A Rust implementation of the CVM Distinct Elements counting algorithm#
This library implements the algorithm described in
Chakraborty, S., Vinodchandran, N. V., & Meel, K. S. (2022). Distinct Elements in Streams: An Algorithm for the (Text) Book. 6 pages, 727571 bytes. https://doi.org/10.4230/LIPIcs.ESA.2022.34
The accompanying article in Quanta is here: https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/
Installation#
Binaries and installation instructions are available for x64 Linux, Apple Silicon and Intel, and x64 Windows in releases
You can also build your own if you have Rust installed: cargo install cvmcount.
CLI Example#
cvmcount -t file.txt -e 0.8 -d 0.1 -s 5000
-t --tokens: a valid path to a text file
-e --epsilon: how close you want your estimate to be to the true number of distinct tokens. A smaller ε means you require a more precise estimate. For example, ε = 0.05 means you want your estimate to be within 5 % of the actual value. An epsilon of 0.8 is a good starting point for most applications.
-d --delta: the level of certainty that the algorithm's estimate will fall within your desired accuracy range. A higher confidence (e.g. 99.9 %) means you're very sure the estimate will be accurate, while a lower confidence (e.g. 90 %) means there's a higher chance the estimate may be outside your desired range. A delta of 0.1 is a good starting point for most applications.
-s --streamsize: this is used to determine buffer size and can be a loose approximation. The closer it is to the stream size, the more accurate the results.
The --help option is available.
Note#
If you're thinking about using this library, you presumably know that it only provides an estimate (within the specified bounds), similar to something like HyperLogLog. You are trading accuracy for speed!
Perf#
Calculating the unique tokens in a 418K UTF-8 text file takes 18.6 ms ± 0.3 ms on an M2 Pro
Implementation Details#
This library strips punctuation from input tokens using a regex. I assume there is a small performance penalty, but it seems like a small price to pay for increased practicality.