Rust implementation of the CVM algorithm for counting distinct elements in a stream
0

Configure Feed

Select the types of activity you want to include in your feed.

18 2 0

Clone this repository

https://tangled.org/urschrei.eurosky.social/cvmcount https://tangled.org/did:plc:worabswoc5op4rmwsr5ovpgw
git@tangled.org:urschrei.eurosky.social/cvmcount git@tangled.org:did:plc:worabswoc5op4rmwsr5ovpgw

For self-hosted knots, clone URLs may differ based on your setup.



README.md

A Rust implementation of the CVM Distinct Elements counting algorithm#

This library implements the algorithm described in

Chakraborty, S., Vinodchandran, N. V., & Meel, K. S. (2022). Distinct Elements in Streams: An Algorithm for the (Text) Book. 6 pages, 727571 bytes. https://doi.org/10.4230/LIPIcs.ESA.2022.34

The accompanying article in Quanta is here: https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/

Installation#

Binaries and installation instructions are available for x64 Linux, Apple Silicon and Intel, and x64 Windows in releases

You can also build your own if you have Rust installed: cargo install cvmcount.

CLI Example#

cvmcount -t file.txt -e 0.8 -d 0.1 -s 5000

-t --tokens: a valid path to a text file

-e --epsilon: how close you want your estimate to be to the true number of distinct tokens. A smaller ε means you require a more precise estimate. For example, ε = 0.05 means you want your estimate to be within 5 % of the actual value. An epsilon of 0.8 is a good starting point for most applications.

-d --delta: the level of certainty that the algorithm's estimate will fall within your desired accuracy range. A higher confidence (e.g. 99.9 %) means you're very sure the estimate will be accurate, while a lower confidence (e.g. 90 %) means there's a higher chance the estimate may be outside your desired range. A delta of 0.1 is a good starting point for most applications.

-s --streamsize: this is used to determine buffer size and can be a loose approximation. The closer it is to the stream size, the more accurate the results.

The --help option is available.

Note#

If you're thinking about using this library, you presumably know that it only provides an estimate (within the specified bounds), similar to something like HyperLogLog. You are trading accuracy for speed!

Perf#

Calculating the unique tokens in a 418K UTF-8 text file takes 18.6 ms ± 0.3 ms on an M2 Pro

Implementation Details#

This library strips punctuation from input tokens using a regex. I assume there is a small performance penalty, but it seems like a small price to pay for increased practicality.