Rust implementation of the CVM algorithm for counting distinct elements in a stream
0

Configure Feed

Select the types of activity you want to include in your feed.

3 2 0

Clone this repository

https://tangled.org/urschrei.eurosky.social/cvmcount https://tangled.org/did:plc:worabswoc5op4rmwsr5ovpgw
git@tangled.org:urschrei.eurosky.social/cvmcount git@tangled.org:did:plc:worabswoc5op4rmwsr5ovpgw

For self-hosted knots, clone URLs may differ based on your setup.



README.md

Rust implementation of the CVM counting algorithm#

This library implements

Chakraborty, S., Vinodchandran, N. V., & Meel, K. S. (2022). Distinct Elements in Streams: An Algorithm for the (Text) Book. 6 pages, 727571 bytes. https://doi.org/10.4230/LIPIcs.ESA.2022.34

The blog post describing the algorithm is here: https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/

CLI Example#

cargo install cvmcount cvmcount file.txt 0.8 0.1 2900

The --help option is available.

Note#

If you're thinking about using this library, you presumably know that it only provides an estimate (within the specified bounds), similar to something like HyperLogLog. You are trading accuracy for speed!

Implementation Details#

This library strips punctuation from input tokens using a regex. I assume there is a small performance penalty, but it seems like a small price to pay for increased practicality.