Rust implementation of the CVM algorithm for counting distinct elements in a stream
0

Configure Feed

Select the types of activity you want to include in your feed.

at main 157 lines 8.3 kB View raw View rendered
1# A Rust implementation of the CVM Algorithm for Counting Distinct Elements 2 3This library implements the algorithm described in 4 5> Chakraborty, S., Vinodchandran, N. V., & Meel, K. S. (2022). *Distinct Elements in Streams: An Algorithm for the (Text) Book*. 6 pages, 727571 bytes. https://doi.org/10.4230/LIPIcs.ESA.2022.34 6 7The accompanying article in Quanta is here: https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/ 8 9## What does that mean 10The count-distinct problem, or cardinality-estimation problem refers to counting the number of distinct elements in a data stream with repeated elements. As a concrete example, imagine that you want to count the unique words in a book. If you have enough memory, you can keep track of every unique element you encounter. However, you may not have enough working memory due to resource constraints, or the number of potential elements may be enormous. This constraint is referred to as the bounded-storage constraint in the literature. 11 12In order to overcome this constraint, streaming algorithms have been developed: [Flajolet-Martin](https://en.wikipedia.org/wiki/Flajolet–Martin_algorithm), LogLog, [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog). The algorithm implemented by this library is an improvement on these in one particular sense: it is extremely simple. Instead of hashing, it uses a sampling method to compute an [unbiased estimate](https://www.statlect.com/glossary/unbiased-estimator#:~:text=An%20estimator%20of%20a%20given,Examples) of the cardinality. 13 14# What is an Element 15In this implementation, an element is anything implementing the `Ord` trait: various integer flavours, strings, any Struct on which you have implemented the trait. Not `f32` / `f64`, however (unless wrapped in an ordered wrapper type). 16 17## Ownership 18The buffer has to keep ownership of its elements. In practice, this is not a problem: relative to its input stream size, the buffer is very small. This is also the point of the algorithm: your data set is very large and your working memory is small; you **don't** want to keep the original data around in order to store references to it! Thus, if you have `&str` elements you will need to create new `String`s to store them. If you're processing text data you'll probably want to strip punctuation and regularise the case, so you'll need new `String`s anyway. If you're processing strings containing numeric values, parsing them to the appropriate integer type (which implements `Copy`) first seems like a reasonable approach. 19 20## Further Details 21Don Knuth has written about the algorithm (he refers to it as **Algorithm D**) at https://cs.stanford.edu/~knuth/papers/cvm-note.pdf, and does a far better job than I do at explaining it. You will note that on p1 he describes the buffer he uses as a data structure – called a [treap](https://en.wikipedia.org/wiki/Treap#:~:text=7%20External%20links-,Description,(randomly%20chosen)%20numeric%20priority.) – as a binary tree 22> "that’s capable of holding up to _s_ ordered pairs (_a_, _u_), where _a_ is an element of the stream and _u_ is a real number, 0 ≤ _u_ < 1." 23 24where _s_ >= 1. This implementation uses a treap as a buffer, following Knuth's original design. While this results in O(log n) operations instead of O(1) for hash-based approaches, it provides better cache locality for small buffers and eliminates hash collision overhead. 25 26# What does this library provide 27Two things: the crate / library, and a command-line utility (`cvmcount`) which will count the unique strings in an input text file. 28 29# Installation 30Binaries and installation instructions are available for x64 Linux, Apple Silicon and Intel, and x64 Windows in [releases](https://github.com/urschrei/cvmcount/releases) 31 32You can also build your own if you have Rust installed: `cargo install cvmcount`. 33 34# CLI Example 35 36```shell 37cvmcount -t file.txt -e 0.8 -d 0.1 -s 5000 38``` 39`-t --tokens`: a valid path to a text file 40 41`-e --epsilon`: how close you want your estimate to be to the true number of distinct tokens. A smaller ε means you require a more precise estimate. For example, ε = 0.05 means you want your estimate to be within 5 % of the actual value. An epsilon of 0.8 is a good starting point for most applications. 42 43`-d --delta`: the level of certainty that the algorithm's estimate will fall within your desired accuracy range. A higher confidence (e.g. 99.9 %) means you're very sure the estimate will be accurate, while a lower confidence (e.g. 90 %) means there's a higher chance the estimate may be outside your desired range. A delta of 0.1 is a good starting point for most applications. 44 45`-s --streamsize`: this is used to determine buffer size and can be a loose approximation. The closer it is to the stream size, the more accurate the results. 46 47The `--help` option is available. 48 49# Library Usage 50 51The library provides both a simple constructor and a builder pattern for more ergonomic usage: 52 53## Simple Constructor 54 55```rust 56use cvmcount::CVM; 57 58let mut cvm = CVM::new(0.05, 0.01, 10_000); 59for item in data_stream { 60 cvm.process_element(item); 61} 62let estimate = cvm.calculate_final_result(); 63``` 64 65## Builder Pattern (Recommended) 66 67The builder pattern provides better readability and validation: 68 69```rust 70use cvmcount::CVM; 71 72// Using defaults (epsilon=0.8, confidence=0.9, size=1000) 73let mut cvm: CVM<String> = CVM::builder().build().unwrap(); 74 75// Custom configuration with confidence level 76let mut cvm: CVM<i32> = CVM::builder() 77 .epsilon(0.05) // 5 % accuracy 78 .confidence(0.99) // 99 % confidence 79 .estimated_size(50_000) 80 .build() 81 .unwrap(); 82 83// Using delta (failure probability) instead of confidence 84let mut cvm: CVM<String> = CVM::builder() 85 .epsilon(0.1) // 10 % accuracy 86 .delta(0.01) // 1 % chance of failure 87 .estimated_size(1_000) 88 .build() 89 .unwrap(); 90 91// Process your data 92for word in text.split_whitespace() { 93 cvm.process_element(word.to_string()); 94} 95 96let estimate = cvm.calculate_final_result(); 97println!("Estimated unique words: {}", estimate as usize); 98``` 99 100The builder validates parameters and provides clear error messages for invalid inputs. 101 102## Streaming Interface 103 104For processing iterators directly, you can use the streaming methods: 105 106```rust 107use cvmcount::{CVM, EstimateDistinct}; 108 109// Process an entire iterator with CVM instance 110let mut cvm: CVM<i32> = CVM::builder().epsilon(0.05).build().unwrap(); 111let numbers = vec![1, 2, 3, 2, 1, 4, 5]; 112let estimate = cvm.process_stream(numbers); 113 114// Or use the iterator extension trait for one-liners 115let estimate = (1..=1000) 116 .cycle() 117 .take(10_000) 118 .estimate_distinct_count(0.1, 0.1, 10_000); 119 120// With builder pattern 121let words = vec!["hello".to_string(), "world".to_string(), "hello".to_string()]; 122let builder = CVM::<String>::builder().epsilon(0.05).confidence(0.99); 123let estimate = words.into_iter().estimate_distinct_with_builder(builder).unwrap(); 124 125// When working with borrowed data, map to owned explicitly 126let borrowed_words = vec!["hello", "world", "hello"]; 127let estimate = borrowed_words 128 .iter() 129 .map(|s| s.to_string()) 130 .estimate_distinct_count(0.1, 0.1, 1000); 131``` 132 133The streaming interface accepts owned values to avoid cloning within the algorithm, making the ownership requirements explicit. 134 135## Analysis 136 137![](cvmcount.png) 138```text 139Mean: 9015.744000 140Std: 534.076058 141Min 7552.000000 14225% 8672.000000 14350% 9024.000000 14475% 9344.000000 145Max 11072.00000 146``` 147 148## Note 149If you're thinking about using this library, you presumably know that it only provides an estimate (within the specified bounds), similar to something like HyperLogLog. You are trading accuracy for speed and memory usage! 150 151## Perf 152Calculating the unique tokens in a [418K UTF-8 text file](https://www.gutenberg.org/ebooks/8492) using the CLI takes 7.2 ms ± 0.3 ms on an M2 Pro. Counting 10e6 7-digit integers takes around 13.5 ms. An exact count using the same regex and HashSet runs in around 18 ms. Run `cargo bench` for more. 153 154## Implementation Details 155The CLI app strips punctuation from input tokens using a regex. I assume there is a small performance penalty, but it seems like a small price to pay for increased practicality. 156 157