README.md at main · urschrei.eurosky.social/cvmcount

urschrei.eurosky.social / cvmcount
Fork 0
Rust implementation of the CVM algorithm for counting distinct elements in a stream
Fork 0
cvmcount / README.md
at main 157 lines 8.3 kB View raw View rendered
wrap content
Stephan Hügel Streaming / Iterator interface 1y ago
8ae23cbd
  1# A Rust implementation of the CVM Algorithm for Counting Distinct Elements
  2
  3This library implements the algorithm described in
  4
  5> Chakraborty, S., Vinodchandran, N. V., & Meel, K. S. (2022). *Distinct Elements in Streams: An Algorithm for the (Text) Book*. 6 pages, 727571 bytes. https://doi.org/10.4230/LIPIcs.ESA.2022.34
  6
  7The accompanying article in Quanta is here: https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/
  8
  9## What does that mean
 10The count-distinct problem, or cardinality-estimation problem refers to counting the number of distinct elements in a data stream with repeated elements. As a concrete example, imagine that you want to count the unique words in a book. If you have enough memory, you can keep track of every unique element you encounter. However, you may not have enough working memory due to resource constraints, or the number of potential elements may be enormous. This constraint is referred to as the bounded-storage constraint in the literature.
 11
 12In order to overcome this constraint, streaming algorithms have been developed: [Flajolet-Martin](https://en.wikipedia.org/wiki/Flajolet–Martin_algorithm), LogLog, [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog). The algorithm implemented by this library is an improvement on these in one particular sense: it is extremely simple. Instead of hashing, it uses a sampling method to compute an [unbiased estimate](https://www.statlect.com/glossary/unbiased-estimator#:~:text=An%20estimator%20of%20a%20given,Examples) of the cardinality.
 13
 14# What is an Element
 15In this implementation, an element is anything implementing the `Ord` trait: various integer flavours, strings, any Struct on which you have implemented the trait. Not `f32` / `f64`, however (unless wrapped in an ordered wrapper type).
 16
 17## Ownership
 18The buffer has to keep ownership of its elements. In practice, this is not a problem: relative to its input stream size, the buffer is very small. This is also the point of the algorithm: your data set is very large and your working memory is small; you **don't** want to keep the original data around in order to store references to it! Thus, if you have `&str` elements you will need to create new `String`s to store them. If you're processing text data you'll probably want to strip punctuation and regularise the case, so you'll need new `String`s anyway. If you're processing strings containing numeric values, parsing them to the appropriate integer type (which implements `Copy`) first seems like a reasonable approach.
 19
 20## Further Details
 21Don Knuth has written about the algorithm (he refers to it as **Algorithm D**) at https://cs.stanford.edu/~knuth/papers/cvm-note.pdf, and does a far better job than I do at explaining it. You will note that on p1 he describes the buffer he uses as a data structure – called a [treap](https://en.wikipedia.org/wiki/Treap#:~:text=7%20External%20links-,Description,(randomly%20chosen)%20numeric%20priority.) – as a binary tree
 22> "that’s capable of holding up to _s_ ordered pairs (_a_, _u_), where _a_ is an element of the stream and _u_ is a real number, 0 ≤ _u_ < 1."
 23
 24where _s_ >= 1. This implementation uses a treap as a buffer, following Knuth's original design. While this results in O(log n) operations instead of O(1) for hash-based approaches, it provides better cache locality for small buffers and eliminates hash collision overhead.
 25
 26# What does this library provide
 27Two things: the crate / library, and a command-line utility (`cvmcount`) which will count the unique strings in an input text file.
 28
 29# Installation
 30Binaries and installation instructions are available for x64 Linux, Apple Silicon and Intel, and x64 Windows in [releases](https://github.com/urschrei/cvmcount/releases)
 31
 32You can also build your own if you have Rust installed: `cargo install cvmcount`.
 33
 34# CLI Example
 35
 36```shell
 37cvmcount -t file.txt -e 0.8 -d 0.1 -s 5000
 38```
 39`-t --tokens`: a valid path to a text file
 40
 41`-e --epsilon`: how close you want your estimate to be to the true number of distinct tokens. A smaller ε means you require a more precise estimate. For example, ε = 0.05 means you want your estimate to be within 5 % of the actual value. An epsilon of 0.8 is a good starting point for most applications.
 42
 43`-d --delta`: the level of certainty that the algorithm's estimate will fall within your desired accuracy range. A higher confidence (e.g. 99.9 %) means you're very sure the estimate will be accurate, while a lower confidence (e.g. 90 %) means there's a higher chance the estimate may be outside your desired range. A delta of 0.1 is a good starting point for most applications.
 44
 45`-s --streamsize`: this is used to determine buffer size and can be a loose approximation. The closer it is to the stream size, the more accurate the results.
 46
 47The `--help` option is available.
 48
 49# Library Usage
 50
 51The library provides both a simple constructor and a builder pattern for more ergonomic usage:
 52
 53## Simple Constructor
 54
 55```rust
 56use cvmcount::CVM;
 57
 58let mut cvm = CVM::new(0.05, 0.01, 10_000);
 59for item in data_stream {
 60    cvm.process_element(item);
 61}
 62let estimate = cvm.calculate_final_result();
 63```
 64
 65## Builder Pattern (Recommended)
 66
 67The builder pattern provides better readability and validation:
 68
 69```rust
 70use cvmcount::CVM;
 71
 72// Using defaults (epsilon=0.8, confidence=0.9, size=1000)
 73let mut cvm: CVM<String> = CVM::builder().build().unwrap();
 74
 75// Custom configuration with confidence level
 76let mut cvm: CVM<i32> = CVM::builder()
 77    .epsilon(0.05)        // 5 % accuracy
 78    .confidence(0.99)     // 99 % confidence
 79    .estimated_size(50_000)
 80    .build()
 81    .unwrap();
 82
 83// Using delta (failure probability) instead of confidence
 84let mut cvm: CVM<String> = CVM::builder()
 85    .epsilon(0.1)         // 10 % accuracy
 86    .delta(0.01)          // 1 % chance of failure
 87    .estimated_size(1_000)
 88    .build()
 89    .unwrap();
 90
 91// Process your data
 92for word in text.split_whitespace() {
 93    cvm.process_element(word.to_string());
 94}
 95
 96let estimate = cvm.calculate_final_result();
 97println!("Estimated unique words: {}", estimate as usize);
 98```
 99
100The builder validates parameters and provides clear error messages for invalid inputs.
101
102## Streaming Interface
103
104For processing iterators directly, you can use the streaming methods:
105
106```rust
107use cvmcount::{CVM, EstimateDistinct};
108
109// Process an entire iterator with CVM instance
110let mut cvm: CVM<i32> = CVM::builder().epsilon(0.05).build().unwrap();
111let numbers = vec![1, 2, 3, 2, 1, 4, 5];
112let estimate = cvm.process_stream(numbers);
113
114// Or use the iterator extension trait for one-liners
115let estimate = (1..=1000)
116    .cycle()
117    .take(10_000)
118    .estimate_distinct_count(0.1, 0.1, 10_000);
119
120// With builder pattern
121let words = vec!["hello".to_string(), "world".to_string(), "hello".to_string()];
122let builder = CVM::<String>::builder().epsilon(0.05).confidence(0.99);
123let estimate = words.into_iter().estimate_distinct_with_builder(builder).unwrap();
124
125// When working with borrowed data, map to owned explicitly
126let borrowed_words = vec!["hello", "world", "hello"];
127let estimate = borrowed_words
128    .iter()
129    .map(|s| s.to_string())
130    .estimate_distinct_count(0.1, 0.1, 1000);
131```
132
133The streaming interface accepts owned values to avoid cloning within the algorithm, making the ownership requirements explicit.
134
135## Analysis
136
137![](cvmcount.png)
138```text
139Mean: 9015.744000
140Std: 534.076058
141Min 7552.000000
14225% 8672.000000
14350% 9024.000000
14475% 9344.000000
145Max 11072.00000
146```
147
148## Note
149If you're thinking about using this library, you presumably know that it only provides an estimate (within the specified bounds), similar to something like HyperLogLog. You are trading accuracy for speed and memory usage!
150
151## Perf
152Calculating the unique tokens in a [418K UTF-8 text file](https://www.gutenberg.org/ebooks/8492) using the CLI takes 7.2 ms ± 0.3 ms on an M2 Pro. Counting 10e6 7-digit integers takes around 13.5 ms. An exact count using the same regex and HashSet runs in around 18 ms. Run `cargo bench` for more.
153
154## Implementation Details
155The CLI app strips punctuation from input tokens using a regex. I assume there is a small performance penalty, but it seems like a small price to pay for increased practicality.
156
157
Configure Feed

Configure Feed