Rust implementation of the CVM algorithm for counting distinct elements in a stream
0

Configure Feed

Select the types of activity you want to include in your feed.

Switch to FxHashSet

+16 -17
+2 -1
Cargo.toml
··· 8 8 keywords = ["CVM", "count-distinct", "estimation"] 9 9 categories = ["algorithms", ] 10 10 11 - version = "0.1.6" 11 + version = "0.1.7" 12 12 edition = "2021" 13 13 14 14 [dependencies] 15 15 rand = "0.8.5" 16 16 regex = "1.10.4" 17 17 clap = { version = "4.5.4", features = ["cargo"] } 18 + rustc-hash = "1.1.0" 18 19 19 20 [dev-dependencies] 20 21 rand = "0.8.5"
+3 -5
README.md
··· 12 12 In order to overcome this constraint, streaming algorithms have been developed: [Flajolet-Martin](https://en.wikipedia.org/wiki/Flajolet–Martin_algorithm), LogLog, [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog). The algorithm implemented by this library is an improvement on these in one particular sense: it is extremely simple. Instead of hashing, it uses a sampling method to compute an [unbiased estimate](https://www.statlect.com/glossary/unbiased-estimator#:~:text=An%20estimator%20of%20a%20given,Examples) of the cardinality. 13 13 14 14 # What is an Element 15 - In this implementation, an element is anything implementing the [`PartialOrd`](https://doc.rust-lang.org/std/cmp/trait.PartialOrd.html) and [`PartialEQ`](https://doc.rust-lang.org/std/cmp/trait.PartialEq.html) traits: various integer flavours, strings, any Struct on which you have implemented the traits. Not `f32` / `f64`, however. 15 + In this implementation, an element is anything implementing the [`PartialOrd`](https://doc.rust-lang.org/std/cmp/trait.PartialOrd.html) + [`PartialEQ`](https://doc.rust-lang.org/std/cmp/trait.PartialEq.html) + `Eq` + `PartialEq` + `Hash` traits: various integer flavours, strings, any Struct on which you have implemented the traits. Not `f32` / `f64`, however. 16 16 17 17 ## Ownership 18 18 The buffer has to keep ownership of its elements. In practice, this is not a problem: relative to its input stream size, the buffer is very small. This is also the point of the algorithm: your data set is very large and your working memory is small; you **don't** want to keep the original data around in order to store references to it! Thus, if you have `&str` elements you will need to create new `String`s to store them. If you're processing text data you'll probably want to strip punctuation and regularise the case, so you'll need new `String`s anyway. If you're processing strings containing numeric values, parsing them to the appropriate integer type (which implements `Copy`) first seems like a reasonable approach. ··· 21 21 Don Knuth has written about the algorithm (he refers to it as **Algorithm D**) at https://cs.stanford.edu/~knuth/papers/cvm-note.pdf, and does a far better job than I do at explaining it. You will note that on p1 he describes the buffer he uses as a data structure – called a [treap](https://en.wikipedia.org/wiki/Treap#:~:text=7%20External%20links-,Description,(randomly%20chosen)%20numeric%20priority.) – as a binary tree 22 22 > "that’s capable of holding up to _s_ ordered pairs (_a_, _u_), where _a_ is an element of the stream and _u_ is a real number, 0 ≤ _u_ < 1." 23 23 24 - where _s_ >= 1. Our implementation doesn't use a treap as a buffer; it uses a Vec and performs a linear search during step **D4**. 25 - 26 - I may switch to a treap implementation eventually; for many practical applications a linear search is considerably faster than e.g. a HashSet. If your application assumes a a large buffer such that linear search will be too slow, you may wish to consider using a treap. 24 + where _s_ >= 1. Our implementation doesn't use a treap as a buffer; it uses a fast HashSet with the [FxHash](https://docs.rs/fxhash/latest/fxhash/) algorithm: we pay the hash cost when inserting, but search in step **D4** is `O(1)`. The library may switch to a treap implementation eventually. 27 25 28 26 # What does this library provide 29 27 Two things: the crate / library, and a command-line utility (`cvmcount`) which will count the unique strings in an input text file. ··· 52 50 If you're thinking about using this library, you presumably know that it only provides an estimate (within the specified bounds), similar to something like HyperLogLog. You are trading accuracy for speed! 53 51 54 52 ## Perf 55 - Calculating the unique tokens in a [418K UTF-8 text file](https://www.gutenberg.org/ebooks/8492) using the CLI takes 18.6 ms ± 0.3 ms on an M2 Pro. Run `cargo bench` for more. 53 + Calculating the unique tokens in a [418K UTF-8 text file](https://www.gutenberg.org/ebooks/8492) using the CLI takes 7.2 ms ± 0.3 ms on an M2 Pro. Counting 10e6 7-digit integers takes around 13.5 ms. Run `cargo bench` for more. 56 54 57 55 ## Implementation Details 58 56 The CLI app strips punctuation from input tokens using a regex. I assume there is a small performance penalty, but it seems like a small price to pay for increased practicality.
+11 -11
src/lib.rs
··· 4 4 use rand::rngs::ThreadRng; 5 5 use rand::Rng; 6 6 7 + use rustc_hash::FxHashSet; 8 + use std::hash::Hash; 9 + 7 10 /// A counter implementing the CVM algorithm 8 11 /// 9 12 /// Note that the CVM struct's buffer takes ownership of its elements. 10 - pub struct CVM<T: PartialOrd + PartialEq> { 13 + pub struct CVM<T: PartialOrd + PartialEq + Eq + Hash> { 11 14 buf_size: usize, 12 - buf: Vec<T>, 15 + buf: FxHashSet<T>, 13 16 probability: f64, 14 17 rng: ThreadRng, 15 18 } 16 19 17 - impl<T: PartialOrd + PartialEq> CVM<T> { 20 + impl<T: PartialOrd + PartialEq + Eq + Hash> CVM<T> { 18 21 /// Initialise the algorithm 19 22 /// 20 23 /// epsilon: how close you want your estimate to be to the true number of distinct elements. ··· 33 36 let bufsize = buffer_size(epsilon, delta, stream_size); 34 37 Self { 35 38 buf_size: bufsize, 36 - buf: Vec::with_capacity(bufsize), 39 + buf: FxHashSet::with_capacity_and_hasher(bufsize, Default::default()), 37 40 probability: 1.0, 38 41 rng: rand::thread_rng(), 39 42 } 40 43 } 41 44 /// Add an element, potentially updating the unique element count 42 45 pub fn process_element(&mut self, elem: T) { 43 - // linear search 44 - // I think this will be faster than a hashset for practical sizes 45 - // Should really switch to a treap as per Knuth 46 - if let Some(pos) = self.buf.iter().position(|x| *x == elem) { 47 - self.buf.swap_remove(pos); 48 - } 46 + // We should switch to a treap (as per Knuth) to avoid the hash overhead, but FxHash 47 + // is still a lot faster than linear searching a Vec, even at small (1000) buffer sizes 48 + self.buf.remove(&elem); 49 49 if self.rng.gen_bool(self.probability) { 50 - self.buf.push(elem); 50 + self.buf.insert(elem); 51 51 } 52 52 while self.buf.len() == self.buf_size { 53 53 self.clear_about_half();