Rust implementation of the CVM algorithm for counting distinct elements in a stream
0

Configure Feed

Select the types of activity you want to include in your feed.

README and docstring update

+11 -6
+3
Cargo.toml
··· 4 4 readme = "README.md" 5 5 license = "MIT OR Apache-2.0" 6 6 repository = "https://github.com/urschrei/cvmcount" 7 + documentation = "https://docs.rs/cvmcount" 8 + keywords = ["CVM", "count-distinct", "cardinality-estimation"] 9 + categories = ["algorithms", ] 7 10 8 11 version = "0.1.6" 9 12 edition = "2021"
+4 -4
README.md
··· 14 14 # What is an Element 15 15 In this implementation, an element is anything implementing the [`PartialOrd`](https://doc.rust-lang.org/std/cmp/trait.PartialOrd.html) and [`PartialEQ`](https://doc.rust-lang.org/std/cmp/trait.PartialEq.html) traits: various integer flavours, strings, any Struct on which you have implemented the traits. Not `f32` / `f64`, however. 16 16 17 - You will also note that I didn't mention `&str`: that's because the buffer has to keep ownership of its elements. In practice, this is not a problem: relative to its input stream size, the buffer is very small. This is also the entire point of the algorithm: your data set is very large; you **don't** want to keep the original data around in order to store references to it! 17 + You will also note that I didn't mention `&str`: that's because the buffer has to keep ownership of its elements. In practice, this is not a problem: relative to its input stream size, the buffer is very small. This is also the entire point of the algorithm: your data set is very large; you **don't** want to keep the original data around in order to store references to it! Thus, if you have `&str` elements you will need to create new `String`s from them to store them. Of course, if you're processing text data you'll probably want to strip punctuation and regularise the case, so you'll need new `Strings` anyway. If you're processing strings containing numeric values, parsing them to the appropriate integer type (which implements `Copy`) seems like a reasonable approach. 18 18 19 19 ## Further Details 20 - Don Knuth has written about the algorithm (he refers to it as **Algorithm D**) at https://cs.stanford.edu/~knuth/papers/cvm-note.pdf, and does a far better job than I do at explaining it. You will note that on p1 he describes the buffer he uses as a data structure called a [treap](https://en.wikipedia.org/wiki/Treap#:~:text=7%20External%20links-,Description,(randomly%20chosen)%20numeric%20priority.) 21 - > "that’s capable of holding up to _s_ ordered pairs (_a_, _u_), where _a_ is an element of the stream and _u_ is a real number, 0 ≤ _u_ < 1.", where _s_ >= 1. 20 + Don Knuth has written about the algorithm (he refers to it as **Algorithm D**) at https://cs.stanford.edu/~knuth/papers/cvm-note.pdf, and does a far better job than I do at explaining it. You will note that on p1 he describes the buffer he uses as a data structure – called a [treap](https://en.wikipedia.org/wiki/Treap#:~:text=7%20External%20links-,Description,(randomly%20chosen)%20numeric%20priority.) – as a binary tree 21 + > "that’s capable of holding up to _s_ ordered pairs (_a_, _u_), where _a_ is an element of the stream and _u_ is a real number, 0 ≤ _u_ < 1." 22 22 23 - This implementation doesn't use a treap as a buffer; it uses a Vec and performs a binary search during step **D4**. Note in particular his modification of step **D6** on p5: **D6'**: halving the buffer. 23 + where _s_ >= 1. This implementation doesn't use a treap as a buffer; it uses a Vec and performs a binary search during step **D4**. Note in particular his modification of step **D6** on p5: **D6'**: halving the buffer. 24 24 25 25 I may switch to a treap implementation eventually; for many practical applications a binary search is considerably faster than the hashing algorithms under consideration. If your application assumes a buffer containing 100k+ elements, you may wish to consider using a treap. 26 26
+4 -2
src/lib.rs
··· 5 5 use rand::Rng; 6 6 7 7 /// A counter implementing the CVM algorithm 8 + /// 9 + /// Note that the CVM struct's buffer takes ownership of its elements. 8 10 pub struct CVM<T: PartialOrd + PartialEq> { 9 11 buf_size: usize, 10 12 buf: Vec<T>, ··· 26 28 /// A delta of 0.1 is a good starting point for most applications. 27 29 /// 28 30 /// stream_size: this is used to determine buffer size and can be a loose approximation. The closer it is to the stream size, 29 - /// the more accurate the result will be 31 + /// the more accurate the result will be. 30 32 pub fn new(epsilon: f64, delta: f64, stream_size: usize) -> Self { 31 33 let bufsize = buffer_size(epsilon, delta, stream_size); 32 34 Self { ··· 36 38 rng: rand::thread_rng(), 37 39 } 38 40 } 39 - /// Count elements, updating the current unique count 41 + /// Add an element, potentially updating the unique element count 40 42 pub fn process_element(&mut self, elem: T) { 41 43 // binary search should be pretty fast 42 44 // I think this will be faster than a hashset for practical sizes