Rust implementation of the CVM algorithm for counting distinct elements in a stream
0

Configure Feed

Select the types of activity you want to include in your feed.

Treap remarks

+4 -4
+2 -2
README.md
··· 21 21 Don Knuth has written about the algorithm (he refers to it as **Algorithm D**) at https://cs.stanford.edu/~knuth/papers/cvm-note.pdf, and does a far better job than I do at explaining it. You will note that on p1 he describes the buffer he uses as a data structure – called a [treap](https://en.wikipedia.org/wiki/Treap#:~:text=7%20External%20links-,Description,(randomly%20chosen)%20numeric%20priority.) – as a binary tree 22 22 > "that’s capable of holding up to _s_ ordered pairs (_a_, _u_), where _a_ is an element of the stream and _u_ is a real number, 0 ≤ _u_ < 1." 23 23 24 - where _s_ >= 1. This implementation doesn't use a treap as a buffer; it uses a Vec and performs a binary search during step **D4**. Note in particular his modification of step **D6** on p5: **D6'**: halving the buffer. 24 + where _s_ >= 1. Our implementation doesn't use a treap as a buffer; it uses a Vec and performs a linear search during step **D4**. 25 25 26 - I may switch to a treap implementation eventually; for many practical applications a binary search is considerably faster than the hashing algorithms under consideration. If your application assumes a buffer containing 100k+ elements, you may wish to consider using a treap. 26 + I may switch to a treap implementation eventually; for many practical applications a linear search is considerably faster than e.g. a HashSet. If your application assumes a a large buffer such that linear search will be too slow, you may wish to consider using a treap. 27 27 28 28 # What does this library provide 29 29 Two things: the crate / library, and a command-line utility (`cvmcount`) which will count the unique strings in an input text file.
+2 -2
src/lib.rs
··· 40 40 } 41 41 /// Add an element, potentially updating the unique element count 42 42 pub fn process_element(&mut self, elem: T) { 43 - // binary search should be pretty fast 43 + // linear search 44 44 // I think this will be faster than a hashset for practical sizes 45 - // but I need some empirical data for this 45 + // Should really switch to a treap as per Knuth 46 46 if let Some(pos) = self.buf.iter().position(|x| *x == elem) { 47 47 self.buf.swap_remove(pos); 48 48 }