···1414# What is an Element
1515In this implementation, an element is anything implementing the [`PartialOrd`](https://doc.rust-lang.org/std/cmp/trait.PartialOrd.html) and [`PartialEQ`](https://doc.rust-lang.org/std/cmp/trait.PartialEq.html) traits: various integer flavours, strings, any Struct on which you have implemented the traits. Not `f32` / `f64`, however.
16161717-You will also note that I didn't mention `&str`: that's because the buffer has to keep ownership of its elements. In practice, this is not a problem: relative to its input stream size, the buffer is very small. This is also the entire point of the algorithm: your data set is very large; you **don't** want to keep the original data around in order to store references to it!
1717+You will also note that I didn't mention `&str`: that's because the buffer has to keep ownership of its elements. In practice, this is not a problem: relative to its input stream size, the buffer is very small. This is also the entire point of the algorithm: your data set is very large; you **don't** want to keep the original data around in order to store references to it! Thus, if you have `&str` elements you will need to create new `String`s from them to store them. Of course, if you're processing text data you'll probably want to strip punctuation and regularise the case, so you'll need new `Strings` anyway. If you're processing strings containing numeric values, parsing them to the appropriate integer type (which implements `Copy`) seems like a reasonable approach.
18181919## Further Details
2020-Don Knuth has written about the algorithm (he refers to it as **Algorithm D**) at https://cs.stanford.edu/~knuth/papers/cvm-note.pdf, and does a far better job than I do at explaining it. You will note that on p1 he describes the buffer he uses as a data structure called a [treap](https://en.wikipedia.org/wiki/Treap#:~:text=7%20External%20links-,Description,(randomly%20chosen)%20numeric%20priority.)
2121-> "that’s capable of holding up to _s_ ordered pairs (_a_, _u_), where _a_ is an element of the stream and _u_ is a real number, 0 ≤ _u_ < 1.", where _s_ >= 1.
2020+Don Knuth has written about the algorithm (he refers to it as **Algorithm D**) at https://cs.stanford.edu/~knuth/papers/cvm-note.pdf, and does a far better job than I do at explaining it. You will note that on p1 he describes the buffer he uses as a data structure – called a [treap](https://en.wikipedia.org/wiki/Treap#:~:text=7%20External%20links-,Description,(randomly%20chosen)%20numeric%20priority.) – as a binary tree
2121+> "that’s capable of holding up to _s_ ordered pairs (_a_, _u_), where _a_ is an element of the stream and _u_ is a real number, 0 ≤ _u_ < 1."
22222323-This implementation doesn't use a treap as a buffer; it uses a Vec and performs a binary search during step **D4**. Note in particular his modification of step **D6** on p5: **D6'**: halving the buffer.
2323+where _s_ >= 1. This implementation doesn't use a treap as a buffer; it uses a Vec and performs a binary search during step **D4**. Note in particular his modification of step **D6** on p5: **D6'**: halving the buffer.
24242525I may switch to a treap implementation eventually; for many practical applications a binary search is considerably faster than the hashing algorithms under consideration. If your application assumes a buffer containing 100k+ elements, you may wish to consider using a treap.
2626
+4-2
src/lib.rs
···55use rand::Rng;
6677/// A counter implementing the CVM algorithm
88+///
99+/// Note that the CVM struct's buffer takes ownership of its elements.
810pub struct CVM<T: PartialOrd + PartialEq> {
911 buf_size: usize,
1012 buf: Vec<T>,
···2628 /// A delta of 0.1 is a good starting point for most applications.
2729 ///
2830 /// stream_size: this is used to determine buffer size and can be a loose approximation. The closer it is to the stream size,
2929- /// the more accurate the result will be
3131+ /// the more accurate the result will be.
3032 pub fn new(epsilon: f64, delta: f64, stream_size: usize) -> Self {
3133 let bufsize = buffer_size(epsilon, delta, stream_size);
3234 Self {
···3638 rng: rand::thread_rng(),
3739 }
3840 }
3939- /// Count elements, updating the current unique count
4141+ /// Add an element, potentially updating the unique element count
4042 pub fn process_element(&mut self, elem: T) {
4143 // binary search should be pretty fast
4244 // I think this will be faster than a hashset for practical sizes