···1212In order to overcome this constraint, streaming algorithms have been developed: [Flajolet-Martin](https://en.wikipedia.org/wiki/Flajolet–Martin_algorithm), LogLog, [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog). The algorithm implemented by this library is an improvement on these in one particular sense: it is extremely simple. Instead of hashing, it uses a sampling method to compute an [unbiased estimate](https://www.statlect.com/glossary/unbiased-estimator#:~:text=An%20estimator%20of%20a%20given,Examples) of the cardinality.
13131414# What is an Element
1515-In this implementation, an element is anything implementing the [`PartialOrd`](https://doc.rust-lang.org/std/cmp/trait.PartialOrd.html) and [`PartialEQ`](https://doc.rust-lang.org/std/cmp/trait.PartialEq.html) traits: various integer flavours, strings, any Struct on which you have implemented the traits. Not `f32` / `f64`, however.
1515+In this implementation, an element is anything implementing the [`PartialOrd`](https://doc.rust-lang.org/std/cmp/trait.PartialOrd.html) + [`PartialEQ`](https://doc.rust-lang.org/std/cmp/trait.PartialEq.html) + `Eq` + `PartialEq` + `Hash` traits: various integer flavours, strings, any Struct on which you have implemented the traits. Not `f32` / `f64`, however.
16161717## Ownership
1818The buffer has to keep ownership of its elements. In practice, this is not a problem: relative to its input stream size, the buffer is very small. This is also the point of the algorithm: your data set is very large and your working memory is small; you **don't** want to keep the original data around in order to store references to it! Thus, if you have `&str` elements you will need to create new `String`s to store them. If you're processing text data you'll probably want to strip punctuation and regularise the case, so you'll need new `String`s anyway. If you're processing strings containing numeric values, parsing them to the appropriate integer type (which implements `Copy`) first seems like a reasonable approach.
···2121Don Knuth has written about the algorithm (he refers to it as **Algorithm D**) at https://cs.stanford.edu/~knuth/papers/cvm-note.pdf, and does a far better job than I do at explaining it. You will note that on p1 he describes the buffer he uses as a data structure – called a [treap](https://en.wikipedia.org/wiki/Treap#:~:text=7%20External%20links-,Description,(randomly%20chosen)%20numeric%20priority.) – as a binary tree
2222> "that’s capable of holding up to _s_ ordered pairs (_a_, _u_), where _a_ is an element of the stream and _u_ is a real number, 0 ≤ _u_ < 1."
23232424-where _s_ >= 1. Our implementation doesn't use a treap as a buffer; it uses a Vec and performs a linear search during step **D4**.
2525-2626-I may switch to a treap implementation eventually; for many practical applications a linear search is considerably faster than e.g. a HashSet. If your application assumes a a large buffer such that linear search will be too slow, you may wish to consider using a treap.
2424+where _s_ >= 1. Our implementation doesn't use a treap as a buffer; it uses a fast HashSet with the [FxHash](https://docs.rs/fxhash/latest/fxhash/) algorithm: we pay the hash cost when inserting, but search in step **D4** is `O(1)`. The library may switch to a treap implementation eventually.
27252826# What does this library provide
2927Two things: the crate / library, and a command-line utility (`cvmcount`) which will count the unique strings in an input text file.
···5250If you're thinking about using this library, you presumably know that it only provides an estimate (within the specified bounds), similar to something like HyperLogLog. You are trading accuracy for speed!
53515452## Perf
5555-Calculating the unique tokens in a [418K UTF-8 text file](https://www.gutenberg.org/ebooks/8492) using the CLI takes 18.6 ms ± 0.3 ms on an M2 Pro. Run `cargo bench` for more.
5353+Calculating the unique tokens in a [418K UTF-8 text file](https://www.gutenberg.org/ebooks/8492) using the CLI takes 7.2 ms ± 0.3 ms on an M2 Pro. Counting 10e6 7-digit integers takes around 13.5 ms. Run `cargo bench` for more.
56545755## Implementation Details
5856The CLI app strips punctuation from input tokens using a regex. I assume there is a small performance penalty, but it seems like a small price to pay for increased practicality.
+11-11
src/lib.rs
···44use rand::rngs::ThreadRng;
55use rand::Rng;
6677+use rustc_hash::FxHashSet;
88+use std::hash::Hash;
99+710/// A counter implementing the CVM algorithm
811///
912/// Note that the CVM struct's buffer takes ownership of its elements.
1010-pub struct CVM<T: PartialOrd + PartialEq> {
1313+pub struct CVM<T: PartialOrd + PartialEq + Eq + Hash> {
1114 buf_size: usize,
1212- buf: Vec<T>,
1515+ buf: FxHashSet<T>,
1316 probability: f64,
1417 rng: ThreadRng,
1518}
16191717-impl<T: PartialOrd + PartialEq> CVM<T> {
2020+impl<T: PartialOrd + PartialEq + Eq + Hash> CVM<T> {
1821 /// Initialise the algorithm
1922 ///
2023 /// epsilon: how close you want your estimate to be to the true number of distinct elements.
···3336 let bufsize = buffer_size(epsilon, delta, stream_size);
3437 Self {
3538 buf_size: bufsize,
3636- buf: Vec::with_capacity(bufsize),
3939+ buf: FxHashSet::with_capacity_and_hasher(bufsize, Default::default()),
3740 probability: 1.0,
3841 rng: rand::thread_rng(),
3942 }
4043 }
4144 /// Add an element, potentially updating the unique element count
4245 pub fn process_element(&mut self, elem: T) {
4343- // linear search
4444- // I think this will be faster than a hashset for practical sizes
4545- // Should really switch to a treap as per Knuth
4646- if let Some(pos) = self.buf.iter().position(|x| *x == elem) {
4747- self.buf.swap_remove(pos);
4848- }
4646+ // We should switch to a treap (as per Knuth) to avoid the hash overhead, but FxHash
4747+ // is still a lot faster than linear searching a Vec, even at small (1000) buffer sizes
4848+ self.buf.remove(&elem);
4949 if self.rng.gen_bool(self.probability) {
5050- self.buf.push(elem);
5050+ self.buf.insert(elem);
5151 }
5252 while self.buf.len() == self.buf_size {
5353 self.clear_about_half();