Rust implementation of the CVM algorithm for counting distinct elements in a stream
0

Configure Feed

Select the types of activity you want to include in your feed.

Relax trait bounds

+3 -3
+1 -1
README.md
··· 12 12 In order to overcome this constraint, streaming algorithms have been developed: [Flajolet-Martin](https://en.wikipedia.org/wiki/Flajolet–Martin_algorithm), LogLog, [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog). The algorithm implemented by this library is an improvement on these in one particular sense: it is extremely simple. Instead of hashing, it uses a sampling method to compute an [unbiased estimate](https://www.statlect.com/glossary/unbiased-estimator#:~:text=An%20estimator%20of%20a%20given,Examples) of the cardinality. 13 13 14 14 # What is an Element 15 - In this implementation, an element is anything implementing the [`PartialOrd`](https://doc.rust-lang.org/std/cmp/trait.PartialOrd.html) + [`PartialEQ`](https://doc.rust-lang.org/std/cmp/trait.PartialEq.html) + `Eq` + `PartialEq` + `Hash` traits: various integer flavours, strings, any Struct on which you have implemented the traits. Not `f32` / `f64`, however. 15 + In this implementation, an element is anything implementing the `Eq` + `PartialEq` + `Hash` traits: various integer flavours, strings, any Struct on which you have implemented the traits. Not `f32` / `f64`, however. 16 16 17 17 ## Ownership 18 18 The buffer has to keep ownership of its elements. In practice, this is not a problem: relative to its input stream size, the buffer is very small. This is also the point of the algorithm: your data set is very large and your working memory is small; you **don't** want to keep the original data around in order to store references to it! Thus, if you have `&str` elements you will need to create new `String`s to store them. If you're processing text data you'll probably want to strip punctuation and regularise the case, so you'll need new `String`s anyway. If you're processing strings containing numeric values, parsing them to the appropriate integer type (which implements `Copy`) first seems like a reasonable approach.
+2 -2
src/lib.rs
··· 10 10 /// A counter implementing the CVM algorithm 11 11 /// 12 12 /// Note that the CVM struct's buffer takes ownership of its elements. 13 - pub struct CVM<T: PartialOrd + PartialEq + Eq + Hash> { 13 + pub struct CVM<T: PartialEq + Eq + Hash> { 14 14 buf_size: usize, 15 15 buf: FxHashSet<T>, 16 16 probability: f64, 17 17 rng: ThreadRng, 18 18 } 19 19 20 - impl<T: PartialOrd + PartialEq + Eq + Hash> CVM<T> { 20 + impl<T: PartialEq + Eq + Hash> CVM<T> { 21 21 /// Initialise the algorithm 22 22 /// 23 23 /// epsilon: how close you want your estimate to be to the true number of distinct elements.