···1212In order to overcome this constraint, streaming algorithms have been developed: [Flajolet-Martin](https://en.wikipedia.org/wiki/Flajolet–Martin_algorithm), LogLog, [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog). The algorithm implemented by this library is an improvement on these in one particular sense: it is extremely simple. Instead of hashing, it uses a sampling method to compute an [unbiased estimate](https://www.statlect.com/glossary/unbiased-estimator#:~:text=An%20estimator%20of%20a%20given,Examples) of the cardinality.
13131414# What is an Element
1515-In this implementation, an element is anything implementing the [`PartialOrd`](https://doc.rust-lang.org/std/cmp/trait.PartialOrd.html) + [`PartialEQ`](https://doc.rust-lang.org/std/cmp/trait.PartialEq.html) + `Eq` + `PartialEq` + `Hash` traits: various integer flavours, strings, any Struct on which you have implemented the traits. Not `f32` / `f64`, however.
1515+In this implementation, an element is anything implementing the `Eq` + `PartialEq` + `Hash` traits: various integer flavours, strings, any Struct on which you have implemented the traits. Not `f32` / `f64`, however.
16161717## Ownership
1818The buffer has to keep ownership of its elements. In practice, this is not a problem: relative to its input stream size, the buffer is very small. This is also the point of the algorithm: your data set is very large and your working memory is small; you **don't** want to keep the original data around in order to store references to it! Thus, if you have `&str` elements you will need to create new `String`s to store them. If you're processing text data you'll probably want to strip punctuation and regularise the case, so you'll need new `String`s anyway. If you're processing strings containing numeric values, parsing them to the appropriate integer type (which implements `Copy`) first seems like a reasonable approach.
+2-2
src/lib.rs
···1010/// A counter implementing the CVM algorithm
1111///
1212/// Note that the CVM struct's buffer takes ownership of its elements.
1313-pub struct CVM<T: PartialOrd + PartialEq + Eq + Hash> {
1313+pub struct CVM<T: PartialEq + Eq + Hash> {
1414 buf_size: usize,
1515 buf: FxHashSet<T>,
1616 probability: f64,
1717 rng: ThreadRng,
1818}
19192020-impl<T: PartialOrd + PartialEq + Eq + Hash> CVM<T> {
2020+impl<T: PartialEq + Eq + Hash> CVM<T> {
2121 /// Initialise the algorithm
2222 ///
2323 /// epsilon: how close you want your estimate to be to the true number of distinct elements.