Rust implementation of the CVM algorithm for counting distinct elements in a stream
0

Configure Feed

Select the types of activity you want to include in your feed.

Switch BACK to swap_remove

It doesn't duplicate, it re-orders elements.

+3 -3
+1 -1
Cargo.toml
··· 5 5 license = "MIT OR Apache-2.0" 6 6 repository = "https://github.com/urschrei/cvmcount" 7 7 8 - version = "0.1.2" 8 + version = "0.1.3" 9 9 edition = "2021" 10 10 11 11 [dependencies]
+1 -1
README.md
··· 30 30 If you're thinking about using this library, you presumably know that it only provides an estimate (within the specified bounds), similar to something like HyperLogLog. You are trading accuracy for speed! 31 31 32 32 ## Perf 33 - Calculating the unique tokens in a [418K UTF-8 text file](https://www.gutenberg.org/ebooks/8492) takes 19.2 ms ± 0.3 ms on an M2 Pro 33 + Calculating the unique tokens in a [418K UTF-8 text file](https://www.gutenberg.org/ebooks/8492) takes 18.6 ms ± 0.3 ms on an M2 Pro 34 34 35 35 ## Implementation Details 36 36 This library strips punctuation from input tokens using a regex. I assume there is a small performance penalty, but it seems like a small price to pay for increased practicality.
+1 -1
src/lib.rs
··· 46 46 // I think this will be faster than a hashset for practical sizes 47 47 // but I need some empirical data for this 48 48 if let Some(pos) = self.buf.iter().position(|x| *x == clean_word) { 49 - self.buf.remove(pos); 49 + self.buf.swap_remove(pos); 50 50 } 51 51 if self.rng.gen_bool(self.probability) { 52 52 self.buf.push(clean_word);