Switch BACK to swap_remove · urschrei.eurosky.social/cvmcount@bfcd526

Rust implementation of the CVM algorithm for counting distinct elements in a stream

Switch BACK to swap_remove

It doesn't duplicate, it re-orders elements.

author

Stephan Hügel date 2 years ago (May 23, 2024, 5:14 PM +0100) commit bfcd5267 bfcd5267ead792a4a048fc38d4fe0ed1151457f8 parent 024ed37b 024ed37b92d66f59fa26ba1aaa3959303cfda00f

+3 -3

3 changed files

Expand all

Cargo.toml

README.md

src

lib.rs

+1 -1

Cargo.toml

··· 5 5 license = "MIT OR Apache-2.0" 6 6 repository = "https://github.com/urschrei/cvmcount" 7 7 8 - version = "0.1.2" 8 + version = "0.1.3" 9 9 edition = "2021" 10 10 11 11 [dependencies]

+1 -1

README.md

··· 30 30 If you're thinking about using this library, you presumably know that it only provides an estimate (within the specified bounds), similar to something like HyperLogLog. You are trading accuracy for speed! 31 31 32 32 ## Perf 33 - Calculating the unique tokens in a [418K UTF-8 text file](https://www.gutenberg.org/ebooks/8492) takes 19.2 ms ± 0.3 ms on an M2 Pro 33 + Calculating the unique tokens in a [418K UTF-8 text file](https://www.gutenberg.org/ebooks/8492) takes 18.6 ms ± 0.3 ms on an M2 Pro 34 34 35 35 ## Implementation Details 36 36 This library strips punctuation from input tokens using a regex. I assume there is a small performance penalty, but it seems like a small price to pay for increased practicality.

+1 -1

src/lib.rs

··· 46 46 // I think this will be faster than a hashset for practical sizes 47 47 // but I need some empirical data for this 48 48 if let Some(pos) = self.buf.iter().position(|x| *x == clean_word) { 49 - self.buf.remove(pos); 49 + self.buf.swap_remove(pos); 50 50 } 51 51 if self.rng.gen_bool(self.probability) { 52 52 self.buf.push(clean_word);

Configure Feed

Configure Feed