···99## What does that mean
1010The count-distinct problem, or cardinality-estimation problem refers to counting the number of distinct elements in a data stream with repeated elements. As a concrete example, imagine that you want to count the unique words in a book. If you have enough memory, you can keep track of every unique element you encounter. However, you may not have enough working memory due to resource constraints, or the number of potential elements may be enormous. This constraint is referred to as the bounded-storage constraint in the literature.
11111212+In order to overcome this constraint, streaming algorithms have been developed: [Flajolet-Martin](https://en.wikipedia.org/wiki/Flajolet–Martin_algorithm), LogLog, [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog). The algorithm implemented by this library is an improvement on these in one particular sense: it is extremely simple. Instead of hashing, it uses a sampling method to compute an [unbiased estimate](https://www.statlect.com/glossary/unbiased-estimator#:~:text=An%20estimator%20of%20a%20given,Examples) of the cardinality.
12131313-In order to overcome this constraint, streaming algorithms have been developed: [Flajolet-Martin](https://en.wikipedia.org/wiki/Flajolet–Martin_algorithm), LogLog, [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog). The algorithm implemented by this library is an improvement on these in one particular sense: it is extremely simple. Instead of hashing, it uses a sampling method to compute an [unbiased estimate](https://www.statlect.com/glossary/unbiased-estimator#:~:text=An%20estimator%20of%20a%20given,Examples) of the cardinality.
1414+# What is an Element
1515+In this implementation, an element is anything implementing the [`PartialOrd`](https://doc.rust-lang.org/std/cmp/trait.PartialOrd.html) and [`PartialEQ`](https://doc.rust-lang.org/std/cmp/trait.PartialEq.html) traits: various integer flavours, strings, any Struct on which you have implemented the traits. Not `f32` / `f64`, however.
1616+1717+You will also note that I didn't mention `&str`: that's because the buffer has to keep ownership of its elements. In practice, this is not a problem: relative to its input stream size, the buffer is very small.
14181519## Further Details
1620Don Knuth has written about the algorithm (he refers to it as **Algorithm D**) at https://cs.stanford.edu/~knuth/papers/cvm-note.pdf, and does a far better job than I do at explaining it. You will note that on p1 he describes the buffer he uses as a data structure called a [treap](https://en.wikipedia.org/wiki/Treap#:~:text=7%20External%20links-,Description,(randomly%20chosen)%20numeric%20priority.)
···1923This implementation doesn't use a treap as a buffer; it uses a Vec and performs a binary search during step **D4**. Note in particular his modification of step **D6** on p5: **D6'**: halving the buffer.
20242125I may switch to a treap implementation eventually; for many practical applications a binary search is considerably faster than the hashing algorithms under consideration. If your application assumes a buffer containing 100k+ elements, you may wish to consider using a treap.
2626+2727+# What does this library provide
2828+Two things: the crate / library, and a command-line utility (`cvmcount`) which will count the unique strings in an input text file.
22292330# Installation
2431Binaries and installation instructions are available for x64 Linux, Apple Silicon and Intel, and x64 Windows in [releases](https://github.com/urschrei/cvmcount/releases)
+22-28
src/lib.rs
···11-//! An implementation of the CVM fast token counting algorithm presented in
11+//! An implementation of the CVM fast element counting algorithm presented in
22//! Chakraborty, S., Vinodchandran, N. V., & Meel, K. S. (2022). *Distinct Elements in Streams: An Algorithm for the (Text) Book*. 6 pages, 727571 bytes. https://doi.org/10.4230/LIPIcs.ESA.2022.34
3344use rand::rngs::ThreadRng;
55use rand::Rng;
66-use regex::Regex;
7688-pub struct CVM {
77+pub struct CVM<T: PartialOrd + PartialEq> {
98 buf_size: usize,
1010- buf: Vec<String>,
99+ buf: Vec<T>,
1110 probability: f64,
1211 rng: ThreadRng,
1313- re: Regex,
1412}
1515-impl CVM {
1313+/// A counter implementing the CVM algorithm
1414+impl<T: PartialOrd + PartialEq> CVM<T> {
1615 /// Initialise the algorithm
1716 ///
1817 /// epsilon: how close you want your estimate to be to the true number of distinct elements.
···2120 /// An epsilon of 0.8 is a good starting point for most applications.
2221 ///
2322 /// delta: The level of certainty that the algorithm's estimate will fall within the desired accuracy range. A higher confidence
2424- /// (e.g., 99.9 %) means you're very sure the estimate will be accurate, while a lower confidence (e.g., 90 %) means there's a
2323+ /// (e.g. 99.9 %) means you're very sure the estimate will be accurate, while a lower confidence (e.g. 90 %) means there's a
2524 /// higher chance the estimate might be outside the desired range.
2625 /// A delta of 0.1 is a good starting point for most applications.
2726 ///
2827 /// stream_size: this is used to determine buffer size and can be a loose approximation. The closer it is to the stream size,
2929- /// the more accurate the results
2828+ /// the more accurate the result will be
3029 pub fn new(epsilon: f64, delta: f64, stream_size: usize) -> Self {
3130 let bufsize = buffer_size(epsilon, delta, stream_size);
3231 Self {
···3433 buf: Vec::with_capacity(bufsize),
3534 probability: 1.0,
3635 rng: rand::thread_rng(),
3737- re: Regex::new(r"[^\w\s]").unwrap(),
3836 }
3937 }
4040- /// Count tokens, given a string containing words, e.g. a line of a book
4141- pub fn process_line_tokens(&mut self, line: String) {
4242- let words = line.split(' ');
4343- for word in words {
4444- let clean_word = self.re.replace_all(word, "").to_lowercase();
4545- // binary search should be pretty fast
4646- // I think this will be faster than a hashset for practical sizes
4747- // but I need some empirical data for this
4848- if let Some(pos) = self.buf.iter().position(|x| *x == clean_word) {
4949- self.buf.swap_remove(pos);
5050- }
5151- if self.rng.gen_bool(self.probability) {
5252- self.buf.push(clean_word);
5353- }
5454- while self.buf.len() == self.buf_size {
5555- self.clear_about_half();
5656- self.probability /= 2.0;
5757- }
3838+ /// Count elements, updating the current unique count
3939+ pub fn process_element(&mut self, elem: T) {
4040+ // binary search should be pretty fast
4141+ // I think this will be faster than a hashset for practical sizes
4242+ // but I need some empirical data for this
4343+ if let Some(pos) = self.buf.iter().position(|x| *x == elem) {
4444+ self.buf.swap_remove(pos);
4545+ }
4646+ if self.rng.gen_bool(self.probability) {
4747+ self.buf.push(elem);
4848+ }
4949+ while self.buf.len() == self.buf_size {
5050+ self.clear_about_half();
5151+ self.probability /= 2.0;
5852 }
5953 }
6054 // remove around half of the elements at random
6155 fn clear_about_half(&mut self) {
6256 self.buf.retain(|_| self.rng.gen_bool(0.5));
6357 }
6464- /// Calculate the final token count
5858+ /// Calculate the current unique element count. You can continue to add elements after calling this method.
6559 pub fn calculate_final_result(&self) -> f64 {
6660 self.buf.len() as f64 / self.probability
6761 }
+14-4
src/main.rs
···11use clap::{arg, crate_version, value_parser, Command};
22+use regex::Regex;
23use std::fs::File;
34use std::io::BufRead;
45use std::io::BufReader;
···1314{
1415 let f = File::open(filename).expect("Couldn't read from file");
1516 BufReader::new(f)
1717+}
1818+1919+fn line_to_word(re: &Regex, cvm: &mut CVM<String>, line: &str) {
2020+ let words = line.split(' ');
2121+ words.for_each(|word| {
2222+ let clean_word = re.replace_all(word, "").to_lowercase();
2323+ cvm.process_element(clean_word)
2424+ })
1625}
17261827fn main() {
···3544 .required(true)
3645 .value_parser(value_parser!(usize)))
3746 .get_matches();
4747+3848 let input_file = params.get_one::<PathBuf>("tokens").unwrap();
3949 let epsilon = params.get_one::<f64>("epsilon").unwrap();
4050 let delta = params.get_one::<f64>("delta").unwrap();
4151 let stream_size = params.get_one::<usize>("streamsize").unwrap();
5252+ let mut counter: CVM<String> = CVM::new(*epsilon, *delta, *stream_size);
5353+ let re = Regex::new(r"[^\w\s]").unwrap();
42544343- let mut counter = CVM::new(*epsilon, *delta, *stream_size);
4455 let br = open_file(input_file);
4545- for line in br.lines() {
4646- counter.process_line_tokens(line.unwrap())
4747- }
5656+ br.lines()
5757+ .for_each(|line| line_to_word(&re, &mut counter, &line.unwrap()));
4858 println!(
4959 "Unique tokens: {:?}",
5060 counter.calculate_final_result() as i32