···1212In order to overcome this constraint, streaming algorithms have been developed: [Flajolet-Martin](https://en.wikipedia.org/wiki/Flajolet–Martin_algorithm), LogLog, [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog). The algorithm implemented by this library is an improvement on these in one particular sense: it is extremely simple. Instead of hashing, it uses a sampling method to compute an [unbiased estimate](https://www.statlect.com/glossary/unbiased-estimator#:~:text=An%20estimator%20of%20a%20given,Examples) of the cardinality.
13131414# What is an Element
1515-In this implementation, an element is anything implementing the `Eq` + `PartialEq` + `Hash` traits: various integer flavours, strings, any Struct on which you have implemented the traits. Not `f32` / `f64`, however.
1515+In this implementation, an element is anything implementing the `Ord` trait: various integer flavours, strings, any Struct on which you have implemented the trait. Not `f32` / `f64`, however (unless wrapped in an ordered wrapper type).
16161717## Ownership
1818The buffer has to keep ownership of its elements. In practice, this is not a problem: relative to its input stream size, the buffer is very small. This is also the point of the algorithm: your data set is very large and your working memory is small; you **don't** want to keep the original data around in order to store references to it! Thus, if you have `&str` elements you will need to create new `String`s to store them. If you're processing text data you'll probably want to strip punctuation and regularise the case, so you'll need new `String`s anyway. If you're processing strings containing numeric values, parsing them to the appropriate integer type (which implements `Copy`) first seems like a reasonable approach.
···2121Don Knuth has written about the algorithm (he refers to it as **Algorithm D**) at https://cs.stanford.edu/~knuth/papers/cvm-note.pdf, and does a far better job than I do at explaining it. You will note that on p1 he describes the buffer he uses as a data structure – called a [treap](https://en.wikipedia.org/wiki/Treap#:~:text=7%20External%20links-,Description,(randomly%20chosen)%20numeric%20priority.) – as a binary tree
2222> "that’s capable of holding up to _s_ ordered pairs (_a_, _u_), where _a_ is an element of the stream and _u_ is a real number, 0 ≤ _u_ < 1."
23232424-where _s_ >= 1. Our implementation doesn't use a treap as a buffer; it uses a fast HashSet with the [FxHash](https://docs.rs/fxhash/latest/fxhash/) algorithm: we pay the hash cost when inserting, but search in step **D4** is `O(1)`. The library may switch to a treap implementation eventually.
2424+where _s_ >= 1. This implementation uses a treap as a buffer, following Knuth's original design. While this results in O(log n) operations instead of O(1) for hash-based approaches, it provides better cache locality for small buffers and eliminates hash collision overhead.
25252626# What does this library provide
2727Two things: the crate / library, and a command-line utility (`cvmcount`) which will count the unique strings in an input text file.
···11//! An implementation of the CVM fast element counting algorithm presented in
22//! Chakraborty, S., Vinodchandran, N. V., & Meel, K. S. (2022). *Distinct Elements in Streams: An Algorithm for the (Text) Book*. 6 pages, 727571 bytes. <https://doi.org/10.4230/LIPIcs.ESA.2022.34>
33+//!
44+//! This implementation uses a treap data structure as the buffer, following Knuth's original design.
55+66+mod treap;
3788+use crate::treap::Treap;
49use rand::rngs::StdRng;
510use rand::{Rng, SeedableRng};
61177-use rustc_hash::FxHashSet;
88-use std::hash::Hash;
99-1012/// A counter implementing the CVM algorithm
1113///
1414+/// This implementation uses a treap (randomized binary search tree) as the buffer,
1515+/// which provides `O(log n)` operations while maintaining the probabilistic properties
1616+/// needed for the algorithm.
1717+///
1218/// Note that the CVM struct's buffer takes ownership of its elements.
1313-pub struct CVM<T: PartialEq + Eq + Hash> {
1919+pub struct CVM<T: Ord> {
1420 buf_size: usize,
1515- buf: FxHashSet<T>,
2121+ buf: Treap<T>,
1622 probability: f64,
1723 rng: StdRng,
1824}
19252020-impl<T: PartialEq + Eq + Hash> CVM<T> {
2626+impl<T: Ord> CVM<T> {
2127 /// Initialise the algorithm
2228 ///
2323- /// epsilon: how close you want your estimate to be to the true number of distinct elements.
2424- /// A smaller ε means you require a more precise estimate.
2525- /// For example, ε = 0.05 means you want your estimate to be within 5% of the actual value.
2626- /// An epsilon of 0.8 is a good starting point for most applications.
2929+ /// `epsilon`: how close you want your estimate to be to the true number of distinct elements.
3030+ /// A smaller `ε` means you require a more precise estimate.
3131+ /// For example, `ε = 0.05` means you want your estimate to be within 5 % of the actual value.
3232+ /// An epsilon of `0.8` is a good starting point for most applications.
2733 ///
2828- /// delta: The level of certainty that the algorithm's estimate will fall within the desired accuracy range. A higher confidence
3434+ /// `delta`: The level of certainty that the algorithm's estimate will fall within the desired accuracy range. A higher confidence
2935 /// (e.g. 99.9 %) means you're very sure the estimate will be accurate, while a lower confidence (e.g. 90 %) means there's a
3036 /// higher chance the estimate might be outside the desired range.
3131- /// A delta of 0.1 is a good starting point for most applications.
3737+ /// A `delta` of `0.1` is a good starting point for most applications.
3238 ///
3333- /// stream_size: this is used to determine buffer size and can be a loose approximation. The closer it is to the stream size,
3939+ /// `stream_size`: this is used to determine buffer size and can be a loose approximation. The closer it is to the stream size,
3440 /// the more accurate the result will be.
3541 pub fn new(epsilon: f64, delta: f64, stream_size: usize) -> Self {
3642 let bufsize = buffer_size(epsilon, delta, stream_size);
3743 Self {
3844 buf_size: bufsize,
3939- buf: FxHashSet::with_capacity_and_hasher(bufsize, Default::default()),
4545+ buf: Treap::new(),
4046 probability: 1.0,
4147 rng: StdRng::from_entropy(),
4248 }
4349 }
4450 /// Add an element, potentially updating the unique element count
4551 pub fn process_element(&mut self, elem: T) {
4646- // We should switch to a treap (as per Knuth) to avoid the hash overhead, but FxHash
4747- // is still a lot faster than linear searching a Vec, even at small (1000) buffer sizes
4848- // Round 0: if an element exists, remove it. Element is added back due to probability 1
4949- // When buffer is full, remove half the elements
5050- // Round 1: if an element exists, remove it. Element MAY be added back due to probability 0.5
5252+ // The algorithm works as follows:
5353+ // 1. If element exists in buffer, remove it (this ensures proper sampling)
5454+ // 2. Add element back with current probability
5555+ // 3. If buffer is full, remove ~half the elements and halve the probability
5656+ // This creates a geometric sampling scheme that provides an unbiased estimate
5157 if self.buf.contains(&elem) {
5258 self.buf.remove(&elem);
5359 }
5460 if self.rng.gen_bool(self.probability) {
5555- self.buf.insert(elem);
6161+ self.buf.insert(elem, &mut self.rng);
5662 }
5763 while self.buf.len() == self.buf_size {
5864 self.clear_about_half();
···6167 }
6268 // remove around half of the elements at random
6369 fn clear_about_half(&mut self) {
6464- self.buf.retain(|_| self.rng.gen_bool(0.5));
7070+ // Need to capture rng reference to use in closure
7171+ let rng = &mut self.rng;
7272+ self.buf.retain(|_| rng.gen_bool(0.5));
6573 }
6674 /// Calculate the current unique element count. You can continue to add elements after calling this method.
6775 pub fn calculate_final_result(&self) -> f64 {
···8290 path::Path,
8391 };
84928585- use super::*;
8693 use regex::Regex;
9494+ use std::collections::HashSet;
87958896 fn open_file<P>(filename: P) -> BufReader<File>
8997 where
···93101 BufReader::new(f)
94102 }
951039696- fn line_to_word(re: &Regex, hs: &mut FxHashSet<String>, line: &str) {
104104+ fn line_to_word(re: &Regex, hs: &mut HashSet<String>, line: &str) {
97105 let words = line.split(' ');
98106 words.for_each(|word| {
99107 let clean_word = re.replace_all(word, "").to_lowercase();
···105113 let input_file = "benches/kiy.txt";
106114 let re = Regex::new(r"[^\w\s]").unwrap();
107115 let br = open_file(input_file);
108108- let mut hs = FxHashSet::with_hasher(Default::default());
116116+ let mut hs = HashSet::new();
109117 br.lines()
110118 .for_each(|line| line_to_word(&re, &mut hs, &line.unwrap()));
111119 assert_eq!(hs.len(), 9016)
+313
src/treap.rs
···11+//! A randomized binary search tree (treap) implementation
22+//!
33+//! A treap maintains both BST property (for keys) and heap property (for priorities).
44+//!
55+//! This implementation was inspired by the treap exploration in <https://github.com/apanda/cvm>
66+//! (BSD-2-Clause license), but is an independent implementation tailored specifically
77+//! for the CVM algorithm's requirements.
88+//!
99+//! ## Key Differences from apanda/cvm treap:
1010+//!
1111+//! 1. **Simpler structure**: We don't use a separate Element type; keys and priorities are
1212+//! stored directly in nodes
1313+//! 2. **Random priorities**: apanda's implementation expects explicit priorities, while ours
1414+//! generates random priorities at insertion time
1515+//! 3. **No allocation tracking**: apanda uses `alloc_counter` for performance analysis
1616+//! 4. **Simplified delete**: Our delete returns a bool, apanda's has more complex handling
1717+//! 5. **Retain operation**: We added a specialized `retain` method for CVM's "clear half"
1818+//! 6. **No Display trait**: We focus on the minimal API needed for CVM
1919+//! 7. **Insert behavior**: apanda's `insert_or_replace` updates existing elements; ours
2020+//! keeps the original (no update) which is what CVM needs
2121+//!
2222+//! ## Design Decisions
2323+//!
2424+//! Unlike general-purpose treap implementations, this one is optimized for CVM:
2525+//! - No key-value mapping: CVM only needs to track unique elements
2626+//! - Simplified API: Only operations needed for CVM are implemented
2727+//! - Efficient `retain`: Optimized for the "clear about half" operation
2828+//! - RNG integration: Accepts an external RNG for consistent randomness
2929+3030+use rand::Rng;
3131+use std::cmp::Ordering;
3232+3333+/// A node in the treap
3434+struct Node<T> {
3535+ key: T,
3636+ priority: u32,
3737+ left: Option<Box<Node<T>>>,
3838+ right: Option<Box<Node<T>>>,
3939+}
4040+4141+impl<T> Node<T> {
4242+ fn new(key: T, priority: u32) -> Self {
4343+ Node {
4444+ key,
4545+ priority,
4646+ left: None,
4747+ right: None,
4848+ }
4949+ }
5050+}
5151+5252+/// A treap data structure
5353+///
5454+/// Key differences from typical treap implementations:
5555+/// 1. Priorities are generated at insertion time using the provided RNG
5656+/// 2. The `retain` operation is optimized for the CVM algorithm's "clear half" operation
5757+/// 3. No support for key-value pairs - only keys are stored (values are implicit)
5858+/// 4. No split operation as it's not needed for CVM
5959+/// 5. Insert doesn't update existing keys - matching CVM's requirement
6060+pub struct Treap<T> {
6161+ root: Option<Box<Node<T>>>,
6262+ size: usize,
6363+}
6464+6565+impl<T: Ord> Treap<T> {
6666+ /// Create a new empty treap
6767+ pub fn new() -> Self {
6868+ Treap {
6969+ root: None,
7070+ size: 0,
7171+ }
7272+ }
7373+7474+ /// Get the number of elements in the treap
7575+ pub fn len(&self) -> usize {
7676+ self.size
7777+ }
7878+7979+ /// Check if the treap is empty
8080+ #[allow(dead_code)]
8181+ pub fn is_empty(&self) -> bool {
8282+ self.size == 0
8383+ }
8484+8585+ /// Insert a key with a random priority
8686+ pub fn insert<R: Rng>(&mut self, key: T, rng: &mut R) {
8787+ let priority = rng.gen();
8888+ self.root = Self::insert_node(self.root.take(), key, priority);
8989+ self.size += 1;
9090+ }
9191+9292+ /// Check if the treap contains a key
9393+ pub fn contains(&self, key: &T) -> bool {
9494+ Self::contains_node(&self.root, key)
9595+ }
9696+9797+ /// Remove a key from the treap
9898+ pub fn remove(&mut self, key: &T) -> bool {
9999+ let (new_root, removed) = Self::remove_node(self.root.take(), key);
100100+ self.root = new_root;
101101+ if removed {
102102+ self.size -= 1;
103103+ }
104104+ removed
105105+ }
106106+107107+ /// Clear the treap
108108+ #[allow(dead_code)]
109109+ pub fn clear(&mut self) {
110110+ self.root = None;
111111+ self.size = 0;
112112+ }
113113+114114+ /// Apply a function to each element, removing those for which it returns false
115115+ pub fn retain<F>(&mut self, mut f: F)
116116+ where
117117+ F: FnMut(&T) -> bool,
118118+ {
119119+ let (new_root, new_size) = Self::retain_node(self.root.take(), &mut f);
120120+ self.root = new_root;
121121+ self.size = new_size;
122122+ }
123123+124124+ // Helper function to insert a node
125125+ fn insert_node(node: Option<Box<Node<T>>>, key: T, priority: u32) -> Option<Box<Node<T>>> {
126126+ match node {
127127+ None => Some(Box::new(Node::new(key, priority))),
128128+ Some(mut n) => {
129129+ match key.cmp(&n.key) {
130130+ Ordering::Less => {
131131+ n.left = Self::insert_node(n.left, key, priority);
132132+ // Maintain heap property
133133+ if n.left.as_ref().unwrap().priority > n.priority {
134134+ Self::rotate_right(n)
135135+ } else {
136136+ Some(n)
137137+ }
138138+ }
139139+ Ordering::Greater => {
140140+ n.right = Self::insert_node(n.right, key, priority);
141141+ // Maintain heap property
142142+ if n.right.as_ref().unwrap().priority > n.priority {
143143+ Self::rotate_left(n)
144144+ } else {
145145+ Some(n)
146146+ }
147147+ }
148148+ Ordering::Equal => Some(n), // Key already exists, do nothing
149149+ }
150150+ }
151151+ }
152152+ }
153153+154154+ // Helper function to check if a node contains a key
155155+ fn contains_node(node: &Option<Box<Node<T>>>, key: &T) -> bool {
156156+ match node {
157157+ None => false,
158158+ Some(n) => match key.cmp(&n.key) {
159159+ Ordering::Less => Self::contains_node(&n.left, key),
160160+ Ordering::Greater => Self::contains_node(&n.right, key),
161161+ Ordering::Equal => true,
162162+ },
163163+ }
164164+ }
165165+166166+ // Helper function to remove a node
167167+ fn remove_node(node: Option<Box<Node<T>>>, key: &T) -> (Option<Box<Node<T>>>, bool) {
168168+ match node {
169169+ None => (None, false),
170170+ Some(mut n) => match key.cmp(&n.key) {
171171+ Ordering::Less => {
172172+ let (new_left, removed) = Self::remove_node(n.left, key);
173173+ n.left = new_left;
174174+ (Some(n), removed)
175175+ }
176176+ Ordering::Greater => {
177177+ let (new_right, removed) = Self::remove_node(n.right, key);
178178+ n.right = new_right;
179179+ (Some(n), removed)
180180+ }
181181+ Ordering::Equal => {
182182+ // Found the node to remove
183183+ (Self::merge(n.left, n.right), true)
184184+ }
185185+ },
186186+ }
187187+ }
188188+189189+ // Merge two subtrees
190190+ fn merge(left: Option<Box<Node<T>>>, right: Option<Box<Node<T>>>) -> Option<Box<Node<T>>> {
191191+ match (left, right) {
192192+ (None, right) => right,
193193+ (left, None) => left,
194194+ (Some(l), Some(r)) => {
195195+ if l.priority > r.priority {
196196+ let mut l = l;
197197+ l.right = Self::merge(l.right, Some(r));
198198+ Some(l)
199199+ } else {
200200+ let mut r = r;
201201+ r.left = Self::merge(Some(l), r.left);
202202+ Some(r)
203203+ }
204204+ }
205205+ }
206206+ }
207207+208208+ // Rotate right
209209+ fn rotate_right(mut node: Box<Node<T>>) -> Option<Box<Node<T>>> {
210210+ let mut new_root = node.left.take().unwrap();
211211+ node.left = new_root.right.take();
212212+ new_root.right = Some(node);
213213+ Some(new_root)
214214+ }
215215+216216+ // Rotate left
217217+ fn rotate_left(mut node: Box<Node<T>>) -> Option<Box<Node<T>>> {
218218+ let mut new_root = node.right.take().unwrap();
219219+ node.right = new_root.left.take();
220220+ new_root.left = Some(node);
221221+ Some(new_root)
222222+ }
223223+224224+ // Retain nodes that satisfy the predicate
225225+ fn retain_node<F>(node: Option<Box<Node<T>>>, f: &mut F) -> (Option<Box<Node<T>>>, usize)
226226+ where
227227+ F: FnMut(&T) -> bool,
228228+ {
229229+ match node {
230230+ None => (None, 0),
231231+ Some(mut n) => {
232232+ let (new_left, left_size) = Self::retain_node(n.left, f);
233233+ let (new_right, right_size) = Self::retain_node(n.right, f);
234234+235235+ if f(&n.key) {
236236+ n.left = new_left;
237237+ n.right = new_right;
238238+ (Some(n), left_size + right_size + 1)
239239+ } else {
240240+ // Remove this node by merging its subtrees
241241+ let merged = Self::merge(new_left, new_right);
242242+ (merged, left_size + right_size)
243243+ }
244244+ }
245245+ }
246246+ }
247247+}
248248+249249+impl<T: Ord> Default for Treap<T> {
250250+ fn default() -> Self {
251251+ Self::new()
252252+ }
253253+}
254254+255255+#[cfg(test)]
256256+mod tests {
257257+ use super::*;
258258+ use rand::rngs::StdRng;
259259+ use rand::SeedableRng;
260260+261261+ #[test]
262262+ fn test_insert_and_contains() {
263263+ let mut treap = Treap::new();
264264+ let mut rng = StdRng::seed_from_u64(42);
265265+266266+ treap.insert(5, &mut rng);
267267+ treap.insert(3, &mut rng);
268268+ treap.insert(7, &mut rng);
269269+270270+ assert!(treap.contains(&5));
271271+ assert!(treap.contains(&3));
272272+ assert!(treap.contains(&7));
273273+ assert!(!treap.contains(&1));
274274+ assert_eq!(treap.len(), 3);
275275+ }
276276+277277+ #[test]
278278+ fn test_remove() {
279279+ let mut treap = Treap::new();
280280+ let mut rng = StdRng::seed_from_u64(42);
281281+282282+ treap.insert(5, &mut rng);
283283+ treap.insert(3, &mut rng);
284284+ treap.insert(7, &mut rng);
285285+286286+ assert!(treap.remove(&3));
287287+ assert!(!treap.contains(&3));
288288+ assert_eq!(treap.len(), 2);
289289+290290+ assert!(!treap.remove(&3)); // Already removed
291291+ }
292292+293293+ #[test]
294294+ fn test_retain() {
295295+ let mut treap = Treap::new();
296296+ let mut rng = StdRng::seed_from_u64(42);
297297+298298+ for i in 0..10 {
299299+ treap.insert(i, &mut rng);
300300+ }
301301+302302+ treap.retain(|&x| x % 2 == 0);
303303+ assert_eq!(treap.len(), 5);
304304+305305+ for i in 0..10 {
306306+ if i % 2 == 0 {
307307+ assert!(treap.contains(&i));
308308+ } else {
309309+ assert!(!treap.contains(&i));
310310+ }
311311+ }
312312+ }
313313+}