Rust implementation of the CVM algorithm for counting distinct elements in a stream
0

Configure Feed

Select the types of activity you want to include in your feed.

README update

+11 -1
+11 -1
README.md
··· 10 10 11 11 ```shell 12 12 cargo install cvmcount 13 - cvmcount file.txt 0.8 0.1 2900 13 + cvmcount -t file.txt -e 0.8 -d 0.1 -s 2900 14 14 ``` 15 + `-t --tokens`: a valid path to a text file 16 + 17 + `-e --epsilon`: how close you want your estimate to be to the true number of distinct tokens. A smaller ε means you require a more precise estimate. For example, ε = 0.05 means you want your estimate to be within 5 % of the actual value. An epsilon of 0.8 is a good starting point for most applications. 18 + 19 + `-d --delta`: the level of certainty that the algorithm's estimate will fall within your desired accuracy range. A higher confidence (e.g. 99.9 %) means you're very sure the estimate will be accurate, while a lower confidence (e.g. 90 %) means there's a higher chance the estimate may be outside your desired range. A delta of 0.1 is a good starting point for most applications 20 + 21 + `-s --streamsize`: this is used to determine buffer size and can be a loose approximation. The closer it is to the stream size, the more accurate the results 15 22 16 23 The `--help` option is available. 17 24 18 25 ## Note 19 26 If you're thinking about using this library, you presumably know that it only provides an estimate (within the specified bounds), similar to something like HyperLogLog. You are trading accuracy for speed! 27 + 28 + ## Perf 29 + Calculating the unique tokens in a [418K UTF-8 text file](https://www.gutenberg.org/ebooks/8492) takes 19.2 ms ± 0.3 ms on an M2 Pro 20 30 21 31 ## Implementation Details 22 32 This library strips punctuation from input tokens using a regex. I assume there is a small performance penalty, but it seems like a small price to pay for increased practicality.