Re-Writing a Text Deduplication Tool in Rust
Exploratory project to re-write a text deduplication tool in Rust, comparing performance and accuracy with the original Python implementation.
· 2 min read
Introduction #
I tried working on this without much prior knowledge of Rust and arithmetics and it was a fun learning experience.
Arithmetics overflows #
Overflows are abstracted away in Python, such as when using integers. You get a warning, but the program always runs;
In rust, its a compile time error. Before encountering this, I didn’t know numpy wraps the overflow, which was why I tried expanding to u128, but the results did not match.
// This wouldn't compile in Rust
fn main() {
let a = 18446744073709551615u64;
let b = 1000u64;
let mult = a * b;
println!("{}", mult)
}
Speed and complexity tradeoff #
In terms of speed the rust implementation is about 2x faster than the python one, and the pure rust implementation is about 3x faster.
| Algorithm | Precision (Duplicates) | Recall (Duplicates) | Precision (Non Duplicates) | Recall (Non Duplicates) | Macro F1 score | Accuracy | Time |
|---|---|---|---|---|---|---|---|
| MinHash | 0.8537 | 0.9488 | 0.946 | 0.8463 | 0.8961 | 0.8197 | 11.39s |
| MinHashRust | 0.8537 | 0.9489 | 0.946 | 0.8464 | 0.8961 | 0.8197 | 5.67s |
| MinHashPureRS | 0.8537 | 0.9489 | 0.946 | 0.8463 | 0.8961 | 0.8197 | 3.22s |
| Exact Title | 0.8302 | 0.5521 | 0.7098 | 0.9065 | 0.77 | 0.7456 | - |
However, the pure rust implementation is more complex, due to having to read directly from parquet files, while the python implementation is integrated with the datasets library, which makes it easier to use.
Cross languages determinism #
Despite heavy testing and setting the same permutations, the results were not exactly the same between the two.
On smaller datasets, like the benchmark_news, the results were the same, but on larger datasets like the full benchmark_core, there was a difference of 1 duplicate found.
On the togethercomputer/RedPajama-Data-1T-Sample dataset, the differences were more pronounced:
Python: 440263
Hybrid: 440234
Rust: 440235
Another observation is while replicating the results is important, it is also important to validate the truthfulness of the results. In this case, it may very well be possible that the Rust implementation has a better accuracy, despite the differences. More testing is required.
Conclusion #
Overall, it was a very fun learning experience and I can see why Rust is gaining popularity. Compile time checks are very useful in catching early bugs, and the speed improvements are significant.