patternrustMinor
Jaccard distance between strings in Rust
Viewed 0 times
jaccarddistancebetweenruststrings
Problem
The Jaccard distance between two sets is the size of their intersection divided by the size of their union. For example, the distance between
The Jaccard distance can be used for string similarity by slicing a string into word or character n-grams. For example, with 2-character n-grams:
I've written an implementation in Rust as an attempt to learn something about the language.
The shingles method seems particularly bad. I'd like it to be more flexible (for 3 or 4 character n-grams,
{1, 2, 3} and {2, 3, 4} is 2 ({2,3}) / 4 ({1,2,3,4}) = 0.5. The Jaccard distance can be used for string similarity by slicing a string into word or character n-grams. For example, with 2-character n-grams:
"Pear" vs "Peach"
`{"Pe", "ea", "ar"}` vs `{"Pe", "ea", "ac", "ch"}` = 2/5 = 0.4I've written an implementation in Rust as an attempt to learn something about the language.
extern crate unidecode;
use unidecode::unidecode;
use std::collections::HashSet;
fn normalize(s: &str) -> String {
unidecode(s).to_lowercase()
}
fn shingles(s: String) -> HashSet {
let mut xs = HashSet::new();
let it = s.chars();
let mut pk = it.peekable();
loop {
let cur = pk.next();
let nxt = pk.peek();
match (cur, nxt) {
(Some(i), Some(nxt_i)) => {
xs.insert(format!("{}{}", i, nxt_i));
}
_ => { break }
}
}
return xs
}
// Intersection of the sets divided by the size of the union of the
// sets.
fn jaccard_distance(s1: String, s2: String) -> f64 {
let s1_shingles = shingles(s1);
let s2_shingles = shingles(s2);
s1_shingles.len() as f64;
let inter: HashSet = s1_shingles.intersection(&s2_shingles).collect();
let union: HashSet = s1_shingles.union(&s2_shingles).collect();
(inter.len() as f64) / (union.len() as f64)
}
fn comp_and_print(s1: &str, s2: &str) {
let normal_s1 = normalize(s1);
let normal_s2 = normalize(s2);
println!("'{}' vs '{}' ... \t {}", s1, s2,
jaccard_distance(normal_s1, normal_s2));
}
fn main() {
comp_and_print("Apple sauce", "Apple Trees");
comp_and_print("Apple pie", "Grandma's Apple pie");
comp_and_print("Pear", "Peach");
}The shingles method seems particularly bad. I'd like it to be more flexible (for 3 or 4 character n-grams,
Solution
Let's start by removing unused code, the whole
Then, in order to actually be able to call your functions easily, we change all
Let's do
We start by getting all the characters once into a vector. Now that we have a slice (through the Vec), we can use
Note that now we have a parameter to
The other optimization is to avoid collecting into a collection if we just want the count of elements:
All together, it now looks like:
Edit
And of course now I see that if I were to scroll down a bit, I'd see that you do use
normalize function and the line s1_shingles.len() as f64.Then, in order to actually be able to call your functions easily, we change all
String arguments into &str. You usually want to accept &str as more types can provide it. In this case, string literals are the big example.Let's do
shingles in a big bang!fn shingles(s: &str) -> HashSet {
let chars: Vec = s.chars().collect();
chars.windows(2).map(|w| w.iter().cloned().collect()).collect()
}We start by getting all the characters once into a vector. Now that we have a slice (through the Vec), we can use
windows, which gives overlapping views into the slice. We then convert each window into an iterator of char using cloned and then into a string using collect. We convert the overall iterator into a HashSet, also using collect. Look at that type inference go!Note that now we have a parameter to
windows that allows you to extend to arbitrary-sized n-grams!The other optimization is to avoid collecting into a collection if we just want the count of elements:
let inter = s1_shingles.intersection(&s2_shingles).count();
let union = s1_shingles.union(&s2_shingles).count();All together, it now looks like:
use std::collections::HashSet;
fn shingles(s: &str) -> HashSet {
let chars: Vec = s.chars().collect();
chars.windows(2).map(|w| w.iter().cloned().collect()).collect()
}
fn jaccard_distance(s1: &str, s2: &str) -> f64 {
let s1_shingles = shingles(s1);
let s2_shingles = shingles(s2);
let inter = s1_shingles.intersection(&s2_shingles).count();
let union = s1_shingles.union(&s2_shingles).count();
(inter as f64) / (union as f64)
}
fn main() {
println!("{}", jaccard_distance("Pear", "Peach"));
}Edit
And of course now I see that if I were to scroll down a bit, I'd see that you do use
normalize. Oops! You should still accept a &str, the only change you need to make is to take a reference to the Strings:println!("'{}' vs '{}' ... \t {}", s1, s2,
jaccard_distance(&normal_s1, &normal_s2));Code Snippets
fn shingles(s: &str) -> HashSet<String> {
let chars: Vec<_> = s.chars().collect();
chars.windows(2).map(|w| w.iter().cloned().collect()).collect()
}let inter = s1_shingles.intersection(&s2_shingles).count();
let union = s1_shingles.union(&s2_shingles).count();use std::collections::HashSet;
fn shingles(s: &str) -> HashSet<String> {
let chars: Vec<_> = s.chars().collect();
chars.windows(2).map(|w| w.iter().cloned().collect()).collect()
}
fn jaccard_distance(s1: &str, s2: &str) -> f64 {
let s1_shingles = shingles(s1);
let s2_shingles = shingles(s2);
let inter = s1_shingles.intersection(&s2_shingles).count();
let union = s1_shingles.union(&s2_shingles).count();
(inter as f64) / (union as f64)
}
fn main() {
println!("{}", jaccard_distance("Pear", "Peach"));
}println!("'{}' vs '{}' ... \t {}", s1, s2,
jaccard_distance(&normal_s1, &normal_s2));Context
StackExchange Code Review Q#109461, answer score: 4
Revisions (0)
No revisions yet.