HiveBrain v1.2.0
Get Started
← Back to all entries
patternrustMinor

Jaccard distance between strings in Rust

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
jaccarddistancebetweenruststrings

Problem

The Jaccard distance between two sets is the size of their intersection divided by the size of their union. For example, the distance between {1, 2, 3} and {2, 3, 4} is 2 ({2,3}) / 4 ({1,2,3,4}) = 0.5.

The Jaccard distance can be used for string similarity by slicing a string into word or character n-grams. For example, with 2-character n-grams:

"Pear" vs "Peach" 
`{"Pe", "ea", "ar"}` vs `{"Pe", "ea", "ac", "ch"}` = 2/5 = 0.4


I've written an implementation in Rust as an attempt to learn something about the language.

extern crate unidecode;
use unidecode::unidecode;

use std::collections::HashSet;

fn normalize(s: &str) -> String {
    unidecode(s).to_lowercase()
}

fn shingles(s: String) -> HashSet {
    let mut xs = HashSet::new();
    let it = s.chars();
    let mut pk = it.peekable();

    loop {
        let cur = pk.next();
        let nxt = pk.peek();
        match (cur, nxt) {
            (Some(i), Some(nxt_i)) => {
                xs.insert(format!("{}{}", i, nxt_i));
            }
            _ => { break }
        }
    }
    return xs
}

// Intersection of the sets divided by the size of the union of the
// sets.
fn jaccard_distance(s1: String, s2: String) -> f64 {
    let s1_shingles = shingles(s1);
    let s2_shingles = shingles(s2);
    s1_shingles.len() as f64;
    let inter: HashSet = s1_shingles.intersection(&s2_shingles).collect();
    let union: HashSet = s1_shingles.union(&s2_shingles).collect();
    (inter.len() as f64) / (union.len() as f64)
}

fn comp_and_print(s1: &str, s2: &str) {
    let normal_s1 = normalize(s1);
    let normal_s2 = normalize(s2);
    println!("'{}' vs '{}' ... \t {}", s1, s2,
             jaccard_distance(normal_s1, normal_s2));
}

fn main() {
    comp_and_print("Apple sauce", "Apple Trees");
    comp_and_print("Apple pie", "Grandma's Apple pie");
    comp_and_print("Pear", "Peach");
}


The shingles method seems particularly bad. I'd like it to be more flexible (for 3 or 4 character n-grams,

Solution

Let's start by removing unused code, the whole normalize function and the line s1_shingles.len() as f64.

Then, in order to actually be able to call your functions easily, we change all String arguments into &str. You usually want to accept &str as more types can provide it. In this case, string literals are the big example.

Let's do shingles in a big bang!

fn shingles(s: &str) -> HashSet {
    let chars: Vec = s.chars().collect();
    chars.windows(2).map(|w| w.iter().cloned().collect()).collect()
}


We start by getting all the characters once into a vector. Now that we have a slice (through the Vec), we can use windows, which gives overlapping views into the slice. We then convert each window into an iterator of char using cloned and then into a string using collect. We convert the overall iterator into a HashSet, also using collect. Look at that type inference go!

Note that now we have a parameter to windows that allows you to extend to arbitrary-sized n-grams!

The other optimization is to avoid collecting into a collection if we just want the count of elements:

let inter = s1_shingles.intersection(&s2_shingles).count();
let union = s1_shingles.union(&s2_shingles).count();


All together, it now looks like:

use std::collections::HashSet;

fn shingles(s: &str) -> HashSet {
    let chars: Vec = s.chars().collect();
    chars.windows(2).map(|w| w.iter().cloned().collect()).collect()
}

fn jaccard_distance(s1: &str, s2: &str) -> f64 {
    let s1_shingles = shingles(s1);
    let s2_shingles = shingles(s2);
    let inter = s1_shingles.intersection(&s2_shingles).count();
    let union = s1_shingles.union(&s2_shingles).count();
    (inter as f64) / (union as f64)
}

fn main() {
    println!("{}", jaccard_distance("Pear", "Peach"));
}


Edit

And of course now I see that if I were to scroll down a bit, I'd see that you do use normalize. Oops! You should still accept a &str, the only change you need to make is to take a reference to the Strings:

println!("'{}' vs '{}' ... \t {}", s1, s2,
         jaccard_distance(&normal_s1, &normal_s2));

Code Snippets

fn shingles(s: &str) -> HashSet<String> {
    let chars: Vec<_> = s.chars().collect();
    chars.windows(2).map(|w| w.iter().cloned().collect()).collect()
}
let inter = s1_shingles.intersection(&s2_shingles).count();
let union = s1_shingles.union(&s2_shingles).count();
use std::collections::HashSet;

fn shingles(s: &str) -> HashSet<String> {
    let chars: Vec<_> = s.chars().collect();
    chars.windows(2).map(|w| w.iter().cloned().collect()).collect()
}

fn jaccard_distance(s1: &str, s2: &str) -> f64 {
    let s1_shingles = shingles(s1);
    let s2_shingles = shingles(s2);
    let inter = s1_shingles.intersection(&s2_shingles).count();
    let union = s1_shingles.union(&s2_shingles).count();
    (inter as f64) / (union as f64)
}

fn main() {
    println!("{}", jaccard_distance("Pear", "Peach"));
}
println!("'{}' vs '{}' ... \t {}", s1, s2,
         jaccard_distance(&normal_s1, &normal_s2));

Context

StackExchange Code Review Q#109461, answer score: 4

Revisions (0)

No revisions yet.