patternjavaMinor
Calculate Cosine similarity between Multisets
Viewed 0 times
cosinebetweencalculatemultisetssimilarity
Problem
Having read "Effective Java" and "Clean Code" I wanted to apply these principles in a manageable scenario. To this end I refactored a Java library that calculates similarities between strings.
The code below is that of a cosine metric. Documents are tokenized into multisets of strings. These are passed onto the metric which calculates their magnitude and dot product by treating the multisets as if they were (sparse) vectors.
```
package org.simmetrics.metrics;
import static com.google.common.collect.Multisets.union;
import static java.lang.Math.sqrt;
import org.simmetrics.MultisetMetric;
import com.google.common.collect.Multiset;
/**
* Calculates the cosine similarity over two multisets. The similarity is
* defined as the cosine of the angle between the multisets expressed as sparse
* vectors.
*
*
similarity(a,b) = a·b / (||a|| ||b||)
*
*
*
* The cosine similarity is identical to the Tanimoto coefficient, but unlike
* Tanimoto the occurrence (cardinality) of an entry is taken into account. E.g.
* {@code [hello, world]} and {@code [hello, world, hello, world]} would be
* identical when compared with Tanimoto but are dissimilar when the cosine
* similarity is used.
*
* This class is immutable and thread-safe.
*
* @see TanimotoCoefficient
* @see Wikipedia
* Cosine similarity
*
* @param
* type of the token
*/
public final class CosineSimilarity implements MultisetMetric {
@Override
public float compare(Multiset a, Multiset b) {
if (a.isEmpty() && b.isEmpty()) {
return 1.0f;
}
if (a.isEmpty() || b.isEmpty()) {
return 0.0f;
}
float dotProduct = 0;
float magnitudeA = 0;
float magnitudeB = 0;
// Lager set first for performance improvement.
// See: MultisetUnionSize benchmark
if(a.size() swap = a; a = b; b = swap;
}
for (T entry : union(a, b).elementSet()) {
float aCount
The code below is that of a cosine metric. Documents are tokenized into multisets of strings. These are passed onto the metric which calculates their magnitude and dot product by treating the multisets as if they were (sparse) vectors.
```
package org.simmetrics.metrics;
import static com.google.common.collect.Multisets.union;
import static java.lang.Math.sqrt;
import org.simmetrics.MultisetMetric;
import com.google.common.collect.Multiset;
/**
* Calculates the cosine similarity over two multisets. The similarity is
* defined as the cosine of the angle between the multisets expressed as sparse
* vectors.
*
*
similarity(a,b) = a·b / (||a|| ||b||)
*
*
*
* The cosine similarity is identical to the Tanimoto coefficient, but unlike
* Tanimoto the occurrence (cardinality) of an entry is taken into account. E.g.
* {@code [hello, world]} and {@code [hello, world, hello, world]} would be
* identical when compared with Tanimoto but are dissimilar when the cosine
* similarity is used.
*
* This class is immutable and thread-safe.
*
* @see TanimotoCoefficient
* @see Wikipedia
* Cosine similarity
*
* @param
* type of the token
*/
public final class CosineSimilarity implements MultisetMetric {
@Override
public float compare(Multiset a, Multiset b) {
if (a.isEmpty() && b.isEmpty()) {
return 1.0f;
}
if (a.isEmpty() || b.isEmpty()) {
return 0.0f;
}
float dotProduct = 0;
float magnitudeA = 0;
float magnitudeB = 0;
// Lager set first for performance improvement.
// See: MultisetUnionSize benchmark
if(a.size() swap = a; a = b; b = swap;
}
for (T entry : union(a, b).elementSet()) {
float aCount
Solution
This is pretty good, but two things come immediately to mind:
Example:
- Declare variables closest to where they're being used. Declare
dotProduct,magnitudeA, andmagnitudeBright above your loop.
- Don't reassign method arguments. That's really confusing six months down the road. In this case, use a helper method which takes the larger and smaller sets in a specific order.
Example:
import static java.lang.Math.sqrt;
/**
* [Excluded for brevity]
*
* @param
* type of the token
*/
public final class CosineSimilarity implements MultisetMetric {
@Override
public float compare(final Multiset a, final Multiset b) {
if (a.isEmpty() && b.isEmpty()) {
return 1.0f;
}
if (a.isEmpty() || b.isEmpty()) {
return 0.0f;
}
if (a.size() >= b.size()) {
return this.determineSimilarity(a, b);
} else {
return this.determineSimilarity(b, a);
}
}
private float determineSimilarity(final Multiset largerSet, final Multiset smallerSet) {
float dotProduct = 0;
float magnitudeA = 0;
float magnitudeB = 0;
for (final T entry : union(largerSet, smallerSet).elementSet()) {
final float aCount = largerSet.count(entry);
final float bCount = smallerSet.count(entry);
dotProduct += aCount * bCount;
magnitudeA += aCount * aCount;
magnitudeB += bCount * bCount;
}
// a·b / (||a|| * ||b||)
return (float) (dotProduct / (sqrt(magnitudeA) * sqrt(magnitudeB)));
}
@Override
public String toString() {
return "CosineSimilarity";
}
}Code Snippets
import static java.lang.Math.sqrt;
/**
* [Excluded for brevity]
*
* @param <T>
* type of the token
*/
public final class CosineSimilarity<T> implements MultisetMetric<T> {
@Override
public float compare(final Multiset<T> a, final Multiset<T> b) {
if (a.isEmpty() && b.isEmpty()) {
return 1.0f;
}
if (a.isEmpty() || b.isEmpty()) {
return 0.0f;
}
if (a.size() >= b.size()) {
return this.determineSimilarity(a, b);
} else {
return this.determineSimilarity(b, a);
}
}
private float determineSimilarity(final Multiset<T> largerSet, final Multiset<T> smallerSet) {
float dotProduct = 0;
float magnitudeA = 0;
float magnitudeB = 0;
for (final T entry : union(largerSet, smallerSet).elementSet()) {
final float aCount = largerSet.count(entry);
final float bCount = smallerSet.count(entry);
dotProduct += aCount * bCount;
magnitudeA += aCount * aCount;
magnitudeB += bCount * bCount;
}
// a·b / (||a|| * ||b||)
return (float) (dotProduct / (sqrt(magnitudeA) * sqrt(magnitudeB)));
}
@Override
public String toString() {
return "CosineSimilarity";
}
}Context
StackExchange Code Review Q#111591, answer score: 3
Revisions (0)
No revisions yet.