patternpythonModerate
Efficient numpy cosine distance calculation
Viewed 0 times
calculationnumpycosineefficientdistance
Problem
I want to calculate the nearest cosine neighbors of a vector using the rows of a matrix, and have been testing the performance of a few Python functions for doing this.
The
def cos_loop_spatial(matrix, vector):
"""
Calculating pairwise cosine distance using a common for loop with the numpy cosine function.
"""
neighbors = []
for row in range(matrix.shape[0]):
neighbors.append(scipy.spatial.distance.cosine(vector, matrix[row,:]))
return neighbors
def cos_loop(matrix, vector):
"""
Calculating pairwise cosine distance using a common for loop with manually calculated cosine value.
"""
neighbors = []
for row in range(matrix.shape[0]):
vector_norm = np.linalg.norm(vector)
row_norm = np.linalg.norm(matrix[row,:])
cos_val = vector.dot(matrix[row,:]) / (vector_norm * row_norm)
neighbors.append(cos_val)
return neighbors
def cos_matrix_multiplication(matrix, vector):
"""
Calculating pairwise cosine distance using matrix vector multiplication.
"""
dotted = matrix.dot(vector)
matrix_norms = np.linalg.norm(matrix, axis=1)
vector_norm = np.linalg.norm(vector)
matrix_vector_norms = np.multiply(matrix_norms, vector_norm)
neighbors = np.divide(dotted, matrix_vector_norms)
return neighbors
cos_functions = [cos_loop_spatial, cos_loop, cos_matrix_multiplication]
# Test performance and plot the best results of each function
mat = np.random.randn(1000,1000)
vec = np.random.randn(1000)
cos_performance = {}
for func in cos_functions:
func_performance = %timeit -o func(mat, vec)
cos_performance[func.__name__] = func_performance.best
pd.Series(cos_performance).plot(kind='bar')The
cos_matrix_multiplication function is clearly the fastest of these, but I'm wondering if you have suggestions of further efficiency improvements for matrix vector cosine distance calculations.Solution
-
Your code does not run: there are missing
-
Your algorithms compute different results, so some of them must be wrong!
In
-
Your graph. (i) It has a very poor "data-ink ratio": just three data points! (ii) We can't actually read off the data points in any case, only estimate them based on the length of the bars; (iii) The labels are hard to read because they are written vertically; (iv) I can't copy and paste the labels, let alone the data. A better way to present this kind of information is in a table, like the one I give below.
-
You hint in the text that your real problem is to do with finding nearest neighbours. It's always best to ask about your real problem rather than some proxy for it. We might be able to suggest a solution that involves a different approach. (For example, if you were using Euclidean distance rather than cosine distance, it might make sense to use
-
Scipy includes a function
-
You don't give us your test case, so I can't confirm your findings or compare them against my own implementation. For example, how many dimensions do your points typically have? I used the following test case:
Which produces the following results:
So
Your code does not run: there are missing
import statements:import numpy as np
import scipy.spatial.distance-
Your algorithms compute different results, so some of them must be wrong!
>>> vector = np.array([1, 2, 3, 4, 5])
>>> matrix = np.array([[7, 5, 8, 1, 9], [6, 6, 4, 0, 8]])
>>> cos_matrix_multiplication(matrix, vector)
array([ 0.81818182, 0.76558762])
>>> cos_loop(matrix, vector)
[0.81818181818181823, 0.76558761870466252]
>>> cos_loop_spatial(matrix, vector)
[0.18181818181818177, 0.23441238129533748]In
cos_matrix_multiplication and cos_loop, you've forgotten to subtract the distance from 1. In what follows I have fixed these bugs.-
Your graph. (i) It has a very poor "data-ink ratio": just three data points! (ii) We can't actually read off the data points in any case, only estimate them based on the length of the bars; (iii) The labels are hard to read because they are written vertically; (iv) I can't copy and paste the labels, let alone the data. A better way to present this kind of information is in a table, like the one I give below.
-
You hint in the text that your real problem is to do with finding nearest neighbours. It's always best to ask about your real problem rather than some proxy for it. We might be able to suggest a solution that involves a different approach. (For example, if you were using Euclidean distance rather than cosine distance, it might make sense to use
scipy.spatial.KDTree. But we can't help you unless you tell us what you're really trying to do.)-
Scipy includes a function
scipy.spatial.distance.cdist specifically for computing pairwise distances. In your case you could call it like this:def cos_cdist(matrix, vector):
"""
Compute the cosine distances between each row of matrix and vector.
"""
v = vector.reshape(1, -1)
return scipy.spatial.distance.cdist(matrix, v, 'cosine').reshape(-1)-
You don't give us your test case, so I can't confirm your findings or compare them against my own implementation. For example, how many dimensions do your points typically have? I used the following test case:
import timeit
def test():
algorithms = (cos_loop_spatial, cos_loop, cos_matrix_multiplication,
cos_cdist)
vector = np.random.rand(10)
matrix = np.random.rand(100000, 10)
for algorithm in algorithms:
t = timeit.timeit(lambda:algorithm(matrix, vector), number=1)
print("{:30s} {:f}".format(algorithm.__name__, t))Which produces the following results:
cos_loop_spatial 8.231966
cos_loop 7.212096
cos_matrix_multiplication 0.106005
cos_cdist 0.019018So
scipy.spatial.distance.cdist is about five times as fast (on this test case) as cos_matrix_multiplication.Code Snippets
import numpy as np
import scipy.spatial.distance>>> vector = np.array([1, 2, 3, 4, 5])
>>> matrix = np.array([[7, 5, 8, 1, 9], [6, 6, 4, 0, 8]])
>>> cos_matrix_multiplication(matrix, vector)
array([ 0.81818182, 0.76558762])
>>> cos_loop(matrix, vector)
[0.81818181818181823, 0.76558761870466252]
>>> cos_loop_spatial(matrix, vector)
[0.18181818181818177, 0.23441238129533748]def cos_cdist(matrix, vector):
"""
Compute the cosine distances between each row of matrix and vector.
"""
v = vector.reshape(1, -1)
return scipy.spatial.distance.cdist(matrix, v, 'cosine').reshape(-1)import timeit
def test():
algorithms = (cos_loop_spatial, cos_loop, cos_matrix_multiplication,
cos_cdist)
vector = np.random.rand(10)
matrix = np.random.rand(100000, 10)
for algorithm in algorithms:
t = timeit.timeit(lambda:algorithm(matrix, vector), number=1)
print("{:30s} {:f}".format(algorithm.__name__, t))cos_loop_spatial 8.231966
cos_loop 7.212096
cos_matrix_multiplication 0.106005
cos_cdist 0.019018Context
StackExchange Code Review Q#55717, answer score: 18
Revisions (0)
No revisions yet.