HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Efficient numpy cosine distance calculation

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
calculationnumpycosineefficientdistance

Problem

I want to calculate the nearest cosine neighbors of a vector using the rows of a matrix, and have been testing the performance of a few Python functions for doing this.

def cos_loop_spatial(matrix, vector):
    """
    Calculating pairwise cosine distance using a common for loop with the numpy cosine function.
    """
    neighbors = []
    for row in range(matrix.shape[0]):
        neighbors.append(scipy.spatial.distance.cosine(vector, matrix[row,:]))
    return neighbors

def cos_loop(matrix, vector):
    """
    Calculating pairwise cosine distance using a common for loop with manually calculated cosine value.
    """
    neighbors = []
    for row in range(matrix.shape[0]):
        vector_norm = np.linalg.norm(vector)
        row_norm = np.linalg.norm(matrix[row,:])
        cos_val = vector.dot(matrix[row,:]) / (vector_norm * row_norm)
        neighbors.append(cos_val)
    return neighbors

def cos_matrix_multiplication(matrix, vector):
    """
    Calculating pairwise cosine distance using matrix vector multiplication.
    """
    dotted = matrix.dot(vector)
    matrix_norms = np.linalg.norm(matrix, axis=1)
    vector_norm = np.linalg.norm(vector)
    matrix_vector_norms = np.multiply(matrix_norms, vector_norm)
    neighbors = np.divide(dotted, matrix_vector_norms)
    return neighbors

cos_functions = [cos_loop_spatial, cos_loop, cos_matrix_multiplication]

# Test performance and plot the best results of each function
mat = np.random.randn(1000,1000)
vec = np.random.randn(1000)
cos_performance = {}
for func in cos_functions:
    func_performance = %timeit -o func(mat, vec)
    cos_performance[func.__name__] = func_performance.best

pd.Series(cos_performance).plot(kind='bar')


The cos_matrix_multiplication function is clearly the fastest of these, but I'm wondering if you have suggestions of further efficiency improvements for matrix vector cosine distance calculations.

Solution

-
Your code does not run: there are missing import statements:

import numpy as np
import scipy.spatial.distance


-
Your algorithms compute different results, so some of them must be wrong!

>>> vector = np.array([1, 2, 3, 4, 5])
>>> matrix = np.array([[7, 5, 8, 1, 9], [6, 6, 4, 0, 8]])
>>> cos_matrix_multiplication(matrix, vector)
array([ 0.81818182,  0.76558762])
>>> cos_loop(matrix, vector)
[0.81818181818181823, 0.76558761870466252]
>>> cos_loop_spatial(matrix, vector)
[0.18181818181818177, 0.23441238129533748]


In cos_matrix_multiplication and cos_loop, you've forgotten to subtract the distance from 1. In what follows I have fixed these bugs.

-
Your graph. (i) It has a very poor "data-ink ratio": just three data points! (ii) We can't actually read off the data points in any case, only estimate them based on the length of the bars; (iii) The labels are hard to read because they are written vertically; (iv) I can't copy and paste the labels, let alone the data. A better way to present this kind of information is in a table, like the one I give below.

-
You hint in the text that your real problem is to do with finding nearest neighbours. It's always best to ask about your real problem rather than some proxy for it. We might be able to suggest a solution that involves a different approach. (For example, if you were using Euclidean distance rather than cosine distance, it might make sense to use scipy.spatial.KDTree. But we can't help you unless you tell us what you're really trying to do.)

-
Scipy includes a function scipy.spatial.distance.cdist specifically for computing pairwise distances. In your case you could call it like this:

def cos_cdist(matrix, vector):
    """
    Compute the cosine distances between each row of matrix and vector.
    """
    v = vector.reshape(1, -1)
    return scipy.spatial.distance.cdist(matrix, v, 'cosine').reshape(-1)


-
You don't give us your test case, so I can't confirm your findings or compare them against my own implementation. For example, how many dimensions do your points typically have? I used the following test case:

import timeit

def test():
    algorithms = (cos_loop_spatial, cos_loop, cos_matrix_multiplication,
                  cos_cdist)
    vector = np.random.rand(10)
    matrix = np.random.rand(100000, 10)
    for algorithm in algorithms:
        t = timeit.timeit(lambda:algorithm(matrix, vector), number=1)
        print("{:30s} {:f}".format(algorithm.__name__, t))


Which produces the following results:

cos_loop_spatial               8.231966
cos_loop                       7.212096
cos_matrix_multiplication      0.106005
cos_cdist                      0.019018


So scipy.spatial.distance.cdist is about five times as fast (on this test case) as cos_matrix_multiplication.

Code Snippets

import numpy as np
import scipy.spatial.distance
>>> vector = np.array([1, 2, 3, 4, 5])
>>> matrix = np.array([[7, 5, 8, 1, 9], [6, 6, 4, 0, 8]])
>>> cos_matrix_multiplication(matrix, vector)
array([ 0.81818182,  0.76558762])
>>> cos_loop(matrix, vector)
[0.81818181818181823, 0.76558761870466252]
>>> cos_loop_spatial(matrix, vector)
[0.18181818181818177, 0.23441238129533748]
def cos_cdist(matrix, vector):
    """
    Compute the cosine distances between each row of matrix and vector.
    """
    v = vector.reshape(1, -1)
    return scipy.spatial.distance.cdist(matrix, v, 'cosine').reshape(-1)
import timeit

def test():
    algorithms = (cos_loop_spatial, cos_loop, cos_matrix_multiplication,
                  cos_cdist)
    vector = np.random.rand(10)
    matrix = np.random.rand(100000, 10)
    for algorithm in algorithms:
        t = timeit.timeit(lambda:algorithm(matrix, vector), number=1)
        print("{:30s} {:f}".format(algorithm.__name__, t))
cos_loop_spatial               8.231966
cos_loop                       7.212096
cos_matrix_multiplication      0.106005
cos_cdist                      0.019018

Context

StackExchange Code Review Q#55717, answer score: 18

Revisions (0)

No revisions yet.