HiveBrain v1.2.0
Get Started
← Back to all entries
patterncppMinor

Cosine similarity

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
similaritycosinestackoverflow

Problem

Is there anything to make the cosine similarity function more optimized in terms of CPU execution time?

double cosine_similarity(double *A, double *B, unsigned int size)
{
    double mul = 0.0, d_a = 0.0, d_b = 0.0 ;

    for(unsigned int i = 0; i < size; ++i) 
    {
        mul += A[i] * B[i] ;
        d_a += A[i] * A[i] ;
        d_b += B[i] * B[i] ;
    }
    return mul / (sqrt(d_a) * sqrt(d_b)) ;
}

Solution

Check for Possible Errors

There are a number of possible errors in the code:

  • The variable size may be zero which will lead to division by zero.



  • The vectors A and B may not be the same length, if the size variable


is the size of the shorter vector this is not a problem, but if the
size variable is the size of the larger vector this can lead to unknown
behavior.

  • Either d_a or d_b can add up to zero, which can lead to division by zero


as @coderodde pointed out.

C++ provides the exception class std:exception that has multiple sub classes.
Two of the classes are std::runtime_error and std::logic_error. Item 1 and 2
above belong in the std::logic_error class, item 3 belongs in the
std::runtime_error class.

Code Readability and Maintainability

The code has:

double mul = 0.0, d_a = 0.0, d_b = 0.0 ;


This is less readable and maintainable than putting the initialization on separate lines:

double mul = 0.0;
double d_a = 0.0;
double d_b = 0.0;


Direct Addressing Versus Indirect Addressing

The the most efficient way to write the function is to use direct addressing rather than indirect addressing. Since the code is already passing pointers
use the pointers rather than indexing them.

double cosine_similarity(double *A, double *B, unsigned int size)
{
    double mul = 0.0;
    double d_a = 0.0;
    double d_b = 0.0 ;

    for(unsigned int i = 0; i < size; ++i)
    {
        mul += *A * *B;
        d_a += *A * *A;
        d_b += *B * *B;
        A++;
        B++;
    }

    if (d_a == 0.0f || d_b == 0.0f)
    {
        throw std::logic_error(
                "cosine similarity is not defined whenever one or both "
                "input vectors are zero-vectors.");
    }

    return mul / (sqrt(d_a) * sqrt(d_b)) ;
}


Using Container Classes Can be Safer as Well as Efficient

The container classes such as std::vector and std::array provide safer containers than C style arrays. If the number of elements in the array are know, the std::array class is more efficient. If the number of elements in the array are not known std::vector provides a flexible as well as convenient way to store values.

These container classes provide additional information such as the size
of the array/vector and iterators for direct addressing. They also allow
for indexing through the 'array'.

#include 
#include 
#include 

double cosine_similarity_vectors(std::vector A, std::vectorB)
{
    double mul = 0.0;
    double d_a = 0.0;
    double d_b = 0.0 ;

    if (A.size() != B.size())
    {
        throw std::logic_error("Vector A and Vector B are not the same size");
    }

    // Prevent Division by zero
    if (A.size() ::iterator B_iter = B.begin();
    std::vector::iterator A_iter = A.begin();
    for( ; A_iter != A.end(); A_iter++ , B_iter++ )
    {
        mul += *A_iter * *B_iter;
        d_a += *A_iter * *A_iter;
        d_b += *B_iter * *B_iter;
    }

    if (d_a == 0.0f || d_b == 0.0f)
    {
        throw std::logic_error(
                "cosine similarity is not defined whenever one or both "
                "input vectors are zero-vectors.");
    }

    return mul / (sqrt(d_a) * sqrt(d_b));
}

Code Snippets

double mul = 0.0, d_a = 0.0, d_b = 0.0 ;
double mul = 0.0;
double d_a = 0.0;
double d_b = 0.0;
double cosine_similarity(double *A, double *B, unsigned int size)
{
    double mul = 0.0;
    double d_a = 0.0;
    double d_b = 0.0 ;

    for(unsigned int i = 0; i < size; ++i)
    {
        mul += *A * *B;
        d_a += *A * *A;
        d_b += *B * *B;
        A++;
        B++;
    }

    if (d_a == 0.0f || d_b == 0.0f)
    {
        throw std::logic_error(
                "cosine similarity is not defined whenever one or both "
                "input vectors are zero-vectors.");
    }

    return mul / (sqrt(d_a) * sqrt(d_b)) ;
}
#include <math.h>
#include <vector>
#include <stdexcept>

double cosine_similarity_vectors(std::vector<double> A, std::vector<double>B)
{
    double mul = 0.0;
    double d_a = 0.0;
    double d_b = 0.0 ;

    if (A.size() != B.size())
    {
        throw std::logic_error("Vector A and Vector B are not the same size");
    }

    // Prevent Division by zero
    if (A.size() < 1)
    {
        throw std::logic_error("Vector A and Vector B are empty");
    }

    std::vector<double>::iterator B_iter = B.begin();
    std::vector<double>::iterator A_iter = A.begin();
    for( ; A_iter != A.end(); A_iter++ , B_iter++ )
    {
        mul += *A_iter * *B_iter;
        d_a += *A_iter * *A_iter;
        d_b += *B_iter * *B_iter;
    }

    if (d_a == 0.0f || d_b == 0.0f)
    {
        throw std::logic_error(
                "cosine similarity is not defined whenever one or both "
                "input vectors are zero-vectors.");
    }

    return mul / (sqrt(d_a) * sqrt(d_b));
}

Context

StackExchange Code Review Q#142331, answer score: 9

Revisions (0)

No revisions yet.