patterncppMinor
Cosine similarity
Viewed 0 times
similaritycosinestackoverflow
Problem
Is there anything to make the cosine similarity function more optimized in terms of CPU execution time?
double cosine_similarity(double *A, double *B, unsigned int size)
{
double mul = 0.0, d_a = 0.0, d_b = 0.0 ;
for(unsigned int i = 0; i < size; ++i)
{
mul += A[i] * B[i] ;
d_a += A[i] * A[i] ;
d_b += B[i] * B[i] ;
}
return mul / (sqrt(d_a) * sqrt(d_b)) ;
}Solution
Check for Possible Errors
There are a number of possible errors in the code:
is the size of the shorter vector this is not a problem, but if the
behavior.
as @coderodde pointed out.
C++ provides the exception class std:exception that has multiple sub classes.
Two of the classes are
above belong in the
Code Readability and Maintainability
The code has:
This is less readable and maintainable than putting the initialization on separate lines:
Direct Addressing Versus Indirect Addressing
The the most efficient way to write the function is to use direct addressing rather than indirect addressing. Since the code is already passing pointers
use the pointers rather than indexing them.
Using Container Classes Can be Safer as Well as Efficient
The container classes such as
These container classes provide additional information such as the size
of the array/vector and iterators for direct addressing. They also allow
for indexing through the 'array'.
There are a number of possible errors in the code:
- The variable
sizemay be zero which will lead to division by zero.
- The vectors A and B may not be the same length, if the
sizevariable
is the size of the shorter vector this is not a problem, but if the
size variable is the size of the larger vector this can lead to unknownbehavior.
- Either d_a or d_b can add up to zero, which can lead to division by zero
as @coderodde pointed out.
C++ provides the exception class std:exception that has multiple sub classes.
Two of the classes are
std::runtime_error and std::logic_error. Item 1 and 2above belong in the
std::logic_error class, item 3 belongs in thestd::runtime_error class.Code Readability and Maintainability
The code has:
double mul = 0.0, d_a = 0.0, d_b = 0.0 ;This is less readable and maintainable than putting the initialization on separate lines:
double mul = 0.0;
double d_a = 0.0;
double d_b = 0.0;Direct Addressing Versus Indirect Addressing
The the most efficient way to write the function is to use direct addressing rather than indirect addressing. Since the code is already passing pointers
use the pointers rather than indexing them.
double cosine_similarity(double *A, double *B, unsigned int size)
{
double mul = 0.0;
double d_a = 0.0;
double d_b = 0.0 ;
for(unsigned int i = 0; i < size; ++i)
{
mul += *A * *B;
d_a += *A * *A;
d_b += *B * *B;
A++;
B++;
}
if (d_a == 0.0f || d_b == 0.0f)
{
throw std::logic_error(
"cosine similarity is not defined whenever one or both "
"input vectors are zero-vectors.");
}
return mul / (sqrt(d_a) * sqrt(d_b)) ;
}Using Container Classes Can be Safer as Well as Efficient
The container classes such as
std::vector and std::array provide safer containers than C style arrays. If the number of elements in the array are know, the std::array class is more efficient. If the number of elements in the array are not known std::vector provides a flexible as well as convenient way to store values.These container classes provide additional information such as the size
of the array/vector and iterators for direct addressing. They also allow
for indexing through the 'array'.
#include
#include
#include
double cosine_similarity_vectors(std::vector A, std::vectorB)
{
double mul = 0.0;
double d_a = 0.0;
double d_b = 0.0 ;
if (A.size() != B.size())
{
throw std::logic_error("Vector A and Vector B are not the same size");
}
// Prevent Division by zero
if (A.size() ::iterator B_iter = B.begin();
std::vector::iterator A_iter = A.begin();
for( ; A_iter != A.end(); A_iter++ , B_iter++ )
{
mul += *A_iter * *B_iter;
d_a += *A_iter * *A_iter;
d_b += *B_iter * *B_iter;
}
if (d_a == 0.0f || d_b == 0.0f)
{
throw std::logic_error(
"cosine similarity is not defined whenever one or both "
"input vectors are zero-vectors.");
}
return mul / (sqrt(d_a) * sqrt(d_b));
}Code Snippets
double mul = 0.0, d_a = 0.0, d_b = 0.0 ;double mul = 0.0;
double d_a = 0.0;
double d_b = 0.0;double cosine_similarity(double *A, double *B, unsigned int size)
{
double mul = 0.0;
double d_a = 0.0;
double d_b = 0.0 ;
for(unsigned int i = 0; i < size; ++i)
{
mul += *A * *B;
d_a += *A * *A;
d_b += *B * *B;
A++;
B++;
}
if (d_a == 0.0f || d_b == 0.0f)
{
throw std::logic_error(
"cosine similarity is not defined whenever one or both "
"input vectors are zero-vectors.");
}
return mul / (sqrt(d_a) * sqrt(d_b)) ;
}#include <math.h>
#include <vector>
#include <stdexcept>
double cosine_similarity_vectors(std::vector<double> A, std::vector<double>B)
{
double mul = 0.0;
double d_a = 0.0;
double d_b = 0.0 ;
if (A.size() != B.size())
{
throw std::logic_error("Vector A and Vector B are not the same size");
}
// Prevent Division by zero
if (A.size() < 1)
{
throw std::logic_error("Vector A and Vector B are empty");
}
std::vector<double>::iterator B_iter = B.begin();
std::vector<double>::iterator A_iter = A.begin();
for( ; A_iter != A.end(); A_iter++ , B_iter++ )
{
mul += *A_iter * *B_iter;
d_a += *A_iter * *A_iter;
d_b += *B_iter * *B_iter;
}
if (d_a == 0.0f || d_b == 0.0f)
{
throw std::logic_error(
"cosine similarity is not defined whenever one or both "
"input vectors are zero-vectors.");
}
return mul / (sqrt(d_a) * sqrt(d_b));
}Context
StackExchange Code Review Q#142331, answer score: 9
Revisions (0)
No revisions yet.