HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Cosine similarity of one vector with many

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
cosinewithonemanysimilarityvector

Problem

I'm keen to hear ideas for optimising R code to compute the cosine similarity of a vector x (with length l) with n other vectors (stored in any structure such as a matrix m with n rows and l columns).

Values for n will typically be much larger than values for l.

I'm currently using this custom Rcpp function to compute the similarity of a vector x to each row of a matrix m:

library(Rcpp)
cppFunction('NumericVector cosine_x_to_m(NumericVector x, NumericMatrix m) {
  int nrows = m.nrow();
  NumericVector out(nrows);
  for (int i = 0; i < nrows; i++) {
    NumericVector y = m(i, _);
    out[i] = sum(x * y) / sqrt(sum(pow(x, 2.0)) * sum(pow(y, 2.0)));
  }
  return out;
}')


Varying n and l, I'm getting the following sorts of timings:

Reproducible code below.

# Function to simulate data
sim_data % 
  mutate(timings = map2(l, n, timer))

# Plot results
results_plot %
  unnest(timings) %>% 
  mutate(time = time / 1000000) %>%  # Convert time to seconds
  group_by(l, n) %>% 
  summarise(mean = mean(time), ci = 1.96 * sd(time) / sqrt(n()))

pd % 
  ggplot(aes(n, mean, group= l)) +
  geom_line(aes(color = factor(l)), position = pd, size = 2) +
  geom_errorbar(aes(ymin = mean - ci, ymax = mean + ci), position = pd, width = 100) +
  geom_point(position = pd, size = 2) +
  scale_color_brewer(palette = "Blues") +
  theme_minimal() +
  labs(x = "n", y = "Seconds", color = "l") +
  ggtitle("Algorithm Runtime",
          subtitle = "Error bars represent 95% confidence intervals")

Solution

I'm using Microsoft R (with Intel MKL) which makes matrix multiplications faster, but for fair comparison I set it to be single threaded.

setMKLthreads(1)


In my tests this pure R version cosine_x_to_m is twice faster than yours.

cosine_x_to_m2 = function(x,m){
  x = x / sqrt(crossprod(x));
  return(  as.vector((m %*% x) / sqrt(rowSums(m^2))) );
}


Rewriting rowSums(m^2) in C/C++ makes it even faster, about four times faster than the original.

library(ramwas)
cosine_x_to_m2 = function(x,m){
  x = x / sqrt(crossprod(x));
  return(  as.vector((m %*% x) / sqrt(rowSumsSq(m))) );
}


Initial performance:

Final version performance:

Code Snippets

setMKLthreads(1)
cosine_x_to_m2 = function(x,m){
  x = x / sqrt(crossprod(x));
  return(  as.vector((m %*% x) / sqrt(rowSums(m^2))) );
}
library(ramwas)
cosine_x_to_m2 = function(x,m){
  x = x / sqrt(crossprod(x));
  return(  as.vector((m %*% x) / sqrt(rowSumsSq(m))) );
}

Context

StackExchange Code Review Q#159396, answer score: 4

Revisions (0)

No revisions yet.