HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Calculating how similar two objects are according to a database

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
objectsareaccordingdatabasecalculatingtwohowsimilar

Problem

I want to calculate how similar two objects are according to a database. However the code is very slow; it takes around 2 minutes to analyze just 3 objects. How can I speed it up? I have tried to avoid for loops, and I'm also avoiding recalculating things.

This functions returns the ith comparison of the n elements in r groups, in my case, always 2.

".combinadic"  choose(n0,r)) {
    stop("'i' must be 0 = i) { # Adjusted for one-based indexing
      v <- v - 1
    }
    return(v)
  }

  res <- rep(NA,r)
  for (j in 1:r) {
    res[j] <- largestV(n0,r,i)
    i <- i - choose(res[j],r)
    n0 <- res[j]
    r <- r - 1
  }
  res <- res + 1
  res <- n[res]
  return(res)
}


This functions compares graphs by calculating how many nodes are in both graphs, or if a character vector is given then it compares how many are in both objects.

compare_graphs <- function(g1, g2){
  # Function to estimate how much two graphs overlap by looking if the nodes
  # are the same
  # Check which case are we using
  if (is(g1, "graph") & is(g2, "graph")) {
    prot1 <- nodes(g1)
    prot2 <- nodes(g2)
    if (length(prot1) == 0 | length(prot2) == 0) {
      return(NA)
    }
  } else if (is(g1, "graph") & is.character(g2)) {
    prot1 <- nodes(g1)
    prot2 <- g2
  } else if (is(g2, "graph") & is.character(g1)) {
    prot2 <- nodes(g2)
    prot1 <- g1
  } else {
    prot1 <- g1
    prot2 <- g2
  }

  score <- (length(intersect(prot1, prot2)))*2/(
    length(prot2) + length(prot1))
  score
}


This functions calculates the degree of overlap of Gene Ontologies under the class of Biological Process (BP), it is different from compare graph because the structure of this graph or its subgraphs is known, is a DAG, and I use that to compare the paths of both graphs.

```
# Calculates the degree of overlap of the GO BP ontologies of entrez ids.
go_cor 1 | length(UI) > 1) {
if (is.na(LP["sim"]) | is.na(UI["sim"])) {
return(NA)
}
} else if (is.na(LP) | is.na(UI)) {
r

Solution

I'll address the two main bottle necks in your code.

First bottle neck

To help understand the issue, let's first remind ourselves the difference between the [ and [[ operators:

  • when applied to a list, [ returns a sub-list, while [[ returns a list element.



  • when applied to a data.frame (which is a form of a list), [ returns a data.frame, while [[ returns a vector (the data in a column).



Inside genes.info, where you do:

out <- unique(genes[genes[colm] == id, "Symbol"])


genes is a data.frame (i.e. a list), so genes[colm] is also a data.frame (a sub-list). When you then do genes[colm] == id, the == operator has to convert your one-column data.frame into a matrix before it can compare it to id, which is very expensive. This is where the matrix item at the top of your profile comes from. Instead, you meant to do:

out <- unique(genes[genes[[colm]] == id, "Symbol"])


where genes[[colm]] is a vector, so == does not have to do any conversion.

Note that you have a similar issue twice inside comb_biopath where you meant to use info[[by]] instead of info[by].

Second bottle neck

With the iterative merge calls, you end up with pretty large data. What comes as pretty costly in that these merge calls, by default, also sort your data. That's where the second item in your profile (order) comes from. To get rid of it, which should not affect your results, add sort = FALSE to all your merge() calls.

On my machine, these two changes cut the computation times by roughly two thirds. I hope this puts you on the right track.

Code Snippets

out <- unique(genes[genes[colm] == id, "Symbol"])
out <- unique(genes[genes[[colm]] == id, "Symbol"])

Context

StackExchange Code Review Q#142487, answer score: 3

Revisions (0)

No revisions yet.