patternpythonMinor
Averaging lists of values with duplicate keys
Viewed 0 times
withaveragingduplicatekeyslistsvalues
Problem
I have gene expression data that I represent as a list of genes and a list of lists of values. I average the expression data for any genes with the same name.
For example:
is converted to
Here is my function:
This function is slower than any other part of my data pipeline, taking 5-10 seconds when other steps that also operate on the whole dataset take sub-seconds. Any thoughts on how I can improve this?
For example:
genes = ['A', 'C', 'C', 'B', 'A']
vals = [[2.0, 2.0, 9.0, 9.0], # A: will be averaged with row=4
[3.0, 3.0, 3.0, 3.0], # C: will be averaged with row=2
[8.0, 8.0, 2.0, 2.0], # C: will be averaged with row=1
[4.0, 4.0, 4.0, 3.0], # B: is fine
[1.0, 1.0, 1.0, 1.0]] # A: will be averaged with row=0is converted to
genes = ['A', 'B', 'C']
vals = [[1.5, 1.5, 5.0, 5.0],
[4.0, 4.0, 4.0, 3.0],
[5.5, 5.5, 2.5, 2.5]]Here is my function:
def avg_dups(genes, values):
"""Finds duplicate genes and averages their expression data.
"""
unq_genes = np.unique(genes)
out_values = np.zeros((unq_genes.shape[0], values.shape[1]))
for i, gene in enumerate(unq_genes):
dups = values[genes==gene]
out_values[i] = np.mean(dups, axis=0)
return (unq_genes, out_values)This function is slower than any other part of my data pipeline, taking 5-10 seconds when other steps that also operate on the whole dataset take sub-seconds. Any thoughts on how I can improve this?
Solution
This seems to be the fastest so far:
This finds the unique genes to fold the values into, along with the
It adds the row from each current index to the new index in the new
The mean is taken with a simple division of the number of repeated values that map to the same index:
Using Ashwini Chaudhary's
import numpy
from numpy import newaxis
def avg_dups(genes, values):
folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)
output /= counts[:, newaxis]
return folded, outputThis finds the unique genes to fold the values into, along with the
current index → new index mapping and the number of repeated values that map to the same index:folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)It adds the row from each current index to the new index in the new
output:output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)numpy.add.at(output, indices, values) is used over output[indices] += values because the buffering used in += breaks the code for repeated indices.The mean is taken with a simple division of the number of repeated values that map to the same index:
output /= counts[:, newaxis]Using Ashwini Chaudhary's
generate_test_data(2000) (giving a 10000x4 array), my rough timings are:name time/ms Author
avg_dups 230 gwg
avg_dups_fast 33 Ashwini Chaudhary
avg_dups_python 45 Ashwini Chaudhary
avg_dups 430 Veedrac
avg_dups 5 Veedrac with Jaime's improvementCode Snippets
import numpy
from numpy import newaxis
def avg_dups(genes, values):
folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)
output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)
output /= counts[:, newaxis]
return folded, outputfolded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)output = numpy.zeros((folded.shape[0], values.shape[1]))
numpy.add.at(output, indices, values)output /= counts[:, newaxis]name time/ms Author
avg_dups 230 gwg
avg_dups_fast 33 Ashwini Chaudhary
avg_dups_python 45 Ashwini Chaudhary
avg_dups 430 Veedrac
avg_dups 5 Veedrac with Jaime's improvementContext
StackExchange Code Review Q#82010, answer score: 3
Revisions (0)
No revisions yet.