HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Averaging lists of values with duplicate keys

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
withaveragingduplicatekeyslistsvalues

Problem

I have gene expression data that I represent as a list of genes and a list of lists of values. I average the expression data for any genes with the same name.

For example:

genes = ['A', 'C', 'C', 'B', 'A']
vals  = [[2.0, 2.0, 9.0, 9.0], # A: will be averaged with row=4
         [3.0, 3.0, 3.0, 3.0], # C: will be averaged with row=2
         [8.0, 8.0, 2.0, 2.0], # C: will be averaged with row=1
         [4.0, 4.0, 4.0, 3.0], # B: is fine
         [1.0, 1.0, 1.0, 1.0]] # A: will be averaged with row=0


is converted to

genes = ['A', 'B', 'C']
vals  = [[1.5, 1.5, 5.0, 5.0],
         [4.0, 4.0, 4.0, 3.0],
         [5.5, 5.5, 2.5, 2.5]]


Here is my function:

def avg_dups(genes, values):
    """Finds duplicate genes and averages their expression data.
    """
    unq_genes = np.unique(genes)
    out_values = np.zeros((unq_genes.shape[0], values.shape[1]))
    for i, gene in enumerate(unq_genes):
        dups = values[genes==gene]
        out_values[i] = np.mean(dups, axis=0)
    return (unq_genes, out_values)


This function is slower than any other part of my data pipeline, taking 5-10 seconds when other steps that also operate on the whole dataset take sub-seconds. Any thoughts on how I can improve this?

Solution

This seems to be the fastest so far:

import numpy
from numpy import newaxis

def avg_dups(genes, values):
    folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)

    output = numpy.zeros((folded.shape[0], values.shape[1]))
    numpy.add.at(output, indices, values)
    output /= counts[:, newaxis]

    return folded, output


This finds the unique genes to fold the values into, along with the current index → new index mapping and the number of repeated values that map to the same index:

folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)


It adds the row from each current index to the new index in the new output:

output = numpy.zeros((folded.shape[0], values.shape[1]))
    numpy.add.at(output, indices, values)


numpy.add.at(output, indices, values) is used over output[indices] += values because the buffering used in += breaks the code for repeated indices.

The mean is taken with a simple division of the number of repeated values that map to the same index:

output /= counts[:, newaxis]


Using Ashwini Chaudhary's generate_test_data(2000) (giving a 10000x4 array), my rough timings are:

name             time/ms  Author
avg_dups           230    gwg
avg_dups_fast       33    Ashwini Chaudhary
avg_dups_python     45    Ashwini Chaudhary
avg_dups           430    Veedrac
avg_dups             5    Veedrac with Jaime's improvement

Code Snippets

import numpy
from numpy import newaxis

def avg_dups(genes, values):
    folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)

    output = numpy.zeros((folded.shape[0], values.shape[1]))
    numpy.add.at(output, indices, values)
    output /= counts[:, newaxis]

    return folded, output
folded, indices, counts = np.unique(genes, return_inverse=True, return_counts=True)
output = numpy.zeros((folded.shape[0], values.shape[1]))
    numpy.add.at(output, indices, values)
output /= counts[:, newaxis]
name             time/ms  Author
avg_dups           230    gwg
avg_dups_fast       33    Ashwini Chaudhary
avg_dups_python     45    Ashwini Chaudhary
avg_dups           430    Veedrac
avg_dups             5    Veedrac with Jaime's improvement

Context

StackExchange Code Review Q#82010, answer score: 3

Revisions (0)

No revisions yet.