HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonModerate

Groupby in NumPy

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
groupbynumpystackoverflow

Problem

To avoid the XY problem, here is an example of what I need. Given the sorted integer input array

[1 1 1 2 2 3], I would like to produce the following slices, together with their index:

0: slice(0,3)
1: slice(3,5)
2: slice(5,6)


(Both the index and the slice will be used to index/slice into some other arrays.)

I think this could be solved with the itertools.groupby(). Partly because I am learning NumPy, and partly because evidence suggests that groupby() may be inefficient, I would like to do this in NumPy. The code below seems to do what I want.

from __future__ import print_function
import numpy as np

if __name__=='__main__':
    arr = np.array([1, 1, 1, 2, 2, 3])
    print(arr)
    mask = np.ediff1d(arr, to_begin=1, to_end=1)
    indices = np.where(mask)[0]
    for i in np.arange(indices.size-1):
        print('%d: slice(%d,%d)' % (i, indices[i], indices[i+1]))


Is there a cleaner / more efficient way of doing this?

Solution

The function scipy.ndimage.find_objects returns exactly the slices that you are looking for:

>>> a = np.array([1, 1, 1, 2, 2, 3])
>>> scipy.ndimage.find_objects(a)
[(slice(0, 3, None),), (slice(3, 5, None),), (slice(5, 6, None),)]


It's hard to make a fair timing comparison without knowing exactly how you are going to use these slices, but a quick test shows that this is about twice as fast as your strategy of using np.ediff2d followed by np.where.

A couple of general points on your code:

-
You use numpy.where with a single argument. This is the same as numpy.nonzero and it would be clearer in this case to use the latter.

-
But having made that change, you could use numpy.flatnonzero and so avoid the [0].

-
It's pointless to create an array if you are only going to iterate over it. Better to use an iterator. So instead of:

for i in np.arange(indices.size-1):
    print('%d: slice(%d,%d)' % (i, indices[i], indices[i+1]))


you can write something like:

for i, j in enumerate(zip(indices, indices[1:])):
    print('{}: {}'.format(i, slice(*j)))


In response to comment: If you find the zip idiom unclear then you could encapsulate it:

from future_builtins import zip     # Python 2 only

def pairs(a):
    """Return an iterator over pairs of adjacent elements in a.

    >>> list(pairs([1, 2, 3, 4]))
    [(1, 2), (2, 3), (3, 4)]

    """
    return zip(a[:-1], a[1:])


and then the loop would become:

for i, j in enumerate(pairs(indices)):
    print('{}: {}'.format(i, slice(*j)))


which ought to be clear.

A fancier alternative is to use the (undocumented) numpy.lib.stride_tricks.as_strided to make an array of pairs of adjacent elements without allocating a new array:

from numpy.lib.stride_tricks import as_strided

def pairs(a):
    """Return array of pairs of adjacent elements in a.

    >>> pairs([1, 2, 3, 4])
    array([[1, 2],
           [2, 3],
           [3, 4]])

    """
    a = np.asarray(a)
    return as_strided(a, shape=(a.size - 1, 2), strides=a.strides * 2)


The documented way to make such an array is np.c_[indices[:-1], indices[1:]] but that allocates a new array.

Code Snippets

>>> a = np.array([1, 1, 1, 2, 2, 3])
>>> scipy.ndimage.find_objects(a)
[(slice(0, 3, None),), (slice(3, 5, None),), (slice(5, 6, None),)]
for i in np.arange(indices.size-1):
    print('%d: slice(%d,%d)' % (i, indices[i], indices[i+1]))
for i, j in enumerate(zip(indices, indices[1:])):
    print('{}: {}'.format(i, slice(*j)))
from future_builtins import zip     # Python 2 only

def pairs(a):
    """Return an iterator over pairs of adjacent elements in a.

    >>> list(pairs([1, 2, 3, 4]))
    [(1, 2), (2, 3), (3, 4)]

    """
    return zip(a[:-1], a[1:])
for i, j in enumerate(pairs(indices)):
    print('{}: {}'.format(i, slice(*j)))

Context

StackExchange Code Review Q#60884, answer score: 12

Revisions (0)

No revisions yet.