patternpythonModerate
Groupby in NumPy
Viewed 0 times
groupbynumpystackoverflow
Problem
To avoid the XY problem, here is an example of what I need. Given the sorted integer input array
(Both the index and the slice will be used to index/slice into some other arrays.)
I think this could be solved with the itertools.groupby(). Partly because I am learning NumPy, and partly because evidence suggests that
Is there a cleaner / more efficient way of doing this?
[1 1 1 2 2 3], I would like to produce the following slices, together with their index: 0: slice(0,3)
1: slice(3,5)
2: slice(5,6)(Both the index and the slice will be used to index/slice into some other arrays.)
I think this could be solved with the itertools.groupby(). Partly because I am learning NumPy, and partly because evidence suggests that
groupby() may be inefficient, I would like to do this in NumPy. The code below seems to do what I want.from __future__ import print_function
import numpy as np
if __name__=='__main__':
arr = np.array([1, 1, 1, 2, 2, 3])
print(arr)
mask = np.ediff1d(arr, to_begin=1, to_end=1)
indices = np.where(mask)[0]
for i in np.arange(indices.size-1):
print('%d: slice(%d,%d)' % (i, indices[i], indices[i+1]))Is there a cleaner / more efficient way of doing this?
Solution
The function
It's hard to make a fair timing comparison without knowing exactly how you are going to use these slices, but a quick test shows that this is about twice as fast as your strategy of using
A couple of general points on your code:
-
You use
-
But having made that change, you could use
-
It's pointless to create an array if you are only going to iterate over it. Better to use an iterator. So instead of:
you can write something like:
In response to comment: If you find the
and then the loop would become:
which ought to be clear.
A fancier alternative is to use the (undocumented)
The documented way to make such an array is
scipy.ndimage.find_objects returns exactly the slices that you are looking for:>>> a = np.array([1, 1, 1, 2, 2, 3])
>>> scipy.ndimage.find_objects(a)
[(slice(0, 3, None),), (slice(3, 5, None),), (slice(5, 6, None),)]It's hard to make a fair timing comparison without knowing exactly how you are going to use these slices, but a quick test shows that this is about twice as fast as your strategy of using
np.ediff2d followed by np.where.A couple of general points on your code:
-
You use
numpy.where with a single argument. This is the same as numpy.nonzero and it would be clearer in this case to use the latter.-
But having made that change, you could use
numpy.flatnonzero and so avoid the [0].-
It's pointless to create an array if you are only going to iterate over it. Better to use an iterator. So instead of:
for i in np.arange(indices.size-1):
print('%d: slice(%d,%d)' % (i, indices[i], indices[i+1]))you can write something like:
for i, j in enumerate(zip(indices, indices[1:])):
print('{}: {}'.format(i, slice(*j)))In response to comment: If you find the
zip idiom unclear then you could encapsulate it:from future_builtins import zip # Python 2 only
def pairs(a):
"""Return an iterator over pairs of adjacent elements in a.
>>> list(pairs([1, 2, 3, 4]))
[(1, 2), (2, 3), (3, 4)]
"""
return zip(a[:-1], a[1:])and then the loop would become:
for i, j in enumerate(pairs(indices)):
print('{}: {}'.format(i, slice(*j)))which ought to be clear.
A fancier alternative is to use the (undocumented)
numpy.lib.stride_tricks.as_strided to make an array of pairs of adjacent elements without allocating a new array:from numpy.lib.stride_tricks import as_strided
def pairs(a):
"""Return array of pairs of adjacent elements in a.
>>> pairs([1, 2, 3, 4])
array([[1, 2],
[2, 3],
[3, 4]])
"""
a = np.asarray(a)
return as_strided(a, shape=(a.size - 1, 2), strides=a.strides * 2)The documented way to make such an array is
np.c_[indices[:-1], indices[1:]] but that allocates a new array.Code Snippets
>>> a = np.array([1, 1, 1, 2, 2, 3])
>>> scipy.ndimage.find_objects(a)
[(slice(0, 3, None),), (slice(3, 5, None),), (slice(5, 6, None),)]for i in np.arange(indices.size-1):
print('%d: slice(%d,%d)' % (i, indices[i], indices[i+1]))for i, j in enumerate(zip(indices, indices[1:])):
print('{}: {}'.format(i, slice(*j)))from future_builtins import zip # Python 2 only
def pairs(a):
"""Return an iterator over pairs of adjacent elements in a.
>>> list(pairs([1, 2, 3, 4]))
[(1, 2), (2, 3), (3, 4)]
"""
return zip(a[:-1], a[1:])for i, j in enumerate(pairs(indices)):
print('{}: {}'.format(i, slice(*j)))Context
StackExchange Code Review Q#60884, answer score: 12
Revisions (0)
No revisions yet.