patternpythonMinor
Compute stats on two class vectors
Viewed 0 times
vectorscomputestatstwoclass
Problem
Below is a full working code example of code which is used to compute stats on two class vectors (partitions).
There are two functions:
The first one is the function in question and the second is just for completeness, since it's apparently not so time consuming.
The bottleneck probably are the loops (depth is 3) and the logic and look-up stuff inside of
```
import itertools as it
import numpy as np
def pairwise_indication(part1, part2):
r"""
Computes the pairwise indicators.
Parameters
----------
part1, part2 : array, (n)
The two partition vectors
Returns
-------
n11, n00, n10, n01 : tuple (int)
Tuple with the counts of incidences
Examples
--------
>>> from cluvap.calc import pairwise_indication
>>> import numpy as np
>>> p1 = np.array([1, 2, 3, 3, 1, 2])
>>> p2 = np.array([3, 2, 3, 3, 1, 2])
>>> pairwise_indication(p1, p2)
(2, 10, 1, 2)
"""
n = len(part1)
if n != len(part2):
raise ValueError('Partition shapes do not match')
# Create clustering as set
A = to_clust_set(part1)
B = to_clust_set(part2)
observations = np.arange(n)
n11, n00, n10, n01 = 0, 0, 0, 0
# Count incidences
# function: P'QR'S v P'QRS v PQRS v PQR'S (' == not, v == or)
# condensed: P('QR('S v S) v QR('S v S))
# P: obs1 in a
# Q: obs2 in a
# R: obs1 in b
# S: obs2 in b
for obs1, obs2 in it.com
There are two functions:
pairwise_indicationand
to_clust_set.
The first one is the function in question and the second is just for completeness, since it's apparently not so time consuming.
pairwise_indication returns the indicators n11, n00, n10, n01 which are used to compute scores such as the RAND Index (they're named a, b, c, d on that site). The bottleneck probably are the loops (depth is 3) and the logic and look-up stuff inside of
pairwise_indication. I already tried to improve the logical function in it by pulling forward the common cases. I was hoping it can be improved some further. What makes it interminable (if I get the profiling at the bottom of the post correct) are the loops and maybe the set look-ups using in, though I cannot imagine something faster.```
import itertools as it
import numpy as np
def pairwise_indication(part1, part2):
r"""
Computes the pairwise indicators.
Parameters
----------
part1, part2 : array, (n)
The two partition vectors
Returns
-------
n11, n00, n10, n01 : tuple (int)
Tuple with the counts of incidences
Examples
--------
>>> from cluvap.calc import pairwise_indication
>>> import numpy as np
>>> p1 = np.array([1, 2, 3, 3, 1, 2])
>>> p2 = np.array([3, 2, 3, 3, 1, 2])
>>> pairwise_indication(p1, p2)
(2, 10, 1, 2)
"""
n = len(part1)
if n != len(part2):
raise ValueError('Partition shapes do not match')
# Create clustering as set
A = to_clust_set(part1)
B = to_clust_set(part2)
observations = np.arange(n)
n11, n00, n10, n01 = 0, 0, 0, 0
# Count incidences
# function: P'QR'S v P'QRS v PQRS v PQR'S (' == not, v == or)
# condensed: P('QR('S v S) v QR('S v S))
# P: obs1 in a
# Q: obs2 in a
# R: obs1 in b
# S: obs2 in b
for obs1, obs2 in it.com
Solution
If I understood what you want to do, this would be a more direct way to compute the same result. At least for the test cases provided, the result is indeed the same.
On my computer this is 13 times faster than yours.
Here's a completely different approach based on the idea that you can partition the partitions with one another (using
def pairwise_indication(part1, part2):
if len(part1) != len(part2):
raise ValueError('Partition shapes do not match')
n11, n00, n10, n01 = 0, 0, 0, 0
for (a1,a2), (b1,b2) in it.combinations(zip(part1, part2), 2):
if a1 == b1:
if a2 == b2:
n11 += 1
else:
n10 += 1
else:
if a2 == b2:
n01 += 1
else:
n00 += 1
return n11, n00, n10, n01On my computer this is 13 times faster than yours.
Here's a completely different approach based on the idea that you can partition the partitions with one another (using
zip in Python) to obtain a finer partition. (There must be a word for that?) From the size of each subset you can directly calculate the number of pairs that can be formed. Add up the numbers for each partition and subtract the overlap. This is 70 times faster than yours with the given example, and more importantly, operates in linear time, so it scales well to larger data.from collections import Counter
def pairs(n):
'''Calculate number of pairs that can be formed from n items'''
return n * (n - 1) // 2
def partition_pairs(partition):
'''Calculate number of pairs in subsets of partition'''
return sum(pairs(x) for x in Counter(partition).values())
def pairwise_indication(part1, part2):
n = len(part1)
if n != len(part2):
raise ValueError('Partition shapes do not match')
n11 = partition_pairs(zip(part1, part2))
n10 = partition_pairs(part1) - n11
n01 = partition_pairs(part2) - n11
n00 = pairs(n) - n11 - n10 - n01
return n11, n00, n10, n01Code Snippets
def pairwise_indication(part1, part2):
if len(part1) != len(part2):
raise ValueError('Partition shapes do not match')
n11, n00, n10, n01 = 0, 0, 0, 0
for (a1,a2), (b1,b2) in it.combinations(zip(part1, part2), 2):
if a1 == b1:
if a2 == b2:
n11 += 1
else:
n10 += 1
else:
if a2 == b2:
n01 += 1
else:
n00 += 1
return n11, n00, n10, n01from collections import Counter
def pairs(n):
'''Calculate number of pairs that can be formed from n items'''
return n * (n - 1) // 2
def partition_pairs(partition):
'''Calculate number of pairs in subsets of partition'''
return sum(pairs(x) for x in Counter(partition).values())
def pairwise_indication(part1, part2):
n = len(part1)
if n != len(part2):
raise ValueError('Partition shapes do not match')
n11 = partition_pairs(zip(part1, part2))
n10 = partition_pairs(part1) - n11
n01 = partition_pairs(part2) - n11
n00 = pairs(n) - n11 - n10 - n01
return n11, n00, n10, n01Context
StackExchange Code Review Q#45686, answer score: 4
Revisions (0)
No revisions yet.