HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Compute stats on two class vectors

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
vectorscomputestatstwoclass

Problem

Below is a full working code example of code which is used to compute stats on two class vectors (partitions).

There are two functions:

  • pairwise_indication and



  • to_clust_set.



The first one is the function in question and the second is just for completeness, since it's apparently not so time consuming. pairwise_indication returns the indicators n11, n00, n10, n01 which are used to compute scores such as the RAND Index (they're named a, b, c, d on that site).

The bottleneck probably are the loops (depth is 3) and the logic and look-up stuff inside of pairwise_indication. I already tried to improve the logical function in it by pulling forward the common cases. I was hoping it can be improved some further. What makes it interminable (if I get the profiling at the bottom of the post correct) are the loops and maybe the set look-ups using in, though I cannot imagine something faster.

```
import itertools as it
import numpy as np

def pairwise_indication(part1, part2):
r"""
Computes the pairwise indicators.

Parameters
----------
part1, part2 : array, (n)
The two partition vectors

Returns
-------
n11, n00, n10, n01 : tuple (int)
Tuple with the counts of incidences

Examples
--------
>>> from cluvap.calc import pairwise_indication
>>> import numpy as np
>>> p1 = np.array([1, 2, 3, 3, 1, 2])
>>> p2 = np.array([3, 2, 3, 3, 1, 2])
>>> pairwise_indication(p1, p2)
(2, 10, 1, 2)
"""
n = len(part1)
if n != len(part2):
raise ValueError('Partition shapes do not match')
# Create clustering as set
A = to_clust_set(part1)
B = to_clust_set(part2)
observations = np.arange(n)
n11, n00, n10, n01 = 0, 0, 0, 0
# Count incidences
# function: P'QR'S v P'QRS v PQRS v PQR'S (' == not, v == or)
# condensed: P('QR('S v S) v QR('S v S))
# P: obs1 in a
# Q: obs2 in a
# R: obs1 in b
# S: obs2 in b
for obs1, obs2 in it.com

Solution

If I understood what you want to do, this would be a more direct way to compute the same result. At least for the test cases provided, the result is indeed the same.

def pairwise_indication(part1, part2):
    if len(part1) != len(part2):
        raise ValueError('Partition shapes do not match')
    n11, n00, n10, n01 = 0, 0, 0, 0
    for (a1,a2), (b1,b2) in it.combinations(zip(part1, part2), 2):
        if a1 == b1:
            if a2 == b2:
                n11 += 1
            else:
                n10 += 1
        else:
            if a2 == b2:
                n01 += 1
            else:
                n00 += 1
    return n11, n00, n10, n01


On my computer this is 13 times faster than yours.

Here's a completely different approach based on the idea that you can partition the partitions with one another (using zip in Python) to obtain a finer partition. (There must be a word for that?) From the size of each subset you can directly calculate the number of pairs that can be formed. Add up the numbers for each partition and subtract the overlap. This is 70 times faster than yours with the given example, and more importantly, operates in linear time, so it scales well to larger data.

from collections import Counter

def pairs(n):
    '''Calculate number of pairs that can be formed from n items'''
    return n * (n - 1) // 2

def partition_pairs(partition):
    '''Calculate number of pairs in subsets of partition'''
    return sum(pairs(x) for x in Counter(partition).values())

def pairwise_indication(part1, part2):
    n = len(part1)
    if n != len(part2):
        raise ValueError('Partition shapes do not match')

    n11 = partition_pairs(zip(part1, part2))
    n10 = partition_pairs(part1) - n11
    n01 = partition_pairs(part2) - n11
    n00 = pairs(n) - n11 - n10 - n01

    return n11, n00, n10, n01

Code Snippets

def pairwise_indication(part1, part2):
    if len(part1) != len(part2):
        raise ValueError('Partition shapes do not match')
    n11, n00, n10, n01 = 0, 0, 0, 0
    for (a1,a2), (b1,b2) in it.combinations(zip(part1, part2), 2):
        if a1 == b1:
            if a2 == b2:
                n11 += 1
            else:
                n10 += 1
        else:
            if a2 == b2:
                n01 += 1
            else:
                n00 += 1
    return n11, n00, n10, n01
from collections import Counter

def pairs(n):
    '''Calculate number of pairs that can be formed from n items'''
    return n * (n - 1) // 2

def partition_pairs(partition):
    '''Calculate number of pairs in subsets of partition'''
    return sum(pairs(x) for x in Counter(partition).values())

def pairwise_indication(part1, part2):
    n = len(part1)
    if n != len(part2):
        raise ValueError('Partition shapes do not match')

    n11 = partition_pairs(zip(part1, part2))
    n10 = partition_pairs(part1) - n11
    n01 = partition_pairs(part2) - n11
    n00 = pairs(n) - n11 - n10 - n01

    return n11, n00, n10, n01

Context

StackExchange Code Review Q#45686, answer score: 4

Revisions (0)

No revisions yet.