HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Clustering nodes with Hamming distance < 3

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
nodeswithdistanceclusteringhamming

Problem

I want to speed up the following code, which is from an algorithm class.

I get a list of 200000 nodes where every node is a tuple of the length of 24 where every item is either a 1 or 0.

These items represent a graph where the distance between them is the hamming distance(number of bits that two nodes are different).

I have to run the union find algorithm on them to union all the nodes with distances 2.

I wrote the code and it runs unfortunately in 3 minutes. After profiling the code I got

143269498 function calls in 185.193 seconds

Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   47.656   47.656  183.583  183.583 ex2_big.py:18(cluster_alg)
        1    1.090    1.090  185.193  185.193 ex2_big.py:2()
  192670    0.165    0.000    0.165    0.000 ex2_big.py:39(union)
 11247652   13.388    0.000   13.388    0.000 ex2_big.py:49(find)
110327340  120.790    0.000  122.365    0.000 ex2_big.py:5(all_string_with_diff)
 14511524    1.175    0.000    1.175    0.000 ex2_big.py:6()
  5000000    0.355    0.000    0.355    0.000 ex2_big.py:61()
   596364    0.044    0.000    0.044    0.000 {len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.010    0.005    0.010    0.005 {method 'keys' of 'dict' objects}
   200001    0.144    0.000    0.144    0.000 {method 'split' of 'str' objects}
   200001    0.020    0.000    0.020    0.000 {method 'strip' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {open}
   993940    0.355    0.000    0.355    0.000 {range}


The heavy function is all_string_with_diff

The code of this function is very simple:

```
def all_string_with_diff(tup,k):
perms = [(j for j in range(i-1,len(tup)-k+i))for i in range(1,k+1)]
for x in product(*perms):
l = list(tup)
for y in x:
if l[y] =='1':
l[y] = '0'
else:
l[y] = '1'
yield

Solution

Generic comments

Your spacing is very odd and inconsistent, which makes reading the code a bit harder. 2 blank lines before function definition, spaces around operators, and after a coma is what is suggested by PEP 8.

You should also put your top-level code under if __name__ == '__main__' to be able to import your file without anything being executed. And since local symbols are resolved faster in Python, it is also often advised to wrap the top-level code into a main() function and just have

if __name__ == '__main__':
    main()


at the end of the file.

Your naming also involves a bit of abreviations which is not really necessary.

And you also happen to open a file without closeing it. You should use the with statement in conjunction with open to avoid forgetting such things.

Other nitpicks include creating generators that will only iterate over an iterable without modification ((j for j in range(i-1,len(tup)-k+i)) or i for i in line.strip().split(" ")) and using str.strip().split(' ') where str.split() would give similar results.

A first pass on your file would turn it into:

import sys
from itertools import product

def all_string_with_diff(tup, k):
    perms = [range(i - 1, len(tup) - k + i) for i in range(1, k + 1)]
    for x in product(*perms):
        l = list(tup)
        for y in x:
            if l[y] == '1':
                l[y] = '0'
            else:
                l[y] = '1'
        yield tuple(l)

def cluster_alg(nodes, n):
    d = 1
    i = 0
    while d = r2[1]:
        r2[0] = r1[0]
        r1[1] = r1[1] + r2[1]
    else:
        r1[0] = r2[0]
        r2[1] = r2[1] + r1[1]

def find(clusters, u):
    while clusters[u][0] != u:
        u = clusters[u][0]
    return u

def build_nodes(stream):
    nodes = {}
    node_data = next(stream)
    n = int(node_data.split()[0])

    for line in stream:
        arr = tuple(line.split())
        if arr in nodes:
            n -= 1
        else:
            nodes[arr] = [arr, 1]

    return nodes, n

if __name__ == '__main__':
    filename = sys.argv[1]
    with open(filename) as f:
        nodes, n = build_nodes(f)

    print(cluster_alg(nodes, n))


I also changed the way you iterate over the lines in the file to simplify the special case of the first line.

cluster_alg

i is only used to "monitor" that the code is alive and processing stuff. At this point you should be fairly confident that your code produce the desired results and you could remove that bit. Or at least use for i, n1 in enumerate(nodes):.

Same kind of argument for d, you do not need to manage its incrementation yourself. Use for d in range(1, 3): instead of your while.

def cluster_alg(nodes, n):
    for d in range(1, 3):
        for n1 in nodes:
            for n2 in all_string_with_diff(n1, d):
                if n2 in nodes:
                    r1 = find(nodes, n1)
                    r2 = find(nodes, n2)
                    if r1 != r2:
                        union(nodes, r1, r2)
                        n -= 1
    return n


all_string_with_diff

I was first tempted to rewrite perms using:

size = len(tup) - k + 1
perms = [range(i, size + i) for i in range(k)]


as there is no need to have i start at 1 and then consistently substract 1 from it. But there is no real gain beside better readability.

But since you know that you will only call that function with a limited subset of values for k (1 and 2 in your case), why not try to provide an improved version for each value:

def differences_1(tup):
    for x, val in enumerate(tup):
        l = list(tup)
        l[x] = '0' if val == '1' else '1'
        yield tuple(l)

def differences_2(tup):
    for x, y in product(range(len(tup) - 1), range(1, len(tup))):
        l = list(tup)
        l[x] = '0' if l[x] == '1' else '1'
        l[y] = '0' if l[y] == '1' else '1'
        yield tuple(l)

def cluster_alg(nodes, n):
    for differences in (differences_1, differences_2):
        for n1 in nodes:
            for n2 in differences(n1):
                if n2 in nodes:
                    r1 = find(nodes, n1)
                    r2 = find(nodes, n2)
                    if r1 != r2:
                        union(nodes, r1, r2)
                        n -= 1
    return n


On my machine, differences_2 is around 10% faster than all_string_with_diff with k=2 for tuples of 24 elements.

Strings vs tuples

You can gain an additionnal 5% speedup by converting between string and bytearray rather than tuple and list. This will, however, require that you adapt your parsing a bit:

```
import sys
from itertools import product

def differences_1(value):
for x, val in enumerate(value):
l = bytearray(value, 'utf8')
l[x] = 48 if val == '1' else 49 # 48 is ord('0'), 49 is ord('1')
yield l.decode()

def differences_2(value):
for x, y in product(range(len(value) - 1), range(1, len(value))):
l = bytearray(value, 'utf8')

Code Snippets

if __name__ == '__main__':
    main()
import sys
from itertools import product


def all_string_with_diff(tup, k):
    perms = [range(i - 1, len(tup) - k + i) for i in range(1, k + 1)]
    for x in product(*perms):
        l = list(tup)
        for y in x:
            if l[y] == '1':
                l[y] = '0'
            else:
                l[y] = '1'
        yield tuple(l)


def cluster_alg(nodes, n):
    d = 1
    i = 0
    while d < 3:
        for n1 in nodes:
            i = i + 1
            if i%1000 == 0:
                print(d, i)
            for n2 in all_string_with_diff(n1, d):
                if n2 in nodes:
                    r1 = find(nodes, n1)
                    r2 = find(nodes, n2)
                    if r1 != r2:
                        union(nodes, r1, r2)
                        n = n - 1
        d = d + 1
    return n


def union(clusters, r1, r2):
    r1 = clusters[r1]
    r2 = clusters[r2]
    if r1[1] >= r2[1]:
        r2[0] = r1[0]
        r1[1] = r1[1] + r2[1]
    else:
        r1[0] = r2[0]
        r2[1] = r2[1] + r1[1]


def find(clusters, u):
    while clusters[u][0] != u:
        u = clusters[u][0]
    return u


def build_nodes(stream):
    nodes = {}
    node_data = next(stream)
    n = int(node_data.split()[0])

    for line in stream:
        arr = tuple(line.split())
        if arr in nodes:
            n -= 1
        else:
            nodes[arr] = [arr, 1]

    return nodes, n


if __name__ == '__main__':
    filename = sys.argv[1]
    with open(filename) as f:
        nodes, n = build_nodes(f)

    print(cluster_alg(nodes, n))
def cluster_alg(nodes, n):
    for d in range(1, 3):
        for n1 in nodes:
            for n2 in all_string_with_diff(n1, d):
                if n2 in nodes:
                    r1 = find(nodes, n1)
                    r2 = find(nodes, n2)
                    if r1 != r2:
                        union(nodes, r1, r2)
                        n -= 1
    return n
size = len(tup) - k + 1
perms = [range(i, size + i) for i in range(k)]
def differences_1(tup):
    for x, val in enumerate(tup):
        l = list(tup)
        l[x] = '0' if val == '1' else '1'
        yield tuple(l)


def differences_2(tup):
    for x, y in product(range(len(tup) - 1), range(1, len(tup))):
        l = list(tup)
        l[x] = '0' if l[x] == '1' else '1'
        l[y] = '0' if l[y] == '1' else '1'
        yield tuple(l)


def cluster_alg(nodes, n):
    for differences in (differences_1, differences_2):
        for n1 in nodes:
            for n2 in differences(n1):
                if n2 in nodes:
                    r1 = find(nodes, n1)
                    r2 = find(nodes, n2)
                    if r1 != r2:
                        union(nodes, r1, r2)
                        n -= 1
    return n

Context

StackExchange Code Review Q#133260, answer score: 3

Revisions (0)

No revisions yet.