patternpythonMinor
Clustering nodes with Hamming distance < 3
Viewed 0 times
nodeswithdistanceclusteringhamming
Problem
I want to speed up the following code, which is from an algorithm class.
I get a list of 200000 nodes where every node is a tuple of the length of 24 where every item is either a 1 or 0.
These items represent a graph where the distance between them is the hamming distance(number of bits that two nodes are different).
I have to run the union find algorithm on them to union all the nodes with distances 2.
I wrote the code and it runs unfortunately in 3 minutes. After profiling the code I got
The heavy function is all_string_with_diff
The code of this function is very simple:
```
def all_string_with_diff(tup,k):
perms = [(j for j in range(i-1,len(tup)-k+i))for i in range(1,k+1)]
for x in product(*perms):
l = list(tup)
for y in x:
if l[y] =='1':
l[y] = '0'
else:
l[y] = '1'
yield
I get a list of 200000 nodes where every node is a tuple of the length of 24 where every item is either a 1 or 0.
These items represent a graph where the distance between them is the hamming distance(number of bits that two nodes are different).
I have to run the union find algorithm on them to union all the nodes with distances 2.
I wrote the code and it runs unfortunately in 3 minutes. After profiling the code I got
143269498 function calls in 185.193 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 47.656 47.656 183.583 183.583 ex2_big.py:18(cluster_alg)
1 1.090 1.090 185.193 185.193 ex2_big.py:2()
192670 0.165 0.000 0.165 0.000 ex2_big.py:39(union)
11247652 13.388 0.000 13.388 0.000 ex2_big.py:49(find)
110327340 120.790 0.000 122.365 0.000 ex2_big.py:5(all_string_with_diff)
14511524 1.175 0.000 1.175 0.000 ex2_big.py:6()
5000000 0.355 0.000 0.355 0.000 ex2_big.py:61()
596364 0.044 0.000 0.044 0.000 {len}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
2 0.010 0.005 0.010 0.005 {method 'keys' of 'dict' objects}
200001 0.144 0.000 0.144 0.000 {method 'split' of 'str' objects}
200001 0.020 0.000 0.020 0.000 {method 'strip' of 'str' objects}
1 0.000 0.000 0.000 0.000 {open}
993940 0.355 0.000 0.355 0.000 {range}The heavy function is all_string_with_diff
The code of this function is very simple:
```
def all_string_with_diff(tup,k):
perms = [(j for j in range(i-1,len(tup)-k+i))for i in range(1,k+1)]
for x in product(*perms):
l = list(tup)
for y in x:
if l[y] =='1':
l[y] = '0'
else:
l[y] = '1'
yield
Solution
Generic comments
Your spacing is very odd and inconsistent, which makes reading the code a bit harder. 2 blank lines before function definition, spaces around operators, and after a coma is what is suggested by PEP 8.
You should also put your top-level code under
at the end of the file.
Your naming also involves a bit of abreviations which is not really necessary.
And you also happen to
Other nitpicks include creating generators that will only iterate over an iterable without modification (
A first pass on your file would turn it into:
I also changed the way you iterate over the lines in the file to simplify the special case of the first line.
cluster_alg
Same kind of argument for
all_string_with_diff
I was first tempted to rewrite
as there is no need to have
But since you know that you will only call that function with a limited subset of values for
On my machine,
Strings vs tuples
You can gain an additionnal 5% speedup by converting between
```
import sys
from itertools import product
def differences_1(value):
for x, val in enumerate(value):
l = bytearray(value, 'utf8')
l[x] = 48 if val == '1' else 49 # 48 is ord('0'), 49 is ord('1')
yield l.decode()
def differences_2(value):
for x, y in product(range(len(value) - 1), range(1, len(value))):
l = bytearray(value, 'utf8')
Your spacing is very odd and inconsistent, which makes reading the code a bit harder. 2 blank lines before function definition, spaces around operators, and after a coma is what is suggested by PEP 8.
You should also put your top-level code under
if __name__ == '__main__' to be able to import your file without anything being executed. And since local symbols are resolved faster in Python, it is also often advised to wrap the top-level code into a main() function and just haveif __name__ == '__main__':
main()at the end of the file.
Your naming also involves a bit of abreviations which is not really necessary.
And you also happen to
open a file without closeing it. You should use the with statement in conjunction with open to avoid forgetting such things.Other nitpicks include creating generators that will only iterate over an iterable without modification (
(j for j in range(i-1,len(tup)-k+i)) or i for i in line.strip().split(" ")) and using str.strip().split(' ') where str.split() would give similar results.A first pass on your file would turn it into:
import sys
from itertools import product
def all_string_with_diff(tup, k):
perms = [range(i - 1, len(tup) - k + i) for i in range(1, k + 1)]
for x in product(*perms):
l = list(tup)
for y in x:
if l[y] == '1':
l[y] = '0'
else:
l[y] = '1'
yield tuple(l)
def cluster_alg(nodes, n):
d = 1
i = 0
while d = r2[1]:
r2[0] = r1[0]
r1[1] = r1[1] + r2[1]
else:
r1[0] = r2[0]
r2[1] = r2[1] + r1[1]
def find(clusters, u):
while clusters[u][0] != u:
u = clusters[u][0]
return u
def build_nodes(stream):
nodes = {}
node_data = next(stream)
n = int(node_data.split()[0])
for line in stream:
arr = tuple(line.split())
if arr in nodes:
n -= 1
else:
nodes[arr] = [arr, 1]
return nodes, n
if __name__ == '__main__':
filename = sys.argv[1]
with open(filename) as f:
nodes, n = build_nodes(f)
print(cluster_alg(nodes, n))I also changed the way you iterate over the lines in the file to simplify the special case of the first line.
cluster_alg
i is only used to "monitor" that the code is alive and processing stuff. At this point you should be fairly confident that your code produce the desired results and you could remove that bit. Or at least use for i, n1 in enumerate(nodes):.Same kind of argument for
d, you do not need to manage its incrementation yourself. Use for d in range(1, 3): instead of your while.def cluster_alg(nodes, n):
for d in range(1, 3):
for n1 in nodes:
for n2 in all_string_with_diff(n1, d):
if n2 in nodes:
r1 = find(nodes, n1)
r2 = find(nodes, n2)
if r1 != r2:
union(nodes, r1, r2)
n -= 1
return nall_string_with_diff
I was first tempted to rewrite
perms using:size = len(tup) - k + 1
perms = [range(i, size + i) for i in range(k)]as there is no need to have
i start at 1 and then consistently substract 1 from it. But there is no real gain beside better readability.But since you know that you will only call that function with a limited subset of values for
k (1 and 2 in your case), why not try to provide an improved version for each value:def differences_1(tup):
for x, val in enumerate(tup):
l = list(tup)
l[x] = '0' if val == '1' else '1'
yield tuple(l)
def differences_2(tup):
for x, y in product(range(len(tup) - 1), range(1, len(tup))):
l = list(tup)
l[x] = '0' if l[x] == '1' else '1'
l[y] = '0' if l[y] == '1' else '1'
yield tuple(l)
def cluster_alg(nodes, n):
for differences in (differences_1, differences_2):
for n1 in nodes:
for n2 in differences(n1):
if n2 in nodes:
r1 = find(nodes, n1)
r2 = find(nodes, n2)
if r1 != r2:
union(nodes, r1, r2)
n -= 1
return nOn my machine,
differences_2 is around 10% faster than all_string_with_diff with k=2 for tuples of 24 elements.Strings vs tuples
You can gain an additionnal 5% speedup by converting between
string and bytearray rather than tuple and list. This will, however, require that you adapt your parsing a bit:```
import sys
from itertools import product
def differences_1(value):
for x, val in enumerate(value):
l = bytearray(value, 'utf8')
l[x] = 48 if val == '1' else 49 # 48 is ord('0'), 49 is ord('1')
yield l.decode()
def differences_2(value):
for x, y in product(range(len(value) - 1), range(1, len(value))):
l = bytearray(value, 'utf8')
Code Snippets
if __name__ == '__main__':
main()import sys
from itertools import product
def all_string_with_diff(tup, k):
perms = [range(i - 1, len(tup) - k + i) for i in range(1, k + 1)]
for x in product(*perms):
l = list(tup)
for y in x:
if l[y] == '1':
l[y] = '0'
else:
l[y] = '1'
yield tuple(l)
def cluster_alg(nodes, n):
d = 1
i = 0
while d < 3:
for n1 in nodes:
i = i + 1
if i%1000 == 0:
print(d, i)
for n2 in all_string_with_diff(n1, d):
if n2 in nodes:
r1 = find(nodes, n1)
r2 = find(nodes, n2)
if r1 != r2:
union(nodes, r1, r2)
n = n - 1
d = d + 1
return n
def union(clusters, r1, r2):
r1 = clusters[r1]
r2 = clusters[r2]
if r1[1] >= r2[1]:
r2[0] = r1[0]
r1[1] = r1[1] + r2[1]
else:
r1[0] = r2[0]
r2[1] = r2[1] + r1[1]
def find(clusters, u):
while clusters[u][0] != u:
u = clusters[u][0]
return u
def build_nodes(stream):
nodes = {}
node_data = next(stream)
n = int(node_data.split()[0])
for line in stream:
arr = tuple(line.split())
if arr in nodes:
n -= 1
else:
nodes[arr] = [arr, 1]
return nodes, n
if __name__ == '__main__':
filename = sys.argv[1]
with open(filename) as f:
nodes, n = build_nodes(f)
print(cluster_alg(nodes, n))def cluster_alg(nodes, n):
for d in range(1, 3):
for n1 in nodes:
for n2 in all_string_with_diff(n1, d):
if n2 in nodes:
r1 = find(nodes, n1)
r2 = find(nodes, n2)
if r1 != r2:
union(nodes, r1, r2)
n -= 1
return nsize = len(tup) - k + 1
perms = [range(i, size + i) for i in range(k)]def differences_1(tup):
for x, val in enumerate(tup):
l = list(tup)
l[x] = '0' if val == '1' else '1'
yield tuple(l)
def differences_2(tup):
for x, y in product(range(len(tup) - 1), range(1, len(tup))):
l = list(tup)
l[x] = '0' if l[x] == '1' else '1'
l[y] = '0' if l[y] == '1' else '1'
yield tuple(l)
def cluster_alg(nodes, n):
for differences in (differences_1, differences_2):
for n1 in nodes:
for n2 in differences(n1):
if n2 in nodes:
r1 = find(nodes, n1)
r2 = find(nodes, n2)
if r1 != r2:
union(nodes, r1, r2)
n -= 1
return nContext
StackExchange Code Review Q#133260, answer score: 3
Revisions (0)
No revisions yet.