HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Fast comparison of molecular structures and deleting duplicates

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
fastdeletingmolecularcomparisonandstructuresduplicates

Problem

I have a program that reads in two xyz-files (molecular structures) and compares them by an intramolecular distance measure (dRMSD, Fig. 22). A friend told me that my program structure is bad, and as he is a programmer while I am not, I guess he is right. But as he has no time to help me much, I thought maybe some one of you could provide me with helpful answers.

What my program does right now:

  • read in molecule 1 (p) and molecule 2 (q)



  • compare them by dRMSD $$\sqrt{\frac{1}{N(N+1)}\sum_{i



  • delete the file of molecule 1 if dRMSD is below a certain threshold



```
#!/usr/bin/env python

import sys
import math
import os
import numpy as np
from scipy import linalg

def ecd(vec1, vec2):
"euclidean distance"
a = np.array(vec1)
b = np.array(vec2)
return np.linalg.norm(a-b)

def drmsd(p, q):
"intra-molecular distance measure"
a = np.array(p)
b = np.array(q)
k = len(p)*(len(p)+1.0)
total = 0.0
for i in range(len(p)):
for j in range(i+1,len(p)):
total += (ecd(a[i],a[j])-ecd(b[i],b[j]))**2
return np.sqrt(total/k)

nconf = int(sys.argv[1])

for i in range(1,nconf+1):
for j in range(i+1,nconf+1):
fname1 = "h" + str(i) + ".xyz"
xyz = open(fname1)
n_atoms = int(xyz.readline())
coord1 = np.zeros(shape=(n_atoms,3))
title = xyz.readline()

cnt = 0
for line in xyz:
atom,x,y,z = line.split()
coord1[cnt] = [float(x), float(y), float(z)]
cnt += 1
xyz.close()

fname2 = "h" + str(j) + ".xyz"
xyz = open(fname2)
n_atoms = int(xyz.readline())
coord2 = np.zeros(shape=(n_atoms,3))
title = xyz.readline()

cnt = 0
for line in xyz:
atom,x,y,z = line.split()
coord2[cnt] = [float(x), float(y), float(z)]
cnt += 1
xyz.close()

dist = drmsd(coord1,coord2)
if dist < 0.0001:
print i, j, dist, " ... ", fnam

Solution

I probably get what your friend meant when they said that your code wasn't particularly well-structured: your program has a main double for loop and ought to be broken into more small functions with names that could be understood at first glance (at least for somebody fluent in the application domain). Putting that aside, there are many small things that could be improved:

-
Iteration over numpy arrays is slow. While numpy arrays can help to write much faster code than with regular Python lists, iterating over them is slower than iterating over a regular Python list. The trick to numpy arrays is to use functions designed for them when possible, time, convert them to Python lists when you really need to iterate over them, time again, read the documentation to find the most useful functions from numpy and scipy, then time again and in the end you will know which solution is the fastest. Agreed, that's not the nicest thing in the world to do.

I didn't time your code, but I guess that it would be more performant if you used regular Python lists while you read files, still use Python lists in drmsd since you only iterate over the arrays, but use numpy arrays in ecd since you have to use np.linalg.norm.

-
Instead of manually calling close on file instances, use with instead:

with open('some_file') as f:
    # do things with f
# Here, f is automagically closed


The file will be closed when you leave the scope of the with. That ensures that your file is always closed, even if an exception is thrown while you are doing things with it. It prevents resource leaks without having to use cumbersome try/catch/finally clauses.

-
Instead of putting your main code directly in the file, try to do this:

if __name__ == '__main__':
    # Here your main code


It will ensure that your main is only called if the file containing it is the one launched by the Python interpreter. That, you can reuse its functions from other modules without risking to launch the main code when you don't want to. This feature is generally used to provide example code in modules.

-
You seem to be using Python 2.7. If this is the case, then you can turn your range into xrange. While range produces a full list, xrange only produces a lightweight generator object that will produce a return a new integer at every iteration. It is much more memory efficient than using range for the purpose of iterating.

Code Snippets

with open('some_file') as f:
    # do things with f
# Here, f is automagically closed
if __name__ == '__main__':
    # Here your main code

Context

StackExchange Code Review Q#90389, answer score: 5

Revisions (0)

No revisions yet.