HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Integration of data using Simpson's rule

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
ruleusingintegrationdatasimpson

Problem

I wrote a Python program that performs an integration of data using Simpson's rule. The program takes the areas to be integrated from a foo.ref file, with the following syntax:

# peak          m/z             charge  window
LAG_H4N4F1S1    4244.829467     1+      0.06
LAG_H4N4F1S1    4245.832508     1+      0.06
LAG_H4N4F1S1    4246.835549     1+      0.06
LAG_H4N4F1S1    4247.838590     1+      0.06


The data that the program parses is taken from all the .xy files in the current directory, with the following syntax:

3497.7552 0
3497.7619 410646
3497.7685 397637
3497.7752 363672
3497.7819 321723
3497.7886 286960


The program's final output is a tab seperated file that can be opened directly in excel, the performance is ... a lot worse than I was expecting however and I was wondering if anyone could offer some tricks to increase that (and also to have a general look at the code, if it is correct Python).

```
#! /usr/bin/env python
import glob
import os
import fileinput
print " a script to quickly calculate the area for a set"
print "of analytes. Written by Bas Jansen, Leiden University Medical"
print "Center, March 20th 2013."
print ""

# Simpson's rule
def integrate(y_vals,h):
i=1
total=y_vals[0]+y_vals[-1]
for y in y_vals[1:-1]:
if i%2 == 0:
total+=2*y
else:
total+=4*y
i+=1
return total*(h/3.0)

# Reads the reference file
ref=[]
for refs in glob.glob("*.ref"):
fr=open(refs)
for line in fr.readlines():
parts=line.rstrip('\n').split('\t')
ref.append(parts)
fr.close()

# Initialize the output file
f=open('output.xls','w')
for i in range(1,len(ref)):
f.write(ref[i][0]+"\n")
f.close()

# Find XY files in current directory
xy_files=[]
for files in glob.glob("*.xy"):
xy_files.append(files)
x=[]
area=[]
print "Parsing: "+files
fh=open(files)
for line in fh.readlines():
y=[float(i) for i in line.split()]
x.append(y)
fh.close()

# Pick the XY window
print "Progress indicator:"
prin

Solution

#! /usr/bin/env python
import glob
import os
import fileinput
print " a script to quickly calculate the area for a set"
print "of analytes. Written by Bas Jansen, Leiden University Medical"
print "Center, March 20th 2013."
print ""

# Simpson's rule
def integrate(y_vals,h):


The recommendation for python is four spaces for indentation, you appear to be using 2

i=1
  total=y_vals[0]+y_vals[-1]


It's also recommended to have spaces around binary operators

for y in y_vals[1:-1]:


Use for i, y in enumerate(y_vals, 1): rather then counting your own i value

if i%2 == 0:
      total+=2*y
    else:
      total+=4*y
    i+=1
  return total*(h/3.0)

# Reads the reference file
ref=[]


I'd recommended not using an abbreviation her

for refs in glob.glob("*.ref"):
  fr=open(refs)


It's recommended to use a with block:

with open(refs) as fr:
      do stuff with file
  file will be automatically closed here


That ensures your file will be closed even if an exception occours

for line in fr.readlines():
    parts=line.rstrip('\n').split('\t')
    ref.append(parts)


You shouldn't really store the header ref here.

fr.close()

# Initialize the output file
f=open('output.xls','w')
for i in range(1,len(ref)):
  f.write(ref[i][0]+"\n")
f.close()

# Find XY files in current directory
xy_files=[]
for files in glob.glob("*.xy"):
  xy_files.append(files)
  x=[]
  area=[]
  print "Parsing: "+files
  fh=open(files)
  for line in fh.readlines():


Actually this is the same as for line in fh:

y=[float(i) for i in line.split()]
    x.append(y)
  fh.close()

  # Pick the XY window
  print "Progress indicator:"
  print "0%"
  for i in range(1,len(ref)):


Whenever possible, don't iterate over indexes. Instead iterate over the lists. So something like for line in ref[1:]: If you really need the indexes us enumerate

obs=[]
    vals=0
    low=float(ref[i][1])-float(ref[i][3])
    high=float(ref[i][1])+float(ref[i][3])
    # Progress indicator
    if i %(len(ref)/10) == 0:
      print str(i/(len(ref)/10))+"0%"


Why divide by 10 just to add the 0 back?

for j in range(0,len(x)):


Use for j in x: and then avoid the indexing

if x[j][0] > low and x[j][0] < high:
        coords=[float(l) for l in x[j]]
        obs.append(coords)
        vals+=1


What's this for?

if obs:
      y_vals=[]
      h=(obs[len(obs)-1][0]-obs[0][0])/len(obs)
      for k in range(0,len(obs)):
        y_vals.append(obs[k][1])
      area.append(integrate(y_vals,h))

  # Neat trick to append the area from other input files
  l = 0
  for line in fileinput.input('output.xls',inplace=1):


If you need to iterate over multiple sources at once, use a zip.

print "%s\t%s\n" % (line.rstrip(),area[l]),
    l+=1


Your neat trick is what's killing you. This is designed for processing the output of some other program. Its not good for continually adding to the end of lines you have already written. What you should do is store everything in a big list and then output it at the end.

# Display the legend (& filenames) at bottom of output file
f=open('output.xls','a')
legend = "Peak"
for m in range(0,len(xy_files)):
  legend+="\t"+str(xy_files[m])
legend+="\n"
f.write(legend)
f.close()

# Being polite
raw_input("Finished, press any key to exit")

Code Snippets

#! /usr/bin/env python
import glob
import os
import fileinput
print "<SimpInt.py> a script to quickly calculate the area for a set"
print "of analytes. Written by Bas Jansen, Leiden University Medical"
print "Center, March 20th 2013."
print ""

# Simpson's rule
def integrate(y_vals,h):
i=1
  total=y_vals[0]+y_vals[-1]
for y in y_vals[1:-1]:
if i%2 == 0:
      total+=2*y
    else:
      total+=4*y
    i+=1
  return total*(h/3.0)

# Reads the reference file
ref=[]
for refs in glob.glob("*.ref"):
  fr=open(refs)

Context

StackExchange Code Review Q#24160, answer score: 2

Revisions (0)

No revisions yet.