patternpythonMinor
Reading an Excel file and comparing the amino acid sequence of each data pair
Viewed 0 times
readingfilethecomparingexcelacideachsequenceanddata
Problem
Since I am fairly new to Python I was wondering whether anyone can help me by making the code more efficient. I know the output stinks; I will be using Pandas to make this a little nicer.
```
from xlrd import *
def main():
'''This Proram reads input (A:clone name, B:sequence, C:elisa) from an
Excel file and makes a cross comparison of each sequence pair'''
book = open_workbook("mtask.xlsx")
Input = book.sheet_by_index(0)
# naming of input data
a = (Input.col_values(0,0))
b = (Input.col_values(1,0))
c = (Input.col_values(2,0))
# make dictionary: keys are seq numbers; values are residues
y = {}
for i in range(Input.nrows):
x = []
for j in b[i]:
x.append(j)
y[a[i]] = x
# comparison of sequences and extraction of mutations for each sequence pair
List = []
for shit in range(Input.nrows):
for seq in range(Input.nrows):
seq12 = []
z = 0
for i in y[a[seq]]:
try:
for j in y[a[shit]][z]:
if i == j:
seq12.append(i.lower()+j.lower())
else:
seq12.append(i+j)
z = z+1
except IndexError:
print("oops")
lib = [a[seq],a[shit],c[seq],c[shit]]
for position, item in enumerate(seq12):
if item.isupper():
x = (str(item[0])+str(position+1)+str(item[1]))
lib.append(x)
List.append(lib)
# comparison of sequences and extraction of mutations for each sequence pair
dic = {}
for i in range(Input.nrows*Input.nrows):
x = []
for j in List[i]:
x.append(j)
dic[i] = x
# sort
a = []
for i in dic.values():
a.append(i)
# collect number of mutations in data files
import csv
null = []
one = []
```
from xlrd import *
def main():
'''This Proram reads input (A:clone name, B:sequence, C:elisa) from an
Excel file and makes a cross comparison of each sequence pair'''
book = open_workbook("mtask.xlsx")
Input = book.sheet_by_index(0)
# naming of input data
a = (Input.col_values(0,0))
b = (Input.col_values(1,0))
c = (Input.col_values(2,0))
# make dictionary: keys are seq numbers; values are residues
y = {}
for i in range(Input.nrows):
x = []
for j in b[i]:
x.append(j)
y[a[i]] = x
# comparison of sequences and extraction of mutations for each sequence pair
List = []
for shit in range(Input.nrows):
for seq in range(Input.nrows):
seq12 = []
z = 0
for i in y[a[seq]]:
try:
for j in y[a[shit]][z]:
if i == j:
seq12.append(i.lower()+j.lower())
else:
seq12.append(i+j)
z = z+1
except IndexError:
print("oops")
lib = [a[seq],a[shit],c[seq],c[shit]]
for position, item in enumerate(seq12):
if item.isupper():
x = (str(item[0])+str(position+1)+str(item[1]))
lib.append(x)
List.append(lib)
# comparison of sequences and extraction of mutations for each sequence pair
dic = {}
for i in range(Input.nrows*Input.nrows):
x = []
for j in List[i]:
x.append(j)
dic[i] = x
# sort
a = []
for i in dic.values():
a.append(i)
# collect number of mutations in data files
import csv
null = []
one = []
Solution
Follow PEP8
PEP8 is the official coding style guide of Python.
You have several notable violations:
-
Avoid wildcard imports like
-
Use
These are just examples, there are many more PEP8 violations.
Use the
to detect all the PEP8 violations in your source files.
Although it's not a PEP8 violation,
but for better readability I recommend extra spaces around operators,
and to remove unnecessary parentheses. So for example:
Use list comprehensions
List comprehensions are a powerful and elegant sexy feature of Python.
For example instead of this:
You can write simply:
But actually, if you just want to clone a list, this is the best:
Cleaning up the csv writing
This of course is a complete mess:
The code duplication might actually be the smaller of two big problems.
The biggest problem is that each file is potentially rewritten multiple times.
I don't know what kind of data you have,
but for example if you have 10 rows with
then the file
This is crazy.
You should rework this to write each file only once.
Here's an untested approach that I think should work,
and solves both the multiple writing and the code duplication:
PEP8 is the official coding style guide of Python.
You have several notable violations:
-
Avoid wildcard imports like
from xlrd import *. It makes it hard to tell which names in the code come from xlrd. frim xlrd import open_workbook would be much better and easier to understand.-
Use
snake_case for variable names. The variables Input and List violate this, but I admit these are a bit tricky because input and list could shadow builtin names, so you need to find something better instead. Instead of Input, sheet seems a clear choice, and instead of List, something that describes what kind of list it is would be good, for example sequence_pairsThese are just examples, there are many more PEP8 violations.
Use the
pep8 command line utility (you can install with pip install --user pep8)to detect all the PEP8 violations in your source files.
Although it's not a PEP8 violation,
but for better readability I recommend extra spaces around operators,
and to remove unnecessary parentheses. So for example:
- Instead of:
x = (str(item[0])+str(position+1)+str(item[1]))
- Write like this:
x = (str(item[0]) + str(position+1) + str(item[1]))
Use list comprehensions
List comprehensions are a powerful and elegant sexy feature of Python.
For example instead of this:
x = []
for j in b[i]:
x.append(j)
y[a[i]] = xYou can write simply:
y[a[i]] = [j for j in b[i]]But actually, if you just want to clone a list, this is the best:
y[a[i]] = b[i][:]Cleaning up the csv writing
This of course is a complete mess:
null = []
one = []
two = []
# ...
for i in range(Input.nrows*Input.nrows):
if len(a[i]) <= 4:
null.append(a[i])
with open("no_mut.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(null)
elif len(a[i]) == 5:
one.append(a[i])
with open("one.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(one)
elif len(a[i]) == 6:
# ...The code duplication might actually be the smaller of two big problems.
The biggest problem is that each file is potentially rewritten multiple times.
I don't know what kind of data you have,
but for example if you have 10 rows with
len(a[i]) == 5,then the file
one.csv will be first written with 1 line, then rewritten with 2 lines, 3 lines, ..., in the end 10 lines.This is crazy.
You should rework this to write each file only once.
Here's an untested approach that I think should work,
and solves both the multiple writing and the code duplication:
files = (
(lambda length: length <= 4, "no_mut.csv", []),
(lambda length: length == 5, "one.csv", []),
(lambda length: length == 6, "two.csv", []),
(lambda length: length == 7, "three.csv", []),
# ... and so on
)
for i in range(sheet.nrows * sheet.nrows):
for matcher, _, rows in files:
if matcher(a[i]):
rows.append(a[i])
break
for _, filename, rows in files:
with open(filename, 'w', newline='') as fh:
writer = csv.writer(fh)
writer.writerows(rows)Code Snippets
x = []
for j in b[i]:
x.append(j)
y[a[i]] = xy[a[i]] = [j for j in b[i]]y[a[i]] = b[i][:]null = []
one = []
two = []
# ...
for i in range(Input.nrows*Input.nrows):
if len(a[i]) <= 4:
null.append(a[i])
with open("no_mut.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(null)
elif len(a[i]) == 5:
one.append(a[i])
with open("one.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(one)
elif len(a[i]) == 6:
# ...files = (
(lambda length: length <= 4, "no_mut.csv", []),
(lambda length: length == 5, "one.csv", []),
(lambda length: length == 6, "two.csv", []),
(lambda length: length == 7, "three.csv", []),
# ... and so on
)
for i in range(sheet.nrows * sheet.nrows):
for matcher, _, rows in files:
if matcher(a[i]):
rows.append(a[i])
break
for _, filename, rows in files:
with open(filename, 'w', newline='') as fh:
writer = csv.writer(fh)
writer.writerows(rows)Context
StackExchange Code Review Q#82261, answer score: 2
Revisions (0)
No revisions yet.