HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Reading an Excel file and comparing the amino acid sequence of each data pair

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
readingfilethecomparingexcelacideachsequenceanddata

Problem

Since I am fairly new to Python I was wondering whether anyone can help me by making the code more efficient. I know the output stinks; I will be using Pandas to make this a little nicer.

```
from xlrd import *
def main():
'''This Proram reads input (A:clone name, B:sequence, C:elisa) from an
Excel file and makes a cross comparison of each sequence pair'''
book = open_workbook("mtask.xlsx")
Input = book.sheet_by_index(0)

# naming of input data
a = (Input.col_values(0,0))
b = (Input.col_values(1,0))
c = (Input.col_values(2,0))

# make dictionary: keys are seq numbers; values are residues
y = {}
for i in range(Input.nrows):
x = []
for j in b[i]:
x.append(j)
y[a[i]] = x

# comparison of sequences and extraction of mutations for each sequence pair
List = []
for shit in range(Input.nrows):
for seq in range(Input.nrows):
seq12 = []
z = 0
for i in y[a[seq]]:
try:
for j in y[a[shit]][z]:
if i == j:
seq12.append(i.lower()+j.lower())
else:
seq12.append(i+j)
z = z+1
except IndexError:
print("oops")

lib = [a[seq],a[shit],c[seq],c[shit]]
for position, item in enumerate(seq12):
if item.isupper():
x = (str(item[0])+str(position+1)+str(item[1]))
lib.append(x)
List.append(lib)

# comparison of sequences and extraction of mutations for each sequence pair
dic = {}
for i in range(Input.nrows*Input.nrows):
x = []
for j in List[i]:
x.append(j)
dic[i] = x

# sort
a = []
for i in dic.values():
a.append(i)

# collect number of mutations in data files
import csv
null = []
one = []

Solution

Follow PEP8

PEP8 is the official coding style guide of Python.
You have several notable violations:

-
Avoid wildcard imports like from xlrd import *. It makes it hard to tell which names in the code come from xlrd. frim xlrd import open_workbook would be much better and easier to understand.

-
Use snake_case for variable names. The variables Input and List violate this, but I admit these are a bit tricky because input and list could shadow builtin names, so you need to find something better instead. Instead of Input, sheet seems a clear choice, and instead of List, something that describes what kind of list it is would be good, for example sequence_pairs

These are just examples, there are many more PEP8 violations.
Use the pep8 command line utility (you can install with pip install --user pep8)
to detect all the PEP8 violations in your source files.

Although it's not a PEP8 violation,
but for better readability I recommend extra spaces around operators,
and to remove unnecessary parentheses. So for example:

  • Instead of: x = (str(item[0])+str(position+1)+str(item[1]))



  • Write like this: x = (str(item[0]) + str(position+1) + str(item[1]))



Use list comprehensions

List comprehensions are a powerful and elegant sexy feature of Python.
For example instead of this:

x = []
for j in b[i]:
    x.append(j)
y[a[i]] = x


You can write simply:

y[a[i]] = [j for j in b[i]]


But actually, if you just want to clone a list, this is the best:

y[a[i]] = b[i][:]


Cleaning up the csv writing

This of course is a complete mess:

null = []
one = []
two = []
# ...
for i in range(Input.nrows*Input.nrows):
    if len(a[i]) <= 4:
        null.append(a[i])
        with open("no_mut.csv", "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(null)
    elif len(a[i]) == 5:
        one.append(a[i])
        with open("one.csv", "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(one)
    elif len(a[i]) == 6:
        # ...


The code duplication might actually be the smaller of two big problems.
The biggest problem is that each file is potentially rewritten multiple times.
I don't know what kind of data you have,
but for example if you have 10 rows with len(a[i]) == 5,
then the file one.csv will be first written with 1 line, then rewritten with 2 lines, 3 lines, ..., in the end 10 lines.
This is crazy.
You should rework this to write each file only once.

Here's an untested approach that I think should work,
and solves both the multiple writing and the code duplication:

files = (
    (lambda length: length <= 4, "no_mut.csv", []),
    (lambda length: length == 5, "one.csv", []),
    (lambda length: length == 6, "two.csv", []),
    (lambda length: length == 7, "three.csv", []),
    # ... and so on
)

for i in range(sheet.nrows * sheet.nrows):
    for matcher, _, rows in files:
        if matcher(a[i]):
            rows.append(a[i])
            break

for _, filename, rows in files:
    with open(filename, 'w', newline='') as fh:
        writer = csv.writer(fh)
        writer.writerows(rows)

Code Snippets

x = []
for j in b[i]:
    x.append(j)
y[a[i]] = x
y[a[i]] = [j for j in b[i]]
y[a[i]] = b[i][:]
null = []
one = []
two = []
# ...
for i in range(Input.nrows*Input.nrows):
    if len(a[i]) <= 4:
        null.append(a[i])
        with open("no_mut.csv", "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(null)
    elif len(a[i]) == 5:
        one.append(a[i])
        with open("one.csv", "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(one)
    elif len(a[i]) == 6:
        # ...
files = (
    (lambda length: length <= 4, "no_mut.csv", []),
    (lambda length: length == 5, "one.csv", []),
    (lambda length: length == 6, "two.csv", []),
    (lambda length: length == 7, "three.csv", []),
    # ... and so on
)

for i in range(sheet.nrows * sheet.nrows):
    for matcher, _, rows in files:
        if matcher(a[i]):
            rows.append(a[i])
            break

for _, filename, rows in files:
    with open(filename, 'w', newline='') as fh:
        writer = csv.writer(fh)
        writer.writerows(rows)

Context

StackExchange Code Review Q#82261, answer score: 2

Revisions (0)

No revisions yet.