patternpythonMinor
Code correctness and refinement for quantile normalization
Viewed 0 times
refinementnormalizationquantilecorrectnessandforcode
Problem
The below code is still far from feature complete, but am looking to have some of the sections critiqued to learn better idioms or adjustments (e.g. - yet to be implemented: handling of csv files with headers, exception handling, more robust color labeling for matplotlib graphs, etc.):
```
"""
Quantile normalization
License:
Creative Commons Attribution-ShareAlike 3.0 Unported License
http://creativecommons.org/licenses/by-sa/3.0/
This is an implementation of quantile normalization for microarray data analysis.
CSV files must not contain header. Format must be as follows:
| Gene | Expression value |
Example:
| ABCD1 | 5.675 |
Other restrictions:
1.) Each csv file must contain the same gene set.
2.) Each gene must be unique.
Usage on command line:
python2.7 quantile_normalization *csv
"""
import csv
import matplotlib.pyplot as plt
import numpy as np
import random
import sys
if (len(sys.argv) > 1):
file_list = sys.argv[1:]
else:
print "Not enough arguments given."
sys.exit()
# Parse csv files for samples, creating lists of gene names and expression values.
set_dict = {}
for path in file_list:
with open(path) as stream:
data = list(csv.reader(stream, delimiter = '\t'))
data = sorted([(i, float(j)) for i, j in data], key = lambda v: v[1])
sample_genes = [i for i, j in data]
sample_values = [j for i, j in data]
set_dict[path] = (sample_genes, sample_values)
# Create sorted list of genes and values for all datasets.
set_list = [x for x in set_dict.items()]
set_list.sort(key = lambda (x,y): file_list.index(x))
# Compute row means.
L = len(file_list)
all_sets = [[i] for i in set_list[0:L+1]]
sample_values_list = [[v for i, (j, k) in A for v in k] for A in all_sets]
mean_values = [sum(p) / L for p in zip(*sample_values_list)]
# Compute histogram bin size using Rice Rule
for sample in sample_values_list:
bin_size = int(pow(2 * len(sample), 1.0 / 3.0))
# Provide corresponding gene names for mean values and r
```
"""
Quantile normalization
License:
Creative Commons Attribution-ShareAlike 3.0 Unported License
http://creativecommons.org/licenses/by-sa/3.0/
This is an implementation of quantile normalization for microarray data analysis.
CSV files must not contain header. Format must be as follows:
| Gene | Expression value |
Example:
| ABCD1 | 5.675 |
Other restrictions:
1.) Each csv file must contain the same gene set.
2.) Each gene must be unique.
Usage on command line:
python2.7 quantile_normalization *csv
"""
import csv
import matplotlib.pyplot as plt
import numpy as np
import random
import sys
if (len(sys.argv) > 1):
file_list = sys.argv[1:]
else:
print "Not enough arguments given."
sys.exit()
# Parse csv files for samples, creating lists of gene names and expression values.
set_dict = {}
for path in file_list:
with open(path) as stream:
data = list(csv.reader(stream, delimiter = '\t'))
data = sorted([(i, float(j)) for i, j in data], key = lambda v: v[1])
sample_genes = [i for i, j in data]
sample_values = [j for i, j in data]
set_dict[path] = (sample_genes, sample_values)
# Create sorted list of genes and values for all datasets.
set_list = [x for x in set_dict.items()]
set_list.sort(key = lambda (x,y): file_list.index(x))
# Compute row means.
L = len(file_list)
all_sets = [[i] for i in set_list[0:L+1]]
sample_values_list = [[v for i, (j, k) in A for v in k] for A in all_sets]
mean_values = [sum(p) / L for p in zip(*sample_values_list)]
# Compute histogram bin size using Rice Rule
for sample in sample_values_list:
bin_size = int(pow(2 * len(sample), 1.0 / 3.0))
# Provide corresponding gene names for mean values and r
Solution
Ì am not sure
Same applies to other of your list comprehension. If you do want to create a new list out of the previous,
The
Then, I must confess I got lost in the middle of the code. Please have a look at what can be simplified based on my first comments.
file_list = [args for args in sys.argv[1:]] calls for list comprehension. I might be wrong but file_list = sys.argv[1:] should do the trick.Same applies to other of your list comprehension. If you do want to create a new list out of the previous,
list(my_list) does the trick but this is not required when using the slice operations as they return new list already.The
while True: is not really useful, is it ?all_sets = [set_list[i - 1: i] for i in range(1, L + 1)] is this any different from all_sets = [[i] for i in set_list[0:L+1]] ?Then, I must confess I got lost in the middle of the code. Please have a look at what can be simplified based on my first comments.
Context
StackExchange Code Review Q#31617, answer score: 2
Revisions (0)
No revisions yet.