patternpythonMinor

Code correctness and refinement for quantile normalization

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

refinementnormalizationquantilecorrectnessandforcode

Problem

The below code is still far from feature complete, but am looking to have some of the sections critiqued to learn better idioms or adjustments (e.g. - yet to be implemented: handling of csv files with headers, exception handling, more robust color labeling for matplotlib graphs, etc.):

```
"""
Quantile normalization

License:
Creative Commons Attribution-ShareAlike 3.0 Unported License
http://creativecommons.org/licenses/by-sa/3.0/

This is an implementation of quantile normalization for microarray data analysis.
CSV files must not contain header. Format must be as follows:
| Gene | Expression value |
Example:
| ABCD1 | 5.675 |

Other restrictions:
1.) Each csv file must contain the same gene set.
2.) Each gene must be unique.

Usage on command line:
python2.7 quantile_normalization *csv
"""

import csv
import matplotlib.pyplot as plt
import numpy as np
import random
import sys

if (len(sys.argv) > 1):
file_list = sys.argv[1:]
else:
print "Not enough arguments given."
sys.exit()

# Parse csv files for samples, creating lists of gene names and expression values.
set_dict = {}
for path in file_list:
with open(path) as stream:
data = list(csv.reader(stream, delimiter = '\t'))
data = sorted([(i, float(j)) for i, j in data], key = lambda v: v[1])
sample_genes = [i for i, j in data]
sample_values = [j for i, j in data]
set_dict[path] = (sample_genes, sample_values)

# Create sorted list of genes and values for all datasets.
set_list = [x for x in set_dict.items()]
set_list.sort(key = lambda (x,y): file_list.index(x))

# Compute row means.
L = len(file_list)
all_sets = [[i] for i in set_list[0:L+1]]
sample_values_list = [[v for i, (j, k) in A for v in k] for A in all_sets]
mean_values = [sum(p) / L for p in zip(*sample_values_list)]

# Compute histogram bin size using Rice Rule
for sample in sample_values_list:
bin_size = int(pow(2 * len(sample), 1.0 / 3.0))

# Provide corresponding gene names for mean values and r

Solution

Ì am not sure file_list = [args for args in sys.argv[1:]] calls for list comprehension. I might be wrong but file_list = sys.argv[1:] should do the trick.
Same applies to other of your list comprehension. If you do want to create a new list out of the previous, list(my_list) does the trick but this is not required when using the slice operations as they return new list already.

The while True: is not really useful, is it ?

all_sets = [set_list[i - 1: i] for i in range(1, L + 1)] is this any different from all_sets = [[i] for i in set_list[0:L+1]] ?

Then, I must confess I got lost in the middle of the code. Please have a look at what can be simplified based on my first comments.

Context

StackExchange Code Review Q#31617, answer score: 2

Revisions (0)

No revisions yet.