patternpythonMinor
Matrix Multiplication Python -- Memory Hungry
Viewed 0 times
hungrymultiplicationpythonmemorymatrix
Problem
Here is a snippet from my python script where I am performing:
This is slowing things way down and making it hard to debug with the ~10 min wait times. How can I write this in a more efficient way, and that makes use of more than just one CPU?
Here are the docs for the input params:
- a dictionary lookup
- a cartesian multiplication of a list of len=500 against a list of len=60,
- calculating a cumulative addition for each multiplcation combination
- repeat this down a 20,000 rows
This is slowing things way down and making it hard to debug with the ~10 min wait times. How can I write this in a more efficient way, and that makes use of more than just one CPU?
ccle_head, dot_prod_total = [], []
with open(ccle_mRNA_file, mode='r') as ccle:
for _ in range(2): #skip to data
next(ccle)
exp_mat = [line.rstrip().split('\t') for line in ccle]
for row in exp_mat:
if row[0] != 'Name' and row[1] in pat_mRNA: # data row
n_row = [float(s) for s in row[2:]]
# get ready for some slow matrix multiplication:
prod = [p*n for n in n_row for p in pat_mRNA[row[1]]] # slows here
if dot_prod_total:
dot_prod_total = [cum_prod + prod for (cum_prod,prod)
in zip(dot_prod_total,prod)] # cumulative addition
else: dot_prod_total = prod # first record of matrix products
else: ccle_exp_head = row # header rowHere are the docs for the input params:
"""
:param ccle_mRNA_file: text file, rows are gene expression across sample columns
:param pat_head: list, patient id [pat_1(str), pat_2...]
:param pat_mRNA: dict, patient RNA {gene(str):[pat_1(float), pat_2..}
:param cell_interest: set, cells of same tissue {cell_1, cell_2}
"""Solution
I'm not sure I completely follow exactly what processing you're doing to your data. The comments in the code suggest it's meant to be a matrix multiplication, the description you've provided above it suggests it's a cartesian product, but your code looks like does the sum of (flattened) outer products of corresponding rows. I'll proceed on the assumption that your code is doing the right thing, but hopefully you can clarify.
There are two general pieces of advice that apply here:
-
Separation of concerns. You're mingling code to parse your data with code to process it once it is parsed. Instead, parse everything into the appropriate data structures first, and then do the processing.
-
Use appropriate libraries and data structures. When you're doing linear algebra, that means numpy. Numpy is a dedicated, LA library that has heavily optimised routines for things like various matrix products, so you don't need to roll your own. Numpy arrays will also use quite a bit less memory than Python lists. However, what you're working with is labelled data (and some of it is labelled in two dimensions), which means the best data structure is probably a pandas dataframe - it's basically like your dict of strings to lists in
These things together will make your code easier to follow, more maintainable, and often both faster and shorter.
So, start by looking at your core data structure:
This is a list of lists of strings. For most of the rows, you don't seem to care about column 0 at all. Column 1 contains row labels, and corresponding to keys in your dictionary
The only trick parsing this into a DataFrame is that it is a little easier (and, apparently, more efficient) to swap the columns and rows from how you have them currently. That will affect how it prints, and which methods you call to relabel things, but not much else. This is how I would parse it into a DataFrame:
Note that I have modified your conditions a little, since yours seemed slightly off. The
I've maintained a temporary 'headers' list, because your code is currently robust against those appearing anywhere in the file. A better way to maintain that robustness would be if you know - or, preferably, can parse from your file before now - how many variables you have data for. Then you could do away with that
In either case, I am assuming that you have one header for each variable. If that's a bad assumption, you may need to adjust this code.
As an aside, as a matter of commonly-accepted Python style (and in accordance with PEP 8), I have placed all of the code for each branch of the
I would change
Once you've done that, you can use numpy to do your LA work. As written, your entire strange product becomes this:
Which should be quite a lot faster than trying to do it by list comprehensions, and may use multithreading depending on how numpy is compiled.
There are two general pieces of advice that apply here:
-
Separation of concerns. You're mingling code to parse your data with code to process it once it is parsed. Instead, parse everything into the appropriate data structures first, and then do the processing.
-
Use appropriate libraries and data structures. When you're doing linear algebra, that means numpy. Numpy is a dedicated, LA library that has heavily optimised routines for things like various matrix products, so you don't need to roll your own. Numpy arrays will also use quite a bit less memory than Python lists. However, what you're working with is labelled data (and some of it is labelled in two dimensions), which means the best data structure is probably a pandas dataframe - it's basically like your dict of strings to lists in
pat_mRNA, but much more optimised, and fully compatible with numpy's matrix manipulations.These things together will make your code easier to follow, more maintainable, and often both faster and shorter.
So, start by looking at your core data structure:
exp_mat = [line.rstrip().split('\t') for line in ccle]This is a list of lists of strings. For most of the rows, you don't seem to care about column 0 at all. Column 1 contains row labels, and corresponding to keys in your dictionary
pat_mRNA. The rest of it is numeric data, currently stored as strings. It also contains 'header' rows - you only keep one of those around, so presumably you only expect one to be in there. I'm guessing that is column labels, which a DataFrame lets you include directly in the same data structure. The only trick parsing this into a DataFrame is that it is a little easier (and, apparently, more efficient) to swap the columns and rows from how you have them currently. That will affect how it prints, and which methods you call to relabel things, but not much else. This is how I would parse it into a DataFrame:
data = pd.DataFrame()
headers = []
for line in ccle:
row = line.rstrip().split('\t')
if row[0] == 'Name':
headers = row
elif row[1] in pat_mRNA:
data[row[1]] = [float(f) for f in row[2:]]
if headers:
data.index = headersNote that I have modified your conditions a little, since yours seemed slightly off. The
else branch of your condition effectively says "if the zeroth column is "name" or the first column isn't a match, it must be a header" - I've changed it so that instead, an appropriate value in the zeroth column is a header, and a non-matching first column on is ignored. I've maintained a temporary 'headers' list, because your code is currently robust against those appearing anywhere in the file. A better way to maintain that robustness would be if you know - or, preferably, can parse from your file before now - how many variables you have data for. Then you could do away with that
headers list and do this:data = pd.DataFrame(index=range(nvars))
for line in ccle:
row = line.rstrip().split('\t')
if row[0] == 'Name':
data.index = row
elif row[1] in pat_mRNA:
data[row[1]] = [float(f) for f in row[2:]]In either case, I am assuming that you have one header for each variable. If that's a bad assumption, you may need to adjust this code.
As an aside, as a matter of commonly-accepted Python style (and in accordance with PEP 8), I have placed all of the code for each branch of the
if in indented blocks, even if it is only one line. At most, you should only put it immediately after the : if there is only one branch to the if (no else or elifs), and even then, consider an indented block anyway.I would change
pat_mRNA to also be a DataFrame. You could change how you currently build it, or you could just pass the whole dictionary you have at the moment into pd.DataFrame(). Once you've done that, you can use numpy to do your LA work. As written, your entire strange product becomes this:
dot_prod_total = sum(np.outer(data[key], pat_mRNA[key]).flat for key in data)Which should be quite a lot faster than trying to do it by list comprehensions, and may use multithreading depending on how numpy is compiled.
Code Snippets
exp_mat = [line.rstrip().split('\t') for line in ccle]data = pd.DataFrame()
headers = []
for line in ccle:
row = line.rstrip().split('\t')
if row[0] == 'Name':
headers = row
elif row[1] in pat_mRNA:
data[row[1]] = [float(f) for f in row[2:]]
if headers:
data.index = headersdata = pd.DataFrame(index=range(nvars))
for line in ccle:
row = line.rstrip().split('\t')
if row[0] == 'Name':
data.index = row
elif row[1] in pat_mRNA:
data[row[1]] = [float(f) for f in row[2:]]dot_prod_total = sum(np.outer(data[key], pat_mRNA[key]).flat for key in data)Context
StackExchange Code Review Q#98026, answer score: 5
Revisions (0)
No revisions yet.