patternpythonMinor
Slicing time spans into calendar months
Viewed 0 times
intotimemonthsslicingspanscalendar
Problem
I have apparently correct code that still runs for weeks on my data (tens of millions of rows). I show the entire code for reference (and maybe other gains to be made), but the key operation is in the loop between lines 66 and 79. Basically, if a spell (spent in hospital) extended over a single calendar month, I wanted to have separate lines counting the number of days spent in hospital for each of those calendar months.
I thought things won't be this bad iterating over rows if I allocate space for all the new rows in a single step (the concatenation before the loop) and only reset values row by row in the loop.
`# -- coding: utf-8 --
import numpy as np
import pandas as pd
all_treatments = list()
filelist = ['slutenvard1997','slutenvard2011','slutenvard2012','slutenvard19982004','slutenvard20052010']
tobacco_codes = '|'.join(["C{}".format(i) for i in range(30, 40)] + ["F17"])
nutrition_codes = '|'.join(["D{}".format(i) for i in range(50, 54)] + ["E{}".format(i) for i in range(10, 15)] + ["E{}".format(i) for i in range(40, 47)] + ["E{}".format(i) for i in range(50, 69)])
mental_codes = 'F'
alcohol_codes = '|'.join(["K70"] + ["F0"])
circulatory_codes = 'I'
dental_codes = '|'.join(["K0{}".format(i) for i in range(2, 4)])
accident_codes = '|'.join(["X{}".format(i) for i in range(10, 60)] + ["V"] + ["X0"])
selfharm_codes = '|'.join(["X{}".format(i) for i in range(60, 85)])
cancer_codes = 'C'
endonutrimetab_codes = 'E'
pregnancy_codes = 'O'
other_stress_codes = '|'.join(["J{}".format(i) for i in range(11, 48)] + ["L{}".format(i) for i in range(20, 66)] + ["K{}".format(i) for i in range(20, 60)] + ["X{}".format(i) for i in range(86, 99)] + ["Z{}".format(i) for i in range(10, 77)] + ["R"] + ["J0"] + ["Z0"])
items = {}
conds = ['tobacco','nutrition','mental','alcohol','circulatory','dental','accident','selfharm','cancer','endonutrimetab','pregnancy','other_stress']
for c in conds:
items[c] = eval(c + '_codes')
treatment_summaries = {item: list() for item in it
I thought things won't be this bad iterating over rows if I allocate space for all the new rows in a single step (the concatenation before the loop) and only reset values row by row in the loop.
`# -- coding: utf-8 --
import numpy as np
import pandas as pd
all_treatments = list()
filelist = ['slutenvard1997','slutenvard2011','slutenvard2012','slutenvard19982004','slutenvard20052010']
tobacco_codes = '|'.join(["C{}".format(i) for i in range(30, 40)] + ["F17"])
nutrition_codes = '|'.join(["D{}".format(i) for i in range(50, 54)] + ["E{}".format(i) for i in range(10, 15)] + ["E{}".format(i) for i in range(40, 47)] + ["E{}".format(i) for i in range(50, 69)])
mental_codes = 'F'
alcohol_codes = '|'.join(["K70"] + ["F0"])
circulatory_codes = 'I'
dental_codes = '|'.join(["K0{}".format(i) for i in range(2, 4)])
accident_codes = '|'.join(["X{}".format(i) for i in range(10, 60)] + ["V"] + ["X0"])
selfharm_codes = '|'.join(["X{}".format(i) for i in range(60, 85)])
cancer_codes = 'C'
endonutrimetab_codes = 'E'
pregnancy_codes = 'O'
other_stress_codes = '|'.join(["J{}".format(i) for i in range(11, 48)] + ["L{}".format(i) for i in range(20, 66)] + ["K{}".format(i) for i in range(20, 60)] + ["X{}".format(i) for i in range(86, 99)] + ["Z{}".format(i) for i in range(10, 77)] + ["R"] + ["J0"] + ["Z0"])
items = {}
conds = ['tobacco','nutrition','mental','alcohol','circulatory','dental','accident','selfharm','cancer','endonutrimetab','pregnancy','other_stress']
for c in conds:
items[c] = eval(c + '_codes')
treatment_summaries = {item: list() for item in it
Solution
Performance
Since this was about performance, let's talk about that first.
I'd definitely recommend profiling first. It's hard to say whether it's
running slow because of limitations of Pandas (i.e.
operations), NumPy (because of possible conversion), too much garbage
(again, because of copied data), or because the algorithm in the inner
loop (
doesn't perform well enough on millions of rows (from the outer loop),
especially without any test cases.
To hazard a guess, see if the date calculations can't be cached,
i.e.
reused for the same input values, which could make a difference.
Also, try to get rid of the inner loop, or at least store the
intermediate results somewhere else and only concatenate them to the
full data frame at the end of the operation, either after the inner
loop, or, even better, after the main loop, using
The
It might be better to filter out unwanted rows earlier, that is, before
actually processing them, or trying to not to regex matching, instead
transforming the
matched against much faster.
Btw. with millions of rows, grouping and selecting this almost sounds
more like a job for a full database instead, just saying. Especially if
it needs to be run more often.
General
Take a look at PEP8 for
more style hints, I'm only mentioning the most important ones and
reformat even if I don't mention the specific rule.
There is a section at the bottom marked "Cleaning up:", with a
there's also a
you really want to make sure that they data frames can be garbage
collected immediately, but even then it wouldn't force a GC anyway.
For the first
last run the data frame can be GC-ed.
discouraged, though I do that myself as well.
is fixed up (e.g. the special handling for
applied consistently.
Also, the initialisation of the codes constants at the start should be
handled in a nicer fashion. For starters, I would like to see a
function with parameters instead of multiple copied
In any case using
avoided (for numerous reasons, but here it's just much clearer with a
different solution). Instead, just use a regular dictionary from the
start so the different codes can be accessed by name. So, initially I'd
rewrite to this kind of definition, keeping in mind that
very non-descriptive name:
it's not used anywhere.
For the future, I also recommend at least parsing command line arguments
instead of hard coding all paths, possibly using the current paths as
default values. That way the same script can be run even if external
circumstances change, that is, input files, paths, options. For now,
I'll add a constant
path everywhere:
For reference, I cleaned it up a bit, but it is not at the state where I'd
say I'm happy with the readability, but it might illustrate some of the
points from above:
```
# -- coding: utf-8 --
import os.path
import numpy as np
import pandas as pd
from itertools impo
Since this was about performance, let's talk about that first.
I'd definitely recommend profiling first. It's hard to say whether it's
running slow because of limitations of Pandas (i.e.
DataFrameoperations), NumPy (because of possible conversion), too much garbage
(again, because of copied data), or because the algorithm in the inner
loop (
for x in range(0, monthstoadd):) is sub-optimal and thereforedoesn't perform well enough on millions of rows (from the outer loop),
especially without any test cases.
To hazard a guess, see if the date calculations can't be cached,
i.e.
BOMoffset.rollforward should be pure(?), so the results can bereused for the same input values, which could make a difference.
Also, try to get rid of the inner loop, or at least store the
intermediate results somewhere else and only concatenate them to the
full data frame at the end of the operation, either after the inner
loop, or, even better, after the main loop, using
pd.concat or so.The
all_treatments.DIAGNOS.str.contains is probably also quite slow.It might be better to filter out unwanted rows earlier, that is, before
actually processing them, or trying to not to regex matching, instead
transforming the
DIAGNOS column into an integer code that can bematched against much faster.
Btw. with millions of rows, grouping and selecting this almost sounds
more like a job for a full database instead, just saying. Especially if
it needs to be run more often.
General
Take a look at PEP8 for
more style hints, I'm only mentioning the most important ones and
reformat even if I don't mention the specific rule.
There is a section at the bottom marked "Cleaning up:", with a
del andthere's also a
del in the filelist loop on the variabletreatments - I suspect that both of them aren't really needed unlessyou really want to make sure that they data frames can be garbage
collected immediately, but even then it wouldn't force a GC anyway.
For the first
del I'd at least move it out of the loop so after thelast run the data frame can be GC-ed.
- Use
[]instead oflist.
- Constants should be
UPPER_CASE_WITH_UNDERSCORES.
- For Python 2 use
dict.iterkeysif possible, same forxrange.
printshould be called like a function (print(x)).
- Using standard names (
file,list) as variable or function names is
discouraged, though I do that myself as well.
- The code could use some more comments, especially the parts where data
is fixed up (e.g. the special handling for
'slutenvard20052010').- Variable names are not very understandable (
micolix,mocolix,
ocolix?)- Aligning variables seems not that usual in Python code and it's not
applied consistently.
Also, the initialisation of the codes constants at the start should be
handled in a nicer fashion. For starters, I would like to see a
function with parameters instead of multiple copied
join/format/range statements.In any case using
eval there is unnecessary and should in general beavoided (for numerous reasons, but here it's just much clearer with a
different solution). Instead, just use a regular dictionary from the
start so the different codes can be accessed by name. So, initially I'd
rewrite to this kind of definition, keeping in mind that
ITEMS is avery non-descriptive name:
from itertools import chain
...
def codes_range(prefix, start, stop):
return ['{}{}'.format(prefix, i) for i in range(start, stop)]
def join_codes(*lists):
return '|'.join(chain(*lists))
def format_codes(prefix, start, stop):
return join_codes(codes_range(prefix, start, stop))
ITEMS = {
'tobacco': join_codes(codes_range('C', 30, 40), ['F17']),
'nutrition': join_codes(codes_range('D', 50, 54),
codes_range('E', 10, 15),
codes_range('E', 40, 47),
codes_range('E', 50, 69)),
'mental': 'F',
...
}treatment_summariesis set twice, the first time can be removed as
it's not used anywhere.
For the future, I also recommend at least parsing command line arguments
instead of hard coding all paths, possibly using the current paths as
default values. That way the same script can be run even if external
circumstances change, that is, input files, paths, options. For now,
I'll add a constant
PATH which is reused instead of repeating the samepath everywhere:
import os.path
...
PATH = '/PATH'
...
for file in FILELIST:
filename = os.path.join(PATH, file + '.txt')
...
...
for name, summary in treatment_summaries.iteritems():
treatment_summaries[name].to_csv('{0}/inpatient_treatments_monthly_sliced_{1}.csv'.format(PATH, name))For reference, I cleaned it up a bit, but it is not at the state where I'd
say I'm happy with the readability, but it might illustrate some of the
points from above:
```
# -- coding: utf-8 --
import os.path
import numpy as np
import pandas as pd
from itertools impo
Code Snippets
from itertools import chain
...
def codes_range(prefix, start, stop):
return ['{}{}'.format(prefix, i) for i in range(start, stop)]
def join_codes(*lists):
return '|'.join(chain(*lists))
def format_codes(prefix, start, stop):
return join_codes(codes_range(prefix, start, stop))
ITEMS = {
'tobacco': join_codes(codes_range('C', 30, 40), ['F17']),
'nutrition': join_codes(codes_range('D', 50, 54),
codes_range('E', 10, 15),
codes_range('E', 40, 47),
codes_range('E', 50, 69)),
'mental': 'F',
...
}import os.path
...
PATH = '/PATH'
...
for file in FILELIST:
filename = os.path.join(PATH, file + '.txt')
...
...
for name, summary in treatment_summaries.iteritems():
treatment_summaries[name].to_csv('{0}/inpatient_treatments_monthly_sliced_{1}.csv'.format(PATH, name))# -*- coding: utf-8 -*-
import os.path
import numpy as np
import pandas as pd
from itertools import chain
PATH = 'PATH'
FILELIST = ['slutenvard1997',
'slutenvard2011',
'slutenvard2012',
'slutenvard19982004',
'slutenvard20052010']
def codes_range(prefix, start, stop):
return ['{}{}'.format(prefix, i) for i in range(start, stop)]
def join_codes(*lists):
return '|'.join(chain(*lists))
def format_codes(prefix, start, stop):
return join_codes(codes_range(prefix, start, stop))
ITEMS = {
'tobacco': join_codes(codes_range('C', 30, 40), ['F17']),
'nutrition': join_codes(codes_range('D', 50, 54),
codes_range('E', 10, 15),
codes_range('E', 40, 47),
codes_range('E', 50, 69)),
'mental': 'F',
'alcohol': join_codes(['K70', 'F0']),
'circulatory': 'I',
'dental': format_codes('K0', 2, 4),
'accident': join_codes(codes_range('X', 10, 60), ['V', 'X0']),
'selfharm': format_codes('X', 60, 85),
'cancer': 'C',
'endonutrimetab': 'E',
'pregnancy': 'O',
'other_stress': join_codes(codes_range('J', 11, 48),
codes_range('L', 20, 66),
codes_range('K', 20, 60),
codes_range('X', 86, 99),
codes_range('Z', 10, 77),
['R', 'J0', 'Z0'])
}
all_treatments = []
for file in FILELIST:
filename = os.path.join(PATH, file + '.txt')
treatments = pd.read_table(filename, usecols=[0, 8, 9, 11])
if file == 'slutenvard20052010':
treatments.loc[treatments['INDATUMA'] == 20060230, 'INDATUMA'] = 20060203
treatments.loc[treatments['INDATUMA'] == 20108024, 'INDATUMA'] = 20100824
elif file == 'slutenvard19982004':
treatments.loc[treatments['UTDATUMA']==2003071,'UTDATUMA'] = 20030701
treatments.loc[treatments['UTDATUMA']==2003091,'UTDATUMA'] = 20030901
treatments = treatments[(treatments['INDATUMA'] != '.') & (treatments['UTDATUMA'] > 19971231)]
treatments['INDATUMA'] = treatments['INDATUMA'].astype(float)
all_treatments.append(treatments)
all_treatments = pd.concat(all_treatments, ignore_index=True)
print("Remember datatypes for future use:")
print(all_treatments.dtypes)
all_treatments['indate'] = pd.to_datetime(all_treatments['INDATUMA'], errors='coerce', format='%Y%m%d')
all_treatments['outdate'] = pd.to_datetime(all_treatments['UTDATUMA'], errors='coerce', format='%Y%m%d')
# Separating months:
all_treatments['monthlyindate'] = all_treatments['indate']
all_treatments['monthlyoutdate'] = all_treatments['outdate']
columns = all_treatments.columns
micolix = columns.get_loc('monthlyindate')
mocolix = columns.get_loc('monthlyoutdate')
ocolix = columns.get_loc('outdate')
emcolix = columns.get_loc('extramonths')
outdate = all_treatments['outdate'].dt
indateContext
StackExchange Code Review Q#108332, answer score: 3
Revisions (0)
No revisions yet.