HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Slicing time spans into calendar months

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
intotimemonthsslicingspanscalendar

Problem

I have apparently correct code that still runs for weeks on my data (tens of millions of rows). I show the entire code for reference (and maybe other gains to be made), but the key operation is in the loop between lines 66 and 79. Basically, if a spell (spent in hospital) extended over a single calendar month, I wanted to have separate lines counting the number of days spent in hospital for each of those calendar months.

I thought things won't be this bad iterating over rows if I allocate space for all the new rows in a single step (the concatenation before the loop) and only reset values row by row in the loop.

`# -- coding: utf-8 --
import numpy as np
import pandas as pd

all_treatments = list()
filelist = ['slutenvard1997','slutenvard2011','slutenvard2012','slutenvard19982004','slutenvard20052010']

tobacco_codes = '|'.join(["C{}".format(i) for i in range(30, 40)] + ["F17"])
nutrition_codes = '|'.join(["D{}".format(i) for i in range(50, 54)] + ["E{}".format(i) for i in range(10, 15)] + ["E{}".format(i) for i in range(40, 47)] + ["E{}".format(i) for i in range(50, 69)])
mental_codes = 'F'
alcohol_codes = '|'.join(["K70"] + ["F0"])
circulatory_codes = 'I'
dental_codes = '|'.join(["K0{}".format(i) for i in range(2, 4)])
accident_codes = '|'.join(["X{}".format(i) for i in range(10, 60)] + ["V"] + ["X0"])
selfharm_codes = '|'.join(["X{}".format(i) for i in range(60, 85)])
cancer_codes = 'C'
endonutrimetab_codes = 'E'
pregnancy_codes = 'O'
other_stress_codes = '|'.join(["J{}".format(i) for i in range(11, 48)] + ["L{}".format(i) for i in range(20, 66)] + ["K{}".format(i) for i in range(20, 60)] + ["X{}".format(i) for i in range(86, 99)] + ["Z{}".format(i) for i in range(10, 77)] + ["R"] + ["J0"] + ["Z0"])

items = {}
conds = ['tobacco','nutrition','mental','alcohol','circulatory','dental','accident','selfharm','cancer','endonutrimetab','pregnancy','other_stress']
for c in conds:
items[c] = eval(c + '_codes')

treatment_summaries = {item: list() for item in it

Solution

Performance

Since this was about performance, let's talk about that first.

I'd definitely recommend profiling first. It's hard to say whether it's
running slow because of limitations of Pandas (i.e. DataFrame
operations), NumPy (because of possible conversion), too much garbage
(again, because of copied data), or because the algorithm in the inner
loop (for x in range(0, monthstoadd):) is sub-optimal and therefore
doesn't perform well enough on millions of rows (from the outer loop),
especially without any test cases.

To hazard a guess, see if the date calculations can't be cached,
i.e. BOMoffset.rollforward should be pure(?), so the results can be
reused for the same input values, which could make a difference.

Also, try to get rid of the inner loop, or at least store the
intermediate results somewhere else and only concatenate them to the
full data frame at the end of the operation, either after the inner
loop, or, even better, after the main loop, using pd.concat or so.

The all_treatments.DIAGNOS.str.contains is probably also quite slow.
It might be better to filter out unwanted rows earlier, that is, before
actually processing them, or trying to not to regex matching, instead
transforming the DIAGNOS column into an integer code that can be
matched against much faster.

Btw. with millions of rows, grouping and selecting this almost sounds
more like a job for a full database instead, just saying. Especially if
it needs to be run more often.

General

Take a look at PEP8 for
more style hints, I'm only mentioning the most important ones and
reformat even if I don't mention the specific rule.

There is a section at the bottom marked "Cleaning up:", with a del and
there's also a del in the filelist loop on the variable
treatments - I suspect that both of them aren't really needed unless
you really want to make sure that they data frames can be garbage
collected immediately, but even then it wouldn't force a GC anyway.

For the first del I'd at least move it out of the loop so after the
last run the data frame can be GC-ed.

  • Use [] instead of list.



  • Constants should be UPPER_CASE_WITH_UNDERSCORES.



  • For Python 2 use dict.iterkeys if possible, same for xrange.



  • print should be called like a function (print(x)).



  • Using standard names (file, list) as variable or function names is


discouraged, though I do that myself as well.

  • The code could use some more comments, especially the parts where data


is fixed up (e.g. the special handling for 'slutenvard20052010').

  • Variable names are not very understandable (micolix, mocolix,


ocolix?)

  • Aligning variables seems not that usual in Python code and it's not


applied consistently.

Also, the initialisation of the codes constants at the start should be
handled in a nicer fashion. For starters, I would like to see a
function with parameters instead of multiple copied
join/format/range statements.

In any case using eval there is unnecessary and should in general be
avoided (for numerous reasons, but here it's just much clearer with a
different solution). Instead, just use a regular dictionary from the
start so the different codes can be accessed by name. So, initially I'd
rewrite to this kind of definition, keeping in mind that ITEMS is a
very non-descriptive name:

from itertools import chain

...

def codes_range(prefix, start, stop):
    return ['{}{}'.format(prefix, i) for i in range(start, stop)]

def join_codes(*lists):
    return '|'.join(chain(*lists))

def format_codes(prefix, start, stop):
    return join_codes(codes_range(prefix, start, stop))

ITEMS = {
    'tobacco': join_codes(codes_range('C', 30, 40), ['F17']),
    'nutrition': join_codes(codes_range('D', 50, 54),
                            codes_range('E', 10, 15),
                            codes_range('E', 40, 47),
                            codes_range('E', 50, 69)),
    'mental': 'F',
    ...
}


  • treatment_summaries is set twice, the first time can be removed as


it's not used anywhere.

For the future, I also recommend at least parsing command line arguments
instead of hard coding all paths, possibly using the current paths as
default values. That way the same script can be run even if external
circumstances change, that is, input files, paths, options. For now,
I'll add a constant PATH which is reused instead of repeating the same
path everywhere:

import os.path

...

PATH = '/PATH'

...

for file in FILELIST:
    filename = os.path.join(PATH, file + '.txt')
    ...

...

for name, summary in treatment_summaries.iteritems():
    treatment_summaries[name].to_csv('{0}/inpatient_treatments_monthly_sliced_{1}.csv'.format(PATH, name))


For reference, I cleaned it up a bit, but it is not at the state where I'd
say I'm happy with the readability, but it might illustrate some of the
points from above:

```
# -- coding: utf-8 --

import os.path

import numpy as np
import pandas as pd

from itertools impo

Code Snippets

from itertools import chain

...

def codes_range(prefix, start, stop):
    return ['{}{}'.format(prefix, i) for i in range(start, stop)]


def join_codes(*lists):
    return '|'.join(chain(*lists))


def format_codes(prefix, start, stop):
    return join_codes(codes_range(prefix, start, stop))


ITEMS = {
    'tobacco': join_codes(codes_range('C', 30, 40), ['F17']),
    'nutrition': join_codes(codes_range('D', 50, 54),
                            codes_range('E', 10, 15),
                            codes_range('E', 40, 47),
                            codes_range('E', 50, 69)),
    'mental': 'F',
    ...
}
import os.path

...

PATH = '/PATH'

...

for file in FILELIST:
    filename = os.path.join(PATH, file + '.txt')
    ...

...

for name, summary in treatment_summaries.iteritems():
    treatment_summaries[name].to_csv('{0}/inpatient_treatments_monthly_sliced_{1}.csv'.format(PATH, name))
# -*- coding: utf-8 -*-

import os.path

import numpy as np
import pandas as pd

from itertools import chain


PATH = 'PATH'
FILELIST = ['slutenvard1997',
            'slutenvard2011',
            'slutenvard2012',
            'slutenvard19982004',
            'slutenvard20052010']


def codes_range(prefix, start, stop):
    return ['{}{}'.format(prefix, i) for i in range(start, stop)]


def join_codes(*lists):
    return '|'.join(chain(*lists))


def format_codes(prefix, start, stop):
    return join_codes(codes_range(prefix, start, stop))


ITEMS = {
    'tobacco': join_codes(codes_range('C', 30, 40), ['F17']),
    'nutrition': join_codes(codes_range('D', 50, 54),
                            codes_range('E', 10, 15),
                            codes_range('E', 40, 47),
                            codes_range('E', 50, 69)),
    'mental': 'F',
    'alcohol': join_codes(['K70', 'F0']),
    'circulatory': 'I',
    'dental': format_codes('K0', 2, 4),
    'accident': join_codes(codes_range('X', 10, 60), ['V', 'X0']),
    'selfharm': format_codes('X', 60, 85),
    'cancer': 'C',
    'endonutrimetab': 'E',
    'pregnancy': 'O',
    'other_stress': join_codes(codes_range('J', 11, 48),
                               codes_range('L', 20, 66),
                               codes_range('K', 20, 60),
                               codes_range('X', 86, 99),
                               codes_range('Z', 10, 77),
                               ['R', 'J0', 'Z0'])
}


all_treatments = []
for file in FILELIST:
    filename = os.path.join(PATH, file + '.txt')
    treatments = pd.read_table(filename, usecols=[0, 8, 9, 11])
    if file == 'slutenvard20052010':
        treatments.loc[treatments['INDATUMA'] == 20060230, 'INDATUMA'] = 20060203
        treatments.loc[treatments['INDATUMA'] == 20108024, 'INDATUMA'] = 20100824
    elif file == 'slutenvard19982004':
        treatments.loc[treatments['UTDATUMA']==2003071,'UTDATUMA'] = 20030701
        treatments.loc[treatments['UTDATUMA']==2003091,'UTDATUMA'] = 20030901
        treatments = treatments[(treatments['INDATUMA'] != '.') & (treatments['UTDATUMA'] > 19971231)]
        treatments['INDATUMA'] = treatments['INDATUMA'].astype(float)
    all_treatments.append(treatments)


all_treatments = pd.concat(all_treatments, ignore_index=True)
print("Remember datatypes for future use:")
print(all_treatments.dtypes)
all_treatments['indate'] = pd.to_datetime(all_treatments['INDATUMA'], errors='coerce', format='%Y%m%d')
all_treatments['outdate'] = pd.to_datetime(all_treatments['UTDATUMA'], errors='coerce', format='%Y%m%d')


# Separating months:
all_treatments['monthlyindate'] = all_treatments['indate']
all_treatments['monthlyoutdate'] = all_treatments['outdate']

columns     = all_treatments.columns
micolix     = columns.get_loc('monthlyindate')
mocolix     = columns.get_loc('monthlyoutdate')
ocolix      = columns.get_loc('outdate')
emcolix     = columns.get_loc('extramonths')

outdate     = all_treatments['outdate'].dt
indate

Context

StackExchange Code Review Q#108332, answer score: 3

Revisions (0)

No revisions yet.