HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Extracting time duration in the session from 30 million rows

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
millionrowsthetimedurationsessionextractingfrom

Problem

I am looking for making my code faster. I am working on yoochoose recsys 2015 dataset.. and trying to perform some transformations.. [recsys2015], it has got 30 million plus rows of data. The goal of my code is to extract time duration in the session with the help of python, and the day , hour segment etc, however I realized that my code with pandas is taking hell lot of time, any alternative approaches is well appreciated
below is sample data and code

Data:

sessioid,timestamp,itemid,category
1,2014-04-07T10:51:09.277Z,214536502,0
1,2014-04-07T10:54:09.868Z,214536500,0
1,2014-04-07T10:54:46.998Z,214536506,0
1,2014-04-07T10:57:00.306Z,214577561,0
2,2014-04-07T13:56:37.614Z,214662742,0
2,2014-04-07T13:57:19.373Z,214662742,0
2,2014-04-07T13:58:37.446Z,214825110,0
2,2014-04-07T13:59:50.710Z,214757390,0


Code

```
import pandas as pd
import datetime as dt
clickspath='/tmp/gensim/yoochoose/yoochoose-clicks.dat'
buyspath='/tmp/gensim/yoochoose/yoochoose-buys.dat'
clicksdat=pd.read_csv(clickspath,header=None)
clicksdat.columns=['Sid','Timestamp','itemid','itemcategory']
buysdat=pd.read_csv(buyspath,header=None)
buysdat.columns=['Sid','Timestamp','itemid','price','qty']
segment={}
for i in range(24):
if i<7:
segment[i]='EM'
elif i<10:
segment[i]='M'
elif i<13:
segment[i]='A'
elif i<18:
segment[i]='E'
elif i<23:
segment[i]='N'
elif i<25:
segment[i]='MN'
#***
buyersession=buysdat.Sid.unique()
clickersession=clicksdat.Sid.unique()
with open('/tmp/gensim/yoochooseclick_new.dat','w') as f1:
for session in clickersession:
purchased=1 if session in buyersession else 0
temp=clicksdat[clicksdat.Sid==session]
maxtime=dt.datetime.strptime(temp.Timestamp.max(),"%Y-%m-%dT%H:%M:%S.%fZ")
mintime=dt.datetime.strptime(temp.Timestamp.min(),"%Y-%m-%dT%H:%M:%S.%fZ")
day=maxtime.day
month=maxtime.month
n

Solution

A few basic points:

The bool datatype is basically just an int and therefore 1 == True and 0 == False. This makes your two ternaries easier to read, because instead of

purchased=1 if session in buyersession else 0
segmentchange=1 if (segment[maxtime.hour]!=segment[mintime.hour]) else 0


You can just write

purchased = session in buyersession
segmentchange = segment[maxtime.hour] != segment[mintime.hour]


noofnavigations is a bad name. You should at least call it no_of_navigations, to go along with PEP8. But why not just navigations?

While we're here, you might as well comply with PEP8 on the other things as well (mainly, put spaces around all infix operators and equal signs).

If you demote all of your temporary variables (which you only use once anyways) to keys in a dictionary you can make the printing a lot easier:

with open('/tmp/gensim/yoochooseclick_new.dat', 'w') as f1:
    for session in clickersession:
        temp = clicksdat[clicksdat.Sid == session]
        maxtime = dt.datetime.strptime(temp.Timestamp.max(), "%Y-%m-%dT%H:%M:%S.%fZ")
        mintime = dt.datetime.strptime(temp.Timestamp.min(), "%Y-%m-%dT%H:%M:%S.%fZ")
        d = {'purchased': session in buyersession,
             'day': maxtime.day,
             'month': maxtime.month,
             'navigations': len(temp.Timestamp)-1,
             'duration': int((maxtime-mintime).total_seconds()),
             'totalcats': len(temp.itemcategory.unique()),
             'totalitems': len(temp.itemid.unique()),
             'timesegment': segment[maxtime.hour],
             'segmentchange': segment[maxtime.hour] != segment[mintime.hour],
             }
        s = "{purchased}, {day}, {month}, {navigations}, {duration}, {totalcats}, {totalitems}, {timesegment}"
        f1.write(s.format(**d))


Lastly, the building of the words for the different times of day could be written slightly more concise when using a list instead of a dictionary:

segment = ['EM']*24
segment[7:10] = ['M']*3
segment[10:13] = ['A']*3
segment[13:18] = ['E']*5
segment[18:23] = ['N']*5
segment[23] = 'MN'


In the end your code is probably slow, because of all the finding of mins, maxs and lens. This just takes time. Would be good to see a run with a profiler to see at which operation, though.

With four more years of experience, here is how you could make this even faster:

def process(group):
    min_time = group.loc[group["Timestamp"].idxmin()]
    max_time = group.loc[group["Timestamp"].idxmax()]
    diff = max_time["Timestamp"] - min_time["Timestamp"]
    return pd.Series({"purchased": group["Sid"].iloc[0] in buyer_session,
                      "day": max_time["Timestamp"].day,
                      "month": max_time["Timestamp"].month,
                      "navigations": group["Timestamp"].count() - 1,
                      "duration": int(diff.total_seconds()),
                      "totalcats": group["itemcategory"].nunique(),
                      "totalitems": group["itemid"].nunique(),
                      "timesegment": max_time["segment"],
                      "segmentchange": max_time["segment"] != min_time["segment"]})
    
if __name__ == "__main__":
    buys_data = pd.read_csv(...)
    buyer_session = set(buys_data["Sid"].unique())
    clicks_data = pd.read_csv(...)
    clicks_data["Timestamp"] = pd.to_datetime(clicks_data["Timestamp"], 
                                              format="%Y-%m-%dT%H:%M:%S.%fZ")
    clicks_data["segment"] = pd.cut(clicks_data["Timestamp"].dt.hour,
                                    [0, 7, 10, 13, 18, 23, 24],
                                    labels=["EM", "M", "A", "E", "N", "MN"])
    sessions = clicks_data.groupby("Sid", as_index=False).apply(process)
    sessions.to_csv('/tmp/gensim/yoochooseclick_new.dat', index=False)


I used the fact that you can directly get the number of unique items using nunique and that it is easier to group by the session ID than iterate over it using groupby (provided you are actually interested in all session IDs). I also converted to datetime objects right away and assigned the segements using cut. Before, there was a possible bug, because you determined the maximum and minimum date using string comparison. For the datetime format you have, this just happens to also sort correctly by date, but if that format ever changes, your code would break.

Code Snippets

purchased=1 if session in buyersession else 0
segmentchange=1 if (segment[maxtime.hour]!=segment[mintime.hour]) else 0
purchased = session in buyersession
segmentchange = segment[maxtime.hour] != segment[mintime.hour]
with open('/tmp/gensim/yoochooseclick_new.dat', 'w') as f1:
    for session in clickersession:
        temp = clicksdat[clicksdat.Sid == session]
        maxtime = dt.datetime.strptime(temp.Timestamp.max(), "%Y-%m-%dT%H:%M:%S.%fZ")
        mintime = dt.datetime.strptime(temp.Timestamp.min(), "%Y-%m-%dT%H:%M:%S.%fZ")
        d = {'purchased': session in buyersession,
             'day': maxtime.day,
             'month': maxtime.month,
             'navigations': len(temp.Timestamp)-1,
             'duration': int((maxtime-mintime).total_seconds()),
             'totalcats': len(temp.itemcategory.unique()),
             'totalitems': len(temp.itemid.unique()),
             'timesegment': segment[maxtime.hour],
             'segmentchange': segment[maxtime.hour] != segment[mintime.hour],
             }
        s = "{purchased}, {day}, {month}, {navigations}, {duration}, {totalcats}, {totalitems}, {timesegment}"
        f1.write(s.format(**d))
segment = ['EM']*24
segment[7:10] = ['M']*3
segment[10:13] = ['A']*3
segment[13:18] = ['E']*5
segment[18:23] = ['N']*5
segment[23] = 'MN'
def process(group):
    min_time = group.loc[group["Timestamp"].idxmin()]
    max_time = group.loc[group["Timestamp"].idxmax()]
    diff = max_time["Timestamp"] - min_time["Timestamp"]
    return pd.Series({"purchased": group["Sid"].iloc[0] in buyer_session,
                      "day": max_time["Timestamp"].day,
                      "month": max_time["Timestamp"].month,
                      "navigations": group["Timestamp"].count() - 1,
                      "duration": int(diff.total_seconds()),
                      "totalcats": group["itemcategory"].nunique(),
                      "totalitems": group["itemid"].nunique(),
                      "timesegment": max_time["segment"],
                      "segmentchange": max_time["segment"] != min_time["segment"]})
    
if __name__ == "__main__":
    buys_data = pd.read_csv(...)
    buyer_session = set(buys_data["Sid"].unique())
    clicks_data = pd.read_csv(...)
    clicks_data["Timestamp"] = pd.to_datetime(clicks_data["Timestamp"], 
                                              format="%Y-%m-%dT%H:%M:%S.%fZ")
    clicks_data["segment"] = pd.cut(clicks_data["Timestamp"].dt.hour,
                                    [0, 7, 10, 13, 18, 23, 24],
                                    labels=["EM", "M", "A", "E", "N", "MN"])
    sessions = clicks_data.groupby("Sid", as_index=False).apply(process)
    sessions.to_csv('/tmp/gensim/yoochooseclick_new.dat', index=False)

Context

StackExchange Code Review Q#135755, answer score: 4

Revisions (0)

No revisions yet.