patternpythonMinor
Data cleansing and formatting script
Viewed 0 times
scriptformattingcleansinganddata
Problem
This is a script that creates a base dataframe from a sqlite database, adds data to it (also from SQLite), cleanse it and formats it all to an Excel file. I feel like it is incredibly verbose and my code could be simplified quite a bit. One thing I definitely want to do is eliminate the direct references to the db,
Any thoughts on simple changes I can make to decrease the complexity and amount of code?
```
import pandas as pd
import numpy as np
import sqlite3 as db
from contextlib import closing
from datetime import date, timedelta
ONE = date.today() - timedelta(1)
TWO = date.today() - timedelta(2)
SEVEN = date.today() - timedelta(8)
def video_base():
with closing(db.connect('/Users/meter/code/mm/vid_score/test.db')) as con:
df = pd.read_sql_query("SELECT * FROM video_played;", con)
df['time'] = pd.to_datetime(df['time'], unit='s').dt.date
df = df.drop('index', axis=1)
df_one = df.loc[df['time'] == ONE]
df_two = df.loc[df['time'] == TWO]
df_seven = df.loc[df['time'] == SEVEN]
df_one = df_one.groupby('video_id').agg({"ios_id": {"count_watched": np.count_nonzero,
"unique_watched": pd.Series.nunique},
"feed_position": {"feed_position": np.average},
"time_watched": {"time_watched": np.sum},
"video_length": {"video_length": np.sum}})
df_one.columns = df_one.columns.droplevel(0)
df_one['avg_time_watched'] = df_one['time_watched'] / df_one['video_length']
df_two = df_two.groupby('video_id').agg({"ios_id": {"count_watched": np.count_nonzero,
"unique_watched": pd.Series.nunique},
"feed_position": {"feed_position": np.average},
/Users/meter/code/mm/vid_score/test.db so that this script can be run on any machine.Any thoughts on simple changes I can make to decrease the complexity and amount of code?
```
import pandas as pd
import numpy as np
import sqlite3 as db
from contextlib import closing
from datetime import date, timedelta
ONE = date.today() - timedelta(1)
TWO = date.today() - timedelta(2)
SEVEN = date.today() - timedelta(8)
def video_base():
with closing(db.connect('/Users/meter/code/mm/vid_score/test.db')) as con:
df = pd.read_sql_query("SELECT * FROM video_played;", con)
df['time'] = pd.to_datetime(df['time'], unit='s').dt.date
df = df.drop('index', axis=1)
df_one = df.loc[df['time'] == ONE]
df_two = df.loc[df['time'] == TWO]
df_seven = df.loc[df['time'] == SEVEN]
df_one = df_one.groupby('video_id').agg({"ios_id": {"count_watched": np.count_nonzero,
"unique_watched": pd.Series.nunique},
"feed_position": {"feed_position": np.average},
"time_watched": {"time_watched": np.sum},
"video_length": {"video_length": np.sum}})
df_one.columns = df_one.columns.droplevel(0)
df_one['avg_time_watched'] = df_one['time_watched'] / df_one['video_length']
df_two = df_two.groupby('video_id').agg({"ios_id": {"count_watched": np.count_nonzero,
"unique_watched": pd.Series.nunique},
"feed_position": {"feed_position": np.average},
Solution
So the good thing is it's PEP8 clean (except for the 80 characters limit) and is readable.
statements look great for example. But yeah, as you said, you have a lot of duplicated code.
I've uploaded the repo to Github, so
you might want to clone that soonish.
Command line arguments
It looks like you can run this just as a filter, i.e. something like
convenience I'd suggest
nicer interface. After that you could also add a configuration file to
store those settings locally on each machine. However for this post
I'll restrict myself to the first option.
I don't see why you need to open/close the database so many times. I'd
move
eliminates all but one occurrence of that filename. The Excel filename
can be moved to
Afterwards both filenames should come from the command-line instead,
i.e.
arguments (or you know, use
looks like this:
The
depend on the order, so it makes sense to still close the database as
early as possible.
Further refactoring
There's a typo where you used
Now that that's done, let's eliminate more duplication.
The first thing I see is the dups in
need to do here is factor out the two changing parameters:
Called like:
Same thing for
then call that five times. The only thing I'm not sure about here is
the layout of the resulting data frames, so I'm more cautious than is
probably necessary - you might be able to remove the
parameter if the name isn't important:
In
them:
And finally in
list, then call the method on each of them:
pandas and numpy is very nice as well, the groupbystatements look great for example. But yeah, as you said, you have a lot of duplicated code.
I've uploaded the repo to Github, so
you might want to clone that soonish.
Command line arguments
It looks like you can run this just as a filter, i.e. something like
python video.py test.db daily_report.xlsx. If you want moreconvenience I'd suggest
argparse to have anicer interface. After that you could also add a configuration file to
store those settings locally on each machine. However for this post
I'll restrict myself to the first option.
I don't see why you need to open/close the database so many times. I'd
move
con to main and reuse it as long as you need. That alsoeliminates all but one occurrence of that filename. The Excel filename
can be moved to
main as well.Afterwards both filenames should come from the command-line instead,
i.e.
sys.argv, of course raising an error if you don't get enougharguments (or you know, use
argparse to do that for you). That nowlooks like this:
def main():
if len(sys.argv) != 3:
sys.exit("Need exactly two arguments: database and output file.")
with closing(db.connect(sys.argv[1])) as con:
video_data = video_base(con)
intent_added = video_intent(con, video_data)
feed_data = feed_position(con)
cleaned = cleanup(intent_added)
to_excel(cleaned, feed_data, sys.argv[2])The
cleanup call is moved out of the closing body since it doesn'tdepend on the order, so it makes sense to still close the database as
early as possible.
Further refactoring
There's a typo where you used
yeterday all the time.Now that that's done, let's eliminate more duplication.
The first thing I see is the dups in
video_base; basically all youneed to do here is factor out the two changing parameters:
def video_aggregations(df, day):
df = df.loc[df['time'] == day]
df = df.groupby('video_id').agg({"ios_id": {"count_watched": np.count_nonzero,
"unique_watched": pd.Series.nunique},
"feed_position": {"feed_position": np.average},
"time_watched": {"time_watched": np.sum},
"video_length": {"video_length": np.sum}})
df.columns = df_one.columns.droplevel(0)
return dfCalled like:
df_one = video_aggregations(df, ONE)
df_one['avg_time_watched'] = df_one['time_watched'] / df_one['video_length']Same thing for
video_intent really. Factor out the constant parts,then call that five times. The only thing I'm not sure about here is
the layout of the resulting data frames, so I'm more cautious than is
probably necessary - you might be able to remove the
aggregation_nameparameter if the name isn't important:
def video_intent_aggregations(con, video_base, table, result_key, aggregation_name):
df = pd.read_sql_query("SELECT * FROM %s;" % table, con)
df['time'] = pd.to_datetime(df['time'], unit='s').dt.date
df = df.drop('index', axis=1)
df = df.loc[df['time'] == ONE]
df = df.groupby('video_id').agg({"ios_id": {aggregation_name: np.count_nonzero,
"unique_clicks": pd.Series.nunique},
"feed_position": {"feed_position": np.average}})
df.columns = df.columns.droplevel(0)
video_base[result_key] = df[aggregation_name]
def video_intent(con, video_base):
video_intent_aggregations(con, video_base, "item_info_click", "item_info_clicks", "count_clicks")
video_intent_aggregations(con, video_base, "faved", "faves", "count_faves")
video_intent_aggregations(con, video_base, "table", "replays", "replays")
video_intent_aggregations(con, video_base, "add_to_cart", "add_to_cart", "add_to_cart")
video_intent_aggregations(con, video_base, "carousel", "carousel", "carousel")
return video_baseIn
cleanup you can get the values from video_raw once, then reusethem:
watched = video_raw['count_watched']
yesterday = video_raw['count_watched_yesterday']
seven = video_raw['count_watched_seven']
video_raw['percent_yesterday'] = (watched - yesterday) / yesterday
video_raw['percent_seven'] = (watched - seven) / sevenAnd finally in
to_excel you could extract the columns into a separatelist, then call the method on each of them:
columns = [('B:B', 13, integer)
('C:C', 17, percentage)
('D:D', 19, percentage)
('E:E', 15, integer)
('F:F', 15, percentage)
('G:G', 15, decimal)
('H:H', 13, integer)
('I:I', 13, integer)
('J:J', 13, integer)
('K:K', 13, integer)
('L:L', 13, integer)]
for column in columns:
worksheet.set_column(*column)Code Snippets
def main():
if len(sys.argv) != 3:
sys.exit("Need exactly two arguments: database and output file.")
with closing(db.connect(sys.argv[1])) as con:
video_data = video_base(con)
intent_added = video_intent(con, video_data)
feed_data = feed_position(con)
cleaned = cleanup(intent_added)
to_excel(cleaned, feed_data, sys.argv[2])def video_aggregations(df, day):
df = df.loc[df['time'] == day]
df = df.groupby('video_id').agg({"ios_id": {"count_watched": np.count_nonzero,
"unique_watched": pd.Series.nunique},
"feed_position": {"feed_position": np.average},
"time_watched": {"time_watched": np.sum},
"video_length": {"video_length": np.sum}})
df.columns = df_one.columns.droplevel(0)
return dfdf_one = video_aggregations(df, ONE)
df_one['avg_time_watched'] = df_one['time_watched'] / df_one['video_length']def video_intent_aggregations(con, video_base, table, result_key, aggregation_name):
df = pd.read_sql_query("SELECT * FROM %s;" % table, con)
df['time'] = pd.to_datetime(df['time'], unit='s').dt.date
df = df.drop('index', axis=1)
df = df.loc[df['time'] == ONE]
df = df.groupby('video_id').agg({"ios_id": {aggregation_name: np.count_nonzero,
"unique_clicks": pd.Series.nunique},
"feed_position": {"feed_position": np.average}})
df.columns = df.columns.droplevel(0)
video_base[result_key] = df[aggregation_name]
def video_intent(con, video_base):
video_intent_aggregations(con, video_base, "item_info_click", "item_info_clicks", "count_clicks")
video_intent_aggregations(con, video_base, "faved", "faves", "count_faves")
video_intent_aggregations(con, video_base, "table", "replays", "replays")
video_intent_aggregations(con, video_base, "add_to_cart", "add_to_cart", "add_to_cart")
video_intent_aggregations(con, video_base, "carousel", "carousel", "carousel")
return video_basewatched = video_raw['count_watched']
yesterday = video_raw['count_watched_yesterday']
seven = video_raw['count_watched_seven']
video_raw['percent_yesterday'] = (watched - yesterday) / yesterday
video_raw['percent_seven'] = (watched - seven) / sevenContext
StackExchange Code Review Q#92379, answer score: 3
Revisions (0)
No revisions yet.