patternpythonMinor

Data cleansing and formatting script

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

scriptformattingcleansinganddata

Problem

This is a script that creates a base dataframe from a sqlite database, adds data to it (also from SQLite), cleanse it and formats it all to an Excel file. I feel like it is incredibly verbose and my code could be simplified quite a bit. One thing I definitely want to do is eliminate the direct references to the db, /Users/meter/code/mm/vid_score/test.db so that this script can be run on any machine.

Any thoughts on simple changes I can make to decrease the complexity and amount of code?

```
import pandas as pd
import numpy as np
import sqlite3 as db
from contextlib import closing
from datetime import date, timedelta

ONE = date.today() - timedelta(1)
TWO = date.today() - timedelta(2)
SEVEN = date.today() - timedelta(8)

def video_base():
with closing(db.connect('/Users/meter/code/mm/vid_score/test.db')) as con:
df = pd.read_sql_query("SELECT * FROM video_played;", con)

df['time'] = pd.to_datetime(df['time'], unit='s').dt.date

df = df.drop('index', axis=1)

df_one = df.loc[df['time'] == ONE]
df_two = df.loc[df['time'] == TWO]
df_seven = df.loc[df['time'] == SEVEN]

df_one = df_one.groupby('video_id').agg({"ios_id": {"count_watched": np.count_nonzero,
"unique_watched": pd.Series.nunique},
"feed_position": {"feed_position": np.average},
"time_watched": {"time_watched": np.sum},
"video_length": {"video_length": np.sum}})

df_one.columns = df_one.columns.droplevel(0)
df_one['avg_time_watched'] = df_one['time_watched'] / df_one['video_length']

df_two = df_two.groupby('video_id').agg({"ios_id": {"count_watched": np.count_nonzero,
"unique_watched": pd.Series.nunique},
"feed_position": {"feed_position": np.average},

Solution

So the good thing is it's PEP8 clean (except for the 80 characters limit) and is readable. pandas and numpy is very nice as well, the groupby
statements look great for example. But yeah, as you said, you have a lot of duplicated code.

I've uploaded the repo to Github, so
you might want to clone that soonish.

Command line arguments

It looks like you can run this just as a filter, i.e. something like
python video.py test.db daily_report.xlsx. If you want more
convenience I'd suggest
argparse to have a
nicer interface. After that you could also add a configuration file to
store those settings locally on each machine. However for this post
I'll restrict myself to the first option.

I don't see why you need to open/close the database so many times. I'd
move con to main and reuse it as long as you need. That also
eliminates all but one occurrence of that filename. The Excel filename
can be moved to main as well.

Afterwards both filenames should come from the command-line instead,
i.e. sys.argv, of course raising an error if you don't get enough
arguments (or you know, use argparse to do that for you). That now
looks like this:

def main():
    if len(sys.argv) != 3:
        sys.exit("Need exactly two arguments: database and output file.")

    with closing(db.connect(sys.argv[1])) as con:
        video_data = video_base(con)
        intent_added = video_intent(con, video_data)
        feed_data = feed_position(con)

    cleaned = cleanup(intent_added)
    to_excel(cleaned, feed_data, sys.argv[2])

The cleanup call is moved out of the closing body since it doesn't
depend on the order, so it makes sense to still close the database as
early as possible.

Further refactoring

There's a typo where you used yeterday all the time.

Now that that's done, let's eliminate more duplication.

The first thing I see is the dups in video_base; basically all you
need to do here is factor out the two changing parameters:

def video_aggregations(df, day):
    df = df.loc[df['time'] == day]
    df = df.groupby('video_id').agg({"ios_id": {"count_watched": np.count_nonzero,
                                                "unique_watched": pd.Series.nunique},
                                     "feed_position": {"feed_position": np.average},
                                     "time_watched": {"time_watched": np.sum},
                                     "video_length": {"video_length": np.sum}})
    df.columns = df_one.columns.droplevel(0)
    return df

Called like:

df_one = video_aggregations(df, ONE)
df_one['avg_time_watched'] = df_one['time_watched'] / df_one['video_length']

Same thing for video_intent really. Factor out the constant parts,
then call that five times. The only thing I'm not sure about here is
the layout of the resulting data frames, so I'm more cautious than is
probably necessary - you might be able to remove the aggregation_name
parameter if the name isn't important:

def video_intent_aggregations(con, video_base, table, result_key, aggregation_name):
    df = pd.read_sql_query("SELECT * FROM %s;" % table, con)

    df['time'] = pd.to_datetime(df['time'], unit='s').dt.date
    df = df.drop('index', axis=1)
    df = df.loc[df['time'] == ONE]

    df = df.groupby('video_id').agg({"ios_id": {aggregation_name: np.count_nonzero,
                                                "unique_clicks": pd.Series.nunique},
                                     "feed_position": {"feed_position": np.average}})
    df.columns = df.columns.droplevel(0)

    video_base[result_key] = df[aggregation_name]

def video_intent(con, video_base):
    video_intent_aggregations(con, video_base, "item_info_click", "item_info_clicks", "count_clicks")
    video_intent_aggregations(con, video_base, "faved", "faves", "count_faves")
    video_intent_aggregations(con, video_base, "table", "replays", "replays")
    video_intent_aggregations(con, video_base, "add_to_cart", "add_to_cart", "add_to_cart")
    video_intent_aggregations(con, video_base, "carousel", "carousel", "carousel")
    return video_base

In cleanup you can get the values from video_raw once, then reuse
them:

watched = video_raw['count_watched']
yesterday = video_raw['count_watched_yesterday']
seven = video_raw['count_watched_seven']

video_raw['percent_yesterday'] = (watched - yesterday) / yesterday
video_raw['percent_seven'] = (watched - seven) / seven

And finally in to_excel you could extract the columns into a separate
list, then call the method on each of them:

columns = [('B:B', 13, integer)
           ('C:C', 17, percentage)
           ('D:D', 19, percentage)
           ('E:E', 15, integer)
           ('F:F', 15, percentage)
           ('G:G', 15, decimal)
           ('H:H', 13, integer)
           ('I:I', 13, integer)
           ('J:J', 13, integer)
           ('K:K', 13, integer)
           ('L:L', 13, integer)]

for column in columns:
    worksheet.set_column(*column)

Code Snippets

def main():
    if len(sys.argv) != 3:
        sys.exit("Need exactly two arguments: database and output file.")

    with closing(db.connect(sys.argv[1])) as con:
        video_data = video_base(con)
        intent_added = video_intent(con, video_data)
        feed_data = feed_position(con)

    cleaned = cleanup(intent_added)
    to_excel(cleaned, feed_data, sys.argv[2])

def video_aggregations(df, day):
    df = df.loc[df['time'] == day]
    df = df.groupby('video_id').agg({"ios_id": {"count_watched": np.count_nonzero,
                                                "unique_watched": pd.Series.nunique},
                                     "feed_position": {"feed_position": np.average},
                                     "time_watched": {"time_watched": np.sum},
                                     "video_length": {"video_length": np.sum}})
    df.columns = df_one.columns.droplevel(0)
    return df

df_one = video_aggregations(df, ONE)
df_one['avg_time_watched'] = df_one['time_watched'] / df_one['video_length']

def video_intent_aggregations(con, video_base, table, result_key, aggregation_name):
    df = pd.read_sql_query("SELECT * FROM %s;" % table, con)

    df['time'] = pd.to_datetime(df['time'], unit='s').dt.date
    df = df.drop('index', axis=1)
    df = df.loc[df['time'] == ONE]

    df = df.groupby('video_id').agg({"ios_id": {aggregation_name: np.count_nonzero,
                                                "unique_clicks": pd.Series.nunique},
                                     "feed_position": {"feed_position": np.average}})
    df.columns = df.columns.droplevel(0)

    video_base[result_key] = df[aggregation_name]


def video_intent(con, video_base):
    video_intent_aggregations(con, video_base, "item_info_click", "item_info_clicks", "count_clicks")
    video_intent_aggregations(con, video_base, "faved", "faves", "count_faves")
    video_intent_aggregations(con, video_base, "table", "replays", "replays")
    video_intent_aggregations(con, video_base, "add_to_cart", "add_to_cart", "add_to_cart")
    video_intent_aggregations(con, video_base, "carousel", "carousel", "carousel")
    return video_base

watched = video_raw['count_watched']
yesterday = video_raw['count_watched_yesterday']
seven = video_raw['count_watched_seven']

video_raw['percent_yesterday'] = (watched - yesterday) / yesterday
video_raw['percent_seven'] = (watched - seven) / seven

Context

StackExchange Code Review Q#92379, answer score: 3

Revisions (0)

No revisions yet.