HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Fetching, processing, and storing Mixpanel analytics data to SQLite

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
mixpanelandsqlitedatafetchingprocessingstoringanalytics

Problem

I'm a self-taught Python programmer and I never really learned the fundamentals of programming, so I want to see how to improve upon this script and make it adhere to best practices.

The script has three functions that retrieve data from an API, cleanse the data and store it in a sqlite db. This script is going to run daily on a cron and append to the sqlite tables every morning.

-
get_data() fetches the data and turns it into a pandas dataframe.

-
data_cleanse() removes some non-necessary data.

-
send_to_db() sends the cleansed data to a sqlite db, there is one table for each of the event types.

All of the functions are called in a for-loop which iterates through each of the event types.

I'm open to any suggestions on how to improve this, but here are some thoughts/questions that I have:

-
Should this be a class? I have never used one before because I always found plain functions to be less confusing.

-
Should I be using a if __name__ == "__main__":?

```
import pandas as pd
import json
from datetime import date, timedelta
from mixpanel_client_lib import Mixpanel
import sqlite3 as db

def get_data(start_date, end_date, event_name):
con_data = Mixpanel(API_KEY, API_SECRET)

data = con_data.request(['export'], {
'event': [event_name],
'from_date': start_date,
'to_date': end_date
})

parameters = set()

events = []

for line in data.split('\n'):
try:
event = json.loads(line)
ev = event['properties']
except ValueError:
continue

parameters.update(ev.keys())
events.append(ev)

df = pd.DataFrame(events)

return df, event_name

def data_cleanse(df, event_name):
if event_name == "Video Played":
df = df[['$ios_ifa',
'Groups',
'Lifetime Number of Sessions',
'Days Since Last Visit',
'time',
'Product ID',
'Time Watche

Solution

From a first look the code is nice enough, meaning you can follow the
control flow easily and it's clear what each function does, so
the separation of the three or four steps is rather good.

  • This script is rather small and you don't pass around too much data;


I'd say leave it as is unless you want to have some more reuse in
other scripts.

  • Yes please, do use __name__ if only for consistency.



I'd also suggest the following to clean up anyway:

  • Move constants to the top.



  • Use the values from table_names instead of event_types so you


don't repeat the keys all the time, so you end up with a constant
EVENT_TYPES = {'Video Played': ...}.

  • That way you can also iterate over the event types and table names


simultaneously, eliminating the need to look up table names in
send_to_db.

  • I'd use timedelta with named arguments, so it's a bit clearer what


timedelta(1) means, i.e. use timedelta(days=1) instead.

  • The additional return value for event_name from get_data and


data_cleanse doesn't make much sense to me. It's not like you
transform the event name, so I'd just drop that altogether.

  • The spurious print statements could be replaced by logging calls


instead, but this looks just like debugging statements anyway?

  • The exception handling in get_data could be clearer; I've moved the


access via 'properties' after the try block to make it clearer
which operation can actually fail there.

  • Database connections should probably be protected by


with closing(...) just in case.

  • parameters in get_data is unused.



  • The cases in data_cleanse are duplicated and can be condensed.



  • The comparison foo == 'x' or 'y' doesn't do what you mean. Compare


'a' == 'x' or 'y', which is 'y', with 'a' == 'x' or 'a' == 'y',
which returns False. In any case, this comparison can be rewritten
with x in (...) instead. You also miss the (currently impossible)
else case; depending on your code I'd rather just have the default
case and handle "Video Played" additionally or raise an exception
yourself.

  • The line after # remove admin users from data looks fishy, but I


don't really know how to improve it.

And finally you can always check with flake8 and similar tools for
style violations.
All in all:

import pandas as pd
import json
from datetime import date, timedelta
from mixpanel_client_lib import Mixpanel
import sqlite3 as db
from contextlib import closing

API_KEY = 'xxxxxxx'
API_SECRET = 'xxxxxxx'

EVENT_TYPES = {
    'Video Played': 'video_played',
    'Item Information Click': 'item_info_click',
    'Faved': 'faved',
    'Add to Cart': 'add_to_cart',
    'Tap to Replay': 'replay'
}

DEFAULT_COLUMNS = [
    ('$ios_ifa', 'ios_id'),
    ('Groups', 'groups'),
    ('Lifetime Number of Sessions', 'lifetime_sessions'),
    ('Days Since Last Visit', 'days_since'),
    ('time', 'time'),
    ('Product ID', 'product_id'),
]

VIDEO_COLUMNS = list(DEFAULT_COLUMNS).extend([
    ('Time Watched', 'time_watched'),
    ('Video Length', 'video_length')
])

def get_data(start_date, end_date, event_name):
    con_data = Mixpanel(API_KEY, API_SECRET)

    data = con_data.request(['export'], {
        'event': [event_name],
        'from_date': start_date,
        'to_date': end_date
    })

    events = []

    for line in data.split('\n'):
        try:
            event = json.loads(line)
        except ValueError:
            continue

        events.append(event['properties'])

    return pd.DataFrame(events)

def data_cleanse(df, event_name):
    columns = DEFAULT_COLUMNS
    if event_name == "Video Played":
        columns = VIDEO_COLUMNS

    df = df[[c[0] for c in columns]]
    df.columns = [c[1] for c in columns]

    df['groups'] = df['groups'].astype(str)

    # remove admin users from data
    idx = df['groups'].isin(['[u\'Admin-Personal\']', '[u\'Admin\']'])
    df = df[~idx]

    # remove '0' lifetime session users from data
    idx = df['lifetime_sessions'].isin([0])
    df = df[~idx]

    return df

def send_to_db(df, table_name):
    with closing(db.connect('/code/vid_score/test.db')) as con:
        df.to_sql(table_name, con, flavor='sqlite', if_exists='append')

def main():
    end_date = date.today() - timedelta(days=1)
    start_date = end_date

    for (event_name, table_name) in EVENT_TYPES.iteritems():
        df = get_data(start_date, end_date, event_name)
        df = data_cleanse(df, event_name)
        send_to_db(df, table_name)

if __name__ == "__main__":
    main()

Code Snippets

import pandas as pd
import json
from datetime import date, timedelta
from mixpanel_client_lib import Mixpanel
import sqlite3 as db
from contextlib import closing


API_KEY = 'xxxxxxx'
API_SECRET = 'xxxxxxx'

EVENT_TYPES = {
    'Video Played': 'video_played',
    'Item Information Click': 'item_info_click',
    'Faved': 'faved',
    'Add to Cart': 'add_to_cart',
    'Tap to Replay': 'replay'
}

DEFAULT_COLUMNS = [
    ('$ios_ifa', 'ios_id'),
    ('Groups', 'groups'),
    ('Lifetime Number of Sessions', 'lifetime_sessions'),
    ('Days Since Last Visit', 'days_since'),
    ('time', 'time'),
    ('Product ID', 'product_id'),
]

VIDEO_COLUMNS = list(DEFAULT_COLUMNS).extend([
    ('Time Watched', 'time_watched'),
    ('Video Length', 'video_length')
])


def get_data(start_date, end_date, event_name):
    con_data = Mixpanel(API_KEY, API_SECRET)

    data = con_data.request(['export'], {
        'event': [event_name],
        'from_date': start_date,
        'to_date': end_date
    })

    events = []

    for line in data.split('\n'):
        try:
            event = json.loads(line)
        except ValueError:
            continue

        events.append(event['properties'])

    return pd.DataFrame(events)


def data_cleanse(df, event_name):
    columns = DEFAULT_COLUMNS
    if event_name == "Video Played":
        columns = VIDEO_COLUMNS

    df = df[[c[0] for c in columns]]
    df.columns = [c[1] for c in columns]

    df['groups'] = df['groups'].astype(str)

    # remove admin users from data
    idx = df['groups'].isin(['[u\'Admin-Personal\']', '[u\'Admin\']'])
    df = df[~idx]

    # remove '0' lifetime session users from data
    idx = df['lifetime_sessions'].isin([0])
    df = df[~idx]

    return df


def send_to_db(df, table_name):
    with closing(db.connect('/code/vid_score/test.db')) as con:
        df.to_sql(table_name, con, flavor='sqlite', if_exists='append')


def main():
    end_date = date.today() - timedelta(days=1)
    start_date = end_date

    for (event_name, table_name) in EVENT_TYPES.iteritems():
        df = get_data(start_date, end_date, event_name)
        df = data_cleanse(df, event_name)
        send_to_db(df, table_name)


if __name__ == "__main__":
    main()

Context

StackExchange Code Review Q#91133, answer score: 3

Revisions (0)

No revisions yet.