patternpythonMinor
Fetching, processing, and storing Mixpanel analytics data to SQLite
Viewed 0 times
mixpanelandsqlitedatafetchingprocessingstoringanalytics
Problem
I'm a self-taught Python programmer and I never really learned the fundamentals of programming, so I want to see how to improve upon this script and make it adhere to best practices.
The script has three functions that retrieve data from an API, cleanse the data and store it in a sqlite db. This script is going to run daily on a cron and append to the sqlite tables every morning.
-
-
-
All of the functions are called in a
I'm open to any suggestions on how to improve this, but here are some thoughts/questions that I have:
-
Should this be a class? I have never used one before because I always found plain functions to be less confusing.
-
Should I be using a
```
import pandas as pd
import json
from datetime import date, timedelta
from mixpanel_client_lib import Mixpanel
import sqlite3 as db
def get_data(start_date, end_date, event_name):
con_data = Mixpanel(API_KEY, API_SECRET)
data = con_data.request(['export'], {
'event': [event_name],
'from_date': start_date,
'to_date': end_date
})
parameters = set()
events = []
for line in data.split('\n'):
try:
event = json.loads(line)
ev = event['properties']
except ValueError:
continue
parameters.update(ev.keys())
events.append(ev)
df = pd.DataFrame(events)
return df, event_name
def data_cleanse(df, event_name):
if event_name == "Video Played":
df = df[['$ios_ifa',
'Groups',
'Lifetime Number of Sessions',
'Days Since Last Visit',
'time',
'Product ID',
'Time Watche
The script has three functions that retrieve data from an API, cleanse the data and store it in a sqlite db. This script is going to run daily on a cron and append to the sqlite tables every morning.
-
get_data() fetches the data and turns it into a pandas dataframe.-
data_cleanse() removes some non-necessary data.-
send_to_db() sends the cleansed data to a sqlite db, there is one table for each of the event types.All of the functions are called in a
for-loop which iterates through each of the event types.I'm open to any suggestions on how to improve this, but here are some thoughts/questions that I have:
-
Should this be a class? I have never used one before because I always found plain functions to be less confusing.
-
Should I be using a
if __name__ == "__main__":?```
import pandas as pd
import json
from datetime import date, timedelta
from mixpanel_client_lib import Mixpanel
import sqlite3 as db
def get_data(start_date, end_date, event_name):
con_data = Mixpanel(API_KEY, API_SECRET)
data = con_data.request(['export'], {
'event': [event_name],
'from_date': start_date,
'to_date': end_date
})
parameters = set()
events = []
for line in data.split('\n'):
try:
event = json.loads(line)
ev = event['properties']
except ValueError:
continue
parameters.update(ev.keys())
events.append(ev)
df = pd.DataFrame(events)
return df, event_name
def data_cleanse(df, event_name):
if event_name == "Video Played":
df = df[['$ios_ifa',
'Groups',
'Lifetime Number of Sessions',
'Days Since Last Visit',
'time',
'Product ID',
'Time Watche
Solution
From a first look the code is nice enough, meaning you can follow the
control flow easily and it's clear what each function does, so
the separation of the three or four steps is rather good.
I'd say leave it as is unless you want to have some more reuse in
other scripts.
I'd also suggest the following to clean up anyway:
don't repeat the keys all the time, so you end up with a constant
simultaneously, eliminating the need to look up table names in
transform the event name, so I'd just drop that altogether.
instead, but this looks just like debugging statements anyway?
access via
which operation can actually fail there.
which returns
with
case and handle
yourself.
don't really know how to improve it.
And finally you can always check with
style violations.
All in all:
control flow easily and it's clear what each function does, so
the separation of the three or four steps is rather good.
- This script is rather small and you don't pass around too much data;
I'd say leave it as is unless you want to have some more reuse in
other scripts.
- Yes please, do use
__name__if only for consistency.
I'd also suggest the following to clean up anyway:
- Move constants to the top.
- Use the values from
table_namesinstead ofevent_typesso you
don't repeat the keys all the time, so you end up with a constant
EVENT_TYPES = {'Video Played': ...}.- That way you can also iterate over the event types and table names
simultaneously, eliminating the need to look up table names in
send_to_db.- I'd use
timedeltawith named arguments, so it's a bit clearer what
timedelta(1) means, i.e. use timedelta(days=1) instead.- The additional return value for
event_namefromget_dataand
data_cleanse doesn't make much sense to me. It's not like youtransform the event name, so I'd just drop that altogether.
- The spurious
printstatements could be replaced byloggingcalls
instead, but this looks just like debugging statements anyway?
- The exception handling in
get_datacould be clearer; I've moved the
access via
'properties' after the try block to make it clearerwhich operation can actually fail there.
- Database connections should probably be protected by
with closing(...) just in case.parametersinget_datais unused.
- The cases in
data_cleanseare duplicated and can be condensed.
- The comparison
foo == 'x' or 'y'doesn't do what you mean. Compare
'a' == 'x' or 'y', which is 'y', with 'a' == 'x' or 'a' == 'y',which returns
False. In any case, this comparison can be rewrittenwith
x in (...) instead. You also miss the (currently impossible)else case; depending on your code I'd rather just have the defaultcase and handle
"Video Played" additionally or raise an exceptionyourself.
- The line after
# remove admin users from datalooks fishy, but I
don't really know how to improve it.
And finally you can always check with
flake8 and similar tools forstyle violations.
All in all:
import pandas as pd
import json
from datetime import date, timedelta
from mixpanel_client_lib import Mixpanel
import sqlite3 as db
from contextlib import closing
API_KEY = 'xxxxxxx'
API_SECRET = 'xxxxxxx'
EVENT_TYPES = {
'Video Played': 'video_played',
'Item Information Click': 'item_info_click',
'Faved': 'faved',
'Add to Cart': 'add_to_cart',
'Tap to Replay': 'replay'
}
DEFAULT_COLUMNS = [
('$ios_ifa', 'ios_id'),
('Groups', 'groups'),
('Lifetime Number of Sessions', 'lifetime_sessions'),
('Days Since Last Visit', 'days_since'),
('time', 'time'),
('Product ID', 'product_id'),
]
VIDEO_COLUMNS = list(DEFAULT_COLUMNS).extend([
('Time Watched', 'time_watched'),
('Video Length', 'video_length')
])
def get_data(start_date, end_date, event_name):
con_data = Mixpanel(API_KEY, API_SECRET)
data = con_data.request(['export'], {
'event': [event_name],
'from_date': start_date,
'to_date': end_date
})
events = []
for line in data.split('\n'):
try:
event = json.loads(line)
except ValueError:
continue
events.append(event['properties'])
return pd.DataFrame(events)
def data_cleanse(df, event_name):
columns = DEFAULT_COLUMNS
if event_name == "Video Played":
columns = VIDEO_COLUMNS
df = df[[c[0] for c in columns]]
df.columns = [c[1] for c in columns]
df['groups'] = df['groups'].astype(str)
# remove admin users from data
idx = df['groups'].isin(['[u\'Admin-Personal\']', '[u\'Admin\']'])
df = df[~idx]
# remove '0' lifetime session users from data
idx = df['lifetime_sessions'].isin([0])
df = df[~idx]
return df
def send_to_db(df, table_name):
with closing(db.connect('/code/vid_score/test.db')) as con:
df.to_sql(table_name, con, flavor='sqlite', if_exists='append')
def main():
end_date = date.today() - timedelta(days=1)
start_date = end_date
for (event_name, table_name) in EVENT_TYPES.iteritems():
df = get_data(start_date, end_date, event_name)
df = data_cleanse(df, event_name)
send_to_db(df, table_name)
if __name__ == "__main__":
main()Code Snippets
import pandas as pd
import json
from datetime import date, timedelta
from mixpanel_client_lib import Mixpanel
import sqlite3 as db
from contextlib import closing
API_KEY = 'xxxxxxx'
API_SECRET = 'xxxxxxx'
EVENT_TYPES = {
'Video Played': 'video_played',
'Item Information Click': 'item_info_click',
'Faved': 'faved',
'Add to Cart': 'add_to_cart',
'Tap to Replay': 'replay'
}
DEFAULT_COLUMNS = [
('$ios_ifa', 'ios_id'),
('Groups', 'groups'),
('Lifetime Number of Sessions', 'lifetime_sessions'),
('Days Since Last Visit', 'days_since'),
('time', 'time'),
('Product ID', 'product_id'),
]
VIDEO_COLUMNS = list(DEFAULT_COLUMNS).extend([
('Time Watched', 'time_watched'),
('Video Length', 'video_length')
])
def get_data(start_date, end_date, event_name):
con_data = Mixpanel(API_KEY, API_SECRET)
data = con_data.request(['export'], {
'event': [event_name],
'from_date': start_date,
'to_date': end_date
})
events = []
for line in data.split('\n'):
try:
event = json.loads(line)
except ValueError:
continue
events.append(event['properties'])
return pd.DataFrame(events)
def data_cleanse(df, event_name):
columns = DEFAULT_COLUMNS
if event_name == "Video Played":
columns = VIDEO_COLUMNS
df = df[[c[0] for c in columns]]
df.columns = [c[1] for c in columns]
df['groups'] = df['groups'].astype(str)
# remove admin users from data
idx = df['groups'].isin(['[u\'Admin-Personal\']', '[u\'Admin\']'])
df = df[~idx]
# remove '0' lifetime session users from data
idx = df['lifetime_sessions'].isin([0])
df = df[~idx]
return df
def send_to_db(df, table_name):
with closing(db.connect('/code/vid_score/test.db')) as con:
df.to_sql(table_name, con, flavor='sqlite', if_exists='append')
def main():
end_date = date.today() - timedelta(days=1)
start_date = end_date
for (event_name, table_name) in EVENT_TYPES.iteritems():
df = get_data(start_date, end_date, event_name)
df = data_cleanse(df, event_name)
send_to_db(df, table_name)
if __name__ == "__main__":
main()Context
StackExchange Code Review Q#91133, answer score: 3
Revisions (0)
No revisions yet.