patternpythonMinor
Read daily files and concatenate them
Viewed 0 times
readconcatenatefilesandthemdaily
Problem
Edit - here is my modified code: http://jsfiddle.net/#&togetherjs=GzytydCsRh
Can someone take a look and give me some feedback? It seems a bit long still but that is the first time I used functions.
I am reading a bunch of CSV files and using glob to concatenate them all together into separate dataframes. I eventually join them together and basically create a single large file which I use to connect to a dashboard. I am not too familiar with Python but I used Pandas and sklearn often.
As you can see, I am basically just reading the last 60 (or more) days worth of data (last 60 files) and creating a dataframe for each. This works, but I am wondering if there is a more Pythonic/better way? I watched a video on pydata (about not being restricted by PEP 8 and making sure your code is Pythonic) which was interesting.
(FYI - the reason I need to read 60 days worth of files is because customers can fill out a survey from a call which happened a long time ago. The customer fills out a survey today about a call that happened in July. I need to know about that call (how long it lasted, what the topic was, etc).
```
import pandas as pd
import numpy as np
from pandas import *
import datetime as dt
import os
from glob import glob
os.chdir(r'C:\\Users\Documents\FTP\\')
loc = r'C:\\Users\Documents\\'
rosterloc = r'\\mand\\'
splitsname = r'Splits.csv'
fcrname = r'global_disp_'
npsname = r'survey_'
ahtname = r'callbycall_'
rostername = 'Daily_Roster.csv'
vasname = r'vas_report_'
ext ='.csv'
startdate = dt.date.today() - Timedelta('60 day')
enddate = dt.date.today()
daterange = Timestamp(enddate) - Timestamp(startdate)
daterange = (daterange / np.timedelta64(1, 'D')).astype(int)
data = []
frames = []
calls = []
bracket = []
try:
for date_range in (Timestamp(startdate) + dt.timedelta(n) for n in range(daterange)):
aht = pd.read_csv(ahtname+date_range.strftime('%Y_%m_%d')+ext)
calls.append(aht)
except IOError:
print('File does not exist:', ahtname+da
Can someone take a look and give me some feedback? It seems a bit long still but that is the first time I used functions.
I am reading a bunch of CSV files and using glob to concatenate them all together into separate dataframes. I eventually join them together and basically create a single large file which I use to connect to a dashboard. I am not too familiar with Python but I used Pandas and sklearn often.
As you can see, I am basically just reading the last 60 (or more) days worth of data (last 60 files) and creating a dataframe for each. This works, but I am wondering if there is a more Pythonic/better way? I watched a video on pydata (about not being restricted by PEP 8 and making sure your code is Pythonic) which was interesting.
(FYI - the reason I need to read 60 days worth of files is because customers can fill out a survey from a call which happened a long time ago. The customer fills out a survey today about a call that happened in July. I need to know about that call (how long it lasted, what the topic was, etc).
```
import pandas as pd
import numpy as np
from pandas import *
import datetime as dt
import os
from glob import glob
os.chdir(r'C:\\Users\Documents\FTP\\')
loc = r'C:\\Users\Documents\\'
rosterloc = r'\\mand\\'
splitsname = r'Splits.csv'
fcrname = r'global_disp_'
npsname = r'survey_'
ahtname = r'callbycall_'
rostername = 'Daily_Roster.csv'
vasname = r'vas_report_'
ext ='.csv'
startdate = dt.date.today() - Timedelta('60 day')
enddate = dt.date.today()
daterange = Timestamp(enddate) - Timestamp(startdate)
daterange = (daterange / np.timedelta64(1, 'D')).astype(int)
data = []
frames = []
calls = []
bracket = []
try:
for date_range in (Timestamp(startdate) + dt.timedelta(n) for n in range(daterange)):
aht = pd.read_csv(ahtname+date_range.strftime('%Y_%m_%d')+ext)
calls.append(aht)
except IOError:
print('File does not exist:', ahtname+da
Solution
Use a class, or at least some functions, to make your code more readable and understandable
-
Why not make a class to bundle a bunch of functions together, such as a function
This would really clear out your code even if you elected not to write any other functions or a call. I think it's also fairer to your reader and yourself to respect DRY and not make readers check for themselves when you are doing something absolutely repetitive with slightly different function names.
Avoid redundant import statements and stick with standards
Use special features only when you need them
-
Do you actually need raw strings? I don't see you using your strings in any way that would seem to require them.
-
Similarly, why use
Use more defined constants
-
It's not a good idea to hard code
-
Similarly, you can save your desired date formats as strings to be treated as constants at the top of your file:
More sophisticated error handling
What I liked
- Very first reaction looking at your code is ....blech. I don't want to read that giant blob.
-
Why not make a class to bundle a bunch of functions together, such as a function
readAndConcatAHT? Actually, many of these for loops are doing the exact same thing for slightly differently named files. Why not do something like a function that takes in a filename and then runs a for loop like so:def readAndConcatFile(filename, daterange):
try:
for date_range in (Timestamp(startdate) + dt.timedelta(n) for n in range(daterange)):
fcr = pd.read_csv(filename+date_range.strftime('%m_%d_%Y')+ext, parse_dates = ['call_time'])
data.append(fcr)
except IOError:
print('File does not exist:', fcrname+date_range.strftime('%m_%d_%Y')+ext)This would really clear out your code even if you elected not to write any other functions or a call. I think it's also fairer to your reader and yourself to respect DRY and not make readers check for themselves when you are doing something absolutely repetitive with slightly different function names.
- I'll put an extra point to say I think a class would be nice because in your
initor someprocessingfunction you could string together a bunch of calls toreadAndConcatFileto standardize your read/write process for these CSV files. This will, again, make your code more extensible and more readable.
Avoid redundant import statements and stick with standards
- Almost everyone uses
import pandas as pd. I wouldn't recomend doing it any other way, and it's never a good idea to do a whole scaleimport *
- Don't import
globunless you are actually using it. Where do you useglobafter importing it?
Use special features only when you need them
-
Do you actually need raw strings? I don't see you using your strings in any way that would seem to require them.
-
Similarly, why use
os.chdir when it could be smarter to specify filenames as absolute file names? Here you're again using an option you don't really need and that could have future unintended side effects. Use more defined constants
-
It's not a good idea to hard code
Timedelta(60 day) like so. You should separately specify DAY_RANGE = 60 as a constant and then use that wherever you'd use 60. That way you can easily change the day range. Alternately, you could make the day range an input parameter to your script so that non-programmer users can also call this script for their desired look-back period.-
Similarly, you can save your desired date formats as strings to be treated as constants at the top of your file:
format1 = "'%m_%d_%Y'" and format2 = "'%Y_%m_%d'" Again this makes it easier to see what's going on and also makes it easier to make changes in the future. You can change just one string at the top of your file to change all related formatting, rather than having to change each string. This won't make any given line of code shorter, but it will make it better.More sophisticated error handling
- Error handling is not something I do enough of myself, but I wonder if you can do better here. I'm going to assume that errors in
ahtnameare related, for example, to errors infcrname. If that's the case, once you establish that a date range is missing for one kind of file, why not delete that daterange for all further queries in future loops? You could do so easily by simply deleting that member ofdaterangethat causes theIOError. Then you wouldn't get repetitive error messages that are really all telling you the same thing.
What I liked
- It's good practice to use generator expressions where you can, so I liked seeing code like
for date_range in (Timestamp(startdate) + dt.timedelta(n) for n in range(daterange)):.
Code Snippets
def readAndConcatFile(filename, daterange):
try:
for date_range in (Timestamp(startdate) + dt.timedelta(n) for n in range(daterange)):
fcr = pd.read_csv(filename+date_range.strftime('%m_%d_%Y')+ext, parse_dates = ['call_time'])
data.append(fcr)
except IOError:
print('File does not exist:', fcrname+date_range.strftime('%m_%d_%Y')+ext)Context
StackExchange Code Review Q#104050, answer score: 5
Revisions (0)
No revisions yet.