HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Reading groups of files and concatenating them

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
readinggroupsconcatenatingfilesandthem

Problem

I have made some adjustments to some code that you can see on this thread:

Read daily files and concatenate them

I would like to make some further refinements and make sure I am on the right track organize some of my files better. I added some functions and made some other changes to prevent errors and make my code easier to reuse (and to be more Pythonic).

```
#got rid of import *
import pandas as pd
import numpy as np
import datetime as dt

ftploc = r'C:\Users\FTP\\'
loc = r'C:\Users\\'
splitsname = 'Splits'
fcrname = 'fcr_report_'
npsname = 'csat_report_'
ahtname = 'aht_report_'
rostername = 'Daily_Roster'
vasname = 'vas_report_'
ext ='.csv'

#had to create some periods and date format parameters
start_period = '13 day'
end_period = '1 day'
fcr_period = '3 day'
date_format1 = '%m_%d_%Y'
date_format2 = '%Y_%m_%d'
start_date = dt.date.today() - pd.Timedelta(start_period)
end_date = dt.date.today() - pd.Timedelta(end_period)
fcr_end_date = end_date - pd.Timedelta(fcr_period)
daterange1 = pd.Timestamp(end_date) - pd.Timestamp(start_date)
daterange2 = pd.Timestamp(fcr_end_date) - pd.Timestamp(start_date)
daterange1 = (daterange1 / np.timedelta64(1, 'D')).astype(int)
daterange2 = (daterange2 / np.timedelta64(1, 'D')).astype(int)
print('Starting scrubbing file...')

#AHT files have a different date format in the filename so I made this function
def dateFormat(filename):
if filename == ahtname:
return date_format2
else:
return date_format1

#FCR is 3 days delayed (72 hour window) so I needed to create some logic to adjust for it
def dateRange(filename):
if filename == fcrname:
return daterange2
else:
return daterange1

#this function works on all of my files now. I just wonder if there is a better way to refer to the other functions? Is having a separate function for the date range and format ideal?
def readAndConcatFile(filename, daterange):
df_list = []
try:
for date_range in (pd.Timestamp(startdate) +

Solution

Please note that I'm not sure if I'm reading your script correctly. I compared this to the original to see what's lost from how you called the functions. If I've misunderstood entirely please let me know.

In your original script one of your read_csv calls passed in a 'date_completed' key that you left out here to use one function for all files, but you can still get that information using a default value. Default values can be included in a function parameter list so even if they're not supplied, the variable will exist. In your case it would be good to have a value for parse_dates.

def readAndConcatFile(filename, daterange, parse_dates=True):
    df_list = []
    try:
        for date_range in (pd.Timestamp(startdate) + dt.timedelta(n) for n in range(dateRange(filename))):
            df = pd.read_csv(ftploc + filename + date_range.strftime(dateFormat(filename)) + ext, parse_dates = parse_dates)
            df_list.append(df)
        return pd.concat(df_list)
    except IOError:
        print('File does not exist: ', filename + date_range.strftime(dateFormat(filename)) + ext)


This means that in the absence of any value True will be passed to parse_dates as you're currently doing. However you could also pass specific parameters like you did previously.

nps = readAndConcatFile(npsname, daterange, ['call_date','date_completed'])
vas = readAndConcatFile(vasname, daterange, ['Call_date'])
fcr = readAndConcatFile(fcrname, daterange, ['call_time'])
aht = readAndConcatFile(ahtname, daterange)


However I noticed that you previously passed nothing at all to your call for aht whereas you're now passing True. If that's something you'd like to avoid, that's easy with a slight modification. When you want to use a default to make a parameter optional, set the default as None and then you can have a line where you test whether there was a parameter passed or not.

def readAndConcatFile(filename, daterange, parse_dates=None):
    df_list = []
    try:
        for date_range in (pd.Timestamp(startdate) + dt.timedelta(n) for n in range(dateRange(filename))):
            if parse_dates is None:
                df = pd.read_csv(ftploc + filename + date_range.strftime(dateFormat(filename)) + ext)
            else:
                df = pd.read_csv(ftploc + filename + date_range.strftime(dateFormat(filename)) + ext, parse_dates = parse_dates)
            df_list.append(df)
        return pd.concat(df_list)
    except IOError:
        print('File does not exist: ', filename + date_range.strftime(dateFormat(filename)) + ext)


This means you no longer have to pass True to parse_dates for aht just to have the function work.

Code Snippets

def readAndConcatFile(filename, daterange, parse_dates=True):
    df_list = []
    try:
        for date_range in (pd.Timestamp(startdate) + dt.timedelta(n) for n in range(dateRange(filename))):
            df = pd.read_csv(ftploc + filename + date_range.strftime(dateFormat(filename)) + ext, parse_dates = parse_dates)
            df_list.append(df)
        return pd.concat(df_list)
    except IOError:
        print('File does not exist: ', filename + date_range.strftime(dateFormat(filename)) + ext)
nps = readAndConcatFile(npsname, daterange, ['call_date','date_completed'])
vas = readAndConcatFile(vasname, daterange, ['Call_date'])
fcr = readAndConcatFile(fcrname, daterange, ['call_time'])
aht = readAndConcatFile(ahtname, daterange)
def readAndConcatFile(filename, daterange, parse_dates=None):
    df_list = []
    try:
        for date_range in (pd.Timestamp(startdate) + dt.timedelta(n) for n in range(dateRange(filename))):
            if parse_dates is None:
                df = pd.read_csv(ftploc + filename + date_range.strftime(dateFormat(filename)) + ext)
            else:
                df = pd.read_csv(ftploc + filename + date_range.strftime(dateFormat(filename)) + ext, parse_dates = parse_dates)
            df_list.append(df)
        return pd.concat(df_list)
    except IOError:
        print('File does not exist: ', filename + date_range.strftime(dateFormat(filename)) + ext)

Context

StackExchange Code Review Q#104338, answer score: 2

Revisions (0)

No revisions yet.