patternpythonMinor
Appending unique data to a postgres table
Viewed 0 times
uniquepostgresappendingdatatable
Problem
I receive daily files (filename_%m_%d_%Y.csv) from a client and I read those in pandas, process them, and store them in Postgres. Sometimes there are delays and we do not get the data for a few days. The problem is that I have everything on a scheduled task and I do not want to manually check if the files exist or not before I run my code since I am using append mode on Postgres (and do not have constraints set up yet).
My approach here is to read the last few days of data from my existing database, and then find the max date in order to know where to start reading new data from. So if the max date is Jan 22, I will know that the start date for reading my new files is Jan 23 onward. Is this a decent approach?
Up until this week I have been exclusively using .csv files. My approach here was to find the beginning of the month and read all files from the beginning of the month until the current date. This meant that everyday I would overwrite my file (So this month I would read Jan 1 - 27 daily files, and create a January file, then tomorrow I would read Jan 1 - 28 daily files and then overwrite my previous MTD file).
``
My approach here is to read the last few days of data from my existing database, and then find the max date in order to know where to start reading new data from. So if the max date is Jan 22, I will know that the start date for reading my new files is Jan 23 onward. Is this a decent approach?
central_time = dt.datetime.now(pytz.timezone('US/Central'))
start_date = central_time - pd.Timedelta(days = 7)
engine = sa.create_engine('postgresql://usr:pass@localhost:XXXX/db')
df = pd.read_sql(sa.text('SELECT * FROM table WHERE report_date > :date'), engine, params={'date': start_date}, parse_dates = ['report_date'])
latest_date = df.report_date.max() + pd.Timedelta(days = 1)Up until this week I have been exclusively using .csv files. My approach here was to find the beginning of the month and read all files from the beginning of the month until the current date. This meant that everyday I would overwrite my file (So this month I would read Jan 1 - 27 daily files, and create a January file, then tomorrow I would read Jan 1 - 28 daily files and then overwrite my previous MTD file).
``
start_date = (central_time - pd.Timedelta(days = 0)) - pd.offsets.MonthBegin()
end_date = central_time
days = ((pd.Timestamp(end_date) - pd.Timestamp(start_date)) / np.timedelta64(1, 'D')).astype(int)
def last_n_days(filename, num_of_days, include_curr = False):
f = '%m_%d_%Y'
curr = end_date
if include_curr:
yield curr.strftime(f)
for num in range(num_of_days):
curr = curr - dt.timedelta(days = 1)
yield curr.strftime(f)
`Solution
First of all I don't know Python, so I will try my best.
For the first portion of code, why do you put 7 days as offset? If the file delays more than 7 days your script will crash. I would do a SQL query for that. One like this:
and now you have the latest date.
For the second part your code seems ok, but personally I don't see the reason to update the whole month every day, that will be problematic.
For me, the best solution is find the latest date, I gave you the query up there, and after that advance day by day looking for files up to today.
For the first portion of code, why do you put 7 days as offset? If the file delays more than 7 days your script will crash. I would do a SQL query for that. One like this:
Select max(report_date) from table;
and now you have the latest date.
For the second part your code seems ok, but personally I don't see the reason to update the whole month every day, that will be problematic.
For me, the best solution is find the latest date, I gave you the query up there, and after that advance day by day looking for files up to today.
Context
StackExchange Code Review Q#118052, answer score: 2
Revisions (0)
No revisions yet.