patternpythonMinor
Backing up files to remote servers
Viewed 0 times
remotebackingfilesservers
Problem
I have code that backup files to remote servers.
There are 2 pandas data frames:
-
Filelist structure has the following columns:
To evaluate the storage I check the version on
To control versions I get the Created field (that stores a datetime) and convert it to a Unix timestamp.
My main concern is that I use too many
```
# Stored files is in fact a local csv file with this format
# It holds all file names that has being processed and stored in "depot" server
storedfiles= """
Filename,Version,Depot
foo.txt,12342412,server1
bar.mov,14144862,ser
There are 2 pandas data frames:
storedfiles: that points to the files already stored in previous runs
-
filelist: a list of candidates to be stored in one of the depots (comes from a config file).Filelist structure has the following columns:
In [10]: filelist
Out[10]:
Name Objects Class Subclass Creator \
0 /backups/dir001/test00 400 waves time john
1 /backups/dir002/test00 400 waves time john
Created Datahost Res Total Size
0 12-Feb-15 10:10:59 NaN D 1609728
1 04-Jan-15 14:40:38 NaN D 1609728To evaluate the storage I check the version on
filelist against the version on storedfiles. If the version on file list is newer, it will be stored in some depot (the old version will be deleted). (edit) But to get the "version" I "need" to fetch it from a XML structure. Using an class to read the XML and process the data on it.# 'a' is a instance of a internal XML processing class
# This fetchs the XML structure related to the file 'f' from filelist.
# Its a type(xmlfile) is 'str'
xmlfile = a.get_xml(f)
# This is a object to the structured data loaded on xmlfile.
# type(xmldata) is aXML
xmldata = a.load(xmlfile)
# With that I can fetch informations from my file as:
# That returns an datetime.datetime structure
type(xmldata.get_data_created())
# As I do at:
version = int(xmldata.get_data_created().strftime("%s"))To control versions I get the Created field (that stores a datetime) and convert it to a Unix timestamp.
My main concern is that I use too many
ifs. I feel the code is ugly and I wish have hints to do a neater and prettier code.```
# Stored files is in fact a local csv file with this format
# It holds all file names that has being processed and stored in "depot" server
storedfiles= """
Filename,Version,Depot
foo.txt,12342412,server1
bar.mov,14144862,ser
Solution
In general, iterating through the rows of a series or dataframe is slow, and is not the recommended process. Instead, you should do one of two things:
In this case, you have the opportunity to do both. You should use
(the
Then, to compare with your already stored versions, you will use a selector based on your versioning criterion. Actually, in this case, you're going to want to perform a join on the two data frames first, and then use the selector:
The join here accomplishes the task of connecting the files that have already been backed up with those in
- use
maporapply(mapfor series, generallyapplyfor data frame; see answers to this question for more details)
- select/broadcast
In this case, you have the opportunity to do both. You should use
apply to get the version information for your filelist dataframe:def function_that_does_your_xml_stuff(row):
"""Get version information for a row in the filelist dataframe from XML.
Your code above sets the filename as f, but then doesn't seem to use it
which I find confusing. Perhaps it was a typo, and `a` should be `f`?
Of course, a vanilla string object doesn't have a get_xml() method.
"""
filename = row['filename']
xmlfile = a.get_xml(filename)
xmldata = a.load(xmlfile)
version = int(xmldata.get_data_created().strftime("%s"))
return version
filelist['Version'] = filelist.apply(function_that_does_your_xml_stuff, axis=1)(the
axis=1 argument here tells apply to work on rows instead of columns). This adds a new column to your filelist dataframe which contains the version information as obtained from your XML data associated with each file name. Then, to compare with your already stored versions, you will use a selector based on your versioning criterion. Actually, in this case, you're going to want to perform a join on the two data frames first, and then use the selector:
filelist = filelist.join(sfdf, how=left, on='Filename', rsuffix='_stored')
need_to_update = filelist[filelist['Version'] > filelist['Version_stored']]
need_to_update.apply(actually_update, axis=1)The join here accomplishes the task of connecting the files that have already been backed up with those in
filelist; the second line selects out only those files whose current version is greater than the one that was stored previously, and the third line actually performs the update (again, by applying a function).Code Snippets
def function_that_does_your_xml_stuff(row):
"""Get version information for a row in the filelist dataframe from XML.
Your code above sets the filename as f, but then doesn't seem to use it
which I find confusing. Perhaps it was a typo, and `a` should be `f`?
Of course, a vanilla string object doesn't have a get_xml() method.
"""
filename = row['filename']
xmlfile = a.get_xml(filename)
xmldata = a.load(xmlfile)
version = int(xmldata.get_data_created().strftime("%s"))
return version
filelist['Version'] = filelist.apply(function_that_does_your_xml_stuff, axis=1)filelist = filelist.join(sfdf, how=left, on='Filename', rsuffix='_stored')
need_to_update = filelist[filelist['Version'] > filelist['Version_stored']]
need_to_update.apply(actually_update, axis=1)Context
StackExchange Code Review Q#120076, answer score: 3
Revisions (0)
No revisions yet.