HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Backing up files to remote servers

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
remotebackingfilesservers

Problem

I have code that backup files to remote servers.

There are 2 pandas data frames:

  • storedfiles: that points to the files already stored in previous runs



-
filelist: a list of candidates to be stored in one of the depots (comes from a config file).

Filelist structure has the following columns:

In [10]: filelist
Out[10]: 
                 Name     Objects    Class Subclass Creator  \  
0  /backups/dir001/test00         400    waves     time   john      
1  /backups/dir002/test00         400    waves     time   john      

              Created  Datahost Res  Total Size  
0  12-Feb-15 10:10:59       NaN   D     1609728  
1  04-Jan-15 14:40:38       NaN   D     1609728


To evaluate the storage I check the version on filelist against the version on storedfiles. If the version on file list is newer, it will be stored in some depot (the old version will be deleted). (edit) But to get the "version" I "need" to fetch it from a XML structure. Using an class to read the XML and process the data on it.

# 'a' is a instance of a internal XML processing class

# This fetchs the XML structure related to the file 'f' from filelist.
# Its a type(xmlfile) is 'str'
xmlfile = a.get_xml(f)

# This is a object to the structured data loaded on xmlfile.
# type(xmldata) is aXML
xmldata = a.load(xmlfile)

# With that I can fetch informations from my file as:
# That returns an datetime.datetime structure
type(xmldata.get_data_created())

# As I do at: 
version = int(xmldata.get_data_created().strftime("%s"))


To control versions I get the Created field (that stores a datetime) and convert it to a Unix timestamp.

My main concern is that I use too many ifs. I feel the code is ugly and I wish have hints to do a neater and prettier code.

```
# Stored files is in fact a local csv file with this format
# It holds all file names that has being processed and stored in "depot" server
storedfiles= """
Filename,Version,Depot
foo.txt,12342412,server1
bar.mov,14144862,ser

Solution

In general, iterating through the rows of a series or dataframe is slow, and is not the recommended process. Instead, you should do one of two things:

  • use map or apply (map for series, generally apply for data frame; see answers to this question for more details)



  • select/broadcast



In this case, you have the opportunity to do both. You should use apply to get the version information for your filelist dataframe:

def function_that_does_your_xml_stuff(row):
    """Get version information for a row in the filelist dataframe from XML.

    Your code above sets the filename as f, but then doesn't seem to use it
    which I find confusing.  Perhaps it was a typo, and `a` should be `f`?
    Of course, a vanilla string object doesn't have a get_xml() method.

    """
    filename = row['filename']
    xmlfile = a.get_xml(filename)
    xmldata = a.load(xmlfile)
    version = int(xmldata.get_data_created().strftime("%s"))
    return version

filelist['Version'] = filelist.apply(function_that_does_your_xml_stuff, axis=1)


(the axis=1 argument here tells apply to work on rows instead of columns). This adds a new column to your filelist dataframe which contains the version information as obtained from your XML data associated with each file name.

Then, to compare with your already stored versions, you will use a selector based on your versioning criterion. Actually, in this case, you're going to want to perform a join on the two data frames first, and then use the selector:

filelist = filelist.join(sfdf, how=left, on='Filename', rsuffix='_stored')
need_to_update = filelist[filelist['Version'] > filelist['Version_stored']]
need_to_update.apply(actually_update, axis=1)


The join here accomplishes the task of connecting the files that have already been backed up with those in filelist; the second line selects out only those files whose current version is greater than the one that was stored previously, and the third line actually performs the update (again, by applying a function).

Code Snippets

def function_that_does_your_xml_stuff(row):
    """Get version information for a row in the filelist dataframe from XML.

    Your code above sets the filename as f, but then doesn't seem to use it
    which I find confusing.  Perhaps it was a typo, and `a` should be `f`?
    Of course, a vanilla string object doesn't have a get_xml() method.

    """
    filename = row['filename']
    xmlfile = a.get_xml(filename)
    xmldata = a.load(xmlfile)
    version = int(xmldata.get_data_created().strftime("%s"))
    return version

filelist['Version'] = filelist.apply(function_that_does_your_xml_stuff, axis=1)
filelist = filelist.join(sfdf, how=left, on='Filename', rsuffix='_stored')
need_to_update = filelist[filelist['Version'] > filelist['Version_stored']]
need_to_update.apply(actually_update, axis=1)

Context

StackExchange Code Review Q#120076, answer score: 3

Revisions (0)

No revisions yet.