HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Building an efficient Python backup module

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
efficientmodulebuildingpythonbackup

Problem

I am writing a module that will synchronize destination to be the same as source, but will only overwrite files in destination that differ from source as to not cause excessive writes, hopefully prolonging the life of the backup. I will probably add additional functionality on top of this later.

I am primarily looking for critique regarding the style and structure of this module. The things I don't like or am not sure about:

-
When should or shouldn't I make a class? Is this overkill for a class? A backup is in my head an event-like object which has some state (a source and destination), but then it really only has one thing it needs to do, synchronize the destination to match source. This is probably the main question here. I was taught that OOP is the greatest and should always use it, but it just seems wrong here. I also noticed most of the modules in the Python Standard Library just have various functions and don't really break things out into classes like I was expecting (i.e. Java).

-
There is one method sync and then a couple utility methods that are meant to be static (not attributes of an instance). Should these be moved out into a separate "utility" submodule for the project or can they stay?

-
I don't like how the sync method is very similar to the sync_helper method and there is probably a way to clean that up, but I will leave that as an exercise for myself. However, is this generally how things are done when structuring recursive function calls (i.e. set things up and then call recursively when things are "aligned").

-
Should I have just left the code from main.py in a main function in backup.py? Would that be cleaner than splitting it out into multiple files? The if __name__ == '__main__': doesn't really make sense here, since this file will only ever be executed as the main file.

main.py:

```
#!/usr/bin/python3

"""Script to make efficient backups."""

import argparse
import os
import sys

from backup import Backup

def main():
"""Ma

Solution

First, like @MatthiasEttinger suggested in the comments, have you had a look at rsync?

$ rsync -aAXEv --delete /path/to/source /path/to/backup/folder


seems to do exactly what you want. The flags (used or implied here) mean:

-a, --archive               archive mode; equals -rlptgoD (no -H,-A,-X)
-r, --recursive             recurse into directories
-l, --links                 copy symlinks as symlinks
-p, --perms                 preserve permissions
-t, --times                 preserve modification times
-g, --group                 preserve group
-o, --owner                 preserve owner (super-user only)
-D                          same as --devices --specials
    --devices               preserve device files (super-user only)
    --specials              preserve special files

-A, --acls                  preserve ACLs (implies -p)
-X, --xattrs                preserve extended attributes
-E, --executability         preserve executability
-v, --verbose               increase verbosity
    --delete                delete extraneous files from dest dirs


Whenever you have a class method in Python without the self keyword, which is not decorated with @staticmethod, you are probably not using classes correctly.

The way it is right now, most of these methods have absolutely no link to the class, except that they are helper functions. You pass them the things they need to work on. However, you have all this information already in the class members!

First some easy fixes:

@staticmethod
def remove(path):
    """Removes file, directory, or symlink located at path"""

    if os.path.isdir(path):
        shutil.rmtree(path)
    else:
        os.unlink(path)


As mentioned above, this method should be a staticmethod.

@staticmethod
def sha512(file_name):
    blocksize = 65536
    hasher = hashlib.sha512()
    with open(file_name, 'rb') as hash_file:
        buf = hash_file.read(blocksize)
        while len(buf):
            hasher.update(buf)
            buf = hash_file.read(blocksize)
    return hasher.digest()

def compare(self, source, destination):
    """Compares the SHA512 hash of source and destination."""
    return self.sha512(source) == self.sha512(destination)


For the compare function, I would define a helper function that computes the hash. Note that while open(...) as x: is a SyntaxError, what you want is with open(...) as x:. You can also directly return the result of a comparison, no need for

cond = a == b
if cond:
    return True
else:
    return False


Now, I would add two things to __init__:

-
Already get the normpath of source and destination, so we can later use self.source and self.destination (You never use these variables again).

-
Create the target backup directory if it does not exist

class Backup:

    def __init__(self, source, destination):
        self.source = os.path.normpath(source)
        self.destination = os.path.join(os.path.normpath(
            destination), os.path.basename(source))
        if not os.path.isdir(self.destination):
            print "Creating destination {} because it does not exist yet.".format(self.destination)
            os.makedirs(self.destination)


Finally, I would completely change Backup.sync. I would use os.walk to walk through the source and process it step by step. For every directory level I would first create all directories missing in that directory. Then I would copy all files in that directory, skipping identical files (compared by the hash). Lastly, I would remove all files and folders which are not present anymore in the source.

def sync(self):
    """Synchronizes source to destination."""
    for dir_path, dirnames, filenames in os.walk(self.source):
        # make directory structure
        self.sync_directories(dir_path, dirnames)
        # compare and copy files
        self.sync_files(dir_path, filenames)
        # check if we have to remove files or folders
        self.sync_deleted(dir_path, dirnames, filenames)


The helper functions are defined thusly:

```
def _dest_path(self, dir_path, name):
return os.path.join(self.destination, os.path.relpath(dir_path, self.source), name)

def sync_directories(self, dir_path, dirnames):
for dir_name in dirnames:
dest_dir_name = self._dest_path(dir_path, dir_name)
if not os.path.exists(dest_dir_name):
print "mkdir", dest_dir_name
os.mkdir(dest_dir_name)

def sync_files(self, dir_path, filenames):
for file_name in filenames:
source = os.path.join(dir_path, file_name)
destination = self._dest_path(dir_path, file_name)
if os.path.isfile(destination):
if self.compare(source, destination):
# print "skip", destination
continue
print "copy", destination
shutil.copy2(source, destination)

def sync_deleted(self, dir_path, dirnames, filenames):
source_directories = set(dirnames)

Code Snippets

$ rsync -aAXEv --delete /path/to/source /path/to/backup/folder
-a, --archive               archive mode; equals -rlptgoD (no -H,-A,-X)
-r, --recursive             recurse into directories
-l, --links                 copy symlinks as symlinks
-p, --perms                 preserve permissions
-t, --times                 preserve modification times
-g, --group                 preserve group
-o, --owner                 preserve owner (super-user only)
-D                          same as --devices --specials
    --devices               preserve device files (super-user only)
    --specials              preserve special files

-A, --acls                  preserve ACLs (implies -p)
-X, --xattrs                preserve extended attributes
-E, --executability         preserve executability
-v, --verbose               increase verbosity
    --delete                delete extraneous files from dest dirs
@staticmethod
def remove(path):
    """Removes file, directory, or symlink located at path"""

    if os.path.isdir(path):
        shutil.rmtree(path)
    else:
        os.unlink(path)
@staticmethod
def sha512(file_name):
    blocksize = 65536
    hasher = hashlib.sha512()
    with open(file_name, 'rb') as hash_file:
        buf = hash_file.read(blocksize)
        while len(buf):
            hasher.update(buf)
            buf = hash_file.read(blocksize)
    return hasher.digest()

def compare(self, source, destination):
    """Compares the SHA512 hash of source and destination."""
    return self.sha512(source) == self.sha512(destination)
cond = a == b
if cond:
    return True
else:
    return False

Context

StackExchange Code Review Q#143949, answer score: 4

Revisions (0)

No revisions yet.