patternpythonMinor
Building an efficient Python backup module
Viewed 0 times
efficientmodulebuildingpythonbackup
Problem
I am writing a module that will synchronize destination to be the same as source, but will only overwrite files in destination that differ from source as to not cause excessive writes, hopefully prolonging the life of the backup. I will probably add additional functionality on top of this later.
I am primarily looking for critique regarding the style and structure of this module. The things I don't like or am not sure about:
-
When should or shouldn't I make a class? Is this overkill for a class? A backup is in my head an event-like object which has some state (a source and destination), but then it really only has one thing it needs to do, synchronize the destination to match source. This is probably the main question here. I was taught that OOP is the greatest and should always use it, but it just seems wrong here. I also noticed most of the modules in the Python Standard Library just have various functions and don't really break things out into classes like I was expecting (i.e. Java).
-
There is one method sync and then a couple utility methods that are meant to be static (not attributes of an instance). Should these be moved out into a separate "utility" submodule for the project or can they stay?
-
I don't like how the sync method is very similar to the sync_helper method and there is probably a way to clean that up, but I will leave that as an exercise for myself. However, is this generally how things are done when structuring recursive function calls (i.e. set things up and then call recursively when things are "aligned").
-
Should I have just left the code from main.py in a main function in backup.py? Would that be cleaner than splitting it out into multiple files? The
main.py:
```
#!/usr/bin/python3
"""Script to make efficient backups."""
import argparse
import os
import sys
from backup import Backup
def main():
"""Ma
I am primarily looking for critique regarding the style and structure of this module. The things I don't like or am not sure about:
-
When should or shouldn't I make a class? Is this overkill for a class? A backup is in my head an event-like object which has some state (a source and destination), but then it really only has one thing it needs to do, synchronize the destination to match source. This is probably the main question here. I was taught that OOP is the greatest and should always use it, but it just seems wrong here. I also noticed most of the modules in the Python Standard Library just have various functions and don't really break things out into classes like I was expecting (i.e. Java).
-
There is one method sync and then a couple utility methods that are meant to be static (not attributes of an instance). Should these be moved out into a separate "utility" submodule for the project or can they stay?
-
I don't like how the sync method is very similar to the sync_helper method and there is probably a way to clean that up, but I will leave that as an exercise for myself. However, is this generally how things are done when structuring recursive function calls (i.e. set things up and then call recursively when things are "aligned").
-
Should I have just left the code from main.py in a main function in backup.py? Would that be cleaner than splitting it out into multiple files? The
if __name__ == '__main__': doesn't really make sense here, since this file will only ever be executed as the main file.main.py:
```
#!/usr/bin/python3
"""Script to make efficient backups."""
import argparse
import os
import sys
from backup import Backup
def main():
"""Ma
Solution
First, like @MatthiasEttinger suggested in the comments, have you had a look at rsync?
seems to do exactly what you want. The flags (used or implied here) mean:
Whenever you have a class method in Python without the
The way it is right now, most of these methods have absolutely no link to the class, except that they are helper functions. You pass them the things they need to work on. However, you have all this information already in the class members!
First some easy fixes:
As mentioned above, this method should be a staticmethod.
For the
Now, I would add two things to
-
Already get the normpath of source and destination, so we can later use
-
Create the target backup directory if it does not exist
Finally, I would completely change
The helper functions are defined thusly:
```
def _dest_path(self, dir_path, name):
return os.path.join(self.destination, os.path.relpath(dir_path, self.source), name)
def sync_directories(self, dir_path, dirnames):
for dir_name in dirnames:
dest_dir_name = self._dest_path(dir_path, dir_name)
if not os.path.exists(dest_dir_name):
print "mkdir", dest_dir_name
os.mkdir(dest_dir_name)
def sync_files(self, dir_path, filenames):
for file_name in filenames:
source = os.path.join(dir_path, file_name)
destination = self._dest_path(dir_path, file_name)
if os.path.isfile(destination):
if self.compare(source, destination):
# print "skip", destination
continue
print "copy", destination
shutil.copy2(source, destination)
def sync_deleted(self, dir_path, dirnames, filenames):
source_directories = set(dirnames)
$ rsync -aAXEv --delete /path/to/source /path/to/backup/folderseems to do exactly what you want. The flags (used or implied here) mean:
-a, --archive archive mode; equals -rlptgoD (no -H,-A,-X)
-r, --recursive recurse into directories
-l, --links copy symlinks as symlinks
-p, --perms preserve permissions
-t, --times preserve modification times
-g, --group preserve group
-o, --owner preserve owner (super-user only)
-D same as --devices --specials
--devices preserve device files (super-user only)
--specials preserve special files
-A, --acls preserve ACLs (implies -p)
-X, --xattrs preserve extended attributes
-E, --executability preserve executability
-v, --verbose increase verbosity
--delete delete extraneous files from dest dirsWhenever you have a class method in Python without the
self keyword, which is not decorated with @staticmethod, you are probably not using classes correctly.The way it is right now, most of these methods have absolutely no link to the class, except that they are helper functions. You pass them the things they need to work on. However, you have all this information already in the class members!
First some easy fixes:
@staticmethod
def remove(path):
"""Removes file, directory, or symlink located at path"""
if os.path.isdir(path):
shutil.rmtree(path)
else:
os.unlink(path)As mentioned above, this method should be a staticmethod.
@staticmethod
def sha512(file_name):
blocksize = 65536
hasher = hashlib.sha512()
with open(file_name, 'rb') as hash_file:
buf = hash_file.read(blocksize)
while len(buf):
hasher.update(buf)
buf = hash_file.read(blocksize)
return hasher.digest()
def compare(self, source, destination):
"""Compares the SHA512 hash of source and destination."""
return self.sha512(source) == self.sha512(destination)For the
compare function, I would define a helper function that computes the hash. Note that while open(...) as x: is a SyntaxError, what you want is with open(...) as x:. You can also directly return the result of a comparison, no need for cond = a == b
if cond:
return True
else:
return FalseNow, I would add two things to
__init__:-
Already get the normpath of source and destination, so we can later use
self.source and self.destination (You never use these variables again).-
Create the target backup directory if it does not exist
class Backup:
def __init__(self, source, destination):
self.source = os.path.normpath(source)
self.destination = os.path.join(os.path.normpath(
destination), os.path.basename(source))
if not os.path.isdir(self.destination):
print "Creating destination {} because it does not exist yet.".format(self.destination)
os.makedirs(self.destination)Finally, I would completely change
Backup.sync. I would use os.walk to walk through the source and process it step by step. For every directory level I would first create all directories missing in that directory. Then I would copy all files in that directory, skipping identical files (compared by the hash). Lastly, I would remove all files and folders which are not present anymore in the source.def sync(self):
"""Synchronizes source to destination."""
for dir_path, dirnames, filenames in os.walk(self.source):
# make directory structure
self.sync_directories(dir_path, dirnames)
# compare and copy files
self.sync_files(dir_path, filenames)
# check if we have to remove files or folders
self.sync_deleted(dir_path, dirnames, filenames)The helper functions are defined thusly:
```
def _dest_path(self, dir_path, name):
return os.path.join(self.destination, os.path.relpath(dir_path, self.source), name)
def sync_directories(self, dir_path, dirnames):
for dir_name in dirnames:
dest_dir_name = self._dest_path(dir_path, dir_name)
if not os.path.exists(dest_dir_name):
print "mkdir", dest_dir_name
os.mkdir(dest_dir_name)
def sync_files(self, dir_path, filenames):
for file_name in filenames:
source = os.path.join(dir_path, file_name)
destination = self._dest_path(dir_path, file_name)
if os.path.isfile(destination):
if self.compare(source, destination):
# print "skip", destination
continue
print "copy", destination
shutil.copy2(source, destination)
def sync_deleted(self, dir_path, dirnames, filenames):
source_directories = set(dirnames)
Code Snippets
$ rsync -aAXEv --delete /path/to/source /path/to/backup/folder-a, --archive archive mode; equals -rlptgoD (no -H,-A,-X)
-r, --recursive recurse into directories
-l, --links copy symlinks as symlinks
-p, --perms preserve permissions
-t, --times preserve modification times
-g, --group preserve group
-o, --owner preserve owner (super-user only)
-D same as --devices --specials
--devices preserve device files (super-user only)
--specials preserve special files
-A, --acls preserve ACLs (implies -p)
-X, --xattrs preserve extended attributes
-E, --executability preserve executability
-v, --verbose increase verbosity
--delete delete extraneous files from dest dirs@staticmethod
def remove(path):
"""Removes file, directory, or symlink located at path"""
if os.path.isdir(path):
shutil.rmtree(path)
else:
os.unlink(path)@staticmethod
def sha512(file_name):
blocksize = 65536
hasher = hashlib.sha512()
with open(file_name, 'rb') as hash_file:
buf = hash_file.read(blocksize)
while len(buf):
hasher.update(buf)
buf = hash_file.read(blocksize)
return hasher.digest()
def compare(self, source, destination):
"""Compares the SHA512 hash of source and destination."""
return self.sha512(source) == self.sha512(destination)cond = a == b
if cond:
return True
else:
return FalseContext
StackExchange Code Review Q#143949, answer score: 4
Revisions (0)
No revisions yet.