patternpythonMinor
Python code to write simple VPK archive files
Viewed 0 times
simplearchivewritefilespythoncodevpk
Problem
I'm new to Python, but have a lot of experience with C and C++. I would appreciate some feedback on my implementation of a packer for this archive format. Specifically, any feedback on how idiomatic this code is for Python, and any suggestions about the algorithm used.
VPK is a relatively simple uncompressed archive format. It starts with this header:
In this implementation,
The directory tree is stored in three levels: extension, path, and filename. If any of these are empty (e.g. a file with no extension, or which lives in the root directory) it is represented by a space.
For example, for this directory tree:
The internal data structure would be:
The binary format this structure is saved in is probably described most succinctly by the pseudocode for reading these files available here. (Briefly, extensions appear in plain text as ASCIIZ strings followed by any number of path strings which can each be followed by a number of filenames (and a brief struct follows each filename). Each group is terminated by an additional null byte.)
Here is the binary output archiving the example directory would produce, for reference:
```
0000000: 3412 aa55 0100 0000 8f00 0000 2000 2000 4..U........ . .
0000010: 6e6f 5f65 7874 0000 0000 0000 0000 0000 no_ext..........
0000020: 0000 0000 0000 00ff ff00 0074 7874 0020 ..
VPK is a relatively simple uncompressed archive format. It starts with this header:
struct VPKHeader {
uint32_t signature; // Always 0x55aa1234
uint32_t version; // Always 1
uint32_t tree_len_bytes;
};In this implementation,
tree_len_bytes will always be the length of the entire output file minus the length of the header (12 bytes).The directory tree is stored in three levels: extension, path, and filename. If any of these are empty (e.g. a file with no extension, or which lives in the root directory) it is represented by a space.
For example, for this directory tree:
vpk_test
├── file.txt
├── no_ext
└── some_dir
├── another_file.txt
└── image.jpgThe internal data structure would be:
{
" ": { // no extension
" ": [ // root directory
"no_ext" // filename
]
},
"txt": {
" ": [
"file"
],
"some_dir": [
"another_file"
]
},
"jpg": {
"some_dir": [
"image"
]
}
}The binary format this structure is saved in is probably described most succinctly by the pseudocode for reading these files available here. (Briefly, extensions appear in plain text as ASCIIZ strings followed by any number of path strings which can each be followed by a number of filenames (and a brief struct follows each filename). Each group is terminated by an additional null byte.)
Here is the binary output archiving the example directory would produce, for reference:
```
0000000: 3412 aa55 0100 0000 8f00 0000 2000 2000 4..U........ . .
0000010: 6e6f 5f65 7874 0000 0000 0000 0000 0000 no_ext..........
0000020: 0000 0000 0000 00ff ff00 0074 7874 0020 ..
Solution
I have notes on Python style and idiomatic writing. Reading PEP0008 is definitely a must, as it contains all the guidelines on writing nice clean Pythonic code.
Using
So, you create
Now, normally you shouldn't have a function that runs a process and returns a value. It's an established principle that, where possible functions should either return a value or perform some process but not both. However it's better to not use globals as they're a messier construct.
As for your other global,
Let's see what else we can do with your functions, after all you have one very large function begging to be refactored. Another idiom you should try and stick to is having functions do a single job. That way they're easier to test, read and reuse. You seem to have a whole process happening in
Your process is complicated and confusing. A very easy way to think about splitting it up is to take each block under a comment and make them a function, these are what I mean:
All those sound look good function candidates. Now, you might have problems with scope and passing values that makes one or two not worthwhile, but I suspect that it's not as bad as you think. Let's take a look at
Well, that was easy, wasn't it? Like I said before, passing file references is perfectly safe in Python and that means that none of this function is complicated. Now let's look at
You use
Now, just look at how tidy the start of
There used to be much more code here and now it's reduced a lot. You can also look at improving individual functions easier and if you have bugs you can test individual functions to see if they're misbehaving. Everything becomes easier this way, the only problem is having to manage the scope and pass arguments back and forth but the value in doing that is well worth it.
Using
globals is a bad idea! Almost every time, using globals is a symptom of another problem in how the code is structured. The best way to deal with traversing scope in Python is to pass arguments, but let's take a look at your running_offset and see what can be done.So, you create
running_offset in make_vpk, then in write_file_entry you're using it as well as updating it. You want it to be global so that you can keep adding to that value, right? Instead, you could have write_file_entry accept running_offset as an argument and then return the updated value. In a brief example, you'd have this:def make_vpk(srcdir, dstdir):
running_offset = 0
...
running_offset = write_file_entry(pak01_dir, real_path, running_offset)
def write_file_entry(pak01_dir, srcfile, running_offset):
...
return running_offset + len(data)Now, normally you shouldn't have a function that runs a process and returns a value. It's an established principle that, where possible functions should either return a value or perform some process but not both. However it's better to not use globals as they're a messier construct.
As for your other global,
pak01_000. This is even worse. It's even easier to just open this where you do and pass it as an argument to write_file_entry. But, I suspect you might not have realised that it'd be perfectly fine to pass it as an argument. Python can accept all types of data as arguments without issue. It will merely pass a reference to this same file, so you don't need to worry about it opening multiple copies and possibly forgetting to close them.Let's see what else we can do with your functions, after all you have one very large function begging to be refactored. Another idiom you should try and stick to is having functions do a single job. That way they're easier to test, read and reuse. You seem to have a whole process happening in
make_vpk. add_file and write_file_entry seem to be pretty direct about what they're doing, but a lot happens in make_vpk.Your process is complicated and confusing. A very easy way to think about splitting it up is to take each block under a comment and make them a function, these are what I mean:
# Write VPK header
# Prepare dictionary for VPK directory
# Write VPK directory and pak000
# Fix VPK header directory lengthAll those sound look good function candidates. Now, you might have problems with scope and passing values that makes one or two not worthwhile, but I suspect that it's not as bad as you think. Let's take a look at
write_header:def write_header(pak01_dir):
pak01_dir.write(struct.pack('I', 0x55aa1234)) # Magic signature
pak01_dir.write(struct.pack('I', 1)) # Version
pak01_dir.write(struct.pack('I', 0)) # Directory length -- filled laterWell, that was easy, wasn't it? Like I said before, passing file references is perfectly safe in Python and that means that none of this function is complicated. Now let's look at
prepare_dict.You use
srcdir, but nothing else! This is rather easy to adapt too. All it needs to do is return the prepared dictionary:def prepare_dict(srcdir):
ext_path_file = {}
for root, dirs, files in os.walk(srcdir):
for f in files:
path = os.path.join(root, f)
ext = os.path.splitext(path)[1]
if ext == "":
ext = " "
if ext[0] == ".":
ext = ext[1:]
if root == "" or root == ".":
root = " "
add_file(ext_path_file, ext, root, f)
return ext_path_fileNow, just look at how tidy the start of
make_vpk looks after that:def make_vpk(srcdir, dstdir):
"""Creates a vpk from srcdir and places the output in dstdir."""
running_offset = 0
dstdir = os.path.abspath(dstdir)
os.chdir(srcdir)
srcdir = "."
with open(os.path.join(dstdir, "pak01_dir.vpk"), "wb") as pak01_dir:
write_header(pak01_dir)
ext_path_file = prepare_dict(srcdir)
print "VPK Structure:"
print json.dumps(ext_path_file, indent=4)There used to be much more code here and now it's reduced a lot. You can also look at improving individual functions easier and if you have bugs you can test individual functions to see if they're misbehaving. Everything becomes easier this way, the only problem is having to manage the scope and pass arguments back and forth but the value in doing that is well worth it.
Code Snippets
def make_vpk(srcdir, dstdir):
running_offset = 0
...
running_offset = write_file_entry(pak01_dir, real_path, running_offset)
def write_file_entry(pak01_dir, srcfile, running_offset):
...
return running_offset + len(data)# Write VPK header
# Prepare dictionary for VPK directory
# Write VPK directory and pak000
# Fix VPK header directory lengthdef write_header(pak01_dir):
pak01_dir.write(struct.pack('I', 0x55aa1234)) # Magic signature
pak01_dir.write(struct.pack('I', 1)) # Version
pak01_dir.write(struct.pack('I', 0)) # Directory length -- filled laterdef prepare_dict(srcdir):
ext_path_file = {}
for root, dirs, files in os.walk(srcdir):
for f in files:
path = os.path.join(root, f)
ext = os.path.splitext(path)[1]
if ext == "":
ext = " "
if ext[0] == ".":
ext = ext[1:]
if root == "" or root == ".":
root = " "
add_file(ext_path_file, ext, root, f)
return ext_path_filedef make_vpk(srcdir, dstdir):
"""Creates a vpk from srcdir and places the output in dstdir."""
running_offset = 0
dstdir = os.path.abspath(dstdir)
os.chdir(srcdir)
srcdir = "."
with open(os.path.join(dstdir, "pak01_dir.vpk"), "wb") as pak01_dir:
write_header(pak01_dir)
ext_path_file = prepare_dict(srcdir)
print "VPK Structure:"
print json.dumps(ext_path_file, indent=4)Context
StackExchange Code Review Q#67810, answer score: 2
Revisions (0)
No revisions yet.