patternpythonMinor
Computation of Prefix Free Codes with many repeated weight values in reduced space
Viewed 0 times
spacevaluescomputationfreewithrepeatedweightcodesmanyprefix
Problem
Does the code below respect Python programming conventions? (and if not, which one does it disrespect?) The objective would be to publish this code as part of a scientific article, as "reproducible research".
In 1998, Moffat and Turpin described in "Efficient Construction of Minimum Redundancy Codes for Large Alphabets" (which I reviewed here) how to compute prefix free codes from sorted weights with many repeated weight values in reduced space and time (once the weights are sorted). I am planning to reproduce and describe their first experiment (compressing some TREC data set) in a fully reproducible way, as an exercize in reproducible research. Their algorithm should be implementable (they did implement it and ran it for their experiment but did not share their code) but I found that I needed to make large modifications when programming it in Python, and I wonder if I am doing it properly (I am more of a mathematician than a programmer).
```
def assertEqual(x,y):
assert x==y, str(x)+" should be "+str(y)
# A function to simulate one sorted list from two:
def sourceOfFirstOfTwo(leaves,graphs):
"""Given two lists of 4-uple, returns a pointer to the one with the list with the smallest first element, sorted by first value."""
assert(len(leaves)>0 or len(graphs)>0)
if len(leaves)==0:
return graphs
elif len(graphs)==0:
return leaves
else:
(firstLeafWeight,firstLeafFrequency,firstLeafLeft,firstLeafRight) = leaves[0]
(firstGraphWeight,firstGraphFrequency,firstGraphLeft,firstGraphRight) = graphs[0]
if firstLeafWeight0)
graphs = []
source = leaves
first = (firstWeight,firstFrequency,firstLeft,firstRight) = source[0]
while firstFrequency>1 or len(leaves)+len(graphs)>1:
assert(len(graphs)1:
if firstFrequency %2 == 1:
source[0] = (firstWeight, 1, firstLeft, firstRight)
else:
source.remove(first)
graphs.append((2*firstWeight,firstFrequency // 2,first,first))
In 1998, Moffat and Turpin described in "Efficient Construction of Minimum Redundancy Codes for Large Alphabets" (which I reviewed here) how to compute prefix free codes from sorted weights with many repeated weight values in reduced space and time (once the weights are sorted). I am planning to reproduce and describe their first experiment (compressing some TREC data set) in a fully reproducible way, as an exercize in reproducible research. Their algorithm should be implementable (they did implement it and ran it for their experiment but did not share their code) but I found that I needed to make large modifications when programming it in Python, and I wonder if I am doing it properly (I am more of a mathematician than a programmer).
```
def assertEqual(x,y):
assert x==y, str(x)+" should be "+str(y)
# A function to simulate one sorted list from two:
def sourceOfFirstOfTwo(leaves,graphs):
"""Given two lists of 4-uple, returns a pointer to the one with the list with the smallest first element, sorted by first value."""
assert(len(leaves)>0 or len(graphs)>0)
if len(leaves)==0:
return graphs
elif len(graphs)==0:
return leaves
else:
(firstLeafWeight,firstLeafFrequency,firstLeafLeft,firstLeafRight) = leaves[0]
(firstGraphWeight,firstGraphFrequency,firstGraphLeft,firstGraphRight) = graphs[0]
if firstLeafWeight0)
graphs = []
source = leaves
first = (firstWeight,firstFrequency,firstLeft,firstRight) = source[0]
while firstFrequency>1 or len(leaves)+len(graphs)>1:
assert(len(graphs)1:
if firstFrequency %2 == 1:
source[0] = (firstWeight, 1, firstLeft, firstRight)
else:
source.remove(first)
graphs.append((2*firstWeight,firstFrequency // 2,first,first))
Solution
- Comments on your code
-
The Python style guide (PEP8) recommends using 4 spaces for each indentation step, and
lowercase_words_with_underscores for variable names and function names. You're not obliged to follow this guide, but it will make it easier for you to collaborate with other Python programmers if you do.-
In answer to your question in comments,
assert statements are fine. It's usually a good sign when code includes them, because it shows that the programmer has thought about what needs to be true at each step of the algorithm.-
The
assertEqual calls, on the other hand, are not quite in the right place. It's a good sign to see test cases. However, it's not a good idea to intermix the test suite with the code like this, because it means that every time your program gets run or your module gets loaded, the test suite gets run. That's harmless here because the test cases are so short, but if the test suite took a lot of time to run, you'd normally prefer not to run it every time.So I would suggest reorganizing the test code so that it uses the facilities in the
unittest module. Which includes its own assertEqual method, saving you the need to write your own. See below for how you might do this.-
The docstring of
sourceOfFirstOfTwo seems confused. Python doesn't have "pointers" and it's best not to think in terms of pointers, but in terms of values. You should have written:"""Given two lists of 4-tuples, return the one with the smallest first element."""but in fact this doesn't completely describe the behaviour of the function because it doesn't say what happens if one of the lists is empty.
-
A confused or complicated docstring is often an indication that a function would profit from being simplified or generalized, and that's definitely the case here. The first thing to note is that sequences test true if they are non-empty, and false if they are empty, so the initial tests can be rewritten:
assert(leaves or graphs)
if not leaves:
return graphs
elif not graphs:
return leavesNext, Python already compares tuples element-wise:
>>> sorted([(2,3), (1,4), (2,1)])
[(1, 4), (2, 1), (2, 3)]so there is no need to split up your 4-tuples into their components. Instead you can write:
if leaves[0] < graphs[0]:
return leaves
else:
return graphswhich can be simplified to:
return min(leaves, graphs)since lists compare element-wise too. But now that we've made these simplifications, there's nothing in this function that needs to know about leaves and graphs. Really, it's a function that takes two sequences, discards empty ones, and returns the smaller among the remaining ones. But you can see that there's nothing special about the number two here. So I would generalize this function in order to further simplify it, like this (just one line!):
def min_non_empty(*sequences):
"""Return the smallest of the non-empty sequences.
Raise ValueError if all sequences are empty.
>>> min_non_empty([1,2],[2,3],[])
[1, 2]
>>> min_non_empty([1,2],[1],[])
[1]
>>> min_non_empty([], [])
Traceback (most recent call last):
...
ValueError: min() arg is an empty sequence
"""
return min(s for s in sequences if s)Notice how I've used the docstring to provide examples of how the function can be called. Python's
doctest module is capable of running these mini-tests and checking their output. (This feature doesn't substitute for a proper test suite, but it's a useful way to check examples in documentation.)Note also that there's no need for an assertion any more because
min raises ValueError if you don't give it any arguments.But having done all that, I don't think we actually need this function at all. See below.
-
A data structure with more than a couple of elements becomes unwieldy to manipulate if you represent it as a tuple. You have to refer to its elements by index (like
x[1]) in which case it is not clear what that element means, or else you have to break it down into its components.It would be better to refer to the elements of your data structure by name. You could do this by creating a class, but the simplest thing to do here is to use a named tuple:
Graph = namedtuple('Graph', 'weight frequency left right'.split())-
Instead of using
0 as a placeholder meaning "no left/right child", it's conventional in Python to use None. That way there's no risk of confusing it for a weight or a frequency.-
Your
Meld function lacks a docstring. What does it do and how to call it?-
Your
Meld algorithm isn't the same as the one in your blog. (The original paper is hidden behind a paywall and IEEE want to charge me $31 just to read it, so I am basing this on your blog instead.) The differences are as follows: (i) you separate the graphs into two lists, leaves and graphs, whereas Moffat and Turpin just maintain a singCode Snippets
"""Given two lists of 4-tuples, return the one with the smallest first element."""assert(leaves or graphs)
if not leaves:
return graphs
elif not graphs:
return leaves>>> sorted([(2,3), (1,4), (2,1)])
[(1, 4), (2, 1), (2, 3)]if leaves[0] < graphs[0]:
return leaves
else:
return graphsreturn min(leaves, graphs)Context
StackExchange Code Review Q#26071, answer score: 5
Revisions (0)
No revisions yet.