patternMinor
Knuth's Word Wrap Algorithm in Haskell
Viewed 0 times
algorithmwordhaskellknuthwrap
Problem
I put together a Haskell function for Knuth's word wrap algorithm. A simplified version of the algorithm is described here.
I've already been told that there are certain aspects of my style that are not very agreeable (e.g. lack of comments and type declarations). However, I'd like to get some opinions on where I might be able to improve the performance of the code itself without resorting to agressive, structure breaking optimization techniques.
The code can be exercised like the following examples:
I've already been told that there are certain aspects of my style that are not very agreeable (e.g. lack of comments and type declarations). However, I'd like to get some opinions on where I might be able to improve the performance of the code itself without resorting to agressive, structure breaking optimization techniques.
import Data.Array
import Data.List
import Data.Ord
knuth m xs = map (concat . intersperse " ") $ split ixs 1 ws
where
(_:ixs) = reverse $ snd (c ! lls)
ws = words xs
ls = map length ws
lls = length ls
split [] _ ws = [ws]
split (x:xs) n ws = let (a, b) = splitAt (x-n) ws in a : split xs x b
c = listArray (0, lls) $ map go [0..lls]
where
go 0 = (0, [])
go j = minimumBy (comparing fst) $ map (path j) [1..j]
path j i = let (x, p) = c ! (i-1) in (lc ! (i, j) + x, i:p)
lc = listArray ((1, 1), (lls, lls)) $ concat $ go ls
where
go = map (costs . extras) . take lls . tails
extras = lpad lls 0 . map (m -) . zipWith (+) [0..] . scanl1 (+)
lpad m x = reverse . take m . (++ repeat x) . reverse
costs [] = []
costs (e:es)
| e < 0 = (1/0) : costs es
| null es = [0]
| otherwise = exp (fromIntegral e) : costs esThe code can be exercised like the following examples:
t1 = knuth 6 "aaa bb cc ddddd"
t2 = knuth 37 "You cant trust code that you did not create yourself. (Especially code from companies that employ people like me.) No amount of source-level verification or scrutiny will protect you from untrusted code."Solution
Lack of comments, type information, and good naming makes this code much harder to read than it should be.
The names
Since we're only ever interested in either the cost or the path from
With these name changes, the first confusing line of the program,
It is not obvious what
The
You spend a lot of effort making lists which have to match up to arrays. This is harder to follow than defining each array element with a function (which is possible for these arrays), and it sets you up for off-by-one errors if you're not careful.
I can see two possible ways to speed it up. You use
lls is not descriptive, and the s at the end makes it sound like a list of something, when it's just a number. Since n is not used for anything else, I'd go with that. The names
c and lc are horrible, these arrays are central to the algorithm, and their definitions take up most of the function, but there is nothing to tell us what they are in the program. The linked article says that the c array represents cumulative costs, and lc[i][j] is the cost of a line containing words i through j, so I would suggest the names cumulativeCost and lineCost. However, the c in your program also adds the p array from the article, so cumulativeCost doesn't quite cover it. cumulativeCostAndPath is a bit long, but not too bad. Since we're only ever interested in either the cost or the path from
cumulativeCostAndPath (though they're calculated together), I'd suggest getpath = snd and cumCost = fst as ways of being clearer about what you mean (you could also separate the pieces, but that would be more work). With these name changes, the first confusing line of the program,
(_:ixs) = reverse $ snd (c ! lls) becomes (_:ixs) = reverse $ getpath (cumulativeCostAndPath ! n). path j i = let (x, p) = c ! (i-1) in (lc ! (i, j) + x, i:p) is also hard to read, mostly because of naming. It uses 6 names, of which lc is the only one with more than one character, and only i and j have obvious meanings. I'm going to call x costSoFar and p pathSoFar. I'm also going to change the let into a where and break it into 2 lines.path j i = (lineCost ! (i, j) + costSoFar, i:pathSoFar)
where (costSoFar, pathSoFar) = cumulativeCostAndPath ! (i-1)It is not obvious what
extras is trying to do, even with the article as a reference. It would be much easier to follow with a comment explaining the formula for extras[i][j] in the article, what extras is meant to represent, and mentioning that the scanl1 call in extras and the tails call in go interact to create the last term in extras. The
costs function could use a comment explaining what each of the 3 possibilities means. If extras were extracted to be its own array, it would allow costs to be rewritten more cleanly, as well as potentially simplifying the definition of lc. Before figuring out the details of what's going on, that deep pipeline of functions was hard to reason about. You spend a lot of effort making lists which have to match up to arrays. This is harder to follow than defining each array element with a function (which is possible for these arrays), and it sets you up for off-by-one errors if you're not careful.
I can see two possible ways to speed it up. You use
exp in the definition of costs, which is more expensive than the 2 multiplications required for cubing suggested in the article. Also, the calculation of extras doesn't stop after it's impossible. Since after finding that words i through j won't fit, all of the rest of them won't fit either, you could avoid calculating everything after the first negative number. If you have 500 words to wrap and an 80 column limit, figuring out whether you can fit words 1 through 200 on a line is a waste of time.Code Snippets
path j i = (lineCost ! (i, j) + costSoFar, i:pathSoFar)
where (costSoFar, pathSoFar) = cumulativeCostAndPath ! (i-1)Context
StackExchange Code Review Q#84531, answer score: 3
Revisions (0)
No revisions yet.