HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Justify text from a file using a LaTeX method

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
filemethodtextlatexusingjustifyfrom

Problem

I wrote a bit of code that takes a file, justifies the text and writes to another file.

It uses a DP approach to minimise a badness metric, which is the amount of undershoot of the line length, cubed (see my answer below, the badness function isn't quite right if there is an overshoot, the badness should be inf - I believe this is how LaTeX justifies its text).

I encapsulated it in a class (which is the bit I'm least confident about). Be careful running it on big files, because I think it's got a pretty nasty complexity (\$O(n^2)\$ maybe).

class justifyText:
    def __init__(self, inFileName, outFileName, pageWidth):
        with open(inFileName, "r") as fp:
            self.words = [word for line in fp for word in line.split()]
        self.outFileName = outFileName
        self.pageWidth = pageWidth
        self.memo = {}
        self.breakPointTrack = {}
        self.n = len(self.words)

    def badness(self, i, j):
        totalWidth = 0;
        for word in self.words[i:j]:
            totalWidth += len(word)
        return abs((self.pageWidth - totalWidth)**3)

    def minBadness(self, i):
        if i in self.memo:
            return self.memo[i]
        if i == self.n:
            f = 0
            j_min = self.n
        else:
            f = None
            for j in range(i + 1, self.n + 1):
                temp = self.minBadness(j) + self.badness(i, j)
                if (f is None) or (temp < f):
                    f = temp
                    j_min = j
        self.memo[i] = f
        self.breakPointTrack[i] = j_min
        return f

    def justify(self):
        self.minBadness(0)
        fOut = open(self.outFileName, "w")
        brk = 0
        while brk < self.n:
            start = brk
            brk = self.breakPointTrack[brk]
            line = " ".join(self.words[start:brk]) + "\n"
            fOut.write(line)
        fOut.close()

test = justifyText("test.txt", "out.txt", 80)
test.justify()


Is this a standard way to implement

Solution

You should have a look at Python's official style-guide, PEP8. One of its recommendations is to use PascalCase for class names and lower_case for function and variable names.

Since you read all words into memory anyway, you can just write:

with open(in_file_name) as fp:
    self.words = fp.read().split()


This is because split splits by default on all whitespace (so both space and newlines). Also note that the default behaviour of open is to open a file in read mode, so "r" is also not needed here.

Caching (or memoization) of a functions return value is best done with a decorator. This way you avoid interleaving the actual implementation of the function with the caching. One example could be:

import functools

def memoize_method(func):
    cache = func.cache = {}

    @functools.wraps(func)
    def wrapper(self, n):
        if n not in cache:
            cache[n] = func(self, n)
        return cache[n]
    return wrapper

class JustifyText:
    ...
    @memoize_method
    def min_badness(self, i):
        if i == self.n:
            f, j_min = 0, self.n
        else:
            f, j_min = min((self.min_badness(j) + self.badness(i, j), j) 
                           for j in range(i + 1, self.n + 1))
        self.break_point_track[i] = j_min
        return f


Your badness function can be simplified by using sum, similar to how I used min above:

def badness(self, start, end):
    total_width = sum(len(word) for word in self.words[start:end])
    return abs((self.page_width - total_width)**3)


While it might be your most used use case, writing the output to a file prevents anyone from using this module in any other way. It would be better if justify just returned (even better and simpler: yielded) the justified text and writing to a file is left to the user or a dedicated method using justify internally.

def justify(self):
     self.min_badness(0)  # no-op or to populate the cache?
     brk = 0
     while brk < self.n:
         start, brk = brk, self.break_point_track[brk]
         yield " ".join(self.words[start:brk]) + "\n"

def write(self):
    with open(self.out_file_name, "w") as f_out:
        f_out.writelines(self.justify())


I'm not exactly sure why you have a lone self.min_badness(0) in justify. is it just to populate the cache with the value for zero? If so, this is worth to put a comment here, because otherwise you might accidentally delete it.

Code Snippets

with open(in_file_name) as fp:
    self.words = fp.read().split()
import functools


def memoize_method(func):
    cache = func.cache = {}

    @functools.wraps(func)
    def wrapper(self, n):
        if n not in cache:
            cache[n] = func(self, n)
        return cache[n]
    return wrapper


class JustifyText:
    ...
    @memoize_method
    def min_badness(self, i):
        if i == self.n:
            f, j_min = 0, self.n
        else:
            f, j_min = min((self.min_badness(j) + self.badness(i, j), j) 
                           for j in range(i + 1, self.n + 1))
        self.break_point_track[i] = j_min
        return f
def badness(self, start, end):
    total_width = sum(len(word) for word in self.words[start:end])
    return abs((self.page_width - total_width)**3)
def justify(self):
     self.min_badness(0)  # no-op or to populate the cache?
     brk = 0
     while brk < self.n:
         start, brk = brk, self.break_point_track[brk]
         yield " ".join(self.words[start:brk]) + "\n"

def write(self):
    with open(self.out_file_name, "w") as f_out:
        f_out.writelines(self.justify())

Context

StackExchange Code Review Q#147823, answer score: 9

Revisions (0)

No revisions yet.