HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

"Acro Words" - Creating Acronyms of a Text that are Words

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
acrocreatingarewordstextacronymsthat

Problem

I wrote a program that given an input string returns a list of tuples in which each element of the tuple is a word given by dropping some number of characters of the original string, preserving order.

For example:

'valley' -> [('alley',), ('valley',)]
'friction' -> [('fiction',), ('friction',), ('ricin',)]
'university' -> [('unvest',), ('nesty',), ('unsty',), ...]
'freetrade' -> [..., ('erade',), ..., ('free', 'tade'), ..., ('free', 'trade'), ...('feer',), ...]


The algorithm I've written works as follows

  • Find the powerset of the characters in the string. This gives us a set of lists of characters, conceptually given by "dropping" some number of characters of the input string, preserving order. powerset('abc') = {(),('a',),('a', 'b'),('a', 'b', 'c'),('a', 'c'),('b',),('b', 'c'),('c',)}



  • Next, for each element of the powerset, we perform a "powersplit" in which we find every possible split (or, inversely, concatenation). powerset(('a', 'b', 'c')) = {('a', 'b', 'c'), ('a', 'bc'), ('ab', 'c'), ('abc',)}



  • Finally, for each tuple produced by the powersplit, we check if the contents of the tuple are in our dictionary. contents_in_dictionary(dictionary, ('free', 'trade')) = True



The following is my code

```
from functools import reduce
from itertools import chain, combinations, islice, zip_longest

def powerset(data):
return {z for z in chain.from_iterable((x for x in combinations(data, r)) for r in range(len(data)+1))}

def slices(data, indices):
return [[x for x in islice(data, *list(y))] for y in zip_longest(chain([0], indices), chain(indices, [len(data)]))]

def powersplit(data):
return {tuple([''.join(x) for x in slices(data, indices)]) for indices in powerset(range(1, len(data)))}

def get_dictionary(min_length):
return {line.rstrip('\n').lower() for line in open('/usr/share/dict/words') if len(line.rstrip('\n')) >= min_length}

def contents_in_dictionary(dictionary, in_string):
return all([y in dictionary for y in in_

Solution


  1. Review



-
The lines are quite long, which means that it doesn't fit in the box here at Code Review. The Python style guide (PEP8) recommends sticking to 79 characters.

-
There are no docstrings. What do all these functions do? What am I supposed to pass as parameters?

-
Instead of:

[x for x in expression]


write:

list(expression)


and instead of:

{x for x in expression}


write:

set(expression)


-
This simplifies powerset to:

def powerset(seq):
    """Return a set whose elements are tuples of elements of seq."""
    return set(chain.from_iterable(combinations(seq, r)
                                   for r in range(len(seq) + 1)))


-
Instead of building a list:

return all([y in dictionary for y in in_string])


you can often just pass a generator expression to a function:

return all(y in dictionary for y in in_string)


This avoids constructing an intermediate list in memory.

-
Similarly, instead of:

tuple([''.join(x) for x in slices(data, indices)])


write:

tuple(map(''.join, slices(data, indices)))


-
In slices, I think it would be clearer to use tuple unpacking to give names to the elements of y, and so avoid the mysterious *list(y). Also, since the chains are the same length, you can use zip instead of zip_longest:

return [list(islice(data, start, stop)]
        for start, stop in zip(chain([0], indices),
                               chain(indices, [len(data)]))]


-
Consider using the pairwise recipe from the itertools documentation, and writing:

indices = chain([0], indices, [len(data)])
return [list(islice(data, start, stop)]
        for start, stop in pairwise(indices)]


-
The expression line.rstrip('\n') is repeated in get_dictionary.

  1. Algorithm



The approach taken here is known as generate-and-test. The code generates all subsequences of the input, and then all ways to split the subsequences into words, and then tests whether each set of words is valid. The problem with this approach is that it takes time that's exponential in the length of the input.

What you need is some way to prune the search so that most of the exponentially many candidates subsequences are never considered (because they don't appear in your dictionary).

I'll sketch an algorithm. Take your dictionary:

BLACK
BLUE
BLUER
GREEN
GREY


and transform it into a prefix tree (in the diagram, nodes with a thicker border indicate ends of valid words in the dictionary):

We'll carry out a depth-first search over all the possible substrings of the input, but at each branch of the search in the word we'll also have a corresponding position in the prefix tree. At each step of the search we can either add the first letter of the input to the candidate subsequence and move downwards in the prefix tree, or discard the first letter of the input and remain at the same point in the prefix tree. This gives us two possible branches in the search.

Now suppose the input is the word ABSOLUTE. We start the search with input ABSOLUTE, an empty candidate subsequence, and the position at the root of the prefix tree:

Can we take the first branch of the search (adding the A to the candidate subsequence and moving down in the prefix tree)? No, because we can't move down in the prefix tree from our current position to an A. So this branch of the search goes nowhere.

In the other branch of the search (where we discard the A), we now have input BSOLUTE (with candidate subsequence and position in the prefix tree unchanged).

From this position, we can take the first branch, because there is a transition to B in the prefix tree, so we can add B to the candidate subsequence, discard the B from the input, and continue to here in the search:

And so on. You'll see that in this algorithm we immediately prune any step in the search that we know can never lead to a valid subsequence.

Code Snippets

[x for x in expression]
list(expression)
{x for x in expression}
set(expression)
def powerset(seq):
    """Return a set whose elements are tuples of elements of seq."""
    return set(chain.from_iterable(combinations(seq, r)
                                   for r in range(len(seq) + 1)))

Context

StackExchange Code Review Q#159511, answer score: 2

Revisions (0)

No revisions yet.