patternpythonMinor
"Acro Words" - Creating Acronyms of a Text that are Words
Viewed 0 times
acrocreatingarewordstextacronymsthat
Problem
I wrote a program that given an input string returns a list of tuples in which each element of the tuple is a word given by dropping some number of characters of the original string, preserving order.
For example:
The algorithm I've written works as follows
The following is my code
```
from functools import reduce
from itertools import chain, combinations, islice, zip_longest
def powerset(data):
return {z for z in chain.from_iterable((x for x in combinations(data, r)) for r in range(len(data)+1))}
def slices(data, indices):
return [[x for x in islice(data, *list(y))] for y in zip_longest(chain([0], indices), chain(indices, [len(data)]))]
def powersplit(data):
return {tuple([''.join(x) for x in slices(data, indices)]) for indices in powerset(range(1, len(data)))}
def get_dictionary(min_length):
return {line.rstrip('\n').lower() for line in open('/usr/share/dict/words') if len(line.rstrip('\n')) >= min_length}
def contents_in_dictionary(dictionary, in_string):
return all([y in dictionary for y in in_
For example:
'valley' -> [('alley',), ('valley',)]
'friction' -> [('fiction',), ('friction',), ('ricin',)]
'university' -> [('unvest',), ('nesty',), ('unsty',), ...]
'freetrade' -> [..., ('erade',), ..., ('free', 'tade'), ..., ('free', 'trade'), ...('feer',), ...]The algorithm I've written works as follows
- Find the powerset of the characters in the string. This gives us a set of lists of characters, conceptually given by "dropping" some number of characters of the input string, preserving order.
powerset('abc') = {(),('a',),('a', 'b'),('a', 'b', 'c'),('a', 'c'),('b',),('b', 'c'),('c',)}
- Next, for each element of the powerset, we perform a "powersplit" in which we find every possible split (or, inversely, concatenation).
powerset(('a', 'b', 'c')) = {('a', 'b', 'c'), ('a', 'bc'), ('ab', 'c'), ('abc',)}
- Finally, for each tuple produced by the powersplit, we check if the contents of the tuple are in our dictionary.
contents_in_dictionary(dictionary, ('free', 'trade')) = True
The following is my code
```
from functools import reduce
from itertools import chain, combinations, islice, zip_longest
def powerset(data):
return {z for z in chain.from_iterable((x for x in combinations(data, r)) for r in range(len(data)+1))}
def slices(data, indices):
return [[x for x in islice(data, *list(y))] for y in zip_longest(chain([0], indices), chain(indices, [len(data)]))]
def powersplit(data):
return {tuple([''.join(x) for x in slices(data, indices)]) for indices in powerset(range(1, len(data)))}
def get_dictionary(min_length):
return {line.rstrip('\n').lower() for line in open('/usr/share/dict/words') if len(line.rstrip('\n')) >= min_length}
def contents_in_dictionary(dictionary, in_string):
return all([y in dictionary for y in in_
Solution
- Review
-
The lines are quite long, which means that it doesn't fit in the box here at Code Review. The Python style guide (PEP8) recommends sticking to 79 characters.
-
There are no docstrings. What do all these functions do? What am I supposed to pass as parameters?
-
Instead of:
[x for x in expression]write:
list(expression)and instead of:
{x for x in expression}write:
set(expression)-
This simplifies
powerset to:def powerset(seq):
"""Return a set whose elements are tuples of elements of seq."""
return set(chain.from_iterable(combinations(seq, r)
for r in range(len(seq) + 1)))-
Instead of building a list:
return all([y in dictionary for y in in_string])you can often just pass a generator expression to a function:
return all(y in dictionary for y in in_string)This avoids constructing an intermediate list in memory.
-
Similarly, instead of:
tuple([''.join(x) for x in slices(data, indices)])write:
tuple(map(''.join, slices(data, indices)))-
In
slices, I think it would be clearer to use tuple unpacking to give names to the elements of y, and so avoid the mysterious *list(y). Also, since the chains are the same length, you can use zip instead of zip_longest:return [list(islice(data, start, stop)]
for start, stop in zip(chain([0], indices),
chain(indices, [len(data)]))]-
Consider using the
pairwise recipe from the itertools documentation, and writing:indices = chain([0], indices, [len(data)])
return [list(islice(data, start, stop)]
for start, stop in pairwise(indices)]-
The expression
line.rstrip('\n') is repeated in get_dictionary.- Algorithm
The approach taken here is known as generate-and-test. The code generates all subsequences of the input, and then all ways to split the subsequences into words, and then tests whether each set of words is valid. The problem with this approach is that it takes time that's exponential in the length of the input.
What you need is some way to prune the search so that most of the exponentially many candidates subsequences are never considered (because they don't appear in your dictionary).
I'll sketch an algorithm. Take your dictionary:
BLACK
BLUE
BLUER
GREEN
GREYand transform it into a prefix tree (in the diagram, nodes with a thicker border indicate ends of valid words in the dictionary):
We'll carry out a depth-first search over all the possible substrings of the input, but at each branch of the search in the word we'll also have a corresponding position in the prefix tree. At each step of the search we can either add the first letter of the input to the candidate subsequence and move downwards in the prefix tree, or discard the first letter of the input and remain at the same point in the prefix tree. This gives us two possible branches in the search.
Now suppose the input is the word ABSOLUTE. We start the search with input ABSOLUTE, an empty candidate subsequence, and the position at the root of the prefix tree:
Can we take the first branch of the search (adding the A to the candidate subsequence and moving down in the prefix tree)? No, because we can't move down in the prefix tree from our current position to an A. So this branch of the search goes nowhere.
In the other branch of the search (where we discard the A), we now have input BSOLUTE (with candidate subsequence and position in the prefix tree unchanged).
From this position, we can take the first branch, because there is a transition to B in the prefix tree, so we can add B to the candidate subsequence, discard the B from the input, and continue to here in the search:
And so on. You'll see that in this algorithm we immediately prune any step in the search that we know can never lead to a valid subsequence.
Code Snippets
[x for x in expression]list(expression){x for x in expression}set(expression)def powerset(seq):
"""Return a set whose elements are tuples of elements of seq."""
return set(chain.from_iterable(combinations(seq, r)
for r in range(len(seq) + 1)))Context
StackExchange Code Review Q#159511, answer score: 2
Revisions (0)
No revisions yet.