patternpythonModerate
Approximate (250 over 100) permutation best fitting certain criteria
Viewed 0 times
approximatefittingcriteria250100permutationcertainoverbest
Problem
Given a list of 250 words of 4 letters each, what is the fastest way, in Python, to find a subsample of 100 words (drawn without replacement) so that the distribution of letters across the whole sample is most even?
List of words: http://pastebin.com/2kpiNAcC
Note there are only 18 different characters, and some occur much more frequently than others.
(Currently, "even letter distribution" is operationalized as the smallest difference between max and min letter counts and the smallest standard deviation of letter counts, although I'm open to suggestions.)
I've tried brute forcing it by drawing random samples, and after various refinements have arrived at the code below. But 1. I assume my brute forcing isn't optimal and 2. there probably is a smarter way, but I can't think of any smart approach.
I am interested in finding a better subsample, so either making the brute-forcing more efficient, or using a smarter algorithm could help.
Not sure if this is better suited for Code Review (optimizing the brute forcing) or a site more suited for discussing a smarter algorithm.
```
import random
import numpy as np
import pandas as pd
# currently unused, as iterating over all combinations seems impossible
from itertools import combinations
from collections import Counter
from joblib import Parallel, delayed
#this just reads the file so "words" is a list of strings
#corresponding to the 250 words
fname = "file.txt"
df = pd.DataFrame.from_csv(fname, sep="\t")
words = list(df.index)
# define function to randomly sample words and rate them
def sampl(r):
s, mm = 1000000, 1000000 # arbitrary constants
for w in range(r):
w = random.sample(words, 100)
counts = [v for k, v in Counter("".join(w)).items()]
s_new = np.std(counts)
mm_new = np.max(counts) - np.min(counts)
# compare current subsample to benchmark within this run
if s_new < s:
w_candidate_1 = w
s = s_new
if mm_new < mm:
List of words: http://pastebin.com/2kpiNAcC
Note there are only 18 different characters, and some occur much more frequently than others.
(Currently, "even letter distribution" is operationalized as the smallest difference between max and min letter counts and the smallest standard deviation of letter counts, although I'm open to suggestions.)
I've tried brute forcing it by drawing random samples, and after various refinements have arrived at the code below. But 1. I assume my brute forcing isn't optimal and 2. there probably is a smarter way, but I can't think of any smart approach.
I am interested in finding a better subsample, so either making the brute-forcing more efficient, or using a smarter algorithm could help.
Not sure if this is better suited for Code Review (optimizing the brute forcing) or a site more suited for discussing a smarter algorithm.
```
import random
import numpy as np
import pandas as pd
# currently unused, as iterating over all combinations seems impossible
from itertools import combinations
from collections import Counter
from joblib import Parallel, delayed
#this just reads the file so "words" is a list of strings
#corresponding to the 250 words
fname = "file.txt"
df = pd.DataFrame.from_csv(fname, sep="\t")
words = list(df.index)
# define function to randomly sample words and rate them
def sampl(r):
s, mm = 1000000, 1000000 # arbitrary constants
for w in range(r):
w = random.sample(words, 100)
counts = [v for k, v in Counter("".join(w)).items()]
s_new = np.std(counts)
mm_new = np.max(counts) - np.min(counts)
# compare current subsample to benchmark within this run
if s_new < s:
w_candidate_1 = w
s = s_new
if mm_new < mm:
Solution
- Analysis
In the general case, the problem of finding the sample of words with the "most even" distribution of letters is NP-hard. Here I'm considering a general instance of this problem to be:
Given an alphabet \$ Σ \$, a set \$ W \$ of 4-letter words over that alphabet, and an integer \$ n ≤ \left|W\right| \$, find the subset \$ W^ ⊆ W \$ with size \$ \left|W^\right| = n \$ having the smallest variance of letter counts among all subsets of that size.
Here's a proof that this is NP-hard, by a reduction from EXACT COVER BY 3-SETS (that is, EXACT COVER restricted to sets of size 3, also known as "X3C").
Suppose that we have an instance of EXACT COVER BY 3-SETS in the form of a set \$ X \$ and a collection \$ S \$ of 3-element subsets of \$ X \$, and the problem is to determine if there is a subcollection \$ S^ ⊆ S \$ such that each element in \$ X \$ is contained in exactly one subset in \$ S^ \$.
This translates into the word problem as follows. Let \$ Σ = X ∪ S \$; let \$ W \$ be the set of four-letter words \$ abcd \$ such that \$ \{ a, b, c \} = d ∈ S \$, and let \$ n = { \left|X\right| \over 3 } \$. Then solve the word problem to find \$ W^ \$ with the smallest variance of letter counts. There is an exact cover if and only if the variance of the letter counts in \$ W^ \$ is zero. (Because there can only be one instance of any letter from \$ S \$, and so if the variance is zero, there must be exactly one instance of each letter from \$ X \$, and since there are \$ \left|X\right| \over 3 \$ words, each element of \$ X \$ is covered exactly once.)
So there's no efficient (polynomial time) algorithm that solves your problem. Exhaustive search is out, because \$ {250 \choose 100} \approx 6×10^{71} \$. So the best you can do is to use heuristics to find good approximate solutions. I'll describe one possibility in §4 below.
- Review
-
Having code at the top-level of a module makes it hard to test from the interactive interpreter, because whenever you reload the module, the code runs. It's best to guard top-level code with
if __name__ == '__main__':.-
The comment for
sampl doesn't explain the most important points about the behaviour of the function, namely: what arguments does it take? and what does it return?-
The number 100 is arbitrary, so it ought to be a parameter.
-
When finding the minimum or maximum, it's best to avoid a starting point like this, but instead to rearrange the code into a generator that can then be passed to
min. That way you can avoid arbitrary starting points like 1000000.-
Writing
for w in range(r): is misleading, because w is not actually used (it's immediately overwritten by the random sample). It's conventional to write _ for unused loop variables.-
The code uses NumPy to do the statistics on the sample. But there are at most 400 letters in the sample, so this doesn't really gain much: NumPy only shows a benefit for large volumes of data. I think that using the standard Python functions
statistics.pstdev, max and min would be adequate for this amount of data.-
Since the standard deviation is only used for comparison, you'd get the same results if you used the variance instead, and this would have the advantage of avoiding the square root.
-
Instead of
Counter("".join(w)), use itertools.chain.from_iterable and write Counter(chain.from_iterable(w)). This avoids concatenating the words. [Improved by Veedrac in comments.]-
Instead of:
counts = [v for k, v in Counter("".join(w)).items()]write:
counts = list(Counter(chain.from_iterable(w)).values())since you don't use the keys.
-
I dislike the logic for selecting
w_candidate_3. This is only updated if both scoring functions show an improvement for this sample, and so the "best" sample according to this logic will depend on the order in which samples are examined. This is a bad property for a measure to have. It would be better to find some numeric way to combine the two scores (for example, their sum or product), so that all samples are comparable according to this measure.-
The logic in
sampl is complicated because you aren't sure of the right way to score the samples. In this kind of situation a better approach would be for sampl to take a scoring function as an argument. Then you could easily experiment with different kinds of score.- Revised code
```
from collections import Counter
from itertools import chain
import random
from statistics import pvariance
def letter_counts(words):
"""Generate the letter counts of the words."""
return Counter(chain.from_iterable(words)).values()
def score_variance(words):
"""Score words according to variance of letter counts."""
return pvariance(letter_counts(words))
def score_range(words):
"""Score words according to range of letter counts."""
counts = list(letter_counts(words))
return max(counts) - min(counts)
def best_sample(words, score, n=100, r=100):
"""Gen
Code Snippets
counts = [v for k, v in Counter("".join(w)).items()]counts = list(Counter(chain.from_iterable(w)).values())from collections import Counter
from itertools import chain
import random
from statistics import pvariance
def letter_counts(words):
"""Generate the letter counts of the words."""
return Counter(chain.from_iterable(words)).values()
def score_variance(words):
"""Score words according to variance of letter counts."""
return pvariance(letter_counts(words))
def score_range(words):
"""Score words according to range of letter counts."""
counts = list(letter_counts(words))
return max(counts) - min(counts)
def best_sample(words, score, n=100, r=100):
"""Generate r (default 100) random samples of n (default 100) elements
from words, and return the sample with the smallest score.
"""
return min((random.sample(words, n) for _ in range(r)), key=score)def best_sample_2(words, score, n=100, r=100):
"""Return a sample of n (default 100) elements from words, found by
starting with a random sample and repeatedly hill climbing by
generating r (default 100) neighbours and picking the one with
smallest score, until no more progress is made.
"""
words = set(words)
current_sample = set(random.sample(words, n))
current_score = score(current_sample)
# Generate neighbouring samples (one word swapped).
def neighbours():
nonsample = words - current_sample
for _ in range(r):
s = set(current_sample)
s.difference_update(random.sample(current_sample, 1))
s.update(random.sample(nonsample, 1))
yield s
while True:
new_sample = min(neighbours(), key=score)
new_score = score(new_sample)
if current_score <= new_score:
# No neighbour is any better, so stop here.
return current_sample
current_sample, current_score = new_sample, new_score>>> score_variance(best_sample(WORDS, score_variance, r=10000))
114.18300653594771Context
StackExchange Code Review Q#90505, answer score: 15
Revisions (0)
No revisions yet.