HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Perfect Hash Family Non-Distinct t Columns Calculator

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
distinctcolumnsnonperfecthashfamilycalculator

Problem

Perfect Hash Families (PHFs) are a widely studied combinatorial object. Essentially, they are 4-tuples (N; t, k, v), where the PHF is an N by k array of v symbols. For any t columns chosen, there is at least one row of those chosen columns that has no symbol duplicated in that row.

I want to be able to count how many "non-distinct" t-column choices there are in a given PHF. My code is below, with some sample inputs that I made up:

from itertools import combinations

def get_column(phf, c):
    return [x[c] for x in phf]

# returns the PHF as if we restrict to the "cs" columns (a list)
def combine_columns_into_array(phf, cs):
    r = []
    for c in cs:
        r.append(get_column(phf, c))
    return list(map(list, zip(*r)))

# returns the number of t-column choices that have no distinct rows
def count_non_distinct_combinations(phf, t):
    k = len(phf[0])
    true_count = 0
    # get all combinations
    for c in combinations([i for i in range(k)], t):
        b = False
        choices = combine_columns_into_array(phf, list(c))
        for row in choices:
            # if the row is completely distinct
            if len(row) == len(set(row)):
                b = True
        # if all rows are non-distinct:
        if not b:
            true_count += 1
    return true_count

phf = [
    [4, 2, 3], 
    [2, 1, 3], 
    [1, 1, 4]]

# first 2 columns have no distinct rows
phf2 = [
    [4, 4, 3], 
    [2, 2, 3], 
    [1, 1, 4]]

print(count_non_distinct_combinations(phf, 2)) # 0
print(count_non_distinct_combinations(phf2, 2)) # 1


I plan to use this code as a metric for how "good" the array is - what I mean by this is that I want, given the number of rows N, try to reduce this metric as much as possible. Therefore, speed/efficiency is the most important part of any review of this code.

Solution

For a start, use iterators. Don't convert to lists constantly:

get_column can be simplified to (use generator rather than list):

def get_column(phf, c):
    return (x[c] for x in phf)


combine_columns_into_array to:

def combine_columns_into_array(phf, cs):
    return zip(*[get_column(phf, c) for c in cs])


In count_non_distinct_combinations:

for c in combinations(range(k), t):


Rather than:

for c in combinations([i for i in range(k)], t):


To further simplify, combine_columns_into_array can be merged with get_column:

def combine_columns_into_array(phf, cs):
    return zip(*[[x[c] for x in phf] for c in cs])


Combining combine_columns_into_array into count_non_distinct_combinations:

def count_non_distinct_combinations(phf, t):
    k = len(phf[0])
    true_count = 0
    # get all combinations
    for c in combinations(range(k), t):
        b = False
        for row in zip(*[[x[i] for x in phf] for i in c]):
            # if the row is completely distinct
            if len(row) == len(set(row)):
                b = True
        # if all rows are non-distinct:
        if not b:
            true_count += 1
    return true_count


The zip(*[[x[i] for x in phf] for i in c]) can be simplified to ([x[i] for i in c] for x in phf)

The whole

k = len(phf[0])
true_count = 0
# get all combinations
for c in combinations([i for i in range(k)], t):
    b = False
    choices = combine_columns_into_array(phf, list(c))
    for row in choices:
        # if the row is completely distinct
        if len(row) == len(set(row)):
            b = True
    # if all rows are non-distinct:
    if not b:
        true_count += 1
return true_count


section can be simplified (I think) to:

true_count = 0
# get all combinations
for c in combinations(range(len(phf[0])), t):
    s = [[x[i] for i in c] for x in phf]
    # if all rows are non-distinct:
    if not [len(i) for i in s] == [len(set(i)) for i in s]:
        true_count += 1
return true_count


And so to:

def count_non_distinct_combinations(phf, t):
    true_count = 0
    for c in combinations(range(len(phf[0])), t):
        if not any(len(set(i)) == len(i) for i in ([x[i] for i in c] for x in phf)):
            true_count += 1
    return true_count


I admit I lost readability somewhere along the line, but this new 6 line version takes 55.3% of the time needed to complete the phf2 case, so its almost twice as fast as your version. There may be improvements that can be done to the algorithm, but I'll stop here.

Code Snippets

def get_column(phf, c):
    return (x[c] for x in phf)
def combine_columns_into_array(phf, cs):
    return zip(*[get_column(phf, c) for c in cs])
for c in combinations(range(k), t):
for c in combinations([i for i in range(k)], t):
def combine_columns_into_array(phf, cs):
    return zip(*[[x[c] for x in phf] for c in cs])

Context

StackExchange Code Review Q#96807, answer score: 2

Revisions (0)

No revisions yet.