HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Comparison function for strings

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
stringsfunctionforcomparison

Problem

I got a function that has as input 2 strings, computes the string similarity of all the combinations and give back and output the highest similarity. For example, "you are" and "are you" will have 1 as similarity value.

import itertools
from difflib import SequenceMatcher
import numpy as np

def _compare_names(a, b):
    return SequenceMatcher(None, a, b).ratio()

def _compare_names_with_combinations(name_1,name_2):
    name_1 = name_1.lower().split()
    name_2 = name_2.lower().split()
    combinations_name_1 = list(itertools.permutations(name_1))
    combinations_name_2 = list(itertools.permutations(name_2))
    combinations_name_joined_1 = [' '.join(list(name)) for name in combinations_name_1]
    combinations_name_joined_2 = [' '.join(list(name)) for name in combinations_name_2]
    distances = []
    for name1 in combinations_name_joined_1:
        for name2 in combinations_name_joined_2:
            distance = _compare_names(name1,name2)
            distances.append(distance)
    return max(distances)


examples:

_compare_names_with_combinations('you are','are you')
>> 1
_compare_names_with_combinations('you are','are yos')
>> 0.85714285


My concerns come when I have to compare a lot of texts and it seems that there should be around a more efficient way of computing this value. Do you think there is space in this function to decrease the computational time?

Solution

This will not reduce time complexity, just space complexity; as well as spending less time in the interpreter, most of the work being done in C.

The key idea is to make use of generators to avoid building lists that you only iterate once; reducing your need for memory management, thus increasing the overall speed.

Since you are already importing itertools, I’ll just import the relevant parts to save on typing a bit, feel free to use the long names if you see fit:

from itertools import product, permutations


in case you are using Python 2, I'd use:

from itertools import product, permutations, imap as map


Now we transform each list into a generator expression:

def _compare_names_with_combinations(name_1,name_2):
    join = ' '.join
    name_1 = name_1.lower().split()
    name_2 = name_2.lower().split()
    distances = []
    for name1 in map(join, permutations(name_1)):
        for name2 in map(join, permutations(name_2)):
            distance = _compare_names(name1,name2)
            distances.append(distance)
    return max(distances)


Then product lets us merge the two loops:

def _compare_names_with_combinations(name_1,name_2):
    join = ' '.join
    name_1 = name_1.lower().split()
    name_2 = name_2.lower().split()
    distances = []
    for name1, name2 in product(map(join, permutations(name_1)), map(join, permutations(name_2))):
        distance = _compare_names(name1,name2)
        distances.append(distance)
    return max(distances)


An other improvement is to not build the list of all distances since we are only interested in the maximum value: let's put this max call out of the function and turn it into a generator:

def _compare_names_with_combinations(name_1,name_2):
    join = ' '.join
    name_1 = name_1.lower().split()
    name_2 = name_2.lower().split()
    for name1, name2 in product(map(join, permutations(name_1)), map(join, permutations(name_2))):
        yield _compare_names(name1,name2)

def score_names(name_1, name_2):
    return max(_compare_names_with_combinations(name_1, name_2))


Well, we’re just mapping _compare_names over the product. Maybe we could have that max call in the function after all:

def _compare_names_with_combinations(name_1, name_2):
    join = ' '.join
    name_1 = name_1.lower().split()
    name_2 = name_2.lower().split()
    return max(
        map(
            _compare_names,
            product(
                map(join, permutations(name_1)),
                map(join, permutations(name_2))
            )
        )
    )


But this will require that we modify _compare_names to accept a tuple of two names as parameter instead of two names as two separate parameters:

def _compare_names(names):
    a, b = names
    return SequenceMatcher(None, a, b).ratio()

Code Snippets

from itertools import product, permutations
from itertools import product, permutations, imap as map
def _compare_names_with_combinations(name_1,name_2):
    join = ' '.join
    name_1 = name_1.lower().split()
    name_2 = name_2.lower().split()
    distances = []
    for name1 in map(join, permutations(name_1)):
        for name2 in map(join, permutations(name_2)):
            distance = _compare_names(name1,name2)
            distances.append(distance)
    return max(distances)
def _compare_names_with_combinations(name_1,name_2):
    join = ' '.join
    name_1 = name_1.lower().split()
    name_2 = name_2.lower().split()
    distances = []
    for name1, name2 in product(map(join, permutations(name_1)), map(join, permutations(name_2))):
        distance = _compare_names(name1,name2)
        distances.append(distance)
    return max(distances)
def _compare_names_with_combinations(name_1,name_2):
    join = ' '.join
    name_1 = name_1.lower().split()
    name_2 = name_2.lower().split()
    for name1, name2 in product(map(join, permutations(name_1)), map(join, permutations(name_2))):
        yield _compare_names(name1,name2)

def score_names(name_1, name_2):
    return max(_compare_names_with_combinations(name_1, name_2))

Context

StackExchange Code Review Q#119752, answer score: 3

Revisions (0)

No revisions yet.