patternpythonMinor
Comparison function for strings
Viewed 0 times
stringsfunctionforcomparison
Problem
I got a function that has as input 2 strings, computes the string similarity of all the combinations and give back and output the highest similarity. For example, "you are" and "are you" will have 1 as similarity value.
examples:
My concerns come when I have to compare a lot of texts and it seems that there should be around a more efficient way of computing this value. Do you think there is space in this function to decrease the computational time?
import itertools
from difflib import SequenceMatcher
import numpy as np
def _compare_names(a, b):
return SequenceMatcher(None, a, b).ratio()
def _compare_names_with_combinations(name_1,name_2):
name_1 = name_1.lower().split()
name_2 = name_2.lower().split()
combinations_name_1 = list(itertools.permutations(name_1))
combinations_name_2 = list(itertools.permutations(name_2))
combinations_name_joined_1 = [' '.join(list(name)) for name in combinations_name_1]
combinations_name_joined_2 = [' '.join(list(name)) for name in combinations_name_2]
distances = []
for name1 in combinations_name_joined_1:
for name2 in combinations_name_joined_2:
distance = _compare_names(name1,name2)
distances.append(distance)
return max(distances)examples:
_compare_names_with_combinations('you are','are you')
>> 1
_compare_names_with_combinations('you are','are yos')
>> 0.85714285My concerns come when I have to compare a lot of texts and it seems that there should be around a more efficient way of computing this value. Do you think there is space in this function to decrease the computational time?
Solution
This will not reduce time complexity, just space complexity; as well as spending less time in the interpreter, most of the work being done in C.
The key idea is to make use of generators to avoid building lists that you only iterate once; reducing your need for memory management, thus increasing the overall speed.
Since you are already importing
in case you are using Python 2, I'd use:
Now we transform each list into a generator expression:
Then
An other improvement is to not build the list of all distances since we are only interested in the maximum value: let's put this
Well, we’re just mapping
But this will require that we modify
The key idea is to make use of generators to avoid building lists that you only iterate once; reducing your need for memory management, thus increasing the overall speed.
Since you are already importing
itertools, I’ll just import the relevant parts to save on typing a bit, feel free to use the long names if you see fit:from itertools import product, permutationsin case you are using Python 2, I'd use:
from itertools import product, permutations, imap as mapNow we transform each list into a generator expression:
def _compare_names_with_combinations(name_1,name_2):
join = ' '.join
name_1 = name_1.lower().split()
name_2 = name_2.lower().split()
distances = []
for name1 in map(join, permutations(name_1)):
for name2 in map(join, permutations(name_2)):
distance = _compare_names(name1,name2)
distances.append(distance)
return max(distances)Then
product lets us merge the two loops:def _compare_names_with_combinations(name_1,name_2):
join = ' '.join
name_1 = name_1.lower().split()
name_2 = name_2.lower().split()
distances = []
for name1, name2 in product(map(join, permutations(name_1)), map(join, permutations(name_2))):
distance = _compare_names(name1,name2)
distances.append(distance)
return max(distances)An other improvement is to not build the list of all distances since we are only interested in the maximum value: let's put this
max call out of the function and turn it into a generator:def _compare_names_with_combinations(name_1,name_2):
join = ' '.join
name_1 = name_1.lower().split()
name_2 = name_2.lower().split()
for name1, name2 in product(map(join, permutations(name_1)), map(join, permutations(name_2))):
yield _compare_names(name1,name2)
def score_names(name_1, name_2):
return max(_compare_names_with_combinations(name_1, name_2))Well, we’re just mapping
_compare_names over the product. Maybe we could have that max call in the function after all:def _compare_names_with_combinations(name_1, name_2):
join = ' '.join
name_1 = name_1.lower().split()
name_2 = name_2.lower().split()
return max(
map(
_compare_names,
product(
map(join, permutations(name_1)),
map(join, permutations(name_2))
)
)
)But this will require that we modify
_compare_names to accept a tuple of two names as parameter instead of two names as two separate parameters:def _compare_names(names):
a, b = names
return SequenceMatcher(None, a, b).ratio()Code Snippets
from itertools import product, permutationsfrom itertools import product, permutations, imap as mapdef _compare_names_with_combinations(name_1,name_2):
join = ' '.join
name_1 = name_1.lower().split()
name_2 = name_2.lower().split()
distances = []
for name1 in map(join, permutations(name_1)):
for name2 in map(join, permutations(name_2)):
distance = _compare_names(name1,name2)
distances.append(distance)
return max(distances)def _compare_names_with_combinations(name_1,name_2):
join = ' '.join
name_1 = name_1.lower().split()
name_2 = name_2.lower().split()
distances = []
for name1, name2 in product(map(join, permutations(name_1)), map(join, permutations(name_2))):
distance = _compare_names(name1,name2)
distances.append(distance)
return max(distances)def _compare_names_with_combinations(name_1,name_2):
join = ' '.join
name_1 = name_1.lower().split()
name_2 = name_2.lower().split()
for name1, name2 in product(map(join, permutations(name_1)), map(join, permutations(name_2))):
yield _compare_names(name1,name2)
def score_names(name_1, name_2):
return max(_compare_names_with_combinations(name_1, name_2))Context
StackExchange Code Review Q#119752, answer score: 3
Revisions (0)
No revisions yet.