HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Avoid iteration on complete list to find (fuzzy) matching string

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
avoiditerationcompletefindliststringmatchingfuzzy

Problem

I have about 10k categories, represented by their name.
I need to find a match for a category input among this list of 10k categories.

This is done through an API, and by batches: the endpoint will receive about 500 categories to match.

The process is:

-> Receive request with all categories to match

-> For each word, run fuzzy matching algorithm with the 10k categories.

-> Return match

I'm using Fuzzy Wuzzy for the algorithm, and Django for the API.
Basically, this would look like this:

response = {}

for category in categories_received:
  for master_category in master_categories:
    if fuzz.ratio(category, master_category) > 95:
       response[category] = master_category


This is terribly inefficient, but I couldn't find another way to do it.
I control both sides: the way data is sent to the API, and of course the way it is compared to the existing categories.

Any idea/input on how to make this code more efficient would be much appreciated

Solution

Fuzzy-wuzzy is poorly optimized for that kind of use.

What you can do is use fuzzy-wuzzy functions to construct your own ratio function.

The biggest boost you will get if you will use any caching for master_category tokenization. If you are on python3 you can use functions.lru_cache and if memory allows you I would suggest making it the size of len(master_categories)+1.

This is a small change and it will increase memory usage a lot, but as a result speed of calculation will also be boosted a lot.

also since you need just 1 master_category per each category, just break your loop as soon as you find one.

for category in categories_received:
  for master_category in master_categories:
    if fuzz.ratio(category, master_category) > 95:
       response[category] = master_category
       break

Code Snippets

for category in categories_received:
  for master_category in master_categories:
    if fuzz.ratio(category, master_category) > 95:
       response[category] = master_category
       break

Context

StackExchange Code Review Q#151765, answer score: 3

Revisions (0)

No revisions yet.