HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Search in a big dictionary Python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
dictionarybigpythonsearch

Problem

I have a big Python dictionary with more then 150,000 keys every key has list value. This dictionary contains infinitive words as key and all grammatical forms of these words as values:

{"конфузити": ["конфузить", "конфужу", "конфузиш", "конфузить", "конфузим", "конфузимо", "конфузите", "конфузять", "конфузитиму", "конфузитимеш", "конфузитиме", "конфузитимем", "конфузитимемо", "конфузитимете", "конфузитимуть", "конфузив", "конфузила", "конфузило", "конфузили"]}


I formed list of words from particular text there are more then 2m words in it, every word has it's grammatical form. So what I am trying to do is searching these words in my dictionary values and returning dictionary keys, which as I have already told, are base or dictionary forms of words. This process is called lemmatization. Have tried different approaches but they are all too slow.

In this part I perform text tokenization.

lst =[]
with open("/home/yan/PycharmProjects/vk/my_patrioty/men_patrioty.txt", 'r', encoding="utf-8") as f:
    for sent in f:
        sent = sent.lower()
        sent = re.sub("[A-z0-9\'\"`\|\/\+\#\,\)\(\?\!\B\-\:\=\;\.\«\»\—\@]", '', sent)
        sent = re.findall('\w+', sent)
        for word in sent:
            lst.append(word)


In this part I am trying to perform binary search but is very slow.

with open("/home/yan/data.txt") as f:
    d = json.load(f)
    for w in lst:   #list of my words
        for key, value in d.items():
            lb = 0
            ub = len(value)
            mid_index = (lb + ub) // 2
            item_at_mid = value[mid_index]
            if item_at_mid == w:
                    print(key)
            if item_at_mid < w:
                    lb = mid_index + 1
            else:
                ub = mid_index


This is liner search it is a bit faster. But still not enough fast for my amount of data.

```
with open("/home/yan/data.txt") as f:
d = json.load(f) #dictionary to search in
for w in lst:
for key, va

Solution

I'd suggest turning the problem around. A dictionary is really good for looking up the key, but not for finding a key for a specific value.

First, you need to convert your dictionary to a dictionary in reverse:

with open("/home/yan/data.txt") as f:
    kv_dict = json.load(f)

vk_dict = {}
for k, vs in kv_dict.items():
    for v in vs:
        vk_dict.setdefault(v, []).append(k)

with open("/home/yan/data_rev.txt", "w") as f:
    json.dump(vk_dict, f)


Then, in your code, you can just write

with open("/home/yan/data_rev.txt") as f:
    d = json.load(f)

for w in list:
    for k in d.get(w, [])
        print(k)


The advantage: building data_rev.txt only needs to be done when data.txt changes, which is hopefully not that often.

Code Snippets

with open("/home/yan/data.txt") as f:
    kv_dict = json.load(f)

vk_dict = {}
for k, vs in kv_dict.items():
    for v in vs:
        vk_dict.setdefault(v, []).append(k)

with open("/home/yan/data_rev.txt", "w") as f:
    json.dump(vk_dict, f)
with open("/home/yan/data_rev.txt") as f:
    d = json.load(f)

for w in list:
    for k in d.get(w, [])
        print(k)

Context

StackExchange Code Review Q#124377, answer score: 5

Revisions (0)

No revisions yet.