patternpythonMinor
Search in a big dictionary Python
Viewed 0 times
dictionarybigpythonsearch
Problem
I have a big Python dictionary with more then 150,000 keys every key has list value. This dictionary contains infinitive words as key and all grammatical forms of these words as values:
I formed list of words from particular text there are more then 2m words in it, every word has it's grammatical form. So what I am trying to do is searching these words in my dictionary values and returning dictionary keys, which as I have already told, are base or dictionary forms of words. This process is called lemmatization. Have tried different approaches but they are all too slow.
In this part I perform text tokenization.
In this part I am trying to perform binary search but is very slow.
This is liner search it is a bit faster. But still not enough fast for my amount of data.
```
with open("/home/yan/data.txt") as f:
d = json.load(f) #dictionary to search in
for w in lst:
for key, va
{"конфузити": ["конфузить", "конфужу", "конфузиш", "конфузить", "конфузим", "конфузимо", "конфузите", "конфузять", "конфузитиму", "конфузитимеш", "конфузитиме", "конфузитимем", "конфузитимемо", "конфузитимете", "конфузитимуть", "конфузив", "конфузила", "конфузило", "конфузили"]}I formed list of words from particular text there are more then 2m words in it, every word has it's grammatical form. So what I am trying to do is searching these words in my dictionary values and returning dictionary keys, which as I have already told, are base or dictionary forms of words. This process is called lemmatization. Have tried different approaches but they are all too slow.
In this part I perform text tokenization.
lst =[]
with open("/home/yan/PycharmProjects/vk/my_patrioty/men_patrioty.txt", 'r', encoding="utf-8") as f:
for sent in f:
sent = sent.lower()
sent = re.sub("[A-z0-9\'\"`\|\/\+\#\,\)\(\?\!\B\-\:\=\;\.\«\»\—\@]", '', sent)
sent = re.findall('\w+', sent)
for word in sent:
lst.append(word)In this part I am trying to perform binary search but is very slow.
with open("/home/yan/data.txt") as f:
d = json.load(f)
for w in lst: #list of my words
for key, value in d.items():
lb = 0
ub = len(value)
mid_index = (lb + ub) // 2
item_at_mid = value[mid_index]
if item_at_mid == w:
print(key)
if item_at_mid < w:
lb = mid_index + 1
else:
ub = mid_indexThis is liner search it is a bit faster. But still not enough fast for my amount of data.
```
with open("/home/yan/data.txt") as f:
d = json.load(f) #dictionary to search in
for w in lst:
for key, va
Solution
I'd suggest turning the problem around. A dictionary is really good for looking up the key, but not for finding a key for a specific value.
First, you need to convert your dictionary to a dictionary in reverse:
Then, in your code, you can just write
The advantage: building data_rev.txt only needs to be done when data.txt changes, which is hopefully not that often.
First, you need to convert your dictionary to a dictionary in reverse:
with open("/home/yan/data.txt") as f:
kv_dict = json.load(f)
vk_dict = {}
for k, vs in kv_dict.items():
for v in vs:
vk_dict.setdefault(v, []).append(k)
with open("/home/yan/data_rev.txt", "w") as f:
json.dump(vk_dict, f)Then, in your code, you can just write
with open("/home/yan/data_rev.txt") as f:
d = json.load(f)
for w in list:
for k in d.get(w, [])
print(k)The advantage: building data_rev.txt only needs to be done when data.txt changes, which is hopefully not that often.
Code Snippets
with open("/home/yan/data.txt") as f:
kv_dict = json.load(f)
vk_dict = {}
for k, vs in kv_dict.items():
for v in vs:
vk_dict.setdefault(v, []).append(k)
with open("/home/yan/data_rev.txt", "w") as f:
json.dump(vk_dict, f)with open("/home/yan/data_rev.txt") as f:
d = json.load(f)
for w in list:
for k in d.get(w, [])
print(k)Context
StackExchange Code Review Q#124377, answer score: 5
Revisions (0)
No revisions yet.