HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Python DictionaryTagger class takes too long to initialize

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
dictionarytaggerlongtoopythontakesclassinitialize

Problem

With big inspiration from code from different webpages, I have created this DictionaryTagger. My problem is that it takes around 2 sec. to initialize the taggers / class.

When the tagger has been initialized, it only takes a few milliseconds to use the different taggers. I see two different ways to optimize the tagger:

  • Do the initialize of the tagger in some preload function. This have


to be used through a homepage. If it could be initialize and the stay
in memory all the time.

  • Rewrite the whole class with eye for performance



```
class DictionaryTagger(object):

def __init__(self, dictionary_paths):
files = [open(path, 'r') for path in dictionary_paths]
dictionaries = [yaml.load(dict_file) for dict_file in files]
map(lambda x: x.close(), files)
self.dictionary = {}
self.max_key_size = 0
for curr_dict in dictionaries:
for key in curr_dict:
#print(key)
if key in self.dictionary:
self.dictionary[key].extend(curr_dict[key])
else:
self.dictionary[key] = curr_dict[key]
self.max_key_size = max(self.max_key_size, len(key))

def tag(self, postagged_sentences):
return [self.tag_sentence(sentence) for sentence in postagged_sentences]

def tag_sentence(self, sentence, tag_with_lemmas=False):
"""
the result is only one tagging of all the possible ones.
:type sentence: object
The resulting tagging is determined by these two priority rules:
- longest matches have higher priority
- search is made from left to right
"""
tag_sentence = []
n = len(sentence)
if self.max_key_size == 0:
self.max_key_size = n
i = 0
while i i:
expression_form = ' '.join([word[0] for word in sentence[i:j]]).lower()
expression_lemma = ' '.join([word[1] for word in sentence[i:j]]).lower()
if tag_with_lemmas:
literal = expression_lemma
else:

Solution

So far all I can see being called is the __init__ method, so I thought I would focus on that. It feels weird, so I wanted to first explain why it felt that way, and then address your questions a little.

Readability first

I think the core reason is that this code feels weird is that it appears to use list comprehensions "just because". It creates a temporary list called files; it creates a temporary list called dictionaries; it closes all the already-opened files, and then it iterates over all the dictionaries. This feels weird because these temporaries serve no obvious purpose. Instead the code could easily set its invariants (initializing self.dictionary and self.max_key_size) and then do a simple for loop over thedictionary_paths:

def __init__(self, dictionary_paths)
    self.dictionary = {}
    self.max_key_size = 0
    for path in dictionary_paths:
        dict_file = open(path, 'r')
        curr_dict = yaml.load(dict_file)
        dict_file.close()
        for key in curr_dict: 
            : : :


This may have some performance benefits, but they are likely to be too small to measure. Namely it puts less file-based pressure on the system as it only opens a single file at a time. Similarly there's somewhat less memory pressure, as only one curr_dict is maintained at a time. One downside is if initialization should fail if it can't open one of the dictionary_paths, that now happens after it does more work. I'm fine with that trade-off.

This change also gives you the opportunity to use modern approaches for closing files. For example, the open/load/close could instead use with to automatically close at the end of the block:

with open(path, 'r') as dict_file:
        curr_dict = yaml.load(dict_file)


All in all, this advice so far makes things slightly easier to read. There are fewer moving pieces to keep in your head at the same time, and the ones that are there, are ones that we are used to tracking.

Performance concerns

Once the code is easy to read, it becomes easier to consider performance. If you're used to optimizing a compiled language, you'll find that optimizing python can be counter-intuitive. Where in a compiled language it's sometimes helpful to do a larger number of smaller operations (building manual caches, doing shifts and additions to avoid a multiplication), in python it's almost always the reverse: the fewer operations the better, even if they each take longer. What's in common? Always profile. I don't have your data, so I'm going to shoot from the hip here and offer some general ideas. It's up to you to profile them and see if any of the ideas are useful.

Loops are a good first guess, and the loop of __init__ is almost just a conditional. At first it might seem like there's nowhere to go with this, but there are actually three main things I would test to see if they are worth changing. First, the code traverses the internals of curr_dict multiple times; once is in the for loop, and once is in the lookup curr_dict[key]. Do these together by using dict.items (or dict.iteritems in python 2.x if curr_dict might be large):

for key, value in curr_dict.items():         # or .iteritems() in 2.x
    if key in self.dictionary:
        self.dictionary[key].extend(value)
    else:
        self.dictionary[key] = value
        self.max_key_size = max(self.max_key_size, len(key))


Second, to make the if and else case more similar, pull the max_key_size calculation out of the conditional:

for key, value in curr_dict.items():         # or .iteritems() in 2.x
    if key in self.dictionary:
        self.dictionary[key].extend(value)
    else:
        self.dictionary[key] = value
    self.max_key_size = max(self.max_key_size, len(key))


Or better, per what I mentioned above, let's call it once outside of the loop, having it do more work the single time we call it:

for key, value in curr_dict.items():         # or .iteritems() in 2.x
    if key in self.dictionary:
        self.dictionary[key].extend(value)
    else:
        self.dictionary[key] = value
self.max_key_size = max(len(key) for key in self.dictionary)


Third, back to the conditional. Let's examine two ways to collapse this all into a single line of code. One way is to use dict.setdefault. It reads a little funny, but it does exactly the above in one line:

self.dictionary.setdefault(key, []).extend(value)


The upside is that this only traverses the dictionary internals once. What's the downside? Even if you don't need it, this creates an empty list. That's unlikely to be a big performance concern, but if it is, the second way avoids this. A specialized dictionary type, collections.defaultdict lets you choose a "factory" for missing items. So selecting list as that factory will make self.dictionary do something similar to that setdefault line on its own:

```
self.dictionary = collections.defaultdict(list)
: : :
self.dictionary[key].ex

Code Snippets

def __init__(self, dictionary_paths)
    self.dictionary = {}
    self.max_key_size = 0
    for path in dictionary_paths:
        dict_file = open(path, 'r')
        curr_dict = yaml.load(dict_file)
        dict_file.close()
        for key in curr_dict: 
            : : :
with open(path, 'r') as dict_file:
        curr_dict = yaml.load(dict_file)
for key, value in curr_dict.items():         # or .iteritems() in 2.x
    if key in self.dictionary:
        self.dictionary[key].extend(value)
    else:
        self.dictionary[key] = value
        self.max_key_size = max(self.max_key_size, len(key))
for key, value in curr_dict.items():         # or .iteritems() in 2.x
    if key in self.dictionary:
        self.dictionary[key].extend(value)
    else:
        self.dictionary[key] = value
    self.max_key_size = max(self.max_key_size, len(key))
for key, value in curr_dict.items():         # or .iteritems() in 2.x
    if key in self.dictionary:
        self.dictionary[key].extend(value)
    else:
        self.dictionary[key] = value
self.max_key_size = max(len(key) for key in self.dictionary)

Context

StackExchange Code Review Q#65342, answer score: 4

Revisions (0)

No revisions yet.