patternpythonMinor
NLTK sentence / word tokenize
Viewed 0 times
nltkwordsentencetokenize
Problem
I have a method that takes in a String parameter, and uses NLTK to break the String down to sentences, then into words. Afterwards, it converts each word into lowercase, and finally creates a dictionary of the frequency of each word.
I'm supposed to optimize the above code further to result in faster preprocessing time, and am unsure how to do so. The return value should obviously be exactly the same as the above, so I'm expected to use nltk though not explicitly required to do so. I also have access to scipy, numpy and pandas though I'm not sure how they would help in this case.
Is there any way to speed up the code? Or, if not, any way to make the code at least more elegant (i.e. pythonic)?
import nltk
from collections import Counter
def freq(string):
f = Counter()
sentence_list = nltk.tokenize.sent_tokenize(string)
for sentence in sentence_list:
words = nltk.word_tokenize(sentence)
words = [word.lower() for word in words]
for word in words:
f[word] += 1
return fI'm supposed to optimize the above code further to result in faster preprocessing time, and am unsure how to do so. The return value should obviously be exactly the same as the above, so I'm expected to use nltk though not explicitly required to do so. I also have access to scipy, numpy and pandas though I'm not sure how they would help in this case.
Is there any way to speed up the code? Or, if not, any way to make the code at least more elegant (i.e. pythonic)?
Solution
- Quick review:
-
The
collections.Counter class has an update method that adds counts for items in an iterable. So instead of:words = nltk.word_tokenize(sentence)
words = [word.lower() for word in words]
for word in words:
f[word] += 1you could write:
f.update(word.lower() for word in nltk.word_tokenize(sentence))-
There's no need to call
sent_tokenize if you are then going to call word_tokenize on the results — if you look at the implementation of word_tokenize you'll see that it calls sent_tokenize, so by calling it yourself you're doubling the amount of work here.Revised code:
def freq2(string):
return Counter(word.lower() for word in nltk.word_tokenize(string))This is about 35% faster than the original code, mostly due to avoiding the duplicate sentence processing. But can we make further progress?
- Profiling
When you have to improve the speed of some piece of code, there is a standard approach that works like this:
-
Prepare a repeatable and representative test case whose execution time you can measure.
-
Profile the execution of the code on the test case you prepared in step 1.
-
The profiling results will usually show that a small fraction of the code is responsible for a large fraction of the runtime. (If this is not the case, you've reached the limits of what can be achieved by this approach, so stop here.)
-
Investigate the code you identified in step 3 and make it faster.
-
Go to step 2.
So let's try this. Here's the test case:
TEST_FILE = 'nltk_data/corpora/gutenberg/melville-moby_dick.txt'
TEST_TEXT = open(TEST_FILE).read()
TEST_CASE = lambda:freq2(TEST_TEXT)We can time the execution using the
timeit module:>>> from timeit import timeit
>>> timeit(TEST_CASE, number=1)
1.699578917992767And profile the execution using the
cProfile module:>>> cProfile.run('TEST_CASE()', sort='time')
2201133 function calls (2158188 primitive calls) in 2.127 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
281219 1.100 0.000 1.303 0.000 {method 'sub' of '_sre.SRE_Pattern' objects}
9853 0.165 0.000 0.573 0.000 punkt.py:1287(_slices_from_text)
216744 0.082 0.000 0.129 0.000 re.py:323(_subx)
9852 0.064 0.000 1.366 0.000 treebank.py:96(tokenize)
254990 0.060 0.000 0.084 0.000 cr153873.py:16()
1 0.058 0.058 0.142 0.142 {built-in method _collections._count_elements}
25066 0.054 0.000 0.120 0.000 punkt.py:388(__init__)
35055 0.052 0.000 0.225 0.000 punkt.py:533(_tokenize_words)
216744 0.047 0.000 0.047 0.000 re.py:306(_compile_repl)
37938 0.042 0.000 0.056 0.000 sre_parse.py:931(expand_template)
12491 0.033 0.000 0.033 0.000 {method 'findall' of '_sre.SRE_Pattern' objects}
25066 0.031 0.000 0.056 0.000 punkt.py:581(_first_pass_annotation)
294926 0.029 0.000 0.029 0.000 {method 'lower' of 'str' objects}
[... and so on ...]It takes a bit of practice to read this. Each row of the table gives statistics for a single function. The "tottime" column gives the total time spent running code in that function (but only in that function, not in any functions called by it). The "cumtime" column gives the cumulative time spent in that function (including time spent in functions called by it).
The profiling results satisfy the condition in step (3) — that is, a large fraction of the runtime (more than half) is spent in a single function, the
sub method on regular expression objects. Why is so much time being spent here? In order to find out who is calling this method, we can use the print_callers method from the pstats module, like this:>>> cProfile.run('TEST_CASE()', 'cr153873.profile')
>>> import pstats
>>> pstats.Stats('cr153873.profile').print_callers("method 'sub'")
Random listing order was used
List reduced from 83 to 1 due to restriction
Function was called by...
ncalls tottime cumtime
{method 'sub'} <- 1 0.000 0.000 re.py:175(sub)
25066 0.022 0.022 nltk/tokenize/punkt.py:411(_get_type)
256152 1.078 1.282 nltk/tokenize/treebank.py:96(tokenize)So the caller that's responsible is the
tokenize method of the TreebankWordTokenizer class, which is called by nltk.word_tokenize. Looking at the implementation, you can see that it works by making many regular expression substitutions to the text being tokenized. Each substitution requires a pass over the text and then making a copy of the text. This is wasteful: it would be more efficient to pass once over the text being tokenized, and yield the tokens as they are found.- A faster word tokenizer
So it looks as if one way to significantly improve the performance of the cod
Code Snippets
words = nltk.word_tokenize(sentence)
words = [word.lower() for word in words]
for word in words:
f[word] += 1f.update(word.lower() for word in nltk.word_tokenize(sentence))def freq2(string):
return Counter(word.lower() for word in nltk.word_tokenize(string))TEST_FILE = 'nltk_data/corpora/gutenberg/melville-moby_dick.txt'
TEST_TEXT = open(TEST_FILE).read()
TEST_CASE = lambda:freq2(TEST_TEXT)>>> from timeit import timeit
>>> timeit(TEST_CASE, number=1)
1.699578917992767Context
StackExchange Code Review Q#153873, answer score: 8
Revisions (0)
No revisions yet.