patternpythonMinor

NLTK sentence / word tokenize

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

nltkwordsentencetokenize

Problem

I have a method that takes in a String parameter, and uses NLTK to break the String down to sentences, then into words. Afterwards, it converts each word into lowercase, and finally creates a dictionary of the frequency of each word.

import nltk
from collections import Counter

def freq(string):
    f = Counter()
    sentence_list = nltk.tokenize.sent_tokenize(string)
    for sentence in sentence_list:
        words = nltk.word_tokenize(sentence)
        words = [word.lower() for word in words]
        for word in words:
            f[word] += 1
    return f

I'm supposed to optimize the above code further to result in faster preprocessing time, and am unsure how to do so. The return value should obviously be exactly the same as the above, so I'm expected to use nltk though not explicitly required to do so. I also have access to scipy, numpy and pandas though I'm not sure how they would help in this case.

Is there any way to speed up the code? Or, if not, any way to make the code at least more elegant (i.e. pythonic)?

Solution

Quick review:

-
The collections.Counter class has an update method that adds counts for items in an iterable. So instead of:

words = nltk.word_tokenize(sentence)
words = [word.lower() for word in words]
for word in words:
    f[word] += 1

you could write:

f.update(word.lower() for word in nltk.word_tokenize(sentence))

-
There's no need to call sent_tokenize if you are then going to call word_tokenize on the results — if you look at the implementation of word_tokenize you'll see that it calls sent_tokenize, so by calling it yourself you're doubling the amount of work here.

Revised code:

def freq2(string):
    return Counter(word.lower() for word in nltk.word_tokenize(string))

This is about 35% faster than the original code, mostly due to avoiding the duplicate sentence processing. But can we make further progress?

Profiling

When you have to improve the speed of some piece of code, there is a standard approach that works like this:

-
Prepare a repeatable and representative test case whose execution time you can measure.

-
Profile the execution of the code on the test case you prepared in step 1.

-
The profiling results will usually show that a small fraction of the code is responsible for a large fraction of the runtime. (If this is not the case, you've reached the limits of what can be achieved by this approach, so stop here.)

-
Investigate the code you identified in step 3 and make it faster.

-
Go to step 2.

So let's try this. Here's the test case:

TEST_FILE = 'nltk_data/corpora/gutenberg/melville-moby_dick.txt'
TEST_TEXT = open(TEST_FILE).read()
TEST_CASE = lambda:freq2(TEST_TEXT)

We can time the execution using the timeit module:

>>> from timeit import timeit
>>> timeit(TEST_CASE, number=1)
1.699578917992767

And profile the execution using the cProfile module:

>>> cProfile.run('TEST_CASE()', sort='time')
         2201133 function calls (2158188 primitive calls) in 2.127 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   281219    1.100    0.000    1.303    0.000 {method 'sub' of '_sre.SRE_Pattern' objects}
     9853    0.165    0.000    0.573    0.000 punkt.py:1287(_slices_from_text)
   216744    0.082    0.000    0.129    0.000 re.py:323(_subx)
     9852    0.064    0.000    1.366    0.000 treebank.py:96(tokenize)
   254990    0.060    0.000    0.084    0.000 cr153873.py:16()
        1    0.058    0.058    0.142    0.142 {built-in method _collections._count_elements}
    25066    0.054    0.000    0.120    0.000 punkt.py:388(__init__)
    35055    0.052    0.000    0.225    0.000 punkt.py:533(_tokenize_words)
   216744    0.047    0.000    0.047    0.000 re.py:306(_compile_repl)
    37938    0.042    0.000    0.056    0.000 sre_parse.py:931(expand_template)
    12491    0.033    0.000    0.033    0.000 {method 'findall' of '_sre.SRE_Pattern' objects}
    25066    0.031    0.000    0.056    0.000 punkt.py:581(_first_pass_annotation)
   294926    0.029    0.000    0.029    0.000 {method 'lower' of 'str' objects}
[... and so on ...]

It takes a bit of practice to read this. Each row of the table gives statistics for a single function. The "tottime" column gives the total time spent running code in that function (but only in that function, not in any functions called by it). The "cumtime" column gives the cumulative time spent in that function (including time spent in functions called by it).

The profiling results satisfy the condition in step (3) — that is, a large fraction of the runtime (more than half) is spent in a single function, the sub method on regular expression objects. Why is so much time being spent here? In order to find out who is calling this method, we can use the print_callers method from the pstats module, like this:

>>> cProfile.run('TEST_CASE()', 'cr153873.profile')
>>> import pstats
>>> pstats.Stats('cr153873.profile').print_callers("method 'sub'")
   Random listing order was used
   List reduced from 83 to 1 due to restriction 

Function        was called by...
                    ncalls  tottime  cumtime
{method 'sub'}  <-       1    0.000    0.000  re.py:175(sub)
                     25066    0.022    0.022  nltk/tokenize/punkt.py:411(_get_type)
                    256152    1.078    1.282  nltk/tokenize/treebank.py:96(tokenize)

So the caller that's responsible is the tokenize method of the TreebankWordTokenizer class, which is called by nltk.word_tokenize. Looking at the implementation, you can see that it works by making many regular expression substitutions to the text being tokenized. Each substitution requires a pass over the text and then making a copy of the text. This is wasteful: it would be more efficient to pass once over the text being tokenized, and yield the tokens as they are found.

A faster word tokenizer

So it looks as if one way to significantly improve the performance of the cod

Code Snippets

words = nltk.word_tokenize(sentence)
words = [word.lower() for word in words]
for word in words:
    f[word] += 1

f.update(word.lower() for word in nltk.word_tokenize(sentence))

def freq2(string):
    return Counter(word.lower() for word in nltk.word_tokenize(string))

TEST_FILE = 'nltk_data/corpora/gutenberg/melville-moby_dick.txt'
TEST_TEXT = open(TEST_FILE).read()
TEST_CASE = lambda:freq2(TEST_TEXT)

>>> from timeit import timeit
>>> timeit(TEST_CASE, number=1)
1.699578917992767

Context

StackExchange Code Review Q#153873, answer score: 8

Revisions (0)

No revisions yet.