HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Removing all stopwords from a list of words

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
stopwordsremovingallwordslistfrom

Problem

What is the fastest Pythonic way to remove all stopwords from a list of words in a document? Right now I am using a list comprehension that contains a for loop.

from nltk.corpus import stopwords

''' Push stopwords to a list '''
stop = stopwords.words('english')
Document = ' Some huge text .......................... '
''' Tokenize the doc '''
words = nltk.word_tokenize(Document)
''' Comparing two lists '''
stopwordsfree_words = [word for word in words if word not in stop]


Is there a faster way to do this?

Solution

If stop is a list containing \$s\$ stopwords, and words is a list containing \$w\$ words, then the loop in the list comprehension will be \$O(w s)\$, since it basically has to iterate over both lists in a nested loop.

However, if you make the stopwords into a set

stop = set(stopwords.words('english'))


… then each lookup can be done in \$O(1)\$ time. You would get \$O(w)\$ running time just by changing the data structure like that.

Another minor issue is that by convention, Document should be lowercase, because it is a variable rather than a class.

Code Snippets

stop = set(stopwords.words('english'))

Context

StackExchange Code Review Q#90692, answer score: 9

Revisions (0)

No revisions yet.