patternpythonMinor
Removing all stopwords from a list of words
Viewed 0 times
stopwordsremovingallwordslistfrom
Problem
What is the fastest Pythonic way to remove all stopwords from a list of words in a document? Right now I am using a list comprehension that contains a
Is there a faster way to do this?
for loop.from nltk.corpus import stopwords
''' Push stopwords to a list '''
stop = stopwords.words('english')
Document = ' Some huge text .......................... '
''' Tokenize the doc '''
words = nltk.word_tokenize(Document)
''' Comparing two lists '''
stopwordsfree_words = [word for word in words if word not in stop]Is there a faster way to do this?
Solution
If
However, if you make the stopwords into a
… then each lookup can be done in \$O(1)\$ time. You would get \$O(w)\$ running time just by changing the data structure like that.
Another minor issue is that by convention,
stop is a list containing \$s\$ stopwords, and words is a list containing \$w\$ words, then the loop in the list comprehension will be \$O(w s)\$, since it basically has to iterate over both lists in a nested loop.However, if you make the stopwords into a
set…stop = set(stopwords.words('english'))… then each lookup can be done in \$O(1)\$ time. You would get \$O(w)\$ running time just by changing the data structure like that.
Another minor issue is that by convention,
Document should be lowercase, because it is a variable rather than a class.Code Snippets
stop = set(stopwords.words('english'))Context
StackExchange Code Review Q#90692, answer score: 9
Revisions (0)
No revisions yet.