patternpythonMinor
Words in text counter
Viewed 0 times
wordscountertext
Problem
This accepts text and returns a list of all words which occur with a required frequency (default frequency is 2). It works well, but gets too slow with large texts, more than 1M words. Where can this function be improved to speed searches in larger text?
from collections import Counter
import re
def words_in_text_counter(in_text,frequency=2):
no_noise_text = re.sub(r'[0-9.,:;<>"#!%&/()=?*-+]','',in_text.lower()).strip().split()
frequency_dict = {}
for key, value in Counter(no_noise_text).items():
frequency_dict.setdefault(value, []).append(key)
try:
print(str(frequency)+' times words in text.')
return frequency_dict[frequency]
except KeyError:
return str(frequency)+' times never happens in text.'Solution
Your problem is ultimately:
There's a lot to dislike about this line. It horribly violates the Law of Demeter. It takes a while to figure out what it is supposed to do. It uses the wrong tool for the job. And it's inefficient.
First, to start with, if the goal is to split a strict by non-alpha characters, then we should just use the correct function:
Next, this is going to be inefficient. We have to walk the entire string once (to make it lower-case), then again to split it. And then we have to keep all of the words in memory. We don't have to do any of that if we really start with the correct function:
That will give you an iterator over the words. Which you can stick into a generator expression that you can pass to
And if you're already using
Note that I'm using
Furthermore, why do we need a
Putting it all together
no_noise_text = re.sub(r'[0-9.,:;<>"#!%&/()=?*-+]','',in_text.lower()).strip().split()There's a lot to dislike about this line. It horribly violates the Law of Demeter. It takes a while to figure out what it is supposed to do. It uses the wrong tool for the job. And it's inefficient.
First, to start with, if the goal is to split a strict by non-alpha characters, then we should just use the correct function:
re.split:words = re.split("[^a-z']+", in_text.lower())
counts = Counter(words)Next, this is going to be inefficient. We have to walk the entire string once (to make it lower-case), then again to split it. And then we have to keep all of the words in memory. We don't have to do any of that if we really start with the correct function:
re.finditer:re.finditer("([a-zA-Z']+)", in_text)That will give you an iterator over the words. Which you can stick into a generator expression that you can pass to
Counter:words = re.finditer("([a-zA-Z']+)", in_text)
counts = Counter(m.group(0).lower() for m in words)And if you're already using
Counter, you may as well use defaultdict:freq_dict = defaultdict(list)
for key, val in counts.iteritems():
freq_dict[key].append(val)Note that I'm using
iteritems() (which just gives you a generator) rather than items() (which must give you the full list). Furthermore, why do we need a
frequency_dict? We only care about those words which match the given frequency. Why not just keep a list?frequency_list = []
for key, val in counts.iteritems():
if val == frequency:
frequency_list.append(key)Putting it all together
from collections import Counter
import re
def words_in_text_counter(in_text, frequency=2):
words = re.finditer("([a-zA-Z']+)", in_text)
counts = Counter(m.group(0).lower() for m in words)
frequency_list = []
for key, val in counts.iteritems():
if val == frequency:
frequency_list.append(key)
if frequency_list:
print(str(frequency)+' times words in text.')
return frequency_list
else:
return str(frequency)+' times never happens in text.'Code Snippets
no_noise_text = re.sub(r'[0-9.,:;<>"#!%&/()=?*-+]','',in_text.lower()).strip().split()words = re.split("[^a-z']+", in_text.lower())
counts = Counter(words)re.finditer("([a-zA-Z']+)", in_text)words = re.finditer("([a-zA-Z']+)", in_text)
counts = Counter(m.group(0).lower() for m in words)freq_dict = defaultdict(list)
for key, val in counts.iteritems():
freq_dict[key].append(val)Context
StackExchange Code Review Q#104657, answer score: 3
Revisions (0)
No revisions yet.