HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Words in text counter

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
wordscountertext

Problem

This accepts text and returns a list of all words which occur with a required frequency (default frequency is 2). It works well, but gets too slow with large texts, more than 1M words. Where can this function be improved to speed searches in larger text?

from collections import Counter
import re

def words_in_text_counter(in_text,frequency=2):
    no_noise_text = re.sub(r'[0-9.,:;<>"#!%&/()=?*-+]','',in_text.lower()).strip().split()
    frequency_dict = {}
    for key, value in Counter(no_noise_text).items():
        frequency_dict.setdefault(value, []).append(key)
    try:
        print(str(frequency)+' times words in text.')
        return frequency_dict[frequency]
    except KeyError:
        return str(frequency)+' times never happens in text.'

Solution

Your problem is ultimately:

no_noise_text = re.sub(r'[0-9.,:;<>"#!%&/()=?*-+]','',in_text.lower()).strip().split()


There's a lot to dislike about this line. It horribly violates the Law of Demeter. It takes a while to figure out what it is supposed to do. It uses the wrong tool for the job. And it's inefficient.

First, to start with, if the goal is to split a strict by non-alpha characters, then we should just use the correct function: re.split:

words = re.split("[^a-z']+", in_text.lower())
counts = Counter(words)


Next, this is going to be inefficient. We have to walk the entire string once (to make it lower-case), then again to split it. And then we have to keep all of the words in memory. We don't have to do any of that if we really start with the correct function: re.finditer:

re.finditer("([a-zA-Z']+)", in_text)


That will give you an iterator over the words. Which you can stick into a generator expression that you can pass to Counter:

words = re.finditer("([a-zA-Z']+)", in_text)
counts = Counter(m.group(0).lower() for m in words)


And if you're already using Counter, you may as well use defaultdict:

freq_dict = defaultdict(list)
for key, val in counts.iteritems():
    freq_dict[key].append(val)


Note that I'm using iteritems() (which just gives you a generator) rather than items() (which must give you the full list).

Furthermore, why do we need a frequency_dict? We only care about those words which match the given frequency. Why not just keep a list?

frequency_list = []
for key, val in counts.iteritems():
    if val == frequency:
        frequency_list.append(key)


Putting it all together

from collections import Counter
import re

def words_in_text_counter(in_text, frequency=2):
    words = re.finditer("([a-zA-Z']+)", in_text)
    counts = Counter(m.group(0).lower() for m in words)

    frequency_list = []
    for key, val in counts.iteritems():
        if val == frequency:
            frequency_list.append(key)

    if frequency_list:
        print(str(frequency)+' times words in text.')
        return frequency_list
    else:
        return str(frequency)+' times never happens in text.'

Code Snippets

no_noise_text = re.sub(r'[0-9.,:;<>"#!%&/()=?*-+]','',in_text.lower()).strip().split()
words = re.split("[^a-z']+", in_text.lower())
counts = Counter(words)
re.finditer("([a-zA-Z']+)", in_text)
words = re.finditer("([a-zA-Z']+)", in_text)
counts = Counter(m.group(0).lower() for m in words)
freq_dict = defaultdict(list)
for key, val in counts.iteritems():
    freq_dict[key].append(val)

Context

StackExchange Code Review Q#104657, answer score: 3

Revisions (0)

No revisions yet.