patternpythonMinor
Improve efficiency for finding word frequency in text
Viewed 0 times
frequencytextimprovewordforfindingefficiency
Problem
Premise: Write a program that counts the frequencies of each word in a text, and
output each word with its count and line numbers where it appears. We
define a word as a contiguous sequence of non-white-space characters.
Different capitalizations of the same character sequence should be
considered same word (e.g. Python and python). The output is formatted as
follows: each line begins with a number indicating the frequency of the word,
a white space, then the word itself, and a list of line numbers containing this
word. You should output from the most frequent word to the least frequent.
In case two words have the same frequency, the lexicographically smaller
one comes first. All words are in lower case in the output.
Solution:
```
# Amazing Python libraries that prevent re-invention of wheels.
import re
import collections
import string
from operator import itemgetter
# Simulate work for terminal
print 'Scanning input for word frequency...'
# Text files
input = open('input.txt', 'r')
output = open('output.txt', 'w')
# Reads every word in input as lowercase while ignoring punctuation and
# returns a list containing 2-tuples of (word, frequency).
list = collections.Counter(input.read().lower()
.translate(None, string.punctuation).split()).most_common()
# Sorts the list by frequency in descending order.
list = sorted(list, key=itemgetter(0))
# Sorts the list by lexicogrpahical order in descending order while maintaining
# previous sorting.
list = sorted(list, key=itemgetter(1), reverse=True)
# Gets the lines where each word in list appears in the input.
for word in list:
lineList = []
# Reads the input line by line. Stores each text line in 'line' and keeps a
# counter in 'lineNum' that starts at 1.
for lineNum, line in enumerate(open('input.txt'),1):
# If the word is in the current line then add it to the list of lines
# in which the current word appears in. The word frequency list is not
# mutated. This is stored in a sepa
output each word with its count and line numbers where it appears. We
define a word as a contiguous sequence of non-white-space characters.
Different capitalizations of the same character sequence should be
considered same word (e.g. Python and python). The output is formatted as
follows: each line begins with a number indicating the frequency of the word,
a white space, then the word itself, and a list of line numbers containing this
word. You should output from the most frequent word to the least frequent.
In case two words have the same frequency, the lexicographically smaller
one comes first. All words are in lower case in the output.
Solution:
```
# Amazing Python libraries that prevent re-invention of wheels.
import re
import collections
import string
from operator import itemgetter
# Simulate work for terminal
print 'Scanning input for word frequency...'
# Text files
input = open('input.txt', 'r')
output = open('output.txt', 'w')
# Reads every word in input as lowercase while ignoring punctuation and
# returns a list containing 2-tuples of (word, frequency).
list = collections.Counter(input.read().lower()
.translate(None, string.punctuation).split()).most_common()
# Sorts the list by frequency in descending order.
list = sorted(list, key=itemgetter(0))
# Sorts the list by lexicogrpahical order in descending order while maintaining
# previous sorting.
list = sorted(list, key=itemgetter(1), reverse=True)
# Gets the lines where each word in list appears in the input.
for word in list:
lineList = []
# Reads the input line by line. Stores each text line in 'line' and keeps a
# counter in 'lineNum' that starts at 1.
for lineNum, line in enumerate(open('input.txt'),1):
# If the word is in the current line then add it to the list of lines
# in which the current word appears in. The word frequency list is not
# mutated. This is stored in a sepa
Solution
You're right, the basic point to make this program faster is to store the line numbers in the first pass.
The
If you start using a more generic data structure than your
In one pass over the input you can populate this data structure. In a second step, you can properly sort this data structure and iterate over the elements.
An example implementation:
A few more notes:
-
If you
-
Don't hardcode input/output file names. Read from standard input (
-
Don't litter standard output with progress messages. If you absolutely need them use
-
Don't use regular expressions, unless it's necessary. Prefer string methods for their simplicity and speed.
-
Name your variables in
The
for word in list loop will run many times. The count depends on how many words you have. Every time it runs, it reopens and rereads the whole input file. This is awfully slow.If you start using a more generic data structure than your
Counter(), then you'll be able to store the line numbers. For example you could use a dictionary. The key of the dict is the word, the value is a list of line numbers. The hit count is stored indirectly as the length of the list of line numbers.In one pass over the input you can populate this data structure. In a second step, you can properly sort this data structure and iterate over the elements.
An example implementation:
from collections import defaultdict
import sys
if __name__ == "__main__":
hit = defaultdict(list)
for lineno, line in enumerate(sys.stdin, start=1):
for word in line.split():
hit[word.lower()].append(lineno)
for word, places in sorted(hit.iteritems(), key=lambda (w, p): (-len(p), w)):
print len(places), word, ",".join(map(str, sorted(set(places))))A few more notes:
-
If you
open() a file, close() it too. As soon as you're done with using it. (Or use the with statement to ensure that it gets closed.)-
Don't hardcode input/output file names. Read from standard input (
sys.stdin), write to standard output (sys.stdout or simply print), and let the user use his shell's redirections: program output.txt-
Don't litter standard output with progress messages. If you absolutely need them use
sys.stderr.write or get acquainted with the logging module.-
Don't use regular expressions, unless it's necessary. Prefer string methods for their simplicity and speed.
-
Name your variables in
lower_case_with_underscores as PEP8 recommends.Code Snippets
from collections import defaultdict
import sys
if __name__ == "__main__":
hit = defaultdict(list)
for lineno, line in enumerate(sys.stdin, start=1):
for word in line.split():
hit[word.lower()].append(lineno)
for word, places in sorted(hit.iteritems(), key=lambda (w, p): (-len(p), w)):
print len(places), word, ",".join(map(str, sorted(set(places))))Context
StackExchange Code Review Q#23662, answer score: 3
Revisions (0)
No revisions yet.