HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Self-taught Pythonista: Any criticism welcome for this concurrent word count script!

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
thisscriptwelcomepythonistaanywordcriticismforcountconcurrent

Problem

I've been teaching myself Python - my first programming language - for about two years now.

I recently discovered the concurrent.futures module and wanted to do something with it. What do you think about this script?

```
import re
import shutil
import string

from collections import Counter
from concurrent.futures import ProcessPoolExecutor
from itertools import chain, islice, zip_longest
from urllib.request import urlopen

# Regex to use for splitting text into words, dropping everything but
# alphabetic characters.
REGEX = re.compile(r"[{}{}{}]+".format(
string.whitespace, string.digits, string.punctuation))

# http://docs.python.org/3/library/itertools.html#itertools-recipes
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)

def split_and_clean(line):
"""Returns an iterator of the words in line.

Example:
>>> list(split_and_clean("3, Four, five'"))
['four', 'five']

Args:
line: A string.

Returns:
A filter of the alphabetic words in the line.

Raises:
TypeError: Your input was of type <>. Must be a string.
"""
try:
return filter(None, re.split(REGEX, line.lower()))
except AttributeError:
input_type_str = str(type(line))[8:-2]
error_message = "Your input was of type {}. Must be a string.".format(
input_type_str)
raise TypeError(error_message)

def wc_some_lines(lines):
"""Return a Counter containing the word count of several lines.
Excludes any digital numbers or punctuation.

Example:
>>> wc_some_lines(["Line 1.", "Another line."])
Counter({'line': 2, 'another': 1})

Args:
lines: An iterable of strings.

Returns:
A collections.Counter mapping words to their word counts.

Raises:
TypeError

Solution

PEP8 mentions that top-level constructs like functions should be separated by two lines. Hanging indents should have only one level of indentation (lines 43 and 61). Be careful about trailing whitespaces (lines 61 and 89).

I love functional style myself but it is often frowned upon in Python, and Counter(chain(*map(split_and_clean, lines))) or filter(None, re.split(REGEX, line.lower())) will be considered unreadable by some, and elegant by others.

Otherwise your code is awesome, well-written with beautiful docstrings and clever (filter() to drop empty strings, AttributeError and to the call to lower()). Thanks for sharing!

Context

StackExchange Code Review Q#27739, answer score: 2

Revisions (0)

No revisions yet.