HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Script that scrubs data from a .csv file

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
scriptfilecsvscrubsthatfromdata

Problem

What does this script tell you about how I need to improve as a programmer? I'm somewhat new to both Python and programming, so feel free to minimize your assumptions about my knowledge.

The purpose of this script is to read a .csv of names and emails addresses that are improperly organized -- sometimes there is only one name, sometimes three names (first, middle, last) are in the same cell, and sometimes the person has a title (Mr., Mrs.) with their name.

Right now, the code does "work". It will execute, but there's a few small issues with some of these functions -- for example, the names aren't properly distributed to columns before the file is written. I'm less interested in these smaller things right now, and would rather have a bigger picture review.

I'll happily accept whatever advice you'll offer, but I'd most appreciate feedback on how I've accepted command line arguments, and the layout of the "engine" that controls the flow of operation and calls functions, and specific changes to how I've written the internals of each function -- for example, how could they be faster?

I've also posted general questions I've had along the way. If you're interested, please weigh in on those (see below code). I'm also less interested in unanticipated cases, but welcome them if you're inspired.

```
import csv
import sys

# This script will expext three arguments:
# 1. File with data to be scrubbed (CleanCsv.in_file)
# 2. File name to output clean data (CleanCsv.clean_out_file)
# 3. File name to output data that's needs review (CleanCsv.dirty_out_file)

class CleanCsv(object):
def __init__(self, args):
self.in_file = args[1] # Storing original (necc?)
self.clean_out_file = args[2]
self.dirty_out_file = args[3]
self.flags = [x for x in args[4:] if sys.argv != ""]
# Unnecessary? sys.argv will never be blank---> ^^^
self.manual_repair = []
self.rows = []
self.functions = [
'strip_blan

Solution

Design

As far as I can tell, what you are actually trying to do is to clean up a CSV file so that it has five fields (title, first name, middle name, last name, and e-mail address), by applying a sequence of heuristics. You then allow the caller to specify flags to turns these heuristics off if they are not working.

There are several problems with this approach:

-
You end up pushing the responsibility for working out which heuristics to apply onto the user of your program.

-
In order for the user to be able to figure out which heuristics to turn on, the documentation is going to have to be complex, and most likely hard to understand.

-
The heuristics interact with each other. For example, columnize adds blank fields, but strip_blank_fields removes blank fields: which wins? How will you explain this in the documentation?

-
In any case, the heuristics don’t achieve what you are trying to do: for example you don’t move the title to the beginning of the row, or the e-mail address to the end.

So, although it’s probably not what you were hoping to hear, I think that your code would be more reliable, easier to use and easier to understand if you implemented your clean-up as a straightforward algorithm, rather than as a configurable collection of interacting heuristics. If you find that the algorithm doesn’t work on some of your data, then improve it until it does (instead of adding yet more optional heuristics).

I’ve put some revised code at the end of this answer showing how you might implement my suggested approach.

Comments

-
There’s no docstring for the CleanCsv class. Users will like to be able to run help(CleanCsv) to learn how to use the class.

-
The CleanCsv class starts by processing the command-line arguments. This design decision means that you can only easily use this class from the command-line. If you want to use it from another program or from a test suite, or interactively from the Python interpreter, then you have to mock up an argv array. The class would be more widely useful if it took its arguments directly, leavng the command-line processing to be done in the command-line case.

-
It seems perverse to me that you interpret the command-line argument capitalize to mean “don’t capitalize”. The argument should be something like --dont-capitalize or --no-capitalization to make it clear what it does.

-
You don’t check the command-line arguments to see if they contain errors. This means that misspellings like capitalise won’t be detected.

-
If you’re going to process command-line arguments, it would be a good idea to use the argparse module in the standard library. This gives you help text and error messages.

-
You do your processing with the whole of the input and output in memory: you read the whole of the input CSV file into a list; then you apply each function to the whole list; then you write out the whole list to the output CSV file. This means that your memory usage is going to depend on the size of the CSV file, and means that your program will behave poorly (lots of swapping) if you need to process very large CSV files.

You could change the code so that it works row-by-row instead of function-by-function: read a single line at a time from the input CSV file, apply each function to that line, and then write out the updated line to one of the output CSV files. That way the memory footprint doesn’t need to be much bigger than the biggest line in the input.

-
All of your functions that operate on rows in the input have boilerplate of the form

for row in self.rows:


or

for num, row in (enumerate(self.rows)):


if you refactored the code so that it worked row-by-row then you would be able to avoid this (each function would operate on a single row).

-
This line is bogus (as you say in your comment):

self.flags = [x for x in args[4:] if sys.argv != ""]
# Unnecessary?  sys.argv will never be blank---> ^^^


You’re checking sys.argv each time around the loop to see if it’s equal to the empty string, which of course it isn’t (it’s an array, not a string). Probably you meant to write

self.flags = [x for x in args[4:] if x != ""]


which could be abbreviated to

self.flags = [x for x in args[4:] if x]


since only empty strings test false, but since what you are actually going to do is look up function names to see if they appear in this list, it would be more efficient to use a set than a list (sets can test membership in O(1) but lists take O(n)):

self.flags = {x for x in args[4:] if x}


but in practice you don’t care about whether this set contains the blank string or not (you’re never going to look it up), so you could just write

self.flags = set(args[4:])


but I still think using argparse would be even better.

-
This operation is O(n2):

while [] in self.rows:
    self.rows.remove([])


(Consider a row whose first half is non-blank and whose second half is blank.) The natural wa

Code Snippets

for row in self.rows:
for num, row in (enumerate(self.rows)):
self.flags = [x for x in args[4:] if sys.argv != ""]
# Unnecessary?  sys.argv will never be blank---> ^^^
self.flags = [x for x in args[4:] if x != ""]
self.flags = [x for x in args[4:] if x]

Context

StackExchange Code Review Q#16290, answer score: 7

Revisions (0)

No revisions yet.