HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Reading and merging multiple data files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
readingmergingfilesmultipleanddata

Problem

I've recently started teaching myself Python but have no prior programming experience or formal training. I've successfully created a program that outputs what I wanted it to output, but I'm looking for guidance on design/efficiency. Is there an overall design that would have been more efficient? Any pieces of code that should be re-written to be more stylistically correct? I purposely wrote this program without utilizing a database, but my next goal is to re-write the program to use a database instead of dictionaries.

The goal of this program is to read in 3 files that have a common key (custID). A customer can have multiple orders, but only one email/customer record. File layouts:

names.txt: custID|firstName|lastName|address1|address2|city|state|zip
emails.txt: custID|email
orders.txt: orderID|orderDate|channel|orderAmt|custID


Then, output a file with each order appearing on a new line:

output.txt: custID|firstName|lastName|city|state|email|orderDate|orderAmount


Program:

```
import re
import os

cust_info = {}
cust_emails = {}
cust_orders = {}
cust_join = {}

#Read Customer file with layout:
#custID|firstName|lastName|address1|address2|city|state|zip
def read_cust_file():
with open('names.txt', 'r') as f:
for line in f:
split_line = re.sub("\s\s+", '|', line).strip().split('|')
cust_info[int(split_line[0])] = "|".join(split_line[1:])
return cust_info

#Read Email file with layout: custID|email
def read_email_file():
with open('emails.txt', 'r') as g:
for line in g:
split_line = re.sub("\s\s+", '|', line).strip().split('|')
cust_emails[int(split_line[0])] = "|".join(split_line[1:])
return cust_emails

#Read Order file with layout: orderID|orderDate|channel|orderAmt|custID
#There can be multiple orders per custID
def read_order_file():
with open('orders.txt', 'r') as f:
for line in f:
split_li

Solution

Looks not too bad, it's readable enough. The use of with is good (except for writeOutput, where it's not used for some reason) and the code is split into separate functions by functionality. That said:

Stylistic

  • You should read and follow PEP8 (naming of globals, functions,


variables, whitespace, no need for semicolons).

  • You don't need those globals, just return values from your functions.



  • Maybe use docstrings instead of comments if you're describing the


functionality of a function.

  • Create more functions for common code. If you copy and paste


something is wrong (usually). That applies to all the reader functions, as they share the line splitting and accumulation code.

  • The try/except blocks in joinOrders aren't good. Exceptions


aren't for control flow (usually) and you're better off using a
different approach for better understanding (where could the exception
be raised from) and maintainability (if you move the block or change
things, does the exception still only apply to the part you originally
wanted too, etc.)

Design

  • You might be better off using the csv module with custom separators


to read the files. Even as a beginner it's a good idea to use
libraries to make your life easier.

  • The double splitting in the read functions looks unnecessary, but I'm


not quite sure.

  • Why the weird truncate? It's normal for text files to end with a


newline.

  • Unless you're not concerned about speed the list concatenation in


joinOrders isn't the best idea. Creating the set from custInfo,
then calling update on the set would be better to avoid unnecessary
allocations for the list.

  • Use iter* on dictionaries, lists, etc. for efficiency if you only iterate over the elements. Here that would be iteritems instead of items.



Overall the split into input and output makes sense and I wouldn't worry too much about efficiency at this point (unless you're processing huge amounts of data that is).

Context

StackExchange Code Review Q#95118, answer score: 3

Revisions (0)

No revisions yet.