patternpythonMinor
Reading and merging multiple data files
Viewed 0 times
readingmergingfilesmultipleanddata
Problem
I've recently started teaching myself Python but have no prior programming experience or formal training. I've successfully created a program that outputs what I wanted it to output, but I'm looking for guidance on design/efficiency. Is there an overall design that would have been more efficient? Any pieces of code that should be re-written to be more stylistically correct? I purposely wrote this program without utilizing a database, but my next goal is to re-write the program to use a database instead of dictionaries.
The goal of this program is to read in 3 files that have a common key (custID). A customer can have multiple orders, but only one email/customer record. File layouts:
Then, output a file with each order appearing on a new line:
Program:
```
import re
import os
cust_info = {}
cust_emails = {}
cust_orders = {}
cust_join = {}
#Read Customer file with layout:
#custID|firstName|lastName|address1|address2|city|state|zip
def read_cust_file():
with open('names.txt', 'r') as f:
for line in f:
split_line = re.sub("\s\s+", '|', line).strip().split('|')
cust_info[int(split_line[0])] = "|".join(split_line[1:])
return cust_info
#Read Email file with layout: custID|email
def read_email_file():
with open('emails.txt', 'r') as g:
for line in g:
split_line = re.sub("\s\s+", '|', line).strip().split('|')
cust_emails[int(split_line[0])] = "|".join(split_line[1:])
return cust_emails
#Read Order file with layout: orderID|orderDate|channel|orderAmt|custID
#There can be multiple orders per custID
def read_order_file():
with open('orders.txt', 'r') as f:
for line in f:
split_li
The goal of this program is to read in 3 files that have a common key (custID). A customer can have multiple orders, but only one email/customer record. File layouts:
names.txt: custID|firstName|lastName|address1|address2|city|state|zip
emails.txt: custID|email
orders.txt: orderID|orderDate|channel|orderAmt|custIDThen, output a file with each order appearing on a new line:
output.txt: custID|firstName|lastName|city|state|email|orderDate|orderAmountProgram:
```
import re
import os
cust_info = {}
cust_emails = {}
cust_orders = {}
cust_join = {}
#Read Customer file with layout:
#custID|firstName|lastName|address1|address2|city|state|zip
def read_cust_file():
with open('names.txt', 'r') as f:
for line in f:
split_line = re.sub("\s\s+", '|', line).strip().split('|')
cust_info[int(split_line[0])] = "|".join(split_line[1:])
return cust_info
#Read Email file with layout: custID|email
def read_email_file():
with open('emails.txt', 'r') as g:
for line in g:
split_line = re.sub("\s\s+", '|', line).strip().split('|')
cust_emails[int(split_line[0])] = "|".join(split_line[1:])
return cust_emails
#Read Order file with layout: orderID|orderDate|channel|orderAmt|custID
#There can be multiple orders per custID
def read_order_file():
with open('orders.txt', 'r') as f:
for line in f:
split_li
Solution
Looks not too bad, it's readable enough. The use of
Stylistic
variables, whitespace, no need for semicolons).
functionality of a function.
something is wrong (usually). That applies to all the reader functions, as they share the line splitting and accumulation code.
aren't for control flow (usually) and you're better off using a
different approach for better understanding (where could the exception
be raised from) and maintainability (if you move the block or change
things, does the exception still only apply to the part you originally
wanted too, etc.)
Design
to read the files. Even as a beginner it's a good idea to use
libraries to make your life easier.
not quite sure.
newline.
then calling
allocations for the list.
Overall the split into input and output makes sense and I wouldn't worry too much about efficiency at this point (unless you're processing huge amounts of data that is).
with is good (except for writeOutput, where it's not used for some reason) and the code is split into separate functions by functionality. That said:Stylistic
- You should read and follow PEP8 (naming of globals, functions,
variables, whitespace, no need for semicolons).
- You don't need those globals, just return values from your functions.
- Maybe use docstrings instead of comments if you're describing the
functionality of a function.
- Create more functions for common code. If you copy and paste
something is wrong (usually). That applies to all the reader functions, as they share the line splitting and accumulation code.
- The
try/exceptblocks injoinOrdersaren't good. Exceptions
aren't for control flow (usually) and you're better off using a
different approach for better understanding (where could the exception
be raised from) and maintainability (if you move the block or change
things, does the exception still only apply to the part you originally
wanted too, etc.)
Design
- You might be better off using the
csvmodule with custom separators
to read the files. Even as a beginner it's a good idea to use
libraries to make your life easier.
- The double splitting in the read functions looks unnecessary, but I'm
not quite sure.
- Why the weird
truncate? It's normal for text files to end with a
newline.
- Unless you're not concerned about speed the list concatenation in
joinOrders isn't the best idea. Creating the set from custInfo,then calling
update on the set would be better to avoid unnecessaryallocations for the list.
- Use
iter*on dictionaries, lists, etc. for efficiency if you only iterate over the elements. Here that would beiteritemsinstead ofitems.
Overall the split into input and output makes sense and I wouldn't worry too much about efficiency at this point (unless you're processing huge amounts of data that is).
Context
StackExchange Code Review Q#95118, answer score: 3
Revisions (0)
No revisions yet.