patternpythonMinor
Comparing 2 CSV files
Viewed 0 times
csvfilescomparing
Problem
Here's the exercise in brief:
Consider the following file: Code:
before.csv
And a modified version of the file:
after.csv
The first field of the CSV is a unique identifier of each line. The
exercise consists of detecting the changes applied to the file, by
comparing before and after.
There are 3 types of changes you should detect:
In my example, there are three modifications:
And my code:
```
import collections
import csv
import sys
class P_CSV(dict):
'''A P_CSV is a dict representation of the csv file:
{"id": dict(csvfile)} '''
fieldnames = ["id", "col2", "col3"]
def __init__(self, input):
map(self.readline, csv.DictReader(input, self.fieldnames, delimiter=";",\
skipinitialspace=True))
def readline(self, line):
self[line["id"]] = line
def get_elem(self, name):
for i in self:
if i == name:
return self[i]
class Change:
''' a Change element will be instanciated
each time a difference is found'''.
def __init__(self, *args):
self.args=args
def echo(self):
print "\t".join(self.args)
class P_Comparator(collections.Counter):
'''This class holds 2 P_CSV objects and counts
the number of occurrence of each line.'''
def __init__(self, in_pcsv, out_pcsv):
self.change_list = []
self.in_pcsv = in_pcsv
self.out_pcsv = out_pcsv
self.readfile(in_pcsv, 1)
self.r
Consider the following file: Code:
before.csv
A; ; B;
B; A; H;
C; ; D;
D; C; G;
E; D; F;
F; E; H;
G; D; ;
H; G; ;And a modified version of the file:
after.csv
A; ; B;
B; A; H;
C; ; D;
D; ; G;
E; D; F;
F; E; H;
G; D; ;
K; ; E;The first field of the CSV is a unique identifier of each line. The
exercise consists of detecting the changes applied to the file, by
comparing before and after.
There are 3 types of changes you should detect:
- ADDED (line is present in after.csv but not in before.csv)
- REMOVED (line is present in before.csv but not in after.csv)
- MODIFIED (line is present in both, but second and/or third field are modified)
In my example, there are three modifications:
- ADDED line (K)
- REMOVED line (H)
- MODIFIED line (D)
And my code:
```
import collections
import csv
import sys
class P_CSV(dict):
'''A P_CSV is a dict representation of the csv file:
{"id": dict(csvfile)} '''
fieldnames = ["id", "col2", "col3"]
def __init__(self, input):
map(self.readline, csv.DictReader(input, self.fieldnames, delimiter=";",\
skipinitialspace=True))
def readline(self, line):
self[line["id"]] = line
def get_elem(self, name):
for i in self:
if i == name:
return self[i]
class Change:
''' a Change element will be instanciated
each time a difference is found'''.
def __init__(self, *args):
self.args=args
def echo(self):
print "\t".join(self.args)
class P_Comparator(collections.Counter):
'''This class holds 2 P_CSV objects and counts
the number of occurrence of each line.'''
def __init__(self, in_pcsv, out_pcsv):
self.change_list = []
self.in_pcsv = in_pcsv
self.out_pcsv = out_pcsv
self.readfile(in_pcsv, 1)
self.r
Solution
Implementation
Performance
Performance is good: you're using a linear algorithm, even though you're going through the files twice. If you're worried about very very larges files that won't hold in memory, you can use the assumption that the files are sorted to advance in both files simultaneously, and make sure to always have the same unique id in both files. I don't think this is needed, 6000 lines is quite small!
- The P_CSV trick is good idea.
- I don't know if "input" is supposed to be a file name, a string, a
fileobject, and so on (this is Python's fault, but still). Please use a better name and document that it is afile.
- What does
{"id": dict(csvfile)}mean in your docstring?
get_elemcould be implemented withreturn self.get(name, default=None). The way you're doing it is misleading, since you're relying on the fact that noreturnmeansreturn None.
- This means you can either remove
get_elemor find a better name explaining that it's just likegetexcept that it returnsNoneinstead of throwing an exception.
- I guess you want to use
__str__inChange, and maybe__repr__(but notecho).
- Do you really need
Changeas it is? Simply store lists inchange_list, instead ofChanges.
readfile? I'd sayreadcsvsince it is no longer a file, but aP_CSV. If you have a more descriptive name of what those files contain, then use that instead.
- J_Comparator doesn't work as requested in the exercise, since it also says what columns were modified.
- Setting a list to
Noneis wrong. "No values" is the empty list. You can then useself.change_list.extend(j)without needing to worry about the empty list. It's much more elegant.
- Why P and J for the comparators?
Performance
Performance is good: you're using a linear algorithm, even though you're going through the files twice. If you're worried about very very larges files that won't hold in memory, you can use the assumption that the files are sorted to advance in both files simultaneously, and make sure to always have the same unique id in both files. I don't think this is needed, 6000 lines is quite small!
Context
StackExchange Code Review Q#9744, answer score: 2
Revisions (0)
No revisions yet.