HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Comparing 2 CSV files

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
csvfilescomparing

Problem

Here's the exercise in brief:


Consider the following file: Code:


before.csv

A; ; B; 
B; A; H; 
C; ; D; 
D; C; G; 
E; D; F; 
F; E; H; 
G; D; ; 
H; G; ;




And a modified version of the file:


after.csv

A; ; B; 
B; A; H; 
C; ; D;
D; ; G;
E; D; F;
F; E; H;
G; D; ;
K; ; E;




The first field of the CSV is a unique identifier of each line. The
exercise consists of detecting the changes applied to the file, by
comparing before and after.


There are 3 types of changes you should detect:



  • ADDED (line is present in after.csv but not in before.csv)



  • REMOVED (line is present in before.csv but not in after.csv)



  • MODIFIED (line is present in both, but second and/or third field are modified)





In my example, there are three modifications:



  • ADDED line (K)



  • REMOVED line (H)



  • MODIFIED line (D)




And my code:

```
import collections
import csv
import sys

class P_CSV(dict):
'''A P_CSV is a dict representation of the csv file:
{"id": dict(csvfile)} '''

fieldnames = ["id", "col2", "col3"]

def __init__(self, input):
map(self.readline, csv.DictReader(input, self.fieldnames, delimiter=";",\
skipinitialspace=True))

def readline(self, line):
self[line["id"]] = line

def get_elem(self, name):
for i in self:
if i == name:
return self[i]

class Change:
''' a Change element will be instanciated
each time a difference is found'''.

def __init__(self, *args):
self.args=args

def echo(self):
print "\t".join(self.args)

class P_Comparator(collections.Counter):
'''This class holds 2 P_CSV objects and counts
the number of occurrence of each line.'''

def __init__(self, in_pcsv, out_pcsv):
self.change_list = []

self.in_pcsv = in_pcsv
self.out_pcsv = out_pcsv

self.readfile(in_pcsv, 1)
self.r

Solution

Implementation

  • The P_CSV trick is good idea.



  • I don't know if "input" is supposed to be a file name, a string, a file object, and so on (this is Python's fault, but still). Please use a better name and document that it is a file.



  • What does {"id": dict(csvfile)} mean in your docstring?



  • get_elem could be implemented with return self.get(name, default=None). The way you're doing it is misleading, since you're relying on the fact that no return means return None.



  • This means you can either remove get_elem or find a better name explaining that it's just like get except that it returns None instead of throwing an exception.



  • I guess you want to use __str__ in Change, and maybe __repr__ (but not echo).



  • Do you really need Change as it is? Simply store lists in change_list, instead of Changes.



  • readfile? I'd say readcsv since it is no longer a file, but a P_CSV. If you have a more descriptive name of what those files contain, then use that instead.



  • J_Comparator doesn't work as requested in the exercise, since it also says what columns were modified.



  • Setting a list to None is wrong. "No values" is the empty list. You can then use self.change_list.extend(j) without needing to worry about the empty list. It's much more elegant.



  • Why P and J for the comparators?



Performance

Performance is good: you're using a linear algorithm, even though you're going through the files twice. If you're worried about very very larges files that won't hold in memory, you can use the assumption that the files are sorted to advance in both files simultaneously, and make sure to always have the same unique id in both files. I don't think this is needed, 6000 lines is quite small!

Context

StackExchange Code Review Q#9744, answer score: 2

Revisions (0)

No revisions yet.