principlepythonMinor
Compare lines in 2 text files with different numbers of fields
Viewed 0 times
withfieldstextnumbersdifferentfilescomparelines
Problem
This is the (hopefully) final version of my script for my file comparison problem mentioned previously in two posts on Stack Overflow (here and here).
I have come up with the code shown below, which does what I need it to do, but I'm wondering if it can be written in a more pythonic (read elegant) way, especially the clean up of the lists.
Example input files:
I have come up with the code shown below, which does what I need it to do, but I'm wondering if it can be written in a more pythonic (read elegant) way, especially the clean up of the lists.
#!/usr/bin/python
import sys
import csv
f1 = sys.argv[1]
f2 = sys.argv[2]
with open(f1) as i, open(f2) as j:
a = csv.reader(i)
b = csv.reader(j)
for linea in a:
lineb = next(b)
lista = ([x for x in linea if len(x) > 0])
listastr = map(str.strip, lista)
listastrne = filter(None, listastr)
listb = ([x for x in lineb if len(x) > 0])
listbstr = map(str.strip, listb)
listbstrne = filter(None, listbstr)
if len(listastrne) != len(listbstrne):
print('Line {}: different fields: A: {} B: {}'.format(a.line_num, listastrne, listbstrne))
elif sorted(map(str.lower, listastrne)) != sorted(map(str.lower, listbstrne)):
print('Line {}: {} does not match {}'.format(a.line_num, listastrne, listbstrne))Example input files:
A.csv:1,2,,
1,2,2,3,4
1,2,3,4
X
AAA,BBB,CCC
DDD,,EEE,
GGG,HHH,III
XXX,YYY ,ZZZ
k,B.csv: 1,2,2,2
1,2,3,4
1,2,3,4
W
AAA,,BBB,CCC
EEE,,DDD,,
,,GGG,III,HHH
XXX,YYY,ZZZ
,Solution
Less repetition
You have a lot of repeated code (and unnecessary code). For example, you manually increment your second reader when you could zip them together. You perform the same list comprehension on each of them. You map multiple things onto them after a list comprehension. You perform a pointless filter that is essentially a copy. Removing all of this gives us this:
You could easily move the
Names
You could use better names - bytes are cheap
You have a lot of repeated code (and unnecessary code). For example, you manually increment your second reader when you could zip them together. You perform the same list comprehension on each of them. You map multiple things onto them after a list comprehension. You perform a pointless filter that is essentially a copy. Removing all of this gives us this:
from itertools import izip
import csv
import sys
file1 = sys.argv[1]
file2 = sys.argv[2]
def get_clean_line(line):
return [entry.strip().lower() for entry in line if entry]
with open(file1) as first_file, open(file2) as second_file:
first_reader = csv.reader(first_file)
second_reader = csv.reader(second_file)
for first_line, second_line in izip(first_reader, second_reader):
first_list, second_list = get_clean_line(first_line), get_clean_line(second_line)
if (len(first_list) != len(second_list) or
sorted(first_list) != sorted(second_list)):
print('Line {}: different fields: A: {} B: {}'.format(
first_reader.line_num, first_list, second_list))You could easily move the
sorted call into get_clean_line if you want to, but if you think the lengths will be different more often it might make sense to leave it there so short circuiting avoids too many expensive sorts.Names
You could use better names - bytes are cheap
file1 and file2 vs f1 and f2, first/second_reader vs a and b, etc. There are probably even better names, but that would probably depend on your specific domain.Code Snippets
from itertools import izip
import csv
import sys
file1 = sys.argv[1]
file2 = sys.argv[2]
def get_clean_line(line):
return [entry.strip().lower() for entry in line if entry]
with open(file1) as first_file, open(file2) as second_file:
first_reader = csv.reader(first_file)
second_reader = csv.reader(second_file)
for first_line, second_line in izip(first_reader, second_reader):
first_list, second_list = get_clean_line(first_line), get_clean_line(second_line)
if (len(first_list) != len(second_list) or
sorted(first_list) != sorted(second_list)):
print('Line {}: different fields: A: {} B: {}'.format(
first_reader.line_num, first_list, second_list))Context
StackExchange Code Review Q#116804, answer score: 3
Revisions (0)
No revisions yet.