HiveBrain v1.2.0
Get Started
← Back to all entries
principlepythonMinor

Compare lines in 2 text files with different numbers of fields

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
withfieldstextnumbersdifferentfilescomparelines

Problem

This is the (hopefully) final version of my script for my file comparison problem mentioned previously in two posts on Stack Overflow (here and here).

I have come up with the code shown below, which does what I need it to do, but I'm wondering if it can be written in a more pythonic (read elegant) way, especially the clean up of the lists.

#!/usr/bin/python
import sys
import csv

f1 = sys.argv[1]
f2 = sys.argv[2]

with open(f1) as i, open(f2) as j:
    a = csv.reader(i)
    b = csv.reader(j)
    for linea in a:
        lineb = next(b)
        lista = ([x for x in linea if len(x) > 0])
        listastr = map(str.strip, lista)
        listastrne = filter(None, listastr)
        listb = ([x for x in lineb if len(x) > 0])
        listbstr = map(str.strip, listb)
        listbstrne = filter(None, listbstr)
        if len(listastrne) != len(listbstrne):
            print('Line {}: different fields: A: {} B: {}'.format(a.line_num, listastrne, listbstrne))
        elif sorted(map(str.lower, listastrne)) != sorted(map(str.lower, listbstrne)):
            print('Line {}: {} does not match {}'.format(a.line_num, listastrne, listbstrne))


Example input files:

A.csv:

1,2,,
1,2,2,3,4
1,2,3,4       
X
AAA,BBB,CCC
DDD,,EEE,  
GGG,HHH,III
XXX,YYY   ,ZZZ

 k,


B.csv:

1,2,2,2
1,2,3,4
1,2,3,4  
W
AAA,,BBB,CCC  
EEE,,DDD,,
,,GGG,III,HHH
XXX,YYY,ZZZ

,

Solution

Less repetition

You have a lot of repeated code (and unnecessary code). For example, you manually increment your second reader when you could zip them together. You perform the same list comprehension on each of them. You map multiple things onto them after a list comprehension. You perform a pointless filter that is essentially a copy. Removing all of this gives us this:

from itertools import izip
import csv
import sys

file1 = sys.argv[1]
file2 = sys.argv[2]    

def get_clean_line(line):
    return [entry.strip().lower() for entry in line if entry]

with open(file1) as first_file, open(file2) as second_file:
    first_reader = csv.reader(first_file)
    second_reader = csv.reader(second_file)

    for first_line, second_line in izip(first_reader, second_reader):
        first_list, second_list = get_clean_line(first_line), get_clean_line(second_line)

        if (len(first_list) != len(second_list) or 
                sorted(first_list) != sorted(second_list)):
            print('Line {}: different fields: A: {} B: {}'.format(
                first_reader.line_num, first_list, second_list))


You could easily move the sorted call into get_clean_line if you want to, but if you think the lengths will be different more often it might make sense to leave it there so short circuiting avoids too many expensive sorts.

Names

You could use better names - bytes are cheap

file1 and file2 vs f1 and f2, first/second_reader vs a and b, etc. There are probably even better names, but that would probably depend on your specific domain.

Code Snippets

from itertools import izip
import csv
import sys


file1 = sys.argv[1]
file2 = sys.argv[2]    

def get_clean_line(line):
    return [entry.strip().lower() for entry in line if entry]

with open(file1) as first_file, open(file2) as second_file:
    first_reader = csv.reader(first_file)
    second_reader = csv.reader(second_file)

    for first_line, second_line in izip(first_reader, second_reader):
        first_list, second_list = get_clean_line(first_line), get_clean_line(second_line)

        if (len(first_list) != len(second_list) or 
                sorted(first_list) != sorted(second_list)):
            print('Line {}: different fields: A: {} B: {}'.format(
                first_reader.line_num, first_list, second_list))

Context

StackExchange Code Review Q#116804, answer score: 3

Revisions (0)

No revisions yet.