HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Comparing phone numbers across CSVs Python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
comparingnumberscsvspythonphoneacross

Problem

(continuation from Speeding up and fixing phone numbers from CSVs with Regex)

I'm pulling all of the phone numbers from all CSVs in two different directories, outputting them in a single simple format to two different files, and then comparing those two files for which numbers are in one but not the other.

I'm failing in speed (of execution), style, and results. Here's what I was trying:

```
import csv
import re
import glob
import string

with open('Firstlist.csv', 'wb') as out:
with open('Secondlist.csv', 'wb') as out2:
with open('SecondnotinFirst.csv', 'wb') as out3:
with open('FirstnotinSecond.csv', 'wb') as out4:
seen = set()
regex = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')
out_writer = csv.writer(out)
out_writer.writerow([])
csv_files = glob.glob('First\*.csv')
for filename in csv_files:
with open(filename, 'rbU') as ifile:
read = csv.reader(ifile)
for row in read:
for column in row:
s1 = column.strip()
match = regex.search(s1)
if match:
canonical_phone = re.sub(r'\D', '', match.group(0))
if canonical_phone not in seen:
seen.add(canonical_phone)
for val in seen:
out_writer.writerow([val])

seen2 = set()
out_writer2 = csv.writer(out2)
out_writer2.writerow([])
csv_files2 = glob.glob('Second\*.csv')
for filename in csv_files2:
with open(filename, 'rbU') as ifile2:
read2 = csv.reader(ifile2)
for row in read2:
for column in r

Solution

Here's my 5 cents as non-python guy
Naming:

As soon as you begin numbering variable names, there should be alarm bells ringing... multiple loud alarm bells.

rename your out, out2, and the corresponding writers. Instead use descriptive names (maybe similar to your "filenames"): first_list, second_list, and so on. The same applies for your writers.

Also seen2 is not a good name.. What do you store in that variable? Name it after that: second_seen is definitely better, than seen2

Context

StackExchange Code Review Q#47210, answer score: 3

Revisions (0)

No revisions yet.