patternpythonMinor
Comparing phone numbers across CSVs Python
Viewed 0 times
comparingnumberscsvspythonphoneacross
Problem
(continuation from Speeding up and fixing phone numbers from CSVs with Regex)
I'm pulling all of the phone numbers from all CSVs in two different directories, outputting them in a single simple format to two different files, and then comparing those two files for which numbers are in one but not the other.
I'm failing in speed (of execution), style, and results. Here's what I was trying:
```
import csv
import re
import glob
import string
with open('Firstlist.csv', 'wb') as out:
with open('Secondlist.csv', 'wb') as out2:
with open('SecondnotinFirst.csv', 'wb') as out3:
with open('FirstnotinSecond.csv', 'wb') as out4:
seen = set()
regex = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')
out_writer = csv.writer(out)
out_writer.writerow([])
csv_files = glob.glob('First\*.csv')
for filename in csv_files:
with open(filename, 'rbU') as ifile:
read = csv.reader(ifile)
for row in read:
for column in row:
s1 = column.strip()
match = regex.search(s1)
if match:
canonical_phone = re.sub(r'\D', '', match.group(0))
if canonical_phone not in seen:
seen.add(canonical_phone)
for val in seen:
out_writer.writerow([val])
seen2 = set()
out_writer2 = csv.writer(out2)
out_writer2.writerow([])
csv_files2 = glob.glob('Second\*.csv')
for filename in csv_files2:
with open(filename, 'rbU') as ifile2:
read2 = csv.reader(ifile2)
for row in read2:
for column in r
I'm pulling all of the phone numbers from all CSVs in two different directories, outputting them in a single simple format to two different files, and then comparing those two files for which numbers are in one but not the other.
I'm failing in speed (of execution), style, and results. Here's what I was trying:
```
import csv
import re
import glob
import string
with open('Firstlist.csv', 'wb') as out:
with open('Secondlist.csv', 'wb') as out2:
with open('SecondnotinFirst.csv', 'wb') as out3:
with open('FirstnotinSecond.csv', 'wb') as out4:
seen = set()
regex = re.compile(r'(\+?[2-9]\d{2}\)?[ -]?\d{3}[ -]?\d{4})')
out_writer = csv.writer(out)
out_writer.writerow([])
csv_files = glob.glob('First\*.csv')
for filename in csv_files:
with open(filename, 'rbU') as ifile:
read = csv.reader(ifile)
for row in read:
for column in row:
s1 = column.strip()
match = regex.search(s1)
if match:
canonical_phone = re.sub(r'\D', '', match.group(0))
if canonical_phone not in seen:
seen.add(canonical_phone)
for val in seen:
out_writer.writerow([val])
seen2 = set()
out_writer2 = csv.writer(out2)
out_writer2.writerow([])
csv_files2 = glob.glob('Second\*.csv')
for filename in csv_files2:
with open(filename, 'rbU') as ifile2:
read2 = csv.reader(ifile2)
for row in read2:
for column in r
Solution
Here's my 5 cents as non-python guy
Naming:
As soon as you begin numbering variable names, there should be alarm bells ringing... multiple loud alarm bells.
rename your
Also
Naming:
As soon as you begin numbering variable names, there should be alarm bells ringing... multiple loud alarm bells.
rename your
out, out2, and the corresponding writers. Instead use descriptive names (maybe similar to your "filenames"): first_list, second_list, and so on. The same applies for your writers.Also
seen2 is not a good name.. What do you store in that variable? Name it after that: second_seen is definitely better, than seen2Context
StackExchange Code Review Q#47210, answer score: 3
Revisions (0)
No revisions yet.