patternpythonMinor
Loads two columns from excel - cleans up the data - and makes a new list of unmatched values
Viewed 0 times
cleansthevaluesnewexcelcolumnsmakestwoloadsunmatched
Problem
I am comparing two columns of data (~20k rows) to find the items that do not match. Actually one of my files is an export from an Oracle database, but I have not gotten to the point of working with databases in python yet.
The code works as intended, but as I am new to this I am wondering if there is a more efficient way to do all of this.
First I read the columns into a list:
The data I have extracted from the database is already cleaned. The other data set is in an inconsistent format:
Next I clean the data up. I am slicing out the name, and just keeping the ID. The ID can be of a variable length. When it appears at the end of the string it is enclosed in parentheses. When it is at the beginning it is delimited by a '-'.
The only thing I am doing to the clean list is to convert it from unicode:
Finally I compare the two lists to find the non-matching values:
Eventually I will be passing the two lists into some queries, but I have more to learn before I do that.
The code works as intended, but as I am new to this I am wondering if there is a more efficient way to do all of this.
First I read the columns into a list:
import openpyxl
wb = openpyxl.load_workbook('file.XLSX')
ws = wb['dirtyfile']
dirtysource = []
cleansource = []
for col in ws['A']:
dirtysource.append(col.value)
ws2 = wb['Database Export']
for col in ws2['A']:
cleansource.append(col.value)The data I have extracted from the database is already cleaned. The other data set is in an inconsistent format:
dirtysource = [u'Tom Jones (01134)', u'Bob Smith (On Leave) (598711)', u'Jane Doe (00004)', u'100627 - Juan Carlos' ]Next I clean the data up. I am slicing out the name, and just keeping the ID. The ID can be of a variable length. When it appears at the end of the string it is enclosed in parentheses. When it is at the beginning it is delimited by a '-'.
j = 0
for i in dirtysource:
if i.endswith(')'):
dirtysource[j] = str(i[i.rfind('(')+1:len(i)-1])
else:
dirtysource[j] = str(i[0:i.find('-')-1])
j = j +1The only thing I am doing to the clean list is to convert it from unicode:
j = 0
for i in cleansource:
cleansource[j] = str(i)
j = j + 1Finally I compare the two lists to find the non-matching values:
delta = []
for i in dirtysource:
if i in cleansource: continue
else:
delta.append(i)
print delta
delta1 = []
for i in cleansource:
if i in dirtysource: continue
else:
delta1.append(i)
print delta1Eventually I will be passing the two lists into some queries, but I have more to learn before I do that.
Solution
Your code is actually pretty good!
But there are some small things I would change:
-
Rather than initalizeing
It's part of Python so it's well known, and improves clarity in your code.
This can change your
-
Rather than checking if an item is in another list, if it is do nothing otherwise add to another list.
Just check if the value is not in the list. E.g.
You also tagged this performance. And so there is one way you can improve the performance of this program.
Since
Instead you can use a
However I'd still keep your original
Say you have the list
Either way you can change your code to the following for a performance increase:
But there are some small things I would change:
- Don't use
wsandws2, instead just index them in the for loop. E.g.for col in wb['dirtyfile']['A'].
-
Rather than initalizeing
j and incrementing it each loop you can instead use enumerate.It's part of Python so it's well known, and improves clarity in your code.
This can change your
cleansource loop to:for j, i in enumerate(cleansource):
cleansource[j] = str(i)-
Rather than checking if an item is in another list, if it is do nothing otherwise add to another list.
Just check if the value is not in the list. E.g.
for i in dirtysource:
if i not in cleansource:
delta.append(i)You also tagged this performance. And so there is one way you can improve the performance of this program.
Since
cleansource is a list, if i in cleansource is a performance killer.Instead you can use a
set, this can allow you to perform if i in cleansource faster, as it's optimised to do so.However I'd still keep your original
cleansource, as sets are an unique unorganised collection of data.Say you have the list
['a', 'b', 'c', 'd', 'a', 'b'] and you change it to a set you could get {'b', 'd', 'c', 'a'} back.Either way you can change your code to the following for a performance increase:
clean_set = set(cleansource)
for i in dirtysource:
if i not in clean_set:
delta.append(i)Code Snippets
for j, i in enumerate(cleansource):
cleansource[j] = str(i)for i in dirtysource:
if i not in cleansource:
delta.append(i)clean_set = set(cleansource)
for i in dirtysource:
if i not in clean_set:
delta.append(i)Context
StackExchange Code Review Q#147144, answer score: 2
Revisions (0)
No revisions yet.