patternpythonMinor

Loads two columns from excel - cleans up the data - and makes a new list of unmatched values

Submitted by: @import:stackexchange-codereview·Mar 10, 2026·

Viewed 0 times

cleansthevaluesnewexcelcolumnsmakestwoloadsunmatched

Problem

I am comparing two columns of data (~20k rows) to find the items that do not match. Actually one of my files is an export from an Oracle database, but I have not gotten to the point of working with databases in python yet.

The code works as intended, but as I am new to this I am wondering if there is a more efficient way to do all of this.

First I read the columns into a list:

import openpyxl
 wb = openpyxl.load_workbook('file.XLSX')
 ws = wb['dirtyfile']
 dirtysource = []
 cleansource = []     
 for col in ws['A']:
     dirtysource.append(col.value)
 ws2 = wb['Database Export']
 for col in ws2['A']:
     cleansource.append(col.value)

The data I have extracted from the database is already cleaned. The other data set is in an inconsistent format:

dirtysource = [u'Tom Jones (01134)', u'Bob Smith (On Leave) (598711)', u'Jane Doe (00004)', u'100627 - Juan Carlos' ]

Next I clean the data up. I am slicing out the name, and just keeping the ID. The ID can be of a variable length. When it appears at the end of the string it is enclosed in parentheses. When it is at the beginning it is delimited by a '-'.

j = 0
for i in dirtysource:
    if i.endswith(')'):
        dirtysource[j] = str(i[i.rfind('(')+1:len(i)-1])
    else:
        dirtysource[j] = str(i[0:i.find('-')-1])
    j = j +1

The only thing I am doing to the clean list is to convert it from unicode:

j = 0
for i in cleansource:
    cleansource[j] = str(i)
    j = j + 1

Finally I compare the two lists to find the non-matching values:

delta = []    
for i in dirtysource:
    if i in cleansource: continue
    else:
        delta.append(i)
print delta

delta1 = []    
for i in cleansource:
    if i in dirtysource: continue
    else:
        delta1.append(i)
print delta1

Eventually I will be passing the two lists into some queries, but I have more to learn before I do that.

Solution

Your code is actually pretty good!

But there are some small things I would change:

Don't use ws and ws2, instead just index them in the for loop. E.g. for col in wb['dirtyfile']['A'].

-
Rather than initalizeing j and incrementing it each loop you can instead use enumerate.
It's part of Python so it's well known, and improves clarity in your code.
This can change your cleansource loop to:

for j, i in enumerate(cleansource):
    cleansource[j] = str(i)

-
Rather than checking if an item is in another list, if it is do nothing otherwise add to another list.
Just check if the value is not in the list. E.g.

for i in dirtysource:
    if i not in cleansource:
        delta.append(i)

You also tagged this performance. And so there is one way you can improve the performance of this program.
Since cleansource is a list, if i in cleansource is a performance killer.
Instead you can use a set, this can allow you to perform if i in cleansource faster, as it's optimised to do so.
However I'd still keep your original cleansource, as sets are an unique unorganised collection of data.
Say you have the list ['a', 'b', 'c', 'd', 'a', 'b'] and you change it to a set you could get {'b', 'd', 'c', 'a'} back.
Either way you can change your code to the following for a performance increase:

clean_set = set(cleansource)
for i in dirtysource:
    if i not in clean_set:
        delta.append(i)

Code Snippets

for j, i in enumerate(cleansource):
    cleansource[j] = str(i)

for i in dirtysource:
    if i not in cleansource:
        delta.append(i)

clean_set = set(cleansource)
for i in dirtysource:
    if i not in clean_set:
        delta.append(i)

Context

StackExchange Code Review Q#147144, answer score: 2

Revisions (0)

No revisions yet.