snippetpythonMinor
Excel Data Manipulation - parse, match and create
Viewed 0 times
excelcreatematchparsemanipulationanddata
Problem
I've got a simple Excel Data Manipulation Script to match one of my daily tasks, written in python 3.
Intro
Let's assume that I have 3 excel files:
If a serial number is find:
Now, my script is simply creating new files with the last column appended (instead of appending directly to the same file), which is ok (but if you guys have a better method, shout it). What I'm looking for, is a way of making this as optimized as possible. I'm not looking for
Code:
```
import petl
main = petl.fromxlsx('main.xlsx')
one = petl.fromxlsx('1.xlsx', row_offset=1)
two = petl.fromxlsx('2.xlsx')
non_serial_rows = petl.select(main, lambda rec: rec['serial number'] is None)
serial_rows = petl.select(main, lambda rec: rec['serial number'] is not None)
main_join_one = petl.join(serial_rows, petl.cut(one, ['serial number']), key='serial number')
main_join_one_file = petl.addfield(main_join_one, 'file', 'ok, 1.xlsx')
main_join_two = petl.join(serial_rows, petl.cut(two, ['serial number']), key='serial number')
main_join_two_file = petl.addfield(main_join_two, 'file', 'ok, 2.xlsx')
stacked_joins = petl.stack(main_join_two_file, main_join_one_file)
nok_rows = petl.antijoin(serial_rows, petl.cut(stacked_joins, ['serial number']), key='serial number')
nok_rows = petl.addfield(nok_rows, 'file', 'NOK')
output_main = petl.stack(stacked_joins, non_serial_rows, nok_rows)
main_final = output_main
de
Intro
Let's assume that I have 3 excel files:
main.xlsx, 1.xlsx and 2.xlsx. In all of them I have a column named serial numbers. I have to:- lookup for all serial numbers in
1.xlsxand2.xlsxand verify if they are inmain.xlsx.
If a serial number is find:
- on the last column of
main.xlsx, on the same row with the serial number that was find, writeOK+name_of_the_file_in which_it_was_found. Else, writeNOK. At the same time, write in1.xlsxand2.xlsxokornokon the last column if the serial number was found or not.
Now, my script is simply creating new files with the last column appended (instead of appending directly to the same file), which is ok (but if you guys have a better method, shout it). What I'm looking for, is a way of making this as optimized as possible. I'm not looking for
PEP8 comments as I'm aware of them, but I'll handle this part when I'll have this as optimized / improved as possible.Code:
```
import petl
main = petl.fromxlsx('main.xlsx')
one = petl.fromxlsx('1.xlsx', row_offset=1)
two = petl.fromxlsx('2.xlsx')
non_serial_rows = petl.select(main, lambda rec: rec['serial number'] is None)
serial_rows = petl.select(main, lambda rec: rec['serial number'] is not None)
main_join_one = petl.join(serial_rows, petl.cut(one, ['serial number']), key='serial number')
main_join_one_file = petl.addfield(main_join_one, 'file', 'ok, 1.xlsx')
main_join_two = petl.join(serial_rows, petl.cut(two, ['serial number']), key='serial number')
main_join_two_file = petl.addfield(main_join_two, 'file', 'ok, 2.xlsx')
stacked_joins = petl.stack(main_join_two_file, main_join_one_file)
nok_rows = petl.antijoin(serial_rows, petl.cut(stacked_joins, ['serial number']), key='serial number')
nok_rows = petl.addfield(nok_rows, 'file', 'NOK')
output_main = petl.stack(stacked_joins, non_serial_rows, nok_rows)
main_final = output_main
de
Solution
I've not seen or used
Since you do:
So many times, you should change it to a function.
This function could take a
And so you can use:
Adding this, with the changes to
There's not much more that I think could do with improving apart from having nothing in your global scope.
And then adding a global constant for the serial column, say
This significantly reduces the amount of magic strings, to ones that aren't repeated (except 'NOK').
This can result in (Not pep8 compliant to remove a scroll bar):
petl before, but there are some ways to improve your code.- Move things out of the global scope.
- Make a function for your most common functions.
- Change
petl.selecttopetl.selectisorpetl.selectisnot.
Since you do:
nok_rows = petl.antijoin(serial_rows, petl.cut(stacked_joins, ['serial number']), key='serial number')
nok_rows = petl.addfield(nok_rows, 'file', 'NOK')So many times, you should change it to a function.
This function could take a
fn of petl.antijoin, a left of serial_rows, a right of stacked_joins and a field of 'NOK'.And so you can use:
def change_row_file(fn, left, right, field):
rows = fn(left, petl.cut(right, 'serial number'), key='serial number')
return petl.addfield(rows, 'file', field)
change_row_file(petl.antijoin, serial_rows, stacked_joins, 'NOK')Adding this, with the changes to
petl.select, to main_compare can result in something like:def main_compare(table):
serial_rows = petl.selectisnot(table, 'serial number', None)
return petl.stack(
change_row_file(petl.join, serial_rows, main, 'OK'),
change_row_file(petl.antijoin, serial_rows, main, 'NOK'),
petl.selectis(table, 'serial number', None))There's not much more that I think could do with improving apart from having nothing in your global scope.
And then adding a global constant for the serial column, say
SERIAL_COLUMN = 'serial number'.This significantly reduces the amount of magic strings, to ones that aren't repeated (except 'NOK').
This can result in (Not pep8 compliant to remove a scroll bar):
import petl
SERIAL_COLUMN = 'serial number'
def change_row_file(fn, left, right, field):
rows = fn(left, petl.cut(right, SERIAL_COLUMN), key=SERIAL_COLUMN)
return petl.addfield(rows, 'file', field)
def update_main(main, one, two):
serial_rows = petl.selectisnot(main, SERIAL_COLUMN, None)
stacked_joins = petl.stack(
change_row_file(petl.join, serial_rows, one, 'ok, 1.xlsx'),
change_row_file(petl.join, serial_rows, two, 'ok, 2.xlsx'))
return petl.stack(
stacked_joins,
petl.selectis(main, SERIAL_COLUMN, None),
change_row_file(petl.antijoin, serial_rows, stacked_joins, 'NOK'))
def main_compare(main, table):
serial_rows = petl.selectisnot(table, SERIAL_COLUMN, None)
return petl.stack(
change_row_file(petl.join, serial_rows, main, 'OK'),
change_row_file(petl.antijoin, serial_rows, main, 'NOK'),
petl.selectis(table, SERIAL_COLUMN, None))
def main():
main = petl.fromxlsx('main.xlsx')
one = petl.fromxlsx('1.xlsx', row_offset=1)
two = petl.fromxlsx('2.xlsx')
petl.toxlsx(update_main(main, one, two), 'mainNew.xlsx')
petl.toxlsx(main_compare(main, one), '1New.xlsx')
petl.toxlsx(main_compare(main, two), '2New.xlsx')
if __name__ == '__main__':
main()Code Snippets
nok_rows = petl.antijoin(serial_rows, petl.cut(stacked_joins, ['serial number']), key='serial number')
nok_rows = petl.addfield(nok_rows, 'file', 'NOK')def change_row_file(fn, left, right, field):
rows = fn(left, petl.cut(right, 'serial number'), key='serial number')
return petl.addfield(rows, 'file', field)
change_row_file(petl.antijoin, serial_rows, stacked_joins, 'NOK')def main_compare(table):
serial_rows = petl.selectisnot(table, 'serial number', None)
return petl.stack(
change_row_file(petl.join, serial_rows, main, 'OK'),
change_row_file(petl.antijoin, serial_rows, main, 'NOK'),
petl.selectis(table, 'serial number', None))import petl
SERIAL_COLUMN = 'serial number'
def change_row_file(fn, left, right, field):
rows = fn(left, petl.cut(right, SERIAL_COLUMN), key=SERIAL_COLUMN)
return petl.addfield(rows, 'file', field)
def update_main(main, one, two):
serial_rows = petl.selectisnot(main, SERIAL_COLUMN, None)
stacked_joins = petl.stack(
change_row_file(petl.join, serial_rows, one, 'ok, 1.xlsx'),
change_row_file(petl.join, serial_rows, two, 'ok, 2.xlsx'))
return petl.stack(
stacked_joins,
petl.selectis(main, SERIAL_COLUMN, None),
change_row_file(petl.antijoin, serial_rows, stacked_joins, 'NOK'))
def main_compare(main, table):
serial_rows = petl.selectisnot(table, SERIAL_COLUMN, None)
return petl.stack(
change_row_file(petl.join, serial_rows, main, 'OK'),
change_row_file(petl.antijoin, serial_rows, main, 'NOK'),
petl.selectis(table, SERIAL_COLUMN, None))
def main():
main = petl.fromxlsx('main.xlsx')
one = petl.fromxlsx('1.xlsx', row_offset=1)
two = petl.fromxlsx('2.xlsx')
petl.toxlsx(update_main(main, one, two), 'mainNew.xlsx')
petl.toxlsx(main_compare(main, one), '1New.xlsx')
petl.toxlsx(main_compare(main, two), '2New.xlsx')
if __name__ == '__main__':
main()Context
StackExchange Code Review Q#128232, answer score: 2
Revisions (0)
No revisions yet.