HiveBrain v1.2.0
Get Started
← Back to all entries
snippetpythonMinor

Excel Data Manipulation - parse, match and create

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
excelcreatematchparsemanipulationanddata

Problem

I've got a simple Excel Data Manipulation Script to match one of my daily tasks, written in python 3.
Intro

Let's assume that I have 3 excel files: main.xlsx, 1.xlsx and 2.xlsx. In all of them I have a column named serial numbers. I have to:

  • lookup for all serial numbers in 1.xlsx and 2.xlsx and verify if they are in main.xlsx.



If a serial number is find:

  • on the last column of main.xlsx, on the same row with the serial number that was find, write OK + name_of_the_file_in which_it_was_found. Else, write NOK. At the same time, write in 1.xlsx and 2.xlsx ok or nok on the last column if the serial number was found or not.



Now, my script is simply creating new files with the last column appended (instead of appending directly to the same file), which is ok (but if you guys have a better method, shout it). What I'm looking for, is a way of making this as optimized as possible. I'm not looking for PEP8 comments as I'm aware of them, but I'll handle this part when I'll have this as optimized / improved as possible.
Code:

```
import petl

main = petl.fromxlsx('main.xlsx')
one = petl.fromxlsx('1.xlsx', row_offset=1)
two = petl.fromxlsx('2.xlsx')

non_serial_rows = petl.select(main, lambda rec: rec['serial number'] is None)
serial_rows = petl.select(main, lambda rec: rec['serial number'] is not None)
main_join_one = petl.join(serial_rows, petl.cut(one, ['serial number']), key='serial number')
main_join_one_file = petl.addfield(main_join_one, 'file', 'ok, 1.xlsx')
main_join_two = petl.join(serial_rows, petl.cut(two, ['serial number']), key='serial number')
main_join_two_file = petl.addfield(main_join_two, 'file', 'ok, 2.xlsx')
stacked_joins = petl.stack(main_join_two_file, main_join_one_file)
nok_rows = petl.antijoin(serial_rows, petl.cut(stacked_joins, ['serial number']), key='serial number')
nok_rows = petl.addfield(nok_rows, 'file', 'NOK')
output_main = petl.stack(stacked_joins, non_serial_rows, nok_rows)
main_final = output_main

de

Solution

I've not seen or used petl before, but there are some ways to improve your code.

  • Move things out of the global scope.



  • Make a function for your most common functions.



  • Change petl.select to petl.selectis or petl.selectisnot.



Since you do:

nok_rows = petl.antijoin(serial_rows, petl.cut(stacked_joins, ['serial number']), key='serial number')
nok_rows = petl.addfield(nok_rows, 'file', 'NOK')


So many times, you should change it to a function.
This function could take a fn of petl.antijoin, a left of serial_rows, a right of stacked_joins and a field of 'NOK'.

And so you can use:

def change_row_file(fn, left, right, field):
    rows = fn(left, petl.cut(right, 'serial number'), key='serial number')
    return petl.addfield(rows, 'file', field)

change_row_file(petl.antijoin, serial_rows, stacked_joins, 'NOK')


Adding this, with the changes to petl.select, to main_compare can result in something like:

def main_compare(table):
    serial_rows = petl.selectisnot(table, 'serial number', None)
    return petl.stack(
        change_row_file(petl.join, serial_rows, main, 'OK'),
        change_row_file(petl.antijoin, serial_rows, main, 'NOK'),
        petl.selectis(table, 'serial number', None))


There's not much more that I think could do with improving apart from having nothing in your global scope.
And then adding a global constant for the serial column, say SERIAL_COLUMN = 'serial number'.
This significantly reduces the amount of magic strings, to ones that aren't repeated (except 'NOK').

This can result in (Not pep8 compliant to remove a scroll bar):

import petl
SERIAL_COLUMN = 'serial number'

def change_row_file(fn, left, right, field):
    rows = fn(left, petl.cut(right, SERIAL_COLUMN), key=SERIAL_COLUMN)
    return petl.addfield(rows, 'file', field)

def update_main(main, one, two):
    serial_rows = petl.selectisnot(main, SERIAL_COLUMN, None)
    stacked_joins = petl.stack(
        change_row_file(petl.join, serial_rows, one, 'ok, 1.xlsx'),
        change_row_file(petl.join, serial_rows, two, 'ok, 2.xlsx'))
    return petl.stack(
        stacked_joins,
        petl.selectis(main, SERIAL_COLUMN, None),
        change_row_file(petl.antijoin, serial_rows, stacked_joins, 'NOK'))

def main_compare(main, table):
    serial_rows = petl.selectisnot(table, SERIAL_COLUMN, None)
    return petl.stack(
        change_row_file(petl.join, serial_rows, main, 'OK'),
        change_row_file(petl.antijoin, serial_rows, main, 'NOK'),
        petl.selectis(table, SERIAL_COLUMN, None))

def main():
    main = petl.fromxlsx('main.xlsx')
    one = petl.fromxlsx('1.xlsx', row_offset=1)
    two = petl.fromxlsx('2.xlsx')

    petl.toxlsx(update_main(main, one, two), 'mainNew.xlsx')
    petl.toxlsx(main_compare(main, one), '1New.xlsx')
    petl.toxlsx(main_compare(main, two), '2New.xlsx')

if __name__ == '__main__':
    main()

Code Snippets

nok_rows = petl.antijoin(serial_rows, petl.cut(stacked_joins, ['serial number']), key='serial number')
nok_rows = petl.addfield(nok_rows, 'file', 'NOK')
def change_row_file(fn, left, right, field):
    rows = fn(left, petl.cut(right, 'serial number'), key='serial number')
    return petl.addfield(rows, 'file', field)

change_row_file(petl.antijoin, serial_rows, stacked_joins, 'NOK')
def main_compare(table):
    serial_rows = petl.selectisnot(table, 'serial number', None)
    return petl.stack(
        change_row_file(petl.join, serial_rows, main, 'OK'),
        change_row_file(petl.antijoin, serial_rows, main, 'NOK'),
        petl.selectis(table, 'serial number', None))
import petl
SERIAL_COLUMN = 'serial number'

def change_row_file(fn, left, right, field):
    rows = fn(left, petl.cut(right, SERIAL_COLUMN), key=SERIAL_COLUMN)
    return petl.addfield(rows, 'file', field)

def update_main(main, one, two):
    serial_rows = petl.selectisnot(main, SERIAL_COLUMN, None)
    stacked_joins = petl.stack(
        change_row_file(petl.join, serial_rows, one, 'ok, 1.xlsx'),
        change_row_file(petl.join, serial_rows, two, 'ok, 2.xlsx'))
    return petl.stack(
        stacked_joins,
        petl.selectis(main, SERIAL_COLUMN, None),
        change_row_file(petl.antijoin, serial_rows, stacked_joins, 'NOK'))

def main_compare(main, table):
    serial_rows = petl.selectisnot(table, SERIAL_COLUMN, None)
    return petl.stack(
        change_row_file(petl.join, serial_rows, main, 'OK'),
        change_row_file(petl.antijoin, serial_rows, main, 'NOK'),
        petl.selectis(table, SERIAL_COLUMN, None))

def main():
    main = petl.fromxlsx('main.xlsx')
    one = petl.fromxlsx('1.xlsx', row_offset=1)
    two = petl.fromxlsx('2.xlsx')

    petl.toxlsx(update_main(main, one, two), 'mainNew.xlsx')
    petl.toxlsx(main_compare(main, one), '1New.xlsx')
    petl.toxlsx(main_compare(main, two), '2New.xlsx')

if __name__ == '__main__':
    main()

Context

StackExchange Code Review Q#128232, answer score: 2

Revisions (0)

No revisions yet.