HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Excel to JSON parser with http download

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
excelparserwithdownloadhttpjson

Problem

I've been working on a project (link) to download a spreadsheet of known ransomware and properties and turn it into json so I can better consume the information within early detection projects.

I'm new to python - what can I be doing better? The destination json I'm converting can be found here.

update_json.py (entry point)

from excel_to_json import excel_to_json
from download_file import download_file

SOURCESHEET = 'https://docs.google.com/spreadsheets/d/1TWS238xacAto-fLKh1n5uTsdijWdCEsGIM0Y0Hvmc5g/pub?output=xlsx'
OUTPUTSHEET = '../RansomwareOverview.xlsx'
JSONFILE = '../ransomware_overview.json'

def write_json_file(json_data, filename):
    output = open(filename, 'w')
    output.writelines(json_data)

def generate_json(source_file, download_destination, json_file):
    download_file(source_file, download_destination)
    write_json_file(excel_to_json(download_destination), json_file)

generate_json(SOURCESHEET, OUTPUTSHEET, JSONFILE)


download_file.py

import urllib.request

def download_file(source, destination):
    try:
        urllib.request.urlretrieve(source, destination)
    except IOError:
        print('An error occured trying to write an updated spreadsheet. Do you already have it open?')
    except urllib.error.URLError:
        print('An error occured trying to download the file. Please check the source and try again.')


excel_to_json.py

```
import simplejson as json
import xlrd
from collections import OrderedDict

def excel_to_json(filename):
wb = xlrd.open_workbook(filename)
sh = wb.sheet_by_index(0)
mw = wb.sheet_by_index(2)

# List to hold dictionaries
c_list = []

# Iterate through each row in worksheet and fetch values into dict
for rownum in range(1, sh.nrows):
wares = OrderedDict()
row_values = sh.row_values(rownum)

if row_values[6]=="":
name = row_values[0]
gre=[name]
elif "," in row_values[6]:
e=row_values[6].split(",")

Solution

Performance tips:

  • ujson can bring more speed



  • since both simplejson and xlrd are pure-python, you may get performance improvements "for free" by switching to PyPy



  • you may (or not) see speed and memory usage improvements if switching to openpyxl and especially in the "read-only" mode



  • in the excel_to_json function, you are accessing the same values from row_values by index multiple times. Defining intermediate variables (e.g. defining name = row_values[6] and using name variable later on) and avoiding accessing an element by index more than once might have a positive impact



  • I'm not sure I completely understand the inner for r in range(1, mw.nrows) loop. Can you break once you get the if row_values[0] == rowe[0] evaluated to True?



  • are you sure you need the OrderedDict and cannot get away with a regular dict? (there is a serious overhead for CPythons prior to 3.6)



  • instead of .dumps() and a separate function to dump a JSON string to a file - use .dump() method to dump to a file directly - make sure to use with context manager when opening a file



Code Style notes:

  • follow PEP8 guidelines in terms of whitespace usage in expressions and statements



  • properly organize imports



  • if row_values[6]=="": can be simplified to if not row_values[6]: (similar for some other if conditions later on)



  • the generate_json() call should be put into the if __name__ == '__main__': to avoid it being executed on import



  • the excel_to_json() function is not quite easy to grasp - see if you can add a helpful docstring and/or comments to improve on clarity and readability



Other notes:

  • improve variable naming. Variables like sh, mw, rowe are very close to being meaningless. I would also replace wb with a more explicit workbook



  • have you considered using pandas.read_excel() to read the contents into the dataframe and then dumping it via .to_json() (after applying the desired transformations)?

Context

StackExchange Code Review Q#156672, answer score: 4

Revisions (0)

No revisions yet.