patternpythonMinor
Excel to JSON parser with http download
Viewed 0 times
excelparserwithdownloadhttpjson
Problem
I've been working on a project (link) to download a spreadsheet of known ransomware and properties and turn it into json so I can better consume the information within early detection projects.
I'm new to python - what can I be doing better? The destination json I'm converting can be found here.
update_json.py (entry point)
download_file.py
excel_to_json.py
```
import simplejson as json
import xlrd
from collections import OrderedDict
def excel_to_json(filename):
wb = xlrd.open_workbook(filename)
sh = wb.sheet_by_index(0)
mw = wb.sheet_by_index(2)
# List to hold dictionaries
c_list = []
# Iterate through each row in worksheet and fetch values into dict
for rownum in range(1, sh.nrows):
wares = OrderedDict()
row_values = sh.row_values(rownum)
if row_values[6]=="":
name = row_values[0]
gre=[name]
elif "," in row_values[6]:
e=row_values[6].split(",")
I'm new to python - what can I be doing better? The destination json I'm converting can be found here.
update_json.py (entry point)
from excel_to_json import excel_to_json
from download_file import download_file
SOURCESHEET = 'https://docs.google.com/spreadsheets/d/1TWS238xacAto-fLKh1n5uTsdijWdCEsGIM0Y0Hvmc5g/pub?output=xlsx'
OUTPUTSHEET = '../RansomwareOverview.xlsx'
JSONFILE = '../ransomware_overview.json'
def write_json_file(json_data, filename):
output = open(filename, 'w')
output.writelines(json_data)
def generate_json(source_file, download_destination, json_file):
download_file(source_file, download_destination)
write_json_file(excel_to_json(download_destination), json_file)
generate_json(SOURCESHEET, OUTPUTSHEET, JSONFILE)download_file.py
import urllib.request
def download_file(source, destination):
try:
urllib.request.urlretrieve(source, destination)
except IOError:
print('An error occured trying to write an updated spreadsheet. Do you already have it open?')
except urllib.error.URLError:
print('An error occured trying to download the file. Please check the source and try again.')excel_to_json.py
```
import simplejson as json
import xlrd
from collections import OrderedDict
def excel_to_json(filename):
wb = xlrd.open_workbook(filename)
sh = wb.sheet_by_index(0)
mw = wb.sheet_by_index(2)
# List to hold dictionaries
c_list = []
# Iterate through each row in worksheet and fetch values into dict
for rownum in range(1, sh.nrows):
wares = OrderedDict()
row_values = sh.row_values(rownum)
if row_values[6]=="":
name = row_values[0]
gre=[name]
elif "," in row_values[6]:
e=row_values[6].split(",")
Solution
Performance tips:
Code Style notes:
Other notes:
ujsoncan bring more speed
- since both
simplejsonandxlrdare pure-python, you may get performance improvements "for free" by switching toPyPy
- you may (or not) see speed and memory usage improvements if switching to
openpyxland especially in the "read-only" mode
- in the
excel_to_jsonfunction, you are accessing the same values fromrow_valuesby index multiple times. Defining intermediate variables (e.g. definingname = row_values[6]and usingnamevariable later on) and avoiding accessing an element by index more than once might have a positive impact
- I'm not sure I completely understand the inner
for r in range(1, mw.nrows)loop. Can youbreakonce you get theif row_values[0] == rowe[0]evaluated toTrue?
- are you sure you need the
OrderedDictand cannot get away with a regulardict? (there is a serious overhead for CPythons prior to 3.6)
- instead of
.dumps()and a separate function to dump a JSON string to a file - use.dump()method to dump to a file directly - make sure to usewithcontext manager when opening a file
Code Style notes:
- follow PEP8 guidelines in terms of whitespace usage in expressions and statements
- properly organize imports
if row_values[6]=="":can be simplified toif not row_values[6]:(similar for some other if conditions later on)
- the
generate_json()call should be put into theif __name__ == '__main__':to avoid it being executed on import
- the
excel_to_json()function is not quite easy to grasp - see if you can add a helpful docstring and/or comments to improve on clarity and readability
Other notes:
- improve variable naming. Variables like
sh,mw,roweare very close to being meaningless. I would also replacewbwith a more explicitworkbook
- have you considered using
pandas.read_excel()to read the contents into the dataframe and then dumping it via.to_json()(after applying the desired transformations)?
Context
StackExchange Code Review Q#156672, answer score: 4
Revisions (0)
No revisions yet.