patternpythonMinor
Slicing Stack Exchange data dump XML files into small bites
Viewed 0 times
dumpstackexchangeintoxmlfilessmallbitesslicingdata
Problem
I was posed with a challenge when trying to load XML files from Stack Exchange into a SQL Server database; some of these XML files are extremely large (largest one being a whopping 67 GB in a single XML file), so much so that the database just cannot handle them without throwing a
After some testing, I found out the optimal XML size for the database to handle efficiently (at least on my database) is ~20 MB. So I wrote this Python script to slice up any file larger than 20 MB into more-or-less equal files `#!/usr/bin/python
# -- coding: utf-8 --
import os
import math
from datetime import datetime
__author__ = 'https://github.com/Phrancis'
'''
The purpose of this script is to split up very large XML files from the Stack Exchange
public data dump into small files which can then efficiently be loaded into a Microsoft SQL Server
database for further processing.
Testing with SQL Server on the author's database has proven that XML file sizes of ~20 MB
have reasonable performance and processing time. Larger files have led to System.OutOfMemoryException
on the database. The size of output files can be adjusted by changing the value of
the SQL_SERVER_XML_SIZE_LIMIT variable.
NOTE: This script was made using Python 3.5 and is not compatible with Python 2.x.
This script assumes the format of the Stack Exchange XML files to be as following,
and will not work correctly with differently-formatted XML.
...
'''
# Clock to measure how long the script takes to execute
start = datetime.now()
SQL_SERVER_XML_SIZE_LIMIT = 20000000
# TODO Make this script iterate through all subdirectories instead of targeting just one
file_path = 'D:\Downloads\stackexchange\stackoverflow.com\\'
file_name = 'Badges'
# get input_file size
file_size_bytes = os.path.getsize(file_path + file_name + '.xml')
# no splitting needed if XML input_file already fits within SQL Server limit
if file_size_bytes
System.OutOfMemoryException.After some testing, I found out the optimal XML size for the database to handle efficiently (at least on my database) is ~20 MB. So I wrote this Python script to slice up any file larger than 20 MB into more-or-less equal files `#!/usr/bin/python
# -- coding: utf-8 --
import os
import math
from datetime import datetime
__author__ = 'https://github.com/Phrancis'
'''
The purpose of this script is to split up very large XML files from the Stack Exchange
public data dump into small files which can then efficiently be loaded into a Microsoft SQL Server
database for further processing.
Testing with SQL Server on the author's database has proven that XML file sizes of ~20 MB
have reasonable performance and processing time. Larger files have led to System.OutOfMemoryException
on the database. The size of output files can be adjusted by changing the value of
the SQL_SERVER_XML_SIZE_LIMIT variable.
NOTE: This script was made using Python 3.5 and is not compatible with Python 2.x.
This script assumes the format of the Stack Exchange XML files to be as following,
and will not work correctly with differently-formatted XML.
...
'''
# Clock to measure how long the script takes to execute
start = datetime.now()
SQL_SERVER_XML_SIZE_LIMIT = 20000000
# TODO Make this script iterate through all subdirectories instead of targeting just one
file_path = 'D:\Downloads\stackexchange\stackoverflow.com\\'
file_name = 'Badges'
# get input_file size
file_size_bytes = os.path.getsize(file_path + file_name + '.xml')
# no splitting needed if XML input_file already fits within SQL Server limit
if file_size_bytes
Solution
I think most comments are just going to be about intricacies of Python,
the general structure and comments are great.
The only other comment would be that you can probably open the file in
binary mode, not do any text encoding conversion and save some
processing time that way.
I don't quite get why you have to go through the whole file just to get
the closing XML line ... it should be pretty clear what that line is
going to be considering the opening XML line? I'd probably change the
whole part to just read from the file, accumulate as much as possible
and then open the next file instead of going through the file twice.
be it on many distributions, so I'd say using
is a bit safer - that also deals with the binary being in another
location.
indented no? All the variables are only set in the
I'd actually put a
indentation at all actually.
portable and a bit more structured than concatenating strings.
so you'll have to and probably should do that yourself - the
calls and the same occurrences of paths shouldn't be recomputed all the
time.
StackOverflow, c.f. https://stackoverflow.com/a/2138894/2769043 -
perhaps use that.
check for the line number compared to the line number of the last line
and that should be much faster than string comparison.
Something like this, still same general approach though.
the general structure and comments are great.
The only other comment would be that you can probably open the file in
binary mode, not do any text encoding conversion and save some
processing time that way.
I don't quite get why you have to go through the whole file just to get
the closing XML line ... it should be pretty clear what that line is
going to be considering the opening XML line? I'd probably change the
whole part to just read from the file, accumulate as much as possible
and then open the next file instead of going through the file twice.
- Since you're using Python 3,
#!/usr/bin/pythonisn't guaranteed to
be it on many distributions, so I'd say using
#!/usr/bin/env python3is a bit safer - that also deals with the binary being in another
location.
- The last bit after
# Clean up and print resultsshould also be
indented no? All the variables are only set in the
else branch.I'd actually put a
sys.exit() in the if part and not have theindentation at all actually.
- Take a look at
os.pathfor path manipulation, those functions are
portable and a bit more structured than concatenating strings.
- I'd use
withwith theinput_filetoo.
- The standard Python interpreter won't extract common subexpressions,
so you'll have to and probably should do that yourself - the
encodecalls and the same occurrences of paths shouldn't be recomputed all the
time.
- Getting the last item from an iterator is actually a question on
StackOverflow, c.f. https://stackoverflow.com/a/2138894/2769043 -
perhaps use that.
- The comparison
line != root_closecould probably be replaced with a
check for the line number compared to the line number of the last line
and that should be much faster than string comparison.
Something like this, still same general approach though.
#!/usr/bin/env python3
...
# TODO Make this script iterate through all subdirectories instead of targeting just one
file_path = 'D:\Downloads\stackexchange\stackoverflow.com\\'
file_name = 'Badges'
full_file_path = file_path + file_name + '.xml'
# get input_file size
file_size_bytes = os.path.getsize(full_file_path)
# no splitting needed if XML input_file already fits within SQL Server limit
if file_size_bytes num_lines_per_split_file or line == root_close:
break
output_file.write(line)
output_line_num += 1
# write the footer as the last line in the file
output_file.write(root_close)
# move on to the next output file
current_file_num += 1
# Clean up and print results
print('Path:', file_path + file_name)
print('Size:', file_size_bytes, 'bytes')
print(num_lines_in_file, 'total lines, split into', num_split_files_needed, 'files =', num_lines_per_split_file, 'lines per input_file.')
print('Execution time:', datetime.now() - start)Code Snippets
#!/usr/bin/env python3
...
# TODO Make this script iterate through all subdirectories instead of targeting just one
file_path = 'D:\Downloads\stackexchange\stackoverflow.com\\'
file_name = 'Badges'
full_file_path = file_path + file_name + '.xml'
# get input_file size
file_size_bytes = os.path.getsize(full_file_path)
# no splitting needed if XML input_file already fits within SQL Server limit
if file_size_bytes <= SQL_SERVER_XML_SIZE_LIMIT:
print('input_file size is less than 2 GB, no splitting needed.')
print('Path:', full_file_path)
print('Size:', file_size_bytes, '/', SQL_SERVER_XML_SIZE_LIMIT)
sys.exit()
num_split_files_needed = math.ceil(file_size_bytes / SQL_SERVER_XML_SIZE_LIMIT)
with open(full_file_path, 'rb') as input_file:
# get XML version, opening and closing root nodes,
# and count the lines in order to determine how many lines to write to each split output input_file
opening = input_file.readline() + input_file.readline()
pointer = input_file.tell()
num_lines_in_file = 2
for root_close in input_file:
num_lines_in_file += 1
num_lines_per_split_file = math.ceil(num_lines_in_file / num_split_files_needed)
# BEGIN SLICING OF INPUT FILE INTO SMALLER OUTPUT FILES
# return stream to start of input
input_file.seek(pointer)
for current_file_num in range(1, num_split_files_needed + 1):
full_current_file_path = file_path + file_name + str(current_file_num) + '.xml'
with open(full_current_file_path, 'w+b') as output_file:
print('Writing to:', full_current_file_path)
# write XML header
output_file.write(opening)
# start writing lines from the input to the output file
output_line_num = 1
for line in input_file:
# write lines until we reach the num_lines_per_split_file or the end of the input_file
if output_line_num > num_lines_per_split_file or line == root_close:
break
output_file.write(line)
output_line_num += 1
# write the footer as the last line in the file
output_file.write(root_close)
# move on to the next output file
current_file_num += 1
# Clean up and print results
print('Path:', file_path + file_name)
print('Size:', file_size_bytes, 'bytes')
print(num_lines_in_file, 'total lines, split into', num_split_files_needed, 'files =', num_lines_per_split_file, 'lines per input_file.')
print('Execution time:', datetime.now() - start)Context
StackExchange Code Review Q#140392, answer score: 2
Revisions (0)
No revisions yet.