HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Slicing Stack Exchange data dump XML files into small bites

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
dumpstackexchangeintoxmlfilessmallbitesslicingdata

Problem

I was posed with a challenge when trying to load XML files from Stack Exchange into a SQL Server database; some of these XML files are extremely large (largest one being a whopping 67 GB in a single XML file), so much so that the database just cannot handle them without throwing a System.OutOfMemoryException.

After some testing, I found out the optimal XML size for the database to handle efficiently (at least on my database) is ~20 MB. So I wrote this Python script to slice up any file larger than 20 MB into more-or-less equal files `#!/usr/bin/python
# -- coding: utf-8 --

import os
import math
from datetime import datetime

__author__ = 'https://github.com/Phrancis'

'''
The purpose of this script is to split up very large XML files from the Stack Exchange
public data dump into small files which can then efficiently be loaded into a Microsoft SQL Server
database for further processing.

Testing with SQL Server on the author's database has proven that XML file sizes of ~20 MB
have reasonable performance and processing time. Larger files have led to System.OutOfMemoryException
on the database. The size of output files can be adjusted by changing the value of
the SQL_SERVER_XML_SIZE_LIMIT variable.

NOTE: This script was made using Python 3.5 and is not compatible with Python 2.x.

This script assumes the format of the Stack Exchange XML files to be as following,
and will not work correctly with differently-formatted XML.




...


'''

# Clock to measure how long the script takes to execute
start = datetime.now()

SQL_SERVER_XML_SIZE_LIMIT = 20000000

# TODO Make this script iterate through all subdirectories instead of targeting just one
file_path = 'D:\Downloads\stackexchange\stackoverflow.com\\'
file_name = 'Badges'

# get input_file size
file_size_bytes = os.path.getsize(file_path + file_name + '.xml')

# no splitting needed if XML input_file already fits within SQL Server limit
if file_size_bytes

Solution

I think most comments are just going to be about intricacies of Python,
the general structure and comments are great.

The only other comment would be that you can probably open the file in
binary mode, not do any text encoding conversion and save some
processing time that way.

I don't quite get why you have to go through the whole file just to get
the closing XML line ... it should be pretty clear what that line is
going to be considering the opening XML line? I'd probably change the
whole part to just read from the file, accumulate as much as possible
and then open the next file instead of going through the file twice.

  • Since you're using Python 3, #!/usr/bin/python isn't guaranteed to


be it on many distributions, so I'd say using #!/usr/bin/env python3
is a bit safer - that also deals with the binary being in another
location.

  • The last bit after # Clean up and print results should also be


indented no? All the variables are only set in the else branch.
I'd actually put a sys.exit() in the if part and not have the
indentation at all actually.

  • Take a look at os.path for path manipulation, those functions are


portable and a bit more structured than concatenating strings.

  • I'd use with with the input_file too.



  • The standard Python interpreter won't extract common subexpressions,


so you'll have to and probably should do that yourself - the encode
calls and the same occurrences of paths shouldn't be recomputed all the
time.

  • Getting the last item from an iterator is actually a question on


StackOverflow, c.f. https://stackoverflow.com/a/2138894/2769043 -
perhaps use that.

  • The comparison line != root_close could probably be replaced with a


check for the line number compared to the line number of the last line
and that should be much faster than string comparison.

Something like this, still same general approach though.

#!/usr/bin/env python3

...

# TODO Make this script iterate through all subdirectories instead of targeting just one
file_path = 'D:\Downloads\stackexchange\stackoverflow.com\\'
file_name = 'Badges'

full_file_path = file_path + file_name + '.xml'

# get input_file size
file_size_bytes = os.path.getsize(full_file_path)

# no splitting needed if XML input_file already fits within SQL Server limit
if file_size_bytes  num_lines_per_split_file or line == root_close:
                    break

                output_file.write(line)
                output_line_num += 1

            # write the footer as the last line in the file
            output_file.write(root_close)

        # move on to the next output file
        current_file_num += 1

    # Clean up and print results
    print('Path:', file_path + file_name)
    print('Size:', file_size_bytes, 'bytes')
    print(num_lines_in_file, 'total lines, split into', num_split_files_needed, 'files =', num_lines_per_split_file, 'lines per input_file.')
    print('Execution time:', datetime.now() - start)

Code Snippets

#!/usr/bin/env python3

...

# TODO Make this script iterate through all subdirectories instead of targeting just one
file_path = 'D:\Downloads\stackexchange\stackoverflow.com\\'
file_name = 'Badges'

full_file_path = file_path + file_name + '.xml'

# get input_file size
file_size_bytes = os.path.getsize(full_file_path)

# no splitting needed if XML input_file already fits within SQL Server limit
if file_size_bytes <= SQL_SERVER_XML_SIZE_LIMIT:
    print('input_file size is less than 2 GB, no splitting needed.')
    print('Path:', full_file_path)
    print('Size:', file_size_bytes, '/', SQL_SERVER_XML_SIZE_LIMIT)
    sys.exit()

num_split_files_needed = math.ceil(file_size_bytes / SQL_SERVER_XML_SIZE_LIMIT)

with open(full_file_path, 'rb') as input_file:
    # get XML version, opening and closing root nodes,
    # and count the lines in order to determine how many lines to write to each split output input_file
    opening = input_file.readline() + input_file.readline()

    pointer = input_file.tell()

    num_lines_in_file = 2
    for root_close in input_file:
        num_lines_in_file += 1

    num_lines_per_split_file = math.ceil(num_lines_in_file / num_split_files_needed)

    # BEGIN SLICING OF INPUT FILE INTO SMALLER OUTPUT FILES

    # return stream to start of input
    input_file.seek(pointer)

    for current_file_num in range(1, num_split_files_needed + 1):
        full_current_file_path = file_path + file_name + str(current_file_num) + '.xml'

        with open(full_current_file_path, 'w+b') as output_file:
            print('Writing to:', full_current_file_path)

            # write XML header
            output_file.write(opening)

            # start writing lines from the input to the output file
            output_line_num = 1
            for line in input_file:
                # write lines until we reach the num_lines_per_split_file or the end of the input_file
                if output_line_num > num_lines_per_split_file or line == root_close:
                    break

                output_file.write(line)
                output_line_num += 1

            # write the footer as the last line in the file
            output_file.write(root_close)

        # move on to the next output file
        current_file_num += 1

    # Clean up and print results
    print('Path:', file_path + file_name)
    print('Size:', file_size_bytes, 'bytes')
    print(num_lines_in_file, 'total lines, split into', num_split_files_needed, 'files =', num_lines_per_split_file, 'lines per input_file.')
    print('Execution time:', datetime.now() - start)

Context

StackExchange Code Review Q#140392, answer score: 2

Revisions (0)

No revisions yet.