HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Parsing huge xml file with lxml.etree.iterparse in python

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
etreefileiterparsewithlxmlxmlhugeparsingpython

Problem

After solving the error in SO (like suggested) I return now for codereview. :-)

The task is to parse a huge file dblp.xml (~800 MB) presented by DBLP. The records in this huge file do look for example like this or this. In particular:


    record_1
    ...
    record_n


I wrote some code, that shall get me ech tag of some records (will bes tored in a database).

I adapted this approach from an article from IBM developerWorks which refers to the article Incremental Parsing on effbot.org. Is this the correct approach for this task? Or is there a better way?

```
import sys
import os
import MySQLdb
from lxml import etree

def fast_iter2(context, cursor):
# Available elements are: article|inproceedings|proceedings|book|incollection|phdthesis|mastersthesis|www
elements = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection', 'phdthesis', "mastersthesis", "www"])
# Available tags are: author|editor|title|booktitle|pages|year|address|journal|volume|number|month|url|ee|cdrom|cite|
# publisher|note|crossref|isbn|series|school|chapter
childElements = set(["title", "booktitle", "year", "journal", "ee"])

paper = {} # represents a paper with all its tags.
authors = [] # a list of authors who have written the paper "together".
paperCounter = 0
for event, element in context:
tag = element.tag
if tag in childElements:
if element.text:
paper[tag] = element.text
# print tag, paper[tag]
elif tag == "author":
if element.text:
authors.append(element.text)
# print "AUTHOR:", authors[-1]
elif tag in elements:
paper["element"] = tag
paper["mdate"] = element.get("mdate")
paper["dblpkey"] = element.get("key")
# print tag, element.get("mdate"), element.get("key"), event
if paper["element"] in ['phdthesis', "masters

Solution

The name fast_iter2 seems to lack much of a connection with the function is actually doing.

Rather then having your two sets be local variables, I suggest putting them as global constants.

Calling them elements and childElements is kinda generic. I suggest something more specific

Your loop is a bit awkward. Rather then collecting them in that iterative format, just grab everything when looking at the element tags. This makes the code filling up paper/authors more straightforward.

Rather then having an empty if block and using the else block, invert the logic.

I suggest moving the element clearing code into its own function. This helps make the function actual task clearer avoiding the question of how to clear.

The del context line is useless. Del doesn't destory the object, it merely removes the object from the current context. If this was the only context in which object was refereed to then its reference count would drop to zero. However, that's not the case here as the calling function will keep it alive anyways. Even if it did, the memory savings aren't worth worrying about since the whole script is about to end anyways.

My reworking of your code, no testing has been done on it:

CATEGORIES = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection', 'phdthesis', "mastersthesis", "www"])
    SKIP_CATEGORIES = set(['phdthesis','mastersthesis', 'www'])
    DATA_ITEMS = ["title", "booktitle", "year", "journal", "ee"]

    def clear_element(element):
        element.clear()
        while element.getprevious() is not None:
            del element.getparent()[0]

    def fast_iter2(context, cursor):
        # Available elements are:   article|inproceedings|proceedings|book|incollection|phdthesis|mastersthesis|www
        # Available tags are:       author|editor|title|booktitle|pages|year|address|journal|volume|number|month|url|ee|cdrom|cite|
        #                           publisher|note|crossref|isbn|series|school|chapter

        paperCounter = 0
        for event, element in context:
            if element.tag in CATEGORIES:
                authors = [author.text for author in element.findall("author")]
                paper = {
                    'element' : element.tag,
                    'mdate' : element.get("mdate"),
                    'dblpkey' : element.get('key')
                }
                for data_item in DATA_ITEMS:
                    data = element.find(data_item)
                    if data is not None:
                        paper[data_item] = data

                if paper['element'] not in SKIP_CATEGORIES:
                    populate_database(paper, authors, cursor)

                paperCounter += 1
                print paperCounter

                clear_element(element)


EDIT

How to avoid an explicit counter:

CATEGORIES = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection', 'phdthesis', "mastersthesis", "www"])
    SKIP_CATEGORIES = set(['phdthesis','mastersthesis', 'www'])
    DATA_ITEMS = ["title", "booktitle", "year", "journal", "ee"]

    def clear_element(element):
        element.clear()
        while element.getprevious() is not None:
            del element.getparent()[0]

    def extract_paper_elements(context):
        for event, element in context:
             if element.tag in CATEGORIES:
                   yield element
                   clear_element(element)                 

    def fast_iter2(context, cursor):
        for paperCounter, element in enumerate(extract_paper_elements(context)):
                authors = [author.text for author in element.findall("author")]
                paper = {
                    'element' : element.tag,
                    'mdate' : element.get("mdate"),
                    'dblpkey' : element.get('key')
                }
                for data_item in DATA_ITEMS:
                    data = element.find(data_item)
                    if data is not None:
                        paper[data_item] = data

                if paper['element'] not in SKIP_CATEGORIES:
                    populate_database(paper, authors, cursor)

                print paperCounter

Code Snippets

CATEGORIES = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection', 'phdthesis', "mastersthesis", "www"])
    SKIP_CATEGORIES = set(['phdthesis','mastersthesis', 'www'])
    DATA_ITEMS = ["title", "booktitle", "year", "journal", "ee"]

    def clear_element(element):
        element.clear()
        while element.getprevious() is not None:
            del element.getparent()[0]

    def fast_iter2(context, cursor):
        # Available elements are:   article|inproceedings|proceedings|book|incollection|phdthesis|mastersthesis|www
        # Available tags are:       author|editor|title|booktitle|pages|year|address|journal|volume|number|month|url|ee|cdrom|cite|
        #                           publisher|note|crossref|isbn|series|school|chapter

        paperCounter = 0
        for event, element in context:
            if element.tag in CATEGORIES:
                authors = [author.text for author in element.findall("author")]
                paper = {
                    'element' : element.tag,
                    'mdate' : element.get("mdate"),
                    'dblpkey' : element.get('key')
                }
                for data_item in DATA_ITEMS:
                    data = element.find(data_item)
                    if data is not None:
                        paper[data_item] = data

                if paper['element'] not in SKIP_CATEGORIES:
                    populate_database(paper, authors, cursor)


                paperCounter += 1
                print paperCounter

                clear_element(element)
CATEGORIES = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection', 'phdthesis', "mastersthesis", "www"])
    SKIP_CATEGORIES = set(['phdthesis','mastersthesis', 'www'])
    DATA_ITEMS = ["title", "booktitle", "year", "journal", "ee"]

    def clear_element(element):
        element.clear()
        while element.getprevious() is not None:
            del element.getparent()[0]

    def extract_paper_elements(context):
        for event, element in context:
             if element.tag in CATEGORIES:
                   yield element
                   clear_element(element)                 

    def fast_iter2(context, cursor):
        for paperCounter, element in enumerate(extract_paper_elements(context)):
                authors = [author.text for author in element.findall("author")]
                paper = {
                    'element' : element.tag,
                    'mdate' : element.get("mdate"),
                    'dblpkey' : element.get('key')
                }
                for data_item in DATA_ITEMS:
                    data = element.find(data_item)
                    if data is not None:
                        paper[data_item] = data

                if paper['element'] not in SKIP_CATEGORIES:
                    populate_database(paper, authors, cursor)


                print paperCounter

Context

StackExchange Code Review Q#2449, answer score: 8

Revisions (0)

No revisions yet.