patternpythonMinor
Parsing huge xml file with lxml.etree.iterparse in python
Viewed 0 times
etreefileiterparsewithlxmlxmlhugeparsingpython
Problem
After solving the error in SO (like suggested) I return now for codereview. :-)
The task is to parse a huge file
I wrote some code, that shall get me ech tag of some records (will bes tored in a database).
I adapted this approach from an article from IBM developerWorks which refers to the article Incremental Parsing on effbot.org. Is this the correct approach for this task? Or is there a better way?
```
import sys
import os
import MySQLdb
from lxml import etree
def fast_iter2(context, cursor):
# Available elements are: article|inproceedings|proceedings|book|incollection|phdthesis|mastersthesis|www
elements = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection', 'phdthesis', "mastersthesis", "www"])
# Available tags are: author|editor|title|booktitle|pages|year|address|journal|volume|number|month|url|ee|cdrom|cite|
# publisher|note|crossref|isbn|series|school|chapter
childElements = set(["title", "booktitle", "year", "journal", "ee"])
paper = {} # represents a paper with all its tags.
authors = [] # a list of authors who have written the paper "together".
paperCounter = 0
for event, element in context:
tag = element.tag
if tag in childElements:
if element.text:
paper[tag] = element.text
# print tag, paper[tag]
elif tag == "author":
if element.text:
authors.append(element.text)
# print "AUTHOR:", authors[-1]
elif tag in elements:
paper["element"] = tag
paper["mdate"] = element.get("mdate")
paper["dblpkey"] = element.get("key")
# print tag, element.get("mdate"), element.get("key"), event
if paper["element"] in ['phdthesis', "masters
The task is to parse a huge file
dblp.xml (~800 MB) presented by DBLP. The records in this huge file do look for example like this or this. In particular:
record_1
...
record_n
I wrote some code, that shall get me ech tag of some records (will bes tored in a database).
I adapted this approach from an article from IBM developerWorks which refers to the article Incremental Parsing on effbot.org. Is this the correct approach for this task? Or is there a better way?
```
import sys
import os
import MySQLdb
from lxml import etree
def fast_iter2(context, cursor):
# Available elements are: article|inproceedings|proceedings|book|incollection|phdthesis|mastersthesis|www
elements = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection', 'phdthesis', "mastersthesis", "www"])
# Available tags are: author|editor|title|booktitle|pages|year|address|journal|volume|number|month|url|ee|cdrom|cite|
# publisher|note|crossref|isbn|series|school|chapter
childElements = set(["title", "booktitle", "year", "journal", "ee"])
paper = {} # represents a paper with all its tags.
authors = [] # a list of authors who have written the paper "together".
paperCounter = 0
for event, element in context:
tag = element.tag
if tag in childElements:
if element.text:
paper[tag] = element.text
# print tag, paper[tag]
elif tag == "author":
if element.text:
authors.append(element.text)
# print "AUTHOR:", authors[-1]
elif tag in elements:
paper["element"] = tag
paper["mdate"] = element.get("mdate")
paper["dblpkey"] = element.get("key")
# print tag, element.get("mdate"), element.get("key"), event
if paper["element"] in ['phdthesis', "masters
Solution
The name fast_iter2 seems to lack much of a connection with the function is actually doing.
Rather then having your two sets be local variables, I suggest putting them as global constants.
Calling them elements and childElements is kinda generic. I suggest something more specific
Your loop is a bit awkward. Rather then collecting them in that iterative format, just grab everything when looking at the element tags. This makes the code filling up paper/authors more straightforward.
Rather then having an empty if block and using the else block, invert the logic.
I suggest moving the element clearing code into its own function. This helps make the function actual task clearer avoiding the question of how to clear.
The
My reworking of your code, no testing has been done on it:
EDIT
How to avoid an explicit counter:
Rather then having your two sets be local variables, I suggest putting them as global constants.
Calling them elements and childElements is kinda generic. I suggest something more specific
Your loop is a bit awkward. Rather then collecting them in that iterative format, just grab everything when looking at the element tags. This makes the code filling up paper/authors more straightforward.
Rather then having an empty if block and using the else block, invert the logic.
I suggest moving the element clearing code into its own function. This helps make the function actual task clearer avoiding the question of how to clear.
The
del context line is useless. Del doesn't destory the object, it merely removes the object from the current context. If this was the only context in which object was refereed to then its reference count would drop to zero. However, that's not the case here as the calling function will keep it alive anyways. Even if it did, the memory savings aren't worth worrying about since the whole script is about to end anyways.My reworking of your code, no testing has been done on it:
CATEGORIES = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection', 'phdthesis', "mastersthesis", "www"])
SKIP_CATEGORIES = set(['phdthesis','mastersthesis', 'www'])
DATA_ITEMS = ["title", "booktitle", "year", "journal", "ee"]
def clear_element(element):
element.clear()
while element.getprevious() is not None:
del element.getparent()[0]
def fast_iter2(context, cursor):
# Available elements are: article|inproceedings|proceedings|book|incollection|phdthesis|mastersthesis|www
# Available tags are: author|editor|title|booktitle|pages|year|address|journal|volume|number|month|url|ee|cdrom|cite|
# publisher|note|crossref|isbn|series|school|chapter
paperCounter = 0
for event, element in context:
if element.tag in CATEGORIES:
authors = [author.text for author in element.findall("author")]
paper = {
'element' : element.tag,
'mdate' : element.get("mdate"),
'dblpkey' : element.get('key')
}
for data_item in DATA_ITEMS:
data = element.find(data_item)
if data is not None:
paper[data_item] = data
if paper['element'] not in SKIP_CATEGORIES:
populate_database(paper, authors, cursor)
paperCounter += 1
print paperCounter
clear_element(element)EDIT
How to avoid an explicit counter:
CATEGORIES = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection', 'phdthesis', "mastersthesis", "www"])
SKIP_CATEGORIES = set(['phdthesis','mastersthesis', 'www'])
DATA_ITEMS = ["title", "booktitle", "year", "journal", "ee"]
def clear_element(element):
element.clear()
while element.getprevious() is not None:
del element.getparent()[0]
def extract_paper_elements(context):
for event, element in context:
if element.tag in CATEGORIES:
yield element
clear_element(element)
def fast_iter2(context, cursor):
for paperCounter, element in enumerate(extract_paper_elements(context)):
authors = [author.text for author in element.findall("author")]
paper = {
'element' : element.tag,
'mdate' : element.get("mdate"),
'dblpkey' : element.get('key')
}
for data_item in DATA_ITEMS:
data = element.find(data_item)
if data is not None:
paper[data_item] = data
if paper['element'] not in SKIP_CATEGORIES:
populate_database(paper, authors, cursor)
print paperCounterCode Snippets
CATEGORIES = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection', 'phdthesis', "mastersthesis", "www"])
SKIP_CATEGORIES = set(['phdthesis','mastersthesis', 'www'])
DATA_ITEMS = ["title", "booktitle", "year", "journal", "ee"]
def clear_element(element):
element.clear()
while element.getprevious() is not None:
del element.getparent()[0]
def fast_iter2(context, cursor):
# Available elements are: article|inproceedings|proceedings|book|incollection|phdthesis|mastersthesis|www
# Available tags are: author|editor|title|booktitle|pages|year|address|journal|volume|number|month|url|ee|cdrom|cite|
# publisher|note|crossref|isbn|series|school|chapter
paperCounter = 0
for event, element in context:
if element.tag in CATEGORIES:
authors = [author.text for author in element.findall("author")]
paper = {
'element' : element.tag,
'mdate' : element.get("mdate"),
'dblpkey' : element.get('key')
}
for data_item in DATA_ITEMS:
data = element.find(data_item)
if data is not None:
paper[data_item] = data
if paper['element'] not in SKIP_CATEGORIES:
populate_database(paper, authors, cursor)
paperCounter += 1
print paperCounter
clear_element(element)CATEGORIES = set(['article', 'inproceedings', 'proceedings', 'book', 'incollection', 'phdthesis', "mastersthesis", "www"])
SKIP_CATEGORIES = set(['phdthesis','mastersthesis', 'www'])
DATA_ITEMS = ["title", "booktitle", "year", "journal", "ee"]
def clear_element(element):
element.clear()
while element.getprevious() is not None:
del element.getparent()[0]
def extract_paper_elements(context):
for event, element in context:
if element.tag in CATEGORIES:
yield element
clear_element(element)
def fast_iter2(context, cursor):
for paperCounter, element in enumerate(extract_paper_elements(context)):
authors = [author.text for author in element.findall("author")]
paper = {
'element' : element.tag,
'mdate' : element.get("mdate"),
'dblpkey' : element.get('key')
}
for data_item in DATA_ITEMS:
data = element.find(data_item)
if data is not None:
paper[data_item] = data
if paper['element'] not in SKIP_CATEGORIES:
populate_database(paper, authors, cursor)
print paperCounterContext
StackExchange Code Review Q#2449, answer score: 8
Revisions (0)
No revisions yet.