HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Reading a large XML file and parsing necessary elements into MySQLdb

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
readingfilenecessaryelementsintoxmllargeparsingandmysqldb

Problem

I have fair concept in a programming (learner) but not an expert to refactor code at the highest level. I am trying to read a huge (100MB-2GB) XML file and parse necessary element (attributes) from a file into MySQLdb. The code is working perfectly as I have tested in small file sizes. But, unfortunately when I got big size today (400MB-900MB), it's taking unexpected time.

Currently, I am diving into List Comprehension, generator(), lambda() and list reduction techniques.

Not to make more lines of code, but I have eliminated 3 or 4 elements from my list (taglist) from here. But, the structure of the code for rest of the eliminated elements are the same.

I am using: Python 2.7, lxml, MySQL (1.2.3)

```
import os, sys
import stat
import getpass
import MySQLdb
from lxml import etree
import datetime, time
import dbconfig as config

# All global variables set
#Namespace which is default in every mzML file
#taglist=['mzML','sourceFile','software','dataProcessing','instrumentConfiguration','selectedIon']
taglist=['mzML','sourceFile']
NS="{http://psi.hupo.org/ms/mzml}"

def fast_iter(context, func,args=[],kwargs={}):
# fast_iter is useful if we need to free memory while iterating through a
# very large XML file.
#http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
# Author: Liza Daly

for event, elem in context:
func(elem,*args, **kwargs)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context

def process_element(elt):
# global mzml_pk, sf_cv_fk,softw_fk_ID, softw_cv_fk_ID
# Processing first element (mzML) and it's attributes (id, version, accession)
# version attribute is required AND id & accession are optional
if elt.tag==NS+taglist[0]:

L= elt.keys()
L1=['id','version','accession']

# Checking whether element attributes (items L1) exist in a mzml element attributes (L)
if L1[0] in L:
mzmlID = elt.attrib

Solution

import os, sys
import stat
import getpass
import MySQLdb
from lxml import etree
import datetime, time
import dbconfig as config


They python style guide recommends one import per line. I'd also probably group things together i.e. standard library imports then 3rd party imports/ then app imports. But that doesn't really matter much

# All global variables set
 #Namespace which is default in every mzML file

taglist=['mzML','sourceFile']
NS="{http://psi.hupo.org/ms/mzml}"


For global constants, the python style guide recommends ALL_CAPS. I recommend not using NS, instead spell it out NAMESPACE

def fast_iter(context, func,args=[],kwargs={}):


Taking variable arguments as a list and dict are odd, any reason you didn't take them as *args and **kwargs?

# fast_iter is useful if we need to free memory while iterating through a
    # very large XML file.
    #http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    # Author: Liza Daly

    for event, elem in context:
        func(elem,*args, **kwargs)


Using a yield to make this a generator rather then calling a function would probable be better.

elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context


There is no point in deleting local names at the end of the function. When the function ends all the names will be deleted anyways.

def process_element(elt):


I recommend not using short abbreviations like elt unless they are really common.

# global mzml_pk, sf_cv_fk,softw_fk_ID, softw_cv_fk_ID
    # Processing first element (mzML) and it's attributes (id, version, accession)
    # version attribute is required AND id & accession are optional
    if elt.tag==NS+taglist[0]:

        L= elt.keys()
        L1=['id','version','accession']

        # Checking whether element attributes (items L1) exist in a mzml element attributes (L)
        if L1[0] in L:


You haven't helped anything by storing 'id' L1[0] and then fetching it again. Just use 'id' here.

mzmlID = elt.attrib['id']
        else:
            mzmlID = "-"


You should be able to do something like mzmlID = el.attrib.get('id', '-') to replace this whole if/else block.

if L1[1] in L:
            mzml_version = elt.attrib['version']
        else:
            mzml_version = "-"
        if L1[2] in L:
            mzml_accession = elt.attrib['accession']
        else:
            mzml_accession = "-"

        exp_id_fk = exp_PK

        sql = """INSERT INTO pmass_mzml (mzml_id,accession,version,exp_id)
        VALUES (%s, %s, %s, %s)"""
        try:
            # Execute the SQL command
            config.cursor.execute(sql,(mzmlID,mzml_accession,mzml_version,exp_id_fk))
            # Commit your changes in the database
            config.conn.commit()

            #mzml_pk= cursor.lastrowid


Don't leave dead code in comments

except Exception as err:
            # logger.error(err)
            # Rollback in case there is any error
            config.conn.rollback()


Don't just ignore errors that happen. If you expect an error to occur, you should check to make sure exactly the error you wanted was caught. Otherwise you should re-raise or at least log the error.

You might also look into with statments and context managers. Amongst other things, they make transactions easier to work with.

global mzml_pk
        mzml_pk = config.cursor.lastrowid


avoid use of global variables. Use return values/object attributes/pretty much anything before using a global variable.

# Processing second element (sourceFile) which has attributes (id, name and location)
    # SourceFile attributes id, name and location are required
    # Further, sourceFile has child element (cvParam) and attributes (cvRef, name, accession, value)
    # cvParam attributes: cvRef, name, accession are Required and value optional


Right about now, this function is way too long. You should break it up into several functions.

```
elif elt.tag==NS+taglist[1]:
sf_keys = elt.keys()
sf_need = ['id','name','location']
if sf_need[0] in sf_keys:
sf_id = elt.attrib['id']
else:
sf_id, "-"
if sf_need[1] in sf_keys:
sf_name = elt.attrib['name']
else:
sf_name, "-"
if sf_need[2] in sf_keys:
sf_location = elt.attrib['location']
else:
sf_location = "-"
global sf_fk
sf_fk = mzml_pk
#print "Insert values into django sourceFile class data model"

sql = """INSERT INTO pmass_source_file (sf_id,name,location,mzml_fk_id)
VALUES (%s, %s, %s, %s)"""
try:
# Execute the SQL command
config.cursor.execute(sql,(sf_id,sf_name,sf_location,sf_fk))
# Commit your changes in the database
config.conn.commit()

except Exception as err:
# logger.error(err)
# Rollback in case

Code Snippets

import os, sys
import stat
import getpass
import MySQLdb
from lxml import etree
import datetime, time
import dbconfig as config
# All global variables set
 #Namespace which is default in every mzML file

taglist=['mzML','sourceFile']
NS="{http://psi.hupo.org/ms/mzml}"
def fast_iter(context, func,args=[],kwargs={}):
# fast_iter is useful if we need to free memory while iterating through a
    # very large XML file.
    #http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    # Author: Liza Daly

    for event, elem in context:
        func(elem,*args, **kwargs)
elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

Context

StackExchange Code Review Q#5782, answer score: 3

Revisions (0)

No revisions yet.