HiveBrain v1.2.0
Get Started
← Back to all entries
snippetpythonMinor

How can I make my XML parser more pythonic?

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
pythoniccanparsermakemorexmlhow

Problem

I have a function that takes an XML document and parses out specific XML elements and appends those to a list for use by an Oracle executemany call. The placement of the elements matters as the executemany string uses positional binds, though I could easily use a dictionary to use named binds instead. I feel that while this works, it is pretty slow since it has to reparse the data 5x per document instead of pulling the whole thing into an array once. I know xml.etree is probably what I want here instead of minidom, but I'm not sure how to implement it.

The data I'm working with is structured like this:


    
        
            TAX
            My vendor
            My version 23121
        
        
            OS
            Windows Server 2008 R2 amd64
            6.1
        
        
            APPSERVER
            JBoss Web
            3.0.0-CR1
        
        
            DATABASE
            Microsoft SQL Server
            10.50.1600
        
        
            JAVA
            Sun Microsystems Inc.
            1.6.0_26
       
       
           JDBC
           Microsoft SQL Server JDBC Driver 3.0
           3.0.1301.101
       
   
   ...


And the function looks like this:

```
def parseDocumentData(document, cusNumber):
"Take document from calling function, validate that the structure is \
correct, then process the XML elements and return the data as a list \
of lists."
# Check filename for customer number if cusNumber isn't passed.
if cusNumber == None:
cusNumber = os.path.splitext(os.path.basename(document))[0]
cusNumber = str(cusNumber).encode('ascii')
# Verify that the destination customer record exists. Source was verified by input function.
_dbCur.execute('select count(*) from pm_customer where customer_id=' + cusNumber)
cusVerify = _dbCur.fetchone()[0]
if cusVerify == 0:
_dbCur.execute('insert into pm_customer (customer_id) values (' + cusNumber + ')')
print 'Customer

Solution


  1. Comments on your code



-
Your database queries use string concatenation and so are subject to SQL injection attacks. Consider this pair of lines:

cusNumber = os.path.splitext(os.path.basename(document))[0]
# ...
_dbCur.execute('select count(*) from pm_customer where customer_id=' + cusNumber)


Imagine that an attacker were able to trick you into a processing a customer file with the name 0;drop database.xml. This would cause you to execute the SQL command

select count(*) from pm_customer where customer_id=0;drop database


which I am sure wasn't what you intended to happen. You need to use SQL query parameters to pass untrusted data safely to SQL queries:

_dbCur.execute('select count(*) from pm_customer where customer_id=?', cusNumber)


-
If you just want to know if a customer number is present in the database, use an EXISTS query to make it clear what you mean and to give the database engine a chance to make a better execution plan:

select exists(select * from pm_customer where customer_id=?)


Use a COUNT query only when you actually want to know the number of matching records.

But there's no point in checking to see whether the customer exists in the database if you are just going to add them if they are not present. You should use a MERGE statement, like this:

MERGE INTO pm_customer ON (customer_id = ?)
WHEN NOT MATCHED THEN INSERT (customer_id) VALUES (?)


But even better than that would be to save up all the customers that you plan to add to the database, and then add in one big MERGE statement which you call via executemany (preferably the same statement in which you add the customer attributes).

-
You get the contents of the ` and elements by executing getElementsByTagName queries. It would be more efficient to process the elements in the order they appear. (See section 2 below for an illustration of how to do this.)

-
Your code is delicate: it will break if the order of data within the XML file changes. This is because it assembles
prepList in the order in which the elements appear in the XML. But the values in prepList will eventually be fed to a prepared database query which expects them to appear in the right order (first the TAX data, then the OS data, etc.). It would be better to collect the data in whatever order it appears, and then put it into the right order at the end. That way your code will not break if the elements appear in a different order.

-
You don't give the definition of
xmlTypeTuple but it must be something like this:

xmlTypeTuple = 'TAX OS APPSERVER DATABASE JAVA'.split()


You then repeat this sequence of type names in the line

typePattern = re.compile('(TAX|OS|APPSERVER|DATABASE|JAVA)')


This violates the DRY (Don't Repeat Yourself) principle: if you had to add a new type it would be easy to make a mistake and add it to one of these and not the other. So you should write something like this:

typePattern = re.compile('({})'.format('|'.join(map(re.escape, xmlTypeTuple))))


(But see the next comment.)

-
You match the content of the
element against typePattern. But then you immediately check the content to see if it's equal to one of the members of the xmlTypeTuple list. It's not necessary to carry out both of these tests. If the latter passes, the former must too. So I would suggest dropping the regular expression match altogether and just seeing if the content is a member of xmlTypeTuple.

-
It's not clear what the point of
isValid is. You set it to true as soon as the content of any single element matches against typePattern. But you supply default values for all vendors and versions, so why is this necessary? Why is an empty customer record considered invalid?

(Edited to add: I see that in the revised code in your answer you changed the behaviour so that
isValid is set to true if there is an element with type TAX. This is still not clear to me. What's special about TAX?)

-
There's no need for your empty
return statement: when Python reaches the end of a function body, it returns automatically.

-
Your design involves global state variables
dbWriteList, completedFiles, and failedFiles. It is usually best practice to encapsulate persistent state into objects belonging to classes, and to turn the functions operating on that state into methods on the objects. See section 2 for some example code showing how you might do this.

-
Your success message

print 'File processed successfully. Moving on.'


would be more useful if it contained the name of the document that was processed successfully.

-
Failure messages like

print 'File did not contain valid XML elements:', document


ought to be printed to standard error, not standard output.

-
The Python style guide (PEP8) says that names of variables should be in lower case with underscores. So
cusNumber should be something like customer_id`.

-
You don't check

Code Snippets

cusNumber = os.path.splitext(os.path.basename(document))[0]
# ...
_dbCur.execute('select count(*) from pm_customer where customer_id=' + cusNumber)
select count(*) from pm_customer where customer_id=0;drop database
_dbCur.execute('select count(*) from pm_customer where customer_id=?', cusNumber)
select exists(select * from pm_customer where customer_id=?)
MERGE INTO pm_customer ON (customer_id = ?)
WHEN NOT MATCHED THEN INSERT (customer_id) VALUES (?)

Context

StackExchange Code Review Q#24022, answer score: 3

Revisions (0)

No revisions yet.