snippetpythonMinor
How can I make my XML parser more pythonic?
Viewed 0 times
pythoniccanparsermakemorexmlhow
Problem
I have a function that takes an XML document and parses out specific XML elements and appends those to a list for use by an Oracle executemany call. The placement of the elements matters as the executemany string uses positional binds, though I could easily use a dictionary to use named binds instead. I feel that while this works, it is pretty slow since it has to reparse the data 5x per document instead of pulling the whole thing into an array once. I know xml.etree is probably what I want here instead of minidom, but I'm not sure how to implement it.
The data I'm working with is structured like this:
And the function looks like this:
```
def parseDocumentData(document, cusNumber):
"Take document from calling function, validate that the structure is \
correct, then process the XML elements and return the data as a list \
of lists."
# Check filename for customer number if cusNumber isn't passed.
if cusNumber == None:
cusNumber = os.path.splitext(os.path.basename(document))[0]
cusNumber = str(cusNumber).encode('ascii')
# Verify that the destination customer record exists. Source was verified by input function.
_dbCur.execute('select count(*) from pm_customer where customer_id=' + cusNumber)
cusVerify = _dbCur.fetchone()[0]
if cusVerify == 0:
_dbCur.execute('insert into pm_customer (customer_id) values (' + cusNumber + ')')
print 'Customer
The data I'm working with is structured like this:
TAX
My vendor
My version 23121
OS
Windows Server 2008 R2 amd64
6.1
APPSERVER
JBoss Web
3.0.0-CR1
DATABASE
Microsoft SQL Server
10.50.1600
JAVA
Sun Microsystems Inc.
1.6.0_26
JDBC
Microsoft SQL Server JDBC Driver 3.0
3.0.1301.101
...
And the function looks like this:
```
def parseDocumentData(document, cusNumber):
"Take document from calling function, validate that the structure is \
correct, then process the XML elements and return the data as a list \
of lists."
# Check filename for customer number if cusNumber isn't passed.
if cusNumber == None:
cusNumber = os.path.splitext(os.path.basename(document))[0]
cusNumber = str(cusNumber).encode('ascii')
# Verify that the destination customer record exists. Source was verified by input function.
_dbCur.execute('select count(*) from pm_customer where customer_id=' + cusNumber)
cusVerify = _dbCur.fetchone()[0]
if cusVerify == 0:
_dbCur.execute('insert into pm_customer (customer_id) values (' + cusNumber + ')')
print 'Customer
Solution
- Comments on your code
-
Your database queries use string concatenation and so are subject to SQL injection attacks. Consider this pair of lines:
cusNumber = os.path.splitext(os.path.basename(document))[0]
# ...
_dbCur.execute('select count(*) from pm_customer where customer_id=' + cusNumber)Imagine that an attacker were able to trick you into a processing a customer file with the name
0;drop database.xml. This would cause you to execute the SQL commandselect count(*) from pm_customer where customer_id=0;drop databasewhich I am sure wasn't what you intended to happen. You need to use SQL query parameters to pass untrusted data safely to SQL queries:
_dbCur.execute('select count(*) from pm_customer where customer_id=?', cusNumber)-
If you just want to know if a customer number is present in the database, use an
EXISTS query to make it clear what you mean and to give the database engine a chance to make a better execution plan:select exists(select * from pm_customer where customer_id=?)Use a
COUNT query only when you actually want to know the number of matching records.But there's no point in checking to see whether the customer exists in the database if you are just going to add them if they are not present. You should use a
MERGE statement, like this:MERGE INTO pm_customer ON (customer_id = ?)
WHEN NOT MATCHED THEN INSERT (customer_id) VALUES (?)But even better than that would be to save up all the customers that you plan to add to the database, and then add in one big
MERGE statement which you call via executemany (preferably the same statement in which you add the customer attributes).-
You get the contents of the `
and elements by executing getElementsByTagName queries. It would be more efficient to process the elements in the order they appear. (See section 2 below for an illustration of how to do this.)
-
Your code is delicate: it will break if the order of data within the XML file changes. This is because it assembles prepList in the order in which the elements appear in the XML. But the values in prepList will eventually be fed to a prepared database query which expects them to appear in the right order (first the TAX data, then the OS data, etc.). It would be better to collect the data in whatever order it appears, and then put it into the right order at the end. That way your code will not break if the elements appear in a different order.
-
You don't give the definition of xmlTypeTuple but it must be something like this:
xmlTypeTuple = 'TAX OS APPSERVER DATABASE JAVA'.split()
You then repeat this sequence of type names in the line
typePattern = re.compile('(TAX|OS|APPSERVER|DATABASE|JAVA)')
This violates the DRY (Don't Repeat Yourself) principle: if you had to add a new type it would be easy to make a mistake and add it to one of these and not the other. So you should write something like this:
typePattern = re.compile('({})'.format('|'.join(map(re.escape, xmlTypeTuple))))
(But see the next comment.)
-
You match the content of the element against typePattern. But then you immediately check the content to see if it's equal to one of the members of the xmlTypeTuple list. It's not necessary to carry out both of these tests. If the latter passes, the former must too. So I would suggest dropping the regular expression match altogether and just seeing if the content is a member of xmlTypeTuple.
-
It's not clear what the point of isValid is. You set it to true as soon as the content of any single element matches against typePattern. But you supply default values for all vendors and versions, so why is this necessary? Why is an empty customer record considered invalid?
(Edited to add: I see that in the revised code in your answer you changed the behaviour so that isValid is set to true if there is an element with type TAX. This is still not clear to me. What's special about TAX?)
-
There's no need for your empty return statement: when Python reaches the end of a function body, it returns automatically.
-
Your design involves global state variables dbWriteList, completedFiles, and failedFiles. It is usually best practice to encapsulate persistent state into objects belonging to classes, and to turn the functions operating on that state into methods on the objects. See section 2 for some example code showing how you might do this.
-
Your success message
print 'File processed successfully. Moving on.'
would be more useful if it contained the name of the document that was processed successfully.
-
Failure messages like
print 'File did not contain valid XML elements:', document
ought to be printed to standard error, not standard output.
-
The Python style guide (PEP8) says that names of variables should be in lower case with underscores. So cusNumber should be something like customer_id`.-
You don't check
Code Snippets
cusNumber = os.path.splitext(os.path.basename(document))[0]
# ...
_dbCur.execute('select count(*) from pm_customer where customer_id=' + cusNumber)select count(*) from pm_customer where customer_id=0;drop database_dbCur.execute('select count(*) from pm_customer where customer_id=?', cusNumber)select exists(select * from pm_customer where customer_id=?)MERGE INTO pm_customer ON (customer_id = ?)
WHEN NOT MATCHED THEN INSERT (customer_id) VALUES (?)Context
StackExchange Code Review Q#24022, answer score: 3
Revisions (0)
No revisions yet.