HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

XML schema parser

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
parserschemaxml

Problem

I've been working on a lightweight XML schema parser, and have what I think is a moderately clean solution (some parts helped out by previous questions I posted here) so far for obtaining all schema details, but would like any criticism at all that could help further improve this code.

Below I have supplied the schema class I wrote, and then an example schema.txt file that the schema class will open if run as main. The schema class calls under "main" can be modified if you want to get a better look at the schema data structure, and I have some accompanying functions I have written for the class to pull out specific details that I haven't put here because I still need to do some work on them.

schema.py:

`from lxml import etree

INDICATORS = ["all", "sequence", "choice"]
TYPES = ["simpleType", "complexType"]

class schema:

def __init__(self, schemafile):
if schemafile is None:
print "Error creating Schema: Invalid schema file used"
return

self.schema = self.create_schema(etree.parse(schemafile))

def create_schema(self, schema_data):
def getXSVal(element): #removes namespace
return element.tag.split('}')[-1]

def get_simple_type(element):
return {
"name": element.get("name"),
"restriction": element.getchildren()[0].attrib,
"elements": [ e.get("value") for e in element.getchildren()[0].getchildren() ]
}

def get_simple_content(element):
return {
"simpleContent": {
"extension": element.getchildren()[0].attrib,
"attributes": [ a.attrib for a in element.getchildren()[0].getchildren() ]
}
}

def get_elements(element):

if len(element.getchildren()) == 0:
return element.attrib

data = {}

ename = element.get("name")
tag = getXSVal(element)

if

Solution

from lxml import etree

INDICATORS = ["all", "sequence", "choice"]
TYPES = ["simpleType", "complexType"]

class schema:


Python convention is to name classes using CamelCase.

def __init__(self, schemafile):
        if schemafile is None:
            print "Error creating Schema: Invalid schema file used"
            return


Use exceptions report errors in python. Don't print problems to standard output and then try to continue. Nothing good will come of it. Actually, you don't even need to check for None, because it'll fail on the next line anyways.

self.schema = self.create_schema(etree.parse(schemafile))

    def create_schema(self, schema_data):
        def getXSVal(element): #removes namespace
            return element.tag.split('}')[-1]


Shouldn't you at least verify that the namespace was correct?

def get_simple_type(element):
            return {
                "name": element.get("name"),
                "restriction": element.getchildren()[0].attrib,
                "elements": [ e.get("value") for e in element.getchildren()[0].getchildren() ]
        }


It looks like you are using a dictionary like an object. Perhaps you should actually be creating a SimpleType object with these attributes.

def get_simple_content(element):
            return {
                "simpleContent": {
                    "extension": element.getchildren()[0].attrib,
                    "attributes": [ a.attrib for a in element.getchildren()[0].getchildren() ]
                }
            }

        def get_elements(element):


I've go no idea what this function is trying to do

if len(element.getchildren()) == 0:
                return element.attrib

            data = {}

            ename = element.get("name")
            tag = getXSVal(element)

            if ename is None:


It seems strange that you check for the name, but don't do anything with it

if tag == "simpleContent":
                    return get_simple_content(element)


Its confusing the way you sometimes return something, other times you add into a dictionary.

elif tag in INDICATORS:
                    data["indicator"] = tag
                elif tag in TYPES:
                    data["type"] = tag
                else:
                    data["option"] = tag

            else:
                if tag == "simpleType":
                    return get_simple_type(element)
                else: 
                    data.update(element.attrib)


I don't really follow what the theory for this condition is. I do see the same code showing up multiple times which makes me wonder if it can be refactored to be cleaner.

data["elements"] = []
            data["attributes"] = []
            children = element.getchildren()        

            for child in children:


Combine the last two lines

if child.get("name") is not None:
                    data[getXSVal(child)+"s"].append(get_elements(child))
                elif tag in INDICATORS and getXSVal(child) in INDICATORS:
                    data["elements"].append(get_elements(child))
                else:
                    data.update(get_elements(child))

            if len(data["elements"]) == 0:
                del data["elements"]
            if len(data["attributes"]) == 0:
                del data["attributes"]


Do you really want to do this? It seems to me that it'll make code harder to write that uses the data

return data


These long function as inner functions smell bad. The suggest perhaps they should be in another class or something.

schema = {}
        root = schema_data.getroot()
        children = root.getchildren()
        for child in children:
            c_type = getXSVal(child)
            if child.get("name") is not None and not c_type in schema:
                schema[c_type] = []


If the name is None, won't that cause the next line to have an error?

schema[c_type].append(get_elements(child))


Instead use schema.setdefault(c_type,[]).append(get_elements(child)) it'll take care adding the list the first time you append.

return schema

    def get_Types(self, t_name):


Python convetion is lowercase_with_underscores for method names

types = []
        for t in self.schema[t_name]:
            types.append(t["name"])
        return types


I'd use return [t["name"] for t in self.schema[t_name]]

def get_simpleTypes(self):
        return self.get_Types("simpleType")

    def get_complexTypes(self):
        return self.get_Types("complexType")

if __name__ == '__main__':
    fschema = open("schema.txt")


I suggest using with to make sure it gets closed

schema = schema(fschema)

    print schema.get_simpleTypes()
    print schema.get_complexTypes()


My overall problem with your approach is that you are converting the xml schema into a bunch of unstructured dictionaries. The result isn't going to be much easier to work then the original XML object

Code Snippets

from lxml import etree

INDICATORS = ["all", "sequence", "choice"]
TYPES = ["simpleType", "complexType"]

class schema:
def __init__(self, schemafile):
        if schemafile is None:
            print "Error creating Schema: Invalid schema file used"
            return
self.schema = self.create_schema(etree.parse(schemafile))

    def create_schema(self, schema_data):
        def getXSVal(element): #removes namespace
            return element.tag.split('}')[-1]
def get_simple_type(element):
            return {
                "name": element.get("name"),
                "restriction": element.getchildren()[0].attrib,
                "elements": [ e.get("value") for e in element.getchildren()[0].getchildren() ]
        }
def get_simple_content(element):
            return {
                "simpleContent": {
                    "extension": element.getchildren()[0].attrib,
                    "attributes": [ a.attrib for a in element.getchildren()[0].getchildren() ]
                }
            }

        def get_elements(element):

Context

StackExchange Code Review Q#10960, answer score: 3

Revisions (0)

No revisions yet.