HiveBrain v1.2.0
Get Started
← Back to all entries
patternpythonMinor

Extending the functionality of lxml.etree

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
etreethelxmlfunctionalityextending

Problem

I wrote a class to slightly customize the behavior of lxml.etree.ElementTree and I use it quite extensively. It works great, but there are a few methods that I'm not sure how I wrote, and there are a few other methods that seem awfully redundant. I'll elaborate on specific questions below, but first here's the code.

import lxml.etree as old_etree
from xml_utilities import clean_xml
from datetime import datetime

VERSION = 'X.X.X.X'
TOOL = 'ExampleTool'

class etree(old_etree._ElementTree):
  @staticmethod
  def parse(path):  
    old_tdx = old_etree.parse(path)
    new_tdx = etree(old_tdx)
    new_tdx._setroot(old_tdx.getroot())
    return new_tdx

  @staticmethod
  def fromstring(string):
    old_tdx = old_etree.fromstring(string)
    new_tdx = etree(old_tdx.getroottree())
    new_tdx._setroot(old_tdx)
    return new_tdx

  @staticmethod
  def getISOTime():
    iso_time = datetime.now().isoformat()
    format_time = iso_time.split('.')[0] + 'Z'
    return format_time

  @staticmethod
  def SubElement(*args, **kargs):
    return old_etree.SubElement(*args, **kargs)

  @staticmethod
  def Element(*args, **kargs):
    return old_etree.Element(*args, **kargs)

  def getMetaNode(self, name, default=''):
    metas = self.findall('meta')
    for meta in metas:
      if meta.get('name') == name:
        return meta
    return old_etree.SubElement(self.getroot(), 'meta', 
                                attrib={'name': name, 'value': default})

  def setMetaNode(self, name, value):
    node = self.getMetaNode(name)
    node.set('value', value)

  def write(self, path, updateTool=True):       
    self.setMetaNode('saved', etree.getISOTime())

    if updateTool:
      self.setMetaNode('version', VERSION)
      self.setMetaNode('type', TOOL)

    super(etree, self).write(path)
    clean_xml(path)


The function clean_xml is custom written and largely irrelevant: It escapes troublesome characters that ElementTree.write doesn't by default (this is an issue with th

Solution

Boilerplate

class etree(old_etree._ElementTree):
  @staticmethod
  def parse(path):  
    old_tdx = old_etree.parse(path)
    new_tdx = etree(old_tdx)
    new_tdx._setroot(old_tdx.getroot())
    return new_tdx

  @staticmethod
  def fromstring(string):
    old_tdx = old_etree.fromstring(string)
    new_tdx = etree(old_tdx.getroottree())
    new_tdx._setroot(old_tdx)
    return new_tdx


In those boilerplate functions, I don't know what "tdx" means. I guess it's linked to the file specification you use.

Those functions are the main problem I have with your code. You mixed the lxml.etree module with the lxml.etree.ElementTree class. Those four methods are methods of the module, not the class! I think you should:

  • have a real etree module



  • put your current etree class (renamed as ElementTree) in it



  • add those functions in the module, not the class



  • fix the return type to return what lxml returns



This would make things way less confusing because it would be closer to how lxml.etree and the standard xml.etree work. Your coworkers would be able to pick your code faster.

You would do from lxml.etree import * and only redefine the functions you want to redefine. Your probably can't expect a perfect compatibility, but at least the basic API would be the same.

I don't have much to say about the implementation of the functions apart from the return type: ugly but I did not find a way to make things better. And yes, it's quite easy to understand them.

@staticmethod
  def SubElement(*args, **kargs):
    return old_etree.SubElement(*args, **kargs)

  @staticmethod
  def Element(*args, **kargs):
    return old_etree.Element(*args, **kargs)


You no longer would need do to this with my proposal above.

getISOTime

@staticmethod
  def getISOTime():
    iso_time = datetime.now().isoformat()
    format_time = iso_time.split('.')[0] + 'Z'
    return format_time


Adding the 'Z' means you're saying this is an UTC time, but it's not since you've used datetime.now() instead of datetime.utcnow(). I live in UTC+4, and the format_time you're returning is 4 hours off for me.

getMetaNode

def getMetaNode(self, name, default=''):
    metas = self.findall('meta')
    for meta in metas:
      if meta.get('name') == name:
        return meta


You can use lxml XPath support here if you want to avoid the loop: meta.xpath("meta[name = $name]", name=name).

Write

def write(self, path, updateTool=True):       
    self.setMetaNode('saved', etree.getISOTime())

    if updateTool:
      self.setMetaNode('version', VERSION)
      self.setMetaNode('type', TOOL)

    super(etree, self).write(path)
    clean_xml(path)


nitpick: Cleaning at the ElementTree would prevent your filesystem to see two different versions. Say at some point you decide to use watchdog, the callback will kick in before you have a chance to run clean_xml, which could cause subtle bugs.

Answering your questions



  • Is there a better way to extend the functionality of lxml.etree than inheriting from _ElementTree? Ultimately, I want to be changing the behavior of the method ElementTree.write and to add the methods ElementTree.getMetaNode and ElementTree.setMetaNode.




Since lxml is written in Cython, I think you can't monkeypatch lxml directly anyway, so subclassing is the way to go. lxml could have chosen to make things easier by using __new__ just like numpy does for facilitating ndarray subclassing.



  • Is there a better way of incorporating the methods parse, fromstring, SubElement, and Element, rather than just creating thin wrappers that reference the original methods?




See "Boilerplate" above.



  • Does anyone understand what my methods parse and fromstring are doing? I know that my method fromstring has different functionality from lxml.etree.fromstring because I want it to return an ElementTree rather than an Element. These methods were written largely through guess-and-check (oops).




Why do you want to get an ElementTree rather than an Element? I think it's a bad idea for two reasons:

  • It's very easy to get an ElementTree from an Element.



  • It's already hard enough to reason about the type of object you get when using lxml, if we can't use our knowledge about ElementTree/lxml it's only going to get worse.





  • Is there a better way to write the return old_etree.SubElement(...) line so it doesn't have to take up two lines? This isn't major, but...




If anything, I think I would use more than two lines. If you want to stop worrying about such things, I highly recommand yapf.


Originally, I tried to just overwrite lxml.etree.ElementTree.write directly, but that throws the error AttributeError: 'lxml.etree._ElementTree' object attribute 'write' is read-only. General critiques are also welcomed. I know that I should have comments in here, but everything is pretty straight-forward except for the two methods that I don't unders

Code Snippets

class etree(old_etree._ElementTree):
  @staticmethod
  def parse(path):  
    old_tdx = old_etree.parse(path)
    new_tdx = etree(old_tdx)
    new_tdx._setroot(old_tdx.getroot())
    return new_tdx

  @staticmethod
  def fromstring(string):
    old_tdx = old_etree.fromstring(string)
    new_tdx = etree(old_tdx.getroottree())
    new_tdx._setroot(old_tdx)
    return new_tdx
@staticmethod
  def SubElement(*args, **kargs):
    return old_etree.SubElement(*args, **kargs)

  @staticmethod
  def Element(*args, **kargs):
    return old_etree.Element(*args, **kargs)
@staticmethod
  def getISOTime():
    iso_time = datetime.now().isoformat()
    format_time = iso_time.split('.')[0] + 'Z'
    return format_time
def getMetaNode(self, name, default=''):
    metas = self.findall('meta')
    for meta in metas:
      if meta.get('name') == name:
        return meta
def write(self, path, updateTool=True):       
    self.setMetaNode('saved', etree.getISOTime())

    if updateTool:
      self.setMetaNode('version', VERSION)
      self.setMetaNode('type', TOOL)

    super(etree, self).write(path)
    clean_xml(path)

Context

StackExchange Code Review Q#119559, answer score: 4

Revisions (0)

No revisions yet.