Skip to content

Python Fast XML Parsing

Here is a useful tip on Python XML decoding.

I was extending xml.sax.ContentHandler class in a example to decode maps for a Pygame application when my connection went down and I noticed that the program stop working raising a exception regarded a call to urlib (a module for retrieve resources by url). I noticed that the module was getting the remote DTD schema to validate the XML.

This is not a requirement for my applications and it’s a huge performance overhead when works (almost 1 second for each map loaded) and when the applications is running in a environment without Internet it just waits for almost a minute and then fail with the remain decoding. A dirty workaround is open the XML file and get rid of the line containing the DTD reference.

But the correct way to programming XML decoding when we are not concerned on validate a XML schema is just the xml.parsers.expat. Instead of using a interface you just have to set some callback functions with the behaviors we want. This is a example from the documentation:

import xml.parsers.expat

# 3 handler functions
def start_element(name, attrs):
    print 'Start element:', name, attrs
def end_element(name):
    print 'End element:', name
def char_data(data):
    print 'Character data:', repr(data)

p = xml.parsers.expat.ParserCreate()

p.StartElementHandler = start_element
p.EndElementHandler = end_element
p.CharacterDataHandler = char_data

Text goes here
More text
""", 1)

The output:

Start element: parent {'id': 'top'}
Start element: child1 {'name': 'paul'}
Character data: 'Text goes here'
End element: child1
Character data: '\n'
Start element: child2 {'name': 'fred'}
Character data: 'More text'
End element: child2
Character data: '\n'
End element: parent
Published inenglish


  1. Kannan Sampath Kannan Sampath

    Above script is not working for me.. am using linux.. am not familiar with this xml.parsers.expat. Kindly advice me what am missing.

Leave a Reply

Your email address will not be published. Required fields are marked *