Programming the web with Python, Django and Javascript.
Processing XML with Python - you're probably doing it wrong
If you think that processing XML in Python sucks, and your code is eating up hundreds on megabytes of RAM just to process a simple document, then don’t worry. You’re probably using xml.dom.minidom, and there is a much better way…
The test – books.xml
For this example, we’ll be attempting to process a 43MB document containing 4000 books. The test data can be downloaded here. The two methods shall be tested by providing implementations of iter_authors_and_descriptions.
Due to loading the entire document in one chunk, minidom takes a long time to run, and uses a lot of memory.
1234
Unique authors: 999
Longest description: 63577
Time taken: 368 ms
Max memory: 107 MB
The right way – cElementTree
The cElementTree method is also awkward (this is XML, after all). However, by using the iterparse() method to avoid loading the whole document into memory, a great deal more efficiency can be acheived.
fromxml.etreeimportcElementTreedefiter_elements_by_name(handle,name):events=cElementTree.iterparse(handle,events=("start","end",))_,root=next(events)# Grab the root element.forevent,eleminevents:ifevent=="end"andelem.tag==name:yieldelemroot.clear()# Free up memory by clearing the root element.defiter_authors_and_descriptions(handle):forbookiniter_elements_by_name(handle,"book"):yield(book.find("author").text,book.find("description").text,)
The results speak for themselves. By using cElementTree, you can process XML in half the time and only use 10% of the memory.
1234
Unique authors: 999
Longest description: 63577
Time taken: 192 ms
Max memory: 8 MB
Conclusion
These tests are hardly scientific, so feel free to download the code and see how it runs in your own environment. In any case, the next time your servers get melted by an XML document, consider giving cElementTree a spin.