...and then it crashed

Programming the web with Python, Django and Javascript.

Processing XML with Python - you're probably doing it wrong

If you think that processing XML in Python sucks, and your code is eating up hundreds on megabytes of RAM just to process a simple document, then don’t worry. You’re probably using xml.dom.minidom, and there is a much better way…

The test – books.xml

For this example, we’ll be attempting to process a 43MB document containing 4000 books. The test data can be downloaded here. The two methods shall be tested by providing implementations of iter_authors_and_descriptions.

(xml_test.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import resource, time
from xml.dom import minidom

def main():
    authors = set()
    max_description = 0
    start_time = time.time()
    with open("books.xml") as handle:
        for author, description in iter_authors_and_descriptions(handle):
            authors.add(author)
            max_description = max(max_description, len(description))
    # Print out the report.
    end_time = time.time()
    report = resource.getrusage(resource.RUSAGE_SELF)
    print("Unique authors: {}".format(len(authors)))
    print("Longest description: {}".format(max_description))
    print("Time taken: {} ms".format(int((end_time - start_time) * 1000)))
    print("Max memory: {} MB".format(report.ru_maxrss / 1024 / 1024))

if __name__ == "__main__":
    main()

The wrong way – minidom

The minidom method is both awkward and inefficient.

(minidom.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from xml.dom import minidom

def get_child_text(parent, child_name):
    child = parent.getElementsByTagName(child_name)[0]
    return "".join(
        grandchild.data
        for grandchild
        in child.childNodes
        if grandchild.nodeType in (grandchild.TEXT_NODE, grandchild.CDATA_SECTION_NODE)
    )

def iter_authors_and_descriptions(handle):
    document = minidom.parse(handle)
    for book in document.getElementsByTagName("book"):
        yield (
            get_child_text(book, "author"),
            get_child_text(book, "description"),
        )

Due to loading the entire document in one chunk, minidom takes a long time to run, and uses a lot of memory.

1
2
3
4
Unique authors: 999
Longest description: 63577
Time taken: 368 ms
Max memory: 107 MB

The right way – cElementTree

The cElementTree method is also awkward (this is XML, after all). However, by using the iterparse() method to avoid loading the whole document into memory, a great deal more efficiency can be acheived.

(cElementTree.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from xml.etree import cElementTree

def iter_elements_by_name(handle, name):
    events = cElementTree.iterparse(handle, events=("start", "end",))
    _, root = next(events)  # Grab the root element.
    for event, elem in events:
        if event == "end" and elem.tag == name:
            yield elem
            root.clear()  # Free up memory by clearing the root element.

def iter_authors_and_descriptions(handle):
    for book in iter_elements_by_name(handle, "book"):
        yield (
            book.find("author").text,
            book.find("description").text,
        )

The results speak for themselves. By using cElementTree, you can process XML in half the time and only use 10% of the memory.

1
2
3
4
Unique authors: 999
Longest description: 63577
Time taken: 192 ms
Max memory: 8 MB

Conclusion

These tests are hardly scientific, so feel free to download the code and see how it runs in your own environment. In any case, the next time your servers get melted by an XML document, consider giving cElementTree a spin.

Comments