XML tree representation in rust XML libraries

Written on 2023-06-23 in 397 words ✍️.
Part of project typho digital-typesetting

Motivation

In my article “Graph-theoretic considerations for 'text documents as trees'”, I tried to take a theoretic look at document trees. In this article, I want to have a look at practical aspects.

How are XML trees represented in data structures in XML crates?

A wrong assumption about xml-rs

Based on all-time download number, xml-rs is the most popular XML library in rust. First, I took a look at the primitive data types:

I think it is interesting that several types have explicit referred and owned versions (Name versus OwnedName, Attribute versus OwnedAttribute, …). However, the set of objects is immediate and corresponds to the XML specification.

However, I had to learn at this point that xml-rs, xmlparser, and quick-xml do not actually store the XML tree. They only provide SAX-like XML parsers. SAX parsers provide an interface where a method is called whenever an XML elements starts or ends, a text node occurs, processing instruction data is found, et cetera. The point is that the data is available only within that method call and discarded afterwards. So the memory consumption is much lower, but you cannot the XML tree’s hierarchy through primitives like XPath.

So let us look at other libraries which not only provide a SAX parser, but actually keep the XML tree in memory.

Conclusion