Motivation
In my previous article “Graph-theoretic considerations for Graph-theoretic considerations for 'text documents as trees'”, I looked at ways to represent the hierarchical structure of documents as graph-theoretic trees. In this article, I explained graph theory fundamentals, proposed names for specific structures, and proposed names for these structures. Today, I want to revise these names as I gained more experience with construction of such trees and want to focus on the representation of the trees in Haskell and rust.
Table of tree structures
old name | new name | type of attributes | is colored |
---|---|---|---|
structure tree |
structure tree |
none |
no |
- |
attributeless markup tree |
none |
yes |
content tree |
attributed structure tree |
string-valued |
no |
XML tree |
markup tree |
string-valued |
yes |
- |
node-attributed structure tree |
node-valued |
no |
attribute-node tree |
node-attributed markup tree |
node-valued |
yes |
In general, one can observe that I normalized two structures with the wording “markup tree” versus “structure tree”. Why? Because I deem them most valuable. The most important tree is the ‘markup tree’, which was called XML tree before. XML tree is a misnomer, because XML trees need to distinguish between PCDATA, processing instructions, and others besides elements and text nodes. Calling it “subset XML tree” makes little sense but sounds like a cumbersome name.
My original intention was to gain an overview over all possible structures to add them to typho. Implementing all of them systematically is not a goal of mine anymore, since variants of additional technical attributes (mutable, with position information, …) makes variations explode exponentially. Instead I want to focus on valuable variants.
One inconsistency in this naming scheme is that “structure tree” is the simplest name for single-colored trees and “markup tree” is the simplest name for colored trees. But whereas a “markup tree” is has attributes, a “structure tree” has none. In that sense, the simple “structure tree” is no the same level like “attributeless markup tree”; which has a non-simple name. I chose pragmatism over consistency here, because the most common trees shall have the simplest names.
Furthermore, I consider node attributes today as crazy invention.
Representation in Haskell
- Structure tree
-
data Node = String [Node]
- Attributeless markup tree
-
data Node = Element String [Node] | Text String
- Structure tree with attributes
-
data Node = String (Map String String) [Node]
- Markup tree
-
data Node = Element String (Map String String) [Node] | Text String
- Structure tree with node attributes
-
data Node = String (Map String Node) [Node]
- Attribute-node markup tree
-
data Node = Element String (Map String Node) [Node] | Text String
Representation in rust
use std::collections::HashMap;
// structure trees
pub struct StructureTree {
element: String,
children: Vec<Self>,
}
pub struct StructureTreeWithAttributes {
element: String,
attributes: HashMap<String, String>,
children: Vec<Self>,
}
pub struct StructureTreeWithNodeAttributes {
element: String,
attributes: HashMap<String, Box<Self>>,
children: Vec<Self>,
}
// markup trees
pub enum AttributelessMarkupTree {
Element(AttributelessMarkupElementNode),
Text(MarkupTextNode),
}
pub struct AttributelessMarkupElementNode {
element: String,
children: Vec<AttributelessMarkupTree>,
}
pub struct MarkupTextNode {
content: String,
}
pub enum MarkupTree {
Element(MarkupElementNode),
Text(MarkupTextNode),
}
pub struct MarkupElementNode {
element: String,
attributes: HashMap<String, String>,
children: Vec<MarkupTree>,
}
pub enum NodeAttributedMarkupTree {
Element(NodeAttributedMarkupElementNode),
Text(MarkupTextNode),
}
pub struct NodeAttributedMarkupElementNode {
element: String,
attributes: HashMap<String, NodeAttributedMarkupTree>,
children: Vec<NodeAttributedMarkupTree>,
}
Size analysis for rust
The Layout type of rust allows us to determine size and alignment of each type. I want to contribute this in the next table. For repr-declarations, please refer to the rust book.
type | default rust repr | #[repr( C)] |
---|---|---|
|
size = 48, alignment = 8 |
size = 48, alignment = 8 |
|
size = 96, alignment = 8 |
size = 96, alignment = 8 |
|
size = 96, alignment = 8 |
size = 96, alignment = 8 |
|
size = 48, alignment = 8 |
size = 56, alignment = 8 |
|
size = 96, alignment = 8 |
size = 104, alignment = 8 |
|
size = 96, alignment = 8 |
size = 104, alignment = 8 |
The data was generated with rustc version 1.77 (2024-03-17) on amd64.
One can see that the decision, whether to include attributes, results in a major difference in struct size.
The subset XML tree as special case
Speaking of XML trees, our markup tree is actually a generalization of those subset XML trees. Specifically, in XML no two text nodes can occur next to each other. Consider the following example:
<?xml version="1.1" encoding="utf-8"?>
<root>
<sub1/>
hello
<sub2/>
<sub3/>
world,
tree!
</root>
In this case, <root>
has …
-
element
<sub1/>
-
text
hello
-
element
<sub2/>
-
element
<sub3/>
-
text
world,\ntree
… as children. As can be seen, two element nodes can occur next to each other. But there is no possibility that two text nodes occur next to each other. As such, one could consider a text node as inherent data of the previous node. Because the first element might be a text node without a previous element, one needs to introduce another inherent data field. These fields are commonly called text (first text node inside element) and tail (text node after element). So, one could define the subset XML tree in Haskell as:
import Data.Map
data Node = String (Map String String) String String [Node]
Here, the first item after the Map is text and the second item is tail. In this case, both representations would lead to a size of 144 bytes and alignment of 8 bytes. I rejected this alternative, because it consumes more memory per node (but requires less resolution of memory addresses) and is IMHO less comprehensible. Furthermore it would lock into to the paradigm to disallow two consecutive text nodes, which seems asymmetric.
Conclusion
I revisited documents as tree in a systematic way again and discovered the memory consumption of the designs. The new names seem more pragmatic to me. I will now consider these data structures as fundamental basis for typho.