Data-structure-theoretic considerations for 'text documents as trees'

✍️ Written on 2024-05-31 in 1833 words.
Part of project typho digital-typesetting

Motivation

In my previous article “Graph-theoretic considerations for Graph-theoretic considerations for 'text documents as trees'”, I looked at ways to represent the hierarchical structure of documents as graph-theoretic trees. In this article, I explained graph theory fundamentals, proposed names for specific structures, and proposed names for these structures. Today, I want to revise these names as I gained more experience with construction of such trees and want to focus on the representation of the trees in Haskell and rust.

Table of tree structures

old name new name type of attributes is colored

structure tree

structure tree

none

no

-

attributeless markup tree

none

yes

content tree

attributed structure tree

string-valued

no

XML tree

markup tree

string-valued

yes

-

node-attributed structure tree

node-valued

no

attribute-node tree

node-attributed markup tree

node-valued

yes

In general, one can observe that I normalized two structures with the wording “markup tree” versus “structure tree”. Why? Because I deem them most valuable. The most important tree is the ‘markup tree’, which was called XML tree before. XML tree is a misnomer, because XML trees need to distinguish between PCDATA, processing instructions, and others besides elements and text nodes. Calling it “subset XML tree” makes little sense but sounds like a cumbersome name.

My original intention was to gain an overview over all possible structures to add them to typho. Implementing all of them systematically is not a goal of mine anymore, since variants of additional technical attributes (mutable, with position information, …) makes variations explode exponentially. Instead I want to focus on valuable variants.

One inconsistency in this naming scheme is that “structure tree” is the simplest name for single-colored trees and “markup tree” is the simplest name for colored trees. But whereas a “markup tree” is has attributes, a “structure tree” has none. In that sense, the simple “structure tree” is no the same level like “attributeless markup tree”; which has a non-simple name. I chose pragmatism over consistency here, because the most common trees shall have the simplest names.

Furthermore, I consider node attributes today as crazy invention.

Representation in Haskell

Structure tree

data Node = String [Node]

Attributeless markup tree

data Node = Element String [Node] | Text String

Structure tree with attributes

data Node = String (Map String String) [Node]

Markup tree

data Node = Element String (Map String String) [Node] | Text String

Structure tree with node attributes

data Node = String (Map String Node) [Node]

Attribute-node markup tree

data Node = Element String (Map String Node) [Node] | Text String

Representation in rust

use std::collections::HashMap;

// structure trees
pub struct StructureTree {
    element: String,
    children: Vec<Self>,
}

pub struct StructureTreeWithAttributes {
    element: String,
    attributes: HashMap<String, String>,
    children: Vec<Self>,
}

pub struct StructureTreeWithNodeAttributes {
    element: String,
    attributes: HashMap<String, Box<Self>>,
    children: Vec<Self>,
}

// markup trees
pub enum AttributelessMarkupTree {
    Element(AttributelessMarkupElementNode),
    Text(MarkupTextNode),
}
pub struct AttributelessMarkupElementNode {
    element: String,
    children: Vec<AttributelessMarkupTree>,
}
pub struct MarkupTextNode {
    content: String,
}

pub enum MarkupTree {
    Element(MarkupElementNode),
    Text(MarkupTextNode),
}
pub struct MarkupElementNode {
    element: String,
    attributes: HashMap<String, String>,
    children: Vec<MarkupTree>,
}

pub enum NodeAttributedMarkupTree {
    Element(NodeAttributedMarkupElementNode),
    Text(MarkupTextNode),
}
pub struct NodeAttributedMarkupElementNode {
    element: String,
    attributes: HashMap<String, NodeAttributedMarkupTree>,
    children: Vec<NodeAttributedMarkupTree>,
}

Size analysis for rust

The Layout type of rust allows us to determine size and alignment of each type. I want to contribute this in the next table. For repr-declarations, please refer to the rust book.

type default rust repr #[repr( C)]

StructureTree

size = 48, alignment = 8

size = 48, alignment = 8

StructureTreeWithAttributes

size = 96, alignment = 8

size = 96, alignment = 8

StructureTreeWithNodeAttributes

size = 96, alignment = 8

size = 96, alignment = 8

AttributelessMarkupTree

size = 48, alignment = 8

size = 56, alignment = 8

MarkupTree

size = 96, alignment = 8

size = 104, alignment = 8

NodeAttributedMarkupTree

size = 96, alignment = 8

size = 104, alignment = 8

The data was generated with rustc version 1.77 (2024-03-17) on amd64.

One can see that the decision, whether to include attributes, results in a major difference in struct size.

The subset XML tree as special case

Speaking of XML trees, our markup tree is actually a generalization of those subset XML trees. Specifically, in XML no two text nodes can occur next to each other. Consider the following example:

<?xml version="1.1" encoding="utf-8"?>
<root>
  <sub1/>
  hello
  <sub2/>
  <sub3/>
  world,
  tree!
</root>

In this case, <root> has …

  • element <sub1/>

  • text hello

  • element <sub2/>

  • element <sub3/>

  • text world,\ntree

… as children. As can be seen, two element nodes can occur next to each other. But there is no possibility that two text nodes occur next to each other. As such, one could consider a text node as inherent data of the previous node. Because the first element might be a text node without a previous element, one needs to introduce another inherent data field. These fields are commonly called text (first text node inside element) and tail (text node after element). So, one could define the subset XML tree in Haskell as:

import Data.Map
data Node = String (Map String String) String String [Node]

Here, the first item after the Map is text and the second item is tail. In this case, both representations would lead to a size of 144 bytes and alignment of 8 bytes. I rejected this alternative, because it consumes more memory per node (but requires less resolution of memory addresses) and is IMHO less comprehensible. Furthermore it would lock into to the paradigm to disallow two consecutive text nodes, which seems asymmetric.

Conclusion

I revisited documents as tree in a systematic way again and discovered the memory consumption of the designs. The new names seem more pragmatic to me. I will now consider these data structures as fundamental basis for typho.