Graph-theoretic considerations for 'text documents as trees'

✍️ → Written on 2022-12-18 in 876 words. Part of project typho digital-typesetting

Update 2023-01: Sorry, I published this post too early. It is still being updated.

Motivation

The ‘texts documents are trees’ assumption is old. LISP does not hide the fact that all programs are hierarchical structures; hence trees. You can also see the importance of this notion from the Document Object Model (DOM) in HTML. XML trees is another keyword.

Today, I want to look at this topic from a graph theoretic point of view and establish three fundamental definitions for typesetting.

Graphic theoretic definitions

A Tree is a connected acyclic undirected graph. A rooted tree has one node called root (no additional contraints required). The hierarchical structure is defined by the distance to the root. A node is at hierarchical level N if the node has a distance

A picture showing a tree with three nodes an its hierarchy established by the distance from the root
Figure 1. Tree with its hierarchy over two levels

A tree is ordered, if the order of child nodes is significant (TAOCP, volume 1, page 308). This is a stronger notion than that of oriented trees. The following two trees are the same oriented tree, but different ordered trees:

A picture showing two trees with (for example) one tree showing root A with children B and C where the other tree shows root A with children C and B
Figure 2. Ordered trees differ in the sequential occurence of children

A colored tree is any tree with a color associated to each node. Often colors come with designated semantics (c.f. Red-black trees).

Given these basic definitions, we want to look at two types of text document syntaxes. They are representative, because the amount of information encoded in a text document is a desirable subset of these formats:

LISP and S-expressions

LISP is a set of programming languages based on the idea of S-expressions (e.g. Scheme, Common LISP, Clojure, but partially also protocols like IMAP). S-expression where originally defined as expressions of the form …

(x . y)

where x and y denote expressions themselves or any atom (an atom is an element from the infinite set of distinguishable atomic elements, commonly any alphanumeric identifier). For example (concat . 4 . (3 . NIL . (2 . NIL))) could denote the concatenation of the lists [4, 3] and [2]. But this notation of S-expressions is rather archaic. A more modern syntax would be (concat (list 4 3) (list 2)). Here the expressions are variadic, not necessarily binary w.r.t. arguments and the list command allows to skip NIL (= not-in-list) declarations. Semantics of commands like concat and list are outside the scope of S-expressions. S-expressions are a syntactic concepts which make the hierarchical structure very explicit.

Consider the following Scheme expressions:

(define square (lambda (x) (* x x)))
(square (square 2))

Then its expression tree can be visualized the following way:

XML trees

XML is another syntactic concept which makes its tree structure very explicit. XML notation is based on SGML, but uses a more consistent syntax to simplify error handling. The example above could be denoted as:

<?xml version="1.0"?>
<source>
  <command name="define">
    <atom>square</atom>
    <command name="lambda">
      <command name="x"/>
      <command name="*">
        <atom>x</atom>
        <atom>x</atom>
      </command>
    </command>
  </command>
  <command name="square">
    <command name="square">
      <atom>2</atom>
    </command>
  </command>
</source>

Here, we see certain differences between XML and LISP. Unlike LISP, XML requires one top-level element. I introduced a root element <source/> to handle these . Furthermore the set of names for LISP commands and XML elements differ. I had to introduce <command/> elements because is not a permitted name for XML elements. For LISP, the set of commands is not defined but a dialect like Scheme indeed permits as command.

Conclusion

TODO