Graph-theoretic considerations for 'text documents as trees'

✍️ → Written on 2022-12-18 in 876 words. Part of project typho digital-typesetting

Update 2023-01: Sorry, I published this post too early. It is still being updated.


The ‘texts documents are trees’ assumption is old. LISP does not hide the fact that all programs are hierarchical structures; hence trees. You can also see the importance of this notion from the Document Object Model (DOM) in HTML. XML trees is another keyword.

Today, I want to look at this topic from a graph theoretic point of view and establish three fundamental definitions for typesetting.

Graphic theoretic definitions

A Tree is a connected acyclic undirected graph. A rooted tree has one node called root (no additional contraints required). The hierarchical structure is defined by the distance to the root. A node is at hierarchical level N if the node has a distance

A picture showing a tree with three nodes an its hierarchy established by the distance from the root
Figure 1. Tree with its hierarchy over two levels

A tree is ordered, if the order of child nodes is significant (TAOCP, volume 1, page 308). This is a stronger notion than that of oriented trees. The following two trees are the same oriented tree, but different ordered trees:

A picture showing two trees with (for example) one tree showing root A with children B and C where the other tree shows root A with children C and B
Figure 2. Ordered trees differ in the sequential occurence of children

A colored tree is any tree with a color associated to each node. Often colors come with designated semantics (c.f. Red-black trees).

Given these basic definitions, we want to look at two types of text document syntaxes. They are representative, because the amount of information encoded in a text document is a desirable subset of these formats:

LISP and S-expressions

LISP is a set of programming languages based on the idea of S-expressions (e.g. Scheme, Common LISP, Clojure, but partially also protocols like IMAP). S-expression where originally defined as expressions of the form …

(x . y)

where x and y denote expressions themselves or any atom (an atom is an element from the infinite set of distinguishable atomic elements, commonly any alphanumeric identifier). For example (concat . 4 . (3 . NIL . (2 . NIL))) could denote the concatenation of the lists [4, 3] and [2]. But this notation of S-expressions is rather archaic. A more modern syntax would be (concat (list 4 3) (list 2)). Here the expressions are variadic, not necessarily binary w.r.t. arguments and the list command allows to skip NIL (= not-in-list) declarations. Semantics of commands like concat and list are outside the scope of S-expressions. S-expressions are a syntactic concepts which make the hierarchical structure very explicit.

Consider the following Scheme expressions:

(define square (lambda (x) (* x x)))
(square (square 2))

Then its expression tree can be visualized the following way:

XML trees

XML is another syntactic concept which makes its tree structure very explicit. XML notation is based on SGML, but uses a more consistent syntax to simplify error handling. The example above could be denoted as:

<?xml version="1.0"?>
  <command name="define">
    <command name="lambda">
      <command name="x"/>
      <command name="*">
  <command name="square">
    <command name="square">

Here, we see certain differences between XML and LISP. Unlike LISP, XML requires one top-level element. I introduced a root element <source/> to handle these . Furthermore the set of names for LISP commands and XML elements differ. I had to introduce <command/> elements because is not a permitted name for XML elements. For LISP, the set of commands is not defined but a dialect like Scheme indeed permits as command.