# Graph-theoretic considerations for 'text documents as trees'

✍️ → Written on 2022-12-18 in 876 words. Part of project typho digital-typesetting

Update 2023-01: Sorry, I published this post too early. It is still being updated.

## Motivation

The ‘texts documents are trees’ assumption is old. LISP does not hide the fact that all programs are hierarchical structures; hence trees. You can also see the importance of this notion from the Document Object Model (DOM) in HTML. XML trees is another keyword.

Today, I want to look at this topic from a graph theoretic point of view and establish three fundamental definitions for typesetting.

## Graphic theoretic definitions

A Tree is a connected acyclic undirected graph. A rooted tree has one node called root (no additional contraints required). The hierarchical structure is defined by the distance to the root. A node is at hierarchical level N if the node has a distance Figure 1. Tree with its hierarchy over two levels

A tree is ordered, if the order of child nodes is significant (TAOCP, volume 1, page 308). This is a stronger notion than that of oriented trees. The following two trees are the same oriented tree, but different ordered trees: Figure 2. Ordered trees differ in the sequential occurence of children

A colored tree is any tree with a color associated to each node. Often colors come with designated semantics (c.f. Red-black trees).

Given these basic definitions, we want to look at two types of text document syntaxes. They are representative, because the amount of information encoded in a text document is a desirable subset of these formats:

## LISP and S-expressions

LISP is a set of programming languages based on the idea of S-expressions (e.g. Scheme, Common LISP, Clojure, but partially also protocols like IMAP). S-expression where originally defined as expressions of the form …

`(x . y)`

where x and y denote expressions themselves or any atom (an atom is an element from the infinite set of distinguishable atomic elements, commonly any alphanumeric identifier). For example `(concat . 4 . (3 . NIL . (2 . NIL)))` could denote the concatenation of the lists `[4, 3]` and ``. But this notation of S-expressions is rather archaic. A more modern syntax would be `(concat (list 4 3) (list 2))`. Here the expressions are variadic, not necessarily binary w.r.t. arguments and the list command allows to skip NIL (= not-in-list) declarations. Semantics of commands like `concat` and `list` are outside the scope of S-expressions. S-expressions are a syntactic concepts which make the hierarchical structure very explicit.

Consider the following Scheme expressions:

```(define square (lambda (x) (* x x)))
(square (square 2))```

Then its expression tree can be visualized the following way:

## XML trees

XML is another syntactic concept which makes its tree structure very explicit. XML notation is based on SGML, but uses a more consistent syntax to simplify error handling. The example above could be denoted as:

``````<?xml version="1.0"?>
<source>
<command name="define">
<atom>square</atom>
<command name="lambda">
<command name="x"/>
<command name="*">
<atom>x</atom>
<atom>x</atom>
</command>
</command>
</command>
<command name="square">
<command name="square">
<atom>2</atom>
</command>
</command>
</source>``````

Here, we see certain differences between XML and LISP. Unlike LISP, XML requires one top-level element. I introduced a root element `<source/>` to handle these . Furthermore the set of names for LISP commands and XML elements differ. I had to introduce `<command/>` elements because is not a permitted name for XML elements. For LISP, the set of commands is not defined but a dialect like Scheme indeed permits as command.

TODO