On Description of Teχ

Written on 2021-12-09 in 1855 words ✍️.
Part of cs software-development digital-typesetting Teχ

Motivation

When Donald Knuth invented the Teχ typesetting engine, he specified its behavior in three major works (this is also discussed in a TeX.SE post):

“Preliminary preliminary description of TEX” May 1977
“Preliminary description of TEX” July 1977
“The TeXbook” 1986

The white papers can also be found in the book “Digital Typography”. In this article, I want to give a summary and retrospective view on which decisions were made and which yielded success.

Preliminary preliminary description

Please don’t expect to understand this mess now, TEX is really very simple; trust me.

— Don Knuth

In preliminary preliminary description of TEX, Don Knuth points out that …

Teχ is pronounced with ch like Loch Ness
“The input to TEX is a file in say TVeditor format at the Stanford AI lab”.
“While LISP works with one-dimensional character strings, TEX works with two-dimensional box patterns; TEX has both horizontal and vertical ‘cons’ operations.”
“7.Add infinitely many } symbols at the right.” […] “trailing }'s will match up with any {'s the user erroneously forgot to match in the external input file”
The example provided does not use any backslash to introduce a control sequence. Instead a line (e.g. line 9) just starts with special instructions like “vskip .5 cm plus 1 pc minus 5 pt”. “The construction \{string} is used to transmit a string without interpreting its words according to existing definitions”. The comment character “%” was already established.
Line 74 of the example shows mathematical notation: 74 X sub{n+1} = f(X sub n, X sub {n-1}) quad for quad n>0.. The notation by Kernighan & Cherry (“A System for Typesetting Mathematics”, 1975) was used instead of inventing the yet-to-come Teχ math notation.
“The second thing that should be emphasized is that it is written in an extension of TEX, not in the basic language itself.” Knuth already nudges the idea of an extensible macro system which defines control sequences not part of plain Teχ (“lower-level Teχ primitives”).
Don Knuth already established that sizes can be specified in points, picas, inches, centimeters, or millimeters.
The pattern matching capabilities of Teχ were conceptualized, but details were missing.
```
def quoteformat #1 author #2 { …
```
Don Knuth already established his novel glue concept using a fixed component x, a plus component y, and a minus component z including all tricks such as right-justification by a glue with a plus component of 10 meters. Computing the final glue configuration was done with a
“I have tentatively decided to replace italics by a slanted version of the normal ‘Roman’ typeface, for emphasized words in the text, while variable symbols in math formulas will still be in italics as usual. I will have to experiment with this, but my guess is that it will look a lot better, since typical italic fonts do not tie together into words very beautifully. At any rate I will be distinguishing slanted from italic, in case it becomes desirable to use two different fonts for these different purposes.”
` inserts a space, not ~
Horizontal tab was used for vertical alignment as it was originally anticipated. In the paper, it was denoted as ⭙.
“The ‘big(){…}’ in lines 113, 143, etc. and the ‘big/{…}’ in line 152 is used to choose suitably large versions of parentheses and slashes, forming ‘(…)’ and ‘/…’, respectively; the size of the symbols is determined by the height of the enclosed formula box.”
Don Knuth distinguished between block and inline math notation: “The summation signs produced by ‘sum over…’ in lines 131, 143, will be large, since these summations occur in displayed formulas; but between $…$ a smaller sign will be used and the quantity summed over will be attached as a subscript.”
Don Knuth sketched the tokenization process with a finite-state machine which will eventually turn out much more difficult.
Don Knuth thought about integrating an interpretive system to Teχ and a mini-programming language:

In the first implementation of TEX, routines like ACPpages will be written in SAIL code and loaded with TEX. I have had some notion of adding an interpretive system to TEX and a mini-programming language so that such extensions are easier to make without penetrating into TEX’s innards, but on second thought it seems much better to assume that TEX’s code will be sufficiently clean that it will best be extended using SAIL code. This will save the implementors considerable time and make the system run faster.

— Don Knuth
The badness concept as described as well: “The first idea is the concept of ‘badness.’ This is a number computed on the basis of the amount of stretching or shrinking necessary when setting the glue.”
Don Knuth mentions “where ε is a small tolerance to compensate for floating-point rounding” whereas later implementations only used integers to ensure cross-platform reproducible output.
“Footnotes:I have used footnotes only three times in over 2000 pages of The Art of Computer Programming, and personally I believe they should usually be avoided, so I am not planning an elaborate footnoting mechanism (e.g. to break long footnotes between pages or to mark the first footnote per page with an asterisk and the second with a dagger, etc.). They can otherwise be satisfactorily handled by attaching a footnote attribute to any line referring to a footnote, and by requiring the nextpage routine to put the footnote on the page containing that line.”
The concept of ligatures is introduced in the straight-forward way.
“OK to break here, but insert a times sign not a hyphen. The last of these would be used in a long product like $(n+1)*(n+2)\*(n+3)\*(n+4)$ .”
“There is no point in finding all possible places to hyphenate. For one thing, the problem is extremely difficult, since e.g. the word ‘record’ is supposed to be broken as ‘rec-ord’ when it is a noun but ‘re-cord’ when it is a verb.”

My remarks:

ASCII was published as ASA X3.4-1963 already in 1963, but not popular. 28 of the 128 code positions were left unassigned at that time. At the same time, symbols like ≤ were part of the input character set which is absent in ASCII.
The perspective of TEX as two-dimensional LISP is interesting and fundamental
Whereas I feel that many developers are annoyed by overly lenient standards, Teχ explicitly permits to skip any final group-ends.
Sizes were extended for big point (bp), didot point (dd), cicero (cc), scaled point (sp), nominal x-height (ex), as well nominal m-width (em). Only in math mode math units (mu) are available and only pdfTeχ/LuaLaTeχ support pixels (px). See Teχ.SE for details.
Don Knuth already looked for a unified solution to equally-sized delimiters on both ends. Denoted as ‘big(){…}’, eventually became ‘\left(…\right)’
The issue of line division (“Given the text of a paragraph and the set of all allowable places to break it between lines, find breakpoints that minimize the sum of the squares of the badnesses of the resulting lines.”) which heavily relates to badness was later solved with dynamic programming techniques (see chapter 2 “Mathematical Typography” and chapter 3 “Breaking Paragraphs Into Lines”).

Preliminary description

In this white paper, the following changes were made (I don’t mention code-level changes such as renamed control sequences):

The concept of “seven basic delimiters” (\ { } $ ⊗ % #) which will be expanded into category codes in later Teχ implementations. The possibility to redefine them, was already introduced later in the paper:

File ACPhdr begins with the sequence
\chcode'173←2. \chcode'176←3. \chcode'44←4.
\chcode'26←5.  \chcode'45←6.  \chcode'43←7.
\chcode'136←8. \chcode'1←9.
which, at Stanford, defines the characters {}$⊗%# to be the basic delimiters 2,3,4,5,6,7,8, and 9, respectively;

Backslash to introduce control sequences was established.
Kernighan & Cherry’s math notation was still used but the characters ↓ and ↑ now denote subscript and superscript respectively.
big(){…} becomes \\group(){…}
The following property for control sequences was specified:+

When the control sequence consists of \ and a single character, all printing characters are distinguished, but when the control sequence consists of \ and a letter string no distinction is made between upper and lower case letters, except on the very first letter of the sequence; thus, "\GAMMA" and "\Gamma" are considered identical, but they are not the same as "\gamma". Furthermore, letter sequences are considered different only if they differ in the first seven characters (six if TEX is implemented on a 32-bit machine) or if they have different lengths mod 8. For example, "\qUotefOmmmm" and "\quOTEfoxxxxxxxxxxxx" are both equivalent to "\quoteformat".

hyphen, en-dash, minus sign, and em-dash with -, --, - within $, and --- respectively was introduced.
Glues are now put in more descriptive relation to boxes.
The hyphenation algorithm now expanded a lot (examples, prefix rules, consonant rules).
A math formula is now parsed into one of the following modes:
- display mode
- text mode
- text mode with lower subscripts
- script mode
- scriptscript mode
The idea to break mathematical formulas into several lines now disappears: “Displayed formulas are never broken between lines by TEX; the user is supposed to figure out the psychologically best place to break them.”
The Computer Modern typeface was only mentioned once in the first paper. Now it got a more prominent role. This is e.g. represented by the following sentence “which ACPhdr has defined to be ‘Computer Modern Gothic Bold 20 point’, a font that I am currently designing”

Conclusion

It is very interesting to see the early days of Teχ’s design. I put the focus on the basic concepts of Teχ (category codes, badness, ligatures, hyphenation, …). A comparison with the Teχbook would be nice, but expands beyond the scope of a blog post. It is interesting from a modern perspective how designs where guided by technical optimizations rather than technical standards (as it would be necessary today).