typho log: week #3

Written on 2022-01-27 in 1310 words ✍️.
Part of cs software-development digital-typesetting

Motivation

In week 3, I continued my efforts on project typho. The third quick summary comes late by a few days.

Goals and achievements

I looked into mathematical notation, but got lost quickly. The complexity of (e.g.) MathML is not infinite, but high.
I read the paper “A System for Typesetting Mathematics” by Brian W. Kernighan and Lorinda L. Cherry (1975). It introduces the math notation for eqn/troff.
I read the paper “Scribble: closing the book on ad hoc documentation tools” by Flatt, Barzilay, Findler (2009) as suggested by Lexis
I looked into Teχ category codes and wondered which subset markup language can be extracted from LaTeχ neglecting macro expansion. I implemented a lexer/parser.
I tried to download the source files from arxiv.org. I could not believe how cumbersome this is (AWS requires a credit card which must permit sending a $1 verification transaction).

Note on Scribble

My paper review reads as follows:

Excellent tool which expands some existing concepts from Scribe (2005). Fundamentally the prefix “#lang scribble/doc” dispatches parsing based on the module loaded. In particular S-expressions are considered as building blocks for the language defining typesetting elements. The primary advantage is to support the “program is data” paradigm in the typesetting context. The disadvantage is the tight coupling between the PLT Scheme infrastructure and the document.
Lexis even writes her weblog in Scribble as pointed out in my week #1 log.

Note on Teχ as markup language

Teχ is not a markup language. The moment, you start reading Teχ, you need to execute control sequences which might change the syntax. A trivial example is the following:

\catcode`\!=0
!bye

A plain Teχ document is expected to end with “\bye”, but here we change the syntax to use “!” not “\” to introduce control sequences. For beginners, catcodes (category codes) are the basic mechanism used.

However, I want to neglect this mechanism, because I don’t want to read the package sources and most users don’t use macro expansion mechanisms a lot in their source code. Simply put:

I assume the category codes of inittex immutably
Macro expansion does not support pattern matching
We know how many tokens/groups and optional groups are consumed by a control sequence in advance

Can we build a reasonable lexer and parser for Teχ in this case? I talk about a traditional lexer and parser design. The lexer generates tokens which are consumed by the parser which builds the data structure, we are looking for. I implemented such lexer and parser, but decided a difficult decision wrongfully. Consider the following design:

#[derive(Clone,Debug)]
pub enum Token {
  Plain(String),    // any consecutive text without characters of special semantics
  CSeq(String),     // in inittex a controlsequence like “\mathbf”
  Arg(String),      // a group following a control sequence to be consumed by the control sequence
  LetterArg(char),  // if macro requires an argument and Others follows, first character is LetterArg
  OptArg(String),   // represented as “[key1=value1,key2=value2]” in the keyval package (where String excludes the braces)
  BeginGroup,       // in inittex “{”
  EndGroup,         // in inittex “}”
  ParSeparator,     // an empty line
  Align,            // the alignment character “&” occured in the source code
  Parameter(u8),    // parameters occured in the source code, e.g. Parameter(1) matches “#1”
  InlineMathBegin,  // starts inline math mode, e.g. “$”
  InlineMathEnd,    // ends inline math mode, e.g. “$”
  DisplayMathBegin, // starts display math mode, e.g. “\[” or “$$”
  DisplayMathEnd,   // ends display math mode, e.g. “\]” or “$$”
}

The issue is “\usepackage[utf8]{inputenc}” is supposed to be parsed as CSeq("usepackage") OptArg("utf8") Arg("inputenc"). However, it is very common that Arg instances contain groups themselves that need to be interpreted. What I did wrong here is that I applied the knowledge about the number of arguments per control sequence in the lexer and also did not recognize that Arg is a recursive structure itself. These decisions would give me a design which does not provide a proper interface between lexer and parser.

What is more desirable is a design, where characters {“{”, “}”, “[”, “]”} are tokens. The resolution of hierarchical structures is then a task of the parser. In the case of curly braces, I already introduced them as tokens. I just missed to apply them consistently.

#[derive(Clone,Debug)]
pub enum Token {
  Plain(String),    // any consecutive text without characters of special semantics
  CSeq(String),     // in inittex a controlsequence like “\mathbf”
  BeginGroup,       // in inittex “{”
  EndGroup,         // in inittex “}”
  BeginOptGroup,    // “[” following a control sequence, typical design through the keyval package
  EndOptGroup,      // “]” ending an optional argument
  ParSeparator,     // an empty line
  Align,            // the alignment character “&” occured in the source code
  Parameter(u8),    // parameters occured in the source code, e.g. Parameter(1) matches “#1”
  InlineMathBegin,  // starts inline math mode, e.g. “$”
  InlineMathEnd,    // ends inline math mode, e.g. “$”
  DisplayMathBegin, // starts display math mode, e.g. “\[” or “$$”
  DisplayMathEnd,   // ends display math mode, e.g. “\]” or “$$”
}

As a result, I need to adapt the parser in another development session.

Conclusion

I had quite some fun implementing the Teχ parser. However, more effort is required to advance with markup language and math typesetting topics. Furthermore, I miss generators to implement easier error reporting mechanisms for parsers (I’d favor a PEG parser for this).