Taking a look at the XymosTeX project by Emily Eisenberg

✍️ Written on 2023-06-23 in 1510 words.
Part of software-development digital-typesetting Teχ

Motivation

Emily Eisenberg is a software developer who re-implemented Teχ in rust. This was exactly the goal of my project t-rex in 2017. I read the rust code section-by-section and tried to find a rust equivalent for the Pascal code. This did not work at all and I gave up after about 200 sections (the code was not compilable in the end). Unlike my single attempt, XymosTeX is Emily’s fourth attempt. XymosTeX is “A re-implementation of TeX in Rust to help me understand how it works and to eventually provide a debugging interface”. Emily was way more successful. So let us take a look.

The github project and compilation

XymosTeX was publicly developed between 2019 and end of 2022 (I am using commit 4c2211cfa3e170f) under the MIT license and shows some basic working examples (DVI files as output). Let us compile it and reproduce examples. First, the XymosTeχ implementation depends on the kpathsea project by Deyan Ginev (i.e. kpathsea bindings to the C library providing path search for TeX) and once_cell (providing cell-like types). In conclusion, there are few dependencies (no WASM-support though), but with rust running it is trivial anyways:

sudo apt install libkpathsea-dev
cargo build --release

XymosTeX generates three executables: interpret, print_dvi, and xymostex. The interpret executable takes a DVI file and returns

A first example

We adjust the first example shown on the github page slightly.

% $ cargo run --release
% paste the following content to stdin and terminate with Ctrl+d :

\def\domain{digital typesetting}
\def\entity#1{#1}
\entity{We} are {\domain}!
\end

% convert the DVI file to a PDF file:
%   dvipdf texput.dvi    # generates texput.pdf
screenshot of the rendered PDF output

However, if we replace \end with \bye, we get an error:

% $ cargo run --release
% paste the following content to stdin and terminate with Ctrl+d :

\def\domain{digital typesetting}
\def\entity#1{#1}
\entity{We} are {\domain}!
\bye

% yields:
%   thread 'main' panicked at 'unimplemented!', src/parser/horizontal_list.rs:191:21
%   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Is this valid Teχ though?

% $ cat texput.tex
\def\domain{digital typesetting}
\def\entity#1{#1}
\entity{We} are {\domain}!
\bye

% $ tex texput.tex && dvipdf texput.dvi
% This is TeX, Version 3.141592653 (TeX Live 2022/dev/Debian) (preloaded format=tex)
% (./texput.tex [1] )
% Output written on texput.dvi (1 page, 248 bytes).
% Transcript written on texput.log.
Screenshot of the rendered PDF output

Yes, indeed. So we can conclude:

  • XymosTeX implements macros with and without arguments.

  • XymosTeX properly supplies groups as arguments to macros.

  • XymosTeX is not feature-complete (as the github page immediately points out)

The primes example

$ cargo run --release < examples/primes.tex && dvipdf texput.dvi

The provided primes computation example works just fine:

Screenshot of the rendered PDF output of the primes example

However, if you use the original implementation by Donald E. Knuth, you will get an error:

$ cargo run --release < primes.tex
thread 'main' panicked at 'unimplemented', src/parser/vertical_list.rs:189:21
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Trying out catcodes

Teχ’s feature making lexing impossible are category codes. Does it work with XymosTeX?

If you try to run this obfuscated plain Teχ example, you run into the error thread 'main' panicked at 'Invalid token found while looking for control sequence: Char('~', Other)', src/parser/assignment.rs:112:22:

\let~\catcode~`76~`A13~`F1~`j00~`P2jdefA71F~`7113jdefPA;;FP
AYYFvePAJJ7172F72e71PAEE71"72F72i71PAGG71Fo71dPAWW71;FPADDF
PA**7172F727171PAKK7172F72r71PAqq71.F71Kse71PAIIFoPAXXFKdiP
AQQFjbigskipDOPAzzFhPAHHFDPATT7172F72a71PAZZFTDDPAUU71,72MF
jpar71ing;jifx:72jelseU72MjfiPABB71W72;73,74:Fjif.74.jelseB
74:jfiQn tJ;z7172tz; TydDIfDCEzs;tTsm;DmWa;y "KKJtDulIY TYg
tI J;mU7173,74:MPB tJlwWf;Wq;Yq K*dmu.,eJYlnW;q Ep"p.,JntW;
lKsGZlTpe,En"nW;eDTJlsE "dTndc,Egz"eW;t Emd"TsZElk"m,JYsnW;
sTnwWo;sZs*mE"w,Ex"sW; Jg*JZsTyl,E"fWf;Y gGlEDng"KsW,fIurW;
TlcEngD"lbXsW,tzWXW;K*J JfKncz JnzsW,WJcsGnW;tWace;wI tKtuJ
DldIYsW,WKsE"ftW;aHAHHFndZPKpTEt"KdJEgn"DZpJTKDtK*J.W,:jbye

But let us start with a simpler example:

\catcode`\!=0
!def!domain{digital typesetting}
!def!entity#1{#1}
!entity{We} are {!domain}!
!bye

Here, I redefine ! as control sequence initializer (usually \), but I also run into an error: thread 'main' panicked at 'unimplemented', src/parser/vertical_list.rs:189:21. In the end I am not sure whether any category code examples work.

Mathematical typesetting

$\frac{a^2} x$
\end

Sadly, this leads to an error thread 'main' panicked at 'unimplemented', src/parser/math_list.rs:597:26. However, the following example works:

${a^2}^3$
\end
Screenshot of the rendered PDF output of the power example

A technical look at XymosTeX

  • There are 19 modules in this crate (e.g. glue, lexer, and token)

  • ALL_PRIMITIVES is the set of all implemented Teχ primitives.

  • Token is either a ControlSequence or a Char.

  • A TokenDefinition is either a Macro, Token, MathCode, Primitive, or Font.

  • Other Teχ primitives are boxes. HorizontalBox and VerticalBox are characterized by (width, height, depth), a list of elements, as well as a glue set ratio. HorizontalListElem is either a Char, HSkip, or Box. VerticalListElem is either a Box or a VSkip.

  • The Teχ primitive Glue contains a space, stretch, and shrink dimension.

  • Typographic units are points, pica, inch, and others. They are abstracted as scaled points.

  • Teχ works in the context of stacked scopes. Definitions can be local within the current scopes and will be discarded when leaving the scope. This is represented in the definition of Teχ’s state:

    • A TeXStateStack is a RefCell containing this stack which in turn is a vector of TeXStateInner. TeXStateInner is the data stored per scope:

      • The category map mapping characters to the category code; and the same for math category codes

      • A token definition map maps the Token to a TokenDefinition.

      • The registers are 256 elements of type i32. Teχ’s definition excludes the value -231 unlike the typical hardware implementation.

      • Equivalently Teχ has 256 box registers. But XymoxTeX does not represent them as an array of 256 items, but a map from u8 to TexBox.

      • The current active Font is stored in the scope as well.

    • A map between the font and its font metrics

  • The Lexer contains the TeXState, a LexState (one of {BeginningLine, MiddleLine, SkippingBlanks}), source as Vec<Vec<char>>, the line number, and column number.

  • The definition of macros is a list of parameters and a list of replacements. These parameters and replacements are either Tokens or Parameters.

  • It defines two kinds of variables: IntegerVariable and DimenVariable.

  • The math implementation contains many definitions. Since it is likely incomplete, I will not recite its definition here (see also MathCall).

  • The kpathsea comes into play when we look at the file lookup implementation.

I recognized that is_hex_char only accepts lowercase hexadecimal letters. I wonder whether this is the case. Otherwise, rust’s char’s is_ascii_hexdigit method could be used.

A tiny bug

Since the project does not take bug reports, I wanted to document here a tiny bug I found:

$ file texput.dvi
texput.dvi: TeX DVI file (ade by XymosTeX\213)

Seems like there is an off-by-one bug in the DVI generation here.

Conclusion

This is some neat progress and achievement. Wonderful! The source code is very readable, documented, and even at this preliminary state the executable just works fine for this limited feature set. And if you are interested in a small Teχ-related fun project by Emily, take a look at the blog post “Calculating Pi on a Business Card using TeX”.