typho log: week #2

Written on 2022-01-16 in 695 words ✍️.
Part of cs software-development digital-typesetting

Motivation

In week 2, I continued my efforts on project typho. The second quick summary.

Goals and achievements

I proposed a draft for a command line interface for typho (unpublished).
I forked my digital identity. I am now tajpulo on twitter and github. My understanding is that tajpulo (IPA: tajp’ulo) means “typing person” in Esperanto (tajpi = to type, ulo = person). Why did I fork? Because social media is not really good at separating topics/languages and thus I try to represent this on my own through different accounts. Also I got my domain back, but it has no website yet.
I had the opportunity to discuss Orgdown with Karl. We shared our thoughts on digital typesetting. The central topic was the axiom of documents as hierarchy. If you destructure a document into semantic elements, the question arises where you stop. Is a flow text just a string or does it destructure contained hashes? Is a math formula a string meant to contain Teχ syntax or is a math formula a semantic element containing semantic elements of symbolic notation? What we discussed are called “layers” in OrgDown.
I spent my time way too much time writing an S-expressions parser in rust. One can be found on rosettacode, but I started from scratch. I ran into borrow checking issues several times until I decided to maintain my parsing stack as usize indices and the tree itself as mutable recursive data structure. In order to deal with memory management, the rosettacode one uses the typed_arena crate whereas mine uses only native data structures. Furthermore it propagates more data to provide better error messages. My S-expression definition looks as follows:
```
#[derive(Debug,PartialEq)]
pub enum Sexpr {
    Atom(String),
    List(Vec<Sexpr>),
}
```
Why do I care about LISP for digital typesetting? Not really. Let’s consider it as exercise for compiler construction for markup languages.
I joined Jonathan Fine's discussion at “Teχ hour” about accessibility of Teχ. It made me revise how flawed any approach is which tries to recover the original meaning from the output. Consider macro-expanded Teχ instructions (which we could obviously retrieve for the majority of documents), how can we determine what the original message to convey was? By omitting positioning information, we loose track of word boundaries. But even with positioning information, we need to reconstruct characters with diacritics with huge effort (LaTeχ encodes them separately AFAIU). Nelson gave another simple example: If there is a trailing hypen in the previous line … is it due to hyphenation or due to favorable line breaking after a hypen? In the end, any PDF/UA is pretty bad and the webstack is much better as this since HTML5 is a proper markup language. We furthermore discussed Unicode, χeTeχ, and SILE. Since Unicode is the most primitive model: I also stumbled across this paper from 1999: “Application-independent representation of text for document processing—will Unicode suffice?”.
I took a look at the HarfBuzz API. The rust binding even provides an example. The basics are trivial, but I need to take a deeper look at the data model in OpenType fonts. I also learned that someone wrote a golang binding 3 years ago.

I didn’t achieve to propose a data model for mathematical typesetting. Also I didn’t finish my notes on AsciiDoc(tor)'s data model. I need to spend more time on this. In fact, I don’t think I spent all 40 hours on typho.

Conclusion

typho week 2 was a bit disappointing. I still struggle with PQC topics too much. I need to get more things done.