syntok release

Written on 2024-12-19 in 2340 words ✍️.
Part of cs software-development digital-typesetting

Motivation

Assume you have a program to understand desired markup languages. Assume you have a typesetting engine. Assume you have components to generate desired output formats like PDF, EPUB, HTML5, and so on. By combining these tools, you get a program for your digital typesetting needs, right?

No, you will soon recognize that syntax highlighting is a crucial element of such systems. Programmers want their generated documents to feature syntax highlighting. Indeed, source code is often terrible to read without syntax highlighting. Can one easily distinguish types from identifiers? Can one identify substructures if the syntax does not require an off-side rule? One could certainly pull in one of the many syntax highlighting library efforts, but isn’t this overkill and a too strong dependency?

As a result, I thought about a building block. A serialization format which encodes how syntax shall be serialized.

Syntok

Let me introduce syntok: serialized tokenization of syntax.

Consider the following example C++ program:

#include <iostream>

int main() {
	std::cout << "hello " << ([](void){ return "world!"; })() << std::endl;
	return 0;
}

In my weblog, this source code appears colorized (also called “syntax highlighted”) to make it more readable. This is possible because a ruby binding to pygments is utilized to generate a colorized HTML version from the code snippet. But wait… isn’t this annoying? I need a ruby binding to run some python software to generate HTML output?! Or in the case of tree-sitter, I need C and JavaScript to generate HTML or XML output. Would it not be nice to just take a file which encodes the individual tokens and the software can decide the remaining colorization parts? And if pygments and tree-sitter can emit these tokens, we can use them interchangably.

Thus, instead of one tool covering the entire pipeline of reading some syntax and generating some specific output format, I want to split the pipeline up. One tool reads syntax and generates a syntok file. One tool reads the syntok file and generates the output format.

For the example above, the following file can be the corresponding syntok file:

<?xml version="1.0" encoding="utf-8"?>
<syntok xmlns="https://spec.typho.org/syntok/1.0/xml-schema">
  <item category="preprocessor-instruction" start="0" end="8">#include </item>
  <item category="system-library-ref" start="9" end="18">&lt;iostream&gt;</item>
  <item category="whitespace" start="19" end="20">

</item>
  <item category="type" start="21" end="23">int</item>
  <item category="whitespace" start="24" end="24"> </item>
  <item category="identifier" start="25" end="28">main</item>
  <item category="parameter-list" start="29" end="30">()</item>
  <item category="operator" start="31" end="34"> {
        </item>
  <item category="namespace" start="35" end="37">std</item>
  <item category="operator" start="38" end="39">::</item>
  <item category="identifier" start="40" end="43">cout</item>
  <item category="operator" start="44" end="48"> &lt;&lt; "</item>
  <item category="string" start="49" end="54">hello </item>
  <item category="operator" start="55" end="63">" &lt;&lt; ([](</item>
  <item category="type" start="64" end="67">void</item>
  <item category="operator" start="68" end="70">){</item>
  <item category="keyword" start="71" end="77"> return </item>
  <item category="string" start="78" end="85">"world!"</item>
  <item category="operator" start="86" end="95">; })() &lt;&lt; </item>
  <item category="namespace" start="96" end="98">std</item>
  <item category="operator" start="99" end="100">::</item>
  <item category="identifier" start="101" end="104">endl</item>
  <item category="operator" start="105" end="105">;</item>
  <item category="keyword" start="106" end="114">
        return </item>
  <item category="integer" start="115" end="155">0</item>
  <item category="operator" start="116" end="118">;
}</item>
</syntok>

syntok is an XML file (file extension .synt) which has a root element syntok and contains item elements for the individual tokens. start and end document the byte offsets and crucially category associates a category to this token. Now, I would like to point out two obvious points:

  • The set of categories (“data model”) can be selected by the tokenizer itself. For a short period of time in the beginning, I assumed I can contribute a data model for tokenization. For example, remember that many syntaxes do not have a “namespace” category but introduce arbitrary other synactic elements (e.g. python). It is impossible to contribute such a generic taxonomy. Instead I have to rely upon a folksonomy.

  • In general, the category-to-syntax-highlighting-color association needs to be contributed externally. But of course, one can trivially just hash the category name and pick a color based on the hash (this is what I did in my example programs … and certainly it does not always lead to beautiful colorization!).

  • The quality of tokenization is allowed to vary. What about the final operator-categorized item? Why is ';' and '}' not split up with a whitespace-categorized item? Simply put, because for 99% of applications, whitespace won’t have a special style (e.g. different background color). So the given quality suffices. And indeed, a better tokenizer hopefully splits them up to satisfy even more applications.

One of the immediate advantages is that a tool can generate the syntok file, but now the user can intervene and adjust the tokenization (or add additional markup) before it gets represented. This solves a common difficulty I experienced from the LaTeχ package world. If my source code has slight adjustments (often happens with ASM, happened when python highlighting did not yet have python3 support, SQL versus PL/SQL, …), the software will irrevocably represent erroneous syntax.

One of the requirements is that the entire file is tokenized. So the start and end attribute provide a partition[1]. Recognize that the syntax is linear and flat. It does not represent hierarchical structure often found in markup languages and programming languages.

The specification

The specification document was written in AsciiDoc and is readable in this git repository:

Furthermore, it comes with a bunch of tools, I used while using the standard in production. Most importantly:

Furthermore:

  • An XSD file to verify some properties of a syntok file

  • A python script to verify remaining properties of a syntok file

  • A python script to generate syntok template by Unicode categories

  • A python script taking a tree-sitter dump and the original file to generate the syntok file

F.A.Q.

Why XML?

I think JSON and XML have the broadest support as data serialization formats to be written and read. YAML never gained sufficient traction (without reciting the reasons here). I lookad at XML and JSON and recognized that writing XML is much simpler because of simple escaping rules. Recognize that the user-provided content can be arbitrary (even binary) and in these cases, I would not dare to write my own JSON writer in C or assembly, but I would to do so for XML (in fact, I did back at university).

Why didn’t you allow both formats?

There was one unpublished version specifying JSON as well as XML serialization. In the end, I felt like this fragments the topic unnecessarily and makes it difficult for tooling providers.

Why document start/end?

Since people often ignore that XML is whitespace-sensitive, I think it can easily happen that someone introduces content accidentally. Then the original content gets lost. When someone wants to debug this situation, having the start/end attributes, helps a lot. I admit, it makes it more difficult to adjust syntok file manually.

Conclusion

I hope this standard contributes to ease syntax highlighting in digital typesetting. Of course, it needs proper support by syntax highlighting libraries or, even better, by parser authors.

Let syntax be tokenizable!


1. In the mathematical sense. Colloquially this would be named “complete”.