On the concept of syntax escaping

Written on 2022-07-17 in 3329 words ✍️.
Part of project typho digital-typesetting

Motivation

Syntax escaping is a fundamental primitive in formal grammars. Whenever you embed one syntax into another, you need syntax escaping mechanisms. Since programmers and writers have to apply these mechanisms, you want to design them in a usable manner. Be aware that more often than desired grammars are embedded in grammars embedded in grammars embedded in grammars. With increasing recursion depth, the rules become more and more difficult for the programmer to maintain in his brain.

In this post, I want to devise a model to derive escaping rules and then reason about usability of escaping. The post is very long, but very exhaustive on the topic.

Fundamentals

Why is escaping necessary? If I show you the string let text = "Hello my world";; and you are a programmer, you might immediately conclude that variable text will be assigned the string of characters Hello my world (this is OCaml syntax, BTW). This is fine, but what happens, if I want to assign Hello "my" world to variable text? The immediate solution let text = "Hello "my" world";; will yield invalid code, because you might recognize that " acts as initializer and terminator of the string in-between. As such to denote the string Hello "my" world you need to escape the syntactical rules surrounding this context.

This was a programmer’s example. But it is not difficult to find examples from literature. Consider a direct quote:

Adam said, 'Sofia, wait!'. But she heard 'Sofia’s gate!'.

The said statement is wrapped in single apostrophes, but if the statement itself uses an apostrophe, you run into a problem. Only by context and reading ahead, you will be able that she heard Sofia’s gate! and not Sofia. Funny enough, my blog software replaces the apostrophe before s with the typographically correct version and thus renders a visual difference.

Formal point of view on syntax escaping

Let the alphabet of a formal grammar be {A, B, X}. In order to talk about escaping, we need to assign one of the three characters a special meaning. Let X be that character. Consider some string of the grammar. If the special character occurs, the string that follows can really have any kind of semantics:

AABBAX …

But usually we want to return to the original semantics and continue with {A, B} characters:

AABBAX … BABB

How can we get back from the special semantics of X to the regular semantics? There are two preferred strategies:

fixed length: X can be followed by exactly (e.g.) two characters which encode information. Thus there are 4 special sequences ({XAA, XAB, XBA, XBB}) where one should signify the actual character X.
variadic length with terminator: X can be followed by some non-T sequence followed by T. Here T acts as terminator, e.g. let T be B. Then {XB, XAB, XAAB, XAAAB, …} become special sequences.

But many other strategies are possible. As one example, the character following X could act as terminator. As another one, the sequence of As following X replaced by B could act as terminator. These examples could continue to arbitrary complexity.

Conceptual view on syntax escaping

An ordered, finite sequence of rules described by case (A) or (B) gives an escaping mechanism:

Case (A): To escape string R, we start with escape character C followed by a unique non-empty string S describing the escaped character using a (1) fixed length or (2) some non-T sequence followed by T where T acts as terminator. An escaped representation of C must exist.
Case (B): To escape string R or any of its repetitions, we start with escape character C followed by a unique non-empty string S which might include the string R describing the escaped character using a (1) fixed length for S, (2) S is some non-T sequence followed by T where T acts as terminator or (3) S equals R. An escaped representation of C must exist.

Examples

Coming back to the OCaml [subset] example, OCaml uses let text = "Hello \"my\" world";; for the assignment of Hello "my" world to text. The escaping rules are pretty simple:

Case (A), R ≔ ", C ≔ \, S ≔ ", (1) with length 1
Case (A), R ≔ \, C ≔ \, S ≔ \, (1) with length 1

Why do we need two rules? Because the second one satisfies the fifth requirement. Another interesting example is XML. Once more, we consider the sequence Hello "my" world. How does one escape double quotation marks in XML?

Case (A), R ≔ ", C ≔ &, S ≔ quot;, (2) with T ≔ ;
Case (A), C ≔ &, C ≔ &, S ≔ amp;, (2) with T ≔ ;

Apparently, the difference is just that we use variadic- instead of fixed-length S. You might wonder, why we need case (B). One example is CSV, where " is often escaped as "". More specifically, any sequence of quotation marks is replaced by the same sequence concatenated with another quotation mark. As a result, one would have to list an arbitrary sequence of " repetitions for R. As a result, the set of rules is infinite. Since this is not practical, I added the word finite to the model together with Case (B):

Case (B), R ≔ " or any repetitions, C ≔ ", S ≔ R, (3)

Since R equals C here, only one rule is required.

In essence, we can distinguish 3 styles here:

name	R	C	S	example
fixed-length syntax escaping	"	\	"	C, Javascript, …
successive syntax escaping	"	"	"	CSV
variadic-length syntax escaping	"	&	quot;	XML

name

example

fixed-length syntax escaping

C, Javascript, …

successive syntax escaping

CSV

variadic-length syntax escaping

quot;

XML

I have to admit that from an information-theoretic perspective, the approaches allow roughly (26, 1, unlimited) number of escape sequences respectively. Calling them out as equal approaches might not be fair.

Algorithmic view on syntax escaping

From the algorithmic point of view, all three are pretty trivial to implement. You need to distinguish between encoding (“give me its escaped version”) and decoding (“give me its original text”). Variadic-length syntax decoding is a little bit more computationally intense in the sense that the entire string “"” must be collected. Once the entire string is available, we can replace it with its corresponding text version.

But one interesting question arises. Often you have functionalities available which allow you to replace all occurences of a string with some replacement (a replace operation taking two strings as argument). Can the syntax escaping mechanism be fully described by successive application of replacements?

name	describable by replace?
fixed-length syntax escaping	yes
successive syntax escaping	no
variadic-length syntax escaping	yes

name

describable by replace?

fixed-length syntax escaping

yes

successive syntax escaping

variadic-length syntax escaping

yes

Why does it fail for CSV? Fundamentally, because it uses case (B) which requires infinite application. Replacing " by "" would result in doubling the number of double quotes (i.e. con"tr""ived becomes con""rt""""ived because each double quote is handled individually). You would need to start with the longest double-quotes sequence. But what is its longest sequence? You can evaluate the longest sequence and adjust replace-operations accordingly. But then it is not just a successive application of replace operations anymore.

For the other two cases, this is possible. I provided an pure/trivial implementation of the three escaping mechanisms in Lua for UTF-8 strings in the next blogpost.

Usecase-adjusted escaping avoidance

But can’t we just avoid syntax escaping? The grammar outside dictates which sequences need to be escaped. Can we not adjust strings required be to replaced not to occur in the actual string? Because we modify the syntax outside, we don’t have the same scenario as above. This is avoidance, not escaping.

Let us take an example. I think, UNIX coreutil sed is most famous for it: s/home/tmp/ denotes that home shall be replaced by tmp. In this syntax, obviously U+002F SOLIDUS is the delimiter. So how can we use tmp/typho as replacement instead? Yes, there are escaping rules, but sed allows us to use a different delimiter to avoid collisions. s$home$tmp/typho$ would avoid the use of escape sequences.

Can we name more examples?

perl allows sequences like q/…/ where the delimiter can be changed. As such q$…$ is allowed as well.
python allows to use ", ', """ or ''' as delimiter. You can use the delimiter appropriate to avoid syntax escaping.
rust allows to use r#"…"#, r##"…"##, r###"…"###, … where the number of hashes preceding and following the string must match (65535 hashes is the maximum, BTW).

Indeed, this is very useful. However, I would like to emphasize that strings delimited by " or """ must not start with ". You need a second notation with ' and '' to make it possible without using syntax escaping rules. In conclusion, you want to have for each delimiting character a sequence of this character as delimiter as well (so you can use this individual character inside text). On the other hand, you also need a different character, because the string must not start with this former character without using syntax escaping.

In the end, syntax escaping avoidance is possible, but has some complexity in itself.

Requirements for syntax escaping

I think syntax escaping has two different requirements:

Set R arises externally since the formal grammar outside dictates which characters need to be escaped. Set R must be allowed to be arbitrarily large to enable embeddability in many contexts.
Memorability for the syntax escaping rules shall be high.

The second requirement is vague, I call it usability and let us go into the details.

On the usability of escaping mechanisms

Many programming languages replace " by \" together with \ as \\. However, additionally they might support \u{…} where … is some hexadecimal ID for the Unicode scalar. As such the system is not purely (1) or (2). XML prefers variadic lengths and CSV is an example for successive syntax escaping. We come back to the idea that the escape character starts a new formal grammar and anything can happen. But how do we need to design syntax escaping to make it usable?

Set C (escape characters) shall be small.
Set S (escape sequence) shall be intuitive.

Naturally, most grammar only use one character for the first criterion and thus are already small. The second requirement is more difficult. Do you remember \n more easily or &newline;? Do you prefer single-letters or keywords? This is a question difficult to answer and only solvable with tradeoffs.

Examples and their usability

By having the rough understanding of usability, I want to revisit language’s escape mechanisms:

Table 1. Syntax escaping mechanisms used in practice
language	remarks
C	fixed-length syntax escaping with various lengths (`\n` versus `\x0A`) but only one escape character
CSV	successive syntax escaping, but it becomes confusing with notions like QUOTE_MINIMAL v.s. QUOTE_ALL
Ocaml	fixed-length syntax escaping with various lengths (`\n` versus `\x0A`) but only one escape character. Escaping avoidance through quoted strings
Perl	fixed-length syntax escaping with various lengths (`\n` versus `\N{HYPHEN}`) but only one escape character. Escaping avoidance through single- and doublequotes as well as quote-like operator
Python	fixed-length syntax escaping with various lengths (`\n` versus `\u00A0`) but only one escape character. Escaping avoidance through singlequotes, doublequotes, triple-singlequotes and triple-doublequotes
rust	fixed-length syntax escaping with various lengths (`\n` versus `\u{00A0}`) but only one escape character. Escaping avoidance through raw strings
XML	variadic-length syntax escaping with five escape sequences

We can conclude that fixed-length syntax escaping with various lengths is very popular among programming languages. For data serialization (CSV/XML), different approaches are popular. But let us get back to our usability criteria:

What I find interesting are the following 2 thoughts with the languages above:

XML uses & as escape sequence C. But ; as terminator T. Thus you need to remember two characters. Why did they not pick & as terminator too? Is escaping twice like & maybe more readable than &amp&amp&? I can observe maybe a more relevant property: since ; is not a character-required-to-be-escaped, escaping twice is shorter. a & b becomes a & b versus a &amp&amp&amp& b.
Error handling for single-letter escape sequences is very difficult. \n is newline in every grammar. But is \a defined? Does it means literally (U+005C REVERSE SOLIDUS, U+0061 LATIN SMALL LETTER A) or does it trigger some error?

I would like to call out two bad examples here as well:

Teχ’s syntax depends on category codes. So in order to judge about escaping mechanisms (which need not exist if some package author forgets about it), you need to parse the entire Teχ code before and remember the assignment of 256 characters to categories. In practice, people always expect to be in the same category definition context. In some way \ is its escape character since control sequences are used as replacements for literal characters. But literal backslash is not \\, but \textbackslash. Once more this depends on the context (text mode only).
vim requires escaping character \ for + in regex pattern qualifiers, but not for *. This gois against common conventions and shows that for each grammar, you might have to relearn the conventions.

Conclusion

name	R	C	S	example	describable by replace?	escape sequence complexity	algorithmic complexity
fixed-length syntax escaping	"	\	"	C, Javascript, …	yes	moderate	moderate
successive syntax escaping	"	"	"	CSV	no	low	high
variadic-length syntax escaping	"	&	quot;	XML	yes	moderate	moderate

name

example

describable by replace?

escape sequence complexity

algorithmic complexity

fixed-length syntax escaping

C, Javascript, …

yes

moderate

successive syntax escaping

CSV

low

high

variadic-length syntax escaping

quot;

XML

yes

moderate

In this post, I wanted to provide a formal basis to reason about usability. But a problem arised. You can define the model (c.f. Conceptual view on syntax escaping), you can look at approaches and examples (c.f. Examples) and you can identify requirements. But in the end, the notion of usability is very difficult to reason about. Simply put, syntax escaping has a formal basis and practically, all different approaches are mixed up and occur simultaneously, because all concepts address different users. For example, \n is a historic relict of C, \u{00A0} is an obvious result of Unicode, "" is a trivial approach resulting from synactic minimality of CSV and XML prefers keywords. So the question “which escaping rules are easiest to remember” do not have an clear winner but must be answered as “depends on which mechanisms the user already knows”.

However, putting all experiences aside, I do personally think we have some winner. Some mechanism easiest to remember for unbiased users:

var x = "Hello world";
var y = 'There are two worlds%U+003B%%newline%the world that we can measure with “line and rule”,%newline%and the world that we feel with our hearts and imaginations.';
var z = "Hello 2%percent% of %dbq%my%dbq% world";

My point:

If you don’t know which strings occur frequently, allow escaping avoidance mechanisms. I am allowing " and ' as string delimiter here.
I prefer keywords like “dbq”, “newline”, “percent” as well as the conventional Unicode notation “U+003B” (SEMICOLON) over single-letters or punctuation.
There is only one character required to be escaped, one escape character (in this example %) and the same terminator for the keyword-based escape sequence (%).

And thus we have a new escaping mechanism. It breaks conventions. Use with care. Here be dragons.