The Problem with ‘The Problem with Unicode’

✍️ Written on 2021-08-30 in 880 words.
Part of cs software-development digital-typesetting writing-systems Unicode

Motivation

The Problem with Unicode” is an article by Holmes Neville from 2003. I read his articles. I strongly disagree and thus have to write a blog entry.

Context

  • Toward Decent Text Encoding” (1998) [dblp]: Holmes describes Unicode as obese character set with 16bits for every character and offers Mudawwar’s Multicode as alternative

  • Crouching Error, Hidden Markup” (2001) [dblp]: Holmes describes his first experience with Microsoft Word and is unhappy about the WYSIWYG nature of this software frontend. He refers to Lilac as two-view document editor and Teχ as versatile, extensible tool. He forgets that Teχ’s core has proved to be the opposite of extensibile as time progressed.

  • Seven great blunders of the computing world” (2002) [dblp]: Holmes explains seven blunders where blunder supposedly “arise from a failure of imagination, from an inability to see beyond the immediate problem to its full social or professional context”. Besides blunder 1 (“terminology”) and blunder 4 (“commercial programming”), in blunder 6 (“Text encoding”) Holmes says “Unicode itself proved a bigger blunder by far” and “Unicode’s blunder was in aiming to encode every language rather than every writing system”.

  • The Profession as a Culture Killer” (2007): Holmes complains about DNS domains supporting Unicode. Furthermore he suggests a World Wide Web per writing system. He points out that writing systems other than the Latin script have been culturally impoverished.

The article

The point of this article is to justify Holmes' claim that Unicode is blunder. Probably the stretched quote on the first page highlights the underlying question best: “Unicode is a success, but would another approach have fared even better?”

A contextualized summary is also given in the abstract at IEEE:

The author, in his 'Seven Great Blunders of the Computing World' column (Computer, July 2002, p.112, 110-111) claimed that Unicode was a blunder. The author reconsiders this and states that he stands by his claim, but states that this does not mean that he considers it to be a failure. He does however claim that a different approach would have worked much better for encoding text, documents, and writing systems.

Criticsm

  • “Plain text of this kind, being mostly brief and personal, never mixes writing systems”. Besides the obvious missing definition of “plain text”, the statement contradicts with his previous article “Toward decent text encoding”: “I should be able to read all Swedish names in plaintext e-mail messages, but at present many are garbled”. Plaintext is just a contrast word to the addition of semantics by using formally defined syntax elements; aka. markup. The fact that plaintext is often monolingual stems merely from technical present difficulties and less from the fact that mathematical notes and names are not written in so-called plaintext.

  • “This [remark: markup] text is coded within a single writing system–properly so for simplicity’s sake”. In 2003, HTML 4 was the current version of HTML (published in 2000). If you define <meta http-equiv="content-type" content="utf-8"/> one can use all writing systems at the same time.

  • “For example, German treats ö as though is were a, while Finnish treats the two as distinct [remark: with respect to collation]” […] “Thus the placement of symbols within alphabets should be chosen to support transliteration”. In essence, this means the author wants to encode languages and not writing systems. A contradiction to the caption in Table 1.

  • “The generative capability of this approach provides for complex use of accents as in Vietnamese and for the stable generation of new transliterations and symbols, thanks to typography’s ability to provide aesthetically pleasing forms of newly popular compound symbols such as the euro”. This is no difference to Unicode and does not solve a problem. Once the glyph and font has been identified, it is the renderer’s task to generate the pixel representation (in practice mostly OpenType substitution tables and index lookups). The original bit-level representation plays no role here. I admit I cannot tell whether GSUB tables were standardized in 2003 or later.

Conclusion

I admit that some of his points are valid. For example, European people do not recognize the following fact he stated: “Font classes such as typewriter, serif, and sans serif have as little meaning in the Arab writing system as diwani, kufic, and thuluth have in the Latin writing system.” Also his main point “Unicode is far too complex” has some reasoning. However, Holmes needs to accept the underlying technical difficulties and resulting properties to propose a system which offers an advantage.

I cannot judge how much of my ignorance comes from my knowledge base of 2021, unlike 2003 where (e.g.) the relationship of fonts and markup was more vague. But in the end, I think this article has some contradictions which would become more evident if proper definitions are provided.