Unicode issue: different encodings for diacritics [withdrawn]

✍️ Written on 2021-09-07 in 620 words. Part of digital-typesetting Unicode

Motivation

Unicode allows two different encodings for diacritics. This apparently creates some issues in practice.

WARNING!

Update 2021-10-21: I learned that this blog post is just wrong. It uses the assumption that Esperanto uses the caron diacritic. However, it actually uses the breve diacritic. I did not recognize this because Neo2 breve diacritic is broken and duolingo renders both diacritics the same.

I am sorry to get this analysis wrong in the first place. This blog post is only here for archival purposes.

The source of the problem

There are two ways to encode the character ǔ. I will analyze them in a python session.

>>> u1 = 'ǔ'
>>> len(u1)
1
>>> ord(u1)
468
>>> hex(ord(u1))
'0x1d4'
>>> u1.encode('utf-8')
b'\xc7\x94'
>>> u1.encode('utf-16')
b'\xff\xfe\xd4\x01'
>>> u2 = 'ǔ'
>>> len(u2)
1
>>> ord(u2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found
>>> u2.encode('utf-8')
b'u\xcc\x8c'
>>> u2.encode('utf-16')
b'\xff\xfeu\x00\x0c\x03'
>>> import unicodedata
>>> [unicodedata.normalize(std, 'ǔ') for std in ['NFC', 'NFKC', 'NFD', 'NFKD']]
['ǔ', 'ǔ', 'ǔ', 'ǔ']
>>> [len(unicodedata.normalize(std, 'ǔ')) for std in ['NFC', 'NFKC', 'NFD', 'NFKD']]
[1, 1, 2, 2]

Learning platforms

Learning platforms evaluate the correctness of students' responses. In the case of duolingo and lernu, this is done via string comparison without respect for the two possible encodings:

duolingo

Diacritics issue on duolingo

lernu

Diacritics issue on lernu

Conclusion

I wanted to present this as an issue with Unicode in practice. To solve the problem, you can apply some Unicode normalization algorithm before comparing strings.

I filed a bug report on duolingo.