Which grapheme clusters are accessible by keyboard layouts?

Written on 2024-02-18 in 1095 words ✍️.
Part of cs digital-typesetting writing-systems

Introduction

I am interested in the following question: Which keys are easily accessible to be typed for the majority of computer users? For example, the necessity to use the AltGr key to access the square brackets [] on German keyboard layouts makes it less accessible compared to an American keyboard layout. Can we also take into account mobile devices?

Data acquisition

The perfect data base would be activity data which key strokes were required to make a Unicode grapheme cluster appear on the screen. Certainly this data is not easily available.

On a technical level, the mapping of key codes crosses multiple domains: the keyboard might map change it. The kernel might change it and commonly the operating system changes it. Finally also the application (like a GUI toolkit or the webbrowser) might change it. On each level one key code might turn into several key codes (fan-out) or several key codes might map to a single one (fan-in). In practice the common characters are mostly mapped one-to-one. Whereas both beforementioned keyboard layouts let grapheme ‘a’ appear by receiving the keycodes KeyPress(38) and KeyRelease(38), the American keyboard layout lets ‘[’ appear through [KeyPress(34), KeyRelease(34)] whereas the German one requires [KeyPress(108), KeyPress(17), KeyRelease(17), KeyRelease(108)]. But this is technically not the final result. It might happen that Javascript of a website is waiting for the sequence 108-17-17-108 to map it back to 34-34. So it would make sense to just observe the input (keyboard) and the output (rendered screen data), but especially for the output, this becomes difficult. So if any data is available, it usually stops before reaching the final output representation.

On an organizational level, tracking has several issues. For example, logging key strokes means includes logging entered passwords and more general: sensitive information. Responsibly, one does not include key logger in telemetry.

But aside our theoretical considerations, is there any data available on the WWW?

  • Let us look for data by Microsoft, the developer behind the operating system with the largest marketshare for endusers. A arstechnica article from 2017 tells about Windows 10 becoming more transparent, but the links do not work anymore. The WWW is full of tutorials about setting Telemetry settings in Windows as well as collecting Telemetry data as an application developer, but I did not find any public telemetry collection.

  • Maybe I need to switch to a more open-minded company. In case of Mozilla, one immediately runs into the mozilla data portal which allows to analyze and access this data publicly. Running into the evaluated off-topic data was a fun experience on its own. Did you know that about 40% of Firefox users use EN language settings, 11% use DE, and 8% use FR? Back to topic, Mozilla publishes a dataset list in JSON. Looking at the data, no entry concerns keyboard layouts. Firefox Focus once evaluated whether users use the “default” or “custom” keyboard layout to understand the height of the keyboard impacting the layout on mobile devices. But this data does not suffice for us. A next step would be to collect keyboard layout information from Javascript website-wise. There actually used to be a Keyboard Map API proposal, but Mozilla had an (understandably) negative stance towards it: “We’re concerned that this exposes keyboard layouts, which seem likely to be a significant source of fingerprinting data, in a way that does not require any user interaction”. So we are left with no data from this side as well.

How can we acquire data about the map from key strokes to graphemes?

Another less reliable approach is the following sequence of assumptions:

  1. Device users desire to represent grapheme clusters relating to their learned languages by majority.

  2. Within their language community, they use a common keyboard layout by majority.

  3. Based on popularity, one distinguishes between ISO (ISO/IEC 9995-2, 110 keys), ANSI (ANSI-INCITS 154-1988, 109 keys), and JIS (JIS X 6002-1980, ~69 keys) physical keyboards for desktop/laptop computers. We assume people use physical keyboards of these designs by majority.

And how can we acquire for all these points?

  1. A list of learned languages and their distribution can be found on Wikipedia.

  2. The Keyboard layout Wikipedia article is pretty comprehensive.

  3. Basically, one can pay for ISO/IEC 9995 “Information technology — Keyboard layouts for text and office systems”, which provides a categorization, but not a complete list. We will stick to the list of point #2 from Wikipedia as it is not specific for a physical layout.

Data conclusions

And to get started, I will pick the ten most popular learned languages:

Language number of speakers percentage

English (excl. creole languages)

1.456 billion

27.1%

Mandarin Chinese (incl. Standard Chinese, but excl. other varieties)

1.138 billion

21.1%

Hindi (excl. Urdu)

609 million

11.3%

Spanish (excl. creole languages)

559 million

10.4%

French (excl. creole languages)

310 million

5.8%

Modern Standard Arabic (excl. dialects)

274 million

5.1%

Bengali

273 million

5.1%

Portuguese

264 million

4.9%

Russian

255 million

4.7%

Urdu (excl. Hindi)

232 million

4.3%

/usr/share/X11/xkb/symbols/de