Unicode :: PyDays Vienna 2018/05/04

<h1 style="margin-bottom: 0">Unicode</h1>
<p class="subtitle" style="margin-top:0">Or why py3k was necessary</p>

![Unicode point for a snake](img/snake.png)

2018/05/04, Lukas Prokop <br/>
http://lukas-prokop.at/talks/pydays18-unicode/ <br/>
<a href="https://www.pydays.at/"><img src="img/pydays-logo.svg" alt="PyDays Logo" style="width:100px" /></a>

---

## About me

* Lukas Prokop &lt;admin@lukas-prokop.at&gt;
* [@meisterluk](http://twitter.com/meisterluk)
* Hire me for python, Go or rust programming
* 8th public talk with time constraint

---

# My point today

Unicode was broken so much in Python 2 <br/>
that backwards-incompatible Python 3 was necessary

---

# Agenda

1. What is Unicode?
2. Terminology, usecases and issues of Unicode
3. The Python 2 Unicode model
4. Unicode in other programming languages
5. The Python 3 Unicode model
6. Retrospective and End of Life of Python 2

---

What is Unicode? <span class="section">[1/6]</span>

---

# What is Unicode?

> A **map** of numbers [i.e. unicode code points] to characters/symbols, which is not a glyph, a grapheme, nor a linguistic unit.

The Unicode Consortium is promoting it.

---

# Character maps

We can define our own map:

<div style="display: flex; align-items: center; justify-content: center">
  <table class="encoding-map">
    <tr><th>0</th><td>A</td></tr>
    <tr><th>1</th><td>B</td></tr>
    <tr><th>2</th><td>C</td></tr>
    <tr><th>3</th><td>D</td></tr>
    <tr><th>4</th><td>E</td></tr>
    <tr style="opacity:0.8"><th>5</th><td>F</td></tr>
    <tr style="opacity:0.6"><th>6</th><td>G</td></tr>
    <tr style="opacity:0.3"><th>7</th><td>H</td></tr>
    <tr style="opacity:0.2"><th>⋮</th><td>⋮</td></tr>
    <tr style="opacity:0.5"><th>27</th><td>Ä</td></tr>
    <tr><th>28</th><td>Ö</td></tr>
    <tr><th>29</th><td>Ü</td></tr>
    <tr><th>30</th><td>ß</td></tr>
  </table>
  <div style="color:white">
    <p>The issue:</p>
    <p>standardization</p>
  </div>
</div>

---

# Character maps

We can define our own map:

<div style="display: flex; align-items: center; justify-content: center">
  <table class="encoding-map">
    <tr><th>0</th><td>A</td></tr>
    <tr><th>1</th><td>B</td></tr>
    <tr><th>2</th><td>C</td></tr>
    <tr><th>3</th><td>D</td></tr>
    <tr><th>4</th><td>E</td></tr>
    <tr style="opacity:0.8"><th>5</th><td>F</td></tr>
    <tr style="opacity:0.6"><th>6</th><td>G</td></tr>
    <tr style="opacity:0.3"><th>7</th><td>H</td></tr>
    <tr style="opacity:0.2"><th>⋮</th><td>⋮</td></tr>
    <tr style="opacity:0.5"><th>27</th><td>Ä</td></tr>
    <tr><th>28</th><td>Ö</td></tr>
    <tr><th>29</th><td>Ü</td></tr>
    <tr><th>30</th><td>ß</td></tr>
  </table>
  <div>
    <p>The issue:</p>
    <p>standardization</p>
  </div>
</div>

---

# Character maps: ASCII

<strong>A</strong>merican <strong>S</strong>tandard <strong>C</strong>ode for <strong>I</strong>nformation <strong>I</strong>nterchange (1967)

<div style="display:flex; align-items:flex-start; justify-content:center">
  <table class="ascii">
    <tr><th>0</th><td>Null</td></tr>
    <tr><th>1</th><td>Start of Heading</td></tr>
    <tr><th>2</th><td>Start of Text</td></tr>
    <tr><th>3</th><td>End of Text</td></tr>
    <tr><th>4</th><td>End of Transmission</td></tr>
    <tr><th>5</th><td>Enquiry</td></tr>
    <tr><th>6</th><td>Acknowledgement</td></tr>
    <tr><th>7</th><td>Bell</td></tr>
    <tr><th>8</th><td>Backspace</td></tr>
    <tr><th>9</th><td>Horizontal Tab</td></tr>
    <tr><th>10</th><td>Line Feed</td></tr>
    <tr><th>11</th><td>Vertical Tab</td></tr>
    <tr><th>12</th><td>Form Feed</td></tr>
  </table>
  <table class="ascii">
    <tr><th>13</th><td>Carriage Return</td></tr>
    <tr><th>14</th><td>Shift Out</td></tr>
    <tr><th>15</th><td>Shift In</td></tr>
    <tr><th>16</th><td>Data Link Escape</td></tr>
    <tr><th>17</th><td>Device Control 1</td></tr>
    <tr><th>18</th><td>Device Control 2</td></tr>
    <tr><th>19</th><td>Device Control 3</td></tr>
    <tr><th>20</th><td>Device Control 4</td></tr>
    <tr><th>21</th><td>Negative Acknowledgement</td></tr>
    <tr><th>22</th><td>Synchronous Idle</td></tr>
    <tr><th>23</th><td>End of Transmission Block</td></tr>
    <tr><th>24</th><td>Cancel</td></tr>
    <tr><th>25</th><td>End of Medium</td></tr>
  </table>
  <table class="ascii">
    <tr><th>26</th><td>Substitute</td></tr>
    <tr><th>27</th><td>Escape</td></tr>
    <tr><th>28</th><td>File Separator</td></tr>
    <tr><th>29</th><td>Group Separator</td></tr>
    <tr><th>30</th><td>Record Separator</td></tr>
    <tr><th>31</th><td>Unit Separator</td></tr>
    <tr><th>32</th><td>␣</td></tr>
    <tr><th>33</th><td>!</td></tr>
    <tr><th>34</th><td>"</td></tr>
    <tr><th>35</th><td>#</td></tr>
    <tr><th>36</th><td>$</td></tr>
  </table>
</div>

---

# Character maps: ASCII

<strong>A</strong>merican <strong>S</strong>tandard <strong>C</strong>ode for <strong>I</strong>nformation <strong>I</strong>nterchange (1967)

<div style="display:flex; align-items:flex-start; justify-content:center">
  <table class="ascii" style="margin-right:30px">
    <tr><th>37</th><td>%</td></tr>
    <tr><th>38</th><td>&</td></tr>
    <tr><th>39</th><td>'</td></tr>
    <tr><th>40</th><td>(</td></tr>
    <tr><th>41</th><td>)</td></tr>
    <tr><th>42</th><td>*</td></tr>
    <tr><th>43</th><td>+</td></tr>
    <tr><th>44</th><td>,</td></tr>
    <tr><th>45</th><td>-</td></tr>
    <tr><th>46</th><td>.</td></tr>
    <tr><th>47</th><td>/</td></tr>
    <tr><th>48</th><td>0</td></tr>
    <tr><th>49</th><td>1</td></tr>
    <tr><th>50</th><td>2</td></tr>
  </table>
  <table class="ascii" style="margin-right:30px">
    <tr><th>51</th><td>3</td></tr>
    <tr><th>52</th><td>4</td></tr>
    <tr><th>53</th><td>5</td></tr>
    <tr><th>54</th><td>6</td></tr>
    <tr><th>55</th><td>7</td></tr>
    <tr><th>56</th><td>8</td></tr>
    <tr><th>57</th><td>9</td></tr>
    <tr><th>58</th><td>:</td></tr>
    <tr><th>59</th><td>;</td></tr>
    <tr><th>60</th><td><</td></tr>
    <tr><th>61</th><td>=</td></tr>
    <tr><th>62</th><td>></td></tr>
    <tr><th>63</th><td>?</td></tr>
    <tr><th>64</th><td>@</td></tr>
  </table>
  <table class="ascii" style="margin-right:30px">
    <tr><th>65</th><td>A</td></tr>
    <tr><th>66</th><td>B</td></tr>
    <tr><th>67</th><td>C</td></tr>
    <tr><th>68</th><td>D</td></tr>
    <tr><th>69</th><td>E</td></tr>
    <tr><th>70</th><td>F</td></tr>
    <tr><th>71</th><td>G</td></tr>
    <tr><th>72</th><td>H</td></tr>
    <tr><th>73</th><td>I</td></tr>
    <tr><th>74</th><td>J</td></tr>
    <tr><th>75</th><td>K</td></tr>
    <tr><th>76</th><td>L</td></tr>
    <tr><th>77</th><td>M</td></tr>
    <tr><th>78</th><td>N</td></tr>
  </table>
  <table class="ascii" style="margin-right:30px">
    <tr><th>79</th><td>O</td></tr>
    <tr><th>80</th><td>P</td></tr>
    <tr><th>81</th><td>Q</td></tr>
    <tr><th>82</th><td>R</td></tr>
    <tr><th>83</th><td>S</td></tr>
    <tr><th>84</th><td>T</td></tr>
    <tr><th>85</th><td>U</td></tr>
    <tr><th>86</th><td>V</td></tr>
    <tr><th>87</th><td>W</td></tr>
    <tr><th>88</th><td>X</td></tr>
    <tr><th>89</th><td>Y</td></tr>
    <tr><th>90</th><td>Z</td></tr>
    <tr><th>91</th><td>[</td></tr>
    <tr><th>92</th><td>\</td></tr>
  </table>
  <table class="ascii" style="margin-right:30px">
    <tr><th>93</th><td>]</td></tr>
    <tr><th>94</th><td>^</td></tr>
    <tr><th>95</th><td>_</td></tr>
    <tr><th>96</th><td>`</td></tr>
    <tr><th>97</th><td>a</td></tr>
    <tr><th>98</th><td>b</td></tr>
    <tr><th>99</th><td>c</td></tr>
    <tr><th>100</th><td>d</td></tr>
    <tr><th>101</th><td>e</td></tr>
    <tr><th>102</th><td>f</td></tr>
    <tr><th>103</th><td>g</td></tr>
    <tr><th>104</th><td>h</td></tr>
    <tr><th>105</th><td>i</td></tr>
    <tr><th>106</th><td>j</td></tr>
  </table>
  <table class="ascii" style="margin-right:30px">
    <tr><th>107</th><td>k</td></tr>
    <tr><th>108</th><td>l</td></tr>
    <tr><th>109</th><td>m</td></tr>
    <tr><th>110</th><td>n</td></tr>
    <tr><th>111</th><td>o</td></tr>
    <tr><th>112</th><td>p</td></tr>
    <tr><th>113</th><td>q</td></tr>
    <tr><th>114</th><td>r</td></tr>
    <tr><th>115</th><td>s</td></tr>
    <tr><th>116</th><td>t</td></tr>
    <tr><th>117</th><td>u</td></tr>
    <tr><th>118</th><td>v</td></tr>
    <tr><th>119</th><td>w</td></tr>
    <tr><th>120</th><td>x</td></tr>
  </table>
  <table class="ascii" style="margin-right:30px">
    <tr><th>121</th><td>y</td></tr>
    <tr><th>122</th><td>z</td></tr>
    <tr><th>123</th><td>{</td></tr>
    <tr><th>124</th><td>|</td></tr>
    <tr><th>125</th><td>}</td></tr>
    <tr><th>126</th><td>~</td></tr>
    <tr><th>127</th><td>Delete</td></tr>
  </table>
</div>

---

ASCII is a <strong>7</strong>-bit encoding

2<sup>7</sup> = 128 characters

---

# More than 128 symbols!

1996, Euro symbol:

---

# The ISO-8859 family

<table>
  <tr><td>Latin-1 (1998)<td>Western Europe</tr>
  <tr><td>Latin-2<td>Central Europe</tr>
  <tr><td>Latin-3<td>South European</tr>
  <tr><td>Latin-4<td>North European</tr>
  <tr><td>ISO/IEC 8859-5:1999<td>Latin/Cyrillic</tr>
  <tr><td>ISO/IEC 8859-6:1999<td>Latin/Arabic</tr>
  <tr><td>ISO/IEC 8859-7:2003<td>Latin/Greek</tr>
  <tr><td>ISO/IEC 8859-8<td>Latin/Hebrew</tr>
  <tr><td>Latin-5<td>Turkish</tr>
  <tr><td>Latin-6<td>Nordic</tr>
  <tr><td>ISO/IEC 8859-11:2001<td>Latin/Thai</tr>
  <tr style="text-decoration:line-through"><td>ISO/IEC 8859-12<td>Devanagari</tr>
  <tr><td>Latin-7<td>Baltic Rim</tr>
  <tr><td>Latin-8<td>Celtic</tr>
  <tr><td>Latin-9<td>update of Latin-1</tr>
  <tr><td>Latin-10<td>South-Eastern European</tr>
</table>

via https://en.wikipedia.org/wiki/ISO/IEC_8859

---

# East Asian encodings

https://en.wikipedia.org/wiki/File:JIS_and_Shift-JIS_variants.svg [License: zLib License]

---

# Writing system worldwide

<span class="writing-system" style="background-color:#AAAAAA;color:brown">Latin</span>
<span class="writing-system" style="background-color:#008080;color:white">Cyrillic</span> 
<span class="writing-system" style="background-color:cyan;color:blue[[raju]]">Georgian</span> 
<span class="writing-system" style="background-color:blue;color:white">Greek</span> 
<span class="writing-system" style="background-color:#5bc6f0;color:white">Armenian</span>
<span class="writing-system" style="background-color:green;color:white">Arabic</span>
<span class="writing-system" style="background-color:#00ff7f;color:black">Hebrew and Arabic</span>
<span class="writing-system" style="background-color:#40C040;color:black">Arabic and Neo-Tifinagh</span>
<span class="writing-system" style="background-color:#FFC000;color:black">North Indic</span> 
<span class="writing-system" style="background-color:orange;color:black">South Indic</span> 
<span class="writing-system" style="background-color:#800000;color:white">Ethiopic</span> 
<span class="writing-system" style="background-color:olive;color:white">Thaana</span>
<span class="writing-system" style="background-color:#FFFF80;color:black">Canadian Syllabic</span>
<span class="writing-system" style="background-color:#9B0000;color:white">Pure logographic</span> 
<span class="writing-system" style="background-color:#F40000;color:white">Mixed logographic and syllabaries</span> 
<span class="writing-system" style="background-color:#FF00FF;color:white">Featural-alphabet + limited logographic</span> 
<span class="writing-system" style="background-color:#800080;color:white">Featural-alphabet</span>

[GFDL 1.2 / CC BY-SA 3.0 by JWB](https://en.wikipedia.org/wiki/List_of_writing_systems#/media/File:Writing_systems_worldwide.png)

---

# What is Unicode?

<dl>
  <dt>Unicode's goal</dt>
  <dd>A universal character set (UCS) (dating back to 1987)</dd>
  <dt>[Unicode 1.0.0](http://www.unicode.org/versions/Unicode1.0.0/) (Oct 1991)<dt>
  <dd>7,161 characters from 24 writing systems</dd>
  <dt>[Unicode 10.0](http://www.unicode.org/versions/Unicode10.0.0/) (June 2017)</dt>
  <dd>136,755 characters from 139 writing systems</dd>
</dd>

Data from [Wikipedia:Unicode](https://en.wikipedia.org/wiki/Unicode#Versions)

---

# Unicode charts

<div class="center">
  <img src="img/example-chart.png" alt="Example excerpt from the Greek/Coptic Unicode block" />
  <p>via <a href="https://www.unicode.org/charts/PDF/U0370.pdf">https://www.unicode.org/charts/PDF/U0370.pdf</a></p>
</div>

All unicode points are assigned in the range U+0000 to U+10FFFF (decimal: 1,114,111).

---

# Encodings in a text editor

There are [seven character encoding schemes](http://www.unicode.org/glossary/#character_encoding_scheme) in Unicode: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE [character encoding form + byte serialization].

---

# Properties of UTF-8

<strong>U</strong>nicode <strong>T</strong>ransformation <strong>F</strong>ormat <strong>8</strong>-bit <br/>
Originally designed by Ken Thompson and Rob Pike <br/>
Specifies representation of unicode point in bytes of variable width (→ encoding) <br/>
used by >90% of websites

<table class="utf8">
  <tr>
    <th># bytes</th>
    <th>range</th>
    <th>byte 1</th>
    <th>byte 2</th>
    <th>byte 3</th>
    <th>byte 4</th>
  </tr>
  <tr>
    <td>1</td>
    <td>U+0000–U+007F</td>
    <td><span style="color:#00F">0</span>xxxxxxx</td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>2</td>
    <td>U+0080–U+07FF</td>
    <td><span style="color:#00F">110</span>xxxxx</td>
    <td>10xxxxxx</td>
    <td></td>
    <td></td>
  </tr>
  <tr>
    <td>3</td>
    <td>U+0800–U+FFFF</td>
    <td><span style="color:#00F">1110</span>xxxx</td>
    <td>10xxxxxx</td>
    <td>10xxxxxx</td>
    <td></td>
  </tr>
  <tr>
    <td>4</td>
    <td>U+10000–U+10FFFF</td>
    <td><span style="color:#00F">11110</span>xxx</td>
    <td>10xxxxxx</td>
    <td>10xxxxxx</td>
    <td>10xxxxxx</td>
  </tr>
</table>

* <strong>Backward compatibility:</strong> UTF-8 ⊃ ASCII
* <strong>Prefix code:</strong> number of bytes <span style="color:#00F">known</span>
* <strong>Self synchronization:</strong> we can distinguish between the first byte (<span style="color:#00F">0</span> or <span style="color:#00F">11</span>) and continuation bytes (<span style="color:#00F">10</span>)
* <strong>Overlong bytes disallowed:</strong> minimum number of bytes <em>must</em> be used

---

# Properties of UTF-8

* <strong>Auto-detection:</strong> In short, real-world extended ASCII character sequences which look like valid UTF-8 multi-byte sequences are unlikely [[Wikipedia]](https://en.wikipedia.org/wiki/UTF-8)
* Avoids "requiring much more space when storing Latin"

Other encodings:

* Unicode ≅ ISO/IEC 10646
* UTF-1, until 1993, 1 to 5 bytes
* UTF-8, since 1993, 1 to 4 bytes
* UCS-2, before 2000, constant length 2 bytes ([obsolete](https://www.unicode.org/versions/Unicode10.0.0/appC.pdf))
* UCS-4, before 2003, = UTF-32
* UTF-16, ~2000, UCS-2 + low surrogates + high surrogates, 2 or 4 bytes
* UTF-32, since 2003, constant length 4 bytes

[Armin Ronacher: "UCS vs UTF-8 as Internal String Encoding"](http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/)

---

# Python and its internal representation

Compile flags:

```
--enable-unicode=ucs2
--enable-unicode=ucs4
```

Which UCS is used internally?

```
>>> import sys
>>> {False: 'UCS-4', True: 'UCS-2'}[sys.maxunicode < 0x10ffff]
'UCS-4'
```

https://docs.python.org/3.8/c-api/unicode.html

---

Terminology, usecases and issues of Unicode <span class="section">[2/6]</span>

---

# Terminology

<dl>
  <dt>The Basic Multilingual Plane (BMP)</dt>
  <dd>65,424 code points containing all modern writing systems</dd>
  <dt>Surrogates</dt>
  <dd>A mechanism to distinguish 2 and 4 byte UTF-16 characters (first 2 bytes = high surrogate, then low surrogate)</dd>
  <dt>Byte Order Mark</dt>
  <dd>A special symbol making it possible to detect endianness</dd>
</dl>

```
>>> import codecs
>>> dict({attr: getattr(codecs, attr)
          for attr in dir(codecs) if attr.startswith('BOM')})
{'BOM': b'\xff\xfe',                  'BOM_BE': b'\xfe\xff',
 'BOM_LE': b'\xff\xfe',               'BOM_UTF8': b'\xef\xbb\xbf',
 'BOM_UTF16': b'\xff\xfe',            'BOM_UTF16_LE': b'\xff\xfe',
 'BOM_UTF16_BE': b'\xfe\xff',         'BOM_UTF32': b'\xff\xfe\x00\x00',
 'BOM_UTF32_LE': b'\xff\xfe\x00\x00', 'BOM_UTF32_BE': b'\x00\x00\xfe\xff',
 'BOM32_LE': b'\xff\xfe',             'BOM32_BE': b'\xfe\xff',
 'BOM64_LE': b'\xff\xfe\x00\x00',     'BOM64_BE': b'\x00\x00\xfe\xff'}
```

---

# Terminology: Han unification

https://www.unicode.org/versions/Unicode10.0.0/ch18.pdf

* CJK stands for Chinese, Japanese and Korean.
* Chinese ⇒ Japanese/Kanji <br/> Chinese ⇒ Korean/Hanja
* Unify all these characters?

→ political debate

---

# Casing (majuscule and minuscule)

http://www.unicode.org/reports/tr21/

```
>>> 'abcd'.upper()     # latin
'ABCD'
>>> 'αβγδ'.upper()     # greek
'ΑΒΓΔ'
>>> 'ⲁⲃⲅⲇ'.upper()      # coptic
'ⲀⲂⲄⲆ'
>>> 'ⴀ ⴁ ⴂ'.upper()    # georgian (Nuskhuri)
'Ⴀ Ⴁ Ⴂ'
>>> '今日は'.upper()    # japanese - case invariant
'今日は'
```

```
>>> 'ß'.upper()
```

```
'SS'
```

---

# Directionality

http://www.unicode.org/reports/tr9/

* Arabic, Hebrew, Persian, Urdu, …
* Mixing LTR to RTL text
* `U+200E LEFT-TO-RIGHT MARK`
* `U+200F RIGHT-TO-LEFT MARK`

1. bad: <span dir="rtl">لغة C++ هي لغة برمجة تستخدم...</span>
2. <span dir="rtl">لغة C++<span dir="ltr" style="color:#999">&lt;U+200E&gt;</span> هي لغة برمجة تستخدم...</span>
3. good: <span dir="rtl">لغة C++‎ هي لغة برمجة تستخدم...</span>

* Python has no support. Except `unicodedata.bidirectional('a') == 'L'`.
* Not even my terminal.

---

# Line breaking

```
>>> nbsp = '\u00A0'        # NO-BREAK SPACE
>>> thin_space = '\u2009'  # THIN-SPACE
>>> word_joiner = '\u2060' # WORD JOINER

>>> 'line1 line2'.split()
['line1', 'line2']
>>> 'line1\nline2'.split()
['line1', 'line2']
>>> 'line1{nbsp:}line2'.format(nbsp=nbsp).split()
['line1', 'line2']
>>> 'line1{ts:}line2'.format(ts=thin_space).split()
['line1', 'line2']
>>> 'line1{wj:}line2'.format(wj=word_joiner).split()
['line1\u2060line2']  # desired behavior
```

---

# Combined characters

```
>>> cafe1 = 'cafe\u0301' # COMBINING ACUTE ACCENT
>>> cafe2 = 'caf\u00E9'  # LATIN SMALL LETTER E WITH ACUTE

>>> cafe1 == cafe2
False
>>> print(cafe1, cafe2)
café café
>>> len(cafe1), len(cafe2)
(5, 4)
>>> cafe1[4]
'́'
>>> cafe2[4]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: string index out of range
```

---

# Unicode normalization

http://www.unicode.org/reports/tr15/

<table>
  <tr><td style="text-align:right; font-weight:bold">NF</td><td style="font-weight:bold">D</td><td>Canonical Decomposition</td></tr>
  <tr><td style="text-align:right; font-weight:bold">NF</td><td style="font-weight:bold">C</td><td>Canonical Decomposition + Canonical Composition</td></tr>
  <tr><td style="text-align:right; font-weight:bold">NF</td><td style="font-weight:bold">KD</td><td>Compatibility Decomposition</td></tr>
  <tr><td style="text-align:right; font-weight:bold">NF</td><td style="font-weight:bold">KC</td><td>Compatibility Decomposition + Canonical Composition</td></tr>
</table>

```
>>> import unicodedata
>>> unicodedata.normalize('NFC', cafe1) == unicodedata.normalize('NFC', cafe2)
True
>>> a = '\u212B'    # ANGSTROM SIGN
>>> len(unicodedata.normalize('NFD', a))
2
>>> len(unicodedata.normalize('NFC', a))
1
>>> s = '\u1E9B\u0323'  # LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE
>>> len(unicodedata.normalize('NFKC', s))
1
>>> len(unicodedata.normalize('NFKD', s))
3
```

---

# Unicode categories

https://www.unicode.org/reports/tr49/

```
>>> import unicodedata
>>> unicodedata.category('A')
'Lu'
>>> unicodedata.bidirectional('a')
'L'
>>> unicodedata.category('1')
'Nd'
>>> unicodedata.category('①')   # U+2460 CIRCLED DIGIT ONE
'No'
>>> categories = {
...     'Pc', 'Pi', 'Sm', 'Pd', 'Mn', 'Sk', 'Lm', 'No',
...     'Cc', 'Ps', 'Nd', 'Ll', 'Lu', 'Lt', 'Me', 'Zp',
...     'Mc', 'Zs', 'Zl', 'Po', 'Cf', 'Pe', 'So', 'Pf',
...     'Lo', 'Nl', 'Sc'
... }
```

> "Unicode Character Categories" has been withdrawn. It was never formally approved

---

# Collation

https://www.unicode.org/reports/tr10/

```
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'de_AT.UTF-8')
'de_AT.UTF-8'
>>> 'ö' < 'z'
False
```

* Should be `True` in Austria, `False` in Sweden
* Collation depends on the locale
* Python applies unicode code point comparison

---

# Unicode Regular expressions

https://www.unicode.org/reports/tr18/

```
>>> import unicodedata, re
>>> unicodedata.category('A')
'Lu'
>>> re.search(r'\p{Lu}', 'A')
```

* `\p{}` is a mechanism to match characters by category
* regular expressions essential inherit all other problems (such as casing)

---

# Security issues: Domain names and Unicode

http://www.unicode.org/reports/tr36/#international_domain_names

* Invisible characters make domain look same, but is different
* Perfect for phishing attacks
* e.g. `200C ZERO WIDTH NON-JOINER`
* e.g. `200D ZERO WIDTH JOINER `

---

The Python 2 Unicode model <span class="section">[3/6]</span>

---

# Python 2

* Python was created briefly before Unicode 1.0.0 was released
* Python 2 was the first python featuring Unicode support

---

# Python 2: Source file encoding

```python
#!/usr/bin/env python

spätzle = "hello world"

print(spätzle)
```

Result:
--

```
  File "test.py", line 3
SyntaxError: Non-ASCII character '\xc3' in file test.py on line 3, but no encoding
             declared; see http://python.org/dev/peps/pep-0263/ for details
```

```python
#!/usr/bin/env python
# -*- coding: utf-8 -*-
```

Default assumption: ASCII

via [PEP 263][pep263]

[pep263]: https://www.python.org/dev/peps/pep-0263/

---

# Python 2: str and unicode types

str → a sequence of bytes in an implicit encoding <br/>
unicode → a sequence of unicode points

```
>>> p = '\xf0\x9f\x92\xa9'
>>> type(p)
<type 'str'>
>>> b = b'\xf0\x9f\x92\xa9'
>>> p == b
True
>>> print(p.decode('utf-8'))
```
--

💩

`__str__(self)` returns `str` <br/>
`__unicode__(self)` returns `unicode`

---

# Python 2: Bytes and unicode world

---

# Python 2: `encode` and `decode`

```
>>> '\xf0\x9f\x92\xa9'.decode('utf-8').encode('utf-16')
'\xff\xfe=\xd8\xa9\xdc'
```

decode → given bytes, return unicode <br/>
encode → given unicode, return bytes

```
>>> '\xC0'.decode('utf-8')   # overlong-byte
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 0:
                    invalid start byte
```

---

# Python 2: source code file strings

```
#!/usr/bin/env python

__author__ = 'meisterluk'
__version__ = '1.0.0'
```

should be encoded as unicode string → `u''` prefix.

```
#!/usr/bin/env python

__author__ = u'meisterluk'
__version__ = u'1.0.0'
```

---

# Python 2: bytes and unicode coercion

```
>>> 'pydays ' + u'vienna'
u'pydays vienna'
>>>
```

---

# Python 2: csv module

https://docs.python.org/2/library/csv.html

> <strong>Note:</strong> This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.

```
import csv, codecs, cStringIO

class UTF8Recoder:
    """
    Iterator that reads an encoded stream and reencodes the input to UTF-8
    """
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)

def __iter__(self):
        return self

def next(self):
        return self.reader.next().encode("utf-8")
```

---

# Python 2: csv module

```
class UnicodeReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """

def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)

def next(self):
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]

def __iter__(self):
        return self
```

---

# Python 2: csv module

```
class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

def writerow(self, row):
        self.writer.writerow([s.encode("utf-8") for s in row])
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        data = self.encoder.encode(data)
        self.stream.write(data)
        self.queue.truncate(0)

def writerows(self, rows):
        for row in rows:
            self.writerow(row)
```

---

# Python 2: reading a file

```
with open('test.txt') as fd:
    # lines are *NOT* split according to Unicode
    for line in fd.readlines():
        print(line)
```

2 alternatives:

```
with codecs.open('test.txt', encoding='utf-8') as fd:
    for line in fd.readlines():
        print(line)
```

```
with io.open('test.txt', encoding='utf-8') as fd:
  for line in fd.readlines():
    print(line)
```

<!--
TODO

XML specifies the encoding in its header
JSON does not.
-->

---

# Python 2: Frustrations

* The API is very explicit and inconvenient
* The `csv` module is one example of a broken library

```
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python2.7/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
                     ordinal not in range(128)
```

*Be aware:* The encoding is not always known. HTTP headers and file system paths.

---

Unicode in other languages <span class="section">[4/6]</span>

---

# TeX

Design from 1982 <br/>
Section § 21 of 1379

```
〈Set initial values of key variables 21〉≡
  xchr['40] = ' '; xchr['41] = '!'; xchr['42] = '"'; xchr['43] = '#'; 
  xchr['44] = '$'; xchr['45] = '%'; xchr['46] = '&'; xchr['47] = ''';
  xchr['50] = '('; xchr['51] = ')'; xchr['52] = '*'; xchr['53] = '+';
  xchr['54] = ','; xchr['55] = '-'; xchr['56] = '.'; xchr['57] = '/';
  xchr['60] = '0'; xchr['61] = '1'; xchr['62] = '2'; xchr['63] = '3';
  xchr['64] = '4'; xchr['65] = '5'; xchr['66] = '6'; xchr['67] = '7';
  xchr['70] = '8'; xchr['71] = '9'; xchr['72] = ':'; xchr['73] = ';';
  xchr['74] = '<'; xchr['75] = '='; xchr['76] = '>'; xchr['77] = '?';
  xchr['100] = '@'; xchr['101] = 'A'; xchr['102] = 'B'; xchr['103] = 'C';
  xchr['104] = 'D'; xchr['105] = 'E'; xchr['106] = 'F'; xchr['107] = 'G';
  xchr['110] = 'H'; xchr['111] = 'I'; xchr['112] = 'J'; xchr['113] = 'K';
  xchr['114] = 'L'; xchr['115] = 'M'; xchr['116] = 'N'; xchr['117] = 'O';
  xchr['120] = 'P'; xchr['121] = 'Q'; xchr['122] = 'R'; xchr['123] = 'S';
  xchr['124] = 'T'; xchr['125] = 'U'; xchr['126] = 'V'; xchr['127] = 'W';
  xchr['130] = 'X'; xchr['131] = 'Y'; xchr['132] = 'Z'; xchr['133] = '[';
  xchr['134] = '\'; xchr['135] = ']'; xchr['136] = '^'; xchr['137] = '_';
  xchr['140] = '`'; xchr['141] = 'a'; xchr['142] = 'b'; xchr['143] = 'c';
  xchr['144] = 'd'; xchr['145] = 'e'; xchr['146] = 'f'; xchr['147] = 'g';
  xchr['150] = 'h'; xchr['151] = 'i'; xchr['152] = 'j'; xchr['153] = 'k';
  xchr['154] = 'l'; xchr['155] = 'm'; xchr['156] = 'n'; xchr['157] = 'o';
  xchr['160] = 'p'; xchr['161] = 'q'; xchr['162] = 'r'; xchr['163] = 's';
  xchr['164] = 't'; xchr['165] = 'u'; xchr['166] = 'v'; xchr['167] = 'w';
  xchr['170] = 'x'; xchr['171] = 'y'; xchr['172] = 'z'; xchr['173] = '{';
  xchr['174] = '|'; xchr['175] = '}';
```

---

# PHP

Designed in 1994.

* Uses UCS-2 internally
* Functions like `utf8_encode` assume text is processed as latin-1
* Functions like `str_replace` are safe to process UTF-8 strings
* `mb_convert_encoding` is generic and convert from one encoding to another
* [List of supported encoding](https://secure.php.net/manual/en/mbstring.supported-encodings.php)
* [mbstring.internal_encoding](https://secure.php.net/manual/en/mbstring.configuration.php#ini.mbstring.internal-encoding) (obsolete)
* Identifiers can be some superset of ASCII
* PHP has a history of ignoring the multibyte/Unicode issue

---

# Lua

Designed in 1993.

* Lua 5.3.4 consists of 24,000 LOCs
* Lua without stdlib <100kb
* Everything is bytes. Unicode implementation is too complex to be included.
* Cannot use non-ASCII characters as identifiers.

---

# rust

Designed in 2010.

* Rust's `String` type is defined to be UTF-8 and its `char` type is a Unicode scalar value.
* Identifiers must be ASCII.

```
error: this form of character escape may only be used with characters
       in the range [\x00-\x7f]
 --> test.rs:2:23
  |
2 |     let non_utf8 = "\xC8";
  |                       ^^

error: aborting due to previous error
```

[encoding crate](https://crates.io/crates/encoding):

```
use encoding::{Encoding, EncoderTrap};
use encoding::all::ISO_8859_1;

assert_eq!(ISO_8859_1.encode("caf\u{e9}", EncoderTrap::Strict),
           Ok(vec![99,97,102,233]));
```

---

# Go

Designed in 2009. Designed also by Rob Pike (→ UTF-8 guy)

* Some non-ascii identifiers allowed: Unicode categories Lu, Ll, Lt, Lm, Lo, Nd
* Uses and assumes UTF-8 everywhere (like `rust`)
* But is `str` must not be valid UTF-8

```
package main

import "fmt"

func main() {
    spätzle := "\xC8"
    fmt.Println(spätzle)
}
```

---

The Python 3 Unicode model <span class="section">[5/6]</span>

---

# Python 3: Source file encoding

```python
#!/usr/bin/env python
# -*- coding: utf-8 -*-
```

No need anymore: "Using UTF-8 as the default source encoding" [PEP 3120](https://www.python.org/dev/peps/pep-3120/)

---

# str and bytes

str → a sequence of unicode points <br/>
bytes → a sequence of bytes in an implicit encoding

```
>>> p = '\xf0\x9f\x92\xa9'
>>> type(p)
<class 'str'>
>>> b = b'\xf0\x9f\x92\xa9'
>>> p == b
False
>>> print(b.decode('utf-8'))
💩
```
`__str__(self)` returns `str` <br/>
No `__unicode__(self)` anymore.

---

<!--

# The unicodedata package [stdlib]

# The chardet package [pypi]

-->

# Python 3

**Changes:** [Text Vs. Data Instead Of Unicode Vs. 8-bit](https://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit)

* Most importantly, you cannot apply string operations on byte sequences
* Unicode characters in identifiers

```
>>> spätzle = ''
>>> list(l for l in locals() if l.startswith('sp'))
['spätzle']
```

* No coercion of str and bytes

> Python 3.0 uses the concepts of text and (binary) data instead of Unicode strings and 8-bit strings. […]

[Text Vs. Data Instead Of Unicode Vs. 8-bit](https://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit)

---

Retrospective and End of Life of Python 2 <span class="section">[6/6]</span>

---

# Debate

Should UTF-8 have been backwards-incompatible?

---

# Guido van Rossum: BDFL Python 3 retrospective

> * List of "Python warts" were circulating
>     * e.g. unicode mess; long vs. int; two class systems; relative imports; int division vs. float division; comparisons; nonlocal variables; …
> * Not all warts could be fixed without breaking compatibility
>     * esp. the unicode mess […]
> * Don't break everything
> * Don't fall in the Perl 6 trap […]
> * We didn't make enough compatibility allowances (e.g. u"…")

→ see [PEP 0414](https://www.python.org/dev/peps/pep-0414/)

[Youtube: BDFL Python 3 retrospective](https://www.youtube.com/watch?v=Oiw23yfqQy8)

---

# Guido on Twitter

![Guido on Twitter: 2.7.15 is the last python2 release](img/20180430_guido_python_2.7.15_released.png)

---

# Guido on Twitter

![Guido on Twitter: 2.7.15 is the last python2 release](img/20180430_guido_python_2.7.15_released_complete.png)

---

# PEP 0373

> The End Of Life date (EOL, sunset date) for Python 2.7 has been moved five years into the future, to 2020

> Being the last of the 2.x series, 2.7 will have an extended period of maintenance. Specifically, 2.7 will receive bugfix support until January 1, 2020. All 2.7 development work will cease in 2020.

https://www.python.org/dev/peps/pep-0373/

---

# Please donate!

* The Unicode Consortium is a non-profit organization
* Associate your company with a character.
* Now you can adopt a character!
  * Gold: $5,000
  * Silver: $1,000
  * Bronze: $100

[Unicode Adopt-a-Character Submission Form](https://unicode.org/consortium/adopt-a-character.html)

---

# At the end of the day

We are still bad at encodings.

---

# At the end of the day

* Sorry for the PITA issues
* I think it was worth it - Unicode matters
* Unicode in Python is not perfect, but comparatively good
* we need to overcome this backwards-incompatible change
* please port libraries

---

# So, did I make it?

---

# Thank you!