NF | D | Canonical Decomposition |
NF | C | Canonical Decomposition + Canonical Composition |
NF | KD | Compatibility Decomposition |
NF | KC | Compatibility Decomposition + Canonical Composition |
```
>>> import unicodedata
>>> unicodedata.normalize('NFC', cafe1) == unicodedata.normalize('NFC', cafe2)
True
>>> a = '\u212B' # ANGSTROM SIGN
>>> len(unicodedata.normalize('NFD', a))
2
>>> len(unicodedata.normalize('NFC', a))
1
>>> s = '\u1E9B\u0323' # LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE
>>> len(unicodedata.normalize('NFKC', s))
1
>>> len(unicodedata.normalize('NFKD', s))
3
```
---
# Unicode categories
https://www.unicode.org/reports/tr49/
```
>>> import unicodedata
>>> unicodedata.category('A')
'Lu'
>>> unicodedata.bidirectional('a')
'L'
>>> unicodedata.category('1')
'Nd'
>>> unicodedata.category('①') # U+2460 CIRCLED DIGIT ONE
'No'
>>> categories = {
... 'Pc', 'Pi', 'Sm', 'Pd', 'Mn', 'Sk', 'Lm', 'No',
... 'Cc', 'Ps', 'Nd', 'Ll', 'Lu', 'Lt', 'Me', 'Zp',
... 'Mc', 'Zs', 'Zl', 'Po', 'Cf', 'Pe', 'So', 'Pf',
... 'Lo', 'Nl', 'Sc'
... }
```
> "Unicode Character Categories" has been withdrawn. It was never formally approved
---
# Collation
https://www.unicode.org/reports/tr10/
```
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'de_AT.UTF-8')
'de_AT.UTF-8'
>>> 'ö' < 'z'
False
```
--
* Should be `True` in Austria, `False` in Sweden
* Collation depends on the locale
* Python applies unicode code point comparison
---
# Unicode Regular expressions
https://www.unicode.org/reports/tr18/
```
>>> import unicodedata, re
>>> unicodedata.category('A')
'Lu'
>>> re.search(r'\p{Lu}', 'A')
```
* `\p{}` is a mechanism to match characters by category
* regular expressions essential inherit all other problems (such as casing)
---
# Security issues: Domain names and Unicode
http://www.unicode.org/reports/tr36/#international_domain_names
* Invisible characters make domain look same, but is different
* Perfect for phishing attacks
* e.g. `200C ZERO WIDTH NON-JOINER`
* e.g. `200D ZERO WIDTH JOINER `
---
class: singleton, middle, center
The Python 2 Unicode model