I would like to sum up the terminology (as far as I understand it):
- Mojibake
- Character in encoding X was decoded using encoding Y
- glyph
- A character / symbol identified by its shape
- character
- A digital representation of a glyph as code point
- code point
- An integer (mostly listed as hex code) representing / referring to a character
- charset (abbr. character set)
- A set of associations between code points and characters
- encoding
- A set of conventions to transform a code point to a byte string (always in respect of a charset)
- character string
- A string with 1 glyph / character as a unit
- byte string
- Character string encoded (in a specific encoding)
For example, the glyph A consists of three lines and is the first letter in the latin alphabet (which defines a set of characters). If you put two dots at its top (Ä) and decode its UTF-8 representation as latin1, you will get a mojibake Ü. The A’s code point in ASCII-compliant encodings such as Unicode is 65 (0×41). The charset UTF-8 defines the association between 65 and A and for charset UTF-8, the encoding is Unicode. Only 1 byte is required to store 0×41 in the memory (in UTF-8 charset) which is binary 101010. So the bytestring of A looks like this in binary: (01000001, ).
Recent Comments