Definitions in character terminology

I would like to sum up the terminology (as far as I understand it):

Mojibake
Character in encoding X was decoded using encoding Y
glyph
A character / symbol identified by its shape
character
A digital representation of a glyph as code point
code point
An integer (mostly listed as hex code) representing / referring to a character
charset (abbr. character set)
A set of associations between code points and characters
encoding
A set of conventions to transform a code point to a byte string (always in respect of a charset)
character string
A string with 1 glyph / character as a unit
byte string
Character string encoded (in a specific encoding)

For example, the glyph A consists of three lines and is the first letter in the latin alphabet (which defines a set of characters). If you put two dots at its top (Ä) and decode its UTF-8 representation as latin1, you will get a mojibake Ü. The A’s code point in ASCII-compliant encodings such as Unicode is 65 (0×41). The charset UTF-8 defines the association between 65 and A and for charset UTF-8, the encoding is Unicode. Only 1 byte is required to store 0×41 in the memory (in UTF-8 charset) which is binary 101010. So the bytestring of A looks like this in binary: (01000001, ).

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>