Basic XML Q&A

Written on 2022-04-17 in 2221 words ✍️.
Part of cs software-development digital-typesetting

Motivation

XML is a fundamental, established, well-developed standard. I recognize a lot of controversy among software developers when considering the reputation of XML. XML got a lot of drive from the Java community. But as such, it is often associated with overbloated and verbose APIs. These days, XML serves as more verbose big brother of JSON and YAML.

In project typho, I consider text documents fundamentally as trees. XML provides the most advanced toolchain to process trees. As such XML is of interest. But what are the details of the specification? I answer myself some questions.

TL;DR

Which character encodings are allowed/defined?

Infinitely many. {Unicode, ISO 8859, JIS} are named explicitly. Others shall be named acc. to IANA. Custom ones shall be prefixed x-.

Which versions exist?

XML 1.0 and XML 1.1

Why should one use XML 1.1 instead of XML 1.0?

Mainly, XML 1.1 allow more characters since Unicode continued to develop.

Open-close elements versus empty elements

<a></a> is semantically equivalent to <a/>.

Which characters are allowed in XML element names?

Many. It tries to be as inclusive as possible in terms of writing systems. But the first codepoint must not be a digit, dot, hyphen or combining character. Question mark and exclamation mark are two codepoints which are not permitted. Interestingly, “·” (U+00B7 MIDDLE DOT) can be used after the first code point.

Which characters are allowed for XML attributes?

Exactly the same like for XML element names.

Which whitespace characters might be used within tags?

Only four, namely {U+0020 SPACE, U+0009 CHARACTER TABULATION, U+000D CARRIAGE RETURN, U+000A LINE FEED}.

Which entities are predefined in XML?

Only five are defined, namely {U+003C LESS-THAN SIGN, U+003E GREATER-THAN SIGN, U+0026 AMPERSAND, U+0022 QUOTATION MARK, U+0027 APOSTROPHE}.

Which elements are reserved?

The ones starting with case-insensitive text “xml”.

Which attributes are reserved?

xml:lang and xml:space.

Which related specifications exist?

Many, but the following is a list to start with. By the way, XSL = {XSLT, XSL-FO, XPath}.

(The character questions have the same answers for XML 1.0 and 1.1)

Details

Which character encodings are allowed/defined?

Names are case-insensitive. The following values should be used for Unicode and ISO/IEC 10646:

UTF-8
UTF-16
ISO-10646-UCS-2
ISO-10646-UCS-4 The following values should be used for
ISO-8859-X where X is the part number of ISO 8859 The following values should be used for encoded forms of JIS X-0208-1997:
ISO-2022-JP
Shift_JIS
EUC-JP

Other standardized encodings must be specified with their IANA names. “Processors are, of course, not required to support all IANA-registered encodings”. Custom ones should use a name starting with x-. Thus, the answer is “open-ended; there are infinitely many supported encodings”.

Which versions exist?

Why should one use XML 1.1 instead of XML 1.0?

XML relies on Unicode. Unicode evolved. XML somewhat tried to update to the Unicode standard. “Characters not present in Unicode 2.0 may already be used in XML 1.0 character data”. “The overall philosophy of names has changed since XML 1.0. Whereas XML 1.0 provided a rigid definition of names, wherein everything that was not permitted was forbidden, XML 1.1 names are designed so that everything that is not forbidden (for a specific reason) is permitted.” “Therefore XML 1.1 adds NEL (#x85) to the list of line-end characters. For completeness, the Unicode line separator character, #x2028, is also supported.“ “Therefore, XML 1.1 allows the use of character references to the control characters #x1 through #x1F, most of which are forbidden in XML 1.0” (via Rationale and list of changes for XML 1.1)

In fact, the notion of RestrictedChar and normalization was introduced.

Is there a semantic difference between open-close elements `<a></a>` and empty elements `<a/>`?

No. “An element with no content is said to be empty.] The representation of an empty element is either a start-tag immediately followed by an end-tag, or an empty-element tag.”

Characters of XML elements

In XML 1.0, the characters are described by …

Name          ::= NameStartChar (NameChar)*
NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
NameChar      ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

Thus, you need to distinguish between the start characters and consecutive characters.

Regarding the start characters, there are 971,506 Unicode points if you generate this set (acc. to my python program below). As pointed out in the spec, it tries to be as inclusive as possible in terms of writing systems. But to give you an idea, the first characters are …

:ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿĀāĂăĄąĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıĲĳĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňŉŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯưƱƲƳƴƵƶƷ

Here characters like “?” (U+003F QUESTION MARK) and “!” (U+0021 EXCLAMATION MARK) are not included. The consecutive characters use the same set but is extended by some more:

“-” (U+002D HYPHEN-MINUS)
“.” (U+002E FULL STOP)
digits 0–9 (U+0030 to U+0039)
“·” (U+00B7 MIDDLE DOT)
“‿” (U+203F UNDERTIE)
“⁀” (U+2040 CHARACTER TIE) or
one of the combining characters (macron, bridge, grapheme joiner, …).

And in case, you are wondering, bidi instruction characters cannot be used. The colon gets additional semantics through the XML namespace standard.

Characters of XML attributes

Same as XML elements.

Which whitespace characters might be used within tags?

Wikipedia lists 25 code points with property White_Space. However, XML only permits:

U+0020 SPACE
U+0009 CHARACTER TABULATION
U+000D CARRIAGE RETURN
U+000A LINE FEED

Thus, any character potentially introducing a new line is excluded. Also no-break space (U+00A0 NO-BREAK SPACE) is excluded.

Which entities are predefined in XML?

The following XML entities are predefined:

< (“<”, U+003C LESS-THAN SIGN)
> (“>”, U+003E GREATER-THAN SIGN)
& (“&”, U+0026 AMPERSAND)
" (“"”, U+0022 QUOTATION MARK)
' (“'”, U+0027 APOSTROPHE)

Which elements are reserved?

“Names beginning with the string "xml", or with any string which would match 'X'|'x') ('M'|'m') ('L'|'l', are reserved for standardization in this or future versions of this specification.” (via Common Syntactic Constructs)

Which attributes are reserved?

Which XML libraries exist in rust?

XML reader/writer: xml-rs, xml-doc, quick-xml, treexml, rustyxml, yaserde, xml_serde, maybe_xml, easy-xml, sxd-document, …
XPath: libxml, amxml, xrust, sxd-xpath
XQuery: (none)
XML Schema: xml-schema
XSLT: (none)
XProc: (none)
XSL-FO: (none)
DOM: xml_dom

schema-analysis tries to abstract over multiple data serialization format.

Appendix: My python program to generate the codepoints from the specification

#!/usr/bin/env python3

import re
import unicodedata


def hexspecifier_to_codepoint(specifier):
    if len(specifier) == 1:
        return ord(specifier)
    m = re.match('#x(.+)$', specifier)
    assert m, 'unknown range specifier'.format(specifier)
    return int(m.group(1), 16)

def range_to_codepoints(match):
    start, end = match.group(1), match.group(2)
    start, end = hexspecifier_to_codepoint(start), hexspecifier_to_codepoint(end)
    return [chr(codepoint) for codepoint in range(start, end + 1)]


if __name__ == '__main__':
    firstchar_spec = '''":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]'''
    nextchars_without_firstchar_spec = '''
    "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
    '''

    spec = firstchar_spec

    specs = spec.strip().split(' | ')
    chars = []
    for s in specs:
        single_char = re.match('"(.)"', s)
        hex_spec = re.match('(#x[0-9A-F]+)', s)
        range_spec = re.match('\[([^-]+?)-([^\]]+?)\]', s)

        if single_char:
            chars.append(single_char.group(1))
        elif hex_spec:
            chars.append(chr(hexspecifier_to_codepoint(hex_spec.group(1))))
        elif range_spec:
            chars.extend(range_to_codepoints(range_spec))
        else:
            raise ValueError("Unknown specifier '{}'".format(s))

    if len(chars) > 400:
        for i in range(0, 20):
            print('   '.join(chars[20 * i:20 * i + 20]))

        if len(spec) > 400:
            print('… and many more ({} in total)'.format(len(chars)))
    else:
        for char in chars:
            print('{}  U+{:04X} {}'.format(char, ord(char), unicodedata.name(char)))

Conclusion

XML has a really concise definition. It was nice to answer some fundamental questions, I had. However, further standards (outside de/serializing usecases) do not have good tool support outside the Java community (and maybe C community). At least, for this article I can see this in the rust community.