Motivation
XML is a fundamental, established, well-developed standard. I recognize a lot of controversy among software developers when considering the reputation of XML. XML got a lot of drive from the Java community. But as such, it is often associated with overbloated and verbose APIs. These days, XML serves as more verbose big brother of JSON and YAML.
In project typho, I consider text documents fundamentally as trees. XML provides the most advanced toolchain to process trees. As such XML is of interest. But what are the details of the specification? I answer myself some questions.
TL;DR
- Which character encodings are allowed/defined?
-
Infinitely many. {Unicode, ISO 8859, JIS} are named explicitly. Others shall be named acc. to IANA. Custom ones shall be prefixed
x-
. - Which versions exist?
-
XML 1.0 and XML 1.1
- Why should one use XML 1.1 instead of XML 1.0?
-
Mainly, XML 1.1 allow more characters since Unicode continued to develop.
- Open-close elements versus empty elements
-
<a></a>
is semantically equivalent to<a/>
. - Which characters are allowed in XML element names?
-
Many. It tries to be as inclusive as possible in terms of writing systems. But the first codepoint must not be a digit, dot, hyphen or combining character. Question mark and exclamation mark are two codepoints which are not permitted. Interestingly, “·” (U+00B7 MIDDLE DOT) can be used after the first code point.
- Which characters are allowed for XML attributes?
-
Exactly the same like for XML element names.
- Which whitespace characters might be used within tags?
-
Only four, namely {U+0020 SPACE, U+0009 CHARACTER TABULATION, U+000D CARRIAGE RETURN, U+000A LINE FEED}.
- Which entities are predefined in XML?
-
Only five are defined, namely {U+003C LESS-THAN SIGN, U+003E GREATER-THAN SIGN, U+0026 AMPERSAND, U+0022 QUOTATION MARK, U+0027 APOSTROPHE}.
- Which elements are reserved?
-
The ones starting with case-insensitive text “xml”.
- Which attributes are reserved?
-
xml:lang
andxml:space
. - Which related specifications exist?
-
Many, but the following is a list to start with. By the way, XSL = {XSLT, XSL-FO, XPath}.
(The character questions have the same answers for XML 1.0 and 1.1)
Details
Which character encodings are allowed/defined?
Names are case-insensitive. The following values should be used for Unicode and ISO/IEC 10646:
-
UTF-8
-
UTF-16
-
ISO-10646-UCS-2
-
ISO-10646-UCS-4
The following values should be used for -
ISO-8859-X
where X is the part number of ISO 8859 The following values should be used for encoded forms of JIS X-0208-1997: -
ISO-2022-JP
-
Shift_JIS
-
EUC-JP
Other standardized encodings must be specified with their IANA names. “Processors are, of course, not required to support all IANA-registered encodings”. Custom ones should use a name starting with x-
. Thus, the answer is “open-ended; there are infinitely many supported encodings”.
Which versions exist?
-
XML 1.0 “W3C Recommendation 26 November 2008” (Fifth Edition)
-
XML 1.0 “W3C Proposed Edited Recommendation 05 February 2008” (Fifth Edition)
-
XML 1.0 “W3C Recommendation 16 August 2006, edited in place 29 September 2006” (Fourth Edition)
-
XML 1.0 “W3C Recommendation 04 February 2004” (Third Edition)
-
XML 1.0 “W3C Proposed Edited Recommendation 30 October 2003” (Third Edition)
-
XML 1.0 “W3C Recommendation 6 October 2000” (Second Edition)
-
XML 1.0 “W3C Recommendation 10-February-1998” (First Edition)
-
XML 1.1 “W3C Recommendation 16 August 2006, edited in place 29 September 2006” (Second Edition)
-
XML 1.1 “W3C Proposed Edited Recommendation 14 June 2006” (Second Edition)
-
XML 1.1 “W3C Recommendation 04 February 2004, edited in place 15 April 2004”
Why should one use XML 1.1 instead of XML 1.0?
XML relies on Unicode. Unicode evolved. XML somewhat tried to update to the Unicode standard. “Characters not present in Unicode 2.0 may already be used in XML 1.0 character data”. “The overall philosophy of names has changed since XML 1.0. Whereas XML 1.0 provided a rigid definition of names, wherein everything that was not permitted was forbidden, XML 1.1 names are designed so that everything that is not forbidden (for a specific reason) is permitted.” “Therefore XML 1.1 adds NEL (#x85) to the list of line-end characters. For completeness, the Unicode line separator character, #x2028, is also supported.“ “Therefore, XML 1.1 allows the use of character references to the control characters #x1 through #x1F, most of which are forbidden in XML 1.0” (via Rationale and list of changes for XML 1.1)
In fact, the notion of RestrictedChar
and normalization was introduced.
Is there a semantic difference between open-close elements <a></a>
and empty elements <a/>
?
No. “An element with no content is said to be empty.] The representation of an empty element is either a start-tag immediately followed by an end-tag, or an empty-element tag.”
Characters of XML elements
In XML 1.0, the characters are described by …
Name ::= NameStartChar (NameChar)*
NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
Thus, you need to distinguish between the start characters and consecutive characters.
Regarding the start characters, there are 971,506 Unicode points if you generate this set (acc. to my python program below). As pointed out in the spec, it tries to be as inclusive as possible in terms of writing systems. But to give you an idea, the first characters are …
:ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƀƁƂƃƄƅƆƇƈƉƊƋƌƍƎƏƐƑƒƓƔƕƖƗƘƙƚƛƜƝƞƟƠơƢƣƤƥƦƧƨƩƪƫƬƭƮƯưƱƲƳƴƵƶƷ
Here characters like “?” (U+003F QUESTION MARK) and “!” (U+0021 EXCLAMATION MARK) are not included. The consecutive characters use the same set but is extended by some more:
-
“-” (U+002D HYPHEN-MINUS)
-
“.” (U+002E FULL STOP)
-
digits 0–9 (U+0030 to U+0039)
-
“·” (U+00B7 MIDDLE DOT)
-
“‿” (U+203F UNDERTIE)
-
“⁀” (U+2040 CHARACTER TIE) or
-
one of the combining characters (macron, bridge, grapheme joiner, …).
And in case, you are wondering, bidi instruction characters cannot be used. The colon gets additional semantics through the XML namespace standard.
Characters of XML attributes
Same as XML elements.
Which whitespace characters might be used within tags?
-
U+0020 SPACE
-
U+0009 CHARACTER TABULATION
-
U+000D CARRIAGE RETURN
-
U+000A LINE FEED
Thus, any character potentially introducing a new line is excluded. Also no-break space (U+00A0 NO-BREAK SPACE) is excluded.
Which entities are predefined in XML?
The following XML entities are predefined:
-
<
(“<”, U+003C LESS-THAN SIGN) -
>
(“>”, U+003E GREATER-THAN SIGN) -
&
(“&”, U+0026 AMPERSAND) -
"
(“"”, U+0022 QUOTATION MARK) -
'
(“'”, U+0027 APOSTROPHE)
Which elements are reserved?
“Names beginning with the string "xml", or with any string which would match 'X'|'x') ('M'|'m') ('L'|'l'
, are reserved for standardization in this or future versions of this specification.” (via Common Syntactic Constructs)
Which XML libraries exist in rust?
schema-analysis
tries to abstract over multiple data serialization format.
Appendix: My python program to generate the codepoints from the specification
#!/usr/bin/env python3
import re
import unicodedata
def hexspecifier_to_codepoint(specifier):
if len(specifier) == 1:
return ord(specifier)
m = re.match('#x(.+)$', specifier)
assert m, 'unknown range specifier'.format(specifier)
return int(m.group(1), 16)
def range_to_codepoints(match):
start, end = match.group(1), match.group(2)
start, end = hexspecifier_to_codepoint(start), hexspecifier_to_codepoint(end)
return [chr(codepoint) for codepoint in range(start, end + 1)]
if __name__ == '__main__':
firstchar_spec = '''":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]'''
nextchars_without_firstchar_spec = '''
"-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
'''
spec = firstchar_spec
specs = spec.strip().split(' | ')
chars = []
for s in specs:
single_char = re.match('"(.)"', s)
hex_spec = re.match('(#x[0-9A-F]+)', s)
range_spec = re.match('\[([^-]+?)-([^\]]+?)\]', s)
if single_char:
chars.append(single_char.group(1))
elif hex_spec:
chars.append(chr(hexspecifier_to_codepoint(hex_spec.group(1))))
elif range_spec:
chars.extend(range_to_codepoints(range_spec))
else:
raise ValueError("Unknown specifier '{}'".format(s))
if len(chars) > 400:
for i in range(0, 20):
print(' '.join(chars[20 * i:20 * i + 20]))
if len(spec) > 400:
print('… and many more ({} in total)'.format(len(chars)))
else:
for char in chars:
print('{} U+{:04X} {}'.format(char, ord(char), unicodedata.name(char)))
Conclusion
XML has a really concise definition. It was nice to answer some fundamental questions, I had. However, further standards (outside de/serializing usecases) do not have good tool support outside the Java community (and maybe C community). At least, for this article I can see this in the rust community.