Are line breaks encoded in Office Open XML?

✍️ Written on 2024-11-29 in 1088 words.
Part of cs software-development digital-typesetting

Motivation

During my talk on digital typesetting at the beginning of the month, Karl thoughtfully asked whether OOXML files store the automatic line breaks. What is so interesting about it?

Microsoft Word follows the WYSIWG approach. The user essentially edits the final product. This contrast with a markup approach where the user encodes the specification in a simple text file (like LaTeχ, typst, or SILE does). So if we type something in Word, and store this content in a file, does it also store the automatic (and thus computed) line breaks?

If not, Microsoft Word might change the line breaks on a different device. On a different device the font (associated through the stored font name) might have font metric differences and thus generates different automatic line breaks which can lead to a different vertical paragraph lengths and thus layout changes. Or more trivial, the line breaking algorithm on the different device with a software update might give different results.

On the other hand if the line breaks are stored, the line breaks are preserved, but you might run into other problems. For example, the line might not fill up if the text alignment is set to right-ragged but the font metric is different.

A common .docx file follows a closed standard and we have little information about the stored semantics. With OOXML the situation is better, because Microsoft once standardized their file format. So let us have a look.

Input file

First, we consider a reference file. An Office Open XML file (OOXML file) (link:Wikipedia) follows the ECMA-376 or ISO/IEC 29500-1:2016 standard (they are equivalent). And the rendered document in Microsoft Word looks like this (recognize that the first line ends with “ccum.”):

Hello World reference document

You can look for Open packaging conventions and understand that this docx file is really just a ZIP archive containing a particular structure:

  • [Content_Types].xml

  • _rels

    • .rels

  • word

    • document.xml

    • _rels

      • document.xml.rels

    • theme

      • theme1.xml

    • settings.xml

    • styles.xml

    • webSettings.xml

    • fontTable.xml

  • docProps

    • core.xml

    • app.xml

docProps/app.xml

This file shows some metadata (I reformatted it with whitespace to make it readable):

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://purl.oclc.org/ooxml/officeDocument/extendedProperties"
            xmlns:vt="http://purl.oclc.org/ooxml/officeDocument/docPropsVTypes">
  <Template>Normal.dotm</Template>
  <TotalTime>0</TotalTime>
  <Pages>2</Pages>
  <Words>504</Words>
  <Characters>2522</Characters>
  <Application>Microsoft Office Word</Application>
  <DocSecurity>0</DocSecurity>
  <Lines>252</Lines>
  <Paragraphs>275</Paragraphs>
  <ScaleCrop>false</ScaleCrop>
  <HeadingPairs>
    <vt:vector size="2" baseType="variant">
      <vt:variant>
        <vt:lpstr>Title</vt:lpstr>
      </vt:variant>
      <vt:variant>
        <vt:i4>1</vt:i4>
      </vt:variant>
    </vt:vector>
  </HeadingPairs>
  <TitlesOfParts>
    <vt:vector size="1" baseType="lpstr">
      <vt:lpstr></vt:lpstr>
    </vt:vector>
  </TitlesOfParts>
  <Company>typed;software</Company>
  <LinksUpToDate>false</LinksUpToDate>
  <CharactersWithSpaces>2751</CharactersWithSpaces>
  <SharedDoc>false</SharedDoc>
  <HyperlinksChanged>false</HyperlinksChanged>
  <AppVersion>16.0000</AppVersion>
</Properties>

Interestingly, it stores the number of lines, which in some way already mentions how many line breaks are (supposedly?) contained.

But the actually interesting document is word/document.xml

word/document.xml

<w:proofErr w:type="spellStart"/>
<w:r w:rsidRPr="007C5FDB">
  <w:rPr>
    <w:rFonts w:ascii="Courier New" w:hAnsi="Courier New" w:cs="Courier New"/>
    <w:color w:val="000000"/>
    <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
    <w:lang w:val="de-DE"/>
  </w:rPr>
  <w:t>ccum</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r w:rsidRPr="007C5FDB">
  <w:rPr>
    <w:rFonts w:ascii="Courier New" w:hAnsi="Courier New" w:cs="Courier New"/>
    <w:color w:val="000000"/>
    <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
    <w:lang w:val="de-DE"/>
  </w:rPr>
  <w:t xml:space="preserve">. </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r w:rsidRPr="007C5FDB">
  <w:rPr>
    <w:rFonts w:ascii="Courier New" w:hAnsi="Courier New" w:cs="Courier New"/>
    <w:color w:val="000000"/>
    <w:shd w:val="clear" w:color="auto" w:fill="FFFFFF"/>
    <w:lang w:val="de-DE"/>
  </w:rPr>
  <w:t>Volon</w:t>
</w:r>

Here, we can spot our “ccum” text split up with “. ” because apparently each token gets its own r-element. And here we can see that neither the r-element or the elements after it represent a line break. Well, really? It depends on the semantics of the elements. What does r or rPr stand for?

The specification

If we open the specification document “Part 1: Fundamentals and Markup Language Reference”, we can find the documentation for the r-element (visible repeatedly above):

r (Text Run)

This element specifies a run of content in the parent field, hyperlink, custom XML element, structured document tag, smart tag, or paragraph.

— OOXML specification Part 1 section 17.3.2.25 page 292

Okay, there is no association with line breaks. How about rPr-elements?

rPr (Run Properties for the Paragraph Mark)

This element specifies the set of run properties applied to the glyph used to represent the physical location of the paragraph mark for this paragraph. […]

— OOXML specification Part 1 section 17.3.1.29 page 245

Okay. I don’t think any of the elements from above relate to line breaks. So the answer to our initial question is “no”. And is there even an element representing a line break in the specification?

br (Break)

This element specifies that a break shall be placed at the current location in the run content. A break is a special character which is used to override the normal line breaking that would be performed based on the normal layout of the document’s contents. [Example: Normal breaking for English would occur only after a breaking space or optional hyphen character. end example]

The behavior of this break character (the location where text shall be restarted after this break) shall be determined by its type and clear attribute values, described below.

[…]

<w:r>
  <w:t>This is</w:t>
  <w:br/>
  <w:t xml:space="preserve"> a simple sentence.</w:t>
</w:r>
— OOXML specification Part 1 section 17.3.3.1 page 325

Conclusion

OOXML could store automatic line breaks in the document. It does not. I believe it prefers to change the layout of the document if a different font file is used compared to storing each automatic & computed line break. Good to know.