Pandoc's data model

✍️ Written on 2021-09-13 in 596 words. Part of cs software-development digital-typesetting

Motivation

Pandoc tries to provide universal conversion between markup languages. As such it needs to store data in a representation which covers data models of all markup languages. pandoc’s data model should be a superset of document markup languages' data model. As such, it is interesting to look at.

Structure of pandoc

In the beginning, I looked at pandoc’s source code and just couldn’t determine where the fundamental data structures are defined. I started with parseFromString but just couldn’t find the Pandoc definition. After Benedikt reminded me how modules in Haskell work, we were able to find the pandoc-types repository. Haskell’s module system is archaic, but algebraic data types are beautiful to read. So is not much contribution in this article, but let’s revise the data model.

Data model

I write down the data model in pseudo-rust code (remember that rust’s type system is in many ways based on Haskell).

struct pandoc {
  meta: Meta,
  block: Vec<Block>
}

type Format = String;
type Meta = HashMap<String, MetaValue>;

enum MetaValue {
	Map(HashMap<String, MetaValue>),
	List(Vec<MetaValue>),
	Bool(bool),
	String(String),
	Inlines(Vec<Inline>),
	Blocks(Block)
}

// https://github.com/jgm/pandoc-types/blob/master/src/Text/Pandoc/Definition.hs#L272
enum Block {
	Plain(Vec<Inline>),
	Para(Vec<Inline>),
	Line(Vec<Vec<Inline>>),
	Code((Attr, String)),
	Raw((Format, String)),
	Quote(Vec<Block>),
	OrderedList((ListAttributes, Vec<Vec<Block>>)),
	BulletList(Vec<Vec<Block>>),
	DefinitionList(Vec<(Vec<Inline>, Vec<Vec<Block>>)>),
	Header((i32, Attr, Vec<Inline>)),
	HorizontalRule,
	Table((Attr, Caption, Vec<ColSpec>, TableHead, Vec<TableBody>, TableFoot)),
	Div((Attr, Vec<Block>)),
	Null
}

enum Inline {
	Str(String),
	Emph(Vec<Inline>),
	Underline(Vec<Inline>),
	Strong(Vec<Inline>),
	StrikeOut(Vec<Inline>),
	Superscript(Vec<Inline>),
	Subscript(Vec<Inline>),
	SmallCaps(Vec<Inline>),
	Quoted((QuoteType, Vec<Inline>)),
	Cite((Vec<Citation>, Vec<Inline>)),
	Code((Attr, String)),
	Space,
	SoftBreak,
	LineBreak,
	Math((MathType, String)),
	RawInline((Format, String)),
	Link((Attr, Vec<Inline>, Target)),
	Image((Attr, Vec<Inline>, Target)),
	Note(Vec<Block>),
	Span((Attr, Vec<Inline>))
}

struct Citation {
	id: String,
	prefix: Vec<Inline>,
	suffix: Vec<Inline>,
	mode: CitationMode,
	note_num: i32,
	hash: i32
}

enum CitationMode {
	AuthorInText,
	SuppressAuthor,
	NormalCitation
}

enum QuoteType {
	SingleQuote,
	DoubleQuote
}

type Target = (String, String)

enum MathType {
	DisplayMath,
	InlineMath
}

struct ListAttributes((i32, ListNumberStyle, ListNumberDelim));

enum ListNumberStyle {
	DefaultStyle,
	Example,
	Decimal,
	LowerRoman,
	UpperRoman,
	LowerAlpha,
	UpperAlpha
}

enum ListNumberDelim {
	DefaultDelim,
	Period,
	OneParen,
	TwoParens
}

struct Attr((String, Vec<String>, Vec<(String, String)>));

type RowHeadColumns = i32;

enum Alignment {
	Left,
	Right,
	Center,
	Default
}

type ColWidth = f64;

struct ColSpec((Alignment, ColWidth));

struct Row((Attr, Vec<Cell>));

struct TableHead((Attr, Vec<Row>));
struct TableBody((Attr, RowHeadColumns, Vec<Row>, Vec<Row>));
struct TableFoot((Attr, Vec<Row>));

type ShortCaption = Vec<Inline>;

struct Caption((Option<ShortCaption>, Vec<Block>));

struct Cell((Attr, Alignment, RowSpan, ColSpan, Vec<Block>));

type RowSpan = i32;
type ColSpan = i32;

Conclusion

It is interesting to discover some properties and inconsistencies of this data model. I wonder whether it is truely a superset of markup languages.

  • All integers are Int; hence i32. I wonder whether numbers can actually be negative anywhere.

  • There is no model for mathematical notation. Math is simply a type (display or inline) together with raw text. Teχ notation is assumed.

  • One could trivially assume a document is a sequence of block elements containing inline elements. But (e.g.) footnotes are inline elements that contain a sequence of block elements.

  • Alignment is either left, right, center or default. Justified is neglected, but some default exists. QuoteType is either single-quote or double-quote. A default option does not exist.

  • The data model mixes font properties and semantic properties. Emph represents emphasis and SmallCaps represents the font variant small capital letters. A representation for condensed letters or variable fonts is lacking.

  • Space is an explicit case of Inline. At the same time a non-breaking space or horizontal/vertical tab does not exist.