My interest in digital typesetting makes me read papers about industry standards such as Unicode and PDF. PDF/A is considered the archival variant of PDF suitable for long-term storage. Consider a library which needs to build an index of its documents. Users want to retrieve documents matching their criteria of interest and consecutively study the document. PDF/A is considered an improvement over PDF in that regard. One basic example is that fonts must be embedded to ensure reproducible representation. In the paper “PDF/A considered harmful for digital preservation” Marco Klindt sums up shortcomings of the PDF/A standard. I reviewed the paper.
Musical typesetting as example
First, I would like to raise the reader’s attention to the following insightful analogy/example:
An insightful analogue of the difference between human content understanding and machine extraction capabilities would be the visible communication of music. While storing the layout of sheet music is perfectly achievable with PDF the placement of note glyphs on lines with annotating glyphs for bars, clefs and so on, it is easily understood and transformed into audible sound by humans trained in reading musical notation. A machine would have a hard time extracting enough information to reproduce or compare the musical score
Second, I would like to critize mentioning the Markdown language as alternative:
The textual markup of Markdown variants is machine actionable while being human friendly to read at the same time. It is suitable for structured texts (including lists and tables) where the exact layout is not as important. Markdown is not well suited for validation.
Neither is Markdown standardized, nor suitable for document writing.
It is a prevailing issue to nest lists in Markdown. And tables were not initially considered in its design (if you don’t think the HTML-fallback is part of its design then tables are not even possible).
I also doubt that “machine actionable” is appropriate here. Identify all quotes. Is bold/italic an emphasis or a keyword list? Identify all definitions. Identify variable names in a mathematical text.
As a result, I consider Markdown a terrible choice. Other markup languages are not discussed at the same time in the paper.
The paper is an average read. It systemizes desireable goals of archival files and discusses PDF/A in this context. But as pointed out by the author, does not provide a solution.
What I learned from reading this paper:
Open Archival Information System (OAIS) is an accepted reference model to organize an archive.
WebArchive files (WARC) is an archive format aggregating all files as part of a website in an archive together with metadata