Guidelines for the design of file formats

Update 2025-10-19: I added edn as mentioned file format.

Motivation

I don’t think I am an expert on this topic, but I feel like there are some simple guidelines which don’t get retold often enough. I designed some file formats myself and have some frustrating experiences with recurring file format definition errors. Academically, I think there should be more interest in this topic instead of the common “I parsed PDF files using machine learning” papers. Anyhow, I hope some academics find some more answers to open questions, but for now let us summarize those guidelines.

Prior art

I listed the entries by my own rating of “contributes to the topic at hand” from most to least significant:

[paperDSL] paper “Design Guidelines for Domain Specific Languages” (2014) by Karsai, Krahn, Pinkernell, Rumpe, Schindler, and Völkel
[talk38c3] talk “38C3 - Fearsome File Formats” (2024-12-30) by Ange Albertini
[articleKOMPPA] article “On File Formats” (2025-05-19) by Jari Komppa
[paperLITTLE] paper “Little Languages” (1986) by Jon Bentley

Paper “Design Guidelines for Domain Specific Languages”

This paper from 2014 [paperDSL] lists 26 guidelines in various five categories. Without discussing them in detail, I am going to list their names here:

Language Purpose
- Guideline 1: Identify language uses early
- Guideline 2: Ask questions
- Guideline 3: Make your language consistent
Language Realization
- Guideline 4: Decide carefully whether to use graphical or textual realization
- Guideline 5: Compose existing languages where possible
- Guideline 6: Reuse existing language definitions
- Guideline 7: Reuse existing type systems
Language Content
- Guideline 8: Reflect only the necessary domain concepts
- Guideline 9: Keep it simple
- Guideline 10: Avoid unnecessary generality
- Guideline 11: Limit the number of language elements
- Guideline 12: Avoid conceptual redundancy
- Guideline 13: Avoid inefficient language elements
Concrete Syntax
- Guideline 14: Adopt existing notations domain experts use
- Guideline 15: Use descriptive notations
- Guideline 16: Make elements distinguishable
- Guideline 17: Use syntactic sugar appropriately
- Guideline 18: Permit comments
- Guideline 19: Provide organizational structures for models
- Guideline 20: Balance compactness and comprehensibility
- Guideline 21: Use the same style everywhere
- Guideline 22: Identify usage conventions
Abstract Syntax
- Guideline 23: Align abstract and concrete syntax
- Guideline 24: Prefer layout which does not affect translation from concrete to abstract syntax
- Guideline 25: Enable modularity
- Guideline 26: Introduce interfaces

As can be seen from the list, it considers it from a very abstract level and thus really contributes to the field. It applies generically if you want to design a file format. But it also applies in a specific context; for example if you want to redesign how mathematicians write their formulas. It is the only set of guidelines which also talks about notations.

Talk “38C3 - Fearsome File Formats”

Ange Albertini [talk38c3] has built up a lot of expertise on file formats over the last decade and gives an overview in this talk about how file formats can be abused and repurposed. In this talk, I want to focus only on one slide in particular which lists recommendations how to design a “good file format”:

In text, the ten commandments are:

Magic at offset zero (fast identification, no bypass)
Clear chunk structure (forward compatibility, easy parsing/cleanup)
Version number (forward thinking)
No duplicity (duplicitly → discrepency)
No “constant” variables (ossification → hardcoding)
Up-to-date specs (reflect reality)
Samples set (Theory isn’t enough)
Extensibility (your format will evolve in unknown ways)
Keep the spirit (don’t reuse formats for different intent without trivial distinction)
Perfect is the enemy of good (shortcuts will be taken to avoid over-complexity)

Article “On File Formats”

Rather recently, Jari Komppa wrote an article [articleKOMPPA] on the design of file formats (HackerNews discussion). He lists the following recommendations:

Does a file format exist for this yet?
Does it need to be human readable?
Chunk your binaries.
Allow partial parsing.
Version your formats.
Document your format.
Don’t include fields just in case.
Consider the target hardware.
Compression.
On filename extensions (i.e. consider four letters, but three letters are mostly allocated).

Paper “Little languages”

The paper by Jon Bentley [paperLITTLE] is by far the oldest. I do believe the notion of domain-specific languages was not sufficiently developed at that time and he coined the notion of little languages. The most prominent example from the paper is awk as little language. The idea is that domain-specific languages shall be developed and tools like awk help for the first step. He recites “an old rule of thumb” that “the first 10% of programming effort provide 90% of the functionality”. Only if the language evolves sufficiently, developers should consider using parsing tools like lex and yacc. Regarding the design of such languages, the paper lists the following recommendations:

Orthogonality: keep unrelated features unrelated.
Generality: use an operation for many purposes.
Parsimony: delete unneeded operations.
Completeness: can the language describe all objects of interest?
Similarity: make the language as suggestive as possible.
Extensibility: make sure the language can grow.
Openness: let the user ”escape” to use related tools.

Discussion

I want to start this discussion with a definition. Technically, there is no notion of a “binary file” and a “text file”. In practice, the distinction helps because we can immediately align our expectation whether we need a text editor (text file) or a specialized software (binary file) to handle the file. What is the distinction?

Text files versus binary files

Fundamentally, text means that some serialization format or character set exists to encode text. The famous ASCII (American Standard Code for Information Interchange) standard covers 128 characters and other sets are not considered ASCII these days. The other famous text encoding is Unicode which has taken up the tremendous effort to cover all used writing systems of the world in one encoding. To serialize Unicode scalars into actual bytes, different character sets can be chosen. UCS-2 and UTF-16 are deprecated, but UTF-32 can be still found. Unlike UTF-32, UTF-8 has the advantage of backwards-compatibility to ASCII and even ISO 8859-1. Furthermore it uses less space to serialize text in common languages like English. Since each character uses one up to 4 four bytes, the same text encoded in UTF-32 usually longer, but cannot be indexed by codepoint in constant time. To summarize: there is a long history of character sets including EBCDIC, Mac OS Roman, Windows-1252, Shift-JIS, Cork, and WTF-8. But with a current adoption above 98% within the World Wide Web, UTF-8 is the de-facto standard (backed by manifestos like UTF-8 Everywhere) and commonly picked as character set for new text file formats. And someone how thinks Unicode is unnecessary for English text, must be considered a bit naïve.

Going back to the question of binary versus text, the result is that some file formats declare that only byte sequences according to the declared character set are considered admissible files; namely text files. This is put into contrast to binary files where any sequence of bytes is admissible per default. Recognize that file formats like HTML require you to declare the character set, formats like PDF allow you to switch character encoding in the file as often as desired, and formats like HTTP headers are so complex that custom RFCs were written. One counterexample is the TOML file format which declares that “a TOML file must be a valid UTF-8 encoded Unicode document” and therefore is a genuine text file format.

What a text encoding contributes

One notorious problem with file format definitions is that people think that terms like “whitespace”, “hyphen”, or “line break” are universal, unambiguous names for characters. No, no, and no. Instead the notion of whitespace (more specifically Unicode scalars with Whitespace property), hyphen (more specifically U+002D - HYPHEN-MINUS), and line break (more specifically Mandatory break according to UAX#14) specifically come from text encodings like Unicode.

If you don’t specify the text encoding, I don’t know what those words mean. For Unicode encodings like UTF-8 or UTF-16, I clarified the meaning. If you use ASCII instead of Unicode, everyone understands that “whitespace” means 0x20, the space character as only representative of this group. If you mention hyphen, it is even more unambiguous than the Unicode case, because the less common U+00AD SOFT HYPHEN exists (among others). In the case of ASCII, “hyphen” means 0x2D unambiguously. But in ASCII there is no line break definition at all. Is 0x0A (“line feed”) a line break? Is 0x0D (“carriage return”) a line break? Both? Conventionally, 0x0A is a line break on Linux machines and the sequence 0x0A & 0x0D is a line break on Windows machines. But why is 0x0C (“page break”) not a line break? If you define a page break, you necessarily contribute a line break?! We are never going to know this, because ASCII does not define what a line break is.

If you actually decide to use a well-thought through standard like Unicode, you can answer difficult questions quickly in a standardized way as well: Is Unicode normalization semantically meaningful?

If you don’t specify the text encoding, I literally know nothing about the content. I don’t know what a whitespace character is. I don’t even know what a character is, because I don’t know how many bytes constitute a character.

Regarding syntax escaping

Some binary file formats and most text file formats have some requirement like “arbitrary user content follows”. In this setting, you really don’t know when the user content is finished. As a result, you are going to need some byte sequence which tells “user content finishes here” which is not interpreted as user content itself. You need to escape the “user content syntax”.

I wrote an article about syntax escaping some time ago, but the gist is this: syntax escaping can be avoided by a length specifier which is only practical for binary files (never let a user count bytes or Unicode codepoints). Otherwise, you can decide to declare one byte sequence to be “escaping”. If you repeat this byte sequence, it regains its original meaning, but otherwise some escaping sequence is started which might signify something like “user content stops here”.

The worst thing, you can do is to ignore the problem. If you allow arbitrary user content, but don’t declare an escaping mechanism, you either open up yourself to ambiguities or violate the requirement “arbitrary user content”. My personal opinion is that XML’s escaping mechanism is simple and extensible compared to other approaches.

Regarding file extensions

File extensions give the operating system a clue which application might be capable of interpreting a file. For historic reasons, they tend to be short (2 to 4) sequences of Latin characters. They are an incomplete concept leading to unintended collisions. For example all kinds of markup syntaxes are declared as .md file these days. Historically .txt used to be full of collisions. But it still makes sense to align all users upon one file extension.

What I would like to stress here as well is the MIME type. It is equally helpful to align all users upon one MIME type. The x- prefix opens up MIME types to custom standards. So text/x-foobar would be a valid choice.

Regarding magic numbers

One might think that magic numbers are unnecessary boilerplate. If the specified structure is unique for your file format anyhow, why should a magic number be necessary? The answer is simple: Not all tools want to look at the entire document structure to determine whether a file follows a certain file format. If it only has to read some leading bytes (namely the so-called magic number), they are much quicker to determine whether the file is interesting ^[1]

Regarding version numbers

The simple argument pro version numbers is “implementors can easily dispatch interpretation”. If your document follows the specification 1.0, the source code for 1.0 interprets your file. If your document follows the specification 2.0, the source code for 2.0 interprets your file. This way you can easily introduce backwards-incompatible version changes.

Design requires re-re-re-iteration

File format design is design. Every design needs iteration for perfection and consistency. Please finish the draft version in a straight-forwards, usecase-centered manner. But be open to improve upon your design in subsequent versions. Iterate and iterate and iterate. And re-iterate again. And ask your target audience about their opinion. Then you mastered the art.

About the general approach to design

One thing, I would like to point out which comes from programming language design is that design should go from specific cases to generality. What is meant that one can specify very extensible elements in your syntax. But you should define those elements for their specific cases and disallow others. In subsequent versions, you might understand which other cases exist and which cases make sense. Under these circumstances, you might open up that element for more (or more general) cases.

Let me illustrate this with a trivial example: You might have 10 specific cases to distinguish, but you have to use one byte as discriminant. Therefore 256 cases can be distinguished, but you only need 10 cases. Now the general approach to design can be done in the following wrong way:

Specify the 10 cases
Declare 246 cases to be “implementor-defined”

Instead the following correct way can be taken:

Specify the 10 cases
Disallow 246 cases

The point of the latter approach is not that “implementor-defined is always stupid”. If you are certain please specify (for example) 10 cases for the desired cases, 20 cases for “implementor-defined” usecases, but be aware that disallowing some values opens up extensibility in future versions. You should restrict your design tightly. Once you gained experience and feedback, you should open up to other cases. Being restricted in the beginning enables the necessary extensibility for later.

A generic approach for defining binary file formats

There is a simple design which can model any binary data model unambiguously. It is called TLV (Type-Length-Value).

The file format has to follow the general recommendations first. Introduce a magic number. Introduce a version number. Put your metadata in a header. And then let us write down the data in the body of the file.

The body is a sequence of entries. Every entry consists of a type, a length, and a value.

A type is a discriminator (thus, of fixed width) telling what kind of data you are supposed to expect in the value. For example, byte 0x13 might identify value to an unsigned integer in big endian. 0x13 is one instance of this type.
A length specifies how many bytes the value constitutes of. It needs to be of fixed width as well and common choices include 16 or 32 bits. For example, bytes 0x00 0x04 might identify length 4 and thus our example value is expected to be an unsigned integer in big endian of 4 bytes.
The value is a sequence of bytes. Because of the previous values, you know exactly how many bytes you are supposed to read to understand the value. Furthermore, we attached some semantics through the type discriminator. The specification is now supposed to specify how to interpret those bytes of this type.

You have to encode a boolean value? Designate a type, length one should always be required, and specify which two values are admissible. Done.
You have to encode Unicode text? Designate a type, the length can be adjusted to the actual byte length, and specify which encoding you require in the value. Done.

This design is very simple and very generic. Does it solve all problems and do all binary files become instances of the TLV design? No. TLV is generic and a very good guideline. However, it applies only to binary files (no human wants to think through a three-step process all the time and particularly count bytes) and binary files mainly exist because they optimize some requirements over text files. Binary files might optimize parsing performance or space usage. As a result, designers start to skip fields. For example, if an entry of type 0x42 is always preceeded by an entry of type 0x41, space-optimizing designers might claim that the type byte 0x42 must be left out. Indeed, position-defined entries do not need a type if it can be derived from the position index. But in this very moment, the TLV design principle is violated and TLV remains only as “general rule of thumb”.

Everyone should be familiar with TLV and follow it for binary files, if ambiguity-freedom cannot be guaranteed for the entire design. A similar design where values usually represent chunks is the Interchange File Format.

Target hardware and endianness

When it comes to binary files, endianness is necessary to be specified. Usually deciding upon this value directly leads to the question of target hardware platforms. Endianness can be defined arbitrarily. Big endian or little endian? It is trivial. Just pick one. But the only meaningful way to pick the best value is thinking about the target platform. Does it target Intel machines? Then little endian makes more sense. Are humans going to look at the values from time to time? Big endian might be more convenient, but less optimized for desktop computer hardware.

You should know your target domain, your target audience, and common hardware platforms. But don’t optimize prematurely.

If you start cramming all data into bitvectors to save a few bytes to optimize space, you neglect that machines are optimized to operate on bytes and cache lines. Extracting individual bits is a time-consuming operation. You might be better off adding a few unused bits exchanging space for time.

In the end, benchmarking your prototype parsing implementation reveals the actually interesting parts to optimize. This becomes especially important, if you plan to apply compression on parts of your data.

Syntax and semantics

My final point would be that syntax and semantics are two different concepts. You need to be aware of it. You may be able to use the syntax of an existing standard and define custom semantics on top of it. One example would be XML (syntax) and InkML (semantics). You may also define a new syntax like Simple Outline XML for existing semantics like XML.

I gave examples for text files here, but this also applies to binary files. The only problem is that binary files usually have a very specific data model which differs to other formats. But generic binary file standards include ASN.1, edn, and postcard (introductory talk on youtube).

If you are able to split syntax and semantics, you end up in one of two scenarios:

you can use an established, proven-in-practice standard for one of two components. The necessary tools do not need to be written again.
you get the possibility to remove one component and exchange it for something else, if you recognize a mistake. Programming languages like Dylan just removed their LISP-style syntax and introduced something new.

Not too bad, right?

Decision list

Finally, I want to contribute a decision list where items to consider are listed:

Are you sure the effort of defining a new file format is worth it? If no, stop. If yes, proceed.
Split syntax and semantics. Can you reuse syntax (e.g. XML, S-expressions, JSON, YAML, …) or semantics (eg. JSON, YAML, ASN.1, postcard, …) of existing ones?
The following holds true for text and binary files:
- No duplicity (duplicitly → discrepency)
- Add a version number. No exceptions. Consider versioning schemes to communicate users which expectations regarding compatibility / upgrade necessity are given.
- Ask for feedback regarding syntax (text files) and data model (text files and binary files)
- Avoid unnecessary generality. It is easier to permit features later on than standardizing elements later when they are in use already
- Ask a domain expert for feedback and iterate.
- Increment the version number and release.
- Which interfaces to embed content for other file formats do you provide?
- Revise which elements provide modularity and extensibility in your file format.
Is it going to be a text file?
- Declare the character set (UTF-8 is recommended)
- Unicode is utilized by the U-notation: U+002C COMMA. Other character sets commonly use plain hexadecimal notation like 0x2C. If you refer to a character, get used to this notation and use it exclusively.
- Depending on your character set, you may use words like ‘whitespace’ now to describe your file format
- Do you want to base your syntax on existing concepts and notation? Remember that any existing technology exists, because it has some valuable benefits. But you are designing something new, because it does satisfy your requirements. Comprehend the benefits, integrate them into your design, and reiterate multiple times to achieve consistency.
- Define the syntax escaping mechanism if arbitrary user content is allowed
- Discuss punctuation versus keywords. Punctuation is a small set of characters and thus brief. But only programmers are used to use them in various contexts. Keywords are longer and extensible, but you have to discuss questions like casing and singular versus plural (depending on the writing system and language).
- Discuss whether you want to include comments (elements which carry no semantics, but provide an opportunity for documentation to the author)
- Discuss whether you want to allow trailing separators (e.g. ["item1", "item2",] if comma is your separator)
Is it going to be a binary file?
- Add a magic number at offset zero
- Declare: are multi-byte values encoded in little endian or big endian?
- Declare and illustrate the big picture structure of your file format (e.g. header/body/footer). It is easy to get lost in details (or bore the hell out of the reader) when describing binary file formats.
- Enable parsers to skip structural parts of your file format (e.g. the entire body, because its length is declared)
- Follow the TLV design. Declare the semantics of values in the file.
- If you don’t follow the TLV design, define the escaping mechanism (length declaration is recommended)
Publication:
- Provide example files.
- Provide a specification document. Specify where people can direct their feedback to. Specify the version this document documents.
- Develop tools to read, write, analyze, and fix files in this format.
- Suggest a file extension for files. Suggest a MIME type for files.

Conclusion

File format definition is a difficult art. And hopefully I summarized some guidelines for you. If you succeed, people are going to enjoy writing parsers for it. If you fail, your file format is going to suffer from fragmentation and limited adoption. Good luck!

1. For UNIX users, a simple scenario is grep which has to decide whether a file is binary (to be ignored) or text (to be searched in).