Markup language language servers

Motivation

Microsoft pushed the idea of a language server protocols (Wikipedia). Traditionally, every IDE/editor implemented features like auto complete, go-to definitions, workspace symbol search, documentation on hover, etc. The language server protocol defines a uniform interface. Network servers can be implemented to provide the functionality for a particular language. Every IDE/editor can then use the network server provided without reinventing the wheel. I familiarized myself with language servers because of rust’s first tier support. And I like the idea of language servers these days; even though it has some reasonable deficiacy: syntax highlighting is not part of the protocol (summarized: It’s difficult; or long version).

Language servers target programming languages and a list of languages implemented is maintained by Microsoft.

But what about markup languages? They certainly have some different characteristics and might require different features. Also, the word “markup language” might be vague. So, let us dive into this topic and discuss the status quo!

Markup language (ML)

Without doubt, I don’t have a proper definition for ML here. Sometimes, the notions of lightweight MLs, generalized MLs and document MLs are used. Maybe the document ML comes closest, but I don’t see a need to establish a definition here. Simply, I would love to see the following languages to be supported by some languages servers:

{AsciiDoc, Wolfram CDF, Creole, GML, HTML, Lout, Markdown/MultiMarkdown/CommonMark, MediaWiki, org-mode, Plain Old Documentation, PmWiki, reStructuredText, Rich Text Format, SCRIPT, Setext, Textile, Texy!, Conteχt/LaTeχ/Teχ, troff man/mdoc/me/mm/ms, txt2tags, UDO, XML}

JSON, YAML, and XML are (to some extent) data serialization formats and are not considered here. Here OpenDocument and Office Open XML are excluded, because they are instances of XML files. Teχ is included as LaTeχ and ConTeχ, because LaTeχ’s macros allow it to use the syntax as some kind of generalized ML.

Language server features

The features of languages servers can be found by looking into the protocol specification (for a less technical documentation see the Language Server Extension Guide). I consider version 3.16.0 here:

auto completion items at a given cursor position
hover information at a given text document position
signature information at a given cursor position
declaration/definition/type definition location of a symbol at a given text document position
implementation location of a symbol at a given text document position
project-wide references for the symbol denoted by the given text document position
document highlights for a given text document position. For programming languages this usually highlights all references to the symbol scoped to this file
(flat or hierarchic) list of all symbols found in a given text document
actions like code fixes, to fix problems or to beautify/refactor code
code lenses which is a Visual Studio feature to provide buttons like “Impact” (where is this function referenced), “test” (run unit test for this function) and “latest” (revision information for this block of lines)
location of links in a document
list all color references found in a given text document
list of presentations for a color value at a given location
auto-format a whole document
workspace-wide rename of a symbol
folding ranges found in a given text document
suggested selection ranges at an array of given positions
call hierarchy for the language element of given text document positions
resolve semantic tokens (e.g. ‘event’, ‘class’, or ‘type’) for a given file. Semantic tokens are used to add additional color information to a file that depends on language specific symbol information
return the range of the symbol at the given position of a given document and all ranges that have the same content
Language Server Index Format (LSIF) introduced the concept of symbol monikers to help associate symbols across different indexes. Thus for a given text document position, one can provide the same symbol moniker information

Applicability to MLs

Consider e.g. XML:

auto-completion: return “log>” given “</” due to the opening elements
hover information is useful for XML namespaces and XML entities
definition locations are useful for XML entities
document highlights are useful element occurences with the same tag name
the list of all symbols can refer to identical names
auto-format allows quick adjustment of indentation size neglecting CDATA sections
workspace-wide rename of a symbol should work for XML namespaces, tagnames, and attributes
code lenses could copy the XPath of an element into the clipboard
folding follows immediate by XML hierarchy where each element should be foldable
call hierarchy could correspond to the element hierarchy, but I am not sure this violates the original semantics of “call hierarchy”
the current list of semantic tokens is inapplicable to XML

Unlike XML, SGML/HTML should not allow whitespace auto-format, because HTML is often used in combination with CSS where CSS defines which elements are white-space [in]sensitive. The same applies to all variants of Markdown as any HTML is allowed in Markdown documents.

But in general, XML is peculiarly easy. It becomes more difficult to wrap your head around MLs like AsciiDoc. Let’s go:

auto completion would make sense for inter-document link targets or filepaths like images
hover information could help you to distinguish the different kinds of delimited blocks (e.g. ==== shows ‘example’) but also replacements
document highlights are only useful for semantically equivalent text
auto-format would establish a certain default style to write AsciiDoc documents. But this is very limited. I can think of multiple whitespaces merged to a single one.
code lenses could be used to generate reference IDs for the element to be clicked on.
folding would totally make sense for the section hierarchy or any kind of blocks

At the same time, linguistic checks would be very useful to be provided through the server.

Implementations

The implementation list above lists the following generically-interesting implementations:

JSON support implemented in a npmjs package
Adam Voss implemented support for YAML (with JSON schemas)
Red Hat Developers implemented YAML as well
Adam Voss implemented LanguageTool for grammar checks

The implementation list also lists some languages of my ML list above:

Microsoft’s implementation for html
IBM’s XML language server written in Java
LemMinX by Eclipse as XML language server
Eric Förster’s implementation for LaTeχ (also: texlab.netlify.app)
Wolfram CDF implementation by kenkangxgwe

See also, langserver.org. And using a search engine, I found the following implementation:

reStructuredText (declared as ‘work in progress’ by langserver.org)

Conclusion

I can see some potential for language servers in the field of markup languages, but you barely find any implementations so far.