Lua implementation of syntax escaping styles

Written on 2022-07-17 in 1737 words ✍️.
Part of project typho digital-typesetting

Motivation

This is a very lightweight blogpost. I just wanted to provide implementations for all syntax escaping styles in Lua for UTF-8 strings to discover any differences in the algorithmic complexity.

We distinguish between encoding (text→escaped text) and decoding (escaped text→text).

Fixed-length syntax escaping

For fixed-length syntax escaping, an implementation with a replace-function is possible as well. So I also provided FLSE_gsub_encode.

local utf8 = require("utf8");

-- Implementation of fixed-length syntax escape encoding
-- for escape character \ and escape sequence \"
function FLSE_encode(text)
  local out = ""
  for _, scalar in utf8.codes(text) do
    if scalar == utf8.codepoint([["]], 1, 1) then
      out = out .. [[\]]
    elseif scalar == utf8.codepoint([[\]], 1, 1) then
      out = out .. [[\]]
    end
    out = out .. utf8.char(scalar)
  end
  return out
end

assert(FLSE_encode([[hello "my" \world]]) == [[hello \"my\" \\world]])

-- Implementation of fixed-length syntax escape encoding
-- for escape character \ and escape sequence \"
-- using string.gsub successively
function FLSE_gsub_encode(text)
  local out = ""
  return text:gsub([[\]], [[\\]]):gsub([["]], [[\"]])
end

assert(FLSE_gsub_encode([[hello "my" \world]]) == [[hello \"my\" \\world]])

-- Implementation of fixed-length syntax escaping decoding
-- for escape character \ and escape sequence \"
function FLSE_decode(text)
  local was_escape = false
  local out = ""
  for _, scalar in utf8.codes(text) do
    if scalar ~= utf8.codepoint([[\]], 1, 1) or was_escape then
      out = out .. utf8.char(scalar)
    end
    was_escape = (scalar == utf8.codepoint([[\]], 1, 1)) and not was_escape
  end
  return out
end

assert(FLSE_decode([[hello \"my\" world]]) == [[hello "my" world]])

Successive syntax escaping

Successive syntax escaping is the only variant here which cannot be described with a replace-function. However, Lua’s pattern matching capabilities are strong enough to still enable this.

-- Implementation of successive syntax escape encoding
-- for escape character " and escape sequence "
function SSE_encode(text)
  local was_escape = false
  local out = ""
  for _, scalar in utf8.codes(text) do
    if scalar == utf8.codepoint([["]], 1, 1) and not was_escape then
      out = out .. [["]]
    end
    was_escape = scalar == utf8.codepoint([["]], 1, 1)
    out = out .. utf8.char(scalar)
  end
  return out
end

assert(SSE_encode([[hello "my"" world]]) == [[hello ""my""" world]])

-- Implementation of successive syntax escape encoding
-- for escape character " and escape sequence "
-- using string.gsub with Lua pattern matching
function SSE_gsub_encode(text)
  return text:gsub([["+]], [[%1"]])
end

assert(SSE_gsub_encode([[hello "my"" world]]) == [[hello ""my""" world]])

-- Implementation of successive syntax escaping decoding
-- for escape character " and escape sequence "
function SSE_decode(text)
  local was_escape = false
  local out = ""
  for _, scalar in utf8.codes(text) do
    if scalar ~= utf8.codepoint([["]], 1, 1) or was_escape then
      out = out .. utf8.char(scalar)
    end
    was_escape = scalar == utf8.codepoint([["]], 1, 1)
  end
  return out
end

assert(SSE_decode([[hello ""my""" world]]) == [[hello "my"" world]])

Variadic-length syntax

Variadic-length syntax requires us to remember the escape sequences in memory. I called the variable cache. Again an implementation with a replace-function is possible and provided in function VLSE_gsub_encode.

-- Implementation of variadic-length syntax escape encoding
-- for escape character & and escape sequence &amp; or &quot;
function VLSE_encode(text)
  local out = ""
  for _, scalar in utf8.codes(text) do
    if scalar == utf8.codepoint([[&]], 1, 1) then
      out = out .. [[&amp;]]
    elseif scalar == utf8.codepoint([["]], 1, 1) then
      out = out .. [[&quot;]]
    else
      out = out .. utf8.char(scalar)
    end
  end
  return out
end

assert(VLSE_encode([[hello "my" & world]]) == [[hello &quot;my&quot; &amp; world]])

-- Implementation of variadic-length syntax escape decoding
-- for escape character & and escape sequence &amp; or &quot;
-- using string.gsub with Lua pattern matching
function VLSE_gsub_encode(text)
  -- NOTE: &amp; must come first
  return text:gsub([[&]], [[&amp;]]):gsub([["]], [[&quot;]])
end

assert(VLSE_gsub_encode([[hello "my" & world]]) == [[hello &quot;my&quot; &amp; world]])

-- Implementation of variadic-length syntax escape decoding
-- for escape character & and escape sequence &amp; or &quot;
function VLSE_decode(text)
  local lookup = { ["&quot;"] = [["]], ["&amp;"] = [[&]] }
  local cache = ""
  local out = ""
  for _, scalar in utf8.codes(text) do
    if scalar == utf8.codepoint([[&]], 1, 1) and cache == "" then
      cache = "&"
    elseif scalar ~= utf8.codepoint([[;]], 1, 1) and cache ~= "" then
      cache = cache .. utf8.char(scalar)
    elseif scalar == utf8.codepoint([[;]], 1, 1) and cache ~= "" then
      out = out .. lookup[cache .. ";"]
      cache = ""
    else
      out = out .. utf8.char(scalar)
    end
  end
  return out
end

assert(VLSE_decode([[hello &quot;my&quot; &amp; world]]) == [[hello "my" & world]])

Conclusion

None of the differences in algorithmic complexity are significant. Most use just a boolean flag (was_escape) to track state.