Motivation
This is a very lightweight blogpost. I just wanted to provide implementations for all syntax escaping styles in Lua for UTF-8 strings to discover any differences in the algorithmic complexity.
We distinguish between encoding (text→escaped text) and decoding (escaped text→text).
Fixed-length syntax escaping
For fixed-length syntax escaping, an implementation with a replace-function is possible as well. So I also provided FLSE_gsub_encode
.
local utf8 = require("utf8");
-- Implementation of fixed-length syntax escape encoding
-- for escape character \ and escape sequence \"
function FLSE_encode(text)
local out = ""
for _, scalar in utf8.codes(text) do
if scalar == utf8.codepoint([["]], 1, 1) then
out = out .. [[\]]
elseif scalar == utf8.codepoint([[\]], 1, 1) then
out = out .. [[\]]
end
out = out .. utf8.char(scalar)
end
return out
end
assert(FLSE_encode([[hello "my" \world]]) == [[hello \"my\" \\world]])
-- Implementation of fixed-length syntax escape encoding
-- for escape character \ and escape sequence \"
-- using string.gsub successively
function FLSE_gsub_encode(text)
local out = ""
return text:gsub([[\]], [[\\]]):gsub([["]], [[\"]])
end
assert(FLSE_gsub_encode([[hello "my" \world]]) == [[hello \"my\" \\world]])
-- Implementation of fixed-length syntax escaping decoding
-- for escape character \ and escape sequence \"
function FLSE_decode(text)
local was_escape = false
local out = ""
for _, scalar in utf8.codes(text) do
if scalar ~= utf8.codepoint([[\]], 1, 1) or was_escape then
out = out .. utf8.char(scalar)
end
was_escape = (scalar == utf8.codepoint([[\]], 1, 1)) and not was_escape
end
return out
end
assert(FLSE_decode([[hello \"my\" world]]) == [[hello "my" world]])
Successive syntax escaping
Successive syntax escaping is the only variant here which cannot be described with a replace-function. However, Lua’s pattern matching capabilities are strong enough to still enable this.
-- Implementation of successive syntax escape encoding
-- for escape character " and escape sequence "
function SSE_encode(text)
local was_escape = false
local out = ""
for _, scalar in utf8.codes(text) do
if scalar == utf8.codepoint([["]], 1, 1) and not was_escape then
out = out .. [["]]
end
was_escape = scalar == utf8.codepoint([["]], 1, 1)
out = out .. utf8.char(scalar)
end
return out
end
assert(SSE_encode([[hello "my"" world]]) == [[hello ""my""" world]])
-- Implementation of successive syntax escape encoding
-- for escape character " and escape sequence "
-- using string.gsub with Lua pattern matching
function SSE_gsub_encode(text)
return text:gsub([["+]], [[%1"]])
end
assert(SSE_gsub_encode([[hello "my"" world]]) == [[hello ""my""" world]])
-- Implementation of successive syntax escaping decoding
-- for escape character " and escape sequence "
function SSE_decode(text)
local was_escape = false
local out = ""
for _, scalar in utf8.codes(text) do
if scalar ~= utf8.codepoint([["]], 1, 1) or was_escape then
out = out .. utf8.char(scalar)
end
was_escape = scalar == utf8.codepoint([["]], 1, 1)
end
return out
end
assert(SSE_decode([[hello ""my""" world]]) == [[hello "my"" world]])
Variadic-length syntax
Variadic-length syntax requires us to remember the escape sequences in memory. I called the variable cache
.
Again an implementation with a replace-function is possible and provided in function VLSE_gsub_encode
.
-- Implementation of variadic-length syntax escape encoding
-- for escape character & and escape sequence & or "
function VLSE_encode(text)
local out = ""
for _, scalar in utf8.codes(text) do
if scalar == utf8.codepoint([[&]], 1, 1) then
out = out .. [[&]]
elseif scalar == utf8.codepoint([["]], 1, 1) then
out = out .. [["]]
else
out = out .. utf8.char(scalar)
end
end
return out
end
assert(VLSE_encode([[hello "my" & world]]) == [[hello "my" & world]])
-- Implementation of variadic-length syntax escape decoding
-- for escape character & and escape sequence & or "
-- using string.gsub with Lua pattern matching
function VLSE_gsub_encode(text)
-- NOTE: & must come first
return text:gsub([[&]], [[&]]):gsub([["]], [["]])
end
assert(VLSE_gsub_encode([[hello "my" & world]]) == [[hello "my" & world]])
-- Implementation of variadic-length syntax escape decoding
-- for escape character & and escape sequence & or "
function VLSE_decode(text)
local lookup = { ["""] = [["]], ["&"] = [[&]] }
local cache = ""
local out = ""
for _, scalar in utf8.codes(text) do
if scalar == utf8.codepoint([[&]], 1, 1) and cache == "" then
cache = "&"
elseif scalar ~= utf8.codepoint([[;]], 1, 1) and cache ~= "" then
cache = cache .. utf8.char(scalar)
elseif scalar == utf8.codepoint([[;]], 1, 1) and cache ~= "" then
out = out .. lookup[cache .. ";"]
cache = ""
else
out = out .. utf8.char(scalar)
end
end
return out
end
assert(VLSE_decode([[hello "my" & world]]) == [[hello "my" & world]])
Conclusion
None of the differences in algorithmic complexity are significant. Most use just a boolean flag (was_escape
) to track state.