Grazer Linuxtage, HS 03
April 2015, Lukas Prokop
Source code included!
English slides included!
dt. | … weil 30 RegEx ausreichen können, um UNIX zu definieren. |
engl. | … because 30 regex can suffice to define UNIX |
I define UNIX as 30 definitions of regular expressions living under one roof.
—Donald Ervin Knuth
RegEx in practice
. | one arbitrary character |
?*+ | the 3 quantifiers quantifying 0–1, 0–∞ and 1–∞ |
{1,3} | exact quantifier with quantity 1 to 3 |
a|b | alternation (either a or b ) |
[a-z] | character lists (any character between lowercase a and z ) |
[^g-z] | negated character list (any character not in g to z ) |
\d | shorthand character lists |
^regex$ | anchors to require beginning and end of matching string |
(regex) | grouping to define submatches |
[[:digit:]] | character class |
(?P<x>) | Named groups | Access group with identifier x afterwards. |
(?:X) | Unmatched groups | Don't create a group for X, but scope it. |
(?=X)Y | Positive lookahead | Iff X follows, continue with Y. |
(?<=X)Y | Positive lookbehind | Iff X was matched, continue with Y. |
(?!X)Y | Negative lookahead | Iff X will not be matched, continue with Y. |
(?<!X)Y | Negative lookbehind | Iff X was not matched, continue with Y. |
\p{X} | Unicode categories | Character of Unicode category X? |
(?(grp)then|else) | Conditionals | If grp matched, match then otherwise else |
(X)Y\1 | Backreferences | Reuse match of previous group X |
And who are we going to do discuss that?
Let's talk about APIs 😇
Rationale: I'm not aware of applications of XDuce and CDuce. HaRP features a broken link to the homepage. In general too commercial or too theoretical to be discussed here.
Parser generators feature CFG-based approaches (context-free grammar), require unambiguous grammars (regex isn't unambiguous) and structural persistency is more important than performance. Hence uninteresting.
Parsing expression grammars (PEG) have different semantics (greedy, non-backtracking), seem to be a super set and sorry, I didn't look into them thoroughly. Might be interesting; theory seems to be in progress.
If you want to escape metacharacters, you have to do it with two backslashes.
\[harlo\]
\\[harlo\\]
API described by a regex:
Find(All)?(String)?(Submatch)?(Index)?
The regexp implementation provided by this package is guaranteed to run in time linear in the size of the input. (This is a property not guaranteed by most open source implementations of regular expressions.)
The implementation is written natively in Go. Russ Cox is the regex expert who has written this engine.
re2 is another engine by Russ Cox which was ported to several languages: RE2/J, pyre2, re::engine::RE2
re.search(pattern, string, flags=0)
re.match(pattern, string, flags=0)
re.fullmatch(pattern, string, flags=0)
re.split(pattern, string, maxsplit=0, flags=0)
re.findall(pattern, string, flags=0)
re.sub(pattern, repl, string, count=0, flags=0)
re.escape(string)
Links: Python's re module, Source code
$string =~ /regex/
$string !~ /regex/
my $e = "Hello"; my $p = /$e World/; print "Hello World" =~ $p, "\n";
Links: perlre, Source code
Differences on engines:
Leftmost-longest means when matching against text, the regexp returns a match that begins as early as possible in the input (leftmost), and among those it chooses a match that is as long as possible.
============================================== Mode ⇒ Leftmost first anyone Regular expression ⇒ .*(ham|and ham and spam) Input ⇒ eggs and ham and spam Groups ⇒ Yes, 1 Match ⇒ eggs and ham and spam ╰──────────╯ ↦ 'eggs and ham' [0–12] ⇒ eggs and ham and spam - group 1 ╰─╯
================================================================ Mode ⇒ Leftmost longest anyone Regular expression ⇒ .*(ham|and ham and spam) Input ⇒ eggs and ham and spam Groups ⇒ Yes, 1 Match ⇒ eggs and ham and spam ╰───────────────────╯ ↦ 'eggs and ham and spam' [0–21] ⇒ eggs and ham and spam - group 1 ╰──────────────╯
(^01234)
^[^01234]$
^[^0][^1][^2][^3][^4]$
^[^0]?[^1]?[^2]?[^3]?[^4]?$
^[^0]?[^1]?[^2]?[^3]?[^4]?.*$
^([^0]?([^1]?([^2]?([^3]?([^4]?.*)))))$
^([^0]([^1]([^2]([^3]([^4].*)?)?)?)?)?$
^([^0]([^1]([^2]([^3]([^4].*)?|0123)?|012)?|01)?|0)?$
^([^0]([^1]([^2]([^3]([^4].*)|...3)|..2)|.1)|0|)$
They all fail, because …
.*
; 01234 failsWe conclude:
some character is different.
^(|.|..|...|....|[^0]....|.[^1]...|..[^2]..|...[^3].|....[^4]|......+)$
To the best of my knowledge, a correct solution to our problem 😙
But we know negative lookaheads, right?
^(?!01234).*$
Wrong. Why? Would not match 012345. Let's fix it:
^(?!01234$).*$
Fine, works 😙 To fix potential problem that dot does not match newline:
^(?!01234$)(.|\n)*$
|.|..|...|....|…
.?.?.?…
For excluding a string of length 1389 (1…500), I computed a regex with (1) 2,914,158 or (2) 1,947,413 characters and 123,761 test strings.
python 3.4.0 | go 1.3 | OpenJDK java 1.7.0_79 | perl 5.18.2 | |
---|---|---|---|---|
no lookahead, (1) | 666 sec | 155 sec | 15 sec | 134 sec |
no lookahead, (2) | 340 sec | 2960 sec | 1.954 sec | 86 sec |
lookahead | 11.2 sec | unsupported | 0.416 sec | 1 sec |
Usecase: A text file contains edges of a graph. It stores several numbers in a line:
4 42 1337
The first number defines the identifier of the base vertex. All consecutive numbers/vertices are connected to the base vertex.
Task: Match all integers and store them in a list.
So …
… right?
>>> import re >>> regex = r"^(\d+(?: |$))+" >>> reg = re.compile(regex) >>> match = reg.search("4 42 1337")
>>> match.groups() ('1337',)
But we actually want 😥
('4', '42', '1337')
Case distinction:
re.FindAllString
,
py: re.findall
, @matches = $input =~ /regex/g;
while (matcher_object.find()) { matcher_object.group(0) }
Let's do it:
>>> import re >>> regex = r"^(\d+(?: |$))+" >>> reg = re.compile(regex) >>> reg.findall("4 42 1337") ['1337']
What went wrong here?
Fix your regex!
>>> import re >>> regex = r"\d+" >>> reg = re.compile(regex) >>> reg.findall("4 42 1337") ['4', '42', '1337']
Backreferences allow context-free specifications. What about XML?
>>> import re >>> re.findall(r'<(\w+)>(.*)</\1>', '<svg><text></text></svg>') [('svg', '<text></text>')] >>> re.findall(r'<(\w+)>(.*)</\1>', '<svg><text></text></svg>'[5:]) [('text', '')] >>> re.findall(r'<(\w+)>(.*)</\1>', '<svg><text>Hello World</text></svg>'[5:]) [('text', 'Hello World')]Backreferences, lookaheads and lookbehinds as features are the main reason, why regex engines are working inefficiently. They need to store a lot of information about the current parsing state which makes memory management kind of difficult.
Q: How do you match a newline?
I assume most think of something like
\r?\n
Mac OS up to version 9, …
(\r|\n|\r\n)
Acorn BBC, RISC OS, …
\r?\n|\n?\r
\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
Line Feed, Vertical Tab, Form Feed, Carriage Return, Next Line, Line Separator, Paragraph Separator
Java's \R
matches that.
Thomas Limoncelli's email parsing regex
[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x 80-\xff\n\015()]*)*\)[\040\t]*)*(?:(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80 -\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\01 5()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(? :\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040) <>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\ 015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\ ))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*@[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x8 0-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:" .\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\ ([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\x ff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t ]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xf f])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^ \\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*|(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x 80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037]*(?:(?:\([^\\\x80-\xff\n\015()]* (?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)|"[^\\\x80-\xff\n\0 15"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037]*)*<[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(? :(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:@[\040\ t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n \015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\01 5\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\x ff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\0 15()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?! [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\ [^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*(?:,[\040\t]*( ?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015 ()]*)*\)[\040\t]*)*@[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\0 15()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff ])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(? :\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80 -\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\ [\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\ x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[ \040\t]*)*)*)*:[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()] *)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)?(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])| "[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\x ff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]* (?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\0 40)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\ n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)* \))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*@[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x8 0-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:". \\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([ ^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff \n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]* )*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff] )*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\ \x80-\xff\n\015()]*)*\)[\040\t]*)*)*>)
/^((?:(?:[^?+*{}()[\]\\|]+|\\.|\[(?:\^?\\.|\^[^\\]|[^\\^])(?:[^\]\\]+|\\.)*\]|\((?:\?[:=!]|\?<[=!]|\?>)?(?1)??\)|\(\?(?:R|[+-]?\d+)\))(?:(?:[?+*]|\{\d+(?:,\d*)?\})[?+]?)?|\|)*)$/
StackOverflow: Is there a regular expression to detect a valid regular expression?
Please stand back; we know regex!
/