Advanced Regular Expressions

Advanced Regular Expressions logo

Grazer Linuxtage, HS 03
April 2015, Lukas Prokop

Source code included!
English slides included!

Subtitle of this talk

dt.… weil 30 RegEx ausreichen können, um UNIX zu definieren.
engl.… because 30 regex can suffice to define UNIX

I define UNIX as 30 definitions of regular expressions living under one roof.
—Donald Ervin Knuth

Donald E. Knuth

RegEx in practice

My talk 'RegEx in practice'

Predecessor talk RegEx in practice

Recap

RegEx architecture

Recap

Recap

. one arbitrary character
?*+ the 3 quantifiers quantifying 0–1, 0–∞ and 1–∞
{1,3} exact quantifier with quantity 1 to 3
a|b alternation (either a or b)
[a-z] character lists (any character between lowercase a and z)
[^g-z] negated character list (any character not in g to z)
\d shorthand character lists
^regex$ anchors to require beginning and end of matching string
(regex) grouping to define submatches
[[:digit:]] character class
Not in this talk!

Advanced regular expressions

(?P<x>) Named groups Access group with identifier x afterwards.
(?:X) Unmatched groups Don't create a group for X, but scope it.

And who are we going to do discuss that?

Let's talk about APIs 😇

Tools and APIs

Programming languages

Programming language frequency according to TIOBE

Data provided by TIOBE Index

Programming languages

Java, Go, Python and Perl as selected languages

What to discuss

Rationale: I'm not aware of applications of XDuce and CDuce. HaRP features a broken link to the homepage. In general too commercial or too theoretical to be discussed here.

Parser generators feature CFG-based approaches (context-free grammar), require unambiguous grammars (regex isn't unambiguous) and structural persistency is more important than performance. Hence uninteresting.

Parsing expression grammars (PEG) have different semantics (greedy, non-backtracking), seem to be a super set and sorry, I didn't look into them thoroughly. Might be interesting; theory seems to be in progress.

Java

Pattern
compile(String regex[, int flags] )
matcher([String regex,] CharSequence input)
quote(String s)
split(CharSequence input[, int limit])
Matcher
find([int start])
group([int|String group])
matches()
replaceAll(String replacement)
replaceFirst(String replacement)

Links: Oracle's regex documentation, Source code

Java

Go

API described by a regex:

Find(All)?(String)?(Submatch)?(Index)?

The regexp implementation provided by this package is guaranteed to run in time linear in the size of the input. (This is a property not guaranteed by most open source implementations of regular expressions.)

The implementation is written natively in Go. Russ Cox is the regex expert who has written this engine.

re2 is another engine by Russ Cox which was ported to several languages: RE2/J, pyre2, re::engine::RE2

Links: Go regexp documentation Source code

Python

re.search(pattern, string, flags=0)
re.match(pattern, string, flags=0)
re.fullmatch(pattern, string, flags=0)
re.split(pattern, string, maxsplit=0, flags=0)
re.findall(pattern, string, flags=0)
re.sub(pattern, repl, string, count=0, flags=0)
re.escape(string)

Links: Python's re module, Source code

Perl

$string =~ /regex/
$string !~ /regex/
my $e = "Hello";
my $p = /$e World/;
print "Hello World" =~ $p, "\n";

Links: perlre, Source code

RegEx engines

Engines

Differences on engines:

Understanding leftmost

leftmost-longest
POSIX ERE (egrep) syntax
leftmost-first
Non-POSIX engines (perl, python, go, java)

Leftmost-longest means when matching against text, the regexp returns a match that begins as early as possible in the input (leftmost), and among those it chooses a match that is as long as possible.

Leftmost first

==============================================
Mode                ⇒ Leftmost first anyone
Regular expression  ⇒ .*(ham|and ham and spam)
Input               ⇒ eggs and ham and spam
Groups              ⇒ Yes, 1

Match               ⇒ eggs and ham and spam
                      ╰──────────╯  ↦ 'eggs and ham' [0–12]

                    ⇒ eggs and ham and spam - group 1
                               ╰─╯

Leftmost longest

================================================================
Mode                ⇒ Leftmost longest anyone
Regular expression  ⇒ .*(ham|and ham and spam)
Input               ⇒ eggs and ham and spam
Groups              ⇒ Yes, 1

Match               ⇒ eggs and ham and spam
                      ╰───────────────────╯  ↦ 'eggs and ham and spam' [0–21]

                    ⇒ eggs and ham and spam - group 1
                           ╰──────────────╯

RegEx by example

Exercise NOT string

Specification
Match an arbitrary except one string.
Exercise
Match any string except '01234'
Example testcases
ab
{empty string}
4
0
02
11
0123
01234
012345

Exercise NOT string

(^01234)
^[^01234]$
^[^0][^1][^2][^3][^4]$
^[^0]?[^1]?[^2]?[^3]?[^4]?$
^[^0]?[^1]?[^2]?[^3]?[^4]?.*$

Exercise NOT string

^([^0]?([^1]?([^2]?([^3]?([^4]?.*)))))$
^([^0]([^1]([^2]([^3]([^4].*)?)?)?)?)?$
^([^0]([^1]([^2]([^3]([^4].*)?|0123)?|012)?|01)?|0)?$
^([^0]([^1]([^2]([^3]([^4].*)|...3)|..2)|.1)|0|)$

Exercise NOT string

They all fail, because …

  1. Matches string starting with 01234; 01234 fails
  2. Matches only strings of length 1; 0 fails
  3. Matches only strings of length 5; 0 fails
  4. Matches only strings of length up to 5; 012345 fails
  5. 01234 skips square bracket expr and matches .*; 01234 fails
  6. Same as previous, just groups added
  7. Next character must not correspond to 01234, but if current character does not correspond, anything is allowed; 11 fails
  8. Fixes previous for some cases, but 11 still fails
  9. Fixes previous for some cases, but 02 still fails

Exercise NOT string

We conclude:

  • If the string is shorter than 5 characters, we don't care.
    .?.?.?.?
  • If the string has length 5, it is not 01234.
  • Not 01234 at fixed length means some character is different.
    [^0]....|.[^1]...|..[^2]..|...[^3].|....[^4]
  • If the string is longer, we don't care.
    .{6,}

Exercise NOT string

^(|.|..|...|....|[^0]....|.[^1]...|..[^2]..|...[^3].|....[^4]|......+)$

To the best of my knowledge, a correct solution to our problem 😙

Exercise NOT string

But we know negative lookaheads, right?

^(?!01234).*$

Wrong. Why? Would not match 012345. Let's fix it:

^(?!01234$).*$

Fine, works 😙 To fix potential problem that dot does not match newline:

^(?!01234$)(.|\n)*$

Exercise NOT string performance

  1. |.|..|...|....|…
  2. .?.?.?…

For excluding a string of length 1389 (1…500), I computed a regex with (1) 2,914,158 or (2) 1,947,413 characters and 123,761 test strings.

python
3.4.0
go
1.3
OpenJDK
java 1.7.0_79
perl
5.18.2
no lookahead, (1) 666 sec 155 sec 15 sec 134 sec
no lookahead, (2) 340 sec 2960 sec 1.954 sec 86 sec
lookahead 11.2 sec unsupported 0.416 sec 1 sec

Usecase variadic matches

Usecase: A text file contains edges of a graph. It stores several numbers in a line:

4 42 1337

The first number defines the identifier of the base vertex. All consecutive numbers/vertices are connected to the base vertex.

Graph edges

Task: Match all integers and store them in a list.

Usecase variadic matches

So …

  1. define a regex to match integers separated by a space
  2. retrieve those integers as groups

… right?

>>> import re
>>> regex = r"^(\d+(?: |$))+"
>>> reg = re.compile(regex)
>>> match = reg.search("4 42 1337")
>>> match.groups()
('1337',)

But we actually want 😥

('4', '42', '1337')

Usecase variadic matches

Case distinction:

RegEx match do not overlap
go: re.FindAllString, py: re.findall,
perl: @matches = $input =~ /regex/g;
java: while (matcher_object.find()) { matcher_object.group(0) }
RegEx match overlaps with next possible match
Last match started at index i? Start search at i+1.

Usecase variadic matches

Let's do it:

>>> import re
>>> regex = r"^(\d+(?: |$))+"
>>> reg = re.compile(regex)
>>> reg.findall("4 42 1337")
['1337']

What went wrong here?

Fix your regex!

>>> import re
>>> regex = r"\d+"
>>> reg = re.compile(regex)
>>> reg.findall("4 42 1337")
['4', '42', '1337']

Usecase recursive structures

Backreferences allow context-free specifications. What about XML?

>>> import re
>>> re.findall(r'<(\w+)>(.*)</\1>', '<svg><text></text></svg>')
[('svg', '<text></text>')]
>>> re.findall(r'<(\w+)>(.*)</\1>', '<svg><text></text></svg>'[5:])
[('text', '')]
>>> re.findall(r'<(\w+)>(.*)</\1>', '<svg><text>Hello World</text></svg>'[5:])
[('text', 'Hello World')]
        
        

Backreferences, lookaheads and lookbehinds as features are the main reason, why regex engines are working inefficiently. They need to store a lot of information about the current parsing state which makes memory management kind of difficult.

Unicode

Q: How do you match a newline?

I assume most think of something like

\r?\n

Mac OS up to version 9, …

(\r|\n|\r\n)

Acorn BBC, RISC OS, …

\r?\n|\n?\r
As it turns out, it is more complicated. It is actually:
\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

Line Feed, Vertical Tab, Form Feed, Carriage Return, Next Line, Line Separator, Paragraph Separator

Java's \R matches that.

RegEx for fun

Parsing email addresses

Thomas Limoncelli's email parsing regex

[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x 80-\xff\n\015()]*)*\)[\040\t]*)*(?:(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80 -\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\01 5()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(? :\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040) <>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\ 015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\ ))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*@[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x8 0-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:" .\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\ ([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\x ff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t ]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xf f])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^ \\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*|(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x 80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037]*(?:(?:\([^\\\x80-\xff\n\015()]* (?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)|"[^\\\x80-\xff\n\0 15"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037]*)*<[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(? :(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:@[\040\ t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n \015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\01 5\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\x ff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\0 15()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?! [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\ [^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*(?:,[\040\t]*( ?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015 ()]*)*\)[\040\t]*)*@[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\0 15()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff ])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(? :\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80 -\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\ [\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\ x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[ \040\t]*)*)*)*:[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()] *)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)?(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])| "[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\x ff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]* (?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\0 40)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\ n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)* \))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*@[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x8 0-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:". \\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([ ^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff \n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]* )*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff] )*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\ \x80-\xff\n\015()]*)*\)[\040\t]*)*)*>)

Parsing email addresses

Mail::RFC822::Address

whois grml.solutions

mika and solutions email address

mikagrml@twitter

RegViz

RegViz Visual Debugging of Regular Expressions for JavaScript

regviz.org

Regex Crossword

RegEx crossword

regexcrossword.com

RegHex

RegHex

rampion.github.io/RegHex

RegEx to parse regex?

Regex recognizing Regex post on Stackoverflow
/^((?:(?:[^?+*{}()[\]\\|]+|\\.|\[(?:\^?\\.|\^[^\\]|[^\\^])(?:[^\]\\]+|\\.)*\]|\((?:\?[:=!]|\?<[=!]|\?>)?(?1)??\)|\(\?(?:R|[+-]?\d+)\))(?:(?:[?+*]|\{\d+(?:,\d*)?\})[?+]?)?|\|)*)$/

StackOverflow: Is there a regular expression to detect a valid regular expression?

Conclusion

Interesting reading material

Russ Cox
“Implementing regular expressions”
Unicode Consortium
“Unicode Regular Expressions”
Nick Patch
“Unicode Regular Expression Engines”
Claus Brabrand, Jakob G. Thomsen
“Typed and Unambiguous Pattern Matching on Strings using Regular Expressions”

Conclusion

Thanks!

Please stand back; we know regex!

Ninja

http://lukas-prokop.at/talks/advanced_regex/

/