Advanced Regular Expressions

Grazer Linuxtage, HS 03
April 2015, Lukas Prokop

Source code included!
English slides included!

Subtitle of this talk

dt.	… weil 30 RegEx ausreichen können, um UNIX zu definieren.
engl.	… because 30 regex can suffice to define UNIX

I define UNIX as 30 definitions of regular expressions living under one roof.
—Donald Ervin Knuth

RegEx in practice

Predecessor talk RegEx in practice

Recap

Recap

RegEx is a domain-specific language (DSL).
It is typically embedded into more general languages like C++, python, perl, ruby, Scala, …
One string specifies several strings.
Metacharacters have special meanings like quantities.
RegEx is a required tool for every programmer. Why? Because we write algorithms for data processing. Data processing requires matching, extracting, replacing, … data.

Recap

.	one arbitrary character
?*+	the 3 quantifiers quantifying 0–1, 0–∞ and 1–∞
{1,3}	exact quantifier with quantity 1 to 3
a\|b	alternation (either `a` or `b`)
[a-z]	character lists (any character between lowercase `a` and `z`)
[^g-z]	negated character list (any character not in `g` to `z`)
\d	shorthand character lists
^regex$	anchors to require beginning and end of matching string
(regex)	grouping to define submatches
[[:digit:]]	character class

Not in this talk!

Advanced regular expressions

(?P<x>)	Named groups	Access group with identifier `x` afterwards.
(?:X)	Unmatched groups	Don't create a group for X, but scope it.
(?=X)Y	Positive lookahead	Iff X follows, continue with Y.
(?<=X)Y	Positive lookbehind	Iff X was matched, continue with Y.
(?!X)Y	Negative lookahead	Iff X will not be matched, continue with Y.
(?<!X)Y	Negative lookbehind	Iff X was not matched, continue with Y.
\p{X}	Unicode categories	Character of Unicode category X?
(?(grp)then\|else)	Conditionals	If `grp` matched, match `then` otherwise `else`
(X)Y\1	Backreferences	Reuse match of previous group X

And who are we going to do discuss that?

Let's talk about APIs 😇

Tools and APIs

Programming languages

Programming language frequency according to TIOBE

Data provided by TIOBE Index

Programming languages

Java, Go, Python and Perl as selected languages

What to discuss

Java, Go, Python and Perl analyzed.
No libraries like XDuce, CDuce and HaRP.
No parsers like ANTLR, bison, Yacc and flex.

Rationale: I'm not aware of applications of XDuce and CDuce. HaRP features a broken link to the homepage. In general too commercial or too theoretical to be discussed here.

Parser generators feature CFG-based approaches (context-free grammar), require unambiguous grammars (regex isn't unambiguous) and structural persistency is more important than performance. Hence uninteresting.

Parsing expression grammars (PEG) have different semantics (greedy, non-backtracking), seem to be a super set and sorry, I didn't look into them thoroughly. Might be interesting; theory seems to be in progress.

Java

Pattern: compile(String regex[, int flags] )
matcher([String regex,] CharSequence input)
quote(String s)
split(CharSequence input[, int limit])
Matcher: find([int start])
group([int|String group])
matches()
replaceAll(String replacement)
replaceFirst(String replacement)

Links: Oracle's regex documentation, Source code

Java

Lookaheads or lookbehinds must be strings of fixed length.
If you want to escape metacharacters, you have to do it with two backslashes.
```
\[harlo\]
```
```
\\[harlo\\]
```
Java is old and stable. Different implementations of the JVM like SunJDK and OpenJDK exist.

Go

API described by a regex:

Find(All)?(String)?(Submatch)?(Index)?

The regexp implementation provided by this package is guaranteed to run in time linear in the size of the input. (This is a property not guaranteed by most open source implementations of regular expressions.)

The implementation is written natively in Go. Russ Cox is the regex expert who has written this engine.

re2 is another engine by Russ Cox which was ported to several languages: RE2/J, pyre2, re::engine::RE2

Links: Go regexp documentation Source code

Python

re.search(pattern, string, flags=0)
re.match(pattern, string, flags=0)
re.fullmatch(pattern, string, flags=0)
re.split(pattern, string, maxsplit=0, flags=0)
re.findall(pattern, string, flags=0)
re.sub(pattern, repl, string, count=0, flags=0)
re.escape(string)

Links: Python's re module, Source code

Perl

$string =~ /regex/

$string !~ /regex/

A lot of modifiers
Inspired the famous PCRE library
perl was famous for its text processing capabilities. Nowadays it lacks activity (AFAICS).
String interpolation as a performance problem.

my $e = "Hello";
my $p = /$e World/;
print "Hello World" =~ $p, "\n";

Links: perlre, Source code

RegEx engines

Engines

Differences on engines:

What can be extracted? Only fixed number of arguments? Can you access structural information?
Embedding into host language? Syntactic sugar? String interpolation? Applicable to pattern matching?
Different modifiers but they are mostly the same. Local modifiers?
Ambiguities get resolved?
Configurability of greediness?
Word boundaries
Caching
Memory requirements
Runtime guarantees / efficiency

Understanding leftmost

leftmost-longest: POSIX ERE (egrep) syntax
leftmost-first: Non-POSIX engines (perl, python, go, java)

Leftmost-longest means when matching against text, the regexp returns a match that begins as early as possible in the input (leftmost), and among those it chooses a match that is as long as possible.

Leftmost first

==============================================
Mode                ⇒ Leftmost first anyone
Regular expression  ⇒ .*(ham|and ham and spam)
Input               ⇒ eggs and ham and spam
Groups              ⇒ Yes, 1

Match               ⇒ eggs and ham and spam
                      ╰──────────╯  ↦ 'eggs and ham' [0–12]

                    ⇒ eggs and ham and spam - group 1
                               ╰─╯

Leftmost longest

================================================================
Mode                ⇒ Leftmost longest anyone
Regular expression  ⇒ .*(ham|and ham and spam)
Input               ⇒ eggs and ham and spam
Groups              ⇒ Yes, 1

Match               ⇒ eggs and ham and spam
                      ╰───────────────────╯  ↦ 'eggs and ham and spam' [0–21]

                    ⇒ eggs and ham and spam - group 1
                           ╰──────────────╯

RegEx by example

Exercise NOT string

Specification: Match an arbitrary except one string.
Exercise: Match any string except '01234'
Example testcases: ab
{empty string}
4
0
02
11
0123
01234
012345

Exercise NOT string

(^01234)

^[^01234]$

^[^0][^1][^2][^3][^4]$

^[^0]?[^1]?[^2]?[^3]?[^4]?$

^[^0]?[^1]?[^2]?[^3]?[^4]?.*$

Exercise NOT string

^([^0]?([^1]?([^2]?([^3]?([^4]?.*)))))$

^([^0]([^1]([^2]([^3]([^4].*)?)?)?)?)?$

^([^0]([^1]([^2]([^3]([^4].*)?|0123)?|012)?|01)?|0)?$

^([^0]([^1]([^2]([^3]([^4].*)|...3)|..2)|.1)|0|)$

Exercise NOT string

They all fail, because …

Matches string starting with 01234; 01234 fails
Matches only strings of length 1; 0 fails
Matches only strings of length 5; 0 fails
Matches only strings of length up to 5; 012345 fails
01234 skips square bracket expr and matches .*; 01234 fails
Same as previous, just groups added
Next character must not correspond to 01234, but if current character does not correspond, anything is allowed; 11 fails
Fixes previous for some cases, but 11 still fails
Fixes previous for some cases, but 02 still fails

Exercise NOT string

We conclude:

If the string is shorter than 5 characters, we don't care.
.?.?.?.?
If the string has length 5, it is not 01234.
Not 01234 at fixed length means some character is different.
[^0]....|.[^1]...|..[^2]..|...[^3].|....[^4]
If the string is longer, we don't care.
.{6,}

Exercise NOT string

^(|.|..|...|....|[^0]....|.[^1]...|..[^2]..|...[^3].|....[^4]|......+)$

To the best of my knowledge, a correct solution to our problem 😙

Exercise NOT string

But we know negative lookaheads, right?

^(?!01234).*$

Wrong. Why? Would not match 012345. Let's fix it:

^(?!01234$).*$

Fine, works 😙 To fix potential problem that dot does not match newline:

^(?!01234$)(.|\n)*$

Exercise NOT string performance

```
|.|..|...|....|…
```
```
.?.?.?…
```

For excluding a string of length 1389 (1…500), I computed a regex with (1) 2,914,158 or (2) 1,947,413 characters and 123,761 test strings.

	python 3.4.0	go 1.3	OpenJDK java 1.7.0_79	perl 5.18.2
no lookahead, (1)	666 sec	155 sec	15 sec	134 sec
no lookahead, (2)	340 sec	2960 sec	1.954 sec	86 sec
lookahead	11.2 sec	unsupported	0.416 sec	1 sec

Usecase variadic matches

Usecase: A text file contains edges of a graph. It stores several numbers in a line:

4 42 1337

The first number defines the identifier of the base vertex. All consecutive numbers/vertices are connected to the base vertex.

Task: Match all integers and store them in a list.

Usecase variadic matches

So …

define a regex to match integers separated by a space
retrieve those integers as groups

… right?

>>> import re
>>> regex = r"^(\d+(?: |$))+"
>>> reg = re.compile(regex)
>>> match = reg.search("4 42 1337")

>>> match.groups()
('1337',)

But we actually want 😥

('4', '42', '1337')

Usecase variadic matches

Case distinction:

RegEx match do not overlap: go: re.FindAllString, py: re.findall,
perl: @matches = $input =~ /regex/g;
java: while (matcher_object.find()) { matcher_object.group(0) }
RegEx match overlaps with next possible match: Last match started at index i? Start search at i+1.

Usecase variadic matches

Let's do it:

>>> import re
>>> regex = r"^(\d+(?: |$))+"
>>> reg = re.compile(regex)
>>> reg.findall("4 42 1337")
['1337']

What went wrong here?

Fix your regex!

>>> import re
>>> regex = r"\d+"
>>> reg = re.compile(regex)
>>> reg.findall("4 42 1337")
['4', '42', '1337']

Usecase recursive structures

Backreferences allow context-free specifications. What about XML?

>>> import re
>>> re.findall(r'<(\w+)>(.*)</\1>', '<svg><text></text></svg>')
[('svg', '<text></text>')]
>>> re.findall(r'<(\w+)>(.*)</\1>', '<svg><text></text></svg>'[5:])
[('text', '')]
>>> re.findall(r'<(\w+)>(.*)</\1>', '<svg><text>Hello World</text></svg>'[5:])
[('text', 'Hello World')]
        
        
          Backreferences, lookaheads and lookbehinds as features are the main reason,
          why regex engines are working inefficiently. They need to store a lot of information about the current parsing state which makes memory management kind of difficult.

Unicode

Q: How do you match a newline?

I assume most think of something like

\r?\n

Mac OS up to version 9, …

(\r|\n|\r\n)

Acorn BBC, RISC OS, …

\r?\n|\n?\r

As it turns out, it is more complicated. It is actually:

\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

Line Feed, Vertical Tab, Form Feed, Carriage Return, Next Line, Line Separator, Paragraph Separator

Java's \R matches that.

RegEx for fun

Parsing email addresses

Thomas Limoncelli's email parsing regex

[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x 80-\xff\n\015()]*)*\)[\040\t]*)*(?:(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80 -\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\01 5()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(? :\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040) <>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\ 015"]*)*")[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\ ))[^\\\x80-\xff\n\015()]*)*$[\040\t]*)*)*@[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x8 0-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:" .\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\ ([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:$[^\\\x80-\x ff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n\015()]*)*\)[\040\t ]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xf f])*\])[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^ \\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*|(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x 80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037]*(?:(?:$[^\\\x80-\xff\n\015()]* (?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n\015()]*)*\)|"[^\\\x80-\xff\n\0 15"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037]*)*<[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(? :(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:@[\040\ t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n \015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\01 5\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\x ff\n\015()]*)*$)[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\0 15()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?! [^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\ [^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*(?:,[\040\t]*( ?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n\015 ()]*)*\)[\040\t]*)*@[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\0 15()]*)*$)[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff ])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(? :\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80 -\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\ [\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:$[^\\\ x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n\015()]*)*\)[ \040\t]*)*)*)*:[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()] *)*$)[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)?(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])| "[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\x ff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:$[^\\\x80-\xff\n\015()]* (?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\0 40)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\ n\015"]*)*")[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)* $)[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*@[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x8 0-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:". \\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([ ^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:$[^\\\x80-\xff \n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\\x80-\xff\n\015()]*)*\)[\040\t]* )*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff] )*\])[\040\t]*(?:$[^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*$)[^\\ \x80-\xff\n\015()]*)*\)[\040\t]*)*)*>)

Parsing email addresses

Mail::RFC822::Address

In general, email address parsing is difficult and broken.
But take this with a grain of salt. This is just IMHO.
Check whether @ is included and verify with a verification link.

whois grml.solutions

mikagrml@twitter

RegViz

regviz.org

Regex Crossword

regexcrossword.com

RegHex

rampion.github.io/RegHex

RegEx to parse regex?

Regex recognizing Regex post on Stackoverflow

/^((?:(?:[^?+*{}()[\]\\|]+|\\.|\[(?:\^?\\.|\^[^\\]|[^\\^])(?:[^\]\\]+|\\.)*\]|\((?:\?[:=!]|\?<[=!]|\?>)?(?1)??\)|\(\?(?:R|[+-]?\d+)\))(?:(?:[?+*]|\{\d+(?:,\d*)?\})[?+]?)?|\|)*)$/

StackOverflow: Is there a regular expression to detect a valid regular expression?

Conclusion

Interesting reading material

Russ Cox: “Implementing regular expressions”
Unicode Consortium: “Unicode Regular Expressions”
Nick Patch: “Unicode Regular Expression Engines”
Claus Brabrand, Jakob G. Thomsen: “Typed and Unambiguous Pattern Matching on Strings using Regular Expressions”

Conclusion

Do not use too powerful tools for simple jobs!
Do not let user specify and execute regular expressions!
We are not there yet!
I think there will be something going on in the next years.
I also think DSLs are a good idea, but we need standards.

Thanks!

Please stand back; we know regex!

http://lukas-prokop.at/talks/advanced_regex/