How tux21b fixed my regex §

Table of Contents

Background §

  1. At Grazer Linuxtage in April 2015, I presented about Advanced Regular Expressions
  2. I showed a slide with a table claiming that Go performed incredible bad compared to other regular expression engines; even though it promotes itself with linear size asymptotic runtime behavior. I also published the source code in question.
  3. In a discussion in May at gograz the more advanced users claimed things like bytes are better than strings (semantically and for performance), a new go regex engine was recently released which probably fixes this problem or Go requires programmers to think hard about the best tool for the job; it does not add magic to provide a nice API.
  4. tux21b (one of the participants of gograz) published an improved solution to my regex problem at Google Plus.
the questionable slide in my talk 'Advanced regular expressions'
The slide with the table in question

I was very curious about it. It was my first larger Go program and certainly did not use Go properly. So what really matters to performance? It's time to evaluate!

Too long, didn't read? Jump to the conclusions.

Test setup §

The following benchmarks were retrieved using the following setup:

The regex problem §

I solved a regex problem in various programming environments to create the table. The problem is defined as follows:

Given. A lower and upper bound (l, u). The slides used (0, 4) in the problem introduction whereas (1, 500) is used for testing regex engines [i.e. the table].

Find. A regular expression matching any string except the exact string concatenating the numbers l to u.

So 0123 and 012345 must be accepted for(0, 4), whereas 01234 must be rejected. Check out the slides if you are unsure what this is all about.

Performance of the two solutions §

author identifier lines of code regex length runtime
meisterluk 00 141 1,947,413 chars 2684.796 seconds in critical path
2691.52 seconds in total
tux21b 99 101 22,275 chars 0.049984642 seconds in critical path
0.722 seconds in total

Be aware that 44 minutes is less than the 2960 seconds (about 49 minutes) I mentioned in the slides. CPU scaling might be responsible because I did not disable CPU scaling back then.

With critical path we define the matching of strings against the test suite. This means the total runtime excluding generation of the regex and the testsuite.

Conclusions. tux21b's regex has only ~1% of the length of meisterluk's regex. tux21b's implementation runs 53712 times faster in the critical path and 3728 times faster in total. Be aware that both values are subject to the operating system's scheduling algorithm and therefore not always reliable.

The solutions §

In the following sections we will explore the differences between the two programs. We will gradually merge both programs and benchmark interesting steps. We will also briefly discuss benchmarking tools in Go.

Merging the approaches §

Source code versions and their development
Source code versions and their development

Properties of 00 and 99 §

The following properties hold for the two versions:

Benchmarking with time §

00 uses a helper at line 108. The defer statement will execute a function call whenever the function will return; similar to a finally statement for the whole function.

It is important to recognize that the argument will be prepared when the defer statement is read; not executed. Therefore when the function run_test starts to execute, time.Now() is evaluated. The function then runs and when it finishes it executes trackingTime which computes the difference between the argument and current time (time.Now()). So using the semantics of defer together with the time package is a clever trick. This implementation is based on a Stackoverflow answer.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Diff
--- 01.go 2015-06-20 23:47:40.544336817 +0200
+++ 00.go  2015-06-16 13:24:30.562401000 +0200
@@ -7,6 +7,7 @@
  "regexp"
  "strconv"
  "strings"
+  "time"
 )
 
 // Generate a regex accepting any string but the string of concatenated numbers
@@ -96,8 +97,16 @@
  return testsuite
 }
 
+// A helper for measuring time
+func trackingTime(begin time.Time) {
+  elapsed := time.Since(begin)
+  fmt.Printf("It takes %s to test all input strings.", elapsed)
+}
+
 // Actually run all pattern searches.
 func run_test(pat *regexp.Regexp, testsuite map[string]bool) {
+  defer trackingTime(time.Now())
+
  for input, match := range testsuite {
    //mat := pat.FindStringIndex(input)
    //matches := (mat != nil)

Benchmarking with testing §

testing is a more general testing and benchmark package for Go. The file name needs to end with _test.go. This is why you need to save code 99 with a special filename like 99_test.go.

It is also important to log all debug or error information to the provided testing.B instance instead of panicing or to use log. An example is given at line 91.

After that you can use go test to run benchmark and correctness tests. You need to name methods specifying correctness tests with a name starting with Test. Methods specifying benchmarks need to start with Benchmark like version 99 at line 84.

If you want to measure only a critical path of the program, you need to stop the timer when starting the program and continue measurement when the critical path is met. 99 stops the timer at line 85 and starts it again at line 93. I think there is a bug here. testing.ResetTimer [0] should be used instead of testing.StopTimer and testing.StartTimer.

Then you can run tests or benchmarks using go test and specify a regular expression. If the regex matches, this benchmark or test will be run.

An example command line is:

go test -bench 'RegexpNot' 99_test.go
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
Diff
--- new/98.go 2015-06-21 00:52:47.868425972 +0200
+++ old/99.go  2015-06-21 00:53:06.832426405 +0200
@@ -6,6 +6,7 @@
  "bytes"
  "fmt"
  "regexp"
+  "testing"
  "unicode/utf8"
 )
 
@@ -66,31 +67,35 @@
  {[]byte("012345"), true},
 }
 
-func TestRegexpNot() {
+func TestRegexpNot(t *testing.T) {
  str := generateString(0, 4)
-  panic(fmt.Sprintf("Regexp:", generateRegexp(str)))
+  t.Log("Regexp:", generateRegexp(str))
  re, err := regexp.Compile(generateRegexp(str))
  if err != nil {
-    panic(fmt.Sprintf("regexp.Compile:", err))
+    t.Fatal("regexp.Compile:", err)
  }
  for _, test := range simpleTests {
    if m := re.Match(test.Pattern); m != test.Match {
-      panic(fmt.Sprintf("Test %q: expected %v, got %v.", test.Pattern, test.Match, m))
+      t.Errorf("Test %q: expected %v, got %v.", test.Pattern, test.Match, m)
    }
  }
 }
 
-func main() {
+func BenchmarkRegexpNot(b *testing.B) {
+  b.StopTimer()
  lower, upper := 1, 500
  regex := generateRegexp(generateString(lower, upper))
  tests := generateTests(lower, upper)
  re, err := regexp.Compile(regex)
  if err != nil {
-    panic(fmt.Sprintf("regexp.Compile:", err))
+    b.Fatal("regexp.Compile:", err)
  }
-  for _, t := range tests {
-    if m := re.Match(t.Pattern); m != t.Match {
-      panic(fmt.Sprintf("Test %q: expected %v, got %v", t.Pattern, t.Match, m))
+  b.StartTimer()
+  for i := 0; i < b.N; i++ {
+    for _, t := range tests {
+      if m := re.Match(t.Pattern); m != t.Match {
+        b.Errorf("Test %q: expected %v, got %v", t.Pattern, t.Match, m)
+      }
    }
  }
 }

00 ⇒ 01, 99 ⇒ 98: Removing benchmarking code §

01 removes the time benchmarking code of 00.
98 removes the testing benchmarking code of 99.

00 critical path 2,684.796 seconds
00 overall runtime 2,691.52 seconds
01 overall runtime 2,688.93 seconds
98 overall runtime 0.710 seconds
99 critical path 0.049984642 seconds
99 overall runtime 0.722 seconds

No significant speedup was observable.

01 ⇒ 02, 98 ⇒ 97: Using opposite tools §

And as a brief test, let's use opposite benchmark tools. tux21b's source code with time and meisterluk's source code with testing.

97 critical path 54.419541 ms
97 overall runtime 0.724 seconds

When we try to run source code 02 the program actually crashes:

SIGQUIT: quit
PC=0x42c3ef

goroutine 20 [running]:
regexp.(*machine).step(0xc20804c200, 0xc20804c218, 0xc20804c248, 0x34, 0x35, 0x2000000033)
  /usr/local/go1.3/src/pkg/regexp/exec.go:216 +0x27f fp=0xc2180adab8 sp=0xc2180ad948
regexp.(*machine).match(0xc20804c200, 0x7faed986f5c8, 0xc20804c2c8, 0x34, 0x47aa01)
  /usr/local/go1.3/src/pkg/regexp/exec.go:166 +0x455 fp=0xc2180adb70 sp=0xc2180adab8
regexp.(*Regexp).doExecute(0xc208064960, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc208010b03, 0x1bd, 0x0, 0x0, ...)
  /usr/local/go1.3/src/pkg/regexp/exec.go:439 +0x38e fp=0xc2180adcb0 sp=0xc2180adb70
regexp.(*Regexp).MatchString(0xc208064960, 0xc208010b03, 0x1bd, 0xc208014901)
  /usr/local/go1.3/src/pkg/regexp/regexp.go:400 +0x71 fp=0xc2180add38 sp=0xc2180adcb0
command-line-arguments.run_test(0xc208064960, 0xc208028330)
  /home/meisterluk/Desktop/notstring-performance-fix/new/02_test.go:105 +0xbc fp=0xc2180ade38 sp=0xc2180add38
command-line-arguments.BenchmarkRegexpNot(0xc20804c100)
  /home/meisterluk/Desktop/notstring-performance-fix/new/02_test.go:133 +0x442 fp=0xc2180adf10 sp=0xc2180ade38
testing.(*B).runN(0xc20804c100, 0x1)
  /usr/local/go1.3/src/pkg/testing/benchmark.go:124 +0x94 fp=0xc2180adf20 sp=0xc2180adf10
testing.(*B).launch(0xc20804c100)
  /usr/local/go1.3/src/pkg/testing/benchmark.go:197 +0x7b fp=0xc2180adfa0 sp=0xc2180adf20
runtime.goexit()
  /usr/local/go/src/pkg/runtime/proc.c:1445 fp=0xc2180adfa8 sp=0xc2180adfa0
created by testing.(*B).run
  /usr/local/go1.3/src/pkg/testing/benchmark.go:177 +0x3b

goroutine 16 [chan receive, 9 minutes]:
testing.(*B).run(0xc20804c100, 0x0, 0x0, 0x0, 0x0, 0x0)
  /usr/local/go1.3/src/pkg/testing/benchmark.go:178 +0x61
testing.RunBenchmarks(0x583c50, 0x5fdb00, 0x1, 0x1)
  /usr/local/go1.3/src/pkg/testing/benchmark.go:310 +0x53a
testing.Main(0x583c50, 0x609da0, 0x0, 0x0, 0x5fdb00, 0x1, 0x1, 0x609da0, 0x0, 0x0)
  /usr/local/go1.3/src/pkg/testing/testing.go:444 +0x1b5
main.main()
  command-line-arguments/_test/_testmain.go:47 +0x9c

goroutine 19 [finalizer wait, 9 minutes]:
runtime.park(0x413090, 0x606990, 0x6054e9)
  /usr/local/go/src/pkg/runtime/proc.c:1369 +0x89
runtime.parkunlock(0x606990, 0x6054e9)
  /usr/local/go/src/pkg/runtime/proc.c:1385 +0x3b
runfinq()
  /usr/local/go/src/pkg/runtime/mgc0.c:2644 +0xcf
runtime.goexit()
  /usr/local/go/src/pkg/runtime/proc.c:1445

rax     0xc20804c200
rbx     0xc225655790
rcx     0xaac
rdx     0x3f3
rdi     0x3f4
rsi     0xc2124b6000
rbp     0x0
rsp     0xc2180ad948
r8      0x7760
r9      0xee0ff
r10     0xc218a09160
r11     0xc20804c200
r12     0xc20804c218
r13     0xc218f1fb78
r14     0x4
r15     0xc226530000
rip     0x42c3ef
rflags  0x246
cs      0x33
fs      0x0
gs      0x0
*** Test killed with quit: ran too long (10m0s).
FAIL  command-line-arguments  600.010s
go test -bench=. 02_test.go  595.64s user 0.76s system 99% cpu 10:00.64 total

If the program runs more than 10 minutes with the testing package, the program will be terminated.

01 ⇒ 03, 98 ⇒ 96: callgraphs with runtime/pprof §

Generating and displaying profiling data was already popular within the C++ community. Profiling data formats got established and nice tools like callgrind made it easy to visualize which parts of the program consume the most of computation time. runtime/pprof is Go's response to this requirement. It generates runtime profiling data and can store them in a file.

Be aware that I profile the whole test program on purpose; not only the critical path. I want to find out where how much time is spent on. After building the file and running it a file '96.prof' is created that contains the profiling data.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Diff
--- 98.go 2015-06-21 00:52:47.868425972 +0200
+++ 96.go  2015-06-24 16:00:54.579200362 +0200
@@ -5,7 +5,9 @@
 import (
  "bytes"
  "fmt"
+  "os"
  "regexp"
+  "runtime/pprof"
  "unicode/utf8"
 )
 
@@ -81,6 +83,13 @@
 }
 
 func main() {
+  f, err := os.Create("96.prof")
+  if err != nil {
+    panic(err)
+  }
+  pprof.StartCPUProfile(f)
+  defer pprof.StopCPUProfile()
+
  lower, upper := 1, 500
  regex := generateRegexp(generateString(lower, upper))
  tests := generateTests(lower, upper)
pprof analysis of 96
pprof analysis of version 96 (image has been modified for better layout)

In version 96, 25% of the time (ie. 3 samples) is spent on regexp.(*machine).step even though we have only 3 calls of that function. Of a total of 12 samples (1 sample is approximately 10 milliseconds according to the documentation).

regex.(*machine).step325%
runtime.memmove216.7%
regexp/syntax.patchList.append216.7%

regexp/syntax.(*compiler).compile has 122 invocations, but takes almost no time. regexp.(*machine).add takes about 8.3%.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Diff
--- 01.go 2015-06-20 23:47:40.544336817 +0200
+++ 03.go  2015-06-24 15:11:37.939132899 +0200
@@ -5,6 +5,7 @@
  "fmt"
  "os"
  "regexp"
+  "runtime/pprof"
  "strconv"
  "strings"
 )
@@ -111,6 +112,13 @@
 }
 
 func main() {
+  f, err := os.Create("03.prof")
+  if err != nil {
+    panic(err)
+  }
+  pprof.StartCPUProfile(f)
+  defer pprof.StopCPUProfile()
+
  low := 1
  upp := 500
pprof analysis of 03
pprof analysis of 03 (image has been modified for better layout)

Okay, here we immediately can observe a problem. 89.9% of the total time was spent inside of regexp.(*machine).add and 95.1% of the total time was spent inside of that function or one of its callees. So what is this function?

The machine implementation inside of the regexp package has an add function. It works recursively giving reason why so much time is spent on one function and work is not distributed among various functions. As far as I can understand the function comment it seems like this function adds the next token/instruction that can be reached from the current position. Also zero-width conditions are possible. I am not sure what is meant with zero-width conditions (to me it sounds like positive lookaheads, but this is not supported by regexp). Probably it is similar to an alternation with no content like the first case in (|a|b|c|d|e|f).

So in conclusion, a lot of time is spent on adding regex elements into the internal (NFA) representation.

01 ⇒ 04: refactor main routine §

We know our performance analysis tools, but for now let's refactor code to get the two approaches closer to each other. Hence we introduce version 04

Those changes are performance-irrelevant. Skipping the stdout output will save some insignificant buffering time.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
Diff
--- 01.go 2015-06-20 23:47:40.544336817 +0200
+++ 04.go  2015-06-24 23:27:51.090513793 +0200
@@ -3,7 +3,6 @@
 import (
  "bytes"
  "fmt"
-  "os"
  "regexp"
  "strconv"
  "strings"
@@ -11,7 +10,7 @@
 
 // Generate a regex accepting any string but the string of concatenated numbers
 // lower_bound to upper_bound (each inclusive).
-func generate_regex(lower_bound, upper_bound, mode int) string {
+func generateRegexp(lower_bound, upper_bound int) string {
  var not_str bytes.Buffer
  var buf bytes.Buffer
 
@@ -22,30 +21,8 @@
  not_string := not_str.String()
  l := len(not_string)
 
-  // mode 3 = lookahead
-  if mode == 3 {
-    buf.WriteString("^(?!")
-    buf.WriteString(not_string)
-    buf.WriteString("$).*$")
-    return buf.String()
-  }
-
  buf.WriteString("^(")
-
-  // strings with length smaller than not_string
-  // mode 1 == |.|..|...|…
-  if mode == 1 {
-    for i := 0; i < l; i++ {
-      buf.WriteString(strings.Repeat(".", i))
-      if i != l-1 {
-        buf.WriteString("|")
-      }
-    }
-    // mode 2 == .?.?.?.?…
-  } else if mode == 2 {
-    buf.WriteString(strings.Repeat(".?", l-1))
-  }
-
+  buf.WriteString(strings.Repeat(".?", l-1))
  buf.WriteString("|")
 
  // strings with length equal to not_string
@@ -74,7 +51,7 @@
 
 // Generate less than `(upper_bound - lower_bound)^2` input strings.
 // Returns testsuite dict.
-func generate_input_strings(lower_bound, upper_bound int) map[string]bool {
+func generateTests(lower_bound, upper_bound int) map[string]bool {
  testsuite := make(map[string]bool)
 
  // generate not_string
@@ -111,22 +88,12 @@
 }
 
 func main() {
-  low := 1
-  upp := 500
-
-  regex := generate_regex(low, upp, 2)
-  fmt.Println(regex)
-  fmt.Println("\n\n\n\n")
-  testsuite := generate_input_strings(low, upp)
-
-  fmt.Printf("Generated regex of string length %d.\n", len(regex))
-  fmt.Printf("Generated %d input strings.\n", len(testsuite))
-
-  pat := regexp.MustCompile(regex)
-
-  fmt.Println("Regex compilation has finished.\n")
-
-  run_test(pat, testsuite)
-
-  os.Exit(0)
+  lower, upper := 1, 500
+  regex := generateRegexp(lower, upper)
+  tests := generateTests(lower, upper)
+  re, err := regexp.Compile(regex)
+  if err != nil {
+    panic(fmt.Sprintf("regexp.Compile:", err))
+  }
+  run_test(re, tests)
 }

04 ⇒ 05, 99 ⇒ 95: BenchmarkGenerateRegexp §

Even though regex generation was excluded from our critical path, it might be interesting to benchmark both versions. Let's do that.

The following outputs were retrieved:

go test -bench GenerateRegexp 05_test.go 
testing: warning: no tests to run
PASS
BenchmarkGenerateRegexp      100    19390897 ns/op
ok    command-line-arguments  1.969s
go test -bench GenerateRegexp 95_test.go
PASS
BenchmarkGenerateRegexp     2000      952723 ns/op
ok    command-line-arguments  2.013s

tux21b's implementation takes only 5% of the runtime of meisterluk's implementation. But obviously the generated regular expressions are different.

05 ⇒ 06, 95 ⇒ 94: Regex comparison for (1, 3) §

meisterluk regex
^(.?.?|[^1]..|.[^2].|..[^3]|....+)$
tux21b regex
^(?:(?:1(?:(?:2(?:(?:3.+)|[^3]|$))|[^2]|$))|[^1]|$)

meisterluk's regex §

meisterluk's regular expression uses the following approach:

  1. If the string length is smaller than the concatenated string, everything is allowed. Let us model that as .?.?
  2. If the string length is equivalent, the string must not be the concatenated string. We rephrase that as some letter of the string is different. This part of the regex looks like [^1]..|.[^2].|..[^3].
  3. If the string length is greater, everything is allowed. If you concatenate 1 to 500, we get a string of 1389 characters. In this example, we have 3 characters: ....+.

tux21b's regex §

tux21b's regex has a different structure. As explained it is built with recursive code. It also best explained recursively (bullet items at the same level are connected with OR):

05 ⇒ 07, 95 ⇒ 93: fmt versus strconv §

I can see a difference between the versions 05 and 95: 05 uses strconv and 95 uses fmt to convert an integer to string. We exchange the uses and benchmark the implementations.

original authorversionint to stringnanoseconds per operation
meisterluk 05 WriteString and strconv.Itoa 19450037
meisterluk 07 fmt.Fprint 22673632
tux21b 93 WriteString and strconv.Itoa 888468
tux21b 95 fmt.Fprint 954162

Conclusion: In general strconv is faster than fmt.Fprint. Why? Because fmt has to infer the type of the argument to find out which value it should convert to a string. On the opposite side, strconv.Itoa explicitly tells convert an integer to a string.

93 ⇒ 92: unmatched groups §

I think unmatched groups are one of the more frequently used regex elements. They pretty simple to understand: Add (?:…) to a group of parentheses and the parentheses won't create a new group you can refer to later on. This allows the engine to discard the memory at that point during parsing.

So it is actually more interesting from a memory point of view instead of runtime. We use -benchmem as argument to run the programs.

version nanoseconds per operation allocated bytes per operation allocations per operation
92 (unmatched) 784804 ns/op80623 B/op110 allocs/op
93 (matched) 874297 ns/op185809 B/op111 allocs/op

Conclusions: Go got a little bit faster and the number of allocations as well as allocated bytes dropped.

1 2 3 4 5 6 7 8 9 10 11 12
Diff
--- 93_test.go  2015-06-25 01:01:07.762641495 +0200
+++ 92_test.go 2015-06-25 01:53:48.678713620 +0200
@@ -32,7 +32,7 @@
    return
  }
  r, n := utf8.DecodeRune(str)
-  fmt.Fprintf(buf, "(?:(?:%c", r)
+  fmt.Fprintf(buf, "((%c", r)
  build(buf, str[n:])
  fmt.Fprintf(buf, ")|[^%c]|$)", r)
 }

07 ⇒ 08: merge tux21b's regex §

So tux21b's version is kind of cool. The regex is much faster and [except for strconv] contains nice improvements. We will create 08 from 07 by merging tux21b's regex generation (and therefore also his regex).

Performance benchmarks yield that the new implementation is (of course) much faster, but still slower than tux21b's version. 95 gives 49567479 ns/op whereas 08 yields 76253806 ns/op.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109
Diff
--- 07_test.go  2015-06-25 01:14:00.434659126 +0200
+++ 08_test.go 2015-06-25 02:36:54.646772625 +0200
@@ -4,51 +4,36 @@
  "bytes"
  "fmt"
  "regexp"
-  "strings"
  "testing"
+  "unicode/utf8"
 )
 
-// Generate a regex accepting any string but the string of concatenated numbers
-// lower_bound to upper_bound (each inclusive).
-func generateRegexp(lower_bound, upper_bound int) string {
-  not_str := &bytes.Buffer{}
+func generateString(lower, upper int) []byte {
  buf := &bytes.Buffer{}
-
-  // generate not_string
-  for number := lower_bound; number <= upper_bound; number++ {
-    fmt.Fprint(not_str, number)
-  }
-  not_string := not_str.String()
-  l := len(not_string)
-
-  fmt.Fprint(buf, "^(")
-  fmt.Fprint(buf, strings.Repeat(".?", l-1))
-  fmt.Fprint(buf, "|")
-
-  // strings with length equal to not_string
-  for index := 0; index < l; index++ {
-    fmt.Fprint(buf, strings.Repeat(".", index))
-    fmt.Fprint(buf, "[^")
-    fmt.Fprint(buf, string(not_string[index]))
-    fmt.Fprint(buf, "]")
-    fmt.Fprint(buf, strings.Repeat(".", l-index-1))
-    if index != l-1 {
-      fmt.Fprint(buf, "|")
-    }
+  for i := lower; i <= upper; i++ {
+    fmt.Fprint(buf, i)
  }
+  return buf.Bytes()
+}
 
-  fmt.Fprint(buf, "|")
-
-  // strings with length greater than not_string
-
-  fmt.Fprint(buf, strings.Repeat(".", l))
-  fmt.Fprint(buf, ".+")
-
-  fmt.Fprint(buf, ")$")
-
+func generateRegexp(str []byte) string {
+  buf := &bytes.Buffer{}
+  fmt.Fprint(buf, "^")
+  build(buf, str)
  return buf.String()
 }
 
+func build(buf *bytes.Buffer, str []byte) {
+  if len(str) == 0 {
+    fmt.Fprint(buf, ".+")
+    return
+  }
+  r, n := utf8.DecodeRune(str)
+  fmt.Fprintf(buf, "(?:(?:%c", r)
+  build(buf, str[n:])
+  fmt.Fprintf(buf, ")|[^%c]|$)", r)
+}
+
 // Generate less than `(upper_bound - lower_bound)^2` input strings.
 // Returns testsuite dict.
 func generateTests(lower_bound, upper_bound int) map[string]bool {
@@ -59,14 +44,13 @@
  for number := lower_bound; number <= upper_bound; number++ {
    fmt.Fprint(not_str, number)
  }
-  not_string := not_str.String()
+  not_string := generateString(lower_bound, upper_bound)
 
  // generate strings
  for begin := lower_bound; begin <= upper_bound; begin++ {
    for end := begin + 1; end <= upper_bound; end++ {
      var input_string = not_string[begin:end]
-      matches := (input_string != not_string)
-      testsuite[input_string] = matches
+      testsuite[string(input_string)] = !bytes.Equal(input_string, not_string)
    }
  }
 
@@ -89,13 +73,13 @@
 
 func BenchmarkGenerateRegexp(b *testing.B) {
  for i := 0; i < b.N; i++ {
-    generateRegexp(1, 500)
+    generateRegexp(generateString(1, 500))
  }
 }
 
 func BenchmarkRegexpNot(b *testing.B) {
  lower, upper := 1, 500
-  regex := generateRegexp(lower, upper)
+  regex := generateRegexp(generateString(lower, upper))
  tests := generateTests(lower, upper)
  re, err := regexp.Compile(regex)
  if err != nil {

08 ⇒ 09: introduce a Test struct §

tux21b's version uses a Test struct at line 39 to store all testcases. meisterluk's version uses a map associating the input string to match with a boolean value indicating whether it should match. Be aware that the new version 09 uses a string instead of a byte array. The Test struct can be found at line 37.

versionnanoseconds per operations
0875522867 ns/op
0960372369 ns/op

Observation: This provided a nice performance improvement. Why? Even though maps provide asymptotic O(1) lookup, accessing elements violate the principle of locality leading to many cache misses. I think this is the reason here. A linear data structure is therefore preferable. Be aware that this is a very specific usecase where dataset size matters very much. Recently a blog post concluded the opposite for their usecase: Use maps.

09 ⇒ 10: switch from strings to bytes §

strings are considered an abstraction for byte arrays. You receive a byte array via I/O devices and need to decode it according to the specified character set. This is where the byte array becomes a string. Operations on strings are well-defined according to the Unicode Standard.

So strings should be more expensive, because they hide some details and provide a large set of operations. The benchmarks yield:

versionnanoseconds per operations
0960372369 ns/op
1052005978 ns/op

Conclusion: strings provide more overhead and are slower. byte arrays are faster than string operations. Be aware that previously I did some byte array to string conversions (because tux21b's regex generation inherently), which are now unnecessary. So the gap is not that significant, but avoiding String methods in the regexp package might be worth it.

10 ⇒ 11: reduce differences §

We drop the run_test function and directly put test string matching into BenchmarkRegexpNot.

versionnanoseconds per operations
1052005978 ns/op
1153358339 ns/op

I tried it several times, but I doubt that removal of a function call decreases the performance. It's rather the other way around (assuming that inlining was not done). Anyways, this result is not significant enough to draw any conclusions how expensive a function call is.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Diff
--- 10_test.go  2015-06-25 03:29:54.806845188 +0200
+++ 11_test.go 2015-06-25 03:43:03.806863191 +0200
@@ -52,15 +52,6 @@
  return
 }
 
-// Actually run all pattern searches.
-func run_test(b *testing.B, re *regexp.Regexp, tests []Test) {
-  for _, t := range tests {
-    if m := re.Match(t.Pattern); m != t.Match {
-      b.Errorf("Test %q: expected %v, got %v", t.Pattern, t.Match, m)
-    }
-  }
-}
-
 func BenchmarkGenerateRegexp(b *testing.B) {
  for i := 0; i < b.N; i++ {
    generateRegexp(generateString(1, 500))
@@ -76,6 +67,10 @@
    b.Fatal("regexp.Compile:", err)
  }
  for i := 0; i < b.N; i++ {
-    run_test(b, re, tests)
+    for _, t := range tests {
+      if m := re.Match(t.Pattern); m != t.Match {
+        b.Errorf("Test %q: expected %v, got %v", t.Pattern, t.Match, m)
+      }
+    }
  }
 }

11 ⇒ 12: reduce differences §

And then we apply the final changes.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
Diff
--- 11_test.go  2015-06-25 03:43:03.806863191 +0200
+++ 99_test.go 2015-06-24 11:31:30.912120720 +0200
@@ -1,5 +1,7 @@
 package main
 
+// courtesy of tux21b
+
 import (
  "bytes"
  "fmt"
@@ -52,13 +54,35 @@
  return
 }
 
-func BenchmarkGenerateRegexp(b *testing.B) {
-  for i := 0; i < b.N; i++ {
-    generateRegexp(generateString(1, 500))
+var simpleTests = []Test{
+  {[]byte("ab"), true},
+  {[]byte(""), true},
+  {[]byte("4"), true},
+  {[]byte("0"), true},
+  {[]byte("02"), true},
+  {[]byte("11"), true},
+  {[]byte("0123"), true},
+  {[]byte("01234"), false},
+  {[]byte("012a4"), true},
+  {[]byte("012345"), true},
+}
+
+func TestRegexpNot(t *testing.T) {
+  str := generateString(0, 4)
+  t.Log("Regexp:", generateRegexp(str))
+  re, err := regexp.Compile(generateRegexp(str))
+  if err != nil {
+    t.Fatal("regexp.Compile:", err)
+  }
+  for _, test := range simpleTests {
+    if m := re.Match(test.Pattern); m != test.Match {
+      t.Errorf("Test %q: expected %v, got %v.", test.Pattern, test.Match, m)
+    }
  }
 }
 
 func BenchmarkRegexpNot(b *testing.B) {
+  b.StopTimer()
  lower, upper := 1, 500
  regex := generateRegexp(generateString(lower, upper))
  tests := generateTests(lower, upper)
@@ -66,6 +90,7 @@
  if err != nil {
    b.Fatal("regexp.Compile:", err)
  }
+  b.StartTimer()
  for i := 0; i < b.N; i++ {
    for _, t := range tests {
      if m := re.Match(t.Pattern); m != t.Match {

Version 12 is not defined in the source code file, because it is equivalent to version 99.

Conclusions §

bytes are better than strings (semantically and for performance)
We don't discuss the semantics here. This is about language design which is not covered here. For performance 09 ⇒ 10 showed no minor difference, but byte arrays might be more performant.
a new go regex engine was recently released which probably fixes this problem
This is true if Go 1.3 is meant. The changelog shows the following statement:
The regular expression package regexp is now significantly faster for certain simple expressions due to the implementation of a second, one-pass execution engine. The choice of which engine to use is automatic; the details are hidden from the user.
https://golang.org/doc/go1.3#performance
Go requires programmers to think hard about the best tool for the job; it does not add magic to provide a nice API

This is also about language design. Let's make a point here.

My concern is that programmers with years of experience know which regexes can act as regex bombs. With Go's new regexp engine different regexes act as bombs. As such programmers need to learn a new set of regular expression bombs and avoid them. Requiring new knowledge worsens the situation.

My hope (on the opposite side) is that this regex is made up enough such that no such regular expression will ever be seen in production code.

Furthermore it's worth pointing out:

Coming to an end, tux21b showed that regex tuning can have dramatic effect. We, programmers, should be aware of regex features and capabilities. Thanks for providing the source code.

It's about all that regex!

meisterluk~9200 seconds
tux21b0.051057905 seconds

Further investigations §

re2dfa §

What is re2dfa? It recently drew attention when it popped up in a reddit thread “re2dfa - regular expressions into finite state machines”. So it does static compilation and builds an efficient DFA. I was curious whether this tool with a totally different approach would be capable of simplifying my regex into an efficient version.

We install re2dfa. Just to point it out, re2dfa is restricted to go's regexp feature set.

go get github.com/opennota/re2dfa

We adjust source code 06 to print out the full regex:

  //fmt.Println(generateRegexp(1, 3))
  fmt.Println(generateRegexp(1, 500))

We run the go program and write the regular expression to a file.

go run 06.go > my_complex_regex

How long is it?

wc -c my_complex_regex
1947414 my_complex_regex

Oh boy, that are 1.9 megabytes. Anyways we read the content into a shell variable:

regex=$(<my_complex_regex)

We run re2dfa:

re2dfa "$regex" main.matchAPlus string
zsh: liste d'arguments trop longue: re2dfa

Oh boy, that error message is french for “argument list too long”. So the expansion of $regex is too long.

We need to use the re2dfa API directly (like the main program) in our own program. The source code is provided as version re2dfa.

We start the program and … … … and it takes more than 62 minutes to run the program. Then I aborted and sorry, this tool failed this test.

sre2 §

sre2 is a pure Go implementation of re2. What is its runtime? See version sre2.

time go run notstring.go
go run notstring.go  84.17s user 1.06s system 102% cpu 1:23.17 total

So it runs approximately one minute and 23 seconds. Be aware that re2 (unlike regexp) works with runes internally. So I converted every byte array to a string before invoking the match function. That's a way worse performance compared to the original implementation with <1 second.

So it runs slower for very efficient programs. Probably it has a smaller standard derivation and is better in more difficult testcases? What about the meisterluk version? See version meisterluk-sre2.

After one hour I aborted. So sre2 failed again. It is worse than the regexp package in terms of performance. But okay, it's pure Go which might be useful in other scenarios.

Updated slide with tux21b regex §

Python §

I adapted tux21b's version in python and came up with the following backtrace:

…
File "/usr/lib/python3.4/sre_parse.py", line 431, in _parse
    subpattern = SubPattern(state)
RuntimeError: maximum recursion depth exceeded while calling a Python object

Apparently the regex has too many groups (I counted 2784) and as far as the regular expression is parsed recursively, the parsing procedure fails.

The first step is to look for a regex-specific limit for a maximum size of the regular expression.

import sre_constants

This package of python's regular expression implementation “sre” defines constants. I could not find any constant which is specific for a maximum recursion depth. So the general recursion limits for python3 apply. Therefore we increase the default stack size:

1 2
Python
import sys
sys.setrecursionlimit(1000)

As it turns out a stack size of 5583 is the smallest that works for me. It is more than twice the number of groups in the regular expression: sys.setrecursionlimit(5583). According to this stackoverflow answer, you should also set the rlimit of resource. With this setting I get the following output (version python):

Generated regex of string length 22276.
Generated 123761 input strings.
Regex compilation has finished.
It takes 10.949613311000121 seconds to test all input strings.

So it takes about 11 seconds to test all input strings. And the overall program takes about 11.2 seconds to run.

Btw, I also tried sre, which is a pure python implementation of python's regular expression engine: It suffers from the same problem (as expected).

Java §

If you adapt the source code from tux21b for Java (version Java), you will get the following output at stderr:

Generated regex of string length 22275.
Generated 123761 input strings.
Exception in thread "main" java.lang.StackOverflowError
  at java.util.regex.Pattern$Node.<init>(Pattern.java:3347)
  at java.util.regex.Pattern$GroupHead.<init>(Pattern.java:4550)
…
  at java.util.regex.Pattern.expr(Pattern.java:1964)
  at java.util.regex.Pattern.group0(Pattern.java:2770)
  at java.util.regex.Pattern.sequence(Pattern.java:2018)
  at java.util.regex.Pattern.expr(Pattern.java:1964)
  at java.util.regex.Pattern.group0(Pattern.java:2770)

So Java suffers from the same problem as Python: Per default the regular expression is rejected due to its high number of groups.

We need to increase the stack size. After compiling the Java program, we start the JVM and it runs the code. We start the JVM and assign a larger stack size to the main thread. We start Java with the JVM option -Xss1361k. 1361k is the smallest number that works for me. This is a little bit less than have a kilobyte per group.

Generated regex of string length 22275.
Generated 123761 input strings.
Regex compilation has finished.
It takes 0.747 seconds to test all input strings.

I used find instead of matches in this implementation. Why? I don't know why I originally used matches, because its semantics are less appropriate.

If you don't know the difference between find and matches, I have a small exercise:

Online tester
regexplanet.com
regex
^(?:(?:1(?:(?:2(?:(?:3.+)|[^3]|$))|[^2]|$))|[^1]|$)
test string
12

matches says No, find says Yes.

Perl §

We can also update the perl version. It works out of the box 😊.

Generated regex of string length 22275.
Generated 123761 input strings.
Regex compilation not available.
It takes 1 seconds to test all input strings.

It takes about 1 second to run all test strings and about 1.2 seconds for the whole program.

Updated slide §

This is my new table/slide:

python
3.4.0
go
1.3
OpenJDK
java 1.7.0_79
perl
5.18.2
tux21b regex 11 sec 0.05 sec 0.75 sec 1 sec
no lookahead, (1) 666 sec 155 sec 15 sec 134 sec
no lookahead, (2) 340 sec 2960 sec 1.954 sec 86 sec
lookahead 11.2 sec unsupported 0.416 sec 1 sec

Remarks §