Faster Tokenization with Ragel generated FSM

Text processed by Bleve must first go through the process of tokenization. Read on to learn how we sped up the process by a factor of 2x by introducing an FSM generated by Ragel.

Tokenization?

Tokenization is the process by which we take a string like:

"Using Ragel to tokenize 2.1x faster."

And turn it into discrete tokens:

"Using", "Ragel", "to", "tokenize", "2.1x", "faster"

Your first thought might be that we can simply split the text up using the Unicode White Space property. And for simple use cases that might be good enough, but you still have to handle removing punctuation. Unicode defines properties for that as well, but if we want to retain 2.1x as is, we can’t simply remove all punctuation. Fortunately, the Unicode specification has already addressed this problem and defined rules for word segmentation.

Unicode Text Segmentation

If you’re interested in all details, see Unicode® Standard Annex #29. What matters for this discussion is that it defines a set of rules, based on which characters occur before and after others, to decide whether or not to break up the text.

Our first solution was simply to integrate with the C library ICU. This got us up and running quickly, but having a dependency on ICU was not a long term solution.

Next, I set out to write my own implementation. I followed the lead I found in the Go standard library’s unicode package, and wrote code which generated unicode.RangeTables for all the properties defined by the word segmentation rules. Next, I wrote a for loop which walked through the slice of bytes, kept track of which types of characters occurred in sequence, and returned the segmented text at the appropriate places. Fortunately, the Unicode specification also includes test cases, so after banging on the code for a while I had something that worked.

While it was great that it was working, it had some serious drawbacks. First, if the rules evolve in a newer Unicode spec, updating this code base could be tricky. Second, while it performed good enough to use, I always knew that the for-loops and Unicode range table lookups were not optimal.

A Generated FSM Solution

The appeal of a generated FSM solution was that we could keep the rule definitions at a high level. This lets us easily modify them as they change, and it also allows for the possibility of customization in the future. Second, the generated code should be able to take advantage of well understood FSM techniques and perform faster than my hand-rolled for-loop.

I’ve gone looking for FSM solutions in the Go ecosystem in the past and never found one that seemed to meet my needs. This time however, I came across someone suggesting Ragel. To be honest, approaching a project like this can be daunting. Just looking at the example on their home page you realize you have to learn their DSL. Sure it’s rooted in regular-expressions, but the full syntax goes beyond that. The documentation looked solid, but I still wondered, how good is Go code generator? The code isn’t on github and there doesn’t appear to be any community around it. Am I going to invest a day learning this just to hit some dead end?

I was desperate, so I dove in despite my concerns…

Ragel

I started by trying to write a test program, just to get comfortable writing rules in Ragel and generating Go code. I came across this project ragel-go-examples, which I recommend to anyone else getting started.

Once comfortable with the basics, I needed to start working towards something resembling the Unicode word segmentation rules. I had two things to work off of:

  1. The Ragel distribution includes a script contrib/unicode2ragel.rb
  2. The Lucene source has jflex rules for segmentation which share the same regular-expression roots

The unicode2ragel.rb script would take the Unicode properties file and create ragel rules matching the corresponding utf-8 byte sequences. Unfortunately it was hard-coded to work with a particular file, and only export particular properties. I was able to modify this script to work off an arbitrary URL and export whichever named properties you wanted. This let me generate Ragel rules for the properties defined in the word segmentation specification.

Next, I hand converted the jflex rules in the Lucene source into the Ragel language.

Finally, our existing segment package had an API that we wanted to continue to support. So I spent about another half-day attaching the right Ragel actions to the rules and massaging it to fit the API.

Performance

Ragel supports 7 different code output styles, some table based, others goto based. Benchmarking the different options showed that -G2 (goto, with in-place actions) yielded the highest performance. When compared against the original for-loop version:

$ benchcmp before.txt after.txt 
benchmark                          old ns/op     new ns/op     delta
BenchmarkSplitWords-4              966830        457480        -52.68%
BenchmarkWordSegmenter-4           1004317       486750        -51.53%
BenchmarkWordSegmenterDirect-4     970841        463655        -52.24%

benchmark                          old allocs     new allocs     delta
BenchmarkSplitWords-4              20             20             +0.00%
BenchmarkWordSegmenter-4           36             36             +0.00%
BenchmarkWordSegmenterDirect-4     32             32             +0.00%

benchmark                          old bytes     new bytes     delta
BenchmarkSplitWords-4              380966        380971        +0.00%
BenchmarkWordSegmenter-4           496429        496432        +0.00%
BenchmarkWordSegmenterDirect-4     467706        467709        +0.00%

Not too bad, the new version is 2x as fast, with no substantial change to memory usage or allocations.

All good? No…

Runtime performance may be the most important metric, but it’s not the only one. One detail I neglected to mention earlier was that when I generate the code with the -G2 option, the resulting go file is huge:

$ ls -lh segment_words.go
-rw-r--r--  1 mschoch  staff   2.6M Sep  8 10:32 segment_words.go

Well, by itself that isn’t a problem, how does it affect the size of a program using the package? I created a test program which segments some text and compared the size of the resulting binaries.

Before:

$ ls -lh stest
-rwxr-xr-x  1 mschoch  staff   1.3M Sep  8 13:40 stest

After:

$ ls -lh stest
-rwxr-xr-x  1 mschoch  staff   6.0M Sep  8 13:41 stest

I was pleasantly surprised by this one as I thought it might be a lot worse. I can imagine for some use cases this is a problem, but I think for most projects this will not be significant.

But, there was one more thing to check with those same programs, compilation time.

Before:

$ time go build

real	0m0.364s
user	0m0.434s
sys	0m0.079s

After:

$ time go build

real	0m4.848s
user	0m6.134s
sys	0m0.523s

Well, that is a bit unfortunate. The program now takes over 4.5 seconds longer to compile. The Ragel documentation did hint that this could be a problem, and I suspect the (temporary) slowdown in the Go 1.5 tool-chain is also contributing to the problem.

For projects like Bleve this is unfortunate as we had been trying to take other steps to improve compilation time, and this wipes out all of those gains. But, in the big picture I suspect most projects would prefer the 2x runtime gain.

More testing needs to be done, if one of the other output formats results in significantly faster compilation, another option would be to use build tags to control which one is used. That way production builds get the most optimal runtime code. But for now I’m going to let this be and see how it goes.

Testing/Code Coverage

The Unicode specification for text segmentation also includes a suite of test strings and their correct segmentation boundaries. We use these test cases to ensure that the behavior conforms to the specification. I always like the fact that between that test suite and a few other well crafted strings, we were able to get the code coverage to 100%

$ go test -coverprofile=coverage.out 
PASS
coverage: 100.0% of statements
ok  	github.com/blevesearch/segment	0.027s

So, how does it look now?

$ go test -coverprofile=coverage.out 
PASS
coverage: 1.2% of statements
ok  	github.com/blevesearch/segment	0.239s

Thats disappointing, but not surprising. The generated code is expanding the possible paths for efficiency, and our test set doesn’t cover nearly as many paths in the new code.

Getting that number higher might be a project for another day, but I still wanted some higher confidence in this newly generated code. And it seemed like a perfect opportunity to play with go-fuzz

go-fuzz

This was my first time using go-fuzz, so I followed this tutorial by Damian Gryski.

The idea was simple, I should be able to use the Unicode test suite as the initial corpus, and let go-fuzz bang away on library looking for problems. It’s not going to help with correctness, but it can at least identify any crashes or hangs.

First, I wrote a simple Fuzz function:

func Fuzz(data []byte) int {

	vals := make([][]byte, 0, 10000)
	types := make([]int, 0, 10000)
	if _, _, _, err := SegmentWordsDirect(data, vals, types); err != nil {
		return 0
	}
	return 1
}

Next, I wrote some code to convert the test suite strings into the initial test corpus for go-fuzz. I implemented it as a test and not a main program because the test data is only accessible from the test compilation. You can run that with:

go test -v -run=TestGenerateWordSegmentFuzz -tags gofuzz_generate

Then generate the go-fuzz package:

$ go-fuzz-build github.com/blevesearch/segment

Finally, let it run:

$ go-fuzz -bin=segment-fuzz.zip -workdir=workdir
2015/09/08 14:19:19 slaves: 8, corpus: 1859 (0s ago), crashers: 0, restarts: 1/0, execs: 0 (0/sec), cover: 0, uptime: 3s
2015/09/08 14:19:22 slaves: 8, corpus: 2486 (0s ago), crashers: 0, restarts: 1/0, execs: 0 (0/sec), cover: 3651, uptime: 6s
2015/09/08 14:19:25 slaves: 8, corpus: 2486 (3s ago), crashers: 0, restarts: 1/6850, execs: 54804 (6068/sec), cover: 3683, uptime: 9s
... data omitted ...
2015/09/08 15:04:10 slaves: 8, corpus: 7255 (1m6s ago), crashers: 0, restarts: 1/9982, execs: 67204666 (24946/sec), cover: 20912, uptime: 44m54s
2015/09/08 15:04:13 slaves: 8, corpus: 7255 (1m9s ago), crashers: 0, restarts: 1/9985, execs: 67279017 (24945/sec), cover: 20912, uptime: 44m57s
2015/09/08 15:04:16 slaves: 8, corpus: 7265 (0s ago), crashers: 0, restarts: 1/9984, execs: 67354365 (24946/sec), cover: 20912, uptime: 45m0s

This run was for 45 minutes, I have previously let it run for several hours and it has not reported any crashers. I was hoping this would make me feel better about things, but I still have some questions. The documentation states that the cover value is the number of bits set in a hashmap. So more is better right? But, the docs go on to say that the “value should be less than ~5000, otherwise fuzzer can miss new interesting inputs due to hash collisions”. So perhaps our cover is now too high? Maybe we’re not doing something right, so this requires more investigation.

Conclusion

In summary, I’m pretty pleased with how things have turned out. I was initially a bit intimidated by Ragel, but in just a few days I was able to use it successfully. The runtime performance gains will be a significant boost to projects like Bleve. The high compilation times, code coverage and fuzzing issues are things we’ll have to keep working on going forward.

If you’re interested, please checkout the segment github repo. If you have suggestions/ideas to help us improve it further, discuss it with us on the google group.