Over the last several years, I’ve rewritten Rust’s regex
crate to enable better internal composition, and to make it
easier to add optimizations while maintaining correctness. In the course of
this rewrite I created a new crate, regex-automata
, which exposes much
of the regex
crate internals as their own APIs for others to use. To my
knowledge, this is the first regex library to expose its internals to the
degree done in regex-automata
as a separately versioned library.
This blog post discusses the problems that led to the rewrite, how the rewrite
solved them and a guided tour of regex-automata
’s API.
Target audience: Rust programmers and anyone with an interest in how one particular finite automata regex engine is implemented. Prior experience with regular expressions is assumed.
Table of Contents
- Brief history
- The problems
- Follow along with regex-cli
- Flow of data
- Literal optimizations
- The NFA data type
- Regex engines
- Differences with RE2
- Testing strategy
- Benchmarking
- Costs
- Wrap up
Brief history
In September 2012, an issue was filed on the Rust
repository requesting that a regex library be added to the
Rust Distribution. Graydon Hoare later commented in that thread
that they preferred RE2. For those that don’t know, RE2 is a regex engine
that uses finite automata to guarantee O(m * n)
worst case search time
while providing a Perl-like syntax that excludes features that are not known
how to implement efficiently. RE2’s design is described by its author, Russ
Cox, in a series of articles on implementing a regex engine using finite
automata.
In April 2014, I showed up and said I was working on a regex engine inspired
by RE2. I treated Cox’s articles as a blueprint for how to
build a regex library. Soon there after, I published an RFC to add a regex
library to the “Rust Distribution.” This was before Rust 1.0
and Cargo (the second version, not the first), and the “Rust
Distribution” referred to rustc
, std
and several “supporting” libraries
that were all bundled together. This RFC proposed adding a regex
crate to
that list of supporting libraries.
Ten days later, the RFC was approved. The next
day, I submitted a pull request to rust-lang/rust
,
adding it to the Rust distribution. Things moved fast back then. Notice also
that I had originally called the crate regexp
. The PR to Rust involved a
discussion about naming that eventually resulted in it being called regex
instead.
Two years later in May 2016, I wrote an RFC to release regex 1.0
. That took a few months to be approved, but it wasn’t
until a couple years later in May 2018 that I actually released regex 1.0
.
Before regex 1.0
was released, I had been steadily working on a complete
re-imagining of the crate internals. From a commit message in March
2018:
The [regex-syntax] rewrite is intended to be the first phase in an effort to overhaul the entire regex crate.
I didn’t know exactly where I was going at that point in time, but in
March 2020, I started work in earnest on rewriting the actual matching
engines. A little more than three years later, regex 1.9
has been
released with the completed rewrite.
The problems
What kinds of problems were facing the regex
crate that warranted a full
rewrite? And moreover, why publish the rewritten internals as its own crate?
There are a host of things to discuss here.
Problem: composition was difficult
Following in the tradition of RE2, the regex
crate contains a
number of different strategies that it can use to implement a search. Sometimes
multiple strategies are used in a single search call.
There are generally two dimensions, often at odds with one another, to each strategy: performance and functionality. Faster strategies tend to be more limited in functionality. For example, a fast strategy might be able to report the start and end of a match but not the offsets for each capture group in the regex. Conversely, a slower strategy might be needed to report the offsets of each capture group.
When I originally wrote the regex
crate, I implemented a single strategy
(the PikeVM
) and didn’t do any thoughtful design work for how to incorporate
alternative strategies. Eventually, new strategies were added organically:
- A
BoundedBacktracker
that can report capture group offsets like thePikeVM
, but does so using a backtracking strategy. Its main limitation is the memory used to ensure its backtracking is bounded toO(m * n)
, so it can only be used for small haystacks/regexes. Its main upside is that its usually faster than thePikeVM
. - A hybrid NFA/DFA (also know as a “lazy DFA”) that can execute very quickly, but can only report the start and end of a match. It ignores capture groups completely.
- A literal strategy where a regex corresponds to a language that is both
finite and small. Examples:
foo
,foo{2}
,foo|bar
,foo[1-3]
. In this case, we could just use a single or multi-substring search algorithm without any kind of regex engine at all.
(We’ll get into why these strategies have these trade offs in more detail later in the blog.)
And with the above strategies came the composition of them:
- When the caller requests capture group offsets, it is usually faster to
run the lazy DFA first to find the bounds of a match, and then only run the
PikeVM
orBoundedBacktracker
to find the capture group offsets. In this way, especially for cases where matches are somewhat rare, most of the work is done by the much faster lazy DFA. - When a regex begins with a prefix that corresponds to a small finite
language, we can implement a prefilter that searches for occurrences of
that language. Each occurrence corresponds to a candidate match for the
overall regex. For each such candidate match, we run the full regex engine to
confirm whether it’s an actual match or not. So for example,
foo\w+
would look for occurrences offoo
in a haystack, and then run the regexfoo\w+
at the offset at which the occurrence offoo
began. If there’s a match, stop and report it. Otherwise, restart the search forfoo
after the previous occurrence offoo
.
Over time, I wanted to add both more strategies and add more ways of
composing them. But in an organically grown infrastructure, the regex
crate was beginning to buckle under its weight. Loosely speaking, all of the
following were problems:
- Not all strategies were necessarily designed to be composed with others.
The
PikeVM
, for example, was the first strategy and it suffered from this. Specifically, it could not deal with starting and stopping a search in a subsequence of a slice, which is necessary in order to compose it with the lazy DFA. For example, at one point, thePikeVM
would say that\babc\b
matched inabcxyz
if its search started at offset0
and ended at offset3
. But the trailing\b
should not match afterc
because ax
follows it. - It was difficult to reason about which strategy would be used for any given regex.
- There were repeated
match
expressions re-implementing various logic that was easy to go out of sync. - The construction of a regex did not holistically account for the fact that
some strategies don’t need to be constructed at all. For example, I at one
point added an optimization to the
regex
crate (prior to the rewrite) that just used Aho-Corasick for regexes likefoo1|foo2|...|fooN
, but it was extremely hacky to do that in a way that didn’t also result in a Thompson NFA being unnecessarily built that would never actually be used.
Basically, at the very least, many of the strategies needed a makeover and the infrastructure that composes them probably needed to be rewritten.
Problem: testing was difficult
While the regex
crate exposes a public interface that acts as a single regex
engine, as we just discussed, it uses multiple strategies internally depending
on the situation. Many of these strategies are regex engines themselves, and it
is absolutely critical that they behave the same on the same inputs.
As one example, a common case is when the caller requests the offsets of each
capture group in a match. The way this usually works is to run the lazy DFA
to find the bounds of the match, and then a slower regex engine like the
PikeVM
or the BoundedBacktracker
on the bounds of the match to report the
capture group offsets. What happens then, if the lazy DFA finds a match where
the other engines don’t? Oops. It’s a bug.
The problem here is that prior to regex 1.9
, all of the strategies used
internally are not part of any public API, and that makes them difficult to
independently test. One can of course test the public API, but the logic for
selecting which internal regex engines to use is complicated enough that there
isn’t always a clear and obvious mapping between the pattern itself and which
regex engine will be used internally. Moreover, that mapping can change as the
logic evolves. So writing tests just against the public API is not something
that gives us clear coverage over all of the internal engines. And even if one
could do it, it would make debugging test failures more annoying than necessary
because you have to reason through the logic for which strategies are selected.
One approach to this is to put all tests inside the crate so that tests have access to internal APIs. But you really want to leverage the same tests across all of the engines, so doing this requires defining the tests in some structured format, looping over them and running them on each engine. Since the test infrastructure was not written with testing each individual strategy in mind, I ended up not going this route.
Instead, I did some unholy hacks to make the existing test suite work:
- I exposed some internal APIs to make it possible to configure and build the internal strategies from outside the crate.
- I made it so one could actually build the main
Regex
type from those internal APIs using an undocumentedFrom
implementation. - I wrote the tests using macros.
- I created test targets for each internal regex engine I wanted to test, and each test target was responsible for defining the aforementioned macros in a way that used the regex engine I wanted to test.
This was overall a hacky mess and it really needed a rethink. Exposing the internal engines in their own public API was not strictly a requirement to improve the situation, but it does make it possible to run a test suite across all engines without either relying on undocumented APIs or putting the tests inside the crate itself.
Problem: requests for niche APIs
Over the years, there were several requests for additional APIs to the regex
crate, but were ones I considered too niche to be worth expanding the API
surface, or where I wasn’t totally clear on what the API ought to be.
One of the most popular such requests was better multi-pattern support. Namely,
the regex
crate provides a RegexSet
API that permits one to search for
possibly overlapping matches of zero or more regexes. The catch is that the API
only reports which patterns matched anywhere in the haystack. One cannot use
the API to get either the match offsets or the offsets of capture groups. While
useful, it isn’t as useful as it could be if it supported the full Regex
API.
As with adding multiple internal regex engines and the testing strategy, the
RegexSet
API was bolted on to the existing implementation in a fairly hacky
way. Making it capable of reporting match offsets would require either a major
refactoring of all existing engines or a rewrite.
But separately from that, it wasn’t totally clear to me how to expose APIs that
report match offsets in the context of the overlapping search done by the
RegexSet
APIs. Having more room to experiment with alternative APIs, for
example, a RegexSet
that does non-overlapping searches and reports match
offsets, would be something that others could use without needing to complicate
the regex
crate API.
There have been other requests for additional APIs too:
- The ability to execute an anchored search without needing to put a
^
in the pattern. This is especially useful in the context of running a regex on a part of the haystack that you know matches, but where you want to extract capture groups. It’s also useful in the context of an iterator that only reports adjacent matches. Theregex
crate could be augmented to support this, but there’s no easy way of extending existing APIs without either duplicating all of the search routines, or unnecessarily making “anchored mode” an option that one can pass to the regex. (Which, at that point, you might as well just put^
at the beginning of the pattern.) - The ability to run a regex search without it doing synchronization internally to get mutable scratch spaced used for a search. One might want to do this to avoid those synchronization costs in some cases. But to do it would in turn also require duplicating search APIs and exposing a new type that represents the “mutable scratch space.”
- Executing a regex on streams and/or non-contiguous
haystacks. This is especially useful for running a regex on
data structures like ropes, which are often found in text editors. This is a
big topic and not a problem I’ve even attempted to approach, but my hope is
that with more of the
regex
crate internals exposed, it might be tractable for others to more easily experiment with solutions to this problem.
By publishing a new separately versioned crate that contains much of the
regex
crate internals, it provides “breathing room” for more APIs that folks
want without needing to clutter up the general purpose regex API that services
the vast majority of all regex use cases. Namely, by targeting the crate toward
“expert” use cases, we make no show of trying to keep the API small and simple.
Indeed, as we’ll see, the API of regex-automata
is sprawling and complex. And
by making it separately versioned, we can put out breaking change releases at a
much faster cadence than what we do for the regex
crate.
This line of reasoning is not too dissimilar from the line of reasoning that
led to the publication of the regex-syntax
crate. Namely, folks (including
myself) wanted access to a regex parser for their own projects. I certainly
didn’t want to expose a parser in the regex
crate itself because of the added
complexity and the fact that I wanted to be able to evolve the parser and its
data types independently of the regex
crate. (That is, regex-syntax
has
breaking change releases more frequently than regex
itself.) By putting it
into a separate crate, I could simultaneously use it as an implementation
detail of the regex
crate while also making it available for others to use.
Problem: fully compiled DFAs
The birth of regex-automata
was not actually the result of my crusade to
rewrite the regex crate. Its birth coincided with a desire to build fully
compiled DFAs, serialize them and then provide a barebones search runtime that
could zero-copy deserialize those DFAs and use them for a search. I used the
original version of regex-automata
to build DFAs for implementing
various Unicode algorithms inside of bstr
.
In the course of building the initial version of regex-automata
, I realized
that I needed to rebuild an NFA data structure and a compiler for it that was
very similar to one found in the regex
crate. At a certain point, I began
wondering whether it might be possible to share that code since it is quite
non-trivial and a place where a lot of interesting optimizations occur.
I had briefly considered building a new crate like regex-nfa
that both the
regex
crate and regex-automata
could depend upon. But after more thought,
it became apparent that there was more code that could be shared between
regex-automata
and regex
. For example, a lot of the determinization
process can be written generically such that it works for both fully compiled
DFAs and for lazy DFAs.
At that point, the right abstraction boundary seemed like it was closer to
“regex engine” than it was “an NFA.” So I reframed regex-automata
as less
about just DFAs and more about a menagerie of regex engines. The plan at that
point was, roughly, to put all of the regex engines in regex-automata
and
make the regex
crate itself just a thin wrapper around regex-automata
. By
setting things up this way, it should reduce the friction from migrating from
the regex
crate to regex-automata
if one should need access to the lower
level APIs.
In this way, we can build fully compiled DFAs using precisely the same code
that the regex
crate uses for its lazy DFA. And precisely the same code that
the regex
crate uses to convert the parsed representation of a regex pattern
into an NFA. Heck, this even makes it possible to use fully compiled DFAs in
the regex
crate in some cases. (This is normally a big no-no in a general
purpose regex engine because fully compiled DFAs are not only quite bloated,
but they take worst case exponential time to build. That is very inappropriate
for a regex engine you might use to compile untrusted patterns, or even in
cases where you just need compilation time of a regex to be “reasonable.”
Building a DFA may not be reasonable, especially when Unicode is involved.)
Follow along with regex-cli
regex-cli
is a program maintained as part of regex
crate that provides
convenient command line access to many of the APIs exposed in regex-syntax
,
regex-automata
and regex
. It also includes some useful utilities, such as
serializing fully compiled DFAs to a file and generating Rust code to read
them.
I will use regex-cli
at points thoughout this blog post, so if you’d like to
follow along, you can install it straight from the regex
crate repository:
$ cargo install regex-cli
Here’s a pair of examples that shows the impact that Unicode has on the
.
regex. First, the version with Unicode enabled:
$ regex-cli debug thompson '.' --no-table
thompson::NFA(
>000000: binary-union(2, 1)
000001: \x00-\xFF => 0
^000002: capture(pid=0, group=0, slot=0) => 10
000003: \x80-\xBF => 11
000004: \xA0-\xBF => 3
000005: \x80-\xBF => 3
000006: \x80-\x9F => 3
000007: \x90-\xBF => 5
000008: \x80-\xBF => 5
000009: \x80-\x8F => 5
000010: sparse(\x00-\t => 11, \x0B-\x7F => 11, \xC2-\xDF => 3, \xE0 => 4, \xE1-\xEC => 5, \xED => 6, \xEE-\xEF => 5, \xF0 => 7, \xF1-\xF3 => 8, \xF4 => 9)
000011: capture(pid=0, group=0, slot=1) => 12
000012: MATCH(0)
transition equivalence classes: ByteClasses(0 => [\x00-\t], 1 => [\n], 2 => [\x0B-\x7F], 3 => [\x80-\x8F], 4 => [\x90-\x9F], 5 => [\xA0-\xBF], 6 => [\xC0-\xC1], 7 => [\xC2-\xDF], 8 => [\xE0], 9 => [\xE1-\xEC], 10 => [\xED], 11 => [\xEE-\xEF], 12 => [\xF0], 13 => [\xF1-\xF3], 14 => [\xF4], 15 => [\xF5-\xFF], 16 => [EOI])
)
And now the version with Unicode disabled:
$ regex-cli debug thompson '(?-u:.)' --no-table --no-utf8-syntax
thompson::NFA(
>000000: binary-union(2, 1)
000001: \x00-\xFF => 0
^000002: capture(pid=0, group=0, slot=0) => 3
000003: sparse(\x00-\t => 4, \x0B-\xFF => 4)
000004: capture(pid=0, group=0, slot=1) => 5
000005: MATCH(0)
transition equivalence classes: ByteClasses(0 => [\x00-\t], 1 => [\n], 2 => [\x0B-\xFF], 3 => [EOI])
)
The output here shows the Thompson NFA compiled by regex-automata
for the
regex pattern given. The regex-cli debug
command can print lots of different
data types from the regex crate ecosystem:
$ regex-cli debug
Prints the debug representation of various things from regex-automata and
regex-syntax.
This is useful for ad hoc interactions with objects on the command line. In
general, most objects support the full suite of configuration available in code
via the crate.
USAGE:
regex-cli debug <command> ...
COMMANDS:
ast Print the debug representation of an AST.
dense Print the debug representation of a dense DFA.
hir Print the debug representation of an HIR.
literal Print the debug representation of extracted literals.
onepass Print the debug representation of a one-pass DFA.
sparse Print the debug representation of a sparse DFA.
thompson Print the debug representation of a Thompson NFA.
There is also a regex-cli find
command that can run ad hoc searches. For
example, to run a multi-pattern search with capture groups using the meta regex
engine:
$ regex-cli find capture meta \
-p '(?<email>[.\w]+@(?<domain>[.\w]+))' \
-p '(?<phone>(?<areacode>[0-9]{3})-[0-9]{3}-[0-9]{4})' \
-y 'foo@example.com, 111-867-5309'
parse time: 20.713µs
translate time: 22.116µs
build meta time: 834.731µs
search time: 142.537µs
total matches: 2
0:{ 0: 0..15/foo@example.com, 1/email: 0..15/foo@example.com, 2/domain: 4..15/example.com }
1:{ 0: 17..29/111-867-5309, 1/phone: 17..29/111-867-5309, 2/areacode: 17..20/111 }
See the regex-cli
README for a few other examples.
Flow of data
Before diving into details, it’s worth pausing for a moment first to introduce
some terms and briefly describe the flow of data through the regex engine.
That is, when you call Regex::new
with a pattern string, we’ll trace the
transformations done on the pattern string that turn it into something that can
search haystacks.
- A pattern string is first parsed into an
Ast
. AnAst
is a structured representation of the pattern. - An
Ast
is translated into anHir
. AnHir
is another structured representation of the pattern, but contains a lot less detail than anAst
. Things like Unicode case folding and Unicode character class references are all expanded as part of translation. - An
Hir
is then used to build two things. First is a literal sequence, which corresponds to a sequence of literals extracted from the pattern that are used to optimize regex searches in some cases. If possible, a literal sequence is used to build aPrefilter
. Secondly, anHir
is used to construct anNFA
. - At this point, an
NFA
is used to build a variety of regex engines:- A
PikeVM
can handle all possible regexes that are supported by parsing. APikeVM
can also report offsets for matching capture groups. - A
BoundedBacktracker
uses backtracking but explicitly bounds itself to avoid repeating work. Like thePikeVM
, it can report offsets for matching capture groups. - A one-pass DFA that supports a very limited subset of regexes, but can report offsets for matching capture groups very quickly.
- A fully compiled dense DFA. It can only report the overall start and end
of a match (when combined with a second reverse DFA), but is very fast. The
main downside is that its construction algorithm has worst case
O(2^m)
time and space complexity. - A lazy DFA that builds itself during a search from an
NFA
. In some cases it can be slower than aPikeVM
, but in most cases is as fast as a fully compiledDFA
and lacks the downside ofO(2^m)
worst case construction time/space.
- A
- All of the above regex engines, including a
Prefilter
if one was constructed, are composed into a single meta regex engine. - The
regex
crate itself is a thin wrapper around the meta regex engine from theregex-automata
crate.
We’ll discuss each of these things in more detail throughout the blog, but
it’s difficult to avoid referencing some of these things before they get a
full treatment. For that reason, the above is meant to give you a very general
blueprint of regex
crate internals.
Literal optimizations
In this section, we will begin our journey into the regex
crate internals
by talking about a critical optimization technique that it uses: literal
extraction from regexes. For example, the regex (foo|bar|quux)(\s+\w+)
describes a regular language where all elements in the language start with one
of foo
, bar
or quux
. That is, every match of that regex is guaranteed to
begin with one of those three literals.
Motivating literal optimizations
Why does this matter? Why do we care about literals at all? We care about them because of the following two observations:
- There exist algorithms for searching for one or a small number of literals that are extremely fast. Speed is usually obtained by exploiting the “simplicity” of searching for plain literals and using vector instructions to process many bytes in a haystack in a single step.
- In very broad strokes, general algorithms for searching for matches of a regular expression cannot be accelerated easily.
The key observation is that while there are certainly many different techniques to implementing a regex engine (which we will cover a subset of in more depth later), none of them can consistently be as fast as a well optimized substring search implementation that uses vector instructions. In practice, I’ve often found that the difference is at least one order of magnitude, and sometimes more.
Literal extraction
Let’s show some examples. For this, we can utilize regex-cli
, as it exposes
a regex-cli debug literal
sub-command for extracting literals from regexes.
Let’s start simple:
$ regex-cli debug literal 'bar'
parse time: 13.967µs
translate time: 7.008µs
extraction time: 405ns
optimization time: 1.479µs
len: Some(1)
is finite?: true
is exact?: true
min literal len: Some(3)
max literal len: Some(3)
longest common prefix: Some("bar")
longest common suffix: Some("bar")
E("bar")
In the future, I’ll trim the output since there is a lot of extra information shown. But let’s go through them:
- Parse time refers to the time it takes to turn the
bar
pattern into a structuredAst
value. - Translate time refers to the time it takes to turn the
Ast
value into anHir
value. - Extraction time refers to the time it takes to turn the
Hir
value into a literal sequence. - Optimization time refers to the time it takes to “optimize” the literal sequence. This might be as simple as removing duplicate literals and as aggressive as shrinking the sequence in various ways based on heuristics. We’ll see more examples of this later.
len
is the number of literals in the sequence extracted.- Finite refers to whether the sequence has a finite number of elements. An infinite sequence represents the sequence of all possible literals, and usually means that literal optimizations aren’t possible or aren’t believed to be fruitful.
- Exact refers to whether every element in the literal sequence is exact or not. An exact literal refers to a literal that reached a match state from the place where literal extraction began. Since this command extracts prefixes, an exact literal corresponds to an overall match of the regex. If a literal is not exact, then it is said to be inexact.
- Minimum literal length refers to the length, in bytes, of the shortest literal in the sequence.
- Maximum literal length refers to the length, in bytes, of the longest literal in the sequence.
- Longest common prefix represents a single literal that is a prefix of all elements in the sequence. Infinite sequences and finite sequences containing zero elements lack a common prefix. All other sequences have a common prefix of at least the empty string.
- Longest common suffix represents a single literal that is a suffix of all elements in the sequence. Infinite sequences and finite sequences containing zero elements lack a common suffix. All other sequences have a common suffix of at least the empty string.
Finally, after the above meta data, the extracted sequence is shown. Since the
regex is just the literal bar
, the sequence contains a single exact element
corresponding to bar
. If bar
were a strict prefix, then the sequence would
be the same but bar
will be inexact:
$ regex-cli debug literal --no-table 'bar+'
I("bar")
In this case, bar
is inexact because the r
at the end can match one or more
times. In fact, because of the unbounded repetition operator being applied
to a non-empty string, the language described by this regex is infinite. It
therefore follows that one cannot enumerate all literals and that at least some
of them extracted must be inexact (if any are extracted at all).
But one does not need to write a regex that describes an infinite language in order to get an inexact literal. Here’s an example of a regex that describes a finite language, but for which only inexact literals are extracted:
$ regex-cli debug literal --no-table 'bar[a-z]'
I("bar")
Literal extraction could have enumerated every literal, for example,
bara
, barb
, barc
, …, barz
. But instead it didn’t. Why? It turns out
that literal extraction is one big heuristic. A dark art, if you will. We have
to go back to why we’re doing literal extraction at all in the first place:
to identify candidate matches in a haystack very quickly before using a slower
regex engine to confirm whether a match exists at that location.
The trick here is that the choice of what literals to search for might be
just as important as choosing how to search for them. The algorithm with the
highest throughput in the world isn’t going to help you if your haystack is
1,000,000 a
s and your regex is just the literal a
. The key here is that a
good literal optimization achieves both of the following things:
- Minimizes the false positive rate of candidates. That is, most candidates it reports lead to a match.
- Minimizes its impact on the latency of the search. That is, when a prefilter is active, it ideally results in running the regex engine on only a small portion of the haystack. If a prefilter reports candidates frequently, then even if it has a 0% false positive rate, its impact on latency is likely to be hurting overall search times.
The reason why I called literal optimizations a dark art is because it is impossible to know, before a search begins, how to optimally choose the above two things. The reason is because they both depend on the haystack itself, and scanning the haystack to “study” it is almost certainly going to result in a net negative for overall search times. Therefore, we have to guess at how to minimize the false positive rate while reducing our impact on latency. That’s why it’s a dark art.
Thankfully, there are some guidelines we can follow that usually give us a good result:
- A smaller sequence of literals is usually better than a larger sequence, but not if this results in elements that are extremely short. That is, 1 or 2 bytes in length. Short elements are likely to match much more frequently, and so we’d rather not have them. For example, if we had 5,000 literals that were all limited to lowercase ASCII letters, we could trivially shrink the number of literals to at most 26 by taking the first byte of each literal. But this new sequence of literals is likely to match a lot more frequently, and thus result in a higher hit to latency and a higher false positive rate. It would be better to shrink the sequence while retaining longer literals that are less likely to match.
- Longer literals are generally better than short literals, but not if it would
result in a large sequence. Longer literals are usually more discriminative,
that is, they lead to a lower false positive rate since they are less likely to
match by chance. But one doesn’t want to prioritize long literals arbitrarily.
For example, you might have a sequence containing the literals
foobar
,foobaz
,fooquux
, but a better sequence would probably befoo
even though it’s shorter than all three literals in the sequence. A single element sequence is nice because it means we can potentially use a single-substring search algorithm (which is probably fast).
Literal extraction tries to adhere to the above guidelines as much as possible,
but there are some other heuristics that often come into play. For example, the
ASCII space character, U+0020
, is unusually common. If a sequence would
otherwise contain a space, then the sequence is made infinite when optimized,
and this effectively disables literal optimizations. For example, this regex
has three prefix literals extracted:
$ regex-cli debug literal --no-table '(?:foo|z|bar)[a-z]+'
I("foo")
I("z")
I("bar")
But this one doesn’t. The only difference is that the z
was replaced with a
space:
$ regex-cli debug literal --no-table '(?:foo| |bar)[a-z]+'
Seq[∞]
This heuristic takes place during the “optimization” pass of a literal sequence. The heuristic notices that one of the literals is just a space character, assumes this will lead to a high false positive rate and makes the sequence infinite. When a sequence is infinite, it communicates that there is no small set of finite literals that would (likely) serve as a good prefilter. If we disable optimization, we can see that the space character is included:
$ regex-cli debug literal --no-table --no-optimize '(?:foo| |bar)[a-z]+'
I("foo")
I(" ")
I("bar")
To make matters more complicated, if we use a different regex that leads to a small finite sequence of literals that are all exact, then the literal containing the space character doesn’t result in the overall sequence being made infinite:
$ regex-cli debug literal --no-table 'foo| |bar'
E("foo")
E(" ")
E("bar")
This is useful because when all the literals are exact, the regex engine can be skipped completely. In this case, there doesn’t have to be any prefilter at all. One can just use the multi-substring algorithm directly to report matches. Therefore, the concern about a high false positive rate is irrelevant, because every match produced by searching for the literals is a real match.
At this point, we should cover why I’m using the term literal sequence
instead of literal set. Namely, the order of the literals extracted matters.
It matters because the regex
crate tries to simulate Perl-like semantics.
That is, that the matches reported are done as if employing a backtracking
search. This is also called leftmost-first matching, and in this context, the
|
operator is not commutative. For example:
$ regex-cli debug literal --no-table 'sam|samwise'
E("sam")
$ regex-cli debug literal --no-table 'samwise|sam'
E("samwise")
E("sam")
These are two different sequences. Both are minimal with respect to the
corresponding regexes. In the first case, sam|samwise
will only ever match
sam
, since sam
is a prefix of samwise
and comes before samwise
in the
pattern. Therefore, a literal sequence consisting of just sam
is correct,
since samwise
can never match. In the second case, samwise|sam
can match
either branch. Even though sam
is a prefix of samwise
, since samwise
appears first, it will be preferred when samwise
is in the haystack.
(Note: POSIX regex engines don’t implement regexes this way. Instead, they have
leftmost-longest semantics, where the longest possible match always wins. In
this case, |
is a commutative operator. Some other regex engines, such as
Hyperscan, implement “report all matches” or “earliest match” semantics. In
that case, abc|a
would match both a
and abc
in the haystack abc
.)
Our last examples show that literal extraction is somewhat intelligent. For example:
$ regex-cli debug literal --no-table 'abc?de?[x-z]ghi'
I("abcde")
I("abcdx")
I("abcdy")
I("abcdz")
I("abdex")
I("abdey")
I("abdez")
I("abdxg")
I("abdyg")
I("abdzg")
That is, literal extraction knows how to expand things like ?
and even small
character classes. This works as long as the literal sequence size stays
under several different heuristic limits. (Notice also
that literal extraction could have enumerated every element in the language
described by this regex, in full, but optimization chose to shrink it in
accordance with its heuristics.)
Another example of “intelligence” is that case insensitivity, including Unicode awareness, is taken into account as well:
$ regex-cli debug literal --no-table '(?i)She'
E("SHE")
E("SHe")
E("ShE")
E("She")
E("sHE")
E("sHe")
E("shE")
E("she")
E("ſHE")
E("ſHe")
E("ſhE")
E("ſhe")
This actually isn’t a result of literal extraction implementing Unicode case
folding, but rather due to the translation from an Ast
to an Hir
doing the
case folding for us:
$ regex-cli debug hir --no-table '(?i)She'
Concat(
[
Class(
{
'S'..='S',
's'..='s',
'ſ'..='ſ',
},
),
Class(
{
'H'..='H',
'h'..='h',
},
),
Class(
{
'E'..='E',
'e'..='e',
},
),
],
)
That is, literal extraction sees this regex as one that is equivalent to
[Ssſ][Hh][Ee]
. All it does is expand the classes as it would any other regex.
Searching for literals
Once you’ve extracted some literals, you now need to figure out how to search for them.
The single substring case is somewhat easy: you pick the fastest algorithm
you can for finding a substring in a haystack. There’s not much more to it.
You don’t need to care about the order of literals at this point since there
is only one of them. For this case, the regex
crate uses the
memmem
module from the memchr
crate.
There are several different aspects to the algorithm used in memchr::memmem
:
- Its principal algorithm is Two-Way, which runs in
O(n)
worst case time and constant space. - In cases where the needle and haystack are both very short, Rabin-Karp is used in an effort to minimize latency.
- On
x86_64
, a variant of the “generic SIMD” algorithm is used. Basically, two bytes are chosen from the needle, and occurrences for those two bytes in their proper positions are searched for using vector instructions. When a match of those two bytes is found, then a full match of the needle is checked. (Notice that this is just another variant of the prefilter mechanism. We pick an operation that can quickly find candidates and then perform a more expensive verification step.)
For the generic SIMD algorithm, instead of always choosing the first and last
bytes in the needle, we choose two bytes that we believe are “rare” according
to a background frequency distribution of bytes. That is, we assume that bytes
like Z
are far less common than bytes like a
. It isn’t always true, but
we’re in heuristic-land here. It’s true commonly enough that it works well in
practice. By choosing bytes that are probably rarely occurring from the needle,
we hope to maximize the amount of time spent in the vector operations that
detect candidates, and minimize the number of verifications we need to perform.
The multi-substring case is a bit trickier. Here, we need to make sure we treat the literals as a sequence and prioritize matches for literals earlier in the sequence over literals that come later. It’s also typically true that multi-substring search will be slower than single-substring case, because there’s just generally more work to be done. Here, the principal algorithm employed is Teddy, which is an algorithm that I ported out of Hyperscan. At a high level, it uses vector instructions to detect candidates quickly and then a verification step to confirm those candidates as a match.
The Aho-Corasick algorithm is also used in some cases, although usually the regex engine will just prefer to construct a lazy DFA since performance is similar. Aho-Corasick can still help as a prefilter when the lazy DFA cannot be used though. Aho-Corasick will also typically do better than a lazy DFA when the number of literals is extremely large (around tens of thousands).
There is a lot more work I hope to do in the multi-substring case going forward.
The NFA data type
If there was a central data type inside the regex
crate, it would probably be
the NFA
. More specifically, it is a Thompson NFA, which means it was built
by an algorithm similar to Thompson’s construction. Thompson’s construction
builds an NFA from a structured representation of a regex in O(m)
time, where
m
is proportional to the size of the regex after counted repetitions have
been expanded. (For example, a{5}
is aaaaa
.) The algorithm works by mapping
each type of regex expression to a mini NFA unto itself, and then defining
rules for composing those mini NFAs into one big NFA.
NFAs are a central data type because they can be used directly, as-is,
to implement a regex engine. But they can also be transformed into other
types (such as DFAs) which are in turn used to implement different regex
engines. Basically, at present, if you want to build a regex engine with
regex-automata
, then you have to start with a Thompson NFA.
Before exploring NFAs in more detail, let’s look at a simple example.
A simple NFA example
As with literal extraction, regex-cli
can be helpful here by letting us print
a debug representation of an NFA when given a regex:
$ regex-cli debug thompson 'a'
parse time: 9.856µs
translate time: 3.005µs
compile nfa time: 18.51µs
memory: 700
states: 6
pattern len: 1
capture len: 1
has empty?: false
is utf8?: true
is reverse?: false
line terminator: "\n"
lookset any: ∅
lookset prefix any: ∅
lookset prefix all: ∅
thompson::NFA(
>000000: binary-union(2, 1)
000001: \x00-\xFF => 0
^000002: capture(pid=0, group=0, slot=0) => 3
000003: a => 4
000004: capture(pid=0, group=0, slot=1) => 5
000005: MATCH(0)
transition equivalence classes: ByteClasses(0 => [\x00-`], 1 => [a], 2 => [b-\xFF], 3 => [EOI])
)
In the future, I’ll trim the output to just the NFA itself. So let’s explain what the rest of the output means here:
- The parse and translate timings are the same as they were for literal
extraction. That is, the time to build
Ast
andHir
values, respectively. - Compilation time refers to the time it takes to compile an
Hir
value into anNFA
. - Memory refers to the number of bytes of heap memory used by the NFA.
- States is the number of states in the NFA.
- Pattern length is the number of patterns in the NFA.
- Capture length is the number of capture groups compiled into the NFA. When capture groups are enabled (they are by default), then there is always at least 1 group corresponding to the overall match.
- “has empty” refers to whether the NFA can match the empty string or not.
- “is utf8” refers to whether the NFA is guaranteed to never match any
invalid UTF-8. This includes not matching the empty string between code
units in a UTF-8 encoded codepoint. For example,
💩
is UTF-8 encoded as\xF0\x9F\x92\xA9
. While the empty regex will match at every position, when the NFA is in UTF-8 mode, the only matches it would report are immediately before the\xF0
and immediately after\xA9
. - “is reverse” refers to whether the NFA matches the regex in reverse. This means that it matches the language described by the original regex, but with the elements in the language reversed.
- Line terminator refers to the line terminator used for the
(?m:^)
,(?m:$)
and.
regexes. - “lookset any” is the set of all look-around assertions in the regex.
- “lookset prefix any” is the set of all look-around assertions that occur in the prefix of the regex. Every match may match zero or more of these.
- At the end, the “transition equivalence classes” refers to a partitioning of all possible byte values into sets of equivalence class. The rule is that each byte in an equivalence class can be treated as equivalent to one another with respect to whether a match occurs.
Other than that, the main output is the NFA itself. Let’s walk through it:
>000000: binary-union(2, 1)
000001: \x00-\xFF => 0
^000002: capture(pid=0, group=0, slot=0) => 3
000003: a => 4
000004: capture(pid=0, group=0, slot=1) => 5
000005: MATCH(0)
The ^
before state 2
indicates that it is the “anchored” starting state,
while the >
before state 0
indicates that it is the “unanchored” starting
state. The former is used when running an anchored search and the latter is
used for an unanchored search.
An unanchored search starts with binary-union(2, 1)
. This indicates that
NFA traversal should first go to state 2
(a capture
state), and only if
that path fails should it try going to state 1
. In this case, state 1
matches any byte and loops back around to state 0
. In effect, states 0
and
1
represent an implicit (?s-u:.)*?
prefix at the beginning of the regex,
effectively making it possible for it to match anywhere in the haystack.
A capture
state is an unconditional epsilon transition that only exists to
cause a side effect: it instructs the virtual machine executing the NFA to
store the current offset in the slot included in the capture
state. If one is
doing NFA traversal outside the context of a virtual machine (or something else
that cares about capture groups), the capture
states are effectively ignored
by treating them as unconditional epsilon transitions with no side effects. For
example, this is how they are handled during determinization (the process of
converting an NFA to a DFA).
Once at state 3
, one must check whether the byte at the current position is
equivalent to a
, and if so, moves to state 4
, which is another capture
state. Traversal finally moves to state 5
, which is a match state for the
pattern with identifier 0
.
NFA optimization: sparse states
One of the main problems with a Thompson NFA comes from the thing that makes it
a decent choice for a general purpose regex engine: its construction time is
worst case O(m)
, but this is achieved through liberal use of epsilon
transitions. Epsilon transitions are transitions in the NFA that are taken
without consuming any input. They are one of two ways that a search via an NFA
simulation can wind up in multiple states simultaneously. (The other way is for
a single NFA state to have multiple outgoing transitions for the same haystack
symbol.)
Why are espilon transitions a problem? Well, when performing an NFA traversal
(whether it’s for a search or for building some other object such as a DFA), an
epsilon transition represents an added cost you must pay whenever one is found.
In particular, every NFA state has something called an epsilon closure, which
is the set of states reachable via following epsilon transitions recursively.
Depending on where it occurs in the NFA, an epsilon closure may be re-computed
many times. The epsilon closure for a particular state may change during
traversal because some epsilon transitions may be conditional, such as the ones
corresponding to anchor assertions like ^
, $
and \b
.
Let’s take a look at one relatively simple optimization that I made in the
new NFA compiler. First, let’s see how regex <1.9
compiled the regex
[A-Za-z0-9]
:
$ regex-debug compile --bytes '[A-Za-z0-9]'
0000 Save(0) (start)
0001 Split(2, 3)
0002 Bytes(0, 9) (goto: 6)
0003 Split(4, 5)
0004 Bytes(A, Z) (goto: 6)
0005 Bytes(a, z)
0006 Save(1)
0007 Match(0)
(Note: regex-debug
is an older hacky version of a command line tool for
interacting with the regex
crate. It is no longer available, although you can
always check out an older tag in the regex
crate repository and build it.)
The Split
instruction corresponds to an NFA state with two unconditional
epsilon transitions (it’s the same as the binary-union
instruction in the
previous section). The Save
instruction is for capture groups and Bytes
is for checking whether a single byte or a contiguous ranges of bytes matches
the current haystack symbol. In this case, the character class is implemented
with multiple Split
instructions. Notice, for example, that the epsilon
closure of state 0
is {1, 2, 3, 4, 5}
.
Now let’s see what the new NFA compiler does:
$ regex-cli debug thompson --no-table '[A-Za-z0-9]'
thompson::NFA(
>000000: binary-union(2, 1)
000001: \x00-\xFF => 0
^000002: capture(pid=0, group=0, slot=0) => 3
000003: sparse(0-9 => 4, A-Z => 4, a-z => 4)
000004: capture(pid=0, group=0, slot=1) => 5
000005: MATCH(0)
)
(We can ignore states 0
and 1
, the old NFA doesn’t have an equivalent
prefix for unrelated reasons.)
Here, instead of epsilon transitions, we just have a single sparse
NFA
state. The name sparse
refers to the fact that the state contains more
than one contiguous range of bytes, with each range potentially pointing to
a different state. sparse
is used because the representation used doesn’t
permit a constant time determination of whether a particular haystack symbol
has a matching transition. (i.e., One has to use a linear or binary search.)
It accomplishes the same thing as the Split
instructions in the old NFA,
but without any explicit epsilon transitions. This results in less overhead
because there’s no need to compute an epsilon closure through multiple Split
instructions. There’s just one state, and finding the next transition requires
looking up which range, if any, the current haystack symbol matches.
The main downside of this particular optimization is that the sparse
state
(in the current representation of the NFA) uses indirection to support this. So
it may have harmful cache effects and may result in more heap memory used in
some cases. But the overhead of dealing with all of the epsilon transitions, in
practice, tends to trump that. It’s possible this indirection will be removed
in the future.
NFA optimization: minimal UTF-8 automata
One interesting aspect of the old NFA compiler is that it could produce two
different kinds of NFAs: an NFA whose alphabet was defined over Unicode code
points and an NFA whose alphabet was defined over arbitrary bytes. (Not just
UTF-8 code units. Even bytes that can never be a valid UTF-8 code unit,
like \xFF
, are permitted in this byte oriented NFA.) The Unicode NFA is
principally used when using an NFA regex engine (the PikeVM or the bounded
backtracker, we’ll get to those later), where as the byte oriented NFA is
used whenever the lazy DFA engine is used. A byte oriented NFA is required
for use with the lazy DFA because a lazy DFA really wants its alphabet to be
defined over bytes. Otherwise, you wind up with difficult to solve performance
problems. (A byte oriented NFA can be used with the NFA regex engines, but this
only occurs when the regex can match invalid UTF-8. In this case, a Unicode
alphabet cannot be used.)
This led to at least three problems. First is that the byte oriented NFA was
often slower, primarily because of the epsilon transition problem we talked
about in the previous section. That is, a byte oriented NFA usually had more
Split
instructions where as the Unicode NFA would look more like the sparse
state in the new NFA compiler. For example:
$ regex-debug compile '[A-Za-z0-9]'
0000 Save(0) (start)
0001 '0'-'9', 'A'-'Z', 'a'-'z'
0002 Save(1)
0003 Match(0)
Notice that there are no Split
instructions at all.
The second problem follows from the first. Since the byte oriented NFA is usually slower, we would actually compile both a Unicode NFA and a byte oriented NFA. That way, we could use the Unicode NFA with the NFA regex engines and the byte oriented NFA with the lazy DFA. It works, but it’s wasteful.
The third problem is that the NFA regex engines need to work on both the Unicode and byte oriented versions of the NFA. This complicates things and has been the reason for bugs. This problem could likely be mitigated to an extent with better design, but it’s a complication.
So what does this have to do with UTF-8 automata? Well, a byte oriented NFA still needs to be able to deal with Unicode classes. But if the alphabet is bytes and not codepoints, how does it do it? It does it by building UTF-8 automata into the NFA. For example, here’s an NFA (from the new compiler) that can match the UTF-8 encoding of any Unicode scalar value:
$ regex-cli debug thompson --no-table '(?s:.)'
thompson::NFA(
>000000: binary-union(2, 1)
000001: \x00-\xFF => 0
^000002: capture(pid=0, group=0, slot=0) => 10
000003: \x80-\xBF => 11
000004: \xA0-\xBF => 3
000005: \x80-\xBF => 3
000006: \x80-\x9F => 3
000007: \x90-\xBF => 5
000008: \x80-\xBF => 5
000009: \x80-\x8F => 5
000010: sparse(\x00-\x7F => 11, \xC2-\xDF => 3, \xE0 => 4, \xE1-\xEC => 5, \xED => 6, \xEE-\xEF => 5, \xF0 => 7, \xF1-\xF3 => 8, \xF4 => 9)
000011: capture(pid=0, group=0, slot=1) => 12
000012: MATCH(0)
)
The way this is achieved is by starting from a contiguous range of Unicode
codepoints and then generating a sequence of byte oriented character classes
that match the UTF-8 encoding of that range of codepoints. This functionality
is provided by regex-syntax
’s utf8
module. So for example,
(?s:.)
would look like this:
[0-7F]
[C2-DF][80-BF]
[E0][A0-BF][80-BF]
[E1-EC][80-BF][80-BF]
[ED][80-9F][80-BF]
[EE-EF][80-BF][80-BF]
[F0][90-BF][80-BF][80-BF]
[F1-F3][80-BF][80-BF][80-BF]
[F4][80-8F][80-BF][80-BF]
Translating that into an NFA can be done by just treating it as an alternation, like this:
[0-7F]|[C2-DF][80-BF]|...|[F4][80-8F][80-BF][80-BF]
And that will work and it’s correct. The problem is that it will generate
very large NFAs for some very common cases. See, the problem with Unicode is
that it tends to introduce extremely large character classes into a regex.
Classes like \w
, for example, match 139,612 distinct codepoints (at time of
writing). The ASCII version of \w
only matches 63 codepoints. This is a
categorical difference, and there are plenty of tricks that will work for a
small number like 63 that just won’t scale to numbers like 139,612.
The old regex crate did not naively compile UTF-8 automata like the approach
above. Indeed, there is a lot of redundant structure in the classes produced
by the utf8
module above. The old regex crate noticed this and tried to factor
out common suffixes so that they were shared whenever possible. But this still
led to extremely large NFAs:
$ regex-debug compile --bytes '\w' | tail -n20
3545 Bytes(\xb0, \xb0) (goto: 3466)
3546 Bytes(\xf0, \xf0) (goto: 3545)
3547 Split(3550, 3551)
3548 Bytes(\x80, \x8c) (goto: 28)
3549 Bytes(\xb1, \xb1) (goto: 3548)
3550 Bytes(\xf0, \xf0) (goto: 3549)
3551 Split(3554, 3555)
3552 Bytes(\x8d, \x8d) (goto: 2431)
3553 Bytes(\xb1, \xb1) (goto: 3552)
3554 Bytes(\xf0, \xf0) (goto: 3553)
3555 Split(3558, 3562)
3556 Bytes(\x84, \x86) (goto: 28)
3557 Bytes(\xa0, \xa0) (goto: 3556)
3558 Bytes(\xf3, \xf3) (goto: 3557)
3559 Bytes(\x80, \xaf) (goto: 3563)
3560 Bytes(\x87, \x87) (goto: 3559)
3561 Bytes(\xa0, \xa0) (goto: 3560)
3562 Bytes(\xf3, \xf3) (goto: 3561)
3563 Save(1)
3564 Match(0)
Notice here that we’re only showing the last 20 lines of output. But the NFA produced has 3,564 states. Wow. And there are epsilon transitions everywhere. It’s truly a mess, and the only reason why the old regex crate does as well as it does is because the lazy DFA usually bails it out by compiling some subset of what is actually used into a DFA.
Now let’s look at what the new NFA compiler does:
$ regex-cli debug thompson --no-table '\w' | tail -n20
000292: \xB0-\xB9 => 310
000293: sparse(\x84 => 115, \x85 => 291, \x86 => 210, \xAF => 292)
000294: \x80-\x9F => 310
000295: sparse(\x80-\x9A => 5, \x9B => 294, \x9C-\xBF => 5)
000296: sparse(\x80-\x9B => 5, \x9C => 282, \x9D-\x9F => 5, \xA0 => 55, \xA1-\xBF => 5)
000297: sparse(\x80-\xA1 => 310, \xB0-\xBF => 310)
000298: sparse(\x80-\xB9 => 5, \xBA => 297, \xBB-\xBF => 5)
000299: \x80-\xA0 => 310
000300: sparse(\x80-\xAE => 5, \xAF => 299)
000301: \x80-\x9D => 310
000302: sparse(\xA0-\xA7 => 5, \xA8 => 301)
000303: sparse(\x80-\x8A => 310, \x90-\xBF => 310)
000304: sparse(\x80-\x8C => 5, \x8D => 303, \x8E-\xBF => 5)
000305: sparse(\x80-\x8D => 5, \x8E => 236)
000306: sparse(\x90 => 193, \x91 => 231, \x92 => 235, \x93 => 238, \x94 => 239, \x96 => 247, \x97 => 118, \x98 => 248, \x9A => 250, \x9B => 256, \x9C => 257, \x9D => 276, \x9E => 290, \x9F => 293, \xA0-\xA9 => 118, \xAA => 295, \xAB => 296, \xAC => 298, \xAD => 118, \xAE => 300, \xAF => 302, \xB0 => 118, \xB1 => 304, \xB2 => 305)
000307: sparse(\x84-\x86 => 5, \x87 => 236)
000308: \xA0 => 307
000309: sparse(0-9 => 310, A-Z => 310, _ => 310, a-z => 310, \xC2 => 3, \xC3 => 4, \xC4-\xCA => 5, \xCB => 6, \xCC => 5, \xCD => 7, \xCE => 8, \xCF => 9, \xD0-\xD1 => 5, \xD2 => 10, \xD3 => 5, \xD4 => 11, \xD5 => 12, \xD6 => 13, \xD7 => 14, \xD8 => 15, \xD9 => 16, \xDA => 5, \xDB => 17, \xDC => 18, \xDD => 19, \xDE => 20, \xDF => 21, \xE0 => 53, \xE1 => 93, \xE2 => 109, \xE3 => 116, \xE4 => 117, \xE5-\xE9 => 118, \xEA => 137, \xEB-\xEC => 118, \xED => 140, \xEF => 155, \xF0 => 306, \xF3 => 308)
000310: capture(pid=0, group=0, slot=1) => 311
000311: MATCH(0)
There are not only far fewer states, but there are zero epsilon transitions.
While this is due in part to the use of the sparse
state optimization
described in the previous section, it does not account for all of it.
The new NFA compiler achieves this by using Daciuk’s algorithm for computing
minimal DFAs from a sorted sequence of non-overlapping elements. That’s exactly
what we get from the utf8
module. In practice, we don’t necessarily generate
minimal DFAs because of the memory usage required, but instead sacrifice strict
minimality in favor of using a bounded amount of memory. But it’s usually close
enough.
The reverse case cannot be handled so easily because there is no simple way to
reverse sort the output of the utf8
module in a way that works with Daciuk’s
algorithm (as far as I know). To work around this, I built a bespoke data
structure called a range trie that re-partitions the output of the utf8
module in reverse such that it’s sorted and non-overlapping. Once this is done,
we can use Daciuk’s algorithm just like we do for forward case. The problem,
though, is that this can increase the time it takes to build an NFA quite a
bit. Because of that, one needs to opt into it. First, without the reverse
shrinkng:
$ regex-cli debug thompson --no-table --no-captures '\w' -r | tail -n20
001367: \xB1 => 722
001368: \x80-\x8C => 1367
001369: \x80-\xBF => 1368
001370: \x8D => 1367
001371: \x80-\x8A => 1370
001372: \x90-\xBF => 1370
001373: \x8E-\xBF => 1367
001374: \x80-\xBF => 1373
001375: \xB2 => 722
001376: \x80-\x8D => 1375
001377: \x80-\xBF => 1376
001378: \x8E => 1375
001379: \x80-\xAF => 1378
001380: \xF3 => 1386
001381: \xA0 => 1380
001382: \x84-\x86 => 1381
001383: \x80-\xBF => 1382
001384: \x87 => 1381
001385: \x80-\xAF => 1384
001386: MATCH(0)
And now with it:
$ regex-cli debug thompson --no-table --no-captures '\w' -r --shrink | tail -n20
000469: sparse(\x90 => 2, \x92 => 2, \x97 => 2, \xA0-\xA9 => 2, \xAA => 2, \xAB => 2, \xAC => 2, \xAD => 2, \xB0 => 2, \xB1 => 2, \xE3 => 488, \xE4 => 488, \xE5-\xE9 => 488, \xEA => 488,
\xEB-\xEC => 488, \xEF => 488)
000470: sparse(\x97 => 2, \x9A => 2, \xA0-\xA9 => 2, \xAA => 2, \xAB => 2, \xAC => 2, \xAD => 2, \xB0 => 2, \xB1 => 2, \xE3 => 488, \xE4 => 488, \xE5-\xE9 => 488, \xEA => 488, \xEB-\xEC
=> 488)
000471: sparse(\x80 => 3, \x81 => 387, \x82 => 6, \x83 => 387, \x84 => 397, \x85 => 451, \x86 => 174, \x87 => 6, \x88 => 100, \x89 => 100, \x8A => 13, \x8B => 348, \x8C => 14, \x8D => 45
2, \x8E => 16, \x8F => 88, \x90 => 19, \x91 => 62, \x92 => 20, \x93 => 343, \x94 => 21, \x95 => 62, \x96 => 23, \x97 => 61, \x98 => 179, \x99 => 27, \x9A => 27, \x9B => 441, \x9C => 446,
\x9D => 236, \x9E => 28, \x9F => 461, \xA0 => 454, \xA1 => 442, \xA2 => 31, \xA3 => 428, \xA4 => 33, \xA5 => 467, \xA6 => 35, \xA7 => 330, \xA8 => 455, \xA9 => 468, \xAA => 388, \xAB => 4
43, \xAC => 43, \xAD => 414, \xAE => 84, \xAF => 447, \xB0 => 438, \xB1 => 416, \xB2 => 363, \xB3 => 457, \xB4 => 67, \xB5 => 340, \xB6 => 199, \xB7 => 141, \xB8 => 465, \xB9 => 374, \xBA
=> 53, \xBB => 417, \xBC => 459, \xBD => 56, \xBE => 469, \xBF => 470, \xC3 => 488, \xC4-\xCA => 488, \xCC => 488, \xCD => 488, \xCE => 488, \xCF => 488, \xD0-\xD1 => 488, \xD2 => 488, \
xD3 => 488, \xD4 => 488, \xD5 => 488, \xD6 => 488, \xD8 => 488, \xD9 => 488, \xDA => 488, \xDC => 488, \xDD => 488, \xDF => 488)
000472: sparse(\x91 => 2, \x92 => 2, \x93 => 2, \x97 => 2, \x98 => 2, \x9F => 2, \xA0 => 7, \xA1-\xA9 => 2, \xAA => 2, \xAB => 2, \xAC => 2, \xAD => 2, \xAE => 2, \xB0 => 2, \xB1 => 2, \
xB2 => 2, \xE1 => 488, \xE2 => 488, \xE3 => 488, \xE4 => 488, \xE5-\xE9 => 488, \xEA => 488, \xEB-\xEC => 488, \xED => 488)
000473: sparse(\x92 => 2, \x94 => 2, \x97 => 2, \x98 => 2, \x9D => 2, \xA0-\xA9 => 2, \xAA => 2, \xAB => 2, \xAC => 2, \xAD => 2, \xAE => 2, \xB0 => 2, \xB1 => 2, \xE1 => 488, \xE3 => 48
8, \xE4 => 488, \xE5-\xE9 => 488, \xEB-\xEC => 488, \xED => 488)
000474: sparse(\x91 => 2, \x97 => 2, \x98 => 2, \xA0-\xA9 => 2, \xAA => 2, \xAB => 2, \xAC => 2, \xAD => 2, \xB0 => 2, \xB1 => 2, \xE2 => 488, \xE3 => 488, \xE4 => 488, \xE5-\xE9 => 488,
\xEA => 488, \xEB-\xEC => 488, \xEF => 488)
000475: sparse(\x90 => 2, \x91 => 2, \x96 => 2, \x97 => 2, \x9C => 2, \xA0-\xA9 => 2, \xAA => 2, \xAB => 2, \xAC => 2, \xAD => 2, \xB0 => 2, \xB1 => 2, \xE0 => 488, \xE1 => 488, \xE3 =>
488, \xE4 => 488, \xE5-\xE9 => 488, \xEA => 488, \xEB-\xEC => 488)
000476: sparse(\x90 => 2, \x96 => 2, \x97 => 2, \xA0-\xA9 => 2, \xAA => 2, \xAB => 2, \xAC => 2, \xAD => 2, \xB0 => 2, \xB1 => 2, \xE0 => 488, \xE3 => 488, \xE4 => 488, \xE5-\xE9 => 488,
\xEA => 488, \xEB-\xEC => 488, \xEF => 488)
000477: sparse(\x80 => 218, \x81 => 387, \x82 => 6, \x83 => 387, \x84 => 472, \x85 => 451, \x86 => 174, \x87 => 387, \x88 => 12, \x89 => 100, \x8A => 13, \x8B => 348, \x8C => 14, \x8D =>
452, \x8E => 16, \x8F => 426, \x90 => 19, \x91 => 62, \x92 => 20, \x93 => 473, \x94 => 21, \x95 => 62, \x96 => 23, \x97 => 61, \x98 => 179, \x99 => 157, \x9A => 27, \x9B => 441, \x9C =>
446, \x9D => 236, \x9E => 28, \x9F => 461, \xA0 => 454, \xA1 => 442, \xA2 => 31, \xA3 => 428, \xA4 => 33, \xA5 => 467, \xA6 => 305, \xA7 => 317, \xA8 => 463, \xA9 => 468, \xAA => 388, \xA
B => 443, \xAC => 223, \xAD => 414, \xAE => 43, \xAF => 447, \xB0 => 438, \xB1 => 474, \xB2 => 363, \xB3 => 457, \xB4 => 140, \xB5 => 340, \xB6 => 266, \xB7 => 141, \xB8 => 465, \xB9 => 2
01, \xBA => 108, \xBB => 417, \xBC => 475, \xBD => 476, \xBE => 77, \xBF => 470, \xC3 => 488, \xC4-\xCA => 488, \xCC => 488, \xCE => 488, \xCF => 488, \xD0-\xD1 => 488, \xD2 => 488, \xD3
=> 488, \xD4 => 488, \xD5 => 488, \xD8 => 488, \xD9 => 488, \xDA => 488, \xDC => 488, \xDD => 488)
000478: sparse(\x97 => 2, \x9D => 2, \xA0-\xA9 => 2, \xAA => 2, \xAB => 2, \xAC => 2, \xAD => 2, \xAE => 2, \xB0 => 2, \xB1 => 2, \xE3 => 488, \xE4 => 488, \xE5-\xE9 => 488, \xEA => 488,
\xEB-\xEC => 488)
000479: sparse(\x91 => 2, \x96 => 2, \x97 => 2, \x98 => 2, \xA0-\xA9 => 2, \xAA => 2, \xAB => 2, \xAC => 2, \xAD => 2, \xAE => 2, \xAF => 2, \xB0 => 2, \xB1 => 2, \xE0 => 488, \xE3 => 48
8, \xE4 => 488, \xE5-\xE9 => 488, \xEA => 488, \xEB-\xEC => 488)
000480: sparse(\x96 => 2, \x97 => 2, \x98 => 2, \xA0-\xA9 => 2, \xAA => 2, \xAB => 2, \xAC => 2, \xAD => 2, \xAE => 2, \xAF => 2, \xB0 => 2, \xB1 => 2, \xE3 => 488, \xE4 => 488, \xE5-\xE
9 => 488, \xEB-\xEC => 488, \xEF => 488)
000481: sparse(\x90 => 2, \x97 => 2, \x98 => 2, \x9D => 2, \xA0-\xA9 => 2, \xAA => 2, \xAB => 2, \xAC => 2, \xAD => 2, \xAE => 2, \xB0 => 2, \xB1 => 2, \xE0 => 488, \xE1 => 488, \xE3 =>
488, \xE4 => 488, \xE5-\xE9 => 488, \xEB-\xEC => 488, \xEF => 488)
000482: sparse(\x91 => 2, \x97 => 2, \x98 => 2, \xA0-\xA9 => 2, \xAA => 2, \xAB => 2, \xAC => 2, \xAD => 2, \xAE => 2, \xB0 => 2, \xB1 => 2, \xE0 => 488, \xE1 => 488, \xE3 => 488, \xE4 =
> 488, \xE5-\xE9 => 488, \xEA => 488, \xEB-\xEC => 488, \xEF => 488)
000483: sparse(\x91 => 2, \x97 => 2, \xA0-\xA9 => 2, \xAA => 2, \xAB => 2, \xAC => 2, \xAD => 2, \xB0 => 2, \xB1 => 2, \xE0 => 488, \xE1 => 488, \xE2 => 488, \xE3 => 488, \xE4 => 488, \x
E5-\xE9 => 488, \xEA => 488, \xEB-\xEC => 488)
000484: sparse(\x91 => 2, \x97 => 2, \xA0-\xA9 => 2, \xAA => 2, \xAB => 2, \xAC => 2, \xAD => 2, \xB0 => 2, \xB1 => 2, \xE0 => 488, \xE1 => 488, \xE2 => 488, \xE3 => 488, \xE4 => 488, \x
E5-\xE9 => 488, \xEA => 488, \xEB-\xEC => 488, \xEF => 488)
000485: sparse(\x97 => 2, \xA0-\xA9 => 2, \xAA => 2, \xAB => 2, \xAC => 2, \xAD => 2, \xB0 => 2, \xB1 => 2, \xE3 => 488, \xE4 => 488, \xE5-\xE9 => 488, \xEA => 488, \xEB-\xEC => 488)
000486: sparse(\x80 => 4, \x81 => 396, \x82 => 6, \x83 => 387, \x84 => 472, \x85 => 451, \x86 => 174, \x87 => 387, \x88 => 12, \x89 => 100, \x8A => 195, \x8B => 348, \x8C => 14, \x8D =>
452, \x8E => 16, \x8F => 426, \x90 => 19, \x91 => 62, \x92 => 20, \x93 => 473, \x94 => 114, \x95 => 62, \x96 => 23, \x97 => 61, \x98 => 179, \x99 => 27, \x9A => 27, \x9B => 441, \x9C => 4
46, \x9D => 236, \x9E => 28, \x9F => 478, \xA0 => 462, \xA1 => 442, \xA2 => 31, \xA3 => 479, \xA4 => 33, \xA5 => 467, \xA6 => 305, \xA7 => 480, \xA8 => 481, \xA9 => 399, \xAA => 482, \xAB
=> 443, \xAC => 43, \xAD => 414, \xAE => 43, \xAF => 447, \xB0 => 438, \xB1 => 474, \xB2 => 363, \xB3 => 457, \xB4 => 483, \xB5 => 484, \xB6 => 108, \xB7 => 141, \xB8 => 465, \xB9 => 374
, \xBA => 108, \xBB => 417, \xBC => 71, \xBD => 476, \xBE => 57, \xBF => 485, \xC3 => 488, \xC4-\xCA => 488, \xCC => 488, \xCD => 488, \xCE => 488, \xCF => 488, \xD0-\xD1 => 488, \xD2 =>
488, \xD3 => 488, \xD4 => 488, \xD5 => 488, \xD6 => 488, \xD8 => 488, \xD9 => 488, \xDA => 488, \xDB => 488, \xDC => 488, \xDD => 488)
^000487: sparse(0-9 => 488, A-Z => 488, _ => 488, a-z => 488, \x80 => 58, \x81 => 72, \x82 => 78, \x83 => 86, \x84 => 96, \x85 => 111, \x86 => 123, \x87 => 136, \x88 => 143, \x89 => 153,
\x8A => 165, \x8B => 172, \x8C => 177, \x8D => 186, \x8E => 194, \x8F => 202, \x90 => 217, \x91 => 222, \x92 => 224, \x93 => 227, \x94 => 233, \x95 => 238, \x96 => 244, \x97 => 251, \x98
=> 257, \x99 => 258, \x9A => 269, \x9B => 274, \x9C => 279, \x9D => 285, \x9E => 295, \x9F => 298, \xA0 => 312, \xA1 => 315, \xA2 => 320, \xA3 => 322, \xA4 => 328, \xA5 => 333, \xA6 => 33
8, \xA7 => 341, \xA8 => 345, \xA9 => 351, \xAA => 360, \xAB => 365, \xAC => 368, \xAD => 375, \xAE => 383, \xAF => 385, \xB0 => 395, \xB1 => 402, \xB2 => 408, \xB3 => 411, \xB4 => 418, \x
B5 => 425, \xB6 => 429, \xB7 => 435, \xB8 => 440, \xB9 => 444, \xBA => 450, \xBB => 460, \xBC => 466, \xBD => 471, \xBE => 477, \xBF => 486)
000488: MATCH(0)
So shrinking in the reverse case still helps quite a bit in terms of generating tighter NFAs with fewer states. But because of the extra compile time hit, it is currently disabled by default. Therefore, the minimal UTF-8 automata optimization only applies to forward NFAs. (Reverse NFAs are created for use with DFAs, since a DFA requires a reverse scan to find the start of each match.) However, we do still look for redundant suffixes and share them in the reverse case when this extra NFA shrinking isn’t enabled.
NFA optimization: literal trie
If you haven’t noticed a theme by now, one of the biggest problems with a Thompson NFA is its epsilon transitions. It is really the critical thing about a Thompson NFA that makes it scale poorly with respect to the size of a regex. This is why, when using Thompson based regex engines, increasing the size of the regex can impact search times. Because of that, Thompson based regex engines often (but not always) have alternative engines that mitigate this weakness in one way or another. For example, by using a lazy DFA. However, a lazy DFA cannot be used in every circumstance, and even a lazy DFA can become overwhelmed by a large enough regex.
So in this section, we’ll talk about another NFA optimization that works to reduce epsilon transitions. In this case, we’re going to be limiting ourselves to an alternation of literals. Let’s take a look at an example using the old NFA compiler:
$ regex-debug compile 'zap|z|zapper'
0000 Save(0) (start)
0001 Split(2, 5)
0002 'z'
0003 'a'
0004 'p' (goto: 13)
0005 Split(6, 7)
0006 'z' (goto: 13)
0007 'z'
0008 'a'
0009 'p'
0010 'p'
0011 'e'
0012 'r'
0013 Save(1)
0014 Match(0)
Here, we’re building an NFA for the regex zap|z|zapper
. The way it’s compiled
is with nested Split
instructions. It starts with a Split(2, 5)
that points
to the beginning of zap
and another Split(6, 7)
instruction. This second
instruction then points to the beginning of z
and zapper
. So for this
regex, when looking for a match, the epsilon closure of all these splits is
enumerated for every character in the haystack.
In contrast, let’s look at what the new NFA compiler does:
$ regex-cli debug thompson --no-table 'zap|z|zapper'
thompson::NFA(
>000000: binary-union(2, 1)
000001: \x00-\xFF => 0
^000002: capture(pid=0, group=0, slot=0) => 11
000003: p => 12
000004: a => 3
000005: r => 12
000006: e => 5
000007: p => 6
000008: p => 7
000009: a => 8
000010: union(4, 12, 9)
000011: z => 10
000012: capture(pid=0, group=0, slot=1) => 13
000013: MATCH(0)
(Again, ignore states 0
and 1
, which correspond to the optional unanchored
(?s-u:.)*?
prefix. The old NFA compiler doesn’t emit those states for
unrelated reasons.)
Here, before an epsilon transition is seen at all, a z
must first be matched
in state 11
. Only after that is the union(4, 12, 9)
state seen. (This
union
is equivalent to the nested Split
instructions in the old NFA
compiler, but combines them all into one state.) If this NFA were used for a
search, one wouldn’t need to compute a beefy epsilon closure for every byte in
the haystack. One would only need to do it after a z
byte is seen, which is
much more rare. In effect, the regex was rewritten to z(?:ap||apper)
.
So what’s going on here? In this particular case, it almost looks like a common
prefix has been factored out. But the optimization at work here is a bit more
general than that. Consider the regex abc|xyz
. There are no common prefixes
here. First, let’s see what the old NFA compiler does:
$ regex-debug compile 'abc|xyz'
0000 Save(0) (start)
0001 Split(2, 5)
0002 'a'
0003 'b'
0004 'c' (goto: 8)
0005 'x'
0006 'y'
0007 'z'
0008 Save(1)
0009 Match(0)
Here, we see a Split
instruction again that forks out to the start of abc
and then xyz
.
Now the new NFA compiler:
$ regex-cli debug thompson --no-table 'abc|xyz'
thompson::NFA(
>000000: binary-union(2, 1)
000001: \x00-\xFF => 0
^000002: capture(pid=0, group=0, slot=0) => 7
000003: c => 8
000004: b => 3
000005: z => 8
000006: y => 5
000007: sparse(a => 4, x => 6)
000008: capture(pid=0, group=0, slot=1) => 9
000009: MATCH(0)
)
Here there are no epsilon transitions at all. The a
and x
have been lifted
out into the same sparse state, with each them forking off to their respective
suffixes, bc
and yz
.
What’s happening here is that the NFA compiler is recognizing an alternation
of literals, compiling it into a trie, and then converting
that trie to an NFA directly in a way that minimizes epsilon transitions. The
key trick to this optimization is ensuring that leftmost-first semantics are
preserved. For example, in the zap|z|zapper
example above, one might be
tempted to rewrite it as z(?:ap(?:per)?)?
. But this does not have the same
matches! This regex will match zapper
in the haystack zapper
, but the
original zap|z|zapper
will match zap
. The literal trie achieves this by
partitioning the transitions in each trie state into chunks, where a chunk is
created whenever a match (of one of the literals) is seen. If a normal trie was
created, then the preference order required by leftmost-first semantics would
be lost when translating the trie back to an NFA.
NFA future work
There are two aspects I’d like to explore for future work on NFAs.
First is the Glushkov NFA. A Glushkov NFA has a worse time complexity
for compilation, but it comes with the advantage of not having any epsilon
transitions. (Instead, it is an NFA by virtue of permitting a state to have
multiple transitions defined for the same haystack symbol.) Because of the
worse compilation time complexity, a Glushkov NFA probably can’t be used in
every case, but it’s certainly plausible to use it for a subset of smaller
regexes. A Glushkov NFA is possibly more amenable to bit-parallel techniques
that are sadly underused in the regex
crate at present. One of the big
questions marks for me here is how well a Glushkov NFA will fair with big
Unicode classes. (Perhaps such things will be a disqualifiying criterion.)
Second is storing an NFA in a single contiguous allocation. This might make
access patterns faster and more cache friendly, but perhaps more importantly,
it could permit zero-copy serialization and deserialization. The downside of
doing this is code complexity and potentially more use of unsafe
, but there
are some potentially nice benefits too.
Regex engines
Like RE2, the regex
crate is internally composed of several different regex
engines. Most of them have already been mentioned so far. In this section, we
will do a bit of a deeper dive on each of them and explain why they exist.
Finally, we’ll wrap up by exploring the meta regex engine, which combines all
of the regex engines into a single engine.
But why have so many regex engines? The reason is essentially engineering: the
implementation of regex engines with more functionality tends to search more
slowly than regex engines that have less functionality. We could use only a
single regex engine that supports all the functionality we want, namely the
PikeVM
, but the search performance would likely be disappointing to a lot
of users. This fundamentally drives the addition of other engines. None of
the other engines can support the full range of functionality provided by the
regex
crate, and so buck stops with the PikeVM
.
In addition to using regex-cli
to show how to run each regex engine, we’ll
also look at short example Rust programs. To follow along with the Rust
program examples, you’ll want to setup a Cargo project and add regex-automata
as a dependency:
$ cargo init --bin
$ cargo add 'regex-automata@0.3'
Then you can edit src/main.rs
to add source code and run the program with
cargo run
.
Common elements among regex engines
While there are many regex engines in the regex-automata
crate, all of them
share very similar APIs. Because of that, it’s worth covering a few of those
common elements first.
The three most important types are Input
, Match
and MatchError
.
Input
is a small abstraction for setting the parameters of a single search.
Its only required parameter is a haystack, that is, the sequence of bytes to
search. Most search APIs accept anything that implements the Into<Input>
trait, and both &[u8]
and &str
implement Into<Input>
. Optional parameters
consist of the span of the haystack to search, whether to execute an anchored
search and whether to stop the search early as soon as a match is found instead
of greedily trying to match as much as possible.
Match
represents the data reported whenever a match is found in a
haystack. The data consists of two elements. The first is the span of
byte offsets in the haystack where the match was found. The second is the
PatternID
corresponding to the pattern that matched. (Every regex engine
in regex-automata
has first class multi-pattern support. Pattern IDs are
assigned, starting from zero, in an auto-incrementing fashion based on the
order of the patterns given to the regex engine constructor.)
MatchError
represents an error that occurred during a search. When an error
occurs, it is not possible to determine whether a match exists or not. That is,
the result is indeterminate. For this reason, many of the basic search APIs
have a return type of Result<Option<Match>, MatchError>
. Errors can occur
during a search for a variety of reasons. For example, a DFA can be configured
to quit immediately whenever a certain byte is seen. Another example is the
BoundedBacktracker
, which will fail if the length of the haystack exceeds a
configured limit. One of the main features of the meta regex engine, as we’ll
discuss later, is providing a facade on top of a composition of regex engines
that never results in an error being returned to the caller.
There are some other themes common to most regex engines. For example, the
construction of most engines is done by a Builder
and configured by one or
more Config
values. We’ll talk about these more as they come up. See also the
API themes section in the regex-automata
crate documentation.
Engine: Pike VM
As mentioned above, the buck stops with the PikeVM
. That is, the PikeVM
supports the full suite of regex functionality that one can parse with
regex-syntax
, and it supports this for any haystack of any length. Other
regex engines have various limitations. For example:
- The
BoundedBacktracker
only works in cases wherelen(haystack) * len(regex)
is below a configured limit. In practice, this often means one can only use it with short haystacks. - The one-pass DFA only works a small subset of NFAs that satisfy the “one-pass” criterion.
- The lazy DFA and full DFA cannot match Unicode word boundaries. Both have heuristics for treating a Unicode word boundary as an ASCII word boundary when the haystack only consists of ASCII, but if a non-ASCII byte is seen, both can quit and return an error instead of completing the search.
Other than a Cache
value, a PikeVM
can be built directly
from an NFA without any additional work. That is, its search works by a
“simulation” of the NFA itself. (Confusingly, this is sometimes referred to
as the “DFA algorithm.”) The actual implementation is structured similarly to
a virtual machine, with each NFA state acting as an instruction. The PikeVM
works by moving from NFA state to the next, and computing epsilon closures
on the fly. Since it’s possible to be in multiple NFA states simultaneously,
the PikeVM
keeps track of every active state. The transition function then
applies to each of those states. The PikeVM
also keeps track of capture group
positions.
The main problem with the PikeVM
is performance, and its poor performance is
primarily rooted in having to keep track of so much state. The capture group
positions are required to report the start and end match offsets, while the
currently active states must be tracked in order to guarantee worst case
O(m * n)
time. That is, in contrast to a backtracking approach, the PikeVM
visits each byte in the haystack at most a constant number of times, and it
does so by computing all possible active states in lock-step. This adds quite
a bit of overhead, and it can be exacerbated by regexes with a lot of epsilon
transitions. (This is one of the reasons why, earlier in this blog, we spent so
much time talking about optimizations in the NFA to elide epsilon transitions.)
Now that we’ve talked a little about how the PikeVM
works, let’s look at a
few examples. Here’s a lightly annotated Rust program showing how to use it for
a search:
use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};
fn main() {
// This creates a PikeVM directly from a pattern string. But one can also
// build a PikeVM directly from an NFA using `PikeVM::builder()`.
let re = PikeVM::new(r"\b\w+\b").unwrap();
// Most regex engines in regex-automata require some kind of mutable
// scratch space that can be written to during a search. The meta regex
// engine has APIs that hide this fact from you, but when you use the
// underlying regex engines directly, you must create and pass around these
// cache values explicitly.
let mut cache = re.create_cache();
let mut it = re.find_iter(&mut cache, "Σέρλοκ Χολμς");
assert_eq!(Some(Match::must(0, 0..12)), it.next());
assert_eq!(Some(Match::must(0, 13..23)), it.next());
assert_eq!(None, it.next());
}
And the equivalent regex-cli
command:
$ regex-cli find match pikevm --no-table -p '\b\w+\b' -y 'Σέρλοκ Χολμς'
0:0:12:Σέρλοκ
0:13:23:Χολμς
Notice the use of Unicode word boundaries on a non-ASCII haystack. Only the
PikeVM
, BoundedBacktracker
and one-pass DFA support this. The lazy DFA
and fully compiled DFA would return an error in this case.
The other important thing to notice here is that the search APIs do not
return an error. Indeed, the PikeVM
can never return an error under any
circumstances (nor will it ever panic). This is actually a unique property
among the regex engines in regex-automata
. Every other regex engine can
return an error during a search for one reason or another.
We can also make use of multi-pattern support with capture groups
simultaneously. (This is something that the regex
crate cannot do, and many
have requested this support from its RegexSet
API. That still doesn’t exist,
but you can now at least drop down to regex-automata
and do it. This same
example also works with the meta regex engine.)
use regex_automata::nfa::thompson::pikevm::PikeVM;
fn main() {
let re = PikeVM::new_many(&[
r"(?<email>[.\w]+@(?<domain>[.\w]+))",
r"(?<phone>(?<areacode>[0-9]{3})-[0-9]{3}-[0-9]{4})",
])
.unwrap();
let mut cache = re.create_cache();
let hay = "foo@example.com, 111-867-5309";
let mut it = re.captures_iter(&mut cache, hay);
let caps = it.next().unwrap();
assert_eq!(0, caps.pattern().unwrap().as_usize());
assert_eq!("example.com", &hay[caps.get_group_by_name("domain").unwrap()]);
let caps = it.next().unwrap();
assert_eq!(1, caps.pattern().unwrap().as_usize());
assert_eq!("111", &hay[caps.get_group_by_name("areacode").unwrap()]);
assert!(it.next().is_none());
}
And the equivalent regex-cli
command:
$ regex-cli find capture pikevm --no-table \
-p '(?<email>[.\w]+@(?<domain>[.\w]+))' \
-p '(?<phone>(?<areacode>[0-9]{3})-[0-9]{3}-[0-9]{4})' \
-y 'foo@example.com, 111-867-5309'
0:{ 0: 0..15/foo@example.com, 1/email: 0..15/foo@example.com, 2/domain: 4..15/example.com }
1:{ 0: 17..29/111-867-5309, 1/phone: 17..29/111-867-5309, 2/areacode: 17..20/111 }
Notice how the capture groups are different for each pattern. The caller is responsible for using the correct capture group name based on which pattern matches.
Engine: bounded backtracker
The BoundedBacktracker
uses a backtracking algorithm to execute a search
using a Thompson NFA directly. This is sometimes also (confusingly) referred
to as the “NFA algorithm.” The key difference between the backtracking
implementation in regex-automata
and most other implementations is that
it uses additional state to avoid re-tracing steps already taken during
backtracking. This allows us to guarantee worst case O(m * n)
time, but at
the expense of O(m * n)
space.
(Classical backtracking is also technically bounded theoretically, but the
“bounded” in the name “bounded backtracker” refers to the explicit bound used
in the implementation to guarantee worst case O(m * n)
time.)
The benefit of a the bounded backtracker is purely that it is usually faster
than the PikeVM
. In rough experiments, it’s usually about twice as fast.
Here’s a quick example, like the PikeVM
example previously, but using the
bounded backtracker:
use regex_automata::{nfa::thompson::backtrack::BoundedBacktracker, Match};
fn main() {
let re = BoundedBacktracker::new(r"\b\w+\b").unwrap();
// A bounded backtracker needs a cache just like the PikeVM. The Cache
// keeps track of work already done, and also contains scratch space for
// backtracking's call stack, which is stored on the heap.
let mut cache = re.create_cache();
// Unlike the PikeVM, the bounded backtracker can fail to run a search.
// This occurs when `len(regex) * len(haystack)` exceeds the configured
// visited capacity. We'll see an example of this below.
let mut it = re.try_find_iter(&mut cache, "Σέρλοκ Χολμς");
assert_eq!(Some(Ok(Match::must(0, 0..12))), it.next());
assert_eq!(Some(Ok(Match::must(0, 13..23))), it.next());
assert_eq!(None, it.next());
}
And the equivalent regex-cli
command:
$ regex-cli find match backtrack --no-table -p '\b\w+\b' -y 'Σέρλοκ Χολμς'
0:0:12:Σέρλοκ
0:13:23:Χολμς
This exame should look almost identical to the PikeVM
example. One important
difference is that instead of calling re.find_iter(..)
, we instead called
re.try_find_iter(..)
. Namely, as mentioned above, the bounded backtracker can
return an error when a search would require more memory than what is configured.
The relevant configuration knob is Config::visited_capacity
. We can also
query just how big of a haystack a bounded backtracker can search without
failing:
use regex_automata::nfa::thompson::backtrack::BoundedBacktracker;
fn main() {
let re = BoundedBacktracker::new(r"\b\w+\b").unwrap();
println!("{:?}", re.max_haystack_len());
}
At the time of writing, the output of this program on my machine is 6635
.
That might seem pretty short, and that’s because the regex is perhaps bigger
than you might think it is. Namely, since \w
is Unicode-aware by default,
\w
matches over 100,000 distinct codepoints. While we could increase the
maximum haystack length that we could search by setting a bigger value for
Config::visited_capacity
, we can also increase it by decreasing the size of
our regex by disabling Unicode mode:
use regex_automata::nfa::thompson::backtrack::BoundedBacktracker;
fn main() {
let re = BoundedBacktracker::new(r"(?-u)\b\w+\b").unwrap();
println!("{:?}", re.max_haystack_len());
}
The output of this program on my machine is now 233015
. That’s nearly two
orders of magnitude difference!
Overall, when possible, one should prefer using the bounded backtracker over
the PikeVM
. They both have the same time complexity guarantees, but the
bounded backtracker tends to be faster in practice.
Engine: one-pass DFA
Before talking about the one-pass DFA, it makes sense to motivate its
existence in a bit more detail. Namely, one important aspect of both the
PikeVM
and the bounded backtracker is that they support reporting the offsets
of matching capture groups in the pattern. For example, using the PikeVM
with
regex-cli
:
$ regex-cli find capture pikevm --no-table \
-p '(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})' \
-y '2023-07-02'
0:{ 0: 0..10/2023-07-02, 1/year: 0..4/2023, 2/month: 5..7/07, 3/day: 8..10/02 }
Capture groups are quite useful because they permit de-composing a regex match
down into constituent parts that are independently useful. As in the above
example, we didn’t just match a date, we matched the individual components
of that date and made each of those components easily available via APIs. For
example, using the regex
crate itself:
use regex::Regex;
fn main() {
let pat = r"(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})";
let re = Regex::new(pat).unwrap();
let Some(caps) = re.captures("2023-07-02") else { return };
println!(" year: {:?}", &caps["year"]);
println!("month: {:?}", &caps["month"]);
println!(" day: {:?}", &caps["day"]);
}
Without capture groups, regexes become a lot less convenient.
The problem with capture groups is that they aren’t something that cleanly fit
into the theoretical model of regular languages and finite automata. (They
require something more expressive known as tagged finite automata.) As a
result, capture groups are bolted on to the classical NFA simulation, and
the result is named PikeVM
. Capture groups are also part of the classic
“NFA algorithm” or backtracking, as the matching offsets of each group can be
recorded as the backtracking search progresses through the haystack.
But that’s generally where capture groups stop being supported. DFAs simply do not support them, and there is no obvious way to make them support capture groups in general without evolving to something like tagged finite automata.
However, there is one case where we can bolt capture groups into something that executes like a DFA: a one-pass NFA. One can think of a one-pass NFA as an NFA that can be converted into a DFA where each DFA state maps to at most one NFA state. That is, when performing a search using an NFA simulation, then at each possible character in the haystack there is at most one NFA state to transition to.
The intuition behind this special case is that only one copy of the matching
capture groups needs to be kept. (In the PikeVM
, there are up to len(regex)
copies of capture groups kept, as there is no way to know which capture groups
will wind up being part of the final match.) If one can detect this case, then
a new DFA can be constructed from the NFA in the linear time, and this DFA can
execute a search such that a constant number of CPU instructions are used to
process each character in the haystack.
The end result of this is a one-pass DFA and it generally runs quite a bit
faster than either the PikeVM
or the bounded backtracker. In other words, it
represents the fastest way to report the offsets of matching capture groups in
the regex
crate.
The problem with a one-pass DFA is that, as a DFA, it uses a lot more memory.
(The memory problem is mitigated by giving one-pass DFA construction a
configurable fixed budget of memory, and if it’s exceeded, one-pass DFA
construction fails.) Additionally, many regexes are not one-pass. For example,
all unanchored searches in the regex
crate are done by adding an implicit
(?s-u:.)*?
prefix to the beginning of the regex. That prefix is itself not
one-pass when followed by any non-empty regex. Therefore, a one-pass DFA only
supports anchored searches.
The “only anchored” search limitation might make it seem like the one-pass DFA has very limited utility, but as we’ll see in more detail, the meta regex engine uses anchored searches quite a bit even if the original regex itself isn’t anchored. This can occur when the caller asked for the offsets of matching capture groups. The meta regex engine starts by looking for an overall match using a DFA engine, and then once a match is found, an anchored search is used on only the matched part of the haystack to report offsets for each matching capture group. In this way, the utility of the one-pass DFA is quite high.
Using the one-pass DFA directly is possible, and it looks similar to past examples but with some small deviations because of the one-pass DFA’s more limited API:
use regex_automata::{dfa::onepass::DFA, Match};
fn main() {
let re = DFA::new(r"\b\w+\b").unwrap();
let mut cache = re.create_cache();
// A one-pass DFA doesn't expose any iterator APIs directly because it only
// supports anchored matches. Thus, any iterator that would be returned
// would only support adjacent matches. Such a thing is a valid use case,
// but it wouldn't match the semantics of every other iterator in
// regex-automata.
let Some(m) = re.find(&mut cache, "Σέρλοκ Χολμς") else { return };
assert_eq!(Match::must(0, 0..12), m);
}
And the equivalent regex-cli
command:
$ regex-cli find match onepass --no-table --anchored \
-p '\b\w+\b' \
-y 'Σέρλοκ Χολμς'
0:0:12:Σέρλοκ
Notice how we pass the --anchored
flag to regex-cli
. Without it, the
one-pass DFA search would return an error.
We can execute multiple searches as well. Even though the regex itself isn’t
anchored, we don’t have to limit ourselves to searches beginning at offset 0
:
use regex_automata::{dfa::onepass::DFA, Input, Match};
fn main() {
let hay = "Σέρλοκ Χολμς";
let re = DFA::new(r"\b\w+\b").unwrap();
let mut cache = re.create_cache();
let Some(m) = re.find(&mut cache, hay) else { return };
assert_eq!(Match::must(0, 0..12), m);
let input = Input::new(hay).range(13..);
let Some(m) = re.find(&mut cache, input) else { return };
assert_eq!(Match::must(0, 13..23), m);
}
The Input
abstraction can be used in the same way with any other regex engine
as well.
It can be quite tricky to reason about whether a particular regex is one-pass or not. For example, a regex can be one-pass when Unicode is disabled and not one-pass when Unicode is enabled. For example:
$ regex-cli find match onepass --no-table --anchored \
-p '\w+\s+\w+' \
-y 'Σέρλοκ Χολμς'
failed to compile onepass DFA: one-pass DFA could not be built because pattern is not one-pass: conflicting transition
But if we disable Unicode mode, then the regex becomes one-pass:
$ regex-cli find match onepass --no-table --anchored \
-p '(?-u)\w+\s+\w+' \
-y 'Sherlock Holmes'
0:0:15:Sherlock\x20Holmes
This happens because, when Unicode mode is enabled, \w
and \s
partially
overlap. They of course do not overlap logically, as taking the codepoint
sets from each of \w
and \s
and intersecting them would produce the empty
set:
$ regex-cli debug hir --no-table '[\w&&\s]'
Class(
{},
)
The overlap is actually at the byte level, and the byte level is how the
transitions in a one-pass DFA are defined. This overlap means that the DFA for
\w+\s+\w+
when Unicode mode is enabled doesn’t satisfy the property that
every state in the DFA maps to at most one NFA state. That is, there exists
codepoints in both \w
and \s
which start with the same leading UTF-8 code
units.
But when Unicode mode is disabled, not only do the codepoint sets not have any
overlap, but they don’t have any overlap at the byte level either. Why? Because
the codepoints in \w
and \s
when Unicode is disabled are limited to ASCII
codepoints, and each ASCII codepoint is always encoded as a single UTF-8 code
unit corresponding to the ASCII codepoint number.
One should prefer a one-pass DFA over both the PikeVM
and the bounded
backtracker because it is faster, although it can take longer to build and may
use more memory. However, because it can only be built from a very limited set
of regexes, one must be ready to deal with construction failing and falling
back to a different engine.
Engine: DFA
The DFA regex engine is made up of two dense DFAs. One DFA is
responsible for a forward scan that finds the end of a match and the other DFA
is used to perform an anchored scan backwards from the end of a match to find
the start of a match. (This second DFA is built by reversing the concatenations
in an Hir
, building an NFA from that and then determinizing that reverse
NFA into a DFA.) We call these DFAs “dense” to distinguish them from sparse
DFAs. A dense DFA uses a representation that optimizes for
search speed at the expense of more memory usage, while a sparse DFA uses a
representation that optimizes for less memory usage at the expense of search
speed.
Fully compiled DFAs are usually not found in general purpose regex engines
because building them has worst case O(2^m)
time and space (where m
is
proportional to len(regex)
). For example, [01]*1[01]{N}
compiles to an NFA
with approximately N
states, and as N
grows, the NFA grows linearly. But
the corresponding DFA has approximately 2^N
states, and as the DFA grows, the
number of states grows exponentially.
But the problem with a DFA is not just limited to its theoretical worst case
behavior. DFAs, especially dense DFAs, tend to use a lot of memory because each
state supports computing the next transition for any byte in constant time.
This fundamentally requires more memory to provide constant time random access.
When you combine this with large Unicode character classes, the result can be
disastrous. For example, let’s compare some sizes for the regex \w
. First
up is the NFA (which, remember, can be used directly as a regex engine in the
case of the PikeVM
and bounded backtracker):
$ regex-cli debug thompson '\w' -q
[.. snip ..]
memory: 17668
[.. snip ..]
So here, the NFA uses 17KB. That’s not exactly small, but watch what happens when we determinize the NFA into a DFA:
$ regex-cli debug dense dfa '\w' --start-kind unanchored -q
[.. snip ..]
memory: 159296
[.. snip ..]
Memory balloons to about 160KB! (I’ve limited the DFA to just an unanchored
DFA. If one used --start-kind both
instead, the default, then memory usage
would double.) And spending extra time to minimize the DFA doesn’t help:
$ regex-cli debug dense dfa '\w' --start-kind unanchored --minimize -q
[.. snip ..]
memory: 159296
[.. snip ..]
Sometimes minimization helps, but in this case, since we used Daciuk’s
algorithm to build a minimal UTF-8 automaton into the NFA for \w
, it follows
that determinizing that NFA into a DFA itself already results in a minimal DFA.
The real problem here is a result of our dense representation and the fact that
our alphabet is defined over bytes. We can make it a little better by switching
to a sparse representation:
$ regex-cli debug sparse dfa '\w' --start-kind unanchored -q
[.. snip ..]
memory: 102257
[.. snip ..]
But we’re still at over 100KB. Unicode character classes and fully compiled
DFAs just don’t mix well. And in practice, it’s often the case that one doesn’t
need the full class compiled since it’s somewhat rare to search haystacks with
a lot of difference scripts. More to the point, most searches are probably fine
with just the ASCII definition of \w
, which is much smaller:
$ regex-cli debug thompson '(?-u)\w' -q
[.. snip ..]
memory: 732
[.. snip ..]
$ regex-cli debug dense dfa '(?-u)\w' --start-kind unanchored -q
[.. snip ..]
memory: 384
[.. snip ..]
In this case, the dense DFA is actually smaller than the corresponding NFA.
So with all of that said, why does a general purpose regex engine like
the regex
crate have a DFA engine with such huge downsides? Doesn’t the
exorbitant memory usage make it a non-starter? There are two angles to this.
First is that the DFA engine is actually disabled by default. One must opt
into it by enabling the perf-dfa-full
feature. I did this because fully
compiled DFAs don’t carry a ton of weight in the regex
crate, since the lazy
DFA (discussed in the next section) is a better choice in the vast majority of
cases. However, fully compiled DFAs do provide some optimization opportunities
that are difficult for a lazy DFA to take advantage of. For example, in
the regex (?m)^.*$
, a fully compiled DFA will notice that .
doesn’t
match a \n
. It knows this by looking for states where most transitions are
self-transitions (transitions that loop back to the same state). It follows
that there are only a limited number of ways to leave that state. The DFA finds
these states and “accelerates” them by running memchr
to find the bytes
in the haystack corresponding to non-self-transitions. You can see this in
practice with regex-cli
with a little ad hoc benchmarking. First we’ll start
with DFA state acceleration enabled (it’s enabled by default):
$ regex-cli find match dense -bB -q --repeat 1000000 \
-p '(?m-u)^.*$' \
-y 'this is a long line about the quick brown fox that jumped over the lazy dog'
[.. snip ..]
search time: 56.600473ms
[.. snip ..]
And now with DFA state acceleration disabled:
$ regex-cli find match dense -bB -q --repeat 1000000 --no-accelerate \
-p '(?m-u)^.*$' \
-y 'this is a long line about the quick brown fox that jumped over the lazy dog'
[.. snip ..]
search time: 199.044059ms
[.. snip ..]
The search time with acceleration enabled is quite a bit faster. Notice
also that we’ve disabled Unicode. When Unicode is enabled, .
matches the
UTF-8 encoding of any Unicode scalar value. This in turn makes the DFA more
complicated and inhibits the acceleration optimization in this case:
$ regex-cli find match dense -q --repeat 1000000 \
-p '(?m)^.*$' \
-y 'this is a long line about the quick brown fox that jumped over the lazy dog'
[.. snip ..]
search time: 204.897593ms
[.. snip ..]
While this form of DFA state acceleration is quite useful, it is somewhat
limited in the regexes it can be applied to in part because of Unicode. It is
also limited because the meta regex engine only chooses to use the DFA engine
when the regex is very small. Otherwise we open ourselves up to exorbitant
memory usage and exponentials while building a regex. The regex
crate isn’t
the fastest at compiling a regex, but taking exponential time is a big no-no.
Because of its somewhat limited utility and since the DFA engine adds a lot of code that in turn increases compile times and binary size substantially, full DFAs are disabled by default.
The second angle as to why full DFAs exist at all is because their search
runtime is extremely basic. They are the only regex engine in regex-automata
to not require any mutable Cache
space while executing a search. Indeed,
let’s take a look at an example:
use regex_automata::{
dfa::{dense, regex::Regex},
Match,
};
fn main() {
let re = Regex::builder()
// We need to enable heuristic Unicode word boundary support,
// or else the regex below will fail to compile. Why? Because
// \b is Unicode-aware by default, and the DFA engines don't
// support it on haystacks with non-ASCII Unicode codepoints.
// Enabling this comes with the downside of making it possible
// for a search to return an error. Namely, when the DFA sees
// a non-ASCII byte, it transitions to a special sentinel quit
// state, which in turn causes the search to stop and return an
// error.
.dense(dense::Config::new().unicode_word_boundary(true))
.build(r"\b\w+\b")
.unwrap();
// Note that `find_iter` will panic if the underyling search returns
// an error! You can handle the error by using fallible APIs such as
// Regex::try_search.
let mut it = re.find_iter("Sherlock Holmes");
assert_eq!(Some(Match::must(0, 0..8)), it.next());
assert_eq!(Some(Match::must(0, 9..15)), it.next());
assert_eq!(None, it.next());
}
And the equivalent regex-cli
command:
$ regex-cli find match dense --no-table --unicode-word-boundary \
-p '\b\w+\b' \
-y 'Sherlock Holmes'
0:0:8:Sherlock
0:9:15:Holmes
Notice that no Cache
is required. Indeed, because of how simple the search
runtime is for a DFA, and because the DFA internals were designed with this use
case in mind, a DFA can be serialized to raw bytes. That
same DFA can be deserialized and used to execute a search in a free-standing
environment. That is, you don’t need Rust’s std
or alloc
libraries. Only
core
is needed.
(The DFA serialization use case was what motivated the initial regex-automata 0.1
release. It’s currently used in the bstr
crate for implementing some
of the Unicode segmentation algorithms.)
Fully compiled DFAs are most useful when your regexes don’t make use of Unicode features, or if you need access to lower level APIs. For example, this shows how one can compute the transitions manually on a DFA:
use regex_automata::{
dfa::{dense::DFA, Automaton},
Anchored,
};
fn main() {
let re = DFA::new(r"\W").unwrap();
// DFAs can have multiple start states, but only when there are look-around
// assertions. When there aren't any look-around assertions, as in this
// case, we can ask for a start state without providing any of the
// haystack.
let mut sid = re.universal_start_state(Anchored::No).unwrap();
// The UTF-8 encoding of 💩 is \xF0\x9F\x92\xA9.
sid = re.next_state(sid, 0xF0);
sid = re.next_state(sid, 0x9F);
sid = re.next_state(sid, 0x92);
sid = re.next_state(sid, 0xA9);
sid = re.next_eoi_state(sid);
assert!(re.is_match_state(sid));
}
In this example, we walked the DFA manually and fed the DFA one byte at a time.
This example is a little contrived, but it demonstrates some of the APIs that
provide low level control. There are many more examples documented on the
Automaton
trait, which defines all of the lower level DFA routines that
dense and sparse DFAs implement.
Engine: hybrid NFA/DFA
The hybrid NFA/DFA or “lazy DFA” regex engine is exactly like the DFA engine, except its transition table is built at search time. In other words, while a full DFA is “ahead-of-time compiled,” the lazy DFA is “just-in-time compiled.”
(Calling it a JIT would be somewhat misleading. In this domain, a JIT usually refers to compiling a regex into machine code at runtime, such as the JIT in PCRE2. That is not what’s happening here.)
The lazy DFA generally has the same API as the fully compiled DFA, except that,
like the other regex engines, you need to pass a mutable Cache
argument. The
Cache
is where the transition table (among other things) is stored:
use regex_automata::{
hybrid::{dfa, regex::Regex},
Match,
};
fn main() {
let re = Regex::builder()
// As with the fully compiled DFA, we need to enable heuristic
// Unicode word boundary support for the lazy DFA as well. It
// will return an error if a non-ASCII codepoint is seen when
// the regex contains a Unicode word boundary, just like the
// full DFA.
.dfa(dfa::Config::new().unicode_word_boundary(true))
.build(r"\b\w+\b")
.unwrap();
let mut cache = re.create_cache();
// Note that find_iter will panic if the underyling search returns
// an error! You can handle the error by using fallible APIs such
// as Regex::try_search.
let mut it = re.find_iter(&mut cache, "Sherlock Holmes");
assert_eq!(Some(Match::must(0, 0..8)), it.next());
assert_eq!(Some(Match::must(0, 9..15)), it.next());
assert_eq!(None, it.next());
}
And the equivalent regex-cli
command:
$ regex-cli find match hybrid --no-table --unicode-word-boundary \
-p '\b\w+\b' \
-y 'Sherlock Holmes'
0:0:8:Sherlock
0:9:15:Holmes
This example is nearly identical to the full DFA, but with a Cache
parameter.
The similarities extend to lower level APIs as well:
use regex_automata::{hybrid::dfa::DFA, Input};
fn main() {
let re = DFA::new(r"\W").unwrap();
let mut cache = re.create_cache();
// DFAs can have multiple start states, but only when there are
// look-around assertions. When there aren't any look-around
// assertions, as in this case, we can ask for a start state
// without providing any of the haystack. Full DFAs have a
// dedicated routine for this because the universality can be
// checked for us. But lazy DFAs don't compute all of their start
// states up front. So we just kind of fake it and ask for a start
// state given some dummy haystack.
let mut sid = re.start_state_forward(&mut cache, &Input::new("")).unwrap();
// The UTF-8 encoding of 💩 is \xF0\x9F\x92\xA9.
sid = re.next_state(&mut cache, sid, 0xF0).unwrap();
sid = re.next_state(&mut cache, sid, 0x9F).unwrap();
sid = re.next_state(&mut cache, sid, 0x92).unwrap();
sid = re.next_state(&mut cache, sid, 0xA9).unwrap();
sid = re.next_eoi_state(&mut cache, sid).unwrap();
assert!(sid.is_match());
}
Other than passing a Cache
explicitly, these APIs are almost the same. The
main difference is that each of the operations might fail. Namely, depending on
the method, they can return either a MatchError
or a CacheError
. A
MatchError
occurs when the start state can’t be computed because it enters
a quit state. (Which means the search must terminate with an error. This, for
example, occurs when heuristic Unicode word boundary support is enabled and a
non-ASCII byte is seen when computing the start state.) A CacheError
occurs
when the Cache
provided exhausts its capacity.
At this point, it’s worth talking a little more about the Cache
because it is
both the lazy DFA’s greatest strength and its greatest weakness. The lazy DFA,
as mentioned, works by building its transition table during a search. More
specifically, the following happens:
- A maximum cache capacity is configured at construction time. The capacity is not fully allocated up front. It’s just a number establishing an upper bound on how much heap memory can be used.
- When the caller asks to compute the transition for the current state and
character from the haystack, the lazy DFA consults its
Cache
. If the transition has already been computed and stored in theCache
, then it is returned as-is immediately. Otherwise, the transition—and only that transition—is computed by doing NFA powerset construction. This process takes worst caseO(m)
time. - If the cache fills up, it is cleared and thus transitions previously computed will need to be-recomputed.
- Optional configuration can be set to cause the lazy DFA to return an error
if the
Cache
is being used inefficiently. Efficiency is measured in terms of how many bytes are searched per each DFA state computed. If few bytes are searched compared to the number of DFA states in theCache
, then it’s likely that even thePikeVM
would execute the search more quickly. (Other heuristics are used here as well, such as the total number of times theCache
has been cleared.)
In this way, at most one DFA state and transition is created for each byte of
haystack searched. Thus, the worst case search time for the lazy DFA is
O(m * n)
and its worst case space usage is the fixed capacity set at
construction time. Since building a lazy DFA itself does not require the
construction of any DFA states (except for a few basic sentinel states), it
follows that the lazy DFA mitigates the theoretical worst case time and space
complexities for full DFAs. That is, we avoid the exponential construction
time. In the common case, most states/transitions are cached, and so the lazy
DFA runs in average case O(n)
time. In practice, for most regexes, the lazy
DFA and the fully compiled DFA have the same search performance.
The lazy DFA also mitigates the exorbitant space usage for large Unicode
character classes. Since a lazy DFA only computes what it needs based on the
actual bytes searched, searching for a Unicode-aware \w
in an ASCII-only
haystack only requires building the ASCII portion of \w
into a DFA. This ends
up working amazingly well in practice.
The lazy DFA tends to do poorly for regexes that fill up its cache and cause
it to be cleared repeatedly. This might just be a result of the regex being
large or even a result of the haystack forcing a large portion of the DFA to
be constructed. (A regex can easily become large through counted repetitions
or even by adding a lot of patterns. A single lazy DFA gets built for a
multi-pattern regex.) In these cases, heuristics usually detect it and force
the lazy DFA to return an error. At this point, in the context of the meta
regex engine, the search will be retried with a different engine (usually the
PikeVM
).
The meta regex engine
The meta regex engine brings all of the aforementioned regex engines together
and exposes one single infallible API that tries to do the best possible thing
in any given situation. The API it exposes also absolves the caller of needing
to explicitly create and pass Cache
values to each search call. Instead, the
meta regex engine handles this for you by keeping an internal thread safe pool
of Cache
values. (The meta regex engine does expose lower level APIs that
permit passing a Cache
explicitly. These are useful if one wants to avoid
the synchronization costs of the thread safe pool used internally.)
The end result here is that the meta regex engine very closely corresponds
to the top-level API in the regex
crate. Indeed, regex::Regex
,
regex::RegexSet
, regex::bytes::Regex
and regex::bytes::RegexSet
are all
very thin wrappers around the meta regex engine. This is by design, because it
makes it easier to drop down from the high level convenience API that serves
99% of use cases to the lower level API that exposes a lot more knobs.
Internally, the meta regex engine implements roughly the following logic:
- If a regex engine isn’t needed at all and the search can be performed using
single or multi-substring search algorithms directly, then the construction of
a regex (including an
NFA
) is avoided entirely. - If possible, extract a small literal sequence from the prefix of the regex
that can be used as a
Prefilter
. - If possible, choose a “reverse” optimization:
- If a regex is anchored at the end via
$
, then a search can proceed by doing a reverse scan from the end of the haystack. This is called the “reverse anchored” optimization. - If no suitable
Prefilter
has been found and a literal sequence can be extracted from the suffix of the regex, then we can scan for occurrences of that literal sequence and match the regex in reverse from each candidate position. This is called the “reverse suffix” optimization. - If no suitable prefix or suffix literal sequence could be found but a literal sequence could be extracted from an inner part of the regex that cleanly partitions the regex, then we can scan for occurrences of that inner literal sequence. We split the regex in half at the point where the inner literal sequence was extracted. The first half gets compiled into a reverse regex and the second half gets compiled into a forward regex. When a candidate is found, we can look for the start of the match by scanning backwards with the first half, and then look for the end of the match by scanning forwards with the second half.
- If a regex is anchored at the end via
- Otherwise, fall back to the “core” search strategy. The core strategy makes
use of all available regex engines: the
PikeVM
, the bounded backtracker, the one-pass DFA, the lazy DFA and the full DFA. Only thePikeVM
is required. The way these engines compose together is roughly as follows:- Whenever possible, use the lazy DFA (or full DFA if available) to find
the overall bounds of the match. If the DFA fails, then we fall back to
either the
PikeVM
, the bounded backtracker or the one-pass DFA. The DFA can fail either because the regex contained a Unicode word boundary and the haystack contained a non-ASCII codepoint, or because the the lazy DFA was used and itsCache
usage was inefficient according to some heuristic. - When capture groups are requested, then after the full match is found,
either the
PikeVM
, bounded backtracker or one-pass DFA is used to report the offsets of each matching capture group. The order of preference is: one-pass DFA, bounded backtracker and then thePikeVM
.
- Whenever possible, use the lazy DFA (or full DFA if available) to find
the overall bounds of the match. If the DFA fails, then we fall back to
either the
If one were to summarize the overall strategy, it can probably be distilled down to these two points:
- Search for literals whenever possible.
- Avoid the
PikeVM
as much as possible.
That’s pretty much it. In general, the more time we can spend in substring
search algorithms, the better. And the less time we can spend in specifically
the PikeVM
, the better.
Many regex engines do some kind of literal optimization, and indeed, most
popular production grade regex engines spend a fair bit of effort in doing
so. Hyperscan is the gold standard here, but as far as I’m aware, most
other general purpose regex engines don’t go to the lengths described above.
(One could argue about whether Hyperscan is a “general purpose” regex engine
or not. One of my own personal arguments against considering it one is its
match semantics. For example, \w+
matches abc
3 times because Hyperscan
reports matches as they’re seen. An undoubtedly correct choice given its
target domain.) The reverse suffix and reverse inner optimizations are
particularly tricky. They look easy, but there’s a subtle performance problem
with them: they open you up to worst case quadratic behavior (in the size of
the haystack).
Consider the regex [A-Z].*bcdefghijklmnopq
on a haystack of
bcdefghijklmnopq
repeated over and over. There is no “good” prefix literal
sequence that can be extracted from this, so according to the logic above,
the meta regex engine will try the “reverse suffix” optimization by using
bcdefghijklmnopq
as the suffix. This particular benchmark was devised to have
a worst case false positive rate: candidates are reported frequently and none
of them lead to a match. But that’s just a bad heuristic. It doesn’t in and of
itself lead to violating our time complexitiy guarantee (which is O(m * n)
).
The problem here is that each time the suffix matches, a reverse scan includes
the .*
, and that in turn scans all the way back to the beginning of the
haystack. So each candidate reported by the suffix match results in a complete
re-scan of the haystack back to the beginning. This changes our search routine
to have worst case O(m * n^2)
time complexity. That’s bad.
While it is possible to do syntactic analysis on a regex to determine whether
this quadratic behavior is possible, it doesn’t predict it perfectly. For
example, one can clearly see that the suffix bcdefghijklmnopq
overlaps with
the expression immediately before it, .*
. That in turn means some kind of
quadratic behavior might be possible. But that doesn’t mean it is inevitable.
Instead, the meta regex engine mitigates this by defining its own bespoke regex search routines based on the DFA engines. Namely, it defines its own search routine that will stop its reverse scan if it gets to a certain offset in the haystack. That offset corresponds to the end of the last suffix match. So if the search would otherwise exceed that offset, then we’re exposed to quadratic behavior. The meta regex engine detects this error case and falls back to the “core” strategy described above, thus abandoning the optimization when it would otherwise sacrifice our time complexity guarantees.
Roughly the same logic applies to the “reverse inner” optimization as well.
In summary, if you don’t need low level access to individual regex engines but you do want control over the many knobs exposed by the regex engine or want to do multi-pattern matching, then the meta regex engine is a good choice. Namely, all of the regex engines described before this each have their own little caveats that make them less than ideal for general use in every case.
Differences with RE2
If you’ve read Russ Cox’s article series on regular expressions, then some portion of the previous section is likely to sound familiar. Namely, RE2 has a PikeVM (called just an “NFA”), a bounded backtracker (called a “bitstate backtracker” in RE2), a one-pass DFA (called a “one-pass NFA” in RE2) and a lazy DFA. It also has a meta regex engine (although that term isn’t used) that composes its internal regex engines in a similar fashion as described above. The only engine described above that RE2 doesn’t have is a fully compiled DFA.
So what, if any, are the differences between RE2 and the regex
crate?
The similarities between them are much greater than the differences, but here’s a high level list of differences:
- RE2 supports leftmost-longest semantics as an option in addition to leftmost-first. Leftmost-longest semantics are what POSIX regex engines use.
- RE2 has less support for Unicode. A full accounting of this is tricky
because RE2 does permit linking with ICU to add support for more Unicode
properties. However, RE2 does not have an option to make
\w
,\s
,\d
and\b
use Unicode definitions. RE2 also does not support character class set operations beyond union. For example, it’s harder to write something like[\pL&&\p{Greek}]
in RE2, which corresponds to the subset of codepoints considered letters that are also in the Greek script. (Character class set operations other than union aren’t strictly Unicode specific features, but they are most useful in the context of large Unicode character classes.) - RE2 has a likely more memory efficient version of the PikeVM.
- RE2 has some limited support for literal optimizations, but overall does
a lot less here than what the
regex
crate does. RE2 does have a Filtered RE2 which permits the caller to do a limited form of their own literal optimizations. - RE2 uses the same transition cache across multiple threads for the lazy DFA
engine, which requires synchronization. Conversely, the
regex
crate requires a distinct cache for each thread, which requires more memory. - The
regex
crate now exposes bothregex-syntax
andregex-automata
as separately versioned libraries that provide access to its internals. RE2 does not support this. - The
regex-automata
library has first class support for multi-pattern regexes in all engines. RE2 does have a “regex set,” but it only reports which patterns match in a haystack.regex-automata
, on the other hand, can also report match and capture group offsets for each matching pattern.
In the future, I hope to add more engines to regex-automata
. Specifically,
I’d like to explore Glushkov NFAs and a bit parallel regex
engine. I’d also like to explore shift DFAs.
Testing strategy
As described near the opening of this blog, testing had become a problem for
the regex
crate. The macro hacks used to test each internal engine were
growing quite strained and they were generally difficult to work with and, more
importantly, understand.
My idea for revamping how the regex crate was tested was tied with the idea that each internal engine would have its own first class API that could be tested independently from the “main” regex engine. I also wanted to make the tests consumable from any context instead of burying them in macros or Rust source code. To that end, this is the strategy I settled upon:
- All regex tests are specified in TOML files.
- I published a crate,
regex-test
, that knows how to read these TOML files into a structured representation. - I defined a single Rust unit test for each configuration of each regex engine that I wanted to test. Inside this single unit test, all of the tests from the TOML files that are applicable to the engine being tested are executed.
A little extra infrastructure was required in order to make this nice to use because Rust’s unit testing framework is not currently extensible. That means I had to define my own environment variables for filtering on which tests to run. (For example, if you’re working on a single bug that causes many tests to fail, it’s often useful to just have one of those tests run.)
There are also oodles of other kinds of tests as well. For example, there are
over 450 documentation tests in regex-automata
alone.
Finally, in the run up to the regex 1.9
, I added a lot of additional fuzz
testing targets. I had a ton of help from Addison Crump and there were
at least a few bugs I wouldn’t have found if it weren’t for him.
Benchmarking
At this point, this blog is already my longest one ever, and I haven’t even begun to discuss benchmarking. While I originally wanted to spend more time on this topic in this blog—particularly given all of the talk about optimizations—it just wasn’t practical to do so.
Instead, I’ve published a regex barometer called rebar. It isn’t just limited
to benchmarking the regex
crate. It also benchmarks many other regex engines
as well. I believe it is the most comprehensive regex benchmark published to
date.
Across 242 benchmarks, regex 1.9
is on average 1.5 times faster than regex 1.7.3
for search times. (I compared with 1.7
instead of 1.8
because 1.8
reflects a transition release that has some of the work described in this blog
post included. The 1.9
release just completes the transition.)
$ rebar rank record/all/2023-07-02/*.csv \
--intersection \
-M compile \
-e '^(rust/regex|rust/regexold)$'
Engine Version Geometric mean of speed ratios Benchmark count
------ ------- ------------------------------ ---------------
rust/regex 1.8.4 1.08 242
rust/regexold 1.7.3 2.61 242
But the time it takes to build a regex has regressed somewhat:
$ rebar rank record/all/2023-07-02/*.csv \
--intersection \
-m compile \
-e '^(rust/regex|rust/regexold)$'
Engine Version Geometric mean of speed ratios Benchmark count
------ ------- ------------------------------ ---------------
rust/regexold 1.7.3 1.07 28
rust/regex 1.8.4 1.46 28
The geometric mean reported above is a very crude aggregate statistic. I’m not
sure it really captures the extent of the improvements here. If you want to look
at individual benchmark differences, one can replace rebar rank
with rebar cmp
in the above command. (And run it from the root of a checkout of the rebar
repository.)
$ rebar cmp record/all/2023-07-02/*.csv \
--threshold-min 2 \
--intersection \
-M compile \
-e '^(rust/regex|rust/regexold)$'
I’ve added --threshold-min 2
to the above command to limit the comparisons
to ones where there is at least a 2x difference.
Costs
No good deed goes unpunished. What has this rewrite cost me?
First and foremost, it has used up the vast majority of my free time for the past several years. Compounding the problem, I have a lot less of that free time than I used to. So projects like ripgrep haven’t seen a release for quite some time.
Secondly, this has introduced a fair bit more code. Building reusable abstractions
for others to use is a different beast than internal abstractions that only
regex
crate hackers need to worry about. It usually results in more code,
which means bigger binary sizes and higher compile times.
Thirdly, those abstractions are now published and separately versioned. That
means I can’t just break the APIs of those internal engines without publishing
an appropriate breaking change release of regex-automata
. I won’t be nearly
as conservative as doing so as I am with the regex
crate, but it isn’t free
to do. This will also impact contributors. Instead of just being able to
refactor code as necessary, one must now contend with the pressures of public
API design.
Because the regex
crate already had a reputation for less-than-ideal binary
sizes and compile times, and since these changes were going to make that
worse, I decided on two different mitigations:
- As discussed above, I made the fully compiled DFA regex engine opt-in. This engine brings in quite a bit of code, but its impact on search performance is modest.
- I published a new crate,
regex-lite
, that acts nearly as a drop-in replacement for theregex
crate. Its design is based on optimizing almost exclusively for binary size and compile time, at the expense of functionality (namely, Unicode) and performance. You still get theO(m * n)
time complexity guarantee, but you don’t get any of the fancy Unicode support and you don’t get fast search times. But the binary size and compile times are a lot better.regex-lite
has zero dependencies. It shares zero code—including rolling its own regex parser—with theregex
crate.
The regex-lite
mitigation is still somewhat of an experiment, but it just
goes to show that making code artbitrarily reducible is difficult. Even though
the regex
crate has a bunch of features for disabling both optimizations and
Unicode functionality, it still can’t get anywhere close to the binary size
and compile times of regex-lite
.
Wrap up
In this blog, I tried to focus on telling a story about the biggest components
of the regex
crate internals. This higher level description can be difficult
to get from just reading the API documentation. With that said, this blog is
still just scratch the surface. I encourage you to peruse the regex-automata
API documentation for a lot more details and tons of code examples.
I encourage any and all questions on GitHub Discussions.