WARNING: 2018-04-12: The code snippets for this post are no longer available. This is just as well anyway, since they all depended on an unstable internal compiler interface, which hasn’t existed for years.
A few weeks ago, I set out to add regular expressions to the Rust distribution with an implementation and feature set heavily inspired by Russ Cox’s RE2. It was just recently added to the Rust distribution.
This regex crate includes a syntax extension that compiles a regular expression to native Rust code when a Rust program is compiled. This can be thought of as “ahead of time” compilation or something similar to compile time function execution. These special natively compiled regexes have the same exact API as regular expressions compiled at runtime.
In this article, I will outline my implementation strategy—including code samples on how to write a Rust syntax extension—and describe how I was able to achieve an identical API between regexes compiled at compile time and regexes compiled at runtime.
Brief notes on format
All code samples in this post compile with
rustc from commit
Code samples with an
// Output: comment are tested before uploaded.
Most code samples should correspond to a complete program that is compilable. This makes samples a bit longer than they need to be, but I think it’s important to include complete working examples for a language still in development and not (yet) widely known.
Note that Rust is still under heavy development. I will do my best to keep this article updated with any breaking changes.
How a regex matcher works
There are many different ways to implement a regular expression engine, so I will just briefly outline how mine works, which should closely correspond to RE2.
First, a regular expression is parsed and converted into an abstract syntax
tree. For example, the regex
a|b might be represented as
Alternate(Literal(a), Literal(b)), which signifies “match either
Second, the regex’s abstract syntax is converted to a sequence of
instructions. This sequence of instructions can then be evaluated by a
which will determine if a regex matches some text (and where it matches).
For the regex
a|b, the sequence of instructions looks something like:
- 1: Split(2, 4)
- 2: Char(a)
- 3: Jump(5)
- 4: Char(b)
- 5: Match
In this case,
Split means “jump to two different instructions
simultaneously.” In particular, if either branch executes the
instruction, then the entire regex will match. If both
fail, then it’s impossible to reach the
Clarifying the divide: native regexes vs. dynamic regexes
The word “compiled” is heavily overloaded, so it’s worth making sure that we
get our terms straight. For example, in Python, you can compile a regular
re.compile("..."), which is done at runtime, and converts the
pattern given into a data structure that is fed into a matching algorithm.
Often, it is good practice to compile a regular expression outside of loops
so that it doesn’t have to go through the overhead of compilation each time the
expression is used to search text. (Assuming you don’t want to rely on Python’s
This is emphatically not what I mean by compiling regexes “ahead of time.” What I mean is the more literal translation: a regular expression is converted to native Rust code when you compile your Rust program. That is, there is (virtually) zero cost of compiling a regex during runtime. Perhaps more importantly, because it’s compiled to native Rust code, your regex will always run faster. (I promise evidence of this claim with benchmarks toward the end of this article.)
Let’s clarify with an example. First, we can try compiling a regular expression at runtime:
<code data-gist-id=“11161035” data-gist-file=“intro.rs” data-gist-hide-footer=“true” data-gist-hide-line-numbers=“true”
Notice that we call
Regex::new to create a regex, and the call may fail if
the expression given isn’t a valid regex. I call this a dynamic regex
because the regular expression compilation happens at runtime. This example is
analogous to Python’s
When a regular expression is compiled dynamically, it is converted to a sequence of instructions that are executed by a virtual machine. This virtual machine can execute any valid regular expression expressed as a sequence of instructions.
In contrast, I’ve called regexes that are compiled to Rust code when your Rust
program compiles native. A native regex is indistinguishable from a dynamic
regex, except that it must be created with a
<code data-gist-id=“11161035” data-gist-file=“intro-aot.rs” data-gist-highlight-line=“1,3,6” data-gist-hide-footer=“true” data-gist-hide-line-numbers=“true”
The highlighted lines indicate a change from the previous code sample.
(Note about Rust: Syntax extensions require a special compiler
#[phase(plugin)] to work. The
phase directive must be enabled
Notice that we no longer have to check if
regex! returns an error because
all errors are converted to compile time errors. For example, the following
program will fail to compile:
<code data-gist-id=“11161035” data-gist-file=“unbalanced-paren.rs” data-gist-hide-footer=“true” data-gist-hide-line-numbers=“true”
The error given by
Regex syntax error near position 3: Unclosed parenthesis.
So what exactly does
regex! expand to if not a sequence of instructions?
Well, remember the virtual machine I mentioned earlier that can execute a
sequence of instructions? That’s precisely what
regex! gets replaced with,
except it’s specialized to the particular sequence of instructions
corresponding to the regular expression given. This specialization allows us to
perform optimizations such as removing heap allocation and embedding the
instructions directly instead of relying on a generalized algorithm.
If you want an example, you can take a look at how
regex! macro into
(It’s not important to understand the expanded code, but it is notable that it
expands to a lot of code! If you have a lot of
regex! calls in your program,
then you might see your binary size increase. But compiling with
optimization will help with that.)
Native regexes are pretty good. They provide extra safety (cannot compile a
program with an invalid regex) and extra performance. They do come with a
downside though: code bloat. If your program has a lot of
(hundreds) and is compiled unoptimized, then you’ll notice a bigger binary.
However, optimization can help shrink them back down.
Note that I am not the first to do this. D has ctRegex which claims to do something similar. There is also xpressive in Boost, but native regexes must be written with template syntax. Finally, there is also CL-PPCRE for Common Lisp which also claims to have native regexes.
Nimrod seems like it is capable of producing native regexes, but I don’t think it has been done yet.
A related project is Ragel, which can compile state machines to native code in a variety of languages. In principle, Ragel is also producing native regexes, but my approach is automated by the Rust compiler and provides an API identical to that of dynamic regexes.
(Please alert me if I’ve missed anything.)
Native regexes are implemented with macros
but they can be written in one of two ways.
The first way is to use
macro_rules! to conveniently specify some source
transformation with quasiquoting.
For example, here is how the
try! macro is defined, which provides an easy
way to return early in a function that returns a value with type
<code data-gist-id=“11161035” data-gist-file=“try.rs” data-gist-hide-footer=“true” data-gist-hide-line-numbers=“true”
Notice the use of quasiquoting with
$e. The expression
$e is spliced into
match expression at compile time. The quasiquoting in this case
distinguishes between something that parameterizes the macro at compile time
and the actual source code being written.
In general, if a macro can be written with
macro_rules!, then it should be
macro_rules! because they are simple to write. However, they
are not powerful enough to implement native regexes because they are restricted
to source code transformations. To implement native regexes, we need to
actually compile a regular expression during the compilation of a Rust program.
Luckily, the second way to write a macro is via a syntax extension (also known as a “procedural macro”). This is done by a compile time hook that lets you execute arbitrary Rust code, rewrite the abstract syntax to whatever you want and opens up access to the Rust compiler’s parser. In short, you get a lot of power. And it’s enough to implement native regexes. The obvious drawback is that they are more difficult to implement.
Note that all macros are invoked with the
name! syntax, regardless of
whether they are defined with
macro_rules! or as a syntax extension.
Setting up a syntax extension
Syntax extensions only became available to user programs a few months ago. Before that, they were only available to the Rust compiler. Thus, there is little documentation, so I’ve learned what I know by example and by examining the syntax crate, which defines Rust’s abstract syntax, provides a parser and the syntax extension functionality itself. (Note that syntax extensions are currently an experimental feature of Rust.)
Since there is a fair bit of boiler plate required, let’s take a look at
implementing a trivial macro that returns the factorial of
<code data-gist-id=“11161035” data-gist-file=“lib-factorial_no_quote.rs” data-gist-hide-footer=“true” data-gist-hide-line-numbers=“true”
(N.B. There are some comments in the code above that explain some details.)
The magic happens with the
macro_registrar function, which must be exported
and labelled with the special
#[macro_registrar] attribute. This function is
where the compiler lets you register syntax extensions, which have a name and
provide a value of type
SyntaxExtension which indicates the sort of expansion
that you intend to do. In this case, we’re using
NormalTT, which is a
function-like macro, but there
Here, we register a single macro:
factorial, which is expanded by the
expand function. The type of
expand is specified by the
expansion, which gives us an
(e.g., the parsing state),
(e.g., a region of the actual code, used for error reporting)
But most importantly, the return value is a
(“macro result”), where
MacResult is a trait.
Types that satisfy the
MacResult trait can be automatically spliced into the
AST of a Rust program.
In this case, we’re just creating an expression, so we can use the
helper type to construct a value that has type
Box<MacResult>. Finally, we
MacExpr::new has type
Gc<Expr> -> Box<MacResult>. Ah ha! So if we
can build an expression, then we can build a
Box<MacResult>, and therefore be
done with our macro expansion.
The body of the
expand function shows how to manually create an expression
with a single integer literal. Even though the expression is very simple, I
wrote two helper functions
dummy_expr to encapsulate some
of the more mundane details. If we have to go through this much work for just a
single literal, you might imagine that it gets very tedious pretty quickly.
But remember the quasiquoting that was available to use when defining macros
macro_rules!? They are available to us here too, courtesy of the
module. Here’s the same syntax extension implemented using quasiquoting:
<code data-gist-id=“11161035” data-gist-file=“lib-factorial.rs” data-gist-hide-footer=“true” data-gist-hide-line-numbers=“true”
quote_expr! macro does all the heavy lifting for us. It splices the
answer into an expression for us, handling all the nitty-gritty
details automatically. (We could also write, say,
$answer + 1 which would
produce an expression with an addition operation. The
$answer part is
computed at compile time and the addition operation would be done at
runtime—if it isn’t optimized away.) Fundamentally, the
quasiquoting works the same here as it does with
(There is some middle ground between building the AST manually and using quasiquoting, but I won’t cover it here.)
And finally, we can actually use it with:
<code data-gist-id=“11161035” data-gist-file=“factorial.rs” data-gist-hide-footer=“true” data-gist-hide-line-numbers=“true”
The comments in the code explain some of the extra compiler directives we have
to write in order to import a syntax extension. (Note that this is no different
than importing the
log crate from the Rust distribution, except that the
log crate requires
#[phase(syntax, link)] because it provides more stuff
than just a syntax extension.)
Accessing the parser
The above example demonstrates how to write a really simple—but pretty
useless—syntax extension. To write a useful syntax extension, we probably
need to define a macro that can accept arguments. With
macro_rules!, this is
really easy because there’s some convenient syntax similar to defining a
regular function. But in the land of syntax extensions, we must deal with
Rust’s parser directly.
An obvious extension to our
factorial! macro is to let it take an integer
argument so that we can compute any factorial. But how do we get the arguments
given to a macro defined as a syntax extension?
Well, if you look back at the last example, then you’ll notice that the
expand function has a few parameters that we didn’t use. One of which is a
sequence of token trees with type
&[ast::TokenTree]. Luckily, a sequence of
ast::TokenTree values can be used to build a parser with
new_parser_from_tts defined in the
module. With a parser, we can ask it for expressions—and once we have
expressions, we can translate those to real Rust values. In this case, that’s
just going to be an integer.
(You can see the
and although it is largely undocumented, most of the names are reasonably
Let’s take a look at the code. The principal changes from the last example are highlighted:
<code data-gist-id=“11161035” data-gist-file=“lib-factorial_arg.rs” data-gist-hide-footer=“true” data-gist-hide-line-numbers=“true” data-gist-highlight-line=“999999999,20-25,35-65”
The first few changes are fairly trivial. This time, instead of hard-coding the
5, we parse an unsigned integer literal and pass that to the
factorial function. If a single unsigned integer literal could not be parsed,
then we return a
The actual error is written to the parser state in
parse. The code is
structured this way to allow the compiler to continue even if an error is
parse function shows how to create a parser from a sequence of token
trees. A single expression is parsed, and we use pattern matching to make sure
it corresponds to an unsigned integer literal.
(To figure out which value constructors to use in your patterns, you’ll want to
dig into the
documentation, specifically the
After we find a literal, we make sure it’s the last thing in the sequence of
token trees and then return the integer as a real Rust value with type
If anything unexpected occurs, we log an error with the Rust compiler via the
ExtCtxt and return
None. As a bonus, we use the
module to pretty-print Rust expressions to make our error messages nicer.
Finally, the macro can be used the same way as last time, except we call
factorial! with an integer argument:
<code data-gist-id=“11161035” data-gist-file=“factorial_arg.rs” data-gist-hide-footer=“true” data-gist-hide-line-numbers=“true” data-gist-highlight-line=“5”
This concludes the tutorial aspect of this article. Now we’ll get back to regexes.
regex! syntax extension
Instead of going through the full code generator, which would require more explanation about how the virtual machine works, I’ll explain the representation of native regexes and how some of the optimizations work.
But first, we must figure out how the
regex! macro will be used. If possible,
it should reuse the existing
API for the Regex
The benefit of this is that native regexes won’t introduce any additional
complexity on top of dynamic regexes other than how they are constructed. The
regex! macro should also probably return an expression, so that we can write
Perhaps we can construct a value with type
Regex. If we could do that, then
we’d be done—because such values already support all the convenient matching,
splitting and replacing functions. So let’s take a look at the representation
of a dynamic
<code data-gist-id=“11161035” data-gist-file=“rep-regex-dynamic.rs” data-gist-hide-footer=“true” data-gist-hide-line-numbers=“true” data-gist-line=“2-1000”
The representation shown here is very simple: it’s just a sequence of
instructions. The problem is that this representation is for dynamic
regexes. Remember that dynamic regexes work by being compiled to a sequence of
instructions that is then executed by a general virtual machine. The VM can be
thought of as a Rust function with type
fn(insts: &[Inst], search: &str) -> Captures, where
Captures is just the
location of matches in the
Native regexes on the other hand should be native Rust code with precisely the
same type as above with one omission: the sequence of instructions. Why?
Because the sequence of instructions is encoded directly into the function by
regex! syntax extension. Therefore, the type of a native regex VM would
have to be
fn(search: &str) -> Captures.
So we’re left with an interesting set of constraints. On the one hand, we have a VM that can execute a sequence of instructions on search text, and on the other, we have a specialized VM that just executes on search text. Since a regex must be either dynamic or native, we can represent this state of affairs with a sum type:
<code data-gist-id=“11161035” data-gist-file=“rep-regex-both.rs” data-gist-hide-footer=“true” data-gist-hide-line-numbers=“true” data-gist-line=“2-1000”
And now we can easily write a function that executes a
Regex on search text,
regardless of whether its dynamic or native:
<code data-gist-id=“11161035” data-gist-file=“rep-regex-exec.rs” data-gist-hide-footer=“true” data-gist-hide-line-numbers=“true” data-gist-line=“12-1000”
These details of the representation are hidden from well behaving clients, and therefore, dynamic and native regexes are indistinguishable.
Why did you just say “well behaving clients”?
Shouldn’t Rust’s module system prevent clients from accessing parts of the representation that aren’t exported?
Herein lay the dirty little secret of the
regex_macros crate: it must
be able to access the internal representation of a
Regex (along with other
things, like the set of all instructions). But if you look at the public API
documentation, you’ll not see any such details of the representation exposed.
This is because they are hidden using a special
but you can see that an
entire module of
is exported for the exclusive benefit of the
Therefore, the “well behaving clients” qualification is necessary. Misbehaving clients could access the internal representation of the Regex type, and thereby distinguish between dynamic and native regexes.
(It’d be possible to ameliorate this somewhat by exposing a hidden constructor function.)
This is still early days for the regex crate in Rust, but there are two primary optimizations made with native regexes. The first is removing heap allocation. Briefly, the general VM for dynamic regexes has a few places where it needs to allocate things on the heap. Primarily, the allocation is for storing capture groups (i.e., the location of submatches in search text). The heap allocation is necessary because the dynamic VM doesn’t know at compile time how many capture groups any particular regex might have.
This of course is not the case for native regexes, since the regex is compiled
when your Rust program is compiled. Therefore, we can put all of the capture
group information on the stack using explicitly sized vectors (i.e.,
[T, ..N]). This removes all heap allocation (except for the return value).
The second optimization comes from encoding the instructions directly into one
match expression based on the position of the instruction. This removes
a lot of small computations necessary to discover what the “next” instruction
Some other smaller optimizations were made, such as removing conditional
branches based on flags set in a particular instruction (like case
insensitivity, multi-line mode, etc.) or converting a binary search on a
character class to a single
As promised, here are some benchmarks run on an Intel i3930K:
<code data-gist-id=“11161035” data-gist-file=“bench-cmp” data-gist-hide-footer=“true” data-gist-hide-line-numbers=“true”
Native regexes provide a universal speedup.
The more interesting observation is that native regexes drastically reduce
constant factors associated with matching a particular regex against some text.
This is evidenced by the
benchmarks, where there is an order of magnitude improvement in almost all of
These benchmarks in particular test a few different regexes (based on particular optimizations related to literal prefixes) on 32 bytes of text. As one might imagine, with such a small search string, constant factors are likely to dominate the performance of the matching algorithm. Not surprisingly, native regexes win big here—likely because there is almost no heap allocation occurring.
There are still many more optimizations that can be performed:
- Creating a “one pass” NFA where no backtracking need occur.
- Attempt to do state compression in the code generator via a more intelligent analysis on the sequence of instructions.
- The big one: implement a DFA, similar to how RE2/C++ works.
Many thanks to the following:
- Steven Fackler for helping me navigate the word of syntax extensions.
- Eduard Burtescu for helping me with the code generator and talking through some future ideas for optimization.
- Chris Morgan for giving me the initial idea to create native regexes.
- Alex Crichton for holding my hand through the pull request process.
- Russ Cox for blazing the trail and writing an amazing series of articles on implementing regular expression matching.
- regex API documentation.
- A gist of all code samples in this article.
- Some rough numbers for the regex-dna benchmark, compared with Go, Python and C.
- The infrastructure used to compile, check and upload code samples in this article automatically.
Other examples of syntax extensions
When documentation is scarce, the next best thing is to look at examples. Here are the ones I know about:
- simple-ext is an example of a syntax extension for sorting literal strings. You can see it used here.
- hexfloat for hexadecimal floating-point literals (part of Rust distribution).
- fourcc four-character code library (part of Rust distribution).
- rust-phf implements compile time static maps.
Please ping me if you know of more examples. I’ll add them here.