This post is intended for newer OCaml programmers, or those who want to use the
re2 library, but could use a couple of examples to help get started. This is not a general introduction to regular expressions, however. If you have never used regular expressions before, read up a little bit on the syntax before tackling this post.
The there are few choices for regular expression libraries available for OCaml on Opam. Some of the most popular include
- re, a pure OCaml library (installed 7667 times last month),
- pcre, bindings to the Perl Compatibility Regular Expressions library (PCRE), (installed 1115 times last month), and
- re2, OCaml bindings for RE2, Google’s regular expression library (installed 114 times last month).
The first two are by far the most popular in terms of raw Opam install counts. However,
re2 integrates nicely into the Jane Street Base/Core/Async ecosystem (it’s a Jane Street package after all!), and is covered under the MIT license rather than the LGPL with OCaml linking exception, which may be appealing depending on your situation.
One issue that newcomers may face when getting started with the
re2 library is the slightly terse API documentation. While it is detailed and thorough, it can be hard to get started with if you’re not already used to reading Jane Street
mli files and source code.
Note: if you want to follow along, you can paste the examples into the toplevel (or utop). However, don’t paste in lines starting with
- :. These lines show the type of the expression as reported by
Creating regular expressions
You create regular expressions with
Re2.create_exn. The former returns
Re2.t Or_error.t and the latter
You can control how regular expression matching works by passing the
options argument to the
create_exn functions. If you omit this argument, the default options will be passed. Here they are:
For a more detailed description of these options, see the re2.h header filer.
re2 uses case-sensitive matching. To create a case-insensitive regex, pass in an options map like so.
Checking for a match
Perhaps the most basic regex task is to check if a string matches a given regular expression. You can use
Re2.matches for this.
To find all matches of a regular expression in a string, you can use the
Find first match
To return the first match in the query string, use
find_first_exn. These functions return matched string rather than the underlying
Find all matches
find_first returns the first match in a query string,
find_all_exn return lists of all non-overlapping matches in the query string.
Submatches and capturing groups
You can use the
sub argument to return submatches defined by capturing groups rather than the whole match.
Be aware that passing index greater than the amount of capturing groups will raise an error.
Or_error returning vs. Exception raising
Like most of the functions in the
Re2 module, the
find functions come in both
Or_error.t returning and exception raising versions. If the regular expression doesn’t match,
find_all returns a
find_all_exn raises an exception.
It is important to remember that the
find_all functions return non-overlapping matches.
If you need a bit more control than provided by
find_all with the
sub argument (e.g.,
find_all ~sub:(` Index 1)), the you may need to use
find_submatches_exn. These return the first match in the query string. The match is returned as a
string option array, where the first element is the whole match, and subsequent elements are submatches as defined by any capturing groups.
You may wonder why
find_submatches_exn returns a
string option array and not simply a
Match.get under-the-hood. Basically,
find_submatches_exn processes a
Match.t Sequence.t of matches, calling
get on each one. And the
Match.get function returns a
This little code snippet will hopefully give you an idea of what’s going on.
Index you pass to
~sub is higher than the of capturing groups plus one (e.g., the number returned from
None is returned.
More complicated submatch interface
If you need to work with submatches of every match in a string rather than just the first, and you need direct access to the
Match.t, you will want to use
get_matches_exn. Let’s try it out with a weird, little example.
Say we have a string made up of chunks. Each chunk is a number followed by an
A (for add) or an
S (for subtract) (e.g.,
3S). The chunk describes an arithmetic operation:
12A means add 12 to the previous total;
3S means subtract 3 from the previous total.
A full string then might look something like this:
10A5S2S3A, which represents the following sequence of operations:
0 + 10 - 5 - 2 + 3.
One way to solve this little problem using regexes and the
get_matches function. Let’s see how it might go.
In the last two examples, we used the
sub argument along with a polymorphic variant to select capture groups. Let’s take a closer look at the type used for that.
To select submatches, we use id_t, which looks like this:
This type is used to refer to submatches. E.g.,
` Index 1 would be the result of first capturing group,
` Index 2 the 2nd, etc. Remember that
` Index 0 refers to the whole match.
In addition to referring to submatches/capturing groups by index, you can refer to them by name.
When using a complicated regular expression with multiple capturing groups, it is often less error prone to use named submatches rather than numbered ones.
Note: It is not a compile-error to try an access a capturing group that doesn’t exist in the regular expression. Depending on the function, you may get
None or raise an exception.
id_t to control match efficiency
Many of the regex matching functions take a
In some cases, you can increase the efficiency of matching by restricting the number of submatches. If you only care about whether a pattern matches, and not about submatches, you could pass in
~sub:(` Index -1) to many of the above functions.
You can get increasingly more information by increasing the
n to the index.
This section of the documentation has more info on how specifying the
sub argument can have an impact on regex performance, and which functions are affected by its usage.
Another common regex task is splitting an input string based on a regular expression pattern.
Re2 provides the
split function for this purpose.
If you need to include the actual matches in the output, you can. Passing
~include_matches:true ensures the “separators” are in there with the rest of the output.
Just be aware of that final empty string at the end!
You can also limit the number of matches with the
max argument. You could use this to get the first value separated from the remaining values in a string of tab-separated values, for example.
If the regular expression has no matches in the query string, then a one element list is returned.
The simpler interface for regex replacement consists of the
rewrite_exn functions. The
template argument defines how you want to replace any matches in the query string. In this case, we replace any matches with a capital A.
You can reference the submatches in the template string using the syntax
\\n. Check it out.
If you have multiple submatches, just keep referring to them in the same way:
\\1 ... \\2 ... etc.
If you need to check if your rewrite template is valid before running
re2 library also provides more powerful replacing functions:
replace_exn. You can use them if you need direct access to the
Here is a silly example that picks a different replacement value depending on the match.
replace function is more complicated than
rewrite, it gives you more control and has a few other options you may find useful.
Escaping strings for regular expressions
Properly escaping regular expressions can sometimes be tricky, especially if you want to avoid illegal backslash characters in your strings.
Re2 provides a function
escape that escapes its input in such a way that if you create a regex from the resulting escaped string, it would match the original string. Here’s how it works.
Depending on how many special characters are in the string you use to build the regex, escaping can be pretty noisy! In these cases,
escape is especially useful.
Infix matching operator
If you’re feeling nostalgic for Perl, feel free to use the
=~ infix operator!
“Precompiling” your regular expressions
Unless you have a good reason not to, you will probably want to create your regular expression outside of the function that will be using it.
To see why, let’s check out this little benchmark program that compares two functions. The first one reuses a regex that is created outside of the function, whereas the second one creates a new regex each time the function is called.
Note: This benchmark program uses Jane Street’s core_bench micro-benchmarking library.
|outside||272.60 ns||2.00 w||3.74%|
|inside||7_281.55 ns||91.00 w||100.00%|
As you can see, reusing a regex rather than creating a new one each time a function is called makes a big difference in this benchmark. Keep in mind that this is a micro-benchmark, and that this difference may not be that important to the run time of your program as a whole. That said, if you had the slow version of the above function in a hot loop, it could really be wasting a lot of CPU cycles.
Hopefully this overview helps you get started with using
To get more info about using
re2, check out the API docs. Additionally, the
re2 source code is quite readable. I encourage you to take a look at how the functions are defined–it may help clear up any additional questions you have!