August 5, 2024

Going on a tangent: taking the derivative of a regular expression

Informal formal primer on regular expressions

Example (where $\Sigma = {a,b,c,d,e,...,z}$) and the regular expression is $b(l|o)u(e|t)(s)*$:

The Brzozowski derivative

Example: $R = do(g|t)$, $s = dog$

Why is this thing called a derivative?
Back to the algorithm
More

I've had to use regular expressions a LOT at my new job. When we recieve a file, the application code wants to look at these configuration files that map file names to various processing steps + metadata about the files, determine the correct configuration to use, and apply the required transformations to the data. The legacy code that figures out what configuration to apply uses regular expressions to match the incoming file name to a particular configuration name. Simple, right?

In practice this is trickier than it sounds to get correct, and I had to spend a lot of time fixing regular expressions in this old code that were either over or underconstrained (or just plain incorrect). This means I spent a lot of time really learning how to write, read, and use them!

I had always considered regexes to be this unreadably terse mess of symbols, but as I was forced to learned more about them, I grew to appreciate the beauty and power of them as a tool (and when they are appropriate to use).

After using them so much, I started to wonder how the heck a regex gets turned into executable code. How do we go from such a terse syntax like f[o]* to a little machine that determines "YES, we're a match" or "NO, but oh well, there's more fish in the sea"?

I looked around and started to read about Kleene, the mathematical foundations of regexes, automata conversion, and was totally blown away! However, the automata conversion algorithms I found were pretty complex, so I kept searching for a better matching algorithm that didn't require representing the pattern as this sort of labeled graph.

Finally, I found a particularly nice algorithm for determining if a regular expression matches a given input string! The two reasons I chose this algorithm are:

It's very simple to understand, and the code expressing it is pretty short.
It relies on the concept of the derivative of a regular expression, which I didn't even know existed.
It unearthed an interesting connection between regexes and Taylor series that I found fun!

    (defn regex-match [pattern data] 
      (if (empty? data)
        (regex-empty? (regex-empty pattern))
        (regex-match (derivative pattern (first data)) (rest data))))

The algorithm is simple to state:

if the current string we're matching is empty (written $\epsilon$), and the regular expression we want to match against matches the empty string, then $s$ matches the regex $R$! Otherwise, it does not match.
If the string is non-empty, then we run regex-match again with these new arguments:
- the current string is replaced by the the current string without the first character (ex: $blue \rightarrow lue$)
- the current regex is replaced by the derivative of $R$ with respect to the first character in $s$ (ex: $R \rightarrow D_b(R)$, since $b$ is the first character in $blue$).

In other words, we take the derivative of the regex over and over again with respect to each character in our input string. If the regex matches the input string, then when we get to the end of the string, then the resulting regex should accept the empty string as a match. If it doesn't, then the regex did not match the input string to begin with.

Ok, so what's this mysterious derivative?

In short: it is the regular expression that matches the same stuff as $R$, assuming that a given character (ex: $b$) has already been matched by $R$. You can think of it as the "tail" of the regular expression, or an approximation of the original regex.

Informal formal primer on regular expressions

Given:

a string $s$ to match, say "blue"
a regular expression $R$ that will compute a match, say $b(l|o)u(e|t)(s)*$
- recall: the $|$ operator is alternation (aka "choice"), and $*$ is iteration. The above regex could match $bout$, $blue$, $blot$, $boue$, as well as all of these variations ending in ANY number of $s$'s (ex: $bluessssssss$).
- In addition to all those operations, we are doing one other sneaky one: concatenating several regexes together to form $R$! The characters $b$, $l$, $o$, $u$, $e$, $t$ are each very simple regexes that ONLY match that one character, and we can concat them together to form a regex that matches the first thing, then the second thing, then the third thing, etc.!

then we want to determine if $s$ is recognized by $R$. To do this, we can slightly more formally define $R$ as representing the set of all the strings (drawn from a particular alphabet) that it matches: $$R := \{ t | \text{ $R$ matches t and } t \in \Sigma \}$$

Example (where $\Sigma = {a,b,c,d,e,...,z}$) and the regular expression is $b(l|o)u(e|t)(s)*$:

$$ b(l|o)u(e|t)(s)* = \{ bout, blue, blot, boue, bouts, blues, blots, boues, boutss, bluess, blotss, bouess, boutsss, bluesss, blotsss, bouesss, ... \} $$

We will also need a notion of the null regex $\empty$, which is the regex that matches nothing:

$$ \empty := \text{the empty set, confusingly also written } \empty $$

This is NOT THE SAME THING as the empty regex, which matches only the empty string $\epsilon$. This regular expression is also written as $\epsilon$:

$$ \epsilon := \{ \epsilon \} $$

The Brzozowski derivative

With this more concrete definition of a regex, then the Brzozowski derivative of $R$ with respect to a character $a$ is

$$ D_a(R) := \{\text{ } b \text{ } | \text{ } ab \in R \text{ }\} $$

In other words, it is the regex that matches everything $R$ would have if $a$ had already been matched.

To compute this, Brzozowski came up with a recursive way of expressing $D_a(R)$:

$$D_a(\empty) = \empty$$ $$D_a(\epsilon) = \empty$$ $$D_a(a) = \epsilon $$ $$D_a(a') = \empty \text{ if } a \neq a' $$ $$D_a(R_1 R_2) = δ(R_1) D_a(R_2) \text{ } | \text{ } D_a(R_1) \text{ } R_2 $$ $$D_a(R_1 \text{ } | \text{ } R_2) = D_a(R_1)\text{ } | \text{ } D_a(R_2)$$ $$D_a(R*) = D_a(R) R*$$

Note that the "product" rule also uses a function $\delta$. This function is a helper, and it just determines if its argument accepts the empty string pattern $\epsilon$. If it doesn't, then it returns the null pattern $\empty$:

$$\delta(\empty) = \empty$$ $$\delta(\epsilon) = \epsilon$$ $$\delta(a) = \empty$$ $$\delta(R_1 R_2) = \delta(R_1)\delta(R_2)$$ $$\delta(R_1 \text{ } | \text{ } R_2) =\delta(R_1) \text{ } | \text{ } \delta(R_2) $$ $$\delta(R *) = \epsilon$$

This is a lot of rules, and they are best understood through some examples, so let's get on with that.

Example: $R = do(g|t)$, $s = dog$

Let's try and compute $D_d(R)$ to get a feel for how to apply the rules above.

First we have to apply the product rule, since we have a concatenation of 3 regexes.

$$ D_d(do(g|t)) = \delta(d) D_d(o(g|t)) \text{ } | \text{ } D_d(d) \text{ } (o(g|t)) $$

The RHS of the alternation is the more imporant piece, so let's tackle that first:

$$ D_d(d) \text{ } (o(g|t)) = \epsilon \text{ } (o(g|t)) = o(g|t) $$

On the LHS of the |:

$$ \delta(d) D_d(o(g|t)) \text{ } = \empty \text{ } D_d(o(g|t)) = \empty $$

because NO strings are accepted by $\empty$, so if it appears at the beginning of a concatenation like it does here, it overrules any matches that may have happened in the rest of the regex.

Putting the LHS + RHS together, we find that

$$ D_d(do(g|t)) = o(g|t) \text{ } | \text{ } \empty = o(g|t) $$

This is exactly the subset of strings in the set represented by $do(g|t)$ that are suffixes of $d$!

Why is this thing called a derivative?

Every sufficiently nice function can be approximated by its Taylor series, and in some cases, the function is exactly equal to its infinite Taylor series.

In other words, the function's derivatives carry information about the original function, and bundling them together in the above sense represents the behavior of the function.

$$ f(a) \approx \sum_{ n=0 }^\infty \text{ (n-th derivative of f at a) } * g(x, a, n) $$ where $g(x, a, n)$ is some function g that depends on $x,a,n$.

In his original 1964 paper, Brzozowski showed that every regular expression has a similar representation in terms of its derivatives.

Note: his notation for alternation is $+$, so when you look at the following, read the sums as alternations $|$.

$$ \text{Every regular expression R can be written as } R = \delta(R) + \sum_{ a \in \Sigma } aD_a(R) $$

This is sort of like an analogue of Taylor's theorem, but for regexes! A regex is "approximated" in the above sense by its derivatives. They give subsets of the total set of matches that the regex $R$ yields, and as you add more and more of those together via $|$, you gather enough options to eventually reconstruct the original regex.

In fact, he shows that over a fixed alphabet $\Sigma$, every regex $R$ only has finitely many unique derivatives $d_R$. He terms these $R$'s characteristic derivatives. So, most of the terms in the above representation can be ignored. If we want, we can express $R$ purely in terms of an alternation of its characteristic derivatives, plus $\delta(R)$.

Back to the algorithm

With all of this, we can understand a little better why the matching algorithm works. If a regular expression is represented by its derivatives and we iteratively apply the derivative to $R$ (w.r.t. the current character we are looking at in the string $s$), then we should be construcing a descending chain of sets ($S_1 \supseteq S2 \supseteq S3 \supseteq ... $) of strings in our alphabet $\Sigma$, which should end in an empty set exactly when we run out of characters in $s$. If the string matches $R$, then each successive derivative $d_i$ "approximates" $R$ in the sense that the acceptance set $S_{d_i}$ is a subset of the acceptance set of $R$.

I ported Matt Might's implementation to Clojure here, and it works pretty well! I'd highly recommend going and playing with that code to get a feel for how the parsing algorithm works.

The performance of the naive implementation is not great, but it is a truly beautiful and simple algorithm to understand once you get what the Brzozowski derivative does to a regex.

Might + others have done a lot of work to make this algorithm more efficient, and also vastly generalize it to recognize any context-free language, not just regular ones.
In the original 1964 paper, Brzozowski defines the derivative for any boolean connective $f(P,Q)$, not just alternation. The value of $\delta$ on $f$ might be more complicated to derive, but it is possible in terms of other basic connectives. He gives many examples of what f could be: intersection, set complement, xor, etc.
The original paper also goes into how to construct state machines that accept strings from $R$ using its characteristic derivatives. If you want more info on how this works, read the paper! Or I may write a follow-up post on how this works.

Tags: regular expressions

Questions in Questions »