The Design of a Veriﬁed Derivative-Based Parsing Tool for Regular Expressions

We describe the formalization of Brzozowski and Antimirov derivative based algorithms for regular expression parsing, in the dependently typed language Agda. The formalization produces a proof that either an input string matches a given regular expression or that no matching exists. A tool for regular expression based search in the style of the well known GNU grep has been developed with the certiﬁed algorithms. Practical experiments conducted with this tool are reported


Introduction
Parsing is the process of analysing if a string of symbols conforms to a given set of rules. In computer science, it involves the formal specification of the rules in a grammar. An important product of parsing is construction of data that makes evident which rules have been used to conclude that the string of symbols can be obtained from the grammar rules or the indication of an error that represents the fact that the string of symbols cannot be generated from the grammar rules.
In this work we are interested in the parsing problem for regular languages (RLs) [1], i.e. languages that can be recognized by (non-) deterministic finite automata and equivalent formalisms. Regular expressions (REs) are an algebraic and compact way of specifying RLs that are extensively used in lexical analyser generators [2] and string search utilities [3]. Since such tools are widely used and parsing is pervasive in computing, there is a growing interest on certified parsing algorithms [4][5][6]. This interest is motivated by the recent development of dependently typed languages which are powerful enough to express algorithmic properties as types, that are automatically checked by a compiler.
The use of derivatives for regular expressions were introduced by Brzozowski [7] as an alternative method to compute a finite state machine that is equivalent to a given RE and to perform RE-based parsing. According to Owens et. al [8], "derivatives have been lost in the sands of time" until their work on functional encoding of RE derivatives have renewed interest on their use for parsing [9,10]. In this work, we provide a complete formalization of an algorithm for RE parsing using derivatives [8], and describe a RE based search tool we developed by using the dependently typed language Agda [11]. We want to emphasize that what we call "RE parsing" is the problem of finding all prefixes and substrings of an input that matches a given RE, as in RE based text search tools as GNU-grep [3].
More specifically, our contributions are: • A formalization of Brzozowski derivatives based RE parsing in Agda. The presented certified algorithm produces as a result either a proof term (parse tree) that is evidence that the input string is in the language of the input RE or a witness that such parse tree does not exist.
• A detailed explanation of the technique used to simplify derivatives using "smart-constructors" [8].
We give formal proofs that smart constructors indeed preserve the language recognized by REs.
• A formalization of Antimirov's partial derivatives and their use to construct a RE parsing algorithm. The main difference between partial derivatives and Brzozowski's is that the former computes a set of REs using set operators instead of "smart-constructors". Producing a set of REs avoids the need of simplification using smart constructors.
• We use the verified algorithms to build a certified RE matching tool, in the style of well-known GNUgrep [3], and use it in some experiments.
The rest of this paper is organized as follows. Section 2 presents a brief introduction to Agda. Section 3 describes the encoding of REs and its parse trees. In Section 4 we define Brzozowski and Antimirov derivatives and smart constructors and some of their properties, describing how to build a correct parsing algorithm from them. Section 5 comments on the use of the certified algorithm to build a tool for RE-based search and present some experiments with this tool. Related work is discussed on Section 6 and Section 7 concludes our work.
All the source code in this article has been formalized in Agda version 2.5.2 using Standard Library 0.13, but we do not present every detail. Proofs of some properties result in functions with a long pattern matching structure, that would distract the reader from understanding the high-level structure of the formalization. In such situations we give just proof sketches. All details can be found in the source code available at [12].

An Overview of Agda
Agda is a dependently-typed functional programming language based on Martin-Löf intuitionistic type theory [13]. Function types and an infinite hierarchy of types, Set l , where l is a natural number, are built-in. Everything else is a user-defined type. The type Set, also known as Set 0 , is the type of all "small" types, such as Bool, String and List Bool. The type Set 1 is the type of Set and "others like it", such as Set → Bool, String → Set, and Set → Set. We have that Set l is an element of the type Set (l + 1 ), for every l 0. This stratification of types is used to keep Agda consistent as a logical theory [14].
An ordinary (non-dependent) function type is written A → B and a dependent one is written (x : A) → B , where type B depends on x , or ∀ (x : A) → B . Agda allows the definition of implicit parameters, i.e. parameters whose value can be infered from the context, by surrounding them in curly braces: ∀ {x : A} → B . To avoid clutter, we'll omit implicit arguments from the source code presentation. The reader can safely assume that every free variable in a type is an implicity parameter.
As an example of Agda code, consider the the following data type of length-indexed lists, also known as vectors.
The Vec type uses some interesting concepts. First, Vec is parameterized by a type A : Set, which means that in every occurrence of Vec its parameter A should not change. Second, the type of Vec A is N → Set, i.e. Vec A is a family of types indexed by natural numbers. For each natural number n, Vec A n is a type. In Vec's definition, constructor [ ] builds empty vectors. The cons-operator ( :: ) inserts a new element in front of a vector of n elements (of type Vec A n) and returns a value of type Vec A (suc n). The Vec datatype is an example of a dependent type, i.e. a type that uses a value (that denotes its length). The usefulness of dependent types can be illustrated with the definition of a safe list head function: head can be defined to accept only non-empty vectors, i.e. values of type Vec A (suc n).
In head's definition, constructor [ ] is not used. The Agda type-checker can figure out, from head's parameter type, that argument [ ] to head is not type-correct.
Thanks to the propositions-as-types principle 1 we can interpret types as logical formulas and terms as proofs. We can encode logical disjunction as the following Agda type: Note that each constructor of the previous data type represents how we can build evidence for the proposition A ∨ B : using constructor inj 1 together with an evidence for A or using constructor inj 2 and evidence for B . Intuitively, type ∨ encodes the intuitionistic intepretation of disjunction. Another important example is the representation of equality as the following Agda type: This type is called propositional equality. It defines that there is a unique evidence for equality, constructor refl (for reflexivity), that asserts that the only value equal to x is itself. Given a type P , type Dec P is used to build proofs that P is a decidable proposition, i.e. that either P or not P holds. The decidable proposition type is defined as: data Dec (P : Set) : Set where yes : P → Dec P no : ¬ P → Dec P Constructor yes stores a proof that property P holds and constructor no an evidence that such proof is impossible. Some functions used in our formalization use this type. The type ¬ P is an abbreviation for P → ⊥, where ⊥ is a data type with no constructors (i.e. a data type for which it is not possible to construct a value, which corresponds to a false proposition). Dependently typed pattern matching is built by using the so-called with construct, that allows for matching intermediate values [15]. If the matched value has a dependent type, then its result can affect the form of other values. For example, consider the following function that tests the equality of two N values.
Notice that when we pattern-match on the result of the recursive call n ? = m using the with construct, the value of the second parameter is specialized using the fact that yes refl denotes that both n and m must be equal. For the case that n and m are different, we use the inversion lemma sucInv, whose meaning is immediate, to derive a contradiction. For further information about Agda, see [11,16].

Regular Expressions
Regular expressions are defined with respect to a given alphabet. Formally, RE syntax is defined by the following context-free grammar e : where a is any symbol from the underlying alphabet. In our formalization, we represent alphabet symbols using Agda's type Char.
Datatype Regex encodes RE syntax.
Constructors ∅ and denote respectively the empty language (∅) and the empty string ( ). Alphabet symbols are constructed by using the $ constructor. Composite REs are built using concatenation (•), union (+) and Kleene star ( ). We define RE semantics as the inductively defined judgment s ∈ e, which means that string s is in the language denoted by RE e. Rule Eps specifies that only the empty string is accepted by RE , while rule Chr says that a single character string is accepted by the RE formed by this character. The rules for concatenation and choice are straightforward. For Kleene star, we need two rules: the first specifies that the empty string is in the language of RE e and rule StarRec says that the string ss is in the language denoted by e if s ∈ e and s ∈ e . In our formalization, we represent strings as values of type List Char and we encode the RE semantics as an inductive data type in which each constructor represents a rule for the previously defined semantics. Agda allows the overloading of constructor names. In some cases we use the same symbol both in the RE syntax and in the definition of its semantics.
Constructor states that the empty string (denoted by the empty list [ ]) is in the language of RE . For any single character a, the singleton string [ a ] is in the RL for $ a. Given parse trees for REs l and r , xs ∈ l and ys ∈ r , constructor • ⇒ can be used to build a parse tree for the concatenation of these REs. Constructor + L ( + R ) creates a parse tree for l + r from a parse tree for l (r ). Parse trees for Kleene star are built using the following well known equivalence of REs: e = + e e .
Several inversion lemmas about RE parsing relation are necessary for derivative-based parsing formalization. They consist of pattern-matching on proofs of ∈ . As an example, we present below the inversion lemma for a single character RE.
Intuitively, function charInvert specifies that if a string x :: xs matches RE $ y then the input string must be a single character string, i.e. xs ≡ [ ] and x ≡ y. Other inversions lemmas are defined in our formalization. They follow the same structure of charInvert: producing, as result, the necessary conditions for a RE semantics proof to hold and are omitted for brevity.

Derivatives, Smart Constructors and Parsing
Formally, the derivative of a formal language L ⊆ Σ with respect to a symbol a ∈ Σ is the language formed by suffixes of L words without the prefix a.
An algorithm for computing the derivative of a language represented by a RE as another RE is due to Brzozowski [7]. It relies on a function (called ν) that determines if some RE accepts or not the empty string (by returning or ∅, respectively): Decidability of ν(e) is proved by function ν[ ], which is defined by induction over the structure of the input RE e and returns a proof that the empty string is accepted or not, using Agda type of decidable propositions, Dec P .
The definition of ν[ ] uses some of the inversion lemmas about RE semantics. Lemma •invert states that if the empty string is in the language of l • r (where l and r are arbitrary REs) then the empty string belongs to l and r 's languages. Lemma +invert is defined similarly for union.

Smart Constructors
In order to define Brzozowski derivatives, we follow Owens et. al. [8]. We use smart constructors to identify equivalent REs modulo identity and nullable elements, and ∅, respectively. RE equivalence is denoted by e ≈ e and it's defined as usual [1]. The equivalence axioms maintained by smart constructors are: These axioms are kept as invariants using functions that preserve them while building REs. As a convention, a smart constructor is named by prefixing the constructor name with a back quote. In the case of union, the definition of the smart constructor differs only when one the parameters denotes the empty language: ' + : (e e : Regex) → Regex ∅ '+ e = e e '+ ∅ = e e '+ e = e + e In the case of concatenation, we need to deal with the possibilities of each parameter being empty (denoting the empty language) or the empty string. If one of them is empty (∅) the result is also empty, and the empty string is the identity for concatenation.
For Kleene star both ∅ and are replaced by .
Since all smart constructors produce equivalent REs, they preserve the parsing relation. This property is stated below as a soundness and completeness lemma of each smart constructor with respect to RE membership proofs. holds if and only if, xs ∈ e also holds.

Brzozowski Derivatives and their Properties
Intuitively, the derivative L a of a language L with respect to a symbol a is the set of strings generated by stripping the leading a from the strings in L that start with a. Formally: L a = {w | aw ∈ L}. Regular languages are closed under the derivative operation and Janusz Brzozowski defined a elegant method for computing derivatives for RE in [7]. Formally, the derivative of a RE e with respect to a symbol a, denoted by ∂ a (e), is defined by recursion on e's structure as follows: = ∂ a (e) e + ν(e) ∂ a (e ) ∂ a (e + e ) = ∂ a (e) + ∂ a (e ) ∂ a (e ) = ∂ a (e) e This function has an immediate translation to Agda. Notice that the derivative function uses smart constructors to quotient result REs with respect to the equivalence axioms presented in Section 4.1 and RE emptiness test. In the symbol case (constructor $ ), function ? = is used to produce an evidence for equality of two Char values. Proof. Both directions are proved by induction on the structure of e, using the soundness (completeness) lemmas for smart constructors and decidability of the emptiness test. Proof. By induction on the structure of e using the completeness lemmas for smart constructors and decidability of the emptiness test.

Antimirov's Partial Derivatives and its Properties
RE derivatives were introduced by Brzozowski to construct a DFA (deterministic finite automata) from a given RE. Partial derivatives were introduced by Antimirov [17] as a method to construct a NFA (nondeterministic finite automata). The main insight of partial derivatives for building NFAs is building a set of REs which collectively accept the same strings as Brzozowski derivatives. Algebraic properties of set operations ensures that ACUI 2 equations hold. Below, we present function ∇ a (e) which computes the set of partial derivatives from a given RE e and a symbol a.
Function ∇ a (e) uses the operator S e which concatenates RE e at the right of every e ∈ S, i.e. S e = {e • e | e ∈ S}.
Our Agda implementation models sets as lists of regular expressions.

Regexes = List Regex
The operator that concatenates a RE at the right of every e ∈ S is defined by induction on S: 2 Associativity, Commutativity and Idempotence with Unit elements axioms for REs [7]. The definition of a function to compute partial derivatives for a given RE is a direct translation of mathematical notation to Agda code: In order to prove relevant properties about partial derivatives, we define a relation that specifies when a string is accepted by some set of REs.

data ∈
: List Char → Regexes → Set where here : s ∈ e → s ∈ e :: es there : s ∈ es → s ∈ e :: es Essentially, a value of type s ∈ S indicates that s is accepted by some RE in S . The next lemmas on the membership relation s ∈ S and list concatenation are used to prove soundness and completeness of partial derivatives. Using these previous results about ∈ , we can enunciate soundness and completeness theorems of partial derivatives. Let e be an arbitrary RE and a an arbitrary symbol. Soundness means that if a string s is accepted by some RE in ∇[ e , a ] then (a :: s) ∈ e . The completeness theorem shows that the other direction of the soundness implication also holds. Both implications hold by induction on e's structure.

Parsing
Assume that we are given an RE e and a string s and we want to determine if s ∈ e. We can use RE derivatives for building such a test by extending the definition of derivatives to strings as follows [8]: Note that s ∈ e if and only if ∈ ∂ s (e), which is true whenever ν( ∂ s (e)) = . Owens et. al. define a relation between strings and RE, called the matching relation, defined as: A simple inductive proof shows that s ∈ e if and only if e ∼ s.
For our purposes, understanding RE parsing as a matching relation isn't adequate because RE-based text search tools, like GNU-grep, shows every matching prefix and substring of a RE for a given input. Since our interest is in determining which prefixes and substrings of the input string match a given RE, we define datatypes that represent the fact that a given RE matches a prefix or a substring of some string.
We say that RE e matches a prefix of string xs if there exist strings ys and zs such that xs ≡ ys ++ zs and ys ∈ e . Definition of IsPrefix datatype encodes this concept. Datatype IsSub specifies when a RE e matches a substring in xs: there must exist strings ys, zs and ws such that xs ≡ ys ++ zs ++ ws and zs ∈ e hold. We could represent prefix and substring predicates using dependent products, but for code clarity we choose to define the types IsPrefix and IsSub.
data IsPrefix (xs : List Char) (e : Regex) : Set where Prefix : xs ≡ ys ++ zs → ys ∈ e → IsPrefix xs e data IsSub (xs : List Char) (e : Regex) : Set where Sub : xs ≡ ys ++ zs ++ ws → zs ∈ e → IsSub xs e Using these datatypes we can state and prove the following relevant properties of prefixes and substrings which are immediate consequences of these definitions. Function IsPrefixDec decides if a given RE e matches a prefix in xs by induction on the structure of xs, using Lemmas 9, 10, decidable emptyness test ν[ ] and Theorem 1. Intuitively, IsPrefixDec first checks if current RE e accepts the empty string. In this case, [ ] is returned as a prefix. Otherwise, it verifies, for each symbol x , whether RE ∂[ e , x ] matches a prefix of the input string. If this is the case, a prefix including x is built from a recursive call to IsPrefixDec or if no prefix is matched a proof of such impossibility is constructed using lemma 10. ... | no ¬p with IsPrefixDec xs (∂[ e , x ]) ... | no ¬p | (yes (Prefix ys zs eq wit)) = yes (Prefix (x :: ys) zs (cong ( :: x ) eq) (∂ − sound wit)) ... | no ¬pn | (no ¬p) = no (¬IsPrefix− :: ¬pn ¬p) Function IsSubDec is also defined by induction on the structure of the input string xs, using IsPrefixDec to check whether it is possible to match a prefix of e. In this case, a substring is built from this prefix. If there's no such prefix, a recursive call is made to check if there is a substring match, returning such substring or a proof that it does not exist. Previously defined functions for computing prefixes and substrings use Brzozowski derivatives. Functions for building prefixes and substrings using Antimirov's partial derivatives are similar to Brzozowski derivative based ones. The main differences between them are in the necessary lemmas used to prove decidability of the prefix and substring relations. Such lemmas are slightly modified versions of Lemmas 9 and 10 that consider the relation ∈ and are omitted for brevity.

Implementation Details and Experiments
From the formalized algorithm we built a tool for RE parsing in the style of GNU Grep [3]. We have built a simple parser combinator library for parsing RE syntax, using the Agda Standard Library and its support for calling Haskell functions through its foreign function interface. Experimentation with our tool involved a comparison of its performance with GNU Grep [3] (grep), Google regular expression library re2 [18] and Haskell RE parsing algorithms haskell-regexp, described in [10]. We run RE parsing experiments on a machine with a Intel Core I7 1.7 GHz, 8GB RAM running Mac OS X 10.12.3; the results were collected and the median of several test runs was computed.
We used the same experiments as those used in [19]; these consist of parsing files containing thousands of occurrences of symbol a, using the RE (a + b + ab) , and parsing files containing thousands of occurrences of ab, using the same RE. Results are presented in Figure 1 and 2, respectively.
When compared with highly-optimized tools like grep or Google's re2 library, our tool behaves poorly but, our implementation has a performance similar to algorithms developed by [10]. The cause of this inefficiency needs further investigation, but we envisaged that it can be due to the following: 1) Our algorithm relies on the Brzozowski's definition for RE parsing, which needs to quotient resulting REs. 2) We use lists to represent sets of Antimirov's partial derivatives. We believe that usage of better data structures to represent sets and using appropriate disambiguation strategies like greedy parsing [20] and POSIX [19] would be able to improve the efficiency of our algorithm without sacrificing correctness. We leave the formalization of disambiguation strategies and the use of more efficient data structures for future work.

Related Work
Recently, derivative-based parsing has received a lot of attention. Owens et al. were the first to present a functional encoding of RE derivatives and use it to parsing and DFA building. They use derivatives to build scanner generators for ML and Scheme [8]; no formal proof of correctness was presented.
Might et al. [9] report on the use of derivatives for parsing not only RLs but also context-free ones. They use derivatives to handle context-free grammars (CFG) and develops an equational theory for compaction that allows for efficient CFG parsing using derivatives. Implementation of derivatives for CFGs are described  Fischer et al. describe an algorithm for RE-based parsing based on weighted automata in Haskell [10]. The paper describes the design evolution of such algorithm as a dialog between three persons. Their implementation has a competitive performance when compared with Google's RE library [18]. This work also does not consider formal proofs of RE parsing.
An algorithm for POSIX RE parsing is described in [19]. The main idea of the article is to adapt derivative parsing to construct parse trees incrementally to solve both matching and submatching for REs. In order to improve the efficiency of the proposed algorithm, Sulzmann et al. use a bit encoded representation of RE parse trees. Textual proofs of correctness of the proposed algorithm are presented in an appendix.
Certified algorithms for parsing also received attention recently. Firsov et al. describe a certified algorithm for RE parsing by converting an input RE to an equivalent NFA represented as a boolean matrix [4]. A matrix library based on some "block" operations [22] was developed and used to construct an Agda formalization of NFA-based parsing [11]. Compared to our work, a NFA-based formalization requires much more infrastructure (such as a Matrix library). No experiments with the certified algorithm were reported.
Almeida et al. [23] describe a Coq formalization of partial derivatives and its equivalence with automata. Partial derivatives were introduced by Antimirov [24] as an alternative to Brzozowski derivatives, since it avoids quotient resulting REs with respect to ACUI axioms. Almeida et al. motivation is to use such formalization as a basis for a decision procedure for RE equivalence.
Ausaf et. al. [25] describe a formalization, in Isabelle/HOL [26], of the POSIX matching algorithm proposed by Sulzmann et.al. [19]. They give a constructive characterization of what a POSIX matching is and prove that such matching is unique for a given RE and string. No experiments with the verified algorithm are reported.
Ribeiro and Du Bois [27] describe the formalization of regular expression (RE) parsing algorithm that produces a bit representation of its parse tree in Agda. The algorithm computes bit-codes using Brzozowski derivatives and they prove that produced codes are equivalent to parse trees ensuring soundness and completeness w.r.t an inductive RE semantics. Like our work, Ribeiro and Du Bois developed a tool for RE matching and execute some experiments using it but they didn't consider partial derivatives to produce parsing evidence (in their case, bit-codes and proofs that such codes are equivalent to RE parse trees).
Lopes et. al. [28] describe an Idris formalization of a RE parsing tool using Brzozowski's derivatives. Like our work, they proved both soundness and completeness lemmas for the derivative operation and used data-types for denoting prefixes and substrings proofs for a given input RE and string. Lopes et. al. [28] also mention that they use natural numbers in Peano notation to represent alphabet symbols as their respective ASCII codes, instead of using Idris type for characters. According to the authors, the reason for this design choice is due to the way that Idris deals with propositional equality for primitive types, like Char. Equalities of values of these types only reduce on concrete primitive values; this causes computation of proofs to stop under variables whose type is a primitive one [28].

Conclusion
In this work, we describe a complete formalization of a derivative-based parsing for REs in Agda. Our tool supports algorithms based on Brzozowski's derivatives and Antimirov's partial derivatives to find all prefixes and substrings that match a given input RE. The developed formalization has 1145 lines of code, organized in 20 modules. We have proven 39 theorems and lemmas to complete the development. Most of them are immediate pattern matching functions over inductive datatypes and were omitted from this text for brevity. The complete Agda formalization, instructions on how to build and use it are available on the project's on-line repository [12].
As future work, we intend to work on the development of a certified program of greedy and POSIX RE parsing using derivatives [19,20] and investigate ways to obtain a formalized but simple and efficient RE parsing tool.