Parser combinators are straightforward to construct, ‘readable’, modular, well-structured and easily maintainable. They have been used extensively in the prototyping of compilers and processors for domain-specific languages such as natural language interfaces to databases, where complex and varied semantic actions are closely integrated with syntactic processing. In 1989, Richard Frost and John Launchbury demonstrated use of parser combinators to construct Natural Language interpreters. Graham Hutton also used higher-order functions for basic parsing in 1992. S.D. Swierstra also exhibited the practical aspects of parser combinators in 2001. In 2008, Frost, Hafiz and Callaghan described a set of parser-combinators in Haskell that solve the long standing problem of accommodating left-recursion, and work as a complete top-down parsing tool in polynomial time and space.
In functional programming, parser combinators can be used to build basic parsers and to construct complex parsers for rules (that define nonterminals) from other parsers. A production-rule of a context-free grammar (CFG) may have one or more ‘alternatives’ and each alternative may consist of a sequence of non-terminal(s) and/or terminal(s), or the alternative may consist of a single non-terminal or terminal or ‘empty’. Parser combinators allow parsers to be defined in an embedded style, in code which is similar in structure to the rules of the grammar. As such, implementations can be thought of as executable specifications with all of the associated advantages. In order to achieve this, one has to define a set of combinators or infix operators to ‘glue’ different terminals and non-terminals to form a complete rule.
#input the members of which are accessed through an index j. Recognizers are functions which take an index j as argument and which return a set of indices. Each index in the result set corresponds to a position at which the parser successfully finished recognizing a sequence of tokens that began at position j. An empty result set indicates that the recognizer failed to recognize any sequence beginning at j. A non-empty result set indicates the recognizer ends at different positions successfully. (Note that, as we are defining results as a set, we cannot express ambiguity as it would require repeated entries in the set. Use of ‘list’ would solve the problem.) Following the definitions of two basic recognizers for terminals, we define two major combinators for alternative and sequencing:
empty recognizer is a function which always succeeds returning a singleton set containing the current position:term ’x’ for a terminal x is a function which takes an index j as input, and if j is less than #input and if the token at position j in the input corresponds to the terminal x, it returns a singleton set containing j + 1, otherwise it returns the empty set.
<+>, which is used as an infix operator between two recognizers p and q. The <+> applies both of the recognizers on the same input position j and sums up the results returned by both of the recognizers, which is eventually returned as the final result.
*> combinator. Like <+>, it is also used as an infix operator between two recognizers – p and q. But it applies the first recognizer p to the input position j, and if there is any successful result of this application, then the second recognizer q is applied to every element of the result set returned by the first recognizer. The *> ultimately returns the union of these applications of q.s ::= ‘x’ s s | ɛ. Using the combinators defined earlier, we can modularly define executable notations of this grammar in a modern functional language (e.g. Haskell) as s = term ‘x’ *> s *> s <+> empty. When the recognizer s is applied on an input sequence xxxx at position 1, according to the above definitions it would return a result set {5,4,3,2,1}. Note that in real implementation if result set is defined as data type that supports repetition (i.e. list), then we can have the resulting list with all possible ambiguous results like [5, 4, 3, 2, 1,…., 5, 4, 3, 2,……] .Like any top-down recursive descent parsing, the conventional parser combinators (like the combinators described above)won’t terminate while processing a left-recursive grammar (i.e. s ::= s *> s *> term ‘x’|empty). A recognition algorithm that accommodates ambiguous grammars with direct left-recursive rules is described by Frost and Hafiz in 2006. The algorithm curtails the otherwise ever-growing left-recursive parse by imposing depth restrictions. That algorithm was extended to a complete parsing algorithm to accommodate indirect as well as direct left-recursion in polynomial time, and to generate compact polynomial-size representations of the potentially-exponential number of parse trees for highly-ambiguous grammars by Frost, Hafiz and Callaghan in 2007. This extended algorithm accommodates indirect left-recursion by comparing its ‘computed-context’ with ‘current-context’. The same authors also described their implementation of a set of parser combinators written in the Haskell programming language based on the same algorithm. The X-SAIGA site has more about the algorithms and implementation details.