In this section we take a short break from introducing new language features in order to show how to use currying, higher order functions, partial application, fates and infix notation to build a concise recursive-descent backtracking parser.
Combinator parsing is a pretty technique well worth learning in and of itself; it is also an excellent exercise in simplifying code by thinking functionally.
Today parsers generated by tools like yacc are usually used when generating parsers for programming languages. However, recursive descent parsers still have advantages in fields like natural language processing, where the grammar may be not be LALR(1), or where it may be necessary to work with ambiguous grammars, returning all possible parses of a sentence and then using semantic constraints to select the most probable one.
Here are the parser rules for a small fragment of English:
verb = match [ "eats", "throws", "eat", "throw" ]; noun = match [ "boy", "girl", "apple", "ball" ]; article = match [ "the", "a" ]; adjective = match [ "big", "little", "good", "bad" ]; adverb = match [ "quickly", "slowly" ]; qualified_noun = noun | adjective & noun; qualified_verb = verb | adverb & verb; noun_phrase = qualified_noun | article & qualified_noun; sentence = ( noun_phrase & qualified_verb & noun_phrase # "The little boy quickly throws the ball" | qualified_verb & noun_phrase # "Eat the apple" | noun_phrase & qualified_verb # "The girl slowly eats" | qualified_verb # "Eat" );
That code is straight Mythryl, although it may not seem so at first glance. To make those rules work, we must first define a few support functions.
We start by defining a binary-tree datastructure:
Parsetree = PAIR (Parsetree, Parsetree) | TOKEN String ;
This will hold the syntax tree generated by our parser.
As support we define a simple function to print out our parsetrees:
fun parsetree_to_string (TOKEN string) => string; parsetree_to_string (PAIR (parsetree1, parsetree2)) => sprintf "(%s %s)" (parsetree_to_string parsetree1) (parsetree_to_string parsetree2); end;
The elements of our parser are parse functions, each of which attempts to match some pattern such as a verb phrase or noun phrase against part of the input string. Our parse functions have the type
Parse_Function = Success_Fate -> Failure_Fate -> List(String) -> Void;
We use fate passing to handle backtracking: We call the Success_Fate function if our parse function succeeds in matching its assigned pattern at the current location within the input text, otherwise we call the Failure_Fate. The final List(String) argument is the list of input tokens yet to be parsed.
The fate function types are:
Failure_Fate = Void -> Void; Success_Fate = Parsetree -> # Parsetree for substring just matched. Failure_Fate -> List(String) -> # Input tokens not yet matched. Void;
We begin with a parse function which attempts to match the next input token against a list of words:
in = list::in; # ``word in [ "abc", "def" ]'' is TRUE iff word == "abc" or word == "def". fun match words success_fate failure_fate [] : Void => failure_fate (); # No token to match. match words success_fate failure_fate (token ! tokens) => if (string::to_lower(token) in words) success_fate (TOKEN token) failure_fate tokens; else failure_fate (); # Next token does not match. fi; end;
This function is reasonably straightforward. If the next input token matches one of our words we construct a TOKEN syntax tree node representing our successfully matched one-word syntax subtree and pass it to our success_fate, otherwise we call our failure_fate.
Next we define an "and" function which matches two patterns consecutively in the input. This function takes as input two parse functions describing the subpatterns, and returns a parse function which will match their concatenation.
Because they build new parse functions by combining existing parse functions such functions are often called combining forms, or simply combinators.
The type of our combinator is
(Parser, Parser) -> Parser
To improve the readability of our grammar rules, we use the infix operator & to name this function:
fun parse_fn_1 & parse_fn_2 = \\ success_fate = parse_fn_1 (\\ parsetree_1 = parse_fn_2 (\\ parsetree_2 = success_fate (PAIR (parsetree_1, parsetree_2)) ) );
This function is also quite straightforward. We call parser_1 with a success fate which calls parser_2 with a success fate which constructs a syntax tree PAIR node combining the two syntax subtrees they construct for it and then passes that PAIR node to our own original success fate.
Note how we use partial application of curried functions to simplify the code. Both parser_1 and parser_2 take more arguments than explicitly shown. We could have written the same function as
fun parse_fn_1 & parse_fn_2 = \\ success_fate = \\ failure_fate = \\ tokens = parse_fn_1 success_fate_1 failure_fate tokens where fun success_fate_1 parsetree_1 failure_fate tokens = parse_fn_2 success_fate_2 failure_fate tokens where fun success_fate_2 parsetree_2 failure_fate tokens = success_fate (PAIR (parsetree_1, parsetree_2)) failure_fate tokens; end; end;
By using partial application we have cut our code in half.
Next we define a complementary "or" function which matches in the input either one of two given parse functions. Once again, to improve readability, we give it a compact infix name instead of a conventional prefix alphabetic name:
fun parse_fn_1 | parse_fn_2 = \\ success_fate = \\ failure_fate = \\ tokens = parse_fn_1 success_fate_1 failure_fate_1 tokens where fun success_fate_1 parsetree ignored_failure_fate tokens = success_fate parsetree failure_fate tokens; fun failure_fate_1 () = parse_fn_2 success_fate failure_fate tokens; end;
This function is much like the preceding one, except that here we synthesize a failure fate as well as a success fate.
We now have all the machinery in place for the grammar rule functions illustrated at the top of this section:
verb = match [ "eats", "throws", "eat", "throw" ]; noun = match [ "boy", "girl", "apple", "ball" ]; article = match [ "the", "a" ]; adjective = match [ "big", "little", "good", "bad" ]; adverb = match [ "quickly", "slowly" ]; qualified_noun = noun | adjective & noun; qualified_verb = verb | adverb & verb; noun_phrase = qualified_noun | article & qualified_noun; sentence = ( noun_phrase & qualified_verb & noun_phrase # "The little boy quickly throws the ball" | qualified_verb & noun_phrase # "Eat the apple" | noun_phrase & qualified_verb # "The girl slowly eats" | qualified_verb # "Eat" );
Note how we once again use partial application of curried functions to keep the code concise. For example the first rule above can be written
fun verb success_fate failure_fate tokens = match [ "eats", "throws" ] success_fate failure_fate tokens;
but if we do that with all the rules they will be much harder to read and maintain.
Here are the final four functions needed to produce a functioning mini-parser for our fragment of English:
fun parse string = sentence toplevel_success_fate toplevel_failure_fate (string_to_words string) where fun toplevel_success_fate parsetree failure_fate tokens = printf "Successful parse: %s\n" (parsetree_to_string parsetree); fun toplevel_failure_fate () = print "No parse found.\n"; string_to_words = string::tokens char::is_space; end;
Putting it all together, here is our complete parser package:
package parse1 { in = list::in; Parsetree = PAIR (Parsetree, Parsetree) | TOKEN String ; fun parsetree_to_string (TOKEN string) => string; parsetree_to_string (PAIR (parsetree1, parsetree2)) => sprintf "(%s %s)" (parsetree_to_string parsetree1) (parsetree_to_string parsetree2); end; # A parse function which matches any word in a given list: # fun match words success_fate failure_fate [] : Void => failure_fate (); # No token to match. match words success_fate failure_fate (token ! tokens) => if (string::to_lower(token) in words) success_fate (TOKEN token) failure_fate tokens; else failure_fate (); # Next token does not match. fi; end; # An 'and' parse combinator which requires that # the two given parse functions match successive # portions of the 'tokens' input: # fun parse_fn_1 & parse_fn_2 = \\ success_fate = parse_fn_1 (\\ parsetree_1 = parse_fn_2 (\\ parsetree_2 = success_fate (PAIR (parsetree_1, parsetree_2)) ) ); # An 'or' parse combinator which requires that # one of the two given parse functions # match a prefix of 'tokens': # fun parse_fn_1 | parse_fn_2 = \\ success_fate = \\ failure_fate = \\ tokens = parse_fn_1 success_fate_1 failure_fate_1 tokens where fun success_fate_1 parsetree ignored_failure_fate tokens = success_fate parsetree failure_fate tokens; fun failure_fate_1 () = parse_fn_2 success_fate failure_fate tokens; end; # Now a simple grammar for a small fragment of English: # verb = match [ "eats", "throws", "eat", "throw" ]; noun = match [ "boy", "girl", "apple", "ball" ]; article = match [ "the", "a" ]; adjective = match [ "big", "little", "good", "bad" ]; adverb = match [ "quickly", "slowly" ]; qualified_noun = noun | adjective & noun; qualified_verb = verb | adverb & verb; noun_phrase = qualified_noun | article & qualified_noun; sentence = ( noun_phrase & qualified_verb & noun_phrase # "The little boy quickly throws the ball" | qualified_verb & noun_phrase # "Eat the apple" | noun_phrase & qualified_verb # "The girl slowly eats" | qualified_verb # "Eat" ); # Finally, a toplevel function to drive it all: # fun parse string = sentence toplevel_success_fate toplevel_failure_fate (string_to_words string) where fun toplevel_success_fate parsetree failure_fate tokens = printf "Successful parse: %s\n" (parsetree_to_string parsetree); fun toplevel_failure_fate () = print "No parse found.\n"; string_to_words = string::tokens char::is_space; end; };
This code is in src/app/tut/combinator-parsing/parse1.pkg in the Mythryl source code distribution. You may try it out by doing
linux$ cd src/app/tut/combinator-parsing linux$ my eval: make "parse1.lib"; eval: parse1::parse "The boy quickly throws the little ball"; Successful parse: (((The boy) (quickly throws)) (the (little ball)))
Further conciseness:
At the risk of gilding the lily, recall that the Mythryl backticks operator is redefinable. For example, we can do:
linux$ my eval: fun backticks__op string = string::tokens char::is_space string; eval: `abc def`; ["abc", "def"]
Of course, now that we are comfortable with partially applied curried functions, we are more likely to write just:
linux$ my eval: backticks__op = string::tokens char::is_space; eval: `abc def`; ["abc", "def"]
Either way, this lets us replace
verb = match [ "eats", "throws", "eat", "throw" ]; noun = match [ "boy", "girl", "apple", "ball" ]; article = match [ "the", "a" ]; adjective = match [ "big", "little", "good", "bad" ]; adverb = match [ "quickly", "slowly" ];
by simply
verb = match `eats throws eat throw`; noun = match `boy girl apple ball`; article = match `the a`; adjective = match `big little good bad`; adverb = match `quickly slowly`;
If we wrap match into back__ticks by
fun backticks__op string = match (string::tokens char::is_space string);
we can abbreviate further to just
verb = `eats throws eat throw`; noun = `boy girl apple ball`; article = `the a`; adjective = `big little good bad`; adverb = `quickly slowly`;
Whether this is splendidly concise or dreadfully obscure depends on your taste and situation. Mythryl gives you the tools; how to use them is your decision.
Further reading:
Even Higher Order Functions for Parsing
Higher Order Functions for Parsing