Mythryl Functions: Parsing Combinators I

5.4.7 Mythryl Functions: Parsing Combinators I

In this section we take a short break from introducing new language features in order to show how to use currying, higher order functions, partial application, fates and infix notation to build a concise recursive-descent backtracking parser.

Combinator parsing is a pretty technique well worth learning in and of itself; it is also an excellent exercise in simplifying code by thinking functionally.

Today parsers generated by tools like yacc are usually used when generating parsers for programming languages. However, recursive descent parsers still have advantages in fields like natural language processing, where the grammar may be not be LALR(1), or where it may be necessary to work with ambiguous grammars, returning all possible parses of a sentence and then using semantic constraints to select the most probable one.

Here are the parser rules for a small fragment of English:

    verb      =  match [ "eats", "throws", "eat", "throw" ];
    noun      =  match [ "boy", "girl", "apple", "ball"   ];
    article   =  match [ "the", "a"                       ];
    adjective =  match [ "big", "little", "good", "bad"   ];
    adverb    =  match [ "quickly", "slowly"              ];

    qualified_noun =   noun   |   adjective  &  noun;
    qualified_verb =   verb   |   adverb     &  verb;

    noun_phrase    =             qualified_noun
                   | article  &  qualified_noun;

    sentence
        =
        ( noun_phrase  &  qualified_verb  &  noun_phrase     # "The little boy quickly throws the ball"
        |                 qualified_verb  &  noun_phrase     # "Eat the apple"
        | noun_phrase  &  qualified_verb                     # "The girl slowly eats"
        |                 qualified_verb                     # "Eat"
        );

That code is straight Mythryl, although it may not seem so at first glance. To make those rules work, we must first define a few support functions.

We start by defining a binary-tree datastructure:

    Parsetree = PAIR (Parsetree, Parsetree)
              | TOKEN String
              ;

This will hold the syntax tree generated by our parser.

As support we define a simple function to print out our parsetrees:

    fun parsetree_to_string (TOKEN string)
            =>
            string;

        parsetree_to_string (PAIR (parsetree1, parsetree2))
            =>
            sprintf "(%s %s)"
                (parsetree_to_string  parsetree1)
                (parsetree_to_string  parsetree2);
    end;

The elements of our parser are parse functions, each of which attempts to match some pattern such as a verb phrase or noun phrase against part of the input string. Our parse functions have the type

    Parse_Function
        =
        Success_Fate -> Failure_Fate -> List(String) -> Void;

We use fate passing to handle backtracking: We call the Success_Fate function if our parse function succeeds in matching its assigned pattern at the current location within the input text, otherwise we call the Failure_Fate. The final List(String) argument is the list of input tokens yet to be parsed.

The fate function types are:

    Failure_Fate
        =
        Void -> Void;

    Success_Fate
        =
        Parsetree            ->         # Parsetree for substring just matched.
        Failure_Fate ->
        List(String)         ->         # Input tokens not yet matched.
        Void;

We begin with a parse function which attempts to match the next input token against a list of words:

    in = list::in;      # ``word in [ "abc", "def" ]'' is TRUE iff word == "abc" or word == "def".

    fun match  words  success_fate  failure_fate  []   : Void
            =>
            failure_fate  ();                                               # No token to match.

        match  words  success_fate  failure_fate (token ! tokens)
            =>
            if (string::to_lower(token) in words)

                 success_fate  (TOKEN token) failure_fate  tokens;
            else
                 failure_fate  ();                                          # Next token does not match.
            fi;     
    end;

This function is reasonably straightforward. If the next input token matches one of our words we construct a TOKEN syntax tree node representing our successfully matched one-word syntax subtree and pass it to our success_fate, otherwise we call our failure_fate.

Next we define an "and" function which matches two patterns consecutively in the input. This function takes as input two parse functions describing the subpatterns, and returns a parse function which will match their concatenation.

Because they build new parse functions by combining existing parse functions such functions are often called combining forms, or simply combinators.

The type of our combinator is

    (Parser, Parser) -> Parser

To improve the readability of our grammar rules, we use the infix operator & to name this function:

    fun parse_fn_1 & parse_fn_2
        =
        \\  success_fate
            =
            parse_fn_1
                (\\ parsetree_1
                    =
                    parse_fn_2
                        (\\ parsetree_2
                            =
                            success_fate  (PAIR (parsetree_1, parsetree_2))
                        )
                );

This function is also quite straightforward. We call parser_1 with a success fate which calls parser_2 with a success fate which constructs a syntax tree PAIR node combining the two syntax subtrees they construct for it and then passes that PAIR node to our own original success fate.

Note how we use partial application of curried functions to simplify the code. Both parser_1 and parser_2 take more arguments than explicitly shown. We could have written the same function as

    fun parse_fn_1 & parse_fn_2
        =
        \\  success_fate
            =
        \\  failure_fate
            =
        \\  tokens
            =
            parse_fn_1  success_fate_1  failure_fate  tokens
            where
                fun success_fate_1  parsetree_1  failure_fate  tokens
                    =
                    parse_fn_2  success_fate_2  failure_fate  tokens
                    where
                        fun success_fate_2  parsetree_2  failure_fate  tokens
                            =
                            success_fate  (PAIR (parsetree_1, parsetree_2))  failure_fate  tokens;
                    end;
            end;

By using partial application we have cut our code in half.

Next we define a complementary "or" function which matches in the input either one of two given parse functions. Once again, to improve readability, we give it a compact infix name instead of a conventional prefix alphabetic name:

    fun parse_fn_1 | parse_fn_2
        =
        \\  success_fate
            =
        \\  failure_fate
            =
        \\  tokens
            =
            parse_fn_1  success_fate_1  failure_fate_1  tokens
            where
                fun success_fate_1  parsetree   ignored_failure_fate  tokens
                    =
                    success_fate  parsetree   failure_fate  tokens;

                fun failure_fate_1 ()
                    =
                    parse_fn_2  success_fate  failure_fate  tokens;
            end;

This function is much like the preceding one, except that here we synthesize a failure fate as well as a success fate.

We now have all the machinery in place for the grammar rule functions illustrated at the top of this section:

    verb      =  match [ "eats", "throws", "eat", "throw" ];
    noun      =  match [ "boy", "girl", "apple", "ball"   ];
    article   =  match [ "the", "a"                       ];
    adjective =  match [ "big", "little", "good", "bad"   ];
    adverb    =  match [ "quickly", "slowly"              ];

    qualified_noun =   noun   |   adjective  &  noun;
    qualified_verb =   verb   |   adverb     &  verb;

    noun_phrase    =             qualified_noun
                   | article  &  qualified_noun;

    sentence
        =
        ( noun_phrase  &  qualified_verb  &  noun_phrase     # "The little boy quickly throws the ball"
        |                 qualified_verb  &  noun_phrase     # "Eat the apple"
        | noun_phrase  &  qualified_verb                     # "The girl slowly eats"
        |                 qualified_verb                     # "Eat"
        );

Note how we once again use partial application of curried functions to keep the code concise. For example the first rule above can be written

    fun verb  success_fate  failure_fate  tokens
        =
        match [ "eats", "throws" ]  success_fate  failure_fate  tokens;

but if we do that with all the rules they will be much harder to read and maintain.

Here are the final four functions needed to produce a functioning mini-parser for our fragment of English:

    fun parse string
        =
        sentence
            toplevel_success_fate
            toplevel_failure_fate
            (string_to_words  string)

         where

            fun toplevel_success_fate  parsetree  failure_fate  tokens
                =
                printf "Successful parse: %s\n" (parsetree_to_string  parsetree);


            fun toplevel_failure_fate  ()
                =
                print  "No parse found.\n";


            string_to_words =  string::tokens  char::is_space;
        end;

Putting it all together, here is our complete parser package:

    package parse1 {

        in = list::in;

        Parsetree = PAIR (Parsetree, Parsetree)
                  | TOKEN String
                  ;

        fun parsetree_to_string (TOKEN string)
                =>
                string;

            parsetree_to_string (PAIR (parsetree1, parsetree2))
                =>
                sprintf "(%s %s)"
                    (parsetree_to_string  parsetree1)
                    (parsetree_to_string  parsetree2);
        end;



        # A parse function which matches any word in a given list:
        #
        fun match  words  success_fate  failure_fate  []   : Void
                =>
                failure_fate  ();                                               # No token to match.

            match  words  success_fate  failure_fate (token ! tokens)
                =>
                if (string::to_lower(token) in words)

                     success_fate  (TOKEN token) failure_fate  tokens;
                else
                     failure_fate  ();                                          # Next token does not match.
                fi;         
        end;


        # An 'and' parse combinator which requires that
        # the two given parse functions match successive
        # portions of the 'tokens' input:
        #
        fun parse_fn_1 & parse_fn_2
            =
            \\  success_fate
                =
                parse_fn_1
                    (\\ parsetree_1
                        =
                        parse_fn_2
                            (\\ parsetree_2
                                =
                                success_fate  (PAIR (parsetree_1, parsetree_2))
                            )
                    );


        # An 'or' parse combinator which requires that
        # one of the two given parse functions
        # match a prefix of 'tokens':
        #
        fun parse_fn_1 | parse_fn_2
            =
            \\  success_fate
                =
            \\  failure_fate
                =
            \\  tokens
                =
                parse_fn_1  success_fate_1  failure_fate_1  tokens
                where
                    fun success_fate_1  parsetree   ignored_failure_fate  tokens
                        =
                        success_fate  parsetree   failure_fate  tokens;

                    fun failure_fate_1 ()
                        =
                        parse_fn_2  success_fate  failure_fate  tokens;
                end;


        # Now a simple grammar for a small fragment of English:
        #
        verb      =  match [ "eats", "throws", "eat", "throw" ];
        noun      =  match [ "boy", "girl", "apple", "ball"   ];
        article   =  match [ "the", "a"                       ];
        adjective =  match [ "big", "little", "good", "bad"   ];
        adverb    =  match [ "quickly", "slowly"              ];

        qualified_noun =   noun   |   adjective  &  noun;
        qualified_verb =   verb   |   adverb     &  verb;

        noun_phrase    =             qualified_noun
                       | article  &  qualified_noun;

        sentence
            =
            ( noun_phrase  &  qualified_verb  &  noun_phrase     # "The little boy quickly throws the ball"
            |                 qualified_verb  &  noun_phrase     # "Eat the apple"
            | noun_phrase  &  qualified_verb                     # "The girl slowly eats"
            |                 qualified_verb                     # "Eat"
            );


        # Finally, a toplevel function to drive it all:
        #
        fun parse string
            =
            sentence
                toplevel_success_fate
                toplevel_failure_fate
                (string_to_words  string)

            where

                fun toplevel_success_fate  parsetree  failure_fate  tokens
                    =
                    printf "Successful parse: %s\n" (parsetree_to_string  parsetree);


                fun toplevel_failure_fate  ()
                    =
                    print  "No parse found.\n";


                string_to_words =  string::tokens  char::is_space;
            end;
    };

This code is in src/app/tut/combinator-parsing/parse1.pkg in the Mythryl source code distribution. You may try it out by doing

    linux$ cd src/app/tut/combinator-parsing
    linux$ my
    eval:  make "parse1.lib";
    eval:  parse1::parse "The boy quickly throws the little ball";
    Successful parse: (((The boy) (quickly throws)) (the (little ball)))

Further conciseness:

At the risk of gilding the lily, recall that the Mythryl backticks operator is redefinable. For example, we can do:

    linux$ my
    eval:  fun backticks__op string  =  string::tokens  char::is_space  string;
    eval:  `abc def`;

    ["abc", "def"]

Of course, now that we are comfortable with partially applied curried functions, we are more likely to write just:

    linux$ my
    eval:  backticks__op  =  string::tokens  char::is_space;
    eval:  `abc def`;

    ["abc", "def"]

Either way, this lets us replace

        verb      =  match [ "eats", "throws", "eat", "throw" ];
        noun      =  match [ "boy", "girl", "apple", "ball"   ];
        article   =  match [ "the", "a"                       ];
        adjective =  match [ "big", "little", "good", "bad"   ];
        adverb    =  match [ "quickly", "slowly"              ];

by simply

        verb      =  match `eats throws eat throw`;
        noun      =  match `boy girl apple ball`;
        article   =  match `the a`;
        adjective =  match `big little good bad`;
        adverb    =  match `quickly slowly`;

If we wrap match into back__ticks by

    fun backticks__op string  =  match  (string::tokens char::is_space string);

we can abbreviate further to just

        verb      =  `eats throws eat throw`;
        noun      =  `boy girl apple ball`;
        article   =  `the a`;
        adjective =  `big little good bad`;
        adverb    =  `quickly slowly`;

Whether this is splendidly concise or dreadfully obscure depends on your taste and situation. Mythryl gives you the tools; how to use them is your decision.

Further reading:

Even Higher Order Functions for Parsing
Higher Order Functions for Parsing

Comments and suggestions to: bugs@mythryl.org