More Regular Expressions

5.4.10 More Regular Expressions

Picking up where we left off, we have seen how to do string =~ regex matching and regex::replace_all substitutions; it is time to explore some other functions exported by package regex per the Regular_Expression_Matcher api.

The regex::find_first_match_to_regex function returns THE first substring matching a regular expression, returning NULL if no match is found:

    linux$ my
    eval:  regex::find_first_match_to_regex ./f.t/ "the fat father futzed";
    THE "fat"

The regex::find_all_matches_to_regex function returns all substrings matching a regular expression:

    linux$ my
    eval:  regex::find_all_matches_to_regex ./f.t/ "the fat father futzed";
    ["fat", "fat", "fut"]

Thus, recalling that in Perl regular expressions \w matches word constituents and \b matches at word boundaries, one way to break out the words in a string is:

    linux$ my
    eval:  regex::find_all_matches_to_regex ./\b\w+\b/ "the fat father futzed";
    ["the", "fat", "father", "futzed"]

Regular expressions use parentheses both for grouping expressions and also for designating substring matches of interest. A number of regex functions center on processing of such parenthesis-marked groupings.

For example regex::find_first_groups_all matches a regular expression once against a string, raising exception NOT_FOUND if there is no match, otherwise returning the list of all substrings matching groups (parenthesized subexpressions):

    linux$ my

    eval:  regex::find_first_match_to_regex_and_return_all_groups ./f.q/ "the fat father futzed";
    NULL

    eval:  regex::find_first_match_to_regex_and_return_all_groups ./f.t/ "the fat father futzed";
    THE []

    eval:  regex::find_first_match_to_regex_and_return_all_groups ./(f)(.)(t)/ "the fat father futzed";
    THE ["f", "a", "t"]

    eval:  regex::find_first_match_to_regex_and_return_all_groups ./((f(.))t)/ "the fat father futzed";
    THE ["fat", "fa", "a"]

Here:

In the first example there was no match, so the call raised exception NOT_FOUND.
In the second example there was a match, but the regular expression contained no parenthesis-marked groupings, so the return list was empty.
In the third example the first match was against fat and the regular expression had three sets of parentheses, so the returned list contained three strings, each corresponding to the substring matched by one regular expression parenthesis-pair.
The fourth example is just like the third except that the parentheses placements are different, and thus also the corresponding returned strings.

The regex::find_first_group i function does the same as above, except that it returns only a single selected parenthesis group match, raising exception NOT_FOUND if the regex fails to match the string.

By convention, group 0 is the complete matched string, hence regex::find_first_match_to_ith_group 0 regex is the same as regex::find_first_match_to_regex regex:

    linux$ my

    eval:  regex::find_first_match_to_regex       ./(f)(.)(t)/ "the fat father futzed";
    THE "fat"

    eval:  regex::find_first_match_to_ith_group 0 ./(f)(.)(t)/ "the fat father futzed";
    THE "fat"

    eval:  regex::find_first_match_to_ith_group 1 ./(f)(.)(t)/ "the fat father futzed";
    THE "f"

    eval:  regex::find_first_match_to_ith_group 2 ./(f)(.)(t)/ "the fat father futzed";
    THE "a"

    eval:  regex::find_first_match_to_ith_group 3 ./(f)(.)(t)/ "the fat father futzed";
    THE "t"

Hint: There is no regex call which explicitly returns the location of a match within a string, but it is easy to extract the leading string and compute its length. For example, to find the location of the first "foo" in a string:

    eval:  strlen (regex::find_first_match_to_ith_group 1 ./^(.*)foo/ "the fool on the hill");
    THE 4

The regex::find_all_matches_to_regex_and_return_values_of_ith_group i function is the same as above, except that it returns the i-th parenthesis group match for all successful matches of the regular expression against the target string:

    eval:  regex::find_all_matches_to_regex_and_return_values_of_ith_group 2 ./(f)(.)(t)/ "the fat father futzed";
    ["a", "a", "u"]

Finally, the regex::find_all_matches_to_regex_and_return_all_values_of_all_groups does the obvious:

    eval:  regex::find_all_matches_to_regex_and_return_all_values_of_all_groups ./(f)(.)(t)/ "the fat father futzed";
    [["f", "a", "t"], ["f", "a", "t"], ["f", "u", "t"]]

We’ve already seen that regex::replace_all may be used to substitute a string for every regular expression match in a string:

    linux$ my

    eval:  regex::replace_all ./f.t/ "FAT" "the fat father futzed";
    "the FAT FATher FATzed"

There is a matching call which replaces only the first match:

    linux$ my

    eval:  regex::replace_first ./f.t/ "FAT" "the fat father futzed";
    "the FAT father futzed"

There is also a matching pair of functions which allow arbitrary substitutions at each regular expression matchpoint in the string by calling a user-supplied function to compute the replacement string.

The regex::replace_first_via_fn will return the template string if there is no match, otherwise it calls the user-supplied function with a list of strings corresponding to the parenthesis group matchings:

    linux$ my

    eval: regex::replace_first_via_fn  ./(f.t)/  {. toupper (strcat #stringlist); }  "the fat father futzed";
    "the FAT father futzed"

As you might expect regex::replace_all_via_fn is identical except that it splices in replacements for all substrings matched by the regular expression:

    linux$ my

    eval: regex::replace_all_via_fn ./(f.t)/ {. toupper (strcat #stringlist); }  "the fat father futzed";
    "the FAT FATher FUTzed"

For the ultimate in flexibility, the regex::regex_case function provides a ’case’ type statement driven by regular expression pattern-matching.

The arguments consist of a text to be matched followed by a list of (regex, action-fn) pairs and a default action function.

Execution consists of matching each regex in order against the target text until one matches, at which point the corresponding action is invoked (with the substrings obtained from the match) and the result returned.

If no regex matches, the default action is executed and the result returned.

In any event, exactly one action function invoked exactly once:

    #!/usr/bin/mythryl

    fun diagnose  target_text
        =
        regex::regex_case
            target_text
            {  cases =>    [ (./utilize/,                       \\ _       = printf "This guy is verbose!\n"                      ),
                             (./weaponize/,                     \\ _       = printf "This guy is from the Pentagon!\n"            ),
                             (./(\b[bcdfghjklmnpqrstvwxz]+\b)/, \\ strings = printf "What is this '%s' word?!\n" (strcat strings) )
                           ],

               default =>  \\ _ = printf "I can deduce nothing.\n"
            };

    diagnose  "We must utilize our utmost efforts.";
    diagnose  "We must weaponize the chalkboards.";
    diagnose  "The crwth is revolting!";
    diagnose  "We are the people!";

When run, the above script produces:

    linux$ ./my-script
    This guy is verbose!
    This guy is from the Pentagon!
    What is this 'crwth' word?!
    I can deduce nothing.
    linux$

Comments and suggestions to: bugs@mythryl.org