Homework 2. Naive parsing of context free grammars

It should be said right up front that this assignment is just a toy: real parsers are more complicated than this, and don't rely solely on the simple kinds of backtracking used here.

Introduction

You'd like to test the random sentence generator that you wrote for Homework 1. One way is to test it on actual CS 132 projects, but those projects aren't done yet and anyway you'd like a second opinion in case the student projects are incorrect. So you decide to write a simple parser generator. Given a grammar in the style of Homework 1, your program will generate a function that is a parser.

The key notion of this assignment is that of a matcher. A matcher is a function that inspects a given string of terminals to find a match for a prefix that corresponds to a nonterminal symbol of a grammar, and then checks whether the match is acceptable by testing whether a given acceptor succeeds on the corresponding rule list and suffix. For example, a matcher for awksub_grammar below might inspect the string ["3";"+";"4";"-"] and find two possible prefixes that match, namely ["3";"+";"4"] and ["3"]. The matcher will first apply the acceptor to a rule list used to generate this prefix, along with the suffix ["-"]. If this is accepted, the matcher will return whatever the acceptor returns. Otherwise, the matcher will apply the acceptor to a rule list used to generate the shorter prefix ["3"] and the suffix ["+";"4";"-"] and will return whatever the acceptor returns. If a matcher finds no matching prefixes, it returns the special value None. You should write your matchers so that the rule lists passed to acceptors make it easy to compute a derivation later; please see below for more on derivations.

As you can see by mentally executing the example, matchers sometimes need to try multiple alternatives and to backtrack to a later alternative if an earlier one is a blind alley.

An acceptor is a function that accepts a rule list and a suffix by returning some value wrapped inside the Some constructor. The acceptor rejects the rule list and suffix by returning None. For example, the acceptor (fun d -> function | "+"::t -> Some (d,"+"::t) | _ -> None) accepts any rule list but accepts only suffixes beginning with "+". Such an acceptor would cause the example matcher to fail on the prefix ["3";"+";"4"] (since the corresponding suffix begins with "-", not "+") but it would succeed on the prefix ["3"]. By convention, an acceptor that is successful returns Some (d,s), where d is a rule list that typically contains the acceptor's input rule list as a subset (because the acceptor may do further parsing, and therefore has applied more rules than before), and s is a tail of the input suffix (again, because the acceptor may have parsed more of the input, and has therefore consumed some of the suffix).

Typically if a acceptor call accept rl frag returns Some x, then x is a pair containing a rule list and a fragment (possibly rl and frag themselves). This allows the matcher's caller to retrieve the rule list for the matched prefix, along with an indication where the matched prefix ends (since it ends just before the suffix starts). Although this behavior is crucial for the internal acceptors used by your code, it is not required for top-level acceptors supplied by test programs: a top-level acceptor needs only to return any Some x value to succeed.

A derivation is a rule list that describes how to derive a sentence. For example, here is a derivation for ["3";"+";"4"]. After each rule is applied, the resulting list of terminals and nonterminals is given.

rule	after rule is applied
(at start)	Expr
Expr → Term Binop Expr	Term Binop Expr
Term → Num	Num Binop Expr
Num → "3"	"3" Binop Expr
Binop → "+"	"3" "+" Expr
Expr → Term	"3" "+" Term
Term → Num	"3" "+" Num
Num → "4"	"3" "+" "4"

At the top level this assignment uses leftmost derivations: the leftmost nonterminal is always the one that is expanded next. Internally, your code may use rule lists of any form; they need not be leftmost derivations.

Whenever there are several rules to try for a nonterminal, you should always try them left-to-right. For example, awksub_grammar below contains this:

     | Expr ->
         [[N Term; N Binop; N Expr];
          [N Term]]

and therefore, your matcher should attempt to use the rule "Expr → Term Binop Expr" before attempting to use the simpler rule "Expr → Term".

Definitions

symbol, symbol string, alternative list, production function, grammar

same as in Homework 1.

alternative

a member of an alternative list; the right hand side of a grammar rule.

rule

a pair consisting of a nonterminal value and one of the alternatives corresponding to that nonterminal.

derivation

a list of rules used to derive a phrase from a nonterminal. For example, the OCaml representation of the example derivation shown above is as follows:

 [(Expr, [N Term; N Binop; N Expr]);
  (Term, [N Num]);
  (Num, [T "3"]);
  (Binop, [T "+"]);
  (Expr, [N Term]);
  (Term, [N Num]);
  (Num, [T "4"])]

fragment

a list of terminal symbols.

acceptor

a curried function with two arguments: a rule list rl and a fragment frag. If the fragment is not acceptable, it returns None; otherwise it returns Somex for some value x.

matcher

a curried function with two arguments: a fragment frag and an acceptor accept. A matcher matches a prefix p of frag such that accept (when passed a rule list and the corresponding suffix) accepts the corresponding suffix (i.e., the suffix of frag that remains after p is removed). If there is such a match, the matcher returns whatever accept returns; otherwise it returns None.

Assignment

Write a function parse_prefixgram that returns a matcher for the grammar gram. When applied to a fragment frag and an acceptor accept, the matcher must return the first acceptable match of a prefix of frag, by trying the grammar rules in order; this is not necessarily the shortest nor the longest acceptable match. A match is considered to be acceptable if accept succeeds when given a rule list and the suffix fragment that immediately follows the matching prefix. When this happens, the matcher returns whatever the acceptor returned. If no acceptable match is found, the matcher returns None.

Also, write three good test cases for your parse_prefix function. These test cases should all be in the style of the test cases given below, but should cover different problem areas. Your test cases should be named test_1 through test_3 (note the underscores; this distinguishes your test cases from the standard ones given below). Your test cases should test at least two grammars of your own. You may reuse your test cases for Homework 1 as part of test cases for Homework 2.

Your code may use the List module, but it should use no other modules. Your code should be free of side effects. Simplicity is more important than efficiency, but your code should avoid using unnecessary time and space when it is easy to do so.

Assess your work by writing an after-action report that summarizes why you solved the problem the way you did, other approaches that you considered and rejected (and why you rejected them), and any weaknesses in your solution in the context of its intended application. This report should be a simple ASCII plain text file that consumes a page or so (at most 100 lines and 80 columns per line, please). See Resources for oral presentations and written reports for advice on how to write assessments; admittedly much of the advice there is overkill for the simple kind of report we're looking for here.

Unlike Homework 1, we are expecting some weaknesses here, so your assessment should talk about them. For example, we don't expect that your implementation will work with all possible grammars, but we would like to know which sort of grammars it will have trouble with.

Submit

We will test your program on SEASnet, so make sure that /usr/local/cs/bin is at the start of your path, using the same technique as in Homework 1.

Submit three files via CourseWeb. The file hw2.ml should define parse_prefix along with any auxiliary types and functions needed to define parse_prefix. The file hw2test.ml should contain your test cases. The file hw2.txt should hold your assessment. Please do not put your name, student ID, or other personally identifying information in your files.

Sample test cases

let accept_all derivation string = Some (derivation, string)
let accept_empty_suffix derivation = function
   | [] -> Some (derivation, [])
   | _ -> None

(* An example grammar for a small subset of Awk, derived from but not
   identical to the grammar in
   <http://www.cs.ucla.edu/classes/winter06/cs132/hw/hw1.html>.
   Note that this grammar is not the same as Homework 1.  *)

type awksub_nonterminals =
  | Expr | Term | Lvalue | Incrop | Binop | Num

let awksub_grammar =
  (Expr,
   function
     | Expr ->
         [[N Term; N Binop; N Expr];
          [N Term]]
     | Term ->
	 [[N Num];
	  [N Lvalue];
	  [N Incrop; N Lvalue];
	  [N Lvalue; N Incrop];
	  [T"("; N Expr; T")"]]
     | Lvalue ->
	 [[T"$"; N Expr]]
     | Incrop ->
	 [[T"++"];
	  [T"--"]]
     | Binop ->
	 [[T"+"];
	  [T"-"]]
     | Num ->
	 [[T"0"]; [T"1"]; [T"2"]; [T"3"]; [T"4"];
	  [T"5"]; [T"6"]; [T"7"]; [T"8"]; [T"9"]])

let test0 =
  ((parse_prefix awksub_grammar ["ouch"] accept_all) = None)

let test1 =
  ((parse_prefix awksub_grammar ["9"] accept_all)
   = Some ([(Expr, [N Term]); (Term, [N Num]); (Num, [T "9"])], []))

let test2 =
  ((parse_prefix awksub_grammar ["9"; "+"; "$"; "1"; "+"] accept_all)
   = Some
       ([(Expr, [N Term; N Binop; N Expr]); (Term, [N Num]); (Num, [T "9"]);
	 (Binop, [T "+"]); (Expr, [N Term]); (Term, [N Lvalue]);
	 (Lvalue, [T "$"; N Expr]); (Expr, [N Term]); (Term, [N Num]);
	 (Num, [T "1"])],
	["+"]))

let test3 =
  ((parse_prefix awksub_grammar ["9"; "+"; "$"; "1"; "+"] accept_empty_suffix)
   = None)

(* This one might take a bit longer.... *)
let test4 =
 ((parse_prefix awksub_grammar
     ["("; "$"; "8"; ")"; "-"; "$"; "++"; "$"; "--"; "$"; "9"; "+";
      "("; "$"; "++"; "$"; "2"; "+"; "("; "8"; ")"; "-"; "9"; ")";
      "-"; "("; "$"; "$"; "$"; "$"; "$"; "++"; "$"; "$"; "5"; "++";
      "++"; "--"; ")"; "-"; "++"; "$"; "$"; "("; "$"; "8"; "++"; ")";
      "++"; "+"; "0"]
     accept_all)
  = Some
     ([(Expr, [N Term; N Binop; N Expr]); (Term, [T "("; N Expr; T ")"]);
       (Expr, [N Term]); (Term, [N Lvalue]); (Lvalue, [T "$"; N Expr]);
       (Expr, [N Term]); (Term, [N Num]); (Num, [T "8"]); (Binop, [T "-"]);
       (Expr, [N Term; N Binop; N Expr]); (Term, [N Lvalue]);
       (Lvalue, [T "$"; N Expr]); (Expr, [N Term; N Binop; N Expr]);
       (Term, [N Incrop; N Lvalue]); (Incrop, [T "++"]);
       (Lvalue, [T "$"; N Expr]); (Expr, [N Term; N Binop; N Expr]);
       (Term, [N Incrop; N Lvalue]); (Incrop, [T "--"]);
       (Lvalue, [T "$"; N Expr]); (Expr, [N Term; N Binop; N Expr]);
       (Term, [N Num]); (Num, [T "9"]); (Binop, [T "+"]); (Expr, [N Term]);
       (Term, [T "("; N Expr; T ")"]); (Expr, [N Term; N Binop; N Expr]);
       (Term, [N Lvalue]); (Lvalue, [T "$"; N Expr]);
       (Expr, [N Term; N Binop; N Expr]); (Term, [N Incrop; N Lvalue]);
       (Incrop, [T "++"]); (Lvalue, [T "$"; N Expr]); (Expr, [N Term]);
       (Term, [N Num]); (Num, [T "2"]); (Binop, [T "+"]); (Expr, [N Term]);
       (Term, [T "("; N Expr; T ")"]); (Expr, [N Term]); (Term, [N Num]);
       (Num, [T "8"]); (Binop, [T "-"]); (Expr, [N Term]); (Term, [N Num]);
       (Num, [T "9"]); (Binop, [T "-"]); (Expr, [N Term]);
       (Term, [T "("; N Expr; T ")"]); (Expr, [N Term]); (Term, [N Lvalue]);
       (Lvalue, [T "$"; N Expr]); (Expr, [N Term]); (Term, [N Lvalue]);
       (Lvalue, [T "$"; N Expr]); (Expr, [N Term]); (Term, [N Lvalue]);
       (Lvalue, [T "$"; N Expr]); (Expr, [N Term]); (Term, [N Lvalue; N Incrop]);
       (Lvalue, [T "$"; N Expr]); (Expr, [N Term]); (Term, [N Lvalue; N Incrop]);
       (Lvalue, [T "$"; N Expr]); (Expr, [N Term]); (Term, [N Incrop; N Lvalue]);
       (Incrop, [T "++"]); (Lvalue, [T "$"; N Expr]); (Expr, [N Term]);
       (Term, [N Lvalue; N Incrop]); (Lvalue, [T "$"; N Expr]); (Expr, [N Term]);
       (Term, [N Num]); (Num, [T "5"]); (Incrop, [T "++"]); (Incrop, [T "++"]);
       (Incrop, [T "--"]); (Binop, [T "-"]); (Expr, [N Term]);
       (Term, [N Incrop; N Lvalue]); (Incrop, [T "++"]);
       (Lvalue, [T "$"; N Expr]); (Expr, [N Term]); (Term, [N Lvalue; N Incrop]);
       (Lvalue, [T "$"; N Expr]); (Expr, [N Term]);
       (Term, [T "("; N Expr; T ")"]); (Expr, [N Term]);
       (Term, [N Lvalue; N Incrop]); (Lvalue, [T "$"; N Expr]); (Expr, [N Term]);
       (Term, [N Num]); (Num, [T "8"]); (Incrop, [T "++"]); (Incrop, [T "++"]);
       (Binop, [T "+"]); (Expr, [N Term]); (Term, [N Num]); (Num, [T "0"])],
      []))

let rec contains_lvalue = function
  | [] -> false
  | (Lvalue,_)::_ -> true
  | _::rules -> contains_lvalue rules

let accept_only_non_lvalues rules frag =
  if contains_lvalue rules
  then None
  else Some (rules, frag)

let test5 =
  ((parse_prefix awksub_grammar ["3"; "-"; "4"; "+"; "$"; "5"; "-"; "6"]
      accept_only_non_lvalues)
   = Some
      ([(Expr, [N Term; N Binop; N Expr]); (Term, [N Num]); (Num, [T "3"]);
	(Binop, [T "-"]); (Expr, [N Term]); (Term, [N Num]); (Num, [T "4"])],
       ["+"; "$"; "5"; "-"; "6"]))

Sample use of test cases

If you put the sample test cases into a file hw2sample.ml, you should be able to use it as follows to test your hw2.ml solution on the SEASnet implementation of OCaml. Similarly, the command #use "hw2test.ml";; should run your own test cases on your solution.

$ ocaml
        Objective Caml version 3.11.1

# #use "hw2.ml";;
…
val parse_prefix :
  'a * ('a -> ('a, 'b) symbol list list) ->
  'b list ->
  (('a * ('a, 'b) symbol list) list -> 'b list -> ('c list * 'd) option) ->
  ('c list * 'd) option = <fun>
…
# #use "hw2sample.ml";;
val accept_all : 'a -> 'b -> ('a * 'b) option = <fun>
val accept_empty_suffix : 'a -> 'b list -> ('a * 'c list) option = <fun>
type awksub_nonterminals = Expr | Term | Lvalue | Incrop | Binop | Num
val awksub_grammar :
  awksub_nonterminals *
  (awksub_nonterminals -> (awksub_nonterminals, string) symbol list list) =
  (Expr, <fun>)
val test0 : bool = true
val test1 : bool = true
val test2 : bool = true
val test3 : bool = true
val test4 : bool = true
val contains_lvalue : (awksub_nonterminals * 'a) list -> bool = <fun>
val accept_only_non_lvalues :
  (awksub_nonterminals * 'a) list ->
  'b -> ((awksub_nonterminals * 'a) list * 'b) option = <fun>
val test5 : bool = true
#

Hint

You can use a previous Homework 2 as a hint. It is a tough homework and is not the same problem but there are some common ideas. Look for the sample solution at the end.

© 2003, 2004, 2006, 2007, 2008, 2009 Paul Eggert. See copying rules.
$Id: hw2.html,v 1.44 2009/10/03 06:28:03 eggert Exp $