Homework 5. Continuation-based fragment analyzer

Introduction

This homework builds on the hint for Homework 2. That hint contained a simple pattern-matcher generator for DNA fragments. Your generator will also take a pattern and produce a pattern-matcher. However, your pattern-matcher will operate in a different way from the hint, a way that is more convenient for some applications.

As before, the key notion of this assignment is that of a matcher, which in this assignment is a procedure that inspects a given list to find a match for a list prefix that corresponds to a pattern, and then returns the corresponding list suffix. For example, a matcher M for the pattern (or c a) will succeed only on a list whose car is c or a, and will return the cdr of that list.

After a matcher succeeds, it can be backtracked into by invoking the procedure fail, which you will write. For example, the following code:

  (let ((suffix (M frag)))
    (if (and (pair? suffix) (eq? (car suffix) 'g))
        (fail))
    suffix)

calls M on a fragment and returns the first match whose corresponding suffix does not begin with g. This code will cause the example matcher to fail on (a g) but succeed on (c t).

As you can see by mentally executing the (a g) example, matchers sometimes need to try multiple alternatives and, when backtracked into, to return a later alternative if an earlier one is a blind alley.

Definitions

nucleotide
one of the four symbols a, c, g, and t, which stand for adenine, thymine, cytosine, and guanine, respectively. Please see Richard B. Hallick's Introduction to DNA Structure for more about nucleotides in biochemistry.
DNA fragment
a list of nucleotides.
DNA pattern
an object that represents a set of DNA fragments. It is said to match each such fragment. It takes one of the following forms:
a, c, g, t
A nucleotide nt matches the singleton list (nt). For example, a matches (a).
(junk k)
matches up to k nucleotides. k must be a nonnegative integer. If there are several possible matches, it uses the shortest match. For example, (junk 0) matches only the empty list (), and (junk 2) matches (), (a), (c), (g), (t), (a a), (a c), (a g), (a t), (c a), (c c), (c g), (c t), (g a), (g c), (g g), (g t), (t a), (t c), (t g), and (t t).
(or pat…)
matches any DNA fragment that any DNA pattern pat matches. If more than one pat matches, it uses the first match. For example, (or) does not match any list, and (or a g t) matches (a), (g), and (t).
(list pat…)
matches any concatenation of DNA fragments that are matched by each pat, respectively. If there are several possible matches, (list pat1 pat2 ) uses the first match for pat1 that is acceptable to (list pat2 ). For example, (list), which has zero patterns, matches only the empty list (), and (list a (junk 1) (or a t)) matches (a a), (a t), (a a a), (a a t), (a c a), (a c t), (a g a), (a g t), (a t a), and (a t t).
(* pat)
matches any concatenation of nonnull DNA fragments that are each matched by pat. If the matched concatenation is of one or more fragments, this pattern is equivalent to (list pat (* pat)). If there are several possible concatenations, it uses the concatenation of the least number of fragments. It therefore always matches the empty fragment first. For example, (* (or g (list a t))) matches (), (g), (a t), (g g), (g a t), (a t g), (a t a t), ….

A DNA pattern is not a procedure call. For example, you do not need to define a procedure called junk. The symbol junk is part of the DNA pattern (junk 1), but junk does not need to be defined as a procedure.

matcher
a procedure with one argument: a DNA fragment frag. When it matches a prefix of frag of length k it returns (list-tail frag k); see R6RS§11.9 for a definition of list-tail. It first returns the first match (according to the above rules), and each time that it is backtracked into, using fail, it returns the next match. When there are no more matches, it should fail.

Assignment

Write a procedure (make-matcher pat) that returns a matcher for the DNA pattern pat. When applied to a DNA fragment frag, the matcher should return the unmatched suffix after the first match of a prefix of frag, using the definition of "first" given by the DNA pattern rules described above; this is not necessarily the shortest nor the longest match. The matcher should handle backtracking as described above.

Write a procedure (fail) that fails; i.e., it causes the most recently returned matcher to backtrack and to produce the next answer to the matcher's caller. If there are no such matchers, (fail) should return #f (because it failed to fail).

Write a procedure (clear-choices) that erases all records of pending matchers. Between the time that (clear-choices) returns, and the next time a matcher returns a value that has another alternative, (fail) should return #f because no matchers are pending.

Your implementation of make-matcher must be free of side affects and should avoid using unnecessary storage. Tail recursion is desirable but is not required, and you will probably find it necessary to use non-tail-recursive procedures both in make-matcher and in the matchers that make-matcher returns.

The matchers that make-matcher returns should be as simple as possible. Whenever any work could be done in either make-matcher or in the generated matcher, the work should be done in make-matcher. The generated matcher may use side effects, but they should be kept as simple and isolated as possible. It is OK for generated matchers to call helper functions that you define, such as (fail), and to use helper global variables; these helper functions and variables should follow the same rules as the generated matchers.

In all cases it is an error if procedure arguments are not of the proper form. For example, it is an error if the argument to make-matcher is not a DNA pattern, if the first argument to a matcher or acceptor is not a DNA fragment, or if the second argument to a matcher is not an acceptor. Your solutino can assume that make-matcher and the matchers it returns are given valid arguments by the test program; however, you must insure that the matchers do not cause any errors when given valid arguments.

Submit

Submit two files. First, submit a file hw5.ss containing your code. Make sure that your code works with PLT Scheme, the Scheme implementation installed on SEASnet in /usr/local/cs/bin; if you type the shell command mzscheme the first line of output should be "Welcome to MzScheme v4.1.5 [3m], Copyright (c) 2004-2009 PLT Scheme Inc.".

Second, submit an ASCII text file hw4.txt that briefly describes how you arrived at your solution, focusing on alternative approaches that you rejected and why.

Please do not put your name, student ID, or other personally identifying information in your files.

Examples

The following examples show how make-matcher might be used in a test program. Please do not submit them as part of hw5.ss.

(define frag0 '())
(define frag1 '(a t g c t a))
;
; Scheme does not care about the newlines in the definition of frag2.
; From Scheme's point of view, they are merely extra white space that
; is ignored.  The newlines are present only to help humans understand
; how the patterns defined below are matched against frag2.
(define frag2 '(c c c g a t a a a a a a g t g t c g t
                a
                a g t a t a t g g a t a
                t a
                a g t a t a t g g a t a
                c g a t c c c t c g a t c t a))

; Most of the uses of "list" in the following pattern definitions
; are as a symbol that is part of a pattern, not as a procedure.
; However, there are two exceptions, one when defining pat3
; and the other when defining pat4.
(define pat1 '(list a t g c))
(define pat2 '(or
               (list a g t a t a t g g a t a)
               (list g t a g g c c g t)
               (list c c c g a t a a a a a a g t g t c g t)
               (list c g a t c c c (junk 1) c g a t c t a)))
(define pat3 (list 'list pat2 '(junk 2)))
(define pat4 (list '* pat3))

; For each pattern defined above, use "make-matcher" to create a
; matcher that matches the pattern.
(define matcher1 (make-matcher pat1))
(define matcher2 (make-matcher pat2))
(define matcher3 (make-matcher pat3))
(define matcher4 (make-matcher pat4))


; Return the first solution acceptable to ACCEPT.
(define (acceptable-match matcher frag accept)
  (clear-choices)
  (let ((r (matcher frag)))
    (or (accept r)
    (fail))))

; Return the first match.
(define (first-match matcher frag)
  (acceptable-match matcher frag (lambda (frag1) frag1)))

; Return true if the matcher matches all of the fragment.
(define (match-all? matcher frag)
  (acceptable-match matcher frag null?))

; Output all solutions.
(define (write-then-fail matcher frag)
  (clear-choices)
  (write (matcher frag))
  (newline)
  (fail))

; Some test cases.
(first-match matcher1 frag0) ; ⇒ #f

; A match must always match an entire prefix of a fragment.
; So, even though matcher1 finds a match in frag1,
; it does not find the match in (cons 'a frag1).
(first-match matcher1 frag1) ; ⇒ (t a)
(first-match matcher1 (cons 'a frag1)) ; ⇒ #f

(first-match matcher2 frag1) ; ⇒ #f
(first-match matcher2 frag2) ; ⇒ (a
;                                 a g t a t a t g g a t a
;                                 t a
;                                 a g t a t a t g g a t a
;                                 c g a t c c c t c g a t c t a)

; These matcher calls match the same prefix,
; so they return unmatched suffixes that are eq?.
(eq? (first-match matcher2 frag2)
     (first-match matcher3 frag2)) ; ⇒ #t

; matcher4 is lazy: it matches the empty fragment first,
; but you can force it to backtrack by insisting on progress.
(eq? (first-match matcher4 frag2)
     frag2) ; ⇒ #t
(eq? (first-match matcher2 frag2)
     (acceptable-match matcher4
           frag2
           (lambda (frag) (if (eq? frag frag2) #f frag))))
; ⇒ #t

; Here null? is being used as an acceptor.
; It accepts only the empty unmatched suffix,
; so it forces matcher4 to backtrack until all of frag2 is matched.
(match-all? matcher1 frag2) ; ⇒ #f
(match-all? matcher4 frag2) ; ⇒ #t

Hint

Start with the solution to an old homework assignment and modify it to use continuations rather than acceptor functions.

Here are some places you can find ideas on how to use continuations to implement backtracking in the presence of recursion:


© 2003, 2009 Paul Eggert. See copying rules.
$Id: hw5.html,v 1.27 2009/05/09 21:53:03 eggert Exp $