Your boss at Amalgamated Data Associates likes to use the DNA pattern matcher that you built for Homework 2, but has found that the matcher is way too slow, and has assigned you and your coworkers the job of finding ways to speed things up. Your coworkers are looking into other possibilities, but you've been charged with investigating the feasibility of doing the searches in parallel using multithreaded search.
You investigate the enormous variety of freely available software indexed by the Open Bioinformatics Foundation. After investigating the projects you settle on BioJava as the most promising project for your feasibility study. You look through the tutorials and the BioJava In Anger cookbook and, somewhat to your surprise, discover a program called MotifLister that is quite similar to Homework 2. You decide to use this as the basis of your feasibility study.
Unfortunately, the first thing that you discover is that MotifLister doesn't work. There's a typo in it.
javac MotifLister.java java MotifLister dna /u/cs/fac/eggert/opt/SunOS-5.8-sparc/biojava-20041015/biojava-20041015/demos/files/ChrI.prom.fasta AAAAAAG 3should output the following:
MotifLister is searching file /u/cs/fac/eggert/opt/SunOS-5.8-sparc/biojava-20041015/biojava-20041015/demos/files/ChrI.prom.fasta for the motif 'AAAAAAG' in frame 3. YAL068C-7235.2170 : [1137,1143] YAL068C-7235.2170 : [3753,3759] YAL068C-7235.2170 : [3831,3837] YAL065C-21525.11953 : [2949,2955] YAL065C-21525.11953 : [7257,7263] YAL064W-11953.21525 : [8490,8496] YAL063C-31572.27970 : [2283,2289] YAL062W-27970.31572 : [3069,3075] YAL060W-34708.35160 : [417,423] YAL058W-37154.37469 : [255,261] YAL054C-45903.45028 : [264,270] YAL053W-45028.45903 : [276,282] YAL039C-71792.69533 : [159,165] YAL039C-71792.69533 : [213,219] YAL029C-92906.92278 : [297,303] YAL002W-143164.144002 : [417,423] YAL001C-152258.151169 : [519,525] YAR047C-203393.201781 : [363,369] YAR050W-201781.203393 : [342,348] YAR066W-220490.221040 : [537,543] YAR071W-224855.225451 : [225,231] Total Hits = 21
java ThreadedMotifLister dna /u/cs/fac/eggert/opt/SunOS-5.8-sparc/biojava-20041015/biojava-20041015/demos/files/ChrI.prom.fasta AAAAAAG 3 4should behave like the above, except that it should split up the DNA fragment into four segments of roughly equal size, and search each fragment in a separate Java thread, in the hopes of finishing four times faster. The output need not be in the same order as MotifLister. You may also generate extra lines of output to help you debug or time your program: these extra lines must all start with the character #.
When debugging, you can use a Solaris command like truss -o tr java ThreadedMotifLister dna file.fasta AAAAG 3 4 to see the low-level details of how your program is switching threads and doing I/O. This command traces each system call and puts the trace output into the text file tr.
You can use the Solaris command psrinfo -v to find out how many CPUs are on your host. For this assignment we suggest that you not use more than 4 threads on SEASnet machines, to avoid overloading SEASnet resources.
Watch out for the case where an instance of a pattern straddles the boundaries between the regions searched for by a thread. ThreadedMotifListershould report such instances correctly, just as MotifLister did.
If an error occurs (e.g., an I/O error), your program should print an error message on standard error and exit with nonzero status. The exact spelling of the diagnostic does not matter. If no error occurs, it should exit with status 0. Don't forget to check for errors when writing the offsets to standard output. Also, don't forget to check for invalid usages (e.g., missing or extra arguments, or a thread count that is not a positive integer).
To turn in your assignment, submit files MotifLister.java and ThreadedMotifLister.java containing your programs. If you need to submit more Java files because you've added more classes, submit them all, one by one. Also, submit a file hw3.txt containing the summary for your boss. This file should be a plain ASCII text file containing at most 66 lines, with each line containing at most 80 characters. (Your boss is old-fashioned and prefers plain ASCII to fancy formats.) The first line of each file that you submit should be a comment containing your name and student ID.
Make sure that your code works with the 64-bit Java 1.5 platform and the BioJava snapshot (dated 20041015) that are installed on SEASnet. The command java -version should output the following text:
java version "1.5.0" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0-b64) Java HotSpot(TM) 64-Bit Server VM (build 1.5.0-b64, mixed mode)
The command env | grep CLASSPATH should output the following text:
CLASSPATH=/u/cs/fac/eggert/opt/SunOS-5.8-sparc/biojava-20041015/biojava-20041015.jar:.
BioJava-20041015 documentation
The source code for our BioJava snapshot can be found in the directory /u/cs/fac/eggert/opt/SunOS-5.8-sparc/biojava-20041015/biojava-20041015 on SEASnet.
An overview and summary of the BioJava project can be found in Steven Meloan's article BioJava -- Java Technology Powers Toolkit for Deciphering Genomic Codes (June 2004).