How to Use Build 2 Scripts Processing includes probe generation, validation, and load to database. Step 4 must be performed on a computer (server) that is set up for mega_blast processing. (All steps 1 thru 6 may be performed on this same server.) Step 7 should be performed on your database computer. ############################################################################## About 'qsub' job queuing. Scripts with 'qsub' names are intended to use job-queuing software compatible with Portable Batch System (PBS). We use Sun Grid Engine for this purpose on our Solaris server. For information about free and academic versions of PBS : http://www.openpbs.org If you prefer not to install PBS on your server, you can still use the 'qsub' scripts. After each 'qsub' script completes, simply run each of the '.bash' output files manually. ############################################################################## (1) Prepare for the Probe Generation process. Make a directory to hold all the scripts for this process, for example '/home/yourID/sample/bin/' Copy these scripts into that directory : qsub_generate_probes.pl multi_soop.pl create_megablast_files.sh create_megablast_file.pl qsub_megablast.pl qsub_megablast_validation.pl validate_megablast_output.pl qsub_separate_probes.pl separate_all_probes.pl Create an environment variable UPROBE_BIN_DIR that points to this directory. In bash, do it like this : UPROBE_BIN_DIR=/home/yourID/sample/bin export UPROBE_BIN_DIR Create a subdirectory that will hold all data and configuration files. In this example we will call it /home/yourID/sample Copy these files into /home/yourID/sample/ : - .maf file containing alignments from which probes will be generated Scripts assume that .maf file name is like "chr22.maf" (Example file "chr25.maf" is actually a small part of chromosome 22.) Note that ALL .maf files in directory will be processed ! - configuration file with scoring information e.g. score.cfg (See sample file 'score.cfg' available on the website.) This subdirectory should contain no other content at this time. It will contain many data files when all steps are completed. Your current location ('path-to-working-directory', pwd) must remain in this subdirectory AT ALL TIMES when you run the scripts, or data files will not be found. ############################################################################## (2) Generate candidate probes using multi_soop syntax : /[path]/qsub_generate_probes.pl /[full_path_to_multi_soop_directory] After this step completes, these new file types will be present : chr25.bash : Contains bash commands that were run by the Sun Grid Engine. This file can be deleted. chr25.out : Is a log file, contains various processing details. Can be deleted, but you may be interested in the details. chr25.xml : Contains multi_soop output in xml format. This file is input to the next step. ############################################################################## (3) For each probe in xml file, extract data fields needed for megablast. Output is broken into files containing only 25000 probes (this line count is configurable). syntax : /[path]/create_megablast_files.sh After this step completes, this new file type will be present : chr25.xml.1 : (25,000 probes) Probe data needed for input to mega_blast chr25.xml.2 : (25,000 probes) chr25.xml.n : (25,000 probes) ############################################################################## (4) Run mega_blast against all probes. This is to check if each probe is unique within the subject genome (e.g. human). Note : Before running, you must edit the script 'qsub_megablast.pl' so that the generated command line has correct paths to both your 'megablast' (executable) file and the subject blast database. syntax : /[path]/qsub_megablast.pl After this step completes, these new file types will be present : chr25.xml.1.mega_out : Output from mega_blast. This is input to next step. mega_25_1.bash : These 4 file types are process output from mega_blast. mega_25_1.out If your mega_blast succeeded, these can be deleted. mega_25_1.e3842 Otherwise you may need them for investigating the error.log problem. ############################################################################## (5) Process the megablast output, identifying each probe as either 'unique' or 'non-unique'. syntax : /[path]/qsub_megablast_validation.pl After this step completes, these new file types will be present : chr25.xml.1.valid : Output file containing 1 line per probe, with indicator 'VALID' if unique. This is input to the next step. chr25.xml.1.mega_out.sorted : Sorted output from mega_blast. This can be deleted, but is useful to understand validation processing. validate_25_1.mega_out.bash : These 2 file types can be deleted if you validate_25_1.mega_out.out had no errors, otherwise may be needed to investigate the problem. ############################################################################## (6) Generate separate 'valid' (unique) and 'invalid' (non-unique) files with full xml content for each probe. This involves combining data from '.valid' files with earlier .xml file for chromosome. syntax : /[path]/qsub_separate_probes.pl After this step completes, these new file types will be present : chr25.probes.xml.valid : Output file containing unique probes with full xml content for each probe. This is for import into database in next step. chr25.probes.xml.invalid : Output file containing non-unique probes with full xml content for each probe. This is for import into database in next step. chr25_separate.bash : These 2 file types can be deleted if you chr25.xml.separate_output : had no errors, otherwise may be needed to investigate the problem. ############################################################################## ############################################################################## This concludes processing on your mega_blast server (we use Solaris). Now copy these 2 files to your database server and continue with the next step to load probes into database (we use Mysql). ############################################################################## ############################################################################## (7) Load unique and/or non-unique probes into database. This step does not create database or the 'probes' table, just inserts data from files produced by preceding step. syntax : /[path]/load_probes.pl -file chrNN.probes.xml.valid -version [Build#] -type VALID /[path]/load_probes.pl -file chrNN.probes.xml.invalid -version [Build#] -type INVALID This step does not produce any output file. ############################################################################## ############################################################################## Summary of data files used the Probe Generation process : 1. chrNN.maf : original alignment file in '.maf' format 2. chrNN.xml : candidate probes identified by multi_soop, file in 'xml' format 3. chrNN.xml.n : extracted probe data needed for mega_blast 4. chrNN.xml.n.mega_out : mega_blast output 5. chrNN.xml.n.valid : validation output 6. chrNN.probes.xml.valid : Output xml file containing unique probes chrNN.probes.xml.invalid : Output xml file containing non-unique probes ##############################################################################