How to Use Nsoop Scripts Processing includes probe generation, validation, and final selection to file or database. Our nsoop processing runs on Sun Solaris. The process consists of perl and shell scripts, and should run without problem on any computer that has Perl and bash. Linux meets this requirement, as do some other UNIX-like environments. Before starting this process, please realize this pipeline was developed for whole-genome universal probe design in our local environment. It has proven efficient for parallel processing of large batches of probes and we have used this pipeline to design millions of probes at a time, for which this process is cost-effective. We have included all the programs an outside user would need to go through this process with or without using whole-genome alignments. For a single probe set this will take some effort, as it does take some time to walk through these steps. The more probes you design at at time, the more efficient this pipeline will be for you. Step 14 must be performed on a computer (server) that is set up for mega_blast processing. (All steps 1 thru 16 may be performed on this same server.) Step 17 should be performed on your database computer. ############################################################################## 1. Install nsoop files. Make a subdirectory 'nsoop'. Within 'nsoop' make subdirectories 'scripts' and 'control'. Move the downloaded scripts tar file into 'scripts'. Untar the file with this command : tar -xvf ./nsoop.scripts.tar . Now the 'scripts' directory will contain these scripts : nsoop.pl functions_parseNewickToPoints.pl function_makeSubtree.pl function_makeNewickFromPoints.pl function_choosePhase1ProbeFromNonGappedAlignmentUsingParsimony.pl function_getGCscore.pl function_getMaximumLikelihoodScoresFromNonGappedAlignment.pl Nspal_maf.pm Fitch.pm LocationConverter.pm makeChromosomeLengthsFile.pl sort_phase1_to_phase2.pl makeControlFiles.pl qualityPhase2CountsWithMismatchRestrictions.pl select_phase2_to_phase3.pl make_megablast_files_from_phase3.pl make_phase4_using_megablast_result.pl qsub_nsoop.pl qsub_megablast.pl loadProbesToOracle.pm writeSpacedPrimersAndProbes.pl qsub_correct_mm5_badScreen_maf.pl correct_mm5_badScreen_maf.pl Move the downloaded control tar file into 'control'. Untar the file with this command : tar -xvf ./nsoop.control.tar . Now the 'control' directory will contain these example files : newick_mouseRatHumanDog_rooted_withBranchLengths (newick tree file, rooted, with branch lengths) species_correlation.example.txt (required if species names differ between tree file and alignment file) chromosomes_mm5_rn3_hg17_cf1_gg2.txt (chromosome sizes, allows conversion of locations on minus strands) in.baseml.example (required for maximum likelihood scores, must be provided by user ! ) nsoop_baseml.example.ctl (required for maximum likelihood scores) ############################################################################## 2. Download and install software PAML (Phylogenetic Analysis Using Maximum-Likelihood). Nsoop produces maximum likelihood conservation scores from 'reconstruction' of ancestral nucleotide values at branch points in the phylogenetic tree. For these reconstructions the PAML software package is used (v3.13d only version tested with nsoop). PAML can also be used to estimate branch lengths and generate fixed rate parameters required by nsoop. Download and install PAML : http://abacus.gene.ucl.ac.uk/software/paml.html ############################################################################## ############################################################################## TIP 1. nsoop also selects and scores probes using maximum parsimony. The probes selected by maximum parsimony and maximum likelihood in our experience are very similar. For example, from human-dog-mouse-rat-chicken whole-genome alignments, >97% of the probes selected by both methods were identical, and the correlation between probe scores was >0.99. Step 4 outlines the process by which the required rate parameter and control file for use in PAML can be generated. While these files are required for running nsoop, if you are only going to use the probes selected by maximum parsimony and the assigned maximum parsimony scores and not use the maximum likelihood selected probes or scores, you can use the example files for the "in.baseml" and "nsoop_baseml.ctl" files basically as dummy files just to get nsoop to run. (ie, just change the name of the nsoop_baseml.example.ctl file to nsoop_baseml.ctl, and in.baseml.example to in.baseml.) Please note, by doing this the probes selected using maximum likelihood should be ignored, as should all maximum likelihood scores. ############################################################################## ############################################################################## 3. Generate or download a .maf aligment file. Alignments from specific regions of genomes can be retrieved using the UCSC genome table browser. Whole-genome alignments can also be downloaded from UCSC in .maf format. If you generate your own alignments, a later step will require the name of each sequence follow this convention : speciesidentifier.chromosomeorothername For example, the sequence name for a chromosome from a whole-genome assembly looks like this: mm5.chrX where mm5 designates mouse genome assembly version 5 and chrX designates the chromosome. If you generate your own alignments, the "chr" tag may not be relevant or informative. In that case something like: mm.CFTR could be used where mm represents mouse and CFTR refers to a specific segment of the genome. ################################################################################ 4. Nsoop uses the REV model of nucleotide substitution in PAML program 'baseml'. You must generate rate parameters that are appropriate to your sequences and phylogeny. For this use the 'baseml' program with these options : runmode = 0 model = 7 fix_alpha = 1 alpha = 0 clock = 0 nhomo = 0 RateAncestor = 0 # changes for later use cleandata = 1 fix_blength = 0 # changes for later use method = 0 Read the PAML documentation about how to run baseml, and to understand the effect of these options. Note that different options for baseml are used later, when you are running nsoop. The REV model in baseml uses 5 rate parameters. If you already have these 5 parameters, then just make a file in the same format as in.baseml.example. In this format, the -1 designates that the 5 values following the -1 always be used as the rate parameters. If you want to use the maximum likelihood probe selection and scoring and need to generate these parameters, you can do the following. Convert your .maf alignment file to a PHYLIP formatted file that baseml can read with the program: convertMafForPaml.pl (accompanying module : Nspal_maf_forPaml.pm). Once you convert your .maf file to a PHYLIP format you can use baseml to estimate the rates using your entire alignment (see baseml.ctl values above). (Note, this conversion program will translate only the alignments that include all the species you designate in the options. ie, if you have a human-mouse-rat .maf file, and you designate all three species in the options, then only alignmnets with all three species (human-mouse-rat)will be translated, and not the alignments with missing one of the species (human-mouse, human-rat and mouse-rat). Also note that a newick formatted tree file(unrooted) is required for this process. Branch length estimates generated with this step can be used below for the the tree file with branch lengths required below. Also note that your alignment input file needs to be named *.maf to be recognized by convertMafForPaml.pl.) Once you run baseml, the output file "mlb" will contain the branch length information and rate parameters at the end of the file. Here is an example of the end of a mlb output file. (hg17, (mm5, rn3), canFam1); (hg17: 0.08536, (mm5: 0.06249, rn3: 0.08944): 0.17136, canFam1: 0.15806); #These are the branch length estimates. #Note that the tree is unrooted. Detailed output identifying parameters Parameters in the rate matrix (REV) (Yang 1994 J Mol Evol 39:105-111): Rate parameters: 0.86359 0.16390 0.25135 0.30505 0.30066 #These are the rate parameters #for the in.baseml file. Base frequencies: 0.35244 0.16951 0.28397 0.19408 Rate matrix Q, Average Ts/Tv = 1.8343 -0.733003 0.443928 0.141143 0.147932 0.923009 -1.362667 0.262706 0.176953 0.175173 0.156814 -0.920539 0.588552 0.268643 0.154553 0.861176 -1.284373 Place the rate parameters in a file called 'in.baseml' following the convention in the in.baseml.example file. Again, the tree with branch lengths can be used as the required input tree for nsoop. Just note that we use a rooted tree. So in this case a branch with a specified length need to be added to this tree. For example, one branch of length 0.02 was added,and 0.01 was subtracted from the hg17 and canFam1 branches. The resulting rooted tree: ((hg17: 0.07536, (mm5: 0.06249, rn3: 0.08944): 0.17136): 0.02, canFam1: 0.14806); ############################################################################# 5. Each time you want to run nsoop and produce a set of probes, make a new tree of subdirectories under 'nsoop' to contain and organize all the files produced by the process. For example, if you want to refer to a run as 'testrun' , then create these subdirectories : ../nsoop/testrun ../nsoop/testrun/phase1 # to contain candidate-probe alignments within files by chromosome of maf reference species ../nsoop/testrun/phase2 # to contain candidate-probe alignments sorted by chromosome of probe species ../nsoop/testrun/phase3 # to contain probe alignments that meet score threshholds ../nsoop/testrun/megablast # to contain extract of phase3 probes in format suitable for megablast ../nsoop/testrun/phase4 # probe alignments now have megablast result : UNIQUE, NOT_FOUND, TOO_MANY, TOO_SIMILAR ../nsoop/testrun/phase5 # excludes NOT_FOUND probes; remaining probes ready for load to database ############################################################################## 6. These control files must be available : newick tree file (example : newick_mouseRatHumanDog_rooted_withBranchLengths) NOTE : branch lengths from tree file contribute directly to the calculated probe scores (parsimony and max likelihood) chromosome file (example : chromosomes_mm5_rn3_hg17_cf1_gg2.txt) NOTE : the chromosome files are required even in cases where the sequences aligned are not entire chromosomes. Essentially these files just tell the program the length of each input sequence. Chromosome length files can be downloaded from UCSC, or you can get this information and file using "makeChromosomeLengthsFile.pl". By default this program will read in all *.maf files in the current working directory and make a single chromosome.txt file that can be used when designing probes from any one or all the *.maf files in the current working directory. species correlation file (example : species_correlation.example.txt) (required if tree and alignment files use differing species names) ############################################################################## ############################################################################## TIP 2. The tree topology and branch lengths dictate how heavily matches are weighted in the probe score and are important in the ancestral sequence reconstruction. Thus, the relative lengths of the branches to one another is really the key for accurate probe scoring, more so than any specific values. ############################################################################## ############################################################################## 7. These files for PAML must be present in the current directory (e.g. in .../nsoop/testrun ) : baseml (This is the executable program 'baseml' which is part of PAML software.) nsoop_baseml.ctl (example : nsoop_baseml.example.ctl) Control parameters needed for baseml. in.baseml (example : in.baseml.example) Contains nucleotide-substitution parameters produced by earlier run of 'baseml'. NOTE : As mentioned in TIP 1., the in.baseml file is required by nsoop and should be created for each new dataset from which probes will be designed. Therefore,if you plan to use the maximum likelihood probe selection or scores, do not use the provided example of in.baseml for your probe generation. If however you want to ignore probes selected using maximum likelihood and all maximum likelihood probe scores and just use the maximum parsimony probe selection and scores generated by nsoop, then using the example in.baseml file just as a dummy file to get the program to run is an option. ############################################################################## 8. Syntax to generate probes from a single alignment file (.maf format) : ../scripts/nsoop.pl \ -chromosomeFile Note that if species names differ between the newick tree file and the alignment file, you must provide a 'species correlation' file by including this optional parameter : ../scripts/nsoop.pl \ -chromosomeFile -same A common error running nsoop is caused by misspelling or duplicate in the species list. ############################################################################## ############################################################################## TIP 3. If in this step or later steps a custom module (.pm) is not found in your existing path, you can dynamically add the location of a provided module (.pm) to your path like this. export PERLIB5LIB=$PERL5LIB:/path/to/where/the/.pm/file/is/on/your/computer/ ############################################################################## ############################################################################## 9. If you have a set of alignment files, run the script 'qsub_nsoop.pl' to create a bash shell script for each alignment file that will run nsoop.pl against the alignment file and write candidate probes to files. Edit qsub_nsoop.pl to define your parameters, as described in 5 and 6 above. qsub_nsoop.pl will create a new subdirectory for each alignment file, and outputs for that file will be located in that subdirectory. (This is necessary because baseml uses standard file names which will overwrite each other if nsoop runs on multiple processors.) Use of qsub_nsoop.pl requires the Portable Batch System to be installed on your computer. See related notes at end of this HOWTO. ############################################################################## 10. Populate 'phase1' with either the parsimony selected "best" probes (*.phase1.pars.internal.txt)or maximum-likelihood selected "best" probes (*.phase1.ML.internal.txt). (See TIP 1. on similarity of these methods.) Nsoop output includes both maximum-likelihood and parsimony files of candidate-probes for each input alignment file. The *.pars.* file contains probes selected using maximum parsimony. The *.ML.* file contains probes selected using maximum likelihood. Both files contain the probe score (either maximum parsimony or maximum likelihood) used to select the probe, as well as the corresponding parsimony and likelihood score. Decide whether you will use ML or parsimony criteria for probe selection and scoring, and copy the corresponding 'phase1' files from the subdirectories under 'testrun' directory into the 'phase1' subdirectory. ############################################################################## 11. Sort phase1 candidate probes into phase2 files by the probe species chromosome. Change directory (cd) into ......../nsoop/testrun/phase1 (which is where your phase1 files are now located). Run either sort_phase1_to_phase2.pl pars or sort_phase1_to_phase2.pl ML depending on the score type you have chosen. The is the probe species designation, like hg17, and is a required variable. The probes will be sorted by the chromosome location of the species designated. Now the phase1 subdirectory contains several new 'phase2' files for each original phase1 file. Move 'phase2.final' files to phase2 subdirectory. Delete the other 'phase2' files from phase1 subdirectory. This process orders the candidate probe sequences by "chromosome" position and is important for later steps. ############################################################################## 12. Now you will want to analyse the phase2 candidate probes to determine score threshhold(s) that provide suitable level of sequence conservation. In essence, you want to select a probe score cutoff applicable to your needs. We use both the probe score and number of mismatches between the probe sequence and the comparative sequences to decide on a cutoff. The basic logic is that you want to set the probe score cutoff such that your the probe sequence is likely to have 3 or fewer mismatches with your target species sequence. For example, if a human probe sequence has 3 or fewer mismatches with the orthologous chicken sequence, then it is highly likely that human probe sequence will have 3 or fewer mismatches with other mammals. (For more details on this, see the Uprobe website and the manuscript describing Uprobe, Kellner et al Genome Research 2005). To do this we look at the correlation of specific or combinations of mismatch criteria with probe scores for each type of alignment from which the probes were generated. (like hg17-mm5-rn3, or hg17-canFam1, etc). The programs makeControlFiles.pl and qualityPhase2CountsWithMismatchRestrictions.pl can be used to help make the probe score cutoff decisions. makeControlFiles.pl is used to generate two control files: one, mismatchRestriction.ctl, for designating specific mismatch criteria that are used in qualityPhase2CountsWithMismatchRestrictions.pl, and a second, scoreCutoff.ctl,that is used in the next step by select_phase2_to_phase3.pl to extract just the probes that meet a specified score cutoff. (Note, the "assembly" term refers to the species designator like hg17. A list looks like this: "hg17,mm5,rn3". Note, the abbreviation list is a comma separated list of 1 [Aa...Zz] characters representing species in the alignment.) syntax : makeControlFiles.pl OR syntax : makeControlFiles.pl Once you have modified the the .ctl file to your specifications (see mismatchRestriction.ctl output for editing instructions), qualityPhase2CountsWithMismatchRestrictions.pl can be used to generate a summary table of probe scores and mismatch criteria like this: #pattern score restricted-count unrestricted-count restr-cumulative unrestr-cumulative restr-percent restr-cumulative-percent # index 0 GM 3.044 41 41 41 41 100.0 100.0 GM 2.979 0 56 41 97 0.0 42.3 GM 2.914 0 38 41 135 0.0 30.4 GM 2.849 0 16 41 151 0.0 27.2 GM 2.783 0 21 41 172 0.0 23.8 GM 2.718 0 14 41 186 0.0 22.0 GM 2.653 0 6 41 192 0.0 21.4 GM 2.588 0 2 41 194 0.0 21.1 GM total 41 194 The information in each column is listed by the column headers (such as "pattern" etc). Briefly, for each alignment pattern (pattern),each probe score (score, note the probe scores are rounded to the 3rd position after the decimal for this summary file)is listed along with the number of probes that meet the user defined mismatch criteria (restricted-count), all probes with that score (unrestricted-count), the cumulative number of probes that meet the user defined mismatch criteria (restr-cumulative), the cumulative number of all probes (unrestr-cumulative), the percent of probes that meet the user defined mismatch criteria for that probe score (restr-percent), and finally the cumulative percentage of probes that meet the user defined mismatch criteria (restr-cumulative-percent). In this example, all of the probes that met the mismatch criteria had a score of 3.044, so this probe score would be appropriate as a cutoff. In most cases when probes are designed from alignments between 3 or more species, the probe score and mismatch criteria will not have a 1:1 correlation. For examples of how we have implemented the probe score cutoffs, see the description of the April_2005_carnivores_1.1 and JUN_2005_mammals_2 (http:/uprobe.genetics.emory.edu/direction.html). syntax : qualityPhase2CountsWithMismatchRestrictions.pl ############################################################################## 13. Apply your score threshholds to the candidate probes. Edit scores in scoreCutoff.ctl file. Run script with this syntax in phase3 directory: syntax : select_phase2_to_phase3.pl Script will create phase3 files in the current directory. Remember to give path to phase 2 files and .ctl file. Move *phase3* files to phase3 directory. ############################################################################## 14. Check uniqueness of selected probes using megablast. Change directory to 'megablast' subdirectory. Run this script to produce megablast input file into current directory : syntax : make_megablast_files_from_phase3.pl Edit script 'qsub_megablast.pl' to find your local megablast database file. Run script to megablast each input file (megablast results are written into current directory) : qsub_megablast.pl (These megablast jobs may run for hours, depending on size of probe files and megablast database.) ############################################################################## ############################################################################## TIP 4. It is not necessary to use qsub to run megablast. You can just run it manually. The parameters we use are: megablast -t 16 -N2 -W11 -e 0.6 -i INFILE -o *phase3.megablast_out.txt -d /path/to/your/blastdb/ -F F -D 3 The output format -D 3 is absolutely required for using make_phase4_using_megablast_result.pl. Also note that the -e value should be set such that all matches that have a bit score >36 are reported. The output file should follow the input naming format, for example, the megablast output file for hg17_chr1_pars_phase3.megablast_in.txt would be hg17_chr1_pars_phase3.megablast_out.txt ############################################################################## ############################################################################## Change directory to 'phase4'. Run this script to add a descriptor for the megablast result to each selected probe, writing phase4 files to current directory : syntax : make_phase4_using_megablast_result.pl Note that this program uses a set criteria for classifying probes based on the number and bit scores of matches found in the megablast search. That logic is as follows: 1. Unique-1 identical match was found to the expected genome position, no other matches above 40 bits,and fewer than 5 other matches with a bit score >36. 2. TOO_SIMILAR-1 identical match to the expected genome position and at least one other match >40 bits. 3. TOO_MANY-1 identical match to the expected genome position and at least 5 other matches >36 bits. 4. NOT_FOUND_nomatches- no matches >36 bits. 5. NOT_FOUND_toosimilar- no match to the expected genome position and 2 or more matches >40 bits. 6. NOT_FOUND_tomany- no match to the expected location in the genome and 6 matches >36 bits. 7. NOT_FOUND_unique- no match to the expected location in the genome, 1 match >40 bits or,fewer than 5 matches with a bit score >36. Criteria 1-3 are applicable when the probes are being compared to the same genome assembly as the probes were derived from. (ie, hg17 probes compared to a hg17 blastdb). Criteria 4-7 is applicable when the probes are designed from one sequence(s) and compared to a second sequence(s) blastdb. (ie, baboon probes compared to the macaque genome). By this logic, probes that fall in category 1,4 and 7 we would consider unique, and probes that fall in 2,3,5 and 6 are considered non-unique. ############################################################################## 15. At this point you can prepare the phase4 probe files to load them in a db (steps 16 and 17), or you can select probes directly from the phase4 file. To select the probes from phase4 file : syntax : writeSpacedPrimersAndProbes.pl OR syntax : writeSpacedPrimersAndProbes.pl 'Probe spacing' in this case refers to the optimal distance at which the probes should be spaced. The probe spacing logic here is a slightly modified version of what is used in soop. The 'megablast types', ie how the probes were classified based on the megablast hits, are listed below, The right column are the values to be used as input options in this program. UNIQUE UNIQUE TOO_SIMILAR TOO_SIMILAR TOO_MANY TOO_MANY NOT_FOUND_nomatch NF_nomatch NOT_FOUND_toosimilar NF_toosimilar NOT_FOUND_toomany NF_toomany NOT_FOUND_unique NF_unique Two output files are produced, *_select_primer and *_select_probe. The *_select_primer file has the primer sequences to make the overgo probes (1 pair per line). The *_select_probe file has the probe sequences in fasta format. The defline in these files includes the probe name, ############################################################################## 16. Process probes to phase 5. Copy probe files from phase4 into phase5. Then edit each file manually to remove probes that have megablast result 'NOT_FOUND'. ############################################################################## 17. Now the probes are ready to load into database. We load them into an Oracle database using this script : loadProbesToOracle.pl ############################################################################## About 'qsub' job queuing. Scripts with 'qsub' names are intended to use job-queuing software compatible with Portable Batch System (PBS). We use Sun Grid Engine for this purpose on our Solaris server. For information about free and academic versions of PBS : http://www.openpbs.org If you prefer not to install PBS on your server, you can still use the 'qsub' scripts. After each 'qsub' script completes, simply run each of the '.bash' output files manually. ##############################################################################