Purpose:  This document covers how the flow of data passes through the different scripts that our lab uses.
Process:

  1. Convert the AXT to AXTe format using convert_axt.pl.
  2. Generate probes from AXTe file using sooper_xml.pl.
  3. Create megablast run files using create_megablast_file.pl.
  4. Megablast each run using megablast
  5. Validate Megablast output to run file using validate_megablast_output.pl called from process_megablast_output.sh
  6. Process validate results using zzzoom_probes.pl.
    • Seperate probes in to valid and invalid files
    • Mask invalid probes in AXTe
    • Create mini AXTe file containing only alignments that need to be reprocessed
  7. Reprocess mini AXTe if invalid probes

Script Parameters

CONVERT_AXT.PL
This script processes an AXT file that has a 2 species alignment. It will use 2 parameters, species and version, and add that information to the axt file creating an axte file. The format of the axt and axte are very similar, see the axte file format for details. No standard parameters exist for this script, since the change for each run. View Script

SOOPER_XML.PL
This script will accept several file formats (axte, pipmaker, blast and fasta). The script will break the sequences in to non-gapped, non-repeative alignments. From each alignment we will generate a 36 basepair probe, when possible. View Script

Standard Usage : sooper_xml.pl -r -i.88 -a<outputfile> <inputfile>

CREATE_MEGABLAST_FILE.PL
This script will take a probe xml file, and reformat it to work with megablast. Also, because of the processing time required for megabast, it is recommended that the probe count for each file be 3000. View Script

Standard Usage : create_megablast_file.pl -count 3000 -file <inputfile> -output <outputfile>

MEGABLAST
This program will take a file of sequences and return a list of scores for match sequences in the genome. Because of the computational time needed to run, this program is generally run from a script, so that the load can be spread over all processors.

Standard Usage : megablast -t16 -N2 -W11 -e0.6 -i <infile> -o <outfile> -d <database> -FF -D3

VALIDATE_MEGABLAST_OUTPUT.PL
This script takes the megablast output and the create_megablase_file output to determine if a probe is unique and located where we expect. If a sequence has a score of 70+ and the location matches, our probe is located where we expect. If a sequence with a score of more than 40+ or five sequences of 30+, then the probe sequence is not considered unique. This script will create an output file that contains the result and the header for each probe. View Script

USE VIA PROCESS_MEGABLAST_OUTPUT.SH

Standard Usage : validate_megablast_output.pl -file <input_file>  >  <output_file>

ZZZOOM_PROBES.PL
This script will take several files and seperate the valid (unique) from invalid (non-unique) probes. It will also, given a -mask flag, mask out the sequence in the axte file, so that it won't be selected again. The script will also generate a mini-axte which will contain only the alignments that held invalid probes. This is done so that as the number of invalid probes decreases, the script won't keep having to generate probes from previously successful processing. This script will automatically scan the current directory for all the chr??.probes.xml.??.(in)valid files and will automatically process them. View Script

Standard Usage : separate_probes.pl -mask -c <chr num> -b <batch size> (Batch = 3000)