{literal} {/literal}

Introduction

The purpose of the Uprobe project is to provide the community with an efficient means of experimental access to large-insert cloned DNA (ie, BAC clones) from the full spectrum of vertebrate genomes for which genomic BAC libraries are planned or are currently available. This is being accomplished by technology development aimed at improving the ability of 'universal' hybridization-based probes to identify genes/regions of interest from multiple species, and then the dissemination of these improved technologies through the creation of:

A more general introduction to the Uprobe project and details related to the above mentioned resources are provided below.

 

Also note that you can also download whole-genome probe sets , computer programs , and experimental protocols from this website.

 

 

Pre-computed whole-genome probe sets are available for:

All Mammals.

AUG_2003_b1 is based on human-mouse alignments and has been experimentally validated.

OCT_2003_b2 is based on human-mouse-rat alignments and has been experimentally validated.

FEB_2004_mammals_1 was created by combining the approaches used to build the first two probe sets and has been experimentally validated, and is the current recommended and default probe set for screening mammalian genomic libraries.

JUN_2005_mammals_2 was created by enhancing the FEB_2004_mammals_1 with new probes based on human-mouse-rat-dog-chicken alignments and is the current recommended and default probe set for screening mammalian genomic libraries.

Rodents.

APR_2005_rodents_1.1 was designed with a new algorithm, nsoop_v2, using mouse-rat-human-dog alignments specifically for screening rodent libraries and is currently in the process of being experimentally validated. This set replaces OCT_2004_rodents_1.

Carnivores.

APR_2005_carnivores_1.1 was designed with a new algorithm, nsoop_v2, using dog-human-mouse-rat alignments specifically for screening rodent libraries and has been experimentally validated. This set replaces JAN_2005_carnivores_1.

Marsupials.

JUN_2005_marsupials_1 was designed from human-opossum alignments and is the recommended probe set for screening marsupial libraries.

All Birds and Reptiles.

MAR_2004_birds/reptiles_1 is based on chicken-human alignments and has been experimentally validated.

 

On demand universal probe design for nonhuman primates:

On demand universal probe design is now available for Apes and Old world monkeys , New world monkeys, Simians, and All primates.

 

 

Custom universal probe design:

Custom universal probe design can be performed on DNA sequence alignments of 2 or more species provided by the user. A step-by-step tutorial for the custom probe design process is provided here.

 

 

References:

Thomas JW, Prasad AB, Summers TJ, Lee-Lin SQ, Maduro VV, Idol JR, Ryan JF, Thomas PJ, McDowell JC, Green ED. Parallel construction of orthologous sequence-ready clone contig maps in multiple species. Genome Research 12:1277-85, 2002.

 

Kellner WA, Sullivan RT, Carlson BH, NISC Comparative Sequencing Program, Thomas JW. Uprobe: A genome-wide universal probe resource for comparative physical mapping in vertebrates. Genome Research 15:166-173, 2005.

 

Sullivan RT, Morehouse CB, NISC Comparative Sequencing Program, Thomas JW. Uprobe 2008: an online resource for universal overgo hybridization-based probe retrieval and design. Nucleic Acids Research 36:W149-W153, 2008.

 

The Uprobe project has been supported by past funding from the NIH (R24RR022239 and U01MH068185). Comments are welcome and can be directed to James Thomas (jthomas@genetics.emory.edu).

 

 

 

Below is an overview of goals and rationale behind the uprobe project, a description of what universal probes are, how they are designed, details on specific whole-genome probe sets and updates and changes to the uprobe site.

 

 

Rationale

Comparative sequencing is a particularly powerful tool for inferring function from genomic sequence. Thus, sequencing the same region in multiple species simultaneously would provide a valuable method for interpreting all genomic sequence as it is generated. Just as important is access to cloned DNA containing orthologous genes and regions in multiple species catalogued in the sequence. Such access would establish a new resource that could be used for comprehensive functional analysis of both coding and non-coding sequence. This new resource would consist of a series of gene alleles generated not by mutagenesis within a species, but by the divergence of sequence over evolutionary time (ie 'evolutionary alleles'). These 'evolutionary alleles' could then be used to experimentally dissect the function of genomic sequence in cell culture or a transgenic model. At this time, there is no published method for supplying the templates necessary for this type of sequencing or functional studies. Since multiple species sequence comparisons and physical mapping among vertebrates will be a tremendous resource for the functional annotation of the human genome and a starting point for experimental analysis, robust and effective means of generating such clone resources in a targeted manner would be of immeasurable value to the research community.

 

BAC libraries offer the means to selectively isolate a specific region of a genome or to support whole-genome mapping and sequencing efforts. As such, both the NHGRI (RFA-HG-01-002) and NSF (NSF 01-145) have made funding commitments to establish a large resource of BAC libraries from a diverse set of species. In fact, nearly 100 vertebrate BAC libraries are currently available to the public (Fig. 1) . This new BAC library resource will provide a source for comparative sequencing and functional studies across a wide range of vertebrates. Critical to the future utility of these BAC libraries will be the availability of efficient and reliable methods for screening these libraries and assembling high-quality BAC maps that can be used by the entire biomedical research community. This is especially important for individual researchers who do not necessarily have the experience, technology or production needs of a genome center.

 

Current / Pending Vertebrate BAC Libraries
{include file="treeMap.tpl"}

 

Species-specific sequence resources traditionally used for isolating and constructing physical maps of regions of interest are not available for most vertebrates. That is, ESTs or other random genomic sequences. Thus, the traditional route of building physical maps for a region of interest using species-specific markers would not be possible for many of the species shown in Fig 1. The goal of the first phase of the Uprobe project is to combine the principles of traditional comparative mapping with existing methods for screening BAC libraries to develop an efficient, practical and reliable experimental strategy for assembling BAC maps from a diverse set of vertebrates. Specifically, this proposal aims to remove the limitation of species-specific resources necessary for BAC library screening by the design and testing of universal overgo probes that can be used on single or multiple BAC libraries and will be accomplished by identifying evolutionarily conserved sequences between species such as human and mouse, for which there is genomic sequence available. As a result, this would provide an experimental and computational infrastructure aimed at the one essential part of BAC library screening not yet standardized, probe design. The strategy and methodologies proposed here would also dramatically reduce the cost of building physical maps by increasing the potential mapping throughput and decreasing the cost of marker reagents while maximizing the ability to compare genomic maps of divergent species. Through the establishment of a public database of universal probes, individual researchers would have a key resource necessary to construct physical maps (independent of whole-genome efforts) in their region and species of interest that would otherwise be difficult to build. Small clone-based physical maps assembled by individual researchers would also complement whole-genome efforts through the direct integration of both types of maps via the specific clones of interest. Therefore, it is anticipated that this scalable methodology would greatly facilitate the use of future BAC libraries by individual laboratories and genome centers alike, and thus be a widespread means of using the power of comparative genomics for both community's specific research goals.

 

The ongoing goals of the Uprobe project are as follows:

Modern genomic tools and resources will be critical for ongoing and future nonhuman primate research and whole-genome sequencing efforts are underway for a limited number of nonhuman primates. Unfortunately, because of the sequencing strategies employed and associated costs, nonhuman primate genome assemblies will have hundreds-of-thousands of gaps and will not provide a definitive reference sequence like the finished human genome. Moreover, whole-genome sequences will not yield direct access to the experimental tools necessary to exploit this extensive genetic information. Bacterial artificial chromosome (BAC) libraries and clones are a proven and valuable genomic resource for the experimental utilization and functional characterization of genomic sequence and are currently available for eighteen species of nonhuman primates. Input from representatives of the nonhuman primate and biomedical research communities revealed strong support for a resource to facilitate access to these nonhuman primate genomic libraries (see appendix). The goal of this proposal is to develop a resource that will provide an effective and reliable means for the primate and biomedical research communities to isolate any specific gene or region of interest from one or all nonhuman primate BAC libraries. To do so, a web-based tool will be developed for the custom design of universal hybridization probes that can be used for the isolation of nonhuman primate BAC clones. Universal hybridization probes are a proven technology for the efficient targeted isolation of BAC clones from multiple species in parallel. However, this methodology has not been optimized for use in nonhuman primates. This proposal will establish those optimal parameters and provide them as preset values for the custom design of nonhuman primate universal probes by the public. This custom universal hybridization probe design website will therefore facilitate access to the full spectrum of genetic diversity captured within all current and future nonhuman primate genomic libraries independent of, and as a complement to, whole-genome sequencing efforts. As a result, this proposal will yield an important avenue by which individual researchers can readily import nonhuman primate genomic clones into their own laboratories and experimental paradigms. The aims of this resource proposal are:

 

Aim 1. Develop a public resource for the custom design of universal hybridization probes for isolating nonhuman primate genomic clones. A robust universal probe design pipeline used previously in the creation of whole-genome probe sets will be adapted for the custom and on demand design of universal probes for isolating nonhuman primate genomic clones from specific genes and regions of interest. A web-based interface to this custom probe design pipeline will allow the public to design probes from a series of default settings catered to the efficient isolation of genomic clones from four specific clades of primates (1. all primates, 2. simians, 3. new world monkeys and 4. old world monkeys and apes), or to design probes from their own sequence data.

 

Aim 2. Experimentally validate the nonhuman primate custom probe design resource. Small sets of universal probes designed from each of the four ‘default’ nonhuman primate universal probe design options will be selected for use in a small-scale targeted comparative mapping and sequencing project for experimental validation of the resource, and to provide a real world example of how to use the resource and the data it can produce.

 

 

What are universal probes?

The concept behind the use of universal probes is very simple. If a sequence is conserved between two divergent species, then it is likely to be conserved in other species as well. For example, if a sequence is conserved between human and chicken, then it is likely it will be conserved among all mammals and all birds. Thus, a single sequence can act as an effective probe for screening genomic BAC libraries from many birds and mammals and alleviate the need for generating species-specific probes for every genomic library to be screened. In addition, since a single probe is used to screen multiple species, screening of BAC libraries can be done in parallel with identical hybridization and washing conditions. In the preliminary data for this project, universal probes designed for screening placental BAC libraries were designed based on sequence similarity between human and mouse sequence. These probes were tested and found be effective at isolating BAC clones from a set of placental mammals (cat, dog, cow, pig, rat, baboon and chimpanzee) (Thomas et al, Parallel Construction of Orthologous Sequence-Ready Clone Contig Maps in Multiple Species. Genome Res, 2002. 12:1277-1285).

 

The specifics of the process are illustrated in the figure below. The probes themselves are called 'overgo' probes and are comprised of two complementary 22-mers that overlap by 8-bp and are radioactively labeled with a klenow fill-in reaction with dATP and dCTP. Overgo probes were developed by John McPherson and have been used extensively by large and small labs alike to screen genomic libraries. The specificity and uniform design parameters allows one to hybridize groups of probes together. Because primers are cheap and the radioactivity used for labeling a given probe is minimal, we strongly recommend (when feasible) the use of multiple probes from a given region be used to for library screening versus a single probe. We aim to design probes that will have at least a 50% chance of success in a given species, therefore by using multiple probes, the likelihood of identifying clones of interest with one hybridization is maximized. When large regions are targeted for isolation, spacing of the probes every ~30 kb has proven very successful. This basic process is being used to design whole-genome probe sets for clusters of species, such as placental mammals, birds, reptiles and sub-groups of fish using whole-genome alignments.

 

 

 

Figure 2. Strategy for designing universal overgo hybridization probes based on human-mouse sequence alignments. Orthologous human and mouse genomic sequences are masked for repetitive elements (indicated by X's) and then aligned. Regions with high sequence conservation (indicated by vertical lines) are identified and used for designing probes. When possible, a single 36-bp human sequence from each alignment is chosen based on GC content and percent human-mouse sequence identity. A subset of these sequences is then chosen to optimize for inter-probe spacing (~30-40 kb). Three such conserved sequences are depicted in the figure, with greater details provided (in the box) for the middle one. At this stage, each selected 36-bp sequence is compared to all available human genomic sequence to confirm that it is single copy. Overlapping pairs of oligonucleotide primers are then synthesized for each sequence and used to generate double-stranded, radiolabeled (indicated *'s) probes. The probes across a target region(s) are then pooled and used to screen arrayed BAC libraries, allowing the isolation of individual positive BACs.

 

 

How the AUG_2003_b1 whole-genome probe set for screening mammalian libraries was generated.

Whole-genome human-mouse alignments (axtTight) (Schwartz et al, Human-mouse alignments with BLASTZ. Genome Res 2003, 13:103-107 and Watertson et al, Initial sequencing and comparative analysis of the mouse genome. Nature 2003, 420:520-562.) between the April 2003 assembly (UCSC version hg15) of the human genome and the Feb. 2003 build (UCSC version mm3) assembly of the mouse genome were downloaded from http://genome.ucsc.edu. This alignment file was then modified to for use with a modified algorithm for probe design, soop. Common repetitive sequences were masked in the file and then, when possible, 1 candidate probe with >88% human-mouse sequence identity from each ungapped alignment was designed based on the human sequence. These candidate probes were then compared to the April 2003 human genome build by megablast (megablast -t 16 -N 2 -W 11 -e 0.6 -F F -D 3). The megablast output was used to confirm the location of the probes in the human genome, and tag each probe as 'unique' or 'non-unique'. Unique probes had a single identical hit to the human genome assembly, no other hits with a bit score above >40 and fewer than 5 hits with a score above 36. These are very stringent criteria for calling a probe unique, and we feel that unique probes, to the best of our knowledge, represent single-copy sequences in the human genome and should be well suited for screening BAC libraries from other mammals. Non-unique probes also had one identical hit to the human assembly at the expected location, but had at least one other hit above 40 bits or 5 hits above 36 bits. While the non-unique probes are not single-copy in the human genome based on our criteria, we have kept them in our database for potential use in regions of the human genome that are duplicated or for isolating genes within a gene family. We do not recommend the use of non-unique probes unless there is no alternative unique probe available. To increase the number of unique probes, after masking just the non-unique probe sequences in the human-mouse alignment file, candidate probes were then designed only from alignments that yielded a non-unique probe. Candidate probes were again compared to the human genome by megablast and designated unique or non-unique. This recursive process was repeated 2 times to yield 139,272 unique probes and 97,721 non-unique probes.

 

 

Figure 3. Results of the experimental validation of a sample set of AUG_2003_b1 universal probes.

To test the efficiency of the mammalian whole-genome probe set, AUG_2003_b1, n=48 probes were selected from n=7 regions of the human genome for screening the marmoset (CHORI-259), galago (CHORI-256), rabbit (LBNL-1), bat (VMRC-7), shrew (SA_Ba), armadillo (VMRC-5), wallaby (ME_KBa), and platypus (OA_Bb) BAC libraries. After primary and secondary screens, probe-content information was merged with restriction-enzyme fingerprint content maps. Based on this information, the success rate (the fraction of probes tested that were positive for at least one BAC clone) in each species was calculated, and is shown above. The distribution of the probes percentage identity between human and mouse was slightly more enriched for higher id probes than the content of the whole-genome set of unique probes (21% versus 16% at 100% id, 19% versus 18% at 97% id, 19% versus 21% at 94% id, 23% versus 22% at 91% id, and 19% versus 23% at 88% id). However, because optimal physical spacing will greatly enhance the selected probes toward higher percent identity, we believe this sample set reflects an accurate measurement of the effectiveness of the unique whole-genome probe set. Representative clones have been sent to the NIH Intramural Sequencing Center for sequencing to confirm probe specificity.

 

 

Table 2. Summary of Universal Probes in AUG_2003_b1 (Mammalian)

 

 

 

 

Human Chromosome

Length w/o gaps (bp)

Unique Probes

Non-Unique Probes

Chr1

218712898

12,156

9,401

Chr2

237043677

13,752

8,883

Chr3

193607233

9,624

6,298

Chr4

186580523

6,748

4,411

Chr5

177524972

9,125

5,989

Chr6

166880541

6,465

5,533

Chr7

154546299

6,675

4,770

Chr8

141694337

5,573

3,724

Chr9

115187719

6,151

4,501

Chr10

130710874

6,501

4,143

Chr11

130709420

7,735

5,451

Chr12

129328334

5,491

3,994

Chr13

95511656

3,512

2,462

Chr14

87191216

5,017

3,338

Chr15

81117055

4,943

3,929

Chr16

79890795

4,568

3,013

Chr17

77480855

5,655

4,433

Chr18

74534531

3,474

1,967

Chr19

55780860

1,970

1,596

Chr20

59424990

3,150

2,008

Chr21

33924747

813

713

Chr22

34352072

1,100

1,071

ChrX

147686666

8,148

6,529

ChrY

22761097

10

480

 

 

 

 

Total

2832199938 bp

138,356

98,637

 

 

 

 

OCT_2003_b2 Mammalian Whole-Genome Probe set.

A sample set of orthologous genomic sequences from human, mouse, rat, dog, cat, cow and pig were used to empirically optimize the universal probe design process using human-mouse-rat alignments. n=2863 36-bp probe sequences with n=7 or fewer mismatches between human and mouse, and for which rat, dog, cat, cow and pig sequences were also available were used as the basis of this process. For each substitution pattern between human-mouse-rat

(ie, human AAAAA

     mouse AATTT

     rat   ATATC

   pattern 12345)

 

a 'weight' was assigned based on the calculated percent identity for each pattern between the human nucleotide and the corresponding dog, cat, cow and pig nucleotide. The calculated values were:

sum of identical bases (dog, cat, cow,pig)/(total number of bases of Pattern# X 4)

Pattern 1=(67135+67461+66253+66623) /(72680X4)=0.9200

Pattern 2=(2472+2498+2399+2441) /(3003X4)=0.8167

Pattern 3=(1258+1263+1225+1215) /(1526X4)=0.8127

Pattern 4=(3476+3521+3399+3467) /(5885X4)=0.5889

Pattern 5=(318+330+304+310) /(501X4)=0.6297

 

A score was calculated for each probe by counting the patterns and then summing the corresponding 'weights'. Because rat and mouse are essentially equivalent distances from human, and only 0.004 separated the values for patterns 2 and 3, a single value, 0.8147 was used for both those patterns. The correlation of the probe scores and number of mismatches per probe in dog, cat, cow and pig was then calculated and compared to the correlation coefficient using a probe score based solely on the number of mismatches between human (probe) sequence and the mouse sequence. The correlation coefficient for the mouse mismatch score alone was n=0.5425073 and for the new matrix, n=0.564635, indicating that adding the rat sequence and using this matrix does provide a better basis for designing universal mammalian probes. While the increase in the correlation is not large, this basic scoring matrix strategy can be used with larger numbers and/or more informative combinations of species (such as human-mouse-dog).

 

The second major change to the probe design process was the selection of all probes that fell within the 0.44-0.56% GC range and met the minimum scoring requirement. In the previous build, only the 'best' probe was selected for each gap-free alignment between human and mouse. To provide the maximum number of probe options, we eliminated the 'best' criteria and now include all sequences that meet the set probe criteria.

 

This new algorithm was applied to the Multiz human-mouse-rat whole genome alignment (generated by W. Miller and J. Kent,(Blanchette et al. 2004. Aligning multiple geneomic sequences with the threaded block aligner. Genome Res 14:708-715. RGSPC. 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 420:520-562) Human (UCSC hg15), Mouse (UCSC mm3), Rat (UCSC rn2), downloaded from the http://www.genome.ucsc.edu). Based on analysis of our test data set, a probe score cutoff value of 31.83 was determined to be more stringent than the 4 or fewer mismatches between human and mouse used in Aug_2003_b1 (ie, would include greater than 93% of all probes with 0,1,2,or 3 mismatches between human and mouse and exclude 60% of all probes with 4 mismatches between human and mouse). In addition, we have also edited the probe set to remove a small fraction of candidate sequences (<5%) that had properties that might compromise their general utility

(ie, a sequence that looked like this:GGCCGGGGGCCGCCCGGATATTATTTATAATATAT).

Specifically, probes with a gc_score (see soop.pl algorithm, OM40 (McPherson)) above 55.36.

 

 

Figure 4. Results of the experimental validation of a sample set of OCT_2003_b2 universal probes.

To test the efficiency of the mammalian whole-genome probe set, AUG_2003_b1, n=48 probes were selected from n=11 regions of the human genome for screening the marmoset (CHORI-259), galago (CHORI-256), rabbit (LBNL-1), bat (VMRC-7), shrew (SA_Ba), armadillo (VMRC-5), elephant (VMRC-15), wallaby (ME_KBa), and platypus (OA_Bb) BAC libraries. After primary and secondary screens, probe-content information was merged with restriction-enzyme fingerprint content maps. Based on this information, the success rate (the fraction of probes tested that were positive for at least one BAC clone) in each species was calculated, and is shown above. The test set of probes was selected to be an accurate sampling of the entire probe set (in terms of probe score). N Representative clones have been sent to the NIH Intramural Sequencing Center for sequencing to confirm probe specificity.

 

A numerical summary of this probe build is listed below.

 

Table 3. Summary of Universal Probes in OCT_2003_b2 (Mammalian)

 

 

 

 

Human Chromosome

Length w/o gaps (bp)

Unique Probes

Non-Unique Probes

Chr1

218712898

457,441

268,561

Chr2

237043677

420,924

222,490

Chr3

193607233

318,413

162,270

Chr4

186580523

198,668

103,338

Chr5

177524972

270,796

149,901

Chr6

166880541

213,399

139,525

Chr7

154546299

212,745

122,716

Chr8

141694337

179,965

91,364

Chr9

115187719

221,804

124,513

Chr10

130710874

208,993

105,613

Chr11

130709420

286,737

159,738

Chr12

129328334

211,643

124,001

Chr13

95511656

103,488

58,366

Chr14

87191216

178,789

100,094

Chr15

81117055

189,412

108,324

Chr16

79890795

159,073

91,888

Chr17

77480855

237,188

144,407

Chr18

74534531

99,938

48,999

Chr19

55780860

84,060

53,058

Chr20

59424990

112,647

58,712

Chr21

33924747

34,475

20,460

Chr22

34352072

61,558

39,617

ChrX

147686666

224,952

149,438

ChrY

22761097

112

8,855

 

 

 

 

Total

2832199938 bp

4,687,220

2,656,248

 

 

Table 4. Summary of probe scores for OCT_2003_b2.

Score

Unique Probes

Non-Unique Probes

31.83

18498

12059

31.84

55823

31366

31.85

47247

27331

31.86

243

144

31.87

3523

2334

31.88

4897

2868

31.89

4049

2336

31.91

211

159

31.92

344500

196685

31.93

41675

25794

31.95

1548

1296

31.96

68556

39478

31.97

7857

4694

31.99

227

214

32

5452

3309

32.01

330

223

32.02

436743

245136

32.04

88869

53423

32.05

4183

3044

32.06

73638

42661

32.07

301

165

32.08

15043

8745

32.09

423

326

32.1

4859

2865

32.12

621

383

32.13

351825

194616

32.14

177216

104856

32.16

10215

6712

32.17

45930

26272

32.18

25930

15562

32.2

1009

622

32.21

2586

1616

32.22

1071

670

32.25

300994

174414

32.26

23845

15111

32.28

761

490

32.29

38227

22927

32.3

2206

1559

32.33

1384

943

32.35

402291

230841

32.37

54886

32709

32.38

1628

1082

32.39

42745

26130

32.41

4615

2955

32.43

1412

909

32.46

365954

200052

32.47

111929

67463

32.49

3715

2472

32.5

28042

16659

32.51

7772

5055

32.54

754

538

32.58

198871

119827

32.59

8568

5685

32.62

12207

7936

32.68

295510

169997

32.7

19308

12452

32.72

14440

8404

32.79

338178

176621

32.8

41974

25355

32.83

10312

5912

32.91

82401

48641

33.01

150224

81423

33.12

276969

133722

 

 

 

Total

4687220

2656248

 

 

 

A general translation of the scores of this build versus build 1 as determined by our test data set is:

Build 1    Build2

100%      33.05+/-0.14

97%        32.71+/-0.24

94%        32.42+/-0.21

91%        32.12+/-0.22

88%        31.82+/-0.23

 

 

Updates_Changes: 12/2003-03/2004.

At the end of 2003 and the beginning of 2004 a substantial number of changes were made to uprobe.

 

1. Modification of the algorithm used for optimal spacing of the probes.

In the first edition of the probe spacing algorithm used on uprobe (based on the original soop program), if a probe could not be found within a range corresponding to 0.5-1.5 the optimal spacing distance (osd), the first probe beyond this interval was selected to be the next probe. In this case, the selection process was sub-optimal for selecting the next best probe. Therefore, we have modified the spacing algorithm on uprobe to select the best probe beyond 1.5 X osd within a range equivalent to 0.15 X osd. The result of this change is an enhancement in selecting the best probes in all cases without a significant sacrifice in probe spacing.

 

2. Modification of original soop algorithm and sooper.xml.

In the original soop algorithm, there was a bug such that no probes were designed from the last ungapped alignment within a gapped alignment. This resulted in the design of a lower number probes than was possible and was particularly problematic when there was only 1 ungapped alignment (ie, the first ungapped alignment was also the last ungapped alignment, thus no probes were designed). This bug was corrected in both the original soop.pl and sooper.xml. Corrected versions are available for download on uprobe.

 

3. AUG_2003_b1 and OCT_2003_b2 probe corrections.

A bug was detected in the algorithm that is used to determine whether or not a probe is unique. Specifically, in cases where 2 or more identical matches were identified by megablast on the same chromosome as the probe was derived from, these extra matches were not included in the determination of uniqueness. All probes in both builds were re-evaluated to take this discrepancy into account and have been corrected in both the query database and bulk download files. The fraction of probes that were affected was ~0.5%. In addition, a few probes (n=124) did not have the correct location in OCT_2003_b2. These have been corrected. The whole database download files for AUG_2003_b1 and OCT_2003_b2 did not have the correct distance to next probe value. This has been corrected.

 

4. Modification of query interface.

Changes to search interface include:

Default is now FEB_2004_mammals_1 with spacing of 30 kb.

Exact parameters from last query are displayed after a search.

A reset button has been added that returns the query options to the default.

"Search for" on query page now does not recognize comma delimited queries.

Sets of probes >20,000 can not be viewed efficiently at this time on the UCSC browser.

Sets of probes >65,534 can not be downloaded at this time. For very large downloads, we suggest using the files provided on the download page that include all overgos for a given probe set.

When a query returns >1000 probes, only the first 1000 are displayed on the page. All probes are included in the download files (except if the download limit is exceeded).

 

 

New probe sets:

FEB_2004_mammals_1

FEB_2004_mammals_1 is a merger of probes based on the selection criteria used in AUG_2003_b1 and OCT_2003_b2. The best OCT_2003_b2 probes in every 1 kb interval (220,283 unique and 172,946 non-unique) were merged with probes designed using human-mouse alignments (as in AUG_2003_b1, cutoff 88% identity, 258,392 unique, 171,411 non-unique, debugged version of sooper.xml) and filtered to produce a single non-redundant probe set. This merger of probes designed by the two approaches reduces the problems inherent to each approach. For example, while the OCT_2003_b2 approach uses human-mouse-rat alignments to select the probes, the current algorithm ignored alignments that did not include all three species. Thus, no probes were designed from regions with sequencing gaps in either the mouse assembly or rat assembly in OCT_2003_b2.

 

FEB_2004_mammals_1 probes were assigned two scores based on both approaches. In cases where a just a human-mouse, but not human-mouse-rat alignment were available, we estimated a human-mouse-rat score using the average score for all other probes with the same percent identity between human and mouse. OCT_2003_b2 scores are used for probe selection when the spacing option is invoked.

 

Since the criteria used to design the FEB_2004_mammals_1 probes is not different from the previous builds, the probe success rates can be estimated from prior experimental confirmation of the probes in AUG_2003_b1 and OCT_2003_b2.

 

Table 5. Probe Summary by Chromosome for FEB_2004_mammals_1.

 

Human Chromosome

Length w/o gaps (bp)

Unique Probes

Non-Unique Probes

Chr1

218712898

33459

31280

Chr2

237043677

32217

27228

Chr3

193607233

25103

20638

Chr4

186580523

17177

14189

Chr5

177524972

21974

18616

Chr6

166880541

18090

16487

Chr7

154546299

17193

16174

Chr8

141694337

14406

12071

Chr9

115187719

15744

14049

Chr10

130710874

16897

14114

Chr11

130709420

20653

17942

Chr12

129328334

16723

14721

Chr13

95511656

8714

7403

Chr14

87191216

12958

11013

Chr15

81117055

13027

12683

Chr16

79890795

12633

11354

Chr17

77480855

16470

15428

Chr18

74534531

8377

6443

Chr19

55780860

7222

6741

Chr20

59424990

8738

7240

Chr21

33924747

2833

2496

Chr22

34352072

4802

4361

ChrX

147686666

16545

16021

ChrY

22761097

31

1106

 

 

 

 

Total

2832199938 bp

361986

319798

 

 

MAR_2004_birds/reptiles_1

MAR_2004_birds/reptiles_1 is based on chicken-human alignments of UCSC chicken genome assembly galGal2 and UCSC human genome assembly hg16 downloaded from http://genome.ucsc.edu. Preliminary data suggested that probes based on chicken sequence with >88% chicken-human identity will be useful for screening bird and reptile libraries. We used the latest version of sooper.xml to generate the probes with a cutoff of 88% identity. Probes were designed from the chicken sequence and then classified as unique or non-unique by megablastcomparison to galGal2 using the criteria described above. Stats from the build are below, as is a summary of our experimental validation.

 

Table 6. Probe Summary by Chromosome for MAR_2004_birds/reptiles_1

 

 

 

 

Chicken Chromosome

Length w/o gaps

Unique Probes

Non-Unique Probes

Chr1

183744490

9874

3416

Chr1_random

1261352

91

52

Chr2

143798269

7739

2600

Chr2_random

53846

8

42

Chr3

105892232

7085

1748

Chr3_random

1565637

298

125

Chr4

87964617

5311

1714

Chr4_random

1053861

60

10

Chr5

54038396

4382

1509

Chr5_random

34329

1

2

Chr6

33398103

2834

885

Chr6_random

3628

0

0

Chr7

35405183

3479

1375

Chr7_random

2021

0

3

Chr8

28179243

2884

803

Chr8_random

5240

0

0

Chr9

23054450

2135

697

Chr10

18954178

1989

1076

Chr10_random

3515812

177

73

Chr11

17999990

1544

640

Chr11_random

1104880

360

86

Chr12

19041590

1541

508

Chr13

16797080

1447

672

Chr13_random

1106257

150

63

Chr14

20157496

1741

644

Chr15

12220990

1371

545

Chr16

190259

10

12

Chr16_random

244471

8

22

Chr17

9893572

1192

506

Chr18

8797585

882

1300

Chr19

9317615

1620

451

Chr20

13295085

1302

515

Chr21

6044995

957

366

Chr22

2187313

197

110

Chr23

5032209

824

286

Chr24

5780194

914

245

Chr24_random

95706

8

22

Chr26

3666719

650

283

Chr27

2501764

373

244

Chr27_random

696249

226

132

Chr28

4040991

570

189

28_random

5820

0

12

Chr32

990310

223

115

Chr32_random

28993

0

0

ChrW

4135691

458

124

ChrW_random

229903

1

29

ChrZ

30832492

1840

497

ChrZ_random

14348615

850

326

ChrE22C19W28

47202

14

8

ChrE26C13

213526

67

18

ChrE50C23

10171

1

4

ChrE64

1525

0

0

Chr_Un

121198700

4032

11093

 

 

 

 

Totals

1054180845

73720

36197

 

 

 

 

Figure 5. Results of the experimental validation of a sample set of MAR_2004_birds/reptiles_1 universal probes.

To test the efficiency of the bird/reptile whole-genome probe set, AUG_2003_b1, n=68 probes were selected from n=8 regions of the chicken genome for screening the turkey (CHORI-260), zebra finch (TG_Ba), emu (VMRC-16), alligator (VMRC-8), and tuatara (VMRC-12) BAC libraries. After primary and secondary screens, probe-content information was merged with restriction-enzyme fingerprint content maps. Based on this information, the success rate (the fraction of probes tested that were positive for at least one BAC clone) in each species was calculated, and is shown above. Representative clones have been sent to the NIH Intramural Sequencing Center for sequencing to confirm probe specificity.

 

nsoop and nsoop_v2 correction

To fully utilize the increasing number of multiple species whole-genome alignments, a new probe design algorithm, nsoop, was developed. Briefly, nsoop will take as input multiple species alignments from N number of species, in which all, or just a subset of species, can be considered in the probe design process. A user defined phylogeny (newick format with branch lengths) is then used with maximum parsimony or maximum likelihood methods for ancestral sequence reconstruction at each node, and subsequent scoring is based on the sum of the branch lengths connecting nodes/tips with matching nucleotides. Instances where maximum parsimony does not resolve a position in the ancestral sequence to a single nucleotide are scored by taking the average score calculated using each possible nucleotide. Both the maximum parsimony and maximum likelihood methods result in very similar sets of selected probes. For example, in a sample of 4,458,109 probes designed from human-chimp-mouse-rat-chicken alignments, 97.41% of the probes were selected by both maximum parsimony and maximum likelihood scoring routines, and the correlation between the maximum parsimony and maximum likelihood scores was in excess of 0.998.

During the process of creating a second all mammals probe set, a error in the intended scoring logic was detected in the original version of nsoop. Specifically, scoring of the cumulative branch lengths began at the root of the tree and went down the tree scoring matches between nodes (inferred ancestral sequences) and the nodes and leaves/tips of the tree (observed sequences). Our intent was to go up from the sequence from which the probe is designed (leaf) and progress up the tree first as far as possible (ie, as long as the ancestral node sequences matched the probe sequence) and then score from that point down the tree. For example, if the nucleotide position was identical between all observed species, and thus likely all ancestral sequences, the root would be the starting point for scoring and result in the highest score possible for a single position. However, if the inferred ancestral sequence at the first node up from the observed sequence differ (representing say the most recent common ancestor of mouse and rat when making a probe from mouse sequence), then the lowest score possible would be assigned regardless of whether or not the position was conserved among any or all of the other species. Depending on the ancestral sequence reconstructions and pattern of mismatches among the species, this could have a significant impact on scoring the probes. In particular, in some cases probes would receive higher or lower scores then intended based on our intended scoring logic. To correct this error in these and future probe sets, we implemented the intended scoring logic in nsoop_v2. We apologize for any inconvience this might have caused and encourage users to contact us if they have further questions. The consequence of the scoring changes on the probe sets is outlined under APR_2005_rodents_1.1 and APR_2005_carnivores_1.1.

You can download and read more about nsoop_v2 here.

 

 

APR_2005_rodents_1.1

APR_2005_rodents_1 is a whole-genome probe set of universal probes specifically designed for screening rodent libraries. Mouse-rat-human-dog whole-genome alignments from http://genome.ucsc.edu were used to identify mouse sequences likely to be highly conserved with other rodents. A summary of the probe set is listed in Table 7.

An error found in our probe scoring logic was discovered in the original version of nsoop that required us to rebuild and replace the previously released OCT_2004_rodents_1 with the APR_2005_rodens_1.1 probe set. The probe sets overall are very similar with the fraction of probes that were in the previous OCT_2004_rodents_1 set that did not meet the score cutoff with the corrected scoring logic just over 17% (67,663 probes). In addition, 258,429 (65%) of the OCT_2004_rodents_1 probes are also present in the APR_2005_rodents_1.1 build. Overall, the corrected scoring scheme resulted in the retention of more probes in the APR_2005_rodents1.1 set and improved genome coverage. Please contact us if you have specific questions on this correction and we apologize for any problems this error might have caused.

 

 

A new algorithm, nsoop_v2, was used to select this probe set and assign probe scores. Briefly, the following phylogeny: (((mouse: 0.02870, rat: 0.04165): 0.09006, human: 0.04529): 0.01591, dog: 0.08444); was used to generate probe scores. The branch lengths were calculated with baseml in PAML first using an unrooted tree with just these four species from 29,431 bp alignment from the CFTR region on human chromosome 7. The distances for the two branches leading to dog in a rooted tree were then estimated using smaller alignments that also included either wallaby or opossum. Minimum probe scores for inclusion in the final probe set were determined using both the maximum likelihood scores and number of mismatches. The vast majority of the probes conform to the following criteria:

 

   

Aligned                                  Criteria

Mouse-Rat                             0 mismatches

Mouse-Dog                            < 3 mismatches

Mouse-Human                       < 3 mismatches

Mouse-Human-Dog               < 3 mismatches for both mouse-human and mouse-dog

Mouse-Rat-Human                 < 2 mismatches mouse-rat, < 3 mismatches mouse-human

Mouse-Rat-Dog                      < 2 mismatches mouse-rat, < 3 mismatches mouse-dog

Mouse-Rat-Human-Dog         < 2 mismatches mouse-rat, < 5 mismatches for both mouse-human and mouse-dog

 

These criteria are more conservative compared to the whole-genome mammalian probes. To test the probe success rate of the rodent whole-genome universal probe set,n = 48 probes were used to screen the deer mouse (CHORI-233) and 13-lined ground squirrel (VMRC-20) BAC libraries. After the primary and secondary screens, probe-content information was merged with restriction-enzyme fingerprint content maps. Based on this information, the success rate (the fraction of probes tested that were positive for at least one BAC clone) in each species was calculated as 46% for squirrel and 90% for deer mouse. Using a more stringent criteria of probe success defined as probes that identified at least two and fewer than 20 clones the success rates for squirrel and deer mouse were 31% and 83%, respectively. A subset of representative clones identified with these probes have been sequenced by the NISC to evaluate the combined specificity of the probes. In the case of squirrel, 9/9 sequenced clones mapped to the orthologous target regions, and in the case of deer mouse 21/21 clones mapped back to the targeted orthologous regions. A summary file of the test set of probes can be downloaded here, and a summary file of the mapped clones can be downloaded here.

 

 

Table 7. Probe Summary by chromosome for APR_2005_rodents_1.1

Mouse Chromosome

Length w/o gaps (bp)

Unique Probes

Non-Unique Probes

Chr1

185739816

20142

13092

Chr2

178128968

24567

16073

Chr3

151641779

15205

10255

Chr4

150169032

18988

12806

Chr5

140185730

16161

9916

Chr6

140598523

15580

10211

Chr7

123686188

16363

11613

Chr8

120458717

14542

8786

Chr9

117228887

16732

9994

Chr10

122927168

12282

7852

Chr11

118398857

21156

12611

Chr12

108019676

13045

8244

Chr13

109349262

11520

7732

Chr14

110323967

13069

7919

Chr15

98419177

10824

7058

Chr16

92679592

9949

6294

Chr17

86658738

9736

6819

Chr18

86685738

10174

6332

Chr19

56490660

8749

5601

ChrX

155777425

12501

10669

ChrY

37314788

19

728

Chr1_random

1774182

0

170

Chr2_random

7873155

60

1226

Chr3_random

2438032

82

397

Chr4_random

8311544

29

1323

Chr5_random

2808176

19

199

Chr6_random

1859321

5

142

Chr7_random

6635828

496

758

Chr8_random

1478105

5

157

Chr9_random

837411

0

90

Chr10_random

722152

0

54

Chr12_random

1861124

35

244

Chr13_random

1725890

0

185

Chr14_random

2063169

14

191

Chr15_random

765702

2

82

Chr16_random

1152388

5

186

Chr17_random

1130230

4

210

Chr18_random

1555732

1

96

Chr19_random

597925

0

138

ChrX_random

9460804

11

906

ChrY_random

503236

0

32

ChrUn_random

69030694

1492

5040

 

 

 

 

Total

2615467488

293564

202431

 

APR_2005_carnivores_1.1

APR_2005_carnivores_1.1 is a whole-genome probe set of universal probes specifically designed for screening carnivore libraries. Dog-human-mouse-rat whole-genome alignments from http://genome.ucsc.edu were used to identify dog sequences likely to be highly conserved with other carnivores with nsoop_v2. A combination of mismatches and nsoop probe scores were used determine inclusion of probes in this universal probe set. The starting mismatch criteria was < 3 dog-human mismatches and < 10 dog-rodent mismatches. A summary of the resulting probes is listed below. A chromosome summary of the probe set is listed in Table 8.

This probe set replaces JAN_2005_carnivores_1 that was created using nsoop, which was subsequently found to have an error in probe scoring logic. Comparison of the two probe sets indicates that 4% of the probes in JAN_2005_carnivores_1 (15,705) did not meet the criteria for inclusion in the APR_2005_carnivores_1.1 build and that 357,860 of the probes are common to both probe sets (81% of JAN_2005_carnivores_1 probes).The corrected scoring resulted in higher genome coverage for the new carnivore set. We apologize for any problems this error may have caused and we would be happy to answer further questions from the public on this error and correction.

 

Dog-Human Alignments (probe scores 4.86-5.24)

100% (50,360/50,360) of the probes in this probe score range met the mismatch criteria.

 

Dog-Human-Mouse Alignments (probe scores 8.83-9.52)

90.7% (9058/9992) of probes in this probe score range met the mismatch criteria.

50% (9058/18142) of probes that met mismatch criteria included.

 

Dog-Human-Rat Alignments (probe scores 9.29-9.98)

90.7% of probes in this range met the mismatch criteria (1354/1493)

29.6% (1354/4582) of probes that met mismatch included

 

Dog-Human-Mouse-Rat (probe scores 9.66-11.02)

75.1% (404306/538489) of probes in this range met the mismatch criteria

91.8% (404306/440330) of all probes that met the mismatch criteria were included.

By far largest group of probes.

 

 

Table 8. Probe Summary by chromosome for APR_2005_carnivores_1.1

Dog Chromosome

Length w/o gaps (bp)

Unique Probes

Non-Unique Probes

Chr1

120715446

17978

8238

Chr2

83844569

14889

6844

Chr3

91113804

13826

5611

Chr4

87862066

14640

6325

Chr5

88298129

20060

8536

Chr6

75429024

14434

6073

Chr7

79535956

14488

6165

Chr8

73732664

13538

6074

Chr9

50162806

14464

6553

Chr10

69071640

12602

5323

Chr11

72007474

13180

5693

Chr12

72134750

11727

5154

Chr13

62439878

8506

3563

Chr14

60250532

10308

4233

Chr15

63623497

10550

4718

Chr16

56674272

7122

3269

Chr17

63455904

11315

5180

Chr18

62421077

11536

5336

Chr19

53494395

7157

2880

Chr20

57438949

12594

5955

Chr21

49606594

7523

3512

Chr22

60967026

7711

3064

Chr23

52323076

7710

3356

Chr24

47284650

9029

3947

Chr25

51065444

6916

3179

Chr26

37554750

6132

2881

Chr27

45599871

7758

4304

Chr28

39232621

8776

3509

Chr29

41639609

5615

2184

Chr30

39917767

8993

3802

Chr31

37867089

4535

1758

Chr32

38726899

4894

2408

Chr33

31261292

4866

2054

Chr34

41783138

5951

2379

Chr35

26292642

3724

1859

Chr36

30762478

6096

2311

Chr37

30686873

6292

2666

Chr38

23298972

4133

1720

ChrX

121210679

18723

9031

ChrUn_random

69040064

1382

1158

 

 

 

 

Totals

2359828366

391673

172805

 

Experimental validation of this probe set was performed by screening a clouded leopard BAC library (CHORI-87). The success rate of the carnivore universal probes for the clouded leopard BAC library was 81% (i.e., 39/48 probes tested identified at least one leopard BAC clone). Representative BAC clones are currently being selected for sequencing to evaluate the specificity of this probe set. Using a more stringent criteria of probe success defined as probes that identified at least two and fewer than 20 clones the success rate for clouded leopard was 73%. A subset of representative clones identified with these probes have been sequenced by the NISC to evaluate the combined specificity of the probes. 17/18 sequenced clouded leopard clones mapped to the orthologous target regions. A summary file of the test set of probes can be downloaded here, and a summary file of the mapped clones can be downloaded here.

 

 

February 2005 Updates

We have enhanced the search capabilities on the birds/reptiles, rodents and carnivores probe sets to better capture all of the information about gene names, products etc that are included in xenomRNA and xenoRefSeq tracks.

 

The tree depicting available BAC libraries has also been updated to include new libraries and correct topology mistakes included in the earlier version of the tree.

 

April 2005 Updates

The OCT_rodents_1 and JAN_carnivores_1 probe sets were replaced with APR_2005_rodents_1.1 and APR_2005_carnivores_1.1. A corrected version of nsoop, nsoop_v2, was used to make these probe sets. Users are welcomed to contact us with questions as to more details on this change.

 

JUN_2005_mammals_2

An updated whole-genome universal probe set for screening mammalian libraries, JUN_2005_mammals_2, was created by merging the FEB_2004_mammals_1 probe set with new probes designed from human-mouse-rat-dog-chicken whole genome alignments to hg17 (NCBI build 35) using nsoop_v2 and the input phylogeny((((mouse: 0.02589, rat: 0.02999): 0.07201, human: 0.03563): 0.01316, dog: 0.05622): 0.10466, chicken: 0.14417);. The score-cutoffs for inclusion of newly designed probes in this mammalian probe release were set such that at least 75% of the new probes would have met the previous criteria set for mammals_1 and the mismatch criteria outlined below. As a consequence of merging sets of probes designed at different times with different criteria, a substantial fraction of the probes that met the score and mismatch criteria overlapped. To eliminate unnecessary redundancy in the final probe set, in instances where probes overlapped by more than 18-bp, the best single probe was selected for retention, and the other(s) probes discarded. Unique probes that overlapped with non-unique probes by 30 or more bases were also discarded. In total, this probe set increased the genome coverage compared to the previous mammalian whole-genome probe set by ~6%, and is expected to have equivalent or enhanced success rates.

 

Mismatch criteria for JUN_2005_mammals_2:

Human-Dog: 2 or fewer mismatches

(Probe scores 3.61-3.78. 100% (47697/47697) of the probes met the mismatch criteria)

Human-dog-chicken: 3 or fewer human-dog mismatches

(Probe scores 12.33-12.74. 100% (56/56) of the probes met the mismatch criteria)

Human-mouse: 0 mismatches

(Probe scores 4.81. 100% (153/153) of the probes met the mismatch criteria)

Human-mouse-chicken: 2 or fewer human-mouse mismatches

(Probe scores 13.56-14.24. 100% (8/8) of the probes met the mismatch criteria)

Human-mouse-dog: 4 or fewer human-mouse and 3 or fewer human-dog mismatches

(Probe scores 6.68-7.31. 75% (5220/6905) of the probes met the mismatch criteria)

Human-mouse-dog-chicken: 4 or fewer human-mouse and 3 or fewer human-dog

(Probe scores 14.25-16.26. 75% (3074/4096) of the probes met the mismatch criteria)

Human-mouse-rat: 2 or fewer mismatches in human-mouse and human-rat

(Probe scores 5.83-5.89. 100% (529/529) of the probes met the mismatch criteria)

Human-mouse-rat-chicken: 4 or fewer human-mouse and human-rat mismatches

(Probe scores 13.16-15.32. 100% (769/1015) of the probes met the mismatch criteria)

Human-mouse-rat-dog: less than or equal to 3 human-dog mismatches and either 4 or fewer mismatches in human-mouse or human-rat

(Probe scores 7.69-8.38. 95% (148,825/156,150) of the probes met the mismatch

Human-mouse-rat-dog-chicken: less than 3 human-dog and either 4 or fewer mismatches in human-mouse or human-rat

(Probe scores 15.32-17.34. 84% (139,490/165,822) of the probes met the mismatch

Human-rat: 0 mismatches

(Probe scores 4.96. 100% (19/19) of the probes met the mismatch criteria)

Human-rat-chicken: 2 or fewer human-rat mismatches

(Probe scores 13.01-14.39. 100% (10/10) of the probes met the mismatch criteria)

Human-rat-dog: 4 or fewer human-rat and 3 or fewer human-dog mismatches

(Probe scores 7.24-7.45. 100% (228/228) of the probes met the mismatch criteria)

Human-rat-dog-chicken: 4 or fewer human-rat and 3 or fewer human-dog mismatches

(Probe scores 14.69-16.41. 75% (363/484) of the probes met the mismatch criteria)

 

Table 9. Probe Summary by chromosome for JUN_2005_mammals_2

Human Chromosome

Length w/o gaps (bp)

Unique Probes

Non-Unique Probes

Chr1

222827847

35717

29423

Chr2

237506229

34251

26304

Chr3

194635740

27264

20352

Chr4

187161218

18759

13847

Chr5

177702766

23171

17784

Chr6

167317699

20553

15559

Chr7

154759139

18133

15626

Chr8

142612826

15364

11751

Chr9

117781268

16995

13381

Chr10

131613628

17947

13579

Chr11

131130853

21053

16826

Chr12

130259811

18143

14427

Chr13

95559980

9565

7358

Chr14

88290585

13671

10419

Chr15

81341915

13829

12171

Chr16

78884754

13030

10579

Chr17

77800220

16486

14376

Chr18

74656155

9272

6327

Chr19

55785651

7365

6337

Chr20

59505253

9287

6811

Chr21

34171998

2956

2455

Chr22

34764810

4848

4495

ChrX

150394264

18107

16158

ChrY

24871691

29

1195

 

 

 

 

Total

2851336300

385795

307540

 

 

JUN_2005_marsupials_1

A whole-genome probe set was designed based on pairwise whole-genome alignments between hg17 (NCBI build 35) and opossum (monDom1) using soop_v2. Preliminary tests of probes designed with these comparisons indicated that 36-mers with four or fewer mismatches between human and opossum had greater than a 50% success rate screening a wallaby library. Thus, this probe set consists of 36-mers with four or fewer mismatches between human and opossum. This criteria resulted in the development of 121,772 unique and 83,546 non-unique probes for screening marsupial libraries. Estimated genome coverage for this probe set is 3-fold lower than for JUN_2005_mammals_2, but the probe success rate is expected to be greater than 50% in marsupials.

 

Experimental validation of this probe set was performed by screening a wallaby BAC library. The success rate of the marspuial universal probes for the wallaby BAC library (ME_KBa) was 81% (i.e., 39/48 probes tested identified at least one wallaby BAC clone). Using a more stringent criteria of probe success defined as probes that identified at least two and fewer than 20 clones the success rate for wallaby was 75%. A subset of representative clones identified with these probes have been sequenced by the NISC to evaluate the combined specificity of the probes. 8/9 clones mapped back to the targeted orthologous regions. A summary file of the test set of probes can be downloaded here, and a summary file of the mapped clones can be downloaded here.

 

Development of a new cross-species query option

Because the opossum assembly is highly fragmented and not anchored by chromosomes, we have developed a new option to facilitate the identification of probes of interest using cross-species queries. Specifically, users can now designate a 'query' genome to search for probes distinct from the genome from which the probes were designed ('reference' genome). For example, it is now possible to search by human chromosome location, gene name, accession number etc to retrieve probes designed from the inferred syntenic/orthologous locations in the opossum assembly. This cross-species query option is available for every species that contributed to the design of a given probe set. This option can be accessed here.

 

Development of a batch-query option

To facilitate the retrieval of probes from multiple locations in the genome, we have developed a batch-query interface. Users can now enter multiple search strings (either directly or by a uploading a file) into the newly developed interface and receive the results in a text file via email. This option can be accessed here.

 

Development of an on demand universal probe design for Apes and Old world monkeys.

A whole-genome human-chimpanzee-rhesus monkey alignment of hg18-panTro1-rheMac2 was downloaded from the UCSC Browser and serves as the basis for the on demand design of probes for screening Ape and Old world monkey BAC libraries. Optimal universal probe design parameters were determined in a stepwise process:

 

i. design all possible probes from the human chromosome 7 hg18-panTro1-rheMac2 alignment file,

 

ii. evaluate the probe score distributions correlated with number of human-chimpanzee and human-rhesus monkey mismatches to set the minimum probe score cut-off values,

 

iii. determine the frequency of universal probes with scores greater than the above cut-off scores in 51 250-kb intervals that accurately represent the genome-wide distribution of divergence between human and chimpanzee.

 

The resulting probes are expected to have >90% probe success rate for all species within this clade and to be present at high densities across the entire genome. Specifically, with one exception on the Y chromosome, at least 84 ‘unique’ universal probes were identified in each of the 250 kb intervals sampled above.

 

To make universal probes on demand, users first identify and verify the region/regions of the genome of interest via a simple database query. Once identified, the server initiates our established probe design process targeted to the requested segment(s) of the genome and the probes are returned to the user in variety of formats via email. The on demand universal probe design for Apes and Old World Monkeys can be accessed here.

 

The predetermined fixed default values for the on demand universal probe design for Apes and Old world monkeys is as follows:

Human-chimpanzee-                               None

Human-chimpanzee-rhesus monkey       91% of the probes have 0 human-chimpanzee mismatches and 1 or fewer human-rhesus monkey mismatches (probe score > 2.030)

Human—rhesus monkey                        100% of probes have 1 or fewer human-rhesus monkey mismatches (probe score > 1.856)

 

 

Experimental validation of the default parameters.

Figure 6. Results of experimental validation for the Apes & OWM on demand universal probes. To test the probe success rate of universal probe design process for Apes & OWM, n = 48 probes were used to screen the Japanese macaque (CHORI-270), baboon (RPCI-41), colobus monkey (CHORI-272) and gibbon (CHORI-271) BAC libraries. After the primary and secondary screens, probe-content information was merged with restriction-enzyme fingerprint content maps. Based on this information, the success rate (the fraction of probes tested that were positive for at least one BAC clone) in each species was calculated. Using a more stringent criteria of probe success defined as probes that identified at least two and fewer than 20 clones the success rates for Japanese macaque, baboon, colobus monkey and gibbon were: 94%, 83%, 83% and 94%. A subset of representative clones identified with these probes have been sequenced by the NISC to evaluate the combined specificity of the probes. In the case of Japanese macaque, 7/7 sequenced clones mapped to the orthologous target regions, 8/8 sequenced baboon clones mapped back to the targeted orthologous regions, 8/9 sequenced colobus monkey clones mapped to the orthologous target regions, and 10/10 sequenced gibbon clones mapped back to the targeted orthologous regions. A summary file of the test set of probes can be downloaded here, and a summary file of the mapped clones can be downloaded here.

Development of an on demand universal probe design for New world monkeys (October 2008).

A whole-genome human-chimpanzee-orangutan-rhesus-marmoset alignment of hg18-panTro2-panAbe2-rheMac2-calJac1 was downloaded from the UCSC Browser and serves as the basis for the on demand design of probes for screening New world monkey BAC libraries. Optimal universal probe design parameters were determined in a stepwise process:

 

i. design all possible probes from the regions of the marmoset genome assembly orthologous to human chromosome 7 using the hg18-panTro2-panAbe2-rheMac2-calJac1 alignment file,

 

ii. evaluate the probe score distributions correlated with number of mismatches to set the minimum probe score cut-off values,

 

iii. determine the frequency of universal probes with scores greater than the above cut-off scores in 51 250-kb intervals that accurately represent the genome-wide distribution of sequence divergence.

 

The resulting probes are expected to have >90% probe success rate for all species within this clade and to be present at high densities across the entire genome. Specifically, excluding an interval on the Y chromosome and an interval orthologous to Chr15 which is all segmental duplications, an average of 51 universal probes were identified in each of the 250 kb intervals sampled above (range 7-144 probes/250 kb).

 

To make universal probes on demand, users first identify and verify the region/regions of the genome of interest via a simple database query. Once identified, the server initiates our established probe design process targeted to the requested segment(s) of the genome and the probes are returned to the user in variety of formats via email. The on demand universal probe design for New World Monkeys can be accessed here.

 

The predetermined fixed default values for the on demand universal probe design for New world monkeys were set such that nearly all probes will have 2 or fewer mismatches between marmoset (reference New world monkey sequence for the designed probes) and each of the other primates used in the comparison (human, chimp, orangutan and rhesus).

 

A test set of n = 48 probes were hybridized to a pair of BAC libraries from representative New World Monkeys: Dusky Titi (LBNL-5) and an Owl monkey (CHORI-258). The probe success rates were 87% for the Duskty Titi and 89% for the Owl monkey.

 

 

Development of an on demand universal probe design options for Simians and All primates (October 2009).

A whole-genome human-chimpanzee-orangutan-rhesus-marmoset-tarsier-mouse lemur-galago alignment of hg18-panTro2-panAbe2-rheMac2-calJac1-tarSyr1-micMur1-otoGar1 was downloaded from the UCSC Browser and serves as the basis for the on demand design of probes for screening Simian and All Primate BAC libraries. Optimal universal probe design parameters were determined the stepwise process described above for the Apes&OWM and NWM probe sets.

 

The resulting probes for screening Simian libraries are expected to have >85% probe success rate for all species within this clade and to be present at high densities across the entire genome comparable to the density described for the NWM probe design criteria. The All Primates probe design criteria is expected to have >80% probe success rate for all primate genomic libraries.

 

As with the previous primate probe design tools, to make universal probes on demand, users first identify and verify the region/regions of the genome of interest via a simple database query. Once identified, the server initiates our established probe design process targeted to the requested segment(s) of the genome and the probes are returned to the user in variety of formats via email. The on demand universal probe design for Simians can be accessed here and the All Primates here.

 

Simian Probe Validation. A test set of n = 48 probes were hybridized to a set of three BAC libraries from representative Simians: gibbon (CHORI-271), Colobus monkey (CHORI-272) and Dusky titi (LBNL-5). The probe success rates were 92% for the gibbon, 88% for the Colobus monkey, and 85% for the Dusky titi.

 

All Primates Validation. A test set of n = 32 probes were hybridized to a set of three BAC libraries from representative nonhuman primates: Japanese macaque (CHORI-270), Owl monkey (CHORI-258), and Black lemur (CHORI-273). The probe success rates were 100% for the Japanese macaque, 100% for the Owl monkey, and 91% for the Black lemur.