{literal} {/literal}

Introduction

The purpose of the Uprobe project is to provide the community with an efficient means of experimental access to large-insert cloned DNA (ie, BAC clones) from the full spectrum of vertebrate genomes for which genomic BAC libraries are planned or are currently available. This is being accomplished by technology development aimed at improving the ability of 'universal' hybridization-based probes to identify genes/regions of interest from multiple species, and then the dissemination of these improved technologies through the creation of:

A more general introduction to the Uprobe project and details related to the above mentioned resources are provided below.

 

Also note that you can also download whole-genome probe sets , computer programs , and experimental protocols from this website.

 

 

Pre-computed whole-genome probe sets are available for:

All Mammals.

AUG_2003_b1 is based on human-mouse alignments and has been experimentally validated.

OCT_2003_b2 is based on human-mouse-rat alignments and has been experimentally validated.

FEB_2004_mammals_1 was created by combining the approaches used to build the first two probe sets and has been experimentally validated, and is the current recommended and default probe set for screening mammalian genomic libraries.

JUN_2005_mammals_2 was created by enhancing the FEB_2004_mammals_1 with new probes based on human-mouse-rat-dog-chicken alignments and is the current recommended and default probe set for screening mammalian genomic libraries.

Rodents.

APR_2005_rodents_1.1 was designed with a new algorithm, nsoop_v2, using mouse-rat-human-dog alignments specifically for screening rodent libraries and is currently in the process of being experimentally validated. This set replaces OCT_2004_rodents_1.

Carnivores.

APR_2005_carnivores_1.1 was designed with a new algorithm, nsoop_v2, using dog-human-mouse-rat alignments specifically for screening rodent libraries and has been experimentally validated. This set replaces JAN_2005_carnivores_1.

Marsupials.

JUN_2005_marsupials_1 was designed from human-opossum alignments and is the recommended probe set for screening marsupial libraries.

All Birds and Reptiles.

MAR_2004_birds/reptiles_1 is based on chicken-human alignments and has been experimentally validated.

 

On demand universal probe design for nonhuman primates:

On demand universal probe design for Apes and Old world monkeys is based on near identical sequences from human-chimpanzee-rhesus monkey whole-genome alignments and is currently being experimentally validated.

 

Similar resources are currently being developed for 3 clades of nonhuman primates: New world monkeys, Simians and All primates.

 

Custom universal probe design:

Custom universal probe design can be performed on DNA sequence alignments of 2 or more species provided by the user. A step-by-step tutorial for the custom probe design process is provided here.

 

 

References:

Thomas JW, Prasad AB, Summers TJ, Lee-Lin SQ, Maduro VV, Idol JR, Ryan JF, Thomas PJ, McDowell JC, Green ED. Parallel construction of orthologous sequence-ready clone contig maps in multiple species. Genome Research 12:1277-85, 2002.

 

Kellner WA, Sullivan RT, Carlson BH, NISC Comparative Sequencing Program, Thomas JW. Uprobe: A genome-wide universal probe resource for comparative physical mapping in vertebrates. Genome Research 15:166-173, 2005.

 

Sullivan RT, Morehouse CB, NISC Comparative Sequencing Program, Thomas JW. Uprobe 2008: an online resource for universal overgo hybridization-based probe retrieval and design. Nucleic Acids Research 36:W149-W153, 2008.

 

The Uprobe project is currently funded by a grant from NCRR (R24RR022239). Past funding was provided through the NIH BAC Resource Network by grant (U01MH068185). Comments are welcome and can be directed to James Thomas (jthomas@genetics.emory.edu).

 

 

 

Below is an overview of goals and rationale behind the uprobe project, a description of what universal probes are, how they are designed, details on specific whole-genome probe sets and updates and changes to the uprobe site.

 

 

Rationale

Comparative sequencing is a particularly powerful tool for inferring function from genomic sequence. Thus, sequencing the same region in multiple species simultaneously would provide a valuable method for interpreting all genomic sequence as it is generated. Just as important is access to cloned DNA containing orthologous genes and regions in multiple species catalogued in the sequence. Such access would establish a new resource that could be used for comprehensive functional analysis of both coding and non-coding sequence. This new resource would consist of a series of gene alleles generated not by mutagenesis within a species, but by the divergence of sequence over evolutionary time (ie 'evolutionary alleles'). These 'evolutionary alleles' could then be used to experimentally dissect the function of genomic sequence in cell culture or a transgenic model. At this time, there is no published method for supplying the templates necessary for this type of sequencing or functional studies. Since multiple species sequence comparisons and physical mapping among vertebrates will be a tremendous resource for the functional annotation of the human genome and a starting point for experimental analysis, robust and effective means of generating such clone resources in a targeted manner would be of immeasurable value to the research community.

 

BAC libraries offer the means to selectively isolate a specific region of a genome or to support whole-genome mapping and sequencing efforts. As such, both the NHGRI (RFA-HG-01-002) and NSF (NSF 01-145) have made funding commitments to establish a large resource of BAC libraries from a diverse set of species. In fact, nearly 100 vertebrate BAC libraries are currently available to the public (Fig. 1) . This new BAC library resource will provide a source for comparative sequencing and functional studies across a wide range of vertebrates. Critical to the future utility of these BAC libraries will be the availability of efficient and reliable methods for screening these libraries and assembling high-quality BAC maps that can be used by the entire biomedical research community. This is especially important for individual researchers who do not necessarily have the experience, technology or production needs of a genome center.

 

Current / Pending Vertebrate BAC Libraries
{include file="treeMap.tpl"}

 

Species-specific sequence resources traditionally used for isolating and constructing physical maps of regions of interest are not available for most vertebrates. That is, ESTs or other random genomic sequences. Thus, the traditional route of building physical maps for a region of interest using species-specific markers would not be possible for many of the species shown in Fig 1. The goal of the first phase of the Uprobe project is to combine the principles of traditional comparative mapping with existing methods for screening BAC libraries to develop an efficient, practical and reliable experimental strategy for assembling BAC maps from a diverse set of vertebrates. Specifically, this proposal aims to remove the limitation of species-specific resources necessary for BAC library screening by the design and testing of universal overgo probes that can be used on single or multiple BAC libraries and will be accomplished by identifying evolutionarily conserved sequences between species such as human and mouse, for which there is genomic sequence available. As a result, this would provide an experimental and computational infrastructure aimed at the one essential part of BAC library screening not yet standardized, probe design. The strategy and methodologies proposed here would also dramatically reduce the cost of building physical maps by increasing the potential mapping throughput and decreasing the cost of marker reagents while maximizing the ability to compare genomic maps of divergent species. Through the establishment of a public database of universal probes, individual researchers would have a key resource necessary to construct physical maps (independent of whole-genome efforts) in their region and species of interest that would otherwise be difficult to build. Small clone-based physical maps assembled by individual researchers would also complement whole-genome efforts through the direct integration of both types of maps via the specific clones of interest. Therefore, it is anticipated that this scalable methodology would greatly facilitate the use of future BAC libraries by individual laboratories and genome centers alike, and thus be a widespread means of using the power of comparative genomics for both community's specific research goals.

 

The ongoing goals of the Uprobe project are as follows:

Modern genomic tools and resources will be critical for ongoing and future nonhuman primate research and whole-genome sequencing efforts are underway for a limited number of nonhuman primates. Unfortunately, because of the sequencing strategies employed and associated costs, nonhuman primate genome assemblies will have hundreds-of-thousands of gaps and will not provide a definitive reference sequence like the finished human genome. Moreover, whole-genome sequences will not yield direct access to the experimental tools necessary to exploit this extensive genetic information. Bacterial artificial chromosome (BAC) libraries and clones are a proven and valuable genomic resource for the experimental utilization and functional characterization of genomic sequence and are currently available for eighteen species of nonhuman primates. Input from representatives of the nonhuman primate and biomedical research communities revealed strong support for a resource to facilitate access to these nonhuman primate genomic libraries (see appendix). The goal of this proposal is to develop a resource that will provide an effective and reliable means for the primate and biomedical research communities to isolate any specific gene or region of interest from one or all nonhuman primate BAC libraries. To do so, a web-based tool will be developed for the custom design of universal hybridization probes that can be used for the isolation of nonhuman primate BAC clones. Universal hybridization probes are a proven technology for the efficient targeted isolation of BAC clones from multiple species in parallel. However, this methodology has not been optimized for use in nonhuman primates. This proposal will establish those optimal parameters and provide them as preset values for the custom design of nonhuman primate universal probes by the public. This custom universal hybridization probe design website will therefore facilitate access to the full spectrum of genetic diversity captured within all current and future nonhuman primate genomic libraries independent of, and as a complement to, whole-genome sequencing efforts. As a result, this proposal will yield an important avenue by which individual researchers can readily import nonhuman primate genomic clones into their own laboratories and experimental paradigms. The aims of this resource proposal are:

 

Aim 1. Develop a public resource for the custom design of universal hybridization probes for isolating nonhuman primate genomic clones. A robust universal probe design pipeline used previously in the creation of whole-genome probe sets will be adapted for the custom and on demand design of universal probes for isolating nonhuman primate genomic clones from specific genes and regions of interest. A web-based interface to this custom probe design pipeline will allow the public to design probes from a series of default settings catered to the efficient isolation of genomic clones from four specific clades of primates (1. all primates, 2. simians, 3. new world monkeys and 4. old world monkeys and apes), or to design probes from their own sequence data.

 

Aim 2. Experimentally validate the nonhuman primate custom probe design resource. Small sets of universal probes designed from each of the four ‘default’ nonhuman primate universal probe design options will be selected for use in a small-scale targeted comparative mapping and sequencing project for experimental validation of the resource, and to provide a real world example of how to use the resource and the data it can produce.

 

 

What are universal probes?

The concept behind the use of universal probes is very simple. If a sequence is conserved between two divergent species, then it is likely to be conserved in other species as well. For example, if a sequence is conserved between human and chicken, then it is likely it will be conserved among all mammals and all birds. Thus, a single sequence can act as an effective probe for screening genomic BAC libraries from many birds and mammals and alleviate the need for generating species-specific probes for every genomic library to be screened. In addition, since a single probe is used to screen multiple species, screening of BAC libraries can be done in parallel with identical hybridization and washing conditions. In the preliminary data for this project, universal probes designed for screening placental BAC libraries were designed based on sequence similarity between human and mouse sequence. These probes were tested and found be effective at isolating BAC clones from a set of placental mammals (cat, dog, cow, pig, rat, baboon and chimpanzee) (Thomas et al, Parallel Construction of Orthologous Sequence-Ready Clone Contig Maps in Multiple Species. Genome Res, 2002. 12:1277-1285).

 

The specifics of the process are illustrated in the figure below. The probes themselves are called 'overgo' probes and are comprised of two complementary 22-mers that overlap by 8-bp and are radioactively labeled with a klenow fill-in reaction with dATP and dCTP. Overgo probes were developed by John McPherson and have been used extensively by large and small labs alike to screen genomic libraries. The specificity and uniform design parameters allows one to hybridize groups of probes together. Because primers are cheap and the radioactivity used for labeling a given probe is minimal, we strongly recommend (when feasible) the use of multiple probes from a given region be used to for library screening versus a single probe. We aim to design probes that will have at least a 50% chance of success in a given species, therefore by using multiple probes, the likelihood of identifying clones of interest with one hybridization is maximized. When large regions are targeted for isolation, spacing of the probes every ~30 kb has proven very successful. This basic process is being used to design whole-genome probe sets for clusters of species, such as placental mammals, birds, reptiles and sub-groups of fish using whole-genome alignments.

 

 

 

Figure 2. Strategy for designing universal overgo hybridization probes based on human-mouse sequence alignments. Orthologous human and mouse genomic sequences are masked for repetitive elements (indicated by X's) and then aligned. Regions with high sequence conservation (indicated by vertical lines) are identified and used for designing probes. When possible, a single 36-bp human sequence from each alignment is chosen based on GC content and percent human-mouse sequence identity. A subset of these sequences is then chosen to optimize for inter-probe spacing (~30-40 kb). Three such conserved sequences are depicted in the figure, with greater details provided (in the box) for the middle one. At this stage, each selected 36-bp sequence is compared to all available human genomic sequence to confirm that it is single copy. Overlapping pairs of oligonucleotide primers are then synthesized for each sequence and used to generate double-stranded, radiolabeled (indicated *'s) probes. The probes across a target region(s) are then pooled and used to screen arrayed BAC libraries, allowing the isolation of individual positive BACs.

 

 

How the AUG_2003_b1 whole-genome probe set for screening mammalian libraries was generated.

Whole-genome human-mouse alignments (axtTight) (Schwartz et al, Human-mouse alignments with BLASTZ. Genome Res 2003, 13:103-107 and Watertson et al, Initial sequencing and comparative analysis of the mouse genome. Nature 2003, 420:520-562.) between the April 2003 assembly (UCSC version hg15) of the human genome and the Feb. 2003 build (UCSC version mm3) assembly of the mouse genome were downloaded from http://genome.ucsc.edu. This alignment file was then modified to for use with a modified algorithm for probe design, soop. Common repetitive sequences were masked in the file and then, when possible, 1 candidate probe with >88% human-mouse sequence identity from each ungapped alignment was designed based on the human sequence. These candidate probes were then compared to the April 2003 human genome build by megablast (megablast -t 16 -N 2 -W 11 -e 0.6 -F F -D 3). The megablast output was used to confirm the location of the probes in the human genome, and tag each probe as 'unique' or 'non-unique'. Unique probes had a single identical hit to the human genome assembly, no other hits with a bit score above >40 and fewer than 5 hits with a score above 36. These are very stringent criteria for calling a probe unique, and we feel that unique probes, to the best of our knowledge, represent single-copy sequences in the human genome and should be well suited for screening BAC libraries from other mammals. Non-unique probes also had one identical hit to the human assembly at the expected location, but had at least one other hit above 40 bits or 5 hits above 36 bits. While the non-unique probes are not single-copy in the human genome based on our criteria, we have kept them in our database for potential use in regions of the human genome that are duplicated or for isolating genes within a gene family. We do not recommend the use of non-unique probes unless there is no alternative unique probe available. To increase the number of unique probes, after masking just the non-unique probe sequences in the human-mouse alignment file, candidate probes were then designed only from alignments that yielded a non-unique probe. Candidate probes were again compared to the human genome by megablast and designated unique or non-unique. This recursive process was repeated 2 times to yield 139,272 unique probes and 97,721 non-unique probes.

 

 

Figure 3. Results of the experimental validation of a sample set of AUG_2003_b1 universal probes.

To test the efficiency of the mammalian whole-genome probe set, AUG_2003_b1, n=48 probes were selected from n=7 regions of the human genome for screening the marmoset (CHORI-259), galago (CHORI-256), rabbit (LBNL-1), bat (VMRC-7), shrew (SA_Ba), armadillo (VMRC-5), wallaby (ME_KBa), and platypus (OA_Bb) BAC libraries. After primary and secondary screens, probe-content information was merged with restriction-enzyme fingerprint content maps. Based on this information, the success rate (the fraction of probes tested that were positive for at least one BAC clone) in each species was calculated, and is shown above. The distribution of the probes percentage identity between human and mouse was slightly more enriched for higher id probes than the content of the whole-genome set of unique probes (21% versus 16% at 100% id, 19% versus 18% at 97% id, 19% versus 21% at 94% id, 23% versus 22% at 91% id, and 19% versus 23% at 88% id). However, because optimal physical spacing will greatly enhance the selected probes toward higher percent identity, we believe this sample set reflects an accurate measurement of the effectiveness of the unique whole-genome probe set. Representative clones have been sent to the NIH Intramural Sequencing Center for sequencing to confirm probe specificity.

 

 

Table 2. Summary of Universal Probes in AUG_2003_b1 (Mammalian)

 

 

 

 

Human Chromosome

Length w/o gaps (bp)

Unique Probes

Non-Unique Probes

Chr1

218712898

12,156

9,401

Chr2

237043677

13,752

8,883

Chr3

193607233

9,624

6,298

Chr4

186580523

6,748

4,411

Chr5

177524972

9,125

5,989

Chr6

166880541

6,465

5,533

Chr7

154546299

6,675

4,770

Chr8

141694337

5,573

3,724

Chr9

115187719

6,151

4,501

Chr10

130710874

6,501

4,143

Chr11

130709420

7,735

5,451

Chr12

129328334

5,491

3,994

Chr13

95511656

3,512

2,462

Chr14

87191216

5,017

3,338

Chr15

81117055

4,943

3,929

Chr16

79890795

4,568

3,013

Chr17

77480855

5,655

4,433

Chr18

74534531

3,474

1,967

Chr19

55780860

1,970

1,596

Chr20

59424990

3,150

2,008

Chr21

33924747

813

713

Chr22

34352072

1,100

1,071

ChrX

147686666

8,148

6,529

ChrY

22761097

10

480

 

 

 

 

Total

2832199938 bp

138,356

98,637

 

 

 

 

OCT_2003_b2 Mammalian Whole-Genome Probe set.

A sample set of orthologous genomic sequences from human, mouse, rat, dog, cat, cow and pig were used to empirically optimize the universal probe design process using human-mouse-rat alignments. n=2863 36-bp probe sequences with n=7 or fewer mismatches between human and mouse, and for which rat, dog, cat, cow and pig sequences were also available were used as the basis of this process. For each substitution pattern between human-mouse-rat

(ie, human AAAAA

     mouse AATTT

     rat   ATATC

   pattern 12345)

 

a 'weight' was assigned based on the calculated percent identity for each pattern between the human nucleotide and the corresponding dog, cat, cow and pig nucleotide. The calculated values were:

sum of identical bases (dog, cat, cow,pig)/(total number of bases of Pattern# X 4)

Pattern 1=(67135+67461+66253+66623) /(72680X4)=0.9200

Pattern 2=(2472+2498+2399+2441) /(3003X4)=0.8167

Pattern 3=(1258+1263+1225+1215) /(1526X4)=0.8127

Pattern 4=(3476+3521+3399+3467) /(5885X4)=0.5889

Pattern 5=(318+330+304+310) /(501X4)=0.6297

 

A score was calculated for each probe by counting the patterns and then summing the corresponding 'weights'. Because rat and mouse are essentially equivalent distances from human, and only 0.004 separated the values for patterns 2 and 3, a single value, 0.8147 was used for both those patterns. The correlation of the probe scores and number of mismatches per probe in dog, cat, cow and pig was then calculated and compared to the correlation coefficient using a probe score based solely on the number of mismatches between human (probe) sequence and the mouse sequence. The correlation coefficient for the mouse mismatch score alone was n=0.5425073 and for the new matrix, n=0.564635, indicating that adding the rat sequence and using this matrix does provide a better basis for designing universal mammalian probes. While the increase in the correlation is not large, this basic scoring matrix strategy can be used with larger numbers and/or more informative combinations of species (such as human-mouse-dog).

 

The second major change to the probe design process was the selection of all probes that fell within the 0.44-0.56% GC range and met the minimum scoring requirement. In the previous build, only the 'best' probe was selected for each gap-free alignment between human and mouse. To provide the maximum number of probe options, we eliminated the 'best' criteria and now include all sequences that meet the set probe criteria.

 

This new algorithm was applied to the Multiz human-mouse-rat whole genome alignment (generated by W. Miller and J. Kent,(Blanchette et al. 2004. Aligning multiple geneomic sequences with the threaded block aligner. Genome Res 14:708-715. RGSPC. 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 420:520-562) Human (UCSC hg15), Mouse (UCSC mm3), Rat (UCSC rn2), downloaded from the http://www.genome.ucsc.edu). Based on analysis of our test data set, a probe score cutoff value of 31.83 was determined to be more stringent than the 4 or fewer mismatches between human and mouse used in Aug_2003_b1 (ie, would include greater than 93% of all probes with 0,1,2,or 3 mismatches between human and mouse and exclude 60% of all probes with 4 mismatches between human and mouse). In addition, we have also edited the probe set to remove a small fraction of candidate sequences (<5%) that had properties that might compromise their general utility

(ie, a sequence that looked like this:GGCCGGGGGCCGCCCGGATATTATTTATAATATAT).

Specifically, probes with a gc_score (see soop.pl algorithm, OM40 (McPherson)) above 55.36.

 

 

Figure 4. Results of the experimental validation of a sample set of OCT_2003_b2 universal probes.

To test the efficiency of the mammalian whole-genome probe set, AUG_2003_b1, n=48 probes were selected from n=11 regions of the human genome for screening the marmoset (CHORI-259), galago (CHORI-256), rabbit (LBNL-1), bat (VMRC-7), shrew (SA_Ba), armadillo (VMRC-5), elephant (VMRC-15), wallaby (ME_KBa), and platypus (OA_Bb) BAC libraries. After primary and secondary screens, probe-content information was merged with restriction-enzyme fingerprint content maps. Based on this information, the success rate (the fraction of probes tested that were positive for at least one BAC clone) in each species was calculated, and is shown above. The test set of probes was selected to be an accurate sampling of the entire probe set (in terms of probe score). N Representative clones have been sent to the NIH Intramural Sequencing Center for sequencing to confirm probe specificity.

 

A numerical summary of this probe build is listed below.

 

Table 3. Summary of Universal Probes in OCT_2003_b2 (Mammalian)

 

 

 

 

Human Chromosome

Length w/o gaps (bp)

Unique Probes

Non-Unique Probes

Chr1

218712898

457,441

268,561

Chr2

237043677

420,924

222,490

Chr3

193607233

318,413

162,270

Chr4

186580523

198,668

103,338

Chr5

177524972

270,796

149,901

Chr6

166880541

213,399

139,525

Chr7

154546299

212,745

122,716

Chr8

141694337

179,965

91,364

Chr9

115187719

221,804

124,513

Chr10

130710874

208,993

105,613

Chr11

130709420

286,737

159,738

Chr12

129328334

211,643

124,001

Chr13

95511656

103,488

58,366

Chr14

87191216

178,789

100,094

Chr15

81117055

189,412

108,324

Chr16

79890795

159,073

91,888

Chr17

77480855

237,188

144,407

Chr18

74534531

99,938

48,999

Chr19

55780860

84,060

53,058

Chr20

59424990

112,647

58,712

Chr21

33924747

34,475

20,460

Chr22

34352072

61,558

39,617

ChrX

147686666

224,952

149,438

ChrY

22761097

112

8,855

 

 

 

 

Total

2832199938 bp

4,687,220

2,656,248

 

 

Table 4. Summary of probe scores for OCT_2003_b2.

Score

Unique Probes

Non-Unique Probes

31.83

18498

12059

31.84

55823

31366

31.85

47247

27331

31.86

243

144

31.87

3523

2334

31.88

4897

2868

31.89

4049

2336

31.91

211

159

31.92

344500

196685

31.93

41675

25794

31.95

1548

1296

31.96

68556

39478

31.97

7857

4694

31.99

227

214

32

5452

3309

32.01

330

223

32.02

436743

245136

32.04

88869

53423

32.05

4183

3044

32.06

73638

42661

32.07

301

165

32.08

15043

8745

32.09

423

326

32.1

4859

2865

32.12

621

383

32.13

351825

194616

32.14

177216

104856

32.16

10215

6712

32.17

45930

26272

32.18

25930

15562

32.2

1009

622

32.21

2586

1616

32.22

1071

670

32.25

300994

174414

32.26

23845

15111

32.28

761

490

32.29

38227

22927

32.3

2206

1559

32.33

1384

943

32.35

402291

230841