Yeast genome project


Published on

This is a compilation of the Yeast genome project from the different databases and sources.
Nazish Nehal,
M. Tech (Biotechnology),
University School of Biotechnology (USBT),
Guru Gobind Singh Indraprastha University (GGSIPU),
New Delhi (INDIA)

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Yeast genome project

  1. 1. Yeast Genome Project
  2. 2. Introduction • Saccharomyces cerevisiae • • • • • •    It is perhaps the most useful diploid yeast, having been instrumental to winemaking, baking and brewing since ancient times It is one of the most intensively studied eukaryotic model organisms in molecular and cell biology Size: 5- 10 µm in diameter Sequenced in year: 1996 Strain sequenced: S288C Databases: Munich Information Centre for Protein Sequences (MIPS): Yeast Protein Database (YPD): Saccharomyces Genome Database (SGD): • Schizosaccharomyces pombe (Fission yeast) • • • • It is used as a model organism in molecular and cell biology Size: 3 to 4 µm in diameter and 7 to 14 µm in length Sequenced in year: 2002 Strain sequenced: 972h by European sequencing consortium (EUPOM) including 13 laboratories and Wellcome Trust Sanger Institute; Cold Spring Harbor Laboratory Databases: PomBase: Broad Institute: Saccharomyces genome database: •  
  3. 3. • Candida albicans • Most common human fungal pathogen • It is diploid fungus that grows both as yeast and filamentous cells and a causal agent of opportunistic oral and genital infections in humans and candidal onychomycosis, an infection of the nail plate • Size: 2.0-7.0 µm in diameter µm in length 3.0-8.5 µm in length • Sequenced in year:2004 by consortia formed by Stanford technology centre • Strain sequenced: SC5314 • Databases:  Candida database :  Broad Institute: Saccharomyces genome database: Home.html
  4. 4. • The bakers yeast Saccharomyces cerevisae is the first eukaryote whose genome is entirely sequenced • Mitochondrial DNA was sequenced in segments in the 1980s. • In 1989, it was decided to initiate a yeast sequencing project within the frame of the EU biotechnology programmes, some 35 European laboratories became initially involved in this enterprise [Vassarotti & Goffeau, 1992] • Chromosome III was the first chromosome to be completed in 1992 followed by XI and II both in 1994 • The 315kb sequence of yeast chromosome III was published, it was a remarkable scientific landmark not only by being the first eukaryotic chromosome ever to be sequenced, but primarily because it revealed the extent of what remained to be understood in the genome of an otherwise extensively studied organism, such as, Saccharomyces cerevisiae • Soon after its beginning, several other laboratories joined the project and agreed upon an international collaboration that enabled the whole yeast genome sequence to be finalized in 1995 • More than 600 scientists in Europe, North America and Japan became involved in this effort and the entire sequence was released in April 1996.
  5. 5. EU=55.9%, UK=17.6%, USA= 20.0%, Canada= 4.3%, Japan= 2.2% Figure: Consortia involved in the yeast genome sequencing project
  6. 6. Cloning and Mapping Procedures: • The sequencing of chromosome III started from a collection of overlapping plasmid or phage lambda clones that were distributed by the DNA coordinator to the contracting laboratories. However, it soon became evident that ordered cosmid libraries were much more advantageous to aid large scale sequencing. • To construct a library with as complete coverage as possible with as few clones as possible, the cloned DNA fragments should be randomly distributed on the DNA. • Under these conditions, the number of clones (N) in a library representing each genomic segment with a given probability (P) is N = ln (1-P)/ln (1-f) where f is the insert length expressed as fraction of the genome size [Clarke & Carbon, 1976]. • For example, with the size of 12,800 kb for the yeast genome and assuming an average insert length of 35 kb, a cosmid library containing 4600 random clones would represent the yeast genome at P=99.99%, i.e. about twelve times the genome equivalent
  7. 7. A low number of clones was of interest in setting up ordered yeast cosmid libraries or specific sublibraries by sorting out from an unordered cosmid library by colony hybridization using specific chromosomal DNA purified by pulsed-field gel electrophoresis as a probe The 'nested chromosomal fragmentation' method [Thierry & Dujon, 1992] was then applied to rapid sorting of these clones Finally, a set of overlapping cosmids was sufficient to build a contig of specific chromosome • This approach has also been successfully applied to many of the other chromosomes sequenced in the yeast genome project • To facilitate sequencing and assembly of the sequences, contigs of overlapping cosmids and fine-resolution physical maps of the respective chromosomes were constructed first, by application of classical mapping methods (fingerprints, cross-hybridization) or by novel methods developed for this programme, such as site-specific chromosome fragmentation [Thierry & Dujon, 1992] or the high resolution cross-hybridization matrix [Scholler et al., 1995]
  8. 8. Sequencing strategies and Sequence Assembly • • In the European network, clones were distributed to the collaborating laboratories according to a scheme worked out by the DNA coordinators Each contracting laboratory was free to apply sequencing strategies and techniques of its own provided that the sequences were entirely determined on both strands and unambiguous readings were obtained • Two principle approaches were used to prepare subclones for sequencing: 1) generation of sub-libraries by the use of a series of appropriate restriction enzymes or from nested deletions of appropriate sub-fragments made by exonuclease III 2) generation of shotgun libraries from whole cosmids or subcloned fragments by random shearing of the DNA • Sequencing by the Sanger technique was done 1) manually, labelling with [35S]dATP being the preferred method of monitoring 2) by automated devices • Two types of devices for on-line detection with fluorescence labeling were employed 1) Applied Biosystems ABI373A 2) Pharmacia A.L.F. • One laboratory used the direct blotting electrophoresis system from GATC company (Konstanz). Similar procedures were applied to the sequencing of chromosomes outside the European network. The American laboratories largely relied on machine-based large-scale sequencing.
  9. 9. Sequencing Telomeres • The yeast chromosome telomeres presented a particular problem • Due to their repetitive sub-structures and the lack of appropriate restriction sites they could be cloned by conventional procedures with only a few exceptions • Largely, telomeres were physically mapped relative to the terminal-most cosmid inserts using the I-SceI chromosome fragmentation procedure [Thierry & Dujon, 1992] • The sequences were then determined from specific plasmid clones obtained by 'telomere trap cloning', an elegant strategy developed by E. Louis at Oxford [Louis, 1994; Louis & Borts, 1995]
  10. 10. Sequence Assembly • Within the European network, all original sequences were submitted by the collaborating laboratories to the Martinsried Institute of Protein Sequences (MIPS) which acted as an informatics centre • The sequences were kept in a data library, assembled into progressively growing contigs, and updated during the course of the project by the application of appropriate criteria in a number of quality controls, starting with chromosome XI • In collaboration with the DNA coordinators the final chromosome sequences were derived. Also in the other yeast chromosomes, automated procedures were employed for sequence assembly, based for example on the programpackage developed at 1) Cambridge [e.g. Dear & Staden, 1991] 2) ACeDB programdeveloped for the C. elegans genome project [Thierry-Mieg & Durbin, 1992] • In any case, correct assembly of the sequences was guaranteed by establishing that the order of restriction sites predicted from the sequence was consistent with the physical maps of these sites that had been determined independently and care was taken to perform quality controls that would result in a high accuracy • From theoretical considerations taking all types of errors together, it follows that with an average sequence accuracy of 99.9% • In practice, care was taken to minimize frameshift errors, which represented about two thirds of all sequencing errors and thus would have the most deleterious effects on gene interpretation. Meanwhile, all sequences have been systematically checked for errors again and were corrected in the data libraries.
  11. 11. The sequences have been interpreted using the following principles: i. All intron splice site/branch-point pairs detected by using specially defined patterns were listed ii. All ORFs containing at least 100 contiguous sense codons and not contained entirely in a longer ORF iii. Centromere and telomere regions, as well as tRNA genes and Ty elements or remnants thereof were sought by comparison with previously characterized datasets • FASTA BLASTX and FLASH1 in combination with the Protein Sequence Database of PIR-International and other public databases • Protein signatures were detected by using the PROSITE dictionary, as well as BLOCKS and PRODOM domains • Base composition; nucleotide pattern frequencies; GC profiles; ORF distribution profiles were performed by using GCG programs or the X11 program package • For calculations of GC content of ORFs the algorithm CODONS was used • This information was compiled at the end of the sequencing project to annotate all genetic elements in the yeast genome
  12. 12. Classification of S. cerevisiae genes
  13. 13. ORF sizes in the S. cerevisiae genome
  14. 14.  At the time, the yeast genome sequencing project had been finalized, comparison of the total sequence with public databases revealed: • some 28.4% of the yeast ORFs corresponded either to previously known protein-encoding genes or to genes whose functions have been determined previously or during the course of the project • An estimated 5.6% of the total remained questionable ORFs • 66% of the total ORFs represented novel putative yeast genes • 14.8% of the total had homologues among gene products from yeast or other organisms whose functions are known • 14.4% of the total had recognizable motifs or weak homologies to genes of experimentally characterized functions. • Remaining 37.7% of the total ORFs had either homologues to ORFs of unknown function on other • Thus, approximately 2200 of the yeast genes had to be categorized as 'genes of unknown function', sometimes called ‘orphans’  A most useful inventory of the yeast proteins had been compiled in the Yeast Proteome Database (YPD) [Garrels et al., 1996] and is updated regularly.
  15. 15. The mystery of orphans • • • • • • • • • • o  ‘Orphans’ are defined by the absence of known function and of structural homologs of known function, so it seems only natural that, with time, they will vanish. Functions of a few genes previously classified as orphans were reported during the sequencing project itself The most striking result from the chromosome III, sequence was that approximately half of all protein-coding ORFs revealed by the sequence, had no clearcut sequence homologs in any organisrn, including yeast itself Thus, with right sequence of the first eukayotic chromosome, it was the discovery of the extent of our ignorance, rather than the discovery of many new genes, that was the most conspicuous finding exact figures depend on stringency criteria applied to determine the significance of sequence similarities on average, 30-35% of all ORFs of the yeast genome are orphans. Even in absence of homologs, computers can provide some clues about the nature of some orphans. For example, prediction of transmembrane segments resulted in the striking conclusion that up to 35-40% of the predicted proteins from chromosome III have trans- membrane helices. Ultimately, the function of each sequence-predicted ORF can only be demonstrated by experiments total number of orphans in the yeast genome (about 2000) It is clear that orphans by and large, are not fundamentally different from other yeast genes in terms of expression. If orphans are real genes, why were they not discovered before? Genome redundancy is a possible explanation. As sequencing progressed, structural homologs to earlier orphans were regularly discovered in the yeast genome. Statistically, however, there is no indication that orphans tend to be more frequently duplicated than the genes previously characterized by classical genetics or their structural homologs. If any-thing, the converse seems to be true.
  16. 16. Gene Density and Gene Arrangement of Proteinencoding Genes in S. cerevisiae • From the number of genes and the total size of the yeast genome one arrives at a gene density • Gene density in all yeast chromosomes is rather similar • Excluding the ORFs contributed by the Ty elements, ORFs occupy on average 70% of the sequences. This leaves only limited space for the intergenic regions which can be thought to harbour the major regulatory elements involved in chromosome maintenance, DNA replication and transcription. • The compact nature of the S. cerevisiae genome is apparent when compared to more complex eukaryotic systems. • • C. elegans contains a potential protein-encoding gene only every 5-6 kb [Hodgkin et al., 1994] In the human genome, gene density had been estimated to be as low as one gene in 30 kb [Olson, 1993] after the draft sequence is available, this figure is one gene in about 100 kb • Schizosaccharomyces pombe, possesses a lower gene density (one gene per 2.3 kb) than S. cerevisiae. The difference between the two yeast genomes appears to be due to the fact that in the fission yeast 40% of the genes contain introns, whereas only a minor fraction (< 5% of the proteinencoding genes in S. cerevisiae are found to be interrupted by introns
  17. 17. • • Generally, ORFs appear to be rather evenly distributed among the two strands of the single chromosomes. In some chromosomes (e.g. I, II, VIII), there is a slight excess of coding capacity on one of the strands, the significance of which is not known Average base composition of yeast DNA is 38.4% (G+C) • GC content of: 1. protein coding (40.2%) 2. non-coding regions (35.1%) • • Coding regions are evenly distributed between the two strands Average ORF size is 1450 bp • The average sizes of inter-ORF regions vary between 630 and 945 bp for different chromosomes 1. 618 bp on average for 'divergent promoters' (36.2% GC) 2. 326 bp for 'convergent terminators' (29.3% GC) 3. 517 bp for 'promoter-terminator combinations' (34.2% GC) • • • • Average base composition has been found to be symmetrical over the entire chromosomes Base composition of ORFs themselves showing a significant excess of homopurine pairs on the coding strand . Regional variations of base composition with similar amplitudes were first noted along chromosome III A most interesting observation was that the compositional periodicity correlates with local gene density, reaching more than 85% in GC-rich regions, followed by segments of comparably lower gene density (50-55%) in AT-rich regions [Dujon et al., 1994].
  18. 18.  Functional elements of yeast chromosome: 1. 2. 3. Centromere Telomere Origins of replication  Complex and Simple repeats • • yeast genome is remarkably poor in repeated sequences unique constellation of repetitious sequences at the two ends of chromosome I is found. Approximately 30 kb in each subtelomeric region carry similar (but nonessential) genes and a 15 kb repeat • these terminal regions represent the yeast equivalent to heterochromatin and the occurrence of this type of DNA suggests that its presence gives this chromosome the critical length required for proper stability and function • The 30 kb region can be removed from each end without affecting vegetative growth, although chromosome stability is considerably reduced • Besides the Ty elements, it is the rDNA on chromosome XII that most significantly contributes to repetitiveness. A cluster of some 15 tandem repeats (2 kb each) containing the CUP1 gene and contributing to polymorphic variation is found on chromosome VIII Repeated stretches of short oligonucleotides exist. These include poly(A) or poly(T) tracts, alternating poly(AT) or poly(TG) tracts, and direct or inverted long repeats •
  19. 19. (S. cerevisae)
  20. 20. Genome Inventory of S. cerevisae
  21. 21. Graphical View of Protein Coding Genes of S. cerevisiae (as of Nov 20, 2013) Distribution of Gene Products among Biological Process Categories S. Cerevisiae gene products that are annotated to one or more terms in each GO aspect Distribution of Gene Products among Molecular Function Categories Distribution of Gene Products among Cellular Component Categories
  22. 22. Genome Inventory of S. pombe 2004 2013
  23. 23. Genome Inventory of C. albicans
  24. 24. Graphical View of Protein Coding Genes of C. albicans (as of Nov 20, 2013) Distribution of Gene Products among Cellular Component Categories C. albicans gene products that are annotated to one or more terms in each GO aspect Distribution of Gene Products among Biological Process Categories Distribution of Gene Products among Molecular Function Categories
  25. 25. Feature type (Total ) Saccharomyces cerevisae Schizosaccharomyces pombe 6,607 5123 6,214 Chromosome length (bp) 12,157,105 12,362,167 14,324,315 Nuclear genome (bp) 12,071,326 12,342,737 14,283,895 85,779 19,430 40,420 16 3 8 Mean coding Length (bp) 1485 1426 1439 No. of Introns 272 4730 224 69.9 % 57.5 % 61.5 % 92 450 - GC content 39 % 36 % 33.46 % Gene density (gene per bp) 2124 2528 2342 Unique proteins 1104 681 1218 Pseudogenes 19 29 7 Centromere 16 3 8 tRNA 299 171 156 rRNA 27 47 6 snRNA 6 7 5 No. of genes Mitochondrial genome (bp) No. of chromosomes Coding percentage Non-coding RNA Candida albicans
  26. 26. Table 1: Frequency and Characteristics of Short Tandem Repeats in the Coding Sequences of Fungal Genomes Table 2: Number, Abundance Ranking, and Proportion of Gene Products Containing the Indicated Interpro Protein Domain yeast species and human
  27. 27. Genetic and Physical maps • The genetic map of S. cerevisiae [Mortimer et al., 1992] has been of considerable value to yeast molecular biologists • DNA probes from some known genes mapped to particular chromosomes for chromosomal walking. Finally, however, physical maps of all chromosomes have been constructed without reference to the genetic maps. • Beside local expansion or contraction of the genetic map, and the fact that the overall frequency of meiotic recombination increases with shortening chromosome size, the order of the genes positioned on the chromosomes by genetic and physical mapping grossly agree • Thus, the comparison of the physical and genetic maps show that most of the linkages have been established to give the correct gene order but that in many cases the relative distances derived from genetic mapping are imprecise. The obvious imprecision of the genetic maps may be due to the fact that different yeast strains have been used in establishing the linkages
  28. 28. Genetic and Physical map of yeast chromosome II
  29. 29. Genetic redundancy in yeast • • • • • • • • • • There is a considerable degree of internal genetic redundancy in the yeast genome It is difficult to correlate physical redundancy completely to functional redundancy because even in yeast gene functions have been precisely defined to a limited extent Duplicated sequences are confined to nearly the entire coding region of these genes and do not extend into the intergenic regions Corresponding gene products share high similarity in terms of amino acid sequence or sometimes are even identical and, therefore, may be functionally redundant Due to sequence differences within the promoter regions, gene expression should vary according to the nature of the regulatory elements or other (regulatory) constraints; it may well be that one gene copy is highly expressed while another one is lowly expressed; turning on or off expression of a particular copy within a gene family may depend on the differentiated status of the cell (such as mating type, sporulation, etc.) Classical examples of redundant genes in subtelomeric regions are the yeast MEL, SUC, MGL and MAL genes subtelomeric regions of several yeast chromosomes share highly conserved segments, in some instances up to 30 kb, which carry duplicated genes the functions of which are largely unknown. Duplicated genes have also been found in clusters. E.g. in chromosome II and cluster of three hexose transporter genes on chromosome VIII Cluster Homology Regions (CHRs): Sequences of complete chromosomes on being compared to each other revealed that there are large chromosome segments in which homologous genes are arranged in the same order with the same relative transcriptional orientations on two or more chromosomes. This is responsible for 30-40% of total redundancy Chromosomes II and IV share the longest CHR, comprising a pair of pericentric regions of 170 and 120 kb, respectively, that share 18 pairs of homologous genes Significance: Whatever the relative timescale and mechanisms of duplications, these events followed by mutations affecting functional properties give a chance to result in improved environmental fitness. On the other hand, the high gene density in yeast indicates a strong tendency to maintain a compact genome, therefore compensatory mechanisms must exist to remove non-functional or superfluous gene copies.
  30. 30. Figure: View of 53 clustered gene duplications between the 16 chromosomes of yeast
  31. 31. Table: Gene duplication in S. pombe and S. cerevisiae using NCBI BlastClust
  32. 32. Sequence Variation among Yeast Strains • Polymorphisms in different yeast strains is due to the following factors: 1) variable number of gene copies from repeated gene families 2) individual patterns caused by the presence or absence of particular Ty elements 3) plasticity of the chromosome ends 4) excisions or inversions of particular gene regions 5) chromosome breakage has been found to occur in yeast, resulting in karyotypes deviating from the 'normal' picture
  33. 33. Yeast Mitochondrial genome • The mitochondrial genes and their mosaic intronic structure were first identified in S. cerevisiae in 1998 . First mitochondrial gene sequenced ever was from S. cerevisiae •     Multi-copy mitochondrial genome from S. cerevisiae is characterized by : low gene density and high A+T content base composition is highly heterogeneous G+C content of the genes is approximately 30% intergenic spacers are composed of quasi-pure A+T stretches of several hundreds of base pairs, interrupted by more than 150 (G+C)rich clusters, ranging from 10 to 80 bp in length (This shows why scientists have sequenced the genes and neglected the intergenic regions) •      The genome contains the genes for cytochrome c oxidase subunits I, II and III (cox1, cox2 and cox3) ATP synthase subunits 6, 8 and 9 (atp6, atp8 and atp9), apocytochrome b (cytb), a ribosomal protein (var1) several intron-related open reading frames (ORFs) 7-8 replication origin- like (ori) elements and encodes 21S and 15S ribosomal RNAs, 24 tRNAs that can recognize all codons, and the 9S RNA component of RNase P • cox1 gene and, to a lesser extent, the cytb, 21S RNA and 15S RNA genes constitute the largest blocks of higher G+C density atp6, atp9, cox2, cox3 and tRNA genes appear as small G+C-enriched islands in the middle of A+T and G+C cluster-rich regions •
  34. 34. Red- Exons; Grey- Introns; Yellow- rRNA; Green- tRNA; Dark blue- Ori elements
  35. 35. Human-Yeast connection • By comparing the catalogue of human sequences available in the databases with the ORFs on the completed yeast chromosomes at the amino acid level it is estimated that:  >30% of the yeast genes have homologues among the human genes.  As expected, most of the genes of known function categorized in this way represent basic functions in both organisms.  More similarities become apparent, when ESTs are included in the analysis.  Most compelling protagonists among these homologues are yeast genes that bear substantial similarity to human 'disease genes‘  Yeast genome is 200 times smaller than the human one  Yeast genome is only 9-10 times less complex in its capacity to code for proteins • Applications:  Yeast may be a simple system to assay novel drugs or ligands in view of the conservation of some basic mechanisms between yeast and human cells  This conservation that makes some yeast genes important for study of human genetics
  36. 36. S. Cerevisae genes related to human disease genes S. Cerevisae genes related to nucleotide excision repair (NER) genes
  37. 37. S. pombe genes related to human disease genes S. pombe genes related to human cancer genes
  38. 38. Figure: Comparison of homologous genes from different species Figure: Orthologs in different species
  39. 39. Figure: Comparison of proteins in S. pombe (S.p.), S. cerevisiae (S.c.) and C. elegans (C.e.) (a) Pie chart comparing the homology of proteins of S. pombe with those of S. cerevisiae and C. elegans; (b) Pie chart comparing the homology of proteins of S. cerevisiae with those of S. pombe and C. elegans
  40. 40. S. cerevisiae had a sequence approximately 60 times larger than any sequence previously attempted indicating why Goffeau felt compelled to invite the cooperation of a group of laboratories At the time the sequencing of model organisms such as S. cerevisiae appeared to be the logical step towards the eventual characterization of the human genome, a task that seemed beyond the scope of technology due to its tremendous size of 3,000 Mb
  41. 41. Thank-you… By: Nazish Nehal, M. Tech (Biotechnology), University School of Biotechnology (USBT), Guru Gobind Singh Indraprastha University, New Delhi (INDIA)