tctttttatgattaaattaaattttcaaaacgtcgaaatcatttgactgtttgttcagaatgaacagagcctgtaaaagccagttggctgtataatcgcctgatattcggttcccacgtggattagattgattttcaacaagaagttttataaatttttttgtttaaaattttgaatatttggatctgaaaaaattaaagtttgatgattcgaaaattttctggaaaagttctttcagtaaaaactttttttcaactttttgattttttttccgcattttgtttttgaattattttcctgatttttttcgattaataaatttgtaaaaacaattttttttctaatttttggttttgatgattgtgttttttttctgaactttcgctaaaaaattgttcgatttttcccgaattaagaaaaatattatttggtcatggcctagagtatgcagcgtggcctagaaattcctaacgtggcctaattgcaaaaaaaagatttgaaaactagtatttaccctaaaattgcattttccgaatttaccttttttaaatttaattttc meaattcaggcaaactgacgataatattgttcgattacccctttttatcaattattttcttcaatttcttat ge notccaattttcagatttaaaaaaatttaaaaaggaatgaacttttccaaagaaacatttaaaaaatcaaga satttttcagtagataatgatgaaatttagcagattttctgataaaaaattgaatttttttggatgaaatta en iattttttttaatagctctttatttttttgaaaatttctcccatcccttcgcaccctttagcaacaaccaa Wh he d?atttatacagttttatgaaaaggtcacttttcgacgtttttcgccttttcgtggctcacaaaaataatga nisaatttattttctttttatgattaaattaaattttcaaaacgtcgaaatcatttgactgtttgttcagaat ﬁgaacagagcctgtaaaagccagttggctgtataatcgcctgatattcggttcccacgtggattagattgattttcaacaagaagttttataaatttttttgtttaaaattttgaatatttggatctgaaaaaattaaagtttgatgattcgaaaattttctggaaaagttctttcagtaaaaactttttttcaactttttgattttttttccgcattttgtttttgaattattttcctgatttttttcgattaataaatttgtaaaaacaattttttttctaatttttggttttgatgattgtgttttttttctgaactttcgctaaaaaattgttcgatttttagttatttggtcatggcctagagtatgcagcgtggcctagaaattcctaacgtggcctaattgcaaaaaaaagatttgaaaactagtatttaccctaaaattgcattttccgaatttaccttttttaaatttaattttcaattcaggcaaactgacgataatattgttcgattacccctttttatcaattattttcttcaatttcttattccaattttcagatttaaaaaaatttaaaaaggaatgaacttttccaaagaaacatttaaaaaatcaagatttttcaattttctctgaattcctgcagataatgatgaaatttagcagattttctgataaaaaattgaatttttttg Keith BradnamgatgaaattaattttttttaatagctctttatttttttgaaaatttctcccatcccttcgcagccctttaThese slides and notes are licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. gcaacaaccaaatttatacagttttatgaaaatA talk given to the UC Davis Bits & Bites club, based on an earlier lecture I had given at UC Davis.Keith Bradnam, March 2011
Part 1 - the sequenceWe can think of ‘genome completion’ as referring to the sequence and/or the set of geneannotations. Let’s start with the sequence.
A brief history of genomics Wu & Taylor determine the ﬁrst ever 1971 DNA sequence (all 12 bp of it!) Sanger et al. sequence the ﬁrst ever 1977 (DNA-based) virus genome - 5,375 bp First complete bacterial genome sequence 1995 (Haemophilus inﬂuenzae) - 1.83 Mb First complete eukaryotic genome 1996 (Saccharomyces cerevisiae) - 12 Mb First animal genome 1998 (Caenorhabditis elegans) - 100 MbIt took 18 years before we knew the structure of DNA before anyone could sequence it. First DNA sequence was from the end of abacteriophage lambda virus (written in a 20 page paper). First genome was actually an RNA viral genome determined in 1975 by Fierset al. The 1980’s and 1990’s saw the start of widespread DNA sequencing for genes of interest in species of interest. Moving toeukaryotic genome sequencing means determining multiple chromosomes, and tackling bigger repeats (more assembly problems).
genomesonline.org 6000 4500 3000 3,077 7,732 1500 0 Complete Incomplete Bacteria Archaea EukaryotesGenomesonline.org tries to track all of the major genome projects out there. A lot of themare ﬂagged as incomplete, and maybe some of those will never reach ‘completion’ status.
CAP criteria 1) Complete 2) Accessible 3) Permanent Sydney BrennerThe great biologist may have won a Nobel prize for his work on development, he may havepostulated the very existence of mRNA, and he may have co-discovered the triplet code ...but he also came up with the CAP criteria.These criteria could pertain to any large scale academic project, but they conceived withreference to genome sequencing projects.
Homo sapiens 2000 - ‘working draft’ announced 2001 - ‘working draft’ published 2003 - ‘Finished’ version announced 2006 - Last chromosome ﬁnished So it’s ﬁnished now right? Ns make up ~9% of current genomeThe human genome has been ﬁnished on several different dates, depending how you deﬁne‘ﬁnished’. Ns – unknown bases – still account for 9% of the 3.1 Gbp genome.
Drosophila melanogaster 2000 - genome published ~175 MB genome So it’s ﬁnished now right? Ns make up ~4% of current genomeDrosophila is a much smaller genome, but a third of the genome is represented by theharder-to-sequence heterochromatin. This was the subject of a separate genome project thatdidn’t ﬁnish until 2007.The genome still has many Ns.
Arabidopsis thaliana Published 2000 115 Mb sequenced, 125 Mb genome As of 2007... 119 Mb sequenced, 157 Mb genome As of 2012... 119 Mb sequenced, 135 Mb genome N’s make up ~0.2% of current genomeMany published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and moreof the Arabidopsis genome, they had to revise how big it was. So between 2000 and 2007 they produced more sequencebut paradoxically it became less complete.This illustrates the difficulty of estimating genome size. The latest ﬁgures suggest that the genome is smaller again. Notethat much of this missing genome is not present as Ns in sequence you download. But the part you can still downloadstill has many unknown bases.
Caenorhabditis elegans 1998 - ‘ﬁnished’ genome published 97 100 MB genome 2002 - last gap closedGenome information for species such as C. elegans are curated by model organism databases (MODs) that ensure thatthe work goes on long after the initial publication announcing a ‘ﬁnished’ genome is made.Genome size was quickly revised from 97 MB to 100 MB not long after publication.
Where’s my gene???2002 2001 2000 1997People will often know that their gene of interest is deﬁnitely present in a genome through traditional genetic experiments...however, itmight not be present in the published genome sequence. The ﬁgure shows the times at which one end of chromosome X of C.elegans were ﬁnished. The last 20 kbp region wasn’t ﬁnished until four years after the genome was published in 1998. This regioncontained predicted genes...maybe scientists were working on these genes waiting for the sequence.
Caenorhabditis elegans 1998 - ‘ﬁnished’ genome published 97 100 MB genome 2002 - last gap closed 2004 - last N removed So it’s ﬁnished now right?Unlike the previous genomes, C. elegans has no Ns (but this took 6 years after publication to achieve).
Worm genome progress 100,000,000 80,000,000 Genome size (bp) 60,000,000 40,000,000 20,000,000 0 Jan-91 Dec-92 Nov-94 Oct-96 Sep-98 Aug-00 Jul-02 Jun-04 May-06 DateAt a gross level, it looks like the worm genome did not change much after the year 2000....
Worm genome progress 100,280,000 66 nt added 100,260,000 May 2010 Genome size (bp) 100,240,000 100,220,000 Sep-01 Jul-02 May-03 Mar-04 Dec-04 Oct-05 DateHere is a zoom in of the years 2001–2005...still lots of sequence changes happening. The last change on this graph represents avery small addition of 66 bp to the genome. Maybe this change will not make any difference to anyone in the world, but it still makesthe genome sequence more accurate and closer to the biological truthNot many genome projects are this devoted!
Saccharomyces cerevisiae Published 1997 12 MB genome No gaps, no N’s So it’s ﬁnished now right? 1,653 genome changes made since 1997 Last change made in February 2011Like C. elegans, yeast is a species which beneﬁts from coordinated efforts to ﬁnish the genome.In February 2011, the yeast genome sequence underwent corrections that affected 194 proteins. This happened in a – bytoday’s standards – tiny genome which has been studied and curated for 15 years! What hope for larger, more complexgenomes?
Part 2 - annotationsMaybe you don’t care about the state of the genome, as long as you have all of the genespresent.
C. elegans annotations Genes Proteins 25000 23500 22000 20500 19000 1998 2003 2004 2005 2006 2007 2008 2009 2010 2011 Genome publicationSince publication, the number of protein-coding loci in C. elegans has risen by about 1,500genes. But the number of proteins that might arise from alternatively spliced products ismuch, much higher and shows no signs of slowing down.
C. elegans annotations Genes Proteins RNA genes 25000 18750 12500 6250 0 1998 2003 2004 2005 2006 2007 2008 2009 2010 2011 Genome publicationWhen we consider RNA genes, it is surprising that there are now more RNA genes thanprotein-coding genes. How many more species have similar secrets in their genomes thathave yet to be discovered, mostly because of our historical focus on protein-coding genes.
Core genes You can identify ‘core’ genes, that are highly conserved and that should be present in all species Our group identiﬁed a set of 458 core genes from 6 reference genomes: Homo sapiens Caenorhabditis elegans Drosophila melanogaster Arabidopsis thaliana Saccharomyces cerevisiae Schizosaccharomyces pombe We can then test whether these are all present in any ‘ﬁnished’ genome.Our lab developed a set of 458 ‘core genes’ that we believe should be present in every(complete) eukaryotic genome.In the past we’ve discovered that many published genomes are missing some of these genesfrom the genome sequence, even though they should be there. E.g. chicken has missing coregenes even though those genes are represented by chicken EST sequences.
Ciona intestinalis Version N50 Core genes v1.95 234,500 444 v2.0 2,571,800 425Sometimes genomes get updates and assemblies are given a new version number. This mightbe associated with an increase in average scaffold size, but sometimes the number of coregenes gets reduced.
Caenorhabditis sp. PS1010 Version N50 Core genes v4 9,446 454 v5 64,074 428People can easily measure things like N50, harder to measure things like what genes arepresent (though people can use our free CEGMA tool!)
S. cerevisiae Changes due to genome sequence changes in Feb 2011 caused changes to 194 protein sequences. Last correction to gene structure due to mis-annotation was in Jan 2010 So just 13 years to produce a stable gene set!Even in a simpler genome, the work of annotation goes on.Bear in mind that many model organism databases often split genes into different categoriesbased on evidence.
‘Finished’ eukaryotic genome sequences are not ﬁnished! except maybe yeastNot that this matters necessarily. 1% of a genome is better than no genome at all. At somelevel, the law of diminishing returns set it. Ideally, we could produce a metric of ‘usefulpapers published per person-hour of database curator working on model organismdatabase’.Just be aware that the genome you download today may change in future and your resultsmight not always be easily reproducible by someone using a different version.
CAP criteria 1) Complete 2) Accessible 3) Permanent Sydney BrennerClearly they are not all complete.As for accessibility, it not always easy to get hold of large datasets. Bandwidth represents aparticular problem (it can be almost impossible to download GenBank from east coast towest coast using FTP). Also, online journals often end up breaking links to gettingsupplemental material.For the most part, they are permanent. But not always the raw, unassembled read data.