When is a genome finished?

tctttttatgattaaattaaattttcaaaacgtcgaaatcatttgactgtttgttcagaatgaacagagc
ctgtaaaagccagttggctgtataatcgcctgatattcggttcccacgtggattagattgattttcaaca
agaagttttataaatttttttgtttaaaattttgaatatttggatctgaaaaaattaaagtttgatgatt
cgaaaattttctggaaaagttctttcagtaaaaactttttttcaactttttgattttttttccgcatttt
gtttttgaattattttcctgatttttttcgattaataaatttgtaaaaacaattttttttctaatttttg
gttttgatgattgtgttttttttctgaactttcgctaaaaaattgttcgatttttcccgaattaagaaaa
atattatttggtcatggcctagagtatgcagcgtggcctagaaattcctaacgtggcctaattgcaaaaa
aaagatttgaaaactagtatttaccctaaaattgcattttccgaatttaccttttttaaatttaattttc

me
aattcaggcaaactgacgataatattgttcgattacccctttttatcaattattttcttcaatttcttat

ge no
tccaattttcagatttaaaaaaatttaaaaaggaatgaacttttccaaagaaacatttaaaaaatcaaga

sa
tttttcagtagataatgatgaaatttagcagattttctgataaaaaattgaatttttttggatgaaatta

en i
attttttttaatagctctttatttttttgaaaatttctcccatcccttcgcaccctttagcaacaaccaa

Wh he d?
atttatacagttttatgaaaaggtcacttttcgacgtttttcgccttttcgtggctcacaaaaataatga

nis
aatttattttctttttatgattaaattaaattttcaaaacgtcgaaatcatttgactgtttgttcagaat

ﬁ
gaacagagcctgtaaaagccagttggctgtataatcgcctgatattcggttcccacgtggattagattga
ttttcaacaagaagttttataaatttttttgtttaaaattttgaatatttggatctgaaaaaattaaagt
ttgatgattcgaaaattttctggaaaagttctttcagtaaaaactttttttcaactttttgatttttttt
ccgcattttgtttttgaattattttcctgatttttttcgattaataaatttgtaaaaacaattttttttc
taatttttggttttgatgattgtgttttttttctgaactttcgctaaaaaattgttcgatttttagttat
ttggtcatggcctagagtatgcagcgtggcctagaaattcctaacgtggcctaattgcaaaaaaaagatt
tgaaaactagtatttaccctaaaattgcattttccgaatttaccttttttaaatttaattttcaattcag
gcaaactgacgataatattgttcgattacccctttttatcaattattttcttcaatttcttattccaatt
ttcagatttaaaaaaatttaaaaaggaatgaacttttccaaagaaacatttaaaaaatcaagatttttca
attttctctgaattcctgcagataatgatgaaatttagcagattttctgataaaaaattgaatttttttg
Keith Bradnam
gatgaaattaattttttttaatagctctttatttttttgaaaatttctcccatcccttcgcagcccttta
These slides and notes are licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.
gcaacaaccaaatttatacagttttatgaaaat
A talk given to the UC Davis Bits & Bites club, based on an earlier lecture I had given at UC Davis.

Keith Bradnam, March 2011

Part 1 - the sequence

We can think of ‘genome completion’ as referring to the sequence and/or the set of gene
annotations. Let’s start with the sequence.

A brief history of genomics
Wu & Taylor determine the first ever
1971 DNA sequence (all 12 bp of it!)

Sanger et al. sequence the first ever
1977 (DNA-based) virus genome - 5,375 bp

First complete bacterial genome sequence
1995 (Haemophilus influenzae) - 1.83 Mb

First complete eukaryotic genome
1996 (Saccharomyces cerevisiae) - 12 Mb

First animal genome
1998 (Caenorhabditis elegans) - 100 Mb

It took 18 years before we knew the structure of DNA before anyone could sequence it. First DNA sequence was from the end of a
bacteriophage lambda virus (written in a 20 page paper). First genome was actually an RNA viral genome determined in 1975 by Fiers
et al. The 1980’s and 1990’s saw the start of widespread DNA sequencing for genes of interest in species of interest. Moving to
eukaryotic genome sequencing means determining multiple chromosomes, and tackling bigger repeats (more assembly problems).

genomesonline.org
6000

4500

3000 3,077 7,732

1500

0
Complete Incomplete

Bacteria Archaea Eukaryotes

Genomesonline.org tries to track all of the major genome projects out there. A lot of them
are ﬂagged as incomplete, and maybe some of those will never reach ‘completion’ status.

CAP criteria
1) Complete
2) Accessible
3) Permanent

Sydney Brenner
The great biologist may have won a Nobel prize for his work on development, he may have
postulated the very existence of mRNA, and he may have co-discovered the triplet code ...
but he also came up with the CAP criteria.

These criteria could pertain to any large scale academic project, but they conceived with
reference to genome sequencing projects.

Homo sapiens

2000 - ‘working draft’ announced

2001 - ‘working draft’ published

2003 - ‘Finished’ version announced

2006 - Last chromosome finished

So it’s finished now right?

Ns make up ~9% of current genome

The human genome has been finished on several different dates, depending how you define
‘finished’. Ns – unknown bases – still account for 9% of the 3.1 Gbp genome.

Drosophila melanogaster

2000 - genome published

~175 MB genome
Ns make up ~4% of current genome

Drosophila is a much smaller genome, but a third of the genome is represented by the
harder-to-sequence heterochromatin. This was the subject of a separate genome project that
didn’t ﬁnish until 2007.

The genome still has many Ns.

Arabidopsis thaliana

Published 2000

115 Mb sequenced,
125 Mb genome

As of 2007...
119 Mb sequenced,
157 Mb genome

As of 2012...
119 Mb sequenced,
135 Mb genome
N’s make up ~0.2% of current genome
Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more
of the Arabidopsis genome, they had to revise how big it was. So between 2000 and 2007 they produced more sequence
but paradoxically it became less complete.

This illustrates the difficulty of estimating genome size. The latest ﬁgures suggest that the genome is smaller again. Note
that much of this missing genome is not present as Ns in sequence you download. But the part you can still download
still has many unknown bases.

Caenorhabditis elegans

1998 - ‘ﬁnished’ genome published

97 100 MB genome

2002 - last gap closed

Genome information for species such as C. elegans are curated by model organism databases (MODs) that ensure that
the work goes on long after the initial publication announcing a ‘ﬁnished’ genome is made.

Genome size was quickly revised from 97 MB to 100 MB not long after publication.

Where’s my gene???

2002 2001 2000 1997

People will often know that their gene of interest is definitely present in a genome through traditional genetic experiments...however, it
might not be present in the published genome sequence. The figure shows the times at which one end of chromosome X of C.
elegans were finished. The last 20 kbp region wasn’t finished until four years after the genome was published in 1998. This region
contained predicted genes...maybe scientists were working on these genes waiting for the sequence.


1998 - ‘ﬁnished’ genome published

97 100 MB genome

2002 - last gap closed

2004 - last N removed

Unlike the previous genomes, C. elegans has no Ns (but this took 6 years after publication to achieve).

Worm genome progress

100,000,000

80,000,000
Genome size (bp)

60,000,000

40,000,000

20,000,000

0
Jan-91 Dec-92 Nov-94 Oct-96 Sep-98 Aug-00 Jul-02 Jun-04 May-06
Date

At a gross level, it looks like the worm genome did not change much after the year 2000....

Worm genome progress
100,280,000

66 nt added
100,260,000
May 2010
Genome size (bp)

100,240,000

100,220,000
Sep-01 Jul-02 May-03 Mar-04 Dec-04 Oct-05
Date

Here is a zoom in of the years 2001–2005...still lots of sequence changes happening. The last change on this graph represents a
very small addition of 66 bp to the genome. Maybe this change will not make any difference to anyone in the world, but it still makes
the genome sequence more accurate and closer to the biological truth

Not many genome projects are this devoted!

Saccharomyces cerevisiae

Published 1997

12 MB genome

No gaps, no N’s


1,653 genome changes made since 1997

Last change made in February 2011

Like C. elegans, yeast is a species which beneﬁts from coordinated efforts to ﬁnish the genome.

In February 2011, the yeast genome sequence underwent corrections that affected 194 proteins. This happened in a – by
today’s standards – tiny genome which has been studied and curated for 15 years! What hope for larger, more complex
genomes?

Part 2 - annotations

Maybe you don’t care about the state of the genome, as long as you have all of the genes
present.

C. elegans annotations
Genes Proteins
25000

23500

22000

20500

19000
1998 2003 2004 2005 2006 2007 2008 2009 2010 2011

Genome publication
Since publication, the number of protein-coding loci in C. elegans has risen by about 1,500
genes. But the number of proteins that might arise from alternatively spliced products is
much, much higher and shows no signs of slowing down.

C. elegans annotations
Genes Proteins RNA genes
25000

18750

12500

6250

0
1998 2003 2004 2005 2006 2007 2008 2009 2010 2011

Genome publication
When we consider RNA genes, it is surprising that there are now more RNA genes than
protein-coding genes. How many more species have similar secrets in their genomes that
have yet to be discovered, mostly because of our historical focus on protein-coding genes.

Core genes
You can identify ‘core’ genes, that are highly conserved
and that should be present in all species

Our group identiﬁed a set of 458 core genes from 6
reference genomes:
Homo sapiens
Drosophila melanogaster
Arabidopsis thaliana
Saccharomyces cerevisiae
Schizosaccharomyces pombe

We can then test whether these are all present in any
‘ﬁnished’ genome.
Our lab developed a set of 458 ‘core genes’ that we believe should be present in every
(complete) eukaryotic genome.

In the past we’ve discovered that many published genomes are missing some of these genes
from the genome sequence, even though they should be there. E.g. chicken has missing core
genes even though those genes are represented by chicken EST sequences.

Ciona intestinalis

Version N50 Core genes

v1.95 234,500 444

v2.0 2,571,800 425

Sometimes genomes get updates and assemblies are given a new version number. This might
be associated with an increase in average scaffold size, but sometimes the number of core
genes gets reduced.

Caenorhabditis sp. PS1010

Version N50 Core genes

v4 9,446 454

v5 64,074 428

People can easily measure things like N50, harder to measure things like what genes are
present (though people can use our free CEGMA tool!)

S. cerevisiae
Changes due to genome sequence changes in Feb 2011
caused changes to 194 protein sequences.

Last correction to gene structure due to mis-annotation
was in Jan 2010

So just 13 years to produce a
stable gene set!

Even in a simpler genome, the work of annotation goes on.

Bear in mind that many model organism databases often split genes into different categories
based on evidence.

‘Finished’ eukaryotic genome
sequences are not ﬁnished!

except maybe yeast

Not that this matters necessarily. 1% of a genome is better than no genome at all. At some
level, the law of diminishing returns set it. Ideally, we could produce a metric of ‘useful
papers published per person-hour of database curator working on model organism
database’.

Just be aware that the genome you download today may change in future and your results
might not always be easily reproducible by someone using a different version.

CAP criteria
1) Complete
2) Accessible
3) Permanent

Sydney Brenner
Clearly they are not all complete.

As for accessibility, it not always easy to get hold of large datasets. Bandwidth represents a
particular problem (it can be almost impossible to download GenBank from east coast to
west coast using FTP). Also, online journals often end up breaking links to getting
supplemental material.

For the most part, they are permanent. But not always the raw, unassembled read data.

When is a genome finished?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to When is a genome finished?

Similar to When is a genome finished? (20)

More from Keith Bradnam

More from Keith Bradnam (18)

Recently uploaded

Recently uploaded (20)

When is a genome finished?