SlideShare a Scribd company logo
1 of 161
Download to read offline
Genome Assembly:
the art of trying to make one
BIG thing from millions of
very small things
Keith Bradnam
@kbradnam
Image from Wellcome Trust
Genome Assembly:
the art of trying to make one
BIG thing from millions of
very small things
Keith Bradnam
@kbradnam
Image from Wellcome Trust
This was a talk given at UC Davis on 2015-01-28, presented to an audience of
graduate students.
Author: Keith Bradnam, Genome Center, UC Davis
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
flickr.com/incrediblehow/
Overview
1. What is genome assembly?
2. Why is it difficult?
3. Why is it important?
4. How do we know if an assembly is any good?
flickr.com/incrediblehow/
What is genome assembly?
A genome assembly is an attempt to accurately
represent an entire genome sequence from a
large set of very short DNA sequences.
A genome assembly is an attempt to accurately
represent an entire genome sequence from a
large set of very short DNA sequences.
A genome assembly is an attempt to accurately
represent an entire genome sequence from a
large set of very short DNA sequences.
Using a piece of bioinformatics software is just like running an experiment. Just
because you get an answer, it doesn't mean it will be the right answer. You should
always be prepared to tweak some parameters and re-run the experiment.
A genome assembly is an attempt to accurately
represent an entire genome sequence from a
large set of very short DNA sequences.
A genome assembly is an attempt to accurately
represent an entire genome sequence from a
large set of very short DNA sequences.
The ideal goal would be to end up with complete sequences for each chromosome at
each level of ploidy. E.g. diploid genomes would be assembled as two sets of
genome sequences.
A genome assembly is an attempt to accurately
represent an entire genome sequence from a
large set of very short DNA sequences.
A genome assembly is an attempt to accurately
represent an entire genome sequence from a
large set of very short DNA sequences.
'Large' is a relative term. We would expect that advances in sequencing technology
would mean that the number of sequences needed to assemble a genome is only
ever going to decrease.
A genome assembly is an attempt to accurately
represent an entire genome sequence from a
large set of very short DNA sequences.
A genome assembly is an attempt to accurately
represent an entire genome sequence from a
large set of very short DNA sequences.
'Short' is also a relative term. As technology improves, we expect to see our input
sequences get longer and longer until the steps of sequencing and assembly
essentially merge into one process.
It's a bit like trying to do the hardest
jigsaw puzzle you can imagine!
This is a jigsaw that I did for the benefit of your education! There are lots of
analogies that can be made between assembling genomes, and assembling jigsaws.
Sometimes we assemble regions of jigsaws that are locally accurate, but globally
misplaced (the top region circled in red). Sometimes we also assemble regions and
leave them to one side as we don't know where they should go. Many 'finished'
genome assemblies include sets of 'unanchored' sequences that are not positioned
on any chromosome.
Let's keep working on our jigsaw.
The hardest parts of a jigsaw tend to be repetitive regions (skies, sea, forests etc.).
The same is true for genome assemblies.
Sometimes we can use information to pair together two different completed
sections of a jigsaw. In this case, we can use our understanding of what a bridge
looks like to give us an approximate spacing between the two completed sections at
the top of this puzzle. We do similar things with genome assemblies and also end up
inserting approximately sized gaps between regions of sequence.
Is this good enough?
For a jigsaw, we would never ever call this 'finished', but for a genome assembly this
would represent an almost perfect sequence! All of the main details are present, you
can identify what the picture is showing (San Francisco), the edges are detailed
enough that we can accurately calculate the size of the jigsaw, and the parts that are
missing are mostly minor details.
Jigsaws often end up with a few missing pieces meaning that it is impossible to
complete the puzzle. Genome assemblies also end up with missing pieces because
they were never in the input set of sequences to begin with. This is because not all
sequencing technologies capture all locations in a genome.
With the exception of bacterial genomes, we never reach this point with genome
assembly. All published eukaryotic genomes are incomplete and contain errors.
Maybe yeast (Saccharomyces cerevisiae) and worm (Caenorhabditis elegans) are
the best examples we have a of near-complete reference genome for a eukaryotic
species.
flickr.com/incrediblehow/
Why is it difficult?
World's largest jigsaw puzzle
• Made by University of
Economics of Ho Chi Minh City
• 551,232 pieces
• 15 x 23 meters
World's largest jigsaw puzzle
• Made by University of
Economics of Ho Chi Minh City
• 551,232 pieces
• 15 x 23 meters
The world's largest jigsaw has nothing on the world's largest genome assembly…
World's largest assembled genome
• Lobolly pine (Pinus taeda)
• 22 Gbp genome!
• ~80% repetitive
• 64x coverage
from tulsalandscape.com
World's largest assembled genome
• Lobolly pine (Pinus taeda)
• 22 Gbp genome!
• ~80% repetitive
• 64x coverage
from tulsalandscape.com
World's largest assembled genome
• Lobolly pine (Pinus taeda)
• 22 Gbp genome!
• ~80% repetitive
• 64x coverage
from tulsalandscape.com
This gargantuan effort featured the work of many people at UC Davis, led by the
efforts of David Neale's group.
What does 64x coverage mean?
Over 1.4 trillion bp of DNA were sequenced!
What does 64x coverage mean?
Over 1.4 trillion bp of DNA were sequenced!
I.e. they had to use 64x times as much input DNA as they ended up with in the final
output. Imagine if baking a cake was like this, and you had to use 64x as many
ingredients in order to make one cake.
Some genome assembly projects are done with >100x coverage.
Biological challenges
for genome assembly
Problem Description
Repeats
Many plant and animal genomes mostly consist of
repetitive sequences, some of which are longer than
length of sequencing reads.
Ploidy
For many species, you have at least two copies of the
genome present. Level of heterozygosity is important.
Lack of reference
genome
Reference-assisted assembly is a much easier problem
than de novo assembly. Even having genome from a
closely related species can help.
Biological challenges
for genome assembly
Problem Description
Repeats
Many plant and animal genomes mostly consist of
repetitive sequences, some of which are longer than
length of sequencing reads.
Ploidy
For many species, you have at least two copies of the
genome present. Level of heterozygosity is important.
Lack of reference
genome
Reference-assisted assembly is a much easier problem
than de novo assembly. Even having genome from a
closely related species can help.
Ploidy is often a much bigger problem for plant genomes. E.g. some wheat species
are hexaploid. Genome assembly is sometimes performed on a genome for which
we already have a reference (e.g. if you sequenced your own genome, you could
align it to the human reference sequence). Otherwise, we are talking about de novo
assembly which is much, much harder.
from amazon.com
from amazon.com
Returning to the jigsaw analogy…every jigsaw puzzle comes with a picture of the
puzzle on the box. This is a luxury not always available to genome assemblers.
When we are doing de novo assembly, it is a bit like doing a jigsaw without knowing
what it will look like.
Even with de novo assembly, we may have a distant relative with a known genome
sequence that can help with the assembly. A bit like assembling a jigsaw using a
blurred picture as a guide.
Jigsaws tell you how many pieces are in the puzzle (and what the dimensions of the
puzzle will be). We don't always know this for genome assembly. There are
measures for determining how big a genome might be, but these methods can
sometimes be misleading.
Other challenges
for genome assembly
Problem Description
Cost
In 2014 Illumina claimed the $1,000 genome barrier had
been broken (if you first spend ~$10 million on hardware).
Library prep A critical, and often overlooked, step in the process.
Sequence
diversity
Illumina, 454, Ion Torrent, PacBio, Oxford Nanopore: which
mix of sequence data will you be using?
Hardware
Some genome assemblers have very high CPU/RAM
requirements. Might need specialized cluster.
Expertise
Not always easy to even get assembly software installed,
let alone understand how to run it properly.
Software There is a lot of choice out there.
Other challenges
for genome assembly
Problem Description
Cost
In 2014 Illumina claimed the $1,000 genome barrier had
been broken (if you first spend ~$10 million on hardware).
Library prep A critical, and often overlooked, step in the process.
Sequence
diversity
Illumina, 454, Ion Torrent, PacBio, Oxford Nanopore: which
mix of sequence data will you be using?
Hardware
Some genome assemblers have very high CPU/RAM
requirements. Might need specialized cluster.
Expertise
Not always easy to even get assembly software installed,
let alone understand how to run it properly.
Software There is a lot of choice out there.
The PRICE genome
assembler has 52
command-line options!!!
The PRICE genome
assembler has 52
command-line options!!!
This is probably not the most complex, nor the most simple, genome assembler that
is out there. But how much time do you have to explore some of those 52
parameters that could affect the resulting genome assembly?
Problem Description
Cost
In 2014 Illumina claimed the $1,000 genome barrier had
been broken (if you first spend ~$10 million on hardware).
Library prep A critical, and often overlooked, step in the process.
Sequence
diversity
Illumina, 454, Ion Torrent, PacBio, Oxford Nanopore: which
mix of sequence data will you be using?
Hardware
Some genome assemblers have very high CPU/RAM
requirements. Might need specialized cluster.
Expertise
Not always easy to even get assembly software installed,
let alone understand how to run it properly.
Software There is a lot of choice out there.
Other challenges
for genome assembly
There are over 125 different tools available to
help assemble a genome!
There are over 125 different tools available to
help assemble a genome!
Not all of these are comprehensive genome assemblers, some are tools to help with
specific aspects of the assembly process, or to help evaluate genome assemblies etc.
Still, this represents a bewildering amount of choice.
These six assembly tools were published in one month in 2014!
Before you assemble…
• You should remove adapter contamination
• You should remove sequence contamination
• You should trim sequences for low quality regions
Before you assemble…
• You should remove adapter contamination
• You should remove sequence contamination
• You should trim sequences for low quality regions
After we have generated the raw sequence data, we still must run a few basic steps
to clean up our data prior to assembly. How straightforward are these steps?
Tools for removing adapter
contamination from sequences
There are at least 34 different tools!
One of these tools has 27 different
command-line options
Tools for removing adapter
contamination from sequences
There are at least 34 different tools!
One of these tools has 27 different
command-line options
Even the first step of removing adapter contamination is something for which you
could spend a lot of time researching different software choices.
flickr.com/incrediblehow/
Why is it important?
Saccharomyces cerevisiae
• 12 Mbp genome
• Published in 1997
• First eukaryotic genome sequence
Saccharomyces cerevisiae
• 12 Mbp genome
• Published in 1997
• First eukaryotic genome sequence
Not the first published genome — there were several bacterial genomes sequenced
in the preceding couple of years — but this was the first eukaryotic genome
sequence. Furthermore, this genome sequence has undergone continual
improvements and corrections since publication (the last set of changes were in
2011).
Caernorhabditis elegans
• ~100 Mbp genome
• Published in 1998
• First animal genome sequence
Arabidopsis thaliana
• First plant genome sequence
• Published in 2000
• Size?
• 2000 = 125 Mbp
• 2007 = 157 Mbp
• 2012 = 135 Mbp
Arabidopsis thaliana
• First plant genome sequence
• Published in 2000
• Size?
• 2000 = 125 Mbp
• 2007 = 157 Mbp
• 2012 = 135 Mbp
As alluded to earlier, we don't always know for sure how big (or small) a genome is.
The Arabidopsis genome size has been corrected upwards and downwards since
publication. The amount of sequenced information as of today is about 119 Mbp.
And this is for the best understood plant genome that we know about it!
Homo sapiens
• ~3 Gbp genome
• Finished?
• 'working draft' announced in 2000
• 'working draft' published in 2001
• completion announced in 2003
• complete sequence published in 2004
Homo sapiens
• ~3 Gbp genome
• Finished?
• 'working draft' announced in 2000
• 'working draft' published in 2001
• completion announced in 2003
• complete sequence published in 2004
The human genome has also undergone improvements since the (many)
announcements regarding its completion (or near completion). There are only a
small number of species for which there is dedicated group of people who seek to
continually improve the genome sequence and get closer to 'the truth'.
The 100,000 genomes project
There are lots of ongoing
genome sequencing projects
i5k Insect and other Arthropod
Genome Sequencing Initiative
The 100,000 genomes project
There are lots of ongoing
genome sequencing projects
i5k Insect and other Arthropod
Genome Sequencing Initiative
Bigger numbers must be better, right? Some projects sequence genomes to align
back to a reference to look for the differences, others seek to characterize genomes
for which we have very little genomic information. The 100,000 genomes project in
England heralds the start of the mass sequencing of patients to understand disease.
We no longer have one
genome per species
• We have genome sequences representing different
strains and varieties of a species
• We have multiple genomes from different tissues of
the same individual (e.g. cancer genomes)
• We potentially will have genomes from different
time points or life stages of an individual
We no longer have one
genome per species
• We have genome sequences representing different
strains and varieties of a species
• We have multiple genomes from different tissues of
the same individual (e.g. cancer genomes)
• We potentially will have genomes from different
time points or life stages of an individual
Imagine having your genome sequenced at birth from several different tissues and
getting 'genome health checks' throughout your life.
There is no point sequencing so many genomes
if we can't accurately assemble them!
There is no point sequencing so many genomes
if we can't accurately assemble them!
Sequencing genomes is relatively easy. Putting that information together in a
meaningful way so as to make it useful to others…that's not so easy.
Bad genome assemblies #1
Length of 10 shortest sequences:
100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp!
The average
vertebrate gene is
about 25,000 bp
Bad genome assemblies #1
Length of 10 shortest sequences:
100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp!
The average
vertebrate gene is
about 25,000 bp
Everyone wants long sequences in a genome assembly. This may not always matter,
but in most cases they should hopefully be long enough to contain at least one gene.
These data are from a vertebrate genome sequence that someone asked me to look
at. Over half of the genome assembly was represented by sequences less than 150
bp! This is not much use to anyone.
Bad genome assemblies #2
Ns = 90.6% !!!
Genome sequences
usually contain
unknown bases (Ns)
Bad genome assemblies #2
Ns = 90.6% !!!
Genome sequences
usually contain
unknown bases (Ns)
From another assembly that I was asked to look at. Even the 9% of the genome
which wasn't an 'N' was split into tiny little fragments. Completely unusable
information.
Has anyone compared different assemblers to
work out which is the best?
Has anyone compared different assemblers to
work out which is the best?
I was wondering whether you would ask this…
A genome assembly competition
A genome assembly competition
This was a genome assembly assessment exercise that I was involved with.
@assemblathon
@assemblathonIt spawned a sequel.
Published in
Gigascience, 2013
3 species
21 teams
43 assemblies
52 Gbp of sequence!
Goals
• Assess 'quality' of genome assemblies
• Identify the best assemblers
• First need to define quality!
Who makes the best pizza in Davis?
Who makes the best pizza in Davis?
An easy question to ask, but maybe not as straightforward as it seems…
Who makes the best pizza in Davis?
Freshest?
Cheapest?
Biggest?
Gluten free?
Healthiest
Choice of toppings?
Choice of toppings?
Delivery time?
Tastiest?
Who makes the best pizza in Davis?
Freshest?
Cheapest?
Biggest?
Gluten free?
Healthiest
Choice of toppings?
Choice of toppings?
Delivery time?
Tastiest?
'Best' is subjective. If you are intolerant to gluten, then the best pizza place will be
the one that makes gluten-free pizzas.
Who makes the best pizza in Davis?
Freshest?
Cheapest?
Biggest?
Gluten free?
Healthiest
Choice of toppings?
Choice of toppings?
Delivery time?
Tastiest?
Who makes the best pizza in Davis?
Freshest?
Cheapest?
Biggest?
Gluten free?
Healthiest
Choice of toppings?
Choice of toppings?
Delivery time?
Tastiest?
Even if you focus on who makes the best 'tasting' pizzas, this is still very subjective.
Who makes the best genome assembly?
Image from flickr.com/dullhunk/
Who makes the best genome assembly?
Image from flickr.com/dullhunk/
But surely this is not such a subjective topic when it comes to genome assembly?
Who makes the best genome assembly?
Longest contigs?
Fewest errors?
Lowest CPU demands?Best deals with repeats?
Contains most genes?
Fastest?
Best resolves heterozygosity?
Easiest to install?
Longest scaffolds?
Image from flickr.com/dullhunk/
Who makes the best genome assembly?
Longest contigs?
Fewest errors?
Lowest CPU demands?Best deals with repeats?
Contains most genes?
Fastest?
Best resolves heterozygosity?
Easiest to install?
Longest scaffolds?
Image from flickr.com/dullhunk/
It is less subjective, but there are still many different ways we can think of when
trying to determine what makes a good genome assembly.
And the winner is…
• No winner!
• Some assemblers seemed to work well for one
species, but not for other species
• Some assemblies were good, as measured by one
metric, but not when measured by others
And the winner is…
• No winner!
• Some assemblers seemed to work well for one
species, but not for other species
• Some assemblies were good, as measured by one
metric, but not when measured by others
This result was disappointing to many who was hoping that we would provide a
resounding endorsement for assembler 'X'.
flickr.com/incrediblehow/
How do we know if an
assembly is any good?
Read
Read
The fundamental input to a genome assembly is a set of sequencing reads.
Technology Date Typical read lengths
Sanger ~1970–2000 750–1,000 bp
Solexa/Illumina ~2005 ~25 bp
Illumina ~2014 ~150–250 bp
Pacific Biosciences ~2014 10–15 Kbp
Oxford Nanopore ~2014 5–??? Kbp
Technology Date Typical read lengths
Sanger ~1970–2000 750–1,000 bp
Solexa/Illumina ~2005 ~25 bp
Illumina ~2014 ~150–250 bp
Pacific Biosciences ~2014 10–15 Kbp
Oxford Nanopore ~2014 5–??? Kbp
Different technologies produce reads with very different length distributions, and
these technologies also increase the length of reads over time. Perhaps more
importantly, different technologies have different error profiles (where errors occur
in reads and types of error).
Read
Read pair
Insert size is known (approximately)
Read pair
Insert size is known (approximately)
Typically, we work with pairs of reads separated by a short distance (< 1,000 bp) or
even overlapping. The insert size is not exact but can be modeled by a distribution
of sizes.
Mate pair (jumping pair)
Much larger insert size
Mate pair (jumping pair)
Much larger insert size
Mate pairs are produced using a different preparation method and can be separated
by several thousand bp. These become very useful in genome assembly.
Should be able to make one contiguous sequence
from overlapping paired reads
Contig
Should be able to make one contiguous sequence
from overlapping paired reads
Contig
For some sequencing technologies with long reads, you can simply see if there are
enough overlapping reads such that you can form a contiguous sequence, or contig.
For short read technologies such as Illumina, different mathematical approaches
are used to form contigs (e.g. De Bruijn graph approaches).
Use mate pair information to link contigs
as part of a scaffolding process
Scaffold
Use mate pair information to link contigs
as part of a scaffolding process
Scaffold
Hopefully, you will have some mate pairs where one read from the pair matches one
contig, and the other matches another contig. You can then create a scaffold
sequence which spans the two contigs.
Use mate pair information to link contigs
as part of a scaffolding process
Scaffold
NNNNNNNNNNNNNN
Use mate pair information to link contigs
as part of a scaffolding process
Scaffold
NNNNNNNNNNNNNN
The unknown region between contigs is replaced with Ns to represent unknown
bases. The length of these regions are sometimes approximations.
Making contigs is a different process
to making scaffolds
Making contigs is a different process
to making scaffolds
Some assemblers do a better job at making contigs than they do at combining those
contigs into scaffolds. Sometimes you can use different tools to do each step.
Assembly size = sum length of scaffolds
209 Mbp
Assembly size = sum length of scaffolds
209 Mbp
Let's consider a fictional assembly with a few scaffolds and contigs. The first thing
we calculate is the assembly size. This is simply the sum length of all sequences
included in the assembly.
Mean scaffold length is rarely used as a metric
Most genome assemblies contain
a lot of very short contigs
Mean scaffold length is rarely used as a metric
Most genome assemblies contain
a lot of very short contigs
At one extreme, an assembly could include every read that wasn't included in a
contig. More likely, you will end up with some very short contigs which may not be
useful. Contigs/scaffolds below a user-defined length threshold are often excluded
from assemblies. All of these short sequences lower the mean length.
N50 length
The length of the sequence which takes the sum length
of all sequences past 50% of the total assembly size
This is the most widely-used metric to assess genome
assembly quality…sometimes it is the only metric.
N50 length
The length of the sequence which takes the sum length
of all sequences past 50% of the total assembly size
This is the most widely-used metric to assess genome
assembly quality…sometimes it is the only metric.
This was first described in the human genome paper. It has since been mentioned
in just about every paper that has ever described a new genome sequence.
Calculating N50
Assembly size = 209 Mbp
50
40
35
25
20
15
10
3
3
2
2
2
2
Calculating N50
Assembly size = 209 Mbp
50
40
35
25
20
15
10
3
3
2
2
2
2
It is sometimes easier to see how N50 is calculated by showing an example. Let's
start with the longest scaffold and add the lengths to a running total. We want to
stop when we have seen >50% of the total assembly size (i.e. >104.5 Mbp).
Calculating N50
Assembly size = 209 Mbp
50
40
35
25
20
15
10
3
3
2
2
2
2
Running total = 50 Mbp
Calculating N50
50
40
35
25
20
15
10
3
3
2
2
2
2
Running total = 90 Mbp
Assembly size = 209 Mbp
Calculating N50
50
40
35
25
20
15
10
3
3
2
2
2
2
Running total = 125 Mbp
Assembly size = 209 Mbp
Calculating N50
50
40
35
25
20
15
10
3
3
2
2
2
2
N50 length = 35 Mbp
Assembly size = 209 Mbp
Mean length = 16 Mbp
Calculating N50
50
40
35
25
20
15
10
3
3
2
2
2
2
N50 length = 35 Mbp
Assembly size = 209 Mbp
Mean length = 16 MbpAfter looking at three scaffolds we now know what the N50 scaffold length is This
will always be much higher than the mean length.
Different assembly of the same genome
50
40
35
25
20
15
Assembly size = 185 Mbp
N50 length = 40 Mbp
Different assembly of the same genome
50
40
35
25
20
15
Assembly size = 185 Mbp
N50 length = 40 Mbp
Let's assume we tweaked the parameters of our assembly software to exclude the
shortest scaffolds. This makes a smaller assembly but increases the N50 length.
This means that it is possible to boost N50 simply by throwing away sequences.
NG50 length
Like N50, but rather than use assembly size in the
calculation, use known (or estimated) genome size
NG50 length
Like N50, but rather than use assembly size in the
calculation, use known (or estimated) genome size
In the Assemblathon contests, we used a new measure which enables a fairer
comparison between different assemblies (of the same genome).
N50 length = 35 Mbp
Assembly size = 209 Mbp Assembly size = 185 Mbp
N50 length = 40 Mbp
Assume genome size is 240 Mbp
NG50 length = 35 Mbp NG50 length = 35 Mbp
N50 length = 35 Mbp
Assembly size = 209 Mbp Assembly size = 185 Mbp
N50 length = 40 Mbp
Assume genome size is 240 Mbp
NG50 length = 35 Mbp NG50 length = 35 Mbp
If we knew what the actual genome size was (e.g. 240 Mbp) we can calculate the
NG50 scaffold length and see that it is the same for both assemblies.
NG50 length
Use NG50 when making comparisons between
genome assemblies because N50 can be biased
And be warned…some people obsess over N50!
flickr.com/incrediblehow/
Metrics
Metric Notes
Assembly size How does it compare to expected size?
Number of sequences How fragmented is your assembly?
N50 length
(contigs & scaffolds)
Making contigs and making scaffolds
are two different skills.
NG50 scaffold length Becoming more common to see this used.
Coverage
How much of some reference sequence
is present in your assembly?
Errors
Errors in alignment of assembly to reference
sequence or to input read data.
Number of genes
From comparison to reference transcriptome
and/or set of known genes
Metric Notes
Assembly size How does it compare to expected size?
Number of sequences How fragmented is your assembly?
N50 length
(contigs & scaffolds)
Making contigs and making scaffolds
are two different skills.
NG50 scaffold length Becoming more common to see this used.
Coverage
How much of some reference sequence
is present in your assembly?
Errors
Errors in alignment of assembly to reference
sequence or to input read data.
Number of genes
From comparison to reference transcriptome
and/or set of known genes
This is a very brief summary that lists just some of the ways in which you could
describe your genome assembly.
Assembly size
0
500,000,000
1,000,000,000
1,500,000,000
2,000,000,000
A B C D E F G H I J K L M
Assemblathon 2 bird genome assemblies
Assembly size
0
500,000,000
1,000,000,000
1,500,000,000
2,000,000,000
A B C D E F G H I J K L M
Assemblathon 2 bird genome assemblies
In Assemblathon 2, one assembly of the bird genome (a parrot) was very, very
small. Conversely, one assembly was almost twice the size of the estimated genome
(~1.2 Gbp). Bigger is not always better when it comes to assembly size.
Using core genes
• All genomes perform some core functions
(transcription, replication, translation etc.)
• Proteins involved tend to be highly conserved
• They should be present in every genome
CEGMA
CEGMA
This was an approach developed by our lab, originally to find a handful of genes in a
newly sequenced genome which could be used to train a species-specific gene
finder. We then adapted the technique to assess the gene space of a draft genome.
What is CEGMA?
• CEGMA (Core Eukaryotic Gene Mapping Approach)
• defines a set of 248 'Core Eukaryotic Genes' (CEGs)
• CEGs identified from genomes of: S. cerevisiae, S. pombe,
A. thaliana, C. elegans, D. melanogaster, and H. sapiens
• How many full-length CEGs are present in an assembly?
What is CEGMA?
• CEGMA (Core Eukaryotic Gene Mapping Approach)
• defines a set of 248 'Core Eukaryotic Genes' (CEGs)
• CEGs identified from genomes of: S. cerevisiae, S. pombe,
A. thaliana, C. elegans, D. melanogaster, and H. sapiens
• How many full-length CEGs are present in an assembly?We expect that these 248 genes to be present in all eukaryotes. CEGMA uses a
combination of software tools to find these genes. The number of core genes
present is assumed to reflect the proportion of all genes that are present in the
assembly. Sometimes genes are split across contigs or scaffolds, CEGMA can find
some of these and reports them as partial matches.
Here are N50 scaffold lengths and number of core genes present in a variety of
genomes that I have looked at. There is a lot of variation. Some assemblies might
give you longer sequences (higher N50 values), but this is no guarantee that those
assemblies will contain more gene sequences. Likewise, assemblies with more gene
sequences may not necessarily have longer sequences.
Should you use CEGMA?
• CEGMA is not easy to install
• It is old and somewhat out of date
• You could use other transcript/protein data sets
instead of CEGMA
Should you use CEGMA?
• CEGMA is not easy to install
• It is old and somewhat out of date
• You could use other transcript/protein data sets
instead of CEGMA
The principle of CEGMA could be used with a variety of different data. Maybe there
are a small number of full-length mRNAs available for your species of interest. If
you have multiple genome assemblies, you could simply see how they differ with
respect to the presence of those genes.
Other tools for evaluating assemblies
FRCbam (2012) REAPR (2013) kPAL (2014)
Other tools for evaluating assemblies
FRCbam (2012) REAPR (2013) kPAL (2014)
Just as it seems increasingly popular to develop new genome assemblers, there is a
growing demand (and supply) for tools to evaluate genome assemblies. Here are
three recent ones.
flickr.com/incrediblehow/
Summary
In conclusion…
• Genome assembly is not a solved problem
• If possible, try different genome assemblers
• Don't rely on one metric to assess quality
• Different metrics assess different aspects of quality
• Look at your genome assembly!
In conclusion…
• Genome assembly is not a solved problem
• If possible, try different genome assemblers
• Don't rely on one metric to assess quality
• Different metrics assess different aspects of quality
• Look at your genome assembly!
The last point is worth repeating. Is your genome 91% N? Do you have 3 bp
sequences in your assembly? These are easy things to check
And remember, all genome assemblies should be thought of as 'work in progress'!
Further resources
http://acgt.me
@assemblathon
Further resources
http://acgt.me
@assemblathonI use the Assemblathon twitter account to tweet links to papers and resources that
describe tools relevant to the field of genome assembly. Normally only a few tweets
a day. My ACGT blog contains some posts relating to genome assembly, and I try to
write these with more of a general audience in mind.

More Related Content

What's hot

RNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeRNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeSean Davis
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomicsAthira RG
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomicskiran singh
 
R.P Maurya ppt on C C D C & DSSP(Bioinformatics)
R.P Maurya ppt  on C C D C & DSSP(Bioinformatics)R.P Maurya ppt  on C C D C & DSSP(Bioinformatics)
R.P Maurya ppt on C C D C & DSSP(Bioinformatics)R.P MAURYA
 
Transcriptomics: A time efficient tool for crop improvement
Transcriptomics: A time efficient tool for crop improvementTranscriptomics: A time efficient tool for crop improvement
Transcriptomics: A time efficient tool for crop improvementSajid Sheikh
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiomejukais
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignmentavrilcoghlan
 
Microarray Data Analysis
Microarray Data AnalysisMicroarray Data Analysis
Microarray Data Analysisyuvraj404
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsNtino Krampis
 
Submitting DNA sequences to the databases, SEQUIN.pptx
Submitting DNA sequences to the databases, SEQUIN.pptxSubmitting DNA sequences to the databases, SEQUIN.pptx
Submitting DNA sequences to the databases, SEQUIN.pptxVed Gharat
 
RESTRICTION MAPPING
RESTRICTION MAPPINGRESTRICTION MAPPING
RESTRICTION MAPPINGAfra Fathima
 
MCQs on DNA MicroArray.pdf
MCQs on DNA MicroArray.pdfMCQs on DNA MicroArray.pdf
MCQs on DNA MicroArray.pdfRajendraChavhan3
 
Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...
Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...
Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...Promila Sheoran
 

What's hot (20)

RNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeRNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the Transcriptome
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Phylogenetic analysis
Phylogenetic analysisPhylogenetic analysis
Phylogenetic analysis
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Genetic mapping
Genetic mappingGenetic mapping
Genetic mapping
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
R.P Maurya ppt on C C D C & DSSP(Bioinformatics)
R.P Maurya ppt  on C C D C & DSSP(Bioinformatics)R.P Maurya ppt  on C C D C & DSSP(Bioinformatics)
R.P Maurya ppt on C C D C & DSSP(Bioinformatics)
 
dot plot analysis
dot plot analysisdot plot analysis
dot plot analysis
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
Transcriptomics: A time efficient tool for crop improvement
Transcriptomics: A time efficient tool for crop improvementTranscriptomics: A time efficient tool for crop improvement
Transcriptomics: A time efficient tool for crop improvement
 
Ngs microbiome
Ngs microbiomeNgs microbiome
Ngs microbiome
 
Pairwise sequence alignment
Pairwise sequence alignmentPairwise sequence alignment
Pairwise sequence alignment
 
High throughput sequencing
High throughput sequencingHigh throughput sequencing
High throughput sequencing
 
Microarray Data Analysis
Microarray Data AnalysisMicroarray Data Analysis
Microarray Data Analysis
 
Overview of Genome Assembly Algorithms
Overview of Genome Assembly AlgorithmsOverview of Genome Assembly Algorithms
Overview of Genome Assembly Algorithms
 
Submitting DNA sequences to the databases, SEQUIN.pptx
Submitting DNA sequences to the databases, SEQUIN.pptxSubmitting DNA sequences to the databases, SEQUIN.pptx
Submitting DNA sequences to the databases, SEQUIN.pptx
 
UPGMA
UPGMAUPGMA
UPGMA
 
RESTRICTION MAPPING
RESTRICTION MAPPINGRESTRICTION MAPPING
RESTRICTION MAPPING
 
MCQs on DNA MicroArray.pdf
MCQs on DNA MicroArray.pdfMCQs on DNA MicroArray.pdf
MCQs on DNA MicroArray.pdf
 
Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...
Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...
Genomics: Organization of Genome, Strategies of Genome Sequencing, Model Plan...
 

Viewers also liked

This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'Keith Bradnam
 
This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'Keith Bradnam
 
This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'Keith Bradnam
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assemblyfnothaft
 
Building a platinum human genome assembly from single haplotype human genomes...
Building a platinum human genome assembly from single haplotype human genomes...Building a platinum human genome assembly from single haplotype human genomes...
Building a platinum human genome assembly from single haplotype human genomes...kmsteinberg
 
Immunoglobulins- Explained
Immunoglobulins- ExplainedImmunoglobulins- Explained
Immunoglobulins- ExplainedSabahat H Zaidi
 
Thoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestThoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestKeith Bradnam
 
Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Keith Bradnam
 
13 questions you might have about galaxy
13 questions you might have about galaxy13 questions you might have about galaxy
13 questions you might have about galaxyKeith Bradnam
 
Immunoglobulins 2001
Immunoglobulins 2001Immunoglobulins 2001
Immunoglobulins 2001Kinza Ayub
 
Sequencing, Alignment and Assembly
Sequencing, Alignment and AssemblySequencing, Alignment and Assembly
Sequencing, Alignment and AssemblyShaun Jackman
 
Immunoglobulins
ImmunoglobulinsImmunoglobulins
Immunoglobulinsranjani n
 
Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2Keith Bradnam
 
Immunoglobulins
ImmunoglobulinsImmunoglobulins
Immunoglobulinsraghunathp
 
Genome sequencing
Genome sequencingGenome sequencing
Genome sequencingShital Pal
 

Viewers also liked (20)

This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'
 
This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'
 
This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assembly
 
Building a platinum human genome assembly from single haplotype human genomes...
Building a platinum human genome assembly from single haplotype human genomes...Building a platinum human genome assembly from single haplotype human genomes...
Building a platinum human genome assembly from single haplotype human genomes...
 
Immunoglobulins- Explained
Immunoglobulins- ExplainedImmunoglobulins- Explained
Immunoglobulins- Explained
 
Thoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestThoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contest
 
Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2
 
13 questions you might have about galaxy
13 questions you might have about galaxy13 questions you might have about galaxy
13 questions you might have about galaxy
 
Immunoglobulins 2001
Immunoglobulins 2001Immunoglobulins 2001
Immunoglobulins 2001
 
Bioalgo 2012-02-graphs
Bioalgo 2012-02-graphsBioalgo 2012-02-graphs
Bioalgo 2012-02-graphs
 
Sequencing, Alignment and Assembly
Sequencing, Alignment and AssemblySequencing, Alignment and Assembly
Sequencing, Alignment and Assembly
 
Immunoglobulins
ImmunoglobulinsImmunoglobulins
Immunoglobulins
 
Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2
 
Whole Genome Analysis
Whole Genome AnalysisWhole Genome Analysis
Whole Genome Analysis
 
Immunoglobulin ppt
Immunoglobulin pptImmunoglobulin ppt
Immunoglobulin ppt
 
BLAST
BLASTBLAST
BLAST
 
Immunoglobulins
ImmunoglobulinsImmunoglobulins
Immunoglobulins
 
Immunoglobulins
ImmunoglobulinsImmunoglobulins
Immunoglobulins
 
Genome sequencing
Genome sequencingGenome sequencing
Genome sequencing
 

Similar to Genome Assembly: the art of trying to make one BIG thing from millions of very small things

2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011c.titus.brown
 
B.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseB.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseRai University
 
Intro to metagenomic binning
Intro to metagenomic binningIntro to metagenomic binning
Intro to metagenomic binningA. Murat Eren
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grcc.titus.brown
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1Keith Bradnam
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorialc.titus.brown
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptxc.titus.brown
 
12_Molecular_replication_VCBC.ppsx
12_Molecular_replication_VCBC.ppsx12_Molecular_replication_VCBC.ppsx
12_Molecular_replication_VCBC.ppsxSENDHANB4023
 
When is a genome finished?
When is a genome finished? When is a genome finished?
When is a genome finished? Keith Bradnam
 

Similar to Genome Assembly: the art of trying to make one BIG thing from millions of very small things (20)

2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
2014 naples
2014 naples2014 naples
2014 naples
 
2014 villefranche
2014 villefranche2014 villefranche
2014 villefranche
 
B.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseB.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 database
 
DNA Notes
DNA NotesDNA Notes
DNA Notes
 
Intro to metagenomic binning
Intro to metagenomic binningIntro to metagenomic binning
Intro to metagenomic binning
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 
Bio IGCSE- Genetic Engineering.
Bio IGCSE- Genetic Engineering.Bio IGCSE- Genetic Engineering.
Bio IGCSE- Genetic Engineering.
 
Sweden_eemis_big_data
Sweden_eemis_big_dataSweden_eemis_big_data
Sweden_eemis_big_data
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2012 stamps-mbl-1
2012 stamps-mbl-12012 stamps-mbl-1
2012 stamps-mbl-1
 
2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx2013 stamps-assembly-methods.pptx
2013 stamps-assembly-methods.pptx
 
12_Molecular_replication_VCBC.ppsx
12_Molecular_replication_VCBC.ppsx12_Molecular_replication_VCBC.ppsx
12_Molecular_replication_VCBC.ppsx
 
When is a genome finished?
When is a genome finished? When is a genome finished?
When is a genome finished?
 

More from Keith Bradnam

Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1Keith Bradnam
 
What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?Keith Bradnam
 
The art of good science writing
The art of good science writingThe art of good science writing
The art of good science writingKeith Bradnam
 
Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0Keith Bradnam
 
Polish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slidesPolish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slidesKeith Bradnam
 
10 tips for adding polish to presentations
10 tips for adding polish to presentations10 tips for adding polish to presentations
10 tips for adding polish to presentationsKeith Bradnam
 
Database talk for Bits & Bites meeting
Database talk for Bits & Bites meetingDatabase talk for Bits & Bites meeting
Database talk for Bits & Bites meetingKeith Bradnam
 
Benchmarking short-read mapping programs
Benchmarking short-read mapping programsBenchmarking short-read mapping programs
Benchmarking short-read mapping programsKeith Bradnam
 
Thoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesThoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesKeith Bradnam
 
Twitter 101 - an introduction to Twitter
Twitter 101  - an introduction to TwitterTwitter 101  - an introduction to Twitter
Twitter 101 - an introduction to TwitterKeith Bradnam
 

More from Keith Bradnam (10)

Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1
 
What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?
 
The art of good science writing
The art of good science writingThe art of good science writing
The art of good science writing
 
Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0
 
Polish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slidesPolish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slides
 
10 tips for adding polish to presentations
10 tips for adding polish to presentations10 tips for adding polish to presentations
10 tips for adding polish to presentations
 
Database talk for Bits & Bites meeting
Database talk for Bits & Bites meetingDatabase talk for Bits & Bites meeting
Database talk for Bits & Bites meeting
 
Benchmarking short-read mapping programs
Benchmarking short-read mapping programsBenchmarking short-read mapping programs
Benchmarking short-read mapping programs
 
Thoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesThoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore Technologies
 
Twitter 101 - an introduction to Twitter
Twitter 101  - an introduction to TwitterTwitter 101  - an introduction to Twitter
Twitter 101 - an introduction to Twitter
 

Recently uploaded

Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...Lokesh Kothari
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 

Recently uploaded (20)

Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 

Genome Assembly: the art of trying to make one BIG thing from millions of very small things

  • 1. Genome Assembly: the art of trying to make one BIG thing from millions of very small things Keith Bradnam @kbradnam Image from Wellcome Trust
  • 2. Genome Assembly: the art of trying to make one BIG thing from millions of very small things Keith Bradnam @kbradnam Image from Wellcome Trust This was a talk given at UC Davis on 2015-01-28, presented to an audience of graduate students. Author: Keith Bradnam, Genome Center, UC Davis This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
  • 4. 1. What is genome assembly? 2. Why is it difficult? 3. Why is it important? 4. How do we know if an assembly is any good?
  • 6. A genome assembly is an attempt to accurately represent an entire genome sequence from a large set of very short DNA sequences.
  • 7. A genome assembly is an attempt to accurately represent an entire genome sequence from a large set of very short DNA sequences.
  • 8. A genome assembly is an attempt to accurately represent an entire genome sequence from a large set of very short DNA sequences. Using a piece of bioinformatics software is just like running an experiment. Just because you get an answer, it doesn't mean it will be the right answer. You should always be prepared to tweak some parameters and re-run the experiment.
  • 9. A genome assembly is an attempt to accurately represent an entire genome sequence from a large set of very short DNA sequences.
  • 10. A genome assembly is an attempt to accurately represent an entire genome sequence from a large set of very short DNA sequences. The ideal goal would be to end up with complete sequences for each chromosome at each level of ploidy. E.g. diploid genomes would be assembled as two sets of genome sequences.
  • 11. A genome assembly is an attempt to accurately represent an entire genome sequence from a large set of very short DNA sequences.
  • 12. A genome assembly is an attempt to accurately represent an entire genome sequence from a large set of very short DNA sequences. 'Large' is a relative term. We would expect that advances in sequencing technology would mean that the number of sequences needed to assemble a genome is only ever going to decrease.
  • 13. A genome assembly is an attempt to accurately represent an entire genome sequence from a large set of very short DNA sequences.
  • 14. A genome assembly is an attempt to accurately represent an entire genome sequence from a large set of very short DNA sequences. 'Short' is also a relative term. As technology improves, we expect to see our input sequences get longer and longer until the steps of sequencing and assembly essentially merge into one process.
  • 15. It's a bit like trying to do the hardest jigsaw puzzle you can imagine!
  • 16.
  • 17. This is a jigsaw that I did for the benefit of your education! There are lots of analogies that can be made between assembling genomes, and assembling jigsaws.
  • 18.
  • 19. Sometimes we assemble regions of jigsaws that are locally accurate, but globally misplaced (the top region circled in red). Sometimes we also assemble regions and leave them to one side as we don't know where they should go. Many 'finished' genome assemblies include sets of 'unanchored' sequences that are not positioned on any chromosome.
  • 20.
  • 21. Let's keep working on our jigsaw.
  • 22.
  • 23. The hardest parts of a jigsaw tend to be repetitive regions (skies, sea, forests etc.). The same is true for genome assemblies.
  • 24.
  • 25. Sometimes we can use information to pair together two different completed sections of a jigsaw. In this case, we can use our understanding of what a bridge looks like to give us an approximate spacing between the two completed sections at the top of this puzzle. We do similar things with genome assemblies and also end up inserting approximately sized gaps between regions of sequence.
  • 26.
  • 27.
  • 28.
  • 29. Is this good enough? For a jigsaw, we would never ever call this 'finished', but for a genome assembly this would represent an almost perfect sequence! All of the main details are present, you can identify what the picture is showing (San Francisco), the edges are detailed enough that we can accurately calculate the size of the jigsaw, and the parts that are missing are mostly minor details.
  • 30.
  • 31. Jigsaws often end up with a few missing pieces meaning that it is impossible to complete the puzzle. Genome assemblies also end up with missing pieces because they were never in the input set of sequences to begin with. This is because not all sequencing technologies capture all locations in a genome.
  • 32.
  • 33. With the exception of bacterial genomes, we never reach this point with genome assembly. All published eukaryotic genomes are incomplete and contain errors. Maybe yeast (Saccharomyces cerevisiae) and worm (Caenorhabditis elegans) are the best examples we have a of near-complete reference genome for a eukaryotic species.
  • 35. World's largest jigsaw puzzle • Made by University of Economics of Ho Chi Minh City • 551,232 pieces • 15 x 23 meters
  • 36. World's largest jigsaw puzzle • Made by University of Economics of Ho Chi Minh City • 551,232 pieces • 15 x 23 meters The world's largest jigsaw has nothing on the world's largest genome assembly…
  • 37. World's largest assembled genome • Lobolly pine (Pinus taeda) • 22 Gbp genome! • ~80% repetitive • 64x coverage from tulsalandscape.com
  • 38. World's largest assembled genome • Lobolly pine (Pinus taeda) • 22 Gbp genome! • ~80% repetitive • 64x coverage from tulsalandscape.com
  • 39. World's largest assembled genome • Lobolly pine (Pinus taeda) • 22 Gbp genome! • ~80% repetitive • 64x coverage from tulsalandscape.com This gargantuan effort featured the work of many people at UC Davis, led by the efforts of David Neale's group.
  • 40. What does 64x coverage mean? Over 1.4 trillion bp of DNA were sequenced!
  • 41. What does 64x coverage mean? Over 1.4 trillion bp of DNA were sequenced! I.e. they had to use 64x times as much input DNA as they ended up with in the final output. Imagine if baking a cake was like this, and you had to use 64x as many ingredients in order to make one cake. Some genome assembly projects are done with >100x coverage.
  • 42. Biological challenges for genome assembly Problem Description Repeats Many plant and animal genomes mostly consist of repetitive sequences, some of which are longer than length of sequencing reads. Ploidy For many species, you have at least two copies of the genome present. Level of heterozygosity is important. Lack of reference genome Reference-assisted assembly is a much easier problem than de novo assembly. Even having genome from a closely related species can help.
  • 43. Biological challenges for genome assembly Problem Description Repeats Many plant and animal genomes mostly consist of repetitive sequences, some of which are longer than length of sequencing reads. Ploidy For many species, you have at least two copies of the genome present. Level of heterozygosity is important. Lack of reference genome Reference-assisted assembly is a much easier problem than de novo assembly. Even having genome from a closely related species can help. Ploidy is often a much bigger problem for plant genomes. E.g. some wheat species are hexaploid. Genome assembly is sometimes performed on a genome for which we already have a reference (e.g. if you sequenced your own genome, you could align it to the human reference sequence). Otherwise, we are talking about de novo assembly which is much, much harder.
  • 45. from amazon.com Returning to the jigsaw analogy…every jigsaw puzzle comes with a picture of the puzzle on the box. This is a luxury not always available to genome assemblers.
  • 46.
  • 47. When we are doing de novo assembly, it is a bit like doing a jigsaw without knowing what it will look like.
  • 48.
  • 49. Even with de novo assembly, we may have a distant relative with a known genome sequence that can help with the assembly. A bit like assembling a jigsaw using a blurred picture as a guide.
  • 50.
  • 51. Jigsaws tell you how many pieces are in the puzzle (and what the dimensions of the puzzle will be). We don't always know this for genome assembly. There are measures for determining how big a genome might be, but these methods can sometimes be misleading.
  • 52. Other challenges for genome assembly Problem Description Cost In 2014 Illumina claimed the $1,000 genome barrier had been broken (if you first spend ~$10 million on hardware). Library prep A critical, and often overlooked, step in the process. Sequence diversity Illumina, 454, Ion Torrent, PacBio, Oxford Nanopore: which mix of sequence data will you be using? Hardware Some genome assemblers have very high CPU/RAM requirements. Might need specialized cluster. Expertise Not always easy to even get assembly software installed, let alone understand how to run it properly. Software There is a lot of choice out there.
  • 53. Other challenges for genome assembly Problem Description Cost In 2014 Illumina claimed the $1,000 genome barrier had been broken (if you first spend ~$10 million on hardware). Library prep A critical, and often overlooked, step in the process. Sequence diversity Illumina, 454, Ion Torrent, PacBio, Oxford Nanopore: which mix of sequence data will you be using? Hardware Some genome assemblers have very high CPU/RAM requirements. Might need specialized cluster. Expertise Not always easy to even get assembly software installed, let alone understand how to run it properly. Software There is a lot of choice out there.
  • 54. The PRICE genome assembler has 52 command-line options!!!
  • 55. The PRICE genome assembler has 52 command-line options!!! This is probably not the most complex, nor the most simple, genome assembler that is out there. But how much time do you have to explore some of those 52 parameters that could affect the resulting genome assembly?
  • 56. Problem Description Cost In 2014 Illumina claimed the $1,000 genome barrier had been broken (if you first spend ~$10 million on hardware). Library prep A critical, and often overlooked, step in the process. Sequence diversity Illumina, 454, Ion Torrent, PacBio, Oxford Nanopore: which mix of sequence data will you be using? Hardware Some genome assemblers have very high CPU/RAM requirements. Might need specialized cluster. Expertise Not always easy to even get assembly software installed, let alone understand how to run it properly. Software There is a lot of choice out there. Other challenges for genome assembly
  • 57. There are over 125 different tools available to help assemble a genome!
  • 58. There are over 125 different tools available to help assemble a genome! Not all of these are comprehensive genome assemblers, some are tools to help with specific aspects of the assembly process, or to help evaluate genome assemblies etc. Still, this represents a bewildering amount of choice.
  • 59. These six assembly tools were published in one month in 2014!
  • 60. Before you assemble… • You should remove adapter contamination • You should remove sequence contamination • You should trim sequences for low quality regions
  • 61. Before you assemble… • You should remove adapter contamination • You should remove sequence contamination • You should trim sequences for low quality regions After we have generated the raw sequence data, we still must run a few basic steps to clean up our data prior to assembly. How straightforward are these steps?
  • 62. Tools for removing adapter contamination from sequences There are at least 34 different tools! One of these tools has 27 different command-line options
  • 63. Tools for removing adapter contamination from sequences There are at least 34 different tools! One of these tools has 27 different command-line options Even the first step of removing adapter contamination is something for which you could spend a lot of time researching different software choices.
  • 65. Saccharomyces cerevisiae • 12 Mbp genome • Published in 1997 • First eukaryotic genome sequence
  • 66. Saccharomyces cerevisiae • 12 Mbp genome • Published in 1997 • First eukaryotic genome sequence Not the first published genome — there were several bacterial genomes sequenced in the preceding couple of years — but this was the first eukaryotic genome sequence. Furthermore, this genome sequence has undergone continual improvements and corrections since publication (the last set of changes were in 2011).
  • 67. Caernorhabditis elegans • ~100 Mbp genome • Published in 1998 • First animal genome sequence
  • 68. Arabidopsis thaliana • First plant genome sequence • Published in 2000 • Size? • 2000 = 125 Mbp • 2007 = 157 Mbp • 2012 = 135 Mbp
  • 69. Arabidopsis thaliana • First plant genome sequence • Published in 2000 • Size? • 2000 = 125 Mbp • 2007 = 157 Mbp • 2012 = 135 Mbp As alluded to earlier, we don't always know for sure how big (or small) a genome is. The Arabidopsis genome size has been corrected upwards and downwards since publication. The amount of sequenced information as of today is about 119 Mbp. And this is for the best understood plant genome that we know about it!
  • 70. Homo sapiens • ~3 Gbp genome • Finished? • 'working draft' announced in 2000 • 'working draft' published in 2001 • completion announced in 2003 • complete sequence published in 2004
  • 71. Homo sapiens • ~3 Gbp genome • Finished? • 'working draft' announced in 2000 • 'working draft' published in 2001 • completion announced in 2003 • complete sequence published in 2004 The human genome has also undergone improvements since the (many) announcements regarding its completion (or near completion). There are only a small number of species for which there is dedicated group of people who seek to continually improve the genome sequence and get closer to 'the truth'.
  • 72. The 100,000 genomes project There are lots of ongoing genome sequencing projects i5k Insect and other Arthropod Genome Sequencing Initiative
  • 73. The 100,000 genomes project There are lots of ongoing genome sequencing projects i5k Insect and other Arthropod Genome Sequencing Initiative Bigger numbers must be better, right? Some projects sequence genomes to align back to a reference to look for the differences, others seek to characterize genomes for which we have very little genomic information. The 100,000 genomes project in England heralds the start of the mass sequencing of patients to understand disease.
  • 74. We no longer have one genome per species • We have genome sequences representing different strains and varieties of a species • We have multiple genomes from different tissues of the same individual (e.g. cancer genomes) • We potentially will have genomes from different time points or life stages of an individual
  • 75. We no longer have one genome per species • We have genome sequences representing different strains and varieties of a species • We have multiple genomes from different tissues of the same individual (e.g. cancer genomes) • We potentially will have genomes from different time points or life stages of an individual Imagine having your genome sequenced at birth from several different tissues and getting 'genome health checks' throughout your life.
  • 76. There is no point sequencing so many genomes if we can't accurately assemble them!
  • 77. There is no point sequencing so many genomes if we can't accurately assemble them! Sequencing genomes is relatively easy. Putting that information together in a meaningful way so as to make it useful to others…that's not so easy.
  • 78. Bad genome assemblies #1 Length of 10 shortest sequences: 100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp! The average vertebrate gene is about 25,000 bp
  • 79. Bad genome assemblies #1 Length of 10 shortest sequences: 100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp! The average vertebrate gene is about 25,000 bp Everyone wants long sequences in a genome assembly. This may not always matter, but in most cases they should hopefully be long enough to contain at least one gene. These data are from a vertebrate genome sequence that someone asked me to look at. Over half of the genome assembly was represented by sequences less than 150 bp! This is not much use to anyone.
  • 80. Bad genome assemblies #2 Ns = 90.6% !!! Genome sequences usually contain unknown bases (Ns)
  • 81. Bad genome assemblies #2 Ns = 90.6% !!! Genome sequences usually contain unknown bases (Ns) From another assembly that I was asked to look at. Even the 9% of the genome which wasn't an 'N' was split into tiny little fragments. Completely unusable information.
  • 82. Has anyone compared different assemblers to work out which is the best?
  • 83. Has anyone compared different assemblers to work out which is the best? I was wondering whether you would ask this…
  • 84. A genome assembly competition
  • 85. A genome assembly competition This was a genome assembly assessment exercise that I was involved with.
  • 89. 3 species 21 teams 43 assemblies 52 Gbp of sequence!
  • 90. Goals • Assess 'quality' of genome assemblies • Identify the best assemblers • First need to define quality!
  • 91. Who makes the best pizza in Davis?
  • 92. Who makes the best pizza in Davis? An easy question to ask, but maybe not as straightforward as it seems…
  • 93. Who makes the best pizza in Davis? Freshest? Cheapest? Biggest? Gluten free? Healthiest Choice of toppings? Choice of toppings? Delivery time? Tastiest?
  • 94. Who makes the best pizza in Davis? Freshest? Cheapest? Biggest? Gluten free? Healthiest Choice of toppings? Choice of toppings? Delivery time? Tastiest? 'Best' is subjective. If you are intolerant to gluten, then the best pizza place will be the one that makes gluten-free pizzas.
  • 95. Who makes the best pizza in Davis? Freshest? Cheapest? Biggest? Gluten free? Healthiest Choice of toppings? Choice of toppings? Delivery time? Tastiest?
  • 96. Who makes the best pizza in Davis? Freshest? Cheapest? Biggest? Gluten free? Healthiest Choice of toppings? Choice of toppings? Delivery time? Tastiest? Even if you focus on who makes the best 'tasting' pizzas, this is still very subjective.
  • 97. Who makes the best genome assembly? Image from flickr.com/dullhunk/
  • 98. Who makes the best genome assembly? Image from flickr.com/dullhunk/ But surely this is not such a subjective topic when it comes to genome assembly?
  • 99. Who makes the best genome assembly? Longest contigs? Fewest errors? Lowest CPU demands?Best deals with repeats? Contains most genes? Fastest? Best resolves heterozygosity? Easiest to install? Longest scaffolds? Image from flickr.com/dullhunk/
  • 100. Who makes the best genome assembly? Longest contigs? Fewest errors? Lowest CPU demands?Best deals with repeats? Contains most genes? Fastest? Best resolves heterozygosity? Easiest to install? Longest scaffolds? Image from flickr.com/dullhunk/ It is less subjective, but there are still many different ways we can think of when trying to determine what makes a good genome assembly.
  • 101. And the winner is… • No winner! • Some assemblers seemed to work well for one species, but not for other species • Some assemblies were good, as measured by one metric, but not when measured by others
  • 102. And the winner is… • No winner! • Some assemblers seemed to work well for one species, but not for other species • Some assemblies were good, as measured by one metric, but not when measured by others This result was disappointing to many who was hoping that we would provide a resounding endorsement for assembler 'X'.
  • 103. flickr.com/incrediblehow/ How do we know if an assembly is any good?
  • 104. Read
  • 105. Read The fundamental input to a genome assembly is a set of sequencing reads.
  • 106. Technology Date Typical read lengths Sanger ~1970–2000 750–1,000 bp Solexa/Illumina ~2005 ~25 bp Illumina ~2014 ~150–250 bp Pacific Biosciences ~2014 10–15 Kbp Oxford Nanopore ~2014 5–??? Kbp
  • 107. Technology Date Typical read lengths Sanger ~1970–2000 750–1,000 bp Solexa/Illumina ~2005 ~25 bp Illumina ~2014 ~150–250 bp Pacific Biosciences ~2014 10–15 Kbp Oxford Nanopore ~2014 5–??? Kbp Different technologies produce reads with very different length distributions, and these technologies also increase the length of reads over time. Perhaps more importantly, different technologies have different error profiles (where errors occur in reads and types of error).
  • 108. Read
  • 109. Read pair Insert size is known (approximately)
  • 110. Read pair Insert size is known (approximately) Typically, we work with pairs of reads separated by a short distance (< 1,000 bp) or even overlapping. The insert size is not exact but can be modeled by a distribution of sizes.
  • 111. Mate pair (jumping pair) Much larger insert size
  • 112. Mate pair (jumping pair) Much larger insert size Mate pairs are produced using a different preparation method and can be separated by several thousand bp. These become very useful in genome assembly.
  • 113. Should be able to make one contiguous sequence from overlapping paired reads Contig
  • 114. Should be able to make one contiguous sequence from overlapping paired reads Contig For some sequencing technologies with long reads, you can simply see if there are enough overlapping reads such that you can form a contiguous sequence, or contig. For short read technologies such as Illumina, different mathematical approaches are used to form contigs (e.g. De Bruijn graph approaches).
  • 115. Use mate pair information to link contigs as part of a scaffolding process Scaffold
  • 116. Use mate pair information to link contigs as part of a scaffolding process Scaffold Hopefully, you will have some mate pairs where one read from the pair matches one contig, and the other matches another contig. You can then create a scaffold sequence which spans the two contigs.
  • 117. Use mate pair information to link contigs as part of a scaffolding process Scaffold NNNNNNNNNNNNNN
  • 118. Use mate pair information to link contigs as part of a scaffolding process Scaffold NNNNNNNNNNNNNN The unknown region between contigs is replaced with Ns to represent unknown bases. The length of these regions are sometimes approximations.
  • 119. Making contigs is a different process to making scaffolds
  • 120. Making contigs is a different process to making scaffolds Some assemblers do a better job at making contigs than they do at combining those contigs into scaffolds. Sometimes you can use different tools to do each step.
  • 121. Assembly size = sum length of scaffolds 209 Mbp
  • 122. Assembly size = sum length of scaffolds 209 Mbp Let's consider a fictional assembly with a few scaffolds and contigs. The first thing we calculate is the assembly size. This is simply the sum length of all sequences included in the assembly.
  • 123. Mean scaffold length is rarely used as a metric Most genome assemblies contain a lot of very short contigs
  • 124. Mean scaffold length is rarely used as a metric Most genome assemblies contain a lot of very short contigs At one extreme, an assembly could include every read that wasn't included in a contig. More likely, you will end up with some very short contigs which may not be useful. Contigs/scaffolds below a user-defined length threshold are often excluded from assemblies. All of these short sequences lower the mean length.
  • 125. N50 length The length of the sequence which takes the sum length of all sequences past 50% of the total assembly size This is the most widely-used metric to assess genome assembly quality…sometimes it is the only metric.
  • 126. N50 length The length of the sequence which takes the sum length of all sequences past 50% of the total assembly size This is the most widely-used metric to assess genome assembly quality…sometimes it is the only metric. This was first described in the human genome paper. It has since been mentioned in just about every paper that has ever described a new genome sequence.
  • 127. Calculating N50 Assembly size = 209 Mbp 50 40 35 25 20 15 10 3 3 2 2 2 2
  • 128. Calculating N50 Assembly size = 209 Mbp 50 40 35 25 20 15 10 3 3 2 2 2 2 It is sometimes easier to see how N50 is calculated by showing an example. Let's start with the longest scaffold and add the lengths to a running total. We want to stop when we have seen >50% of the total assembly size (i.e. >104.5 Mbp).
  • 129. Calculating N50 Assembly size = 209 Mbp 50 40 35 25 20 15 10 3 3 2 2 2 2 Running total = 50 Mbp
  • 132. Calculating N50 50 40 35 25 20 15 10 3 3 2 2 2 2 N50 length = 35 Mbp Assembly size = 209 Mbp Mean length = 16 Mbp
  • 133. Calculating N50 50 40 35 25 20 15 10 3 3 2 2 2 2 N50 length = 35 Mbp Assembly size = 209 Mbp Mean length = 16 MbpAfter looking at three scaffolds we now know what the N50 scaffold length is This will always be much higher than the mean length.
  • 134. Different assembly of the same genome 50 40 35 25 20 15 Assembly size = 185 Mbp N50 length = 40 Mbp
  • 135. Different assembly of the same genome 50 40 35 25 20 15 Assembly size = 185 Mbp N50 length = 40 Mbp Let's assume we tweaked the parameters of our assembly software to exclude the shortest scaffolds. This makes a smaller assembly but increases the N50 length. This means that it is possible to boost N50 simply by throwing away sequences.
  • 136. NG50 length Like N50, but rather than use assembly size in the calculation, use known (or estimated) genome size
  • 137. NG50 length Like N50, but rather than use assembly size in the calculation, use known (or estimated) genome size In the Assemblathon contests, we used a new measure which enables a fairer comparison between different assemblies (of the same genome).
  • 138. N50 length = 35 Mbp Assembly size = 209 Mbp Assembly size = 185 Mbp N50 length = 40 Mbp Assume genome size is 240 Mbp NG50 length = 35 Mbp NG50 length = 35 Mbp
  • 139. N50 length = 35 Mbp Assembly size = 209 Mbp Assembly size = 185 Mbp N50 length = 40 Mbp Assume genome size is 240 Mbp NG50 length = 35 Mbp NG50 length = 35 Mbp If we knew what the actual genome size was (e.g. 240 Mbp) we can calculate the NG50 scaffold length and see that it is the same for both assemblies.
  • 140. NG50 length Use NG50 when making comparisons between genome assemblies because N50 can be biased And be warned…some people obsess over N50!
  • 142. Metric Notes Assembly size How does it compare to expected size? Number of sequences How fragmented is your assembly? N50 length (contigs & scaffolds) Making contigs and making scaffolds are two different skills. NG50 scaffold length Becoming more common to see this used. Coverage How much of some reference sequence is present in your assembly? Errors Errors in alignment of assembly to reference sequence or to input read data. Number of genes From comparison to reference transcriptome and/or set of known genes
  • 143. Metric Notes Assembly size How does it compare to expected size? Number of sequences How fragmented is your assembly? N50 length (contigs & scaffolds) Making contigs and making scaffolds are two different skills. NG50 scaffold length Becoming more common to see this used. Coverage How much of some reference sequence is present in your assembly? Errors Errors in alignment of assembly to reference sequence or to input read data. Number of genes From comparison to reference transcriptome and/or set of known genes This is a very brief summary that lists just some of the ways in which you could describe your genome assembly.
  • 144. Assembly size 0 500,000,000 1,000,000,000 1,500,000,000 2,000,000,000 A B C D E F G H I J K L M Assemblathon 2 bird genome assemblies
  • 145. Assembly size 0 500,000,000 1,000,000,000 1,500,000,000 2,000,000,000 A B C D E F G H I J K L M Assemblathon 2 bird genome assemblies In Assemblathon 2, one assembly of the bird genome (a parrot) was very, very small. Conversely, one assembly was almost twice the size of the estimated genome (~1.2 Gbp). Bigger is not always better when it comes to assembly size.
  • 146. Using core genes • All genomes perform some core functions (transcription, replication, translation etc.) • Proteins involved tend to be highly conserved • They should be present in every genome
  • 147. CEGMA
  • 148. CEGMA This was an approach developed by our lab, originally to find a handful of genes in a newly sequenced genome which could be used to train a species-specific gene finder. We then adapted the technique to assess the gene space of a draft genome.
  • 149. What is CEGMA? • CEGMA (Core Eukaryotic Gene Mapping Approach) • defines a set of 248 'Core Eukaryotic Genes' (CEGs) • CEGs identified from genomes of: S. cerevisiae, S. pombe, A. thaliana, C. elegans, D. melanogaster, and H. sapiens • How many full-length CEGs are present in an assembly?
  • 150. What is CEGMA? • CEGMA (Core Eukaryotic Gene Mapping Approach) • defines a set of 248 'Core Eukaryotic Genes' (CEGs) • CEGs identified from genomes of: S. cerevisiae, S. pombe, A. thaliana, C. elegans, D. melanogaster, and H. sapiens • How many full-length CEGs are present in an assembly?We expect that these 248 genes to be present in all eukaryotes. CEGMA uses a combination of software tools to find these genes. The number of core genes present is assumed to reflect the proportion of all genes that are present in the assembly. Sometimes genes are split across contigs or scaffolds, CEGMA can find some of these and reports them as partial matches.
  • 151.
  • 152. Here are N50 scaffold lengths and number of core genes present in a variety of genomes that I have looked at. There is a lot of variation. Some assemblies might give you longer sequences (higher N50 values), but this is no guarantee that those assemblies will contain more gene sequences. Likewise, assemblies with more gene sequences may not necessarily have longer sequences.
  • 153. Should you use CEGMA? • CEGMA is not easy to install • It is old and somewhat out of date • You could use other transcript/protein data sets instead of CEGMA
  • 154. Should you use CEGMA? • CEGMA is not easy to install • It is old and somewhat out of date • You could use other transcript/protein data sets instead of CEGMA The principle of CEGMA could be used with a variety of different data. Maybe there are a small number of full-length mRNAs available for your species of interest. If you have multiple genome assemblies, you could simply see how they differ with respect to the presence of those genes.
  • 155. Other tools for evaluating assemblies FRCbam (2012) REAPR (2013) kPAL (2014)
  • 156. Other tools for evaluating assemblies FRCbam (2012) REAPR (2013) kPAL (2014) Just as it seems increasingly popular to develop new genome assemblers, there is a growing demand (and supply) for tools to evaluate genome assemblies. Here are three recent ones.
  • 158. In conclusion… • Genome assembly is not a solved problem • If possible, try different genome assemblers • Don't rely on one metric to assess quality • Different metrics assess different aspects of quality • Look at your genome assembly!
  • 159. In conclusion… • Genome assembly is not a solved problem • If possible, try different genome assemblers • Don't rely on one metric to assess quality • Different metrics assess different aspects of quality • Look at your genome assembly! The last point is worth repeating. Is your genome 91% N? Do you have 3 bp sequences in your assembly? These are easy things to check And remember, all genome assemblies should be thought of as 'work in progress'!
  • 161. Further resources http://acgt.me @assemblathonI use the Assemblathon twitter account to tweet links to papers and resources that describe tools relevant to the field of genome assembly. Normally only a few tweets a day. My ACGT blog contains some posts relating to genome assembly, and I try to write these with more of a general audience in mind.