Thoughts on the feasibility of…
Ian Korf
Please note: this is a draft version of a talk. I.e. these are slides I
prepared for Ian Korf to use at the Genome 10K meeting. His final
version of this talk will no doubt add/remove much material.
Keith Bradnam 2015-03-04
DNA sequencers keep on getting smaller…
…the challenges of genome assembly
seem to keep getting bigger.
flickr.com/incrediblehow/
Let the people speak…
*If* there was to be an Assemblathon 3, what
suggestions or ideas would you have for it?
Please tweet them using hashtag #A3wishlist
"Hybrid-approaches with PacBio, Nanopore, and Illumina data;
non-model systems; egalitarian genomics"
"Polyploid assembly and haplotype reconstruction"
"Give people assemblies + reads, competition on best prediction
which assembly is ‘best’ on different metrics"
"I vote axolotl or lungfish for Assemblathon 3!
Big, repetitive, interesting, useful genomes"
"Large/complex/repetitive marine genomes; high
heterozygosity (no inbred lines); crustacean/sharks;
Illumina 250 bp paired ends + PacBio + optical maps"
"PacBio vs Illumina assemblies, Illumina with low PacBio
coverage to fill gaps + correcting PacBio errors with Illumina"
"Polyploid (highly heterozygous) genome assembly challenge
with emphasis on sub-genome (haplotype) deconvolution"
"Hybrid-approaches with PacBio, Nanopore, and Illumina data;
non-model systems; egalitarian genomics"
"Polyploid assembly and haplotype reconstruction"
"Give people assemblies + reads, competition on best prediction
which assembly is ‘best’ on different metrics"
"I vote axolotl or lungfish for Assemblathon 3!
Big, repetitive, interesting, useful genomes"
"Large/complex/repetitive marine genomes; high
heterozygosity (no inbred lines); crustacean/sharks;
Illumina 250 bp paired ends + PacBio + optical maps"
"PacBio vs Illumina assemblies, Illumina with low PacBio
coverage to fill gaps + correcting PacBio errors with Illumina"
"Polyploid (highly heterozygous) genome assembly challenge
with emphasis on sub-genome (haplotype) deconvolution"
"Hybrid-approaches with PacBio, Nanopore, and Illumina data;
non-model systems; egalitarian genomics"
"Polyploid assembly and haplotype reconstruction"
"Give people assemblies + reads, competition on best prediction
which assembly is ‘best’ on different metrics"
"I vote axolotl or lungfish for Assemblathon 3!
Big, repetitive, interesting, useful genomes"
"Large/complex/repetitive marine genomes; high
heterozygosity (no inbred lines); crustacean/sharks;
Illumina 250 bp paired ends + PacBio + optical maps"
"PacBio vs Illumina assemblies, Illumina with low PacBio
coverage to fill gaps + correcting PacBio errors with Illumina"
"Polyploid (highly heterozygous) genome assembly challenge
with emphasis on sub-genome (haplotype) deconvolution"
A lot of people seem to want to assemble something
really difficult! This presumes that we have already
mastered assembly of haploid, low-repeat-content,
average-sized genomes.
flickr.com/incrediblehow/
Problems with Assemblathon 2
Too many species!
Community effort was diluted across different species
(only 2 teams assembled all 3 genomes). Multiple
species presented more data management issues.
One species?
?
285xcoverage of parrot genome
Unrealistic amounts of sequence data available
285x
Unrealistic amounts of sequence data available
It is not typical to sequence so much data for a genome
assembly. Most researchers can not afford to pay for so
much sequencing.
Make the assembly challenge
representative of a real world scenario
Give teams a virtual budget and let them
buy sequencing resources
$$$
Budget Team Illumina Moleculo PacBio
Oxford
Nanopore
$5,000
Team A 20x 5x
Team B 40x
Team C 10x 10x
$50,000
Team A 30x 20x 5x 2x
Team B 50x 10x
Team C 75x 30x 10x
Could allow teams to 'buy'
sequences from a mix of platforms
Budget Team Illumina Moleculo PacBio
Oxford
Nanopore
$5,000
Team A 20x 5x
Team B 40x
Team C 10x 10x
$50,000
Team A 30x 20x 5x 2x
Team B 50x 10x
Team C 75x 30x 10x
Could potentially have two
different budgets available
(budgets here are just for illustrative reasons)
Budget Team Illumina Moleculo PacBio
Oxford
Nanopore
$5,000
Team A 20x 5x
Team B 40x
Team C 10x 10x
$50,000
Team A 30x 20x 5x 2x
Team B 50x 10x
Team C 75x 30x 10x
Fictional example to show different teams could use different strategies
Low amounts of useful validation data
Low amounts of useful validation data
PacBio data could have been held back to validate
assemblies but wasn't and was then only used by a few
teams. No good transcript data. Had Fosmids and optical
maps (but not for all species).
• More Fosmid and/or BAC sequences?
• Transcript(ome) data?
• Long read sequence data?
• Synteny information?
• Tools such as Irys from BioNano Genomics?
Documentation for how assemblies were made
was often poor or missing altogether
X
• Require reproducible assembly instructions
at the time of submission
• Request better information relating to
computer architecture used to make
assembly
flickr.com/incrediblehow/
Other considerations for Assemblathon 3
Two different sequence file formats have been developed that
can represent haplotype variation in a genome assembly
GFA
FASTG
GFA
FASTG
Two different sequence file formats have been developed that
can represent haplotype variation in a genome assembly
Neither format seems to have been widely adopted…
plus there are no (?) downstream bioinformatics tools
that work with these formats. Would requiring either
format deter participation?
Encourage multiple entries per team?
assembly_1a.fasta
assembly_1b.fasta
Encourage multiple entries per team?
assembly_1a.fasta
assembly_1b.fasta
Some of the better assemblies in Assemblathon 2 were
the 'experimental' entries.
flickr.com/incrediblehow/
What species?
How about an endangered species?
How about an endangered species?
Assemblathon 3 could become a shining example of
conservation genomics, and choosing an endangered
species might help attract more community support.
Also good PR!
How about an endangered species?
California Condor (Gymnogyps californianus)
Image from http://www.manataka.org/
How about an endangered species?
California Condor (Gymnogyps californianus)
Image from http://www.manataka.org/
Critically endangered. BAC resources may be available.
Tuatara lizard (Sphenodon punctatus)
Image from https://student.societyforscience.org/
Tuatara lizard (Sphenodon punctatus)
Image from https://student.societyforscience.org/
A 'living fossil'. Low risk of extinction. BAC libraries and
partial transcriptome exist.
Spiny rat (Tokudaia spp)
Image from https://wikimedia.org/
Spiny rat (Tokudaia spp)
Image from https://wikimedia.org/
Endangered. Transcriptome available.
But does it have to be a Genome 10K species ?
But does it have to be a Genome 10K species ?
If the species is eukaryotic and has a large genome, this
would still be useful to assess assemblers that could be
used for other Genome 10K species.
White abalone (Haliotis sorenseni)
Image from https://wikimedia.org/
White abalone (Haliotis sorenseni)
Image from https://wikimedia.org/
Estimated genome size: 1.7–2.0 Gbp.
Native to California and Mexico.
Critically endangered — first marine invertebrate to be
listed under the Endangered Species Act.
Successfully bred the first white abalone in captivity in 2012.
Gary Cherr
Director, Bodega Marine Laboratory
Principle Investigator for abalone
captive breeding program
"The restoration of the white abalone in the wild — the
first time this would ever have been attempted for a listed
marine — may depend on the genome being
sequenced."
Gary Cherr
Director, Bodega Marine Laboratory
Principle Investigator for abalone
captive breeding program
"There’s probably a few thousand left in the wild.
But because they’re so far apart, they’re effectively sterile.
Their population could be effectively extinct already."
Kristin Aquilino
Manager of abalone
captive breeding program
flickr.com/incrediblehow/
Summary
• People seem to want very different things
out of a possible Assemblathon 3 contest
• Trying to please everyone — rather than
focusing on something achievable and
helpful to the ultimate users of genome
assembly software — might not be the
most productive strategy
From Wikimedia commons
Three months later…
From http://flickr.com/markturner/

Thoughts on the feasibility of an Assemblathon 3 contest

  • 1.
    Thoughts on thefeasibility of… Ian Korf
  • 2.
    Please note: thisis a draft version of a talk. I.e. these are slides I prepared for Ian Korf to use at the Genome 10K meeting. His final version of this talk will no doubt add/remove much material. Keith Bradnam 2015-03-04
  • 4.
    DNA sequencers keepon getting smaller… …the challenges of genome assembly seem to keep getting bigger.
  • 5.
  • 6.
    *If* there wasto be an Assemblathon 3, what suggestions or ideas would you have for it? Please tweet them using hashtag #A3wishlist
  • 7.
    "Hybrid-approaches with PacBio,Nanopore, and Illumina data; non-model systems; egalitarian genomics" "Polyploid assembly and haplotype reconstruction" "Give people assemblies + reads, competition on best prediction which assembly is ‘best’ on different metrics" "I vote axolotl or lungfish for Assemblathon 3! Big, repetitive, interesting, useful genomes" "Large/complex/repetitive marine genomes; high heterozygosity (no inbred lines); crustacean/sharks; Illumina 250 bp paired ends + PacBio + optical maps" "PacBio vs Illumina assemblies, Illumina with low PacBio coverage to fill gaps + correcting PacBio errors with Illumina" "Polyploid (highly heterozygous) genome assembly challenge with emphasis on sub-genome (haplotype) deconvolution"
  • 8.
    "Hybrid-approaches with PacBio,Nanopore, and Illumina data; non-model systems; egalitarian genomics" "Polyploid assembly and haplotype reconstruction" "Give people assemblies + reads, competition on best prediction which assembly is ‘best’ on different metrics" "I vote axolotl or lungfish for Assemblathon 3! Big, repetitive, interesting, useful genomes" "Large/complex/repetitive marine genomes; high heterozygosity (no inbred lines); crustacean/sharks; Illumina 250 bp paired ends + PacBio + optical maps" "PacBio vs Illumina assemblies, Illumina with low PacBio coverage to fill gaps + correcting PacBio errors with Illumina" "Polyploid (highly heterozygous) genome assembly challenge with emphasis on sub-genome (haplotype) deconvolution"
  • 9.
    "Hybrid-approaches with PacBio,Nanopore, and Illumina data; non-model systems; egalitarian genomics" "Polyploid assembly and haplotype reconstruction" "Give people assemblies + reads, competition on best prediction which assembly is ‘best’ on different metrics" "I vote axolotl or lungfish for Assemblathon 3! Big, repetitive, interesting, useful genomes" "Large/complex/repetitive marine genomes; high heterozygosity (no inbred lines); crustacean/sharks; Illumina 250 bp paired ends + PacBio + optical maps" "PacBio vs Illumina assemblies, Illumina with low PacBio coverage to fill gaps + correcting PacBio errors with Illumina" "Polyploid (highly heterozygous) genome assembly challenge with emphasis on sub-genome (haplotype) deconvolution" A lot of people seem to want to assemble something really difficult! This presumes that we have already mastered assembly of haploid, low-repeat-content, average-sized genomes.
  • 10.
  • 11.
  • 12.
    Community effort wasdiluted across different species (only 2 teams assembled all 3 genomes). Multiple species presented more data management issues.
  • 13.
  • 14.
    285xcoverage of parrotgenome Unrealistic amounts of sequence data available
  • 15.
    285x Unrealistic amounts ofsequence data available It is not typical to sequence so much data for a genome assembly. Most researchers can not afford to pay for so much sequencing.
  • 16.
    Make the assemblychallenge representative of a real world scenario
  • 17.
    Give teams avirtual budget and let them buy sequencing resources $$$
  • 18.
    Budget Team IlluminaMoleculo PacBio Oxford Nanopore $5,000 Team A 20x 5x Team B 40x Team C 10x 10x $50,000 Team A 30x 20x 5x 2x Team B 50x 10x Team C 75x 30x 10x Could allow teams to 'buy' sequences from a mix of platforms
  • 19.
    Budget Team IlluminaMoleculo PacBio Oxford Nanopore $5,000 Team A 20x 5x Team B 40x Team C 10x 10x $50,000 Team A 30x 20x 5x 2x Team B 50x 10x Team C 75x 30x 10x Could potentially have two different budgets available (budgets here are just for illustrative reasons)
  • 20.
    Budget Team IlluminaMoleculo PacBio Oxford Nanopore $5,000 Team A 20x 5x Team B 40x Team C 10x 10x $50,000 Team A 30x 20x 5x 2x Team B 50x 10x Team C 75x 30x 10x Fictional example to show different teams could use different strategies
  • 21.
    Low amounts ofuseful validation data
  • 22.
    Low amounts ofuseful validation data PacBio data could have been held back to validate assemblies but wasn't and was then only used by a few teams. No good transcript data. Had Fosmids and optical maps (but not for all species).
  • 23.
    • More Fosmidand/or BAC sequences? • Transcript(ome) data? • Long read sequence data? • Synteny information? • Tools such as Irys from BioNano Genomics?
  • 25.
    Documentation for howassemblies were made was often poor or missing altogether X
  • 26.
    • Require reproducibleassembly instructions at the time of submission • Request better information relating to computer architecture used to make assembly
  • 27.
  • 28.
    Two different sequencefile formats have been developed that can represent haplotype variation in a genome assembly GFA FASTG
  • 29.
    GFA FASTG Two different sequencefile formats have been developed that can represent haplotype variation in a genome assembly Neither format seems to have been widely adopted… plus there are no (?) downstream bioinformatics tools that work with these formats. Would requiring either format deter participation?
  • 30.
    Encourage multiple entriesper team? assembly_1a.fasta assembly_1b.fasta
  • 31.
    Encourage multiple entriesper team? assembly_1a.fasta assembly_1b.fasta Some of the better assemblies in Assemblathon 2 were the 'experimental' entries.
  • 32.
  • 33.
    How about anendangered species?
  • 34.
    How about anendangered species? Assemblathon 3 could become a shining example of conservation genomics, and choosing an endangered species might help attract more community support. Also good PR!
  • 35.
    How about anendangered species? California Condor (Gymnogyps californianus) Image from http://www.manataka.org/
  • 36.
    How about anendangered species? California Condor (Gymnogyps californianus) Image from http://www.manataka.org/ Critically endangered. BAC resources may be available.
  • 37.
    Tuatara lizard (Sphenodonpunctatus) Image from https://student.societyforscience.org/
  • 38.
    Tuatara lizard (Sphenodonpunctatus) Image from https://student.societyforscience.org/ A 'living fossil'. Low risk of extinction. BAC libraries and partial transcriptome exist.
  • 39.
    Spiny rat (Tokudaiaspp) Image from https://wikimedia.org/
  • 40.
    Spiny rat (Tokudaiaspp) Image from https://wikimedia.org/ Endangered. Transcriptome available.
  • 41.
    But does ithave to be a Genome 10K species ?
  • 42.
    But does ithave to be a Genome 10K species ? If the species is eukaryotic and has a large genome, this would still be useful to assess assemblers that could be used for other Genome 10K species.
  • 43.
    White abalone (Haliotissorenseni) Image from https://wikimedia.org/
  • 44.
    White abalone (Haliotissorenseni) Image from https://wikimedia.org/ Estimated genome size: 1.7–2.0 Gbp. Native to California and Mexico. Critically endangered — first marine invertebrate to be listed under the Endangered Species Act.
  • 46.
    Successfully bred thefirst white abalone in captivity in 2012.
  • 47.
    Gary Cherr Director, BodegaMarine Laboratory Principle Investigator for abalone captive breeding program
  • 48.
    "The restoration ofthe white abalone in the wild — the first time this would ever have been attempted for a listed marine — may depend on the genome being sequenced." Gary Cherr Director, Bodega Marine Laboratory Principle Investigator for abalone captive breeding program
  • 49.
    "There’s probably afew thousand left in the wild. But because they’re so far apart, they’re effectively sterile. Their population could be effectively extinct already." Kristin Aquilino Manager of abalone captive breeding program
  • 50.
  • 51.
    • People seemto want very different things out of a possible Assemblathon 3 contest
  • 52.
    • Trying toplease everyone — rather than focusing on something achievable and helpful to the ultimate users of genome assembly software — might not be the most productive strategy
  • 53.
  • 54.
  • 55.