Using field-based DNA sequencing to
accelerate phylogenomics
Joe Parker
Royal Botanic Gardens, Kew
Department of Zoology, Oxford University
30th November 2016
Outline
Intro
-
Real-time phylogenomics
-
Ubiquitous sequencing
-
Implications
Images – Wikimedia commons CC BY-SA
(clockwise from top left: Jeroen Rouwkema, @aGastya, author’s own, @RE73)
Phylogenetics  Phylogenomics
Stewart et al. (1987):
5 species
1 gene
130 amino acids
2 phylogenies
doi:10.1038/330401a0
Parker et al. (2013):
22 species
2,326 genes
600,000 amino acids
(~700,000,000 simulated)
14 test phylogenies
100 control phylogenies
doi:10.1038/nature12511
Real-time phylogenomics
Clockwise from top-right:
Authors’ own; ONT Ltd;
Quick et al. (2016)
http://dx.doi.org/10.1038/nature16996;
Loose et al. (2016)
http://dx.doi.org/10.1038/nmeth.3930
Nanopores…
Field-based DNA
sequencing
Field-based sequencing: design and questions
Arabidopsis sequencing in
Snowdonia National Park:
Congeneric species with genomes available:
- Arabidopsis thaliana
- Arabidopsis lyrata ssp. petraea
Sample, extract and prepare libraries in the field
Sequence in the field with MinION ‘Rapid 1D’
Replicate lab-based library prep and MiSeq data
Field-based sequencing: design and questions
Questions:
Will field-based DNA extraction, library preparation
and sequencing work?
Will the data produced be any good? In any
quantity?
Can we use it for congeneric species ID? By what
method?
What else might we do with the data?
Field-based sequencing: Real-time phylogenomics kit
Field-based sequencing: Real-time phylogenomics kit
Field-based sequencing: Yield and performance
Total field-sequenced yield
>300Mbp (~1.9x)
~80% accuracy
Sequencer yield affected by
reagent conditions and
sample quality as much as by
field conditions
Field-based sequencing: ID using BLAST
Assign ID using BLASTN versus reference
genomes; difference statistic calculated e.g.
BLAST 1 database (=true pos): A. thaliana A.lyrata1
A. thaliana A.lyrata1
BLAST2 db (=true neg): A.lyrata A. thaliana A.lyrata A. thaliana
Sequencing platform: ONT 1D ONT 1D MiSeq MiSeq
Total reads: 91,715 25,839 9,476,598 9,659,489
Total hits present DB1 only (TP): 10,322 76 1,078,986 1,491,474
Total hits present DB2 only (FP): 378 2 29,200 12,613
Total hits present in both @ e ≤ 0.0012 22,386 101 4,251,153 3,829,870
Hit Biases3:
Cumulative length 29,636,139 20,424 91,328,759 65,181,756
Cumulative % identities 850,070 119 25,692,193 19,737,356
Cumulative evalues 0.01 (0.00) 0.04 0.02
Mean length 1,324 202.22 21.48 17.02
Mean % identities 37.97 1.18 6.04 5.15
Mean evalues 4.80E-07 -3.89581E-11 1.06E-08 5.46E-09
Field-based sequencing: ID using BLAST
Assign ID using BLASTN versus reference
genomes; difference statistic calculated e.g.
Field-based sequencing: ID using BLAST
Assign ID using BLASTN versus reference
genomes; difference statistic calculated e.g.
MinION
MiSeq
~50%
accuracy
Field-based genomics: assembly and annotation
Species Arabidopsis thaliana A. lyrata ssp. petraea
Data
MiSeq,
300bp
MiSeq + MinION MiSeq Hybrid
Assembler Abyss hybridSPAdes Abyss hybridSPAdes
# contigs 24,999 10,643 37,568 85,599
Largest contig 89,717 413,462 101,114 38,313
Total length 106,455,313 119,031,074 151,562,895 117,256,694
Reference length 119,667,750 119,667,750 183,707,801 183,707,801
N50 7,853 48,730 9,605 1,686
Unaligned length 7,121,882 6,737,059 36,669,847 35,287,390
Genome fraction (%) 82.0 88.7 53.4 43.7
Duplication ratio 1.01 1.06 1.17 1.02
# N's per 100 kbp 1.72 5.41 0.22 7.09
# mismatches per 100 kbp 518 588 1,297 1,097
# indels per 100 kbp 120 130 334 271
Largest alignment 76,935 264,039 44,515 17,201
Total aligned length 98,382,255 108,085,473 100,502,092 80,814,492
Field-based sequencing: Phylogenomics
Coding sequences and proteins
predicted from hybridSPAdes
MiSeq + MinION (lab- and field-
sequenced) data assembly using
CEGMA
Aligned to 248 CEGMA alignments
(plus additional plants from JGI)
with muscle
Quick RAxML phylogenies;
TreeAnnotator majority-rule
consensus Consensus support > 98%
…. And extreme phylogenomics
Field-based sequencing
Conclusions
Field-based DNA extraction, library preparation and
sequencing are entirely feasible.
With current techniques, comparable quantities of
data can be produced to lab-based runs.
Data is of sufficient quality for congeneric ID using
simple, fast processes (BLAST) and genomics.
Raw reads can even be used for informative
phylogenomics with minimal processing.
More adventures in real-time ID: Kew Science Festival 2016
Fast ID-by-sequencing:
Generate genomic data for
BLAST identification:
rapid-rough-reference (‘R3’)
Field-sequence unknown
(blinded) samples from panel
Use BLAST triggered by watch
daemon to instantly identify new
reads
Recompute sample ID in real-
time using difference
statistics
More adventures in real-time ID: Kew Science Festival 2016
All six species identified
correctly from blinded samples.
Fastest time-to-ID:
20 minutes
R3 databases:
Samples: 6
Avg yield: 32K reads / 46Mbp
N50: 7.4Kbp
12-48h sequencing
Science festival IDs:
6 samples, 3 days
Avg yield: 23K reads / 33Mbp
N50: 2.9Kbp
Implications of
ubiquitous
sequencing in
real-time
App store informatics
Real-time phylogenomics
Informatics:
Field-based sequencing
Real-time analyses
Asynchronous
computation
Phylogenomics:
Metrics on ‘tree space’
Relaxing orthology
Neutral models of
genomic evolution
Big genomic data for macroevolutionary questions
What is the tree of
life?
Is ‘sequence space’
constrained – is
evolution reproducible
(‘replay-the-tape’)?
Are higher clades
‘real’?
Why species?
Genomes?
Individuals?
Images – Wikimedia commons CC BY-SA
(clockwise from top left: Jeroen Rouwkema, @Nelsonramirezdearellano, author’s own, @soerfm)
Thanks
RBG Kew:
Alexander S.T. Papadopulos (@metallophyte)
Andrew Helmstetter (@ajhelmstetter)
Dion Devey, Robyn Cowan
ONT:
Dan Turner, Richard
Ronan, Gerrard
Coyne

Using field-based DNA sequencing to accelerate phylogenomics

  • 1.
    Using field-based DNAsequencing to accelerate phylogenomics Joe Parker Royal Botanic Gardens, Kew Department of Zoology, Oxford University 30th November 2016
  • 2.
  • 3.
    Images – Wikimediacommons CC BY-SA (clockwise from top left: Jeroen Rouwkema, @aGastya, author’s own, @RE73)
  • 4.
    Phylogenetics  Phylogenomics Stewartet al. (1987): 5 species 1 gene 130 amino acids 2 phylogenies doi:10.1038/330401a0 Parker et al. (2013): 22 species 2,326 genes 600,000 amino acids (~700,000,000 simulated) 14 test phylogenies 100 control phylogenies doi:10.1038/nature12511
  • 5.
    Real-time phylogenomics Clockwise fromtop-right: Authors’ own; ONT Ltd; Quick et al. (2016) http://dx.doi.org/10.1038/nature16996; Loose et al. (2016) http://dx.doi.org/10.1038/nmeth.3930 Nanopores…
  • 6.
  • 7.
    Field-based sequencing: designand questions Arabidopsis sequencing in Snowdonia National Park: Congeneric species with genomes available: - Arabidopsis thaliana - Arabidopsis lyrata ssp. petraea Sample, extract and prepare libraries in the field Sequence in the field with MinION ‘Rapid 1D’ Replicate lab-based library prep and MiSeq data
  • 8.
    Field-based sequencing: designand questions Questions: Will field-based DNA extraction, library preparation and sequencing work? Will the data produced be any good? In any quantity? Can we use it for congeneric species ID? By what method? What else might we do with the data?
  • 9.
  • 10.
  • 11.
    Field-based sequencing: Yieldand performance Total field-sequenced yield >300Mbp (~1.9x) ~80% accuracy Sequencer yield affected by reagent conditions and sample quality as much as by field conditions
  • 12.
    Field-based sequencing: IDusing BLAST Assign ID using BLASTN versus reference genomes; difference statistic calculated e.g. BLAST 1 database (=true pos): A. thaliana A.lyrata1 A. thaliana A.lyrata1 BLAST2 db (=true neg): A.lyrata A. thaliana A.lyrata A. thaliana Sequencing platform: ONT 1D ONT 1D MiSeq MiSeq Total reads: 91,715 25,839 9,476,598 9,659,489 Total hits present DB1 only (TP): 10,322 76 1,078,986 1,491,474 Total hits present DB2 only (FP): 378 2 29,200 12,613 Total hits present in both @ e ≤ 0.0012 22,386 101 4,251,153 3,829,870 Hit Biases3: Cumulative length 29,636,139 20,424 91,328,759 65,181,756 Cumulative % identities 850,070 119 25,692,193 19,737,356 Cumulative evalues 0.01 (0.00) 0.04 0.02 Mean length 1,324 202.22 21.48 17.02 Mean % identities 37.97 1.18 6.04 5.15 Mean evalues 4.80E-07 -3.89581E-11 1.06E-08 5.46E-09
  • 13.
    Field-based sequencing: IDusing BLAST Assign ID using BLASTN versus reference genomes; difference statistic calculated e.g.
  • 14.
    Field-based sequencing: IDusing BLAST Assign ID using BLASTN versus reference genomes; difference statistic calculated e.g. MinION MiSeq ~50% accuracy
  • 15.
    Field-based genomics: assemblyand annotation Species Arabidopsis thaliana A. lyrata ssp. petraea Data MiSeq, 300bp MiSeq + MinION MiSeq Hybrid Assembler Abyss hybridSPAdes Abyss hybridSPAdes # contigs 24,999 10,643 37,568 85,599 Largest contig 89,717 413,462 101,114 38,313 Total length 106,455,313 119,031,074 151,562,895 117,256,694 Reference length 119,667,750 119,667,750 183,707,801 183,707,801 N50 7,853 48,730 9,605 1,686 Unaligned length 7,121,882 6,737,059 36,669,847 35,287,390 Genome fraction (%) 82.0 88.7 53.4 43.7 Duplication ratio 1.01 1.06 1.17 1.02 # N's per 100 kbp 1.72 5.41 0.22 7.09 # mismatches per 100 kbp 518 588 1,297 1,097 # indels per 100 kbp 120 130 334 271 Largest alignment 76,935 264,039 44,515 17,201 Total aligned length 98,382,255 108,085,473 100,502,092 80,814,492
  • 16.
    Field-based sequencing: Phylogenomics Codingsequences and proteins predicted from hybridSPAdes MiSeq + MinION (lab- and field- sequenced) data assembly using CEGMA Aligned to 248 CEGMA alignments (plus additional plants from JGI) with muscle Quick RAxML phylogenies; TreeAnnotator majority-rule consensus Consensus support > 98%
  • 17.
    …. And extremephylogenomics
  • 18.
    Field-based sequencing Conclusions Field-based DNAextraction, library preparation and sequencing are entirely feasible. With current techniques, comparable quantities of data can be produced to lab-based runs. Data is of sufficient quality for congeneric ID using simple, fast processes (BLAST) and genomics. Raw reads can even be used for informative phylogenomics with minimal processing.
  • 19.
    More adventures inreal-time ID: Kew Science Festival 2016 Fast ID-by-sequencing: Generate genomic data for BLAST identification: rapid-rough-reference (‘R3’) Field-sequence unknown (blinded) samples from panel Use BLAST triggered by watch daemon to instantly identify new reads Recompute sample ID in real- time using difference statistics
  • 20.
    More adventures inreal-time ID: Kew Science Festival 2016 All six species identified correctly from blinded samples. Fastest time-to-ID: 20 minutes R3 databases: Samples: 6 Avg yield: 32K reads / 46Mbp N50: 7.4Kbp 12-48h sequencing Science festival IDs: 6 samples, 3 days Avg yield: 23K reads / 33Mbp N50: 2.9Kbp
  • 21.
  • 22.
  • 23.
    Real-time phylogenomics Informatics: Field-based sequencing Real-timeanalyses Asynchronous computation Phylogenomics: Metrics on ‘tree space’ Relaxing orthology Neutral models of genomic evolution
  • 24.
    Big genomic datafor macroevolutionary questions What is the tree of life? Is ‘sequence space’ constrained – is evolution reproducible (‘replay-the-tape’)? Are higher clades ‘real’? Why species? Genomes? Individuals?
  • 25.
    Images – Wikimediacommons CC BY-SA (clockwise from top left: Jeroen Rouwkema, @Nelsonramirezdearellano, author’s own, @soerfm)
  • 26.
    Thanks RBG Kew: Alexander S.T.Papadopulos (@metallophyte) Andrew Helmstetter (@ajhelmstetter) Dion Devey, Robyn Cowan ONT: Dan Turner, Richard Ronan, Gerrard Coyne