Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Inference and informatics in a 'sequenced' world


Published on

Short lecture relating my recent work on real-time phylogenomics, implications for bioinformatics research and future directions of genomic/phylogenetic modelling to explicitly account for phylogeny, synteny and identity through coloured graphs.

University of Reading, 2nd August 2017

Published in: Science
  • Be the first to comment

  • Be the first to like this

Inference and informatics in a 'sequenced' world

  1. 1. Informatics and inference in a sequenced world Dr. Joe Parker Early Career Research Fellow (Phylogenomics) Royal Botanic Gardens, Kew @lonelyjoeparker:
  2. 2. Joe Parker - background 2 VL 4 length Average VL 1 length ≤ 3 4 ≤ 2 7 > 2 7 > 3 4 Neut - Neut +
  3. 3. Incredible times for bioscience 3 Images – Wikimedia commons CC BY-SA (clockwise from top left: Jeroen Rouwkema, @aGastya, author’s own, @RE73)
  4. 4. Step back: molecular evolution 4 “Horizontal gene transfer occurs x more frequently in these lineages, because of this biology” “Convergent evolution is rare in most genes, in most organisms, but y times greater in these gene families …because of this biology” “New chomosomes are created & destroyed at z, q, rates in this reproductive strategy …because of this biology”
  5. 5. Field-based DNA sequencing
  6. 6. Snowdonia, HelloWorld & ‘tent-seq’ 6 A. thaliana Arabidopsis lyrata Congeneric species; Reference genomes available Field-sequenced (MinION) & Lab-sequenced (Illumina™) Orthogonal BLAST: 4 sample*sequencer combinations Compare TRUE & FALSE rates for varying ID statistic cutoffs
  7. 7. Tasty pics 7 Conditions 100% humidity; 6-13ºC Essential kit 800w generator 3x laptops Centrifuge Waterbath Polystyrene boxes (lots) Kettle(…!) Yield >400Mbp data in three days; A. thaliana ~2.01x coverage
  8. 8. Field- vs. lab-sequenced sample ID 8 Match individual reads to each reference with BLAST Compare match lengths in TRUE and FALSE cases ‘Length bias’ ID stat: lengthTRUE - lengthFALSE Compare TRUE & FALSE rates as length bias cutoff varies MiSeq (lab) MinION (field)
  9. 9. Bitty data (1) partial queries 9 Subsample MinION output Repeat ID pipeline, record mean ID stat sbias Replicates: N = 30 Simulate from 100 – 104 reads (≈instant → hours)
  10. 10. Bitty data (2) partial references 10 Take reference genome at high contiguity Fragment randomly to target (low) contiguity Repeat read identification using fragmented DB Simulate N50 ≈1,000bp to N50 ≈ 10Mbp
  11. 11. Keeping it simple: Kew Science Festival 11 Six species: whole genome- skim samples with MinION in preparation Build BLAST DBs from skimmed data Select ‘unknown’ (blinded) sample, extract DNA and resequence in real-time Compare to partial DBs in six- way BLAST competition Live ID ?
  12. 12. de novo genome assembly 12 Data MiSeq only MiSeq + MinION Assembler Abyss hybridSPAdes Illumina reads, 300bp paired-end 8,033,488 8,033,488 Illumina data (yield) 2,418 Mbp 2,418 Mbp MinION reads, R7.3 + R9 kits, N50 ~ 4,410bp - 96,845 MinION data (yield) - 240 Mbp Approx. coverage 19.49x 19.49x + 2.01x Assembly key statistics: # contigs 24,999 10,644 Longest contig 90 Kbp 414 Kbp N50 contiguity 7,853 bp 48,730 bp Fraction of reference genome (%) 82 88 Errors, per 100 kbp: #N’s 1.7 5.4 # mismatches 518 588 # indels 120 130 Largest alignment 76,935 bp 264,039 bp CEGMA gene completeness estimate: # genes 219 of 248 245 of 248 % genes 88% 99%
  13. 13. Wait – genes? 13 Entire chloroplast genome (~150kbp) Plastid coding loci Individual field- sequenced MinION reads
  14. 14. Real-time phylogenomics 14 Filtered reads Gene models TAIR10 CDS code Annotation SNAP 1:1 reciprocal BLAST Multiple sequence alignments MUSCLE Trimal Gene trees → Consensus tree *BEAST RAxML, TreeAnnotator Cumulative counts: Unique genes All genes (‘Lab’ being transported!)
  15. 15. Emerging health threats & globalisation 15 Acute oak decline: A syndrome-type oak disease • Unknown cause, no treatment • ca. 200 million oaks in GB …amenity & timber value: ~£500/tree • Emerged ca. 2004, spreading rapidly • Significant morbidity and mortality Defra ‘Futureproofing Plant Health’ initiative • Test field-based methods • Balanced survey of microbial community composition (healthy & affected individuals) • Overcome ascertainment bias • Pilot training of non-experts. • Draw conclusions relevant to rapid-response plant health monitoring in the UK. © 2016 Katy Reed / Forest Research
  16. 16. Recap 16 From lab-based… … to ‘app store’ genomics
  17. 17. Problems with phylogeny… and comparative genomics 17 Suh (2016) Zool. Scripta. doi:10.1111/zsc.12213 Zapata et al. (2016) PNAS 113:E4052-E4060 ©2016 National Academy of Sciences
  18. 18. Key: Extant node Inferred node Synteny edge (physical connection Phylogeny edge (evolutionary connection) Identity edge (organismal connection) Three-colour graphs: phylogeny, synteny & identity 18 a b c d x y z e a a
  19. 19. Three-colour graphs: phylogeny, synteny & identity 19 a1 b1 a2 b2 a3 b3 b’3 a4 b4 a5 b5 Duplication a1 b1 a2 b2 a3 b3 a4 b4 x4 y4 x3 y3 x1 y1x2 Tetraploid hybrid formed Diploidization Key: Extant node Inferred node Synteny edge (physical connection) Phylogeny edge (evolutionary connection) Identity edge (organismal connection)a1 b1 b2 a2 a3 b3 c1 c3 c1 Inversion a1 b1 a2 b2 x1 x5 x2 x3 x4 x7 x6 HGT
  20. 20. Final thoughts 20 bionode.js Singularity Portable sequencing, by anyone means really Big Data Informatics connecting this data through explicit models is inference Scalable, reproducible, sustainable research:
  21. 21. Thanks, funders, contacts and questions 21 Oxford Nanopore Technologies Ltd. Dan Turner, Richard Ronan, Gerrard CoyneRBG Kew: Alexander S.T. Papadopulos (@metallophyte) Andrew Helmstetter (@ajhelmstetter) Dion Devey, Robyn Cowan, Tim Wilkinson, Stephen Dodsworth, Pepijn Kooij, Felix Forest, Bill Baker, Jan T. Kim, Jenny Williams, Abigail Barker, Mark Lee, Jim Clarkson, Mike Chester, Ester Gaya, Lisa Pokorny, Laszlo Csiba, Paul Wilkin, Richard Buggs, Mike Fay, Mark Chase, Ilia Leitch QMUL Laura Kelly, Kalina Davies, Steve Rossiter Oxford Aris Katzourakis, Oli Pybus, Jayna Raghwani Others Forest Research: Daegan Inward, Katy Reed Dstl: Claire Lonstale, James Taylor Birmingham: Nick Loman, Josh Quick U. Utah: Bryn Dentinger Imperial: James Rosindell This research was conducted in the Sackler Phylogenomics Laboratory and was supported by the Calleva Foundation Phylogenomic Research Programme and the Sackler Trust @lonelyjoeparker: