Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real-time Phylogenomics: Joe Parker

250 views

Published on

Talk given at a technology/informatics company, London, Feb 2018.

Published in: Science
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Real-time Phylogenomics: Joe Parker

  1. 1. Real-time phylogenomics or ‘Some interesting problems in genomic big data’ Joe Parker Early-career Research Fellow, RBG Kew @lonelyjoeparker: joe.parker@kew.org
  2. 2. Incredible times for bioscience Images – Wikimedia commons CC BY-SA (clockwise from top left: Jeroen Rouwkema, @aGastya, author’s own, @RE73)
  3. 3. Background
  4. 4. MENU Some definitions Adventures with a genome sequencer Evolution is complex Real-time data & mass sequencing Final thoughts: the cosmology of life
  5. 5. 1. Definitions
  6. 6. Genes and genomes >ENA|AY819028|AY819028.1 Capsicum annuum cultivar Hot 1493 acyltransferase (Pun1) gene, complete cds. TCATTAGAAGGTCATACCGCTCCACGAAAATGCACCTTGAAAGATATAACACGGACAACGAATCATTATCCCCATCATCACTATTACTCCCACTTCC CTTGCACTCTTCACTGTCACCACTGACACTCCGCTTGGCAACATTTTCACTAGAATCGACGTAGTCGCTTATCTCCTTTAACTCCGAATCTGATTCG GACACCGACTGTTCAATTTTCTTTCTTTTTGAGACTTTTTCAACTGCTTCAGTTCTTCTTTTTCTACTGTTACCAGCGGTACCGGCTTTGCGTTTAGG AATGATGTTTTTTTTACCCATTTTCAACACAATCTACACCTAAAGAACAAATCTCCCATTTTTAGTTCATAGACCACAAGTCTATCAACAGAAATAACT CAATGCTCAAATGAACCCCCCTCCCCCCAAAAAAAAATTAACAAACACCCCACCATTAAACAGTTCACTACACAAACATACAATAACTGAACCAAAAT CCAACATGCAATATCAAAACACAACAATTACTAAAATCAAACTAATGCACCTAATCAAACTAATTAGCTATTAATATTTCAATTTTCACTATTTCAGCAA TCATGTTTTAAAAGAATTTCATACGTCTGAAAATTGATATATATCTAGGGCATTCTCATTTCATAGACCACGGGTCTACGGATAGACCTCGGGTCTAC GAACAGAAATAGGTCGCTGTTCAAATCAAAATGCCAAAATAACTCTTCAAACAACTATTATCCCACCATTCAACACTTCGTTGCTAAATAAACCACAA CTAAACCAAAACACCAAATTCGAAGAAAAAATTTCTACATCACTACGAGTTGATTAGCAAAAAAAAACGTTTAAATGGATCTAGAAATGATCGAAACT TGATTTTAACTAACCTTGCAAAGCAGCAACAACCCCTTAGTAGCTGGAGAAGAAGACGAAATGAAAATGGCATTTTTGGAAGAAGTAGTTTCAAAAG CAGGAGTTGGGAATTGAAGAGGAGAGAGAGGGTGGGTTTTTTTAAATATTGGAATAATTGGAGGGTGTTAGGTGTATTATATTAAATTTGTAAAGTT GTAAAAATGATGAATTGGTCCCTTGGCCGATGCGTGGGCCCCACTTTTTCATAAAAAATAAATCAAAAAGAAATTAAGTAGGTATTTGACAAATTAAT TTTGGAGGGTTCCTTCTTTGCCAATTATTCCCCACTAAGCTACTCTCATTCACTCTTATATTATAGATTATAGTATAAAGTAATACAAACTATGAATTG TTTTTATATTTTATTTTACAAGTTATGAATAGTGTTTATATAGGTCTCTATTTCCATACAATCACATTTTGTGGGCAGTTTTTTTGGGATTGTCACGAAG GCGAGGTTTGTTCATTTTGTGGAAAGAGAATTGGATTTCTACATTTTTATCATCTTCTAGGTGTGATGTTGATACTACTATTTGCCCAAATATTTGTTT TAAACATATTAATATTATGTATCAAAATGTGTACAATATAATTTAACACACGTGCAGTATGCATGTATCGCGAAACTAGTTAATTACATGCATCACATG TAATAGCAATAGTATTATTGTACGACGTACTAATATATTAGTATCTATTCTAGCTACTAATTTCCTCTTAACCGTCTCCATGCTGAAAACAACGCCACA GTGCAACGAGCCTTCTATAAAAGTTGAATTATATAAAAATAAGGTACAGTTTAGAAATAAAACTAACAAAAAGGTAACCTATAGTTTGGGGGTTGGGT AGAGGTTGTTTAGCCAGTAACTCTATTATTTCATTTCCTTTTGTCTATATAAGTGTATCCATATATGCAAGAAATGTCAACCGGCCAGCAGCATATAT TTATTTGTTAAATTAATTATGGCTTTTGCATTACCATCATCACTTGTTTCAGTTTGTAACAAATCTTTTATCAAACCTTCCTCTCTCACCCCCTCTACAC TTAGATTTCACAAGCTATCTTTCATCGATCAATCTTTAAGTAATATGTATATCCCTTGTGCATTTTTTTACCCTAAAGTACAACAAAGACTAGAAGACT CCAAAAATTCTGATGAGCTTTCCCATATAGCCCACTTGCTACAAACATCTCTATCACAAACTCTAGTCTCTTACTATCCTTATGCTGGAAAGTTGAAG GACAATGCTACTGTTGACTGTAACGATATGGGAGCTGAGTTCTTGAGTGTTCGAATAAAATGTTCCATGTCTGAAATTCTTGATCATCCTCATGCAT CTCTTGCAGAGAGCATAGTTTTGCCCAAGGATTTGCCTTGGGCGAATAATTGTGAAGGTGGTAATTTGCTTGTAGTTCAAGTAAGTAAGTTTGATTG TGGGGGAATAGCCATCAGTGTATGCTTTTCGCACAAGATTGGTGATGGTTGCTCTCTGCTTAATTTCCTTAATGATTGGTCTAGCGTTACTCGTGAT CATACGACAACAACTTTAGTTCCATCTCCTAGATTTGTAGGAGATTCAGTCTTCTCTACACAAAAATATGGTTCTCTCATTACGCCACAAATTTTGTC CGATCTCAACCAGTGCGTACAGAAAAGACTCATTTTTCCTACAGATAAGTTAGATGCACTTCGAGCTAAGGTAATACTACCATCGTCCATTATTGTTT GTCTTACGGTATTTTTGAAAAGAATAATATTTAATAGTCTTCTTGAGACATATTTCACTTAACAAGCCTAGGCTATTTAGTCTATTTGTAGAAGCTACT CTTAAACGCCTCACTTAGTTAATAGCACTCCACTTATTGGTGTCAAAAACTACTCTTGGACATGTCATTTACTTAATAACACTCCACTTAATTATCGAA
  7. 7. Alignment, assembly, annotation Li et al. (2011) EBI / NCBI / DDBJ
  8. 8. Three millennia of modelling life
  9. 9. Phylogenetic trees H. sapiens ATG CTC TAT GAG P. troglodoytes ATG CTC TTT GAG G. gorilla ATG CTT TAT TAC P. troglodoytes G. gorilla H. sapiens P. troglodoytes 11 9 8
  10. 10. Phylogenetic trees
  11. 11. 2. Adventures with DNA sequencing
  12. 12. Field-sequencing for real Conditions 100% humidity; 6-13ºC Essential kit 800w generator 3x laptops Centrifuge Waterbath Polystyrene boxes (lots) Kettle(…!) Yield >400Mbp data in three days; A. thaliana ~2.01x coverage
  13. 13. Snowdonia, HelloWorld & ‘tent-seq’ A. thaliana Arabidopsis lyrata Congeneric species; Reference genomes available Field-sequenced (MinION) & Lab-sequenced (Illumina™) Orthogonal BLAST: 4 sample*sequencer combinations Compare TRUE & FALSE rates for varying ID statistic cutoffs
  14. 14. Field- vs. lab-sequenced sample ID Match individual reads to each reference with BLAST Compare match lengths in TRUE and FALSE cases ‘Length bias’ ID stat: lengthTRUE - lengthFALSE Compare TRUE & FALSE rates as length bias cutoff varies MiSeq (lab) MinION (field)
  15. 15. Bitty data (1) partial queries Subsample MinION output Repeat ID pipeline, record mean ID stat sbias Replicates: N = 30 Simulate from 100 – 104 reads (≈instant → hours)
  16. 16. Bitty data (2) partial references Take reference genome at high contiguity Fragment randomly to target (low) contiguity Repeat read identification using fragmented DB Simulate N50 ≈1,000bp to N50 ≈ 10Mbp
  17. 17. Keeping it simple: Kew Science Festival Six species: whole genome- skim samples with MinION in preparation Build BLAST DBs from skimmed data Select ‘unknown’ (blinded) sample, extract DNA and resequence in real-time Compare to partial DBs in six-way BLAST competition Live ID ?
  18. 18. de novo genome assembly Data MiSeq only MiSeq + MinION Assembler Abyss hybridSPAdes Illumina reads, 300bp paired-end 8,033,488 8,033,488 Illumina data (yield) 2,418 Mbp 2,418 Mbp MinION reads, R7.3 + R9 kits, N50 ~ 4,410bp - 96,845 MinION data (yield) - 240 Mbp Approx. coverage 19.49x 19.49x + 2.01x Assembly key statistics: # contigs 24,999 10,644 Longest contig 90 Kbp 414 Kbp N50 contiguity 7,853 bp 48,730 bp Fraction of reference genome (%) 82 88 Errors, per 100 kbp: #N’s 1.7 5.4 # mismatches 518 588 # indels 120 130 Largest alignment 76,935 bp 264,039 bp CEGMA gene completeness estimate: # genes 219 of 248 245 of 248 % genes 88% 99%
  19. 19. Wait – genes? Entire chloroplast genome (~150kbp) Plastid coding loci Individual field- sequenced MinION reads
  20. 20. Real-time phylogenomics Filtered reads Gene models TAIR10 CDS code Annotation SNAP 1:1 reciprocal BLAST Multiple sequence alignments MUSCLE Trimal Gene trees → Consensus tree *BEAST RAxML, TreeAnnotator Cumulative counts: Unique genes All genes (‘Lab’ being transported!)
  21. 21. 3. Life is complex
  22. 22. Evolution is complex Nakhleh, (2009); Suh (2016) Zool. Scripta. doi:10.1111/zsc.12213 Zapata et al. (2016) PNAS 113:E4052-E4060 ©2016 National Academy of Sciences
  23. 23. Networks Strimmer & Moulton (2000) MBE Solís-Lemus & Ané (2016) PLoS Genet. Nakhleh (2009) in: Heath & Ramakrishnan, eds., Springer
  24. 24. Key: Extant node Inferred node Synteny edge (physical connection Phylogeny edge (evolutionary connection) Identity edge (organismal connection) Three-colour graphs: phylogeny, synteny & identity a b c d x y z e a a
  25. 25. Three-colour graphs: phylogeny, synteny & identity a1 b1 a2 b2 a3 b3 b’3 a4 b4 a5 b5 Duplication Key: Extant node Inferred node Synteny edge (physical connection) Phylogeny edge (evolutionary connection) Identity edge (organismal connection) a1 b1 b2 a2 a3 b3 c1 c3 c1 Inversion
  26. 26. Three-colour graphs: phylogeny, synteny & identity a1 b1 a2 b2 a3 b3 a4 b4 x4 y4 x3 y3 x1 y1 Tetraploid hybrid formed Diploidization (secondary loss) Key: Extant node Inferred node Synteny edge (physical connection) Phylogeny edge (evolutionary connection) Identity edge (organismal connection) a1 b1 a2 b2 x1 x5 x2 x3 x4 x7 x6 Horizontal gene transfer
  27. 27. Step back: molecular evolution “Horizontal gene transfer occurs x more frequently in these lineages, because of this biology” “Convergent evolution is rare in most genes, in most organisms, but y times greater in these gene families …because of this biology” “New chomosomes are created & destroyed at z, q, rates in this reproductive strategy …because of this biology”
  28. 28. 4. Real big data
  29. 29. How big? How many? Species: • Mammals: 103 – 105 • Animals: 106 – 107 • Plants: 105 – 106 • Bacteria: >106 ? • Fungi: >>105-??? DNA sequencing machines: ~104-5 (each ~109-10 bp/day) (Organisms): (…a lot) Example Typical feature size (Mbp) Largest known genomes ~240,000 (1011) Vascular plant ~0.1 – 10,000 (108 - 1010) Human genome 3,000 (3x109) Most fungi ~0.1 – 10 (105 - 107) Bacteria ~0.1 – 1 (106) Viruses ~0.01 (105) Mitochondria / chloroplasts 0.017 / 0.2 (104 ;105) ‘Barcoding’ locus ~1000bp (102) Illumina read 75-300bp Nanopore read 100bp –>1Mbp
  30. 30. The tools aren’t in great shape but the prizes are there bionode.js bioboxes.org Singularity Portable sequencing, by anyone means really Big Data Informatics connecting this data through explicit models is inference Scalable, reproducible, sustainable research:
  31. 31. Ubiquitous / citizen-sequencing From lab-based… … to ‘app store’ genomics
  32. 32. Metagenomics… © 2016 Katy Reed / FR EPI2ME; Juul et al. (2016) / Metrichor.com
  33. 33. Health: defining ‘normal’ Credit: Darryl Leja, NHGRI
  34. 34. Food-chain: inputs and outputs
  35. 35. Environmental sampling (Species’ abundance varying in time and space) (AKA “trampling on the ecologists’ toes”)
  36. 36. Genomic observatories • [GBIF pic] • orms
  37. 37. The Tree (network) of Life iTOL: Ivica Letunic, Mariana Ruiz Villarreal
  38. 38. Final thoughts
  39. 39. Thanks, funders, contacts and questions Oxford Nanopore Technologies Ltd. Dan Turner, Richard Ronan, Gerrard CoyneU Bangor: Alexander S.T. Papadopulos (@metallophyte) RBG Kew: Postdocs: Andrew Helmstetter (@ajhelmstetter); Tim Coker Thanks: Dion Devey, Robyn Cowan, Tim Wilkinson, Stephen Dodsworth, Pepijn Kooij, Felix Forest, Bill Baker, Jan T. Kim, Jenny Williams, Abigail Barker, Mark Lee, Jim Clarkson, Mike Chester, Ester Gaya, Lisa Pokorny, Laszlo Csiba, Paul Wilkin, Richard Buggs, Mike Fay, Mark Chase, Ilia Leitch QMUL Laura Kelly, Kalina Davies, Steve Rossiter Oxford Aris Katzourakis, Oli Pybus, Jayna Raghwani Others Forest Research: Daegan Inward, Katy Reed Dstl: Claire Lonstale, James Taylor Birmingham: Nick Loman, Josh Quick U. Utah: Bryn Dentinger Imperial: James Rosindell This research was conducted in the Sackler Phylogenomics Laboratory and was supported by the Calleva Foundation Phylogenomic Research Programme and the Sackler Trust @lonelyjoeparker: joe.parker@kew.org

×