Successfully reported this slideshow.
Your SlideShare is downloading. ×

Real-time Phylogenomics: Joe Parker

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 40 Ad
Advertisement

More Related Content

Slideshows for you (20)

Similar to Real-time Phylogenomics: Joe Parker (20)

Advertisement
Advertisement

Real-time Phylogenomics: Joe Parker

  1. 1. Real-time phylogenomics or ‘Some interesting problems in genomic big data’ Joe Parker Early-career Research Fellow, RBG Kew @lonelyjoeparker: joe.parker@kew.org
  2. 2. Incredible times for bioscience Images – Wikimedia commons CC BY-SA (clockwise from top left: Jeroen Rouwkema, @aGastya, author’s own, @RE73)
  3. 3. Background
  4. 4. MENU Some definitions Adventures with a genome sequencer Evolution is complex Real-time data & mass sequencing Final thoughts: the cosmology of life
  5. 5. 1. Definitions
  6. 6. Genes and genomes >ENA|AY819028|AY819028.1 Capsicum annuum cultivar Hot 1493 acyltransferase (Pun1) gene, complete cds. TCATTAGAAGGTCATACCGCTCCACGAAAATGCACCTTGAAAGATATAACACGGACAACGAATCATTATCCCCATCATCACTATTACTCCCACTTCC CTTGCACTCTTCACTGTCACCACTGACACTCCGCTTGGCAACATTTTCACTAGAATCGACGTAGTCGCTTATCTCCTTTAACTCCGAATCTGATTCG GACACCGACTGTTCAATTTTCTTTCTTTTTGAGACTTTTTCAACTGCTTCAGTTCTTCTTTTTCTACTGTTACCAGCGGTACCGGCTTTGCGTTTAGG AATGATGTTTTTTTTACCCATTTTCAACACAATCTACACCTAAAGAACAAATCTCCCATTTTTAGTTCATAGACCACAAGTCTATCAACAGAAATAACT CAATGCTCAAATGAACCCCCCTCCCCCCAAAAAAAAATTAACAAACACCCCACCATTAAACAGTTCACTACACAAACATACAATAACTGAACCAAAAT CCAACATGCAATATCAAAACACAACAATTACTAAAATCAAACTAATGCACCTAATCAAACTAATTAGCTATTAATATTTCAATTTTCACTATTTCAGCAA TCATGTTTTAAAAGAATTTCATACGTCTGAAAATTGATATATATCTAGGGCATTCTCATTTCATAGACCACGGGTCTACGGATAGACCTCGGGTCTAC GAACAGAAATAGGTCGCTGTTCAAATCAAAATGCCAAAATAACTCTTCAAACAACTATTATCCCACCATTCAACACTTCGTTGCTAAATAAACCACAA CTAAACCAAAACACCAAATTCGAAGAAAAAATTTCTACATCACTACGAGTTGATTAGCAAAAAAAAACGTTTAAATGGATCTAGAAATGATCGAAACT TGATTTTAACTAACCTTGCAAAGCAGCAACAACCCCTTAGTAGCTGGAGAAGAAGACGAAATGAAAATGGCATTTTTGGAAGAAGTAGTTTCAAAAG CAGGAGTTGGGAATTGAAGAGGAGAGAGAGGGTGGGTTTTTTTAAATATTGGAATAATTGGAGGGTGTTAGGTGTATTATATTAAATTTGTAAAGTT GTAAAAATGATGAATTGGTCCCTTGGCCGATGCGTGGGCCCCACTTTTTCATAAAAAATAAATCAAAAAGAAATTAAGTAGGTATTTGACAAATTAAT TTTGGAGGGTTCCTTCTTTGCCAATTATTCCCCACTAAGCTACTCTCATTCACTCTTATATTATAGATTATAGTATAAAGTAATACAAACTATGAATTG TTTTTATATTTTATTTTACAAGTTATGAATAGTGTTTATATAGGTCTCTATTTCCATACAATCACATTTTGTGGGCAGTTTTTTTGGGATTGTCACGAAG GCGAGGTTTGTTCATTTTGTGGAAAGAGAATTGGATTTCTACATTTTTATCATCTTCTAGGTGTGATGTTGATACTACTATTTGCCCAAATATTTGTTT TAAACATATTAATATTATGTATCAAAATGTGTACAATATAATTTAACACACGTGCAGTATGCATGTATCGCGAAACTAGTTAATTACATGCATCACATG TAATAGCAATAGTATTATTGTACGACGTACTAATATATTAGTATCTATTCTAGCTACTAATTTCCTCTTAACCGTCTCCATGCTGAAAACAACGCCACA GTGCAACGAGCCTTCTATAAAAGTTGAATTATATAAAAATAAGGTACAGTTTAGAAATAAAACTAACAAAAAGGTAACCTATAGTTTGGGGGTTGGGT AGAGGTTGTTTAGCCAGTAACTCTATTATTTCATTTCCTTTTGTCTATATAAGTGTATCCATATATGCAAGAAATGTCAACCGGCCAGCAGCATATAT TTATTTGTTAAATTAATTATGGCTTTTGCATTACCATCATCACTTGTTTCAGTTTGTAACAAATCTTTTATCAAACCTTCCTCTCTCACCCCCTCTACAC TTAGATTTCACAAGCTATCTTTCATCGATCAATCTTTAAGTAATATGTATATCCCTTGTGCATTTTTTTACCCTAAAGTACAACAAAGACTAGAAGACT CCAAAAATTCTGATGAGCTTTCCCATATAGCCCACTTGCTACAAACATCTCTATCACAAACTCTAGTCTCTTACTATCCTTATGCTGGAAAGTTGAAG GACAATGCTACTGTTGACTGTAACGATATGGGAGCTGAGTTCTTGAGTGTTCGAATAAAATGTTCCATGTCTGAAATTCTTGATCATCCTCATGCAT CTCTTGCAGAGAGCATAGTTTTGCCCAAGGATTTGCCTTGGGCGAATAATTGTGAAGGTGGTAATTTGCTTGTAGTTCAAGTAAGTAAGTTTGATTG TGGGGGAATAGCCATCAGTGTATGCTTTTCGCACAAGATTGGTGATGGTTGCTCTCTGCTTAATTTCCTTAATGATTGGTCTAGCGTTACTCGTGAT CATACGACAACAACTTTAGTTCCATCTCCTAGATTTGTAGGAGATTCAGTCTTCTCTACACAAAAATATGGTTCTCTCATTACGCCACAAATTTTGTC CGATCTCAACCAGTGCGTACAGAAAAGACTCATTTTTCCTACAGATAAGTTAGATGCACTTCGAGCTAAGGTAATACTACCATCGTCCATTATTGTTT GTCTTACGGTATTTTTGAAAAGAATAATATTTAATAGTCTTCTTGAGACATATTTCACTTAACAAGCCTAGGCTATTTAGTCTATTTGTAGAAGCTACT CTTAAACGCCTCACTTAGTTAATAGCACTCCACTTATTGGTGTCAAAAACTACTCTTGGACATGTCATTTACTTAATAACACTCCACTTAATTATCGAA
  7. 7. Alignment, assembly, annotation Li et al. (2011) EBI / NCBI / DDBJ
  8. 8. Three millennia of modelling life
  9. 9. Phylogenetic trees H. sapiens ATG CTC TAT GAG P. troglodoytes ATG CTC TTT GAG G. gorilla ATG CTT TAT TAC P. troglodoytes G. gorilla H. sapiens P. troglodoytes 11 9 8
  10. 10. Phylogenetic trees
  11. 11. 2. Adventures with DNA sequencing
  12. 12. Field-sequencing for real Conditions 100% humidity; 6-13ºC Essential kit 800w generator 3x laptops Centrifuge Waterbath Polystyrene boxes (lots) Kettle(…!) Yield >400Mbp data in three days; A. thaliana ~2.01x coverage
  13. 13. Snowdonia, HelloWorld & ‘tent-seq’ A. thaliana Arabidopsis lyrata Congeneric species; Reference genomes available Field-sequenced (MinION) & Lab-sequenced (Illumina™) Orthogonal BLAST: 4 sample*sequencer combinations Compare TRUE & FALSE rates for varying ID statistic cutoffs
  14. 14. Field- vs. lab-sequenced sample ID Match individual reads to each reference with BLAST Compare match lengths in TRUE and FALSE cases ‘Length bias’ ID stat: lengthTRUE - lengthFALSE Compare TRUE & FALSE rates as length bias cutoff varies MiSeq (lab) MinION (field)
  15. 15. Bitty data (1) partial queries Subsample MinION output Repeat ID pipeline, record mean ID stat sbias Replicates: N = 30 Simulate from 100 – 104 reads (≈instant → hours)
  16. 16. Bitty data (2) partial references Take reference genome at high contiguity Fragment randomly to target (low) contiguity Repeat read identification using fragmented DB Simulate N50 ≈1,000bp to N50 ≈ 10Mbp
  17. 17. Keeping it simple: Kew Science Festival Six species: whole genome- skim samples with MinION in preparation Build BLAST DBs from skimmed data Select ‘unknown’ (blinded) sample, extract DNA and resequence in real-time Compare to partial DBs in six-way BLAST competition Live ID ?
  18. 18. de novo genome assembly Data MiSeq only MiSeq + MinION Assembler Abyss hybridSPAdes Illumina reads, 300bp paired-end 8,033,488 8,033,488 Illumina data (yield) 2,418 Mbp 2,418 Mbp MinION reads, R7.3 + R9 kits, N50 ~ 4,410bp - 96,845 MinION data (yield) - 240 Mbp Approx. coverage 19.49x 19.49x + 2.01x Assembly key statistics: # contigs 24,999 10,644 Longest contig 90 Kbp 414 Kbp N50 contiguity 7,853 bp 48,730 bp Fraction of reference genome (%) 82 88 Errors, per 100 kbp: #N’s 1.7 5.4 # mismatches 518 588 # indels 120 130 Largest alignment 76,935 bp 264,039 bp CEGMA gene completeness estimate: # genes 219 of 248 245 of 248 % genes 88% 99%
  19. 19. Wait – genes? Entire chloroplast genome (~150kbp) Plastid coding loci Individual field- sequenced MinION reads
  20. 20. Real-time phylogenomics Filtered reads Gene models TAIR10 CDS code Annotation SNAP 1:1 reciprocal BLAST Multiple sequence alignments MUSCLE Trimal Gene trees → Consensus tree *BEAST RAxML, TreeAnnotator Cumulative counts: Unique genes All genes (‘Lab’ being transported!)
  21. 21. 3. Life is complex
  22. 22. Evolution is complex Nakhleh, (2009); Suh (2016) Zool. Scripta. doi:10.1111/zsc.12213 Zapata et al. (2016) PNAS 113:E4052-E4060 ©2016 National Academy of Sciences
  23. 23. Networks Strimmer & Moulton (2000) MBE Solís-Lemus & Ané (2016) PLoS Genet. Nakhleh (2009) in: Heath & Ramakrishnan, eds., Springer
  24. 24. Key: Extant node Inferred node Synteny edge (physical connection Phylogeny edge (evolutionary connection) Identity edge (organismal connection) Three-colour graphs: phylogeny, synteny & identity a b c d x y z e a a
  25. 25. Three-colour graphs: phylogeny, synteny & identity a1 b1 a2 b2 a3 b3 b’3 a4 b4 a5 b5 Duplication Key: Extant node Inferred node Synteny edge (physical connection) Phylogeny edge (evolutionary connection) Identity edge (organismal connection) a1 b1 b2 a2 a3 b3 c1 c3 c1 Inversion
  26. 26. Three-colour graphs: phylogeny, synteny & identity a1 b1 a2 b2 a3 b3 a4 b4 x4 y4 x3 y3 x1 y1 Tetraploid hybrid formed Diploidization (secondary loss) Key: Extant node Inferred node Synteny edge (physical connection) Phylogeny edge (evolutionary connection) Identity edge (organismal connection) a1 b1 a2 b2 x1 x5 x2 x3 x4 x7 x6 Horizontal gene transfer
  27. 27. Step back: molecular evolution “Horizontal gene transfer occurs x more frequently in these lineages, because of this biology” “Convergent evolution is rare in most genes, in most organisms, but y times greater in these gene families …because of this biology” “New chomosomes are created & destroyed at z, q, rates in this reproductive strategy …because of this biology”
  28. 28. 4. Real big data
  29. 29. How big? How many? Species: • Mammals: 103 – 105 • Animals: 106 – 107 • Plants: 105 – 106 • Bacteria: >106 ? • Fungi: >>105-??? DNA sequencing machines: ~104-5 (each ~109-10 bp/day) (Organisms): (…a lot) Example Typical feature size (Mbp) Largest known genomes ~240,000 (1011) Vascular plant ~0.1 – 10,000 (108 - 1010) Human genome 3,000 (3x109) Most fungi ~0.1 – 10 (105 - 107) Bacteria ~0.1 – 1 (106) Viruses ~0.01 (105) Mitochondria / chloroplasts 0.017 / 0.2 (104 ;105) ‘Barcoding’ locus ~1000bp (102) Illumina read 75-300bp Nanopore read 100bp –>1Mbp
  30. 30. The tools aren’t in great shape but the prizes are there bionode.js bioboxes.org Singularity Portable sequencing, by anyone means really Big Data Informatics connecting this data through explicit models is inference Scalable, reproducible, sustainable research:
  31. 31. Ubiquitous / citizen-sequencing From lab-based… … to ‘app store’ genomics
  32. 32. Metagenomics… © 2016 Katy Reed / FR EPI2ME; Juul et al. (2016) / Metrichor.com
  33. 33. Health: defining ‘normal’ Credit: Darryl Leja, NHGRI
  34. 34. Food-chain: inputs and outputs
  35. 35. Environmental sampling (Species’ abundance varying in time and space) (AKA “trampling on the ecologists’ toes”)
  36. 36. Genomic observatories • [GBIF pic] • orms
  37. 37. The Tree (network) of Life iTOL: Ivica Letunic, Mariana Ruiz Villarreal
  38. 38. Final thoughts
  39. 39. Thanks, funders, contacts and questions Oxford Nanopore Technologies Ltd. Dan Turner, Richard Ronan, Gerrard CoyneU Bangor: Alexander S.T. Papadopulos (@metallophyte) RBG Kew: Postdocs: Andrew Helmstetter (@ajhelmstetter); Tim Coker Thanks: Dion Devey, Robyn Cowan, Tim Wilkinson, Stephen Dodsworth, Pepijn Kooij, Felix Forest, Bill Baker, Jan T. Kim, Jenny Williams, Abigail Barker, Mark Lee, Jim Clarkson, Mike Chester, Ester Gaya, Lisa Pokorny, Laszlo Csiba, Paul Wilkin, Richard Buggs, Mike Fay, Mark Chase, Ilia Leitch QMUL Laura Kelly, Kalina Davies, Steve Rossiter Oxford Aris Katzourakis, Oli Pybus, Jayna Raghwani Others Forest Research: Daegan Inward, Katy Reed Dstl: Claire Lonstale, James Taylor Birmingham: Nick Loman, Josh Quick U. Utah: Bryn Dentinger Imperial: James Rosindell This research was conducted in the Sackler Phylogenomics Laboratory and was supported by the Calleva Foundation Phylogenomic Research Programme and the Sackler Trust @lonelyjoeparker: joe.parker@kew.org

Editor's Notes

  • Definitions
    Genetic data, what it is, where it’s found, how we get it
    A genome / assembly
    Annotation and alignment
    A phylogeny or tree

  • Definitions
    Genetic data, what it is, where it’s found, how we get it
    A genome / assembly
    Annotation and alignment
    A phylogeny or tree
  • Naming stuff
    The ladder of life
    Binomial / ontological naming
    Darwin and The Tree
    Networks
  • Portable sequencing: also long reads and real-time
  • Portable
    Real-time
    Long
    easy
  • Data in terrible conditions but anyone can do it
    Social media reach The Atlantic, Economist
  • Direct, explicit, orthogonal test – and can it work?
    Picture of experimental design

    Outline of the study
    In terms of bioinformatics questions
    Funding: a first pot and timeline…
  • We compare match lengths, and minon allows long matches
  • EXPLAIN AXES: precision improves rapidly
  • EXPLAIN AXES: a partial REFERENCE would work, too
  • MORE FUNDING. SO simple a kid could do it? Yes
    The challenge I set myself: OK, it’s a simple experiment. Can I buid a trest simple ehough a child can understand it?

    SOCIAL MEDIA
    Funding: NANOPORE
  • Data from one time and place can and should be useful elsewhere
    lash a bit of proper genomics
  • Single reads match whole genes – meat & drink
  • EXPLAIN AXES postdoc-years PAPER ACCEPTED
  • Genomes come in all shapes and sizes
    Organisms too, life cycles
    (A)sexual reproduction; clonal replication
    Even genetic alphabet not fixed
    And mutation isn’t random
    Incongruence and reticulation
    Horizontal gene transfer
    Incomplete lineage sorting
    Hybridization
    Recombination

  • Networks attempt to summarise this
    Splits graphs, directed graphical models / planar graphs
  • Definition
    Features
    Generalised representations
    Phylogeny edge information workable-outable
    Other edges present in metadata; inferable
    Generative model; easy to interpret…

    Here’s a common framework for all these studies
    How to infer – sounds like a nightmare
    Many of the edges in this network are really there already
    Shifting paradigms, making linking easier

    Explicitly model phylogeny, synteny and identity
    Edge support reflects evidence; deviations from neutrality reflect hypotheses/models/phenomena

    Any nodes connecting to an identity edge are considered completely connected

    Maximum # edges ~n (2n-1)/2
    Digraphs ~n!!
    Possible ancestors from one locus on n taxa essentially inverse func of when they coalesce (can have m generations of n ancestors until an event where n(m)<n(t)

  • EXAMPLES
    Gene duplication e.g. paralogue in animal
    Tetraploid formed then secondary diploidization, e.g. plant
    Inversion in a genome
    Unlinked loci (e.g. bacterial plasmids) and HGT.

    How to infer – sounds like a nightmare
    Many of the edges in this network are really there already
    Shifting paradigms, making linking easier

    Explicitly model phylogeny, synteny and identity
    Edge support reflects evidence; deviations from neutrality reflect hypotheses/models/phenomena
  • EXAMPLES
    Gene duplication e.g. paralogue in animal
    Tetraploid formed then secondary diploidization, e.g. plant
    Inversion in a genome
    Unlinked loci (e.g. bacterial plasmids) and HGT.

    How to infer – sounds like a nightmare
    Many of the edges in this network are really there already
    Shifting paradigms, making linking easier

    Explicitly model phylogeny, synteny and identity
    Edge support reflects evidence; deviations from neutrality reflect hypotheses/models/phenomena
  • We need enough data to turn obervations, into empirical comparisons, into models and laws
    We know a lot about evolutionary mechanisms

    And a lot about (a handful of genomes)

    What we know tells us “it’s complicated”
    Most genes don’t have simple orthologues etc etc etc, hotizonatl etc


    But we don’t, really, have an empirical understanding of how they fit together, e.g.:
    - ”horizontal gene transfer occurs x more frequently in these lineages, because of this biology”
    - adaptive molecularconvergence is rare in most genes, in most organisms, but y times greater in these gene families because of this biology
    - new chomosomes are created (by duplication, endogenisation, polyploidy) and destroyed (by diploidization) at z rates in this reproductive strategy because of biology
  • Global databases
    Algorithms, methods and theory
    Generally bespoke / slow / in-house
    Special sauce

    Formally linking datasets and models is inferring the network of life
    Shifts the job for bioinformatics from something it’s good at – sophisiticated analysis incemental

    To sometheing computers in gerneral are great at: linking elements

    In this case informatics doesn’t enable research , it is the process of inference

    It’s relatively easy to write a new standalone app to do x, or analyse some big dataset
    Reproducibility and scaling-up science mean we must work harder on the links

    Informatics as inference.
    The lonely astronomers.
  • HPCs to apps: Exponential data, linear understanding.
    Pause – to recap
    This is important because it’s where we tie it together and show my contribution:

    Portable sequencers, easier to use
    More places
    More experimenters
    More data
    More noise
    Efficient comparison?
    Dynamic computation?
    Clever hashing

    Portable, mass sequencing is really here
    Massive potential for de novo genomics; phylogenomics
    But while we’re accumulating information at an exponential rate, we’re integrating it linearly, in essence
    … where are we going?
  • Superset of species ID
    Distribution of species, sometimes functional focus
    We may not have positive controls
    We usually don’t know ‘normal’ distribution
    From a fringe idea to routine
  • Gut microbiome
    Many other tissues
    UTI
    Dental
    Cardiac
    Respiratory
    Not just human health; pathogen surveillance
  • Dodgy burgers and provenance
    Pests, crop inputs
    Feeds
    Supporting ecosystem health/services
  • Ecosystems; habitats; communities; niches
    Properly ecologists’ domain
    Thousands of species
    Abundances shift in time and space
    All trackable with DNA
    Where do ecosystem services come from?
    How healthy?
  • What’s really out there
    Longitudinal data
    Fixed locations
    Parallels with earth sensing
    Autonomous in-situ sensor platforms
  • Data collection in aggregate means we can asymptotically assemble the components we need for the Tree Of Life
    This is loosely defined as the Map Of The World for genomic stuff
    Not exactly simple but not a computational / and engineering challenge, not really intellectually taxing (probably)
    Pretty much the biggest goal in evolution
  • The cosmology of life
    Why genomes/chromosomes?
    Why that size?
    Why organisms?
    Where is the root
    Sequence-space and the network as a state-space
    Inflation
    Probability function of you
  • Funders
    Thanks
    Reach out

×