Successfully reported this slideshow.

Phylogenomic methods for comparative evolutionary biology - University College Dublin MSc - Joe Parker - 24th October 2014

1

Share

Loading in …3
×
1 of 48
1 of 48

Phylogenomic methods for comparative evolutionary biology - University College Dublin MSc - Joe Parker - 24th October 2014

1

Share

Download to read offline

Invited research seminar given to MSc students at University College Dublin on 24th October 2013.

I introduce the discipline of phylogenomics - comparative phylogenetic analyses of DNA sequences across genomes - and some of the applications and recent breakthroughs in the field.

As an in-depth case study I explain the methods and significance of our 2013 Nature paper on adaptive genotypic molecular convergence in echolocating mammals.

I then highlight some of the avenues of study on the frontiers of current research.

Invited research seminar given to MSc students at University College Dublin on 24th October 2013.

I introduce the discipline of phylogenomics - comparative phylogenetic analyses of DNA sequences across genomes - and some of the applications and recent breakthroughs in the field.

As an in-depth case study I explain the methods and significance of our 2013 Nature paper on adaptive genotypic molecular convergence in echolocating mammals.

I then highlight some of the avenues of study on the frontiers of current research.

More Related Content

Similar to Phylogenomic methods for comparative evolutionary biology - University College Dublin MSc - Joe Parker - 24th October 2014

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Phylogenomic methods for comparative evolutionary biology - University College Dublin MSc - Joe Parker - 24th October 2014

  1. 1. High-throughput comparative genomics 24th October 2013 Joe Parker, Queen Mary University London
  2. 2. Topics 1. Introduction 2. Background: why phylog e nomics? 3. Examples 4. Practice 5. Case study 6. On the horizon 7. Over the horizon
  3. 3. Aims • Context of phylogenomics: Next-generation sequencing (NGS) • Why phylog e nomics? • Practical analyses • Future developments
  4. 4. 1. Our Research
  5. 5. Lab Interests • Ecology and evolution of traits • Echolocation, sociality • NGS data for population genetics and phylogenomics
  6. 6. Activities • Phylogeny estimation/comparison • Molecular correlates of evolution; – site substitutions, dN/dS, composition • Simulation • Dataset limitations (R-L): Joe Parker; GeorgiaTsagkogeorga; Kalina Davies; Steve Rossiter; Xiuguang Mao; Seb Bailey
  7. 7. 2. Background
  8. 8. Next-generation sequencing
  9. 9. Why phylog e nomics, not -genetics? • Causes of discordant signal – Incomplete lineage sorting – Lateral transfer – Recombination – Introgression
  10. 10. Quantitative biology • Multiple configurations • Hyperparameters empirically investigated • Determine sensitivity of results
  11. 11. Distributions • Genome-scale data provides context • Identify outliers Ge ne s / taxa / tre e s • Compare values across biological systems
  12. 12. Integration with ‘Omics • Multiple databases • Functional data • Bibliographic information
  13. 13. 3. Example studies
  14. 14. Tsakgogeorgia e t al. (in press)
  15. 15. Salichos & Rokas (2013)
  16. 16. Backström e t al. (2013)
  17. 17. Lindblad-Toh e t al. (2011)
  18. 18. 4. Practice
  19. 19. Source material • Samples • Storage • Purification • Library prep
  20. 20. Sequencing • Genome – Sanger – Illumina – Pyro /454 – SOLiD – PacBio • Transcriptome / RNA-seq – MyBAITS • HiSeq / MiSeq • IonTorrent
  21. 21. Infrastructure • Desktop machines • Computing clusters • Grid systems • Cloud-based computation
  22. 22. Assembly, Annotation • Assembly – To reference (mapping) – De novo • Annotation – By homology – De novo •SOAPdenovo •MAKER •Velvet •Bowtie / Cufflinks / Tophat •Trinity
  23. 23. Alignment • PRANK • MUSCLE • MAFFT • Clustal
  24. 24. Phylogeny inference • MrBayes • RAxML • BEAST • MP-EST • STAR
  25. 25. Phylogenetic analysis • BEAST • HYPHY • PAML • Pipelines • LRT
  26. 26. 5. Case study
  27. 27. Parker e t al. (2013) • De novo genomes: – four taxa – 2,321 protein-coding loci – 801,301 codons • Published: – 18 genomes • ~69,000 simulated datasets • ~3,500 cluster cores
  28. 28. Our pipeline for detecting genome-wide convergence
  29. 29. mean = 0.05
  30. 30. mean = 0.05 mean = -0.01 mean = -0.08 
  31. 31. Development cycle Design Wireframe & specify tests Implement Alignment loadSequences() getSubstitutions() Phylogeny trimTaxa() getMRCA() DataSeries calculateECDF() randomise() Regression getResiduals() predictInterval() Review, refine & refactor
  32. 32. Parker e t al. (2013)
  33. 33. Parker e t al. (2013)
  34. 34. 6. On the horizon
  35. 35. Environmental metagenomics
  36. 36. Models of computation • Cloud resources: Unlimited flexibility, finite time • Development trade-off – Off-the-shelf – Bespoke • Exploratory work – Real time genomic transects? • Essential fundamental data missing from nearly every system; – Diversity; structure; substitution rates; dN/dS; recombination; dispersal; lateral transfer
  37. 37. Serialisation • Process data remotely • Freeze-dry objects, download to desktop • Implement new methods directly on previously-analysed data
  38. 38. 7. Over the horizon • Real-time phylogenetics • Field phylogenetics • Alignment-free analyses
  39. 39. Conclusions • Why phylogenomics? • Practice • Comparative approach • Statistical context
  40. 40. Thanks Steve Rossiter1, James Cotton2, Elia Stupka3 & Georgia Tsagkogeorga1 1Scho o l o f Bio lo g ical and Chemical Scie nce s, Que e n Mary, Unive rsity o f Lo ndo n 2We llcome Trust Sang e r Institute 3Ce nte r fo r Translatio nal Ge no mics and Bio info rmatics, San Raffae le Institute , Milan Chris Walker & Dan Traynor Que e n Mary GridPP High-thro ughput Cluste r Chaz Mein & Anna Terry Barts and The Lo ndo n Ge no me Ce ntre Mahesh Pancholi Scho o l o f Bio lo g ical and Chemical Scie nce s BBSRC (UK); Queen Mary, University of London
  41. 41. Resources • My email: Joe Parker (Queen Mary University of London): j.d.parker@qmul.ac.uk • Parker, J., Tsagkogeorga, G., Cotton, J.A., Liu, Y., Provero, P., Stupka, E. & Rossiter, S.J. (2013) Genome-wide signatures of convergent evolution in echolocating mammals. Nature 502(7470):228-231 doi:10.1038/nature12511. • Tsagkogeorga, G., Parker, J., Stupka, E., Cotton, J.A., & Rossiter, S.J. (2013) Phylogenomic analyses elucidate evolutionary relationships of the bats (Chiroptera) Curr. Biol. in the press. • Salichos, L. & Rokas, A. (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 437:327- 331. doi:10.1038/nature12130 • Backström, N., Zhang, Q. & Edwards, S.V. (2013) Evidence from a House Finch (Haemorhous mexicanus) Spleen Transcriptome for Adaptive Evolution and Biased Gene Conversion in Passerine Birds. MBE 30(5):1046-50. doi:10.1093/molbev/mst033 • Lindblad-Toh, K., Garber, M., Zuk, O., Lin, M.F., Parker, B.J., et al. (2011) A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478:476–482 doi:10.1038/nature10530 • Degnan, J.H. & Rosenberg, N.A. (2009) Gene tree discordance, phylogenetic inference and the multispecies coalescent. TREE 24:(6)332-340 doi:10.1016/j.tree.2009.01.009 • The Tree Of Life: http://phylogenomics.blogspot.co.uk/ • RNA-seq For Everyone: http://rnaseq.uoregon.edu/index.html • Evo-Phylo: http://www.davelunt.net/evophylo/tag/phylogenomics/ • OpenHelix: http://blog.openhelix.eu/ • Our blogs: http://evolve.sbcs.qmul.ac.uk/rossiter/ (lab) and http://www.lonelyjoeparker.com/?cat=11 (Joe)

Editor's Notes

  • Quick through this
  • Moore’s law, sequencing data etc
    Order-of-magnitude improvements:
    Sequencing throughput, accuracy
    Computational power
  • Concatenated, RAxML
    B) per-locus support counts; RAxML concat and coalescent gave H1 overall
  • Almost as many discrete gene trees as genes
  • Backstrom - approach as measuring exercise
  • Surveying
  • Technologies and tools, mature
  • Technologies and tools
  • SOAPdenovo-Trans[edit]
    SOAPdenovo-Trans is a de novo transcriptome assembler inherited from the SOAPdenovo2 framework, designed for assembling transcriptome with alternative splicing and different expression level. The assembler provides a more comprehensive way to construct the full-length transcript sets compare to SOAPdenovo2.
    Velvet/Oases[edit]
    (Main article: Velvet assembler)
    The Velvet algorithm uses de Bruijn graphs to assemble transcripts. In simulations, Velvet can produce contigs up to 50-kb N50 length using prokaryotic data and 3-kb N50 in mammalian bacterial artificial chromosomes (BACs).[15] These preliminary transcripts are transferred to Oases, which uses paired end read and long read information to build transcript isoforms.[16]
    Trans-ABySS[edit]
    ABySS is a parallel, paired-end sequence assembler. Trans-ABySS (Assembly By Short Sequences) is a software pipeline written in Python and Perl for analyzing ABySS-assembled transcriptome contigs. This pipeline can be applied to assemblies generated across a wide range of k values. It first reduces the dataset into smaller sets of non-redundant contigs, and identifies splicing events including exon-skipping, novel exons, retained introns, novel introns, and alternative splicing. The Trans-ABySS algorithms are also able to estimate gene expression levels, identify potential polyadenylation sites, as well as candidate gene-fusion events.[17]
    Trinity[edit]
    Trinity[18] first divides the sequence data into a number of de Bruijn graphs, each representing transcriptional variations at a single gene or locus. It then extracts full-length splicing isoforms and distinguishes transcripts derived from paralogous genes from each graph separately. Trinity consists of three independent software modules, which are used sequentially to produce transcripts:
    Inchworm assembles the RNA-Seq data into transcript sequences, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.
    Chrysalis clusters the Inchworm contigs and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptional complexity for a given gene (or a family or set of genes that share a conserved sequence). Chrysalis then partitions the full read set among these separate graphs.
    Butterfly then processes the individual graphs in parallel, tracing the paths of reads within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes.[19]
    Cufflinks[edit]
    Cufflinks [20] is a program that assembles aligned RNA-Seq reads into transcripts, estimates their abundances, and tests for differential expression and regulation transcriptome-wide.
    Cufflinks constructs a parsimonious set of transcripts that "explain" the reads observed in an RNA-Seq experiment. It does so by reducing the comparative assembly problem to a problem in maximum matching in bipartite graphs. In essence, Cufflinks implements a constructive proof of Dilworth's theorem by constructing a covering relation on the read alignments, and finding a minimum path cover on the Directed acyclic graph for the relation.
  • Technologies and tools
  • Technologies and tools
  • Pervasive phylogenetic incongruence
    test for phylogenetic discordance attributable to genetic convergence,
    when applied to different contexts it could equally be used to measure discordance that has arisen by other processes,
    some of which will be more applicable to tropical systems:
    - Horizontal gene transfer among bacteria
    - Introgression across species barriers
    - Incomplete lineage sorting
  • RUNTIME --- ~weeks --> hours
    Object-oriented design
    Separation of code into modular objects
    Re-use methods through inheritance
    Abstraction of behaviour allows modifications to parts of the API without affecting other tested code
    Incorporate other libraries
  • Pervasive phylogenetic incongruence
    test for phylogenetic discordance attributable to genetic convergence,
    when applied to different contexts it could equally be used to measure discordance that has arisen by other processes,
    some of which will be more applicable to tropical systems:
    - Horizontal gene transfer among bacteria
    - Introgression across species barriers
    - Incomplete lineage sorting
  • Pervasive phylogenetic incongruence
    test for phylogenetic discordance attributable to genetic convergence,
    when applied to different contexts it could equally be used to measure discordance that has arisen by other processes,
    some of which will be more applicable to tropical systems:
    - Horizontal gene transfer among bacteria
    - Introgression across species barriers
    - Incomplete lineage sorting
  • ×