High-throughput comparative 
genomics 
24th October 2013 
Joe Parker, 
Queen Mary University London
Topics 
1. Introduction 
2. Background: why phylog e nomics? 
3. Examples 
4. Practice 
5. Case study 
6. On the horizon 
7. Over the horizon
Aims 
• Context of phylogenomics: Next-generation 
sequencing (NGS) 
• Why phylog e nomics? 
• Practical analyses 
• Future developments
1. Our Research
Lab Interests 
• Ecology and evolution of traits 
• Echolocation, sociality 
• NGS data for population genetics and phylogenomics
Activities 
• Phylogeny estimation/comparison 
• Molecular correlates of evolution; 
– site substitutions, dN/dS, composition 
• Simulation 
• Dataset limitations 
(R-L): Joe Parker; GeorgiaTsagkogeorga; Kalina Davies; Steve Rossiter; Xiuguang Mao; Seb Bailey
2. Background
Next-generation sequencing
Why phylog e nomics, not 
-genetics? 
• Causes of discordant signal 
– Incomplete lineage sorting 
– Lateral transfer 
– Recombination 
– Introgression
Quantitative biology 
• Multiple configurations 
• Hyperparameters 
empirically investigated 
• Determine sensitivity of 
results
Distributions 
• Genome-scale data 
provides context 
• Identify outliers 
Ge ne s / taxa / tre e s 
• Compare values across 
biological systems
Integration with ‘Omics 
• Multiple databases 
• Functional data 
• Bibliographic information
3. Example studies
Tsakgogeorgia e t al. (in press)
Salichos & Rokas (2013)
Backström e t al. (2013)
Lindblad-Toh e t al. (2011)
4. Practice
Source material 
• Samples 
• Storage 
• Purification 
• Library prep
Sequencing 
• Genome 
– Sanger 
– Illumina 
– Pyro /454 
– SOLiD 
– PacBio 
• Transcriptome / RNA-seq 
– MyBAITS 
• HiSeq / MiSeq 
• IonTorrent
Infrastructure 
• Desktop machines 
• Computing clusters 
• Grid systems 
• Cloud-based computation
Assembly, Annotation 
• Assembly 
– To reference 
(mapping) 
– De novo 
• Annotation 
– By homology 
– De novo 
•SOAPdenovo 
•MAKER 
•Velvet 
•Bowtie / Cufflinks / Tophat 
•Trinity
Alignment 
• PRANK 
• MUSCLE 
• MAFFT 
• Clustal
Phylogeny inference 
• MrBayes 
• RAxML 
• BEAST 
• MP-EST 
• STAR
Phylogenetic analysis 
• BEAST 
• HYPHY 
• PAML 
• Pipelines 
• LRT
5. Case study
Parker e t al. (2013) 
• De novo genomes: 
– four taxa 
– 2,321 protein-coding loci 
– 801,301 codons 
• Published: 
– 18 genomes 
• ~69,000 simulated datasets 
• ~3,500 cluster cores
Our pipeline for detecting genome-wide convergence
mean = 0.05
mean = 0.05 mean = -0.01 mean = -0.08 

Development cycle 
Design 
Wireframe & 
specify tests 
Implement 
Alignment 
loadSequences() 
getSubstitutions() 
Phylogeny 
trimTaxa() 
getMRCA() 
DataSeries 
calculateECDF() 
randomise() 
Regression 
getResiduals() 
predictInterval() 
Review, refine 
& refactor
Parker e t al. (2013)
Parker e t al. (2013)
6. On the horizon
Environmental metagenomics
Models of computation 
• Cloud resources: Unlimited 
flexibility, finite time 
• Development trade-off 
– Off-the-shelf 
– Bespoke 
• Exploratory work 
– Real time genomic transects? 
• Essential fundamental data missing 
from nearly every system; 
– Diversity; structure; substitution rates; 
dN/dS; recombination; dispersal; lateral 
transfer
Serialisation 
• Process data remotely 
• Freeze-dry objects, download to 
desktop 
• Implement new methods directly 
on previously-analysed data
7. Over the horizon 
• Real-time phylogenetics 
• Field phylogenetics 
• Alignment-free analyses
Conclusions 
• Why phylogenomics? 
• Practice 
• Comparative approach 
• Statistical context
Thanks 
Steve Rossiter1, James Cotton2, Elia Stupka3 & Georgia Tsagkogeorga1 
1Scho o l o f Bio lo g ical and Chemical Scie nce s, Que e n Mary, Unive rsity o f Lo ndo n 
2We llcome Trust Sang e r Institute 
3Ce nte r fo r Translatio nal Ge no mics and Bio info rmatics, San Raffae le Institute , Milan 
Chris Walker & Dan Traynor 
Que e n Mary GridPP High-thro ughput Cluste r 
Chaz Mein & Anna Terry 
Barts and The Lo ndo n Ge no me Ce ntre 
Mahesh Pancholi 
Scho o l o f Bio lo g ical and Chemical Scie nce s 
BBSRC (UK); Queen Mary, University of London
Resources • My email: Joe Parker (Queen Mary University of London): j.d.parker@qmul.ac.uk 
• Parker, J., Tsagkogeorga, G., Cotton, J.A., Liu, Y., Provero, P., Stupka, E. & Rossiter, S.J. (2013) Genome-wide signatures of 
convergent evolution in echolocating mammals. Nature 502(7470):228-231 doi:10.1038/nature12511. 
• Tsagkogeorga, G., Parker, J., Stupka, E., Cotton, J.A., & Rossiter, S.J. (2013) Phylogenomic analyses elucidate evolutionary 
relationships of the bats (Chiroptera) Curr. Biol. in the press. 
• Salichos, L. & Rokas, A. (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 437:327- 
331. doi:10.1038/nature12130 
• Backström, N., Zhang, Q. & Edwards, S.V. (2013) Evidence from a House Finch (Haemorhous mexicanus) Spleen 
Transcriptome for Adaptive Evolution and Biased Gene Conversion in Passerine Birds. MBE 30(5):1046-50. 
doi:10.1093/molbev/mst033 
• Lindblad-Toh, K., Garber, M., Zuk, O., Lin, M.F., Parker, B.J., et al. (2011) A high-resolution map of human evolutionary 
constraint using 29 mammals. Nature 478:476–482 doi:10.1038/nature10530 
• Degnan, J.H. & Rosenberg, N.A. (2009) Gene tree discordance, phylogenetic inference and the multispecies coalescent. TREE 
24:(6)332-340 doi:10.1016/j.tree.2009.01.009 
• The Tree Of Life: http://phylogenomics.blogspot.co.uk/ 
• RNA-seq For Everyone: http://rnaseq.uoregon.edu/index.html 
• Evo-Phylo: http://www.davelunt.net/evophylo/tag/phylogenomics/ 
• OpenHelix: http://blog.openhelix.eu/ 
• Our blogs: http://evolve.sbcs.qmul.ac.uk/rossiter/ (lab) and http://www.lonelyjoeparker.com/?cat=11 (Joe)

Phylogenomic methods for comparative evolutionary biology - University College Dublin MSc - Joe Parker - 24th October 2014

  • 1.
    High-throughput comparative genomics 24th October 2013 Joe Parker, Queen Mary University London
  • 2.
    Topics 1. Introduction 2. Background: why phylog e nomics? 3. Examples 4. Practice 5. Case study 6. On the horizon 7. Over the horizon
  • 3.
    Aims • Contextof phylogenomics: Next-generation sequencing (NGS) • Why phylog e nomics? • Practical analyses • Future developments
  • 4.
  • 5.
    Lab Interests •Ecology and evolution of traits • Echolocation, sociality • NGS data for population genetics and phylogenomics
  • 6.
    Activities • Phylogenyestimation/comparison • Molecular correlates of evolution; – site substitutions, dN/dS, composition • Simulation • Dataset limitations (R-L): Joe Parker; GeorgiaTsagkogeorga; Kalina Davies; Steve Rossiter; Xiuguang Mao; Seb Bailey
  • 7.
  • 8.
  • 9.
    Why phylog enomics, not -genetics? • Causes of discordant signal – Incomplete lineage sorting – Lateral transfer – Recombination – Introgression
  • 10.
    Quantitative biology •Multiple configurations • Hyperparameters empirically investigated • Determine sensitivity of results
  • 11.
    Distributions • Genome-scaledata provides context • Identify outliers Ge ne s / taxa / tre e s • Compare values across biological systems
  • 12.
    Integration with ‘Omics • Multiple databases • Functional data • Bibliographic information
  • 13.
  • 14.
    Tsakgogeorgia e tal. (in press)
  • 15.
  • 16.
    Backström e tal. (2013)
  • 17.
    Lindblad-Toh e tal. (2011)
  • 18.
  • 19.
    Source material •Samples • Storage • Purification • Library prep
  • 20.
    Sequencing • Genome – Sanger – Illumina – Pyro /454 – SOLiD – PacBio • Transcriptome / RNA-seq – MyBAITS • HiSeq / MiSeq • IonTorrent
  • 21.
    Infrastructure • Desktopmachines • Computing clusters • Grid systems • Cloud-based computation
  • 22.
    Assembly, Annotation •Assembly – To reference (mapping) – De novo • Annotation – By homology – De novo •SOAPdenovo •MAKER •Velvet •Bowtie / Cufflinks / Tophat •Trinity
  • 23.
    Alignment • PRANK • MUSCLE • MAFFT • Clustal
  • 24.
    Phylogeny inference •MrBayes • RAxML • BEAST • MP-EST • STAR
  • 25.
    Phylogenetic analysis •BEAST • HYPHY • PAML • Pipelines • LRT
  • 26.
  • 27.
    Parker e tal. (2013) • De novo genomes: – four taxa – 2,321 protein-coding loci – 801,301 codons • Published: – 18 genomes • ~69,000 simulated datasets • ~3,500 cluster cores
  • 28.
    Our pipeline fordetecting genome-wide convergence
  • 36.
  • 37.
    mean = 0.05mean = -0.01 mean = -0.08 
  • 38.
    Development cycle Design Wireframe & specify tests Implement Alignment loadSequences() getSubstitutions() Phylogeny trimTaxa() getMRCA() DataSeries calculateECDF() randomise() Regression getResiduals() predictInterval() Review, refine & refactor
  • 39.
    Parker e tal. (2013)
  • 40.
    Parker e tal. (2013)
  • 41.
    6. On thehorizon
  • 42.
  • 43.
    Models of computation • Cloud resources: Unlimited flexibility, finite time • Development trade-off – Off-the-shelf – Bespoke • Exploratory work – Real time genomic transects? • Essential fundamental data missing from nearly every system; – Diversity; structure; substitution rates; dN/dS; recombination; dispersal; lateral transfer
  • 44.
    Serialisation • Processdata remotely • Freeze-dry objects, download to desktop • Implement new methods directly on previously-analysed data
  • 45.
    7. Over thehorizon • Real-time phylogenetics • Field phylogenetics • Alignment-free analyses
  • 46.
    Conclusions • Whyphylogenomics? • Practice • Comparative approach • Statistical context
  • 47.
    Thanks Steve Rossiter1,James Cotton2, Elia Stupka3 & Georgia Tsagkogeorga1 1Scho o l o f Bio lo g ical and Chemical Scie nce s, Que e n Mary, Unive rsity o f Lo ndo n 2We llcome Trust Sang e r Institute 3Ce nte r fo r Translatio nal Ge no mics and Bio info rmatics, San Raffae le Institute , Milan Chris Walker & Dan Traynor Que e n Mary GridPP High-thro ughput Cluste r Chaz Mein & Anna Terry Barts and The Lo ndo n Ge no me Ce ntre Mahesh Pancholi Scho o l o f Bio lo g ical and Chemical Scie nce s BBSRC (UK); Queen Mary, University of London
  • 48.
    Resources • Myemail: Joe Parker (Queen Mary University of London): j.d.parker@qmul.ac.uk • Parker, J., Tsagkogeorga, G., Cotton, J.A., Liu, Y., Provero, P., Stupka, E. & Rossiter, S.J. (2013) Genome-wide signatures of convergent evolution in echolocating mammals. Nature 502(7470):228-231 doi:10.1038/nature12511. • Tsagkogeorga, G., Parker, J., Stupka, E., Cotton, J.A., & Rossiter, S.J. (2013) Phylogenomic analyses elucidate evolutionary relationships of the bats (Chiroptera) Curr. Biol. in the press. • Salichos, L. & Rokas, A. (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 437:327- 331. doi:10.1038/nature12130 • Backström, N., Zhang, Q. & Edwards, S.V. (2013) Evidence from a House Finch (Haemorhous mexicanus) Spleen Transcriptome for Adaptive Evolution and Biased Gene Conversion in Passerine Birds. MBE 30(5):1046-50. doi:10.1093/molbev/mst033 • Lindblad-Toh, K., Garber, M., Zuk, O., Lin, M.F., Parker, B.J., et al. (2011) A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478:476–482 doi:10.1038/nature10530 • Degnan, J.H. & Rosenberg, N.A. (2009) Gene tree discordance, phylogenetic inference and the multispecies coalescent. TREE 24:(6)332-340 doi:10.1016/j.tree.2009.01.009 • The Tree Of Life: http://phylogenomics.blogspot.co.uk/ • RNA-seq For Everyone: http://rnaseq.uoregon.edu/index.html • Evo-Phylo: http://www.davelunt.net/evophylo/tag/phylogenomics/ • OpenHelix: http://blog.openhelix.eu/ • Our blogs: http://evolve.sbcs.qmul.ac.uk/rossiter/ (lab) and http://www.lonelyjoeparker.com/?cat=11 (Joe)

Editor's Notes

  • #7 Quick through this
  • #9 Moore’s law, sequencing data etc Order-of-magnitude improvements: Sequencing throughput, accuracy Computational power
  • #15 Concatenated, RAxML B) per-locus support counts; RAxML concat and coalescent gave H1 overall
  • #16 Almost as many discrete gene trees as genes
  • #17 Backstrom - approach as measuring exercise
  • #18 Surveying
  • #21 Technologies and tools, mature
  • #22 Technologies and tools
  • #23 SOAPdenovo-Trans[edit] SOAPdenovo-Trans is a de novo transcriptome assembler inherited from the SOAPdenovo2 framework, designed for assembling transcriptome with alternative splicing and different expression level. The assembler provides a more comprehensive way to construct the full-length transcript sets compare to SOAPdenovo2. Velvet/Oases[edit] (Main article: Velvet assembler) The Velvet algorithm uses de Bruijn graphs to assemble transcripts. In simulations, Velvet can produce contigs up to 50-kb N50 length using prokaryotic data and 3-kb N50 in mammalian bacterial artificial chromosomes (BACs).[15] These preliminary transcripts are transferred to Oases, which uses paired end read and long read information to build transcript isoforms.[16] Trans-ABySS[edit] ABySS is a parallel, paired-end sequence assembler. Trans-ABySS (Assembly By Short Sequences) is a software pipeline written in Python and Perl for analyzing ABySS-assembled transcriptome contigs. This pipeline can be applied to assemblies generated across a wide range of k values. It first reduces the dataset into smaller sets of non-redundant contigs, and identifies splicing events including exon-skipping, novel exons, retained introns, novel introns, and alternative splicing. The Trans-ABySS algorithms are also able to estimate gene expression levels, identify potential polyadenylation sites, as well as candidate gene-fusion events.[17] Trinity[edit] Trinity[18] first divides the sequence data into a number of de Bruijn graphs, each representing transcriptional variations at a single gene or locus. It then extracts full-length splicing isoforms and distinguishes transcripts derived from paralogous genes from each graph separately. Trinity consists of three independent software modules, which are used sequentially to produce transcripts: Inchworm assembles the RNA-Seq data into transcript sequences, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts. Chrysalis clusters the Inchworm contigs and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptional complexity for a given gene (or a family or set of genes that share a conserved sequence). Chrysalis then partitions the full read set among these separate graphs. Butterfly then processes the individual graphs in parallel, tracing the paths of reads within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes.[19] Cufflinks[edit] Cufflinks [20] is a program that assembles aligned RNA-Seq reads into transcripts, estimates their abundances, and tests for differential expression and regulation transcriptome-wide. Cufflinks constructs a parsimonious set of transcripts that "explain" the reads observed in an RNA-Seq experiment. It does so by reducing the comparative assembly problem to a problem in maximum matching in bipartite graphs. In essence, Cufflinks implements a constructive proof of Dilworth's theorem by constructing a covering relation on the read alignments, and finding a minimum path cover on the Directed acyclic graph for the relation.
  • #25 Technologies and tools
  • #26 Technologies and tools
  • #28 Pervasive phylogenetic incongruence test for phylogenetic discordance attributable to genetic convergence, when applied to different contexts it could equally be used to measure discordance that has arisen by other processes, some of which will be more applicable to tropical systems: - Horizontal gene transfer among bacteria - Introgression across species barriers - Incomplete lineage sorting
  • #39 RUNTIME --- ~weeks --> hours Object-oriented design Separation of code into modular objects Re-use methods through inheritance Abstraction of behaviour allows modifications to parts of the API without affecting other tested code Incorporate other libraries
  • #40 Pervasive phylogenetic incongruence test for phylogenetic discordance attributable to genetic convergence, when applied to different contexts it could equally be used to measure discordance that has arisen by other processes, some of which will be more applicable to tropical systems: - Horizontal gene transfer among bacteria - Introgression across species barriers - Incomplete lineage sorting
  • #41 Pervasive phylogenetic incongruence test for phylogenetic discordance attributable to genetic convergence, when applied to different contexts it could equally be used to measure discordance that has arisen by other processes, some of which will be more applicable to tropical systems: - Horizontal gene transfer among bacteria - Introgression across species barriers - Incomplete lineage sorting