This document outlines Joe Parker's research interests in phylogenomics and high-throughput comparative genomics at Queen Mary University London. It discusses why phylogenomics is important, provides examples of past studies, and describes the lab's workflow and tools for sequencing, assembly, alignment, phylogeny inference, and phylogenetic analysis. It also presents a case study on detecting genome-wide convergence and discusses future directions including environmental metagenomics, cloud computing models, and real-time phylogenetics.
22. Assembly, Annotation
• Assembly
– To reference
(mapping)
– De novo
• Annotation
– By homology
– De novo
•SOAPdenovo
•MAKER
•Velvet
•Bowtie / Cufflinks / Tophat
•Trinity
43. Models of computation
• Cloud resources: Unlimited
flexibility, finite time
• Development trade-off
– Off-the-shelf
– Bespoke
• Exploratory work
– Real time genomic transects?
• Essential fundamental data missing
from nearly every system;
– Diversity; structure; substitution rates;
dN/dS; recombination; dispersal; lateral
transfer
44. Serialisation
• Process data remotely
• Freeze-dry objects, download to
desktop
• Implement new methods directly
on previously-analysed data
45. 7. Over the horizon
• Real-time phylogenetics
• Field phylogenetics
• Alignment-free analyses
47. Thanks
Steve Rossiter1, James Cotton2, Elia Stupka3 & Georgia Tsagkogeorga1
1Scho o l o f Bio lo g ical and Chemical Scie nce s, Que e n Mary, Unive rsity o f Lo ndo n
2We llcome Trust Sang e r Institute
3Ce nte r fo r Translatio nal Ge no mics and Bio info rmatics, San Raffae le Institute , Milan
Chris Walker & Dan Traynor
Que e n Mary GridPP High-thro ughput Cluste r
Chaz Mein & Anna Terry
Barts and The Lo ndo n Ge no me Ce ntre
Mahesh Pancholi
Scho o l o f Bio lo g ical and Chemical Scie nce s
BBSRC (UK); Queen Mary, University of London
48. Resources • My email: Joe Parker (Queen Mary University of London): j.d.parker@qmul.ac.uk
• Parker, J., Tsagkogeorga, G., Cotton, J.A., Liu, Y., Provero, P., Stupka, E. & Rossiter, S.J. (2013) Genome-wide signatures of
convergent evolution in echolocating mammals. Nature 502(7470):228-231 doi:10.1038/nature12511.
• Tsagkogeorga, G., Parker, J., Stupka, E., Cotton, J.A., & Rossiter, S.J. (2013) Phylogenomic analyses elucidate evolutionary
relationships of the bats (Chiroptera) Curr. Biol. in the press.
• Salichos, L. & Rokas, A. (2013) Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 437:327-
331. doi:10.1038/nature12130
• Backström, N., Zhang, Q. & Edwards, S.V. (2013) Evidence from a House Finch (Haemorhous mexicanus) Spleen
Transcriptome for Adaptive Evolution and Biased Gene Conversion in Passerine Birds. MBE 30(5):1046-50.
doi:10.1093/molbev/mst033
• Lindblad-Toh, K., Garber, M., Zuk, O., Lin, M.F., Parker, B.J., et al. (2011) A high-resolution map of human evolutionary
constraint using 29 mammals. Nature 478:476–482 doi:10.1038/nature10530
• Degnan, J.H. & Rosenberg, N.A. (2009) Gene tree discordance, phylogenetic inference and the multispecies coalescent. TREE
24:(6)332-340 doi:10.1016/j.tree.2009.01.009
• The Tree Of Life: http://phylogenomics.blogspot.co.uk/
• RNA-seq For Everyone: http://rnaseq.uoregon.edu/index.html
• Evo-Phylo: http://www.davelunt.net/evophylo/tag/phylogenomics/
• OpenHelix: http://blog.openhelix.eu/
• Our blogs: http://evolve.sbcs.qmul.ac.uk/rossiter/ (lab) and http://www.lonelyjoeparker.com/?cat=11 (Joe)
Editor's Notes
Quick through this
Moore’s law, sequencing data etc
Order-of-magnitude improvements:
Sequencing throughput, accuracy
Computational power
Concatenated, RAxML
B) per-locus support counts; RAxML concat and coalescent gave H1 overall
Almost as many discrete gene trees as genes
Backstrom - approach as measuring exercise
Surveying
Technologies and tools, mature
Technologies and tools
SOAPdenovo-Trans[edit]
SOAPdenovo-Trans is a de novo transcriptome assembler inherited from the SOAPdenovo2 framework, designed for assembling transcriptome with alternative splicing and different expression level. The assembler provides a more comprehensive way to construct the full-length transcript sets compare to SOAPdenovo2.
Velvet/Oases[edit]
(Main article: Velvet assembler)
The Velvet algorithm uses de Bruijn graphs to assemble transcripts. In simulations, Velvet can produce contigs up to 50-kb N50 length using prokaryotic data and 3-kb N50 in mammalian bacterial artificial chromosomes (BACs).[15] These preliminary transcripts are transferred to Oases, which uses paired end read and long read information to build transcript isoforms.[16]
Trans-ABySS[edit]
ABySS is a parallel, paired-end sequence assembler. Trans-ABySS (Assembly By Short Sequences) is a software pipeline written in Python and Perl for analyzing ABySS-assembled transcriptome contigs. This pipeline can be applied to assemblies generated across a wide range of k values. It first reduces the dataset into smaller sets of non-redundant contigs, and identifies splicing events including exon-skipping, novel exons, retained introns, novel introns, and alternative splicing. The Trans-ABySS algorithms are also able to estimate gene expression levels, identify potential polyadenylation sites, as well as candidate gene-fusion events.[17]
Trinity[edit]
Trinity[18] first divides the sequence data into a number of de Bruijn graphs, each representing transcriptional variations at a single gene or locus. It then extracts full-length splicing isoforms and distinguishes transcripts derived from paralogous genes from each graph separately. Trinity consists of three independent software modules, which are used sequentially to produce transcripts:
Inchworm assembles the RNA-Seq data into transcript sequences, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.
Chrysalis clusters the Inchworm contigs and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptional complexity for a given gene (or a family or set of genes that share a conserved sequence). Chrysalis then partitions the full read set among these separate graphs.
Butterfly then processes the individual graphs in parallel, tracing the paths of reads within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes.[19]
Cufflinks[edit]
Cufflinks [20] is a program that assembles aligned RNA-Seq reads into transcripts, estimates their abundances, and tests for differential expression and regulation transcriptome-wide.
Cufflinks constructs a parsimonious set of transcripts that "explain" the reads observed in an RNA-Seq experiment. It does so by reducing the comparative assembly problem to a problem in maximum matching in bipartite graphs. In essence, Cufflinks implements a constructive proof of Dilworth's theorem by constructing a covering relation on the read alignments, and finding a minimum path cover on the Directed acyclic graph for the relation.
Technologies and tools
Technologies and tools
Pervasive phylogenetic incongruence
test for phylogenetic discordance attributable to genetic convergence,
when applied to different contexts it could equally be used to measure discordance that has arisen by other processes,
some of which will be more applicable to tropical systems:
- Horizontal gene transfer among bacteria
- Introgression across species barriers
- Incomplete lineage sorting
RUNTIME --- ~weeks --> hours
Object-oriented design
Separation of code into modular objects
Re-use methods through inheritance
Abstraction of behaviour allows modifications to parts of the API without affecting other tested code
Incorporate other libraries
Pervasive phylogenetic incongruence
test for phylogenetic discordance attributable to genetic convergence,
when applied to different contexts it could equally be used to measure discordance that has arisen by other processes,
some of which will be more applicable to tropical systems:
- Horizontal gene transfer among bacteria
- Introgression across species barriers
- Incomplete lineage sorting
Pervasive phylogenetic incongruence
test for phylogenetic discordance attributable to genetic convergence,
when applied to different contexts it could equally be used to measure discordance that has arisen by other processes,
some of which will be more applicable to tropical systems:
- Horizontal gene transfer among bacteria
- Introgression across species barriers
- Incomplete lineage sorting