Bastien Boussau
LBBE, CNRS, Université de Lyon
Models of gene
duplication, transfer and loss
to study genome evolution
Collaborators
Lyon collaborators:
• Adrián Arellano Davín
• Gergely Szöllősi (Budapest)
• Vincent Daubin
• Eric Tannier
• Thomas Bigot
• Magali Semeria
• Manolo Gouy
• Laurent Duret
• Nicolas Lartillot
Austin/Illinois collaborators:
• Siavash Mirarab
• Md. Shamsuzzoha Bayzid
• Tandy Warnow
RevBayes collaborators:
• Sebastian Hoehna
• Michael Landis
• Tracy Heath
• Fredrik Ronquist
• Brian Moore
• John Huelsenbeck
• …
Plan
1. Gene duplications and losses
• Mammalian genomes
2. Gene duplications, losses and transfers
• Fungi and Cyanobacteria
3. A fast approach to dealing with incomplete
lineage sorting
• Birds
4. 2 vignettes
To study genome evolution:
1. One species tree:
!
!
!
2. Thousands of gene trees:
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
To study genome evolution:
1. One species tree:
!
!
!
2. Thousands of gene trees:
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Why	
  our	
  current	
  pipeline	
  can	
  be	
  improved
Why	
  our	
  current	
  pipeline	
  can	
  be	
  improved
•Gene	
  alignments:	
  
•Error	
  prone	
  (Genes	
  are	
  
short)	
  
•Point	
  es:mates	
  
Why	
  our	
  current	
  pipeline	
  can	
  be	
  improved
•Gene	
  trees:	
  
•based	
  on	
  alignments	
  
•Point	
  es:mates	
  
•Gene	
  alignments:	
  
•Error	
  prone	
  (Genes	
  are	
  
short)	
  
•Point	
  es:mates	
  
Why	
  our	
  current	
  pipeline	
  can	
  be	
  improved
•Gene	
  trees:	
  
•based	
  on	
  alignments	
  
•Point	
  es:mates	
  
•Species	
  trees:	
  
•based	
  on	
  gene	
  trees	
  
•Gene	
  alignments:	
  
•Error	
  prone	
  (Genes	
  are	
  
short)	
  
•Point	
  es:mates	
  
Why	
  our	
  current	
  pipeline	
  can	
  be	
  improved
•Gene	
  trees:	
  
•based	
  on	
  alignments	
  
•Point	
  es:mates	
  
•Species	
  trees:	
  
•based	
  on	
  gene	
  trees	
  
•Gene	
  alignments:	
  
•Error	
  prone	
  (Genes	
  are	
  
short)	
  
•Point	
  es:mates	
  
Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
D
Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
D DL
Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
LGTD DL
Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
LGT ILSD DL
Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
LGT ILS
DL: Boussau et al., Genome Research 2013
D DL
Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
LGT ILS
DL: Boussau et al., Genome Research 2013
D DL
DL+T:!
Szöllősi et al. "
PNAS 2013
Species: A B C D
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
LGT ILS
ILS: !
Mirarab et al.
Science 2014
DL: Boussau et al., Genome Research 2013
D DL
DL+T:!
Szöllősi et al. "
PNAS 2013
(thousands	
  of	
  alignments)
PHYLDOG
All gene families
Rooted species tree,
numbers of duplications
and losses,
rooted gene trees D1
D2
D3
D4
D5
D6
L2
L1
L4
L3
L5
L6
Joint reconstruction of
the species tree,
gene trees, and
numbers of duplications and losses
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
D1
D3
D2 D4
D5 D6
L1
L3
L2 L4
L5 L6
Boussau et al., Genome Research 2013
(thousands	
  of	
  alignments)
PHYLDOG
All gene families
Rooted species tree,
numbers of duplications
and losses,
rooted gene trees D1
D2
D3
D4
D5
D6
L2
L1
L4
L3
L5
L6
Joint reconstruction of
the species tree,
gene trees, and
numbers of duplications and losses
Species: A B C D
Discrete character:
Continuous character:
a a b a
0.1 0.2 0.2 0.4
T
I
M
E
D1
D3
D2 D4
D5 D6
L1
L3
L2 L4
L5 L6
Probabilis5c	
  models:	
  
• sequence	
  evolu1on	
  
• gene	
  family	
  evolu1on
Boussau et al., Genome Research 2013
PHYLDOG: a model of
gene duplication and loss
Assumptions!
•Genes evolve along the species tree:!
•birth events:!
•duplications (rate of duplication)!
•death events:!
•losses (rate of loss)!
•Each gene family is independent of other genes!
•Each gene copy is independent of other copies!
!
!
Study	
  of	
  mammalian	
  genome	
  evolu:on
10
• Challenging	
  but	
  well-­‐studied	
  phylogeny	
  
• 36	
  mammalian	
  genomes	
  available	
  in	
  Ensembl	
  v.	
  57	
  
• About	
  7000	
  gene	
  families	
  
• Correc:on	
  for	
  poorly	
  sequenced	
  genomes
PHYLDOG finds a good species tree
Sus scrofa
Felis catus
Ornithorhynchus anatinus
Oryctolagus cuniculus
Loxodonta africana
Mus musculus
Gorilla gorilla
Dipodomys ordii
Monodelphis domestica
Vicugna pacos
Macaca mulatta
Tupaia belangeri
Procavia capensis
Spermophilus tridecemlineatus
Pongo pygmaeus
Tursiops truncatus
Microcebus murinus
Callithrix jacchus
Equus caballus
Erinaceus europaeus
Tarsius syrichta
Choloepus hoffmanni
Ochotona princeps
Cavia porcellus
Pan troglodytes
Bos taurus
Rattus norvegicus
Homo sapiens
Otolemur garnettii
Dasypus novemcinctus
Echinops telfairi
Pteropus vampyrus
Macropus eugenii
Canis familiaris
Sorex araneus
Myotis lucifugus
Laurasiatheria
Afrotheria
Xenarthra
Marsupials
Primates
Glires
Quality	
  of	
  the	
  gene	
  trees
12
Comparison	
  between:	
  
PhyML	
  (used	
  for	
  the	
  PhylomeDB	
  and	
  Homolens	
  databases	
  )	
  
TreeBeST	
  (used	
  for	
  the	
  Ensembl-­‐Compara	
  database)	
  
PHYLDOG
Two	
  approaches:	
  
• Looking	
  at	
  ancestral	
  genome	
  sizes	
  
• Assessing	
  how	
  well	
  one	
  can	
  recover	
  ancestral	
  syntenies	
  
using	
  reconstructed	
  gene	
  trees	
  (Bérard	
  et	
  al.,	
  
Bioinforma:cs	
  2012)
Sus scrofa
Felis catus
Ornithorhynchus anatinus
Oryctolagus cuniculus
Loxodonta africana
Mus musculus
Gorilla gorilla
Dipodomys ordii
Monodelphis domestica
Vicugna pacos
Macaca mulatta
Tupaia belangeri
Procavia capensis
Spermophilus tridecemlineatus
Pongo pygmaeus
Tursiops truncatus
Microcebus murinus
Callithrix jacchus
Equus caballus
Erinaceus europaeus
Tarsius syrichta
Choloepus hoffmanni
Ochotona princeps
Cavia porcellus
Pan troglodytes
Bos taurus
Rattus norvegicus
Homo sapiens
Otolemur garnettii
Dasypus novemcinctus
Echinops telfairi
Pteropus vampyrus
Macropus eugenii
Canis familiaris
Sorex araneus
Myotis lucifugus
Laurasiatheria
Afrotheria
Xenarthra
Marsupials
Primates
Glires
010000
010000
010000
010000
010000
010000
010000
PHYLDOG
TreeBeST
PhyML
PHYLDOG: better trees for better ancestral genomes
An example gene family
0.1
Ornithorhynchus anatinus
0.3
Ornithorhynchus anatinus
Mus musculus
Mus musculus
Mus musculus
Cavia porcellus
Mus musculus
Oryctolagus cuniculus
Canis familiaris
Bos taurus
Homo sapiens
Pongo pygmaeus
Oryctolagus cuniculus
Cavia porcellus
Equus caballus
Equus caballus
Bos taurus
Callithrix jacchus
Homo sapiens
Monodelphis domestica
Spermophilus tridecemlineatus
Homo sapiens
Ornithorhynchus anatinus
Ornithorhynchus anatinus
Ornithorhynchus anatinus
Ornithorhynchus anatinus
Mus musculus
Mus musculus
Ornithorhynchus anatinus
Ornithorhynchus anatinus
Mus musculus
Mus musculus
Mus musculus
Cavia porcellus
Mus musculus
Oryctolagus cuniculus
Canis familiaris
Bos taurus
Homo sapiens
Pongo pygmaeus
Oryctolagus cuniculus
Cavia porcellus
Equus caballus
Equus caballus
Bos taurus
Callithrix jacchus
Homo sapiens
Monodelphis domestica
Spermophilus tridecemlineatus
Homo sapiens
Ornithorhynchus anatinus
Ornithorhynchus anatinus
Ornithorhynchus anatinus
Ornithorhynchus anatinus
Mus musculus
Mus musculus
TreeBeST PHYLDOG
Boussau et al., Genome Research 2013
Recent improvements to PHYLDOG
• Easier installation using Cmake or a virtual machine!
• Better algorithms for gene tree inference!
• Better algorithm for starting species tree!
• Faster computations using the Phylogenetic Likelihood Library
(PLL, A. Stamatakis group)!
• Python scripts to help run the program
Plan
1. Gene duplications and losses
• Mammalian genomes
2. Gene duplications, losses and transfers
• Fungi and Cyanobacteria
3. A fast approach to dealing with incomplete
lineage sorting
• Birds
4. 2 vignettes
Species: A B C D
T
I
M
E
ILS: !
Mirarab et al.
Science 2014
DL: Boussau et al., Genome Research 2013
DL+T:!
Szöllősi et al. "
PNAS 2013
Species: A B C D
T
I
M
E
LGT ILS
ILS: !
Mirarab et al.
Science 2014
DL: Boussau et al., Genome Research 2013
D DL
DL+T:!
Szöllősi et al. "
PNAS 2013
Gene	
  transfers	
  and	
  the	
  quixo:c	
  pursuit	
  of	
  the	
  TOL
DooliYle	
  WF,	
  
	
  Science	
  1999
Gene	
  transfers	
  and	
  the	
  quixo:c	
  pursuit	
  of	
  the	
  TOL
DooliYle	
  WF,	
  
	
  Science	
  1999
Gene	
  transfers	
  and	
  the	
  quixo:c	
  pursuit	
  of	
  the	
  TOL
DooliYle	
  WF,	
  
	
  Science	
  1999
“The monistic concept of a single universal tree appears […]
increasingly obsolete. […][It is] no longer the most
scientifically productive position to hold[…][It] accounts for
only a minority of observations from genomes.”!
Bapteste, O’Malley, Beiko, Ereshefsky, Gogarten, Franklin-Hall,
Lapointe, Dupré, Dagan, Boucher, Martin, !
Biology Direct 2009.
exODT: a model of
gene duplication, transfer, and loss
Assumptions!
•Genes evolve along the species tree:!
•birth events:!
•duplications (rate of duplication)!
•transfers (rate of receiving a gene)!
•death events:!
•losses (rate of loss)!
•Each gene family is independent of other genes!
•Each gene copy is independent of other copies!
•Transfers can go through unsampled/extinct species!
!
!
exODT: a model of
gene duplication, transfer, and loss
Szöllősi et al., Syst. Biol. a 2013
exODT: a model of
gene duplication, transfer, and loss
Szöllősi et al., Syst. Biol. a 2013
Better gene trees, fewer transfers
Usual
approach
ALE
+DTL
RFdistancetorealtree
Szöllősi et al., Syst. Biol. b 2013
Better gene trees, fewer transfers
Usual
approach
ALE
+DTL
Transfereventsperfamily
Usual
approach
ALE
+DTL
RFdistancetorealtree
Szöllősi et al., Syst. Biol. b 2013
Application to real data:
Cyanobacteria and Fungi
Cyanobacteria!
• > 2.4 billion years old! !
• 40 species!
• 1,200 to 4,500 protein coding genes!
• 7,410 gene families!
!
Fungi (Dikarya)!
• ~ 1 billion years old!
• 28 species!
• 5,200 to 10,000 protein coding genes!
• 11,387 gene families!
!!
Both cases: !
• fixed species tree, gene trees inferred using the
Duplication, Transfer and Loss model! Szöllősi et al., under review
Application to real data:
Cyanobacteria and Fungi
Cyanobacteria!
• > 2.4 billion years old! !
• 40 species!
• 1,200 to 4,500 protein coding genes!
• 7,410 gene families!
!
Fungi (Dikarya)!
• ~ 1 billion years old!
• 28 species!
• 5,200 to 10,000 protein coding genes!
• 11,387 gene families!
!!
Both cases: !
• fixed species tree, gene trees inferred using the
Duplication, Transfer and Loss model!
Transfers are expected
Transfers should be less frequent
Szöllősi et al., under review
Cyanobacteria
Szöllősi et al., under review
Cyanobacteria
Szöllősi et al., under review
Cyanobacteria
0.18 transfer per gene
Szöllősi et al., under review
Fungi
Szöllősi et al., under review
Fungi
Szöllősi et al., under review
Fungi
0.07 transfer per gene
Szöllősi et al., under review
Comparing transfer rates
• Cyanobacteria and Fungi differ in their age:!
!
We can compare normalized numbers of events:!
T/(T+D)!
!
• The Cyanobacteria and Fungi data sets differ in their
number of species:!
!
We can perform rarefaction studies
Szöllősi et al., under review
Comparing transfer rates
Szöllősi et al., under review
Similar transfer rates in Fungi and
Cyanobacteria
Szöllősi et al., under review
Using transfers to date clades
?
T
I
M
E
Using transfers to date clades
?
T
I
M
E
Using transfers to date clades
?
T
I
M
E
Using transfers to date clades
?
T
I
M
E
Using transfers to date clades
?
T
I
M
E
Using transfers to date clades
?
T
I
M
E
Using transfers to date clades
?
T
I
M
E
Because we can identify gene transfers, we have information for
ordering the nodes of a species tree
Bayesian species tree inference
accounting for DTL events
• STRALE:
• A Bayesian probabilistic method that can interpret thousands of
gene trees in terms of:
• speciation events
• duplication events (D)
• transfer events (T)
• loss events (L)
• A method able to estimate the DTL rates
• A method able to reconstruct the species tree
• A method able to order the nodes of the species tree
Simulation to test the species tree reconstruction
• 20 species
• 200 gene families
1 5
1
3
1 4
1 0
6
8
1 2
1 8
1 3
5
4
2
9
0
1 1
1 9
7
1 6
1 7
2
1 3
7
1 7
1 5
1
5
1 2
1 0
1 6
1 1
9
0
4
8
3
1 4
1 9
6
1 8
Simulated Inferred
Conclusion on DTL models
• The use of DTL models shows that the number of gene
transfers has so far been overestimated
• DTL models can be used to study genome evolution
and in particular rates of gene transfer
• DTL models can be used to date the nodes of a species
phylogeny
• DTL models should provide a powerful tool to infer an
accurate account of the history of life
Plan
1. Gene duplications and losses
• Mammalian genomes
2. Gene duplications, losses and transfers
• Fungi and Cyanobacteria
3. A fast approach to dealing with incomplete
lineage sorting
• Birds
4. 2 vignettes
Species: A B C D
T
I
M
E
ILS: !
Mirarab et al.
Science 2014
DL: Boussau et al., Genome Research 2013
DL+T:!
Szöllősi et al. "
PNAS 2013
Species: A B C D
T
I
M
E
LGT ILS
ILS: !
Mirarab et al.
Science 2014
DL: Boussau et al., Genome Research 2013
D DL
DL+T:!
Szöllősi et al. "
PNAS 2013
35
The multispecies coalescent
Rannala and Yang, Genetics 2003
• Divergence times in the species tree!
• Divergence times in the gene trees!
• Effective population sizes in the species tree
Faster alternatives to the multispecies coalescent
use fixed gene trees
E.g.: MP-EST (Liu, Yu and Edwards, 2010)!
Input: fixed gene trees!
Output: species tree with branch lengths in coalescent units!
!
Has been shown to be consistent, under one notable assumption: !
gene trees are correct.
Errors in gene trees decrease the accuracy of
estimated species trees
Mirarab, Bayzid and Warnow, Syst. Biol 2014
38
Statistical binning
Mirarab et al., Science 2014
38
Statistical binning
Mirarab et al., Science 2014
MP-EST
39
Statistical binning
Mirarab et al., Science 2014
MP-EST
39
Statistical binning
Mirarab et al., Science 2014
MP-EST
MP-EST
40
Statistical binning
improves
species tree inference
Mirarab et al., Science 2014
41
Statistical binning also improves the
estimation of the gene tree distribution
Mirarab et al., Science 2014
42
Jarvis et al., Science 2014
Statistical binning and birds
43Mirarab et al., PLoS One, accepted
Improving statistical binning: weighted statistical binning
44Mirarab et al., PLoS One, accepted
Improving statistical binning: weighted statistical binning
Practice: weighted binning and unweighted binning have about the same
accuracy !
Theory: weighted statistical binning can be shown to be consistent,
unweighted statistical binning is not.
Plan
1. Gene duplications and losses
• Mammalian genomes
2. Gene duplications, losses and transfers
• Fungi and Cyanobacteria
3. A fast approach to dealing with incomplete
lineage sorting
• Birds
4. 2 vignettes
RevBayes
• R-like language
• Model-based phylogenetics
• Many models of sequence evolution
• Models for dating
• Models for phylogeography
• Models for continuous traits
• Models for gene tree/species tree inference
• http://revbayes.net
• Sebastian Hoehna
• Michael Landis
• Tracy Heath
• Fredrik Ronquist
• Nicolas Lartillot
• Brian Moore
• John Huelsenbeck
• …
One more thing..
One more thing..
One more thing..
Conclusions
• We develop methods for gene tree and species
tree inference
• Improvement of gene trees and species trees in the
presence of:
• duplications and losses,
• transfers,
• incomplete lineage sorting
• Parallel algorithms applicable to genome-scale data
• We study the evolution of life, ancient and recent
RevBayes collaborators:
• Sebastian Hoehna
• Michael Landis
• Tracy Heath
• Fredrik Ronquist
• Brian Moore
• John Huelsenbeck
• …
Lyon collaborators:
• Adrián Arellano Davín
• Gergely Szöllősi (Budapest)
• Vincent Daubin
• Eric Tannier
• Thomas Bigot
• Magali Semeria
• Manolo Gouy
• Laurent Duret
• Nicolas Lartillot
Austin/Illinois collaborators:
• Siavash Mirarab
• Md. Shamsuzzoha Bayzid
• Tandy Warnow
Thanks!

Models of gene duplication, transfer and loss to study genome evolution

  • 1.
    Bastien Boussau LBBE, CNRS,Université de Lyon Models of gene duplication, transfer and loss to study genome evolution
  • 2.
    Collaborators Lyon collaborators: • AdriánArellano Davín • Gergely Szöllősi (Budapest) • Vincent Daubin • Eric Tannier • Thomas Bigot • Magali Semeria • Manolo Gouy • Laurent Duret • Nicolas Lartillot Austin/Illinois collaborators: • Siavash Mirarab • Md. Shamsuzzoha Bayzid • Tandy Warnow RevBayes collaborators: • Sebastian Hoehna • Michael Landis • Tracy Heath • Fredrik Ronquist • Brian Moore • John Huelsenbeck • …
  • 3.
    Plan 1. Gene duplicationsand losses • Mammalian genomes 2. Gene duplications, losses and transfers • Fungi and Cyanobacteria 3. A fast approach to dealing with incomplete lineage sorting • Birds 4. 2 vignettes
  • 4.
    To study genomeevolution: 1. One species tree: ! ! ! 2. Thousands of gene trees: Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E
  • 5.
    To study genomeevolution: 1. One species tree: ! ! ! 2. Thousands of gene trees: Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E
  • 6.
    Why  our  current  pipeline  can  be  improved
  • 7.
    Why  our  current  pipeline  can  be  improved •Gene  alignments:   •Error  prone  (Genes  are   short)   •Point  es:mates  
  • 8.
    Why  our  current  pipeline  can  be  improved •Gene  trees:   •based  on  alignments   •Point  es:mates   •Gene  alignments:   •Error  prone  (Genes  are   short)   •Point  es:mates  
  • 9.
    Why  our  current  pipeline  can  be  improved •Gene  trees:   •based  on  alignments   •Point  es:mates   •Species  trees:   •based  on  gene  trees   •Gene  alignments:   •Error  prone  (Genes  are   short)   •Point  es:mates  
  • 10.
    Why  our  current  pipeline  can  be  improved •Gene  trees:   •based  on  alignments   •Point  es:mates   •Species  trees:   •based  on  gene  trees   •Gene  alignments:   •Error  prone  (Genes  are   short)   •Point  es:mates  
  • 11.
    Species: A BC D T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E
  • 12.
    Species: A BC D T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E
  • 13.
    Species: A BC D T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E
  • 14.
    Species: A BC D T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E D
  • 15.
    Species: A BC D T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E D DL
  • 16.
    Species: A BC D T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E LGTD DL
  • 17.
    Species: A BC D T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E LGT ILSD DL
  • 18.
    Species: A BC D T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E LGT ILS DL: Boussau et al., Genome Research 2013 D DL
  • 19.
    Species: A BC D T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E LGT ILS DL: Boussau et al., Genome Research 2013 D DL DL+T:! Szöllősi et al. " PNAS 2013
  • 20.
    Species: A BC D T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E LGT ILS ILS: ! Mirarab et al. Science 2014 DL: Boussau et al., Genome Research 2013 D DL DL+T:! Szöllősi et al. " PNAS 2013
  • 21.
    (thousands  of  alignments) PHYLDOG Allgene families Rooted species tree, numbers of duplications and losses, rooted gene trees D1 D2 D3 D4 D5 D6 L2 L1 L4 L3 L5 L6 Joint reconstruction of the species tree, gene trees, and numbers of duplications and losses Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E D1 D3 D2 D4 D5 D6 L1 L3 L2 L4 L5 L6 Boussau et al., Genome Research 2013
  • 22.
    (thousands  of  alignments) PHYLDOG Allgene families Rooted species tree, numbers of duplications and losses, rooted gene trees D1 D2 D3 D4 D5 D6 L2 L1 L4 L3 L5 L6 Joint reconstruction of the species tree, gene trees, and numbers of duplications and losses Species: A B C D Discrete character: Continuous character: a a b a 0.1 0.2 0.2 0.4 T I M E D1 D3 D2 D4 D5 D6 L1 L3 L2 L4 L5 L6 Probabilis5c  models:   • sequence  evolu1on   • gene  family  evolu1on Boussau et al., Genome Research 2013
  • 23.
    PHYLDOG: a modelof gene duplication and loss Assumptions! •Genes evolve along the species tree:! •birth events:! •duplications (rate of duplication)! •death events:! •losses (rate of loss)! •Each gene family is independent of other genes! •Each gene copy is independent of other copies! ! !
  • 24.
    Study  of  mammalian  genome  evolu:on 10 • Challenging  but  well-­‐studied  phylogeny   • 36  mammalian  genomes  available  in  Ensembl  v.  57   • About  7000  gene  families   • Correc:on  for  poorly  sequenced  genomes
  • 25.
    PHYLDOG finds agood species tree Sus scrofa Felis catus Ornithorhynchus anatinus Oryctolagus cuniculus Loxodonta africana Mus musculus Gorilla gorilla Dipodomys ordii Monodelphis domestica Vicugna pacos Macaca mulatta Tupaia belangeri Procavia capensis Spermophilus tridecemlineatus Pongo pygmaeus Tursiops truncatus Microcebus murinus Callithrix jacchus Equus caballus Erinaceus europaeus Tarsius syrichta Choloepus hoffmanni Ochotona princeps Cavia porcellus Pan troglodytes Bos taurus Rattus norvegicus Homo sapiens Otolemur garnettii Dasypus novemcinctus Echinops telfairi Pteropus vampyrus Macropus eugenii Canis familiaris Sorex araneus Myotis lucifugus Laurasiatheria Afrotheria Xenarthra Marsupials Primates Glires
  • 26.
    Quality  of  the  gene  trees 12 Comparison  between:   PhyML  (used  for  the  PhylomeDB  and  Homolens  databases  )   TreeBeST  (used  for  the  Ensembl-­‐Compara  database)   PHYLDOG Two  approaches:   • Looking  at  ancestral  genome  sizes   • Assessing  how  well  one  can  recover  ancestral  syntenies   using  reconstructed  gene  trees  (Bérard  et  al.,   Bioinforma:cs  2012)
  • 27.
    Sus scrofa Felis catus Ornithorhynchusanatinus Oryctolagus cuniculus Loxodonta africana Mus musculus Gorilla gorilla Dipodomys ordii Monodelphis domestica Vicugna pacos Macaca mulatta Tupaia belangeri Procavia capensis Spermophilus tridecemlineatus Pongo pygmaeus Tursiops truncatus Microcebus murinus Callithrix jacchus Equus caballus Erinaceus europaeus Tarsius syrichta Choloepus hoffmanni Ochotona princeps Cavia porcellus Pan troglodytes Bos taurus Rattus norvegicus Homo sapiens Otolemur garnettii Dasypus novemcinctus Echinops telfairi Pteropus vampyrus Macropus eugenii Canis familiaris Sorex araneus Myotis lucifugus Laurasiatheria Afrotheria Xenarthra Marsupials Primates Glires 010000 010000 010000 010000 010000 010000 010000 PHYLDOG TreeBeST PhyML PHYLDOG: better trees for better ancestral genomes
  • 28.
    An example genefamily 0.1 Ornithorhynchus anatinus 0.3 Ornithorhynchus anatinus Mus musculus Mus musculus Mus musculus Cavia porcellus Mus musculus Oryctolagus cuniculus Canis familiaris Bos taurus Homo sapiens Pongo pygmaeus Oryctolagus cuniculus Cavia porcellus Equus caballus Equus caballus Bos taurus Callithrix jacchus Homo sapiens Monodelphis domestica Spermophilus tridecemlineatus Homo sapiens Ornithorhynchus anatinus Ornithorhynchus anatinus Ornithorhynchus anatinus Ornithorhynchus anatinus Mus musculus Mus musculus Ornithorhynchus anatinus Ornithorhynchus anatinus Mus musculus Mus musculus Mus musculus Cavia porcellus Mus musculus Oryctolagus cuniculus Canis familiaris Bos taurus Homo sapiens Pongo pygmaeus Oryctolagus cuniculus Cavia porcellus Equus caballus Equus caballus Bos taurus Callithrix jacchus Homo sapiens Monodelphis domestica Spermophilus tridecemlineatus Homo sapiens Ornithorhynchus anatinus Ornithorhynchus anatinus Ornithorhynchus anatinus Ornithorhynchus anatinus Mus musculus Mus musculus TreeBeST PHYLDOG Boussau et al., Genome Research 2013
  • 29.
    Recent improvements toPHYLDOG • Easier installation using Cmake or a virtual machine! • Better algorithms for gene tree inference! • Better algorithm for starting species tree! • Faster computations using the Phylogenetic Likelihood Library (PLL, A. Stamatakis group)! • Python scripts to help run the program
  • 30.
    Plan 1. Gene duplicationsand losses • Mammalian genomes 2. Gene duplications, losses and transfers • Fungi and Cyanobacteria 3. A fast approach to dealing with incomplete lineage sorting • Birds 4. 2 vignettes
  • 31.
    Species: A BC D T I M E ILS: ! Mirarab et al. Science 2014 DL: Boussau et al., Genome Research 2013 DL+T:! Szöllősi et al. " PNAS 2013
  • 32.
    Species: A BC D T I M E LGT ILS ILS: ! Mirarab et al. Science 2014 DL: Boussau et al., Genome Research 2013 D DL DL+T:! Szöllősi et al. " PNAS 2013
  • 33.
    Gene  transfers  and  the  quixo:c  pursuit  of  the  TOL DooliYle  WF,    Science  1999
  • 34.
    Gene  transfers  and  the  quixo:c  pursuit  of  the  TOL DooliYle  WF,    Science  1999
  • 35.
    Gene  transfers  and  the  quixo:c  pursuit  of  the  TOL DooliYle  WF,    Science  1999 “The monistic concept of a single universal tree appears […] increasingly obsolete. […][It is] no longer the most scientifically productive position to hold[…][It] accounts for only a minority of observations from genomes.”! Bapteste, O’Malley, Beiko, Ereshefsky, Gogarten, Franklin-Hall, Lapointe, Dupré, Dagan, Boucher, Martin, ! Biology Direct 2009.
  • 36.
    exODT: a modelof gene duplication, transfer, and loss Assumptions! •Genes evolve along the species tree:! •birth events:! •duplications (rate of duplication)! •transfers (rate of receiving a gene)! •death events:! •losses (rate of loss)! •Each gene family is independent of other genes! •Each gene copy is independent of other copies! •Transfers can go through unsampled/extinct species! ! !
  • 37.
    exODT: a modelof gene duplication, transfer, and loss Szöllősi et al., Syst. Biol. a 2013
  • 38.
    exODT: a modelof gene duplication, transfer, and loss Szöllősi et al., Syst. Biol. a 2013
  • 39.
    Better gene trees,fewer transfers Usual approach ALE +DTL RFdistancetorealtree Szöllősi et al., Syst. Biol. b 2013
  • 40.
    Better gene trees,fewer transfers Usual approach ALE +DTL Transfereventsperfamily Usual approach ALE +DTL RFdistancetorealtree Szöllősi et al., Syst. Biol. b 2013
  • 41.
    Application to realdata: Cyanobacteria and Fungi Cyanobacteria! • > 2.4 billion years old! ! • 40 species! • 1,200 to 4,500 protein coding genes! • 7,410 gene families! ! Fungi (Dikarya)! • ~ 1 billion years old! • 28 species! • 5,200 to 10,000 protein coding genes! • 11,387 gene families! !! Both cases: ! • fixed species tree, gene trees inferred using the Duplication, Transfer and Loss model! Szöllősi et al., under review
  • 42.
    Application to realdata: Cyanobacteria and Fungi Cyanobacteria! • > 2.4 billion years old! ! • 40 species! • 1,200 to 4,500 protein coding genes! • 7,410 gene families! ! Fungi (Dikarya)! • ~ 1 billion years old! • 28 species! • 5,200 to 10,000 protein coding genes! • 11,387 gene families! !! Both cases: ! • fixed species tree, gene trees inferred using the Duplication, Transfer and Loss model! Transfers are expected Transfers should be less frequent Szöllősi et al., under review
  • 43.
  • 44.
  • 45.
    Cyanobacteria 0.18 transfer pergene Szöllősi et al., under review
  • 46.
  • 47.
  • 48.
    Fungi 0.07 transfer pergene Szöllősi et al., under review
  • 49.
    Comparing transfer rates •Cyanobacteria and Fungi differ in their age:! ! We can compare normalized numbers of events:! T/(T+D)! ! • The Cyanobacteria and Fungi data sets differ in their number of species:! ! We can perform rarefaction studies Szöllősi et al., under review
  • 50.
  • 51.
    Similar transfer ratesin Fungi and Cyanobacteria Szöllősi et al., under review
  • 52.
    Using transfers todate clades ? T I M E
  • 53.
    Using transfers todate clades ? T I M E
  • 54.
    Using transfers todate clades ? T I M E
  • 55.
    Using transfers todate clades ? T I M E
  • 56.
    Using transfers todate clades ? T I M E
  • 57.
    Using transfers todate clades ? T I M E
  • 58.
    Using transfers todate clades ? T I M E Because we can identify gene transfers, we have information for ordering the nodes of a species tree
  • 59.
    Bayesian species treeinference accounting for DTL events • STRALE: • A Bayesian probabilistic method that can interpret thousands of gene trees in terms of: • speciation events • duplication events (D) • transfer events (T) • loss events (L) • A method able to estimate the DTL rates • A method able to reconstruct the species tree • A method able to order the nodes of the species tree
  • 60.
    Simulation to testthe species tree reconstruction • 20 species • 200 gene families 1 5 1 3 1 4 1 0 6 8 1 2 1 8 1 3 5 4 2 9 0 1 1 1 9 7 1 6 1 7 2 1 3 7 1 7 1 5 1 5 1 2 1 0 1 6 1 1 9 0 4 8 3 1 4 1 9 6 1 8 Simulated Inferred
  • 61.
    Conclusion on DTLmodels • The use of DTL models shows that the number of gene transfers has so far been overestimated • DTL models can be used to study genome evolution and in particular rates of gene transfer • DTL models can be used to date the nodes of a species phylogeny • DTL models should provide a powerful tool to infer an accurate account of the history of life
  • 62.
    Plan 1. Gene duplicationsand losses • Mammalian genomes 2. Gene duplications, losses and transfers • Fungi and Cyanobacteria 3. A fast approach to dealing with incomplete lineage sorting • Birds 4. 2 vignettes
  • 63.
    Species: A BC D T I M E ILS: ! Mirarab et al. Science 2014 DL: Boussau et al., Genome Research 2013 DL+T:! Szöllősi et al. " PNAS 2013
  • 64.
    Species: A BC D T I M E LGT ILS ILS: ! Mirarab et al. Science 2014 DL: Boussau et al., Genome Research 2013 D DL DL+T:! Szöllősi et al. " PNAS 2013
  • 65.
    35 The multispecies coalescent Rannalaand Yang, Genetics 2003 • Divergence times in the species tree! • Divergence times in the gene trees! • Effective population sizes in the species tree
  • 66.
    Faster alternatives tothe multispecies coalescent use fixed gene trees E.g.: MP-EST (Liu, Yu and Edwards, 2010)! Input: fixed gene trees! Output: species tree with branch lengths in coalescent units! ! Has been shown to be consistent, under one notable assumption: ! gene trees are correct.
  • 67.
    Errors in genetrees decrease the accuracy of estimated species trees Mirarab, Bayzid and Warnow, Syst. Biol 2014
  • 68.
  • 69.
    38 Statistical binning Mirarab etal., Science 2014 MP-EST
  • 70.
    39 Statistical binning Mirarab etal., Science 2014 MP-EST
  • 71.
    39 Statistical binning Mirarab etal., Science 2014 MP-EST MP-EST
  • 72.
    40 Statistical binning improves species treeinference Mirarab et al., Science 2014
  • 73.
    41 Statistical binning alsoimproves the estimation of the gene tree distribution Mirarab et al., Science 2014
  • 74.
    42 Jarvis et al.,Science 2014 Statistical binning and birds
  • 75.
    43Mirarab et al.,PLoS One, accepted Improving statistical binning: weighted statistical binning
  • 76.
    44Mirarab et al.,PLoS One, accepted Improving statistical binning: weighted statistical binning Practice: weighted binning and unweighted binning have about the same accuracy ! Theory: weighted statistical binning can be shown to be consistent, unweighted statistical binning is not.
  • 77.
    Plan 1. Gene duplicationsand losses • Mammalian genomes 2. Gene duplications, losses and transfers • Fungi and Cyanobacteria 3. A fast approach to dealing with incomplete lineage sorting • Birds 4. 2 vignettes
  • 78.
    RevBayes • R-like language •Model-based phylogenetics • Many models of sequence evolution • Models for dating • Models for phylogeography • Models for continuous traits • Models for gene tree/species tree inference • http://revbayes.net • Sebastian Hoehna • Michael Landis • Tracy Heath • Fredrik Ronquist • Nicolas Lartillot • Brian Moore • John Huelsenbeck • …
  • 79.
  • 80.
  • 81.
  • 82.
    Conclusions • We developmethods for gene tree and species tree inference • Improvement of gene trees and species trees in the presence of: • duplications and losses, • transfers, • incomplete lineage sorting • Parallel algorithms applicable to genome-scale data • We study the evolution of life, ancient and recent
  • 83.
    RevBayes collaborators: • SebastianHoehna • Michael Landis • Tracy Heath • Fredrik Ronquist • Brian Moore • John Huelsenbeck • … Lyon collaborators: • Adrián Arellano Davín • Gergely Szöllősi (Budapest) • Vincent Daubin • Eric Tannier • Thomas Bigot • Magali Semeria • Manolo Gouy • Laurent Duret • Nicolas Lartillot Austin/Illinois collaborators: • Siavash Mirarab • Md. Shamsuzzoha Bayzid • Tandy Warnow Thanks!