SlideShare a Scribd company logo
1 of 81
Download to read offline
© Copyright 2016 by Paul O. Lewis
Entropy and information
in phylogenetics
Past-President
Society of Systematic Biologists
Paul O. Lewis
Evolution2016
Joint Annual Meeting of SSE, ASN, and SSB
Austin, Texas ~ 19 June 2016
© Copyright 2016 by Paul O. Lewis
What is information?
details, particulars, facts, figures, statistics, data;
knowledge, intelligence; instruction, advice,
guidance, direction, counsel, enlightenment; news,
word; hot tip; informal: info, lowdown, dope, dirt,
inside story, scoop, poop.
— Synonyms of information in the Oxford American
Writer’s Thesaurus
© Copyright 2016 by Paul O. Lewis
Does information=data?
Taxon 1
Taxon 2
Taxon 3
Taxon 4
Taxon 5
Taxon 6
Taxon 7
Taxon 8
Taxon 9
Taxon 10
Taxon 11
Taxon 12
AAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAA
© Copyright 2016 by Paul O. Lewis
© Copyright 2016 by Paul O. Lewis
Information=Data?
GGGTTGAATGGGGTGCGACTTATTC
GCGGCGATAGACTGCTACTACGTGC
CCCGTGGATAGCGACGTCTACAAGA
GGCTGTCGTAGCTTCCGTGTAATAC
CCGGAGGCAAACACCCTGTTCCCCC
GGGCAATATATATCCGCACCGCTCG
AAGAGCCGACAAGTAGAATCGGGAT
AGTAGCACAAGCGACACGGCAATAA
GTCGTGTTTTACCAGAGGTTGCATA
GCGTTGTAACACCCTTACCCTCTTT
AGTACATGTATGTTTCCTTCGTTCG
TGGGTTCCGCCCCGAGACGAGGCTC
Taxon 1
Taxon 2
Taxon 3
Taxon 4
Taxon 5
Taxon 6
Taxon 7
Taxon 8
Taxon 9
Taxon 10
Taxon 11
Taxon 12
© Copyright 2016 by Paul O. Lewis
© Copyright 2016 by Paul O. Lewis
The correct exposure for
phylogenetic inference
0.02 subst./site
Data simulated on the tree
above are nearly optimal for
phylogeny estimation
ACGGTCGAGGCGTAGACTCGATCAA
ACGGTCGAGGCGTAGACTCGATCAA
ACGGTCGAGGCGTAGACTCGATCAA
ACGGTCGATGCGTAGACTCGATCAA
ACGGTCGACGCGTATACTCGATCAA
ACGGTCGACGCGTATACTCGATCAA
ACGGTCGACGCGGATACTCGATCAA
ACGGTCGACGCGTATACTCGATCAA
ACGGTTGACGCATATACTCGATCAA
ACGGTTGACGCATATACTCGATCAA
ACCGTTGACGCATATACTCGATCAA
ACCGTTGACGCATATACTCGATCAA
Taxon 1
Taxon 2
Taxon 3
Taxon 4
Taxon 5
Taxon 6
Taxon 7
Taxon 8
Taxon 9
Taxon 10
Taxon 11
Taxon 12
© Copyright 2016 by Paul O. Lewis
Negatively skewed parsimony tree length
distributions indicate information content
Noisy
Fitch 1984
Informative
most
parsimonious
tree
© Copyright 2016 by Paul O. Lewis
The g1
statistic quantifies
skewness, and hence
information content
g1=0.05 g1=-0.96
Hillis 1991; Huelsenbeck 1991
slightly
positive
quite
negative
© Copyright 2016 by Paul O. Lewis
Taxon 1
Taxon 2
Taxon 3
Taxon 4
Taxon 5
A AC T G T
A AC T G T
C AG A TT
C GG A CT
C GG A CT
Shuffling taxon assignments within characters (sites)
removes hierarchical structure due to history
Archie 1989; Faith & Cranston 1991
© Copyright 2016 by Paul O. Lewis
Taxon 1
Taxon 2
Taxon 3
Taxon 4
Taxon 5 A
C
T
G
T
A
A
C T G
TC
AG
A
T
T
C G
G
A
C
T
C G
G
A
C
T
Shuffling taxon assignments within characters (sites)
removes hierarchical structure due to history
Archie 1989; Faith & Cranston 1991
A
© Copyright 2016 by Paul O. Lewis
Shuffling tests easily
differentiate random versus
properly exposed data
Archie 1989; Faith & Cranston 1991
unshuffled
original
now that's
significant!
© Copyright 2016 by Paul O. Lewis
TGCGTGGCGTTGGGGTAGCCCTCAC
TGCGTGGCGTTTGGGTAGCCCTCAC
TGCGTGGCGTTGGGGTAGCCCTCAC
TGCGTGGCGTTGGGGTAGCCCTCAC
TGCGTGGCGTTGGGGTAGCCCTCAC
TGCGTGGCGTTGGGGTAGCCCTTAC
TGCGTGGCGTTGGGGTAGCCCTCAC
TGCGTGGCGTTGGGGTAGCCCTCAC
TGCGTGGCGTTGGGGTAGCCCTCAC
TGCGTGGCGTTGGGGTAGCCCTCAC
TGCGTGGCGTTGGGGTAGCCCTCAC
TGCGTGGCGTGGGGGTAGCCCTCAC
GGGTTGAATGGGGTGCGACTTATTC
GCGGCGATAGACTGCTACTACGTGC
CCCGTGGATAGCGACGTCTACAAGA
GGCTGTCGTAGCTTCCGTGTAATAC
CCGGAGGCAAACACCCTGTTCCCCC
GGGCAATATATATCCGCACCGCTCG
AAGAGCCGACAAGTAGAATCGGGAT
AGTAGCACAAGCGACACGGCAATAA
GTCGTGTTTTACCAGAGGTTGCATA
GCGTTGTAACACCCTTACCCTCTTT
AGTACATGTATGTTTCCTTCGTTCG
TGGGTTCCGCCCCGAGACGAGGCTC
GGGTTGAATGGGGTGCGACTTATTC
GCGGCGATAGACTGCTACTACGTGC
CCCGTGGATAGCGACGTCTACAAGA
GGCTGTCGTAGCTTCCGTGTAATAC
CCGGAGGCAAACACCCTGTTCCCCC
GGGCAATATATATCCGCACCGCTCG
AAGAGCCGACAAGTAGAATCGGGAT
AGTAGCACAAGCGACACGGCAATAA
GTCGTGTTTTACCAGAGGTTGCATA
GCGTTGTAACACCCTTACCCTCTTT
AGTACATGTATGTTTCCTTCGTTCG
TGGGTTCCGCCCCGAGACGAGGCTC
TGCGTGGCGTTGGGGTAGCCCTCAC
TGCGTGGCGTTTGGGTAGCCCTCAC
TGCGTGGCGTTGGGGTAGCCCTCAC
TGCGTGGCGTTGGGGTAGCCCTCAC
TGCGTGGCGTTGGGGTAGCCCTCAC
TGCGTGGCGTTGGGGTAGCCCTTAC
TGCGTGGCGTTGGGGTAGCCCTCAC
TGCGTGGCGTTGGGGTAGCCCTCAC
TGCGTGGCGTTGGGGTAGCCCTCAC
TGCGTGGCGTTGGGGTAGCCCTCAC
TGCGTGGCGTTGGGGTAGCCCTCAC
TGCGTGGCGTGGGGGTAGCCCTCAC
Xie et al. 2003
T
G
A
T
G
C
AT
G C
A T
G
C
A
G CT A
Properly exposed data has lower nucleotide
compositional entropy than saturated data
S
LO
W
LY
E
V
O
LV
IN
G
SA
TU
R
A
TE
D
© Copyright 2016 by Paul O. Lewis
Plotting pairwise p-distance against model-corrected
distance reveals overexposure graphically
0 1 2 3 4 5
0
0.1
0.2
0.3
0.4
0.5
Estimated distance
Proportiondifferent
2nd codon positions
3rd codon positions
??
© Copyright 2016 by Paul O. Lewis
The Bayesian framework provides a
natural way to quantify information
0.0 0.2 0.4 0.6 0.8 1.0
θ = Pr(coin lands heads on any given flip)
uniform probability density
2-headed coinfair coin2-tailed coin
© Copyright 2016 by Paul O. Lewis
The information in just 3 flips is
enough to make trick coins impossible
0.0 0.2 0.4 0.6 0.8 1.0
θ = Pr(coin lands heads on any given flip)
2-headed coin2-tailed coin
0.0 0.2 0.4 0.6 0.8 1.0
© Copyright 2016 by Paul O. Lewis
The difference between prior and
posterior measures information content
0.0 0.2 0.4 0.6 0.8 1.0
θ = Pr(coin lands heads on any given flip)
0.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0
prior
posterior
Brown 2014
© Copyright 2016 by Paul O. Lewis
Information Theory
Dissonance
Additivity
Scaling Storm
Polytomy Rainbow
Why?
© Copyright 2016 by Paul O. Lewis
“Information is the resolution of uncertainty”
— Claude Shannon, 1948
© Copyright 2016 by Paul O. Lewis
The uncertainty Claude Shannon was
interested in resolving was “Which symbol was
last transmitted over a telegraph system?”
Sender chooses 1 of 8
possible symbols
Receiver must resolve which
symbol was sent
Information = number of
questions receiver needs to ask to
determine which symbol was sent
★
★
★★ ?★
★
★
★
★
© Copyright 2016 by Paul O. Lewis
Any 1 of the 8 symbols can be identified
by answering 3 yes/no questions
★ ★ ★ ★
circle
? noyes
blue?
yes no
blue?
noyes
★? ★? ★? ★?yes no yes no yes no yes no
111 110 101 100 011 010 001 0001 1 1 1 0 0 0 011 11 10 10 01 01 00 00
© Copyright 2016 by Paul O. Lewis
Dichotomous keys embody
Shannon's basic units information
seeds?
yesno
vascular?
no yes
flowers?
yesno
bryophyte fern gymnosperm angiosperm
© Copyright 2016 by Paul O. Lewis
entropy = 3
1/81/81/8 1/8 1/8 1/8 1/8 1/8 1/8
If each symbol has an equal chance of being
chosen by the sender, then 3 bits are needed to
identify each symbol on average
★ ★ ★ ★
1/8 1/8 1/8 1/8 1/8 1/8 1/8
111 110 101 100 011 010 001 000
3 bits 3 bits 3 bits 3 bits 3 bits 3 bits 3 bits 3 bits
entropy equals
average number of
questions needed
1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
© Copyright 2016 by Paul O. Lewis
If 1 symbol is sent half the time, and 4 other
remaining symbols are equally probable, then
only 2 bits are needed on average
★ ★ ★
1 011 010 001 000
circle
? noyes
blue?
noyes
★? ★?yes no yes no
Only 1 bit needed
half the time
3 bits needed
the other half of
the time
entropy = 2
1/2 1/8 1/8 1/8 1/8
© Copyright 2016 by Paul O. Lewis
If only 1 symbol is ever sent, then no
questions need be asked by the receiver,
and thus no information is required
★ ★ ★ ★ ★ ★ ★ ★ ★ ★
entropy = 0
1
0 questions
need be
asked
© Copyright 2016 by Paul O. Lewis
In the previous examples, there is
no uncertainty at the receiving end
★ ★ ★ ★
100% correct
© Copyright 2016 by Paul O. Lewis
Noise means that the data received do not
contain enough information to unambiguously
identify the symbol transmitted
★ ★ ★ ★
73%
101
100 001111
8.1% 8.1% 8.1%
1.6 bits3 bits
© Copyright 2016 by Paul O. Lewis
If not all bits are transmitted, there will
also be uncertainty at the destination
★ ★ ★ ★
50%
100
50%
101
10
© Copyright 2016 by Paul O. Lewis
Estimating the phylogeny for 4 taxa involves identifying
1 symbol (a tree) from a total of 15 symbols
A B C D B A C D A B C D C D A B D C A B A C B D C A B D A C B D B D A C D B A C A D B C D A B C A D B C B C A D C B A DA C B D
A B C D B A C D A B C D C D A B D C A B A C B D C A B D A C B D B D A C D B A C A D B C D A B C A D B C B C A D C B A D
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110
0111 3.9 bits sent
3.9 bits received
© Copyright 2016 by Paul O. Lewis
Simulating sequence data on a tree captures
information about that tree's topology
A B C D B A C D A B C D C D A B D C A B
A C B D C A B D A C B D B D A C D B A C
A D B C D A B C A D B C B C A D C B A D
model tree
© Copyright 2016 by Paul O. Lewis
Simulating sequence data on a tree captures
information about that tree's topology
0 sites
© Copyright 2016 by Paul O. Lewis
Simulating sequence data on a tree captures
information about that tree's topology
1 site
© Copyright 2016 by Paul O. Lewis
Simulating sequence data on a tree captures
information about that tree's topology
10 sites
© Copyright 2016 by Paul O. Lewis
Simulating sequence data on a tree captures
information about that tree's topology
100 sites
© Copyright 2016 by Paul O. Lewis
Simulating sequence data on a tree captures
information about that tree's topology
1000 sites
1000 sites captures
enough information to
identify the tree topology
chosen as the model tree
© Copyright 2016 by Paul O. Lewis
In 4-taxon simulations, information
estimation works as you might expect
Relative
rate
%I
0.01 18
0.1 99
1 100
10 64
100 1.5
Percent
missing
%I
0 100
50 98
100 0
Rate
variance
%I
1 100
10 97
100 13
1000 0
info highest at
optimal rate
info decreases with
no missing data
info decreases with
rate heterogeneity
© Copyright 2016 by Paul O. Lewis
Information can be
false information!
POLICE
© Copyright 2016 by Paul O. Lewis
Dissonance
Additivity
Scaling Storm
Polytomy Rainbow
Why?
Information Theory
© Copyright 2016 by Paul O. Lewis
Horizontal transfer results in conflicting
information about the placement of bloodroot
(Sanguinaria)
Bocconia
Eschscholzia
Oryza
Disporum
Sanguinaria
Oryza
Disporum
Eschscholzia
Bocconia
Sanguinaria
5’ end
3’ end
(horizontally transferred
from monocots)
monocots
monocots
Papaveraceae
Papaveraceae
Bergthorsson et al. 2003rps11 mtDNA
© Copyright 2016 by Paul O. Lewis
The 5' data contains 2.9 of 3.9 bits of
information
S B E D B S E D S B E D E D S B D E S B S E B D E S B D S E B D D B S E S D B E S D B E B E S D E B S D
O O O O O O O O O O O O O
S B E D B S E D S B E D E D S B D E S B S E B D E S B D S E B D D B S E S D B E S D B E B E S D E B S D
O O O O O O O O
B D S E
OO O
D S B E
O O O O
74.5%
information
D S B E
O
B D S E
O
© Copyright 2016 by Paul O. Lewis
Likewise, the 3' data captures 2.6 of 3.9
bits of information
S B E D B S E D S B E D E D S B D E S B S E B D E S B D S E B D D B S E S D B E S D B E B E S D E B S D
O O O O O O O O O O O O O
S B E D B S E D S B E D E D S B D E S B S E B D E S B D S E B D D B S E S D B E S D B E B E S D E B S D
O O O O O O O O
B D S E
OO O
D S B E
O O O O
66.8%
information
B D S E
O
D S B E
O
© Copyright 2016 by Paul O. Lewis
What do you expect will happen if we
concatenate the two data sets?
A ACGTACGTA ATATGTGTG
B ACGTACGTA GCGCACACA
C CCATGCGCA GCGCACACA
D GTACGCACA ATATGTGTG
E GTACGCACA ATATGGTTG
A ACGTACGTA
B ACGTACGTA
C CCATGCGCA
D GTACGCACA
E GTACGCACA
Data 1
A ATATGTGTG
B GCGCACACA
C GCGCACACA
D ATATGTGTG
E ATATGGTTG
Data 2
D
A
C
BE
Concatenated
Tree file
© Copyright 2016 by Paul O. Lewis
Concatenating the 3' and 5' data, we might
expect the conflict to be expressed as noise
S B E D B S E D S B E D E D S B D E S B S E B D E S B D S E B D D B S E S D B E S D B E B E S D E B S D
O O O O O O O O O O O O O
S B E D B S E D S B E D E D S B D E S B S E B D E S B D S E B D D B S E S D B E S D B E B E S D E B S D
O O O O O O O O
B D S E
OO O
D S B E
O O O O
B D S E
O
D S B E
O
5' tree 3' tree
hypothetical posterior distribution
© Copyright 2016 by Paul O. Lewis
Instead, we get all 3.9 bits of information needed, but
identify a tree that is neither the 3' nor the 5' tree!
S B E D B S E D S B E D E D S B D E S B S E B D E S B D S E B D D B S E S D B E S D B E B E S D E B S D
O O O O O O O O O O O O O
5' tree 3' tree
S B E D B S E D S B E D E D S B D E S B S E B D E S B D S E B D D B S E S D B E S D B E B E S D E B S D
O O O O O O O O O O O O O
D S B E
O
B D S E
O
D S B E
O
B D S E
O
concatenated
data contains
100% of info!
© Copyright 2016 by Paul O. Lewis
Each data set strongly rejects the other's
favorite tree, so a mediocre tree wins everything
Bocconia
Eschscholzia
Oryza
Disporum
Sanguinaria
Oryza
Disporum
Eschscholzia
Bocconia
Sanguinaria
5'
Topology 5’ 3’ Concatenated
((S,D),O),E,B) --- 0.64 ---
((S,O),D),E,B) --- 0.18 ---
((O,D),S),E,B) 0.11 0.18 1
(O,D,(B,(S,E)) 0.12 --- ---
(O,D,(E,(S,B)) 0.77 --- ---
Info 74.5% 66.8% 100%
3'
This loser wins
everything!
5' data rejects
these 2 trees
3' data rejects
these 2 trees
© Copyright 2016 by Paul O. Lewis
D
E
C
BA
Trees 2
Merging tree files provides a means of
measuring information dissonance
A ACGTACGTA ATATGTGTG
B ACGTACGTA GCGCACACA
C CCATGCGCA GCGCACACA
D GTACGCACA ATATGTGTG
E GTACGCACA ATATGGTTG
A ACGTACGTA
B ACGTACGTA
C CCATGCGCA
D GTACGCACA
E GTACGCACA
Data 1
D
C
A
BE
Trees 1
A ATATGTGTG
B GCGCACACA
C GCGCACACA
D ATATGTGTG
E ATATGGTTG
Data 2
D
C
A
BE
D
E
C
BA
Merged
D
A
C
BE
Concatenated
© Copyright 2016 by Paul O. Lewis
Merged tree file says the same thing as
individual tree files if there is no dissonance
Topology 5’ 5’ Merged
((O,D),S),E,B) 0.11 0.11 0.11
(O,D,(B,(S,E)) 0.12 0.12 0.12
(O,D,(E,(S,B)) 0.77 0.77 0.77
Info 74.5% 74.5% 74.5%
same, no dissonance
same
© Copyright 2016 by Paul O. Lewis
Dissonance is the difference between
merged info and average info
Topology 5’ 3’ Merged
((S,D),O),E,B) --- 0.64 0.32
((S,O),D),E,B) --- 0.18 0.09
((O,D),S),E,B) 0.11 0.18 0.14
(O,D,(B,(S,E)) 0.12 --- 0.06
(O,D,(E,(S,B)) 0.77 --- 0.39
Info 74.5% 66.8% 48.6%
average info = 70.7
22.1
dissonance
© Copyright 2016 by Paul O. Lewis
Additivity
Scaling Storm
Polytomy Rainbow
Why?
Information Theory
Dissonance
© Copyright 2016 by Paul O. Lewis
A sample of trees can be used to
build a conditional clade graph
AB|C D|EF DE|F
ABCDEF
ABC|DEF
A|BC
A B C D E FA B C D E F
Larget 2013
1
1
0.5 0.5 0.5 0.5
1
© Copyright 2016 by Paul O. Lewis
Clade mixing-and-matching allows us to
greatly extend the reach of our sample
AB|C D|EF DE|F
ABCDEF
ABC|DEF
A|BC
A B C D E FA B C D E F A B C D E F A B C D E F
0.5 0.5 0.5 0.5
1
Larget 2013
© Copyright 2016 by Paul O. Lewis
0.6
0.6
6.7
this clade provides the the largest
contribution because here the 945
possible trees are cut down to just 9
Entropy, information, and dissonance
can all be partitioned by clade
AB|C D|EF DE|F
ABCDEF
ABC|DEF
A|BC
A B C D E FA B C D E F A B C D E F A B C D E F
0.25 0.25 0.25 0.25
Information = 7.9 bits
= 6.7 + 0.6 + 0.6
0.5 0.5 0.5 0.5
1
© Copyright 2016 by Paul O. Lewis
Two data sets simulated on trees differing only in
the swapping of two tips illustrates that
dissonance can pinpoint disagreement
© Copyright 2016 by Paul O. Lewis
Two data sets simulated on trees differing only in
the swapping of two tips illustrates that
dissonance can pinpoint disagreement
All dissonance
attributed to
clade
containing
swapped taxa
© Copyright 2016 by Paul O. Lewis
Scaling Storm
Polytomy Rainbow
Why?
Information Theory
Dissonance
Additivity
© Copyright 2016 by Paul O. Lewis
There are 5.6×1026
distinct labeled
unrooted binary tree topologies for 24 taxa
© Copyright 2016 by Paul O. Lewis
There are 5.6×10
unrooted binary tree topologies for
A computer examining 1 billion trees/second would
have to start before the Big Bang in order to finish
looking through all these trees!
© Copyright 2016 by Paul O. Lewis
There are 5.6×10
unrooted binary tree topologies for
A computer examining
have to start
looking through all these trees!
An MCMC sample of 1 trillion trees is still 564 trillion
times too small to sample each tree once
© Copyright 2016 by Paul O. Lewis
There are 5.6×10
unrooted binary tree topologies for
A computer examining
have to start
looking through all these trees!
An MCMC
times too small
Bottom line: it is impossible to accurately estimate the
entropy of a posterior representing zero information
for any reasonable number of taxa
© Copyright 2016 by Paul O. Lewis
Taxa Unrooted Trees Estimated information (%)
4 3 0
5 15 0
6 105 0
7 945 1
8 10,395 6
9 135,135 22
10 2,027,025 37
11 34,459,425 47
12 654,729,075 55
If data contains zero information, inadequate
sampling results in high estimated information
content
10,000 trees
sampled
© Copyright 2016 by Paul O. Lewis
Taxa Unrooted Trees Estimated information (%)
4 3 0
5 15 0
6 105 0
7 945 1
8 10,395 6
9 135,135 22
10 2,027,025 37
11 34,459,425 47
12 654,729,075 55
If data contains zero information, inadequate
sampling results in high estimated information
content
10,000 trees
sampled
This little dot is how
much tree space
we've covered
65,473 times
larger than
sample size
© Copyright 2016 by Paul O. Lewis
Polytomy Rainbow
Why?
Information Theory
Dissonance
Additivity
Scaling Storm
© Copyright 2016 by Paul O. Lewis
Polytomy priors make it possible to
estimate low information content accurately
1 tree
25 trees
105 trees
105 trees
Lewis, Holder & Holsinger 2005
© Copyright 2016 by Paul O. Lewis
Polytomy priors make it possible to
estimate low information content accurately
1 tree
25 trees
105 trees
105 trees
Lewis, Holder & Holsinger 2005
the star tree (resolution class 1)
© Copyright 2016 by Paul O. Lewis
Polytomy priors make it possible to
estimate low information content accurately
1 tree
25 trees
105 trees
105 trees
Lewis, Holder & Holsinger 2005
the star tree (resolution class 1)
fully resolved (resolution class 4)
© Copyright 2016 by Paul O. Lewis
Polytomy priors make it possible to
estimate low information content accurately
1 tree
25 trees
105 trees
105 trees
Lewis, Holder & Holsinger 2005
more than doubles
size of tree space
© Copyright 2016 by Paul O. Lewis
0.25
0.25
0.25
0.25
Polytomy priors make it possible to
estimate low information content accurately
1 tree
25 trees
105 trees
105 trees
Make each of the 4
resolution classes
equally probable
under the prior
Lewis, Holder & Holsinger 2005
© Copyright 2016 by Paul O. Lewis
Flat resolution class prior easy to
sample even for a 24-taxon problem
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
0 125 250 375 500
Info = 0.026%
10,000 trees sampled
© Copyright 2016 by Paul O. Lewis
Highly informative data sets place all
posterior in the fully-resolved resolution class
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
0 2500 5000 7500 10000
Info = 100%
10,000 trees sampled
All posterior on just 1
of the 5.6×1026
possible trees!
© Copyright 2016 by Paul O. Lewis
Estimated distance
Proportiondifferent
The Bayesian approach is better at assessing the
information content of 2nd vs. 3rd position sites
0 1 2 3 4 5
0
0.1
0.2
0.3
0.4
0.5
2nd codon positions
3rd codon positions
Saturated?
© Copyright 2016 by Paul O. Lewis
The Bayesian approach is better at assessing the
information content of 2nd vs. 3rd position sites
0 1 2 3 4 5
0
0.1
0.2
0.3
0.4
0.5
0.005
0.700
3rd position sites:
info = 86.4%
2nd position sites:
info = 75.6%
3rd positions have
more information than
2nd positions!
© Copyright 2016 by Paul O. Lewis
0 1 2 3 4 5
0
0.1
0.2
0.3
0.4
0.5
Using the resolution class prior does not change the
conclusion that 3rd position sites have more
information than 2nd position sites
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
0 1000 2000 3000 4000
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
0 400 800 1200 1600
2nd position sites:
info = 30.2%
3rd position sites:
info = 54.9%
© Copyright 2016 by Paul O. Lewis
Why?
Information Theory
Dissonance
Additivity
Scaling Storm
Polytomy Rainbow
© Copyright 2016 by Paul O. Lewis
Why measure information
content?
• Morphology vs. molecules
© Copyright 2016 by Paul O. Lewis
Why measure information
content?
• Informed site-stripping
0 1 2 3 4 5
0
0.1
0.2
0.3
0.4
0.5
© Copyright 2016 by Paul O. Lewis
Why measure information
content?
• Impact of missing data
missing taxa missing genes random
© Copyright 2016 by Paul O. Lewis
Why measure information
content?
• Partition gene tree conflict
D
E
C
BA
Trees 2
A ACGTACGTA
B ACGTACGTA
C CCATGCGCA
D GTACGCACA
E GTACGCACA
Data 1
D
C
A
BE
Trees 1
A ATATGTGTG
B GCGCACACA
C GCGCACACA
D ATATGTGTG
E ATATGGTTG
Data 2
dissonance
© Copyright 2016 by Paul O. Lewis
Why measure information
content?
• Profiling information content
© Copyright 2016 by Paul O. Lewis
Why measure information
content?
• Divergence time analyses
© Copyright 2016 by Paul O. Lewis
Thanks!
~ UConn Collaborators ~
Ming-Hui Chen, Lynn Kuo, Louise Lewis, Karolina Fučíková,
Suman Neupane, Yu-Bo Wang, Daoyuan Shi
Supported by the National
Science Foundation
Department of Ecology and
Evolutionary Biology
http://dx.doi.org/10.1093/sysbio/syw042
Systematic Biology Advance Access
© Copyright 2016 by Paul O. Lewis
Literature Cited
Archie, J. W. 1989. A randomization test for phylogenetic information in systematic data. Systematic Zoology 38(3):239–252.
Bergthorsson U., Adams K.L., Thomason B., Palmer J.D. 2003. Widespread horizontal transfer of mitochondrial genes in
flowering plants. Nature 424:197–201.
Brown, J. M. 2014. Detection of implausible phylogenetic inferences using posterior predictive assessment of model fit.
Systematic Biology, 63(3), 334–348.
Faith, D. P., & Cranston, P. S. 1991. Could a cladogram this short have arisen by chance alone?: on permutation tests for cladistic
structure. Cladistics 7(1):1–28.
Fitch, W. M. 1984. Cladistic and other methods: problems, pitfalls, and potentials. Chapter 12 in: Duncan, T., and Stuessy, T. F.
(eds.), Cladistics: perspectives on the reconstruction of evolutionary history. Papers presented at a workshop on the theory and application
of cladistic methodology, March 22-28, 1981, University of California, Berkeley. Columbia University Press, New York.
Hillis, D. M. 1991. Discriminating between phylogenetic signal and random noise in DNA sequences. In M. M. Miyamoto & J.
Cracraft (Eds.), Phylogenetic analysis of DNA sequences (pp. 278–284). New York: Oxford University Press.
Huelsenbeck, J. P. 1991. Tree-length distribution skewness: an indicator of phylogenetic information. Systematic Zoology 40(3):
257–270.
Larget, B. 2013. The estimation of tree posterior probabilities using conditional clade probability distributions. Systematic Biology
62(4):501–511.
Lewis, P. O., Holder, M. T., & Holsinger, K. E. 2005. Polytomies and Bayesian phylogenetic inference. Systematic Biology 54(2):241–
253.
Xia, X., Xie, Z., Salemi, M., Chen, L., & Wang, Y. 2003. An index of substitution saturation and its application. Molecular
Phylogenetics and Evolution 26(1):1–7.
Claude Shannon photograph: http://www.itsoc.org/about/shannon
All other photographs by Paul O. Lewis

More Related Content

Recently uploaded

DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
LeenakshiTyagi
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
anilsa9823
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
University of Hertfordshire
 

Recently uploaded (20)

DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 

Featured

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 

Paul Lewis SSB Past-President's address at Evol2016

  • 1. © Copyright 2016 by Paul O. Lewis Entropy and information in phylogenetics Past-President Society of Systematic Biologists Paul O. Lewis Evolution2016 Joint Annual Meeting of SSE, ASN, and SSB Austin, Texas ~ 19 June 2016
  • 2. © Copyright 2016 by Paul O. Lewis What is information? details, particulars, facts, figures, statistics, data; knowledge, intelligence; instruction, advice, guidance, direction, counsel, enlightenment; news, word; hot tip; informal: info, lowdown, dope, dirt, inside story, scoop, poop. — Synonyms of information in the Oxford American Writer’s Thesaurus
  • 3. © Copyright 2016 by Paul O. Lewis Does information=data? Taxon 1 Taxon 2 Taxon 3 Taxon 4 Taxon 5 Taxon 6 Taxon 7 Taxon 8 Taxon 9 Taxon 10 Taxon 11 Taxon 12 AAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAA
  • 4. © Copyright 2016 by Paul O. Lewis
  • 5. © Copyright 2016 by Paul O. Lewis Information=Data? GGGTTGAATGGGGTGCGACTTATTC GCGGCGATAGACTGCTACTACGTGC CCCGTGGATAGCGACGTCTACAAGA GGCTGTCGTAGCTTCCGTGTAATAC CCGGAGGCAAACACCCTGTTCCCCC GGGCAATATATATCCGCACCGCTCG AAGAGCCGACAAGTAGAATCGGGAT AGTAGCACAAGCGACACGGCAATAA GTCGTGTTTTACCAGAGGTTGCATA GCGTTGTAACACCCTTACCCTCTTT AGTACATGTATGTTTCCTTCGTTCG TGGGTTCCGCCCCGAGACGAGGCTC Taxon 1 Taxon 2 Taxon 3 Taxon 4 Taxon 5 Taxon 6 Taxon 7 Taxon 8 Taxon 9 Taxon 10 Taxon 11 Taxon 12
  • 6. © Copyright 2016 by Paul O. Lewis
  • 7. © Copyright 2016 by Paul O. Lewis The correct exposure for phylogenetic inference 0.02 subst./site Data simulated on the tree above are nearly optimal for phylogeny estimation ACGGTCGAGGCGTAGACTCGATCAA ACGGTCGAGGCGTAGACTCGATCAA ACGGTCGAGGCGTAGACTCGATCAA ACGGTCGATGCGTAGACTCGATCAA ACGGTCGACGCGTATACTCGATCAA ACGGTCGACGCGTATACTCGATCAA ACGGTCGACGCGGATACTCGATCAA ACGGTCGACGCGTATACTCGATCAA ACGGTTGACGCATATACTCGATCAA ACGGTTGACGCATATACTCGATCAA ACCGTTGACGCATATACTCGATCAA ACCGTTGACGCATATACTCGATCAA Taxon 1 Taxon 2 Taxon 3 Taxon 4 Taxon 5 Taxon 6 Taxon 7 Taxon 8 Taxon 9 Taxon 10 Taxon 11 Taxon 12
  • 8. © Copyright 2016 by Paul O. Lewis Negatively skewed parsimony tree length distributions indicate information content Noisy Fitch 1984 Informative most parsimonious tree
  • 9. © Copyright 2016 by Paul O. Lewis The g1 statistic quantifies skewness, and hence information content g1=0.05 g1=-0.96 Hillis 1991; Huelsenbeck 1991 slightly positive quite negative
  • 10. © Copyright 2016 by Paul O. Lewis Taxon 1 Taxon 2 Taxon 3 Taxon 4 Taxon 5 A AC T G T A AC T G T C AG A TT C GG A CT C GG A CT Shuffling taxon assignments within characters (sites) removes hierarchical structure due to history Archie 1989; Faith & Cranston 1991
  • 11. © Copyright 2016 by Paul O. Lewis Taxon 1 Taxon 2 Taxon 3 Taxon 4 Taxon 5 A C T G T A A C T G TC AG A T T C G G A C T C G G A C T Shuffling taxon assignments within characters (sites) removes hierarchical structure due to history Archie 1989; Faith & Cranston 1991 A
  • 12. © Copyright 2016 by Paul O. Lewis Shuffling tests easily differentiate random versus properly exposed data Archie 1989; Faith & Cranston 1991 unshuffled original now that's significant!
  • 13. © Copyright 2016 by Paul O. Lewis TGCGTGGCGTTGGGGTAGCCCTCAC TGCGTGGCGTTTGGGTAGCCCTCAC TGCGTGGCGTTGGGGTAGCCCTCAC TGCGTGGCGTTGGGGTAGCCCTCAC TGCGTGGCGTTGGGGTAGCCCTCAC TGCGTGGCGTTGGGGTAGCCCTTAC TGCGTGGCGTTGGGGTAGCCCTCAC TGCGTGGCGTTGGGGTAGCCCTCAC TGCGTGGCGTTGGGGTAGCCCTCAC TGCGTGGCGTTGGGGTAGCCCTCAC TGCGTGGCGTTGGGGTAGCCCTCAC TGCGTGGCGTGGGGGTAGCCCTCAC GGGTTGAATGGGGTGCGACTTATTC GCGGCGATAGACTGCTACTACGTGC CCCGTGGATAGCGACGTCTACAAGA GGCTGTCGTAGCTTCCGTGTAATAC CCGGAGGCAAACACCCTGTTCCCCC GGGCAATATATATCCGCACCGCTCG AAGAGCCGACAAGTAGAATCGGGAT AGTAGCACAAGCGACACGGCAATAA GTCGTGTTTTACCAGAGGTTGCATA GCGTTGTAACACCCTTACCCTCTTT AGTACATGTATGTTTCCTTCGTTCG TGGGTTCCGCCCCGAGACGAGGCTC GGGTTGAATGGGGTGCGACTTATTC GCGGCGATAGACTGCTACTACGTGC CCCGTGGATAGCGACGTCTACAAGA GGCTGTCGTAGCTTCCGTGTAATAC CCGGAGGCAAACACCCTGTTCCCCC GGGCAATATATATCCGCACCGCTCG AAGAGCCGACAAGTAGAATCGGGAT AGTAGCACAAGCGACACGGCAATAA GTCGTGTTTTACCAGAGGTTGCATA GCGTTGTAACACCCTTACCCTCTTT AGTACATGTATGTTTCCTTCGTTCG TGGGTTCCGCCCCGAGACGAGGCTC TGCGTGGCGTTGGGGTAGCCCTCAC TGCGTGGCGTTTGGGTAGCCCTCAC TGCGTGGCGTTGGGGTAGCCCTCAC TGCGTGGCGTTGGGGTAGCCCTCAC TGCGTGGCGTTGGGGTAGCCCTCAC TGCGTGGCGTTGGGGTAGCCCTTAC TGCGTGGCGTTGGGGTAGCCCTCAC TGCGTGGCGTTGGGGTAGCCCTCAC TGCGTGGCGTTGGGGTAGCCCTCAC TGCGTGGCGTTGGGGTAGCCCTCAC TGCGTGGCGTTGGGGTAGCCCTCAC TGCGTGGCGTGGGGGTAGCCCTCAC Xie et al. 2003 T G A T G C AT G C A T G C A G CT A Properly exposed data has lower nucleotide compositional entropy than saturated data S LO W LY E V O LV IN G SA TU R A TE D
  • 14. © Copyright 2016 by Paul O. Lewis Plotting pairwise p-distance against model-corrected distance reveals overexposure graphically 0 1 2 3 4 5 0 0.1 0.2 0.3 0.4 0.5 Estimated distance Proportiondifferent 2nd codon positions 3rd codon positions ??
  • 15. © Copyright 2016 by Paul O. Lewis The Bayesian framework provides a natural way to quantify information 0.0 0.2 0.4 0.6 0.8 1.0 θ = Pr(coin lands heads on any given flip) uniform probability density 2-headed coinfair coin2-tailed coin
  • 16. © Copyright 2016 by Paul O. Lewis The information in just 3 flips is enough to make trick coins impossible 0.0 0.2 0.4 0.6 0.8 1.0 θ = Pr(coin lands heads on any given flip) 2-headed coin2-tailed coin 0.0 0.2 0.4 0.6 0.8 1.0
  • 17. © Copyright 2016 by Paul O. Lewis The difference between prior and posterior measures information content 0.0 0.2 0.4 0.6 0.8 1.0 θ = Pr(coin lands heads on any given flip) 0.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0 prior posterior Brown 2014
  • 18. © Copyright 2016 by Paul O. Lewis Information Theory Dissonance Additivity Scaling Storm Polytomy Rainbow Why?
  • 19. © Copyright 2016 by Paul O. Lewis “Information is the resolution of uncertainty” — Claude Shannon, 1948
  • 20. © Copyright 2016 by Paul O. Lewis The uncertainty Claude Shannon was interested in resolving was “Which symbol was last transmitted over a telegraph system?” Sender chooses 1 of 8 possible symbols Receiver must resolve which symbol was sent Information = number of questions receiver needs to ask to determine which symbol was sent ★ ★ ★★ ?★ ★ ★ ★ ★
  • 21. © Copyright 2016 by Paul O. Lewis Any 1 of the 8 symbols can be identified by answering 3 yes/no questions ★ ★ ★ ★ circle ? noyes blue? yes no blue? noyes ★? ★? ★? ★?yes no yes no yes no yes no 111 110 101 100 011 010 001 0001 1 1 1 0 0 0 011 11 10 10 01 01 00 00
  • 22. © Copyright 2016 by Paul O. Lewis Dichotomous keys embody Shannon's basic units information seeds? yesno vascular? no yes flowers? yesno bryophyte fern gymnosperm angiosperm
  • 23. © Copyright 2016 by Paul O. Lewis entropy = 3 1/81/81/8 1/8 1/8 1/8 1/8 1/8 1/8 If each symbol has an equal chance of being chosen by the sender, then 3 bits are needed to identify each symbol on average ★ ★ ★ ★ 1/8 1/8 1/8 1/8 1/8 1/8 1/8 111 110 101 100 011 010 001 000 3 bits 3 bits 3 bits 3 bits 3 bits 3 bits 3 bits 3 bits entropy equals average number of questions needed 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8
  • 24. © Copyright 2016 by Paul O. Lewis If 1 symbol is sent half the time, and 4 other remaining symbols are equally probable, then only 2 bits are needed on average ★ ★ ★ 1 011 010 001 000 circle ? noyes blue? noyes ★? ★?yes no yes no Only 1 bit needed half the time 3 bits needed the other half of the time entropy = 2 1/2 1/8 1/8 1/8 1/8
  • 25. © Copyright 2016 by Paul O. Lewis If only 1 symbol is ever sent, then no questions need be asked by the receiver, and thus no information is required ★ ★ ★ ★ ★ ★ ★ ★ ★ ★ entropy = 0 1 0 questions need be asked
  • 26. © Copyright 2016 by Paul O. Lewis In the previous examples, there is no uncertainty at the receiving end ★ ★ ★ ★ 100% correct
  • 27. © Copyright 2016 by Paul O. Lewis Noise means that the data received do not contain enough information to unambiguously identify the symbol transmitted ★ ★ ★ ★ 73% 101 100 001111 8.1% 8.1% 8.1% 1.6 bits3 bits
  • 28. © Copyright 2016 by Paul O. Lewis If not all bits are transmitted, there will also be uncertainty at the destination ★ ★ ★ ★ 50% 100 50% 101 10
  • 29. © Copyright 2016 by Paul O. Lewis Estimating the phylogeny for 4 taxa involves identifying 1 symbol (a tree) from a total of 15 symbols A B C D B A C D A B C D C D A B D C A B A C B D C A B D A C B D B D A C D B A C A D B C D A B C A D B C B C A D C B A DA C B D A B C D B A C D A B C D C D A B D C A B A C B D C A B D A C B D B D A C D B A C A D B C D A B C A D B C B C A D C B A D 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 0111 3.9 bits sent 3.9 bits received
  • 30. © Copyright 2016 by Paul O. Lewis Simulating sequence data on a tree captures information about that tree's topology A B C D B A C D A B C D C D A B D C A B A C B D C A B D A C B D B D A C D B A C A D B C D A B C A D B C B C A D C B A D model tree
  • 31. © Copyright 2016 by Paul O. Lewis Simulating sequence data on a tree captures information about that tree's topology 0 sites
  • 32. © Copyright 2016 by Paul O. Lewis Simulating sequence data on a tree captures information about that tree's topology 1 site
  • 33. © Copyright 2016 by Paul O. Lewis Simulating sequence data on a tree captures information about that tree's topology 10 sites
  • 34. © Copyright 2016 by Paul O. Lewis Simulating sequence data on a tree captures information about that tree's topology 100 sites
  • 35. © Copyright 2016 by Paul O. Lewis Simulating sequence data on a tree captures information about that tree's topology 1000 sites 1000 sites captures enough information to identify the tree topology chosen as the model tree
  • 36. © Copyright 2016 by Paul O. Lewis In 4-taxon simulations, information estimation works as you might expect Relative rate %I 0.01 18 0.1 99 1 100 10 64 100 1.5 Percent missing %I 0 100 50 98 100 0 Rate variance %I 1 100 10 97 100 13 1000 0 info highest at optimal rate info decreases with no missing data info decreases with rate heterogeneity
  • 37. © Copyright 2016 by Paul O. Lewis Information can be false information! POLICE
  • 38. © Copyright 2016 by Paul O. Lewis Dissonance Additivity Scaling Storm Polytomy Rainbow Why? Information Theory
  • 39. © Copyright 2016 by Paul O. Lewis Horizontal transfer results in conflicting information about the placement of bloodroot (Sanguinaria) Bocconia Eschscholzia Oryza Disporum Sanguinaria Oryza Disporum Eschscholzia Bocconia Sanguinaria 5’ end 3’ end (horizontally transferred from monocots) monocots monocots Papaveraceae Papaveraceae Bergthorsson et al. 2003rps11 mtDNA
  • 40. © Copyright 2016 by Paul O. Lewis The 5' data contains 2.9 of 3.9 bits of information S B E D B S E D S B E D E D S B D E S B S E B D E S B D S E B D D B S E S D B E S D B E B E S D E B S D O O O O O O O O O O O O O S B E D B S E D S B E D E D S B D E S B S E B D E S B D S E B D D B S E S D B E S D B E B E S D E B S D O O O O O O O O B D S E OO O D S B E O O O O 74.5% information D S B E O B D S E O
  • 41. © Copyright 2016 by Paul O. Lewis Likewise, the 3' data captures 2.6 of 3.9 bits of information S B E D B S E D S B E D E D S B D E S B S E B D E S B D S E B D D B S E S D B E S D B E B E S D E B S D O O O O O O O O O O O O O S B E D B S E D S B E D E D S B D E S B S E B D E S B D S E B D D B S E S D B E S D B E B E S D E B S D O O O O O O O O B D S E OO O D S B E O O O O 66.8% information B D S E O D S B E O
  • 42. © Copyright 2016 by Paul O. Lewis What do you expect will happen if we concatenate the two data sets? A ACGTACGTA ATATGTGTG B ACGTACGTA GCGCACACA C CCATGCGCA GCGCACACA D GTACGCACA ATATGTGTG E GTACGCACA ATATGGTTG A ACGTACGTA B ACGTACGTA C CCATGCGCA D GTACGCACA E GTACGCACA Data 1 A ATATGTGTG B GCGCACACA C GCGCACACA D ATATGTGTG E ATATGGTTG Data 2 D A C BE Concatenated Tree file
  • 43. © Copyright 2016 by Paul O. Lewis Concatenating the 3' and 5' data, we might expect the conflict to be expressed as noise S B E D B S E D S B E D E D S B D E S B S E B D E S B D S E B D D B S E S D B E S D B E B E S D E B S D O O O O O O O O O O O O O S B E D B S E D S B E D E D S B D E S B S E B D E S B D S E B D D B S E S D B E S D B E B E S D E B S D O O O O O O O O B D S E OO O D S B E O O O O B D S E O D S B E O 5' tree 3' tree hypothetical posterior distribution
  • 44. © Copyright 2016 by Paul O. Lewis Instead, we get all 3.9 bits of information needed, but identify a tree that is neither the 3' nor the 5' tree! S B E D B S E D S B E D E D S B D E S B S E B D E S B D S E B D D B S E S D B E S D B E B E S D E B S D O O O O O O O O O O O O O 5' tree 3' tree S B E D B S E D S B E D E D S B D E S B S E B D E S B D S E B D D B S E S D B E S D B E B E S D E B S D O O O O O O O O O O O O O D S B E O B D S E O D S B E O B D S E O concatenated data contains 100% of info!
  • 45. © Copyright 2016 by Paul O. Lewis Each data set strongly rejects the other's favorite tree, so a mediocre tree wins everything Bocconia Eschscholzia Oryza Disporum Sanguinaria Oryza Disporum Eschscholzia Bocconia Sanguinaria 5' Topology 5’ 3’ Concatenated ((S,D),O),E,B) --- 0.64 --- ((S,O),D),E,B) --- 0.18 --- ((O,D),S),E,B) 0.11 0.18 1 (O,D,(B,(S,E)) 0.12 --- --- (O,D,(E,(S,B)) 0.77 --- --- Info 74.5% 66.8% 100% 3' This loser wins everything! 5' data rejects these 2 trees 3' data rejects these 2 trees
  • 46. © Copyright 2016 by Paul O. Lewis D E C BA Trees 2 Merging tree files provides a means of measuring information dissonance A ACGTACGTA ATATGTGTG B ACGTACGTA GCGCACACA C CCATGCGCA GCGCACACA D GTACGCACA ATATGTGTG E GTACGCACA ATATGGTTG A ACGTACGTA B ACGTACGTA C CCATGCGCA D GTACGCACA E GTACGCACA Data 1 D C A BE Trees 1 A ATATGTGTG B GCGCACACA C GCGCACACA D ATATGTGTG E ATATGGTTG Data 2 D C A BE D E C BA Merged D A C BE Concatenated
  • 47. © Copyright 2016 by Paul O. Lewis Merged tree file says the same thing as individual tree files if there is no dissonance Topology 5’ 5’ Merged ((O,D),S),E,B) 0.11 0.11 0.11 (O,D,(B,(S,E)) 0.12 0.12 0.12 (O,D,(E,(S,B)) 0.77 0.77 0.77 Info 74.5% 74.5% 74.5% same, no dissonance same
  • 48. © Copyright 2016 by Paul O. Lewis Dissonance is the difference between merged info and average info Topology 5’ 3’ Merged ((S,D),O),E,B) --- 0.64 0.32 ((S,O),D),E,B) --- 0.18 0.09 ((O,D),S),E,B) 0.11 0.18 0.14 (O,D,(B,(S,E)) 0.12 --- 0.06 (O,D,(E,(S,B)) 0.77 --- 0.39 Info 74.5% 66.8% 48.6% average info = 70.7 22.1 dissonance
  • 49. © Copyright 2016 by Paul O. Lewis Additivity Scaling Storm Polytomy Rainbow Why? Information Theory Dissonance
  • 50. © Copyright 2016 by Paul O. Lewis A sample of trees can be used to build a conditional clade graph AB|C D|EF DE|F ABCDEF ABC|DEF A|BC A B C D E FA B C D E F Larget 2013 1 1 0.5 0.5 0.5 0.5 1
  • 51. © Copyright 2016 by Paul O. Lewis Clade mixing-and-matching allows us to greatly extend the reach of our sample AB|C D|EF DE|F ABCDEF ABC|DEF A|BC A B C D E FA B C D E F A B C D E F A B C D E F 0.5 0.5 0.5 0.5 1 Larget 2013
  • 52. © Copyright 2016 by Paul O. Lewis 0.6 0.6 6.7 this clade provides the the largest contribution because here the 945 possible trees are cut down to just 9 Entropy, information, and dissonance can all be partitioned by clade AB|C D|EF DE|F ABCDEF ABC|DEF A|BC A B C D E FA B C D E F A B C D E F A B C D E F 0.25 0.25 0.25 0.25 Information = 7.9 bits = 6.7 + 0.6 + 0.6 0.5 0.5 0.5 0.5 1
  • 53. © Copyright 2016 by Paul O. Lewis Two data sets simulated on trees differing only in the swapping of two tips illustrates that dissonance can pinpoint disagreement
  • 54. © Copyright 2016 by Paul O. Lewis Two data sets simulated on trees differing only in the swapping of two tips illustrates that dissonance can pinpoint disagreement All dissonance attributed to clade containing swapped taxa
  • 55. © Copyright 2016 by Paul O. Lewis Scaling Storm Polytomy Rainbow Why? Information Theory Dissonance Additivity
  • 56. © Copyright 2016 by Paul O. Lewis There are 5.6×1026 distinct labeled unrooted binary tree topologies for 24 taxa
  • 57. © Copyright 2016 by Paul O. Lewis There are 5.6×10 unrooted binary tree topologies for A computer examining 1 billion trees/second would have to start before the Big Bang in order to finish looking through all these trees!
  • 58. © Copyright 2016 by Paul O. Lewis There are 5.6×10 unrooted binary tree topologies for A computer examining have to start looking through all these trees! An MCMC sample of 1 trillion trees is still 564 trillion times too small to sample each tree once
  • 59. © Copyright 2016 by Paul O. Lewis There are 5.6×10 unrooted binary tree topologies for A computer examining have to start looking through all these trees! An MCMC times too small Bottom line: it is impossible to accurately estimate the entropy of a posterior representing zero information for any reasonable number of taxa
  • 60. © Copyright 2016 by Paul O. Lewis Taxa Unrooted Trees Estimated information (%) 4 3 0 5 15 0 6 105 0 7 945 1 8 10,395 6 9 135,135 22 10 2,027,025 37 11 34,459,425 47 12 654,729,075 55 If data contains zero information, inadequate sampling results in high estimated information content 10,000 trees sampled
  • 61. © Copyright 2016 by Paul O. Lewis Taxa Unrooted Trees Estimated information (%) 4 3 0 5 15 0 6 105 0 7 945 1 8 10,395 6 9 135,135 22 10 2,027,025 37 11 34,459,425 47 12 654,729,075 55 If data contains zero information, inadequate sampling results in high estimated information content 10,000 trees sampled This little dot is how much tree space we've covered 65,473 times larger than sample size
  • 62. © Copyright 2016 by Paul O. Lewis Polytomy Rainbow Why? Information Theory Dissonance Additivity Scaling Storm
  • 63. © Copyright 2016 by Paul O. Lewis Polytomy priors make it possible to estimate low information content accurately 1 tree 25 trees 105 trees 105 trees Lewis, Holder & Holsinger 2005
  • 64. © Copyright 2016 by Paul O. Lewis Polytomy priors make it possible to estimate low information content accurately 1 tree 25 trees 105 trees 105 trees Lewis, Holder & Holsinger 2005 the star tree (resolution class 1)
  • 65. © Copyright 2016 by Paul O. Lewis Polytomy priors make it possible to estimate low information content accurately 1 tree 25 trees 105 trees 105 trees Lewis, Holder & Holsinger 2005 the star tree (resolution class 1) fully resolved (resolution class 4)
  • 66. © Copyright 2016 by Paul O. Lewis Polytomy priors make it possible to estimate low information content accurately 1 tree 25 trees 105 trees 105 trees Lewis, Holder & Holsinger 2005 more than doubles size of tree space
  • 67. © Copyright 2016 by Paul O. Lewis 0.25 0.25 0.25 0.25 Polytomy priors make it possible to estimate low information content accurately 1 tree 25 trees 105 trees 105 trees Make each of the 4 resolution classes equally probable under the prior Lewis, Holder & Holsinger 2005
  • 68. © Copyright 2016 by Paul O. Lewis Flat resolution class prior easy to sample even for a 24-taxon problem 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 0 125 250 375 500 Info = 0.026% 10,000 trees sampled
  • 69. © Copyright 2016 by Paul O. Lewis Highly informative data sets place all posterior in the fully-resolved resolution class 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 0 2500 5000 7500 10000 Info = 100% 10,000 trees sampled All posterior on just 1 of the 5.6×1026 possible trees!
  • 70. © Copyright 2016 by Paul O. Lewis Estimated distance Proportiondifferent The Bayesian approach is better at assessing the information content of 2nd vs. 3rd position sites 0 1 2 3 4 5 0 0.1 0.2 0.3 0.4 0.5 2nd codon positions 3rd codon positions Saturated?
  • 71. © Copyright 2016 by Paul O. Lewis The Bayesian approach is better at assessing the information content of 2nd vs. 3rd position sites 0 1 2 3 4 5 0 0.1 0.2 0.3 0.4 0.5 0.005 0.700 3rd position sites: info = 86.4% 2nd position sites: info = 75.6% 3rd positions have more information than 2nd positions!
  • 72. © Copyright 2016 by Paul O. Lewis 0 1 2 3 4 5 0 0.1 0.2 0.3 0.4 0.5 Using the resolution class prior does not change the conclusion that 3rd position sites have more information than 2nd position sites 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 0 1000 2000 3000 4000 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 0 400 800 1200 1600 2nd position sites: info = 30.2% 3rd position sites: info = 54.9%
  • 73. © Copyright 2016 by Paul O. Lewis Why? Information Theory Dissonance Additivity Scaling Storm Polytomy Rainbow
  • 74. © Copyright 2016 by Paul O. Lewis Why measure information content? • Morphology vs. molecules
  • 75. © Copyright 2016 by Paul O. Lewis Why measure information content? • Informed site-stripping 0 1 2 3 4 5 0 0.1 0.2 0.3 0.4 0.5
  • 76. © Copyright 2016 by Paul O. Lewis Why measure information content? • Impact of missing data missing taxa missing genes random
  • 77. © Copyright 2016 by Paul O. Lewis Why measure information content? • Partition gene tree conflict D E C BA Trees 2 A ACGTACGTA B ACGTACGTA C CCATGCGCA D GTACGCACA E GTACGCACA Data 1 D C A BE Trees 1 A ATATGTGTG B GCGCACACA C GCGCACACA D ATATGTGTG E ATATGGTTG Data 2 dissonance
  • 78. © Copyright 2016 by Paul O. Lewis Why measure information content? • Profiling information content
  • 79. © Copyright 2016 by Paul O. Lewis Why measure information content? • Divergence time analyses
  • 80. © Copyright 2016 by Paul O. Lewis Thanks! ~ UConn Collaborators ~ Ming-Hui Chen, Lynn Kuo, Louise Lewis, Karolina Fučíková, Suman Neupane, Yu-Bo Wang, Daoyuan Shi Supported by the National Science Foundation Department of Ecology and Evolutionary Biology http://dx.doi.org/10.1093/sysbio/syw042 Systematic Biology Advance Access
  • 81. © Copyright 2016 by Paul O. Lewis Literature Cited Archie, J. W. 1989. A randomization test for phylogenetic information in systematic data. Systematic Zoology 38(3):239–252. Bergthorsson U., Adams K.L., Thomason B., Palmer J.D. 2003. Widespread horizontal transfer of mitochondrial genes in flowering plants. Nature 424:197–201. Brown, J. M. 2014. Detection of implausible phylogenetic inferences using posterior predictive assessment of model fit. Systematic Biology, 63(3), 334–348. Faith, D. P., & Cranston, P. S. 1991. Could a cladogram this short have arisen by chance alone?: on permutation tests for cladistic structure. Cladistics 7(1):1–28. Fitch, W. M. 1984. Cladistic and other methods: problems, pitfalls, and potentials. Chapter 12 in: Duncan, T., and Stuessy, T. F. (eds.), Cladistics: perspectives on the reconstruction of evolutionary history. Papers presented at a workshop on the theory and application of cladistic methodology, March 22-28, 1981, University of California, Berkeley. Columbia University Press, New York. Hillis, D. M. 1991. Discriminating between phylogenetic signal and random noise in DNA sequences. In M. M. Miyamoto & J. Cracraft (Eds.), Phylogenetic analysis of DNA sequences (pp. 278–284). New York: Oxford University Press. Huelsenbeck, J. P. 1991. Tree-length distribution skewness: an indicator of phylogenetic information. Systematic Zoology 40(3): 257–270. Larget, B. 2013. The estimation of tree posterior probabilities using conditional clade probability distributions. Systematic Biology 62(4):501–511. Lewis, P. O., Holder, M. T., & Holsinger, K. E. 2005. Polytomies and Bayesian phylogenetic inference. Systematic Biology 54(2):241– 253. Xia, X., Xie, Z., Salemi, M., Chen, L., & Wang, Y. 2003. An index of substitution saturation and its application. Molecular Phylogenetics and Evolution 26(1):1–7. Claude Shannon photograph: http://www.itsoc.org/about/shannon All other photographs by Paul O. Lewis