Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Using model-based statistical inferenceUsing model-based statistical inference
to learn about evolutionto learn about evol...
My group develops mathematical and computational
tools
for model-based statistical inference
on continuous and discrete ma...
What is model-based statistical inference?What is model-based statistical inference?
Modern technology gives us the ability to in great detailobserve
But very detailed observation is not the same as understanding
To understand we need to simplify and abstract.
What abstractions do we have at our disposal?What abstractions do we have at our disposal?
 
3
 
x
is useful and we love it dearly!is useful and we love it dearly!xx
allows us to describe knowledge in an implicit way:x
f(...
DefineDefine as areaas areaff((xx)) ddxx∫∫
bb
aa
a b
is averageis average11//((bb −− aa)) ⋅⋅ ff((xx)) ddxx∫∫
bb
aa
a b
average on (a, b)
Variables allow us to solveVariables allow us to solve
?x
y
Problem 1: given , solve for .
Problem 2: predict if a 10% big...
Variables allow us to solveVariables allow us to solve
?x
y
… in a deterministic framework.
Life is a probabilistic process.
How do we abstract probabilistic quantities?
X
Random variablesRandom variables abstract variablesabstract variablesXX
It doesn’t have a fixed value: we have to “ask” it...
Random variable sampling determined byRandom variable sampling determined by
distributionsdistributions
Sometimes discrete...
Working withWorking with random variablesrandom variables ::XX
We can solve for in “equations” like , obtaining
expression...
Probabilistic approach to predictionProbabilistic approach to prediction
?X
Y
: horizontal distance traveled by a cannonba...
Biological experiments are measurements withBiological experiments are measurements with
uncertaintyuncertainty
?X Y
CATTC...
Model-based statistical inferenceModel-based statistical inference ✓✓
We can solve for in “equations” like ,
inferring an ...
Now, what is model-based statistical inferenceNow, what is model-based statistical inference
onon discrete mathematical ob...
Motivation: we would like to decide whether anMotivation: we would like to decide whether an
individual has beenindividual...
Integrate out phylogenetic uncertaintyIntegrate out phylogenetic uncertainty
?X Y
CATTCTTGTACG
GTTCGGCGAAGA
GCGTAAAATAGG
A...
Time to count your blessings.Time to count your blessings.
Real numbers are equipped with a total order. ( )
Real numbers ...
We can thus define the integralWe can thus define the integral
a ba b
for real-valued and .f(x)dx∫
b
a f(X) dP(X ∣ Y )∫
b
a
Integrating over phylogenetic trees?Integrating over phylogenetic trees?
Phylogenetic trees have discrete topologies, ther...
Notion of proximity of trees?Notion of proximity of trees?
Subtree-prune-regraft (rSPR) definitionSubtree-prune-regraft (rSPR) definition
1 4 5 61 2 3 4 5 6 1 2 34 5 6
2 3
These tre...
Tree graph connected by rSPR movesTree graph connected by rSPR moves
Tree inference bounces around graphTree inference bounces around graph
Probability is # of visits to nodesProbability is # of visits to nodes
Subset to high probability nodesSubset to high probability nodes
node size proportional
to posterior probability;
color sh...
The top 4096 trees for a data setThe top 4096 trees for a data set
Graph effects matterGraph effects matter
For more details:
Chris Whidden and FM. Quantifying MCMC exploration of phylogene...
Is the tree graph positively curved?Is the tree graph positively curved?
Is it flat?Is it flat?
Is it negatively curved?Is it negatively curved?
curvature
SPRdistance
imbalanced
balanced
Model-based statistical inference on discreteModel-based statistical inference on discrete
and continuous mathematical obj...
Next: use model-based statistical inference toNext: use model-based statistical inference to
learn about adaptive immunity...
Jenner’s 1796 vaccineJenner’s 1796 vaccine
A revolutionary advance.
Where are we 200 years later?Where are we 200 years later?
Vaccine trials still take a long time and are very costly.
Where are we 200 years later?Where are we 200 years later?
Just
invented
vaccines.
I rock.
LOL
Vaccine trials still take a...
Vaccines manipulate the adaptive immuneVaccines manipulate the adaptive immune
systemsystem
Current practice for trials:
S...
Antibodies bind antigensAntibodies bind antigens
B cell diversification processB cell diversification process
V genes D genes J genes
Affinity
maturation
Somatic hypermutati...
Overall goal: reconstruct processOverall goal: reconstruct process
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATA...
Why reconstruct B cell lineages?Why reconstruct B cell lineages?
...
1. Vaccine design
This one is really good.
How can we...
Why reconstruct B cell lineages?Why reconstruct B cell lineages?
...
1. Vaccine design
Why reconstruct B cell lineages?Why reconstruct B cell lineages?
...
1. Vaccine design
?
2. Vaccine assay
Why reconstruct B cell lineages?Why reconstruct B cell lineages?
...
1. Vaccine design
3. Evolutionary analysis to learn
a...
Goal 1: how are antibodies “drafted”?Goal 1: how are antibodies “drafted”?
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTA...
“Solve”“Solve” , where, whereff((XX)) ∼∼ YY
V genes D genes J genes
Affinity
maturation
Somatic hypermutation
VDJ
rearrangem...
VDJ annotation problem:VDJ annotation problem:
from where did each nucleotide come?from where did each nucleotide come?
So...
Rich probabilistic models workRich probabilistic models work
hamming distance
0 5 10 15
frequency
0.0
0.1
0.2
0.3
HTTN
par...
Integrate out annotation uncertaintyIntegrate out annotation uncertainty
for better clusteringfor better clustering
Goal 2: how are antibodies “revised”?Goal 2: how are antibodies “revised”?
Estimate per-residue level of natural selection...
AAC AAG
GTGGTC
more likely
less likely
In antibodies
CCA CCT
Pro Pro
Thr Ile
ATCACC
synonymous
nonsynonymous
For selection
AAC AAG
GTGGTC
more likely
less likely
In antibodies
CCA CCT
Pro Pro
Thr Ile
ATCACC
synonymous
nonsynonymous
For selection
AAC AAG
GTGGTC
more likely
less likely
In antibodies...
antigen
light chain
purifying
neutral
diversifying
ConclusionConclusion
We like to “solve equations” like , where and are
random variables.
We especially like the case when ...
Next steps: phylogeneticsNext steps: phylogenetics
Understand the impact of data on curvature
Extend work to other models ...
Next steps: B cellsNext steps: B cells
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...
reality
inference
...
Next steps: B cellsNext steps: B cells
Origin of protective antibodies;
optimization of vaccination strategies
Watching im...
Wish I had time to talk aboutWish I had time to talk about
Evolution of innate immunity & viral
antagonists; Origin of SIV...
Wish I had time to talk aboutWish I had time to talk about
Human microbiome
Simian foamy virus variation;
innate immune de...
Wish I had time to talk aboutWish I had time to talk about
HIV superinfection
Drug resistance mutations
Thank you to my group membersThank you to my group members
Thank you to the Fred Hutch communityThank you to the Fred Hutch community
Brilliant students, postdocs, and staff scienti...
Using model-based statistical inference to learn about evolution
Using model-based statistical inference to learn about evolution
Upcoming SlideShare
Loading in …5
×

Using model-based statistical inference to learn about evolution

1,545 views

Published on

These are the slides I used for my promotion talk to associate member at the Fred Hutch. My abstract follows:

Our knowledge about much of biology is indirect: rather than directly observing a process we observe some noisy result of that process. In addition, we almost never have a complete description mapping underlying processes to observations. Given these challenges, what framework can we use to use to understand biology?

In this talk I will describe the use of probabilistic models to learn about evolution from biological data. Starting with the more familiar terrain of solving equations and performing integration in math, I will describe how these same concepts are generalized to the probabilistic setting. I will illustrate how this works in practice with examples from our current research on reconstruction of evolutionary trees and maturation of antibody-making B cells.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Using model-based statistical inference to learn about evolution

  1. 1. Using model-based statistical inferenceUsing model-based statistical inference to learn about evolutionto learn about evolution Frederick “Erick” MatsenFrederick “Erick” Matsen http://matsen.fredhutch.org/http://matsen.fredhutch.org/ @ematsen@ematsen
  2. 2. My group develops mathematical and computational tools for model-based statistical inference on continuous and discrete mathematical objects motivated by evolutionary sequence analysis of microbes and the immune system.
  3. 3. What is model-based statistical inference?What is model-based statistical inference?
  4. 4. Modern technology gives us the ability to in great detailobserve
  5. 5. But very detailed observation is not the same as understanding To understand we need to simplify and abstract.
  6. 6. What abstractions do we have at our disposal?What abstractions do we have at our disposal?
  7. 7.   3
  8. 8.   x
  9. 9. is useful and we love it dearly!is useful and we love it dearly!xx allows us to describe knowledge in an implicit way:x f(x) = y then we can work towards solving for .x Alternatively, one might be interested in taking the average of between two values and . f(x) a b
  10. 10. DefineDefine as areaas areaff((xx)) ddxx∫∫ bb aa a b
  11. 11. is averageis average11//((bb −− aa)) ⋅⋅ ff((xx)) ddxx∫∫ bb aa a b average on (a, b)
  12. 12. Variables allow us to solveVariables allow us to solve ?x y Problem 1: given , solve for . Problem 2: predict if a 10% bigger charge will hit the castle. Say the answer to this is , such that is 1 if that will make the cannonball hit the castle, and 0 otherwise. y x (x)hit10 (x)hit10 x
  13. 13. Variables allow us to solveVariables allow us to solve ?x y … in a deterministic framework.
  14. 14. Life is a probabilistic process. How do we abstract probabilistic quantities?
  15. 15. X
  16. 16. Random variablesRandom variables abstract variablesabstract variablesXX It doesn’t have a fixed value: we have to “ask” it for a value. Random variables are capricious, but they are well defined behind their stochastic exterior.
  17. 17. Random variable sampling determined byRandom variable sampling determined by distributionsdistributions Sometimes discrete: P(heads) P(tails) = 0.51 = 0.49 Sometimes continuous:
  18. 18. Working withWorking with random variablesrandom variables ::XX We can solve for in “equations” like , obtaining expressions such as this is called inference. X f(X) ∼ Y P(X ∣ Y ); We can also average with respect to : where now we are averaging out with respect to a probability. X ∫ f(X) dP(X ∣ Y )
  19. 19. Probabilistic approach to predictionProbabilistic approach to prediction ?X Y : horizontal distance traveled by a cannonball (random variable) : cannon angle (inferred random variable) Problem 1: given observed distribution , infer distribution of . Problem 2: find probability that a 10% bigger charge will hit castle. Y X Y X Solve to get .1. Integrate .2. f(X) = Y P(X ∣ Y ) ∫ (X) dP(X ∣ Y )hit10
  20. 20. Biological experiments are measurements withBiological experiments are measurements with uncertaintyuncertainty ?X Y CATTCTTGTACG GTTCGGCGAAGA GCGTAAAATAGG AGGGGTTGCATG CTTCACTGGCAT expression level of certain genes risk
  21. 21. Model-based statistical inferenceModel-based statistical inference ✓✓ We can solve for in “equations” like , inferring an unknown distribution for (what can we learn about the angle of the cannon). X f(X) ∼ Y X We can push uncertainty through an analysis using integrals like (we don’t care what the angle of the cannon is really, we just want to know with what probability the shot is going to hit the castle!) f(X) dP(X ∣ Y ).∫ b a
  22. 22. Now, what is model-based statistical inferenceNow, what is model-based statistical inference onon discrete mathematical objectsdiscrete mathematical objects??
  23. 23. Motivation: we would like to decide whether anMotivation: we would like to decide whether an individual has beenindividual has been superinfectedsuperinfected, i.e. infected, i.e. infected with a second viral variantwith a second viral variant in a separate eventin a separate event single infection superinfection
  24. 24. Integrate out phylogenetic uncertaintyIntegrate out phylogenetic uncertainty ?X Y CATTCTTGTACG GTTCGGCGAAGA GCGTAAAATAGG AGGGGTTGCATG CTTCACTGGCAT To decide superinfection, we would like to calculate where is now a phylogenetic-tree-valued random variable. f(X) dP(X ∣ Y )∫ S X
  25. 25. Time to count your blessings.Time to count your blessings. Real numbers are equipped with a total order. ( ) Real numbers are equipped with a simply-computed distance that is compatible with the total order. ( ) Real numbers form a continuum. ( ) 3 < 4 |7 − 3| = 4 2.9 < 2.95 < 3
  26. 26. We can thus define the integralWe can thus define the integral a ba b for real-valued and .f(x)dx∫ b a f(X) dP(X ∣ Y )∫ b a
  27. 27. Integrating over phylogenetic trees?Integrating over phylogenetic trees? Phylogenetic trees have discrete topologies, there is no canonical distance between them, nor a natural total order. But we still want to do inference and integration in this setting! ACATGGCTC... ATACGTTCC... TTACGGTTC... ATCCGGTAC... ATACAGTCT... ... Joint work with postdoc Chris Whidden.
  28. 28. Notion of proximity of trees?Notion of proximity of trees?
  29. 29. Subtree-prune-regraft (rSPR) definitionSubtree-prune-regraft (rSPR) definition 1 4 5 61 2 3 4 5 6 1 2 34 5 6 2 3 These trees are then distance 1 apart.
  30. 30. Tree graph connected by rSPR movesTree graph connected by rSPR moves
  31. 31. Tree inference bounces around graphTree inference bounces around graph
  32. 32. Probability is # of visits to nodesProbability is # of visits to nodes
  33. 33. Subset to high probability nodesSubset to high probability nodes node size proportional to posterior probability; color shows distance to highest probability tree.
  34. 34. The top 4096 trees for a data setThe top 4096 trees for a data set
  35. 35. Graph effects matterGraph effects matter For more details: Chris Whidden and FM. Quantifying MCMC exploration of phylogenetic tree space. Systematic Biology 2015. … so what do we know about this graph?
  36. 36. Is the tree graph positively curved?Is the tree graph positively curved?
  37. 37. Is it flat?Is it flat?
  38. 38. Is it negatively curved?Is it negatively curved?
  39. 39. curvature SPRdistance imbalanced balanced
  40. 40. Model-based statistical inference on discreteModel-based statistical inference on discrete and continuous mathematical objectsand continuous mathematical objects ✓✓ When we perform inference on , we can have be something continuous, discrete, or continuous and discrete. f(X) ∼ Y X Discrete-ness brings special challenges; graphs are helpful.
  41. 41. Next: use model-based statistical inference toNext: use model-based statistical inference to learn about adaptive immunitylearn about adaptive immunity Joint with Trevor Bedford (VIDD), Connor McCoy (now at Google), Vladimir Minin (UW Statistics), and Duncan Ralph (postdoc). Data from Harlan Robins (PHS/Adaptive).
  42. 42. Jenner’s 1796 vaccineJenner’s 1796 vaccine A revolutionary advance.
  43. 43. Where are we 200 years later?Where are we 200 years later? Vaccine trials still take a long time and are very costly.
  44. 44. Where are we 200 years later?Where are we 200 years later? Just invented vaccines. I rock. LOL Vaccine trials still take a long time and are very costly.
  45. 45. Vaccines manipulate the adaptive immuneVaccines manipulate the adaptive immune systemsystem Current practice for trials: Stimulate immune system1. Battle-test immune system via pathogen exposure2. What can we learn from antibody-making B cells without battle-testing?
  46. 46. Antibodies bind antigensAntibodies bind antigens
  47. 47. B cell diversification processB cell diversification process V genes D genes J genes Affinity maturation Somatic hypermutation VDJ rearrangement including erosion and non-templated insertion AntigenNaive B cell Experienced B cell
  48. 48. Overall goal: reconstruct processOverall goal: reconstruct process ACATGGCTC... ATACGTTCC... TTACGGTTC... ATCCGGTAC... ATACAGTCT... reality inference ......
  49. 49. Why reconstruct B cell lineages?Why reconstruct B cell lineages? ... 1. Vaccine design This one is really good. How can we elicit it?
  50. 50. Why reconstruct B cell lineages?Why reconstruct B cell lineages? ... 1. Vaccine design
  51. 51. Why reconstruct B cell lineages?Why reconstruct B cell lineages? ... 1. Vaccine design ? 2. Vaccine assay
  52. 52. Why reconstruct B cell lineages?Why reconstruct B cell lineages? ... 1. Vaccine design 3. Evolutionary analysis to learn about underlying mechanisms 2. Vaccine assay
  53. 53. Goal 1: how are antibodies “drafted”?Goal 1: how are antibodies “drafted”? ACATGGCTC... ATACGTTCC... TTACGGTTC... ATCCGGTAC... ATACAGTCT... reality rearrangement groups ......
  54. 54. “Solve”“Solve” , where, whereff((XX)) ∼∼ YY V genes D genes J genes Affinity maturation Somatic hypermutation VDJ rearrangement including erosion and non-templated insertion AntigenNaive B cell Experienced B cell is a statistical model of recombination and maturation are parameters of that model (including clusters) are antibody repertoire sequences f X Y
  55. 55. VDJ annotation problem:VDJ annotation problem: from where did each nucleotide come?from where did each nucleotide come? Somatichypermutation Sequencing primerSequencing error 3’V deletion VD insertion 5’D deletion 3’D deletion 5’J deletion DJ insertion Biological process Sequencing Inference G This is a key first step in BCR sequence analysis.
  56. 56. Rich probabilistic models workRich probabilistic models work hamming distance 0 5 10 15 frequency 0.0 0.1 0.2 0.3 HTTN partis (k=5) partis (k=1) ighutil iHMMunealign igblast imgt HTTN
  57. 57. Integrate out annotation uncertaintyIntegrate out annotation uncertainty for better clusteringfor better clustering
  58. 58. Goal 2: how are antibodies “revised”?Goal 2: how are antibodies “revised”? Estimate per-residue level of natural selection on receptor sequences from healthy individuals. ω = dN/dS ■ Large : diversifying sites ■ near 1: neutral sites ■ Small : purifying sites ω ω ω
  59. 59. AAC AAG GTGGTC more likely less likely In antibodies
  60. 60. CCA CCT Pro Pro Thr Ile ATCACC synonymous nonsynonymous For selection AAC AAG GTGGTC more likely less likely In antibodies
  61. 61. CCA CCT Pro Pro Thr Ile ATCACC synonymous nonsynonymous For selection AAC AAG GTGGTC more likely less likely In antibodies Solution: use “out-of-frame” sequences to determine neutral mutation rate.
  62. 62. antigen light chain purifying neutral diversifying
  63. 63. ConclusionConclusion We like to “solve equations” like , where and are random variables. We especially like the case when is sequence data and is something weird. We can use these tools to learn about B cell receptor sequence evolution. f(X) ∼ Y X Y Y X
  64. 64. Next steps: phylogeneticsNext steps: phylogenetics Understand the impact of data on curvature Extend work to other models of tree space Use understanding to design biased proposals that don’t get stuck Implement phylogenetic algorithms that can update trees given more sequences Continue building community with phyloseminar.org phylobabble.org
  65. 65. Next steps: B cellsNext steps: B cells ACATGGCTC... ATACGTTCC... TTACGGTTC... ATCCGGTAC... ATACAGTCT... reality inference ...... Learn more about the mutation process in B cell maturation to better reconstruct ancestral sequences; evolutionary dynamics Etiology of Burkitt’s lymphoma
  66. 66. Next steps: B cellsNext steps: B cells Origin of protective antibodies; optimization of vaccination strategies Watching immune repertoires evolve through time
  67. 67. Wish I had time to talk aboutWish I had time to talk about Evolution of innate immunity & viral antagonists; Origin of SIVcpz Founder HIV sequence identification for sieve analysis
  68. 68. Wish I had time to talk aboutWish I had time to talk about Human microbiome Simian foamy virus variation; innate immune defense
  69. 69. Wish I had time to talk aboutWish I had time to talk about HIV superinfection Drug resistance mutations
  70. 70. Thank you to my group membersThank you to my group members
  71. 71. Thank you to the Fred Hutch communityThank you to the Fred Hutch community Brilliant students, postdocs, and staff scientist collaborators Computational biology program, esp. “scouts” and Marty Fantastic admin support: Sara, Melissa, and Anissa Fantastic computing support: esp. Dirk, Carl, Erik, and Michael supporters: Katie P, Dan G, and Garnet Patience with my meddling: Larry, Myra, Jon C fredhutch.io

×