Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Learning how antibodies
are drafted and revised
Frederick “Erick” Matsen
Fred Hutchinson Cancer Research Center
@ematsen
h...
Jenner’s 1796 vaccine
 
Where are we 200 years later?
RV144 HIV trial: 2003-2009
26,676 volunteers enrolled
16,395 volunteers randomized
125 infections
$105,000,000 and 6 years...
Vaccines manipulate
the adaptive immune system
 
 
What can we learn from antibody-making B cells without
battle-testing t...
Antibodies bind antigens
Antigen
Light chain
Heavy chain
Too many antigens to code for directly
≈ ∞⋯ ∞∞
B cell diversification process
V genes D genes J genes
Affinity
maturation
Somatic hypermutation
VDJ 
rearrangement
includ...
What germline really looks like
(Eichler and Breden groups)
Big aim: reconstruct from memory reads
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...
...
reality
...
in...
Why reconstruct B cell lineages?
...
1. Vaccine design
This one is really good.
How can we elicit it?
Why reconstruct B cell lineages?
...
1. Vaccine design
immunogen 1
immunogen 2
Why reconstruct B cell lineages?
...
1. Vaccine design
?
2. Vaccine assay
Why reconstruct B cell lineages?
...
1. Vaccine design
3. Evolutionary analysis to learn
about underlying mechanisms
2. Va...
Goal 1: find rearrangement groups
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...
...
reality
...
rearran...
VDJ annotation problem:
from where did each nucleotide come?
Somatichypermutation
Sequencing primerSequencing error
3’V de...
Data: Illumina reads from CDR3 locus
Somatic  hypermut ation
Sequencing primerSequencing error
3’V deletion
VD insertion
5...
“Thread” reads onto structure
V genes D genes J genes
...
...
...
HMM intro: dishonest casino
6 6
HMM intro: dishonest casino
6 6
1-p
1-p
p
HMM intro: dishonest casino
6 6
1-p
1-p
p
6 6
HMM intro: dishonest casino
6 6
1-p
1-p
p
6 6
p1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p
p
1-p 1-p 1-p 1-p1-p
1-p 1-p 1-p 1-p1-p
p p p pp
1-p
1-p
p
...
...
...
...
1-p
1-p
p
1-p 1-p 1-p 1-p1-p
1-p 1-p 1-p 1-p1-p
p p p pp
1-p
1-p
p
...
...
...
...
1-p
1-p
V genes D genes J genes
...
...
...
V genes D genes J genes
...
...
...
V genes D genes J genes
...
...
...
Detour: write HMM inference package
 
We wanted to use HMMoCby G Lunter (Bioinf 2007)…
then tried extending StochHMMby Lot...
What are probabilities?
V genes D genes J genes
...
...
...
Distributions are reproducibly weird!
bases
0 5 10
frequency
0.0
0.1
0.2
0.3
0.4
IGHV2­70*12 ­­ V 3' deletion
A
B
C
IGHV2­...
Distributions are reproducibly weird!
position
200 250
mutation freq
0.0
0.1
0.2
0.3
0.4
IGHV3­23D*01
A
B
C
IGHV3­23D*01
p...
Only insertions look simple
bases
0 5 10 15
frequency
0.00
0.05
0.10
0.15
VD insertion
A
B
C
VD insertion
bases
0 5 10
fre...
Simulate sequences to benchmark
 
 
Somatichypermutation
Sequencing primerSequencing error
3’V deletion
VD insertion
5’D d...
Incorporating this complexity is good
hamming distance
0 5 10 15
frequency
0.0
0.1
0.2
0.3
HTTN
partis (k=5)
partis (k=1)
...
Remember goal: find rearrangement groups
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...
...
reality
...
...
Say we are given two sequences
1-p
p
1-p
2 ×
2 ×
Double roll
of a single die
per turn
1-p
p
1-p
1-p
p
1-p
+
Two independen...
Double roll Pair HMM↔
p
1-p 1-p 1-p 1-p1-p
1-p 1-p 1-p 1-p1-p
p p p pp
1-p
1-p
p
...
...
...
...
1-p
1-p
Do two sequences come from a single
rearrangement event?
 
The forward algorithm for HMMs gives probability of generating
...
V genes D genes J genes
...
...
...
Do sets of sequences come from a
single rearrangement event?
 
=
P(A ∪ B)
P(A)P(B)
P(A ∪ B | single rearrangement)
P(A, B ...
Preliminary simulation
 
Integrate out annotation uncertainty and win.
Goal 2: how are antibodies revised?
First, investigate BCR mutation patterns
affinity
maturation
antigen
naive B cell
experienced B cell
clonal
expansion
soma...
Use two-taxon “trees” for model fitting
note: we know ancestral state within V, D, J.
VV DD JJ
IGNORE
IGNORE
IGNORE
IGNORE...
model fitGeneral Time Reversible
Individual A Individual B Individual C
0.14
0.79
0.10
0.22
0.72
0.41
0.69
0.40
0.06
0.28
...
Best model according to AIC/BIC
… has different matrices and fixed rate multipliers
for the different segments.
V D J
Seq....
Branch length distribution under this best
model
IGHD rate: 3.36
IGHJ rate: 0.62
IGHD rate: 4.44
IGHJ rate: 0.62
IGHD rate...
Next consider selection (Goal 2 con’t)
affinity
maturation
antigen
naive B cell
experienced B cell
clonal
expansion
somati...
 
AAC AAG
GTGGTC
more likely
less likely
In antibodies
 
CCA CCT
Pro Pro
Thr Ile
ATCACC
synonymous
nonsynonymous
For selection
ὡὡ
AAC AAG
GTGGTC
more likely
less likely
In antib...
Would like per-site selection inference
 
ω ≡ ≡
dN
dS
rate of non-synonymous substitution
rate of synonymous substitution
...
Productive vs. out-of-frame receptors
 
Each cell may carry two IGH alleles, but only one is expressed.
 
V D J
V D J
inse...
ω ≡ ≡
dN
dS
rate of non-synonymous substitution
rate of synonymous substitution
λS
out−of−frame
70 80 90 100
site (IMGT nu...
is a ratio of rates in terms of observed
neutral process
ωl
: nonsynonymous in-frame rate for site
: nonsynonymous out-of-...
Renaissance count (Lemey,Minin… 2012)
TGG CCG CGA
seq−5 CCT CAA ATC ACT CTA TGG CCG CGA
seq−2 CCA CAA ATC ACG TTA TGG CCG ...
Empirical Bayes regularization
to stabilize estimates
Say we are doing a per-county smoking survey.
zero smokers? Really?
...
Estimating selection coefficient ωl
: nonsynonymous in-frame rate for site
: nonsynonymous out-of-frame rate for site
: sy...
Overall IGHV selection map
0.1
1.0
10.0
75 80 85 90 95 100 105
medianω
Individual A
0
50
100
150
200
75 80 85 90 95 100 10...
Similar across individuals
Individual A
0
50
100
150
200
75 80 85 90 95 100 105
count
Individual B
0
50
100
150
200
75 80 ...
antigen
light chain
purifying
neutral
diversifying
Conclusion
 
B cell receptors are “drafted” and “revised” randomly, but
… with remarkably consistent deletion and insertio...
Thank you
Trevor Bedford, Connor McCoy, Vladimir Minin & Duncan Ralph
Phil Bradley for doing structural work
Molecular wor...
Addenda
Measuring clustering agreement
good agreement:
bad agreement:
Cx
Cy
Cx
Cy
 
 
Intuition: “how much variability is there in...
Mutual information I
Think of cluster identity under for a uniformly selected point as a
random variable (similarly for an...
Estimates of the mutational process are quite
consistent between individuals
(each point is a single entry for one of the ...
Branch length differences between productive,
unproductive
Unproductive rearrangements are more likely to be either: uncha...
Sites are generally under purifying selection
Individual A
Individual B
Individual C
0
200
400
600
800
0
200
400
600
800
0...
Similar across individuals (ii)
Distribution of amino acids
beginning
of CDR3
selection
for aromatic
amino acids?Frequency: left of line = out-of-frame, r...
Stabilize with empirical Bayes regularization
Assume that , the substitution rate at site , comes from a Gamma
distributio...
Sequence counts
status A B C
functional 4,139,983 4,861,800 3,748,306
out-of-frame 533,919 794,845 558,246
stop 104,525 16...
Correlation between sequence and GTR matrix
 
Each dot is a pair of genes.
Simulation results for selection inference
● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ●
● ● ● ●
●
●
● ● ● ●
●
● ● ● ●
● ●
● ● ●
● ● ● ●...
Omega distribution
Random facts
Mean length of D segment in individual A’s naive repertoire is 16.61.
Subject A’s naive sequences were 37% CD...
Learning how antibodies are drafted and revised
Learning how antibodies are drafted and revised
Upcoming SlideShare
Loading in …5
×

Learning how antibodies are drafted and revised

1,077 views

Published on

Antibodies must recognize a great diversity of antigens to protect us from infectious disease. The binding properties of antibodies are determined by the sequences of their corresponding B cell receptors (BCRs). These BCR sequences are created in "draft" form by VDJ recombination, which randomly selects and deletes from the ends of V, D, and J genes, then joins them together with additional random nucleotides. If they pass initial screening and bind an antigen, these sequences then undergo an evolutionary process of mutation and selection, "revising" the BCR to improve binding to its cognate antigen. It has recently become possible to determine the antibody-determining BCR sequences resulting from this process in high throughput. Although these sequences implicitly contain a wealth of information about both antigen exposure and the process by which we learn to resist pathogens, this information can only be extracted using computer algorithms.

In this talk, I will describe two recent projects to develop model-based inferential tools for analyzing BCR sequences. In the first, we find that large modern data sets suggest a model using parameter-rich per-allele categorical distributions for HMM transition probabilities and per-allele-per-position mutation probabilities, and that using such a model for VDJ annotation inference leads to significantly improved results. In the second, we investigate selection on BCRs using a novel method that side-steps the difficulties encountered by previous work in differentiating between selection and motif-driven mutation; this is done through stochastic mapping and empirical Bayes estimators that compare the evolution of in-frame and out-of-frame rearrangements. We use this new method to derive a per-residue map of selection on millions of reads, which provides a more nuanced view of the constraints on framework and variable regions.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Learning how antibodies are drafted and revised

  1. 1. Learning how antibodies are drafted and revised Frederick “Erick” Matsen Fred Hutchinson Cancer Research Center @ematsen http://matsen.fredhutch.org/ with Trevor Bedford (FH), Connor McCoy, Vladimir Minin (UW), and Duncan Ralph (FH)
  2. 2. Jenner’s 1796 vaccine   Where are we 200 years later?
  3. 3. RV144 HIV trial: 2003-2009 26,676 volunteers enrolled 16,395 volunteers randomized 125 infections $105,000,000 and 6 years   Prospective studies are expensive, slow, and entail complex moral issues. This does not lend itself to rapid vaccine development.   How might we guide vaccine development without disease exposure?
  4. 4. Vaccines manipulate the adaptive immune system     What can we learn from antibody-making B cells without battle-testing them through disease exposure?
  5. 5. Antibodies bind antigens Antigen Light chain Heavy chain
  6. 6. Too many antigens to code for directly ≈ ∞⋯ ∞∞
  7. 7. B cell diversification process V genes D genes J genes Affinity maturation Somatic hypermutation VDJ  rearrangement including erosion and non­templated insertion AntigenNaive B cell Experienced B cell
  8. 8. What germline really looks like (Eichler and Breden groups)
  9. 9. Big aim: reconstruct from memory reads ACATGGCTC... ATACGTTCC... TTACGGTTC... ATCCGGTAC... ATACAGTCT... ... reality ... inference
  10. 10. Why reconstruct B cell lineages? ... 1. Vaccine design This one is really good. How can we elicit it?
  11. 11. Why reconstruct B cell lineages? ... 1. Vaccine design immunogen 1 immunogen 2
  12. 12. Why reconstruct B cell lineages? ... 1. Vaccine design ? 2. Vaccine assay
  13. 13. Why reconstruct B cell lineages? ... 1. Vaccine design 3. Evolutionary analysis to learn about underlying mechanisms 2. Vaccine assay
  14. 14. Goal 1: find rearrangement groups ACATGGCTC... ATACGTTCC... TTACGGTTC... ATCCGGTAC... ATACAGTCT... ... reality ... rearrangement groups
  15. 15. VDJ annotation problem: from where did each nucleotide come? Somatichypermutation Sequencing primerSequencing error 3’V deletion VD insertion 5’D deletion 3’D deletion 5’J deletion DJ insertion Biological process Sequencing Inference G     This is a key first step in BCR sequence analysis.
  16. 16. Data: Illumina reads from CDR3 locus Somatic  hypermut ation Sequencing primerSequencing error 3’V deletion VD insertion 5’D deletion 3’D deletion 5’J deletion DJ insertion Biological process Sequencing G Total of about 15 million unique 130nt sequences from memory B cell populations of three healthy individuals A, B, and C.
  17. 17. “Thread” reads onto structure V genes D genes J genes ... ... ...
  18. 18. HMM intro: dishonest casino 6 6
  19. 19. HMM intro: dishonest casino 6 6 1-p 1-p p
  20. 20. HMM intro: dishonest casino 6 6 1-p 1-p p 6 6
  21. 21. HMM intro: dishonest casino 6 6 1-p 1-p p 6 6 p1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p
  22. 22. p 1-p 1-p 1-p 1-p1-p 1-p 1-p 1-p 1-p1-p p p p pp 1-p 1-p p ... ... ... ... 1-p 1-p
  23. 23. p 1-p 1-p 1-p 1-p1-p 1-p 1-p 1-p 1-p1-p p p p pp 1-p 1-p p ... ... ... ... 1-p 1-p
  24. 24. V genes D genes J genes ... ... ...
  25. 25. V genes D genes J genes ... ... ...
  26. 26. V genes D genes J genes ... ... ...
  27. 27. Detour: write HMM inference package   We wanted to use HMMoCby G Lunter (Bioinf 2007)… then tried extending StochHMMby Lott & Korf (Bioinf 2014)… but it ended up being a complete rewrite by Duncan to make ham.   Takes HMM description in concise & intuitive YAML format (for CpG example, 440 chars for hamvs 5,961 for HMMoCXML) slightly faster and more memory efficient than HMMoC continuous integration via Docker   Then write BCR annotation package: https://github.com/psathyrella/ham https://github.com/psathyrella/partis
  28. 28. What are probabilities? V genes D genes J genes ... ... ...
  29. 29. Distributions are reproducibly weird! bases 0 5 10 frequency 0.0 0.1 0.2 0.3 0.4 IGHV2­70*12 ­­ V 3' deletion A B C IGHV2­70*12 ­­ V 3' deletion bases 0 5 10 frequency 0.0 0.1 0.2 0.3 0.4 IGHD1­14*01 ­­ D 5' deletion A B C IGHD1­14*01 ­­ D 5' deletion bases 0 5 10 frequency 0.0 0.2 0.4 0.6 IGHD7­27*01 ­­ D 3' deletion A B C IGHD7­27*01 ­­ D 3' deletion bases 0 5 10 frequency 0.00 0.05 0.10 0.15 0.20 IGHJ4*02 ­­ J 5' deletion A B C IGHJ4*02 ­­ J 5' deletion
  30. 30. Distributions are reproducibly weird! position 200 250 mutation freq 0.0 0.1 0.2 0.3 0.4 IGHV3­23D*01 A B C IGHV3­23D*01 position 200 250 mutation freq 0.0 0.2 0.4 0.6 IGHV3­33*06 A B C IGHV3­33*06
  31. 31. Only insertions look simple bases 0 5 10 15 frequency 0.00 0.05 0.10 0.15 VD insertion A B C VD insertion bases 0 5 10 frequency 0.0 0.1 0.2 DJ insertion A B C DJ insertion
  32. 32. Simulate sequences to benchmark     Somatichypermutation Sequencing primerSequencing error 3’V deletion VD insertion 5’D deletion 3’D deletion 5’J deletion DJ insertion Biological process Sequencing Inference G   Simulation code independent from inference code.
  33. 33. Incorporating this complexity is good hamming distance 0 5 10 15 frequency 0.0 0.1 0.2 0.3 HTTN partis (k=5) partis (k=1) ighutil iHMMunealign igblast imgt HTTN but there are still a number of errors.
  34. 34. Remember goal: find rearrangement groups ACATGGCTC... ATACGTTCC... TTACGGTTC... ATCCGGTAC... ATACAGTCT... ... reality ... rearrangement groups
  35. 35. Say we are given two sequences 1-p p 1-p 2 × 2 × Double roll of a single die per turn 1-p p 1-p 1-p p 1-p + Two independent die rolling games vs.
  36. 36. Double roll Pair HMM↔ p 1-p 1-p 1-p 1-p1-p 1-p 1-p 1-p 1-p1-p p p p pp 1-p 1-p p ... ... ... ... 1-p 1-p
  37. 37. Do two sequences come from a single rearrangement event?   The forward algorithm for HMMs gives probability of generating observed sequence from a given HMM:x   P(x) = P(x; σ),∑ paths σ   probability of generating two sequences and from the same path through the HMM (summed across paths). P(x, y) = P(x, y; σ),∑ paths σ x y
  38. 38. V genes D genes J genes ... ... ...
  39. 39. Do sets of sequences come from a single rearrangement event?   = P(A ∪ B) P(A)P(B) P(A ∪ B | single rearrangement) P(A, B | independent rearrangements)     Use this for agglomerative clustering; stop when the ratio < 1.
  40. 40. Preliminary simulation   Integrate out annotation uncertainty and win.
  41. 41. Goal 2: how are antibodies revised?
  42. 42. First, investigate BCR mutation patterns affinity maturation antigen naive B cell experienced B cell clonal expansion somatic hypermutation
  43. 43. Use two-taxon “trees” for model fitting note: we know ancestral state within V, D, J. VV DD JJ IGNORE IGNORE IGNORE IGNORE   Our “trees” have an observed read on the bottom and the corresponding “ancestral” germline sequence on top, connected by a branch, representing some amount of divergence.
  44. 44. model fitGeneral Time Reversible Individual A Individual B Individual C 0.14 0.79 0.10 0.22 0.72 0.41 0.69 0.40 0.06 0.28 0.73 0.17 0.08 0.48 0.27 0.17 0.35 0.50 0.66 0.23 0.32 0.42 0.37 0.36 0.35 0.11 0.46 1.02 0.12 1.12 0.85 0.31 0.18 1.10 0.91 0.06 0.12 0.79 0.10 0.19 0.60 0.43 0.76 0.36 0.07 0.24 0.67 0.18 0.07 0.64 0.23 0.14 0.36 0.44 0.74 0.21 0.36 0.33 0.33 0.45 0.28 0.14 0.44 0.76 0.13 1.15 0.94 0.34 0.24 0.89 0.86 0.07 0.14 0.72 0.11 0.21 0.54 0.43 0.71 0.37 0.08 0.24 0.65 0.18 0.08 0.50 0.27 0.16 0.27 0.49 0.65 0.16 0.45 0.39 0.34 0.52 0.27 0.14 0.50 0.73 0.14 1.05 0.79 0.28 0.23 0.90 0.70 0.08 T C G A T C G A T C G A IGHVIGHDIGHJ A G C T A G C T A G C T read germline
  45. 45. Best model according to AIC/BIC … has different matrices and fixed rate multipliers for the different segments. V D J Seq. 1 Seq. 2 Seq. 3 t2 t3 t1 rD t1 rJ t1 rD t2 rJ t2 rD t3 rJ t3 Mutation Model
  46. 46. Branch length distribution under this best model IGHD rate: 3.36 IGHJ rate: 0.62 IGHD rate: 4.44 IGHJ rate: 0.62 IGHD rate: 3.88 IGHJ rate: 0.63 Individual A Individual B Individual C 0e+00 2e+05 4e+05 6e+05 8e+05 0.0e+00 5.0e+05 1.0e+06 1.5e+06 2.0e+06 0e+00 5e+05 1e+06 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 ML branch length count   D segments evolve substantially faster than V J segments evolve more slowly than V Individual A has a higher mutational load.
  47. 47. Next consider selection (Goal 2 con’t) affinity maturation antigen naive B cell experienced B cell clonal expansion somatic hypermutation
  48. 48.   AAC AAG GTGGTC more likely less likely In antibodies
  49. 49.   CCA CCT Pro Pro Thr Ile ATCACC synonymous nonsynonymous For selection ὡὡ AAC AAG GTGGTC more likely less likely In antibodies
  50. 50. Would like per-site selection inference   ω ≡ ≡ dN dS rate of non-synonymous substitution rate of synonymous substitution   position 200 250 mutation freq 0.0 0.1 0.2 0.3 0.4 IGHV3­23D*01 A B C IGHV3­23D*01
  51. 51. Productive vs. out-of-frame receptors   Each cell may carry two IGH alleles, but only one is expressed.   V D J V D J insertion that disrupts frame
  52. 52. ω ≡ ≡ dN dS rate of non-synonymous substitution rate of synonymous substitution λS out−of−frame 70 80 90 100 site (IMGT numbering) 0.1 1.0 individual A B C     Out-of-frame reads can be used to infer neutral mutation rate!
  53. 53. is a ratio of rates in terms of observed neutral process ωl : nonsynonymous in-frame rate for site : nonsynonymous out-of-frame rate for site : synonymous in-frame rate for site : synonymous out-of-frame rate for site λ (N−I) l l λ (N−O) l l λ (S−I) l l λ (S−O) l l     =ωl /λ (N−I) l λ (N−O) l /λ (S−I) l λ (S−O) l
  54. 54. Renaissance count (Lemey,Minin… 2012) TGG CCG CGA seq−5 CCT CAA ATC ACT CTA TGG CCG CGA seq−2 CCA CAA ATC ACG TTA TGG CCG CGA ArgPro Gln Thr Ile Thr Leu Trp Gln Pro seq−1 CCA CAA ACC ACG TTA TGG CAG seq−3 CGA CCT CAA ACC ACT CTA TGG CAG CGA seq−4 CCT CAA ATC ACT CTA ACC ATC ATC ATC ACC ACC ATC ATC ATC ACC ATC ATC ACC ACC ACC ATC ATC mutation historysample Use sampled mutation histories to estimate rates... but such estimates can be unstable.
  55. 55. Empirical Bayes regularization to stabilize estimates Say we are doing a per-county smoking survey. zero smokers? Really? Use all of the data to fit prior distribution of smoking prevalence, then with given observations obtain per-county posterior.
  56. 56. Estimating selection coefficient ωl : nonsynonymous in-frame rate for site : nonsynonymous out-of-frame rate for site : synonymous in-frame rate for site : synonymous out-of-frame rate for site λ (N−I) l l λ (N−O) l l λ (S−I) l l λ (S−O) l l     =ωl /λ (N−I) l λ (N−O) l /λ (S−I) l λ (S−O) l
  57. 57. Overall IGHV selection map 0.1 1.0 10.0 75 80 85 90 95 100 105 medianω Individual A 0 50 100 150 200 75 80 85 90 95 100 105 Site (IMGT numbering) count purifying neutral diversifying Distribution of classifications across IGHV genes Distribution of median estimates of ω
  58. 58. Similar across individuals Individual A 0 50 100 150 200 75 80 85 90 95 100 105 count Individual B 0 50 100 150 200 75 80 85 90 95 100 105 count Individual C Site (IMGT numbering) 0 50 100 150 75 80 85 90 95 100 105 count purifying neutral diversifying
  59. 59. antigen light chain purifying neutral diversifying
  60. 60. Conclusion   B cell receptors are “drafted” and “revised” randomly, but … with remarkably consistent deletion and insertion patterns … with remarkably consistent substitution and selection   We can learn about these processes using model-based inference.   Paper on annotation with partiswill be up soon is up on arXivSelection analysis paper
  61. 61. Thank you Trevor Bedford, Connor McCoy, Vladimir Minin & Duncan Ralph Phil Bradley for doing structural work Molecular work done by Paul Lindau in Phil Greenberg’s lab with support from Harlan Robins and Adaptive Biotechnologies Adaptive Biotechnologies computational biology team   National Science Foundation and National Institute of Health University of Washington Center for AIDS Research (CFAR) University of Washington eScience Institute W. M. Keck Foundation  
  62. 62. Addenda
  63. 63. Measuring clustering agreement good agreement: bad agreement: Cx Cy Cx Cy     Intuition: “how much variability is there in the color for amongst the items of a given color under ? Cx Cy
  64. 64. Mutual information I Think of cluster identity under for a uniformly selected point as a random variable (similarly for and ): Cx X Cy Y I(X; Y ) = H(X) − H(X|Y ) where is the entropy of (ignoring ), and is the entropy of given the value for . H(X) X Y H(X|Y ) X Y   I(X; Y ) = p(x, y) log ( )∑ y∈Y ∑ x∈X p(x, y) p(x) p(y)   AM I(U , V ) = M I(U , V ) − E{M I(U , V )} max {H(U ), H(V )} − E{M I(U , V )}
  65. 65. Estimates of the mutational process are quite consistent between individuals (each point is a single entry for one of the matrices for a pair of individuals.)
  66. 66. Branch length differences between productive, unproductive Unproductive rearrangements are more likely to be either: unchanged from germline, or more divergent.
  67. 67. Sites are generally under purifying selection Individual A Individual B Individual C 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800 −1 0 1 median log10(ω) count purifying diversifying neutral countcount
  68. 68. Similar across individuals (ii)
  69. 69. Distribution of amino acids beginning of CDR3 selection for aromatic amino acids?Frequency: left of line = out-of-frame, right of line = in-frame
  70. 70. Stabilize with empirical Bayes regularization Assume that , the substitution rate at site , comes from a Gamma distribution with shape and rate : λl l α β ∼ Gamma(α, β).λl   Model total substitution counts (sampled via stochastic mapping) for a site as Poisson with rate :λl ∼ Poisson( ),Cl λl   Fit and to all data, then draw rates from the posterior:α^ β^ λl ∣ ∼ Gamma( + , 1 + ).λl Cl Cl α^ β^   We extended this regularization to case of non-constant coverage.
  71. 71. Sequence counts status A B C functional 4,139,983 4,861,800 3,748,306 out-of-frame 533,919 794,845 558,246 stop 104,525 169,423 112,901
  72. 72. Correlation between sequence and GTR matrix   Each dot is a pair of genes.
  73. 73. Simulation results for selection inference ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● 0.1 1.0 10.0 0 25 50 75 100 site ω synonymous change possible? ● yes no type ●● ●● ●● purifying neutral diversifying 0.00 0.25 0.50 0.75 1.00 0 25 50 75 100 site Proportion type N S 0 250 500 750 1000 0 25 50 75 100 site coverage
  74. 74. Omega distribution
  75. 75. Random facts Mean length of D segment in individual A’s naive repertoire is 16.61. Subject A’s naive sequences were 37% CDR3 Divergence between the various germ-line V genes: >summary(dist.dna(allele_01,pairwise.deletion=TRUE,model='raw')) Min. 1stQu. Median Mean 3rdQu. Max. 0.0038460.2013000.3446000.3047000.3849000.539500

×