Antibodies must recognize a great diversity of antigens to protect us from infectious disease. The binding properties of antibodies are determined by the sequences of their corresponding B cell receptors (BCRs). These BCR sequences are created in "draft" form by VDJ recombination, which randomly selects and deletes from the ends of V, D, and J genes, then joins them together with additional random nucleotides. If they pass initial screening and bind an antigen, these sequences then undergo an evolutionary process of mutation and selection, "revising" the BCR to improve binding to its cognate antigen. It has recently become possible to determine the antibody-determining BCR sequences resulting from this process in high throughput. Although these sequences implicitly contain a wealth of information about both antigen exposure and the process by which we learn to resist pathogens, this information can only be extracted using computer algorithms.
In this talk, I will describe two recent projects to develop model-based inferential tools for analyzing BCR sequences. In the first, we find that large modern data sets suggest a model using parameter-rich per-allele categorical distributions for HMM transition probabilities and per-allele-per-position mutation probabilities, and that using such a model for VDJ annotation inference leads to significantly improved results. In the second, we investigate selection on BCRs using a novel method that side-steps the difficulties encountered by previous work in differentiating between selection and motif-driven mutation; this is done through stochastic mapping and empirical Bayes estimators that compare the evolution of in-frame and out-of-frame rearrangements. We use this new method to derive a per-residue map of selection on millions of reads, which provides a more nuanced view of the constraints on framework and variable regions.
1. Learning how antibodies
are drafted and revised
Frederick “Erick” Matsen
Fred Hutchinson Cancer Research Center
@ematsen
http://matsen.fredhutch.org/
with Trevor Bedford (FH), Connor McCoy,
Vladimir Minin (UW), and Duncan Ralph (FH)
4. RV144 HIV trial: 2003-2009
26,676 volunteers enrolled
16,395 volunteers randomized
125 infections
$105,000,000 and 6 years
Prospective studies are expensive, slow, and entail complex moral issues.
This does not lend itself to rapid vaccine development.
How might we guide vaccine development without disease exposure?
5. Vaccines manipulate
the adaptive immune system
What can we learn from antibody-making B cells without
battle-testing them through disease exposure?
8. B cell diversification process
V genes D genes J genes
Affinity
maturation
Somatic hypermutation
VDJ
rearrangement
including
erosion and
nontemplated
insertion
AntigenNaive B cell
Experienced B cell
14. Why reconstruct B cell lineages?
...
1. Vaccine design
3. Evolutionary analysis to learn
about underlying mechanisms
2. Vaccine assay
15. Goal 1: find rearrangement groups
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...
...
reality
...
rearrangement groups
16. VDJ annotation problem:
from where did each nucleotide come?
Somatichypermutation
Sequencing primerSequencing error
3’V deletion
VD insertion
5’D deletion
3’D deletion
5’J deletion
DJ insertion
Biological process
Sequencing
Inference
G
This is a key first step in BCR sequence analysis.
17. Data: Illumina reads from CDR3 locus
Somatic hypermut ation
Sequencing primerSequencing error
3’V deletion
VD insertion
5’D deletion
3’D deletion
5’J deletion
DJ insertion
Biological process
Sequencing
G
Total of about 15 million unique 130nt sequences from memory B cell
populations of three healthy individuals A, B, and C.
28. Detour: write HMM inference package
We wanted to use HMMoCby G Lunter (Bioinf 2007)…
then tried extending StochHMMby Lott & Korf (Bioinf 2014)…
but it ended up being a complete rewrite by Duncan to make ham.
Takes HMM description in concise & intuitive YAML format
(for CpG example, 440 chars for hamvs 5,961 for HMMoCXML)
slightly faster and more memory efficient than HMMoC
continuous integration via Docker
Then write BCR annotation package:
https://github.com/psathyrella/ham
https://github.com/psathyrella/partis
30. Distributions are reproducibly weird!
bases
0 5 10
frequency
0.0
0.1
0.2
0.3
0.4
IGHV270*12 V 3' deletion
A
B
C
IGHV270*12 V 3' deletion
bases
0 5 10
frequency
0.0
0.1
0.2
0.3
0.4
IGHD114*01 D 5' deletion
A
B
C
IGHD114*01 D 5' deletion
bases
0 5 10
frequency
0.0
0.2
0.4
0.6
IGHD727*01 D 3' deletion
A
B
C
IGHD727*01 D 3' deletion
bases
0 5 10
frequency
0.00
0.05
0.10
0.15
0.20
IGHJ4*02 J 5' deletion
A
B
C
IGHJ4*02 J 5' deletion
31. Distributions are reproducibly weird!
position
200 250
mutation freq
0.0
0.1
0.2
0.3
0.4
IGHV323D*01
A
B
C
IGHV323D*01
position
200 250
mutation freq
0.0
0.2
0.4
0.6
IGHV333*06
A
B
C
IGHV333*06
32. Only insertions look simple
bases
0 5 10 15
frequency
0.00
0.05
0.10
0.15
VD insertion
A
B
C
VD insertion
bases
0 5 10
frequency
0.0
0.1
0.2
DJ insertion
A
B
C
DJ insertion
33. Simulate sequences to benchmark
Somatichypermutation
Sequencing primerSequencing error
3’V deletion
VD insertion
5’D deletion
3’D deletion
5’J deletion
DJ insertion
Biological process
Sequencing
Inference
G
Simulation code independent from inference code.
34. Incorporating this complexity is good
hamming distance
0 5 10 15
frequency
0.0
0.1
0.2
0.3
HTTN
partis (k=5)
partis (k=1)
ighutil
iHMMunealign
igblast
imgt
HTTN
but there are still a number of errors.
35. Remember goal: find rearrangement groups
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...
...
reality
...
rearrangement groups
36. Say we are given two sequences
1-p
p
1-p
2 ×
2 ×
Double roll
of a single die
per turn
1-p
p
1-p
1-p
p
1-p
+
Two independent
die rolling games
vs.
37. Double roll Pair HMM↔
p
1-p 1-p 1-p 1-p1-p
1-p 1-p 1-p 1-p1-p
p p p pp
1-p
1-p
p
...
...
...
...
1-p
1-p
38. Do two sequences come from a single
rearrangement event?
The forward algorithm for HMMs gives probability of generating
observed sequence from a given HMM:x
P(x) = P(x; σ),∑
paths σ
probability of generating two sequences and from the same path
through the HMM (summed across paths).
P(x, y) = P(x, y; σ),∑
paths σ
x y
40. Do sets of sequences come from a
single rearrangement event?
=
P(A ∪ B)
P(A)P(B)
P(A ∪ B | single rearrangement)
P(A, B | independent rearrangements)
Use this for agglomerative clustering; stop when the ratio < 1.
43. First, investigate BCR mutation patterns
affinity
maturation
antigen
naive B cell
experienced B cell
clonal
expansion
somatic hypermutation
44. Use two-taxon “trees” for model fitting
note: we know ancestral state within V, D, J.
VV DD JJ
IGNORE
IGNORE
IGNORE
IGNORE
Our “trees” have an observed read on the bottom and the corresponding
“ancestral” germline sequence on top, connected by a branch,
representing some amount of divergence.
45. model fitGeneral Time Reversible
Individual A Individual B Individual C
0.14
0.79
0.10
0.22
0.72
0.41
0.69
0.40
0.06
0.28
0.73
0.17
0.08
0.48
0.27
0.17
0.35
0.50
0.66
0.23
0.32
0.42
0.37
0.36
0.35
0.11
0.46
1.02
0.12
1.12
0.85
0.31
0.18
1.10
0.91
0.06
0.12
0.79
0.10
0.19
0.60
0.43
0.76
0.36
0.07
0.24
0.67
0.18
0.07
0.64
0.23
0.14
0.36
0.44
0.74
0.21
0.36
0.33
0.33
0.45
0.28
0.14
0.44
0.76
0.13
1.15
0.94
0.34
0.24
0.89
0.86
0.07
0.14
0.72
0.11
0.21
0.54
0.43
0.71
0.37
0.08
0.24
0.65
0.18
0.08
0.50
0.27
0.16
0.27
0.49
0.65
0.16
0.45
0.39
0.34
0.52
0.27
0.14
0.50
0.73
0.14
1.05
0.79
0.28
0.23
0.90
0.70
0.08
T
C
G
A
T
C
G
A
T
C
G
A
IGHVIGHDIGHJ
A G C T A G C T A G C T
read
germline
46. Best model according to AIC/BIC
… has different matrices and fixed rate multipliers
for the different segments.
V D J
Seq. 1
Seq. 2
Seq. 3
t2
t3
t1 rD
t1
rJ
t1
rD
t2
rJ
t2
rD
t3
rJ
t3
Mutation
Model
47. Branch length distribution under this best
model
IGHD rate: 3.36
IGHJ rate: 0.62
IGHD rate: 4.44
IGHJ rate: 0.62
IGHD rate: 3.88
IGHJ rate: 0.63
Individual A Individual B Individual C
0e+00
2e+05
4e+05
6e+05
8e+05
0.0e+00
5.0e+05
1.0e+06
1.5e+06
2.0e+06
0e+00
5e+05
1e+06
0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4
ML branch length
count
D segments evolve substantially faster than V
J segments evolve more slowly than V
Individual A has a higher mutational load.
48. Next consider selection (Goal 2 con’t)
affinity
maturation
antigen
naive B cell
experienced B cell
clonal
expansion
somatic hypermutation
50.
CCA CCT
Pro Pro
Thr Ile
ATCACC
synonymous
nonsynonymous
For selection
ὡὡ
AAC AAG
GTGGTC
more likely
less likely
In antibodies
51. Would like per-site selection inference
ω ≡ ≡
dN
dS
rate of non-synonymous substitution
rate of synonymous substitution
position
200 250
mutation freq
0.0
0.1
0.2
0.3
0.4
IGHV323D*01
A
B
C
IGHV323D*01
53. ω ≡ ≡
dN
dS
rate of non-synonymous substitution
rate of synonymous substitution
λS
out−of−frame
70 80 90 100
site (IMGT numbering)
0.1
1.0
individual A B C
Out-of-frame reads can be used to infer neutral mutation rate!
54. is a ratio of rates in terms of observed
neutral process
ωl
: nonsynonymous in-frame rate for site
: nonsynonymous out-of-frame rate for site
: synonymous in-frame rate for site
: synonymous out-of-frame rate for site
λ
(N−I)
l
l
λ
(N−O)
l
l
λ
(S−I)
l
l
λ
(S−O)
l
l
=ωl
/λ
(N−I)
l
λ
(N−O)
l
/λ
(S−I)
l
λ
(S−O)
l
55. Renaissance count (Lemey,Minin… 2012)
TGG CCG CGA
seq−5 CCT CAA ATC ACT CTA TGG CCG CGA
seq−2 CCA CAA ATC ACG TTA TGG CCG CGA
ArgPro Gln
Thr
Ile Thr Leu Trp Gln
Pro
seq−1 CCA CAA ACC ACG TTA TGG CAG
seq−3
CGA
CCT CAA ACC ACT CTA TGG CAG CGA
seq−4 CCT CAA ATC ACT CTA
ACC
ATC
ATC
ATC
ACC
ACC
ATC
ATC
ATC
ACC
ATC
ATC
ACC
ACC
ACC
ATC
ATC
mutation historysample
Use sampled
mutation histories
to estimate rates...
but such
estimates
can be unstable.
56. Empirical Bayes regularization
to stabilize estimates
Say we are doing a per-county smoking survey.
zero smokers? Really?
Use all of the data to fit prior distribution of smoking prevalence, then
with given observations obtain per-county posterior.
57. Estimating selection coefficient ωl
: nonsynonymous in-frame rate for site
: nonsynonymous out-of-frame rate for site
: synonymous in-frame rate for site
: synonymous out-of-frame rate for site
λ
(N−I)
l
l
λ
(N−O)
l
l
λ
(S−I)
l
l
λ
(S−O)
l
l
=ωl
/λ
(N−I)
l
λ
(N−O)
l
/λ
(S−I)
l
λ
(S−O)
l
58. Overall IGHV selection map
0.1
1.0
10.0
75 80 85 90 95 100 105
medianω
Individual A
0
50
100
150
200
75 80 85 90 95 100 105
Site (IMGT numbering)
count
purifying neutral diversifying
Distribution of
classifications
across IGHV genes
Distribution of
median estimates
of ω
59. Similar across individuals
Individual A
0
50
100
150
200
75 80 85 90 95 100 105
count
Individual B
0
50
100
150
200
75 80 85 90 95 100 105
count
Individual C
Site (IMGT numbering)
0
50
100
150
75 80 85 90 95 100 105
count
purifying neutral diversifying
62. Conclusion
B cell receptors are “drafted” and “revised” randomly, but
… with remarkably consistent deletion and insertion patterns
… with remarkably consistent substitution and selection
We can learn about these processes using model-based inference.
Paper on annotation with partiswill be up soon
is up on arXivSelection analysis paper
63. Thank you
Trevor Bedford, Connor McCoy, Vladimir Minin & Duncan Ralph
Phil Bradley for doing structural work
Molecular work done by Paul Lindau in Phil Greenberg’s lab with
support from Harlan Robins and Adaptive Biotechnologies
Adaptive Biotechnologies computational biology team
National Science Foundation and National Institute of Health
University of Washington Center for AIDS Research (CFAR)
University of Washington eScience Institute
W. M. Keck Foundation
65. Measuring clustering agreement
good agreement:
bad agreement:
Cx
Cy
Cx
Cy
Intuition: “how much variability is there in the color for amongst the
items of a given color under ?
Cx
Cy
66. Mutual information I
Think of cluster identity under for a uniformly selected point as a
random variable (similarly for and ):
Cx
X Cy Y
I(X; Y ) = H(X) − H(X|Y )
where is the entropy of (ignoring ), and is the
entropy of given the value for .
H(X) X Y H(X|Y )
X Y
I(X; Y ) = p(x, y) log ( )∑
y∈Y
∑
x∈X
p(x, y)
p(x) p(y)
AM I(U , V ) =
M I(U , V ) − E{M I(U , V )}
max {H(U ), H(V )} − E{M I(U , V )}
67. Estimates of the mutational process are quite
consistent between individuals
(each point is a single entry for one of the matrices for a pair of
individuals.)
68. Branch length differences between productive,
unproductive
Unproductive rearrangements are more likely to be either: unchanged
from germline, or more divergent.
69. Sites are generally under purifying selection
Individual A
Individual B
Individual C
0
200
400
600
800
0
200
400
600
800
0
200
400
600
800
−1 0 1
median log10(ω)
count
purifying diversifying neutral
countcount
71. Distribution of amino acids
beginning
of CDR3
selection
for aromatic
amino acids?Frequency: left of line = out-of-frame, right of line = in-frame
72. Stabilize with empirical Bayes regularization
Assume that , the substitution rate at site , comes from a Gamma
distribution with shape and rate :
λl l
α β
∼ Gamma(α, β).λl
Model total substitution counts (sampled via stochastic mapping) for a
site as Poisson with rate :λl
∼ Poisson( ),Cl λl
Fit and to all data, then draw rates from the posterior:α^ β^ λl
∣ ∼ Gamma( + , 1 + ).λl Cl Cl α^ β^
We extended this regularization to case of non-constant coverage.
73. Sequence counts
status A B C
functional 4,139,983 4,861,800 3,748,306
out-of-frame 533,919 794,845 558,246
stop 104,525 169,423 112,901
77. Random facts
Mean length of D segment in individual A’s naive repertoire is 16.61.
Subject A’s naive sequences were 37% CDR3
Divergence between the various germ-line V genes:
>summary(dist.dna(allele_01,pairwise.deletion=TRUE,model='raw'))
Min. 1stQu. Median Mean 3rdQu. Max.
0.0038460.2013000.3446000.3047000.3849000.539500