Learning how antibodies are drafted and revised

Learning how antibodies
are drafted and revised
Frederick “Erick” Matsen
Fred Hutchinson Cancer Research Center
@ematsen
http://matsen.fredhutch.org/
with Trevor Bedford (FH), Connor McCoy,
Vladimir Minin (UW), and Duncan Ralph (FH)

Jenner’s 1796 vaccine

Where are we 200 years later?

RV144 HIV trial: 2003-2009
26,676 volunteers enrolled
16,395 volunteers randomized
125 infections
$105,000,000 and 6 years

Prospective studies are expensive, slow, and entail complex moral issues.
This does not lend itself to rapid vaccine development.

How might we guide vaccine development without disease exposure?

Vaccines manipulate
the adaptive immune system

What can we learn from antibody-making B cells without
battle-testing them through disease exposure?

Antibodies bind antigens
Antigen
Light chain
Heavy chain

Too many antigens to code for directly
≈ ∞⋯ ∞∞

B cell diversification process
V genes D genes J genes
Affinity
maturation
Somatic hypermutation
VDJ
rearrangement
including
erosion and
nontemplated
insertion
AntigenNaive B cell
Experienced B cell

What germline really looks like
(Eichler and Breden groups)

Big aim: reconstruct from memory reads
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...
...
reality
...
inference

Why reconstruct B cell lineages?
...
1. Vaccine design
This one is really good.
How can we elicit it?

...
1. Vaccine design
immunogen 1
immunogen 2

...
1. Vaccine design
?
2. Vaccine assay

...
1. Vaccine design
3. Evolutionary analysis to learn
about underlying mechanisms
2. Vaccine assay

Goal 1: find rearrangement groups
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...
...
reality
...
rearrangement groups

VDJ annotation problem:
from where did each nucleotide come?
Somatichypermutation
Sequencing primerSequencing error
3’V deletion
VD insertion
5’D deletion
3’D deletion
5’J deletion
DJ insertion
Biological process
Sequencing
Inference
G

This is a key first step in BCR sequence analysis.

Data: Illumina reads from CDR3 locus
Somatic hypermut ation
3’V deletion
VD insertion
5’D deletion
3’D deletion
5’J deletion
DJ insertion
Biological process
Sequencing
G
Total of about 15 million unique 130nt sequences from memory B cell
populations of three healthy individuals A, B, and C.

“Thread” reads onto structure
...
...
...

HMM intro: dishonest casino
6 6

6 6
1-p
1-p
p

6 6
1-p
1-p
p
6 6

6 6
1-p
1-p
p
6 6
p1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p

p
1-p 1-p 1-p 1-p1-p
1-p 1-p 1-p 1-p1-p
p p p pp
1-p
1-p
p
...
...
...
...
1-p
1-p

...
...
...

Detour: write HMM inference package

We wanted to use HMMoCby G Lunter (Bioinf 2007)…
then tried extending StochHMMby Lott & Korf (Bioinf 2014)…
but it ended up being a complete rewrite by Duncan to make ham.

Takes HMM description in concise & intuitive YAML format
(for CpG example, 440 chars for hamvs 5,961 for HMMoCXML)
slightly faster and more memory efficient than HMMoC
continuous integration via Docker

Then write BCR annotation package:
https://github.com/psathyrella/ham
https://github.com/psathyrella/partis

What are probabilities?
...
...
...

Distributions are reproducibly weird!
bases
0 5 10
frequency
0.0
0.1
0.2
0.3
0.4
IGHV270*12 V 3' deletion
A
B
C
IGHV270*12 V 3' deletion
bases
0 5 10
frequency
0.0
0.1
0.2
0.3
0.4
IGHD114*01 D 5' deletion
A
B
C
bases
0 5 10
frequency
0.0
0.2
0.4
0.6
A
B
C
bases
0 5 10
frequency
0.00
0.05
0.10
0.15
0.20
IGHJ4*02 J 5' deletion
A
B
C
IGHJ4*02 J 5' deletion

Distributions are reproducibly weird!
position
200 250
mutation freq
0.0
0.1
0.2
0.3
0.4
IGHV323D*01
A
B
C
IGHV323D*01
position
200 250
mutation freq
0.0
0.2
0.4
0.6
IGHV333*06
A
B
C
IGHV333*06

Only insertions look simple
bases
0 5 10 15
frequency
0.00
0.05
0.10
0.15
VD insertion
A
B
C
VD insertion
bases
0 5 10
frequency
0.0
0.1
0.2
DJ insertion
A
B
C
DJ insertion

Simulate sequences to benchmark

Somatichypermutation
3’V deletion
VD insertion
5’D deletion
3’D deletion
5’J deletion
DJ insertion
Biological process
Sequencing
Inference
G

Simulation code independent from inference code.

Incorporating this complexity is good
hamming distance
0 5 10 15
frequency
0.0
0.1
0.2
0.3
HTTN
partis (k=5)
partis (k=1)
ighutil
iHMMunealign
igblast
imgt
HTTN
but there are still a number of errors.

Remember goal: find rearrangement groups
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...
...
reality
...
rearrangement groups

Say we are given two sequences
1-p
p
1-p
2 ×
2 ×
Double roll
of a single die
per turn
1-p
p
1-p
1-p
p
1-p
+
Two independent
die rolling games
vs.

Double roll Pair HMM↔
p
1-p 1-p 1-p 1-p1-p
1-p 1-p 1-p 1-p1-p
p p p pp
1-p
1-p
p
...
...
...
...
1-p
1-p

Do two sequences come from a single
rearrangement event?

The forward algorithm for HMMs gives probability of generating
observed sequence from a given HMM:x

P(x) = P(x; σ),∑
paths σ

probability of generating two sequences and from the same path
through the HMM (summed across paths).
P(x, y) = P(x, y; σ),∑
paths σ
x y

Do sets of sequences come from a
single rearrangement event?

=
P(A ∪ B)
P(A)P(B)
P(A ∪ B | single rearrangement)
P(A, B | independent rearrangements)

Use this for agglomerative clustering; stop when the ratio < 1.

Preliminary simulation

Integrate out annotation uncertainty and win.

Goal 2: how are antibodies revised?

First, investigate BCR mutation patterns
affinity
maturation
antigen
naive B cell
experienced B cell
clonal
expansion
somatic hypermutation

Use two-taxon “trees” for model fitting
note: we know ancestral state within V, D, J.
VV DD JJ
IGNORE
IGNORE
IGNORE
IGNORE

Our “trees” have an observed read on the bottom and the corresponding
“ancestral” germline sequence on top, connected by a branch,
representing some amount of divergence.

model fitGeneral Time Reversible
Individual A Individual B Individual C
0.14
0.79
0.10
0.22
0.72
0.41
0.69
0.40
0.06
0.28
0.73
0.17
0.08
0.48
0.27
0.17
0.35
0.50
0.66
0.23
0.32
0.42
0.37
0.36
0.35
0.11
0.46
1.02
0.12
1.12
0.85
0.31
0.18
1.10
0.91
0.06
0.12
0.79
0.10
0.19
0.60
0.43
0.76
0.36
0.07
0.24
0.67
0.18
0.07
0.64
0.23
0.14
0.36
0.44
0.74
0.21
0.36
0.33
0.33
0.45
0.28
0.14
0.44
0.76
0.13
1.15
0.94
0.34
0.24
0.89
0.86
0.07
0.14
0.72
0.11
0.21
0.54
0.43
0.71
0.37
0.08
0.24
0.65
0.18
0.08
0.50
0.27
0.16
0.27
0.49
0.65
0.16
0.45
0.39
0.34
0.52
0.27
0.14
0.50
0.73
0.14
1.05
0.79
0.28
0.23
0.90
0.70
0.08
T
C
G
A
T
C
G
A
T
C
G
A
IGHVIGHDIGHJ
A G C T A G C T A G C T
read
germline

Best model according to AIC/BIC
… has different matrices and fixed rate multipliers
for the different segments.
V D J
Seq. 1
Seq. 2
Seq. 3
t2
t3
t1 rD
t1
rJ
t1
rD
t2
rJ
t2
rD
t3
rJ
t3
Mutation
Model

Branch length distribution under this best
model
IGHD rate: 3.36
IGHJ rate: 0.62
IGHD rate: 4.44
IGHJ rate: 0.62
IGHD rate: 3.88
IGHJ rate: 0.63
Individual A Individual B Individual C
0e+00
2e+05
4e+05
6e+05
8e+05
0.0e+00
5.0e+05
1.0e+06
1.5e+06
2.0e+06
0e+00
5e+05
1e+06
0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4
ML branch length
count

D segments evolve substantially faster than V
J segments evolve more slowly than V
Individual A has a higher mutational load.

Next consider selection (Goal 2 con’t)
affinity
maturation
antigen
naive B cell
experienced B cell
clonal
expansion
somatic hypermutation

AAC AAG
GTGGTC
more likely
less likely
In antibodies

CCA CCT
Pro Pro
Thr Ile
ATCACC
synonymous
nonsynonymous
For selection
ὡὡ
AAC AAG
GTGGTC
more likely
less likely
In antibodies

Would like per-site selection inference

ω ≡ ≡
dN
dS
rate of non-synonymous substitution
rate of synonymous substitution

position
200 250
mutation freq
0.0
0.1
0.2
0.3
0.4
IGHV323D*01
A
B
C
IGHV323D*01

Productive vs. out-of-frame receptors

Each cell may carry two IGH alleles, but only one is expressed.

V D J
V D J
insertion that
disrupts frame

ω ≡ ≡
dN
dS
rate of non-synonymous substitution
rate of synonymous substitution
λS
out−of−frame
70 80 90 100
site (IMGT numbering)
0.1
1.0
individual A B C

Out-of-frame reads can be used to infer neutral mutation rate!

is a ratio of rates in terms of observed
neutral process
ωl
: nonsynonymous in-frame rate for site
: nonsynonymous out-of-frame rate for site
: synonymous in-frame rate for site
: synonymous out-of-frame rate for site
λ
(N−I)
l
l
λ
(N−O)
l
l
λ
(S−I)
l
l
λ
(S−O)
l
l

=ωl
/λ
(N−I)
l
λ
(N−O)
l
/λ
(S−I)
l
λ
(S−O)
l

Renaissance count (Lemey,Minin… 2012)
TGG CCG CGA
seq−5 CCT CAA ATC ACT CTA TGG CCG CGA
seq−2 CCA CAA ATC ACG TTA TGG CCG CGA
ArgPro Gln
Thr
Ile Thr Leu Trp Gln
Pro
seq−1 CCA CAA ACC ACG TTA TGG CAG
seq−3
CGA
CCT CAA ACC ACT CTA TGG CAG CGA
seq−4 CCT CAA ATC ACT CTA
ACC
ATC
ATC
ATC
ACC
ACC
ATC
ATC
ATC
ACC
ATC
ATC
ACC
ACC
ACC
ATC
ATC
mutation historysample
Use sampled
mutation histories
to estimate rates...
but such
estimates
can be unstable.

Empirical Bayes regularization
to stabilize estimates
Say we are doing a per-county smoking survey.
zero smokers? Really?
Use all of the data to fit prior distribution of smoking prevalence, then
with given observations obtain per-county posterior.

Estimating selection coefficient ωl
: nonsynonymous in-frame rate for site
: nonsynonymous out-of-frame rate for site
: synonymous in-frame rate for site
: synonymous out-of-frame rate for site
λ
(N−I)
l
l
λ
(N−O)
l
l
λ
(S−I)
l
l
λ
(S−O)
l
l

=ωl
/λ
(N−I)
l
λ
(N−O)
l
/λ
(S−I)
l
λ
(S−O)
l

Overall IGHV selection map
0.1
1.0
10.0
75 80 85 90 95 100 105
medianω
Individual A
0
50
100
150
200
75 80 85 90 95 100 105
Site (IMGT numbering)
count
purifying neutral diversifying
Distribution of
classifications
across IGHV genes
Distribution of
median estimates
of ω

Similar across individuals
Individual A
0
50
100
150
200
75 80 85 90 95 100 105
count
Individual B
0
50
100
150
200
75 80 85 90 95 100 105
count
Individual C
Site (IMGT numbering)
0
50
100
150
75 80 85 90 95 100 105
count
purifying neutral diversifying

antigen
light chain
purifying
neutral
diversifying

Conclusion

B cell receptors are “drafted” and “revised” randomly, but
… with remarkably consistent deletion and insertion patterns
… with remarkably consistent substitution and selection

We can learn about these processes using model-based inference.

Paper on annotation with partiswill be up soon
is up on arXivSelection analysis paper

Thank you
Trevor Bedford, Connor McCoy, Vladimir Minin & Duncan Ralph
Phil Bradley for doing structural work
Molecular work done by Paul Lindau in Phil Greenberg’s lab with
support from Harlan Robins and Adaptive Biotechnologies
Adaptive Biotechnologies computational biology team

National Science Foundation and National Institute of Health
University of Washington Center for AIDS Research (CFAR)
University of Washington eScience Institute
W. M. Keck Foundation

Measuring clustering agreement
good agreement:
bad agreement:
Cx
Cy
Cx
Cy

Intuition: “how much variability is there in the color for amongst the
items of a given color under ?
Cx
Cy

Mutual information I
Think of cluster identity under for a uniformly selected point as a
random variable (similarly for and ):
Cx
X Cy Y
I(X; Y ) = H(X) − H(X|Y )
where is the entropy of (ignoring ), and is the
entropy of given the value for .
H(X) X Y H(X|Y )
X Y

I(X; Y ) = p(x, y) log ( )∑
y∈Y
∑
x∈X
p(x, y)
p(x) p(y)

AM I(U , V ) =
M I(U , V ) − E{M I(U , V )}
max {H(U ), H(V )} − E{M I(U , V )}

Estimates of the mutational process are quite
consistent between individuals
(each point is a single entry for one of the matrices for a pair of
individuals.)

Branch length differences between productive,
unproductive
Unproductive rearrangements are more likely to be either: unchanged
from germline, or more divergent.

Sites are generally under purifying selection
Individual A
Individual B
Individual C
0
200
400
600
800
0
200
400
600
800
0
200
400
600
800
−1 0 1
median log10(ω)
count
purifying diversifying neutral
countcount

Similar across individuals (ii)

Distribution of amino acids
beginning
of CDR3
selection
for aromatic
amino acids?Frequency: left of line = out-of-frame, right of line = in-frame

Stabilize with empirical Bayes regularization
Assume that , the substitution rate at site , comes from a Gamma
distribution with shape and rate :
λl l
α β
∼ Gamma(α, β).λl

Model total substitution counts (sampled via stochastic mapping) for a
site as Poisson with rate :λl
∼ Poisson( ),Cl λl

Fit and to all data, then draw rates from the posterior:α^ β^ λl
∣ ∼ Gamma( + , 1 + ).λl Cl Cl α^ β^

We extended this regularization to case of non-constant coverage.

Sequence counts
status A B C
functional 4,139,983 4,861,800 3,748,306
out-of-frame 533,919 794,845 558,246
stop 104,525 169,423 112,901

Correlation between sequence and GTR matrix

Each dot is a pair of genes.

Simulation results for selection inference
● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ●
● ● ● ●
●
●
● ● ● ●
●
● ● ● ●
● ●
● ● ●
● ● ● ● ● ● ●
● ●
●
●
● ● ● ●
●
● ●
●
● ●
● ●
● ● ● ● ● ● ● ● ●
●
● ● ● ●
● ●
●
● ●
●
●
●
● ● ●
● ●
●
●
● ●
● ● ●
● ● ● ● ●
●
● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ●
●
● ● ● ●
●
● ●
●
● ●
● ●
● ● ● ● ● ● ● ● ●
●
● ● ● ●
● ●
●
● ●
●
● ●
●
● ●
●
● ●
●
● ●
●
●
● ● ● ● ● ●
●
●
●
●
● ● ●
●
● ●
●
●
●●
●
●
●
0.1
1.0
10.0
0 25 50 75 100
site
ω
synonymous
change
possible?
● yes
no
type
●●
●●
●●
purifying
neutral
diversifying
0.00
0.25
0.50
0.75
1.00
0 25 50 75 100
site
Proportion
type
N
S
0
250
500
750
1000
0 25 50 75 100
site
coverage

Random facts
Mean length of D segment in individual A’s naive repertoire is 16.61.
Subject A’s naive sequences were 37% CDR3
Divergence between the various germ-line V genes:
>summary(dist.dna(allele_01,pairwise.deletion=TRUE,model='raw'))
Min. 1stQu. Median Mean 3rdQu. Max.
0.0038460.2013000.3446000.3047000.3849000.539500

Learning how antibodies are drafted and revised

Recommended

Recommended

More Related Content

Similar to Learning how antibodies are drafted and revised

Similar to Learning how antibodies are drafted and revised (20)

Recently uploaded

Recently uploaded (20)

Learning how antibodies are drafted and revised