SlideShare a Scribd company logo
1 of 77
Download to read offline
Learning how antibodies
are drafted and revised
Frederick “Erick” Matsen
Fred Hutchinson Cancer Research Center
@ematsen
http://matsen.fredhutch.org/
with Trevor Bedford (FH), Connor McCoy,
Vladimir Minin (UW), and Duncan Ralph (FH)
Jenner’s 1796 vaccine
 
Where are we 200 years later?
RV144 HIV trial: 2003-2009
26,676 volunteers enrolled
16,395 volunteers randomized
125 infections
$105,000,000 and 6 years
 
Prospective studies are expensive, slow, and entail complex moral issues.
This does not lend itself to rapid vaccine development.
 
How might we guide vaccine development without disease exposure?
Vaccines manipulate
the adaptive immune system
 
 
What can we learn from antibody-making B cells without
battle-testing them through disease exposure?
Antibodies bind antigens
Antigen
Light chain
Heavy chain
Too many antigens to code for directly
≈ ∞⋯ ∞∞
B cell diversification process
V genes D genes J genes
Affinity
maturation
Somatic hypermutation
VDJ 
rearrangement
including
erosion and
non­templated
insertion
AntigenNaive B cell
Experienced B cell
What germline really looks like
(Eichler and Breden groups)
Big aim: reconstruct from memory reads
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...
...
reality
...
inference
Why reconstruct B cell lineages?
...
1. Vaccine design
This one is really good.
How can we elicit it?
Why reconstruct B cell lineages?
...
1. Vaccine design
immunogen 1
immunogen 2
Why reconstruct B cell lineages?
...
1. Vaccine design
?
2. Vaccine assay
Why reconstruct B cell lineages?
...
1. Vaccine design
3. Evolutionary analysis to learn
about underlying mechanisms
2. Vaccine assay
Goal 1: find rearrangement groups
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...
...
reality
...
rearrangement groups
VDJ annotation problem:
from where did each nucleotide come?
Somatichypermutation
Sequencing primerSequencing error
3’V deletion
VD insertion
5’D deletion
3’D deletion
5’J deletion
DJ insertion
Biological process
Sequencing
Inference
G
 
 
This is a key first step in BCR sequence analysis.
Data: Illumina reads from CDR3 locus
Somatic  hypermut ation
Sequencing primerSequencing error
3’V deletion
VD insertion
5’D deletion
3’D deletion
5’J deletion
DJ insertion
Biological process
Sequencing
G
Total of about 15 million unique 130nt sequences from memory B cell
populations of three healthy individuals A, B, and C.
“Thread” reads onto structure
V genes D genes J genes
...
...
...
HMM intro: dishonest casino
6 6
HMM intro: dishonest casino
6 6
1-p
1-p
p
HMM intro: dishonest casino
6 6
1-p
1-p
p
6 6
HMM intro: dishonest casino
6 6
1-p
1-p
p
6 6
p1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p
p
1-p 1-p 1-p 1-p1-p
1-p 1-p 1-p 1-p1-p
p p p pp
1-p
1-p
p
...
...
...
...
1-p
1-p
p
1-p 1-p 1-p 1-p1-p
1-p 1-p 1-p 1-p1-p
p p p pp
1-p
1-p
p
...
...
...
...
1-p
1-p
V genes D genes J genes
...
...
...
V genes D genes J genes
...
...
...
V genes D genes J genes
...
...
...
Detour: write HMM inference package
 
We wanted to use HMMoCby G Lunter (Bioinf 2007)…
then tried extending StochHMMby Lott & Korf (Bioinf 2014)…
but it ended up being a complete rewrite by Duncan to make ham.
 
Takes HMM description in concise & intuitive YAML format
(for CpG example, 440 chars for hamvs 5,961 for HMMoCXML)
slightly faster and more memory efficient than HMMoC
continuous integration via Docker
 
Then write BCR annotation package:
https://github.com/psathyrella/ham
https://github.com/psathyrella/partis
What are probabilities?
V genes D genes J genes
...
...
...
Distributions are reproducibly weird!
bases
0 5 10
frequency
0.0
0.1
0.2
0.3
0.4
IGHV2­70*12 ­­ V 3' deletion
A
B
C
IGHV2­70*12 ­­ V 3' deletion
bases
0 5 10
frequency
0.0
0.1
0.2
0.3
0.4
IGHD1­14*01 ­­ D 5' deletion
A
B
C
IGHD1­14*01 ­­ D 5' deletion
bases
0 5 10
frequency
0.0
0.2
0.4
0.6
IGHD7­27*01 ­­ D 3' deletion
A
B
C
IGHD7­27*01 ­­ D 3' deletion
bases
0 5 10
frequency
0.00
0.05
0.10
0.15
0.20
IGHJ4*02 ­­ J 5' deletion
A
B
C
IGHJ4*02 ­­ J 5' deletion
Distributions are reproducibly weird!
position
200 250
mutation freq
0.0
0.1
0.2
0.3
0.4
IGHV3­23D*01
A
B
C
IGHV3­23D*01
position
200 250
mutation freq
0.0
0.2
0.4
0.6
IGHV3­33*06
A
B
C
IGHV3­33*06
Only insertions look simple
bases
0 5 10 15
frequency
0.00
0.05
0.10
0.15
VD insertion
A
B
C
VD insertion
bases
0 5 10
frequency
0.0
0.1
0.2
DJ insertion
A
B
C
DJ insertion
Simulate sequences to benchmark
 
 
Somatichypermutation
Sequencing primerSequencing error
3’V deletion
VD insertion
5’D deletion
3’D deletion
5’J deletion
DJ insertion
Biological process
Sequencing
Inference
G
 
Simulation code independent from inference code.
Incorporating this complexity is good
hamming distance
0 5 10 15
frequency
0.0
0.1
0.2
0.3
HTTN
partis (k=5)
partis (k=1)
ighutil
iHMMunealign
igblast
imgt
HTTN
but there are still a number of errors.
Remember goal: find rearrangement groups
ACATGGCTC...
ATACGTTCC...
TTACGGTTC...
ATCCGGTAC...
ATACAGTCT...
...
reality
...
rearrangement groups
Say we are given two sequences
1-p
p
1-p
2 ×
2 ×
Double roll
of a single die
per turn
1-p
p
1-p
1-p
p
1-p
+
Two independent
die rolling games
vs.
Double roll Pair HMM↔
p
1-p 1-p 1-p 1-p1-p
1-p 1-p 1-p 1-p1-p
p p p pp
1-p
1-p
p
...
...
...
...
1-p
1-p
Do two sequences come from a single
rearrangement event?
 
The forward algorithm for HMMs gives probability of generating
observed sequence from a given HMM:x
 
P(x) = P(x; σ),∑
paths σ
 
probability of generating two sequences and from the same path
through the HMM (summed across paths).
P(x, y) = P(x, y; σ),∑
paths σ
x y
V genes D genes J genes
...
...
...
Do sets of sequences come from a
single rearrangement event?
 
=
P(A ∪ B)
P(A)P(B)
P(A ∪ B | single rearrangement)
P(A, B | independent rearrangements)
 
 
Use this for agglomerative clustering; stop when the ratio < 1.
Preliminary simulation
 
Integrate out annotation uncertainty and win.
Goal 2: how are antibodies revised?
First, investigate BCR mutation patterns
affinity
maturation
antigen
naive B cell
experienced B cell
clonal
expansion
somatic hypermutation
Use two-taxon “trees” for model fitting
note: we know ancestral state within V, D, J.
VV DD JJ
IGNORE
IGNORE
IGNORE
IGNORE
 
Our “trees” have an observed read on the bottom and the corresponding
“ancestral” germline sequence on top, connected by a branch,
representing some amount of divergence.
model fitGeneral Time Reversible
Individual A Individual B Individual C
0.14
0.79
0.10
0.22
0.72
0.41
0.69
0.40
0.06
0.28
0.73
0.17
0.08
0.48
0.27
0.17
0.35
0.50
0.66
0.23
0.32
0.42
0.37
0.36
0.35
0.11
0.46
1.02
0.12
1.12
0.85
0.31
0.18
1.10
0.91
0.06
0.12
0.79
0.10
0.19
0.60
0.43
0.76
0.36
0.07
0.24
0.67
0.18
0.07
0.64
0.23
0.14
0.36
0.44
0.74
0.21
0.36
0.33
0.33
0.45
0.28
0.14
0.44
0.76
0.13
1.15
0.94
0.34
0.24
0.89
0.86
0.07
0.14
0.72
0.11
0.21
0.54
0.43
0.71
0.37
0.08
0.24
0.65
0.18
0.08
0.50
0.27
0.16
0.27
0.49
0.65
0.16
0.45
0.39
0.34
0.52
0.27
0.14
0.50
0.73
0.14
1.05
0.79
0.28
0.23
0.90
0.70
0.08
T
C
G
A
T
C
G
A
T
C
G
A
IGHVIGHDIGHJ
A G C T A G C T A G C T
read
germline
Best model according to AIC/BIC
… has different matrices and fixed rate multipliers
for the different segments.
V D J
Seq. 1
Seq. 2
Seq. 3
t2
t3
t1 rD
t1
rJ
t1
rD
t2
rJ
t2
rD
t3
rJ
t3
Mutation
Model
Branch length distribution under this best
model
IGHD rate: 3.36
IGHJ rate: 0.62
IGHD rate: 4.44
IGHJ rate: 0.62
IGHD rate: 3.88
IGHJ rate: 0.63
Individual A Individual B Individual C
0e+00
2e+05
4e+05
6e+05
8e+05
0.0e+00
5.0e+05
1.0e+06
1.5e+06
2.0e+06
0e+00
5e+05
1e+06
0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4
ML branch length
count
 
D segments evolve substantially faster than V
J segments evolve more slowly than V
Individual A has a higher mutational load.
Next consider selection (Goal 2 con’t)
affinity
maturation
antigen
naive B cell
experienced B cell
clonal
expansion
somatic hypermutation
 
AAC AAG
GTGGTC
more likely
less likely
In antibodies
 
CCA CCT
Pro Pro
Thr Ile
ATCACC
synonymous
nonsynonymous
For selection
ὡὡ
AAC AAG
GTGGTC
more likely
less likely
In antibodies
Would like per-site selection inference
 
ω ≡ ≡
dN
dS
rate of non-synonymous substitution
rate of synonymous substitution
 
position
200 250
mutation freq
0.0
0.1
0.2
0.3
0.4
IGHV3­23D*01
A
B
C
IGHV3­23D*01
Productive vs. out-of-frame receptors
 
Each cell may carry two IGH alleles, but only one is expressed.
 
V D J
V D J
insertion that
disrupts frame
ω ≡ ≡
dN
dS
rate of non-synonymous substitution
rate of synonymous substitution
λS
out−of−frame
70 80 90 100
site (IMGT numbering)
0.1
1.0
individual A B C
 
 
Out-of-frame reads can be used to infer neutral mutation rate!
is a ratio of rates in terms of observed
neutral process
ωl
: nonsynonymous in-frame rate for site
: nonsynonymous out-of-frame rate for site
: synonymous in-frame rate for site
: synonymous out-of-frame rate for site
λ
(N−I)
l
l
λ
(N−O)
l
l
λ
(S−I)
l
l
λ
(S−O)
l
l
 
 
=ωl
/λ
(N−I)
l
λ
(N−O)
l
/λ
(S−I)
l
λ
(S−O)
l
Renaissance count (Lemey,Minin… 2012)
TGG CCG CGA
seq−5 CCT CAA ATC ACT CTA TGG CCG CGA
seq−2 CCA CAA ATC ACG TTA TGG CCG CGA
ArgPro Gln
Thr
Ile Thr Leu Trp Gln
Pro
seq−1 CCA CAA ACC ACG TTA TGG CAG
seq−3
CGA
CCT CAA ACC ACT CTA TGG CAG CGA
seq−4 CCT CAA ATC ACT CTA
ACC
ATC
ATC
ATC
ACC
ACC
ATC
ATC
ATC
ACC
ATC
ATC
ACC
ACC
ACC
ATC
ATC
mutation historysample
Use sampled
mutation histories
to estimate rates...
but such
estimates
can be unstable.
Empirical Bayes regularization
to stabilize estimates
Say we are doing a per-county smoking survey.
zero smokers? Really?
Use all of the data to fit prior distribution of smoking prevalence, then
with given observations obtain per-county posterior.
Estimating selection coefficient ωl
: nonsynonymous in-frame rate for site
: nonsynonymous out-of-frame rate for site
: synonymous in-frame rate for site
: synonymous out-of-frame rate for site
λ
(N−I)
l
l
λ
(N−O)
l
l
λ
(S−I)
l
l
λ
(S−O)
l
l
 
 
=ωl
/λ
(N−I)
l
λ
(N−O)
l
/λ
(S−I)
l
λ
(S−O)
l
Overall IGHV selection map
0.1
1.0
10.0
75 80 85 90 95 100 105
medianω
Individual A
0
50
100
150
200
75 80 85 90 95 100 105
Site (IMGT numbering)
count
purifying neutral diversifying
Distribution of
classifications
across IGHV genes
Distribution of
median estimates
of ω
Similar across individuals
Individual A
0
50
100
150
200
75 80 85 90 95 100 105
count
Individual B
0
50
100
150
200
75 80 85 90 95 100 105
count
Individual C
Site (IMGT numbering)
0
50
100
150
75 80 85 90 95 100 105
count
purifying neutral diversifying
antigen
light chain
purifying
neutral
diversifying
Conclusion
 
B cell receptors are “drafted” and “revised” randomly, but
… with remarkably consistent deletion and insertion patterns
… with remarkably consistent substitution and selection
 
We can learn about these processes using model-based inference.
 
Paper on annotation with partiswill be up soon
is up on arXivSelection analysis paper
Thank you
Trevor Bedford, Connor McCoy, Vladimir Minin & Duncan Ralph
Phil Bradley for doing structural work
Molecular work done by Paul Lindau in Phil Greenberg’s lab with
support from Harlan Robins and Adaptive Biotechnologies
Adaptive Biotechnologies computational biology team
 
National Science Foundation and National Institute of Health
University of Washington Center for AIDS Research (CFAR)
University of Washington eScience Institute
W. M. Keck Foundation
 
Addenda
Measuring clustering agreement
good agreement:
bad agreement:
Cx
Cy
Cx
Cy
 
 
Intuition: “how much variability is there in the color for amongst the
items of a given color under ?
Cx
Cy
Mutual information I
Think of cluster identity under for a uniformly selected point as a
random variable (similarly for and ):
Cx
X Cy Y
I(X; Y ) = H(X) − H(X|Y )
where is the entropy of (ignoring ), and is the
entropy of given the value for .
H(X) X Y H(X|Y )
X Y
 
I(X; Y ) = p(x, y) log ( )∑
y∈Y
∑
x∈X
p(x, y)
p(x) p(y)
 
AM I(U , V ) =
M I(U , V ) − E{M I(U , V )}
max {H(U ), H(V )} − E{M I(U , V )}
Estimates of the mutational process are quite
consistent between individuals
(each point is a single entry for one of the matrices for a pair of
individuals.)
Branch length differences between productive,
unproductive
Unproductive rearrangements are more likely to be either: unchanged
from germline, or more divergent.
Sites are generally under purifying selection
Individual A
Individual B
Individual C
0
200
400
600
800
0
200
400
600
800
0
200
400
600
800
−1 0 1
median log10(ω)
count
purifying diversifying neutral
countcount
Similar across individuals (ii)
Distribution of amino acids
beginning
of CDR3
selection
for aromatic
amino acids?Frequency: left of line = out-of-frame, right of line = in-frame
Stabilize with empirical Bayes regularization
Assume that , the substitution rate at site , comes from a Gamma
distribution with shape and rate :
λl l
α β
∼ Gamma(α, β).λl
 
Model total substitution counts (sampled via stochastic mapping) for a
site as Poisson with rate :λl
∼ Poisson( ),Cl λl
 
Fit and to all data, then draw rates from the posterior:α^ β^ λl
∣ ∼ Gamma( + , 1 + ).λl Cl Cl α^ β^
 
We extended this regularization to case of non-constant coverage.
Sequence counts
status A B C
functional 4,139,983 4,861,800 3,748,306
out-of-frame 533,919 794,845 558,246
stop 104,525 169,423 112,901
Correlation between sequence and GTR matrix
 
Each dot is a pair of genes.
Simulation results for selection inference
● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ●
● ● ● ●
●
●
● ● ● ●
●
● ● ● ●
● ●
● ● ●
● ● ● ● ● ● ●
● ●
●
●
● ● ● ●
●
● ●
●
● ●
● ●
● ● ● ● ● ● ● ● ●
●
● ● ● ●
● ●
●
● ●
●
●
●
● ● ●
● ●
●
●
● ●
● ● ●
● ● ● ● ●
●
● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ●
●
● ● ● ●
●
● ●
●
● ●
● ●
● ● ● ● ● ● ● ● ●
●
● ● ● ●
● ●
●
● ●
●
● ●
●
● ●
●
● ●
●
● ●
●
●
● ● ● ● ● ●
●
●
●
●
● ● ●
●
● ●
●
●
●●
●
●
●
0.1
1.0
10.0
0 25 50 75 100
site
ω
synonymous
change
possible?
● yes
no
type
●●
●●
●●
purifying
neutral
diversifying
0.00
0.25
0.50
0.75
1.00
0 25 50 75 100
site
Proportion
type
N
S
0
250
500
750
1000
0 25 50 75 100
site
coverage
Omega distribution
Random facts
Mean length of D segment in individual A’s naive repertoire is 16.61.
Subject A’s naive sequences were 37% CDR3
Divergence between the various germ-line V genes:
>summary(dist.dna(allele_01,pairwise.deletion=TRUE,model='raw'))
Min. 1stQu. Median Mean 3rdQu. Max.
0.0038460.2013000.3446000.3047000.3849000.539500

More Related Content

Similar to Learning how antibodies are drafted and revised

Mutation taster ppt
Mutation taster pptMutation taster ppt
Mutation taster pptHina Qaiser
 
ppgardner-lecture06-homologysearch.pdf
ppgardner-lecture06-homologysearch.pdfppgardner-lecture06-homologysearch.pdf
ppgardner-lecture06-homologysearch.pdfPaul Gardner
 
Lessons learned from high throughput CRISPR targeting in human cell lines
Lessons learned from high throughput CRISPR targeting in human cell linesLessons learned from high throughput CRISPR targeting in human cell lines
Lessons learned from high throughput CRISPR targeting in human cell linesChris Thorne
 
Human or Humanized Antibody
Human or Humanized AntibodyHuman or Humanized Antibody
Human or Humanized AntibodyCreative Biolabs
 
(050407)protein chip
(050407)protein chip(050407)protein chip
(050407)protein chipnamvgta
 
Antibody humanization
Antibody humanizationAntibody humanization
Antibody humanizationEchoHan4
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithmgarima931
 
2008 PGSAS G-nomes
2008 PGSAS G-nomes2008 PGSAS G-nomes
2008 PGSAS G-nomesgfb1
 
2008 PGSAS G-nomes
2008 PGSAS G-nomes2008 PGSAS G-nomes
2008 PGSAS G-nomesgfb1
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Elia Brodsky
 
Machine Learning - Genetic Algorithm Fundamental
Machine Learning - Genetic Algorithm FundamentalMachine Learning - Genetic Algorithm Fundamental
Machine Learning - Genetic Algorithm FundamentalAnggi Andriyadi
 
Adaptive Molecular Evolution (dN/dS). 2011
Adaptive Molecular Evolution (dN/dS). 2011Adaptive Molecular Evolution (dN/dS). 2011
Adaptive Molecular Evolution (dN/dS). 2011Hernán Dopazo
 
CE-Symm, protein symmetry, and the evolution of protein folds
CE-Symm, protein symmetry, and the evolution of protein foldsCE-Symm, protein symmetry, and the evolution of protein folds
CE-Symm, protein symmetry, and the evolution of protein foldsDouglas Myers-Turnbull
 

Similar to Learning how antibodies are drafted and revised (20)

Mason abrf single_cell_2017
Mason abrf single_cell_2017Mason abrf single_cell_2017
Mason abrf single_cell_2017
 
Mutation taster ppt
Mutation taster pptMutation taster ppt
Mutation taster ppt
 
ppgardner-lecture06-homologysearch.pdf
ppgardner-lecture06-homologysearch.pdfppgardner-lecture06-homologysearch.pdf
ppgardner-lecture06-homologysearch.pdf
 
Lessons learned from high throughput CRISPR targeting in human cell lines
Lessons learned from high throughput CRISPR targeting in human cell linesLessons learned from high throughput CRISPR targeting in human cell lines
Lessons learned from high throughput CRISPR targeting in human cell lines
 
Human or Humanized Antibody
Human or Humanized AntibodyHuman or Humanized Antibody
Human or Humanized Antibody
 
Markers
MarkersMarkers
Markers
 
(050407)protein chip
(050407)protein chip(050407)protein chip
(050407)protein chip
 
Antibody humanization
Antibody humanizationAntibody humanization
Antibody humanization
 
Antibody humanization
Antibody humanizationAntibody humanization
Antibody humanization
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithm
 
2008 PGSAS G-nomes
2008 PGSAS G-nomes2008 PGSAS G-nomes
2008 PGSAS G-nomes
 
2008 PGSAS G-nomes
2008 PGSAS G-nomes2008 PGSAS G-nomes
2008 PGSAS G-nomes
 
Artificial chromosomes
Artificial chromosomesArtificial chromosomes
Artificial chromosomes
 
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
 
RIP 24Feb2016v4
RIP 24Feb2016v4RIP 24Feb2016v4
RIP 24Feb2016v4
 
Microarray
MicroarrayMicroarray
Microarray
 
Machine Learning - Genetic Algorithm Fundamental
Machine Learning - Genetic Algorithm FundamentalMachine Learning - Genetic Algorithm Fundamental
Machine Learning - Genetic Algorithm Fundamental
 
Adaptive Molecular Evolution (dN/dS). 2011
Adaptive Molecular Evolution (dN/dS). 2011Adaptive Molecular Evolution (dN/dS). 2011
Adaptive Molecular Evolution (dN/dS). 2011
 
final_presentation
final_presentationfinal_presentation
final_presentation
 
CE-Symm, protein symmetry, and the evolution of protein folds
CE-Symm, protein symmetry, and the evolution of protein foldsCE-Symm, protein symmetry, and the evolution of protein folds
CE-Symm, protein symmetry, and the evolution of protein folds
 

Recently uploaded

Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptxkhadijarafiq2012
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 

Recently uploaded (20)

Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptx
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 

Learning how antibodies are drafted and revised

  • 1. Learning how antibodies are drafted and revised Frederick “Erick” Matsen Fred Hutchinson Cancer Research Center @ematsen http://matsen.fredhutch.org/ with Trevor Bedford (FH), Connor McCoy, Vladimir Minin (UW), and Duncan Ralph (FH)
  • 2.
  • 3. Jenner’s 1796 vaccine   Where are we 200 years later?
  • 4. RV144 HIV trial: 2003-2009 26,676 volunteers enrolled 16,395 volunteers randomized 125 infections $105,000,000 and 6 years   Prospective studies are expensive, slow, and entail complex moral issues. This does not lend itself to rapid vaccine development.   How might we guide vaccine development without disease exposure?
  • 5. Vaccines manipulate the adaptive immune system     What can we learn from antibody-making B cells without battle-testing them through disease exposure?
  • 7. Too many antigens to code for directly ≈ ∞⋯ ∞∞
  • 8. B cell diversification process V genes D genes J genes Affinity maturation Somatic hypermutation VDJ  rearrangement including erosion and non­templated insertion AntigenNaive B cell Experienced B cell
  • 9. What germline really looks like (Eichler and Breden groups)
  • 10. Big aim: reconstruct from memory reads ACATGGCTC... ATACGTTCC... TTACGGTTC... ATCCGGTAC... ATACAGTCT... ... reality ... inference
  • 11. Why reconstruct B cell lineages? ... 1. Vaccine design This one is really good. How can we elicit it?
  • 12. Why reconstruct B cell lineages? ... 1. Vaccine design immunogen 1 immunogen 2
  • 13. Why reconstruct B cell lineages? ... 1. Vaccine design ? 2. Vaccine assay
  • 14. Why reconstruct B cell lineages? ... 1. Vaccine design 3. Evolutionary analysis to learn about underlying mechanisms 2. Vaccine assay
  • 15. Goal 1: find rearrangement groups ACATGGCTC... ATACGTTCC... TTACGGTTC... ATCCGGTAC... ATACAGTCT... ... reality ... rearrangement groups
  • 16. VDJ annotation problem: from where did each nucleotide come? Somatichypermutation Sequencing primerSequencing error 3’V deletion VD insertion 5’D deletion 3’D deletion 5’J deletion DJ insertion Biological process Sequencing Inference G     This is a key first step in BCR sequence analysis.
  • 17. Data: Illumina reads from CDR3 locus Somatic  hypermut ation Sequencing primerSequencing error 3’V deletion VD insertion 5’D deletion 3’D deletion 5’J deletion DJ insertion Biological process Sequencing G Total of about 15 million unique 130nt sequences from memory B cell populations of three healthy individuals A, B, and C.
  • 18. “Thread” reads onto structure V genes D genes J genes ... ... ...
  • 19. HMM intro: dishonest casino 6 6
  • 20. HMM intro: dishonest casino 6 6 1-p 1-p p
  • 21. HMM intro: dishonest casino 6 6 1-p 1-p p 6 6
  • 22. HMM intro: dishonest casino 6 6 1-p 1-p p 6 6 p1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p 1-p
  • 23. p 1-p 1-p 1-p 1-p1-p 1-p 1-p 1-p 1-p1-p p p p pp 1-p 1-p p ... ... ... ... 1-p 1-p
  • 24. p 1-p 1-p 1-p 1-p1-p 1-p 1-p 1-p 1-p1-p p p p pp 1-p 1-p p ... ... ... ... 1-p 1-p
  • 28. Detour: write HMM inference package   We wanted to use HMMoCby G Lunter (Bioinf 2007)… then tried extending StochHMMby Lott & Korf (Bioinf 2014)… but it ended up being a complete rewrite by Duncan to make ham.   Takes HMM description in concise & intuitive YAML format (for CpG example, 440 chars for hamvs 5,961 for HMMoCXML) slightly faster and more memory efficient than HMMoC continuous integration via Docker   Then write BCR annotation package: https://github.com/psathyrella/ham https://github.com/psathyrella/partis
  • 29. What are probabilities? V genes D genes J genes ... ... ...
  • 30. Distributions are reproducibly weird! bases 0 5 10 frequency 0.0 0.1 0.2 0.3 0.4 IGHV2­70*12 ­­ V 3' deletion A B C IGHV2­70*12 ­­ V 3' deletion bases 0 5 10 frequency 0.0 0.1 0.2 0.3 0.4 IGHD1­14*01 ­­ D 5' deletion A B C IGHD1­14*01 ­­ D 5' deletion bases 0 5 10 frequency 0.0 0.2 0.4 0.6 IGHD7­27*01 ­­ D 3' deletion A B C IGHD7­27*01 ­­ D 3' deletion bases 0 5 10 frequency 0.00 0.05 0.10 0.15 0.20 IGHJ4*02 ­­ J 5' deletion A B C IGHJ4*02 ­­ J 5' deletion
  • 31. Distributions are reproducibly weird! position 200 250 mutation freq 0.0 0.1 0.2 0.3 0.4 IGHV3­23D*01 A B C IGHV3­23D*01 position 200 250 mutation freq 0.0 0.2 0.4 0.6 IGHV3­33*06 A B C IGHV3­33*06
  • 32. Only insertions look simple bases 0 5 10 15 frequency 0.00 0.05 0.10 0.15 VD insertion A B C VD insertion bases 0 5 10 frequency 0.0 0.1 0.2 DJ insertion A B C DJ insertion
  • 33. Simulate sequences to benchmark     Somatichypermutation Sequencing primerSequencing error 3’V deletion VD insertion 5’D deletion 3’D deletion 5’J deletion DJ insertion Biological process Sequencing Inference G   Simulation code independent from inference code.
  • 34. Incorporating this complexity is good hamming distance 0 5 10 15 frequency 0.0 0.1 0.2 0.3 HTTN partis (k=5) partis (k=1) ighutil iHMMunealign igblast imgt HTTN but there are still a number of errors.
  • 35. Remember goal: find rearrangement groups ACATGGCTC... ATACGTTCC... TTACGGTTC... ATCCGGTAC... ATACAGTCT... ... reality ... rearrangement groups
  • 36. Say we are given two sequences 1-p p 1-p 2 × 2 × Double roll of a single die per turn 1-p p 1-p 1-p p 1-p + Two independent die rolling games vs.
  • 37. Double roll Pair HMM↔ p 1-p 1-p 1-p 1-p1-p 1-p 1-p 1-p 1-p1-p p p p pp 1-p 1-p p ... ... ... ... 1-p 1-p
  • 38. Do two sequences come from a single rearrangement event?   The forward algorithm for HMMs gives probability of generating observed sequence from a given HMM:x   P(x) = P(x; σ),∑ paths σ   probability of generating two sequences and from the same path through the HMM (summed across paths). P(x, y) = P(x, y; σ),∑ paths σ x y
  • 40. Do sets of sequences come from a single rearrangement event?   = P(A ∪ B) P(A)P(B) P(A ∪ B | single rearrangement) P(A, B | independent rearrangements)     Use this for agglomerative clustering; stop when the ratio < 1.
  • 41. Preliminary simulation   Integrate out annotation uncertainty and win.
  • 42. Goal 2: how are antibodies revised?
  • 43. First, investigate BCR mutation patterns affinity maturation antigen naive B cell experienced B cell clonal expansion somatic hypermutation
  • 44. Use two-taxon “trees” for model fitting note: we know ancestral state within V, D, J. VV DD JJ IGNORE IGNORE IGNORE IGNORE   Our “trees” have an observed read on the bottom and the corresponding “ancestral” germline sequence on top, connected by a branch, representing some amount of divergence.
  • 45. model fitGeneral Time Reversible Individual A Individual B Individual C 0.14 0.79 0.10 0.22 0.72 0.41 0.69 0.40 0.06 0.28 0.73 0.17 0.08 0.48 0.27 0.17 0.35 0.50 0.66 0.23 0.32 0.42 0.37 0.36 0.35 0.11 0.46 1.02 0.12 1.12 0.85 0.31 0.18 1.10 0.91 0.06 0.12 0.79 0.10 0.19 0.60 0.43 0.76 0.36 0.07 0.24 0.67 0.18 0.07 0.64 0.23 0.14 0.36 0.44 0.74 0.21 0.36 0.33 0.33 0.45 0.28 0.14 0.44 0.76 0.13 1.15 0.94 0.34 0.24 0.89 0.86 0.07 0.14 0.72 0.11 0.21 0.54 0.43 0.71 0.37 0.08 0.24 0.65 0.18 0.08 0.50 0.27 0.16 0.27 0.49 0.65 0.16 0.45 0.39 0.34 0.52 0.27 0.14 0.50 0.73 0.14 1.05 0.79 0.28 0.23 0.90 0.70 0.08 T C G A T C G A T C G A IGHVIGHDIGHJ A G C T A G C T A G C T read germline
  • 46. Best model according to AIC/BIC … has different matrices and fixed rate multipliers for the different segments. V D J Seq. 1 Seq. 2 Seq. 3 t2 t3 t1 rD t1 rJ t1 rD t2 rJ t2 rD t3 rJ t3 Mutation Model
  • 47. Branch length distribution under this best model IGHD rate: 3.36 IGHJ rate: 0.62 IGHD rate: 4.44 IGHJ rate: 0.62 IGHD rate: 3.88 IGHJ rate: 0.63 Individual A Individual B Individual C 0e+00 2e+05 4e+05 6e+05 8e+05 0.0e+00 5.0e+05 1.0e+06 1.5e+06 2.0e+06 0e+00 5e+05 1e+06 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 ML branch length count   D segments evolve substantially faster than V J segments evolve more slowly than V Individual A has a higher mutational load.
  • 48. Next consider selection (Goal 2 con’t) affinity maturation antigen naive B cell experienced B cell clonal expansion somatic hypermutation
  • 49.   AAC AAG GTGGTC more likely less likely In antibodies
  • 50.   CCA CCT Pro Pro Thr Ile ATCACC synonymous nonsynonymous For selection ὡὡ AAC AAG GTGGTC more likely less likely In antibodies
  • 51. Would like per-site selection inference   ω ≡ ≡ dN dS rate of non-synonymous substitution rate of synonymous substitution   position 200 250 mutation freq 0.0 0.1 0.2 0.3 0.4 IGHV3­23D*01 A B C IGHV3­23D*01
  • 52. Productive vs. out-of-frame receptors   Each cell may carry two IGH alleles, but only one is expressed.   V D J V D J insertion that disrupts frame
  • 53. ω ≡ ≡ dN dS rate of non-synonymous substitution rate of synonymous substitution λS out−of−frame 70 80 90 100 site (IMGT numbering) 0.1 1.0 individual A B C     Out-of-frame reads can be used to infer neutral mutation rate!
  • 54. is a ratio of rates in terms of observed neutral process ωl : nonsynonymous in-frame rate for site : nonsynonymous out-of-frame rate for site : synonymous in-frame rate for site : synonymous out-of-frame rate for site λ (N−I) l l λ (N−O) l l λ (S−I) l l λ (S−O) l l     =ωl /λ (N−I) l λ (N−O) l /λ (S−I) l λ (S−O) l
  • 55. Renaissance count (Lemey,Minin… 2012) TGG CCG CGA seq−5 CCT CAA ATC ACT CTA TGG CCG CGA seq−2 CCA CAA ATC ACG TTA TGG CCG CGA ArgPro Gln Thr Ile Thr Leu Trp Gln Pro seq−1 CCA CAA ACC ACG TTA TGG CAG seq−3 CGA CCT CAA ACC ACT CTA TGG CAG CGA seq−4 CCT CAA ATC ACT CTA ACC ATC ATC ATC ACC ACC ATC ATC ATC ACC ATC ATC ACC ACC ACC ATC ATC mutation historysample Use sampled mutation histories to estimate rates... but such estimates can be unstable.
  • 56. Empirical Bayes regularization to stabilize estimates Say we are doing a per-county smoking survey. zero smokers? Really? Use all of the data to fit prior distribution of smoking prevalence, then with given observations obtain per-county posterior.
  • 57. Estimating selection coefficient ωl : nonsynonymous in-frame rate for site : nonsynonymous out-of-frame rate for site : synonymous in-frame rate for site : synonymous out-of-frame rate for site λ (N−I) l l λ (N−O) l l λ (S−I) l l λ (S−O) l l     =ωl /λ (N−I) l λ (N−O) l /λ (S−I) l λ (S−O) l
  • 58. Overall IGHV selection map 0.1 1.0 10.0 75 80 85 90 95 100 105 medianω Individual A 0 50 100 150 200 75 80 85 90 95 100 105 Site (IMGT numbering) count purifying neutral diversifying Distribution of classifications across IGHV genes Distribution of median estimates of ω
  • 59. Similar across individuals Individual A 0 50 100 150 200 75 80 85 90 95 100 105 count Individual B 0 50 100 150 200 75 80 85 90 95 100 105 count Individual C Site (IMGT numbering) 0 50 100 150 75 80 85 90 95 100 105 count purifying neutral diversifying
  • 61.
  • 62. Conclusion   B cell receptors are “drafted” and “revised” randomly, but … with remarkably consistent deletion and insertion patterns … with remarkably consistent substitution and selection   We can learn about these processes using model-based inference.   Paper on annotation with partiswill be up soon is up on arXivSelection analysis paper
  • 63. Thank you Trevor Bedford, Connor McCoy, Vladimir Minin & Duncan Ralph Phil Bradley for doing structural work Molecular work done by Paul Lindau in Phil Greenberg’s lab with support from Harlan Robins and Adaptive Biotechnologies Adaptive Biotechnologies computational biology team   National Science Foundation and National Institute of Health University of Washington Center for AIDS Research (CFAR) University of Washington eScience Institute W. M. Keck Foundation  
  • 65. Measuring clustering agreement good agreement: bad agreement: Cx Cy Cx Cy     Intuition: “how much variability is there in the color for amongst the items of a given color under ? Cx Cy
  • 66. Mutual information I Think of cluster identity under for a uniformly selected point as a random variable (similarly for and ): Cx X Cy Y I(X; Y ) = H(X) − H(X|Y ) where is the entropy of (ignoring ), and is the entropy of given the value for . H(X) X Y H(X|Y ) X Y   I(X; Y ) = p(x, y) log ( )∑ y∈Y ∑ x∈X p(x, y) p(x) p(y)   AM I(U , V ) = M I(U , V ) − E{M I(U , V )} max {H(U ), H(V )} − E{M I(U , V )}
  • 67. Estimates of the mutational process are quite consistent between individuals (each point is a single entry for one of the matrices for a pair of individuals.)
  • 68. Branch length differences between productive, unproductive Unproductive rearrangements are more likely to be either: unchanged from germline, or more divergent.
  • 69. Sites are generally under purifying selection Individual A Individual B Individual C 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800 −1 0 1 median log10(ω) count purifying diversifying neutral countcount
  • 71. Distribution of amino acids beginning of CDR3 selection for aromatic amino acids?Frequency: left of line = out-of-frame, right of line = in-frame
  • 72. Stabilize with empirical Bayes regularization Assume that , the substitution rate at site , comes from a Gamma distribution with shape and rate : λl l α β ∼ Gamma(α, β).λl   Model total substitution counts (sampled via stochastic mapping) for a site as Poisson with rate :λl ∼ Poisson( ),Cl λl   Fit and to all data, then draw rates from the posterior:α^ β^ λl ∣ ∼ Gamma( + , 1 + ).λl Cl Cl α^ β^   We extended this regularization to case of non-constant coverage.
  • 73. Sequence counts status A B C functional 4,139,983 4,861,800 3,748,306 out-of-frame 533,919 794,845 558,246 stop 104,525 169,423 112,901
  • 74. Correlation between sequence and GTR matrix   Each dot is a pair of genes.
  • 75. Simulation results for selection inference ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● 0.1 1.0 10.0 0 25 50 75 100 site ω synonymous change possible? ● yes no type ●● ●● ●● purifying neutral diversifying 0.00 0.25 0.50 0.75 1.00 0 25 50 75 100 site Proportion type N S 0 250 500 750 1000 0 25 50 75 100 site coverage
  • 77. Random facts Mean length of D segment in individual A’s naive repertoire is 16.61. Subject A’s naive sequences were 37% CDR3 Divergence between the various germ-line V genes: >summary(dist.dna(allele_01,pairwise.deletion=TRUE,model='raw')) Min. 1stQu. Median Mean 3rdQu. Max. 0.0038460.2013000.3446000.3047000.3849000.539500