Predictive Features of TCR Repertoire

Quantifiable predictive features define
epitope-specific T cell receptor repertoires
Thi Nguyen, Ph.D. Candidate
Graduate Biomedical Sciences | Immunology Theme
University of Alabama at Birmingham (UAB)
kimthi@uab.edu
Summer Journal Club
August 29th, 2018

Outline
1. T cells background -TCR diversity
2. Experiment workflow
3. CD8+ epitope specific TCR repertoire general biochemical
characteristics
4. Gene preference usage
5. TCRdist = measure difference between TCRs
6. CDR3 motif discovery
7. TCRdiv = measure TCR diversity
8. Nearest Neighbor Classifier

T cells
• T cell/T lymphocyte, is a type of white blood cell that is critical for immune defense
• T cell can be distinguished from other leukocytes by the presence of TCR
• derived from hematopoietic stem cells in bone marrow
• mature in the thymus (thymocyte)
• have many different subsets with distinct function (helper, killer, regulatory)
• unique ability to recognize patterns between normal (self) vs abnormal (non-self or
cancerous or sick/dying) cells (cell-mediated immunity) through pMHC binding.
• Upon TCR-pMHC binding + costimulatory molecule binding, they become activated
• Depending on the cytokine cues from environment, they become differentiated

TCR diversity-V(D)J recommbination
Immunobiology: The Immune System in Health and Disease. 5th edition.
Janeway CA Jr, Travers P, Walport M, et al.
New York: Garland Science; 2001.
germline
• Theoretical estimate 1015 -1061 TCR
• Observed 106 TCR
Gene rearrangement

Experiment Workflow
Mice (n=78)
Influenza
(i.n.)
mCMV
(i.p.)
BAL
Stain with tetramers
sort
Epitope-specific
CD8+ T cells
spleen
Human (influenza, CMV, EBV)
(n=32)
Single cell
mRNA
Paired TCR amplification + sequence
N = 4635 paired TCR sequences
PBMC

Paired TCR𝛂𝝱 amplification and sequence
Pradyot Dash et al
Methods Mol. Biol. 2015

TCR𝛂𝝱 sequence analysis
• V and J genes were assigned using BLAST against the IMGT database
• CDR3 nucleotide and aa sequence were assigned based on the location of the
conserved cysteine in the V region and the FGXG motif in the J region.
• Full CDR3 starts at C104 and ending in F position of the FGXG motif
• trimmed CDR3 starts at the 3rd position after C104 and terminate with 2nd position
before F118
• To handle degenerate J-gene FGXG motifs, J aa sequence were manually aligned to
define the F118 position before sequence analysis

Extended table. TCR repertoires characteristics
clonality = 1 – Simpson’s diversity index
(normalized by the size of repertoire)
Pshare = estimated rate a clone drawn
from one subject has an identical
aa sequence to another subject

Extended Fig.1. CDR3 region characteristics of 10 epitope-specific TCR repertoires

How do they quantify gene preference?
Jensen-Shannon divergence (JSD) (total divergence to the average)
• measure the similarity between two probability distribution/ quantifies how
distinguishable two distributions are from each other.
• Based on Kullback-Leibler divergence but it is symmetric and has a finite value
• Basic form = entropy of the mixture minus the mixture of the entropy
Gene usage preference = a normalized JSD between gene frequencies of epitope-
specific repertoire and non-specific background from public dataset.
• This can be generalized to a number of random variables with arbitrary weights:

Gene correlation analysis
• covariation between gene usage was quantified by adjusted mutual information
• correct for the number and frequencies of the observed genes that cluster by chance
• set lower significance threshold in Fig.1c, they randomly shuffled genes in each of
the 60 gene pairing lists 100 times and recompute the adjusted mutual information.
• The largest value observed in these 6000 random trials = lower significance threshold

Fig.1: V and J gene segment usage and covariation in epitope-specific
responses

Extended Data Figure 2: V and J gene segment usage and covariation
in epitope-specific responses

Extended Data Figure 3. Schematic overview of the TCRdist
• Similarity between TCR = similarity between pMHC-
contacting loops.
• Loops are defined based on IMGT CDR
definition with modifications:
(1) Include CDR2.5
(2) Use trimmed CDR3
• AAdist (Alignment score) = BLOSUM62 matrix
• TCRdist = Sum (weighted AAdist)
distance(a,a) = 0
distance(a,b) = min (4, 4-BLOSUM62(a,b) to reduce
penalty for aa with positive BLOSUM62 score.
• A gap penalty of 4 (8 for CDR3) = distance between
gap position and an aa.
• Weight of 3 is applied to mismatches in the CDR3.

BLOSUM62 matrix
BLOSUM = Block substitution matrix
• Score alignment between protein sequence (locally , as opposed to PAM)
• Based on observed alignments
• Larger = higher sequence similarity => smaller evolutionary distance

Clustering and dimensionality reduction
• TCR with the largest number of neighbors within the distance threshold is chosen
as a cluster center.
 It and all its neighbors are removed from the repertoires
 repeat the process until all TCRs have been clustered.
• The distance threshold was chosen to yield homogeneous cluster of sufficient size,
same threshold was used for all repertoire
• result of this clustering method was visualized by average-linkage hierarchical clusterin
trees and TCR sequence logos
• They also use 2D kernel PCA (scikit-learn, KernelPCA function) to visualize the TCR
landscape. This attempts to preserve similarity structure of the input data while reducing
their dimensionality.

Fig.2: TCRdist analysis of the M45 repertoire identifies clusters of
related receptors
TCR logos
• summarize V and J gene usage, CDR3 aa sequence
and inferred rearrangement structure of the CDR3
• 4 components:
1. V-gene logos (left): V-gene names are scaled by
frequency and stacked top to bottom from most to least
common
2. CDR3 sequence logo where aa are scaled and ordered
by frequency and colored by chemical type.
3. a J-gene logo (right)
4. CDR3 where the genomic source regions for each
nucleotide column are represented by frequency-scaled
bars ordered top to bottom from V to D to J and colored
according to their frequency.

Extended Fig.4:
2D projections of mouse epitope-specific TCR repertoire
• kernel PCA applied to TCRdist
• Colored based on gene segment usage
kernel PCA ~ nonlinear form of PCA

CDR3 motif discovery
• Motifs = fixed length patterns consisting of aa position, wild card positions, aa group
positions (allowed groupings (K,R), (D, E), (N,Q), (S,T), (FYWH), (AGSP), (VILM))
• motif score = (observed –expected)2 /expected.
• observed = number of times motifs were observed from TCR sequence
• expected = values from background TCR (with V and J gene match observed repertoire
• Starting with two-position motifs scoring above a seed threshold, each motif was
iteratively extended by adding new specified position that increase the motif score.
• motif scores were sorted and filtered for redundancy.
• motif score above a threshold were extended to include near-neighbour TCR using
a stringent distance threshold => capture additional patterns
• final set of motif for each repertoire were visualized using TCR logo.

Amino Acid Groupings
https://en.wikipedia.org/wiki/Amino_acid

Fig.3: Enriched CDR3 sequence motifs define key features of epitope
specificity

TCRdiv metric to measure repertoire diversity
Simpson diversity index (D):
• takes into account of both the
richness and evenness of the
population.
• Measure the probability that 2
Individuals randomly selected from
A population will belong to the same class.
• 0 ≤ D ≤ 1
• 1 means the samples are identical
• 0 means otherwise
TCRdiv:
• Estimate the expected value of a Gaussian function of the inter-sample that returns
1 if the samples are identical and exp(-(TCRdist(a,b)/s.d.)2) otherwise.
• S.d. = 18.45 for single chain distance and twice that for paired analyses based on
empirical assessments of receptor distance distribution for multiple epitopes.
• TCRdiv =inverse of this estimate
http://www.countrysideinfo.co.uk/simpsons.htm

Extended Fig.8. TCRdiv measures for each chain and paired chains

Nearest neighbor (NN) -distance classifier
• receptor density within repertoire = sampling density nearby each receptor
• = weighted average distance to nearest neighbor receptors in the repertoire
• small NN-distance means higher local sampling density = many nearby neighbors
• They use the nearest 10% of the repertoire with a weight that linearly decrease from
nearest to farthest neighbors.
• To compute AUROC score for the NN-distance classifier, epitope-specific TCR (positive
and background receptors (negative) were sorted by NN-distance
• ROC curve:
sensitive = fractional recovery of epitope-specific receptor
1-specificity = fractional recovery of background receptors

Nearest Neighbor Classifiers
http://user.it.uu.se/~kostis/Teaching/DM-05/Slides/classification01.pdf

KNN algorithm
http://user.it.uu.se/~kostis/Teaching/DM-05/Slides/classification01.pdf

Fig.4 . Quantifying the defining features of epitope-specific populations

Extended Fig 9. Specificity and Avidity of the dispersed TCR

Modeling gene rearrangement
• Each nucleotide of the CDR3 is assigned to either V, D, J or N-nucleotide insertion
so as to minimize number of N.
• They sample V and J gene segments from the observed receptors but generated
the junctional sequences based on the inferred probability distribution for number of
insertion and deletion from the background TCR (public data).
• They also use these probability distributions to generate the random receptors that
formed one of the two control set for CDR3 motif discovery algorithm.
extended Fig.8.

Summary
• characterize 10 epitope-specific TCR repertoires of CD8+ T cells from 4600 single
celled TCR:
 gene segment usage
 epitope selection
• Develop TCRdist to quantify similarity among TCR based on spaces of TCR
• Develop TCRdiv to quantify TCR repertoire diversity
• develop a distance-based classifier that can assign unobserved TCR to characterized
TCR
Significance:
• potential application to analyze clinical TCR repertoire data where the target is
unknown such as in TIL.
• propose that despite tremendous diversity of TCR, we can develop predictive model for
TCR-pMHC recognition.

Predictive Features of TCR Repertoire

More Related Content

What's hot

Similar to Predictive Features of TCR Repertoire

More from Thi K. Tran-Nguyen, PhD

Recently uploaded

Predictive Features of TCR Repertoire

Editor's Notes