This document summarizes the characterization of 10 epitope-specific CD8+ T cell receptor repertoires from over 4,600 single cells. Key findings include quantifying gene segment usage, epitope selection, TCR similarity using TCRdist, repertoire diversity with TCRdiv, and developing a distance-based classifier to assign unobserved TCR. The work demonstrates that predictive models for TCR-pMHC recognition may be possible despite tremendous TCR diversity, with potential applications to analyze clinical T cell receptor repertoire data.
1. Quantifiable predictive features define
epitope-specific T cell receptor repertoires
Thi Nguyen, Ph.D. Candidate
Graduate Biomedical Sciences | Immunology Theme
University of Alabama at Birmingham (UAB)
kimthi@uab.edu
Summer Journal Club
August 29th, 2018
3. T cells
• T cell/T lymphocyte, is a type of white blood cell that is critical for immune defense
• T cell can be distinguished from other leukocytes by the presence of TCR
• derived from hematopoietic stem cells in bone marrow
• mature in the thymus (thymocyte)
• have many different subsets with distinct function (helper, killer, regulatory)
• unique ability to recognize patterns between normal (self) vs abnormal (non-self or
cancerous or sick/dying) cells (cell-mediated immunity) through pMHC binding.
• Upon TCR-pMHC binding + costimulatory molecule binding, they become activated
• Depending on the cytokine cues from environment, they become differentiated
4. TCR diversity-V(D)J recommbination
Immunobiology: The Immune System in Health and Disease. 5th edition.
Janeway CA Jr, Travers P, Walport M, et al.
New York: Garland Science; 2001.
germline
• Theoretical estimate 1015 -1061 TCR
• Observed 106 TCR
Gene rearrangement
7. TCR𝛂𝝱 sequence analysis
• V and J genes were assigned using BLAST against the IMGT database
• CDR3 nucleotide and aa sequence were assigned based on the location of the
conserved cysteine in the V region and the FGXG motif in the J region.
• Full CDR3 starts at C104 and ending in F position of the FGXG motif
• trimmed CDR3 starts at the 3rd position after C104 and terminate with 2nd position
before F118
• To handle degenerate J-gene FGXG motifs, J aa sequence were manually aligned to
define the F118 position before sequence analysis
8. Extended table. TCR repertoires characteristics
clonality = 1 – Simpson’s diversity index
(normalized by the size of repertoire)
Pshare = estimated rate a clone drawn
from one subject has an identical
aa sequence to another subject
9. Extended Fig.1. CDR3 region characteristics of 10 epitope-specific TCR repertoires
10. How do they quantify gene preference?
Jensen-Shannon divergence (JSD) (total divergence to the average)
• measure the similarity between two probability distribution/ quantifies how
distinguishable two distributions are from each other.
• Based on Kullback-Leibler divergence but it is symmetric and has a finite value
• Basic form = entropy of the mixture minus the mixture of the entropy
Gene usage preference = a normalized JSD between gene frequencies of epitope-
specific repertoire and non-specific background from public dataset.
• This can be generalized to a number of random variables with arbitrary weights:
11. Gene correlation analysis
• covariation between gene usage was quantified by adjusted mutual information
• correct for the number and frequencies of the observed genes that cluster by chance
• set lower significance threshold in Fig.1c, they randomly shuffled genes in each of
the 60 gene pairing lists 100 times and recompute the adjusted mutual information.
• The largest value observed in these 6000 random trials = lower significance threshold
12. Fig.1: V and J gene segment usage and covariation in epitope-specific
responses
13. Extended Data Figure 2: V and J gene segment usage and covariation
in epitope-specific responses
14. Extended Data Figure 3. Schematic overview of the TCRdist
• Similarity between TCR = similarity between pMHC-
contacting loops.
• Loops are defined based on IMGT CDR
definition with modifications:
(1) Include CDR2.5
(2) Use trimmed CDR3
• AAdist (Alignment score) = BLOSUM62 matrix
• TCRdist = Sum (weighted AAdist)
distance(a,a) = 0
distance(a,b) = min (4, 4-BLOSUM62(a,b) to reduce
penalty for aa with positive BLOSUM62 score.
• A gap penalty of 4 (8 for CDR3) = distance between
gap position and an aa.
• Weight of 3 is applied to mismatches in the CDR3.
15. BLOSUM62 matrix
BLOSUM = Block substitution matrix
• Score alignment between protein sequence (locally , as opposed to PAM)
• Based on observed alignments
• Larger = higher sequence similarity => smaller evolutionary distance
16. Clustering and dimensionality reduction
• TCR with the largest number of neighbors within the distance threshold is chosen
as a cluster center.
It and all its neighbors are removed from the repertoires
repeat the process until all TCRs have been clustered.
• The distance threshold was chosen to yield homogeneous cluster of sufficient size,
same threshold was used for all repertoire
• result of this clustering method was visualized by average-linkage hierarchical clusterin
trees and TCR sequence logos
• They also use 2D kernel PCA (scikit-learn, KernelPCA function) to visualize the TCR
landscape. This attempts to preserve similarity structure of the input data while reducing
their dimensionality.
17. Fig.2: TCRdist analysis of the M45 repertoire identifies clusters of
related receptors
TCR logos
• summarize V and J gene usage, CDR3 aa sequence
and inferred rearrangement structure of the CDR3
• 4 components:
1. V-gene logos (left): V-gene names are scaled by
frequency and stacked top to bottom from most to least
common
2. CDR3 sequence logo where aa are scaled and ordered
by frequency and colored by chemical type.
3. a J-gene logo (right)
4. CDR3 where the genomic source regions for each
nucleotide column are represented by frequency-scaled
bars ordered top to bottom from V to D to J and colored
according to their frequency.
18. Extended Fig.4:
2D projections of mouse epitope-specific TCR repertoire
• kernel PCA applied to TCRdist
• Colored based on gene segment usage
kernel PCA ~ nonlinear form of PCA
19. CDR3 motif discovery
• Motifs = fixed length patterns consisting of aa position, wild card positions, aa group
positions (allowed groupings (K,R), (D, E), (N,Q), (S,T), (FYWH), (AGSP), (VILM))
• motif score = (observed –expected)2 /expected.
• observed = number of times motifs were observed from TCR sequence
• expected = values from background TCR (with V and J gene match observed repertoire
• Starting with two-position motifs scoring above a seed threshold, each motif was
iteratively extended by adding new specified position that increase the motif score.
• motif scores were sorted and filtered for redundancy.
• motif score above a threshold were extended to include near-neighbour TCR using
a stringent distance threshold => capture additional patterns
• final set of motif for each repertoire were visualized using TCR logo.
22. TCRdiv metric to measure repertoire diversity
Simpson diversity index (D):
• takes into account of both the
richness and evenness of the
population.
• Measure the probability that 2
Individuals randomly selected from
A population will belong to the same class.
• 0 ≤ D ≤ 1
• 1 means the samples are identical
• 0 means otherwise
TCRdiv:
• Estimate the expected value of a Gaussian function of the inter-sample that returns
1 if the samples are identical and exp(-(TCRdist(a,b)/s.d.)2) otherwise.
• S.d. = 18.45 for single chain distance and twice that for paired analyses based on
empirical assessments of receptor distance distribution for multiple epitopes.
• TCRdiv =inverse of this estimate
http://www.countrysideinfo.co.uk/simpsons.htm
24. Nearest neighbor (NN) -distance classifier
• receptor density within repertoire = sampling density nearby each receptor
• = weighted average distance to nearest neighbor receptors in the repertoire
• small NN-distance means higher local sampling density = many nearby neighbors
• They use the nearest 10% of the repertoire with a weight that linearly decrease from
nearest to farthest neighbors.
• To compute AUROC score for the NN-distance classifier, epitope-specific TCR (positive
and background receptors (negative) were sorted by NN-distance
• ROC curve:
sensitive = fractional recovery of epitope-specific receptor
1-specificity = fractional recovery of background receptors
28. Extended Fig 9. Specificity and Avidity of the dispersed TCR
29. Modeling gene rearrangement
• Each nucleotide of the CDR3 is assigned to either V, D, J or N-nucleotide insertion
so as to minimize number of N.
• They sample V and J gene segments from the observed receptors but generated
the junctional sequences based on the inferred probability distribution for number of
insertion and deletion from the background TCR (public data).
• They also use these probability distributions to generate the random receptors that
formed one of the two control set for CDR3 motif discovery algorithm.
extended Fig.8.
30. Summary
• characterize 10 epitope-specific TCR repertoires of CD8+ T cells from 4600 single
celled TCR:
gene segment usage
epitope selection
• Develop TCRdist to quantify similarity among TCR based on spaces of TCR
• Develop TCRdiv to quantify TCR repertoire diversity
• develop a distance-based classifier that can assign unobserved TCR to characterized
TCR
Significance:
• potential application to analyze clinical TCR repertoire data where the target is
unknown such as in TIL.
• propose that despite tremendous diversity of TCR, we can develop predictive model for
TCR-pMHC recognition.
Editor's Notes
TCR is heterodimeric surface receptor that mediates recognition of pathogens-associated epitope through interaction with pMHC.
TCR are generagted by genomic rearrangement of germline TCR locus, a process called VDJ that has the potential to generate marked diversity of TCR, estimated from. However, only observed 106 TCR due to several limitation (biological, experimental, technical).
This cartoon on left show the germline organization of the human T-cell receptor α and β loci:
cluster of 61 Jα gene segments is located a considerable distance from the Vα gene segments. The Jα gene segments are followed by a single C gene, which contains separate exons for the constant and hinge domains and a single exon encoding the transmembrane and cytoplasmic regions.
The TCRβ locus (chromosome 7) has a different organization, with a cluster of 52 functional Vβ gene segments located distantly from two separate clusters each containing a single D gene segment, together with six or seven J gene segments and a single C gene. Each TCRβ C gene has separate exons encoding the constant domain, the hinge, the transmembrane region, and the cytoplasmic region (not shown).
The TCRα- and β-chain genes are composed of discrete segments that are joined by somatic recombination during development of the T cell.
For the α chain, a Vα gene segment rearranges to a Jα gene segment to create a functional V-region exon. Transcription and splicing of the VJα exon to Cα generates the mRNA that is translated to yield the T-cell receptor α-chain protein.
For the β chain, the variable domain is encoded in three gene segments, Vβ, Dβ, and Jβ. Rearrangement of these gene segments generates a functional VDJβ V-region exon that is transcribed and spliced to join to Cβ; the resulting mRNA is translated to yield the T-cell receptor β chain.
The α and β chains pair soon after their biosynthesis to yield the α:β T-cell receptor heterodimer, that can bind to a particular pMHC presented by APC.
Now TCR that recognize the same pMHC molecular more often than not have completely different sequences. But for the most part, they also have certain similarity. This paper attempt to characterize the similarity and different of unique different TCR from TCR repertoire that are epitope specific, hoping that this will enable them to predictively model TCR sequence given known epitope and vice versa. Say, given a particular TCR sequence, we’re going to predict the pMHC it bind to.
Historically studying the real TCR sequence in repertoire is impossible because of bulk sequencing. Now with recent advance in single cell sequencing, the possibility to study and model TCR and immune response is endless.
First, they use flow cytometry to sort for CD8 T cells that are specific to certain epitope of virus by staining cells in pMHC-tetramer.
After single cell sort, they subject these cells to paired TCR amplification, first, isolate RNA then make cDNA and do 2 rounds of nested PCR to amplify CDR3alpha and beta parralellely.
Then they sequence these PCR products-> translate to aa sequence then combine alpha with beta to have the alphabeta coexpression profile.
Degeneracy of genetic code = different nucleotide sequence encode the same polypeptide.
They first analyze this TCR repertoire data set using established feature such as length, charge, hydrophobicity.
a, TCR repertoires of 10 epitope-specific populations.
Where as there are substantial levels of sharing or publicity were observed at the single chain level, lower level of sharing was observed, with 3 epitops (F2,m139 and pp65) have no public receptor in this data set.
b, Biophysical characteristics of TCR repertoires of 10 epitope specific populations.
CDR3 length, charge, hydrophobicity, and inferred number of junctional nucleotide insertions for both single and paired chains as shown in the histograms. Different epitopes are colour-coded.
Gere show that mean values for CDR3 length, charge and hydrophobicity are tightly clustered for the majority of the epitopes and all these features show great degree of overlapping ranges.
B. Correlation between CDR3α β and antigenic peptides for charge, hydrophobicity, length, and N-insertions observed in all 10 epitopes.
They found a negative correlation between CDR3 charge and peptide charge, peptide length and CDR3 length, suggesting that charge and length complemetarity may have a role in pMHC recognition.
To quantify gene preference, they constructed a background, non-epitope selected repertoire by combining public data and compare gene frequencies between epitope-specific repertoire to those seen in background set.
Fig 1 A . Gene segment usage and gene–gene pairing landscapes are illustrated graphically using four vertical stacks (one for each V and J segment) connected by curved segments with thickness proportional to the number of TCRs with the respective gene pairing (each panel is labelled with the four gene segments atop their respective colour stacks and the epitope identifier in the top middle).
Genes are coloured by frequency within the repertoire with a fixed colour sequence used throughout the manuscript which begins red (most frequent),green (second most frequent), blue, cyan, magenta, and black.
Clonallyexpanded TCRs were reduced to a single data point for this analysis. Thenumber of clones is indicated to the left of each panel. The enrichment of gene segments relative to background frequencies is indicated by up or down arrows, with each successive arrowhead corresponding to an additional twofold deviation.
A shows the degree of dominance of single gene and pairwise gene associations. Each epitope specific response is characterized by an overrepresentation of individual genes as well as gene pairing preferences. Example is PB1 epitope, where Trav3-3, Tra26 and Trb2-3 are used in the single largest block of receptors.
B. Jensen-Shannon divergence between observed gene usage of the epitope-specific CD8+ T cell repertoire compared to background, normalized by mean Shannon entropy. Higher value show higher gene preference.
C.. Show gene usage correlation between V and J segment within a chain or across chain
Gene segment usage and gene–gene pairing landscapes are illustrated graphically using four vertical stacks (one for each V and J segment) connected by curved segments with thickness proportional to the number of TCRs with the respective gene pairing (each panel is labelled with the four gene segments atop their respective
colour stacks and the epitope identifier in the top middle).
Genes are coloured by frequency within the repertoire with a fixed colour sequence
used throughout the manuscript which begins red (most frequent),
green (second most frequent), blue, cyan, magenta, and black.
Clonally expanded TCRs were reduced to a single data point for this analysis. The
number of clones is indicated to the left of each panel. The enrichment
of gene segments relative to background frequencies is indicated by up
or down arrows, with each successive arrowhead corresponding to an
additional twofold deviation.
To map epitope specific TCR landscape at high resolution and to quantify similarity between TCR, they developed TCRdist = a distance measure on the space of TCR, guided by structural information on pMHC binding.
Each of the two TCRs being compared is first mapped to the amino acid sequence of its CDR loops (CDR1, CDR2, and CDR3 as well as an additional variable loop here labelled ‘CDR2.5’), as indicated by the black arrows leading from the coloured loop regions in the receptor structures to the corresponding amino acid sequences in the middle of the diagram. These CDR sequences are aligned based on the IMGT reference multiple sequence alignments, and a distance score (‘AAdist’) is computed for each position in the alignment using the BLOSUM62 similarity matrix
according to the formula given in the box at the bottom left. The Aadist scores are weighted as shown in the ‘weight’ row (thereby increasing the
contribution of the CDR3 regions) and summed to produce the final
TCRdist score (shown at the right).
2A shows gene usage just like fig1.
2B . They use TCRdist value to do a 2D kernel principal components analysis (PCA) . Then the clustere are coloured by Vα (left panel) and Vβ (right panel) gene usage. Three groups of receptors that correspond to TCR logos and clusters depicted in c are indicated with dashed ellipses.
So they obstain a coarse-grained visualization of each repertoire by mapping high-dimensional TCR landscape into 2 dimensions, each dot representing a TCR, clustered based on TCRdist.
To complement these landscape projetion, they also constru hierarchical trees and a TCR logos.
TCR logo summarize the gene frequencies, CDR3 aa sequence and inferred rearrangement structures to further annotate these clusters.
Examination of these trees show that repertoire mostly contained dominant cluster of receptors due to common V-J region usage but also by similarity of CDR3 motifs.
In addition to the core clusters of similar receptors, each repertoire also has divergent regions that are clearly distinct from each other (For the dendogram and TCR logos of other TCR repertoire, we can take a look at the extended data figure 5-6.
Looking at the logos, many of the shared CDR3 residues are derived directly from genomic sequence, and thus reflect the biased gene usage.
Inspection of these projected landsccape allows us to identify subregions of the TCR repertoires that are tightly clustered (similar), and associate it with the gene segment usage.
The standard PCA always finds linear principal components to represent the data in lower dimension. Sometime, we need non-linear principal components.If we apply standard PCA for the below data, it will fail to find good representative direction. Kernel PCA (KPCA) rectifies this limitation.
Kernel PCA just performs PCA in a new space.
It uses Kernel trick to find principal components in different space (Possibly High Dimensional Space).
PCA finds new directions based on covariance matrix of original variables. It can extract maximum P (number of features) eigen values. KPCA finds new directions based on kernel matrix. It can extract n (number of observations) eigenvalues.
PCA allow us to reconstruct pre-image using few eigenvectors from total P eigenvectors. It may not be possible in KPCA.
The computational complexity for KPCA to extract principal components take more time compared to Standard PCA.
They hypothesize that motifs that are not germline encoded are more likely to be contributor of specificity. To identify these features directly, they performed statistical analsysi of overrepresented CDR3 motifs, taking into account the underlying sequence bias introduced by the rearrangement process.
the grouping of aa was based on their similarity in charge, and hydrophobicity
They hypothesize that motifs that are not germline encoded are more likely to be contributor of specificity. To identify these features directly, they performed statistical analsysi of overrepresented CDR3 motifs, taking into account the underlying sequence bias introduced by the rearrangement process.
Here fig3, they show top-scoring motif for both CDR3a and b for 10 repertoire along with the residues that are enriched relative to the background distribution. Their results are also supported by the solved ternary structures for PA, BMLF and M1 due to the fact that the enriched non-germline residue either directly contact pMHC or contribute to the stabilization of the CDR3 loop confirmation.
They develop a new diversity metric that generalize Simpson’s diversity index by capturing the similarity among receptors in addition to exact identiy.
They develop a new diversity metric that generalize Simpson’s diversity index by capturing the similarity among receptors in addition to exact identity.
Shows TCRdiv diversity measure for the 10 epitope specific CD8+ T cells repertoire, in each chain as well as in paired.Examining the TCRdiv score clarify trends seen in the earlier analysis. e.g. PB1 repertoire show low diversity in the alpha chain but high diversity in beta chain, wheras the opposite is true for the M38 repertoire.
Then since their landscape analysis suggest that each repertoire is composed of one or more groups of clustered receptors sharing similar sequence feature together with a more diverse, outlying receptors.
So to measure the receptor density within repertoire and quantify the relative contribution of clustered and diverged TCR, they develop a repertoire specific NN score that capture the density of receptors surrounding each receptor.
They developed a new diversity metric (TCRdiv) that generalizes Simpson’s diversity index by capturing similarity among receptors in TCRdiv diversity measures (a) and smoothed density profiles of the nearest-neighbours (NN) distance (b) are shown for each repertoire.
B shows majority of TCR exhibit a biomodal distribution in terms of their nearest neighbor distances, with one peak shows low NN distance, representin the dominant and densely sampled cluster and the 2nd peak with bigger NN distances to reflect all the outlier receptor.
They also confirm the ag specificty of these non cluster receptors by cloning the receptors into TCR-null cells and measure their ability to tetramers and confirmed that indeed, these outliers receptors represent legitimate but unconventional epitope specificity.
(c)To test the predictive power of TCRdist, they defined a TCR classifier that assign a given receptor to the repertoire with the lowest NN distance. They first measure the the sensitivity vs specificity of the classifier for identifying epitope specific receptors among a pool of randomly generated background receptors.
d, The area under these ROC curves (AUROC), a standard measure of classification success is above 0.8 for all, except for pp65 repertoire. They are also the one with the highest diversity.
e, Indeed TCR div is negatively correlated with AUROC, with the most diverged receptor being harder to be descriminated from the background.
f, To validate the TCR classifier, they generated an additional dataset by using index sorted cells stained with 4 tetramer, NP, PB, PB1 and F2 from airway of influenza infected mice. Cells were sorted without the index tetramer information, sequenced and assigned to one of 4 epitopes or non-specific response using the NN-classifier. The predictor correctly assigned most TCR sequences to target epitope as identify by tetramer staining with AUROC =0.9 for 3, for F2, itis about 0.72 for single cells and 0.85 for clonotypes, possibiliy because this epitope has the fewest receptor sequences available to train the classifier.
Importantly 85% receptors correctly classified were not previously observed, demonstrating the power of this approach to classify nocel ag-specific receptors. Also a significant population of cells fell below the threshold for tetramer positivity yet were assigned to a specific epitope by the NN clasifier. They hypothesize that these cells maybe specific for predicted epitope yet could not be identified by tetramer staining.
Assignment of TCR sequences from influenza-infected lung without prior knowledge of their tetramer specificity by NN-distance classifier. Tetramer binding (mean fluorescence intensity (MFI), x axis) is plotted against NN-distance score (y axis) for a validation set of T cell receptors (n = 856 TCRs; 352 clones) collected after development of the classifier. The solid vertical lines indicate the MFI thresholds used to define epitope-positive receptors, which are plotted with the colours given in the legend (receptors negative for all four tetramers are shown in grey). Raw MFI values were scaled to align the threshold values across tetramers. Dotted horizontal lines indicating a fixed NN-distance score are provided for visual reference.
A+ B: They also confirm the ag specificty of these non cluster receptors by cloning the receptors into TCR-null cells and measure their ability to tetramers and confirmed that indeed, these outliers receptors represent legitimate but unconventional epitope specificity.
C: The distribution of the tested TCRs (numbered 1–5 corresponding to left to right occurrence in on a NN-distance plot and
d, their V-J usage and CDR3 sequences with NNdistance score are shown.
E. Analysis of the mean fluorescence intensities (MFI) of the clustered and dispersed (separated by visual threshold of 135
NN-distance score) group of receptors shows no consistent segregation of the avidity. Mean and standard error of mean are shown.
f, PB1-specific TCRs derived from cells sorted by low, intermediate and high gating show overlapping distribution of NN-distance scores (n = 23 (low), 18 (intermediate), 23 (high) cells).
So basically there is no correlation betwen avidity of binding with NN distance.
However, they found a strong correlation between receptor density and TCR generation probability, suggesting the ease of generation explains a portion of the variation in the landscape structure.