SlideShare a Scribd company logo
1 of 1
Download to read offline
Immunosequencing reveals diagnostic signatures of
pathogen infection and HLA type in the T-cell repertoire
William S. DeWitt1
, Ryan O. Emerson1
, Marissa Vignali1
, Cindy Desmarais1
, Mark J. Rieder1
, Christopher S. Carlson2
, Jenna Gravley2
, John A. Hansen2
, Harlan S. Robins1,2
1
Adaptive Biotechnologies, Seattle, WA, USA; 2
Fred Hutchinson Cancer Research Center, Seattle, WA, USA
SUMMARY
We investigated the public T-cell response to
cytomegalovirus (CMV) by sequencing the
rearranged T-cell receptors (TCRs) of 640 subjects
(287 with and 353 without CMV). First, we
performed an experiment similar to a GWAS study,
in which we assessed the concordance of ~85
million unique TCRs obtained through
high-throughput sequencing with CMV serostatus.
We identified 142 TCRs that were significantly
associated with a positive CMV serostatus at P ≤
10-4
(FDR ≈ 0.20). We then trained a binary
classifier on these features, and were able to
predict CMV serostatus in a leave-one-out
cross-validation procedure with a diagnostic odds
ratio of 66.
Next, we determined that we could estimate the
HLA-association of each CMV-associated TCR by
assessing the over-representation of particular HLA
types among the subjects that carry those
CMV-associated TCRs. Of 142 CMV-associated
TCRs, 57 were HLA-associated at P ≤ 10-3
, and none
of these were significantly associated with multiple
HLA alleles.
We find a substantial concordance between our
results and the literature. Most previously reported
public CMV-specific TCRs are seen in our data,
athough only 6 are significantly associated with
CMV serostatus in this cohort. Five of these were
significantly HLA-associated, and in all cases the
associated allele agrees with previous findings
We separately investigated the association of TCRs
with each HLA-A allele present in the cohort, and
observed significant association (P ≤ 10-4
) for many
higher frequency HLA alleles. Cross validation with
binary classifier training resulted in high accuracy
(~96%) prediction for these alleles, suggesting the
possibility that HLA type can be inferred from
immunosequencing data.
In conclusion, we demonstrate the validity of
association studies using immunosequencing for
detection and HLA-association of public T-cell
responses to infection, and report that assessing
the presence of associated T-cell responses can
serve as a powerful diagnostic classifier.
CONCLUSIONS
1551 Eastlake Ave E., Ste. 200 • Seattle, WA 98102
For more information contact bhowie@adaptivebiotech.com
© 2015 Adaptive Biotechnologies Corporation. All Rights Reserved.
PST-90009-04 AA
@adaptivebiotech
SUBJECT DEMOGRAPHICS
* Race and ethnicity categories as defined by the NIH were
used. All Hispanic/Latino were classified as race = Unknown
and Ethnicity = Hispanic or Latino. The 26 individuals classified
as Unknown provided no information for either Race or
Ethnicity
▲Fig 1. Experimental and analytical flow. (a) We analyzed
peripheral blood samples from 640 healthy subjects (287 CMV-
and 353 CMV+) by high-throughput TCRβ profiling. (b) We
identified TCRβ sequences that were present in significantly more
CMV+ subjects than CMV- subjects, controlling FDR by
permutation of CMV status. These data were used to build a
classification model. (c) The model was tested using exhaustive
leave-one-out cross-validation, in which one sample was held out
and the process repeated from the beginning. The resulting
classification model was used to predict the CMV status of the
holdout subject.
dataset
353 CMV-
subjects
287 CMV+
subjects
immunosequence TCR
machine learning
CASSLIGVSSYNEQFF
27 CMV+; 2 CMV-; p=2.76x10-8
pathogen memory burden
#ofsubjects
repeat
640X
“leave-one-out”
?
?
a b c
FPR
TPR
370 270
-
+
353
287
observed
347
23
6
264
predicted
- +
FPR
TPR
361 279
-
+
353
287
observed
322
39
31
248
predicted
- +
CASSLIGVSSYNEQFF
27 CMV+; 2 CMV-; p=2.76x10-8
pathogen memory burden
#ofsubjects
INFERENCE OF CMV INFECTION STATUS
◄Fig 2. Machine learning results. (a) Classification
performance on the full and cross-validation (CV) datasets for
each p-value threshold, measured as the area under the ROC curve
(AUROC). The numbers correspond to the CMV-associated TCRβ
identified at each p-value threshold; we selected the dataset
corresponding to p ≤ 1x10-4
(boxed) for downstream analyses. (b)
False discovery rate (FDR) estimated for each p-value threshold
used in the identification of significantly CMV-associated TCRβ
sequences, using permutations of CMV status. (c) Distribution of
CMV memory burden (i.e., the proportion of each subject’s TCRβ
repertoire that matches our list of 142 CMV-associated TCRβ
sequences) among CMV+ and CMV- subjects. (d) ROC curves
calculated for the full and cross-validation datasets, and confusion
matrix calculated on the cross-validation dataset. The highest
accuracy (0.89, diagnostic odds ratio 66) is achieved when
classifying 86% of true positives correctly with a false positive rate
of 8%.
Sex CMV+ CMV- All
Female 150 147 297
Male 137 206 343
All 287 353 640
Race/ethnicity* CMV+ CMV- All
White 164 212 376
Black or African American 8 0 8
Asian 15 2 17
American Indian or Alaska Native 7 2 9
Native Hawaiian or other Pacific Islander 3 0 3
Hispanic or Latino 20 6 26
Unknown 70 131 201
All 287 353 640
Age (in years) CMV+ CMV- All
Mean 42 37 40
Median 44 38 41
Range 5-74 1 -70 1 -74
# of individuals of unknown age 31 56 87
Table 1. Demographics of the subjects in this study
CASSLAPGATNEKLFF et al., 2005
et al., 2005
et al., 2008
et al. 2011
CASSLIGVSSYNEQFF et al., 1999
et al., 2011
et al., 1999
CASSLQAGANEQFF et al., 2003
CASASANYGYTF et al., 2008
(*) Reported gene, followed by name in current nomenclature (as per IMGT); for Peggs et al. the identity of the J gene was deduced from the aminoacid sequence.
this study previous publications
◄Table 2. Concordance between this dataset and
previously published data. The table lists the CDR3 amino
acid sequence (from C to F), flanking V and J genes, and HLA
association for each of the 6 TCRβ sequences identified in this
study that had been previously reported as CMV-associated, and
compares these data to previous reports, indicating whether the V
and J genes and the HLA associations match.
◄Fig 3. HLA-restriction of CMV-associated TCRβ
sequences. Distribution of HLA-A (a) and HLA-B (b) alleles in
this cohort. (c) Each of the 142 CMV-associated TCRβ sequences
(p ≤ 1x10-4
) was tested for significant association with each HLA
allele, with a P value threshold of 1x10-3
. Of these, 57 are
significantly associated with an HLA-A and/or an HLA-B allele, and
none are significantly associated with more than a single allele
from each locus. The colored sequences correspond to those that
had been previously identified as CMV-associated (Table 2); in 5
cases we recapitulate the correct HLA association, while we did
not see a statistically significant HLA association for the remaining
sequence.
►Fig. 4. Incidence of previously reported CMV-reactive
TCRβ sequences in this dataset. (a) The incidence of each
previously published CMV-associated TCRβ sequences in our
cohort is plotted along the horizontal axis by decreasing total
incidence (CMV+ subjects above, and CMV- subjects below the
horizontal line). (b) The incidence of these TCRβ sequences in our
cohort of 640 subjects for each group of sequences is shown.
• Positive selection eliminates T-cells with receptors that do not bind MHC.
• Negative selection eliminates T-cells with receptors that bind too strongly to
MHC or self peptides presented by MHC.
• MHC polymorphism means thymic selection will censor different TCRs
across individuals.
• V(D)J recombination produces some TCRs with high enough probability
that they’ll arise in multiple individuals.
• Individuals with different HLA type will censor these common sequences
differently.
• There should be “public clones” associated with HLA type.
FDRAUROC
40 65
142
676 1,455 7,878 78,566
0.5
0.6
0.7
full
CV
0.8
0.9
1.0
10-5
10-4
10-3
10-2
10-6
0
0.2
0.4
0.6
0.8
1.0
p-value threshold
a
b
NumberofSubjects
100
80
60
40
20
0
0 0.01 0.02 0.03 0.04
CMV+
CMV-
TruePositiveRate
0
0.2
0.4
0.6
0.8
1.0
full: AUROC=0.98
CV: AUROC=0.93
0.2 0.4 0.6 0.8 1.00
False Positive Rate
c d
-
+
observed
322
39
31
248
predicted
- +
pathogen memory burden (%)
CMV-specific TCRs reported in the literature
incidenceinCMV-subjectsincidenceinCMV+subjects
30020010003002001000
A
B
5 10 50 100 500
incidence in this cohort
0
2
4
6
8
10
12
14
public
private
fractionofCMV-specificTCRs
reportedintheliterature(%)
INFERENCE OF HLA TYPE
▲Table 3. List of CMV-associated TCRs
identified in this study. Top clones are shown
(sorted by P value). For each TCR, the list includes the V
gene, the CDR3 sequence (in aminoacids) and the J
gene, the frequency of the TCR among CMV+ and CMV-
subjects, and a P value.
Frequency
in Subjects
TCRb (V gene|CDR3|J gene) CMV+ CMV- p-value
V09-01|CASSGQGAYEQYF|J02-07*01 60 11 3.31E-13
V05-01*01|CASSPDRVGQETQYF|J02-05*01 34 1 9.54E-12
V19-01|CASSIGPLEHNEQFF|J02-01*01 30 0 1.48E-11
V28-01*01|CASSIEGNQPQHF|J01-05*01 27 0 1.95E-10
V05-06*01|CASSLVAGGRETQYF|J02-05*01 51 11 2.35E-10
V07-02*01|CASSLEAEYEQYF|J02-07*01 30 1 2.71E-10
V24|CATSDGDEQFF|J02-01*01 41 6 4.73E-10
V07-06*01|CASSLAPGATNEKLFF|J01-04*01 36 4 9.13E-10
V07-06*01|CASSRGRQETQYF|J02-05*01 37 5 2.11E-09
V09-01|CASSAGQGVTYEQYF|J02-07*01 23 0 5.88E-09
V05-06*01|CASSLRREKLFF|J01-04*01 28 2 1.26E-08
V07-09|CASSLIGVSSYNEQFF|J02-01*01 27 2 2.76E-08
V07-09|CASSFPTSGQETQYF|J02-05*01 27 2 2.76E-08
V25-01*01|CASSPGDEQYF|J02-07*01 24 1 3.71E-08
...
▼Fig. 5. Cross validation accuracy. Confusion
matrixes for two of the highest-frequency HLA-A alleles
(A2, which has 288 feature TCRs, and A24, which has 31).
Cross validation with binary classifier training predicts
both alleles with high accuracy (96%).
50
100
150
200
250
300
350
frequency
0
2 1 3 24 11 26 29 68 32 30 31 23 25 3433 66 28 80 36 69 74
288 feature TCRs
31 feature TCRs
HLA-A allele
• We describe the identification of a set of public TCRs
that distinguish between individuals with and without a
chronic viral infection (i.e. cytomegalovirus, or CMV)
with high accuracy, using a combination of
high-throughput immunosequencing, statistical
association of particular TCRs with disease state, and
machine learning to classify disease state in new
subjects.
•Our approach constitutes a pathogen specific diagnostic
tool that employs a very general assay, which relies only
on a large training dataset coupled with
immunosequencing and sophisticated data analysis to
identify public T-cell responses.
• Since rearranged T-cell receptors store immunological
data to all pathogens in a common format, this method
has the potential to be used to assess multiple disease
states in parallel.
• We show that the method can also be applied to infer
HLA type for HLA-A alleles, suggesting the possibility
that HLA type is an additional feature that can be
inferred (even retroactively) from immunosequencing
data.

More Related Content

Similar to Poster_Nature2015Conference

Insights into the tumor microenvironment and therapeutic T cell manufacture r...
Insights into the tumor microenvironment and therapeutic T cell manufacture r...Insights into the tumor microenvironment and therapeutic T cell manufacture r...
Insights into the tumor microenvironment and therapeutic T cell manufacture r...Thermo Fisher Scientific
 
Transplantation in sensitized patients(seminar)
Transplantation in sensitized patients(seminar)Transplantation in sensitized patients(seminar)
Transplantation in sensitized patients(seminar)Vishal Golay
 
2. viral hepatitis in blood bank hospital civil
2. viral hepatitis in blood bank hospital civil2. viral hepatitis in blood bank hospital civil
2. viral hepatitis in blood bank hospital civilErwin Chiquete, MD, PhD
 
Themis Genetic Summary
Themis Genetic SummaryThemis Genetic Summary
Themis Genetic SummaryMiles Priar
 
EHA poster Genomic Analysis by MyAML with Chemotherapy
EHA poster Genomic Analysis by MyAML with ChemotherapyEHA poster Genomic Analysis by MyAML with Chemotherapy
EHA poster Genomic Analysis by MyAML with ChemotherapySuzanne M. Graham
 
TCRB chain convergence in chronic cytomegalovirus infection and cancer
TCRB chain convergence in chronic cytomegalovirus infection and cancerTCRB chain convergence in chronic cytomegalovirus infection and cancer
TCRB chain convergence in chronic cytomegalovirus infection and cancerThermo Fisher Scientific
 
Sergio Querol - Advances in UCBT: UCB Banking, Making the Most of Cord Blood
Sergio Querol - Advances in UCBT: UCB Banking, Making the Most of Cord BloodSergio Querol - Advances in UCBT: UCB Banking, Making the Most of Cord Blood
Sergio Querol - Advances in UCBT: UCB Banking, Making the Most of Cord BloodSingapore Society for Haematology
 
Purdue cancer center retreat poster Christy Cooper 12062014FINAL
Purdue cancer center retreat poster Christy Cooper 12062014FINALPurdue cancer center retreat poster Christy Cooper 12062014FINAL
Purdue cancer center retreat poster Christy Cooper 12062014FINALChristy Cooper
 
Histocompatibility in kidney transplantation
Histocompatibility in kidney transplantationHistocompatibility in kidney transplantation
Histocompatibility in kidney transplantationscienthiasanjeevani1
 
TEMPUS xT, Biópsia sólida vs TEMPUS xF, Biópsia Líquida
TEMPUS xT, Biópsia sólida vs TEMPUS xF, Biópsia LíquidaTEMPUS xT, Biópsia sólida vs TEMPUS xF, Biópsia Líquida
TEMPUS xT, Biópsia sólida vs TEMPUS xF, Biópsia LíquidaComunicaoIberlab
 
A computational framework for large-scale analysis of TCRβ immune repertoire ...
A computational framework for large-scale analysis of TCRβ immune repertoire ...A computational framework for large-scale analysis of TCRβ immune repertoire ...
A computational framework for large-scale analysis of TCRβ immune repertoire ...Thermo Fisher Scientific
 
bb_unit7RhSpring2011.ppt
bb_unit7RhSpring2011.pptbb_unit7RhSpring2011.ppt
bb_unit7RhSpring2011.pptAlokdev Mishra
 

Similar to Poster_Nature2015Conference (20)

Insights into the tumor microenvironment and therapeutic T cell manufacture r...
Insights into the tumor microenvironment and therapeutic T cell manufacture r...Insights into the tumor microenvironment and therapeutic T cell manufacture r...
Insights into the tumor microenvironment and therapeutic T cell manufacture r...
 
Minh's Poster
Minh's PosterMinh's Poster
Minh's Poster
 
Transplantation in sensitized patients(seminar)
Transplantation in sensitized patients(seminar)Transplantation in sensitized patients(seminar)
Transplantation in sensitized patients(seminar)
 
2. viral hepatitis in blood bank hospital civil
2. viral hepatitis in blood bank hospital civil2. viral hepatitis in blood bank hospital civil
2. viral hepatitis in blood bank hospital civil
 
Themis Genetic Summary
Themis Genetic SummaryThemis Genetic Summary
Themis Genetic Summary
 
EHA poster Genomic Analysis by MyAML with Chemotherapy
EHA poster Genomic Analysis by MyAML with ChemotherapyEHA poster Genomic Analysis by MyAML with Chemotherapy
EHA poster Genomic Analysis by MyAML with Chemotherapy
 
TCRB chain convergence in chronic cytomegalovirus infection and cancer
TCRB chain convergence in chronic cytomegalovirus infection and cancerTCRB chain convergence in chronic cytomegalovirus infection and cancer
TCRB chain convergence in chronic cytomegalovirus infection and cancer
 
Sergio Querol - Advances in UCBT: UCB Banking, Making the Most of Cord Blood
Sergio Querol - Advances in UCBT: UCB Banking, Making the Most of Cord BloodSergio Querol - Advances in UCBT: UCB Banking, Making the Most of Cord Blood
Sergio Querol - Advances in UCBT: UCB Banking, Making the Most of Cord Blood
 
ECCMID2005HPV(4c)
ECCMID2005HPV(4c)ECCMID2005HPV(4c)
ECCMID2005HPV(4c)
 
Purdue cancer center retreat poster Christy Cooper 12062014FINAL
Purdue cancer center retreat poster Christy Cooper 12062014FINALPurdue cancer center retreat poster Christy Cooper 12062014FINAL
Purdue cancer center retreat poster Christy Cooper 12062014FINAL
 
Histocompatibility in kidney transplantation
Histocompatibility in kidney transplantationHistocompatibility in kidney transplantation
Histocompatibility in kidney transplantation
 
TEMPUS xT, Biópsia sólida vs TEMPUS xF, Biópsia Líquida
TEMPUS xT, Biópsia sólida vs TEMPUS xF, Biópsia LíquidaTEMPUS xT, Biópsia sólida vs TEMPUS xF, Biópsia Líquida
TEMPUS xT, Biópsia sólida vs TEMPUS xF, Biópsia Líquida
 
A computational framework for large-scale analysis of TCRβ immune repertoire ...
A computational framework for large-scale analysis of TCRβ immune repertoire ...A computational framework for large-scale analysis of TCRβ immune repertoire ...
A computational framework for large-scale analysis of TCRβ immune repertoire ...
 
females and hcv
females and hcvfemales and hcv
females and hcv
 
bb_unit7RhSpring2011.ppt
bb_unit7RhSpring2011.pptbb_unit7RhSpring2011.ppt
bb_unit7RhSpring2011.ppt
 
Sitc 2015 poster
Sitc 2015 posterSitc 2015 poster
Sitc 2015 poster
 
jclinpath00424-0026
jclinpath00424-0026jclinpath00424-0026
jclinpath00424-0026
 
Cancer Immunogenomics
Cancer ImmunogenomicsCancer Immunogenomics
Cancer Immunogenomics
 
Abstract (1)
Abstract (1)Abstract (1)
Abstract (1)
 
C3_Psarra.pdf
C3_Psarra.pdfC3_Psarra.pdf
C3_Psarra.pdf
 

Poster_Nature2015Conference

  • 1. Immunosequencing reveals diagnostic signatures of pathogen infection and HLA type in the T-cell repertoire William S. DeWitt1 , Ryan O. Emerson1 , Marissa Vignali1 , Cindy Desmarais1 , Mark J. Rieder1 , Christopher S. Carlson2 , Jenna Gravley2 , John A. Hansen2 , Harlan S. Robins1,2 1 Adaptive Biotechnologies, Seattle, WA, USA; 2 Fred Hutchinson Cancer Research Center, Seattle, WA, USA SUMMARY We investigated the public T-cell response to cytomegalovirus (CMV) by sequencing the rearranged T-cell receptors (TCRs) of 640 subjects (287 with and 353 without CMV). First, we performed an experiment similar to a GWAS study, in which we assessed the concordance of ~85 million unique TCRs obtained through high-throughput sequencing with CMV serostatus. We identified 142 TCRs that were significantly associated with a positive CMV serostatus at P ≤ 10-4 (FDR ≈ 0.20). We then trained a binary classifier on these features, and were able to predict CMV serostatus in a leave-one-out cross-validation procedure with a diagnostic odds ratio of 66. Next, we determined that we could estimate the HLA-association of each CMV-associated TCR by assessing the over-representation of particular HLA types among the subjects that carry those CMV-associated TCRs. Of 142 CMV-associated TCRs, 57 were HLA-associated at P ≤ 10-3 , and none of these were significantly associated with multiple HLA alleles. We find a substantial concordance between our results and the literature. Most previously reported public CMV-specific TCRs are seen in our data, athough only 6 are significantly associated with CMV serostatus in this cohort. Five of these were significantly HLA-associated, and in all cases the associated allele agrees with previous findings We separately investigated the association of TCRs with each HLA-A allele present in the cohort, and observed significant association (P ≤ 10-4 ) for many higher frequency HLA alleles. Cross validation with binary classifier training resulted in high accuracy (~96%) prediction for these alleles, suggesting the possibility that HLA type can be inferred from immunosequencing data. In conclusion, we demonstrate the validity of association studies using immunosequencing for detection and HLA-association of public T-cell responses to infection, and report that assessing the presence of associated T-cell responses can serve as a powerful diagnostic classifier. CONCLUSIONS 1551 Eastlake Ave E., Ste. 200 • Seattle, WA 98102 For more information contact bhowie@adaptivebiotech.com © 2015 Adaptive Biotechnologies Corporation. All Rights Reserved. PST-90009-04 AA @adaptivebiotech SUBJECT DEMOGRAPHICS * Race and ethnicity categories as defined by the NIH were used. All Hispanic/Latino were classified as race = Unknown and Ethnicity = Hispanic or Latino. The 26 individuals classified as Unknown provided no information for either Race or Ethnicity ▲Fig 1. Experimental and analytical flow. (a) We analyzed peripheral blood samples from 640 healthy subjects (287 CMV- and 353 CMV+) by high-throughput TCRβ profiling. (b) We identified TCRβ sequences that were present in significantly more CMV+ subjects than CMV- subjects, controlling FDR by permutation of CMV status. These data were used to build a classification model. (c) The model was tested using exhaustive leave-one-out cross-validation, in which one sample was held out and the process repeated from the beginning. The resulting classification model was used to predict the CMV status of the holdout subject. dataset 353 CMV- subjects 287 CMV+ subjects immunosequence TCR machine learning CASSLIGVSSYNEQFF 27 CMV+; 2 CMV-; p=2.76x10-8 pathogen memory burden #ofsubjects repeat 640X “leave-one-out” ? ? a b c FPR TPR 370 270 - + 353 287 observed 347 23 6 264 predicted - + FPR TPR 361 279 - + 353 287 observed 322 39 31 248 predicted - + CASSLIGVSSYNEQFF 27 CMV+; 2 CMV-; p=2.76x10-8 pathogen memory burden #ofsubjects INFERENCE OF CMV INFECTION STATUS ◄Fig 2. Machine learning results. (a) Classification performance on the full and cross-validation (CV) datasets for each p-value threshold, measured as the area under the ROC curve (AUROC). The numbers correspond to the CMV-associated TCRβ identified at each p-value threshold; we selected the dataset corresponding to p ≤ 1x10-4 (boxed) for downstream analyses. (b) False discovery rate (FDR) estimated for each p-value threshold used in the identification of significantly CMV-associated TCRβ sequences, using permutations of CMV status. (c) Distribution of CMV memory burden (i.e., the proportion of each subject’s TCRβ repertoire that matches our list of 142 CMV-associated TCRβ sequences) among CMV+ and CMV- subjects. (d) ROC curves calculated for the full and cross-validation datasets, and confusion matrix calculated on the cross-validation dataset. The highest accuracy (0.89, diagnostic odds ratio 66) is achieved when classifying 86% of true positives correctly with a false positive rate of 8%. Sex CMV+ CMV- All Female 150 147 297 Male 137 206 343 All 287 353 640 Race/ethnicity* CMV+ CMV- All White 164 212 376 Black or African American 8 0 8 Asian 15 2 17 American Indian or Alaska Native 7 2 9 Native Hawaiian or other Pacific Islander 3 0 3 Hispanic or Latino 20 6 26 Unknown 70 131 201 All 287 353 640 Age (in years) CMV+ CMV- All Mean 42 37 40 Median 44 38 41 Range 5-74 1 -70 1 -74 # of individuals of unknown age 31 56 87 Table 1. Demographics of the subjects in this study CASSLAPGATNEKLFF et al., 2005 et al., 2005 et al., 2008 et al. 2011 CASSLIGVSSYNEQFF et al., 1999 et al., 2011 et al., 1999 CASSLQAGANEQFF et al., 2003 CASASANYGYTF et al., 2008 (*) Reported gene, followed by name in current nomenclature (as per IMGT); for Peggs et al. the identity of the J gene was deduced from the aminoacid sequence. this study previous publications ◄Table 2. Concordance between this dataset and previously published data. The table lists the CDR3 amino acid sequence (from C to F), flanking V and J genes, and HLA association for each of the 6 TCRβ sequences identified in this study that had been previously reported as CMV-associated, and compares these data to previous reports, indicating whether the V and J genes and the HLA associations match. ◄Fig 3. HLA-restriction of CMV-associated TCRβ sequences. Distribution of HLA-A (a) and HLA-B (b) alleles in this cohort. (c) Each of the 142 CMV-associated TCRβ sequences (p ≤ 1x10-4 ) was tested for significant association with each HLA allele, with a P value threshold of 1x10-3 . Of these, 57 are significantly associated with an HLA-A and/or an HLA-B allele, and none are significantly associated with more than a single allele from each locus. The colored sequences correspond to those that had been previously identified as CMV-associated (Table 2); in 5 cases we recapitulate the correct HLA association, while we did not see a statistically significant HLA association for the remaining sequence. ►Fig. 4. Incidence of previously reported CMV-reactive TCRβ sequences in this dataset. (a) The incidence of each previously published CMV-associated TCRβ sequences in our cohort is plotted along the horizontal axis by decreasing total incidence (CMV+ subjects above, and CMV- subjects below the horizontal line). (b) The incidence of these TCRβ sequences in our cohort of 640 subjects for each group of sequences is shown. • Positive selection eliminates T-cells with receptors that do not bind MHC. • Negative selection eliminates T-cells with receptors that bind too strongly to MHC or self peptides presented by MHC. • MHC polymorphism means thymic selection will censor different TCRs across individuals. • V(D)J recombination produces some TCRs with high enough probability that they’ll arise in multiple individuals. • Individuals with different HLA type will censor these common sequences differently. • There should be “public clones” associated with HLA type. FDRAUROC 40 65 142 676 1,455 7,878 78,566 0.5 0.6 0.7 full CV 0.8 0.9 1.0 10-5 10-4 10-3 10-2 10-6 0 0.2 0.4 0.6 0.8 1.0 p-value threshold a b NumberofSubjects 100 80 60 40 20 0 0 0.01 0.02 0.03 0.04 CMV+ CMV- TruePositiveRate 0 0.2 0.4 0.6 0.8 1.0 full: AUROC=0.98 CV: AUROC=0.93 0.2 0.4 0.6 0.8 1.00 False Positive Rate c d - + observed 322 39 31 248 predicted - + pathogen memory burden (%) CMV-specific TCRs reported in the literature incidenceinCMV-subjectsincidenceinCMV+subjects 30020010003002001000 A B 5 10 50 100 500 incidence in this cohort 0 2 4 6 8 10 12 14 public private fractionofCMV-specificTCRs reportedintheliterature(%) INFERENCE OF HLA TYPE ▲Table 3. List of CMV-associated TCRs identified in this study. Top clones are shown (sorted by P value). For each TCR, the list includes the V gene, the CDR3 sequence (in aminoacids) and the J gene, the frequency of the TCR among CMV+ and CMV- subjects, and a P value. Frequency in Subjects TCRb (V gene|CDR3|J gene) CMV+ CMV- p-value V09-01|CASSGQGAYEQYF|J02-07*01 60 11 3.31E-13 V05-01*01|CASSPDRVGQETQYF|J02-05*01 34 1 9.54E-12 V19-01|CASSIGPLEHNEQFF|J02-01*01 30 0 1.48E-11 V28-01*01|CASSIEGNQPQHF|J01-05*01 27 0 1.95E-10 V05-06*01|CASSLVAGGRETQYF|J02-05*01 51 11 2.35E-10 V07-02*01|CASSLEAEYEQYF|J02-07*01 30 1 2.71E-10 V24|CATSDGDEQFF|J02-01*01 41 6 4.73E-10 V07-06*01|CASSLAPGATNEKLFF|J01-04*01 36 4 9.13E-10 V07-06*01|CASSRGRQETQYF|J02-05*01 37 5 2.11E-09 V09-01|CASSAGQGVTYEQYF|J02-07*01 23 0 5.88E-09 V05-06*01|CASSLRREKLFF|J01-04*01 28 2 1.26E-08 V07-09|CASSLIGVSSYNEQFF|J02-01*01 27 2 2.76E-08 V07-09|CASSFPTSGQETQYF|J02-05*01 27 2 2.76E-08 V25-01*01|CASSPGDEQYF|J02-07*01 24 1 3.71E-08 ... ▼Fig. 5. Cross validation accuracy. Confusion matrixes for two of the highest-frequency HLA-A alleles (A2, which has 288 feature TCRs, and A24, which has 31). Cross validation with binary classifier training predicts both alleles with high accuracy (96%). 50 100 150 200 250 300 350 frequency 0 2 1 3 24 11 26 29 68 32 30 31 23 25 3433 66 28 80 36 69 74 288 feature TCRs 31 feature TCRs HLA-A allele • We describe the identification of a set of public TCRs that distinguish between individuals with and without a chronic viral infection (i.e. cytomegalovirus, or CMV) with high accuracy, using a combination of high-throughput immunosequencing, statistical association of particular TCRs with disease state, and machine learning to classify disease state in new subjects. •Our approach constitutes a pathogen specific diagnostic tool that employs a very general assay, which relies only on a large training dataset coupled with immunosequencing and sophisticated data analysis to identify public T-cell responses. • Since rearranged T-cell receptors store immunological data to all pathogens in a common format, this method has the potential to be used to assess multiple disease states in parallel. • We show that the method can also be applied to infer HLA type for HLA-A alleles, suggesting the possibility that HLA type is an additional feature that can be inferred (even retroactively) from immunosequencing data.