Assign 2.0 software for the analysis of Phred quality values for quality control of HLA sequencing-based typing.pdf
1. Assign 2.0: software for the analysis of
Phred quality values for quality control of
HLA sequencing-based typing
D.C. Sayer
D.M. Goodridge
F.T. Christiansen
Authors’ affiliations:
D.C. Sayer1,2,3
,
D.M. Goodridge1,3
,
F.T. Christiansen1,2
1
Department of Clinical
Immunology and Biochemical
Genetics, Royal Perth
Hospital, Wellington Street,
Perth 6000, Western
Australia, Australia
2
School of Surgery and
Pathology, Division of
Pathology, University of
Western Australia, Verdun
Street, Nedlands, Western
Australia, Australia
3
Conexio Genomics, PO Box
1670, Applecross, Western
Australia, Australia
Correspondence to:
David C. Sayer
Department of Clinical Immu-
nology and Biochemical
Genetics
Royal Perth Hospital
Wellington Street
Perth 6000
Western Australia
Australia
Tel.: þ61 8 92242899
Fax: þ61 8 92242920
e-mail: david.sayer@
health.wa.gov.au
Abstract: As improvements to DNA sequencing technology have resulted in
increasing the throughput of DNA sequencing, the bottleneck for high
throughput DNA sequencing-based typing (SBT) has shifted to sequence
analysis, genotyping and quality control (QC). Consistent high-quality DNA
sequence is required in order to reduce manual verification and editing of
sequence electropherograms. However, identifying systematic changes in
quality is difficult to achieve without the aid of sophisticated sequence
analysis programs dedicated to this purpose. We describe a computer
software program called Assign 2.0, which integrates sequence QC analysis
and genotyping in order to facilitate high-throughput SBT. Assign 2.0
performs an analysis of Phred quality values in order to produce quality
scores for a sample and a sequencing run. This enables sample-to-sample and
run-to-run QC monitoring and provides a mechanism for the comparison of
sequence quality between various genes, various reagents and various
protocols with the aim of improving the overall quality of DNA sequence data.
This, in turn, will result in reducing sequence analysis as a bottleneck for
high-throughput SBT.
Recent advances in DNA-sequencing technology, including the intro-
duction of capillary DNA sequencers and improvements in dye-
labelling technology (1, 2), have simplified DNA-sequencing protocols
and have improved the ability to detect heterozygous sequences.
As a result, an increasing number of clinical and research labora-
tories are using DNA sequencing in order to study genetic diversity.
This is particularly true for laboratories performing human leucocyte
antigen (HLA) typing for the matching of donors and recipients for
bone marrow transplantation. Several HLA genes are required for
matching for transplantation and each is highly polymorphic (http://
www.ebi.ac.uk/imgt/hla). Current genotyping approaches are hier-
archical and employ low typing resolution molecular techniques
that are relatively inexpensive and suitable for high throughput,
followed by DNA sequencing to provide high-resolution typing
when required. DNA sequencing is regarded as the gold standard
Key words:
assign; quality control; resequencing;
sequencing
Received 14 January 2004, revised 6 April 2004, accepted
for publication 19 April 2004
Copyright ß Blackwell Munksgaard 2004
doi: 10.1111/j.1399-0039.2004.00283.x
Tissue Antigens 2004: 64: 556–565
Printed in Denmark. All rights reserved
556
2. for HLA typing and therefore the ideal would be for DNA sequencing
to be the sole method for HLA typing. State-of-the-art DNA sequen-
cers provide the throughput requirements for most HLA-typing
laboratories. However, data analysis, including manual verification
of automated sequence base calling, allele assignment and quality
control (QC), is a significant impediment to high-throughput sequen-
cing-based typing (SBT).
HLA SBT is a complex multi-step process, which requires the
specific polymerase chain reaction (PCR) amplification of the region
to be sequenced, sequencing up to four polymorphic exons in both
directions, splicing the intron sequence and creating a single con-
catenated consensus sequence for analysis. The consensus sequence
is usually matched against a database of allele sequences in order to
identify those alleles, which are best matched to the test sequence.
Computer software programs, such as SeqScape1
v2.0 Software
(SeqScape) from Applied Biosystems (Foster City, CA), perform base
calling, align forward and reverse complementary sequences, splice
intron sequence and produce a concatenated consensus sequence for
allele assignment. However, base calling may be unreliable, espe-
cially for heterozygous sequence, because an arbitrary threshold for
heterozygosity is assigned based on the percentage of one peak
within another. If the threshold is too low, the presence of any back-
ground may result in false calling of heterozygotes. If the threshold
is not set low enough, then some heterozygotes with low di-
deoxynucleotide incorporation may be incorrectly called homozygotes.
Therefore, manual verification by viewing the sequence electropherograms
(EPG) is required.
The requirement for manual sequence base-call verification and
sequence editing is highest, when the quality of the sequence is poor.
The ability to obtain and maintain high-quality sequence is critical to
improving the throughput capabilities of SBT. High-quality sequence
results in improved accuracy of base calling and removes the time
required for manual verification. As sequencing is being increasingly
used in a clinical setting, guidelines for sequence quality have been
suggested by groups, such as the Clinical Molecular Genetics Society
(http://cmgs.org/BPG/Guidelines/2002/data%20quality). However,
these guidelines tend to be subjective.
Unique and objective approaches to SBT QC are required. We
suggest that various combinations of alleles in heterozygous sam-
ples, each with its own unique sequence, are amplified in PCR and
sequencing reactions with various efficiencies, largely as a result of
the different melting temperatures and GC content. Thus, every
sample should have its own QC. Furthermore, as the sequence for
every sample is usually derived from concatenated bi-directionally
sequenced units (BSU) or exons as is the case for most HLA class-I
SBT assays (3), the basic unit of QC should be the BSU. We have
developed a computer software program that enables such QC to be
performed. We have integrated this with our allele assignment
software in order to provide a comprehensive sequence-analysis
software program, called Assign 2.0. Assign 2.0 is suitable for high--
throughput HLA SBT or any resequencing application.
The Assign 2.0 QC tools enable the analysis of several indicators
of sequence quality. However, the primary function of Assign 2.0 is
the analysis of Phred quality values (PQV) (4, 5). Phred is a software
program, which provides a probability that a base call within a
sequence is correct by using the algorithm QV ¼ 10*log10 (PE),
where PE is the probability that the base call is an error. Thus, a
PQV of 40 indicates that there is a one in 10,000 chance that the base
call is incorrect. However, this algorithm was developed for cloned
template and the same interpretations of base call accuracy may not
apply to heterozygous sequence from PCR products. Therefore, we
have investigated whether PQV can have a broader utility for the
assessment of SBT QC and provide a quality score for a sequenced
sample and a sequencing run or gel. We demonstrate a unique and
informative objective assessment of sequence quality following the
analysis of PQV that enables the setting of target specifications of
quality. As a result, we are able to monitor samples and sequencing
runs for deviations from target specifications (accuracy) and exces-
sive variability around target specifications (precision), thus meeting
the criteria for effective QC (6).
Methods
Sequencing reactions were performed by means of Applied Biosys-
tems Big Dye1
Terminator v3.0 sequencing chemistry. All sequen-
cing was performed on an Applied Biosystems ABI PRISM1
3730
Genetic Analyzer (AB 3730). The AB 3730 is a 48 capillary auto-
mated DNA sequencer. HLA-A, HLA-B and HLA-C SBT protocols
were developed in house. Each locus was typed by means of DNA
sequencing following locus-specific amplification and bi-directional
sequencing of exons 2 and 3. HLA-A and HLA-C were amplified with
a single set of amplification primers and HLA-B was amplified in two
PCRs in order to amplify the HLA-B alleles in two groups character-
ized by the alternate ‘TA’ and ‘CG’ dimorphism located in intron 1 (7).
The locus names HLA-BTA and HLA-BCG have been used in order
to indicate the alternative PCR amplifications. The DNA sequences
were analysed in a two-step process. First, the sequences
were analysed with the help of ABI PRISM SeqScape1
Software
(SeqScape) in order to splice intron sequence, align forward and reverse
sequence strands and assign consensus sequence quality values. The
DNA sequence files in .xml format were then imported into Assign
2.0 for allele assignment and QC analysis. The data included in the
Sayer et al : Quality control of SBT
Tissue Antigens 2004: 64: 556–565 557
3. .xml files contain the consensus sequence base calls and the consensus
sequence PQV (CSPQV). The .xml files are named according to a
strict convention, which includes the sample name, the locus being
sequenced and the sequencing primer. In addition, the .xml file
storage system is organized by means of locus and sequencing
date in order to facilitate data retrieval and enable chronological
analysis of sequence QC data. Assign 2.0 QC tools perform inde-
pendent analysis of CSPQV of automated homozygous (CSPQV-hom)
and heterozygous (CSPQV-het) base calls for a single position, a
range of positions (e.g., exon 2 or exon 3) or a selected date range
for a selected locus. We present an analysis of data from HLA-A
SBT runs from 12 February 2003 to 7 July 2003. This included
1086 samples sequence on 76 different sequencing runs.
The Assign 2.0 allele assignment component of the software matches
the consensus test sequence against an HLA allele sequence library
generated from the IMGT/HLA database (http://www.ebi.ac.uk/imgt/hla).
The matching algorithm has been developed in order to enable high-
speed matching on multiple samples to facilitate high-throughput SBT.
Results
Assign 2.0 QC tools: QC analysis of PQV
As PQV have previously been reported to be an indicator of base call
accuracy (5) and therefore sequence quality, we examined the possi-
bility that analysis of CSPQV could be extrapolated in order to
provide useful QC data for sample and/or a sequencing run. The
hypothesis is that the mean and standard deviation (SD) of CSPQV
for all nucleotide positions will reflect the sequence quality of the
sample sequenced. Furthermore, the mean and SD CSPQV of all
nucleotides for all samples on a sequencing run will reflect the
quality of the sequencing run. However, in order to determine the
feasibility of this approach, we needed to determine the degree of
variability of base call CSPQV at the same site between various
sequences, which appeared visually to be of good quality, between
various samples. It is important to demonstrate that CSPQV only
varies because of changes in sequence quality. For this purpose, we
analysed the CSPQV at 100 conserved (and therefore homozygous)
positions within exon 2 of HLA-A SBT from 20 samples within
the same sequencing run. The results have been presented in Fig. 1.
The mean CSPQV between positions may differ slightly, but more
importantly the CSPQV at each position are reproducible between
various samples. All but three positions have SD of less than 5
CSPQV units and a coefficient of variation (CV) of 5% with a
mean CV value for all positions of 2.7%.
While CSPQV are highly reproducible between samples, the
CSPQV of homozygous and heterozygous base calls are different.
This is demonstrated for two polymorphic positions (positions 165
and 170) within exon 2 of HLA-A in Fig. 2 Figure 2(A) shows the
frequency distribution of CSPQV-hom and CSPQV-het for position
165 of HLA-A. HLA-A alleles can be either A or G at this position.
The grey bars represent the frequency distribution of CSPQV-het
base calls (where both A and G are sequenced) and the black bars
represent the frequency distribution of homozygous base calls (this
includes both A and G base calls). Similarly, Fig. 2B is a frequency
histogram of the CSPQV at position 170. HLA-A alleles are also
either A or G at this position. For both positions, the distribution of
the CSPQV for heterozygous and homozygous positions is normally
distributed, but the CSPQV-het values are lower than CSPQV-hom
values. At position 165, the mean CSPQV-het is 27.10 and SD is 1.14
and the mean CSPQV-hom is 40.84 and SD is 1.75. For position 170,
the mean CSPQV-het is 25.48 and SD is 1.14 and the CSPQV-hom is
40.86 and SD is 1.53.
As a result of the findings described above, we suggest that:
1. The mean and/or SD values of CSPQV-hom of a BSU (i.e., the
various exons for HLA class-I) will provide good indicators of
sequence quality of the BSU. Some samples may not have
0
5
10
15
20
25
30
35
40
45
50
Mean
CSPQV
0
2
4
6
8
10
12
14
16
18
20
SD
CSPQV
Mean
SD
Conserved sequence nucleotide positions within exon 2 HLA-A
Fig. 1. The mean and standard deviation of
consensus sequence PQV (CSPQV) at 100
conserved (therefore, homozygous) positions
of exon 2 of HLA-A are shown from 20
consecutive unrelated samples. The mean
CSPQV (the plot in the top half of the graph) varies
between positions within the same sequence, but the
CSPQV at one position is reproducible between
samples as indicated by the low-standard deviations.
This indicates that a mean value of all CSPQV-hom for
a BSU should provide an indication of sequence
quality of the BSU. BSU, bi-directionally sequenced
units; CSPQV-hom, CSPQV of automated homozygous
base calls; PQV, Phred quality values.
Sayer et al : Quality control of SBT
558 Tissue Antigens 2004: 64: 556–565
4. heterozygous positions and so the use of CSPQV-het should not
be used as an indicator of sequence quality of a BSU.
2. The mean and/or SD values of all CSPQV-hom for all samples on a
sequencing run will provide good indicators of sequence quality of
the sequencing run.
3. Sequence quality ‘target’ (or ‘expected’) values can be calculated
from multiple data points and the mean and SD values of CSPQV
for individual BSU and sequencing runs can be compared to
expected values according to Shewhart rules for analysing con-
trols (6).
In order to test these hypotheses, we performed a retrospective
analysis of SBT data for HLA obtained between 12 February 2003
and 7 July 2003.
Within-run QC analysis
The graphs shown in Figs. 3 and 5 are examples of CSPQV analysis
that can be performed by the Assign 2.0 QC tools in just a few
seconds. Analyses of CSPQV-hom data for exons 2 and 3, respec-
tively, for each of 24 samples of the HLA-A SBT run 10–05–03 have
been presented in Fig. 3(A, B). In both graphs, the mean and SD data
are mirror images such that a sample with a high mean CSPQV
usually has a low SD. Grey bars with a horizontal line through the
middle have been used in order to indicate the mean 2 SD of
CSPQV data calculated from all runs between 12 February 2003 and
7 July 2003.
The exon 2 graph (Fig. 3A) reveals considerable variability
between samples, compared to the graph for exon 3 (Fig. 3B). This
40
(A)
35
30
mean = 27.10
Heterozygous
Sequence
Heterozygous
sequence
Homozygous
sequence
Homozygous
Sequence
SD = 0.90
mean = 25.48
SD = 1.14 mean = 40.86
SD = 1.53
mean = 40.90
SD = 1.75
25
HLA-A exon 2 position 165
HLA-A exon 2 position 165
20
Frequency
(%)
Frequency
(%)
15
10
5
0
40
35
30
25
20
15
10
5
0
1 4
(B)
7 10 13 16 19
PQV scores
22 25 28 31 34 37 40 43 46 49
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
Fig. 2. The frequency histograms of
consensus sequence PQV (CSPQV) at
homozygous (black bars) and heterozygous
(grey bars) base calls have been shown for
two polymorphic positions (positions 165,
Fig. 2A and 170, Fig. 2B) within exon 2 of
HLA-A for all samples (n ¼ 1086 samples)
sequenced between 12 February 2003 and 7
July 2003. The distribution of the CSPQV is bi-
modal with CSPQV of heterozygous base calls being
less than the CSPQV of homozygous base calls. These
results indicate that homozygous and heterozygous
CSPQV should be considered independently if CSPQV
is used as a measure of sequence quality for a sample
or a sequencing run as the number of heterozygous
positions vary between samples.
Sayer et al : Quality control of SBT
Tissue Antigens 2004: 64: 556–565 559
5. indicates variability in sequence quality between the exon 2
sequences of the samples and consistent high-quality sequence for
exon 3 for all samples. Analysis of the sequence EPG for the forward
and reverse sequencing primers for exon 2 revealed that the
sequences from the forward sequencing primer contained high back-
ground for some samples, whereas the reverse sequencing primers
resulted in consistent good quality sequence (data not shown). The
CSPQV is deduced from PQV from both strands and poor quality
sequence on one strand is sufficient to reduce the CSPQV. The EPG
from the forward sequencing primer for some of the samples with
and without background have been shown in Fig. 4. A comparison of
the EPG and CSPQV-hom for these samples reveals that when the
background is high, i.e., the quality of sequence is poor (e.g., samples
13 and 19), the mean CSPQV-hom is low (35.01 and 33.58, respec-
tively) and SD is high (7.61 and 8.42, respectively). In samples where
there is no background, i.e., good quality sequence (e.g., samples 02,
21 and 06), the mean CSPQV is high (41.41, 41.30 and 41.03, respec-
tively) and the SD is low (2.2, 2.1 and 1.8, respectively).
These data demonstrate that mean and SD of CSPQV-hom are
sensitive and quantitative measurements of sequence quality.
With the exception of sample 3, the QC data for exon 3 indicate that
all sequence is of similar quality. Furthermore, all CSPQV-hom means
are greater than the expected mean CSPQV (horizontal line through the
middle of the grey bar) and all but one of the sample SDs are below the
expected SD. This indicates that the quality of sequence obtained for
exon 3 for all samples of this run is of greater quality than is expected.
For sample 3, only two of the 276 bases of exon 3 were included by the
SeqScape algorithm for analysis for one of the sequencing primers. As
a result, much of the sequence is single-stranded. The high PQV is an
anomaly of the SeqScape/Phred algorithm where the CSPQV may be
higher for single-strand sequence than for those with bi-directional
coverage. As a result, a SD was not calculated for this sample.
48
(A) Exon 2
Run 05_10_03. Position: exon 2
(B) Exon 3
Run 05_10_03. Position: exon 3
44
40
36
32
28
PQV-hom
mean
PQV-hom
mean
PQV-hom
SD
PQV-hom
SD
24
20
16
12
8
4
0
48
44
40
36
32
28
24
20
16
12
8
4
0
01 02 03 04 05 06 07 08 09 10 11 12
Sample
Sample
13 14
Mean (this run) = 39.82
SD (this run) = 1.96
Mean (this run) = 4.00
SD (this run) = 1.99
Mean (this run) = 40.6
SD (this run) = 1.29
Mean (this run) = 3.41
SD (this run) = 0.81
15 16 17 18 19 20 21 22 23 24
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
20
18
16
14
12
10
8
6
4
2
0
20
18
16
14
12
10
8
6
4
2
0
Fig. 3. The mean and standard deviation (SD) of
consensus sequence PQV (CSPQV) for
homozygous base calls within exon 2 (Fig. 3A)
and exon 3 (Fig.3B) have been shown for each of
24 samples within an HLA-A SBT run (run
ID¼ 05–10–03). The mean values of each sample are
plotted on the top part of each graph and are associated
with the Y-axis on the left hand side of the graph and the
SD values are plotted on the lower half of each graph
and the values are on the Y-axis on the right hand side of
the graph. The grey bars represent the mean 2 SD
limits of the mean and SD values of all samples for all
runs (n ¼ 76 runs) between 12 February 2003 and 7 July
2003. The mean and SD plots are mirror images, such
that when the mean is high, the SD is low and vice versa.
The plots demonstrate why individual BSU, in this case
each exon, are analysed separately. The exon 2 data is
variable, indicating sequence of variable quality, with
the mean and SD CSPQV for two samples (e.g., samples
13 and 19) outside the expected limits. By contrast, the
exon 3 data are much more consistent with all values
being on or greater than the expected mean of the mean
CSPQV and all but one sample being below the mean of
the expected SD CSPQV. These data indicate a potential
problem of varying degree effecting exon 2 sequences
only. SBT, sequencing-based typing;
Sayer et al : Quality control of SBT
560 Tissue Antigens 2004: 64: 556–565
6. Between-run QC analysis
In contrast to Fig. 3, where sample-to-sample QC analysis within a
sequencing run is demonstrated, Fig. 5 demonstrates run-to-run
(between-run) QC analysis. Between-run analysis is performed by
plotting the mean and SD of CSPQV calculated from all positions
for all samples on a sequencing run. This has been demonstrated in
Fig. 5, where the CSPQV data for exons 2 and 3 are plotted for each
run between 12 February 2003 and 7 July 2003 (76 runs, 1086
samples). The grey bars represent the mean 2 SD of data from
all runs. The data from the sequencing run of 10–05–03 (as demon-
strated in Fig. 3) are indicated by the arrows and does not appear to
be significantly different from data from other runs. However, the
exon 2 mean and SD data from the 19 runs after the run of 10–05–03
indicate that there has been a change in sequence quality. For nine of
the last 19 runs, the mean CSPQV-hom is below the expected mean
and four of the nine are on the lower 2 SD limit. By contrast, only
one result of the previous 57 runs has been on the lower 2 SD limit.
Similarly for the SD data for CSPQV-hom, 14 of the last 19 runs have
SD greater than the expected SD value. This indicates a change of
sequence quality as a result of the variable sequence obtained with
the exon 2 forward sequencing primer shown in Fig. 4. It is of interest
to note that similar changes in sequence quality are not indicated by
the CSPQV-het data. It is not clear why this is the case, but it may be
because of the smaller number of heterozygous sequence positions,
some which may be at positions where the background does not exist.
It is of interest to note that, although unlikely to be statistically
significant, the mean CSPQV-hom for exon 2 for all runs is higher
than the mean CSPQV-hom for exon 3 for all runs (exon 2 ¼ 40.06,
exon 3 ¼ 38.93). In addition, the SD is lower (mean SD for exon 2 is
3.99 and for exon 3 the mean SD is 5.25). This indicates that the
sequence quality for exon 2 is consistently better than the sequence
quality for exon 3. It is possible that this difference is because of the
inherent sequence differences between exon 2 and exon 3. However,
this difference may suggest that the conditions are not optimal for
exon 3. Table 1 lists the mean and SD of CSPQV-hom for the BSU
(i.e., exons 2 and 3) of HLA-A, HLA-B (both the HLA-BTA and HLA-
BCG HLA-B protocols) and HLA-C sequenced during the same
period. The exon 2 BSU sequence of HLA-A has the highest mean
PQV-hom and lowest SD, compared to all the other BSU for the other
loci. This indicates that the sequence quality obtained for the HLA-A
exon 2 BSU is better than the quality of sequence for the exon 3 BSU
of HLA-A and better than the sequence for all other BSU for the other
loci. The challenge now is to understand why this is the case and
optimize the sequencing conditions for the other loci to improve the
sequence quality at least to the level of the HLA-A exon 2 BSU.
Allele assignment
An example of an HLA allele assignment result page has been shown
in Fig. 6. A unique feature of Assign 2.0 is that the result page
contains important QC information in addition to the HLA allele
assignment. The allele assignment is displayed as a list of allele
combinations within the library that are best matched with test
sequence. Mismatched positions include the sequence base call of
the test sample at this position and the expected base call for the
allele combination. Additional information, including the CSPQV of
the test sequence at the mismatched positions and whether there was
Sample 02
PQV
41.41
41.30
41.03
39.56
38.56
35.01
33.58
2.2
2.1
1.8
5.3
6.0
7.6
8.4
Mean SD
Sample 21
Sample06
Sample 01
Sample 04
Sample 13
Sample 19
Fig. 4. The electropherogram (EPG) from a
region of exon 2 for selected samples from run
10–05–03 has been shown. The figure also
includes the mean and SD CSPQV-hom for the
samples of the EPG. When the sequence quality is
good (no background noise), the CSPQV means are
high and SDs are low. As the background noise
increases, the mean CSPQV-hom decreases and the SD
increases. CSPQV is an indicator of sequence quality.
The background noise appears as non-specific peaks
usually smaller than the specific sequence peak.
CSPQV, consensus sequence PQV; CSPQV-hom,
CSPQV of automated homozygous base calls; PQV,
Phred quality values.
Sayer et al : Quality control of SBT
Tissue Antigens 2004: 64: 556–565 561
7. a discrepancy between forward and reverse strand base calls (FRD)
or whether the mismatched position was sequenced in a single
direction only (SS), is also shown. Base calls that have arisen from
sequencing one strand only are also indicated in the result table by
‘SS’ in the ‘Quality Values’ row (not present in the example in Fig. 6).
The QC information of the sample includes the number of bases
sequenced (e.g., n ¼ 546 of the 546 bases which constitute exon
2 þ exon 3 for HLA-A, the homozygous and heterozygous base call
CSPQV (CSPQV-hom and CSPQV-het) statistics (mean CSPQV-
hom ¼ 39.9 and SD ¼ 4.3, mean CSPQV-het ¼ 25.8 and SD ¼ 2.1)
and the SS (0% for homozygous base calls, 0% for heterozygous
base calls) and FRD data (2% of homozygous and 0% of heterozy-
gous consensus base calls had FRD).
In the example shown in Fig. 6, there are two mismatches between
the test sequence and the best-matched alleles. Both mismatches
(position 282 and 448) are at positions, where there was an FRD.
An FRD indicates a base call error when sequencing in one direction
and high potential for an incorrect consensus base call. Such a
position is a priority for manual review. In addition, the base calls
at these positions are mismatched against all of the alleles in the
result table, indicating that the test sequence contains unique poly-
morphisms or they are incorrect base calls. By contrast, the base call
50 20
18
16
14
12
10
8
CSPQV
SD
6
4
2
0
20
18
16
14
12
10
8
CSPQV-het
SD
6
4
2
0
20
18
16
14
12
10
8
6
4
2
0
Homozygous base calls by SBT run-HLA-A exon 2 Homozygous base calls by SBT run-HLA-A exon 3
Heterozygous base calls by SBT run-HLA-A exon 2 Heterozygous base calls by SBT run-HLA-A exon 3
45
40
Mean (all runs) = 40.06
Mean (all runs) = 3.99
SD (all runs) = 1.05
Mean (all runs) = 23.97
SD (all runs) = 2.31
Mean (all runs) = 3.94
SD (all runs) = 1.5
Mean (all runs) = 3.94
SD (all runs) = 1.40
Mean (all runs) = 38.93
SD (all runs) = 1.58
Mean (all runs) = 5.25
SD (all runs) = 1.08
SD (all runs) = 1.16
35
30
25
CSPQV
mean
CSPQV-het
Mean
CSPQV
hom
Mean
CSPQV-het
mean
CSPQV
hom
SD
20
18
16
14
12
10
8
6
4
2
0
CSPQV-het
SD
20
15
10
5
0
30
28
26
24
22
20
18
16
14
12
10
8
6
4
2
0
30
28
26
24
22
20
18
16
14
12
10
8
6
4
2
0
50
45
40
35
30
25
20
15
10
5
0
Sequence run
Sequence run
Sequence run
Sequence run
Fig. 5. Between-run monitoring of sequence quality has been shown. The mean and SD CSPQV-hom and CSPQV-het for all samples of each
run (n ¼ 76 runs) for the period 12 February 2003 and 7 July 2003 have been plotted for exons 2 and 3. The grey bars represent the
mean 2 SD limits for all values on each graph. As for Fig. 3(A,B), the mean values have been shown in the top half of the graph and the SD values have been
shown in the bottom half of each graph. The arrows show the values for the run 5_10_3 (from Fig. 3A,B). Despite the poor quality sequence for the forward
sequencing primer in exon 2 for some samples in runs that follow 5_10_03, run mean does not fall out of the 2 SD limits (see the top left hand graph). However,
it is of interest to note that of the 19 runs following the run of 05–10–03, nine of the runs have a mean value below the expected mean and four of the nine runs
have values on the lower limit. By contrast, only one run in the previous 57 runs has been on the lower limit. This indicates a shift (decrease) in the mean CSPQV
for this assay, as a result of the suboptimal sequence obtained from the forward sequencing primer. The situation is similar for the SD values. Fourteen of the
last 19 SD value runs are greater than the mean SD value for all runs, indicating a shift in the mean SD for this assay. By contrast, the exon 3 data indicate that
the quality of sequence has increased. Sixteen of the last 19 runs are above the expected mean CSPQV and 12 of the last 19 are below the expected SD. This
indicates an overall improvement of SBT of exon 3 of HLA-A. However, a specific problem exists with the forward sequencing primer of exon 2. The changes in
sequence quality demonstrated in the CSPQV-hom data are not reflected in the CSPQV-het data. CSPQV, consensus sequence PQV; CSPQV-het, CSPQV of
heterozygous base cells; CSPQV-hom, CSPQV of automated homozygous base cells; PQV, Phred quality values.
Sayer et al : Quality control of SBT
562 Tissue Antigens 2004: 64: 556–565
8. at position 258 is ‘C’ and the CSPQV at this position is 42. This
indicates that ‘C’ has been called on both strands and a CSPQV of 42
indicates sequence of high quality and very low probability of an
incorrect base call.
Confirmation of base calls at positions within the mismatch table
is performed by viewing the EPG in SeqScape. Any edits to the
sequence are then performed directly in Assign 2.0 and the result
table is updated without the need for re-analysing the sequence
against the allele sequence library (i.e., in real time). Following con-
firmation of all base calls, Assign 2.0 will produce a report listing the
alleles that are best matched to the test sequence. The operator can
then click to the next sample for analysis and the result table is
immediately updated with data from the next sample.
Discussion
We have described a sequence data analysis computer software
program called Assign 2.0 that combines allele assignment with a
comprehensive and effective quality control system. Thousands of
sequences can be analysed in seconds making Assign 2.0 suitable
for high throughput sequencing-based typing or any resequencing
project. We have used the sequence-based typing of the highly
polymorphic HLA-A locus to demonstrate the utility of Assign 2.0.
The unique feature of Assign 2.0 is the ability to analyse PQV in
order to provide a comprehensive QC analysis of SBT data. We have
demonstrated that the mean and SD of all CSPQV-hom within a BSU
are sensitive indicators of sequence quality for that sample. Similarly,
the CSPQV-hom data for all BSU for all samples within a sequencing
run provide QC data for that sequencing run. As a result, sample-
to-sample and run-to-run QC monitoring can be performed.
Furthermore, the normal distribution of mean PQV data indicates
that Shewhart control graphs can be used and changes in sequence
quality can be accurately monitored. These processes add very little
time to the SBT process and yet provide valuable QC data.
A retrospective analysis of all data from February 2003 to July
2003 generated in our laboratory revealed changes in sequence quality
associated with an intermittent increase in background with a single
sequencing primer in our HLA-A SBT assay. This resulted a greater
than expected number of runs falling below the expected mean
CSPQV-hom. In addition, a comparison of CSPQV-hom data between
our HLA-A, HLA-B and HLA-C SBT assays revealed a difference in
sequence quality between the assays with HLA-A exon 2 providing
the best quality data. We are in the process of using Assign 2.0 in
order to re-optimize the HLA-B, HLA-C and HLA-A exon 3 assays so
optimal quality sequence data are obtained.
It is of interest to note that Phred was not designed to provide quality
values for heterozygous sequence (4, 5). However, the data shown in Fig.2
demonstrate that CSPQV-het are normally distributed but with a much
lower mean than CSPQV-hom. Therefore, in theory, CSPQV-het can also
be used for monitoring sequence quality. In most cases, the mean and SD
values of CSPQV-hom were mirror images, indicating that either of these
values, or the coefficient of variation (CV (%) ¼ SD*100/mean) can be
used as an indicator of sequence quality. The data presented in this study
did not indicate that analysis of CSPQV-het provided as sensitive an
indicatorofqualityasCSPQV-hom.Thisislikelytobebecauseofvariable
and low numbers of heterozygous positions, compared to homozygous
positions within a sequence.
The analysis of CSPQV in the ways we have described provides the
ability to assess the effect of reagents and SBT protocols on sequence
data quality. By improving the data obtained from SBT protocols, the
data analysis component of SBT protocols will be significantly reduced
and SBT will become a high-throughput protocol for measuring diver-
sity. In addition, the Assign 2.0 QC tools can be used for between-
laboratory comparison of data and provide a means of standardizing
SBT assays through workshops and QA exchange programs.
The applications of DNA sequencing are moving from the ‘sequence
factories’, where cloned DNA from a single chromosome is sequenced,
to studies of genetic diversity that includes the sequencing of PCR
products of highly polymorphic genes from pairs of chromosomes.
This includes research studies of evolution and population migration
(8) or for clinical diagnostic purposes (9–11). In addition, DNA sequen-
cing is being used by some laboratories for low to medium throughput
SNP analysis and de novo mutation detection (Ivo Gut, CNG, Paris,
France, personal communication). Appropriate QC is critical. Obtain-
ing, maintaining and monitoring sequence quality is required for all of
these applications. This manuscript describes a means by which
appropriate sequencing QC can be performed.
Assign v3.0 has been developed and does not require a third party
software, such as SeqScape, thus further improving the efficiency of SBT.
Mean and standard deviation CSPQV for homozygous base calls (CSPQV-hom) of
exon 2 and exon 3 of various HLA class-I SBT assays
CSPQV-hom
Exon 2 Exon 3
Locus Mean SD Mean SD
HLA-A 40.06 1.05 38.93 1.58
HLA-BCG 38.70 1.95 39.07 2.04
HLA-BTA 39.07 2.04 39.22 1.73
HLA-C 39.33 2.55 38.43 2.81
HLA-A exon 2 results in sequence quality with highest mean CSPQV and lowest SD, which may
reflect that the SBT conditions are better optimized for this BSU than the BSU of other loci. BSU,
bi-directionally sequenced units; CSPQV, consensus sequence PQV; PQV, Phred quality values;
SBT, sequencing-based typing.
Table 1
Sayer et al : Quality control of SBT
Tissue Antigens 2004: 64: 556–565 563
9. A
I
H
B C
E
F
G
D
A) Browse window for locating the .xml files for analysis
B) Locus being typed. If the locus is indicated in the sample name the selected locus in the ‘‘Locus’’ pane is over ridden
C) Indicates the maximum tolerance at which results are listed. Assign will list the best matched alleles up to 31 mismatches within the library.
D) The sample quality control information for the homozygous and heterozygous base calls. Included is the mean and standard deviation Phred quality value
information. The amount of sequence which was from a single strand (SS) and the percentage of base calls which were made from forward/reverse strand base
call discrepancies
E) Contains the ID of the sample for which the report is shown. The number of bases sequenced in also shown
F) This is the results pane. It lists the alleles which are best matched with the test sequence, the number of sequence differences between the alleles and the test
sequence and the sequence base call information at positions that are discrepant between the test sequence and the best matched alleles. This includes the
observed base calls of the test sample, the Phred quality value which is colour coded to represent base calls of high quality which do not require review (green).
Base calls which require review but which are probably correct (yellow) and base calls which definitely require review because they are either at a position with
single strand coverage, there is a forward/reverse strand base call discrepancy or the sequence quality is very poor (red).
G) This is the editor window and allows confirmation of the base calls. Once confirmed the final result can be determined and a report is generated
H) This is the list of samples that have been analysed. Selecting a sample ID results in immediate viewing of the SBT details as described above. Above the
sample IDs is the date of the release of the IMGT/HLA database.
I) This is the control panel which includes access to the QC tools
Fig. 6. A typical allele assignment result page has been shown. A detailed description of the result page is present in the key. The result page contains
the list of alleles, which are best matched to the test sequence, ranked in order of best match. The results have been presented, so that mismatched sequence
positions have been listed across the result page in sequence number order and include the consensus sequence of the test sample, the Phred quality value of the
consensus sequence (CSPQV) base call, if there was a forward and reverse strand base call discrepancy (FRD) and if the position was sequenced in both
directions (SS if sequence was from a single strand only) and the corresponding sequence of the alleles within the table. Moreover, included on the result page
are the total number of bases sequenced, the mean and standard deviation of CSPQV of the homozygous sequence base calls, CSPQV of the heterozygous base
calls, the number of positions (expressed as a percentage of the homozygous and heterozygous base calls), at which there were forward and reverse strand
sequence base call discrepancies (FRD), and the total amount of SS sequence.
Sayer et al : Quality control of SBT
564 Tissue Antigens 2004: 54: 556–565
10. References
1. Rosenblum BB, Lee LG, Spurgeon SL et al.
New dye-labeled terminators for improved
DNA sequencing patterns. Nucleic Acids Res
1997: 25: 4500–4.
2. Lee LG, Spurgeon SL, Heiner CR et al. New
energy transfer dyes for DNA sequencing.
Nucleic Acids Res 1997: 25: 2816–22.
3. Sayer DC, Whidborne R, De Santis D,
Rozemuller E, Christiansen FT, Tilanus M. A
multi centre evaluation of single-tube
amplification protocols for SBT of HLA-DRB1
and HLA-DRB3, 4, 5 are reproducible and
robust. HLA 2002. 2003. Tissue Antigens
2004: 63(5): 412–23.
4. Ewing B, Green P. Base-calling of automated
sequencer traces using phred. II. Error
probabilities. Genome Res 1998: 8: 186–94.
5. Ewing B, Hillier L, Wendl MC, Green P. Base-
calling of automated sequencer traces using
phred. I. Accuracy assessment. Genome Res
1998: 8: 175–85.
6. Shewhart WA. Economic Control of Quality of
Manufactured Product, 1st edn. New York:
Van Nostrand, 1931.
7. Cereb N, Yang SY. Dimorphic primers derived
from intron 1 for use in the molecular typing
of HLA-B alleles. Tissue Antigens 1997: 50:
74–6.
8. Malhi RS, Mortensen HM, Eshleman JA et al.
Native American mtDNA prehistory in the
American Southwest. Am J Phys Anthropol 2003:
120: 108–24.
9. Sayer DC, Land S, Gizzarelli L et al. A quality
assessment program (QAP) for genotypic
antiretroviral testing (GART) results in an
improvement in the detection of drug
resistance mutations. J Clin Microbiol 2003:
41: 227–36.
10. Sayer D, Whidborne R, Brestovac B, Trimboli F,
Witt C, Christiansen F. HLA-DRB1 DNA
sequencing based typing: an approach
suitable for high throughput typing including
unrelated bone marrow registry donors.
Tissue Antigens 2001: 57: 46–54.
11. Pryce TM, Palladino S, Kay D, Coombs GW.
Rapid identification of fungi by sequencing
the ITS1 and ITS2 regions using an
automated capillar electrophoresis system.
Med Mycol 2003: 41: 369–81.
Sayer et al : Quality control of SBT
Tissue Antigens 2004: 64: 556–565 565