1. Target-Decoy with Mass Binning: A Simple and Effective Validation
Method for Shotgun Proteomics Using High Resolution Mass
Spectrometry
Jong Wha J. Joo,†
Seungjin Na,‡
Je-Hyun Baek,†
Cheolju Lee,†
and Eunok Paek*,‡
Korea Institute of Science and Technology, Seoul, Republic of Korea, and Department of Mechanical and
Information Engineering, University of Seoul, Seoul, Republic of Korea
Received July 20, 2009
Abstract: Shotgun proteomics using mass spectrometry
(MS) has become the choice for large-scale peptide and
protein identification. The recent development of high-
resolution mass spectrometers such as FT-ICR or Orbitrap
makes it possible to identify peptides within only a few
parts per million (ppm), and it is expected to dramatically
improve performance of peptide identification, as com-
pared to low-resolution instruments. To fully exploit such
significantly higher mass accuracy, however, appropriate
data analysis methods are required. Here, we present a
new target-decoy strategy, called Target-Decoy with Mass
Binning, utilizing high mass accuracy for peptide identi-
fication validation, which remains a challenging problem
in MS-based proteomics. When tested on various high-
resolution MS data, our method was very effective and
yet simple and showed comparable or better performance
when compared with other validation methods.
Keywords: target-decoy • peptide identification • valida-
tion method • precursor mass error • mass binning • high-
resolution mass spectrometry
1. Introduction
During the last years, along with technological developments
in tandem mass spectrometry (MS/MS),1,2
many powerful
database search algorithms have been developed and made
great advances in peptide/protein identification possible.
Database search engines such as SEQUEST,3
Mascot,4
Phenyx5
and X!Tandem6
select the best-scoring peptide-spectrum
match (PSM) for every input MS/MS spectrum based on the
comparison with theoretical MS/MS spectra from a protein
database. However, they do not guarantee that the top PSMs
are correct because of deficiency in their scoring models, and
search results include many incorrect PSMs (in fact, the
majority of searched MS/MS spectra). Thus, the assessment of
database search results is very important and still open to
controversy.
To assess the reliability of database search results, various
validation methods are currently in use. There have been
mainly two ways, one of which is a machine learning based
method7,8
and the other is a target-decoy search method.9
A
machine learning based method combines multiple features
related with the adopted search software (for example, XCorr,
deltaCn, SpRank, and so forth in the case of SEQUEST), into a
single match quality score. With the combined score, it models
the distribution of PSMs and classifies correct and incorrect
PSMs. However, their scoring model based on training data
may not work well on data from different experimental
environment, which would require new training. On the other
hand, a target-decoy search method uses a decoy database,
which is a reversed, randomized or shuffled version of standard
protein sequence (target) database. This approach assumes that
the distribution of incorrect PSMs from the target database is
similar to that of PSMs from the decoy database. With this
assumption, an optimized score cutoff for the corresponding
false discovery rate (FDR) is calculated for each data set. A
target-decoy method is favored in practice because of its
simplicity and robustness to the effects of database size, sample
quality, experimental environment and instrument types.9–11
Consequently, the advantages of the target-decoy search have
been incorporated into machine learning based methods. Recent
implementations of machine learning based methods12–15
esti-
mate certain parameters of their models from a target-decoy
search or construct an initial classifier (dynamic training set) for
further analyses and incrementally improve their performance.
Recently, with the advent of high-resolution MS instruments
such as Fourier transform ion cyclotron resonance (FT-ICR)
and electrostatic FT traps (Orbitrap), mass resolution has been
improved by 2 to 3 orders of magnitude when compared with
conventional MS instruments.16
They allow identifying peptides
within only a few parts per million (ppm) mass accuracy and it
would dramatically improve performance of peptide identification
via reduction of the search space by orders of magnitude.
However, many works reported that a very small search space
does not necessarily lead to improved performance.17,18
And, in
spite of their high resolving power, a fair number of MS/MS
spectra are generated using wrong precursor masses due to errors
in deconvolution.19,20
With these issues, new database search
options or validation methods for high-resolution MS data have
been introduced. Mascot and X!Tandem provided a search
option to compensate for a misassignment of monoisotopic
masses. Another method, called PE-MMR,21
has been proposed
* To whom correspondence should be addressed. Eunok Paek, Depart-
ment of Mechanical and Information Engineering, University of Seoul, Seoul,
Republic of Korea. E-mail: paek@uos.ac.kr. Fax: +82-2-2210-5575. Tel: +82-
2-2210-2680.
†
Korea Institute of Science and Technology.
‡
University of Seoul.
1150 Journal of Proteome Research 2010, 9, 1150–1156 10.1021/pr9006377 2010 American Chemical Society
Published on Web 11/13/2009
2. so that it can correct wrong precursor masses of MS/MS
spectra. It generated unique mass class (UMC) from an LC/
MS experiment and, prior to a database search, filtered out or
corrected MS/MS spectra by matching the precursor masses
of MS/MS spectra to UMC. PeptideProphet,22
the representa-
tive validation method, improved its performance for high-
resolution MS data by using mass binning for accurate mod-
eling of mass accuracy distributions.
Here, we introduce a database search and validation strategy
based on a target-decoy search, called Target-Decoy with Mass
Binning (TDMB), to identify as many reliable peptide identi-
fications as possible from high-resolution MS data (to make
the distinction clear, we call the original target-decoy method
as TD). We examined the distribution of mass errors (precursor
mass of a spectrum minus identified peptide mass) of search
results from high-resolution MS data, and compared the search
results under different search conditions (variation in precursor
mass tolerance). These analyses led us to set a wide range of
precursor mass tolerance for a search and use an accurate mass
binning of mass errors for validation. This strategy allowed
finding a systematic mass error and an optimal mass error
window for a MS instrument used. In our approach, score
thresholds at a specified FDR were determined in combination
with the mass error window. Because peptide identifications
out of the optimal mass error window would be almost always
incorrect, it could achieve high specificity. Our approach
successfully dealt with issues related to high mass accuracy and
incorrect assignment of monoisotopic masses, and gave more
sensitive and accurate identifications than certain existing
methods.
2. Material and Methods
2.1. Sample Preparation. The 48 standard protein mixture
(Sigma U6133, Universal Proteomics Standard Set, UPS1) was
denatured in buffer (6 M urea, 0.05% SDS, 5 mM EDTA, and
50 mM ammonium bicarbonate, pH 8.0), reduced with 3 mM
Tris (2-carboxyethyl) phosphine hydrochloride for 30 min at
37 °C, and alkylated with 5 mM iodoacetamide for 30 min while
shaking at 50 °C. Protein sample was digested with trypsin
(Promega, Madison, WI) at a protein-to-trypsin ratio of 50:1
(w/w) for 16 h at 37 °C. SDS and other reagents were removed
from the digested protein samples using a mixed strong cation
exchange cartridge (OASIS, Waters). Peptides were eluted by
adding 5% ammonia in methanol, dried in a speed vacuum
centrifuge, and dissolved in 0.4% acetic acid prior to analysis.
Yeast cell lysate from wild-type yeast grown in rich medium
was obtained after cell disruption with glass-beads in lysis
buffer (7 M urea, 2 M thiourea, 2% CHAPS, 50 mM ammonium
bicarbonate, pH 8.0). Yeast lysate proteins were reduced with
3.7 mM Tris (2-carboxyethyl) phosphine hydrochloride for 30
min at 37 °C, and alkylated with 5 mM iodoacetamide for 60
min while shaking at 37 °C. After a 10-fold dilution of the
sample with 50 mM ammonium bicarbonate buffer (pH 8.0),
yeast lysate proteins were digested with trypsin (Promega,
Madison, WI) at a protein-to-trypsin ratio of 40:1 (w/w) for 12 h
at 37 °C. Digested sample was loaded and desalted on a C18
cartridge (OASIS, Waters). After washing with 0.1% TFA (tri-
fluoreacetic acid) solution, desalted peptides were obtained by
adding 0.1% TFA/80% acetonitrile solution. Eluted peptides
were dried in a speed vacuum centrifuge, and dissolved in 0.4%
acetic acid prior to analysis.
2.2. Peptide Separation and Mass Spectrometry. A trypsin
digest of yeast lysate (6 µg) was separated by MudPIT, Multi-
dimensional Protein Identification Technology (10 salt frac-
tions). For a tryptic digest of 48 protein mixture, we performed
two MudPIT runs (13 salt fractions) and four 1D-LC runs
(reverse phase liquid chromatography) additionally. Peptide
samples were loaded onto a strong cation exchange resin (Luna,
5 µm, Phenomenex)-packed column (3 cm × 150 µm) and
eluted by adding a salt buffer in range of 0-600 mM am-
monium acetate. For reverse phase liquid chromatography,
peptide samples were loaded onto a C18 (Magic C18aq,
Michrom BioResources, Auburn, CA)-packed trap column and
separated using a capillary C18 column (20 cm × 75 µm)
coupled with a nanospray tip. Peptides were eluted using a 30-
min (48 protein mixture) and 60-min (yeast lysate) linear
gradient of 5-35% solution B in a 60-min and 120-min run
(Solution A, 0.1% formic acid in H2O; Solution B, 0.1% formic
acid in 100% acetonitrile). Elution was performed at a flow rate
of 300 nL/min using the Eksigent MDLC. Separated peptides
were analyzed by a LTQ-Orbitrap hybrid mass spectrometer
(ThermoElectron, San Jose, CA) in a data-dependent acquisition
manner. We used a duty cycle including one survey scan
(resolution: 100 000) in an Orbitrap and six consecutive full-
MS2 scans in an ion-trap (isolation width, 3 m/z; normalized
collision energy, 35%; exclusion duration, 1 min).
2.3. Database Search and Peptide Identification. For 48
protein mixture, from 369 302 scans, we obtained 119 035 dta
files using Bioworks 3.3.1 program (molecular weight range
800-6000), which were searched with a SEQUEST cluster
against the concatenated protein database consisting of 48
standard protein sequences and common contaminants ap-
pended to reversed IPI human sequences version 3.24 (67 100
sequences). 3 Da precursor mass tolerance was used for TDMB
and 15 ppm for TD. 1 Da fragment ion mass tolerance was used
with no enzyme option, variable modification at M (Oxidation,
+15.99492) and fixed modification at C (Carbamidomethyl,
+57.02146) for both TDMB and TD.
For yeast cell lysate, from 80 622 scans, 63 031 dta files
(molecular weight range 600-4200) were acquired from mzxml
using MzXML2Search program of TPP (Trans-Proteomic Pipe-
line, http://tools.proteomecenter.org/TPP.php, v4.0 rev2). Dta
files were searched with a SEQUEST cluster against the
concatenated sequence database containing yeast proteome
sequences (obtained from http://downloads.yeastgenome.org/
sequence/genomic_sequence/orf_protein/archive/orf_trans_
all.20080606.fasta) and common contaminants appended to
their reversed sequences. 3 Da precursor mass tolerance was
used for TDMB and PeptideProphet, and 15 ppm for TD and
PE-MMR. 1 Da fragment ion mass tolerance was used with no
enzyme option, variable modification at M (Oxidation) and
fixed modification at C (Carbamidomethyl) for all of the
SEQUEST searches. For PeptideProphet, TPP v4.0 rev2 was used
with mass binning and decoy option. PE-MMR v2.3.14 was used
with default options, (scan count of 2, 10 ppm mass tolerance,
and hole of 5 scans were used for creating UMC, and dta
tolerance of (25 ppm and -1, -2, -3 Da mass correction were
used for dta filtering). After dta filtration, 62 262 dta files were
retained. In addition to the SEQUEST search, Mascot searches
(version 2.2.06) were conducted against the same database as
the SEQUEST search, using three different precursor mass
tolerances: 3 Da, 15 ppm, and 15 ppm allowing up to two
misassigned C13
(PEP_ISOTOPE_ERROR option, http://www.
matrixscience.com/help/data_file_help.html). For these three
Mascot searches, other parameters except for the precursor
mass tolerance were equal: 1 Da fragment ion mass tolerance,
Target-Decoy with Mass Binning technical notes
Journal of Proteome Research • Vol. 9, No. 2, 2010 1151
3. trypsin as enzyme, up to two missed cleavages, variable
modification at M (Oxidation), fixed modification at C (Car-
bamidomethyl).
For all the search results, peptide identifications were
obtained at FDR 1.5%. FDR was calculated as FDR ) 2D/(T +
D), where D is the number of the decoy hits and T is the
number of the target above a score threshold.
2.4. Target-Decoy with Mass Binning. High-resolution mass
spectrometers, such as FTMS or Orbitrap, provide very accurate
precursor masses, often with errors of only a few parts per
million (ppm).16
Thus, it would be natural to perform a
database search with a small precursor mass tolerance, about
10-15 ppm, and apply a target-decoy strategy (TD). However,
it is well-known that a large portion of precursor masses are
not from their monoisotopes, mostly due to the limitation in
deconvolution methods.19,20
Here, we propose that such errors
can be overcome by performing a database search with a
precursor mass tolerance large enough to accommodate the
monoisotopic peak, and applying a target-decoy strategy for
each local region that corresponds to precursor mass errors of
0, (1.00335, (2.0067, (3.01005 Da (for convenience, we will
omit fractional parts and write these as 0, (1, (2, (3),
separately. This strategy works because a high-resolution mass
spectrometer gives accurate masses, with errors of only a few
parts per million, unlike a low resolution mass spectrometer
that gives mass errors up to a few dalton.
First, TDMB performs “systematic mass error correction”. A
systematic mass error was estimated from the mass error
distribution of the target peptide identifications from TD at FDR
1.5% and then their precursor masses were corrected.
After correcting the systematic mass error, for each local
region that corresponds to precursor mass errors of 0, (1, (2,
(3 Da, TDMB exhaustively searches for optimal values of
precursor mass tolerance, XCorr and deltaCn, to be used as
threshold values, which give the most number of target hits at
a given FDR. Unlike other methods that use learning algorithms
and complex statistics, we did not use a combined score for
thresholding but exhaustively searched all possible values in a
user defined range and increment for each of the three
parameters, to get the best combination of three values. For
SEQUEST results, the exhaustive search for threshold values
was conducted between 5 and 100 ppm for precursor mass
tolerance (with 1 ppm increment), from 0 to 4 for XCorr value
(with 0.01 increment), and from 0 to 0.18 for deltaCn value
(with 0.01 increment), which are default values and a user can
set the range and the amount of increment in the program.
Charge states are considered from 2 to 5 and TDMB sets
thresholds for each charge state and the number of tryptic
termini (NTT), separately. As a result, for each charge state and
NTT, we consider 96 × 401 × 19 () 731 424) possible cutoff
combinations of three parameters, and among them, we choose
the one that gives the most number of true positives. Even
though we exhaustively search 731 424 cases, computational
overhead is very small and it takes only a few seconds for a
large scale data (over 100 000 dta files). For Mascot results, the
same strategy was used as with SEQUEST search, except that
Mascot Homology Threshold (MHT),18
which is ion score
subtracted by homology score, is used for thresholding. To
ensure the statistical significance of the target-decoy results,
the resulted identifications are removed in case when the
numbers of target and decoy hits are too small for a given local
region. For all the TDMB results, we used the minimum count
of 100 identifications so that a reasonable number of target
and decoy hits are used when thresholding for a given FDR.
TDMB software can be obtained upon request to the corre-
sponding author.
3. Results and Discussion
3.1. The 48 Standard Protein Mixture. The 48 standard
protein mixture data set was searched with SEQUEST against
the target database appended to a decoy database composed
as described in the previous section. As the decoy database is
significantly larger than the target, all the identifications of
target peptides were regarded as correct.
Figure 1 shows precursor mass error distributions of hits to
the target protein sequences (target hits) and hits to the decoy
(reverse) protein sequences (decoy hits) from a SEQUEST
search with 3 Da precursor mass tolerance. It can be seen that
there are a fair number of spectra with incorrect precursor
masses, that is, isotope masses other than monoisotope masses.
Target hits around the precursor mass error of 0 Da indicate
MS/MS spectra with the correct monoisotope masses as their
precursor masses, while target hits around the precursor mass
errors (1, (2, (3 Da indicate MS/MS spectra with wrong
precursor masses due to incorrect assignment of monoisotopic
masses. Interestingly, the mass error distribution is almost
concentrated within a very small range (several ppm) around
the precursor mass errors 0, (1, (2, (3 Da. This means that
the mass error can be a powerful measure for confidence when
high mass accuracy data were searched with a big precursor
mass tolerance. Another observation is the systematic mass
error. The mass error distribution of target hits around mass
error 0 is shown in gray, in the left inset of Figure 1, and the
distribution is populated mostly within 4 ( 10 ppm. The 4 ppm
is regarded as a systematic mass error, which can be caused
by experimental variations. The systematic mass error can be
corrected as shown in the left inset of Figure 1, to give the
distribution shown in black.
Figure 2 shows the XCorr distribution of the 48 proteins
mixture. Decoy hits were spread out over a wide range of
precursor mass errors, while target hits around the precursor
mass errors of 0, (1, (2, (3 Da yielded high XCorr scores. On
the basis of these observations, TDMB used three parameters,
precursor mass tolerance, deltaCn and XCorr for thresholding
so that the most number of target hits can be obtained at a
given FDR. For a given FDR, TDMB finds an optimal precursor
mass tolerance, XCorr, and deltaCn in each local region, by
exhaustively searching through all possible threshold value
combinations.
We compared the number of identifications from TDMB and
TD. Using high-resolution mass spectrometry (Orbitrap), it may
seem natural to perform a database search with a small
precursor mass tolerance (e.g., 15 ppm). When such a target-
decoy scheme was applied, TD identified 19 886 true positives
and 143 false positives at 1.5% FDR. At the same FDR, TDMB
identified 25 485 true positives and 179 false positives. TDMB
identified more unique peptides at a peptide level as well: 1764
peptides were identified by TDMB, while TD identified 1387
peptides at 1.5% FDR. In total, TDMB resulted in much more
identifications as it identified extra MS/MS spectra, those with
isotope masses other than monoisotopes as their precursor
masses, which TD missed. In addition, dependent upon the
sample and experimental environment, TDMB corrects the
systematic mass error and sets precursor mass tolerance flexibly
from 5 to 100 ppm to accommodate the most number of
technical notes Joo et al.
1152 Journal of Proteome Research • Vol. 9, No. 2, 2010
4. identifications, while TD can admit only those hits within a
fixed error window of 15 ppm, for example.
3.2. TDMB Performance and Comparison. We tested TDMB
using a Yeast complex mixture and compared its performance
with other thresholding methods. Two search engines,
SEQUEST and Mascot, were used under different conditions
(variation in precursor mass tolerance).
First, SEQUEST searches were conducted with precursor
mass tolerances of 3 Da and 15 ppm. For two search results,
the number of correct identifications according to FDR thresh-
Figure 1. Precursor mass error distribution of 48 standard protein mixture. Hits to the target protein sequences (target hits, blue bars)
and hits to the decoy sequences (decoy hits, red bars) are from SEQUEST database search results when 3 Da precursor mass tolerance
was used and no thresholding was applied. As the decoy database is significantly larger than the target, all the identifications of target
peptides were regarded as correct. The x-axis shows precursor mass error (Da) and y-axis shows corresponding number of spectra.
Target hits around the precursor mass error of 0 Da indicate MS/MS spectra with the correct monoisotope masses as their precursor
masses, while target hits around the precursor mass errors (1, (2, (3 Da indicate MS/MS spectra with incorrect precursor masses.
Unlike target hits, decoy hits show relatively uniform distribution, as they are random hits (the inset on the right shows the number
of target and decoy hits around the precursor mass error of 1 Da). The inset on the left shows precursor mass error distribution of hits
to the target protein sequences around the precursor mass error 0 Da, before and after applying TDMB’s systematic mass correction
option.
Figure 2. SEQUEST’s XCorr score distribution of 48 standard protein mixture. Target hits (blue) and decoy (reverse) hits (red) are from
SEQUEST database search results where 3 Da precursor mass tolerance was used and no thresholding was applied. As the decoy
database is significantly larger than the target, all the identifications of target peptides were regarded as correct. The x-axis shows
precursor mass error (Da) and y-axis shows corresponding XCorr score of each MS/MS spectrum. Decoy hits are spread out over a
wide range of precursor mass errors, while target hits are concentrated around the precursor mass errors of 0, (1, (2, (3 Da.
Target-Decoy with Mass Binning technical notes
Journal of Proteome Research • Vol. 9, No. 2, 2010 1153
5. old was determined using different thresholding methods, TD
for 3 Da and 15 ppm searches, TD with mass error filter (15
ppm) for 3 Da search, and TDMB for 3Da search. The results
are shown in Figure 3. The performance of TDMB is superior
to others, while TD of 3 Da search showed the worst perfor-
mance. It shows well that correcting a precursor mass error
and using the optimal mass error window are very effective
for validation. We also compared TDMB with PeptideProphet22
and PE-MMR,21
because they had been upgraded or designed
for high-resolution MS data, and could be used to identify MS/
MS spectra with precursor masses other than monoisotopes.
As described in Materials and Methods, for PeptideProphet,
we performed a SEQUEST search with 3 Da precursor mass
tolerance, and for PE-MMR, we performed the search with 15
ppm precursor mass tolerance after its precursor mass correc-
tion procedure so that each method can obtain the most
identifications.
PeptideProphet recently upgraded its performance by in-
troducing mass binning and decoy options. TDMB is a simple
algorithm but its performance is on a par with that of
PeptideProphet. TDMB resulted in 33 791 target hits and
PeptideProphet gave 33 752 at the same FDR of 1.5%. For PE-
MMR, filtered MS/MS spectra were searched using SEQUEST
with 15 ppm precursor mass tolerance and TD was applied,
which is a suggested process by the authors of PE-MMR. First,
PE-MMR performed a precursor mass correction up to 3 Da
and generated additional MS/MS spectra with different precur-
sor masses and thus could result in 3 times more spectra than
Figure 3. Number of identifications by TDMB, TD, PeptideProphet, and PE-MMR at different FDR. The x-axis indicates FDR and y-axis
indicates the number of target identifications.
Figure 4. Mascot search results. The top left circle shows the number of identifications when 3 Da precursor mass tolerance was used
for search and TDMB for thresholding. The bottom circle shows the number of identifications when 15 ppm precursor mass tolerance
was used for search and TD for thresholding. The top right circle shows the number of identifications when 15 ppm precursor mass
tolerance and “PEP_ISOTOPE_ERROR” option were used for search and TD for thresholding. Venn diagram of identifications from
TDMB, TD, and TD with “PEP_ISOTOPE_ERROR” option (a) at a spectrum level, (b) at a peptide level (unique peptides) and (c) at a
protein level when 1.5% FDR was applied. The numbers in parentheses represent the number of proteins identified by a single peptide
hit.
technical notes Joo et al.
1154 Journal of Proteome Research • Vol. 9, No. 2, 2010
6. the original data set, before applying dta filtration. Then dta
filtration was applied based on its mass clustering algorithm.
Finally, a SEQUEST search with 15 ppm precursor mass
tolerance was conducted and TD was applied. TDMB outper-
formed PE-MMR (33 791 and 32 674 for TDMB and PE-MMR,
respectively) at both 1.5% FDR.
Second, three Mascot database searches were conducted
with different parameters to compare the performance of
TDMB, TD, and TD using Mascot’s precursor mass correction
option (named “PEP_ISOTOPE_ERROR”). First, we performed
a database search with 3 Da precursor mass tolerance and
TDMB was used for thresholding. Second, 15 ppm precursor
mass tolerance was used for database search and TD was used
for thresholding. Third, 15 ppm precursor mass tolerance was
used, but “PEP_ISOTOPE_ERROR” option was added to the
search, and TD was used for thresholding. Utilizing accurate
mass from high-resolution mass spectrometry, after correction
of the systematic mass error, TDMB found the optimal precur-
sor mass error window, leading to the improvement of the
identification performance. Correction of the systematic mass
error improved the number of identifications from 30 896 to
31 274 at 1.5% FDR. If a small precursor mass tolerance, such
as 10-15 ppm, were used for database search, identifications
that are out of the tolerance range due to the systematic mass
error could never be identified. However, TDMB could accom-
modate such identifications. Figure 4 compares the identifica-
tion of TDMB, TD and TD with “PEP_ISOTOPE_ERROR” option
at the levels of spectrum, peptide, and protein, and TDMB
resulted in the most identifications.
We also tested the case of using TDMB instead of TD in the
third Mascot search, where 15 ppm precursor mass tolerance
and “PEP_ISOTOPE_ERROR” option were used. TDMB found
the optimal precursor mass tolerance of 6 ppm around precur-
sor mass error 0, charge state 3, and NTT 2, and lowered
threshold score to give more identifications. It resulted in as
many identifications as in the case of the first search, where 3
Da precursor mass tolerance was used for database search and
TDMB for thresholding. Even when using “PEP_ISOTOPE_
ERROR” option during Mascot search, TDMB turned out to be
very useful.
4. Conclusion
Validation for peptide identification plays a key role in the
data analysis of proteomics experiments based on mass
spectrometry. Along with advances in mass spectrometry as
well as search algorithms, validation methods should be
developed hand in hand with such advances.
A large portion of MS/MS spectra are left unidentified due
to various reasons. Our results show that there are a fair
number of MS/MS spectra with wrong precursor masses due
to an incomplete deconvolution method. Given that the
deconvolution algorithm is incomplete, correctly identifying
MS/MS spectra with isotope masses other than monoisotope
masses as their precursor masses is important to increase
sensitivity of identification. Here we presented a simple varia-
tion of a target-decoy method that shows performance com-
parable to a sophisticated probability based method, utilizing
accurate mass from high-resolution mass spectrometer. TDMB
is as simple as TD and takes only a few seconds for processing
more than a 100 000 spectra. TDMB can correctly identify MS/
MS spectra with wrong isotope masses by performing a
database search with a wide precursor mass tolerance, and
makes use of high mass accuracy featured by high-resolution
mass spectrometers by setting thresholds separately within
each mass error region. With a wide precursor mass tolerance,
TDMB could help identify those spectra shifted by experimental
error with “systematic mass error correction” option. In addi-
tion, TDMB is useful not only for the database search engines
such as SEQUEST, which do not have a precursor mass
correction option, but also for search engines such as Mascot,
which has one.
Acknowledgment. This work was supported by 21C
Frontier Functional Proteomics Project from Korean Ministry
of Education, Science & Technology (FPR-08-A1-020, FPR-08-
A1-030, and FPR-08-A1-090). S. Na was supported by Brain
Korea 21 (BK21) Project.
Note Added after ASAP Publication. This paper was
published on the Web on Nov 13, 2009, with an omission in
the Acknowledgment paragraph. The corrected version was
reposted on Jan 7, 2010.
References
(1) Steen, H.; Mann, M. The ABC (and XYZ’s) of peptide sequencing.
Nat. Rev. Mol. Cell Biol. 2004, 5, 699–711.
(2) Aebersold, R.; Mann, M. Mass spectrometry-based proteomics.
Nature 2003, 422, 198–207.
(3) Eng, J. K.; McCormack, A. L.; Yates, J. R. An approach to correlate
tandem mass spectral data of peptides with amino acid sequences
in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5, 975–
989.
(4) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S.
Probability-based protein identification by searching sequence
databases using mass spectrometry data. Electrophoresis 1999, 20,
3551–3567.
(5) Colinge, J.; Masselot, A.; Giron, M.; Dessingy, T.; Magnin, J. OLAV:
Towards high-throughput tandem mass spectrometry data iden-
tification. Proteomics 2003, 3, 1454–1463.
(6) Craig, R.; Beavis, R. C. A method for reducing the time required
to match protein sequences with tandem mass spectra. Rapid
Commun. Mass Spectrom. 2003, 17, 2310–2316.
(7) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical
statistical model to estimate the accuracy of peptide identifications
made by MS/MS and database search. Anal. Chem. 2002, 74, 5383–
5392.
(8) Hingdon, R.; Kolker, N.; Picone, A.; van Belle, G.; Kolker, E. LIP
index for peptide classification using MS/MS and SEQUEST search
via logistic regression. OMICS 2004, 8, 357–369.
(9) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increase
confidence in large-scale protein identifications by mass spec-
trometry. Nat. Methods 2007, 4, 207–214.
(10) Nesvizhskii, A. I.; Vitek, O.; Aeversold, R. Analysis and validation
of proteomic data generated by tandem mass spectrometry. Nat.
Methods 2007, 4, 7787–797.
(11) Higdon, R. H. J.; Van Belle, G.; Kolker, E. Randomized sequence
databases for tandem mass spectrometry peptide and protein
identification. OMICS 2005, 9, 364–379.
(12) Zhang, J.; Ma, J.; Dou, L.; Wu, S.; Qian, X.; Xie, H.; Zhu, Y.; He, F.
Bayesian nonparametric model for the validation of peptide
identification in shotgun proteomics. Mol. Cell. Proteomics 2009,
8 (3), 547–557.
(13) Kall, L.; Canterbury, J. D.; Weston, J.; Noble, W. S.; MacCoss, M. J.
Semi-supervised learning for peptide identification from shotgun
proteomics datasets. Nat. Methods 2007, 4, 923–925.
(14) Ma, Z. Q.; Dasari, S.; Chambers, M. C.; Litton, M. D.; Sobecki, S. M.;
Zimmerman, L. J.; Halvey, P. J.; Schilling, B.; Drake, P. M.; Gibson,
B. W.; Tabb, D. L. IDPicker 2.0: Improved protein assembly with
high discrimination peptide identification filtering. J. Proteome Res.
2009, 8, 3872–3881.
(15) Choi, H.; Nesvizhskii, A. I. Semisupervised model-based validation
of peptide identifications in mass spectrometry-based proteomics.
J. Proteome Res. 2008, 7, 254–265.
(16) Makarov, A.; Denisov, E.; Lange, O.; Horning, S. Dynamic range
of mass accuracy in LTQ Orbitrap hybrid mass spectrometer. J. Am.
Soc. Mass Spectrom. 2006, 17, 977–82.
(17) Beausoleil, S. A.; Villen, J.; Gerber, S. A.; Rush, J.; Gygi, S. P. A
probability-based approach for high-throughput protein phos-
Target-Decoy with Mass Binning technical notes
Journal of Proteome Research • Vol. 9, No. 2, 2010 1155
7. phorylation analysis and site localization. Nat. Biotechnol. 2006,
24, 1285–1292.
(18) Brosch, M.; Swamy, S.; Hubbard, T.; Choudhary, J. Comparison
of Mascot and X!Tandem performance for low and high accuracy
mass spectrometry and the development of an adjusted Mascot
threshold. Mol. Cell. Proteomics 2008, 7, 962–970.
(19) Horn, D. M.; Zubarev, R. A.; McLafferty, F. W. Automated reduction
and interpretation of high resolution electrospray mass spectra
of large molecules. J. Am. Soc. Mass Spectrom. 2000, 11, 320–32.
(20) Park, K.; Yoon, J. Y.; Lee, S.; Paek, E.; Park, H.; Jung, H. J.; Lee,
S. W. Isotopic peak intensity ratio based algorithm for determi-
nation of isotopic clusters and monoisotopic masses of polypep-
tides from high-resolution mass spectrometric data. Anal. Chem.
2008, 80, 7294–7303.
(21) Shin, B.; Jung, H. J.; Hyung, S. W.; Kim, H.; Lee, D.; Lee, C.; Yu,
M. H.; Lee, S. W. Postexperiment monoisotopic mass filtering and
refinement (PE-MMR) of tandem mass spectrometric data in-
creases accuracy of peptide identification in LC/MS/MS. Mol. Cell.
Proteomics 2008, 7, 1124–1134.
(22) Choi, H.; Ghosh, D.; Nesvizhskii, A. I. Statistical Validation of
Peptide Identifications in Large-Scale Proteomics Using the Target-
Decoy Database Search Strategy and Flexible Mixture Modeling.
J. Protome Res. 2008, 7, 286–292.
PR9006377
technical notes Joo et al.
1156 Journal of Proteome Research • Vol. 9, No. 2, 2010