SlideShare a Scribd company logo
1 of 7
Download to read offline
Target-Decoy with Mass Binning: A Simple and Effective Validation
Method for Shotgun Proteomics Using High Resolution Mass
Spectrometry
Jong Wha J. Joo,†
Seungjin Na,‡
Je-Hyun Baek,†
Cheolju Lee,†
and Eunok Paek*,‡
Korea Institute of Science and Technology, Seoul, Republic of Korea, and Department of Mechanical and
Information Engineering, University of Seoul, Seoul, Republic of Korea
Received July 20, 2009
Abstract: Shotgun proteomics using mass spectrometry
(MS) has become the choice for large-scale peptide and
protein identification. The recent development of high-
resolution mass spectrometers such as FT-ICR or Orbitrap
makes it possible to identify peptides within only a few
parts per million (ppm), and it is expected to dramatically
improve performance of peptide identification, as com-
pared to low-resolution instruments. To fully exploit such
significantly higher mass accuracy, however, appropriate
data analysis methods are required. Here, we present a
new target-decoy strategy, called Target-Decoy with Mass
Binning, utilizing high mass accuracy for peptide identi-
fication validation, which remains a challenging problem
in MS-based proteomics. When tested on various high-
resolution MS data, our method was very effective and
yet simple and showed comparable or better performance
when compared with other validation methods.
Keywords: target-decoy • peptide identification • valida-
tion method • precursor mass error • mass binning • high-
resolution mass spectrometry
1. Introduction
During the last years, along with technological developments
in tandem mass spectrometry (MS/MS),1,2
many powerful
database search algorithms have been developed and made
great advances in peptide/protein identification possible.
Database search engines such as SEQUEST,3
Mascot,4
Phenyx5
and X!Tandem6
select the best-scoring peptide-spectrum
match (PSM) for every input MS/MS spectrum based on the
comparison with theoretical MS/MS spectra from a protein
database. However, they do not guarantee that the top PSMs
are correct because of deficiency in their scoring models, and
search results include many incorrect PSMs (in fact, the
majority of searched MS/MS spectra). Thus, the assessment of
database search results is very important and still open to
controversy.
To assess the reliability of database search results, various
validation methods are currently in use. There have been
mainly two ways, one of which is a machine learning based
method7,8
and the other is a target-decoy search method.9
A
machine learning based method combines multiple features
related with the adopted search software (for example, XCorr,
deltaCn, SpRank, and so forth in the case of SEQUEST), into a
single match quality score. With the combined score, it models
the distribution of PSMs and classifies correct and incorrect
PSMs. However, their scoring model based on training data
may not work well on data from different experimental
environment, which would require new training. On the other
hand, a target-decoy search method uses a decoy database,
which is a reversed, randomized or shuffled version of standard
protein sequence (target) database. This approach assumes that
the distribution of incorrect PSMs from the target database is
similar to that of PSMs from the decoy database. With this
assumption, an optimized score cutoff for the corresponding
false discovery rate (FDR) is calculated for each data set. A
target-decoy method is favored in practice because of its
simplicity and robustness to the effects of database size, sample
quality, experimental environment and instrument types.9–11
Consequently, the advantages of the target-decoy search have
been incorporated into machine learning based methods. Recent
implementations of machine learning based methods12–15
esti-
mate certain parameters of their models from a target-decoy
search or construct an initial classifier (dynamic training set) for
further analyses and incrementally improve their performance.
Recently, with the advent of high-resolution MS instruments
such as Fourier transform ion cyclotron resonance (FT-ICR)
and electrostatic FT traps (Orbitrap), mass resolution has been
improved by 2 to 3 orders of magnitude when compared with
conventional MS instruments.16
They allow identifying peptides
within only a few parts per million (ppm) mass accuracy and it
would dramatically improve performance of peptide identification
via reduction of the search space by orders of magnitude.
However, many works reported that a very small search space
does not necessarily lead to improved performance.17,18
And, in
spite of their high resolving power, a fair number of MS/MS
spectra are generated using wrong precursor masses due to errors
in deconvolution.19,20
With these issues, new database search
options or validation methods for high-resolution MS data have
been introduced. Mascot and X!Tandem provided a search
option to compensate for a misassignment of monoisotopic
masses. Another method, called PE-MMR,21
has been proposed
* To whom correspondence should be addressed. Eunok Paek, Depart-
ment of Mechanical and Information Engineering, University of Seoul, Seoul,
Republic of Korea. E-mail: paek@uos.ac.kr. Fax: +82-2-2210-5575. Tel: +82-
2-2210-2680.
†
Korea Institute of Science and Technology.
‡
University of Seoul.
1150 Journal of Proteome Research 2010, 9, 1150–1156 10.1021/pr9006377  2010 American Chemical Society
Published on Web 11/13/2009
so that it can correct wrong precursor masses of MS/MS
spectra. It generated unique mass class (UMC) from an LC/
MS experiment and, prior to a database search, filtered out or
corrected MS/MS spectra by matching the precursor masses
of MS/MS spectra to UMC. PeptideProphet,22
the representa-
tive validation method, improved its performance for high-
resolution MS data by using mass binning for accurate mod-
eling of mass accuracy distributions.
Here, we introduce a database search and validation strategy
based on a target-decoy search, called Target-Decoy with Mass
Binning (TDMB), to identify as many reliable peptide identi-
fications as possible from high-resolution MS data (to make
the distinction clear, we call the original target-decoy method
as TD). We examined the distribution of mass errors (precursor
mass of a spectrum minus identified peptide mass) of search
results from high-resolution MS data, and compared the search
results under different search conditions (variation in precursor
mass tolerance). These analyses led us to set a wide range of
precursor mass tolerance for a search and use an accurate mass
binning of mass errors for validation. This strategy allowed
finding a systematic mass error and an optimal mass error
window for a MS instrument used. In our approach, score
thresholds at a specified FDR were determined in combination
with the mass error window. Because peptide identifications
out of the optimal mass error window would be almost always
incorrect, it could achieve high specificity. Our approach
successfully dealt with issues related to high mass accuracy and
incorrect assignment of monoisotopic masses, and gave more
sensitive and accurate identifications than certain existing
methods.
2. Material and Methods
2.1. Sample Preparation. The 48 standard protein mixture
(Sigma U6133, Universal Proteomics Standard Set, UPS1) was
denatured in buffer (6 M urea, 0.05% SDS, 5 mM EDTA, and
50 mM ammonium bicarbonate, pH 8.0), reduced with 3 mM
Tris (2-carboxyethyl) phosphine hydrochloride for 30 min at
37 °C, and alkylated with 5 mM iodoacetamide for 30 min while
shaking at 50 °C. Protein sample was digested with trypsin
(Promega, Madison, WI) at a protein-to-trypsin ratio of 50:1
(w/w) for 16 h at 37 °C. SDS and other reagents were removed
from the digested protein samples using a mixed strong cation
exchange cartridge (OASIS, Waters). Peptides were eluted by
adding 5% ammonia in methanol, dried in a speed vacuum
centrifuge, and dissolved in 0.4% acetic acid prior to analysis.
Yeast cell lysate from wild-type yeast grown in rich medium
was obtained after cell disruption with glass-beads in lysis
buffer (7 M urea, 2 M thiourea, 2% CHAPS, 50 mM ammonium
bicarbonate, pH 8.0). Yeast lysate proteins were reduced with
3.7 mM Tris (2-carboxyethyl) phosphine hydrochloride for 30
min at 37 °C, and alkylated with 5 mM iodoacetamide for 60
min while shaking at 37 °C. After a 10-fold dilution of the
sample with 50 mM ammonium bicarbonate buffer (pH 8.0),
yeast lysate proteins were digested with trypsin (Promega,
Madison, WI) at a protein-to-trypsin ratio of 40:1 (w/w) for 12 h
at 37 °C. Digested sample was loaded and desalted on a C18
cartridge (OASIS, Waters). After washing with 0.1% TFA (tri-
fluoreacetic acid) solution, desalted peptides were obtained by
adding 0.1% TFA/80% acetonitrile solution. Eluted peptides
were dried in a speed vacuum centrifuge, and dissolved in 0.4%
acetic acid prior to analysis.
2.2. Peptide Separation and Mass Spectrometry. A trypsin
digest of yeast lysate (6 µg) was separated by MudPIT, Multi-
dimensional Protein Identification Technology (10 salt frac-
tions). For a tryptic digest of 48 protein mixture, we performed
two MudPIT runs (13 salt fractions) and four 1D-LC runs
(reverse phase liquid chromatography) additionally. Peptide
samples were loaded onto a strong cation exchange resin (Luna,
5 µm, Phenomenex)-packed column (3 cm × 150 µm) and
eluted by adding a salt buffer in range of 0-600 mM am-
monium acetate. For reverse phase liquid chromatography,
peptide samples were loaded onto a C18 (Magic C18aq,
Michrom BioResources, Auburn, CA)-packed trap column and
separated using a capillary C18 column (20 cm × 75 µm)
coupled with a nanospray tip. Peptides were eluted using a 30-
min (48 protein mixture) and 60-min (yeast lysate) linear
gradient of 5-35% solution B in a 60-min and 120-min run
(Solution A, 0.1% formic acid in H2O; Solution B, 0.1% formic
acid in 100% acetonitrile). Elution was performed at a flow rate
of 300 nL/min using the Eksigent MDLC. Separated peptides
were analyzed by a LTQ-Orbitrap hybrid mass spectrometer
(ThermoElectron, San Jose, CA) in a data-dependent acquisition
manner. We used a duty cycle including one survey scan
(resolution: 100 000) in an Orbitrap and six consecutive full-
MS2 scans in an ion-trap (isolation width, 3 m/z; normalized
collision energy, 35%; exclusion duration, 1 min).
2.3. Database Search and Peptide Identification. For 48
protein mixture, from 369 302 scans, we obtained 119 035 dta
files using Bioworks 3.3.1 program (molecular weight range
800-6000), which were searched with a SEQUEST cluster
against the concatenated protein database consisting of 48
standard protein sequences and common contaminants ap-
pended to reversed IPI human sequences version 3.24 (67 100
sequences). 3 Da precursor mass tolerance was used for TDMB
and 15 ppm for TD. 1 Da fragment ion mass tolerance was used
with no enzyme option, variable modification at M (Oxidation,
+15.99492) and fixed modification at C (Carbamidomethyl,
+57.02146) for both TDMB and TD.
For yeast cell lysate, from 80 622 scans, 63 031 dta files
(molecular weight range 600-4200) were acquired from mzxml
using MzXML2Search program of TPP (Trans-Proteomic Pipe-
line, http://tools.proteomecenter.org/TPP.php, v4.0 rev2). Dta
files were searched with a SEQUEST cluster against the
concatenated sequence database containing yeast proteome
sequences (obtained from http://downloads.yeastgenome.org/
sequence/genomic_sequence/orf_protein/archive/orf_trans_
all.20080606.fasta) and common contaminants appended to
their reversed sequences. 3 Da precursor mass tolerance was
used for TDMB and PeptideProphet, and 15 ppm for TD and
PE-MMR. 1 Da fragment ion mass tolerance was used with no
enzyme option, variable modification at M (Oxidation) and
fixed modification at C (Carbamidomethyl) for all of the
SEQUEST searches. For PeptideProphet, TPP v4.0 rev2 was used
with mass binning and decoy option. PE-MMR v2.3.14 was used
with default options, (scan count of 2, 10 ppm mass tolerance,
and hole of 5 scans were used for creating UMC, and dta
tolerance of (25 ppm and -1, -2, -3 Da mass correction were
used for dta filtering). After dta filtration, 62 262 dta files were
retained. In addition to the SEQUEST search, Mascot searches
(version 2.2.06) were conducted against the same database as
the SEQUEST search, using three different precursor mass
tolerances: 3 Da, 15 ppm, and 15 ppm allowing up to two
misassigned C13
(PEP_ISOTOPE_ERROR option, http://www.
matrixscience.com/help/data_file_help.html). For these three
Mascot searches, other parameters except for the precursor
mass tolerance were equal: 1 Da fragment ion mass tolerance,
Target-Decoy with Mass Binning technical notes
Journal of Proteome Research • Vol. 9, No. 2, 2010 1151
trypsin as enzyme, up to two missed cleavages, variable
modification at M (Oxidation), fixed modification at C (Car-
bamidomethyl).
For all the search results, peptide identifications were
obtained at FDR 1.5%. FDR was calculated as FDR ) 2D/(T +
D), where D is the number of the decoy hits and T is the
number of the target above a score threshold.
2.4. Target-Decoy with Mass Binning. High-resolution mass
spectrometers, such as FTMS or Orbitrap, provide very accurate
precursor masses, often with errors of only a few parts per
million (ppm).16
Thus, it would be natural to perform a
database search with a small precursor mass tolerance, about
10-15 ppm, and apply a target-decoy strategy (TD). However,
it is well-known that a large portion of precursor masses are
not from their monoisotopes, mostly due to the limitation in
deconvolution methods.19,20
Here, we propose that such errors
can be overcome by performing a database search with a
precursor mass tolerance large enough to accommodate the
monoisotopic peak, and applying a target-decoy strategy for
each local region that corresponds to precursor mass errors of
0, (1.00335, (2.0067, (3.01005 Da (for convenience, we will
omit fractional parts and write these as 0, (1, (2, (3),
separately. This strategy works because a high-resolution mass
spectrometer gives accurate masses, with errors of only a few
parts per million, unlike a low resolution mass spectrometer
that gives mass errors up to a few dalton.
First, TDMB performs “systematic mass error correction”. A
systematic mass error was estimated from the mass error
distribution of the target peptide identifications from TD at FDR
1.5% and then their precursor masses were corrected.
After correcting the systematic mass error, for each local
region that corresponds to precursor mass errors of 0, (1, (2,
(3 Da, TDMB exhaustively searches for optimal values of
precursor mass tolerance, XCorr and deltaCn, to be used as
threshold values, which give the most number of target hits at
a given FDR. Unlike other methods that use learning algorithms
and complex statistics, we did not use a combined score for
thresholding but exhaustively searched all possible values in a
user defined range and increment for each of the three
parameters, to get the best combination of three values. For
SEQUEST results, the exhaustive search for threshold values
was conducted between 5 and 100 ppm for precursor mass
tolerance (with 1 ppm increment), from 0 to 4 for XCorr value
(with 0.01 increment), and from 0 to 0.18 for deltaCn value
(with 0.01 increment), which are default values and a user can
set the range and the amount of increment in the program.
Charge states are considered from 2 to 5 and TDMB sets
thresholds for each charge state and the number of tryptic
termini (NTT), separately. As a result, for each charge state and
NTT, we consider 96 × 401 × 19 () 731 424) possible cutoff
combinations of three parameters, and among them, we choose
the one that gives the most number of true positives. Even
though we exhaustively search 731 424 cases, computational
overhead is very small and it takes only a few seconds for a
large scale data (over 100 000 dta files). For Mascot results, the
same strategy was used as with SEQUEST search, except that
Mascot Homology Threshold (MHT),18
which is ion score
subtracted by homology score, is used for thresholding. To
ensure the statistical significance of the target-decoy results,
the resulted identifications are removed in case when the
numbers of target and decoy hits are too small for a given local
region. For all the TDMB results, we used the minimum count
of 100 identifications so that a reasonable number of target
and decoy hits are used when thresholding for a given FDR.
TDMB software can be obtained upon request to the corre-
sponding author.
3. Results and Discussion
3.1. The 48 Standard Protein Mixture. The 48 standard
protein mixture data set was searched with SEQUEST against
the target database appended to a decoy database composed
as described in the previous section. As the decoy database is
significantly larger than the target, all the identifications of
target peptides were regarded as correct.
Figure 1 shows precursor mass error distributions of hits to
the target protein sequences (target hits) and hits to the decoy
(reverse) protein sequences (decoy hits) from a SEQUEST
search with 3 Da precursor mass tolerance. It can be seen that
there are a fair number of spectra with incorrect precursor
masses, that is, isotope masses other than monoisotope masses.
Target hits around the precursor mass error of 0 Da indicate
MS/MS spectra with the correct monoisotope masses as their
precursor masses, while target hits around the precursor mass
errors (1, (2, (3 Da indicate MS/MS spectra with wrong
precursor masses due to incorrect assignment of monoisotopic
masses. Interestingly, the mass error distribution is almost
concentrated within a very small range (several ppm) around
the precursor mass errors 0, (1, (2, (3 Da. This means that
the mass error can be a powerful measure for confidence when
high mass accuracy data were searched with a big precursor
mass tolerance. Another observation is the systematic mass
error. The mass error distribution of target hits around mass
error 0 is shown in gray, in the left inset of Figure 1, and the
distribution is populated mostly within 4 ( 10 ppm. The 4 ppm
is regarded as a systematic mass error, which can be caused
by experimental variations. The systematic mass error can be
corrected as shown in the left inset of Figure 1, to give the
distribution shown in black.
Figure 2 shows the XCorr distribution of the 48 proteins
mixture. Decoy hits were spread out over a wide range of
precursor mass errors, while target hits around the precursor
mass errors of 0, (1, (2, (3 Da yielded high XCorr scores. On
the basis of these observations, TDMB used three parameters,
precursor mass tolerance, deltaCn and XCorr for thresholding
so that the most number of target hits can be obtained at a
given FDR. For a given FDR, TDMB finds an optimal precursor
mass tolerance, XCorr, and deltaCn in each local region, by
exhaustively searching through all possible threshold value
combinations.
We compared the number of identifications from TDMB and
TD. Using high-resolution mass spectrometry (Orbitrap), it may
seem natural to perform a database search with a small
precursor mass tolerance (e.g., 15 ppm). When such a target-
decoy scheme was applied, TD identified 19 886 true positives
and 143 false positives at 1.5% FDR. At the same FDR, TDMB
identified 25 485 true positives and 179 false positives. TDMB
identified more unique peptides at a peptide level as well: 1764
peptides were identified by TDMB, while TD identified 1387
peptides at 1.5% FDR. In total, TDMB resulted in much more
identifications as it identified extra MS/MS spectra, those with
isotope masses other than monoisotopes as their precursor
masses, which TD missed. In addition, dependent upon the
sample and experimental environment, TDMB corrects the
systematic mass error and sets precursor mass tolerance flexibly
from 5 to 100 ppm to accommodate the most number of
technical notes Joo et al.
1152 Journal of Proteome Research • Vol. 9, No. 2, 2010
identifications, while TD can admit only those hits within a
fixed error window of 15 ppm, for example.
3.2. TDMB Performance and Comparison. We tested TDMB
using a Yeast complex mixture and compared its performance
with other thresholding methods. Two search engines,
SEQUEST and Mascot, were used under different conditions
(variation in precursor mass tolerance).
First, SEQUEST searches were conducted with precursor
mass tolerances of 3 Da and 15 ppm. For two search results,
the number of correct identifications according to FDR thresh-
Figure 1. Precursor mass error distribution of 48 standard protein mixture. Hits to the target protein sequences (target hits, blue bars)
and hits to the decoy sequences (decoy hits, red bars) are from SEQUEST database search results when 3 Da precursor mass tolerance
was used and no thresholding was applied. As the decoy database is significantly larger than the target, all the identifications of target
peptides were regarded as correct. The x-axis shows precursor mass error (Da) and y-axis shows corresponding number of spectra.
Target hits around the precursor mass error of 0 Da indicate MS/MS spectra with the correct monoisotope masses as their precursor
masses, while target hits around the precursor mass errors (1, (2, (3 Da indicate MS/MS spectra with incorrect precursor masses.
Unlike target hits, decoy hits show relatively uniform distribution, as they are random hits (the inset on the right shows the number
of target and decoy hits around the precursor mass error of 1 Da). The inset on the left shows precursor mass error distribution of hits
to the target protein sequences around the precursor mass error 0 Da, before and after applying TDMB’s systematic mass correction
option.
Figure 2. SEQUEST’s XCorr score distribution of 48 standard protein mixture. Target hits (blue) and decoy (reverse) hits (red) are from
SEQUEST database search results where 3 Da precursor mass tolerance was used and no thresholding was applied. As the decoy
database is significantly larger than the target, all the identifications of target peptides were regarded as correct. The x-axis shows
precursor mass error (Da) and y-axis shows corresponding XCorr score of each MS/MS spectrum. Decoy hits are spread out over a
wide range of precursor mass errors, while target hits are concentrated around the precursor mass errors of 0, (1, (2, (3 Da.
Target-Decoy with Mass Binning technical notes
Journal of Proteome Research • Vol. 9, No. 2, 2010 1153
old was determined using different thresholding methods, TD
for 3 Da and 15 ppm searches, TD with mass error filter (15
ppm) for 3 Da search, and TDMB for 3Da search. The results
are shown in Figure 3. The performance of TDMB is superior
to others, while TD of 3 Da search showed the worst perfor-
mance. It shows well that correcting a precursor mass error
and using the optimal mass error window are very effective
for validation. We also compared TDMB with PeptideProphet22
and PE-MMR,21
because they had been upgraded or designed
for high-resolution MS data, and could be used to identify MS/
MS spectra with precursor masses other than monoisotopes.
As described in Materials and Methods, for PeptideProphet,
we performed a SEQUEST search with 3 Da precursor mass
tolerance, and for PE-MMR, we performed the search with 15
ppm precursor mass tolerance after its precursor mass correc-
tion procedure so that each method can obtain the most
identifications.
PeptideProphet recently upgraded its performance by in-
troducing mass binning and decoy options. TDMB is a simple
algorithm but its performance is on a par with that of
PeptideProphet. TDMB resulted in 33 791 target hits and
PeptideProphet gave 33 752 at the same FDR of 1.5%. For PE-
MMR, filtered MS/MS spectra were searched using SEQUEST
with 15 ppm precursor mass tolerance and TD was applied,
which is a suggested process by the authors of PE-MMR. First,
PE-MMR performed a precursor mass correction up to 3 Da
and generated additional MS/MS spectra with different precur-
sor masses and thus could result in 3 times more spectra than
Figure 3. Number of identifications by TDMB, TD, PeptideProphet, and PE-MMR at different FDR. The x-axis indicates FDR and y-axis
indicates the number of target identifications.
Figure 4. Mascot search results. The top left circle shows the number of identifications when 3 Da precursor mass tolerance was used
for search and TDMB for thresholding. The bottom circle shows the number of identifications when 15 ppm precursor mass tolerance
was used for search and TD for thresholding. The top right circle shows the number of identifications when 15 ppm precursor mass
tolerance and “PEP_ISOTOPE_ERROR” option were used for search and TD for thresholding. Venn diagram of identifications from
TDMB, TD, and TD with “PEP_ISOTOPE_ERROR” option (a) at a spectrum level, (b) at a peptide level (unique peptides) and (c) at a
protein level when 1.5% FDR was applied. The numbers in parentheses represent the number of proteins identified by a single peptide
hit.
technical notes Joo et al.
1154 Journal of Proteome Research • Vol. 9, No. 2, 2010
the original data set, before applying dta filtration. Then dta
filtration was applied based on its mass clustering algorithm.
Finally, a SEQUEST search with 15 ppm precursor mass
tolerance was conducted and TD was applied. TDMB outper-
formed PE-MMR (33 791 and 32 674 for TDMB and PE-MMR,
respectively) at both 1.5% FDR.
Second, three Mascot database searches were conducted
with different parameters to compare the performance of
TDMB, TD, and TD using Mascot’s precursor mass correction
option (named “PEP_ISOTOPE_ERROR”). First, we performed
a database search with 3 Da precursor mass tolerance and
TDMB was used for thresholding. Second, 15 ppm precursor
mass tolerance was used for database search and TD was used
for thresholding. Third, 15 ppm precursor mass tolerance was
used, but “PEP_ISOTOPE_ERROR” option was added to the
search, and TD was used for thresholding. Utilizing accurate
mass from high-resolution mass spectrometry, after correction
of the systematic mass error, TDMB found the optimal precur-
sor mass error window, leading to the improvement of the
identification performance. Correction of the systematic mass
error improved the number of identifications from 30 896 to
31 274 at 1.5% FDR. If a small precursor mass tolerance, such
as 10-15 ppm, were used for database search, identifications
that are out of the tolerance range due to the systematic mass
error could never be identified. However, TDMB could accom-
modate such identifications. Figure 4 compares the identifica-
tion of TDMB, TD and TD with “PEP_ISOTOPE_ERROR” option
at the levels of spectrum, peptide, and protein, and TDMB
resulted in the most identifications.
We also tested the case of using TDMB instead of TD in the
third Mascot search, where 15 ppm precursor mass tolerance
and “PEP_ISOTOPE_ERROR” option were used. TDMB found
the optimal precursor mass tolerance of 6 ppm around precur-
sor mass error 0, charge state 3, and NTT 2, and lowered
threshold score to give more identifications. It resulted in as
many identifications as in the case of the first search, where 3
Da precursor mass tolerance was used for database search and
TDMB for thresholding. Even when using “PEP_ISOTOPE_
ERROR” option during Mascot search, TDMB turned out to be
very useful.
4. Conclusion
Validation for peptide identification plays a key role in the
data analysis of proteomics experiments based on mass
spectrometry. Along with advances in mass spectrometry as
well as search algorithms, validation methods should be
developed hand in hand with such advances.
A large portion of MS/MS spectra are left unidentified due
to various reasons. Our results show that there are a fair
number of MS/MS spectra with wrong precursor masses due
to an incomplete deconvolution method. Given that the
deconvolution algorithm is incomplete, correctly identifying
MS/MS spectra with isotope masses other than monoisotope
masses as their precursor masses is important to increase
sensitivity of identification. Here we presented a simple varia-
tion of a target-decoy method that shows performance com-
parable to a sophisticated probability based method, utilizing
accurate mass from high-resolution mass spectrometer. TDMB
is as simple as TD and takes only a few seconds for processing
more than a 100 000 spectra. TDMB can correctly identify MS/
MS spectra with wrong isotope masses by performing a
database search with a wide precursor mass tolerance, and
makes use of high mass accuracy featured by high-resolution
mass spectrometers by setting thresholds separately within
each mass error region. With a wide precursor mass tolerance,
TDMB could help identify those spectra shifted by experimental
error with “systematic mass error correction” option. In addi-
tion, TDMB is useful not only for the database search engines
such as SEQUEST, which do not have a precursor mass
correction option, but also for search engines such as Mascot,
which has one.
Acknowledgment. This work was supported by 21C
Frontier Functional Proteomics Project from Korean Ministry
of Education, Science & Technology (FPR-08-A1-020, FPR-08-
A1-030, and FPR-08-A1-090). S. Na was supported by Brain
Korea 21 (BK21) Project.
Note Added after ASAP Publication. This paper was
published on the Web on Nov 13, 2009, with an omission in
the Acknowledgment paragraph. The corrected version was
reposted on Jan 7, 2010.
References
(1) Steen, H.; Mann, M. The ABC (and XYZ’s) of peptide sequencing.
Nat. Rev. Mol. Cell Biol. 2004, 5, 699–711.
(2) Aebersold, R.; Mann, M. Mass spectrometry-based proteomics.
Nature 2003, 422, 198–207.
(3) Eng, J. K.; McCormack, A. L.; Yates, J. R. An approach to correlate
tandem mass spectral data of peptides with amino acid sequences
in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5, 975–
989.
(4) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S.
Probability-based protein identification by searching sequence
databases using mass spectrometry data. Electrophoresis 1999, 20,
3551–3567.
(5) Colinge, J.; Masselot, A.; Giron, M.; Dessingy, T.; Magnin, J. OLAV:
Towards high-throughput tandem mass spectrometry data iden-
tification. Proteomics 2003, 3, 1454–1463.
(6) Craig, R.; Beavis, R. C. A method for reducing the time required
to match protein sequences with tandem mass spectra. Rapid
Commun. Mass Spectrom. 2003, 17, 2310–2316.
(7) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical
statistical model to estimate the accuracy of peptide identifications
made by MS/MS and database search. Anal. Chem. 2002, 74, 5383–
5392.
(8) Hingdon, R.; Kolker, N.; Picone, A.; van Belle, G.; Kolker, E. LIP
index for peptide classification using MS/MS and SEQUEST search
via logistic regression. OMICS 2004, 8, 357–369.
(9) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increase
confidence in large-scale protein identifications by mass spec-
trometry. Nat. Methods 2007, 4, 207–214.
(10) Nesvizhskii, A. I.; Vitek, O.; Aeversold, R. Analysis and validation
of proteomic data generated by tandem mass spectrometry. Nat.
Methods 2007, 4, 7787–797.
(11) Higdon, R. H. J.; Van Belle, G.; Kolker, E. Randomized sequence
databases for tandem mass spectrometry peptide and protein
identification. OMICS 2005, 9, 364–379.
(12) Zhang, J.; Ma, J.; Dou, L.; Wu, S.; Qian, X.; Xie, H.; Zhu, Y.; He, F.
Bayesian nonparametric model for the validation of peptide
identification in shotgun proteomics. Mol. Cell. Proteomics 2009,
8 (3), 547–557.
(13) Kall, L.; Canterbury, J. D.; Weston, J.; Noble, W. S.; MacCoss, M. J.
Semi-supervised learning for peptide identification from shotgun
proteomics datasets. Nat. Methods 2007, 4, 923–925.
(14) Ma, Z. Q.; Dasari, S.; Chambers, M. C.; Litton, M. D.; Sobecki, S. M.;
Zimmerman, L. J.; Halvey, P. J.; Schilling, B.; Drake, P. M.; Gibson,
B. W.; Tabb, D. L. IDPicker 2.0: Improved protein assembly with
high discrimination peptide identification filtering. J. Proteome Res.
2009, 8, 3872–3881.
(15) Choi, H.; Nesvizhskii, A. I. Semisupervised model-based validation
of peptide identifications in mass spectrometry-based proteomics.
J. Proteome Res. 2008, 7, 254–265.
(16) Makarov, A.; Denisov, E.; Lange, O.; Horning, S. Dynamic range
of mass accuracy in LTQ Orbitrap hybrid mass spectrometer. J. Am.
Soc. Mass Spectrom. 2006, 17, 977–82.
(17) Beausoleil, S. A.; Villen, J.; Gerber, S. A.; Rush, J.; Gygi, S. P. A
probability-based approach for high-throughput protein phos-
Target-Decoy with Mass Binning technical notes
Journal of Proteome Research • Vol. 9, No. 2, 2010 1155
phorylation analysis and site localization. Nat. Biotechnol. 2006,
24, 1285–1292.
(18) Brosch, M.; Swamy, S.; Hubbard, T.; Choudhary, J. Comparison
of Mascot and X!Tandem performance for low and high accuracy
mass spectrometry and the development of an adjusted Mascot
threshold. Mol. Cell. Proteomics 2008, 7, 962–970.
(19) Horn, D. M.; Zubarev, R. A.; McLafferty, F. W. Automated reduction
and interpretation of high resolution electrospray mass spectra
of large molecules. J. Am. Soc. Mass Spectrom. 2000, 11, 320–32.
(20) Park, K.; Yoon, J. Y.; Lee, S.; Paek, E.; Park, H.; Jung, H. J.; Lee,
S. W. Isotopic peak intensity ratio based algorithm for determi-
nation of isotopic clusters and monoisotopic masses of polypep-
tides from high-resolution mass spectrometric data. Anal. Chem.
2008, 80, 7294–7303.
(21) Shin, B.; Jung, H. J.; Hyung, S. W.; Kim, H.; Lee, D.; Lee, C.; Yu,
M. H.; Lee, S. W. Postexperiment monoisotopic mass filtering and
refinement (PE-MMR) of tandem mass spectrometric data in-
creases accuracy of peptide identification in LC/MS/MS. Mol. Cell.
Proteomics 2008, 7, 1124–1134.
(22) Choi, H.; Ghosh, D.; Nesvizhskii, A. I. Statistical Validation of
Peptide Identifications in Large-Scale Proteomics Using the Target-
Decoy Database Search Strategy and Flexible Mixture Modeling.
J. Protome Res. 2008, 7, 286–292.
PR9006377
technical notes Joo et al.
1156 Journal of Proteome Research • Vol. 9, No. 2, 2010

More Related Content

What's hot

Peptide Mass Fingerprinting (PMF) and Isotope Coded Affinity Tags (ICAT)
Peptide Mass Fingerprinting  (PMF) and Isotope Coded Affinity Tags (ICAT)Peptide Mass Fingerprinting  (PMF) and Isotope Coded Affinity Tags (ICAT)
Peptide Mass Fingerprinting (PMF) and Isotope Coded Affinity Tags (ICAT)Suresh Antre
 
Proteomics Processes and Applications
Proteomics Processes and ApplicationsProteomics Processes and Applications
Proteomics Processes and ApplicationsKhalid Hakeem
 
Analytical method development and validation
Analytical method development and validationAnalytical method development and validation
Analytical method development and validationCreative Peptides
 
A brief introfuction of label-free protein quantification methods
A brief introfuction of label-free protein quantification methodsA brief introfuction of label-free protein quantification methods
A brief introfuction of label-free protein quantification methodsCreative Proteomics
 
Metabolomics in the 21st century - perspective
Metabolomics in the 21st century - perspectiveMetabolomics in the 21st century - perspective
Metabolomics in the 21st century - perspectiveDinesh Barupal
 
Cadd and molecular modeling for M.Pharm
Cadd and molecular modeling for M.PharmCadd and molecular modeling for M.Pharm
Cadd and molecular modeling for M.PharmShikha Popali
 
QSAR Studies of the Inhibitory Activity of a Series of Substituted Indole and...
QSAR Studies of the Inhibitory Activity of a Series of Substituted Indole and...QSAR Studies of the Inhibitory Activity of a Series of Substituted Indole and...
QSAR Studies of the Inhibitory Activity of a Series of Substituted Indole and...inventionjournals
 
Biomolecular interaction analysis (BIA) techniques
Biomolecular interaction analysis (BIA) techniquesBiomolecular interaction analysis (BIA) techniques
Biomolecular interaction analysis (BIA) techniquesN Poorin
 
itraq protein quatification technique
 itraq protein quatification technique itraq protein quatification technique
itraq protein quatification techniquebharti rakhecha
 
A Simple Quantitative Bedside Test to Determine Methemoglobin
A Simple Quantitative Bedside Test to Determine MethemoglobinA Simple Quantitative Bedside Test to Determine Methemoglobin
A Simple Quantitative Bedside Test to Determine MethemoglobinEM OMSB
 
Determining stable ligand orientation
Determining stable ligand orientationDetermining stable ligand orientation
Determining stable ligand orientationijaia
 
Exploration of a potential FtsZ inhibitors as new scaffolds by Ligand and Str...
Exploration of a potential FtsZ inhibitors as new scaffolds by Ligand and Str...Exploration of a potential FtsZ inhibitors as new scaffolds by Ligand and Str...
Exploration of a potential FtsZ inhibitors as new scaffolds by Ligand and Str...Pavan Kumar
 
Drug design based on bioinformatic tools
Drug design based on bioinformatic toolsDrug design based on bioinformatic tools
Drug design based on bioinformatic toolsSujeethKrishnan
 
Fragment based drug design
Fragment based drug designFragment based drug design
Fragment based drug designEkta Tembhare
 
Population Pharmacokinetic Modelling of an investigational prodrug. Crunenber...
Population Pharmacokinetic Modelling of an investigational prodrug. Crunenber...Population Pharmacokinetic Modelling of an investigational prodrug. Crunenber...
Population Pharmacokinetic Modelling of an investigational prodrug. Crunenber...robirish51
 

What's hot (20)

Peptide Mass Fingerprinting (PMF) and Isotope Coded Affinity Tags (ICAT)
Peptide Mass Fingerprinting  (PMF) and Isotope Coded Affinity Tags (ICAT)Peptide Mass Fingerprinting  (PMF) and Isotope Coded Affinity Tags (ICAT)
Peptide Mass Fingerprinting (PMF) and Isotope Coded Affinity Tags (ICAT)
 
Proteomics Processes and Applications
Proteomics Processes and ApplicationsProteomics Processes and Applications
Proteomics Processes and Applications
 
Chenomx
ChenomxChenomx
Chenomx
 
Analytical method development and validation
Analytical method development and validationAnalytical method development and validation
Analytical method development and validation
 
A brief introfuction of label-free protein quantification methods
A brief introfuction of label-free protein quantification methodsA brief introfuction of label-free protein quantification methods
A brief introfuction of label-free protein quantification methods
 
วิจัยต้นอ้อ
วิจัยต้นอ้อวิจัยต้นอ้อ
วิจัยต้นอ้อ
 
Structural proteomics
Structural proteomicsStructural proteomics
Structural proteomics
 
Metabolomics in the 21st century - perspective
Metabolomics in the 21st century - perspectiveMetabolomics in the 21st century - perspective
Metabolomics in the 21st century - perspective
 
Cadd and molecular modeling for M.Pharm
Cadd and molecular modeling for M.PharmCadd and molecular modeling for M.Pharm
Cadd and molecular modeling for M.Pharm
 
QSAR Studies of the Inhibitory Activity of a Series of Substituted Indole and...
QSAR Studies of the Inhibitory Activity of a Series of Substituted Indole and...QSAR Studies of the Inhibitory Activity of a Series of Substituted Indole and...
QSAR Studies of the Inhibitory Activity of a Series of Substituted Indole and...
 
Biomolecular interaction analysis (BIA) techniques
Biomolecular interaction analysis (BIA) techniquesBiomolecular interaction analysis (BIA) techniques
Biomolecular interaction analysis (BIA) techniques
 
itraq protein quatification technique
 itraq protein quatification technique itraq protein quatification technique
itraq protein quatification technique
 
A Simple Quantitative Bedside Test to Determine Methemoglobin
A Simple Quantitative Bedside Test to Determine MethemoglobinA Simple Quantitative Bedside Test to Determine Methemoglobin
A Simple Quantitative Bedside Test to Determine Methemoglobin
 
Determining stable ligand orientation
Determining stable ligand orientationDetermining stable ligand orientation
Determining stable ligand orientation
 
Exploration of a potential FtsZ inhibitors as new scaffolds by Ligand and Str...
Exploration of a potential FtsZ inhibitors as new scaffolds by Ligand and Str...Exploration of a potential FtsZ inhibitors as new scaffolds by Ligand and Str...
Exploration of a potential FtsZ inhibitors as new scaffolds by Ligand and Str...
 
SBR Final Presentaion
SBR Final PresentaionSBR Final Presentaion
SBR Final Presentaion
 
Drug design based on bioinformatic tools
Drug design based on bioinformatic toolsDrug design based on bioinformatic tools
Drug design based on bioinformatic tools
 
15 arrays
15 arrays15 arrays
15 arrays
 
Fragment based drug design
Fragment based drug designFragment based drug design
Fragment based drug design
 
Population Pharmacokinetic Modelling of an investigational prodrug. Crunenber...
Population Pharmacokinetic Modelling of an investigational prodrug. Crunenber...Population Pharmacokinetic Modelling of an investigational prodrug. Crunenber...
Population Pharmacokinetic Modelling of an investigational prodrug. Crunenber...
 

Viewers also liked

Plenido Dental 2015
Plenido Dental 2015Plenido Dental 2015
Plenido Dental 2015Jordi Ferrer
 
K5 kepelbagaian budaya
K5 kepelbagaian budayaK5 kepelbagaian budaya
K5 kepelbagaian budayanormazua
 
Goopi quick intro
Goopi quick introGoopi quick intro
Goopi quick introericpenot
 
Nobel Prize Nomination - Steven (Steve) Jobs and William (Bill) Gates
Nobel Prize Nomination - Steven (Steve) Jobs and William (Bill) GatesNobel Prize Nomination - Steven (Steve) Jobs and William (Bill) Gates
Nobel Prize Nomination - Steven (Steve) Jobs and William (Bill) GatesClement Manoko
 
K14 strategi perubahan pendidikan
K14 strategi perubahan pendidikanK14 strategi perubahan pendidikan
K14 strategi perubahan pendidikannormazua
 
eCommerce 2015 strategy
eCommerce 2015 strategyeCommerce 2015 strategy
eCommerce 2015 strategyTrellis
 

Viewers also liked (12)

FlockMiner
FlockMinerFlockMiner
FlockMiner
 
Plenido Dental 2015
Plenido Dental 2015Plenido Dental 2015
Plenido Dental 2015
 
K5 kepelbagaian budaya
K5 kepelbagaian budayaK5 kepelbagaian budaya
K5 kepelbagaian budaya
 
QV magazine
QV magazineQV magazine
QV magazine
 
Goopi quick intro
Goopi quick introGoopi quick intro
Goopi quick intro
 
Nobel Prize Nomination - Steven (Steve) Jobs and William (Bill) Gates
Nobel Prize Nomination - Steven (Steve) Jobs and William (Bill) GatesNobel Prize Nomination - Steven (Steve) Jobs and William (Bill) Gates
Nobel Prize Nomination - Steven (Steve) Jobs and William (Bill) Gates
 
K14 strategi perubahan pendidikan
K14 strategi perubahan pendidikanK14 strategi perubahan pendidikan
K14 strategi perubahan pendidikan
 
Ahmed Alkhyat's CV
Ahmed Alkhyat's CVAhmed Alkhyat's CV
Ahmed Alkhyat's CV
 
Clicedk to Edit
Clicedk to EditClicedk to Edit
Clicedk to Edit
 
eCommerce 2015 strategy
eCommerce 2015 strategyeCommerce 2015 strategy
eCommerce 2015 strategy
 
SCDB2010
SCDB2010SCDB2010
SCDB2010
 
Executive Summary
Executive SummaryExecutive Summary
Executive Summary
 

Similar to JPR2010_TDMB

Proteomics 2009 V9p1696
Proteomics 2009 V9p1696Proteomics 2009 V9p1696
Proteomics 2009 V9p1696jcruzsilva
 
Protein Qualitative Analysis Services
Protein Qualitative Analysis ServicesProtein Qualitative Analysis Services
Protein Qualitative Analysis ServicesCreative Proteomics
 
“Proteomics” to study genes and genomes
“Proteomics” to study genes and genomes“Proteomics” to study genes and genomes
“Proteomics” to study genes and genomesNazish_Nehal
 
Proteomics 2009 V9p1683
Proteomics 2009 V9p1683Proteomics 2009 V9p1683
Proteomics 2009 V9p1683jcruzsilva
 
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...Keiji Takamoto
 
Cncp 2010
Cncp 2010Cncp 2010
Cncp 2010ygc
 
Quantitative Proteomics: From Instrument To Browser
Quantitative Proteomics: From Instrument To BrowserQuantitative Proteomics: From Instrument To Browser
Quantitative Proteomics: From Instrument To BrowserNeil Swainston
 
From sample-to-spray: high performance workflow for top down protein analysis
From sample-to-spray: high performance workflow for top down protein analysisFrom sample-to-spray: high performance workflow for top down protein analysis
From sample-to-spray: high performance workflow for top down protein analysisExpedeon
 
Mass Spectrometry Based Proteomic Analysis
Mass Spectrometry Based Proteomic Analysis Mass Spectrometry Based Proteomic Analysis
Mass Spectrometry Based Proteomic Analysis Sijo A
 
Methods for Protein Sequencing.pdf
Methods for Protein Sequencing.pdfMethods for Protein Sequencing.pdf
Methods for Protein Sequencing.pdfCreative Proteomics
 
Three Methods for Protein Sequencing
Three Methods for Protein SequencingThree Methods for Protein Sequencing
Three Methods for Protein SequencingCreative Proteomics
 
AsedaSciences SLAS2017 poster presentation
AsedaSciences SLAS2017 poster presentationAsedaSciences SLAS2017 poster presentation
AsedaSciences SLAS2017 poster presentationAndrew Bieberich
 
A high-throughput approach for multi-omic testing for prostate cancer research
A high-throughput approach for multi-omic testing for prostate cancer researchA high-throughput approach for multi-omic testing for prostate cancer research
A high-throughput approach for multi-omic testing for prostate cancer researchThermo Fisher Scientific
 
Journal Combinatorial Chemistry 2006 v8 p820
Journal Combinatorial Chemistry 2006 v8 p820Journal Combinatorial Chemistry 2006 v8 p820
Journal Combinatorial Chemistry 2006 v8 p820Peter Tidswell
 
A magnetic resonance spectroscopy driven initialization scheme for active sha...
A magnetic resonance spectroscopy driven initialization scheme for active sha...A magnetic resonance spectroscopy driven initialization scheme for active sha...
A magnetic resonance spectroscopy driven initialization scheme for active sha...TRS Telehealth Services
 
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATA
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATAGRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATA
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATAcscpconf
 

Similar to JPR2010_TDMB (20)

Proteomics 2009 V9p1696
Proteomics 2009 V9p1696Proteomics 2009 V9p1696
Proteomics 2009 V9p1696
 
Protein Qualitative Analysis Services
Protein Qualitative Analysis ServicesProtein Qualitative Analysis Services
Protein Qualitative Analysis Services
 
“Proteomics” to study genes and genomes
“Proteomics” to study genes and genomes“Proteomics” to study genes and genomes
“Proteomics” to study genes and genomes
 
Proteomics 2009 V9p1683
Proteomics 2009 V9p1683Proteomics 2009 V9p1683
Proteomics 2009 V9p1683
 
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
Theoretical evaluation of shotgun proteomic analysis strategies; Peptide obse...
 
Cncp 2010
Cncp 2010Cncp 2010
Cncp 2010
 
proteomics
 proteomics proteomics
proteomics
 
Proteomics
ProteomicsProteomics
Proteomics
 
Quantitative Proteomics: From Instrument To Browser
Quantitative Proteomics: From Instrument To BrowserQuantitative Proteomics: From Instrument To Browser
Quantitative Proteomics: From Instrument To Browser
 
From sample-to-spray: high performance workflow for top down protein analysis
From sample-to-spray: high performance workflow for top down protein analysisFrom sample-to-spray: high performance workflow for top down protein analysis
From sample-to-spray: high performance workflow for top down protein analysis
 
Mayuri shitre
Mayuri shitreMayuri shitre
Mayuri shitre
 
Mass Spectrometry Based Proteomic Analysis
Mass Spectrometry Based Proteomic Analysis Mass Spectrometry Based Proteomic Analysis
Mass Spectrometry Based Proteomic Analysis
 
Methods for Protein Sequencing.pdf
Methods for Protein Sequencing.pdfMethods for Protein Sequencing.pdf
Methods for Protein Sequencing.pdf
 
Three Methods for Protein Sequencing
Three Methods for Protein SequencingThree Methods for Protein Sequencing
Three Methods for Protein Sequencing
 
AsedaSciences SLAS2017 poster presentation
AsedaSciences SLAS2017 poster presentationAsedaSciences SLAS2017 poster presentation
AsedaSciences SLAS2017 poster presentation
 
A high-throughput approach for multi-omic testing for prostate cancer research
A high-throughput approach for multi-omic testing for prostate cancer researchA high-throughput approach for multi-omic testing for prostate cancer research
A high-throughput approach for multi-omic testing for prostate cancer research
 
Journal Combinatorial Chemistry 2006 v8 p820
Journal Combinatorial Chemistry 2006 v8 p820Journal Combinatorial Chemistry 2006 v8 p820
Journal Combinatorial Chemistry 2006 v8 p820
 
HTS by mukesh
HTS by mukeshHTS by mukesh
HTS by mukesh
 
A magnetic resonance spectroscopy driven initialization scheme for active sha...
A magnetic resonance spectroscopy driven initialization scheme for active sha...A magnetic resonance spectroscopy driven initialization scheme for active sha...
A magnetic resonance spectroscopy driven initialization scheme for active sha...
 
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATA
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATAGRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATA
GRC-MS: A GENETIC RULE-BASED CLASSIFIER MODEL FOR ANALYSIS OF MASS SPECTRA DATA
 

More from Je-Hyun Baek

More from Je-Hyun Baek (9)

ProteinSci2007
ProteinSci2007ProteinSci2007
ProteinSci2007
 
JGP2010
JGP2010JGP2010
JGP2010
 
2014_BKCS_기생충
2014_BKCS_기생충2014_BKCS_기생충
2014_BKCS_기생충
 
AH_Exosome_논문
AH_Exosome_논문AH_Exosome_논문
AH_Exosome_논문
 
NEDD_NatureCommunications2016
NEDD_NatureCommunications2016NEDD_NatureCommunications2016
NEDD_NatureCommunications2016
 
PibPET_iTRAQ_JAD2016
PibPET_iTRAQ_JAD2016PibPET_iTRAQ_JAD2016
PibPET_iTRAQ_JAD2016
 
2014_BKCS Lee and Baek
2014_BKCS Lee and Baek2014_BKCS Lee and Baek
2014_BKCS Lee and Baek
 
MCP2009
MCP2009MCP2009
MCP2009
 
Nav1.2_JBC2014
Nav1.2_JBC2014Nav1.2_JBC2014
Nav1.2_JBC2014
 

JPR2010_TDMB

  • 1. Target-Decoy with Mass Binning: A Simple and Effective Validation Method for Shotgun Proteomics Using High Resolution Mass Spectrometry Jong Wha J. Joo,† Seungjin Na,‡ Je-Hyun Baek,† Cheolju Lee,† and Eunok Paek*,‡ Korea Institute of Science and Technology, Seoul, Republic of Korea, and Department of Mechanical and Information Engineering, University of Seoul, Seoul, Republic of Korea Received July 20, 2009 Abstract: Shotgun proteomics using mass spectrometry (MS) has become the choice for large-scale peptide and protein identification. The recent development of high- resolution mass spectrometers such as FT-ICR or Orbitrap makes it possible to identify peptides within only a few parts per million (ppm), and it is expected to dramatically improve performance of peptide identification, as com- pared to low-resolution instruments. To fully exploit such significantly higher mass accuracy, however, appropriate data analysis methods are required. Here, we present a new target-decoy strategy, called Target-Decoy with Mass Binning, utilizing high mass accuracy for peptide identi- fication validation, which remains a challenging problem in MS-based proteomics. When tested on various high- resolution MS data, our method was very effective and yet simple and showed comparable or better performance when compared with other validation methods. Keywords: target-decoy • peptide identification • valida- tion method • precursor mass error • mass binning • high- resolution mass spectrometry 1. Introduction During the last years, along with technological developments in tandem mass spectrometry (MS/MS),1,2 many powerful database search algorithms have been developed and made great advances in peptide/protein identification possible. Database search engines such as SEQUEST,3 Mascot,4 Phenyx5 and X!Tandem6 select the best-scoring peptide-spectrum match (PSM) for every input MS/MS spectrum based on the comparison with theoretical MS/MS spectra from a protein database. However, they do not guarantee that the top PSMs are correct because of deficiency in their scoring models, and search results include many incorrect PSMs (in fact, the majority of searched MS/MS spectra). Thus, the assessment of database search results is very important and still open to controversy. To assess the reliability of database search results, various validation methods are currently in use. There have been mainly two ways, one of which is a machine learning based method7,8 and the other is a target-decoy search method.9 A machine learning based method combines multiple features related with the adopted search software (for example, XCorr, deltaCn, SpRank, and so forth in the case of SEQUEST), into a single match quality score. With the combined score, it models the distribution of PSMs and classifies correct and incorrect PSMs. However, their scoring model based on training data may not work well on data from different experimental environment, which would require new training. On the other hand, a target-decoy search method uses a decoy database, which is a reversed, randomized or shuffled version of standard protein sequence (target) database. This approach assumes that the distribution of incorrect PSMs from the target database is similar to that of PSMs from the decoy database. With this assumption, an optimized score cutoff for the corresponding false discovery rate (FDR) is calculated for each data set. A target-decoy method is favored in practice because of its simplicity and robustness to the effects of database size, sample quality, experimental environment and instrument types.9–11 Consequently, the advantages of the target-decoy search have been incorporated into machine learning based methods. Recent implementations of machine learning based methods12–15 esti- mate certain parameters of their models from a target-decoy search or construct an initial classifier (dynamic training set) for further analyses and incrementally improve their performance. Recently, with the advent of high-resolution MS instruments such as Fourier transform ion cyclotron resonance (FT-ICR) and electrostatic FT traps (Orbitrap), mass resolution has been improved by 2 to 3 orders of magnitude when compared with conventional MS instruments.16 They allow identifying peptides within only a few parts per million (ppm) mass accuracy and it would dramatically improve performance of peptide identification via reduction of the search space by orders of magnitude. However, many works reported that a very small search space does not necessarily lead to improved performance.17,18 And, in spite of their high resolving power, a fair number of MS/MS spectra are generated using wrong precursor masses due to errors in deconvolution.19,20 With these issues, new database search options or validation methods for high-resolution MS data have been introduced. Mascot and X!Tandem provided a search option to compensate for a misassignment of monoisotopic masses. Another method, called PE-MMR,21 has been proposed * To whom correspondence should be addressed. Eunok Paek, Depart- ment of Mechanical and Information Engineering, University of Seoul, Seoul, Republic of Korea. E-mail: paek@uos.ac.kr. Fax: +82-2-2210-5575. Tel: +82- 2-2210-2680. † Korea Institute of Science and Technology. ‡ University of Seoul. 1150 Journal of Proteome Research 2010, 9, 1150–1156 10.1021/pr9006377  2010 American Chemical Society Published on Web 11/13/2009
  • 2. so that it can correct wrong precursor masses of MS/MS spectra. It generated unique mass class (UMC) from an LC/ MS experiment and, prior to a database search, filtered out or corrected MS/MS spectra by matching the precursor masses of MS/MS spectra to UMC. PeptideProphet,22 the representa- tive validation method, improved its performance for high- resolution MS data by using mass binning for accurate mod- eling of mass accuracy distributions. Here, we introduce a database search and validation strategy based on a target-decoy search, called Target-Decoy with Mass Binning (TDMB), to identify as many reliable peptide identi- fications as possible from high-resolution MS data (to make the distinction clear, we call the original target-decoy method as TD). We examined the distribution of mass errors (precursor mass of a spectrum minus identified peptide mass) of search results from high-resolution MS data, and compared the search results under different search conditions (variation in precursor mass tolerance). These analyses led us to set a wide range of precursor mass tolerance for a search and use an accurate mass binning of mass errors for validation. This strategy allowed finding a systematic mass error and an optimal mass error window for a MS instrument used. In our approach, score thresholds at a specified FDR were determined in combination with the mass error window. Because peptide identifications out of the optimal mass error window would be almost always incorrect, it could achieve high specificity. Our approach successfully dealt with issues related to high mass accuracy and incorrect assignment of monoisotopic masses, and gave more sensitive and accurate identifications than certain existing methods. 2. Material and Methods 2.1. Sample Preparation. The 48 standard protein mixture (Sigma U6133, Universal Proteomics Standard Set, UPS1) was denatured in buffer (6 M urea, 0.05% SDS, 5 mM EDTA, and 50 mM ammonium bicarbonate, pH 8.0), reduced with 3 mM Tris (2-carboxyethyl) phosphine hydrochloride for 30 min at 37 °C, and alkylated with 5 mM iodoacetamide for 30 min while shaking at 50 °C. Protein sample was digested with trypsin (Promega, Madison, WI) at a protein-to-trypsin ratio of 50:1 (w/w) for 16 h at 37 °C. SDS and other reagents were removed from the digested protein samples using a mixed strong cation exchange cartridge (OASIS, Waters). Peptides were eluted by adding 5% ammonia in methanol, dried in a speed vacuum centrifuge, and dissolved in 0.4% acetic acid prior to analysis. Yeast cell lysate from wild-type yeast grown in rich medium was obtained after cell disruption with glass-beads in lysis buffer (7 M urea, 2 M thiourea, 2% CHAPS, 50 mM ammonium bicarbonate, pH 8.0). Yeast lysate proteins were reduced with 3.7 mM Tris (2-carboxyethyl) phosphine hydrochloride for 30 min at 37 °C, and alkylated with 5 mM iodoacetamide for 60 min while shaking at 37 °C. After a 10-fold dilution of the sample with 50 mM ammonium bicarbonate buffer (pH 8.0), yeast lysate proteins were digested with trypsin (Promega, Madison, WI) at a protein-to-trypsin ratio of 40:1 (w/w) for 12 h at 37 °C. Digested sample was loaded and desalted on a C18 cartridge (OASIS, Waters). After washing with 0.1% TFA (tri- fluoreacetic acid) solution, desalted peptides were obtained by adding 0.1% TFA/80% acetonitrile solution. Eluted peptides were dried in a speed vacuum centrifuge, and dissolved in 0.4% acetic acid prior to analysis. 2.2. Peptide Separation and Mass Spectrometry. A trypsin digest of yeast lysate (6 µg) was separated by MudPIT, Multi- dimensional Protein Identification Technology (10 salt frac- tions). For a tryptic digest of 48 protein mixture, we performed two MudPIT runs (13 salt fractions) and four 1D-LC runs (reverse phase liquid chromatography) additionally. Peptide samples were loaded onto a strong cation exchange resin (Luna, 5 µm, Phenomenex)-packed column (3 cm × 150 µm) and eluted by adding a salt buffer in range of 0-600 mM am- monium acetate. For reverse phase liquid chromatography, peptide samples were loaded onto a C18 (Magic C18aq, Michrom BioResources, Auburn, CA)-packed trap column and separated using a capillary C18 column (20 cm × 75 µm) coupled with a nanospray tip. Peptides were eluted using a 30- min (48 protein mixture) and 60-min (yeast lysate) linear gradient of 5-35% solution B in a 60-min and 120-min run (Solution A, 0.1% formic acid in H2O; Solution B, 0.1% formic acid in 100% acetonitrile). Elution was performed at a flow rate of 300 nL/min using the Eksigent MDLC. Separated peptides were analyzed by a LTQ-Orbitrap hybrid mass spectrometer (ThermoElectron, San Jose, CA) in a data-dependent acquisition manner. We used a duty cycle including one survey scan (resolution: 100 000) in an Orbitrap and six consecutive full- MS2 scans in an ion-trap (isolation width, 3 m/z; normalized collision energy, 35%; exclusion duration, 1 min). 2.3. Database Search and Peptide Identification. For 48 protein mixture, from 369 302 scans, we obtained 119 035 dta files using Bioworks 3.3.1 program (molecular weight range 800-6000), which were searched with a SEQUEST cluster against the concatenated protein database consisting of 48 standard protein sequences and common contaminants ap- pended to reversed IPI human sequences version 3.24 (67 100 sequences). 3 Da precursor mass tolerance was used for TDMB and 15 ppm for TD. 1 Da fragment ion mass tolerance was used with no enzyme option, variable modification at M (Oxidation, +15.99492) and fixed modification at C (Carbamidomethyl, +57.02146) for both TDMB and TD. For yeast cell lysate, from 80 622 scans, 63 031 dta files (molecular weight range 600-4200) were acquired from mzxml using MzXML2Search program of TPP (Trans-Proteomic Pipe- line, http://tools.proteomecenter.org/TPP.php, v4.0 rev2). Dta files were searched with a SEQUEST cluster against the concatenated sequence database containing yeast proteome sequences (obtained from http://downloads.yeastgenome.org/ sequence/genomic_sequence/orf_protein/archive/orf_trans_ all.20080606.fasta) and common contaminants appended to their reversed sequences. 3 Da precursor mass tolerance was used for TDMB and PeptideProphet, and 15 ppm for TD and PE-MMR. 1 Da fragment ion mass tolerance was used with no enzyme option, variable modification at M (Oxidation) and fixed modification at C (Carbamidomethyl) for all of the SEQUEST searches. For PeptideProphet, TPP v4.0 rev2 was used with mass binning and decoy option. PE-MMR v2.3.14 was used with default options, (scan count of 2, 10 ppm mass tolerance, and hole of 5 scans were used for creating UMC, and dta tolerance of (25 ppm and -1, -2, -3 Da mass correction were used for dta filtering). After dta filtration, 62 262 dta files were retained. In addition to the SEQUEST search, Mascot searches (version 2.2.06) were conducted against the same database as the SEQUEST search, using three different precursor mass tolerances: 3 Da, 15 ppm, and 15 ppm allowing up to two misassigned C13 (PEP_ISOTOPE_ERROR option, http://www. matrixscience.com/help/data_file_help.html). For these three Mascot searches, other parameters except for the precursor mass tolerance were equal: 1 Da fragment ion mass tolerance, Target-Decoy with Mass Binning technical notes Journal of Proteome Research • Vol. 9, No. 2, 2010 1151
  • 3. trypsin as enzyme, up to two missed cleavages, variable modification at M (Oxidation), fixed modification at C (Car- bamidomethyl). For all the search results, peptide identifications were obtained at FDR 1.5%. FDR was calculated as FDR ) 2D/(T + D), where D is the number of the decoy hits and T is the number of the target above a score threshold. 2.4. Target-Decoy with Mass Binning. High-resolution mass spectrometers, such as FTMS or Orbitrap, provide very accurate precursor masses, often with errors of only a few parts per million (ppm).16 Thus, it would be natural to perform a database search with a small precursor mass tolerance, about 10-15 ppm, and apply a target-decoy strategy (TD). However, it is well-known that a large portion of precursor masses are not from their monoisotopes, mostly due to the limitation in deconvolution methods.19,20 Here, we propose that such errors can be overcome by performing a database search with a precursor mass tolerance large enough to accommodate the monoisotopic peak, and applying a target-decoy strategy for each local region that corresponds to precursor mass errors of 0, (1.00335, (2.0067, (3.01005 Da (for convenience, we will omit fractional parts and write these as 0, (1, (2, (3), separately. This strategy works because a high-resolution mass spectrometer gives accurate masses, with errors of only a few parts per million, unlike a low resolution mass spectrometer that gives mass errors up to a few dalton. First, TDMB performs “systematic mass error correction”. A systematic mass error was estimated from the mass error distribution of the target peptide identifications from TD at FDR 1.5% and then their precursor masses were corrected. After correcting the systematic mass error, for each local region that corresponds to precursor mass errors of 0, (1, (2, (3 Da, TDMB exhaustively searches for optimal values of precursor mass tolerance, XCorr and deltaCn, to be used as threshold values, which give the most number of target hits at a given FDR. Unlike other methods that use learning algorithms and complex statistics, we did not use a combined score for thresholding but exhaustively searched all possible values in a user defined range and increment for each of the three parameters, to get the best combination of three values. For SEQUEST results, the exhaustive search for threshold values was conducted between 5 and 100 ppm for precursor mass tolerance (with 1 ppm increment), from 0 to 4 for XCorr value (with 0.01 increment), and from 0 to 0.18 for deltaCn value (with 0.01 increment), which are default values and a user can set the range and the amount of increment in the program. Charge states are considered from 2 to 5 and TDMB sets thresholds for each charge state and the number of tryptic termini (NTT), separately. As a result, for each charge state and NTT, we consider 96 × 401 × 19 () 731 424) possible cutoff combinations of three parameters, and among them, we choose the one that gives the most number of true positives. Even though we exhaustively search 731 424 cases, computational overhead is very small and it takes only a few seconds for a large scale data (over 100 000 dta files). For Mascot results, the same strategy was used as with SEQUEST search, except that Mascot Homology Threshold (MHT),18 which is ion score subtracted by homology score, is used for thresholding. To ensure the statistical significance of the target-decoy results, the resulted identifications are removed in case when the numbers of target and decoy hits are too small for a given local region. For all the TDMB results, we used the minimum count of 100 identifications so that a reasonable number of target and decoy hits are used when thresholding for a given FDR. TDMB software can be obtained upon request to the corre- sponding author. 3. Results and Discussion 3.1. The 48 Standard Protein Mixture. The 48 standard protein mixture data set was searched with SEQUEST against the target database appended to a decoy database composed as described in the previous section. As the decoy database is significantly larger than the target, all the identifications of target peptides were regarded as correct. Figure 1 shows precursor mass error distributions of hits to the target protein sequences (target hits) and hits to the decoy (reverse) protein sequences (decoy hits) from a SEQUEST search with 3 Da precursor mass tolerance. It can be seen that there are a fair number of spectra with incorrect precursor masses, that is, isotope masses other than monoisotope masses. Target hits around the precursor mass error of 0 Da indicate MS/MS spectra with the correct monoisotope masses as their precursor masses, while target hits around the precursor mass errors (1, (2, (3 Da indicate MS/MS spectra with wrong precursor masses due to incorrect assignment of monoisotopic masses. Interestingly, the mass error distribution is almost concentrated within a very small range (several ppm) around the precursor mass errors 0, (1, (2, (3 Da. This means that the mass error can be a powerful measure for confidence when high mass accuracy data were searched with a big precursor mass tolerance. Another observation is the systematic mass error. The mass error distribution of target hits around mass error 0 is shown in gray, in the left inset of Figure 1, and the distribution is populated mostly within 4 ( 10 ppm. The 4 ppm is regarded as a systematic mass error, which can be caused by experimental variations. The systematic mass error can be corrected as shown in the left inset of Figure 1, to give the distribution shown in black. Figure 2 shows the XCorr distribution of the 48 proteins mixture. Decoy hits were spread out over a wide range of precursor mass errors, while target hits around the precursor mass errors of 0, (1, (2, (3 Da yielded high XCorr scores. On the basis of these observations, TDMB used three parameters, precursor mass tolerance, deltaCn and XCorr for thresholding so that the most number of target hits can be obtained at a given FDR. For a given FDR, TDMB finds an optimal precursor mass tolerance, XCorr, and deltaCn in each local region, by exhaustively searching through all possible threshold value combinations. We compared the number of identifications from TDMB and TD. Using high-resolution mass spectrometry (Orbitrap), it may seem natural to perform a database search with a small precursor mass tolerance (e.g., 15 ppm). When such a target- decoy scheme was applied, TD identified 19 886 true positives and 143 false positives at 1.5% FDR. At the same FDR, TDMB identified 25 485 true positives and 179 false positives. TDMB identified more unique peptides at a peptide level as well: 1764 peptides were identified by TDMB, while TD identified 1387 peptides at 1.5% FDR. In total, TDMB resulted in much more identifications as it identified extra MS/MS spectra, those with isotope masses other than monoisotopes as their precursor masses, which TD missed. In addition, dependent upon the sample and experimental environment, TDMB corrects the systematic mass error and sets precursor mass tolerance flexibly from 5 to 100 ppm to accommodate the most number of technical notes Joo et al. 1152 Journal of Proteome Research • Vol. 9, No. 2, 2010
  • 4. identifications, while TD can admit only those hits within a fixed error window of 15 ppm, for example. 3.2. TDMB Performance and Comparison. We tested TDMB using a Yeast complex mixture and compared its performance with other thresholding methods. Two search engines, SEQUEST and Mascot, were used under different conditions (variation in precursor mass tolerance). First, SEQUEST searches were conducted with precursor mass tolerances of 3 Da and 15 ppm. For two search results, the number of correct identifications according to FDR thresh- Figure 1. Precursor mass error distribution of 48 standard protein mixture. Hits to the target protein sequences (target hits, blue bars) and hits to the decoy sequences (decoy hits, red bars) are from SEQUEST database search results when 3 Da precursor mass tolerance was used and no thresholding was applied. As the decoy database is significantly larger than the target, all the identifications of target peptides were regarded as correct. The x-axis shows precursor mass error (Da) and y-axis shows corresponding number of spectra. Target hits around the precursor mass error of 0 Da indicate MS/MS spectra with the correct monoisotope masses as their precursor masses, while target hits around the precursor mass errors (1, (2, (3 Da indicate MS/MS spectra with incorrect precursor masses. Unlike target hits, decoy hits show relatively uniform distribution, as they are random hits (the inset on the right shows the number of target and decoy hits around the precursor mass error of 1 Da). The inset on the left shows precursor mass error distribution of hits to the target protein sequences around the precursor mass error 0 Da, before and after applying TDMB’s systematic mass correction option. Figure 2. SEQUEST’s XCorr score distribution of 48 standard protein mixture. Target hits (blue) and decoy (reverse) hits (red) are from SEQUEST database search results where 3 Da precursor mass tolerance was used and no thresholding was applied. As the decoy database is significantly larger than the target, all the identifications of target peptides were regarded as correct. The x-axis shows precursor mass error (Da) and y-axis shows corresponding XCorr score of each MS/MS spectrum. Decoy hits are spread out over a wide range of precursor mass errors, while target hits are concentrated around the precursor mass errors of 0, (1, (2, (3 Da. Target-Decoy with Mass Binning technical notes Journal of Proteome Research • Vol. 9, No. 2, 2010 1153
  • 5. old was determined using different thresholding methods, TD for 3 Da and 15 ppm searches, TD with mass error filter (15 ppm) for 3 Da search, and TDMB for 3Da search. The results are shown in Figure 3. The performance of TDMB is superior to others, while TD of 3 Da search showed the worst perfor- mance. It shows well that correcting a precursor mass error and using the optimal mass error window are very effective for validation. We also compared TDMB with PeptideProphet22 and PE-MMR,21 because they had been upgraded or designed for high-resolution MS data, and could be used to identify MS/ MS spectra with precursor masses other than monoisotopes. As described in Materials and Methods, for PeptideProphet, we performed a SEQUEST search with 3 Da precursor mass tolerance, and for PE-MMR, we performed the search with 15 ppm precursor mass tolerance after its precursor mass correc- tion procedure so that each method can obtain the most identifications. PeptideProphet recently upgraded its performance by in- troducing mass binning and decoy options. TDMB is a simple algorithm but its performance is on a par with that of PeptideProphet. TDMB resulted in 33 791 target hits and PeptideProphet gave 33 752 at the same FDR of 1.5%. For PE- MMR, filtered MS/MS spectra were searched using SEQUEST with 15 ppm precursor mass tolerance and TD was applied, which is a suggested process by the authors of PE-MMR. First, PE-MMR performed a precursor mass correction up to 3 Da and generated additional MS/MS spectra with different precur- sor masses and thus could result in 3 times more spectra than Figure 3. Number of identifications by TDMB, TD, PeptideProphet, and PE-MMR at different FDR. The x-axis indicates FDR and y-axis indicates the number of target identifications. Figure 4. Mascot search results. The top left circle shows the number of identifications when 3 Da precursor mass tolerance was used for search and TDMB for thresholding. The bottom circle shows the number of identifications when 15 ppm precursor mass tolerance was used for search and TD for thresholding. The top right circle shows the number of identifications when 15 ppm precursor mass tolerance and “PEP_ISOTOPE_ERROR” option were used for search and TD for thresholding. Venn diagram of identifications from TDMB, TD, and TD with “PEP_ISOTOPE_ERROR” option (a) at a spectrum level, (b) at a peptide level (unique peptides) and (c) at a protein level when 1.5% FDR was applied. The numbers in parentheses represent the number of proteins identified by a single peptide hit. technical notes Joo et al. 1154 Journal of Proteome Research • Vol. 9, No. 2, 2010
  • 6. the original data set, before applying dta filtration. Then dta filtration was applied based on its mass clustering algorithm. Finally, a SEQUEST search with 15 ppm precursor mass tolerance was conducted and TD was applied. TDMB outper- formed PE-MMR (33 791 and 32 674 for TDMB and PE-MMR, respectively) at both 1.5% FDR. Second, three Mascot database searches were conducted with different parameters to compare the performance of TDMB, TD, and TD using Mascot’s precursor mass correction option (named “PEP_ISOTOPE_ERROR”). First, we performed a database search with 3 Da precursor mass tolerance and TDMB was used for thresholding. Second, 15 ppm precursor mass tolerance was used for database search and TD was used for thresholding. Third, 15 ppm precursor mass tolerance was used, but “PEP_ISOTOPE_ERROR” option was added to the search, and TD was used for thresholding. Utilizing accurate mass from high-resolution mass spectrometry, after correction of the systematic mass error, TDMB found the optimal precur- sor mass error window, leading to the improvement of the identification performance. Correction of the systematic mass error improved the number of identifications from 30 896 to 31 274 at 1.5% FDR. If a small precursor mass tolerance, such as 10-15 ppm, were used for database search, identifications that are out of the tolerance range due to the systematic mass error could never be identified. However, TDMB could accom- modate such identifications. Figure 4 compares the identifica- tion of TDMB, TD and TD with “PEP_ISOTOPE_ERROR” option at the levels of spectrum, peptide, and protein, and TDMB resulted in the most identifications. We also tested the case of using TDMB instead of TD in the third Mascot search, where 15 ppm precursor mass tolerance and “PEP_ISOTOPE_ERROR” option were used. TDMB found the optimal precursor mass tolerance of 6 ppm around precur- sor mass error 0, charge state 3, and NTT 2, and lowered threshold score to give more identifications. It resulted in as many identifications as in the case of the first search, where 3 Da precursor mass tolerance was used for database search and TDMB for thresholding. Even when using “PEP_ISOTOPE_ ERROR” option during Mascot search, TDMB turned out to be very useful. 4. Conclusion Validation for peptide identification plays a key role in the data analysis of proteomics experiments based on mass spectrometry. Along with advances in mass spectrometry as well as search algorithms, validation methods should be developed hand in hand with such advances. A large portion of MS/MS spectra are left unidentified due to various reasons. Our results show that there are a fair number of MS/MS spectra with wrong precursor masses due to an incomplete deconvolution method. Given that the deconvolution algorithm is incomplete, correctly identifying MS/MS spectra with isotope masses other than monoisotope masses as their precursor masses is important to increase sensitivity of identification. Here we presented a simple varia- tion of a target-decoy method that shows performance com- parable to a sophisticated probability based method, utilizing accurate mass from high-resolution mass spectrometer. TDMB is as simple as TD and takes only a few seconds for processing more than a 100 000 spectra. TDMB can correctly identify MS/ MS spectra with wrong isotope masses by performing a database search with a wide precursor mass tolerance, and makes use of high mass accuracy featured by high-resolution mass spectrometers by setting thresholds separately within each mass error region. With a wide precursor mass tolerance, TDMB could help identify those spectra shifted by experimental error with “systematic mass error correction” option. In addi- tion, TDMB is useful not only for the database search engines such as SEQUEST, which do not have a precursor mass correction option, but also for search engines such as Mascot, which has one. Acknowledgment. This work was supported by 21C Frontier Functional Proteomics Project from Korean Ministry of Education, Science & Technology (FPR-08-A1-020, FPR-08- A1-030, and FPR-08-A1-090). S. Na was supported by Brain Korea 21 (BK21) Project. Note Added after ASAP Publication. This paper was published on the Web on Nov 13, 2009, with an omission in the Acknowledgment paragraph. The corrected version was reposted on Jan 7, 2010. References (1) Steen, H.; Mann, M. The ABC (and XYZ’s) of peptide sequencing. Nat. Rev. Mol. Cell Biol. 2004, 5, 699–711. (2) Aebersold, R.; Mann, M. Mass spectrometry-based proteomics. Nature 2003, 422, 198–207. (3) Eng, J. K.; McCormack, A. L.; Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5, 975– 989. (4) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551–3567. (5) Colinge, J.; Masselot, A.; Giron, M.; Dessingy, T.; Magnin, J. OLAV: Towards high-throughput tandem mass spectrometry data iden- tification. Proteomics 2003, 3, 1454–1463. (6) Craig, R.; Beavis, R. C. A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spectrom. 2003, 17, 2310–2316. (7) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 2002, 74, 5383– 5392. (8) Hingdon, R.; Kolker, N.; Picone, A.; van Belle, G.; Kolker, E. LIP index for peptide classification using MS/MS and SEQUEST search via logistic regression. OMICS 2004, 8, 357–369. (9) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increase confidence in large-scale protein identifications by mass spec- trometry. Nat. Methods 2007, 4, 207–214. (10) Nesvizhskii, A. I.; Vitek, O.; Aeversold, R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat. Methods 2007, 4, 7787–797. (11) Higdon, R. H. J.; Van Belle, G.; Kolker, E. Randomized sequence databases for tandem mass spectrometry peptide and protein identification. OMICS 2005, 9, 364–379. (12) Zhang, J.; Ma, J.; Dou, L.; Wu, S.; Qian, X.; Xie, H.; Zhu, Y.; He, F. Bayesian nonparametric model for the validation of peptide identification in shotgun proteomics. Mol. Cell. Proteomics 2009, 8 (3), 547–557. (13) Kall, L.; Canterbury, J. D.; Weston, J.; Noble, W. S.; MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 2007, 4, 923–925. (14) Ma, Z. Q.; Dasari, S.; Chambers, M. C.; Litton, M. D.; Sobecki, S. M.; Zimmerman, L. J.; Halvey, P. J.; Schilling, B.; Drake, P. M.; Gibson, B. W.; Tabb, D. L. IDPicker 2.0: Improved protein assembly with high discrimination peptide identification filtering. J. Proteome Res. 2009, 8, 3872–3881. (15) Choi, H.; Nesvizhskii, A. I. Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. J. Proteome Res. 2008, 7, 254–265. (16) Makarov, A.; Denisov, E.; Lange, O.; Horning, S. Dynamic range of mass accuracy in LTQ Orbitrap hybrid mass spectrometer. J. Am. Soc. Mass Spectrom. 2006, 17, 977–82. (17) Beausoleil, S. A.; Villen, J.; Gerber, S. A.; Rush, J.; Gygi, S. P. A probability-based approach for high-throughput protein phos- Target-Decoy with Mass Binning technical notes Journal of Proteome Research • Vol. 9, No. 2, 2010 1155
  • 7. phorylation analysis and site localization. Nat. Biotechnol. 2006, 24, 1285–1292. (18) Brosch, M.; Swamy, S.; Hubbard, T.; Choudhary, J. Comparison of Mascot and X!Tandem performance for low and high accuracy mass spectrometry and the development of an adjusted Mascot threshold. Mol. Cell. Proteomics 2008, 7, 962–970. (19) Horn, D. M.; Zubarev, R. A.; McLafferty, F. W. Automated reduction and interpretation of high resolution electrospray mass spectra of large molecules. J. Am. Soc. Mass Spectrom. 2000, 11, 320–32. (20) Park, K.; Yoon, J. Y.; Lee, S.; Paek, E.; Park, H.; Jung, H. J.; Lee, S. W. Isotopic peak intensity ratio based algorithm for determi- nation of isotopic clusters and monoisotopic masses of polypep- tides from high-resolution mass spectrometric data. Anal. Chem. 2008, 80, 7294–7303. (21) Shin, B.; Jung, H. J.; Hyung, S. W.; Kim, H.; Lee, D.; Lee, C.; Yu, M. H.; Lee, S. W. Postexperiment monoisotopic mass filtering and refinement (PE-MMR) of tandem mass spectrometric data in- creases accuracy of peptide identification in LC/MS/MS. Mol. Cell. Proteomics 2008, 7, 1124–1134. (22) Choi, H.; Ghosh, D.; Nesvizhskii, A. I. Statistical Validation of Peptide Identifications in Large-Scale Proteomics Using the Target- Decoy Database Search Strategy and Flexible Mixture Modeling. J. Protome Res. 2008, 7, 286–292. PR9006377 technical notes Joo et al. 1156 Journal of Proteome Research • Vol. 9, No. 2, 2010