This document introduces several proteomics data standards developed by the Proteomics Standards Initiative (PSI), including mzML for mass spectrometry data, mzIdentML for peptide and protein identifications, mzQuantML for quantification data, and mzTab for final identification and quantification results. It describes how these standards address the need for data standardization in proteomics as the field has evolved. It also discusses how these standards have been implemented in proteomics databases, software tools, and data repositories like ProteomeXchange to facilitate data sharing and analysis.
1. Introduction to the PSI standard data
formats
Dr. Juan Antonio VizcaĆno
Proteomics Team Leader
EMBL-EBI
Hinxton, Cambridge, UK
2. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Overview
ā¢ A couple of slides about the need of data standards
ā¢ The Proteomics Standards Initiative
ā¢ Existing data standards
ā¢ Connection with ProteomeXchange
3. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Overview
ā¢ A couple of slides about the need of data standards
ā¢ The Proteomics Standards Initiative
ā¢ Existing data standards
ā¢ Connection with ProteomeXchange
4. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Standards are needed in life: also in bioinformaticsā¦
With a small number
of standards,
data converters are feasible
Data standards are needed
6. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Mass Spectrometry (MS)-based proteomics
ā¢ Many different workflows -> Many different data
types -> Need for several data standards.
ā¢ Discovery mode:
ā¢ Bottom-up proteomics
ā¢ Data dependent acquisition
ā¢ Data independent acquisition
ā¢ Top down proteomics
ā¢ Targeted mode:
ā¢ SRM (Selected Reaction Monitoring)
7. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Overview
ā¢ A couple of slides about the need of data standards
ā¢ The Proteomics Standards Initiative
ā¢ Existing data standards
ā¢ Connection with ProteomeXchange and IMEx
8. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
ā¢Develops data format standards for proteomics.
ā¢Both data representation and annotation standards.
ā¢Involves data producers, database providers, software producers,
publishers, ā¦
ā¢Active Workgroups: MI, MS, PI, Mod.
ā¢Inter-group activities: MIAPE and Controlled Vocabularies.
ā¢Started in 2002, so some experience alreadyā¦
ā¢One annual meeting in March-April, regular phone calls.
ā¢Close interaction with the metabolomics community.
http://www.psidev.info
HUPO Proteomics Standards Initiative
9. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
PSI Deliverables
ā¢Minimum information (MIAPE) specifications: Format-independent
specification of minimum information guidelines.
ā¢Formats: Usually an XML schema (but also tab-delimited files) capable of
representing the relevant Minimum Information, plus additional detailed data
for the domain.
ā¢Controlled vocabularies: Usually an OBO-style hierarchical controlled
vocabulary precisely defining the metadata that are encoded in the formats.
ā¢Databases and Tools: Foster software implementations to make the
standards truly useful.
ā¢Community interaction to ensure deposition of data in public repositories.
10. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
MIAPE guidelines
ā¢ Minimum Information About a Proteomics Experiment
guidelines.
ā¢ Set of experimental and technical metadata that are needed
to make one experiment reproducible.
ā¢ They cover different aspects: mass spectrometry,
informatics (identification and quantification), particular
techniques, etc.
ā¢ Published since 2008, but their adoption has been limited
12. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Overview
ā¢ A couple of slides about the need of data standards
ā¢ The Proteomics Standards Initiative
ā¢ Existing data standards
ā¢ Connection with ProteomeXchange
13. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
The typical dilemma
ā¢Data standards need to be stable to promote adoption
ā¢Proteomics standards need to evolve very rapidly:
ā¢ Data is inherently very complex
ā¢ Experimental techniques are evolving all the time
14. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
ā¢MS data: mzML (also used in MS metabolomics).
ā¢Protein and peptide identifications: mzIdentML.
ā¢Peptide and protein quantification: mzQuantML (also supports
small molecules).
ā¢SRM transitions (for targeted proteomics): TraML.
ā¢mzTab: identification and quantification results for peptides,
proteins and small molecules (also used in MS metabolomics).
ā¢Molecular interactions: PSI MI XML and MITAB.
www.psidev.info
Existing data standards in proteomics
15. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Current PSI Standard File Formats for MS
ā¢ mzMLMS data
ā¢ mzIdentMLIdentification
ā¢ mzQuantMLQuantitation
ā¢ mzTabFinal Results
ā¢ TraMLSRM
16. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Binary data
mzData
mzXML
mzML
XML-based
files
.dta, .pkl, .mgf,
.ms2
Peak lists
Data formats for mass spectra data
17. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
An example of success story: mzML
ā¢ A data format for the storage and exchange of MS output files
ā¢ Designed by merging the best aspects of both mzData and mzXML
ā¢ Developed with full participation of academic researchers, hardware
and software vendors
ā¢ Expected to replace mzXML and mzData, but not expected to
completely replace vendor binary formats
ā¢ Captures spectra (raw data or peak lists), chromatograms and related
metadata
ā¢ Version 1.0 released in June 2008, v1.1 released in June 2009
ā¢ Many implementations already exist
ā¢ Version 1.2 with enhanced compression considered for 2014
Martens et al., MCP, 2011
18. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
An example of success story: mzML
The most popular search
engines support mzML
Many parser libraries available
Conversion from raw files
into mzMLhttp://www.psidev.info/mzml_1_0_0
20. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Current PSI Standard File Formats for MS
ā¢ mzMLMS data
ā¢ mzIdentMLIdentification
ā¢ mzQuantMLQuantitation
ā¢ mzTabFinal Results
ā¢ TraMLSRM
21. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
mzIdentML, mzTab
mascot .dat, sequest
.out,
SpectrumMill .spo
pep.xml, prot.xml
Only qualitative data!
Data formats for output from search engines
22. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
mzIdentML: peptide and protein identifications
ā¢ Overview
ā¢ XML-based data standard for peptide and protein identifications e.g.
following database search and protein inference.
ā¢ Sections for all PSMs, proteins/protein groups inferred,
protocols/parameters etc.
ā¢ Timeline:
ā¢ Original 1.0 version in Aug 2009.
ā¢ Version 1.1 stable (Aug 2011).
ā¢ Manuscript published in MCP in 2012*.
ā¢ 2012-2015:
ā¢ Improving support for protein grouping multiple search engines, pre-fractionation
approaches and de novo sequencing.
ā¢ Now firmly embedded as part of ProteomeXchange submission
process, and supported by lots of external software.
* Jones, A. R., Eisenacher, M., Mayer, G., Kohlbacher, O., et al., The mzIdentML data standard for mass spectrometry
-based proteomics results. Molecular & Cellular Proteomics 2012, 11, M111.014381.
24. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Support for mzIdentML
Search
Engine
Results +
MS files
Search
engines
mzIdentML
- Mascot
- MSGF+
- Myrimatch and related tools from D. Tabbās lab
- OpenMS
- PEAKS
- PeptideShaker (several open source tools)
- ProCon (ProteomeDiscoverer, Sequest)
- Scaffold
- TPP via the idConvert tool (ProteoWizard)
- ProteinPilot (from version 5.0)
- X!Tandem (from PILEDRIVER version)
- Others: library for X!Tandem conversion, lab
internal pipelines, ā¦
- Crux
An increasing number of tools support export to mzIdentML
1.1
Updated list: http://www.psidev.info/tools-implementing-
25. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Data visualisation: PRIDE Inspector Toolsuite
Wang et al., Nat. Biotechnology, 2012
Perez-Riverol et al., MCP, 2016, in press
PRIDE Inspector Toolsuite
PRIDE Inspector Toolsuite supports:
- PRIDE XML
- mzIdentML + all types of spectra files
- mzML
- mzTab identification and Quantification
+ all types of spectra files
https://github.com/PRIDE-Toolsuite/
26. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
PRIDE Inspector Toolsuite
https://github.com/PRIDE-Toolsuite/
New visualisation
functionality for Protein
Groups
PRIDE Inspector Toolsuite
27. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Current PSI Standard File Formats for MS
ā¢ mzMLMS data
ā¢ mzIdentMLIdentification
ā¢ mzQuantMLQuantitation
ā¢ mzTabFinal Results
ā¢ TraMLSRM
28. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Data formats for output from search engines:
quantification results
ā¢ There is software that does:
ā¢ Only quantification.
ā¢ Identification and quantification together.
mzIdentML, mzTab
MaxQuant output files
OpenMS output files
Progenesis output files
ā¦
30. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
mzQuantML: Standard for quantitative data
Overview
ā¢ XML-based standard for quantification data ā following use of quant software
ā¢ Can report tables of data (QuantLayers), columns are: StudyVariables, Assays or Ratios;
rows are ProteinGroups, Proteins or Peptides
ā¢ Can also capture 2D coordinates of quantified regions in LC-MS (Features)
Timeline
ā¢ Version 1.0 rc-1 submitted to the PSI process October 2011; Version 1.0 rc-2 June 2012; Re-
submitted to PSI process in October 2012
ā¢ Completed PSI process in Feb 2013 ā version 1.0 release
ā¢ Supports label-free (intensity), label-free (spectral counting), MS2 tag techniques (e.g. iTRAQ) and
MS1 label techniques e.g. SILAC
ā¢ Schema is fixed with each technique defined by separate semantic rules, implemented in validator
software
ā¢ Manuscript published in MCP in summer 2013*
ā¢ Updated in 2013-2014 to support SRM as a new technique** (version 1.0.1 just submitted to the
document process).
*Walzer et al. MCP 2013 Aug;12(8):2332-40. doi: 10.1074/mcp.O113.028506
**Qi et al. PROTEOMICS, 2015
31. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Current PSI Standard File Formats for MS
ā¢ mzMLMS data
ā¢ mzIdentMLIdentification
ā¢ mzQuantMLQuantitation
ā¢ mzTabFinal Results
ā¢ TraMLSRM
32. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
The last addition: mzTab ā Aims and concept
ā¢ To provide a simple and efficient way of exchanging results from MS
approaches.
ā¢ Simpler summary report of the experimental results
ā¢ Peptides and proteins identified in a given experimental setting
ā¢ Small molecules identified
ā¢ Reported quantification values
ā¢ Technical and biological metadata
ā¢ Easier to parse and use by the research community, systems
biologists as well as providers of knowledge bases.
ā¢ It can be used by non-experts in bioinformatics.
ā¢ It does not aim to replace mzIdentMl and mzQuantML
33. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
mzTab - Sections
ā¢ Basic information about experiment and sample
ā¢ Key-Value pairsMetadata
ā¢ Basic information about protein identifications
ā¢ Table-basedProtein
ā¢ Information about quantified peptides
ā¢ Table-basedPeptide
ā¢ Information about identified spectra
ā¢ Table-basedPSM
ā¢ Basic information about identified small molecules
ā¢ Table-basedSmall Molecule
Griss et al., MCP, 2014
36. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
mzTab ā ongoing development in metabolomics
ā¢ More detailed modelling of MS metabolomics data
ā¢ Led by S. Neumann (COSMOS EU FP7 project).
ā¢ Extension from one to three sections.
Example file exists at
https://github.com/sneumann/mtbls2/faahKO.mzTab
http://www.cosmos-fp7.eu/
37. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Current PSI Standard File Formats for MS
ā¢ mzMLMS data
ā¢ mzIdentMLIdentification
ā¢ mzQuantMLQuantitation
ā¢ mzTabFinal Results
ā¢ TraMLSRM
38. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Unify exchange of transitions with TraML
ā¢ PSIās TraML (Transitions Markup Language)
ā¢ Format for encoding SRM/MRM transitions
ā¢ Version 1.0.0 now released and published in MCP (Deutsch et al. 2012)
Journal
Articles
Transitions
Databases
Excel
sheets
SRM
Analysis
Software
Instruments
TraML
39. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Unify exchange of transitions with TraML
Deutsch et al., MCP, 2012
40. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
PSI document process
ā¢Every data standard has to undergo a
thorough review processā¦
ā¢In fact, in practice, two review processes
happen in parallel: the PSI and
manuscript review.
41. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Data standard publications
mzML (data standard for MS data) Martens et al., MCP, 2011
mzIdentML (standard for peptide/protein IDs) Jones et al., MCP, 2012
TraML (for SRM transitions) Deutsch et al., MCP, 2012
mzQuantML (for quantitative data) Waltzer et al., MCP, 2013
mzTab (peptide/protein ID and quantification) Griss et al., MCP, 2014
Some updates already going on (e.g. mzIdentML 1.2 about to be submitted)
42. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Importance of making software available
jmzML (https://github.com/PRIDE-Utilities/jmzml) Cote et al., Proteomics, 2009
jmzIdentML (https://github.com/PRIDE-Utilities/jmzidentML) Reisinger et al., Proteomics, 2012
jmzReader (https://github.com/PRIDE-Utilities/jmzReader) Griss et al., Proteomics, 2012
jmzQuantML (https://github.com/UKQIDA/jmzquantml) Qi et al., Proteomics, 2014
jmzTab (https://github.com/PRIDE-Utilities/jmzTab) Xu et al., Proteomics, 2014
ms-data-core-api (https://github.com/PRIDE-Utilities/ms-data-core-api)Perez-Riverol et al., Bioinformatics, 2015
PSI promotes implementations. The reference libraries are always
open source and can be used by anyone!
44. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Under development: Proteogenomics related
formats
ā¢ Two ongoing formats are being developed: proBed and
proBAM.
ā¢ Same overall objective: to map identified peptides to
genome coordinates.
ā¢ Different level of detail:
ā¢ proBed is tab-delimited and simpler, based on the original BED
format. Less level of detail.
ā¢ proBAM is based in the original SAM/BAM formats, widely
used in genomics. Much higher level of detail.
46. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
And alsoā¦ protein-protein interactions
PSI-XML: XML-based format
ā¢ Version 2.5 is the working version
ā¢ Version 3.0 under development
MITAB: tab-delimited format
47. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Overview
ā¢ A couple of slides about the need of data standards
ā¢ The Proteomics Standards Initiative
ā¢ Existing data standards
ā¢ Connection with ProteomeXchange
48. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
ProteomeXchange Consortium
ā¢ Goal: Development of a framework to allow
standard data submission and dissemination
pipelines between the main existing proteomics
repositories.
ā¢ Includes PeptideAtlas (ISB, Seattle), PRIDE
(Cambridge, UK) and (very recently) MassIVE
(UCSD, San Diego).
ā¢ EU FP7 CA (01/2011-> 06/2014).
ā¢ Common identifier space (PXD identifiers)
ā¢ Two supported data workflows: MS/MS and SRM.
ā¢ Main objective: Make life easier for researchers
http://www.proteomexchange.org VizcaĆno et al., Nat Biotechnol, 2014
49. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride
ā¢ PRIDE stores mass spectrometry (MS)-
based proteomics data:
ā¢ Peptide and protein expression data
(identification and quantification)
ā¢ Post-translational modifications
ā¢ Mass spectra (raw data and peak
lists)
ā¢ Technical and biological metadata
ā¢ Any other related information
ā¢ Full support for tandem MS approaches
Martens et al., Proteomics, 2005
VizcaĆno et al., NAR, 2016
50. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
ProteomeCentral
Metadata /
Manuscript
Raw Data*
Results
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
PRIDE
(MS/MS data)
Other DBs
GPMDB
Researcherās results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
ProteomeXchange data workflow
51. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Origin:
885 USA
465 Germany
342 United Kingdom
264 China
194 France
158 Netherland
136 Canada
126 Switzerland
107 Denmark
104 Spain
99 Australia
95 Japan
72 Belgium
68 Austria
63 Sweden
61 India
51 Norway
43 Taiwan
30 Italy
29 Brazil
28 Singapore
28 Finland
27 Ireland
27 Russia
26 Israel ā¦
ProteomeXchange: 3,802 datasets up until 1st April, 2016
Type:
2429 PRIDE partial
1016 PRIDE complete
250 MassIVE
84 PeptideAtlas/PASSEL complete
23 Reprocessed
Publicly Accessible:
1973 datasets, 52% of all
91% PRIDE
5% MassIVE
4% PASSEL
Datasets/year:
2012: 102
2013: 527
2014: 963
2015: 1758
2016: 452
Top Species studied by at least 20
datasets:
1526 Homo sapiens
485 Mus musculus
150 Saccharomyces cerevisiae
121 Arabidopsis thaliana
102 Rattus norvegicus
86 Escherichia coli
44 Bos taurus
35 Drosophila melanogaster
32 Glycine max
~ 700 species in total
52. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
PRIDE Archive submitted datasets up until 1st April, 2016
ā¢ In the last year: ~150 submitted datasets per month
ā¢ Size: ~ 220TB
53. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
ProteomeCentral: Portal for all PX datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
55. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Conclusions
ā¢ The PSI is a very active group, developing and maintaining
data standards in the computational proteomics field.
ā¢ Big variety of data types in proteomics -> Several data
standards available.
ā¢ Adoption of standards (as usual) takes some time.
ā¢ Public databases (e.g. ProteomeXchange) greatly benefit
from them.
56. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Conclusions: Reproducibility
ā¢ Share your data ā and profit from
othersā data
ā¢ Deposit all data in publicly
accessible repositories
(ProteomeXchange)
ā¢ Use PSI standard formats (mzML,
mzQuantML, mzTab)
ā¢ Include accessions in your manuscript
ā¢ Include as much detail as needed
(Supp Mat!) ā ācomputational SOPā
ā¢ Software and DB versions used
ā¢ Ideally: share the whole workflow
ā thatās the best documentation!
If I have seen further it is by standing
on the shoulders of giants.
Isaac Newton (sort of)
57. Juan A. VizcaĆno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Acknowlegments and further readingā¦
http://www.psidev.info