Proteomics data standards

Introduction to the PSI standard data
formats
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-EBI
Hinxton, Cambridge, UK

Juan A. Vizcaíno
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Overview
• A couple of slides about the need of data standards
• The Proteomics Standards Initiative
• Existing data standards
• Connection with ProteomeXchange

Juan A. Vizcaíno
juan@ebi.ac.uk
Standards are needed in life: also in bioinformatics…
With a small number
of standards,
data converters are feasible
Data standards are needed

Juan A. Vizcaíno
juan@ebi.ac.uk
Taken from Biocomical, http://biocomicals.blogspot.com

Juan A. Vizcaíno
juan@ebi.ac.uk
Mass Spectrometry (MS)-based proteomics
• Many different workflows -> Many different data
types -> Need for several data standards.
• Discovery mode:
• Bottom-up proteomics
• Data dependent acquisition
• Data independent acquisition
• Top down proteomics
• Targeted mode:
• SRM (Selected Reaction Monitoring)

Juan A. Vizcaíno
juan@ebi.ac.uk
Overview
• A couple of slides about the need of data standards
• The Proteomics Standards Initiative
• Existing data standards
• Connection with ProteomeXchange and IMEx

Juan A. Vizcaíno
juan@ebi.ac.uk
•Develops data format standards for proteomics.
•Both data representation and annotation standards.
•Involves data producers, database providers, software producers,
publishers, …
•Active Workgroups: MI, MS, PI, Mod.
•Inter-group activities: MIAPE and Controlled Vocabularies.
•Started in 2002, so some experience already…
•One annual meeting in March-April, regular phone calls.
•Close interaction with the metabolomics community.
http://www.psidev.info
HUPO Proteomics Standards Initiative

Juan A. Vizcaíno
juan@ebi.ac.uk
PSI Deliverables
•Minimum information (MIAPE) specifications: Format-independent
specification of minimum information guidelines.
•Formats: Usually an XML schema (but also tab-delimited files) capable of
representing the relevant Minimum Information, plus additional detailed data
for the domain.
•Controlled vocabularies: Usually an OBO-style hierarchical controlled
vocabulary precisely defining the metadata that are encoded in the formats.
•Databases and Tools: Foster software implementations to make the
standards truly useful.
•Community interaction to ensure deposition of data in public repositories.

Juan A. Vizcaíno
juan@ebi.ac.uk
MIAPE guidelines
• Minimum Information About a Proteomics Experiment
guidelines.
• Set of experimental and technical metadata that are needed
to make one experiment reproducible.
• They cover different aspects: mass spectrometry,
informatics (identification and quantification), particular
techniques, etc.
• Published since 2008, but their adoption has been limited

Juan A. Vizcaíno
juan@ebi.ac.uk
PSI MS Controlled Vocabulary
Mayer et al., Database, 2013

Juan A. Vizcaíno
juan@ebi.ac.uk
The typical dilemma
•Data standards need to be stable to promote adoption
•Proteomics standards need to evolve very rapidly:
• Data is inherently very complex
• Experimental techniques are evolving all the time

Juan A. Vizcaíno
juan@ebi.ac.uk
•MS data: mzML (also used in MS metabolomics).
•Protein and peptide identifications: mzIdentML.
•Peptide and protein quantification: mzQuantML (also supports
small molecules).
•SRM transitions (for targeted proteomics): TraML.
•mzTab: identification and quantification results for peptides,
proteins and small molecules (also used in MS metabolomics).
•Molecular interactions: PSI MI XML and MITAB.
www.psidev.info
Existing data standards in proteomics

Juan A. Vizcaíno
juan@ebi.ac.uk
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM

Juan A. Vizcaíno
juan@ebi.ac.uk
Binary data
mzData
mzXML
mzML
XML-based
files
.dta, .pkl, .mgf,
.ms2
Peak lists
Data formats for mass spectra data

Juan A. Vizcaíno
juan@ebi.ac.uk
An example of success story: mzML
• A data format for the storage and exchange of MS output files
• Designed by merging the best aspects of both mzData and mzXML
• Developed with full participation of academic researchers, hardware
and software vendors
• Expected to replace mzXML and mzData, but not expected to
completely replace vendor binary formats
• Captures spectra (raw data or peak lists), chromatograms and related
metadata
• Version 1.0 released in June 2008, v1.1 released in June 2009
• Many implementations already exist
• Version 1.2 with enhanced compression considered for 2014
Martens et al., MCP, 2011

Juan A. Vizcaíno
juan@ebi.ac.uk
An example of success story: mzML
The most popular search
engines support mzML
Many parser libraries available
Conversion from raw files
into mzMLhttp://www.psidev.info/mzml_1_0_0

Juan A. Vizcaíno
juan@ebi.ac.uk
Application of mzML to metabolomics

Juan A. Vizcaíno
juan@ebi.ac.uk
mzIdentML, mzTab
mascot .dat, sequest
.out,
SpectrumMill .spo
pep.xml, prot.xml
Only qualitative data!
Data formats for output from search engines

Juan A. Vizcaíno
juan@ebi.ac.uk
mzIdentML: peptide and protein identifications
• Overview
• XML-based data standard for peptide and protein identifications e.g.
following database search and protein inference.
• Sections for all PSMs, proteins/protein groups inferred,
protocols/parameters etc.
• Timeline:
• Original 1.0 version in Aug 2009.
• Version 1.1 stable (Aug 2011).
• Manuscript published in MCP in 2012*.
• 2012-2015:
• Improving support for protein grouping multiple search engines, pre-fractionation
approaches and de novo sequencing.
• Now firmly embedded as part of ProteomeXchange submission
process, and supported by lots of external software.
* Jones, A. R., Eisenacher, M., Mayer, G., Kohlbacher, O., et al., The mzIdentML data standard for mass spectrometry
-based proteomics results. Molecular & Cellular Proteomics 2012, 11, M111.014381.

Juan A. Vizcaíno
juan@ebi.ac.uk
An example: XML snippet of mzIdentML

Juan A. Vizcaíno
juan@ebi.ac.uk
Support for mzIdentML
Search
Engine
Results +
MS files
Search
engines
mzIdentML
- Mascot
- MSGF+
- Myrimatch and related tools from D. Tabb’s lab
- OpenMS
- PEAKS
- PeptideShaker (several open source tools)
- ProCon (ProteomeDiscoverer, Sequest)
- Scaffold
- TPP via the idConvert tool (ProteoWizard)
- ProteinPilot (from version 5.0)
- X!Tandem (from PILEDRIVER version)
- Others: library for X!Tandem conversion, lab
internal pipelines, …
- Crux
An increasing number of tools support export to mzIdentML
1.1
Updated list: http://www.psidev.info/tools-implementing-

Juan A. Vizcaíno
juan@ebi.ac.uk
Data visualisation: PRIDE Inspector Toolsuite
Wang et al., Nat. Biotechnology, 2012
Perez-Riverol et al., MCP, 2016, in press
PRIDE Inspector Toolsuite
PRIDE Inspector Toolsuite supports:
- PRIDE XML
- mzIdentML + all types of spectra files
- mzML
- mzTab identification and Quantification
+ all types of spectra files
https://github.com/PRIDE-Toolsuite/

Juan A. Vizcaíno
juan@ebi.ac.uk
https://github.com/PRIDE-Toolsuite/
New visualisation
functionality for Protein
Groups

Juan A. Vizcaíno
juan@ebi.ac.uk
Data formats for output from search engines:
quantification results
• There is software that does:
• Only quantification.
• Identification and quantification together.
mzIdentML, mzTab
MaxQuant output files
OpenMS output files
Progenesis output files
…

Juan A. Vizcaíno
juan@ebi.ac.uk
Wide variety of quantification techniques

Juan A. Vizcaíno
juan@ebi.ac.uk
mzQuantML: Standard for quantitative data
Overview
• XML-based standard for quantification data – following use of quant software
• Can report tables of data (QuantLayers), columns are: StudyVariables, Assays or Ratios;
rows are ProteinGroups, Proteins or Peptides
• Can also capture 2D coordinates of quantified regions in LC-MS (Features)
Timeline
• Version 1.0 rc-1 submitted to the PSI process October 2011; Version 1.0 rc-2 June 2012; Re-
submitted to PSI process in October 2012
• Completed PSI process in Feb 2013 – version 1.0 release
• Supports label-free (intensity), label-free (spectral counting), MS2 tag techniques (e.g. iTRAQ) and
MS1 label techniques e.g. SILAC
• Schema is fixed with each technique defined by separate semantic rules, implemented in validator
software
• Manuscript published in MCP in summer 2013*
• Updated in 2013-2014 to support SRM as a new technique** (version 1.0.1 just submitted to the
document process).
*Walzer et al. MCP 2013 Aug;12(8):2332-40. doi: 10.1074/mcp.O113.028506
**Qi et al. PROTEOMICS, 2015

Juan A. Vizcaíno
juan@ebi.ac.uk
The last addition: mzTab – Aims and concept
• To provide a simple and efficient way of exchanging results from MS
approaches.
• Simpler summary report of the experimental results
• Peptides and proteins identified in a given experimental setting
• Small molecules identified
• Reported quantification values
• Technical and biological metadata
• Easier to parse and use by the research community, systems
biologists as well as providers of knowledge bases.
• It can be used by non-experts in bioinformatics.
• It does not aim to replace mzIdentMl and mzQuantML

Juan A. Vizcaíno
juan@ebi.ac.uk
mzTab - Sections
• Basic information about experiment and sample
• Key-Value pairsMetadata
• Basic information about protein identifications
• Table-basedProtein
• Information about quantified peptides
• Table-basedPeptide
• Information about identified spectra
• Table-basedPSM
• Basic information about identified small molecules
• Table-basedSmall Molecule
Griss et al., MCP, 2014

Juan A. Vizcaíno
juan@ebi.ac.uk
Metadata section - Example

Juan A. Vizcaíno
juan@ebi.ac.uk
Protein Section (label-free)

Juan A. Vizcaíno
juan@ebi.ac.uk
mzTab – ongoing development in metabolomics
• More detailed modelling of MS metabolomics data
• Led by S. Neumann (COSMOS EU FP7 project).
• Extension from one to three sections.
Example file exists at
https://github.com/sneumann/mtbls2/faahKO.mzTab
http://www.cosmos-fp7.eu/

Juan A. Vizcaíno
juan@ebi.ac.uk
Unify exchange of transitions with TraML
• PSI’s TraML (Transitions Markup Language)
• Format for encoding SRM/MRM transitions
• Version 1.0.0 now released and published in MCP (Deutsch et al. 2012)
Journal
Articles
Transitions
Databases
Excel
sheets
SRM
Analysis
Software
Instruments
TraML

Juan A. Vizcaíno
juan@ebi.ac.uk
Unify exchange of transitions with TraML
Deutsch et al., MCP, 2012

Juan A. Vizcaíno
juan@ebi.ac.uk
PSI document process
•Every data standard has to undergo a
thorough review process…
•In fact, in practice, two review processes
happen in parallel: the PSI and
manuscript review.

Juan A. Vizcaíno
juan@ebi.ac.uk
Data standard publications
mzML (data standard for MS data) Martens et al., MCP, 2011
mzIdentML (standard for peptide/protein IDs) Jones et al., MCP, 2012
TraML (for SRM transitions) Deutsch et al., MCP, 2012
mzQuantML (for quantitative data) Waltzer et al., MCP, 2013
mzTab (peptide/protein ID and quantification) Griss et al., MCP, 2014
Some updates already going on (e.g. mzIdentML 1.2 about to be submitted)

Juan A. Vizcaíno
juan@ebi.ac.uk
Importance of making software available
jmzML (https://github.com/PRIDE-Utilities/jmzml) Cote et al., Proteomics, 2009
jmzIdentML (https://github.com/PRIDE-Utilities/jmzidentML) Reisinger et al., Proteomics, 2012
jmzReader (https://github.com/PRIDE-Utilities/jmzReader) Griss et al., Proteomics, 2012
jmzQuantML (https://github.com/UKQIDA/jmzquantml) Qi et al., Proteomics, 2014
jmzTab (https://github.com/PRIDE-Utilities/jmzTab) Xu et al., Proteomics, 2014
ms-data-core-api (https://github.com/PRIDE-Utilities/ms-data-core-api)Perez-Riverol et al., Bioinformatics, 2015
PSI promotes implementations. The reference libraries are always
open source and can be used by anyone!

Juan A. Vizcaíno
juan@ebi.ac.uk
ProteoGenomics

Juan A. Vizcaíno
juan@ebi.ac.uk
Under development: Proteogenomics related
formats
• Two ongoing formats are being developed: proBed and
proBAM.
• Same overall objective: to map identified peptides to
genome coordinates.
• Different level of detail:
• proBed is tab-delimited and simpler, based on the original BED
format. Less level of detail.
• proBAM is based in the original SAM/BAM formats, widely
used in genomics. Much higher level of detail.

Juan A. Vizcaíno
juan@ebi.ac.uk
Provide your own data to genome browsers

Juan A. Vizcaíno
juan@ebi.ac.uk
And also… protein-protein interactions
PSI-XML: XML-based format
• Version 2.5 is the working version
• Version 3.0 under development
MITAB: tab-delimited format

Juan A. Vizcaíno
juan@ebi.ac.uk
ProteomeXchange Consortium
• Goal: Development of a framework to allow
standard data submission and dissemination
pipelines between the main existing proteomics
repositories.
• Includes PeptideAtlas (ISB, Seattle), PRIDE
(Cambridge, UK) and (very recently) MassIVE
(UCSD, San Diego).
• EU FP7 CA (01/2011-> 06/2014).
• Common identifier space (PXD identifiers)
• Two supported data workflows: MS/MS and SRM.
• Main objective: Make life easier for researchers
http://www.proteomexchange.org Vizcaíno et al., Nat Biotechnol, 2014

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride
• PRIDE stores mass spectrometry (MS)-
based proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak
lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
ProteomeCentral
Metadata /
Manuscript
Raw Data*
Results
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
PRIDE
(MS/MS data)
Other DBs
GPMDB
Researcher’s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
ProteomeXchange data workflow

Juan A. Vizcaíno
juan@ebi.ac.uk
Origin:
885 USA
465 Germany
342 United Kingdom
264 China
194 France
158 Netherland
136 Canada
126 Switzerland
107 Denmark
104 Spain
99 Australia
95 Japan
72 Belgium
68 Austria
63 Sweden
61 India
51 Norway
43 Taiwan
30 Italy
29 Brazil
28 Singapore
28 Finland
27 Ireland
27 Russia
26 Israel …
ProteomeXchange: 3,802 datasets up until 1st April, 2016
Type:
2429 PRIDE partial
1016 PRIDE complete
250 MassIVE
84 PeptideAtlas/PASSEL complete
23 Reprocessed
Publicly Accessible:
1973 datasets, 52% of all
91% PRIDE
5% MassIVE
4% PASSEL
Datasets/year:
2012: 102
2013: 527
2014: 963
2015: 1758
2016: 452
Top Species studied by at least 20
datasets:
1526 Homo sapiens
485 Mus musculus
150 Saccharomyces cerevisiae
121 Arabidopsis thaliana
102 Rattus norvegicus
86 Escherichia coli
44 Bos taurus
35 Drosophila melanogaster
32 Glycine max
~ 700 species in total

Juan A. Vizcaíno
juan@ebi.ac.uk
PRIDE Archive submitted datasets up until 1st April, 2016
• In the last year: ~150 submitted datasets per month
• Size: ~ 220TB

Juan A. Vizcaíno
juan@ebi.ac.uk
ProteomeCentral: Portal for all PX datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset

Juan A. Vizcaíno
juan@ebi.ac.uk
Data reuse is increasing
Vaudel et al., Proteomics, 2016

Juan A. Vizcaíno
juan@ebi.ac.uk
Conclusions
• The PSI is a very active group, developing and maintaining
data standards in the computational proteomics field.
• Big variety of data types in proteomics -> Several data
standards available.
• Adoption of standards (as usual) takes some time.
• Public databases (e.g. ProteomeXchange) greatly benefit
from them.

Juan A. Vizcaíno
juan@ebi.ac.uk
Conclusions: Reproducibility
• Share your data – and profit from
others’ data
• Deposit all data in publicly
accessible repositories
(ProteomeXchange)
• Use PSI standard formats (mzML,
mzQuantML, mzTab)
• Include accessions in your manuscript
• Include as much detail as needed
(Supp Mat!) – ‘computational SOP’
• Software and DB versions used
• Ideally: share the whole workflow
– that’s the best documentation!
If I have seen further it is by standing
on the shoulders of giants.
Isaac Newton (sort of)

Juan A. Vizcaíno
juan@ebi.ac.uk
Acknowlegments and further reading…
http://www.psidev.info

Juan A. Vizcaíno
juan@ebi.ac.uk
Questions?

Proteomics data standards

Recommended

Recommended

More Related Content

Similar to Proteomics data standards

Similar to Proteomics data standards (20)

More from Juan Antonio Vizcaino

More from Juan Antonio Vizcaino (20)

Recently uploaded

Recently uploaded (20)

Proteomics data standards