SlideShare a Scribd company logo
1 of 58
Introduction to the PSI standard data
formats
Dr. Juan Antonio VizcaĆ­no
Proteomics Team Leader
EMBL-EBI
Hinxton, Cambridge, UK
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Overview
ā€¢ A couple of slides about the need of data standards
ā€¢ The Proteomics Standards Initiative
ā€¢ Existing data standards
ā€¢ Connection with ProteomeXchange
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Overview
ā€¢ A couple of slides about the need of data standards
ā€¢ The Proteomics Standards Initiative
ā€¢ Existing data standards
ā€¢ Connection with ProteomeXchange
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Standards are needed in life: also in bioinformaticsā€¦
With a small number
of standards,
data converters are feasible
Data standards are needed
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Taken from Biocomical, http://biocomicals.blogspot.com
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Mass Spectrometry (MS)-based proteomics
ā€¢ Many different workflows -> Many different data
types -> Need for several data standards.
ā€¢ Discovery mode:
ā€¢ Bottom-up proteomics
ā€¢ Data dependent acquisition
ā€¢ Data independent acquisition
ā€¢ Top down proteomics
ā€¢ Targeted mode:
ā€¢ SRM (Selected Reaction Monitoring)
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Overview
ā€¢ A couple of slides about the need of data standards
ā€¢ The Proteomics Standards Initiative
ā€¢ Existing data standards
ā€¢ Connection with ProteomeXchange and IMEx
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
ā€¢Develops data format standards for proteomics.
ā€¢Both data representation and annotation standards.
ā€¢Involves data producers, database providers, software producers,
publishers, ā€¦
ā€¢Active Workgroups: MI, MS, PI, Mod.
ā€¢Inter-group activities: MIAPE and Controlled Vocabularies.
ā€¢Started in 2002, so some experience alreadyā€¦
ā€¢One annual meeting in March-April, regular phone calls.
ā€¢Close interaction with the metabolomics community.
http://www.psidev.info
HUPO Proteomics Standards Initiative
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
PSI Deliverables
ā€¢Minimum information (MIAPE) specifications: Format-independent
specification of minimum information guidelines.
ā€¢Formats: Usually an XML schema (but also tab-delimited files) capable of
representing the relevant Minimum Information, plus additional detailed data
for the domain.
ā€¢Controlled vocabularies: Usually an OBO-style hierarchical controlled
vocabulary precisely defining the metadata that are encoded in the formats.
ā€¢Databases and Tools: Foster software implementations to make the
standards truly useful.
ā€¢Community interaction to ensure deposition of data in public repositories.
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
MIAPE guidelines
ā€¢ Minimum Information About a Proteomics Experiment
guidelines.
ā€¢ Set of experimental and technical metadata that are needed
to make one experiment reproducible.
ā€¢ They cover different aspects: mass spectrometry,
informatics (identification and quantification), particular
techniques, etc.
ā€¢ Published since 2008, but their adoption has been limited
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
PSI MS Controlled Vocabulary
Mayer et al., Database, 2013
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Overview
ā€¢ A couple of slides about the need of data standards
ā€¢ The Proteomics Standards Initiative
ā€¢ Existing data standards
ā€¢ Connection with ProteomeXchange
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
The typical dilemma
ā€¢Data standards need to be stable to promote adoption
ā€¢Proteomics standards need to evolve very rapidly:
ā€¢ Data is inherently very complex
ā€¢ Experimental techniques are evolving all the time
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
ā€¢MS data: mzML (also used in MS metabolomics).
ā€¢Protein and peptide identifications: mzIdentML.
ā€¢Peptide and protein quantification: mzQuantML (also supports
small molecules).
ā€¢SRM transitions (for targeted proteomics): TraML.
ā€¢mzTab: identification and quantification results for peptides,
proteins and small molecules (also used in MS metabolomics).
ā€¢Molecular interactions: PSI MI XML and MITAB.
www.psidev.info
Existing data standards in proteomics
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Current PSI Standard File Formats for MS
ā€¢ mzMLMS data
ā€¢ mzIdentMLIdentification
ā€¢ mzQuantMLQuantitation
ā€¢ mzTabFinal Results
ā€¢ TraMLSRM
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Binary data
mzData
mzXML
mzML
XML-based
files
.dta, .pkl, .mgf,
.ms2
Peak lists
Data formats for mass spectra data
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
An example of success story: mzML
ā€¢ A data format for the storage and exchange of MS output files
ā€¢ Designed by merging the best aspects of both mzData and mzXML
ā€¢ Developed with full participation of academic researchers, hardware
and software vendors
ā€¢ Expected to replace mzXML and mzData, but not expected to
completely replace vendor binary formats
ā€¢ Captures spectra (raw data or peak lists), chromatograms and related
metadata
ā€¢ Version 1.0 released in June 2008, v1.1 released in June 2009
ā€¢ Many implementations already exist
ā€¢ Version 1.2 with enhanced compression considered for 2014
Martens et al., MCP, 2011
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
An example of success story: mzML
The most popular search
engines support mzML
Many parser libraries available
Conversion from raw files
into mzMLhttp://www.psidev.info/mzml_1_0_0
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Application of mzML to metabolomics
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Current PSI Standard File Formats for MS
ā€¢ mzMLMS data
ā€¢ mzIdentMLIdentification
ā€¢ mzQuantMLQuantitation
ā€¢ mzTabFinal Results
ā€¢ TraMLSRM
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
mzIdentML, mzTab
mascot .dat, sequest
.out,
SpectrumMill .spo
pep.xml, prot.xml
Only qualitative data!
Data formats for output from search engines
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
mzIdentML: peptide and protein identifications
ā€¢ Overview
ā€¢ XML-based data standard for peptide and protein identifications e.g.
following database search and protein inference.
ā€¢ Sections for all PSMs, proteins/protein groups inferred,
protocols/parameters etc.
ā€¢ Timeline:
ā€¢ Original 1.0 version in Aug 2009.
ā€¢ Version 1.1 stable (Aug 2011).
ā€¢ Manuscript published in MCP in 2012*.
ā€¢ 2012-2015:
ā€¢ Improving support for protein grouping multiple search engines, pre-fractionation
approaches and de novo sequencing.
ā€¢ Now firmly embedded as part of ProteomeXchange submission
process, and supported by lots of external software.
* Jones, A. R., Eisenacher, M., Mayer, G., Kohlbacher, O., et al., The mzIdentML data standard for mass spectrometry
-based proteomics results. Molecular & Cellular Proteomics 2012, 11, M111.014381.
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
An example: XML snippet of mzIdentML
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Support for mzIdentML
Search
Engine
Results +
MS files
Search
engines
mzIdentML
- Mascot
- MSGF+
- Myrimatch and related tools from D. Tabbā€™s lab
- OpenMS
- PEAKS
- PeptideShaker (several open source tools)
- ProCon (ProteomeDiscoverer, Sequest)
- Scaffold
- TPP via the idConvert tool (ProteoWizard)
- ProteinPilot (from version 5.0)
- X!Tandem (from PILEDRIVER version)
- Others: library for X!Tandem conversion, lab
internal pipelines, ā€¦
- Crux
An increasing number of tools support export to mzIdentML
1.1
Updated list: http://www.psidev.info/tools-implementing-
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Data visualisation: PRIDE Inspector Toolsuite
Wang et al., Nat. Biotechnology, 2012
Perez-Riverol et al., MCP, 2016, in press
PRIDE Inspector Toolsuite
PRIDE Inspector Toolsuite supports:
- PRIDE XML
- mzIdentML + all types of spectra files
- mzML
- mzTab identification and Quantification
+ all types of spectra files
https://github.com/PRIDE-Toolsuite/
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
PRIDE Inspector Toolsuite
https://github.com/PRIDE-Toolsuite/
New visualisation
functionality for Protein
Groups
PRIDE Inspector Toolsuite
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Current PSI Standard File Formats for MS
ā€¢ mzMLMS data
ā€¢ mzIdentMLIdentification
ā€¢ mzQuantMLQuantitation
ā€¢ mzTabFinal Results
ā€¢ TraMLSRM
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Data formats for output from search engines:
quantification results
ā€¢ There is software that does:
ā€¢ Only quantification.
ā€¢ Identification and quantification together.
mzIdentML, mzTab
MaxQuant output files
OpenMS output files
Progenesis output files
ā€¦
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Wide variety of quantification techniques
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
mzQuantML: Standard for quantitative data
Overview
ā€¢ XML-based standard for quantification data ā€“ following use of quant software
ā€¢ Can report tables of data (QuantLayers), columns are: StudyVariables, Assays or Ratios;
rows are ProteinGroups, Proteins or Peptides
ā€¢ Can also capture 2D coordinates of quantified regions in LC-MS (Features)
Timeline
ā€¢ Version 1.0 rc-1 submitted to the PSI process October 2011; Version 1.0 rc-2 June 2012; Re-
submitted to PSI process in October 2012
ā€¢ Completed PSI process in Feb 2013 ā€“ version 1.0 release
ā€¢ Supports label-free (intensity), label-free (spectral counting), MS2 tag techniques (e.g. iTRAQ) and
MS1 label techniques e.g. SILAC
ā€¢ Schema is fixed with each technique defined by separate semantic rules, implemented in validator
software
ā€¢ Manuscript published in MCP in summer 2013*
ā€¢ Updated in 2013-2014 to support SRM as a new technique** (version 1.0.1 just submitted to the
document process).
*Walzer et al. MCP 2013 Aug;12(8):2332-40. doi: 10.1074/mcp.O113.028506
**Qi et al. PROTEOMICS, 2015
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Current PSI Standard File Formats for MS
ā€¢ mzMLMS data
ā€¢ mzIdentMLIdentification
ā€¢ mzQuantMLQuantitation
ā€¢ mzTabFinal Results
ā€¢ TraMLSRM
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
The last addition: mzTab ā€“ Aims and concept
ā€¢ To provide a simple and efficient way of exchanging results from MS
approaches.
ā€¢ Simpler summary report of the experimental results
ā€¢ Peptides and proteins identified in a given experimental setting
ā€¢ Small molecules identified
ā€¢ Reported quantification values
ā€¢ Technical and biological metadata
ā€¢ Easier to parse and use by the research community, systems
biologists as well as providers of knowledge bases.
ā€¢ It can be used by non-experts in bioinformatics.
ā€¢ It does not aim to replace mzIdentMl and mzQuantML
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
mzTab - Sections
ā€¢ Basic information about experiment and sample
ā€¢ Key-Value pairsMetadata
ā€¢ Basic information about protein identifications
ā€¢ Table-basedProtein
ā€¢ Information about quantified peptides
ā€¢ Table-basedPeptide
ā€¢ Information about identified spectra
ā€¢ Table-basedPSM
ā€¢ Basic information about identified small molecules
ā€¢ Table-basedSmall Molecule
Griss et al., MCP, 2014
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Metadata section - Example
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Protein Section (label-free)
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
mzTab ā€“ ongoing development in metabolomics
ā€¢ More detailed modelling of MS metabolomics data
ā€¢ Led by S. Neumann (COSMOS EU FP7 project).
ā€¢ Extension from one to three sections.
Example file exists at
https://github.com/sneumann/mtbls2/faahKO.mzTab
http://www.cosmos-fp7.eu/
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Current PSI Standard File Formats for MS
ā€¢ mzMLMS data
ā€¢ mzIdentMLIdentification
ā€¢ mzQuantMLQuantitation
ā€¢ mzTabFinal Results
ā€¢ TraMLSRM
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Unify exchange of transitions with TraML
ā€¢ PSIā€™s TraML (Transitions Markup Language)
ā€¢ Format for encoding SRM/MRM transitions
ā€¢ Version 1.0.0 now released and published in MCP (Deutsch et al. 2012)
Journal
Articles
Transitions
Databases
Excel
sheets
SRM
Analysis
Software
Instruments
TraML
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Unify exchange of transitions with TraML
Deutsch et al., MCP, 2012
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
PSI document process
ā€¢Every data standard has to undergo a
thorough review processā€¦
ā€¢In fact, in practice, two review processes
happen in parallel: the PSI and
manuscript review.
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Data standard publications
mzML (data standard for MS data) Martens et al., MCP, 2011
mzIdentML (standard for peptide/protein IDs) Jones et al., MCP, 2012
TraML (for SRM transitions) Deutsch et al., MCP, 2012
mzQuantML (for quantitative data) Waltzer et al., MCP, 2013
mzTab (peptide/protein ID and quantification) Griss et al., MCP, 2014
Some updates already going on (e.g. mzIdentML 1.2 about to be submitted)
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Importance of making software available
jmzML (https://github.com/PRIDE-Utilities/jmzml) Cote et al., Proteomics, 2009
jmzIdentML (https://github.com/PRIDE-Utilities/jmzidentML) Reisinger et al., Proteomics, 2012
jmzReader (https://github.com/PRIDE-Utilities/jmzReader) Griss et al., Proteomics, 2012
jmzQuantML (https://github.com/UKQIDA/jmzquantml) Qi et al., Proteomics, 2014
jmzTab (https://github.com/PRIDE-Utilities/jmzTab) Xu et al., Proteomics, 2014
ms-data-core-api (https://github.com/PRIDE-Utilities/ms-data-core-api)Perez-Riverol et al., Bioinformatics, 2015
PSI promotes implementations. The reference libraries are always
open source and can be used by anyone!
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
ProteoGenomics
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Under development: Proteogenomics related
formats
ā€¢ Two ongoing formats are being developed: proBed and
proBAM.
ā€¢ Same overall objective: to map identified peptides to
genome coordinates.
ā€¢ Different level of detail:
ā€¢ proBed is tab-delimited and simpler, based on the original BED
format. Less level of detail.
ā€¢ proBAM is based in the original SAM/BAM formats, widely
used in genomics. Much higher level of detail.
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Provide your own data to genome browsers
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
And alsoā€¦ protein-protein interactions
PSI-XML: XML-based format
ā€¢ Version 2.5 is the working version
ā€¢ Version 3.0 under development
MITAB: tab-delimited format
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Overview
ā€¢ A couple of slides about the need of data standards
ā€¢ The Proteomics Standards Initiative
ā€¢ Existing data standards
ā€¢ Connection with ProteomeXchange
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
ProteomeXchange Consortium
ā€¢ Goal: Development of a framework to allow
standard data submission and dissemination
pipelines between the main existing proteomics
repositories.
ā€¢ Includes PeptideAtlas (ISB, Seattle), PRIDE
(Cambridge, UK) and (very recently) MassIVE
(UCSD, San Diego).
ā€¢ EU FP7 CA (01/2011-> 06/2014).
ā€¢ Common identifier space (PXD identifiers)
ā€¢ Two supported data workflows: MS/MS and SRM.
ā€¢ Main objective: Make life easier for researchers
http://www.proteomexchange.org VizcaĆ­no et al., Nat Biotechnol, 2014
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
PRIDE (PRoteomics IDEntifications) database
http://www.ebi.ac.uk/pride
ā€¢ PRIDE stores mass spectrometry (MS)-
based proteomics data:
ā€¢ Peptide and protein expression data
(identification and quantification)
ā€¢ Post-translational modifications
ā€¢ Mass spectra (raw data and peak
lists)
ā€¢ Technical and biological metadata
ā€¢ Any other related information
ā€¢ Full support for tandem MS approaches
Martens et al., Proteomics, 2005
VizcaĆ­no et al., NAR, 2016
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
ProteomeCentral
Metadata /
Manuscript
Raw Data*
Results
Journals
UniProt/
neXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
PRIDE
(MS/MS data)
Other DBs
GPMDB
Researcherā€™s results
Reprocessed results
Raw data*
Metadata
MassIVE
(MS/MS data)
ProteomeXchange data workflow
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Origin:
885 USA
465 Germany
342 United Kingdom
264 China
194 France
158 Netherland
136 Canada
126 Switzerland
107 Denmark
104 Spain
99 Australia
95 Japan
72 Belgium
68 Austria
63 Sweden
61 India
51 Norway
43 Taiwan
30 Italy
29 Brazil
28 Singapore
28 Finland
27 Ireland
27 Russia
26 Israel ā€¦
ProteomeXchange: 3,802 datasets up until 1st April, 2016
Type:
2429 PRIDE partial
1016 PRIDE complete
250 MassIVE
84 PeptideAtlas/PASSEL complete
23 Reprocessed
Publicly Accessible:
1973 datasets, 52% of all
91% PRIDE
5% MassIVE
4% PASSEL
Datasets/year:
2012: 102
2013: 527
2014: 963
2015: 1758
2016: 452
Top Species studied by at least 20
datasets:
1526 Homo sapiens
485 Mus musculus
150 Saccharomyces cerevisiae
121 Arabidopsis thaliana
102 Rattus norvegicus
86 Escherichia coli
44 Bos taurus
35 Drosophila melanogaster
32 Glycine max
~ 700 species in total
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
PRIDE Archive submitted datasets up until 1st April, 2016
ā€¢ In the last year: ~150 submitted datasets per month
ā€¢ Size: ~ 220TB
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
ProteomeCentral: Portal for all PX datasets
http://proteomecentral.proteomexchange.org/cgi/GetDataset
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Data reuse is increasing
Vaudel et al., Proteomics, 2016
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Conclusions
ā€¢ The PSI is a very active group, developing and maintaining
data standards in the computational proteomics field.
ā€¢ Big variety of data types in proteomics -> Several data
standards available.
ā€¢ Adoption of standards (as usual) takes some time.
ā€¢ Public databases (e.g. ProteomeXchange) greatly benefit
from them.
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Conclusions: Reproducibility
ā€¢ Share your data ā€“ and profit from
othersā€™ data
ā€¢ Deposit all data in publicly
accessible repositories
(ProteomeXchange)
ā€¢ Use PSI standard formats (mzML,
mzQuantML, mzTab)
ā€¢ Include accessions in your manuscript
ā€¢ Include as much detail as needed
(Supp Mat!) ā€“ ā€˜computational SOPā€™
ā€¢ Software and DB versions used
ā€¢ Ideally: share the whole workflow
ā€“ thatā€™s the best documentation!
If I have seen further it is by standing
on the shoulders of giants.
Isaac Newton (sort of)
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Acknowlegments and further readingā€¦
http://www.psidev.info
Juan A. VizcaĆ­no
juan@ebi.ac.uk
EuPA workshop on standardisation
Istanbul, 22 June 2016
Questions?

More Related Content

Similar to Proteomics data standards

Similar to Proteomics data standards (20)

Experiences to learn from the MS proteomics field
Experiences to learn from the MS proteomics fieldExperiences to learn from the MS proteomics field
Experiences to learn from the MS proteomics field
Ā 
Mass Spectrometry Informatics formats in progress
Mass Spectrometry Informatics formats in progressMass Spectrometry Informatics formats in progress
Mass Spectrometry Informatics formats in progress
Ā 
Proteomics data standards
Proteomics data standardsProteomics data standards
Proteomics data standards
Ā 
ProteomeXchange update HUPO 2016
ProteomeXchange update HUPO 2016ProteomeXchange update HUPO 2016
ProteomeXchange update HUPO 2016
Ā 
Mining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasetsMining the hidden proteome using hundreds of public proteomics datasets
Mining the hidden proteome using hundreds of public proteomics datasets
Ā 
An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...An overview of the PRIDE ecosystem of resources and computational tools for m...
An overview of the PRIDE ecosystem of resources and computational tools for m...
Ā 
Euro lipids 2014_graz
Euro lipids 2014_grazEuro lipids 2014_graz
Euro lipids 2014_graz
Ā 
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Proteomics and the "big data" trend: challenges and new possibilitites (Talk ...
Ā 
Pride cluster presentation
Pride cluster presentation Pride cluster presentation
Pride cluster presentation
Ā 
Proteomics data standards
Proteomics data standardsProteomics data standards
Proteomics data standards
Ā 
PRIDE and ProteomeXchange: Training webinar
PRIDE and ProteomeXchange: Training webinarPRIDE and ProteomeXchange: Training webinar
PRIDE and ProteomeXchange: Training webinar
Ā 
Introduction to the PSI standard data formats
Introduction to the PSI standard data formatsIntroduction to the PSI standard data formats
Introduction to the PSI standard data formats
Ā 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
Ā 
The mzTab data standard format for reporting MS-based peptide, protein and sm...
The mzTab data standard format for reporting MS-based peptide, protein and sm...The mzTab data standard format for reporting MS-based peptide, protein and sm...
The mzTab data standard format for reporting MS-based peptide, protein and sm...
Ā 
PRIDE and ProteomeXchange
PRIDE and ProteomeXchangePRIDE and ProteomeXchange
PRIDE and ProteomeXchange
Ā 
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics dataPRIDE and ProteomeXchange: A golden age for working with public proteomics data
PRIDE and ProteomeXchange: A golden age for working with public proteomics data
Ā 
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
PRIDE and ProteomeXchange: supporting the cultural change in proteomics publi...
Ā 
ProteomeXchange_and_PRIDE_Semmeting_2015
ProteomeXchange_and_PRIDE_Semmeting_2015ProteomeXchange_and_PRIDE_Semmeting_2015
ProteomeXchange_and_PRIDE_Semmeting_2015
Ā 
How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?How to run and maintain a popular biological data repository?
How to run and maintain a popular biological data repository?
Ā 
PSI-Proteome Informatics update
PSI-Proteome Informatics updatePSI-Proteome Informatics update
PSI-Proteome Informatics update
Ā 

More from Juan Antonio Vizcaino

More from Juan Antonio Vizcaino (20)

Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...Reusing and integrating public proteomics data to improve our knowledge of th...
Reusing and integrating public proteomics data to improve our knowledge of th...
Ā 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
Ā 
PRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchangePRIDE resources and ProteomeXchange
PRIDE resources and ProteomeXchange
Ā 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
Ā 
Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018Introduction to the Proteomics Bioinformatics Course 2018
Introduction to the Proteomics Bioinformatics Course 2018
Ā 
ELIXIR Implementation Study: ā€œMining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: ā€œMining the Proteome: Enabling Automated Process...ELIXIR Implementation Study: ā€œMining the Proteome: Enabling Automated Process...
ELIXIR Implementation Study: ā€œMining the Proteome: Enabling Automated Process...
Ā 
ProteomeXchange update
ProteomeXchange updateProteomeXchange update
ProteomeXchange update
Ā 
Developing open data analysis pipelines in the cloud: Enabling the ā€˜big dataā€™...
Developing open data analysis pipelines in the cloud: Enabling the ā€˜big dataā€™...Developing open data analysis pipelines in the cloud: Enabling the ā€˜big dataā€™...
Developing open data analysis pipelines in the cloud: Enabling the ā€˜big dataā€™...
Ā 
The ELIXIR Proteomics community
The ELIXIR Proteomics community The ELIXIR Proteomics community
The ELIXIR Proteomics community
Ā 
The ELIXIR Proteomics Community
The ELIXIR Proteomics CommunityThe ELIXIR Proteomics Community
The ELIXIR Proteomics Community
Ā 
A proteomics data ā€œgold mineā€ at your disposal: Now that the data is there, w...
A proteomics data ā€œgold mineā€ at your disposal: Now that the data is there, w...A proteomics data ā€œgold mineā€ at your disposal: Now that the data is there, w...
A proteomics data ā€œgold mineā€ at your disposal: Now that the data is there, w...
Ā 
The ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 updateThe ProteomeXchange Consoritum: 2017 update
The ProteomeXchange Consoritum: 2017 update
Ā 
Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...Public proteomics data: a (mostly unexploited) gold mine for computational re...
Public proteomics data: a (mostly unexploited) gold mine for computational re...
Ā 
Reuse of public proteomics data
Reuse of public proteomics dataReuse of public proteomics data
Reuse of public proteomics data
Ā 
Proteomics repositories
Proteomics repositoriesProteomics repositories
Proteomics repositories
Ā 
Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017Introduction to the Proteomics Bioinformatics Course 2017
Introduction to the Proteomics Bioinformatics Course 2017
Ā 
Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?Is it feasible to identify novel biomarkers by mining public proteomics data?
Is it feasible to identify novel biomarkers by mining public proteomics data?
Ā 
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
The spectra-cluster toolsuite: Enhancing proteomics analysis through spectrum...
Ā 
ProteomeXchange update 2017
ProteomeXchange update 2017ProteomeXchange update 2017
ProteomeXchange update 2017
Ā 
Enabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics dataEnabling automated processing and analysis of large-scale proteomics data
Enabling automated processing and analysis of large-scale proteomics data
Ā 

Recently uploaded

Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
SĆ©rgio Sacani
Ā 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
SĆ©rgio Sacani
Ā 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
Ā 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
Ā 

Recently uploaded (20)

Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
Ā 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
Ā 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Ā 
Lucknow šŸ’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow šŸ’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow šŸ’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow šŸ’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Ā 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
Ā 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
Ā 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
Ā 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Ā 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
Ā 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
Ā 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
Ā 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
Ā 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
Ā 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Ā 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
Ā 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
Ā 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
Ā 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
Ā 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
Ā 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
Ā 

Proteomics data standards

  • 1. Introduction to the PSI standard data formats Dr. Juan Antonio VizcaĆ­no Proteomics Team Leader EMBL-EBI Hinxton, Cambridge, UK
  • 2. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Overview ā€¢ A couple of slides about the need of data standards ā€¢ The Proteomics Standards Initiative ā€¢ Existing data standards ā€¢ Connection with ProteomeXchange
  • 3. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Overview ā€¢ A couple of slides about the need of data standards ā€¢ The Proteomics Standards Initiative ā€¢ Existing data standards ā€¢ Connection with ProteomeXchange
  • 4. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Standards are needed in life: also in bioinformaticsā€¦ With a small number of standards, data converters are feasible Data standards are needed
  • 5. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Taken from Biocomical, http://biocomicals.blogspot.com
  • 6. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Mass Spectrometry (MS)-based proteomics ā€¢ Many different workflows -> Many different data types -> Need for several data standards. ā€¢ Discovery mode: ā€¢ Bottom-up proteomics ā€¢ Data dependent acquisition ā€¢ Data independent acquisition ā€¢ Top down proteomics ā€¢ Targeted mode: ā€¢ SRM (Selected Reaction Monitoring)
  • 7. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Overview ā€¢ A couple of slides about the need of data standards ā€¢ The Proteomics Standards Initiative ā€¢ Existing data standards ā€¢ Connection with ProteomeXchange and IMEx
  • 8. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 ā€¢Develops data format standards for proteomics. ā€¢Both data representation and annotation standards. ā€¢Involves data producers, database providers, software producers, publishers, ā€¦ ā€¢Active Workgroups: MI, MS, PI, Mod. ā€¢Inter-group activities: MIAPE and Controlled Vocabularies. ā€¢Started in 2002, so some experience alreadyā€¦ ā€¢One annual meeting in March-April, regular phone calls. ā€¢Close interaction with the metabolomics community. http://www.psidev.info HUPO Proteomics Standards Initiative
  • 9. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 PSI Deliverables ā€¢Minimum information (MIAPE) specifications: Format-independent specification of minimum information guidelines. ā€¢Formats: Usually an XML schema (but also tab-delimited files) capable of representing the relevant Minimum Information, plus additional detailed data for the domain. ā€¢Controlled vocabularies: Usually an OBO-style hierarchical controlled vocabulary precisely defining the metadata that are encoded in the formats. ā€¢Databases and Tools: Foster software implementations to make the standards truly useful. ā€¢Community interaction to ensure deposition of data in public repositories.
  • 10. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 MIAPE guidelines ā€¢ Minimum Information About a Proteomics Experiment guidelines. ā€¢ Set of experimental and technical metadata that are needed to make one experiment reproducible. ā€¢ They cover different aspects: mass spectrometry, informatics (identification and quantification), particular techniques, etc. ā€¢ Published since 2008, but their adoption has been limited
  • 11. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 PSI MS Controlled Vocabulary Mayer et al., Database, 2013
  • 12. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Overview ā€¢ A couple of slides about the need of data standards ā€¢ The Proteomics Standards Initiative ā€¢ Existing data standards ā€¢ Connection with ProteomeXchange
  • 13. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 The typical dilemma ā€¢Data standards need to be stable to promote adoption ā€¢Proteomics standards need to evolve very rapidly: ā€¢ Data is inherently very complex ā€¢ Experimental techniques are evolving all the time
  • 14. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 ā€¢MS data: mzML (also used in MS metabolomics). ā€¢Protein and peptide identifications: mzIdentML. ā€¢Peptide and protein quantification: mzQuantML (also supports small molecules). ā€¢SRM transitions (for targeted proteomics): TraML. ā€¢mzTab: identification and quantification results for peptides, proteins and small molecules (also used in MS metabolomics). ā€¢Molecular interactions: PSI MI XML and MITAB. www.psidev.info Existing data standards in proteomics
  • 15. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Current PSI Standard File Formats for MS ā€¢ mzMLMS data ā€¢ mzIdentMLIdentification ā€¢ mzQuantMLQuantitation ā€¢ mzTabFinal Results ā€¢ TraMLSRM
  • 16. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Binary data mzData mzXML mzML XML-based files .dta, .pkl, .mgf, .ms2 Peak lists Data formats for mass spectra data
  • 17. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 An example of success story: mzML ā€¢ A data format for the storage and exchange of MS output files ā€¢ Designed by merging the best aspects of both mzData and mzXML ā€¢ Developed with full participation of academic researchers, hardware and software vendors ā€¢ Expected to replace mzXML and mzData, but not expected to completely replace vendor binary formats ā€¢ Captures spectra (raw data or peak lists), chromatograms and related metadata ā€¢ Version 1.0 released in June 2008, v1.1 released in June 2009 ā€¢ Many implementations already exist ā€¢ Version 1.2 with enhanced compression considered for 2014 Martens et al., MCP, 2011
  • 18. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 An example of success story: mzML The most popular search engines support mzML Many parser libraries available Conversion from raw files into mzMLhttp://www.psidev.info/mzml_1_0_0
  • 19. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Application of mzML to metabolomics
  • 20. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Current PSI Standard File Formats for MS ā€¢ mzMLMS data ā€¢ mzIdentMLIdentification ā€¢ mzQuantMLQuantitation ā€¢ mzTabFinal Results ā€¢ TraMLSRM
  • 21. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 mzIdentML, mzTab mascot .dat, sequest .out, SpectrumMill .spo pep.xml, prot.xml Only qualitative data! Data formats for output from search engines
  • 22. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 mzIdentML: peptide and protein identifications ā€¢ Overview ā€¢ XML-based data standard for peptide and protein identifications e.g. following database search and protein inference. ā€¢ Sections for all PSMs, proteins/protein groups inferred, protocols/parameters etc. ā€¢ Timeline: ā€¢ Original 1.0 version in Aug 2009. ā€¢ Version 1.1 stable (Aug 2011). ā€¢ Manuscript published in MCP in 2012*. ā€¢ 2012-2015: ā€¢ Improving support for protein grouping multiple search engines, pre-fractionation approaches and de novo sequencing. ā€¢ Now firmly embedded as part of ProteomeXchange submission process, and supported by lots of external software. * Jones, A. R., Eisenacher, M., Mayer, G., Kohlbacher, O., et al., The mzIdentML data standard for mass spectrometry -based proteomics results. Molecular & Cellular Proteomics 2012, 11, M111.014381.
  • 23. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 An example: XML snippet of mzIdentML
  • 24. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Support for mzIdentML Search Engine Results + MS files Search engines mzIdentML - Mascot - MSGF+ - Myrimatch and related tools from D. Tabbā€™s lab - OpenMS - PEAKS - PeptideShaker (several open source tools) - ProCon (ProteomeDiscoverer, Sequest) - Scaffold - TPP via the idConvert tool (ProteoWizard) - ProteinPilot (from version 5.0) - X!Tandem (from PILEDRIVER version) - Others: library for X!Tandem conversion, lab internal pipelines, ā€¦ - Crux An increasing number of tools support export to mzIdentML 1.1 Updated list: http://www.psidev.info/tools-implementing-
  • 25. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Data visualisation: PRIDE Inspector Toolsuite Wang et al., Nat. Biotechnology, 2012 Perez-Riverol et al., MCP, 2016, in press PRIDE Inspector Toolsuite PRIDE Inspector Toolsuite supports: - PRIDE XML - mzIdentML + all types of spectra files - mzML - mzTab identification and Quantification + all types of spectra files https://github.com/PRIDE-Toolsuite/
  • 26. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 PRIDE Inspector Toolsuite https://github.com/PRIDE-Toolsuite/ New visualisation functionality for Protein Groups PRIDE Inspector Toolsuite
  • 27. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Current PSI Standard File Formats for MS ā€¢ mzMLMS data ā€¢ mzIdentMLIdentification ā€¢ mzQuantMLQuantitation ā€¢ mzTabFinal Results ā€¢ TraMLSRM
  • 28. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Data formats for output from search engines: quantification results ā€¢ There is software that does: ā€¢ Only quantification. ā€¢ Identification and quantification together. mzIdentML, mzTab MaxQuant output files OpenMS output files Progenesis output files ā€¦
  • 29. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Wide variety of quantification techniques
  • 30. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 mzQuantML: Standard for quantitative data Overview ā€¢ XML-based standard for quantification data ā€“ following use of quant software ā€¢ Can report tables of data (QuantLayers), columns are: StudyVariables, Assays or Ratios; rows are ProteinGroups, Proteins or Peptides ā€¢ Can also capture 2D coordinates of quantified regions in LC-MS (Features) Timeline ā€¢ Version 1.0 rc-1 submitted to the PSI process October 2011; Version 1.0 rc-2 June 2012; Re- submitted to PSI process in October 2012 ā€¢ Completed PSI process in Feb 2013 ā€“ version 1.0 release ā€¢ Supports label-free (intensity), label-free (spectral counting), MS2 tag techniques (e.g. iTRAQ) and MS1 label techniques e.g. SILAC ā€¢ Schema is fixed with each technique defined by separate semantic rules, implemented in validator software ā€¢ Manuscript published in MCP in summer 2013* ā€¢ Updated in 2013-2014 to support SRM as a new technique** (version 1.0.1 just submitted to the document process). *Walzer et al. MCP 2013 Aug;12(8):2332-40. doi: 10.1074/mcp.O113.028506 **Qi et al. PROTEOMICS, 2015
  • 31. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Current PSI Standard File Formats for MS ā€¢ mzMLMS data ā€¢ mzIdentMLIdentification ā€¢ mzQuantMLQuantitation ā€¢ mzTabFinal Results ā€¢ TraMLSRM
  • 32. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 The last addition: mzTab ā€“ Aims and concept ā€¢ To provide a simple and efficient way of exchanging results from MS approaches. ā€¢ Simpler summary report of the experimental results ā€¢ Peptides and proteins identified in a given experimental setting ā€¢ Small molecules identified ā€¢ Reported quantification values ā€¢ Technical and biological metadata ā€¢ Easier to parse and use by the research community, systems biologists as well as providers of knowledge bases. ā€¢ It can be used by non-experts in bioinformatics. ā€¢ It does not aim to replace mzIdentMl and mzQuantML
  • 33. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 mzTab - Sections ā€¢ Basic information about experiment and sample ā€¢ Key-Value pairsMetadata ā€¢ Basic information about protein identifications ā€¢ Table-basedProtein ā€¢ Information about quantified peptides ā€¢ Table-basedPeptide ā€¢ Information about identified spectra ā€¢ Table-basedPSM ā€¢ Basic information about identified small molecules ā€¢ Table-basedSmall Molecule Griss et al., MCP, 2014
  • 34. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Metadata section - Example
  • 35. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Protein Section (label-free)
  • 36. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 mzTab ā€“ ongoing development in metabolomics ā€¢ More detailed modelling of MS metabolomics data ā€¢ Led by S. Neumann (COSMOS EU FP7 project). ā€¢ Extension from one to three sections. Example file exists at https://github.com/sneumann/mtbls2/faahKO.mzTab http://www.cosmos-fp7.eu/
  • 37. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Current PSI Standard File Formats for MS ā€¢ mzMLMS data ā€¢ mzIdentMLIdentification ā€¢ mzQuantMLQuantitation ā€¢ mzTabFinal Results ā€¢ TraMLSRM
  • 38. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Unify exchange of transitions with TraML ā€¢ PSIā€™s TraML (Transitions Markup Language) ā€¢ Format for encoding SRM/MRM transitions ā€¢ Version 1.0.0 now released and published in MCP (Deutsch et al. 2012) Journal Articles Transitions Databases Excel sheets SRM Analysis Software Instruments TraML
  • 39. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Unify exchange of transitions with TraML Deutsch et al., MCP, 2012
  • 40. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 PSI document process ā€¢Every data standard has to undergo a thorough review processā€¦ ā€¢In fact, in practice, two review processes happen in parallel: the PSI and manuscript review.
  • 41. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Data standard publications mzML (data standard for MS data) Martens et al., MCP, 2011 mzIdentML (standard for peptide/protein IDs) Jones et al., MCP, 2012 TraML (for SRM transitions) Deutsch et al., MCP, 2012 mzQuantML (for quantitative data) Waltzer et al., MCP, 2013 mzTab (peptide/protein ID and quantification) Griss et al., MCP, 2014 Some updates already going on (e.g. mzIdentML 1.2 about to be submitted)
  • 42. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Importance of making software available jmzML (https://github.com/PRIDE-Utilities/jmzml) Cote et al., Proteomics, 2009 jmzIdentML (https://github.com/PRIDE-Utilities/jmzidentML) Reisinger et al., Proteomics, 2012 jmzReader (https://github.com/PRIDE-Utilities/jmzReader) Griss et al., Proteomics, 2012 jmzQuantML (https://github.com/UKQIDA/jmzquantml) Qi et al., Proteomics, 2014 jmzTab (https://github.com/PRIDE-Utilities/jmzTab) Xu et al., Proteomics, 2014 ms-data-core-api (https://github.com/PRIDE-Utilities/ms-data-core-api)Perez-Riverol et al., Bioinformatics, 2015 PSI promotes implementations. The reference libraries are always open source and can be used by anyone!
  • 43. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 ProteoGenomics
  • 44. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Under development: Proteogenomics related formats ā€¢ Two ongoing formats are being developed: proBed and proBAM. ā€¢ Same overall objective: to map identified peptides to genome coordinates. ā€¢ Different level of detail: ā€¢ proBed is tab-delimited and simpler, based on the original BED format. Less level of detail. ā€¢ proBAM is based in the original SAM/BAM formats, widely used in genomics. Much higher level of detail.
  • 45. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Provide your own data to genome browsers
  • 46. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 And alsoā€¦ protein-protein interactions PSI-XML: XML-based format ā€¢ Version 2.5 is the working version ā€¢ Version 3.0 under development MITAB: tab-delimited format
  • 47. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Overview ā€¢ A couple of slides about the need of data standards ā€¢ The Proteomics Standards Initiative ā€¢ Existing data standards ā€¢ Connection with ProteomeXchange
  • 48. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 ProteomeXchange Consortium ā€¢ Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories. ā€¢ Includes PeptideAtlas (ISB, Seattle), PRIDE (Cambridge, UK) and (very recently) MassIVE (UCSD, San Diego). ā€¢ EU FP7 CA (01/2011-> 06/2014). ā€¢ Common identifier space (PXD identifiers) ā€¢ Two supported data workflows: MS/MS and SRM. ā€¢ Main objective: Make life easier for researchers http://www.proteomexchange.org VizcaĆ­no et al., Nat Biotechnol, 2014
  • 49. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 PRIDE (PRoteomics IDEntifications) database http://www.ebi.ac.uk/pride ā€¢ PRIDE stores mass spectrometry (MS)- based proteomics data: ā€¢ Peptide and protein expression data (identification and quantification) ā€¢ Post-translational modifications ā€¢ Mass spectra (raw data and peak lists) ā€¢ Technical and biological metadata ā€¢ Any other related information ā€¢ Full support for tandem MS approaches Martens et al., Proteomics, 2005 VizcaĆ­no et al., NAR, 2016
  • 50. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 ProteomeCentral Metadata / Manuscript Raw Data* Results Journals UniProt/ neXtProt Peptide Atlas Other DBs Receiving repositories PASSEL (SRM data) PRIDE (MS/MS data) Other DBs GPMDB Researcherā€™s results Reprocessed results Raw data* Metadata MassIVE (MS/MS data) ProteomeXchange data workflow
  • 51. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Origin: 885 USA 465 Germany 342 United Kingdom 264 China 194 France 158 Netherland 136 Canada 126 Switzerland 107 Denmark 104 Spain 99 Australia 95 Japan 72 Belgium 68 Austria 63 Sweden 61 India 51 Norway 43 Taiwan 30 Italy 29 Brazil 28 Singapore 28 Finland 27 Ireland 27 Russia 26 Israel ā€¦ ProteomeXchange: 3,802 datasets up until 1st April, 2016 Type: 2429 PRIDE partial 1016 PRIDE complete 250 MassIVE 84 PeptideAtlas/PASSEL complete 23 Reprocessed Publicly Accessible: 1973 datasets, 52% of all 91% PRIDE 5% MassIVE 4% PASSEL Datasets/year: 2012: 102 2013: 527 2014: 963 2015: 1758 2016: 452 Top Species studied by at least 20 datasets: 1526 Homo sapiens 485 Mus musculus 150 Saccharomyces cerevisiae 121 Arabidopsis thaliana 102 Rattus norvegicus 86 Escherichia coli 44 Bos taurus 35 Drosophila melanogaster 32 Glycine max ~ 700 species in total
  • 52. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 PRIDE Archive submitted datasets up until 1st April, 2016 ā€¢ In the last year: ~150 submitted datasets per month ā€¢ Size: ~ 220TB
  • 53. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 ProteomeCentral: Portal for all PX datasets http://proteomecentral.proteomexchange.org/cgi/GetDataset
  • 54. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Data reuse is increasing Vaudel et al., Proteomics, 2016
  • 55. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Conclusions ā€¢ The PSI is a very active group, developing and maintaining data standards in the computational proteomics field. ā€¢ Big variety of data types in proteomics -> Several data standards available. ā€¢ Adoption of standards (as usual) takes some time. ā€¢ Public databases (e.g. ProteomeXchange) greatly benefit from them.
  • 56. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Conclusions: Reproducibility ā€¢ Share your data ā€“ and profit from othersā€™ data ā€¢ Deposit all data in publicly accessible repositories (ProteomeXchange) ā€¢ Use PSI standard formats (mzML, mzQuantML, mzTab) ā€¢ Include accessions in your manuscript ā€¢ Include as much detail as needed (Supp Mat!) ā€“ ā€˜computational SOPā€™ ā€¢ Software and DB versions used ā€¢ Ideally: share the whole workflow ā€“ thatā€™s the best documentation! If I have seen further it is by standing on the shoulders of giants. Isaac Newton (sort of)
  • 57. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Acknowlegments and further readingā€¦ http://www.psidev.info
  • 58. Juan A. VizcaĆ­no juan@ebi.ac.uk EuPA workshop on standardisation Istanbul, 22 June 2016 Questions?