Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Introduction to PSI Standard Data Formats
1. Introduction to the PSI standard data formats
Dr. Juan Antonio Vizcaíno
Proteomics Team Leader
EMBL-EBI
Hinxton, Cambridge, UK
2. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Overview
• A couple of slides about the need of data standards
• The Proteomics Standards Initiative
• Existing data standards
• Connection with ProteomeXchange and IMEx
3. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Overview
• A couple of slides about the need of data standards
• The Proteomics Standards Initiative
• Existing data standards
• Connection with ProteomeXchange and IMEx
4. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Standards are needed in real life: also in bioinformatics…
With a small number
of standards,
converters are feasible
Data standards are needed
5. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Taken from Biocomicals, http://biocomicals.blogspot.com
6. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Mass Spectrometry (MS)-based proteomics
• Many different workflows -> Many different data
types -> Need for several data standards.
• Discovery mode:
• Bottom-up proteomics
• Data dependent acquisition (DDA)
• Data independent acquisition (DIA)
• Top down proteomics
• Targeted mode:
• SRM (Selected Reaction Monitoring)
7. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Overview
• A couple of slides about the need of data standards
• The Proteomics Standards Initiative
• Existing data standards
• Connection with ProteomeXchange and IMEx
8. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
•Develops data standards for proteomics.
•Both data representation and annotation standards.
•Involves data producers, database providers, software producers,
publishers, everyone who wants to be involved…
•Active Workgroups: MI, MS, PI, Mod and the new QC.
•Inter-group activities: MIAPE and Controlled Vocabularies.
•Started in 2002, so some experience already…
•One annual meeting in March-April, regular phone calls.
•Close interaction with the metabolomics community (MSI).
http://www.psidev.info
HUPO Proteomics Standards Initiative
9. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
PSI Deliverables
•Minimum information (MIAPE) specifications: Format-independent
specification of minimum information guidelines.
•Formats: Usually XML-based (but also tab-delimited files), capable of
representing the relevant Minimum Information, plus additional detailed data
for the domain.
•Controlled vocabularies: Usually an OBO-style hierarchical controlled
vocabulary precisely defining the metadata that are encoded in the formats.
•Databases and Tools: Foster open software implementations to make the
standards truly useful.
•Community interaction to ensure deposition of data in public repositories.
10. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
PSI MS Controlled Vocabulary
Mayer et al., Database, 2013~2,600 terms by October 2016
11. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
The Ontology Lookup Service (OLS)
http://www.ebi.ac.uk/ontology-lookup/
12. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
MIAPE guidelines
• Minimum Information About a Proteomics Experiment
guidelines.
• Set of experimental and technical metadata that are needed
to make one experiment reproducible.
• They cover different aspects: mass spectrometry,
informatics (identification and quantification), particular
techniques, etc.
• Published since 2008, but their adoption has been limited
13. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Overview
• A couple of slides about the need of data standards
• The Proteomics Standards Initiative
• Existing data standards
• Connection with ProteomeXchange and IMEx
14. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
The typical dilemma
•Data standards need to be stable to promote adoption
•Proteomics standards need to evolve very rapidly:
• Data is inherently very complex
• Experimental techniques are evolving all the time
15. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
•MS data: mzML (also used in MS metabolomics).
•Protein and peptide identification: mzIdentML.
•Peptide and protein quantification: mzQuantML.
•SRM transitions (for targeted proteomics): TraML.
•Molecular interactions: PSI MI XML and MITAB.
• mzTab: identification and quantification results for peptides,
proteins and small molecules (also used in MS metabolomics).
www.psidev.info
Existing data standards in proteomics
16. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM
17. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Binary data
mzData
mzXML
mzML
XML-based
files
.dta, .pkl, .mgf,
.ms2
Peak lists
Data formats for mass spectra data
18. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
An example of success story: mzML
• A data format for the storage and exchange of MS output files
• Designed by merging the best aspects of both mzData and mzXML
• Developed with full participation of academic researchers, hardware
and software vendors
• Expected to replace mzXML and mzData, but not expected to
completely replace vendor binary formats
• Captures spectra (raw data or peak lists), chromatograms and related
metadata
• Version 1.0 released in June 2008, v1.1 released in June 2009
• Many implementations already exist
• Version 1.2 with enhanced compression considered for the near
future.
Martens et al., MCP, 2011
19. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
An example of success story: mzML
Martens et al., MCP, 2011
20. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
An example of success story: mzML
The most popular search
engines support mzML
Many parser libraries available
Conversion from raw files
into mzMLhttp://www.psidev.info/mzml_1_0_0
22. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM
23. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
mzIdentML, mascot
.dat, sequest .out,
SpectrumMill .spo
pep.xml, prot.xml
Only qualitative data!
Data formats for output from search engines
24. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
mzIdentML: peptide and protein identifications
• Overview
• XML-based data standard for peptide and protein identifications e.g.
following database search and protein inference.
• Sections for all PSMs, proteins/protein groups inferred,
protocols/parameters etc.
• Timeline:
• Original 1.0 version in Aug 2009.
• Version 1.1 stable (Aug 2011).
• Manuscript published in MCP in 2012*.
• 2012-2016:
• Improving support for protein grouping multiple search engines, pre-fractionation
approaches and de novo sequencing.
• Now firmly embedded as part of ProteomeXchange submission
process, and supported by lots of external software.
* Jones, A. R., Eisenacher, M., Mayer, G., Kohlbacher, O., et al., The mzIdentML data standard for mass spectrometry-based
proteomics results. Molecular & Cellular Proteomics 2012, 11, M111.014381.
25. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
mzIdentML: peptide and protein identifications
* Jones, A. R., Eisenacher, M., Mayer, G., Kohlbacher, O., et al., The mzIdentML data standard for mass spectrometry-based
proteomics results. Molecular & Cellular Proteomics 2012, 11, M111.014381.
MzIdentML
CvList
AnalysisSoftwareList
AnalysisSampleCollection
SequenceCollection
AnalysisCollection
AnalysisProtocolCollection
DataCollection
URLsof controlledvocabularies
usedwithinthe file
Softwarepackagesused
Biologicalsamples analysed,
annotated with CV terms
Databaseentriesofprotein/ peptide
sequencesidentifiedandmodifications
Applicationofprotocol
inputs= externalspectra1..n
output= SpectrumIdentificationList1
SpectrumIdentificationProtocol
ProteinDetectionProtocol
SpectrumIdentificationProtocol
AdditionalSearchParams
ModificationParams
Enzymes
DatabaseFilters
Parametersfor theprotein
detectionprocedure
Inputs
AnalysisData
AnalysisData
SpectrumIdentificationList
Thedatabasesearched and theinput
fileconverted tomzIdentML
SpectrumIdentificationResult
SpectrumIdentificationItem
ProteinDetectionList
ProteinAmbiguityGroup
ProteinDetectionHypothesis
All identificationsmade from
searchingonespectrum
One(poly)peptide-
spectrummatch
Aset ofrelatedprotein
identificationse.g.conflicting
peptide-proteinassignments
Asingle proteinidentification
SpectrumIdentification
ProteinDetectionApplicationofprotocol
inputs= SpectrumIdentificationList1..n
output= ProteinDetectionList1
26. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
mzIdentML 1.1
Data standard for peptide
and protein identification
data
mzIdentML 1.2
2011-
2012
2017
New support for:
- Cross-linking approaches
- Peptide level scores
- Modification localization scores
- Proteogenomics approaches
Improved support for:
- Protein inference
- Pre-fractionation
- de novo sequencing
- Spectral library searches
Increasingly
supported
by the most-
used
proteomics
software
and
databases
jmzIdentML
mzid Library
ms-data-core-api
MyriMatch
ProteoAnnotator
PIA
ProCon
27. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM
28. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Wide variety of quantification techniques
29. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
mzQuantML: Standard for quantitative data
Overview
• XML-based standard for quantification data – following use of quant software
• Can report tables of data (QuantLayers), columns are: StudyVariables, Assays or Ratios;
rows are ProteinGroups, Proteins or Peptides
• Can also capture 2D coordinates of quantified regions in LC-MS (Features)
Timeline
• Version 1.0 rc-1 submitted to the PSI process October 2011; Version 1.0 rc-2 June 2012; Re-
submitted to PSI process in October 2012
• Completed PSI process in Feb 2013 – version 1.0 release
• Supports label-free (intensity), label-free (spectral counting), MS2 tag techniques (e.g. iTRAQ) and
MS1 label techniques e.g. SILAC
• Schema is fixed with each technique defined by separate semantic rules, implemented in validator
software
• Manuscript published in MCP in summer 2013*
• Updated to support SRM as a new technique** (version 1.0.1 just submitted to the document
process).
*Walzer et al. MCP 2013 Aug;12(8):2332-40. doi: 10.1074/mcp.O113.028506
**Qi et al. PROTEOMICS, 2015
30. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
mzQuantML: Standard for quantitative data
*Walzer et al. MCP 2013 Aug;12(8):2332-40. doi: 10.1074/mcp.O113.028506
**Qi et al. PROTEOMICS, 2015
31. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM
32. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
The last addition: mzTab – Aims and concept
• To provide a simple and efficient way of exchanging results from MS
approaches.
• Simpler summary report of the experimental results
• Peptides and proteins identified in a given experimental setting
• Small molecules identified
• Reported quantification values
• Technical and biological metadata
• Easier to parse and use by the research community, systems
biologists as well as providers of knowledge bases.
• It can be used by non-experts in bioinformatics.
• It does not aim to replace mzIdentMl and mzQuantML
33. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
mzTab - Sections
• Basic information about experiment and sample
• Key-Value pairsMetadata
• Basic information about protein identifications
• Table-basedProtein
• Information about quantified peptides
• Table-basedPeptide
• Information about identified spectra
• Table-basedPSM
• Basic information about identified small molecules
• Table-basedSmall Molecule
Griss et al., MCP, 2014
35. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
mzTab – Current implementations
• jmzTab (Java API). Manuscript published in the journal Proteomics.
• mzTab Validator, PRIDE XML to mzTab converter (PRIDE team).
• Mascot (Matrix Science) exporter.
• MaxQuant: exporter in beta is available.
• ProteomeDiscoverer: Exporter in beta.
• OpenMS (version 1.10).
36. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
mzTab – ongoing development in metabolomics
• More detailed modelling of MS metabolomics data
(version 1.1):
• Led by A. Jones (Liverpool University)
• Extension from one to three sections for metabolomics.
• Also applicable to lipidomics data.
• Software will also be extended to support the new
version.
http://www.cosmos-fp7.eu/
37. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM
38. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Unify exchange of transitions with TraML
• PSI’s TraML (Transitions Markup Language)
• Format for encoding SRM/MRM transitions
• Version 1.0.0 now released and published in MCP (Deutsch et al. 2012)
Journal
Articles
Transitions
Databases
Excel
sheets
SRM
Analysis
Software
Instruments
TraML
39. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Unify exchange of transitions with TraML
Deutsch et al., MCP, 2012
40. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Java library for working with TraML files
It aims:
- command line & simple GUI
- TraML to TSV <-> TSV to TraML
- TSV vendor formats from TSQ, QTRAP5500, AgilentQQQ
Published: Helsens et al., JPR, 2011
TraML software implementations
41. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
PSI document process
•Every data standard has to undergo a
thorough review process…
•In fact, in practice, two review processes
happen in parallel: the PSI and
manuscript review.
42. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Data standard publications
mzML (data standard for MS data) Martens et al., MCP, 2011
mzIdentML (standard for peptide/protein IDs) Jones et al., MCP, 2012
TraML (for SRM transitions) Deutsch et al., MCP, 2012
mzQuantML (for quantitative data) Waltzer et al., MCP, 2013
mzTab (peptide/protein ID and quantification) Griss et al., MCP, 2014
Some updates already going on (e.g. mzIdentML 1.2)
43. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Importance of making software available
jmzML (https://github.com/PRIDE-Utilities/jmzml) Cote et al., Proteomics, 2009
jmzIdentML (https://github.com/PRIDE-Utilities/jmzidentML) Reisinger et al., Proteomics, 2012
jmzReader (https://github.com/PRIDE-Utilities/jmzReader) Griss et al., Proteomics, 2012
jmzQuantML (https://github.com/UKQIDA/jmzquantml) Qi et al., Proteomics, 2014
jmzTab (https://github.com/PRIDE-Utilities/jmzTab) Xu et al., Proteomics, 2014
ms-data-core-api (https://github.com/PRIDE-Utilities/ms-data-core-api)Perez-Riverol et al., Bioinformatics, 2015
PSI promotes implementations. The reference libraries are always
open source and can be used by anyone!
44. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Under development: Proteogenomics related
formats
• Two ongoing formats are being developed: proBed and
proBAM.
• Same overall objective: to map identified peptides to
genome coordinates.
• Different level of detail:
• proBed is tab-delimited and simpler, based on the original BED
format. Less level of detail.
• proBAM is based in the original SAM/BAM formats, widely
used in genomics. Much higher level of detail.
45. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Provide your own data to genome browsers
46. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
And also… protein-protein interactions
PSI-XML: XML-based format
• Version 2.5 is the working version
• Version 3.0 under development
MITAB: tab-delimited format
47. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
Overview
• A couple of slides about the need of data standards
• The Proteomics Standards Initiative
• Existing data standards
• Connection with ProteomeXchange and IMEx
48. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
ProteomeXchange: A Global, distributed proteomics
database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
jPOST
(MS/MS data)
Mandatory raw data deposition
since July 2015
• Goal: Development of a framework to allow standard data submission and
dissemination pipelines between the main existing proteomics repositories.
http://www.proteomexchange.org
New in 2016
Vizcaíno et al., Nat Biotechnol, 2014
Deutsch et al., NAR, 2017, in press
49. Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2016
Hinxton, 8 December 2016
MBInfo
The IMEx Consortium (www.imexconsortium.org)
Orchard et al., Nat Methods, 2012