Introduction to the PSI standard data formats
Dr. Juan Antonio Vizcaíno
EMBL-EBI
Hinxton, Cambridge, UK
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Overview
• A couple of slides about the need of data standards
• The Proteomics Standards Initiative
• Existing data standards
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Standards are needed in real life: also in bioinformatics…
With a small number
of standards,
converters are feasible
Data standards are needed
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Taken from Biocomicals, http://biocomicals.blogspot.com
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Mass Spectrometry (MS)-based proteomics
• Many different workflows -> Many different data types ->
Need for several data standards.
• Discovery mode:
• Bottom-up proteomics
• Data dependent acquisition (DDA)
• Data independent acquisition (DIA)
• Top down proteomics
• Targeted mode:
• SRM/MRM/PRM (Selected/ Multiple/Parallel Reaction
Monitoring)
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Overview
• A couple of slides about the need of data standards
• The Proteomics Standards Initiative
• Existing data standards
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
•Develops data standards for proteomics.
•Both data representation and annotation standards.
•Involves data producers, database providers, software producers,
publishers, everyone who wants to be involved…
•Active Workgroups: MI, MS, PI, Mod and the new QC.
•Inter-group activities: MIAPE and Controlled Vocabularies.
•Started in 2002, so some experience already…
•One annual meeting in March-April, regular phone calls.
•Close interaction with the metabolomics community (MSI).
http://www.psidev.info
HUPO Proteomics Standards Initiative
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
PSI Deliverables
•Minimum information (MIAPE) specifications: Format-independent
specification of minimum information guidelines.
•Formats: Usually XML-based (but also tab-delimited files), capable of
representing the relevant Minimum Information, plus additional detailed data
for the domain.
•Controlled vocabularies: Usually an OBO-style hierarchical controlled
vocabulary precisely defining the metadata that are encoded in the formats.
•Databases and Tools: Foster open software implementations to make the
standards truly useful.
•Community interaction to ensure deposition of data in public repositories.
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
PSI MS Controlled Vocabulary
Mayer et al., Database, 2013~2,700 terms by June 2017
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
The Ontology Lookup Service (OLS)
http://www.ebi.ac.uk/ontology-lookup/
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Overview
• A couple of slides about the need of data standards
• The Proteomics Standards Initiative
• Existing data standards
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
The typical dilemma
•Data standards need to be stable to promote adoption
•Proteomics standards need to evolve very rapidly:
• Data is inherently very complex
• Experimental techniques are evolving all the time
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Binary data
mzData
mzXML
mzML
XML-based
files
.dta, .pkl, .mgf,
.ms2
Peak lists
Data formats for mass spectra data
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
An example of success story: mzML
• A data format for the storage and exchange of MS output files
• Designed by merging the best aspects of both mzData and mzXML
• Developed with full participation of academic researchers, hardware
and software vendors
• Expected to replace mzXML and mzData, but not expected to
completely replace vendor binary formats
• Captures spectra (raw data or peak lists), chromatograms and related
metadata
• Version 1.0 released in June 2008, v1.1 released in June 2009
• Many implementations already exist
• Version 1.2 with enhanced compression considered for the near
future.
Martens et al., MCP, 2011
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
An example of success story: mzML
The most popular search
engines support mzML
Many parser libraries available
Conversion from raw files
into mzMLhttp://www.psidev.info/mzml_1_0_0
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
mzIdentML, mascot
.dat, sequest .out,
SpectrumMill .spo
pep.xml, prot.xml
Only qualitative data!
Data formats for output from search engines
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
mzIdentML
• XML-based data standard for peptide and protein identifications e.g. following
database search and protein inference
• Sections for all PSMs, proteins/protein groups, protocols/parameters etc.
• Timeline:
• Original 1.0 version in Aug 2009
• Version 1.1 stable (Aug 2011); Original manuscript published in MCP in 2012*
• Well supported in lots of open source and commercial software
• Fully supported by ProteomeXchange resources
• 2012 onwards (mzIdentML 1.2): extended use cases
• Better support for protein grouping. Manuscript published in Proteomics **
• 2017 mzIdentML 1.2 release; manuscript published at MCP***
* Jones, A. R., Eisenacher, M., Mayer, G., Kohlbacher, O., et al., The mzIdentML data standard for mass
spectrometry-based proteomics results. Molecular & Cellular Proteomics 2012, 11, M111.014381.
** Seymour, S. L., Farrah, T., Binz, P. A., Chalkley, R. J., et al., A standardized framing for reporting protein
identifications in mzIdentML 1.2. Proteomics 2014, 14, 2389-2399.
*** Vizcaíno, J. A., Mayer G., Perkins S., Barsnes H., et al., The mzIdentML Data Standard Version 1.2,
Supporting advances in Proteome Informatics. Molecular & Cellular Proteomics 2017, 16, 1275-1285.
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
mzIdentML 1.1
Data standard for peptide
and protein identification
data
mzIdentML 1.2
2011-
2012
2017
New support for:
- Cross-linking approaches
- Peptide level scores
- Modification localization scores
- Proteogenomics approaches
Improved support for:
- Protein inference
- Pre-fractionation
- de novo sequencing
- Spectral library searches
Increasingly
supported
by the most-
used
proteomics
software
and
databases
jmzIdentML
mzid Library
ms-data-core-api
MyriMatch
ProteoAnnotator
PIA
ProCon
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
mzQuantML status
• XML-based standard for quantification data Can report tables of data (QuantLayers),
columns are: StudyVariables, Assays or Ratios, rows are ProteinGroups, Proteins or
Peptides
• Can also capture 2D coordinates of quantified regions in LC-MS (Features)
Timeline
• Work started in Oct 2011, and progressed at various PSI meetings
• Completed PSI process in Feb 2013 – version 1.0 release
• Supports label-free (intensity), label-free (spectral counting), MS2 tag techniques (e.g. iTRAQ)
and MS1 label techniques e.g. SILAC*
• Updated in 2013-2014 to support SRM as a new technique**; mzqLibrary***
• 2015, mzQuantML 1.0.1 – minor update with SRM included
Open issues
• Not widely supported by software. No live development. Efforts are being put into
mzTab support instead
*Walzer et al. MCP 2013 Aug;12(8):2332-40. doi: 10.1074/mcp.O113.028506
**Qi et al. PROTEOMICS, 2015, 15(18):3152-62
*** Qi et al PROTEOMICS 2015, 15, 2592-2596.
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
mzTab – Aims and concept
• To provide a simple and efficient way of exchanging results from MS
approaches.
• Simpler summary report of the experimental results
• Peptides and proteins identified in a given experimental setting
• Small molecules identified
• Reported quantification values
• Technical and biological metadata
• Easier to parse and use by the research community, systems
biologists as well as providers of knowledge bases.
• It can be used by non-experts in bioinformatics.
• It does not aim to replace mzIdentMl and mzQuantML
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
mzTab - Sections
• Basic information about experiment and sample
• Key-Value pairsMetadata
• Basic information about protein identifications
• Table-basedProtein
• Information about quantified peptides
• Table-basedPeptide
• Information about identified spectra
• Table-basedPSM
• Basic information about identified small molecules
• Table-basedSmall Molecule
Griss et al., MCP, 2014
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Metadata section - Example
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Current PSI Standard File Formats for MS
• mzMLMS data
• mzIdentMLIdentification
• mzQuantMLQuantitation
• mzTabFinal Results
• TraMLSRM
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Unify exchange of transitions with TraML
• PSI’s TraML (Transitions Markup Language)
• Format for encoding SRM/MRM transitions
• Version 1.0.0 now released and published in MCP (Deutsch et al. 2012)
Journal
Articles
Transitions
Databases
Excel
sheets
SRM
Analysis
Software
Instruments
TraML
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
PSI document process
•Every data standard has to undergo a
thorough review process…
•In fact, in practice, two review processes
happen in parallel: the PSI and
manuscript review.
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Proteogenomics related data formats
• Two ongoing formats are being developed: proBed
(version 1 available) and proBAM (still under review).
• Same overall objective: to map identified peptides to
genome coordinates.
• Different level of detail:
• proBed is tab-delimited and simpler, based on the original
BED format. Less level of detail.
• proBAM is based in the original SAM/BAM formats, widely
used in genomics. Much higher level of detail.
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Proteogenomics related data formats
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Provide your own data to genome browsers
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
TrackHubs in Genome Browsers
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Data standard publications
mzML (for MS data) Martens et al., MCP, 2011
mzIdentML (peptide/protein IDs) Jones et al., MCP, 2012
Vizcaíno et al., MCP, 2017
TraML (for SRM transitions) Deutsch et al., MCP, 2012
mzQuantML (for quantitative data) Walzer et al., MCP, 2013
mzTab (ID and quantification) Griss et al., MCP, 2014
proBed & proBAM (proteogenomics) Menschaert et al., Genome Biology, 2018
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Importance of making software available
jmzML (https://github.com/PRIDE-Utilities/jmzml) Cote et al., Proteomics, 2009
jmzIdentML (https://github.com/PRIDE-Utilities/jmzidentML) Reisinger et al., Proteomics, 2012
jmzReader (https://github.com/PRIDE-Utilities/jmzReader) Griss et al., Proteomics, 2012
jmzQuantML (https://github.com/UKQIDA/jmzquantml) Qi et al., Proteomics, 2014
jmzTab (https://github.com/PRIDE-Utilities/jmzTab) Xu et al., Proteomics, 2014
ms-data-core-api (https://github.com/PRIDE-Utilities/ms-data-core-api)Perez-Riverol et al., Bioinformatics, 2015
PSI promotes implementations. The reference libraries are always
open source and can be used by anyone!
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
And also… protein-protein interactions
PSI-XML: XML-based format
• Version 2.5 is the working version
• Version 3.0 under development
MITAB: tab-delimited format
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Summary slide
Deutsch et al., JPR, 2017
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018
Do you want to learn more?
Juan A. Vizcaíno
juan@ebi.ac.uk
WT Proteomics Bioinformatics Course 2018
Hinxton, 18 July 2018

Introduction to the PSI standard data formats

  • 1.
    Introduction to thePSI standard data formats Dr. Juan Antonio Vizcaíno EMBL-EBI Hinxton, Cambridge, UK
  • 2.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Overview • A couple of slides about the need of data standards • The Proteomics Standards Initiative • Existing data standards
  • 3.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Standards are needed in real life: also in bioinformatics… With a small number of standards, converters are feasible Data standards are needed
  • 4.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Taken from Biocomicals, http://biocomicals.blogspot.com
  • 5.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Mass Spectrometry (MS)-based proteomics • Many different workflows -> Many different data types -> Need for several data standards. • Discovery mode: • Bottom-up proteomics • Data dependent acquisition (DDA) • Data independent acquisition (DIA) • Top down proteomics • Targeted mode: • SRM/MRM/PRM (Selected/ Multiple/Parallel Reaction Monitoring)
  • 6.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Overview • A couple of slides about the need of data standards • The Proteomics Standards Initiative • Existing data standards
  • 7.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 •Develops data standards for proteomics. •Both data representation and annotation standards. •Involves data producers, database providers, software producers, publishers, everyone who wants to be involved… •Active Workgroups: MI, MS, PI, Mod and the new QC. •Inter-group activities: MIAPE and Controlled Vocabularies. •Started in 2002, so some experience already… •One annual meeting in March-April, regular phone calls. •Close interaction with the metabolomics community (MSI). http://www.psidev.info HUPO Proteomics Standards Initiative
  • 8.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 PSI Deliverables •Minimum information (MIAPE) specifications: Format-independent specification of minimum information guidelines. •Formats: Usually XML-based (but also tab-delimited files), capable of representing the relevant Minimum Information, plus additional detailed data for the domain. •Controlled vocabularies: Usually an OBO-style hierarchical controlled vocabulary precisely defining the metadata that are encoded in the formats. •Databases and Tools: Foster open software implementations to make the standards truly useful. •Community interaction to ensure deposition of data in public repositories.
  • 9.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 PSI MS Controlled Vocabulary Mayer et al., Database, 2013~2,700 terms by June 2017
  • 10.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 The Ontology Lookup Service (OLS) http://www.ebi.ac.uk/ontology-lookup/
  • 11.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Overview • A couple of slides about the need of data standards • The Proteomics Standards Initiative • Existing data standards
  • 12.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 The typical dilemma •Data standards need to be stable to promote adoption •Proteomics standards need to evolve very rapidly: • Data is inherently very complex • Experimental techniques are evolving all the time
  • 13.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Current PSI Standard File Formats for MS • mzMLMS data • mzIdentMLIdentification • mzQuantMLQuantitation • mzTabFinal Results • TraMLSRM
  • 14.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Binary data mzData mzXML mzML XML-based files .dta, .pkl, .mgf, .ms2 Peak lists Data formats for mass spectra data
  • 15.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 An example of success story: mzML • A data format for the storage and exchange of MS output files • Designed by merging the best aspects of both mzData and mzXML • Developed with full participation of academic researchers, hardware and software vendors • Expected to replace mzXML and mzData, but not expected to completely replace vendor binary formats • Captures spectra (raw data or peak lists), chromatograms and related metadata • Version 1.0 released in June 2008, v1.1 released in June 2009 • Many implementations already exist • Version 1.2 with enhanced compression considered for the near future. Martens et al., MCP, 2011
  • 16.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 An example of success story: mzML The most popular search engines support mzML Many parser libraries available Conversion from raw files into mzMLhttp://www.psidev.info/mzml_1_0_0
  • 17.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Current PSI Standard File Formats for MS • mzMLMS data • mzIdentMLIdentification • mzQuantMLQuantitation • mzTabFinal Results • TraMLSRM
  • 18.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 mzIdentML, mascot .dat, sequest .out, SpectrumMill .spo pep.xml, prot.xml Only qualitative data! Data formats for output from search engines
  • 19.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 mzIdentML • XML-based data standard for peptide and protein identifications e.g. following database search and protein inference • Sections for all PSMs, proteins/protein groups, protocols/parameters etc. • Timeline: • Original 1.0 version in Aug 2009 • Version 1.1 stable (Aug 2011); Original manuscript published in MCP in 2012* • Well supported in lots of open source and commercial software • Fully supported by ProteomeXchange resources • 2012 onwards (mzIdentML 1.2): extended use cases • Better support for protein grouping. Manuscript published in Proteomics ** • 2017 mzIdentML 1.2 release; manuscript published at MCP*** * Jones, A. R., Eisenacher, M., Mayer, G., Kohlbacher, O., et al., The mzIdentML data standard for mass spectrometry-based proteomics results. Molecular & Cellular Proteomics 2012, 11, M111.014381. ** Seymour, S. L., Farrah, T., Binz, P. A., Chalkley, R. J., et al., A standardized framing for reporting protein identifications in mzIdentML 1.2. Proteomics 2014, 14, 2389-2399. *** Vizcaíno, J. A., Mayer G., Perkins S., Barsnes H., et al., The mzIdentML Data Standard Version 1.2, Supporting advances in Proteome Informatics. Molecular & Cellular Proteomics 2017, 16, 1275-1285.
  • 20.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 mzIdentML 1.1 Data standard for peptide and protein identification data mzIdentML 1.2 2011- 2012 2017 New support for: - Cross-linking approaches - Peptide level scores - Modification localization scores - Proteogenomics approaches Improved support for: - Protein inference - Pre-fractionation - de novo sequencing - Spectral library searches Increasingly supported by the most- used proteomics software and databases jmzIdentML mzid Library ms-data-core-api MyriMatch ProteoAnnotator PIA ProCon
  • 21.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Current PSI Standard File Formats for MS • mzMLMS data • mzIdentMLIdentification • mzQuantMLQuantitation • mzTabFinal Results • TraMLSRM
  • 22.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 mzQuantML status • XML-based standard for quantification data Can report tables of data (QuantLayers), columns are: StudyVariables, Assays or Ratios, rows are ProteinGroups, Proteins or Peptides • Can also capture 2D coordinates of quantified regions in LC-MS (Features) Timeline • Work started in Oct 2011, and progressed at various PSI meetings • Completed PSI process in Feb 2013 – version 1.0 release • Supports label-free (intensity), label-free (spectral counting), MS2 tag techniques (e.g. iTRAQ) and MS1 label techniques e.g. SILAC* • Updated in 2013-2014 to support SRM as a new technique**; mzqLibrary*** • 2015, mzQuantML 1.0.1 – minor update with SRM included Open issues • Not widely supported by software. No live development. Efforts are being put into mzTab support instead *Walzer et al. MCP 2013 Aug;12(8):2332-40. doi: 10.1074/mcp.O113.028506 **Qi et al. PROTEOMICS, 2015, 15(18):3152-62 *** Qi et al PROTEOMICS 2015, 15, 2592-2596.
  • 23.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Current PSI Standard File Formats for MS • mzMLMS data • mzIdentMLIdentification • mzQuantMLQuantitation • mzTabFinal Results • TraMLSRM
  • 24.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 mzTab – Aims and concept • To provide a simple and efficient way of exchanging results from MS approaches. • Simpler summary report of the experimental results • Peptides and proteins identified in a given experimental setting • Small molecules identified • Reported quantification values • Technical and biological metadata • Easier to parse and use by the research community, systems biologists as well as providers of knowledge bases. • It can be used by non-experts in bioinformatics. • It does not aim to replace mzIdentMl and mzQuantML
  • 25.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 mzTab - Sections • Basic information about experiment and sample • Key-Value pairsMetadata • Basic information about protein identifications • Table-basedProtein • Information about quantified peptides • Table-basedPeptide • Information about identified spectra • Table-basedPSM • Basic information about identified small molecules • Table-basedSmall Molecule Griss et al., MCP, 2014
  • 26.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Metadata section - Example
  • 27.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Current PSI Standard File Formats for MS • mzMLMS data • mzIdentMLIdentification • mzQuantMLQuantitation • mzTabFinal Results • TraMLSRM
  • 28.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Unify exchange of transitions with TraML • PSI’s TraML (Transitions Markup Language) • Format for encoding SRM/MRM transitions • Version 1.0.0 now released and published in MCP (Deutsch et al. 2012) Journal Articles Transitions Databases Excel sheets SRM Analysis Software Instruments TraML
  • 29.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 PSI document process •Every data standard has to undergo a thorough review process… •In fact, in practice, two review processes happen in parallel: the PSI and manuscript review.
  • 30.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Proteogenomics related data formats • Two ongoing formats are being developed: proBed (version 1 available) and proBAM (still under review). • Same overall objective: to map identified peptides to genome coordinates. • Different level of detail: • proBed is tab-delimited and simpler, based on the original BED format. Less level of detail. • proBAM is based in the original SAM/BAM formats, widely used in genomics. Much higher level of detail.
  • 31.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Proteogenomics related data formats
  • 32.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Provide your own data to genome browsers
  • 33.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 TrackHubs in Genome Browsers
  • 34.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Data standard publications mzML (for MS data) Martens et al., MCP, 2011 mzIdentML (peptide/protein IDs) Jones et al., MCP, 2012 Vizcaíno et al., MCP, 2017 TraML (for SRM transitions) Deutsch et al., MCP, 2012 mzQuantML (for quantitative data) Walzer et al., MCP, 2013 mzTab (ID and quantification) Griss et al., MCP, 2014 proBed & proBAM (proteogenomics) Menschaert et al., Genome Biology, 2018
  • 35.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Importance of making software available jmzML (https://github.com/PRIDE-Utilities/jmzml) Cote et al., Proteomics, 2009 jmzIdentML (https://github.com/PRIDE-Utilities/jmzidentML) Reisinger et al., Proteomics, 2012 jmzReader (https://github.com/PRIDE-Utilities/jmzReader) Griss et al., Proteomics, 2012 jmzQuantML (https://github.com/UKQIDA/jmzquantml) Qi et al., Proteomics, 2014 jmzTab (https://github.com/PRIDE-Utilities/jmzTab) Xu et al., Proteomics, 2014 ms-data-core-api (https://github.com/PRIDE-Utilities/ms-data-core-api)Perez-Riverol et al., Bioinformatics, 2015 PSI promotes implementations. The reference libraries are always open source and can be used by anyone!
  • 36.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 And also… protein-protein interactions PSI-XML: XML-based format • Version 2.5 is the working version • Version 3.0 under development MITAB: tab-delimited format
  • 37.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Summary slide Deutsch et al., JPR, 2017
  • 38.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018 Do you want to learn more?
  • 39.
    Juan A. Vizcaíno juan@ebi.ac.uk WTProteomics Bioinformatics Course 2018 Hinxton, 18 July 2018