Yasset Perez-Riverol Ph.D
github: github.com/ypriverol
twitter: @ypriverol
OpenMS: Quantitative proteomics at
large scale
Proteomics Bioinformatics
EMBL-EBI, December 2016
Outline
• Introduction to OpenMS
Modularity & Workflows
Visualization.
Integration with other tools.
• Two example workflows
Protein identification
Label-free quantification
Proteomics Bioinformatics
EMBL-EBI, December 2016
Modularity is the degree to which a system's components may
be separated and recombined.
Proteomics Bioinformatics
EMBL-EBI, December 2016
Proteomics Bioinformatics
EMBL-EBI, December 2016
Modularity
tools for identification
DecoyDatabase
MascotAdapter
XTandemAdapter
MSGFPlusAdapter
PeptideIndexer
FalseDiscoveryRate
IDPosteriorErrorProbability
ConsensusID
LuciphorAdapter
HighResPrecursorMassCorrector
FidoAdapter
tools for quantification
PeakPickerHiRes
FeatureFinderMultiplex
FeatureFinderCentroided
SpectraMerger
NoiseFilterSGolay
ITRAQAnalyzer
IDMapper
IDConflictResolver
MapAlignerPoseClustering
MapRTTransformer
FeatureLinkerUnlabeledQT
ProteinQuantifier
tools for file handling
FileConverter
FileMerger
FileFilter
IDFileConverter
IDMerger
IDFilter
MzTabExporter
FileInfo
OpenMS ⇨ collection of 180 software tools
≈ 30 tools sufficient for standard workflows
Proteomics Bioinformatics
EMBL-EBI, December 2016
OpenMS
OpenMS – an open-source framework for computational mass spectrometry
Portable: available on Windows, OSX, Linux
OpenMS TOPP tools – The OpenMS Proteomics Pipeline tools
• > 180 Building blocks: One application for each analysis step
• Vendor independent: Uses PSI standard formats
Can be integrated in various workflow systems
• TOPPAS – TOPP Pipeline Assistant
• Galaxy
• KNIME
Proteomics Bioinformatics
EMBL-EBI, December 2016
KNIME and TOPPView
KNIME – KoNstanz Information MinEr
• Enable to build customized workflows by using OpenMS
components.
TOPVIEW: An OpenMS Data Viewer.
• Based on standard files formats.
• MS/MS information,
peptides/proteins,
quantitative information.
Proteomics Bioinformatics
EMBL-EBI, December 2016
KNIME – Workflow System
KNIME – KoNstanz Information MinEr
Industrial-strength general-purpose workflow system
Convenient and easy-to-use graphical user interface
Available for Windows, OSX, Linux at http://KNIME.org
KNIME (CC BY-SA 4.0)
Workflows
Plots
Tables
Console
Nodes
Proteomics Bioinformatics
EMBL-EBI, December 2016
Workflow Builder: Data Flow
KNIME-OpenMS workflows consist of distinct nodes
that are assembled into workflows
Either tables or files are exchanged between nodes
along the edges of the workflow
Configuration dialogs are used to set node
parameters
Loops, allow iterating sequentially over lists of data
Switches, allow executing nodes or subworkflows
dependent on a condition
Proteomics Bioinformatics
EMBL-EBI, December 2016
Scripting
KNIME permits the embedding of R code for advanced statistics
Embedding of R scripts using the R Snippet node
All plotting capabilities of R can be used as well
Proteomics Bioinformatics
EMBL-EBI, December 2016
Peptide/Protein Identification
Task: Identify peptides in multiple samples
Mass spectra enter workflow on the left
Loop nodes permit execution of parts of the workflow
Identified proteins end up in result files (right side)
Proteomics Bioinformatics
EMBL-EBI, December 2016
TOOView: Visualization of the results
mzML idXML
Proteomics Bioinformatics
EMBL-EBI, December 2016
Workflow – Plug-In System
Task: Identify peptides in multiple samples
Mass spectra enter workflow on the left
Loop nodes permit execution of parts of the workflow
Identified proteins end up in result files (right side)
Proteomics Bioinformatics
EMBL-EBI, December 2016
Workflow – Plug-In System
Task: Identify peptides in multiple samples
Combination of Xtandem+OMSSA
Defining of QC parameters like FDR. Q-values, P-values.
Proteomics Bioinformatics
EMBL-EBI, December 2016
Complex and customized Workflows
X!Tandem Mascot MS-GF+ Merged
PIA 1214 64 (5.3%) 1442 74 (5.1%) 1631 93 (5.7%) 1615 101 (6.2%)
Fido 996 67 (6.7%) 1439 80 (5.6%) 1679 96 (5.7%) 1619 105 (6.5%)
ProteinLP 989 64 (6.5%) 1229 77 (2.3%) 1651 93 (5.6%) 1295 104 (8.0%)
MSBayesPro 749 24 (3.2%) 958 26 (2.7%) 1303 31 (2.4%) 963 36 (3.7%)
ProteinProphet 1027 64 (6.2%) 1282 73 (5.7%) 1629 91 (5.6%) 1629 99 (6.7%)
Audain E. & Uszkoreit J. et al, Journal of Proteomics, 2017
Best Protein inference
algorithm:
3 Datasets
4 Search engines.
5 Protein inference
algorithms.
> 140 combinations.
Proteomics Bioinformatics
EMBL-EBI, December 2016
Some of the Identification nodes
IDPosteriorErrorProbability
Compute the posterior error probability for each PSM
Generate a new file with the corresponding values.
ConsensusID
Combine PSM identifications from multiple search
engines.
Generate a Combined PosteriorErrorProbability for
each PSM.
For each peptide ID, use the best score of any
search engine as the consensus score.
FalseDiscoveryRate
For each peptide ID, use the best score of any
search engine as the consensus score.
Proteomics Bioinformatics
EMBL-EBI, December 2016
Adapters and Complementary Nodes
FileMerger
This nodes takes two files (or file lists) as input and
outputs a merged list of both inputs. The order
corresponds to the order of the input lists and ports.
IDMerger
Merges several protein/peptide identification files
into one file.
PeptideIndexer
Refreshes the protein references for all peptide hits.
IDFilter
Filters results from protein or peptide identification
engines based on different criteria.
Proteomics Bioinformatics
EMBL-EBI, December 2016
Quantitative Proteomics
Quantitative Proteomics
Relative Quantification
Labeled
In vivo
14N/15N SILAC
In vitro
iTRAQ TMT 16O/18O
Label-Free
Spectral Counting MRM Feature-Based
Absolute Quantification
AQUA SISCAPA
And many more…
Proteomics Bioinformatics
EMBL-EBI, December 2016
Label-Free Quantification (LFQ)
Label-free quantification is probably the most natural way of
quantifying
• No labeling required, removing further sources of error, cheap
• Different samples acquired in different measurements – higher
reproducibility needed
• Manual analysis difficult
• Scales very well with the number of samples, basically no limit,
no difference in the analysis between 2 or 100 samples
Proteomics Bioinformatics
EMBL-EBI, December 2016
Feature-based LFQ - LC-MS Maps
Spectra are acquired with rates up to dozens per second
Stacking the spectra yields peak maps
Resolution:
• Up to millions of points per spectrum
• Tens of thousands of spectra per LC run
Huge 2D datasets of up to hundreds of GB per sample
Quantification
(3x over-expressed, …)
Feature
(eluting peptide)
Proteomics Bioinformatics
EMBL-EBI, December 2016
Feature-based LFQ
1. Find features in all maps
2. Align maps
3. Link corresponding features
4. Identify features
5. Quantify features
6. Quantify proteins based on
their peptides
NPC2_HUMA
N
1.0 : 5.2 : 0.3
CD177_HUMAN 1.0 : 0.2 : 0.4
::
Sample 1 Sample 2 Sample 3
Proteomics Bioinformatics
EMBL-EBI, December 2016
Label-Free Workflow
Different algorithms has been proposed by the OpenMS community for
label free:
• Weisser H, Journal of Proteome Research (2013).
• Bo Zhang, Molecular Cell Proteomics (2016).
• Veit J., Jounral of Proteome Research (2016)
• Ranninger C., Analytica Chimica Acta (2016)
Proteomics Bioinformatics
EMBL-EBI, December 2016
DeMix-Q Algorithm and Workflow
Bo Zhang, Lukas Käll & Roman A. Zubarev, MCP (2016)
Proteomics Bioinformatics
EMBL-EBI, December 2016
Reliable and reproducible Quantitation
Proteomics Bioinformatics
EMBL-EBI, December 2016
LFQ Relevant nodes
FeatureFinderCentroid
Detects two-dimensional features in LC-MS data.
MapAlignerPoseClustering
Corrects retention time distortions between maps
using a pose clustering approach.
FeatureLinkerUnlabeledQT
Groups corresponding features from multiple maps.
ConsensusMapNormalizer
Normalizes maps of one consensusXML file
Proteomics Bioinformatics
EMBL-EBI, December 2016
OpenMS at Large Scale
Galaxy
WS-PGRADE/gUSE
KNIME
Each individual tool can be run in the command line making
possible its distribution in large HPC environments.
$> FileFilter -in myinfile.mzML -levels 2 -rt 100:1500 -out myoutfile.mzML
$> OpenSwathDecoyGenerator.exe −in OpenSWATH_SGS_AssayLibrary.TraML −out
OpenSWATH_SGS_AssayLibrary_with_Decoys.TraML −method shuffle −append exclude_similar
−remove_unannotated
Conclusions
• OpenMS modular workflow system
• standard workflows:
SILAC, iTRAQ/TMT, label-free, Swath, Quality
Control
• strong collaboration with other projects:
ProteoWizard, Thermo PD, Knime, Fido
Percolator, search engines, HUPO-PSI formats
How to run OpenMS workflows
• OpenMS, local installation
(Windows, OS X, Linux)
http://bit.ly/1J6lz6h
http://openms.de/workflows
• OpenMS in Proteome Discoverer
(LFQProfiler and RNPxl for PD 2.1)
http://openms.de/PD
• OpenMS in Galaxy
http://galaxy.uni-freiburg.de
• OpenMS in Knime
https://tech.knime.org/community/bioinf/openms

OpenMS: Quantitative proteomics at large scale

  • 1.
    Yasset Perez-Riverol Ph.D github:github.com/ypriverol twitter: @ypriverol OpenMS: Quantitative proteomics at large scale
  • 2.
    Proteomics Bioinformatics EMBL-EBI, December2016 Outline • Introduction to OpenMS Modularity & Workflows Visualization. Integration with other tools. • Two example workflows Protein identification Label-free quantification
  • 3.
    Proteomics Bioinformatics EMBL-EBI, December2016 Modularity is the degree to which a system's components may be separated and recombined.
  • 4.
  • 5.
    Proteomics Bioinformatics EMBL-EBI, December2016 Modularity tools for identification DecoyDatabase MascotAdapter XTandemAdapter MSGFPlusAdapter PeptideIndexer FalseDiscoveryRate IDPosteriorErrorProbability ConsensusID LuciphorAdapter HighResPrecursorMassCorrector FidoAdapter tools for quantification PeakPickerHiRes FeatureFinderMultiplex FeatureFinderCentroided SpectraMerger NoiseFilterSGolay ITRAQAnalyzer IDMapper IDConflictResolver MapAlignerPoseClustering MapRTTransformer FeatureLinkerUnlabeledQT ProteinQuantifier tools for file handling FileConverter FileMerger FileFilter IDFileConverter IDMerger IDFilter MzTabExporter FileInfo OpenMS ⇨ collection of 180 software tools ≈ 30 tools sufficient for standard workflows
  • 6.
    Proteomics Bioinformatics EMBL-EBI, December2016 OpenMS OpenMS – an open-source framework for computational mass spectrometry Portable: available on Windows, OSX, Linux OpenMS TOPP tools – The OpenMS Proteomics Pipeline tools • > 180 Building blocks: One application for each analysis step • Vendor independent: Uses PSI standard formats Can be integrated in various workflow systems • TOPPAS – TOPP Pipeline Assistant • Galaxy • KNIME
  • 7.
    Proteomics Bioinformatics EMBL-EBI, December2016 KNIME and TOPPView KNIME – KoNstanz Information MinEr • Enable to build customized workflows by using OpenMS components. TOPVIEW: An OpenMS Data Viewer. • Based on standard files formats. • MS/MS information, peptides/proteins, quantitative information.
  • 8.
    Proteomics Bioinformatics EMBL-EBI, December2016 KNIME – Workflow System KNIME – KoNstanz Information MinEr Industrial-strength general-purpose workflow system Convenient and easy-to-use graphical user interface Available for Windows, OSX, Linux at http://KNIME.org KNIME (CC BY-SA 4.0) Workflows Plots Tables Console Nodes
  • 9.
    Proteomics Bioinformatics EMBL-EBI, December2016 Workflow Builder: Data Flow KNIME-OpenMS workflows consist of distinct nodes that are assembled into workflows Either tables or files are exchanged between nodes along the edges of the workflow Configuration dialogs are used to set node parameters Loops, allow iterating sequentially over lists of data Switches, allow executing nodes or subworkflows dependent on a condition
  • 10.
    Proteomics Bioinformatics EMBL-EBI, December2016 Scripting KNIME permits the embedding of R code for advanced statistics Embedding of R scripts using the R Snippet node All plotting capabilities of R can be used as well
  • 11.
    Proteomics Bioinformatics EMBL-EBI, December2016 Peptide/Protein Identification Task: Identify peptides in multiple samples Mass spectra enter workflow on the left Loop nodes permit execution of parts of the workflow Identified proteins end up in result files (right side)
  • 12.
    Proteomics Bioinformatics EMBL-EBI, December2016 TOOView: Visualization of the results mzML idXML
  • 13.
    Proteomics Bioinformatics EMBL-EBI, December2016 Workflow – Plug-In System Task: Identify peptides in multiple samples Mass spectra enter workflow on the left Loop nodes permit execution of parts of the workflow Identified proteins end up in result files (right side)
  • 14.
    Proteomics Bioinformatics EMBL-EBI, December2016 Workflow – Plug-In System Task: Identify peptides in multiple samples Combination of Xtandem+OMSSA Defining of QC parameters like FDR. Q-values, P-values.
  • 15.
    Proteomics Bioinformatics EMBL-EBI, December2016 Complex and customized Workflows X!Tandem Mascot MS-GF+ Merged PIA 1214 64 (5.3%) 1442 74 (5.1%) 1631 93 (5.7%) 1615 101 (6.2%) Fido 996 67 (6.7%) 1439 80 (5.6%) 1679 96 (5.7%) 1619 105 (6.5%) ProteinLP 989 64 (6.5%) 1229 77 (2.3%) 1651 93 (5.6%) 1295 104 (8.0%) MSBayesPro 749 24 (3.2%) 958 26 (2.7%) 1303 31 (2.4%) 963 36 (3.7%) ProteinProphet 1027 64 (6.2%) 1282 73 (5.7%) 1629 91 (5.6%) 1629 99 (6.7%) Audain E. & Uszkoreit J. et al, Journal of Proteomics, 2017 Best Protein inference algorithm: 3 Datasets 4 Search engines. 5 Protein inference algorithms. > 140 combinations.
  • 16.
    Proteomics Bioinformatics EMBL-EBI, December2016 Some of the Identification nodes IDPosteriorErrorProbability Compute the posterior error probability for each PSM Generate a new file with the corresponding values. ConsensusID Combine PSM identifications from multiple search engines. Generate a Combined PosteriorErrorProbability for each PSM. For each peptide ID, use the best score of any search engine as the consensus score. FalseDiscoveryRate For each peptide ID, use the best score of any search engine as the consensus score.
  • 17.
    Proteomics Bioinformatics EMBL-EBI, December2016 Adapters and Complementary Nodes FileMerger This nodes takes two files (or file lists) as input and outputs a merged list of both inputs. The order corresponds to the order of the input lists and ports. IDMerger Merges several protein/peptide identification files into one file. PeptideIndexer Refreshes the protein references for all peptide hits. IDFilter Filters results from protein or peptide identification engines based on different criteria.
  • 18.
    Proteomics Bioinformatics EMBL-EBI, December2016 Quantitative Proteomics Quantitative Proteomics Relative Quantification Labeled In vivo 14N/15N SILAC In vitro iTRAQ TMT 16O/18O Label-Free Spectral Counting MRM Feature-Based Absolute Quantification AQUA SISCAPA And many more…
  • 19.
    Proteomics Bioinformatics EMBL-EBI, December2016 Label-Free Quantification (LFQ) Label-free quantification is probably the most natural way of quantifying • No labeling required, removing further sources of error, cheap • Different samples acquired in different measurements – higher reproducibility needed • Manual analysis difficult • Scales very well with the number of samples, basically no limit, no difference in the analysis between 2 or 100 samples
  • 20.
    Proteomics Bioinformatics EMBL-EBI, December2016 Feature-based LFQ - LC-MS Maps Spectra are acquired with rates up to dozens per second Stacking the spectra yields peak maps Resolution: • Up to millions of points per spectrum • Tens of thousands of spectra per LC run Huge 2D datasets of up to hundreds of GB per sample Quantification (3x over-expressed, …) Feature (eluting peptide)
  • 21.
    Proteomics Bioinformatics EMBL-EBI, December2016 Feature-based LFQ 1. Find features in all maps 2. Align maps 3. Link corresponding features 4. Identify features 5. Quantify features 6. Quantify proteins based on their peptides NPC2_HUMA N 1.0 : 5.2 : 0.3 CD177_HUMAN 1.0 : 0.2 : 0.4 :: Sample 1 Sample 2 Sample 3
  • 22.
    Proteomics Bioinformatics EMBL-EBI, December2016 Label-Free Workflow Different algorithms has been proposed by the OpenMS community for label free: • Weisser H, Journal of Proteome Research (2013). • Bo Zhang, Molecular Cell Proteomics (2016). • Veit J., Jounral of Proteome Research (2016) • Ranninger C., Analytica Chimica Acta (2016)
  • 23.
    Proteomics Bioinformatics EMBL-EBI, December2016 DeMix-Q Algorithm and Workflow Bo Zhang, Lukas Käll & Roman A. Zubarev, MCP (2016)
  • 24.
    Proteomics Bioinformatics EMBL-EBI, December2016 Reliable and reproducible Quantitation
  • 25.
    Proteomics Bioinformatics EMBL-EBI, December2016 LFQ Relevant nodes FeatureFinderCentroid Detects two-dimensional features in LC-MS data. MapAlignerPoseClustering Corrects retention time distortions between maps using a pose clustering approach. FeatureLinkerUnlabeledQT Groups corresponding features from multiple maps. ConsensusMapNormalizer Normalizes maps of one consensusXML file
  • 26.
    Proteomics Bioinformatics EMBL-EBI, December2016 OpenMS at Large Scale Galaxy WS-PGRADE/gUSE KNIME Each individual tool can be run in the command line making possible its distribution in large HPC environments. $> FileFilter -in myinfile.mzML -levels 2 -rt 100:1500 -out myoutfile.mzML $> OpenSwathDecoyGenerator.exe −in OpenSWATH_SGS_AssayLibrary.TraML −out OpenSWATH_SGS_AssayLibrary_with_Decoys.TraML −method shuffle −append exclude_similar −remove_unannotated
  • 27.
    Conclusions • OpenMS modularworkflow system • standard workflows: SILAC, iTRAQ/TMT, label-free, Swath, Quality Control • strong collaboration with other projects: ProteoWizard, Thermo PD, Knime, Fido Percolator, search engines, HUPO-PSI formats
  • 28.
    How to runOpenMS workflows • OpenMS, local installation (Windows, OS X, Linux) http://bit.ly/1J6lz6h http://openms.de/workflows • OpenMS in Proteome Discoverer (LFQProfiler and RNPxl for PD 2.1) http://openms.de/PD • OpenMS in Galaxy http://galaxy.uni-freiburg.de • OpenMS in Knime https://tech.knime.org/community/bioinf/openms