1. EBI is an Outstation of the European Molecular Biology Laboratory.
08/23/18
EMBL-EBI Proteomics data resources and services
Rafael JIMENEZ (EBI, Hinxton, UK)
4th Annual Forum for SMEs
Munich, October 18th-19th 2010
2. Context
Integration, standards and dissemination
Uniprot
Protein Sequences
Reactome
Pathways
IntAct
Interactions
PRIDE
Mass Spec
DAS
PSICQUIC
EnCore
Annotation
Archive
4. • Proteomics Standards Initiative
• Work group of the Human Proteome Organization
• Defines community standards for data in proteomics
• … facilitating data comparison, exchange and verification
PSI
4
http://www.psidev.info/
5. • Proteomics Standards Initiative
• Work group of the Human Proteome Organization
• Defines community standards for data in proteomics
• … facilitating data comparison, exchange and verification
PSI
5
• MIAPE: The Minimum Information About a Proteomics Experiment
• Data and metadata from proteomics experiments
• Data: results
• Metadata: data about the data
• Where the samples came from
• How the analysis were performed
6. • Proteomics Standards Initiative
• Work group of the Human Proteome Organization
• Defines community standards for data in proteomics
• … facilitating data comparison, exchange and verification
PSI
6
http://www.psidev.info/
7. 7
PSI-MI
Data format
Data distribution
Control vocabulary
Data submission
Website
Standard format
Tools
PSICQUIC
PSI-MI CV
Reporting guideline MIMIx
Tools
PSI-MI XML
PSI-MITAB
XML Java API
MITAB Java API
XMLMakerFlattener
XML Validator
MIF25_view.xsl
MIF25_compact.xsl
MIF25_expand.xsl
PSI-MI XML files
PSI Excel Sheet
PSI Web Form
Data
Servers
Registry
Clients
8. • Work group of the Proteomics Standards Initiative
• Community coordination to ensure deposition of data in
public repositories
• Concentrating on …
• Annotation and representation of published MI data
• Accessibility of MI data to the user community
PSI - Molecular Interactions
Data format
Data distribution
Control vocabulary
MIAPE
Reporting guideline
PSI-MI XML
PSI-MITAB
PSICQUIC
MIMIxPSI-MI CV
http://www.psidev.info/MI
Scoring
PSISCORE
9. PSI-MI format
• Community standard for Molecular Interactions
• Jointly developed by major data providers: BIND,
CellZome, DIP, GSK, HPRD, Hybrigenics, IntAct, MINT, MIPS, Serono,
U. Bielefeld, U. Bordeaux, U. Cambridge, and others
• Collecting and combining data from different sources
has become easier
• Standardized annotation through PSI-MI ontologies
• Tools from different organizations can be chained, e.g.
IntAct data in Cytoscape.
9
psi-mi/xml25psi-mi/xml25 psi-mi/tab25psi-mi/tab25
12. Data distribution: PSICQUIC
• Proteomics Standards Initiative Common QUery InterfaCe.
• Community effort to standardise the way to access and retrieve data
from Molecular Interaction databases.
• Widely implemented by independent interaction data resources.
• Based on the PSI standard formats (PSI-MI XML and MITAB)
• Not limited to protein-protein interactions, also e.g.
• Drug-target interactions
• Simplified pathway data
• A registry listing resources implementing PSICQUIC
• Documentation: http://psicquic.googlecode.com
25. IMEx: The International Molecular Exchange Consortium
• Group of major public interaction data providers sharing
curation effort: DIP, IntAct, MINT, MPact, MatrixDB, MPIDB and BioGRID
• Independent molecular interaction resources
• Common curation standards for detailed curation
• Common data formats (PSI-MI XML, PSI-MITAB, PSICQUIC)
• Common accession number space
• Coordinated & non-redundant curation
• In production mode since February 2010
• Since 3/2009 supported by the European Commission
under PSIMEx, contract number FP7-HEALTH-2007-223411, with additional partners Vital-IT, Nature,
Wiley, BiaCore (GE), U. Maryland, CSIC, TU Munich, MIPS, SCBIT (Shanghai)
Imex.sf.net
26. IntAct
• Freely available, open-source database system
• Public repository of molecular interactions
• Interactions manually curated and reviewed by experts
• Interaction derived from literature or direct user submissions
• Topic centric datasets (eg. Cancer, Chromatin, MSD…)
• Analysis tools for interaction data
• EBI database (part of the IMEx consortium and the PSI-MI)
• Data updated every week: ftp://ftp.ebi.ac.uk/pub/databases/intact
• Data formats available:
http://www.ebi.ac.uk/intact
32. 32
PSI-MSS PSI-MS
PSI-PI
Data format
Tools
Standard format
Reporting guideline MIAPE-MS
mzML
TraML
- ProDaC
- OpenMS/TOPP
- ProteoWizard
- Proteios
- TPP
- X!Tandem
- Myrimatch
- InSilicoSpectro
- NCBI C++ toolkit
- Mascot
Validation, analysis, exporters, viewers , ...
- Phenyx
- PEAKS
- mzML_Exporter
- CompassXport
- Insilicos Viewer
-Jmzml
- Pride Inspector
- Pride Converter
…
Control vocabulary PSI-MS
Data format
Tools
Standard format
Reporting guideline MIAPE-MSI
mzIdentML
mzQuantML
- mzIdentML validator
- Mascot
- OMSSA
- Peaks
- Phenyx
- PLGS
- ProCon
- ProteinPilot
- ProteinScape
- SEQUEST
Validation, analysis, exporters, viewers , ...
- SpectraST
- Spectrum Mill
- X!Tandem
- OpenMS/TOPP
- Scaffold
- TPP
- Mascot Integra
- MIAPE MSI exporter
- CSV exporter
…
Tools
Data
Website
Pride Inspector
Pride Converter
Pride Biomart
Pride QProjects
PICR
OLS
33. • Work group of the Proteomics Standards Initiative
• Community coordination to ensure deposition of data in
public repositories
• Concentrating on …
• Annotation and representation of published MS data
• Accessibility of MS data to the user community
PSI - Mass Spectrometry Standards
Individual
proteins
Peptides
Protein
mixture
Peptide
Mass
Separation 2D-SDS-PAGE
Spot Cutting
Digestion
Trypsin
Mass Spectroscopy MALDI-TOF
Database
search
mzMLmzML
mzIdentMLmzIdentML
Protein
identification
Quantification
mzQuantMLmzQuantML
Protein
quantification
mzXMLmzXML
mzDatamzData
analysisXMLanalysisXML
37. ProteomExchange:
Enhancing Cooperation of Proteomics Data Repositories
• Group of major public Mass Spec data providers
• Single point of submission to proteomics repositories
• Encourage data exchange
• Common data formats (mzML, mzIdentML, mzQuantML)
• Common accession number space
• Coordinated & non-redundant data
• Since 2010 supported by the European Commission
38. 38
Secondary resources
Data reprocessing and notification
Journals
Wiley
Proteomics
NBT
JPR
MCP
Standards Local data management systems
mzQuantML
Release 1 Release 2 Release 3
ProHITS
MS-Lims
ProCon
Phenyx
OmicsHub
Other
LIMS
Pride Converter
Repositories
Pride
Metadata,
Results
mzML
mzIdentML
Peptide
Atlas
Uniprot
NIST
Spectrum
libraries
……
Implementedin
Data submission
RSS
feed
Central
Dataset
Look-up
Service
MIAPE
validation
Accession
Number/
reviewer login
Notification
Reprocessing notification
Tranche
Raw
data
Peptidome
Metadata,
Results
xref xref
Data release / publications
Proposal structure
39. http://www.ebi.ac.uk/pride
The Proteomics Identifications Database
• Centralized, standards compliant, public data repository
for proteomics identifications
• Open source
• Open data
• > 100.000.000 spectra
• ~ 4.000.000 protein identifications
• Detailed annotation of meta-data
• Vizcaíno JA, Côté R, Reisinger F, Foster JM, Mueller M, Rameseder J, Hermjakob H, Martens L.
A guide to the Proteomics Identifications Database proteomics data repository.
Proteomics. 2009 Sep;9(18):4276-83.
PMID: 19662629
46. 08/23/1846
What is OLS?
• A unified, single point of query for over 69 ontologies
(updated daily) and upwards of 850,000 terms.
• A tool that offers online and programmatic access to query
ontologies about:
• Term names
• Synonyms
• Relationships
• Annotations
• Cross-references
• Reusable code components to integrate such functionality
in other projects
http://www.ebi.ac.uk/ontology-lookup/
Richard Cote
48. 08/23/1848
Why do you need ID mapping
• Merging datasets to a common identifier space
• Finding all aliases/synonyms for an identifier
• (data integration – submissions!)
• Mapping from secondary IDs to more recent primary IDs
• (data “freshness”)
• Preparing data sets for specific tools
• Querying in various primary databases
• (data format requirements)
Richard Cote
49. 08/23/1849
Protein identifier mapping is hard
• The basic problem: the same protein sequence is referred to by
multiple accession numbers assigned by multiple databases.
• No universal identifier scheme
• Redundant databases – multiple identifiers for the same sequence in
the same database
• Unstable identifiers (ex: gi numbers)
• Obsolete and deleted identifiers (hypothetical proteins)
• Different production cycles for major databases
• Tools exist, but are limited in important their database and
species coverage and in their usability and availability. Richard Cote
50. 08/23/1850
PICR Home Page
Submit accessions
OR sequences
(FASTA) with 500
entry interactive limit
(no batch limit)
Select output format
Select one or
many databases
to map to in one
request
Limit search by
taxonomy
(pessimistic)
Choose to return
all mappings or
only active ones
Run
search
Richard Cote
51. 08/23/1851
PICR Result Page – simple view
Logical xref
(hyperlinked)
Inactive xref
Secondary
Identifier
Active xref
(hyperlinked)
Richard Cote
52. Pride inspector
• Open mzML and PRIDE XML files
• Browse the PRIDE database
• Facilitate publication reviews
53. 08/23/1853
DAS, The Distributed Annotation System
The Distributed Annotation System is…
• A network of biological data sources
• A Service Oriented Architecture (SOA)
• RESTful web service
• An example of federation
• Uniform access to multiple repositories of biological data.
• Repositories distributed in different geographical locations.
The DAS Protocol is…
• An integration platform
• A client-server protocol
• An agreed standard for web services Andy Jenkinson
54. 54
74 Protein DAS sources!
PRIDE
DAS 1.6
Pride & Dasty3
Protein client
55. At present, PRIDE is a data repository to support publications
-We rely 100% on data submitted by researchers
-It is not possible to determine which data is good and which is not
that good
-This makes not possible that data in PRIDE can be used as
supportive evidence for protein existence in UniProt
-Data can not be reused by other resources either…
PRIDE-Q: Why?
56. Project funded by the Wellcome Trust (5 years):
Added value: High-quality data will go to a new resource: PRIDE-Q
Data exports:
•Links, DAS track for all PRIDE data
•Quality controlled, e.g. “Protein Existence”, Expression Atlas from PRIDE-Q
PRIDE-Q *
Curation
Automated rules,
Curator override
PRIDE-Q
57. Reactome
•Human pathway knowledgebase
•Manually curated
•Open source, open data
•Collaboration between EBI, OCRI and
NYU
•Online since 2003
•Matthews L, et al: Reactome knowledgebase of
human biological pathways and processes. Nucleic
Acids Res. 2008 Nov 3.
http://www.ebi.ac.uk/pride
59. New site! Coming soon …
http://reactome.oicr.on.ca
Main
text
Navigation bar
60. The Pathway Browser
Species selector
Search &
Analyze barSidebar
Pathway Diagram Panel
Details Panel (hidden)
Thumbnail
61. The Pathway Browser - Pathway Diagrams
Boxes are proteins, sets or complexes.
Ovals are small molecules.
Green boxes are proteins or sets, blue are complexes.
Catalyst
Input
Outputs
Compartment
Reaction node
Transition Binding Dissociation Omitted Uncertain
Regulation
+ve -ve
62. Pathway Analysis – Overrepresentation
‘Top-level’
Reveal next level
P-val, In set/In pathway
There are many different approaches for confidence scoring individual molecular interactions. They can, for instance, be based on structural information, network topology, experimental conditions , similarity of the functional annotations of the interactors, or evolutionary conservation. It is therefore unfeasible that a single, central scoring approach could combine all these methods. The obvious solution to this is decentralization.
In a decentralized setting, individual research groups focus on specific scoring approaches, for instance, a group with a focus on protein structures can provide confidence scores for the mutual interaction interfaces.
Each line is a binary interaction evidence reported by a scientific publication
Evidences are grouped by molecule pairs (allowing for subsequent filtering should you need to)
Data can be downloaded in standard formats (see table header)
Only 30 interactions per page for speedy loading
One can customize the list of columns by clicking on the “Change Column Displayed”
An integration platform for biological data
a way of bringing together data from different providers
federation
unifies data sources that are different to each other
Species selector – you need to have a pathway selected for this to do anything
Sidebar in the PB displays the pathway hierarchy – on the Pathways tab.
Analyze, Annotate & Upload – i.e. A Control Panel. Contains tools and configuration options.
Search – not used in the exercises, takes some getting used to
Pathway Diagram panel is where diagrams appear, when selected from the hierarchy or a search result
Details panel – does what it says, responds to the selected object – N.B. Hidden by default!
Results – explain the order is best at the top, but only shows ‘top-level’ pathways coloured gives significance, the order is by best scoring pathway OR subpathway, explains why sometimes the colours seem out of order. Use twisties to see subpathways, or Open All.
Numbers after name are p-val, then no. proteins from submitted set that match the pathway/total in pathway.
Matching values twisty – lists the proteins from dataset that matched the pathway
Results list all pathways, pick one and it looks a bit like this – colours overlaid on pathway diagram. Box shows result of right-clicking a complex to expand
Results form, click view to see a pathway. Colour indicates expression level, refer to spectrum bar for values. Right click to zoom into complexes.
Experiment Browser toolbar steps through multiple columns of expression data (if provided).