Artificial Intelligence In Microbiology by Dr. Prince C P
ELIXIR Implementation Study: “Mining the Proteome: Enabling Automated Processing and Analysis of Large-Scale Proteomics Data”
1. European Life Sciences Infrastructure for Biological Information
www.elixir-europe.org
ELIXIR Implementation Study: “Mining the
Proteome: Enabling Automated Processing and
Analysis of Large-Scale Proteomics Data”
Juan AntonioVizcaíno
Mathias Walzer
European Bioinformatics Institute (EMBL-EBI)
juan@ebi.ac.uk, walzer@ebi.ac.uk
2. ELIXIR Webinar
11 April 2018
• One slide intro to proteomics
• The ELIXIR Proteomics Community
• The implementation study
• Plans for the near future
Outline
3. ELIXIR Webinar
11 April 2018
One slide intro to Mass Spectrometry
proteomics
Hein et al., Handbook of Systems Biology, 2012
Proteins ≈ most drug targets
4. ELIXIR Webinar
11 April 2018
• One slide intro to proteomics
• The ELIXIR Proteomics Community
• The implementation study
• Plans for the near future
Outline
5. ELIXIR Webinar
11 April 2018
• The goal of the ELIXIR proteomics community is to
develop and maintain sustainable proteomics
tools and data resources
• An essential part of the development will also be the
‘FAIRification’ of the resources (i.e. making the
resources FAIR)
• Integrate proteomics bioinformatics activities in
ELIXIR
Overall objectives
6. ELIXIR Webinar
11 April 2018
• 11 ELIXIR nodes supported the application:
• Germany (co-lead) (O. Kohlbacher)
• Belgium (co-lead) (L. Martens)
• Czech Republic
• Denmark
• Ireland
• France
• Netherlands
• Spain
• Sweden
• United Kingdom
• EMBL-EBI (co-lead) (Juan A. Vizcaíno)
ELIXIR nodes supporting the new Community
7. ELIXIR Webinar
11 April 2018
White paper as the basis for this Community
Vizcaíno et al., F1000Research, 2017
8. ELIXIR Webinar
11 April 2018
Highlighting already existing resources and initiatives
Tools: Services and connectors to drive access and exploitation
Data: Sustaining Europe’s life science data infrastructure
Interoperability: Integration of data and services
Compute: Access, exchange and storage
Training: Professional skills for managing and exploiting data
9. ELIXIR Webinar
11 April 2018
Tools: Services and connectors to drive access and exploitation
Data: Sustaining Europe’s life science data infrastructure
Interoperability: Integration of data and services
Compute: Access, exchange and storage
Training: Professional skills for managing and exploiting data
Highlighting already existing resources and initiatives
10. ELIXIR Webinar
11 April 2018
• PRIDE stores mass spectrometry (MS)-based
proteomics data:
• Peptide and protein expression data
(identification and quantification)
• Post-translational modifications
• Mass spectra (raw data and peak lists)
• Technical and biological metadata
• Any other related information
• Full support for tandem MS approaches
• Any type of data can be stored
• Leading ProteomeXchange
• From July 2017, an ELIXIR core resource
European leadership: the world-leading PRIDE database
http://www.ebi.ac.uk/pride/archive Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2016
11. ELIXIR Webinar
11 April 2018
ProteomeXchange: A Global, distributed proteomics database
PASSEL
(SRM data)
PRIDE
(MS/MS data)
MassIVE
(MS/MS data)
Raw
ID/Q
Meta
jPOST
(MS/MS data)
Mandatory data deposition
http://www.proteomexchange.org
Vizcaíno et al., Nat Biotechnol, 2014
Deutsch et al., NAR, 2017
iProX
(MS/MS data)
• Framework to allow standard data submission and dissemination
pipelines between the main existing proteomics repositories.
12. ELIXIR Webinar
11 April 2018
PRIDE data submissions and data growth
> 2,400 datasets submitted in 2017
September, November and December
2017 were the record months in terms
of submitted datasets
Datasets submitted per
month
Datasets submitted
per year
13. ELIXIR Webinar
11 April 2018
Stats: Data growth in EMBL-EBI resources
Sequence data
Micro-array
Metabolomics
Proteomics
14. ELIXIR Webinar
11 April 2018
Data re-use in proteomics is increasing
Data download volume for PRIDE
Archive in 2017: 295 TB
0
50
100
150
200
250
300
350
2013 2014 2015 2016 2017
Downloads in TBs
15. ELIXIR Webinar
11 April 2018
• One slide intro to proteomics
• The ELIXIR Proteomics Community
• The implementation study
• Plans for the near future
Outline
16. ELIXIR Webinar
11 April 2018
Tools: Services and connectors to drive access and exploitation
Data: Sustaining Europe’s life science data infrastructure
Interoperability: Integration of data and services
Compute: Access, exchange and storage
Training: Professional skills for managing and exploiting data
ELIXIR Platforms
17. ELIXIR Webinar
11 April 2018
• Title: ‘’Mining the proteome: Enabling automated processing and analysis
of large-scale proteomics data”.
• Development of open, reproducible, and robust analysis pipelines for
DDA (Data Dependent Acquisition) approaches.
• Deployment in the EMBL-”Embassy Cloud” (and optionally later other
clouds)
• Connected to PRIDE, bringing analysis closer to the data.
• Who is involved?
• EMBL-EBI (Vizcaíno & Newhouse).
• ELIXIR-DE (Kohlbacher, EKUT, Eisenacher, RUB)
ELIXIR Implementation Study (Feb 2017-June 2018)
18. ELIXIR Webinar
11 April 2018
Develop exemplary proteomics data analysis workflows and deploy
them in the EMBL-EBI "Embassy Cloud”:
(1) Standard identification workflow
(2) Identification workflow for PTMs
(3) Quantification (label-free/label-based approaches)
(4) Quality Control (to aid data set interpretation/reanalysis
evaluation)
(5) Versions of quantification approaches (including PTMs)
è Connected to public proteomics data from
ELIXIR Implementation Study
19. ELIXIR Webinar
11 April 2018
Consolidating data access and provision of robust
analysis pipelines
• Development of free-to-use, scalable, and user-
friendly data analysis pipelines including cloud
deployment
• Cloud-based data analysis pipelines’ appeal
1. Increasingly large datasets
2. Local struggle with the ‘compute task’
ELIXIR Implementation Study
20. ELIXIR Webinar
11 April 2018
Cloud workflow in genomics:
Simplified workflow launcher
• One workflow
• For AWS
• Enabling co-analysis with the larger PanCancer dataset
“Running a >30x whole genome alignment is [...] roughly 4 days
and ~$10 on a single m4.2 xlarge instance.”*
*: http://icgc.org/working-pancancer-data-aws
Existing clouds for genomics… one example
21. ELIXIR Webinar
11 April 2018
Infrastructure as a Service:
• Compute power not necessarily local but remote
• Still from compute centres, but on a larger scale
The ‘service’ is:
• Customer gets infrastructure, but it’s virtualized
• This Abstraction yields better
utilisation and scalability (but...)
• Developer/Customer has to interface with these abstraction layers
What’s a cloud environment?
22. ELIXIR Webinar
11 April 2018
Elixir proteomics use-case (soon proteomics community)
PROTEOMES
(Proteoform centric,
including PTMs and
sequence variants)
AREA 1: Reproducible
open analysis
pipelines: DDA, DIA,
targeted proteomics,
and others
DATA
PRODUCERS
PROTEOMICS
DATA ANALYSIS &
QC
23. ELIXIR Webinar
11 April 2018
Elixir proteomics use-case (soon proteomics community)
PROTEOMES
(Proteoform centric,
including PTMs and
sequence variants)
AREA 1: Reproducible
open analysis
pipelines: DDA, DIA,
targeted proteomics,
and others
DATA
PRODUCERS
PROTEOMICS
DATA ANALYSIS &
QC
Proteogenomics and
Proteotranscriptomics
AREA 2:
Multi-omics
integration
Proteometabol-
omics
SYSTEMS
BIOLOGY &
SYSTEMS
MEDICINE
24. ELIXIR Webinar
11 April 2018
Elixir proteomics use-case (soon proteomics community)
PROTEOMES
(Proteoform centric,
including PTMs and
sequence variants)
AREA 1: Reproducible
open analysis
pipelines: DDA, DIA,
targeted proteomics,
and others
DATA
PRODUCERS
PROTEOMICS
DATA ANALYSIS &
QC
UniProt
neXtProt
Protein
Knowledge
Bases
LIMS
Others
PRIDE
Proteogenomics and
Proteotranscriptomics
AREA 2:
Multi-omics
integration
Proteometabol-
omics
SYSTEMS
BIOLOGY &
SYSTEMS
MEDICINE
26. ELIXIR Webinar
11 April 2018
We opted for the framework
Features:
• Tool modularisation
• Solutions for data handover between tools with standardised
(PSI) formats
• Adapters for integrating third-party software (Search Engines,
LuciPHOr, FIDO, percolator, etc.)
• Integration into various workflow systems as a basis
Analysis pipeline construction
27. ELIXIR Webinar
11 April 2018
Analysis pipeline construction
Kubernetes & container advantages
• Software in containers
Ø readily usable and well isolated modules
• Resilient system, working in different infrastructure
environments
28. ELIXIR Webinar
11 April 2018
Summarising the benefits of a cloud based pipeline
• The containerisation of workflow steps makes execution
resource efficient and version aware
• Compute infrastructure can be added dynamically,
infrastructure is setup on-demand (and released after use)
• Bring the analysis to the data
30. ELIXIR Webinar
11 April 2018
• PRIDE data connection into the cloud is being optimised
• The workflows are deployed into the EMBL-EBI
“Embassy Cloud” Portal and fitted with a dashboard as a
proof of concept.
• Conceptually, these workflows can be deployed in
different cloud infrastructures in the future so they can be
used openly by the wider community.
Current status
31. ELIXIR Webinar
11 April 2018
• One slide intro to proteomics
• The ELIXIR Proteomics Community
• The implementation study
• Plans for the near future
Outline
32. ELIXIR Webinar
11 April 2018
• Follow-up of the implementation study just mentioned.
• Title: "Extending open proteomics data analysis pipelines in the
cloud: Additional tools and focus on scalability, supporting the
dramatic growth of public proteomics data"
• It will start on August 2018 (1 year):
• Led by ELIXIR-Belgium (Martens).
• Participation of EMBL-EBI (Vizcaíno, Newhouse), ELIXIR-
Germany (Kohlbacher), ELIXIR-France (Bouyssie), ELIXIR-
Spain (Sabidó)
• It will include other tools and additional pipelines (Compomics tools,
QCloud, PROFI tools, etc).
Just approved Implementation Study (2018-2019)
33. ELIXIR Webinar
11 April 2018
• Assigned to the Community (10 ELIXIR nodes involved). It will start
on June 2018 (1 year).
• Title: ”Crowd-sourcing the annotation of public proteomics
datasets to improve data reusability”.
• Apply software developed in the different nodes to improve
automatic annotation pipelines linked to PRIDE (and QC
assessment).
• Improve re-usability of public data.
Just approved Implementation Study (2018-2019)
34. ELIXIR Webinar
11 April 2018
PROTEOMES
(Proteoform centric,
including PTMs and
sequence variants)
UniProt
neXtProt
Protein
Knowledge
Bases
Proteogenomics and
Proteotranscriptomics
AREA 3: Data
management
& Annotation
Metadata
improvements;
management of human
identifiable data; data
standards (e.g. for
multi-omics
approaches)
AREA 1: Reproducible
open analysis
pipelines: DDA, DIA,
targeted proteomics,
and others
AREA 2:
Multi-omics
integration
Proteometabol-
omics
DATA
PRODUCERS
LIMS
SYSTEMS
BIOLOGY &
SYSTEMS
MEDICINE
PROTEOMICS
DATA ANALYSIS &
QC
Others
DATA
MANAGEMENT
PRIDE
35. ELIXIR Webinar
11 April 2018
• Proteomics bioinformatics activities in Europe are
very prominent world-wide
• Analysis infrastructure: work in progress
• Plans for the future:
• Data integration approaches with other ‘omics’
technologies (e.g. genomics, metabolomics, etc).
• Add pipelines for other popular experimental techniques
• Improve data management practises (metadata
annotation, management of clinical data, …)
Summary
36. ELIXIR Webinar
11 April 2018
Acknowledgements
Thank you!
Proteomics team
Yasset Perez-Riverol
Andrew Jarnuczak
Tobias Ternent
Phenomenal
Pablo Moreno
Embassy cloud
David Ocaña
ELIXIR-DE
The OpenMS team:
Oliver Kohlbacher
Timo Sachsenberg
Julianus Pfeuffer
Martin Eisenacher
MDC
Chris Bielow
Sanger/ ICR
Jyoti Choudhary
Hendrik Weisser
Embassy cloud portal
Jose A. Dianes