Programmatic Access to Biological Databases through EnCore and EnVision
1. Data Integration through Enfin and EnCore
Programmatic Access To Biological Databases (Perl)
22–26 February 2010
Rafael Jimenez
rafael@ebi.ac.uk
Updated: 12 February 2010
EnCORE
presentation
• EnCore
• EnVision
2. ENFIN Network of Excellence
• Brings together
experimentalists and
computational biologists to
develop the next generation of
informatics resources for
systems biology
• Funded by the European
Commission within its FP6
programme under the
thematic area ‘Life sciences,
genomics and biotechnology
for health’
• 20 partners in 13 countries
• www.enfin.org
EnCore
3. ENFIN Network of Excellence
• Brings together
experimentalists and
computational biologists to
develop the next generation of
informatics resources for
systems biology
• Funded by the European
Commission within its FP6
programme under the
thematic area ‘Life sciences,
genomics and biotechnology
for health’
• 20 partners in 13 countries
• www.enfin.org
4. EnCore
• ENFIN Platform to enable mining data across various domains,
sources, formats and types
• Integrates database resources and analysis tools across different
disciplines
EnXML
EnCORE services
EnVISION pages
Standard EnXML format
User
input output
5. Diverse service world
SOAP, REST,
Java API, Perl
API, FTP,
GUI, …
External data sources
Different formats
Access interfaces
User
?integration
• Multiple manual connections
• Multiple technologies
• Multiple result files which have to be combined manually
• Much work to reproduce
XML, CSV,
Plain Text,
JSON, …
6. Standardized EnCORE world
Heterogeneous
external world
Standardised
EnCORE world
EnXML
External data sources
EnCORE services
EnVISION pages
API, WS access
Standard EnXML format
User
input output
7. … – Input – Output – Input – Output – …
…
Input
EnXML
Output
EnXML
Service
EnCORE WS
8. EnCORE services
From Inputs to Outputs
Positive Negative
Input/Query
Output/Results
Program/Service
EnCORE dataset
EnCORE
results
EnCORE webservice
• Enfin-IntAct
• Enfin-PRIDE
• Enfin-Affy2UniProt
• Enfin-PICR
• Enfin-Reactome
• Enfin-ArrayExpress
• Enfin-UniProt
• Enfin-BioModels
• Enfin-KEGG
• Enfin-G:GOSt
• Enfin-CellMINT
• Enfin-DOMAINATION
• Database IDs
• Sequences
• Experiment: Identifies the result
• Sets: Contains the structure of the result
• Molecules: Includes the results
• Features: Describe details of the result
9. EnCORE services
Example
Positive Negative
Input/Query
Output/Results
Program/Service
EnCORE dataset
EnCORE
results
EnCORE webservice
• Encore webservice
Enfin-IntAct
• Database ID (Uniprot ID)
P37173
• Experiment: ID4
• Sets: (1)EBI-296235, (2)EBI-1033040, (3) EBI-
902913, EBI-902937, (4) EBI-296166, EBI-296246,
(5)EBI-902913
• Molecules: (1)O35613, (2)P10600, (3)P07200,
(4)Q9UER7, (5)Q99K41
• Features: No features
10. EnCORE services
Example (Result on a table)
Interactor A Interactor B Interaction IDs
1 P37173 O35613 EBI-296235
2 P37173 P10600 EBI-1033040
3 P37173 P07200 EBI-902913, EBI-902937
4 P37173 Q9UER7 EBI-296166, EBI-296246
5 P37173 Q99K41 EBI-902913
Input/Query
Output/Results
Program/Service
Enfin-IntAct
P37173
12. ENFIN Network of Excellence
• Brings together
experimentalists and
computational biologists to
develop the next generation of
informatics resources for
systems biology
• Funded by the European
Commission within its FP6
programme under the
thematic area ‘Life sciences,
genomics and biotechnology
for health’
• 20 partners in 13 countries
• www.enfin.org
EnCore
Adapting EnCORE to Standards and Federation
13. Molecular Biology Database resources
Human Genes and
Diseases
14%
Proteomics Resources
(20)
0%
Other Molecular
Biology Databases
3%
Immunological
databases
2%
Plant databases
8%
Organelle databases
2%
Human and other
Vertebrate Genomes
8%
Nucleotide
Sequence Databases
9%
RNA
sequence
databases
Protein
sequence
databases
Structure Databases
9%
Genomics
-Databases (non
(vertebrate
Metabolic and
Signaling Pathways
9%
Nucleic Acids Research annual
Database Issue and the NAR online
Molecular Biology Database Collection
in 2009MY Galperin, GR Cochrane -
Nucleic Acids Research, 2008
~1440
resources
15. New EnCore approach
Standards and Federation
Domain 1
External data sources
Federated systems / Standards
EnVISION pages
WS
WS
Web interface
EnCORE wrapper
17. New EnCore approach
Standards and Federation
• Less development
• More sources
• Domain data integration
• Comparable results
• Automatic inclusion of new data sources
• Less maintenance
• More stable formats
• Easy to control changes
• Facilitates validation
• Extra value to the original data
18. New role for EnCore and EnVision
Extra value to the original data
• Integration of sources.
• Filtering redundancy (whenever possible)
• Interconnect results.
• Data analysis
• More visualization
Domain 5 Domain …Domain 4
Domain 2 Domain 3Domain 1
20. ENFIN Network of Excellence
• Brings together
experimentalists and
computational biologists to
develop the next generation of
informatics resources for
systems biology
• Funded by the European
Commission within its FP6
programme under the
thematic area ‘Life sciences,
genomics and biotechnology
for health’
• 20 partners in 13 countries
• www.enfin.org
EnVision
22. Envison interface
• Results for Pride, Uniprot, Intact, Reactome, CellMint, PICR, Biomodels, …
http://www.ebi.ac.uk/~rafael/enfin/presentations/EnVISION2_01.ppt
http://www.enfin.org/dokuwiki/
EnCORE
tutorial
Results per service
Example
The idea behind EnCORE is simplified in this picture
Input (our query) is contained in a XML standard format called EnXML
We can run different services over this input.
We get results contained in the same EnXML format
The Outputs can be use as inputs of other services.
This is a generic example of how an EnCORE service work
An specific example
The query is a protein Acc
We run the Intact service
We get the interactions result defined by the EnXML terminology
The same results in a table
EnCORE facilitates building workflows
EnVISION results are nice, but do not forget our initial integration problem
For one domain (protein interaction, pathways, protein sequence …) we might have several databases providing data
EnCORE provides a great solution however it is not complete if it can not include more resources
For EnCORE it is not feasible to develop and maintain so many wrappers.
Nonetheless EnCORE can overcome this problem using standards and federated systems
EnVISION is an EnCORE interface
With just one click user can run different services get a quick overview for a dataset
This example shows result for …
Here an example of the potential of EnVISION
In this example we used a dataset of more than 300 protein Acc.
In this screenshot EnVISION was able to find more than 500 pathways for this dataset.
EnVISION is capable to link and display positive results in a pathway map.
Integration of biological data of various types and development of adapted bioinformatics tools represent critical objectives to enable research at the systems level. The European Network of Excellence ENFIN is engaged in developing an adapted infrastructure to connect databases, and platforms to enable both generation of new bioinformatics tools and experimental validation of computational predictions. Beyond the use of common standards to format individual datasets, there is a need for sophisticated informatics platforms to enable mining data across various domains, sources, formats and types. The aim of the EnCORE project is to integrate across different disciplines an extensive list of database resources and analysis tools in a computationally accessible and extensible manner, facilitating automated data retrieval and processing with a special focus on systems biology. The EnCORE platform is available as a collection of webservices with a common standard format easy to integrate in Workflow management software such as Taverna. Additionally EnCORE services are also accessible thought EnVISION, a web graphical user interface providing elaborated information such as molecular interaction, biological pathways and computational models of pathways.