2. Table of contents
• Data integration
Why do we need it?
What is it?
Problems and solutions
Different approaches
Important variables
Tools
6. Why so many data sources?
• Many data types
• Many communities
• Different ways to structure data
• Control
• Reputation
• Easy publication
7. 23.08.18 7
DB
GUI
API
WS
A AA A
DB
GUI
API
WS
DB
GUI
API
WS
DB
GUI
API
WS
DB
GUI
API
WS
A AA A
A Annotator Database
Graphical User Interface
Application programming interface
Web Services
GUI
API
WS
User
Data collection
Ideally Reality
8. 23.08.18 8
Utility of bioinformaticsScientificimpact
Too little
bioinformatics
Too many databases
Too diverse interfaces
Tim Hubbard
10. 23.08.18 10
Data integration
DB
GUI
API
WS
DB DB DBDB
GUI
API
WS
DB
GUI
API
WS
DB
GUI
API
WS
DB
GUI
API
WS
NO integration Integration
Database Graphical User InterfaceGUI User
Combining data residing in different sources
… providing users with a unified view of these data.
11. 23.08.18 11
Utility of bioinformaticsScientificimpact
Too little
bioinformatics
Too many databases
Too diverse interfaces
Integration of
13. Problems
Many data sources
• Many sources to maintain
• New sources appearing
• Just 20% has a sustained future*
• How to find them?
Different query interfaces
data integration?
Variable results
• Formats
• Schemas
• Controlled vocabularies
• Minimum information guidelines
Redundant results
* Merali Z. et all. Databases in peril. Nature 2005.
14. Solutions
– Scientific and political independence of the databases
– Cross-database queries spanning domain and
organizational boundaries
– Sharing and adoption rather than reinventing
– Adoption of standards
– Coordination to avoid redundant content
– Infrastructure to avoid volatile resources
– Registries to find resources and services
25. QI QIQI
SP SP SP
QI
S
i
5
Federated databases
Curators / Annotators
Original data sources
Third party implementations
Users
Examples:
•DAS
•PSICQUIC
•EnCore
•RDF
32. Integrating different domains
Integration per domain
SPSPSP
Domain
Domain 1
QI
Domain 2
QI
Domain …
QI
QI
SP = Common identifiers, Controlled vocabularies, Common formats, Common schemas, Minimum information guidelines
1
2
leverage
33. Domain
Standards
• Standardization per domain
• Common identifiers
• Controlled vocabularies
• Common formats
• Common schemas
• Minimum information guidelines
• Common query interfaces
35. Architecture
• Data warehousing
– Pull data from several resources into one resource.
– Main features:
• Data centralization
• High maintenance
• Data out of date
• Modifications (schema, format, content, …)
• Federation
– Data residing in different sources with a common standard
protocol and query system.
– Main features:
• Fresh data (original)
• Data redundancy
• Data inconsistency
43. ID Mapping services
Logical xref
(hyperlinked)
Inactive xref
Secondary
Identifier
Active xref
(hyperlinked)
Richard Cote
Web services!
•REST
•SOAP
http://www.ebi.ac.uk/Tools/picr/
Protein Identifier Cross-Reference Service
44. Standard formats/schemas
BioPAX
PSI-MI 2
SBML,
CellML
Genetic
Interactions
Molecular Interactions
Pro:Pro All:All
Interaction Networks
Molecular Non-molecular
Pro:Pro TF:Gene Genetic
Regulatory Pathways
Low Detail High Detail
Database Exchange
Formats
Simulation Model
Exchange Formats
Rate
Formulas
Metabolic Pathways
Low Detail High Detail
Biochemical
Reactions
Small Molecules
Low Detail High Detail
Anatoly Sorokin
47. • PSI: Proteomics Standards Initiative
– Work group of the Human Proteome Organization
– Defines community standards for data in proteomics
• … facilitating data comparison, exchange and verification
Minimum information guidelines
47
• MIAPE: The Minimum Information About a Proteomics Experiment
• Data and metadata from proteomics experiments
• Data: results
• Metadata: data about the data
• Where the samples came from
• How the analysis were performed
As a biologist I would prefer to see all the information in one unique database.
Centralized databases have this mission.
The aim to collect all the information for one specific domain.
However …
Medium-size labs and organizations are capable to produce large amounts of data.
The it becomes harder to submit data to centralized repositories.
Moreover data producers like to control and structure their own databases, developing their own GUI and access protocols.
For us, the users, it becomes harder to access the information.
For one specific domain we might find different databases, using different GUIs. We might end up downloading data in different formats complicating the integration of results. After integration we might find a problem of high redundancy in our results.
This workflow searches for genes which reside in a QTL (Quantitative Trait Loci) region in the mouse, Mus musculus. The workflow requires an input of: a chromosome name or number; a QTL start base pair position; QTL end base pair position. Data is then extracted from BioMart to annotate each of the genes found in this region. The Entrez and UniProt identifiers are then sent to KEGG to obtain KEGG gene identifiers. The KEGG gene identifiers are then used to searcg for pathways in the KEGG pathway database.
this is pathways_and_gene_annotations_for_qtl_phenotype_28303
exec with
chromosome = 17
start_position = 28500000
end_position = 32500000
The HUPO Proteomics Standards Initiative (PSI) defines community standards for data representation in proteomics to facilitate data comparison, exchange and verification.
The PSI was founded at the HUPO meeting in Washington, April 28-29, 2002
MIAPE: The Minimum Information About a Proteomics Experiment .
Guidance document specifying the data and metadata that should be collected from proteomics experiments
Where samples came from and how analyses were performed
Data accompanied by context: 'metadata' ('data about the data')
Integration of biological data of various types and development of adapted bioinformatics tools represent critical objectives to enable research at the systems level. The European Network of Excellence ENFIN is engaged in developing an adapted infrastructure to connect databases, and platforms to enable both generation of new bioinformatics tools and experimental validation of computational predictions. Beyond the use of common standards to format individual datasets, there is a need for sophisticated informatics platforms to enable mining data across various domains, sources, formats and types. The aim of the EnCORE project is to integrate across different disciplines an extensive list of database resources and analysis tools in a computationally accessible and extensible manner, facilitating automated data retrieval and processing with a special focus on systems biology. The EnCORE platform is available as a collection of webservices with a common standard format easy to integrate in Workflow management software such as Taverna. Additionally EnCORE services are also accessible thought EnVISION, a web graphical user interface providing elaborated information such as molecular interaction, biological pathways and computational models of pathways.