How to Remove Document Management Hurdles with X-Docs?
GPKB Genomic Proteomic Knowledge Base
1. Group Meeting 2016-08-17, Tech
“GPKB: Genomic and Proteomic
Knowledge Base”
by Davide Chicco
davide.chicco@gmail.com
2. ● A data warehouse developed and mantained by my
former colleagues at Politecnico di Milano university
● Integration of several data sources:
● KEGG (Kyoto Encyclopedia of Genes and Genomes)
● OMIM (Online Mendelian Inheritance in Man)
● Gene Ontology Annotations (GOA)
● Gene Ontology (GO)
● Expasy Enzyme
● Entrez Gene
● Reactome
● UniProt
● BioCyc
● IntAct
Genomic and Proteomic Knowledge Base (GPKB)
(c) Flickr Vitlava: database-integration
3. ● Large amounts of biological datasets are available all
around the world
● Especially, biomolecular annotations (associations
between genes or gene products and biological function
features) can help scientists in the understanding of
biology and life science
● The hierarchical structure of the ontology structure of
these datasets are able to highlight semantic
relationships beween data
Motivation
4. ● Implemented in PostgreSQL
● It can be downloaded or used through a web interface
● Dataset quantitative characteristics:
~ 20 milions of genes
~ 20 milions of proteins
~ 17 milions of gene annotations
~ 31 milions of protein annotations
● Some tables are simply imported from data sources
(GO, Reactome, etc)
● Other tables are INFERED from the available datasets
Technical details and quantitative characteristics
5. ● Data tables available:
Technical details and quantitative characteristics
Image from M. Masseroli, et al. "Explorative search of distributed bio-data to answer complex biomedical questions." BMC
Bioinformatics 15.1 (2014): 1.
Green-gray boxes: data table available in the general data warehouse and publically
available on the web interface
Gray boxes: data table available in the general data warehouse (publically available in the
future)
7. ● The Basic search functionality is available for searches
aimed at retrieving all information directly associated
with a single feature instance, either imported from
external sources or inferred based on the integrated
data
● For example, all annotations and interactions of a
specific gene or protein (e.g. the human insulin-like
growth factor 2 (somatome-din A) (IGF2) gene, Entrez
Gene ID 3481), or all genes and proteins annotated to
a particular biomedical feature instance, such as a
specific pathway or genetic disorder (e.g. the Alzheimer
disease , OMIM ID 104300).
Basic search
8.
9.
10.
11.
12.
13. ● Authors also implemented an enhanced functionality
and graphical interface for multi-feature search, named
Easy search.
● It supports the simple graphical composition of
complex queries on multiple features just by orderly
selecting the required features, e.g. gene, pathway,
enzyme, biological function feature, genetic disorder,
clinical synopsis, etc.; if needed, display and filtering
constrains can be defined for any attribute of each
selected feature just by specifying them in the feature
windo.
Easy search
14. ● Query example: relationship between genes, biological
function features of pathologies (e.g. in Muscular
dystrophy, Duchenne type).
● Using the Easy search functionality, the user can
orderly select the gene feature, then the gene
associated biological function feature and genetic
disorder features, and then the genetic disorder
associated clinical synopsis feature; finally, before
submitting the query, if the user wants to investigate
only some related pathologies, he/she can specify them
as value of the name attribute in the genetic disorder
feature window.
Easy search
21. Exact count: it runs exact count of the query results,
otherwise it estimates the result count
22. Conceptual query (C): the query includes the
conceptually equivalent database items coming
from other data sources
23. Semantic expansion: When a query is executed with
semantic expansion for a feature then the result contains
not only the items that satisfy the query but also
semantically related more general items based on the
feature ontologies
24. Expand query: After obtaining results for an initial
query, to expand the query only for the user selected
rows of the previous query result
25. Show all: shows all the query results
Only matching: shows only the query results
matching values between all the selected features
26. “Find all the genes that are involved both in breast
cancer and in prostate cancer, and then retrieve all the
proteins that are encoded by one of those genes”
http://www.bioinformatics.deib.polimi.it/GPKB
Demo
27. Main advantages of GPKB compared to other systems
(such as BioWarehouse, Biozon, etc):
1) flexible data schema and software architecture, to
facilitate data import
2) integration of datasets from different sources
highlight semantic relationships between data
elements
3) ability to answer multi-domain biomedical
questions
GPKB advantages
28. M. Masseroli, A. Canakoglu, and S. Ceri. "Integration
and querying of genomic and proteomic semantic
annotations for biomedical knowledge extraction"
IEEE/ACM Transactions on Computational Biology and
Bioinformatics 13.2 (2016): 209-219.
http://www.bioinformatics.deib.polimi.it/GPKB
Citation and web link