A Reason Able View To The Web Of Pathway Data - Presentation Transcript
Vassil Momtchev Group Leader, Semantic Life Science Applications A Reason-Able View to the Web of Pathway Data 2009/10/08
Objectives WITH NO 2009/10/08 Bio-IT World, Hannover
A Typical Question? Select drugs related to asthma that are linked to a curated molecular interaction in the literature where the protein is known to cause inflammatory response… 2009/10/08 Bio-IT World, Hannover
A Typical Answer!
Find all drugs related to asthma
Extract all proteins in a text file
Compose a long query to list the proteins by OR
Get a filtered list of genes
For each gene send a query in molecular interaction database
2009/10/08 Bio-IT World, Hannover
A Typical Question? Select all human genes, which code for proteins with known molecular interactions and are analyzed with molecular techniques like ‘Transfection‘; Restrict the results just to gene or proteins which are known drug targets… 2009/10/08 Bio-IT World, Hannover
Wrap-up: Typical Problems
Link data between different silos applications
Hard to collaborate across domains due to information silos
Combine public and private information
Expensive process that could not be done on the fly
Put the information into context
Query cross-domain information
Make more interesting queries
No easy way to interpret the information
Analyze the knowledge locked into unstructured data
Database information is often outdated
Relationships are not enough to capture the case study questions
Hard to find information
2009/10/08 Bio-IT World, Hannover
Agenda
Ontotext
Put your data into context
Linked Life Data platform and Pathway and Interaction dataset
LifeSKIM an application to find any existing information
2009/10/08 Bio-IT World, Hannover
Who Are We?
Ontotext is a semantic technology provider
Established in year 2000 as part of Sirma Group
Sirma is a top-3 software house in Bulgaria , est. 1992, ~300 persons
Since September 2008 a separate company
Staff: 40 employees in Bulgaria
Multiple affiliates and contractors around the world
Over 150 person-years invested in product development
Linked Life Data Pathway and Interaction Knowledge Base
Linked Life Data
RDF warehouse solution
Powered by OWLIM semantic database
Pathway and Interaction Knowledge Base
Dataset that integrates molecular data
Part of linked data cloud
Test with case studies developed
in collaboration with AstraZeneca
2009/10/08 Bio-IT World, Hannover
Select drugs related to asthma that are linked to a curated molecular interaction in the literature where the protein is known to cause inflammatory response… 2009/10/08 Bio-IT World, Hannover
Select all human genes, which code for proteins with known molecular interactions and are analyzed with molecular techniques like ‘Transfection‘; Restrict the results just to gene or proteins which are known drug targets… 2009/10/08 Bio-IT World, Hannover
Isolated communities which could not reach cross-domain understanding
Increase the data abstraction level
2009/10/08 Bio-IT World, Hannover
RDF Technology 2009/10/08 Bio-IT World, Hannover ERBB2 HER2 CD340 Q4H1F1 Q4H1F2 Protein GO:0005023 EGF receptor activity receptor activity peroxisome receptor ENSG00000141736 Gene 2064 Gene Ontology Term GO: 0004872 GO: 0005006 is_a is_a type type type type type type label label label label label label database cross-reference hasProtein hasProtein hasGene hasGene
Linking Open Data Community Project
Use URIs to identify the information
Expose via dereferenceable URIs
Allows browsing of data spread across different servers, in the way HTML is browsed
2009/10/08 Bio-IT World, Hannover
LLD Integration Process Data Source Identification Flat files OBO files XML RDBMS RDF Special tailored transformer OBO to SKOS converter Custom XSLT RDBMS to RDF formatter RDF warehouse Reasoner Instance Mappings Semantic Annotations 2009/10/08 Bio-IT World, Hannover
Over 20 Different Sources Number of statements: 4.792.035.475 Number of explicit statements: 2.218.239.691 Number of entities: 370.230.951 2009/10/08 Bio-IT World, Hannover Data source Description RDF statements Disease Ontology Disease Ontology is a controlled 446,066 Human Phenotype Ontology The human phenotype ontology (HPO) intends 70,911 Symptom Ontology The symptom ontology was designed around 4,163 DrugBank The DrugBank database is a unique bioinformatics 493,794 Diseasome The diseasome website is a disease relationships 69,546 DailyMed DailyMed provides high quality information about 116,992 SIDER SIDER contains information on marketed medicines 96,272 BioGRID The Biological General Repository for Interaction Datasets 1,892,897 INOH INOH (Integrating Network Objects with Hierarchies) 432,456 CellMap The Cancer Cell Map contains selected 173,914 HPRD The Human Protein Reference Database 18,05,651 HumanCYC HumanCyc is a bioinformatics database that describes 341,225 IMID General Repository for Interaction Datasets. 154,408 IntAct IntAct provides a freely available, open source database 11,005,555 Reactome Reactome is a free, online, open-source, curated resource 2,538,793 NCI-Nature Nature pathway interaction database. 333,415 KEGG KEGG PATHWAY is a collection of manually drawn 18,128,735 Entrez-Gene Entrez Gene is a searchable database of genes 107,193,308 PubMed PubMed is a service of the U.S. National Library of Medicine 807,851,455 UniProt Major resource for protein sequences 1,252,667,885 UMLS Metathesaurus Database that contains information about biomedical 12,420,882 UMLS Semantic network Semantic categorization of terminology 1,368
X Y ns-x: id ns-y: id db id X Y db: id X Y accession db: id db: accession X term Y Y X Y X text to describe name name 2009/10/08 Bio-IT World, Hannover Namespace mapping Reference node Mismatched identifiers Value dereference Transitive link Semantic Annotations
Semantic Annotations 2009/10/08 Bio-IT World, Hannover broader umls:C0035204 broader broaderTransitive COPD Bronchial Diseases Respiration Disorders umls:C0006261 Chronic Obstructive Airway Diseases This an example text of document that mentions COPD disease hasDocumentText mentions Natural Language Processing Natural Language Processing Natural Language Processing Natural Language Processing
Semantic Annotations #2
Executed over selected textual fields
Powered by standard and open source NLP components
Very efficient parallelization techniques
The annotation process created UMLS and PubMed:
Over 705 millions high recall semantic annotations
Over 263 millions high precision semantic annotations
Bio-IT World, Hannover 2009/10/08
Bio-IT World, Hannover 2009/10/08
Bio-IT World, Hannover 2009/10/08
Linked Life Data Service
Pathway and Interaction Knowledge Base is a free resource
LLD is free public service available http://linkedlifedata.com
OWLIM engine is experimentally proven to scale up to:
20 billion RDF statements (15 billions explicit)
On a computer that costs less than 10’000$
2009/10/08 Bio-IT World, Hannover
Conclusion
Ontotext is company that provides very efficient software to manage semantic information
Link data between different silos applications
Easily incrementally extensible
Put the information into context
Start more interesting queries
Manage knowledge derived from text mining
Bio-IT World, Hannover 2009/10/08
Acknowledgement
AstraZeneca
Bosse Andersson
LODD
BioRDF
HCLSIG
Ontotext
Deyan Peychev
Georgi Georgiev
Todor Primov
OWLIM team
Bio-IT World, Hannover 2009/10/08 The development of PIKB and Linked Life Data is partially funded by FP7 215535
0 comments
Post a comment