Vassil  Momtchev Group Leader, Semantic Life Science Applications A Reason-Able View to the Web of Pathway Data 2009/10/08
Objectives WITH NO  2009/10/08 Bio-IT World, Hannover
A Typical Question? Select drugs related to asthma that are linked to a curated molecular interaction in the literature wh...
A Typical Answer! <ul><li>Find all drugs related to asthma </li></ul><ul><li>Extract all proteins in a text file </li></ul...
A Typical Question? Select all human genes, which code for proteins with known molecular interactions and are analyzed wit...
Wrap-up: Typical Problems  <ul><li>Link data between  different silos applications </li></ul><ul><ul><li>Hard to collabora...
Agenda <ul><li>Ontotext </li></ul><ul><li>Put your data into context </li></ul><ul><li>Linked Life Data platform and Pathw...
Who Are We? <ul><li>Ontotext is a  semantic technology provider </li></ul><ul><li>Established in year 2000 as  part of Sir...
What is Our Position? <ul><li>Unique coverage of technology  areas:  </li></ul><ul><ul><li>Semantic Databases : high-perfo...
Our Application Domains <ul><li>Ontotext technologies are used for various applications :   </li></ul><ul><ul><li>Data Int...
Linked Life Data  Pathway and Interaction Knowledge Base <ul><li>Linked  Life Data </li></ul><ul><ul><li>RDF warehouse sol...
Select drugs related to asthma that are linked to a curated molecular interaction in the literature where the protein is k...
Select all human genes, which code for proteins with known molecular interactions and are analyzed with molecular techniqu...
Holistic View of the Scientific Problems <ul><li>Use data of more than 20 sources </li></ul><ul><li>Inter-link the informa...
The Challenge of the Holistic View <ul><li>Extreme amount of data </li></ul><ul><li>Data is supported by different organiz...
RDF Technology 2009/10/08 Bio-IT World, Hannover ERBB2  HER2 CD340 Q4H1F1  Q4H1F2  Protein GO:0005023 EGF receptor activit...
Linking Open Data Community Project  <ul><li>Use URIs to identify the information </li></ul><ul><li>Expose via dereference...
LLD Integration Process Data Source Identification Flat files OBO files XML RDBMS RDF Special tailored transformer OBO to ...
Over 20 Different Sources Number of statements:   4.792.035.475 Number of explicit statements: 2.218.239.691 Number of ent...
<ul><li><C1,broader,C2> </li></ul><ul><li><C2,broader,C3>  </li></ul><ul><li><C1,broaderTransitive,C3> </li></ul><ul><li><...
Instance Mapping biopax-2:SHORT-NAME biopax-2:XREF P29965 UNIPROT CD40L_HUMAN  cpath:CPATH-94138 cpath:CPATH-LOCAL-8467065...
X Y ns-x:  id ns-y:  id db id X Y db: id X Y accession db: id db:  accession X term Y Y X Y X text  to describe  name name...
Semantic Annotations 2009/10/08 Bio-IT World, Hannover broader umls:C0035204 broader broaderTransitive COPD Bronchial Dise...
Semantic Annotations #2 <ul><li>Executed over selected textual fields </li></ul><ul><li>Powered by standard and open sourc...
Bio-IT World, Hannover 2009/10/08
Bio-IT World, Hannover 2009/10/08
Linked Life Data Service <ul><li>Pathway and Interaction Knowledge Base is a free resource </li></ul><ul><li>LLD is free p...
Conclusion <ul><li>Ontotext is company that provides very efficient software to  manage semantic information </li></ul><ul...
Acknowledgement <ul><li>AstraZeneca </li></ul><ul><ul><li>Bosse Andersson </li></ul></ul><ul><li>LODD </li></ul><ul><li>Bi...
Questions <ul><li>? </li></ul>Bio-IT World, Hannover 2009/10/08
Upcoming SlideShare
Loading in...5
×

A Reason Able View To The Web Of Pathway Data

2,113

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,113
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

A Reason Able View To The Web Of Pathway Data

  1. 1. Vassil  Momtchev Group Leader, Semantic Life Science Applications A Reason-Able View to the Web of Pathway Data 2009/10/08
  2. 2. Objectives WITH NO 2009/10/08 Bio-IT World, Hannover
  3. 3. A Typical Question? Select drugs related to asthma that are linked to a curated molecular interaction in the literature where the protein is known to cause inflammatory response… 2009/10/08 Bio-IT World, Hannover
  4. 4. A Typical Answer! <ul><li>Find all drugs related to asthma </li></ul><ul><li>Extract all proteins in a text file </li></ul><ul><li>Compose a long query to list the proteins by OR </li></ul><ul><li>Get a filtered list of genes </li></ul><ul><li>For each gene send a query in molecular interaction database </li></ul>2009/10/08 Bio-IT World, Hannover
  5. 5. A Typical Question? Select all human genes, which code for proteins with known molecular interactions and are analyzed with molecular techniques like ‘Transfection‘; Restrict the results just to gene or proteins which are known drug targets… 2009/10/08 Bio-IT World, Hannover
  6. 6. Wrap-up: Typical Problems <ul><li>Link data between different silos applications </li></ul><ul><ul><li>Hard to collaborate across domains due to information silos </li></ul></ul><ul><ul><li>Combine public and private information </li></ul></ul><ul><ul><li>Expensive process that could not be done on the fly </li></ul></ul><ul><li>Put the information into context </li></ul><ul><ul><li>Query cross-domain information </li></ul></ul><ul><ul><li>Make more interesting queries </li></ul></ul><ul><ul><li>No easy way to interpret the information </li></ul></ul><ul><li>Analyze the knowledge locked into unstructured data </li></ul><ul><ul><li>Database information is often outdated </li></ul></ul><ul><ul><li>Relationships are not enough to capture the case study questions </li></ul></ul><ul><ul><li>Hard to find information </li></ul></ul>2009/10/08 Bio-IT World, Hannover
  7. 7. Agenda <ul><li>Ontotext </li></ul><ul><li>Put your data into context </li></ul><ul><li>Linked Life Data platform and Pathway and Interaction dataset </li></ul><ul><li>LifeSKIM an application to find any existing information </li></ul>2009/10/08 Bio-IT World, Hannover
  8. 8. Who Are We? <ul><li>Ontotext is a semantic technology provider </li></ul><ul><li>Established in year 2000 as part of Sirma Group </li></ul><ul><ul><li>Sirma is a top-3 software house in Bulgaria , est. 1992, ~300 persons </li></ul></ul><ul><ul><li>Since September 2008 a separate company </li></ul></ul><ul><li>Staff: 40 employees in Bulgaria </li></ul><ul><ul><li>Multiple affiliates and contractors around the world </li></ul></ul><ul><li>Over 150 person-years invested in product development </li></ul>Bio-IT World, Hannover 2009/10/08
  9. 9. What is Our Position? <ul><li>Unique coverage of technology areas: </li></ul><ul><ul><li>Semantic Databases : high-performance RDF DBMS, scalable reasoning </li></ul></ul><ul><ul><li>Semantic Search : text-mining (IE), Information Retrieval (IR) </li></ul></ul><ul><ul><li>Semantic Web Services and BPM : WS annotation, discovery, etc. </li></ul></ul><ul><ul><li>Web Mining : focused crawling, wrapping </li></ul></ul><ul><ul><li>Knowledge fusion : identity resolution, record linkage </li></ul></ul><ul><li>Core business: development of semantic engines </li></ul><ul><ul><li>Mostly product development and sales </li></ul></ul><ul><ul><li>Complemented by professional services </li></ul></ul><ul><ul><li>Joint ventures for vertical applications </li></ul></ul>Bio-IT World, Hannover 2009/10/08
  10. 10. Our Application Domains <ul><li>Ontotext technologies are used for various applications : </li></ul><ul><ul><li>Data Integration (consolidation of multiple databases) </li></ul></ul><ul><ul><li>Knowledge & Content Management (enterprise search) </li></ul></ul><ul><ul><li>Business Intelligence </li></ul></ul><ul><ul><li>Web-mining/Web-intelligence </li></ul></ul><ul><li>Major industries/markets </li></ul><ul><ul><li>Life sciences and health care </li></ul></ul><ul><ul><li>Telecommunications </li></ul></ul><ul><ul><li>Media Archives, Media Research </li></ul></ul><ul><ul><li>Online recruitment </li></ul></ul><ul><ul><li>IP/Patent Research </li></ul></ul><ul><ul><li>Web search, Web 2.0 and Semantic Web start-ups </li></ul></ul>Bio-IT World, Hannover 2009/10/08
  11. 11. Linked Life Data Pathway and Interaction Knowledge Base <ul><li>Linked Life Data </li></ul><ul><ul><li>RDF warehouse solution </li></ul></ul><ul><ul><li>Powered by OWLIM semantic database </li></ul></ul><ul><li>Pathway and Interaction Knowledge Base </li></ul><ul><ul><li>Dataset that integrates molecular data </li></ul></ul><ul><ul><li>Part of linked data cloud </li></ul></ul><ul><li>Test with case studies developed </li></ul><ul><li>in collaboration with AstraZeneca </li></ul>2009/10/08 Bio-IT World, Hannover
  12. 12. Select drugs related to asthma that are linked to a curated molecular interaction in the literature where the protein is known to cause inflammatory response… 2009/10/08 Bio-IT World, Hannover
  13. 13. Select all human genes, which code for proteins with known molecular interactions and are analyzed with molecular techniques like ‘Transfection‘; Restrict the results just to gene or proteins which are known drug targets… 2009/10/08 Bio-IT World, Hannover
  14. 14. Holistic View of the Scientific Problems <ul><li>Use data of more than 20 sources </li></ul><ul><li>Inter-link the information </li></ul><ul><li>Put it into new contexts </li></ul><ul><li>Switch different perspectives </li></ul>Environmental Factors 2009/10/08 Bio-IT World, Hannover
  15. 15. The Challenge of the Holistic View <ul><li>Extreme amount of data </li></ul><ul><li>Data is supported by different organizations </li></ul><ul><li>Information is highly distributed and redundant </li></ul><ul><li>Tons of flat file formats with special semantics </li></ul><ul><li>Knowledge is locked in vast data silos </li></ul><ul><li>Isolated communities which could not reach cross-domain understanding </li></ul><ul><li>Increase the data abstraction level </li></ul>2009/10/08 Bio-IT World, Hannover
  16. 16. RDF Technology 2009/10/08 Bio-IT World, Hannover ERBB2 HER2 CD340 Q4H1F1 Q4H1F2 Protein GO:0005023 EGF receptor activity receptor activity peroxisome receptor ENSG00000141736 Gene 2064 Gene Ontology Term GO: 0004872 GO: 0005006 is_a is_a type type type type type type label label label label label label   database cross-reference  hasProtein hasProtein  hasGene hasGene
  17. 17. Linking Open Data Community Project <ul><li>Use URIs to identify the information </li></ul><ul><li>Expose via dereferenceable URIs </li></ul><ul><li>Allows browsing of data spread across different servers, in the way HTML is browsed </li></ul>2009/10/08 Bio-IT World, Hannover
  18. 18. LLD Integration Process Data Source Identification Flat files OBO files XML RDBMS RDF Special tailored transformer OBO to SKOS converter Custom XSLT RDBMS to RDF formatter RDF warehouse Reasoner Instance Mappings Semantic Annotations 2009/10/08 Bio-IT World, Hannover
  19. 19. Over 20 Different Sources Number of statements: 4.792.035.475 Number of explicit statements: 2.218.239.691 Number of entities: 370.230.951 2009/10/08 Bio-IT World, Hannover Data source Description RDF statements Disease Ontology Disease Ontology is a controlled 446,066 Human Phenotype Ontology The human phenotype ontology (HPO) intends 70,911 Symptom Ontology The symptom ontology was designed around 4,163 DrugBank The DrugBank database is a unique bioinformatics 493,794 Diseasome The diseasome website is a disease relationships 69,546 DailyMed DailyMed provides high quality information about 116,992 SIDER SIDER contains information on marketed medicines 96,272 BioGRID The Biological General Repository for Interaction Datasets 1,892,897 INOH INOH (Integrating Network Objects with Hierarchies) 432,456 CellMap The Cancer Cell Map contains selected 173,914 HPRD The Human Protein Reference Database 18,05,651 HumanCYC HumanCyc is a bioinformatics database that describes 341,225 IMID General Repository for Interaction Datasets. 154,408 IntAct IntAct provides a freely available, open source database 11,005,555 Reactome Reactome is a free, online, open-source, curated resource 2,538,793 NCI-Nature Nature pathway interaction database. 333,415 KEGG KEGG PATHWAY is a collection of manually drawn 18,128,735 Entrez-Gene Entrez Gene is a searchable database of genes 107,193,308 PubMed PubMed is a service of the U.S. National Library of Medicine 807,851,455 UniProt Major resource for protein sequences 1,252,667,885 UMLS Metathesaurus Database that contains information about biomedical 12,420,882 UMLS Semantic network Semantic categorization of terminology 1,368
  20. 20. <ul><li><C1,broader,C2> </li></ul><ul><li><C2,broader,C3> </li></ul><ul><li><C1,broaderTransitive,C3> </li></ul><ul><li><premise 1> </li></ul><ul><li><premise 2> </li></ul><ul><li><conclusion> </li></ul>Schema Reasoning 2009/10/08 rdf:type broader umls:C0035204 inferred broader broaderTransitive Bio-IT World, Hannover COPD Bronchial Diseases Respiration Disorders umls:C0006261 umls:C0024117 Chronic Obstructive Airway Diseases
  21. 21. Instance Mapping biopax-2:SHORT-NAME biopax-2:XREF P29965 UNIPROT CD40L_HUMAN cpath:CPATH-94138 cpath:CPATH-LOCAL-8467065 biopax-2:PHYSICAL-ENTITY biopax-2:ID biopax-2:DB biopax-2:PHYSICAL-ENTITY cpath:CPATH-LOCAL-8749236 uniprot:P29965 CD40L_HUMAN uniprot:mnemonic TNF5_HUMAN uniprot:mnemonic TNFL5_HUMAN uniprot:mnemonic CD4L_HUMAN uniprot:mnemonic 2009/10/08 Bio-IT World, Hannover
  22. 22. X Y ns-x: id ns-y: id db id X Y db: id X Y accession db: id db: accession X term Y Y X Y X text to describe name name 2009/10/08 Bio-IT World, Hannover Namespace mapping Reference node Mismatched identifiers Value dereference Transitive link Semantic Annotations
  23. 23. Semantic Annotations 2009/10/08 Bio-IT World, Hannover broader umls:C0035204 broader broaderTransitive COPD Bronchial Diseases Respiration Disorders umls:C0006261 Chronic Obstructive Airway Diseases This an example text of document that mentions COPD disease hasDocumentText mentions Natural Language Processing Natural Language Processing Natural Language Processing Natural Language Processing
  24. 24. Semantic Annotations #2 <ul><li>Executed over selected textual fields </li></ul><ul><li>Powered by standard and open source NLP components </li></ul><ul><li>Very efficient parallelization techniques </li></ul><ul><li>The annotation process created UMLS and PubMed: </li></ul><ul><ul><li>Over 705 millions high recall semantic annotations </li></ul></ul><ul><ul><li>Over 263 millions high precision semantic annotations </li></ul></ul>Bio-IT World, Hannover 2009/10/08
  25. 25. Bio-IT World, Hannover 2009/10/08
  26. 26. Bio-IT World, Hannover 2009/10/08
  27. 27. Linked Life Data Service <ul><li>Pathway and Interaction Knowledge Base is a free resource </li></ul><ul><li>LLD is free public service available http://linkedlifedata.com </li></ul><ul><li>OWLIM engine is experimentally proven to scale up to: </li></ul><ul><ul><li>20 billion RDF statements (15 billions explicit) </li></ul></ul><ul><ul><li>On a computer that costs less than 10’000$ </li></ul></ul>2009/10/08 Bio-IT World, Hannover
  28. 28. Conclusion <ul><li>Ontotext is company that provides very efficient software to manage semantic information </li></ul><ul><li>Link data between different silos applications </li></ul><ul><li>Easily incrementally extensible </li></ul><ul><li>Put the information into context </li></ul><ul><li>Start more interesting queries </li></ul><ul><li>Manage knowledge derived from text mining </li></ul>Bio-IT World, Hannover 2009/10/08
  29. 29. Acknowledgement <ul><li>AstraZeneca </li></ul><ul><ul><li>Bosse Andersson </li></ul></ul><ul><li>LODD </li></ul><ul><li>BioRDF </li></ul><ul><li>HCLSIG </li></ul><ul><li>Ontotext </li></ul><ul><ul><li>Deyan Peychev </li></ul></ul><ul><ul><li>Georgi Georgiev </li></ul></ul><ul><ul><li>Todor Primov </li></ul></ul><ul><ul><li>OWLIM team </li></ul></ul>Bio-IT World, Hannover 2009/10/08 The development of PIKB and Linked Life Data is partially funded by FP7 215535
  30. 30. Questions <ul><li>? </li></ul>Bio-IT World, Hannover 2009/10/08

×