Sem tech 2011 v8


Published on

Phil Ashworth and Dean Allemang, "Building a semantic integration framework to support a federated query "

Published in: Education, Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Sem tech 2011 v8

  1. 1. Building a semantic integration framework to support a federated query environment in 5 steps Philip Ashworth UCB Celltech Dean Allemang TopQuadrant
  2. 2. Data Integration… Why? <ul><li>Scope and knowledge of life sciences expands everyday </li></ul><ul><li>Everyday we make new discoveries by experimenting (in the lab) </li></ul><ul><li>Data generated in the lab in large quantities to complement the vast growth externally </li></ul><ul><li>Too difficult and time consuming for the user to bring data together </li></ul><ul><li>Therefore we don’t often make use of the data we already have to make new discoveries </li></ul>
  3. 3. Data Integration… Problems Warehouse DB Project DB Project Marts Applications App DB App DB’s App DB’s Registration, Query DI, Query DI Query DI App DB’s
  4. 4. Data Integration… Problems <ul><li>Demand for DI increases everyday. </li></ul><ul><li>Data doesn’t evolve into a larger more beneficial platform </li></ul><ul><ul><li>Where is the long term benefit? </li></ul></ul><ul><ul><li>Driving ourselves around in circles </li></ul></ul><ul><li>Just creating more data silos </li></ul><ul><ul><li>Limited scope for reuse </li></ul></ul><ul><li>Slow & difficult to modify / enhance </li></ul><ul><li>High maintenance </li></ul><ul><ul><li>Multiple systems create more and more overhead </li></ul></ul>
  5. 5. Data Integration… Thoughts Data Integration is clearly evolving But it is not fulfilling the needs If we identify the need… can we see what we should be doing?
  6. 6. Accessible Data True Integration Variety of Sources Align Concepts Data has Context All Data for All Projects Data Integration… Needs
  7. 7. Data Integration… There is a way! Open Linked Data Cloud Connected and linked data with context Created by a community Significant linking hubs appearing Significant scientific content A Valuable resource that will only Grow! Something we can learn from!
  8. 8. Data Integration… Starting an Evolutionary Leap No one internally really knows about this Can’t just rip and replace old systems Have to do some ground work
  9. 9. Linked Data…The Quest <ul><li>Technology Projects </li></ul><ul><ul><li>Emphasis on semantic web principles </li></ul></ul><ul><li>Scientific Projects </li></ul><ul><ul><li>Data Integration </li></ul></ul><ul><ul><li>Data Visualisation (mash-ups) </li></ul></ul>
  10. 10. Linked Data… The Quest Highly Repetitive & Promiscuous Highly Promiscuous & Repetitive
  11. 11. <ul><li>New Approach </li></ul><ul><li>Develop a POC semantic data integration framework </li></ul><ul><ul><li>Easy to configure </li></ul></ul><ul><ul><li>Support all projects </li></ul></ul><ul><ul><li>Builds an environment for the future. </li></ul></ul>Linked Data
  12. 12. Rest Services (Abstraction layer) Semantic Integration Framework Knowledge Collation, Concept mapping, Distributed Query Result inference, Aggregation Increasing Ease of Development Decreasing knowledge of Semantic technologies The Idea Applications Business Process / Workflow Automation PURL Data Sources RDBMS Oracle,Postgres SQL, mySql RDF Triple Store MS Excel TXT Doc RDF Sparql EndPoint Sparql EndPoint Native
  13. 13. Step 1. Data Sources <ul><li>Expose data as RDF through SPARQL Endpoints </li></ul><ul><li>Internal Data sources </li></ul><ul><ul><li>D2R SPARQL Endpoints on RDBMS databases </li></ul></ul><ul><ul><ul><li>Each Modelled as local concepts that they represent </li></ul></ul></ul><ul><ul><ul><li>Don’t worry about the larger concept picture </li></ul></ul></ul><ul><ul><li>Virtuoso RDF triple store (Open source) to host RDF data created from spreadsheets </li></ul></ul><ul><ul><li>TopBraid Ensemble & SPARQLMotion/SPIN scripts to convert static data to RDF </li></ul></ul>RDBMS D2R SPARQL Endpoints Virtuoso RDF
  14. 14. <ul><li>External Data Sources </li></ul><ul><ul><li>SPARQL endpoints in LOD from Bio2RDF, LODD and others. </li></ul></ul><ul><ul><li>Some stability, access, quality issues within these sources. </li></ul></ul><ul><ul><li>Created Amazon Cloud server to host stable environments. </li></ul></ul><ul><ul><li>Bio2RDF sources downloaded, stored and modified </li></ul></ul><ul><ul><li>Virtuoso (open source) used as triple store </li></ul></ul>Step 1. Data Sources IDAC MOC PEP UCB Data Cloud Linked Open Data Cloud Abysis NBE Mart SEQ Bio2RDF PDB NBE WH ITrack PMT LDAP WKW UCB PDB Premier Sider Kegg cpd Diseasome Kegg gl Kegg dr chebi Uniprot ec geneid RDF
  15. 15. Step 2: Integration Framework: <ul><li>Why? </li></ul><ul><ul><li>Linked Open Data: links within a source are manually created </li></ul></ul><ul><ul><li>To Navigate the cloud you either </li></ul></ul><ul><ul><ul><li>Learn the network </li></ul></ul></ul><ul><ul><ul><li>Discover the network as you go through (unguided) </li></ul></ul></ul><ul><ul><li>There is nothing that understands the total connectivity of concepts available to you. </li></ul></ul><ul><ul><ul><li>Difficult to know where start </li></ul></ul></ul><ul><ul><ul><li>No idea if a start point will lead you to the information you are looking for or might be interested in. </li></ul></ul></ul><ul><ul><ul><li>Can’t query the cloud for specific Information </li></ul></ul></ul><ul><li>The Integration Framework will resolve these issues </li></ul><ul><ul><li>It will model the models to understand the connectivity </li></ul></ul><ul><li>You shouldn’t have to know where to look for data </li></ul>
  16. 16. Rest Services (Abstraction layer) Semantic Integration Framework Knowledge Collation, Concept mapping, Distributed Query Result inference, Aggregation Applications Business Process / Workflow Automation PURL RDF Data Sources Understand Data Sources (concepts, access, props) Understand Links Across Sources Automate some tasks Accessible Via Services Easy to wire up Understand UCB concepts Understand how UCB Concepts fit with source concepts Step 2: Integration Framework
  17. 17. Step 2: Integration Framework. <ul><li>Integration Framework </li></ul><ul><ul><li>Data source, concept and property registry </li></ul></ul><ul><ul><li>An Ontology that Utilises </li></ul></ul><ul><ul><ul><ul><li>VoID (enhanced) to capture data source information (endpoints) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>SKOS to link local ontologies with UCB concepts </li></ul></ul></ul></ul><ul><ul><ul><ul><ul><li>UCB:Person -> db1:user, db2:employee, db3:actor </li></ul></ul></ul></ul></ul><ul><li>Built using TopBraid Suite </li></ul><ul><ul><li>Ontology development (TopBraid Composer) </li></ul></ul><ul><ul><li>SPARQLMotion scripts to provide some automation </li></ul></ul><ul><ul><ul><li>Creation of ontologies from endpoints, D2R mappings </li></ul></ul></ul><ul><ul><ul><li>Configuration assistance </li></ul></ul></ul>Sem Int Framework
  18. 18. Step 2: Integration Framework. DB1 Dataset Ontology (VoID) UCB Concept Ontology (SKOS) Integration Framework Sem Int Framework narrowMatch UCB:Antibody DB1:Antibody UCB:Person DB1:User narrowMatch narrowMatch UCB:Project DB1:Project
  19. 19. Step 2: Integration Framework. DB1 Dataset Ontology (VoID) UCB Concept Ontology (SKOS) DB2:Person UCB:Person DB1:User DB3:Employee DB2 DB3 DB3:Contact Sem Int Framework narrowMatch narrowMatch narrowMatch narrowMatch
  20. 20. Step 2: Integration Framework. DB1 Dataset Ontology (VoID) UCB Concept Ontology (SKOS) DB2:Person UCB:Person DB1:User DB3:Employee DB2 DB3 DB3:Contact Person_DB1_DB2 Person_DB1_DB3 Linksets Sem Int Framework narrowMatch narrowMatch narrowMatch narrowMatch
  21. 21. Step 2: Integration Framework. Dataset Ontology (VoID) UCB Concept Ontology (SKOS) Sem Int Framework 2 3 1 10 7 4 8 5 9 6 12 11
  22. 22. Step 3: Rest Services <ul><li>Rest Services </li></ul><ul><ul><li>Interaction point for applications </li></ul></ul><ul><ul><li>Expose simple and generic access to the Integration framework </li></ul></ul><ul><ul><li>Removes complexity of framework and how to ask questions of it. </li></ul></ul><ul><ul><ul><li>You don’t need to know how to make it work </li></ul></ul></ul><ul><ul><li>You don’t need to know anything about the datasets and the concepts and properties held within. </li></ul></ul><ul><ul><li>Just ask simple questions in the UCB language </li></ul></ul><ul><ul><ul><li>Tell me about UCB:Person “ashworth” </li></ul></ul></ul><ul><ul><li>Built using SPARQLMotion/SPIN and exposed in TopBraid Live enterprise server. </li></ul></ul><ul><ul><li>Two simple yet very effective services created </li></ul></ul>Rest Services
  23. 23. Step 3: Rest Services Dataset Ontology (VoID) UCB Concept Ontology (SKOS) DB2 DB3 Keyword Search Get Info Find UCB:Person “phil” Search DB1:User Tell me the sub-types of UCB:Person Here are the resources for “phil” ldap:U0xx10x, itrack:101, moc:scordisp etc…. Search DB3:Employee Search DB3:Contact Search DB2:Person Rest Services Can the linksets tell us any info? Tell me the datasets for the sub-types DB1
  24. 24. Step 3: Rest Services Dataset Ontology (VoID) UCB Concept Ontology (SKOS) Keyword Search Get Info Tell me the super-types of all resources Retrieve DB1:U0xx10x Tell me about moc:scordisp Here is everything I know about it. DB2 DB3 Retrieve DB2:scordisp Retrieve DB3:philscordis Tell me everything about this resource? Rest Services DB1
  25. 25. <ul><li>Data Exploration environment </li></ul><ul><ul><li>Search concepts </li></ul></ul><ul><ul><li>Display data </li></ul></ul><ul><ul><li>Allow link following. </li></ul></ul><ul><ul><li>Deals with any concept defined in UCB SKOS language </li></ul></ul><ul><ul><li>Uses two framework services mentioned previously. </li></ul></ul><ul><ul><li>Deployed in TopBraid Ensemble – Live </li></ul></ul>Step 4: Building an Application 1 Applications
  26. 26. Step 4: Data Exploration UCB Concepts Search submitted to “Keyword Search” Service Applications
  27. 27. Step 4: Data Exploration Results Displayed. Index shows inference is already taking place Applications
  28. 28. Step 4: Data Exploration Drag Instance to basket, Initiates “Get Info” Service call Applications
  29. 29. Step 4: Data Exploration Select Instance Data Displayed per Source Applications
  30. 30. Step 4: Data Exploration Links to other data items Applications
  31. 31. Step 4: Data Exploration Displays Sparse data Submit Instance to“Get info” service Applications
  32. 32. Step 4: Data Exploration More Detailed Information Applications
  33. 33. Step 4: Data Exploration He has another interaction. Lets Explore. Applications
  34. 34. Step 4: Data Exploration Applications
  35. 35. Step 4: Data Exploration Applications Data cached as we navigated Concept Explorer. Can now be investigated.
  36. 36. Step 4: Data Exploration Structure concept Keyword Search pulls data from internal and external data sources Add to basket After detailed Information retrieved a second Structure has been identified without a keyword search Integrated Internal and External data Applications
  37. 37. Step 4: Data Exploration Applications
  38. 38. <ul><li>Federated data gathering & marting </li></ul><ul><ul><li>Data marting without the warehouse </li></ul></ul><ul><ul><li>New Mart Rest Service </li></ul></ul><ul><ul><ul><li>SPARQLMotion/SPIN scripts </li></ul></ul></ul><ul><ul><ul><li>Dump_UCB:Antibody </li></ul></ul></ul><ul><ul><li>Still uses framework to integrate data </li></ul></ul><ul><ul><ul><li>On the fly data integration </li></ul></ul></ul><ul><ul><ul><li>Gather RDF from data sources </li></ul></ul></ul><ul><ul><li>Dump into tables </li></ul></ul><ul><ul><li>Data consumed by traditional query tools </li></ul></ul><ul><ul><li>Not particularly designed for this aspect… (slow) </li></ul></ul><ul><ul><ul><li>But works! </li></ul></ul></ul>Step 4: Building an Application 2 Applications
  39. 39. <ul><li>Knowledge Base Creation </li></ul><ul><ul><li>Gathering information can be a time consuming exercise </li></ul></ul><ul><ul><ul><li>But is vital for projects to have </li></ul></ul></ul><ul><ul><ul><li>Different individuals have different ideas </li></ul></ul></ul><ul><ul><ul><ul><li>Relevance, sources etc, presentation </li></ul></ul></ul></ul><ul><ul><li>Knowledge Base to provide consistency for </li></ul></ul><ul><ul><ul><li>Data gathered </li></ul></ul></ul><ul><ul><ul><li>Data sources used </li></ul></ul></ul><ul><ul><ul><li>Data presentation </li></ul></ul></ul><ul><ul><li>ROI </li></ul></ul><ul><ul><ul><li>150 fold Increase in efficiency </li></ul></ul></ul><ul><ul><ul><ul><li>6mins compared to > 16hrs (over several weeks) </li></ul></ul></ul></ul><ul><ul><ul><li>Information available to all at central access point </li></ul></ul></ul>Step 4: Building an Application 3 Applications
  40. 40. Step 4: Knowledge Base Semantic Integration Framework Keyword Search Get Info Data Sources App Service “ Tell me about the protein with Gene ID X ” and I want to know about Literature Refs , Sequences , Descriptions, Structure …… etc. Applications
  41. 41. Step 4: Knowledge Base Applications
  42. 42. Step 4: Knowledge Base Applications
  43. 43. Step 4: Knowledge Base Applications
  44. 44. Step 4: Knowledge Base Applications
  45. 45. Step 4: Knowledge Base Applications
  46. 46. Step 5: Purl Server <ul><li>Removing URL dependencies </li></ul><ul><li>D2R publishes resolvable URLs’ as specific to the server </li></ul><ul><li>Removing URL specificity with PURL server </li></ul><ul><li>Allows each layer of the architecture to be removed without all the others having to be reconfigured </li></ul><ul><ul><li>Level of independence / indirection </li></ul></ul><ul><li>Only done on limited scale </li></ul>PURL
  47. 47. Conclusions & Business value <ul><li>We have built an extensible data integration framework </li></ul><ul><ul><li>Shown how data integration can be an incremental process </li></ul></ul><ul><ul><ul><li>Started with three datasets, more than 20 a few months later </li></ul></ul></ul><ul><ul><ul><li>Compare warehouse took 18 months to add two new data sources </li></ul></ul></ul><ul><ul><ul><li>Adding a new source can take less than a day (whole process, inc endpoint creation) </li></ul></ul></ul><ul><ul><ul><li>Creates an enterprise-wide “data fabric” rather than just one more application </li></ul></ul></ul><ul><ul><li>Connect datasets together like web pages fit together </li></ul></ul><ul><ul><ul><li>Literally click from one dataset to the other </li></ul></ul></ul><ul><ul><ul><li>Dynamically mash-up data from multiple sources </li></ul></ul></ul><ul><ul><ul><li>Add new sources by describing the connections, not by building a new application </li></ul></ul></ul>
  48. 48. Conclusions & Business value <ul><li>We have built a framework that </li></ul><ul><ul><li>Differs from data integration applications the way the Web differs from earlier network technologies (ftp, archie) </li></ul></ul><ul><ul><ul><li>Infrastructure allows new entities (pages, databases) to be added dynamically </li></ul></ul></ul><ul><ul><ul><li>Adding connections is as easy as specifying them </li></ul></ul></ul><ul><ul><li>Provides data for all projects </li></ul></ul><ul><ul><ul><li>Three very different applications have been demonstrated </li></ul></ul></ul><ul><ul><ul><li>All are able to use the same framework </li></ul></ul></ul><ul><ul><ul><li>Reuse </li></ul></ul></ul>
  49. 49. Questions?