Building a semantic integration framework to support a federated query environment in 5 stepsPhilip Ashworth UCB CelltechDean Allemang TopQuadrant Nele, living with lupus
Data Integration… Why? Scope and knowledge of life sciences expands everyday Everyday we make new discoveries by experimenting (in the lab) Data generated in the lab in large quantities to complement the vast growth externally Too difficult and time consuming for the user to bring data together Therefore we don’t often make use of the data we already have to make new discoveries
Data Integration… Problems Registration, QueryApp DB DI, Query ApplicationsApp DB’s DI QueryApp DB’s Project DB DIApp DB’s Warehouse DB Project Marts
Data Integration… Problems Demand for DI increases everyday. Data doesn’t evolve into a larger more beneficial platform • Where is the long term benefit? • Driving ourselves around in circles Just creating more data silos • Limited scope for reuse Slow & difficult to modify / enhance High maintenance • Multiple systems create more and more overhead
Data Integration… Thoughts Data Integration is clearly evolving But it is not fulfilling the needs If we identify the need… can we see what we should be doing?
Data Integration… Needs All Data for All Projects Accessible DataTrue Integration Align Concepts Data has Context Variety of Sources
Data Integration… There is a way! Open Linked Data Cloud Connected and linked data with context Created by a community A Valuable resource that will only Grow! Something we can learn from!Significant scientific content Significant linking hubs appearing
Data Integration… Starting an Evolutionary Leap No one internally really knows about this Can’t just rip and replace old systems Have to do some ground work
Linked Data…The Quest Technology Projects • Emphasis on semantic web principles Scientific Projects • Data Integration • Data Visualisation (mash-ups)
Linked Data New Approach Develop a POC semantic data integration framework • Easy to configure • Support all projects • Builds an environment for the future.
The Idea Applications Business Process / Workflow Automation PURL Rest Services (Abstraction layer) Increasing Ease of Development Decreasing knowledge of Semantic Semantic Integration Framework technologies Knowledge Collation, Concept mapping, Distributed Query Result inference, Aggregation RDF Sparql EndPoint Native Sparql EndPoint RDBMS RDF Triple MS Excel Oracle,Postgres Store TXT SQL, mySql Doc Data Sources
RDFStep 1. Data Sources Expose data as RDF through SPARQL Endpoints Internal Data sources • D2R SPARQL Endpoints on RDBMS databases • Each Modelled as local concepts that they represent • Don’t worry about the larger concept picture • Virtuoso RDF triple store (Open source) to host RDF data created from spreadsheets • TopBraid Ensemble & SPARQLMotion/SPIN scripts to convert static data to RDF SPARQL Endpoints D2R RDBMS Virtuoso
RDFStep 1. Data SourcesExternal Data Sources• SPARQL endpoints in LOD from Bio2RDF, LODD and others.• Some stability, access, quality issues within these sources.• Created Amazon Cloud server to host stable environments.• Bio2RDF sources downloaded, stored and modified• Virtuoso (open source) used as triple store Linked Open Data Cloud MOC NBE NBE LDAP WH Mart chebi Bio2RDF PDB ITrack UCB geneid PDB Premier Abysis PEP Kegg Kegg dr gl Kegg IDAC WKW cpd SEQ PMT Dis Uniprot Sider eas ec om UCB Data Cloud e
Step 2: Integration Framework: Why? • Linked Open Data: links within a source are manually created • To Navigate the cloud you either • Learn the network • Discover the network as you go through (unguided) • There is nothing that understands the total connectivity of concepts available to you. • Difficult to know where start • No idea if a start point will lead you to the information you are looking for or might be interested in. • Can’t query the cloud for specific Information The Integration Framework will resolve these issues • It will model the models to understand the connectivity You shouldn’t have to know where to look for data
Step 2: Integration Framework Applications Understand Understand UCB concepts how UCB Business Process / Concepts fit Workflow Automationwith source PURL concepts Easy to Understand Rest Services (Abstraction layer) wire upLinks Across Sources Semantic Integration Framework Knowledge Collation, Concept mapping, Distributed Query Result inference, Aggregation Automate RDF some tasks Understand Data Sources Accessible (concepts, acce Via Services ss, props) Data Sources
Step 2: Integration Framework. Sem Int Framework Integration Framework • Data source, concept and property registry • An Ontology that Utilises • VoID (enhanced) to capture data source information (endpoints) • SKOS to link local ontologies with UCB concepts • UCB:Person -> db1:user, db2:employee, db3:actor Built using TopBraid Suite • Ontology development (TopBraid Composer) • SPARQLMotion scripts to provide some automation • Creation of ontologies from endpoints, D2R mappings • Configuration assistance
Step 2: Integration Framework. Sem Int Framework Integration Framework UCB Concept Ontology (SKOS) UCB:Person DB1:User Dataset Ontology (VoID) UCB:Antibody DB1:Antibod y UCB:Project DB1:Project DB1
Step 2: Integration Framework. Sem Int Framework Dataset Ontology (VoID) UCB Concept Ontology (SKOS) UCB:Person DB1:User DB2:Person DB3:Employe e DB3:Contact DB1 DB2 DB3
Step 2: Integration Framework. Sem Int Framework Dataset Ontology (VoID) UCB Concept Ontology (SKOS) UCB:Person DB1:User Linksets Person_DB1_DB3 DB2:Person Person_DB1_DB2 DB3:Employe e DB3:Contact DB1 DB2 DB3
Step 3: Rest Services Rest Services Rest Services • Interaction point for applications • Expose simple and generic access to the Integration framework • Removes complexity of framework and how to ask questions of it. • You don’t need to know how to make it work • You don’t need to know anything about the datasets and the concepts and properties held within. • Just ask simple questions in the UCB language • Tell me about UCB:Person “ashworth” • Built using SPARQLMotion/SPIN and exposed in TopBraid Live enterprise server. • Two simple yet very effective services created
Step 3: Rest Services Rest Services Find UCB:Person “phil” Here are the resources for “phil” ldap:U0xx10x, itrack:101, moc:scordisp etc…. Keyword Get Info Search Tell me the sub-types of UCB:Person Can the linksets tell us any info? Dataset Ontology (VoID) UCB Concept Ontology (SKOS) Tell me the datasets for the sub-types Search DB3:Employee Search DB3:Contact Search DB1:User Search DB2:Person DB1 DB2 DB3
Step 3: Rest Services Rest Services Here is everything I know about it. Tell me about moc:scordisp Keyword Get Info Search Tell me everything about this UCB Concept Ontology (SKOS) Dataset Ontology (VoID) resource? Tell me the super-types of all resources Retrieve DB3:philscordis Retrieve DB1:U0xx10x Retrieve DB2:scordisp DB1 DB2 DB3
Step 4: Building an Application 1 Applications Data Exploration environment • Search concepts • Display data • Allow link following. • Deals with any concept defined in UCB SKOS language • Uses two framework services mentioned previously. • Deployed in TopBraid Ensemble – Live
Step 4: Data Exploration Applications UCB Concepts Search submitted to “Keyword Search” Service
Step 4: Data Exploration Applications Results Displayed. Index shows inference is already taking place
Step 4: Data Exploration Applications Drag Instance to basket, Initiates “Get Info” Service call
Step 4: Data Exploration Applications Select Instance Data Displayed per Source
Step 4: Data Exploration Applications Links to other data items
Step 4: Data Exploration Applications Displays Sparse data Submit Instance to“Get info” service
Step 4: Data Exploration Applications More Detailed Information
Step 4: Data Exploration Applications He has another interaction. Lets Explore.
Step 4: Data Exploration Applications Data cached as we navigated Concept Explorer. Can now be investigated.
Step 4: Data Exploration Applications Integrated Internal and External data Structure concept Keyword Search pulls data from internal and external data sources After detailed Information retrieved a second Structure has been identified without a keyword search Add to basket
Step 4: Building an Application 2 Applications Federated data gathering & marting • Data marting without the warehouse • New Mart Rest Service • SPARQLMotion/SPIN scripts • Dump_UCB:Antibody • Still uses framework to integrate data • On the fly data integration • Gather RDF from data sources • Dump into tables • Data consumed by traditional query tools • Not particularly designed for this aspect… (slow) • But works!
Step 4: Building an Application 3 Applications Knowledge Base Creation • Gathering information can be a time consuming exercise • But is vital for projects to have • Different individuals have different ideas • Relevance, sources etc, presentation • Knowledge Base to provide consistency for • Data gathered • Data sources used • Data presentation • ROI • 150 fold Increase in efficiency • 6mins compared to > 16hrs (over several weeks) • Information available to all at central access point
Step 4: Knowledge Base Applications“Tell me about theprotein with Gene IDX” and I want to knowabout LiteratureRefs, Sequences, Descriptions, Structure……etc. App Service Keyword Get Info Search Semantic Integration Framework Data Sources
PURLStep 5: Purl Server Removing URL dependencies D2R publishes resolvable URLs’ as specific to the server Removing URL specificity with PURL server Allows each layer of the architecture to be removed without all the others having to be reconfigured • Level of independence / indirection Only done on limited scale
Conclusions & Business value We have built an extensible data integration framework • Shown how data integration can be an incremental process • Started with three datasets, more than 20 a few months later • Compare warehouse took 18 months to add two new data sources • Adding a new source can take less than a day (whole process, inc endpoint creation) • Creates an enterprise-wide “data fabric” rather than just one more application • Connect datasets together like web pages fit together • Literally click from one dataset to the other • Dynamically mash-up data from multiple sources • Add new sources by describing the connections, not by building a new application
Conclusions & Business value We have built a framework that • Differs from data integration applications the way the Web differs from earlier network technologies (ftp, archie) • Infrastructure allows new entities (pages, databases) to be added dynamically • Adding connections is as easy as specifying them • Provides data for all projects • Three very different applications have been demonstrated • All are able to use the same framework • Reuse