Data Integration at the Ontology Engineering Group


Published on

Presentation done on the work being done on Data Integration at OEG-UPM (, for the CredIBLE workshop, in Sophia-Antipolis (October 15th, 2012).

Published in: Technology
1 Comment
  • Great presentation! Thanks!
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Integration at the Ontology Engineering Group

  1. 1. Data integration at our group: ingredients and some prospects Credible workshop Sophia-Antipolis, October 15th 2012 Oscar Corcho Facultad de Informática, Universidad Politécnica de Madrid Campus de Montegancedo s/n. 28660 Boadilla del Monte, Madrid, Spain With contributions from: José Mora (OEG-UPM), Boris Villazón-Terrazas (OEG-UPM, now atiSOCO), Jean Paul Calbimonte (OEG-UPM), Freddy Priyatna (OEG-UPM), Carlos Buil-Aranda (OEG-UPM, now at PUC Chile)
  2. 2. Our data integration needs, problems (and challenges) And data may be available from data streams (e.g., sensors) Need to submit SPARQL queries into distributed SPARQL endpointsNeed to access heterogeneous relationaldata sources (mainly in the area of Geography) • Some of the databases are available in different DBMSs • And some of the data sources are available as spreadsheets • Furthermore, many of these datasets are already published as Linked Data 2
  3. 3. Ingredients 100 80 thin applications (mas hups ) middleware 60 5 Reasoning Este 40 s emantic data integration and querying Oeste 20 1 RDB2RDF Norte 0 1er 3er 2 legacy Sensor-based 3 query rewriting Optimisations data s ources trim. regis tries s ens or networks trim. Federated Query 4 Processing Linked Open Data SpreadsheetsFrom SemsorGrid4Env architecture ( 3
  4. 4. Disclaimer When I talk about ontology-based querying,I will be normally talking about SPARQL querying 4
  5. 5. 1. RDB2RDFIn other words, how to make relational data available asRDF (and connected to ontologies) 5
  6. 6. RDB2RDF. Motivation• A majority of dynamic Web content is backed by relational databases (RDB), and so are many enterprise systems. transformation transformation engine description 6
  7. 7. RDB2RDF. Query rewriting for OBDA with mappings Q Rewriting Mappings Q’ There may be some mappings to translate between ontology and DB. The rewriting should consider those mappings. 7
  8. 8. RDB2RDF. Existing approaches 1 21. To build a new ontology from a database schema and content (direct mappings)2. To map the ontology created in approach (1) to a legacy ontology3. To map an existing DB to a legacy ontology 3 new ontology existing ontology
  9. 9. OEG’s background knowledge in RDB2RDF• R2O and ODEMapster • GaV wrapper generation (no mediators) • Syntactic sugar for the generation of SQL queries. • Simple use of this language and processor in the domains of fund finding, cultural information, and fisheries. • NeOn Toolkit plugin for common mappingsBarrasa J, Corcho O, Gómez-Pérez A. (2004)R2O, an extensible and semantically baseddatabase-to-ontology mapping language. In:Proceedings of the Second Workshop onSemantic Web and Databases, SWDB 2004. 9
  10. 10. R2O (Relational-to-Ontology) LanguageFor concepts... One or more concepts can be extracted from a A view maps exactly single data field (not one concept in the in 1NF). ontology. For attributes... A subset of the A column in a columns in the view database view maps map a concept in the directly an attribute ontology. or a relation. A subset (selection) of the records of a A column in a database view map a database view maps concept in the an attribute or a ontology. relation after some transformation. A subset of the records of a database view map a concept in the onto. but the A set of columns in a selection cannot be database view map made using SQL. an attribute or a relation.
  11. 11. The W3C RDB2RDF Working Group• Created in 2007• W3C Recommendations in September 2012 • R2RML: RDB to RDF Mapping Language - • Direct Mapping - direct-mapping/ • R2RML and Direct Mapping Test Cases - 2rdf/test-cases/ • RDB2RDF Implementation Report - 2rdf/implementation-report/ 11
  12. 12. R2RML example12
  13. 13. Existing implementations• OEG implementations • • • Implementation Report. Boris Villazón-Terrazas, Michael Hausenblas. 13
  14. 14. Ongoing work• Provide a list of common patterns in R2RML transformations, so that they can be reused (increasing productivity) • Sequeda J, Priyatna F, Villazón-Terrazas B. Relational Database to RDF Mapping Patterns. In: Proceedings of the 3rd Workshop on Ontology Patterns (WOP2012). • Villazón-Terrazas B, Priyatna F. Building Ontologies by using Re-engineering Patterns and R2RML Mappings. In: Proceedings of the 3rd Workshop on Ontology Patterns (WOP2012).Priyatna •• Improve our support at Morph for all test cases• Adapt existing GUIs for the generation of mappings (such as NeOn Toolkit’s one). 14
  15. 15. 2. R2RML queryrewriting optimisationsIn other words, how to make this query rewritingoptimised, so that we don’t suffer from a bad efficiencyin our results 15
  16. 16. R2RML is now a W3C Recommendation• That’s very good to ensure wide uptake, but…• Implementations still suffer from their lack of efficiency • UltraWrap has shown that a similar performance can be obtained with direct mappings on high-end databases (Oracle, SQL Server) • What happens with low-end databases (mySQL)? 16
  17. 17. Several works on SPARQL to SQL translation• Barrasa J, Corcho O, Gómez-Pérez A. (2004) R2O, an extensible and semantically based database-to-ontology mapping language. In: Proceedings of the Second Workshop on Semantic Web and Databases, SWDB 2004.• R. Cyganiak. A relational algebra for sparql. Digital Media Systems Laboratory. HP Laboratories Bristol. HPL-2005-170, 2005.• B. Elliott, E. Cheng, C. Thomas-Ogbuji, and Z.M. Ozsoyoglu. A complete translation from sparql into ecient sql. In Proceedings of the 2009 International Database Engineering & Applications Symposium, pages 31-42. ACM, 2009.• A. Chebotko, S. Lu, and F. Fotouhi. Semantics preserving sparql-to-sql translation. Data & Knowledge Engineering, 68(10):973-1000, 2009. 17
  18. 18. Chebotko’s query rewriting18
  19. 19. Our proposal19
  20. 20. An example. BSBM08NATIVESELECT r.title, r.text, r.reviewDate, p.personID,, r.rating1, r.rating2, r.rating3, r.rating4FROM review r, person pWHERE r.productID=55547 AND r.personID=p.personID AND r.language=enORDER BY r.reviewDate descCHEBOTKOSELECT var_rating2 AS rating2, var_reviewerName AS reviewerName, var_title AS title, var_rating1AS rating1, var_reviewDate AS reviewDate, var_reviewer AS reviewer, var_rating3 AS rating3,var_rating4 AS rating4, var_text AS textFROM (SELECT *FROM (SELECT uri_rating41477446315 AS uri_rating41477446315, var_rating2 AS var_rating2,var_reviewer AS var_reviewer, uri_reviewDate750573656 AS uri_reviewDate750573656, var_rating4AS var_rating4, var_rating1 AS var_rating1, var_text AS var_text, uri_title1963229325 ASuri_title1963229325, var_rating3 AS var_rating3, uri_reviewer2088452952 ASuri_reviewer2088452952, uri_rating21477446253 AS uri_rating21477446253, uri_text1457367120 ASuri_text1457367120, uri_rating31477446284 AS uri_rating31477446284, uri_rating11477446222 ASuri_rating11477446222, uri_reviewFor1499735727 AS uri_reviewFor1499735727, var_reviewDate ASvar_reviewDate, var_title AS var_title, uri_language269987354 AS uri_language269987354,uri_Product555472014519903 AS uri_Product555472014519903, v_7634.var_review AS var_review,var_reviewerName AS var_reviewerName, uri_name1396749066 AS uri_name1396749066, var_langAS var_langFROM (SELECT uri_reviewer2088452952 AS uri_reviewer2088452952, v_6537.var_review ASvar_review, uri_rating11477446222 AS uri_rating11477446222, uri_rating31477446284 ASuri_rating31477446284, uri_Product555472014519903 AS uri_Product555472014519903,uri_reviewFor1499735727 AS uri_reviewFor1499735727, var_rating2 AS var_rating2, 20
  21. 21. An example. BSBM08OUR APPROACHSELECT var_rating2 AS rating2, var_reviewDate AS reviewDate, var_rating4 AS rating4, var_rating1AS rating1, var_reviewer AS reviewer, var_rating3 AS rating3, var_reviewerName AS reviewerName,var_text AS text, var_title AS titleFROM (SELECT *FROM (SELECT v_2660.var_reviewer AS var_reviewer, var_reviewDate AS var_reviewDate,var_review AS var_review, uri_rating31477446284 AS uri_rating31477446284, uri_rating21477446253AS uri_rating21477446253, uri_title1963229325 AS uri_title1963229325, var_rating3 AS var_rating3,uri_reviewDate750573656 AS uri_reviewDate750573656, uri_reviewFor1499735727 ASuri_reviewFor1499735727, uri_language269987354 AS uri_language269987354,uri_name1396749066 AS uri_name1396749066, var_rating1 AS var_rating1, var_reviewerName ASvar_reviewerName, var_lang AS var_lang, uri_Product555472014519903 ASuri_Product555472014519903, var_rating2 AS var_rating2, uri_rating41477446315 ASuri_rating41477446315, var_title AS var_title, var_rating4 AS var_rating4, var_text AS var_text,uri_rating11477446222 AS uri_rating11477446222, uri_text1457367120 AS uri_text1457367120,uri_reviewer2088452952 AS uri_reviewer2088452952FROM (SELECT v_8722.PERSONID AS var_reviewer, ASuri_name1396749066, v_8722.NAME AS var_reviewerNameFROM PERSON v_8722WHERE (v_8722.NAME IS NOT NULL) ) v_2660INNER JOIN (SELECT v_3353.REVIEWDATE AS var_reviewDate, AS uri_rating11477446222, v_3353.REVIEWID ASvar_review, v_3353.TEXT AS var_text, AS uri_reviewer2088452952,v_3353.RATING1 AS var_rating1, uri_rating21477446253, v_3353.TITLE AS var_title, AS uri_language269987354, AS uri_reviewDate750573656, AS uri_rating31477446284, http://www4.wiwiss.fu- 21
  22. 22. Analysis with BSBM SQL Server mySQL22
  23. 23. Ongoing work• Writing the paper describing our optimisations• Proposing a comprehensive benchmarking platform to test R2RML-compliant query rewriting systems • Extending our current work on the R2RML implementation testcases 23
  24. 24. 3. Ontology-basedsensor query rewritingIn other words, what happens if our data sources arenot static, but data streams. Can we still use similartechniques? 24
  25. 25. An example: SmartCitiesEnvironmental sensors Parking sensors SmartSantander Project25
  26. 26. Data from the Web Flood risk alert:South East England Emergency I have to make planner sense out of all this data wave data Environmental forecasts defenses Heterogeneity Continuous querying Streaming data 26
  27. 27. Ingredients for Linked Sensor DataCore ontological modelAdditional domain ontologiesGuidelines for generation of identifiersSensor Web programming interfacesQuery processing engines
  28. 28. Overview of the SSN ontology Deployment deploymentProcesPart only System OperatingRestriction hasSubsystem only, some hasSurvivalRange only SurvivalRangeDeploymentRelatedProcess hasDeployment only System OperatingRange Deployment deployedSystem only hasOperatingRange only deployedOnPlatform only Process inDeployment only Device hasInput only Input PlatformSite onPlatform only Device Process Platform Output attachedSystem only hasOutput only, some Data Skeleton isProducedBy some implements some Sensor Sensing hasValue some sensingMethodUsed only SensorOutput detects only SensingDevice observes onlyObservationValue SensorInput isProxyFor only Property includesEvent some isPropertyOf some observedProperty only observationResult only observedBy only hasProperty only, some Observation FeatureOfInterest featureOfInterest only MeasuringCapability ConstraintBlock hasMeasurementCapability only forProperty only inCondition only inCondition only MeasurementCapability Condition Compton M, Barnaghi P, Bermúdez L, García-Castro R, Corcho O, Cox S, Graybeal J, Hauswirth M, Henson C, Herzog A, Huang V, Janowicz K, Kelsey WD, Le Phuoc D, Lefort L, Leggieri M, Neuhaus H, Nikolov A, Page K, Passant A, Sheth A, Taylor K. The SSN Ontology of the W3C Semantic Sensor Network Incubator Group. Journal of Web Semantics. In press
  29. 29. SSN Ontology with other OntologiesGarcía-Castro R, Corcho O, Hill C. A Core Ontology Model for Semantic Sensor Web Infrastructures.International Journal of Semantic Web and Information Systems 8(1):22-42 29
  30. 30. Queries to Sensor Data SNEEql RSTREAM SELECT id, speed, direction FROM wind [NOW]; Data Stream Mgmt System Esper QL SELECT wind_speed FROM min) Complex Event Processors GSN RESTful service[0]=wind_sensor&field[0]=wind_speed& from=15/09/2011+05:00:00&to=15/09/2011+15:00:00 Pachube RESTful service 02T14:01:46Z&end=2011-09-02T17:01:46Z Sensor Data Middleware Querying through ontologies? 30
  31. 31. SPARQL-StreamSELECT ?windspeed ?tidespeedFROM NAMED STREAM <>[NOW-10 MINUTES TO NOW-0 MINUTES]WHERE { ?WaveObs a ssn:Observation; ssn:observationResult ?windspeed; ssn:observedProperty sweetSpeed:WindSpeed. ?TideObs a ssn:Observation; ssn:observationResult ?tidespeed; ssn:observedProperty sweetSpeed:TideSpeed.FILTER (?tidespeed<?windspeed)} Query processing closer to dataUse ontologies as conceptual model Query virtual stream graphs 31
  32. 32. SPARQL-StreamSELECT ?name ( AVG(?temperature) AS ?avgTemperature ) ( AVG(?humidity) AS ?avgHumidity )FROM NAMED STREAM <> [NOW - 1 HOURS SLIDE 1 HOURS]FROM <>FROM <>WHERE { ?sensor om-owl:generatedObservation ?temperatureObservation; Aggregates om-owl:generatedObservation ?humidityObservation; Static & Streaming om-owl:hasLocatedNearRel [ om-owl:hasLocation ?nearbyLocation ] . ?temperatureObservation om-owl:observedProperty weather:_AirTemperature ; om-owl:result [ om-owl:floatValue ?temperature ] . ?humidityObservation om-owl:observedProperty weather:_RelativeHumidity ; { SELECT ?name om-owl:result [ om-owl:floatValue ?humidity ] . Windows WHERE { Filters, Functions ?nearbyLocation gn:featureClass ?featureClass ; gn:name | gn:officialName ?name ; gn:population ?population . FILTER ( ?population > 15000 && REGEX(?featureClass, “P” , “i") ) } } UNION{ SELECT ?name WHERE { Disclaimer: some features NYI ?nearbyLocation gn:parentFeature+ ?parentFeature . ?parentFeature gn:featureClass ?parentClass ; gn:name | gn:officialName ?name ; gn:population ?parentPopulation . FILTER ( ?parentPopulation > 15000 && REGEX(?parentClass, “P” , “i") ) }}} GROUP BY ?name 32
  33. 33. Querying the Observations SELECT ?waveheight FROM STREAM <> [NOW -10 MINUTES TO NOW STEP 1 MINUTE] WHERE { ?WaveObs a sea:WaveHeightObservation; sea:hasValue ?waveheight; } :22001/ multidata ?vs [0]= wan7 & field [0]= sp_wind Query :Wan4WindSpeed a rr:TriplesMapClass; rr:tableName "wan7"; Rewriting GSN SPARQLStream rr:subjectMap [ rr:template API " {timed}"; Mappings rr:class ssn:ObservationValue; Query rr:graph ssg:swissexsnow.srdf ]; Processing rr:predicateObjectMap [ SensorClient rr:predicateMap [ rr:predicate Network ssn:hasQuantityValue ]; rr:objectMap[ rr:column "sp_wind" ] ]; Data [tuples] [triples] translation R2RML Query processing Mappings engines 33
  34. 34. Rewriting to different technologies SELECT ?windspeed FROM NAMED STREAM <http://swiss-> [NOW-10 MINUTE TO NOW-0 MINUTE] WHERE { Query ?WaveObs a ssn:Observation; Rewriting ssn:observationResult ?windspeed; Algebra ssn:observedProperty sweetSpeed:WindSpeed. } representationSELECT wind_speed_scalar_av, timed FROM min) Esper (CEP) SELECT wan7.wind_speed_scalar_av AS windspeed, wan7.timed AS windts FROM wan7[FROM NOW-10 MINUTES TO NOW] SNEE (DSMS)[0]=wan7& field[0]=wind_speed_scalar_av& from=15/05/2011+05:00:00&to=15/05/2011+15:00:00 GSN (Middleware) 02T14:01:46Z&end=2011-09-02T17:01:46Z Pachube (Middleware) Calbimonte JP, Corcho O, Yeung H, Aberer K. Enabling Query Technologies for the Semantic Sensor Web. International Journal of Semantic Web and Information Systems 8(1):43-63 34
  35. 35. Ongoing work• Benchmarking of ontology-based streaming data engines • Zhang Y, Pham MD, Corcho O, Calbimonte JP. SRBench: A Streaming RDF/SPARQL Benchmark. Proceedings of the 11th International Semantic Web Conference (ISWC2012)• Improve optimisations when joining static and streaming data• Automatic characterisation of sensor data streams • Useful in citizen science approaches (e.g., AirQualityEgg) • Calbimonte JP, Yan Z, Jeung H, Corcho O, Aberer K. Deriving Semantic Sensor Metadata from Raw Measurements. ISWC2012 5th International Workshop on Semantic Sensor Networks 2011 (SSN2012). CEUR Workshop Proceedings, Vol-904, 35
  36. 36. 4. Federated query processingIn other words, how can we access data from federateddata sources 36
  37. 37. Example• We query the life science domain 1. Using the Pubmed references obtained from the GeneID gene dataset, retrieve information about genes and their references in the Pubmed dataset. 2. From Pubmed we access the information in the National Library of Medicines controlled vocabulary thesaurus, stored at the MeSH endpoint, so we have more complete information about such genes. 3. Finally, we also access the HHPID endpoint, which is the knowledge base for the HIV-1 protein. 37
  38. 38. Introduction• Question: • How can we access such amount of RDF data in an integrated manner?• Current approaches • Replicate data in local stores, access it using existing RDF databases. • Execute individual queries and manually join data. • Use existing distributed query systems (starting to appear). 38
  39. 39. Problem• Existing tools for distributed SPARQL query processing differ in the way of handling distribution • SPARQL-published the Federated Query Document Last Call Working Draft • It homogenises the access to distributed RDF data repositories • SERVICE <> {...}• Problems in semantics: SERVICE ?X not well defined• Current Access to SPARQL endpoints is not optimal • Work on SPARQL distributed query optimization is beginning 39
  40. 40. State of the Art• ANAPSID, RDF::Query, OpenAnzo, ARQ, Rasqal RDF Query Library• ANAPSID provides SPARQL optimization based on adaptive query processing operators• RDF::Query provides basic pattern reordering • Implement the federation using query predicates • List of SPARQL endpoints needed • Helps user to direct queries to remote datasets • FedX, SPLENDID, SemWIQ, NetworkedGraphs • All provide basic optimisations: pattern grouping (FedX), cost based optimizations(SemWIQ, SPLENDID and recently FedX, NetworkedGraphs) • SPARQL 1.1 is mostly syntactic sugar 40
  41. 41. Assumptions & Restrictions• Assumptions 1. Users know how to create a query to the endpoints 2. No statistics of any kind are available for the query processing system. 3. Data are distributed• Restrictions 1. We only consider the Federation Extension of SPARQL 1.1 2. We are not aware of the capabilities or implementation of the remote SPARQL server 3. No registry of endpoints 41
  42. 42. SERVICE Semantics Example: SELECT ?name ?email SELECT ?name ?email WHERE { WHERE { ?y :name ?name . SERVICE <> ?y :email ?email {?y :name ?name} . } SERVICE <> {?y :email ?email} }• We extend [PAG09] with the semantics for SERVICE: 42
  43. 43. SERVICE SemanticsExample:SELECT ?nameWHERE { SERVICE ?X {?y :name ?name}} 43
  44. 44. SPARQL Optimisation - OPTIONAL• We assume that we have no statistics of endpoints • This means that we cannot use cost-based optimisations • We will only focus on static optimisations• Besides the usual static optimisations (e.g. Pushing down filters) SPARQL queries can be optimised if they contain OPTIONAL operators • The OPTIONAL operator is responsible for PSPACE- completeness in SPARQL [PAG09]• OPTIONAL is a key operator in SPARQL 44
  45. 45. Well-designed patterns• Well-designed SPARQL patterns [PAG09] • Class of SPARQL patterns which adds a restriction 45
  46. 46. Well-designed Patterns• We extended the notion of well-designed patterns for the SPARQL 1.1 Federation Extension • The previous rules also hold for SERVICE 46
  47. 47. Implementation: SPARQL-DQP• SPARQL-DQP is implemented on top of OGSA-DAI and OGSA- DQP • OGSA-DAI is a Web service-based framework for accessing distributed data resources • OGSA-DQP adds distributed query processing infrastructure• We reuse some OGSA-DQP operators• We added RDF and SPARQL endpoint data access • RDFB2RDF data resource • RDF data resource • SPARQL endpoint resources• Good behaviour for large datasets Buil C, Arenas M, Corcho O. Semantics and optimization of the SPARQL 1.1 federation extension. Proceedings of the 8th Extended Semantic Web Conference (ESWC2011). Springer-Verlag LNCS 6644, pages 1-15 47
  48. 48. Ongoing Work• An extensive benchmark has been produced • Montoya G, Vidal ME, Corcho O, Ruckhaus E, Buil-Aranda C. Benchmarking Federated SPARQL Query Engines: Are Existing Testbeds Enough? In: Proceedings of the 11th International Semantic Web Conference (ISWC2012)• Focusing now on Adaptive Query Processing • Query Processing should be adapted to the users specific needs and specific network requirements 48
  49. 49. 5. Entailment in query rewritingIn other words, how can we take into account theexistence of ontologies in the query rewriting process,so as to provide simple entailment 49
  50. 50. Main approaches in the state of the artExpressiveness Author System Output [R] Datalog,ELHIO¬ Pérez-Urbina et al. REQUIEM UCQSticky-join [linear] datalog± Gottlob et al. Nyaya UCQDL-LiteR, DL-LiteF Calvanese et al. QuOnto UCQDL-LiteR Chortaras et al. Rapid UCQ Presto & NR-Datalog &DL-LiteR [+EBox] Rosati et al. Prexto UCQ 50
  51. 51. Optimizations in the rewriting• The rewriting can be optimized in several ways • Ontology preprocessing • Subsumption checks • Prioritize inferences • Constrain the searches 51
  52. 52. Our proposalJosé Mora 52
  53. 53. Conclusion and Future Work• We have proposed some small incremental improvements over the current state of the art in entailment-aware query rewriting • Need to integrate it with the rest of our work • This will happen during Fall 2012 53
  54. 54. Final conclusions and future work 54
  55. 55. Ingredients 100 80 thin applications (mas hups )middleware 60 5 Reasoning Este 40 s emantic data integration and querying Oeste 20 1 RDB2RDF Norte 0 1er 3er 3 legacy Sensor-based 2 query rewriting Optimisations data s ources trim. regis triess ens or networks trim. Federated Query 4 Processing Linked Open Data Spreadsheets 55
  56. 56. Data integration at our group: ingredients and some prospects Credible workshop Sophia-Antipolis, October 15th 2012 Oscar Corcho Facultad de Informática, Universidad Politécnica de Madrid Campus de Montegancedo s/n. 28660 Boadilla del Monte, Madrid, Spain With contributions from: José Mora (OEG-UPM), Boris Villazón-Terrazas (OEG-UPM, now atiSOCO), Jean Paul Calbimonte (OEG-UPM), Freddy Priyatna (OEG-UPM), Carlos Buil-Aranda (OEG-UPM, now at PUC Chile)