Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Session 1.5 supporting virtual integration of linked data with just-in-time query recompilation

97 views

Published on

Talk at SEMANTiCS 2017
www.semantics.cc

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Session 1.5 supporting virtual integration of linked data with just-in-time query recompilation

  1. 1. Supporting virtual integration of Linked Data with just-in-time query recompilation Amsterdam, The Netherlands, September 12 2017 Alessandro Adamou1, Mathieu d’Aquin2, Carlo Allocca13, Enrico Motta1 1 Knowledge Media Institute, The Open University, UK 2 Insight Centre for Data Analytics, NUI Galway, Ireland 3 now Samsung Inc.
  2. 2. Outline •  Motivation •  Just-in-time query recompilation •  Implementation •  Experiments •  Perspectives
  3. 3. Virtual data integration •  No ETL process •  Naturally keeps data up-to-date •  Unlike data federation, there is still a designated node •  Favours project economies relying on networking rather than storage space •  Serious performance issues! •  Generally considered less robust •  Acquiring momentum in industry, per 2016 Gartner report •  Maintenance??
  4. 4. Pay-as-you-go integration 1.  Establish mappings between source schemas and global schema (go) 2.  Obtain feedback on mapping results, e.g. in terms of precision and recall (pay) 3.  Refine 1 Really “pays off” in virtual data integration N.W. Paton, K. Christodoulou, A. A. A. Fernandes, B. Parsia, C. Hedeler: Pay-as-you-go data integra0on for linked data: opportuni0es, challenges and architectures. SWIM 2012: 3
  5. 5. Outline •  Motivation •  Just-in-time query recompilation •  Implementation •  Experiments •  Perspectives
  6. 6. •  Target query language other than SPARQL – Conjunctive, star-shaped queries (good for Web APIs) {hostname}{/attribute/value}+? {attribute}{&attribute}+ e.g. http://example.org/api/type/actor/ name/clint_eastwood?filmwork
  7. 7. type/actor/name/clint_eastwood SELECT DISTINCT ?filmwork WHERE {
 { dbr:Clint_Eastwood a dbo:Actor ; ^(dbo:director|dbo:starring) ?filmwork } UNION { { ?x owl:sameAs dbr:Clint_Eastwood } UNION { ?x (movie:actor_name|movie:director_name) ?s FILTER (str(?s) = "Clint Eastwood")
 } . ?x foaf:made|^(movie:director|movie:actor) ?filmwork }} Equivalent SPARQL query for federated engine that supports DBpedia and LinkedMDB
  8. 8. type/actor/name/clint_eastwood SELECT DISTINCT ?filmwork WHERE {
 { dbr:Clint_Eastwood a dbo:Actor ; ^(dbo:director|dbo:starring) ?filmwork } UNION { { ?x owl:sameAs dbr:Clint_Eastwood } UNION { ?x (movie:actor_name|movie:director_name) ?s FILTER (str(?s) = "Clint Eastwood")
 } . ?x foaf:made|^(movie:director|movie:actor) ?filmwork }} Equivalent SPARQL query for federated engine that supports DBpedia and LinkedMDB
  9. 9. type/actor/name/clint_eastwood SELECT DISTINCT ?filmwork WHERE {
 { dbr:Clint_Eastwood a dbo:Actor ; ^(dbo:director|dbo:starring) ?filmwork } UNION { { ?x owl:sameAs dbr:Clint_Eastwood } UNION { ?x (movie:actor_name|movie:director_name) ?s FILTER (str(?s) = "Clint Eastwood")
 } . ?x foaf:made|^(movie:director|movie:actor) ?filmwork }} Equivalent SPARQL query for federated engine that supports DBpedia and LinkedMDB
  10. 10. type/actor/name/clint_eastwood SELECT DISTINCT ?filmwork ?eq WHERE { VALUES(?x) { ( dbr:Clint_Eastwood ) ( dbr:Clint_Eastwood_(actor) ) } ?x a dbo:Actor ; ^(dbo:director|dbo:starring) ?filmwork . ?filmwork owl:sameAs|^owl:sameAs ?eq } Alternative approach: DBpedia
  11. 11. type/actor/name/clint_eastwood SELECT DISTINCT ?filmwork ?eq WHERE { { VALUES(?x0) { ( dbr:Clint_Eastwood ) ( dbr:Clint_Eastwood_(actor) ) } ?x0 ^owl:sameAs ?x } UNION { ?x movie:actor_name "Clint Eastwood" } UNION { ?x movie:director_name "Clint Eastwood" } . { { ?x foaf:made ?filmwork } UNION { ?filmwork movie:director ?x } UNION { ?filmwork movie:actor ?x }
 } . ?filmwork owl:sameAs|^owl:sameAs ?eq } Alternative approach: LinkedMDB
  12. 12. •  Encode integrator’s knowledge of a dataset schema into a set of primitives, which will serve as “compilation units”. •  Managing the compilation units of a query using two types of structure: – Microcompilers – Query skeletons (or templates)
  13. 13. Microcompiler Let W be the set of all the a;ribute-value pairs and Σ the alphabet of a language; a microcompiler is a func0on φ : ℘W → Σ∗ that transforms sets of a;ribute-value pairs into a sequence of symbols in that language.
  14. 14. Microcompiler (JS ex.) mc_x_dbp = function (type,name) { var pref = ‘http://dbpedia.org/resource/’; var idd = name.replace(/b[a-z]/g, function(f){return f.toUpperCase()}); return ‘VALUES(?x_dbp){ ‘ + ‘( <’ + pref + idd + ‘> )’ + ‘( <’ + pref + idd + ‘_(’ + type + ‘)> )’ + ‘} ?x_dbp’ } type/actor/name/clint_eastwood VALUES(?x_dbp) { ( <http://dbpedia.org/resource/Clint_Eastwood> ) ( <http://dbpedia.org/resource/Clint_Eastwood_(actor)> ) } ?x_dbp
  15. 15. Microcompiler (JS ex. II) mc_x_lmdb = function (type,name) { if( [‘actor’,’director’].indexof(type) >= 0 ) { var sa = mc_x_dbp(type,name) + ‘ ^owl:sameAs ?x_lmdb’; var nam = makename(name); // omitted for simplicity return ‘{ ‘ + sa + ’ } UNION ’ + ‘{ ?x_lmdb movie:actor_name “’+nam+’” } UNION’ + ‘{ ?x_lmdb movie:director_name “’+nam+’” }’ //… }} type/actor/name/clint_eastwood { VALUES(?x_dbp) { ( <http://dbpedia.org/resource/Clint_Eastwood> ) ( <http://dbpedia.org/resource/Clint_Eastwood_(actor)> ) } ?x_dbp ^owl:sameAs ?x_lmdb } UNION {?x_lmdb movie:actor_name “Clint Eastwood” } UNION {?x_lmdb movie:director_name “Clint Eastwood” }
  16. 16. Query skeleton A query skeleton, or query template, t is a member of (Σ∪C)∗, where C is an alphabet called set of control symbols. <[name]> ^(dbo:director|dbo:starring) ?[filmwork]? { { <[name]> foaf:made ?[filmwork]? } UNION { ?[filmwork]? movie:director ?x_lmdb } UNION { ?[filmwork]? movie:actor ?x_lmdb }
 } . ?[filmwork]? owl:sameAs|^owl:sameAs ?eq Example II (LinkedMDB, {filmwork,name} Example I (DBpedia, {filmwork,name}
  17. 17. JIT framework Data source selecLon strategy micro compilers query skeletons micro compilers query skeletons micro compilers query skeletons … Compiler compiler: funcLon Φ × ℘(Σ∪C)∗ × ℘W → L Target queries Source query Target queries Target queries
  18. 18. Compilation strategies •  A manifest is a pair of sets of microcompilers and query skeletons •  Grouping into manifests for: – (a) data sources; – (b) entity types •  Data source selection algorithm produces a set of datasource-query pairs by finding satisfiable query skeletons (on paper)
  19. 19. Outline •  Motivation •  Just-in-time query recompilation •  Implementation •  Experiments •  Perspectives
  20. 20. M.d'Aquin, A. Adamou, E. Daga, S. Liu, K. Thomas, E. MoTa: Dealing with Diversity in a Smart-City Datahub. SemanLcs for Smarter CiLes @ISWC 2014: 68-82 Big Data for Milton Keynes as a Smart City EnLty-centric data API based on a simplified language from the one of this presentaLon •  hTps://datahub.mksmart.org •  hTps://github.com/mk-smart/enLty-centric-api
  21. 21. Implementation •  Reference open source implementation written in Java –  With support for SPARQL and HTTP dereferencing of RDF –  Includes JIT logic, custom experimental VDIS and HTTP API •  Accepts microcompilers in JavaScript •  Apache CouchDB map-reduce for atomically retrieving candidate compilation units
  22. 22. Outline •  Motivation •  Just-in-time query recompilation •  Implementation •  Experiments •  Perspectives
  23. 23. Experiments What is the price paid to turn a federated query engine into a virtual data integration system using JIT recompilation?
  24. 24. Experiments 1.  Benchmark of FedBench1 queries translated into our target language 2.  Take a federated query engine (FedX)2 3.  Measure the time taken by FedX to execute the original FedBench SPARQL query –  On the live endpoints whenever possible 4.  Take the translated query and recompile them into one or more SPARQL queries (at most one per data source) –  Execute each query with FedX 5.  Measure for each: –  Increase in size of “correct” result set –  Recompilation overhead –  Overall turnaround time of queries 1 hTp://fedbench.fluidops.net 2 hTps://www.fluidops.com/en/company/knowledge/open_source
  25. 25. Experiments Example: FedBench Cross-domain CD3 CD3 (original) SELECT ?pres ?party ?page WHERE { ?pres rdf:type dbpedia-owl:President . ?pres dbpedia-owl:nationality dbpedia:United_States . ?pres dbpedia-owl:party ?party . ?x nytimes:topicPage ?page . ?x owl:sameAs ?pres } CD3C: type/president/country/united_states?party&webpage
  26. 26. Results I Query Result set VDI boost Notes FedBench Cross-Domain CD1C m * 1.387 CD2C 52 new results Plain FedX yielded no results CD3C 67 new results Plain FedX yielded no results, has SERVICE clause CD4C m * 4480.0 Some microcompilers perform queries CD5C m * 1.0 No increment from recompilaLon FedBench Life Sciences LS1C m * 1.0 Query could not be expanded LS2C m * 1.0 No increment from recompilaLon LS3C 70981 results Plain FedX crashed FedBench Linked Data LD5C m ∗ 3.677 LD9C 4 new results Plain FedX yielded no results LD10C m * 17.0 LD11C m * 1.65
  27. 27. Results II Query Time (ms) - FedX Time (ms) – FedX+JIT JIT overhead Query TAT FedBench Cross-Domain CD1C 300 ± 050 420 ± 109 400 ± 020 800 ± 120 CD2C 175 ± 005 475 ± 055 432 ± 009 1500 ± 123 CD3C 158 ± 004 446 ± 076 408 ± 106 1067 ± 048 CD4C 8835 ± 954 420 ± 100 787 ± 165 7480 ± 569 CD5C 851 ± 319 519 ± 145 448 ± 031 548 ± 061 FedBench Life Sciences LS1C 795 ± 371 892 ± 043 query could not be expanded LS2C 484 ± 166 420 ± 100 444 ± 061 370 ± 061 LS3C !ERROR 6653 ± 861 query could not be expanded FedBench Linked Data LD5C 795 ± 371 801 ± 078 486 ± 017 1028 ± 099 LD9C 484 ± 166 407 ± 023 390 ± 039 318 ± 061 LD10C 189 ± 036 440 ± 018 416 ± 017 658 ± 101 LD11C 387 ± 067 861 ± 057 406 ± 020 762 ± 095
  28. 28. Outline •  Motivation •  Just-in-time query recompilation •  Implementation •  Experiments •  Perspectives
  29. 29. Discussion •  Can compile star-shaped input queries into more complex target queries •  Overhead is mostly a standard cost •  Proves to be mostly efficient when also effective (i.e. there is query expansion) •  Cannot still substitute query federation optimisation strategies •  Manageability? We knew exactly how to proceed… –  However we worked with ~ |A| · |MS| + |MT| microcompilers and query skeletons, where it could have been up to |MT + A| · |MS| + |MT|
  30. 30. Future work •  Optimisations to abate JIT overhead •  Application to chain-shaped queries and other query types •  Investigate other target languages •  Investigate templating languages for query skeletons •  Cascaded mappings applied at query time (no knowledge of dataset content or structure)
  31. 31. Thank You Amsterdam, The Netherlands, September 12 2017 Alessandro Adamou1, Mathieu d’Aquin2, Carlo Allocca13, Enrico Motta1 1 Knowledge Media Institute, The Open University, UK 2 Insight Centre for Data Analytics, NUI Galway, Ireland 3 now Samsung Inc.

×