Entity centric data_management_2013

504 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
504
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Entity centric data_management_2013

  1. 1. Entity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland IBM TJ Watson Research Center July 24, 2013
  2. 2. eXascale Infolab • New lab @ U. of Fribourg, Switzerland • Financed by Swiss Federal State / private foundations / companies • Big (non-relational) data management (Volume, Velocity, Variety) (… mostly)
  3. 3. Entities
  4. 4. Information Management • The story so far: – Strict separation between unstructured and structured data management infrastructures DBMS JDBC SQL Inverted Index Keywords HTTP
  5. 5. Information Integration • Information integration is still one of the biggest CS problem out there (according to many e.g., Gartner) • Information integration typically requires some sort of mediation 1. Unstructured Data: keywords, synsets 2. Structured Data: global schema, transitive closure of schemas (mostly syntactic) ⇒nightmarish if 1 and 2 taken separately, horror marathon if considered together
  6. 6. Entities as Mediation • Rising paradigm – Store information at the entity granularity – Integrate information by inter-linking entities • Advantages? – Coarser granularity compared to keywords • More natural, e.g., brain functions similarly (or is it the other way around?) – Denormalized information compared to RDBMSs • Schema-later, heterogeneity, sparsity • Pre-computed joins, “Semantic” linking • Drawbacks?
  7. 7. • Google Knowledge Graph 7 Example: Google’s Knowledge Graph
  8. 8. Example (2): Web Data Integration http://data.semanticweb.org/person/philippe-cudre-mauroux/html
  9. 9. Example (3): BBC Olympics
  10. 10. Focusing on a few Core Problems 1. Extracting entities from text (ZenCrowd) 2. Searching for entities 3. Accessing entities (DNS^3) 4. Storing entities (Diplodocus[RDF])
  11. 11. Extracting Entities • Extracting entities from text is an old problem… – … and is extremely hard, esp. for machines • Dozens of approaches have been suggested • What if – We want to combine approaches / frameworks? – We want to leverage both human computations & algorithms?
  12. 12. ZenCrowd • Extracts entities from text using state-of-the-art techniques • Uses sets of algorithmic matchers to match entities to online concepts • Uses dynamic templating to create micro- matching-tasks and publish them on MTurk • Combines both algorithmic and human matchers using probabilistic networks
  13. 13. ZenCrowd Architecture Micro Matching Tasks HTML Pages HTML+ RDFa Pages LOD Open Data Cloud Crowdsourcing Platform Z enCrowd Entity Extractors LOD Index Get Entity Input Output Probabilistic Network Decision Engine Micro- TaskManager Workers Decisions Algorithmic Matchers
  14. 14. “Black-Magic” Component • Probabilistic network to integrate a priori & a posteriori information – Agreement of good turkers & algorithms • Learning process – Constraints • Unicity • Equality (SameAs) – Giant probabilistic graph • Instantiated selectively w1 w2 l1 l2 pw1( ) pw2( ) lf1( ) lf2( ) pl1( ) pl2( ) l lf3 pl c11 c22 c12 c21 c13 u2-3( )sa1-2( )
  15. 15. Does it Work? • Improves avg. prec. by 0.14 on average! – Minimal crowd involvement – Embarrassingly parallel problem Top US Worker 0 0.5 1 0 250 500 WorkerPrecision Number of Tasks US Workers IN Workers 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 1 2 3 4 5 6 7 8 9 Precision Top K workers
  16. 16. Searching for Entities • How can end-users reach entities? ⇒ Keyword search • On their names or attributes – Obviously not ideal • BM25 on TREC 2011 AOR: MAP=0.15, P@10=0.20 • Query extension, query completion or pseudo-relevance feedback yield comparable (or worse) results
  17. 17. Hybrid Entity Search The Descendants TheDescendants type title GeorgeClooney George Clooney name May 6, 1961 dateOfBirth type ShaileneW Shailene Woodley name Nov. 15, 1991 dateOfBirth type playsIn playsIn • Main idea: combine unstructured and structured search – Inverted index to locate first candidates – Graph queries to refine the results • Graph traversals (queries on object properties) • Graph neighborhoods (queries on data type properties) Inverted Index Keywords HTTP DBMS SPARQL
  18. 18. Architecture LOD Cloud index() User Query Annotation and Expansion Inverted Index RDF Store Ranking FunctionsRanking FunctionsRanking Functions query() Entity Search Keyword Query intermediate top-k results Graph-Enriched Results Graph Traversals (queries on object properties) Neighborhoods (queries on datatype properties) Structured Inverted Index WordNet 3rd party search engines Final Ranking Function Pseudo-Relevance Feedback
  19. 19. Does it Work? • Up to 25% improvement over best IR (stat. sign.) • Very modest impact on latency (17%) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.2 0.4 0.6 0.8 1 Precision Recall S1_1 BM25 baseline S2_2 SAMEAS
  20. 20. Entity Registries, i.e., Whom to Ask for Entities? • (Search Engine+) Void + SPARQL end-point • DOA • DNS [DNS^3] • Same_as service / Okkam IDs • P2P Mesh of entities [idMesh] • (Downscaled) Entity registries?
  21. 21. How to Store Entities? • Fundamental impedance mismatch between graphs of entities and… – N-ary / decomposition storage model – Inverted Indices – Key-value paradigms
  22. 22. Entity Storage • Materialize the joins! • Dense-pack the values • Provide new indices • Co-locate • Co-locate • Co-locate
  23. 23. DipLODocus: System Architecture
  24. 24. Main Idea - data structures
  25. 25. Template Matching (Bottom-up ≠ schema)
  26. 26. Molecule Clusters • extremely compact sub-graphs • precomputed joins
  27. 27. List of Literals • extremely compact list of sorted values
  28. 28. Basic operations - queries - triple patterns ?x type Student. ?x takesCourse Course0. ?x type Student. ?x takesCourse Course0. ?x takesCourse Course1. => intersection of sorted lists
  29. 29. Basic operations - queries aggregates and analytics ?x type Student. ?x age ?y filter (?y < 21)
  30. 30. Basic operations - queries - molecule queries ?a name 'Student1'. ?a ?b ?c. ?c ?d ?e.
  31. 31. Results - LUBM - 100 Universities 35 x faster for OLTP 300 x faster for OLAP
  32. 32. References • ZenCrowd [WWW 12, WWW 13, VLDB J. 13] • idMesh [WWW 09] • Hybrid Entity Search [SIGIR 12] • Downscaling Entity Registries [Downscale 12] • Diplodocus[RDF] [ISWC 11] • Graph Data Management for New Application Domains [VLDB 11]
  33. 33. http://exascale.info IBM TJ Watson Research Center July 24, 2013

×