Your SlideShare is downloading. ×
0
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Entity centric data_management_2013
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Entity centric data_management_2013

273

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
273
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Entity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland IBM TJ Watson Research Center July 24, 2013
  • 2. eXascale Infolab • New lab @ U. of Fribourg, Switzerland • Financed by Swiss Federal State / private foundations / companies • Big (non-relational) data management (Volume, Velocity, Variety) (… mostly)
  • 3. Entities
  • 4. Information Management • The story so far: – Strict separation between unstructured and structured data management infrastructures DBMS JDBC SQL Inverted Index Keywords HTTP
  • 5. Information Integration • Information integration is still one of the biggest CS problem out there (according to many e.g., Gartner) • Information integration typically requires some sort of mediation 1. Unstructured Data: keywords, synsets 2. Structured Data: global schema, transitive closure of schemas (mostly syntactic) ⇒nightmarish if 1 and 2 taken separately, horror marathon if considered together
  • 6. Entities as Mediation • Rising paradigm – Store information at the entity granularity – Integrate information by inter-linking entities • Advantages? – Coarser granularity compared to keywords • More natural, e.g., brain functions similarly (or is it the other way around?) – Denormalized information compared to RDBMSs • Schema-later, heterogeneity, sparsity • Pre-computed joins, “Semantic” linking • Drawbacks?
  • 7. • Google Knowledge Graph 7 Example: Google’s Knowledge Graph
  • 8. Example (2): Web Data Integration http://data.semanticweb.org/person/philippe-cudre-mauroux/html
  • 9. Example (3): BBC Olympics
  • 10. Focusing on a few Core Problems 1. Extracting entities from text (ZenCrowd) 2. Searching for entities 3. Accessing entities (DNS^3) 4. Storing entities (Diplodocus[RDF])
  • 11. Extracting Entities • Extracting entities from text is an old problem… – … and is extremely hard, esp. for machines • Dozens of approaches have been suggested • What if – We want to combine approaches / frameworks? – We want to leverage both human computations & algorithms?
  • 12. ZenCrowd • Extracts entities from text using state-of-the-art techniques • Uses sets of algorithmic matchers to match entities to online concepts • Uses dynamic templating to create micro- matching-tasks and publish them on MTurk • Combines both algorithmic and human matchers using probabilistic networks
  • 13. ZenCrowd Architecture Micro Matching Tasks HTML Pages HTML+ RDFa Pages LOD Open Data Cloud Crowdsourcing Platform Z enCrowd Entity Extractors LOD Index Get Entity Input Output Probabilistic Network Decision Engine Micro- TaskManager Workers Decisions Algorithmic Matchers
  • 14. “Black-Magic” Component • Probabilistic network to integrate a priori & a posteriori information – Agreement of good turkers & algorithms • Learning process – Constraints • Unicity • Equality (SameAs) – Giant probabilistic graph • Instantiated selectively w1 w2 l1 l2 pw1( ) pw2( ) lf1( ) lf2( ) pl1( ) pl2( ) l lf3 pl c11 c22 c12 c21 c13 u2-3( )sa1-2( )
  • 15. Does it Work? • Improves avg. prec. by 0.14 on average! – Minimal crowd involvement – Embarrassingly parallel problem Top US Worker 0 0.5 1 0 250 500 WorkerPrecision Number of Tasks US Workers IN Workers 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 1 2 3 4 5 6 7 8 9 Precision Top K workers
  • 16. Searching for Entities • How can end-users reach entities? ⇒ Keyword search • On their names or attributes – Obviously not ideal • BM25 on TREC 2011 AOR: MAP=0.15, P@10=0.20 • Query extension, query completion or pseudo-relevance feedback yield comparable (or worse) results
  • 17. Hybrid Entity Search The Descendants TheDescendants type title GeorgeClooney George Clooney name May 6, 1961 dateOfBirth type ShaileneW Shailene Woodley name Nov. 15, 1991 dateOfBirth type playsIn playsIn • Main idea: combine unstructured and structured search – Inverted index to locate first candidates – Graph queries to refine the results • Graph traversals (queries on object properties) • Graph neighborhoods (queries on data type properties) Inverted Index Keywords HTTP DBMS SPARQL
  • 18. Architecture LOD Cloud index() User Query Annotation and Expansion Inverted Index RDF Store Ranking FunctionsRanking FunctionsRanking Functions query() Entity Search Keyword Query intermediate top-k results Graph-Enriched Results Graph Traversals (queries on object properties) Neighborhoods (queries on datatype properties) Structured Inverted Index WordNet 3rd party search engines Final Ranking Function Pseudo-Relevance Feedback
  • 19. Does it Work? • Up to 25% improvement over best IR (stat. sign.) • Very modest impact on latency (17%) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.2 0.4 0.6 0.8 1 Precision Recall S1_1 BM25 baseline S2_2 SAMEAS
  • 20. Entity Registries, i.e., Whom to Ask for Entities? • (Search Engine+) Void + SPARQL end-point • DOA • DNS [DNS^3] • Same_as service / Okkam IDs • P2P Mesh of entities [idMesh] • (Downscaled) Entity registries?
  • 21. How to Store Entities? • Fundamental impedance mismatch between graphs of entities and… – N-ary / decomposition storage model – Inverted Indices – Key-value paradigms
  • 22. Entity Storage • Materialize the joins! • Dense-pack the values • Provide new indices • Co-locate • Co-locate • Co-locate
  • 23. DipLODocus: System Architecture
  • 24. Main Idea - data structures
  • 25. Template Matching (Bottom-up ≠ schema)
  • 26. Molecule Clusters • extremely compact sub-graphs • precomputed joins
  • 27. List of Literals • extremely compact list of sorted values
  • 28. Basic operations - queries - triple patterns ?x type Student. ?x takesCourse Course0. ?x type Student. ?x takesCourse Course0. ?x takesCourse Course1. => intersection of sorted lists
  • 29. Basic operations - queries aggregates and analytics ?x type Student. ?x age ?y filter (?y < 21)
  • 30. Basic operations - queries - molecule queries ?a name 'Student1'. ?a ?b ?c. ?c ?d ?e.
  • 31. Results - LUBM - 100 Universities 35 x faster for OLTP 300 x faster for OLAP
  • 32. References • ZenCrowd [WWW 12, WWW 13, VLDB J. 13] • idMesh [WWW 09] • Hybrid Entity Search [SIGIR 12] • Downscaling Entity Registries [Downscale 12] • Diplodocus[RDF] [ISWC 11] • Graph Data Management for New Application Domains [VLDB 11]
  • 33. http://exascale.info IBM TJ Watson Research Center July 24, 2013

×