Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Entity-Centric Data Management
Philippe Cudré-Mauroux
eXascale Infolab, University of Fribourg
Switzerland
ICDAR 2015
Augu...
eXascale Infolab (XI)
•  New lab @ DIUF, U. of Fribourg–Switzerland
•  Big Data infrastructures for social / semantic / sc...
Entities
Volume
■  amount of data
Velocity
■  speed of data in and out
Variety
■  range of data types and sources
[Gartner 2012] "B...
Information Management
•  The story so far:
–  Strict separation between unstructured and structured
data management infra...
The 3rd V: Information Integration
•  Information integration is still one of the biggest
CS problem out there (according ...
Entities as Mediation
•  Rising paradigm
–  Store information at the entity granularity
–  Integrate information by inter-...
Entity-Centric Data Management
Higher-level apps
•  Google Knowledge Graph
9
Example (1): Google’s Knowledge Graph
Example (2): BBC Olympics
Example (3): Web Data Integration
http://data.semanticweb.org/person/philippe-cudre-mauroux/html
Exposing Textual Data
•  The XI Pipeline
–  From text to (contextualized) entities
•  Runs on massive amounts of data (Spa...
Named Entity Recognition (NER)
Text
extraction
(Apache Tika)
List of
extracted
n-grams
n-gram
Indexing
foreach
Candidate
S...
Entity Linking
•  Linking entities to text is an old problem…
–  … and is extremely hard, esp. for machines
•  Dozens of a...
ZenCrowd
•  Three-stage blocking approach for integrating textual
data & entities
1.  Cheap inverted index selection of ca...
ZenCrowd Architecture
Micro
Matching
Tasks
HTML
Pages
HTML+ RDFa
Pages
LOD Open Data Cloud
Crowdsourcing
Platform
ZenCrowd...
Probabilistic Inference
•  Probabilistic network to integrate a priori & a
posteriori information
–  Agreement of good tur...
Does it Work?
•  Improves avg. prec. by 0.14 on average!
–  Minimal crowd involvement
–  Embarrassingly parallel problem
T...
Entity Typing
•  Entities can have many types (facets)
•  Which fine-grained types are most relevant given the context?
Th...
TRank
•  Fine-grained Typing
•  Tree of 447’260 types
•  Rooted on <owl:Thing>
•  Depth of 19
•  Ranks relevant types by a...
Enterprise search
Co-reference resolution
Literature browsing / recommendation
Some Applications
1⃣
2⃣
3⃣
Enterprise Search
•  Holy grail for many large companies
•  How can end-users reach entities?
⇒ Structured search
⇒ Keywor...
Hybrid Entity Search
The Descendants
TheDescendants
type
title
GeorgeClooney
George Clooney
name
May 6, 1961
dateOfBirth
t...
Architecture
LOD Cloud
index()
User
Query Annotation
and Expansion
Inverted Index
RDF
Store
Ranking
FunctionsRanking
Funct...
Does it Work?
•  Up to 25% improvement over best IR (stat. sign.)
•  Very modest impact on latency (17%)
0"
0.1"
0.2"
0.3"...
Co-Reference Resolution
•  Better co-reference resolution through the
knowledge bases & entity linking
Roman Prokofyev, Al...
Literature Browsing
http://sciencewise.info
3⃣
http://exascale.info
ICDAR 2015
August 25, 2015
Entity Storage (Diplodocus[RDF])
•  Materialize the joins!
•  Dense-pack the values
•  Provide new indices
•  Co-locate
• ...
Support for Provenance
Marcin Wylot, Philippe Cudré-Mauroux, Paul T. Groth: TripleProv: efficient processing of
lineage qu...
Upcoming SlideShare
Loading in …5
×

Entity-Centric Data Management

717 views

Published on

Keynote @ ICDAR 2015 on Entity-Centric Data Management

Published in: Internet
  • Be the first to comment

Entity-Centric Data Management

  1. 1. Entity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland ICDAR 2015 August 25, 2015
  2. 2. eXascale Infolab (XI) •  New lab @ DIUF, U. of Fribourg–Switzerland •  Big Data infrastructures for social / semantic / scientific data (… mostly) http://exascale.info/
  3. 3. Entities
  4. 4. Volume ■  amount of data Velocity ■  speed of data in and out Variety ■  range of data types and sources [Gartner 2012] "Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization" Context: The 3-Vs of Big Data
  5. 5. Information Management •  The story so far: –  Strict separation between unstructured and structured data management infrastructures DBMS JDBC SQL Inverted Index Keywords HTTP
  6. 6. The 3rd V: Information Integration •  Information integration is still one of the biggest CS problem out there (according to many e.g., Gartner) •  Information integration typically requires some sort of mediation 1.  Unstructured Data: keywords, synsets 2.  Structured Data: global schema, transitive closure of schemas (mostly syntactic) ⇒ nightmarish if 1 and 2 taken separately, horror marathon if considered together
  7. 7. Entities as Mediation •  Rising paradigm –  Store information at the entity granularity –  Integrate information by inter-linking entities •  Advantages? –  Coarser granularity compared to keywords •  More natural, e.g., brain functions similarly (or is it the other way around?) –  Denormalized information compared to RDBMSs •  Schema-later, heterogeneity, sparsity •  Pre-computed joins, “Semantic” linking •  Drawbacks?
  8. 8. Entity-Centric Data Management Higher-level apps
  9. 9. •  Google Knowledge Graph 9 Example (1): Google’s Knowledge Graph
  10. 10. Example (2): BBC Olympics
  11. 11. Example (3): Web Data Integration http://data.semanticweb.org/person/philippe-cudre-mauroux/html
  12. 12. Exposing Textual Data •  The XI Pipeline –  From text to (contextualized) entities •  Runs on massive amounts of data (Spark) Mention Extraction NER Entity Linking Entity Typing 1⃣ 2⃣ 3⃣
  13. 13. Named Entity Recognition (NER) Text extraction (Apache Tika) List of extracted n-grams n-gram Indexing foreach Candidate Selection List of selected n-grams Supervised Classi!er Ranked list of n-grams Lemmat ization n+1 grams merging Feature extractionFeature extractionFeatures POS Tagging frequency reweighting Roman Prokofyev, Gianluca Demartini, Philippe Cudré-Mauroux: Effective named entity recognition for idiosyncratic web collections. WWW 2014: 397-408 1⃣
  14. 14. Entity Linking •  Linking entities to text is an old problem… –  … and is extremely hard, esp. for machines •  Dozens of approaches have been suggested •  What if –  We want to combine approaches / frameworks? –  We want to leverage both human computations & algorithms? 2⃣
  15. 15. ZenCrowd •  Three-stage blocking approach for integrating textual data & entities 1.  Cheap inverted index selection of candidates 2.  Expensive similarity measures 3.  Crowdsource low confidence matches •  Uses dynamic templating to create micro-matching-tasks and publish them on MTurk •  Combines both algorithmic and human matchers using probabilistic networks Gianluca Demartini, Djellel Eddine Difallah, Philippe Cudré-Mauroux: ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. WWW 2012: 469-478
  16. 16. ZenCrowd Architecture Micro Matching Tasks HTML Pages HTML+ RDFa Pages LOD Open Data Cloud Crowdsourcing Platform ZenCrowd Entity Extractors LOD Index Get Entity Input Output Probabilistic Network Decision Engine Micro- TaskManager Workers Decisions Algorithmic Matchers
  17. 17. Probabilistic Inference •  Probabilistic network to integrate a priori & a posteriori information –  Agreement of good turkers & algorithms •  Learning process –  Constraints •  Unicity •  Equality (SameAs) –  Giant probabilistic graph •  Instantiated selectively w1 w2 l1 l2 pw1( ) pw2( ) lf1( ) lf2( ) pl1( ) pl2( ) l lf3 pl c11 c22 c12 c21 c13 u2-3( )sa1-2( )
  18. 18. Does it Work? •  Improves avg. prec. by 0.14 on average! –  Minimal crowd involvement –  Embarrassingly parallel problem Top$US$ Worker$ 0$ 0.5$ 1$ 0$ 250$ 500$ Worker&Precision& Number&of&Tasks& US$Workers$ IN$Workers$ 0.6$ 0.62$ 0.64$ 0.66$ 0.68$ 0.7$ 0.72$ 0.74$ 0.76$ 0.78$ 0.8$ 1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$ Precision) Top)K)workers)
  19. 19. Entity Typing •  Entities can have many types (facets) •  Which fine-grained types are most relevant given the context? Thing   American   Billionaires   People   from  King   County   People   from   Sea:le   Windows   People   Agent   Person   Living   People   American   People  of   Sco@sh   Descent   Harvard   University   People   American   Computer   Programmers   American   Philanthropists   3⃣
  20. 20. TRank •  Fine-grained Typing •  Tree of 447’260 types •  Rooted on <owl:Thing> •  Depth of 19 •  Ranks relevant types by analyzing the context •  Textual context •  Graph context •  Decision trees •  Linear regression Alberto Tonon, Michele Catasta, Gianluca Demartini, Philippe Cudré-Mauroux, Karl Aberer: TRank: Ranking Entity Types Using the Web of Data. ISWC 2013: 640-656.
  21. 21. Enterprise search Co-reference resolution Literature browsing / recommendation Some Applications 1⃣ 2⃣ 3⃣
  22. 22. Enterprise Search •  Holy grail for many large companies •  How can end-users reach entities? ⇒ Structured search ⇒ Keyword search •  On their names or attributes –  Obviously not ideal •  BM25 on TREC 2011 AOR: MAP=0.15, P@10=0.20 •  Query extension, query completion or pseudo-relevance feedback yield comparable (or worse) results Higher-level apps 1⃣
  23. 23. Hybrid Entity Search The Descendants TheDescendants type title GeorgeClooney George Clooney name May 6, 1961 dateOfBirth type ShaileneW Shailene Woodley name Nov. 15, 1991 dateOfBirth type playsIn playsIn •  Main idea: combine unstructured and structured search –  Inverted index to locate first candidates –  Graph queries to refine the results •  Graph traversals (queries on object properties) •  Graph neighborhoods (queries on data type properties) Inverted Index Keywords HTTP DBMS SPARQL
  24. 24. Architecture LOD Cloud index() User Query Annotation and Expansion Inverted Index RDF Store Ranking FunctionsRanking FunctionsRanking Functions query() Entity Search Keyword Query intermediate top-k results Graph-Enriched Results Graph Traversals (queries on object properties) Neighborhoods (queries on datatype properties) Structured Inverted Index WordNet 3rd party search engines Final Ranking Function Pseudo-Relevance Feedback Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux: Combining inverted indices and structured search for ad-hoc object retrieval. SIGIR 2012: 125-134
  25. 25. Does it Work? •  Up to 25% improvement over best IR (stat. sign.) •  Very modest impact on latency (17%) 0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0" 0.2" 0.4" 0.6" 0.8" 1" Precision) Recall) S1_1" BM25"baseline" S2_2" SAMEAS"
  26. 26. Co-Reference Resolution •  Better co-reference resolution through the knowledge bases & entity linking Roman Prokofyev, Alberto Tonon, Michael Luggen, Loic Vouilloz, Djellel Eddine Difallah, and Philippe Cudre-Mauroux: SANAPHOR: Ontology-Based Coreference Resolution. ISWC 2015. Barack Obama called Angela Merkel last week; the president asked the chancellor whether… 2⃣
  27. 27. Literature Browsing http://sciencewise.info 3⃣
  28. 28. http://exascale.info ICDAR 2015 August 25, 2015
  29. 29. Entity Storage (Diplodocus[RDF]) •  Materialize the joins! •  Dense-pack the values •  Provide new indices •  Co-locate •  Co-locate •  Co-locate Marcin Wylot, Jigé Pont, Mariusz Wisniewski, Philippe Cudré-Mauroux: dipLODocus[RDF] - Short and Long-Tail RDF Analytics for Massive Webs of Data. ISWC 2011: 778-793
  30. 30. Support for Provenance Marcin Wylot, Philippe Cudré-Mauroux, Paul T. Groth: TripleProv: efficient processing of lineage queries in a native RDF store. WWW 2014: 455-466

×