2. eXascale Infolab (XI)
• New lab @ DIUF, U. of Fribourg–Switzerland
• Big Data infrastructures for social / semantic / scientific
data (… mostly)
http://exascale.info/
4. Volume
■ amount of data
Velocity
■ speed of data in and out
Variety
■ range of data types and sources
[Gartner 2012] "Big Data are high-volume, high-velocity, and/or high-variety
information assets that require new forms of processing to enable enhanced
decision making, insight discovery and process optimization"
Context: The 3-Vs of Big Data
5. Information Management
• The story so far:
– Strict separation between unstructured and structured
data management infrastructures
DBMS
JDBC
SQL
Inverted Index
Keywords
HTTP
6. The 3rd V: Information Integration
• Information integration is still one of the biggest
CS problem out there (according to many e.g., Gartner)
• Information integration typically requires some
sort of mediation
1. Unstructured Data: keywords, synsets
2. Structured Data: global schema, transitive closure of
schemas (mostly syntactic)
⇒ nightmarish if 1 and 2 taken separately, horror
marathon if considered together
7. Entities as Mediation
• Rising paradigm
– Store information at the entity granularity
– Integrate information by inter-linking entities
• Advantages?
– Coarser granularity compared to keywords
• More natural, e.g., brain functions similarly (or is it the
other way around?)
– Denormalized information compared to RDBMSs
• Schema-later, heterogeneity, sparsity
• Pre-computed joins, “Semantic” linking
• Drawbacks?
11. Example (3): Web Data Integration
http://data.semanticweb.org/person/philippe-cudre-mauroux/html
12. Exposing Textual Data
• The XI Pipeline
– From text to (contextualized) entities
• Runs on massive amounts of data (Spark)
Mention
Extraction
NER
Entity
Linking
Entity
Typing
1⃣ 2⃣ 3⃣
13. Named Entity Recognition (NER)
Text
extraction
(Apache Tika)
List of
extracted
n-grams
n-gram
Indexing
foreach
Candidate
Selection
List of
selected
n-grams
Supervised
Classi!er
Ranked
list of
n-grams
Lemmat
ization
n+1 grams
merging
Feature
extractionFeature
extractionFeatures
POS
Tagging
frequency
reweighting
Roman Prokofyev, Gianluca Demartini, Philippe Cudré-Mauroux:
Effective named entity recognition for idiosyncratic web collections. WWW 2014: 397-408
1⃣
14. Entity Linking
• Linking entities to text is an old problem…
– … and is extremely hard, esp. for machines
• Dozens of approaches have been suggested
• What if
– We want to combine approaches / frameworks?
– We want to leverage both human computations &
algorithms?
2⃣
15. ZenCrowd
• Three-stage blocking approach for integrating textual
data & entities
1. Cheap inverted index selection of candidates
2. Expensive similarity measures
3. Crowdsource low confidence matches
• Uses dynamic templating to create micro-matching-tasks
and publish them on MTurk
• Combines both algorithmic and human matchers using
probabilistic networks
Gianluca Demartini, Djellel Eddine Difallah, Philippe Cudré-Mauroux:
ZenCrowd: leveraging probabilistic reasoning and crowdsourcing
techniques for large-scale entity linking. WWW 2012: 469-478
19. Entity Typing
• Entities can have many types (facets)
• Which fine-grained types are most relevant given the context?
Thing
American
Billionaires
People
from
King
County
People
from
Sea:le
Windows
People
Agent
Person
Living
People
American
People
of
Sco@sh
Descent
Harvard
University
People
American
Computer
Programmers
American
Philanthropists
3⃣
20. TRank
• Fine-grained Typing
• Tree of 447’260 types
• Rooted on <owl:Thing>
• Depth of 19
• Ranks relevant types by analyzing the context
• Textual context
• Graph context
• Decision trees
• Linear regression
Alberto Tonon, Michele Catasta, Gianluca Demartini, Philippe Cudré-Mauroux, Karl Aberer:
TRank: Ranking Entity Types Using the Web of Data. ISWC 2013: 640-656.
22. Enterprise Search
• Holy grail for many large companies
• How can end-users reach entities?
⇒ Structured search
⇒ Keyword search
• On their names or attributes
– Obviously not ideal
• BM25 on TREC 2011 AOR: MAP=0.15, P@10=0.20
• Query extension, query completion or pseudo-relevance
feedback yield comparable (or worse) results
Higher-level apps
1⃣
23. Hybrid Entity Search
The Descendants
TheDescendants
type
title
GeorgeClooney
George Clooney
name
May 6, 1961
dateOfBirth
type
ShaileneW
Shailene Woodley
name
Nov. 15, 1991
dateOfBirth
type
playsIn
playsIn
• Main idea: combine unstructured and structured
search
– Inverted index to locate first candidates
– Graph queries to refine the results
• Graph traversals (queries on object properties)
• Graph neighborhoods (queries on
data type properties)
Inverted Index
Keywords
HTTP
DBMS
SPARQL
24. Architecture
LOD Cloud
index()
User
Query Annotation
and Expansion
Inverted Index
RDF
Store
Ranking
FunctionsRanking
FunctionsRanking
Functions
query()
Entity Search
Keyword Query
intermediate
top-k results
Graph-Enriched
Results
Graph Traversals
(queries on object
properties)
Neighborhoods
(queries on datatype
properties)
Structured
Inverted Index
WordNet
3rd party
search engines
Final Ranking
Function
Pseudo-Relevance Feedback
Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux: Combining inverted indices and
structured search for ad-hoc object retrieval. SIGIR 2012: 125-134
25. Does it Work?
• Up to 25% improvement over best IR (stat. sign.)
• Very modest impact on latency (17%)
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0" 0.2" 0.4" 0.6" 0.8" 1"
Precision)
Recall)
S1_1"
BM25"baseline"
S2_2"
SAMEAS"
26. Co-Reference Resolution
• Better co-reference resolution through the
knowledge bases & entity linking
Roman Prokofyev, Alberto Tonon, Michael Luggen, Loic Vouilloz, Djellel Eddine Difallah, and
Philippe Cudre-Mauroux: SANAPHOR: Ontology-Based Coreference Resolution. ISWC 2015.
Barack Obama
called
Angela Merkel last week;
the president asked
the chancellor whether…
2⃣
29. Entity Storage (Diplodocus[RDF])
• Materialize the joins!
• Dense-pack the values
• Provide new indices
• Co-locate
• Co-locate
• Co-locate
Marcin Wylot, Jigé Pont, Mariusz Wisniewski, Philippe Cudré-Mauroux: dipLODocus[RDF] -
Short and Long-Tail RDF Analytics for Massive Webs of Data. ISWC 2011: 778-793
30. Support for Provenance
Marcin Wylot, Philippe Cudré-Mauroux, Paul T. Groth: TripleProv: efficient processing of
lineage queries in a native RDF store. WWW 2014: 455-466