SlideShare a Scribd company logo
Entity-Centric Data Management
Philippe Cudré-Mauroux
eXascale Infolab, University of Fribourg
Switzerland
ICDAR 2015
August 25, 2015
eXascale Infolab (XI)
•  New lab @ DIUF, U. of Fribourg–Switzerland
•  Big Data infrastructures for social / semantic / scientific
data (… mostly)
http://exascale.info/
Entities
Volume
■  amount of data
Velocity
■  speed of data in and out
Variety
■  range of data types and sources
[Gartner 2012] "Big Data are high-volume, high-velocity, and/or high-variety
information assets that require new forms of processing to enable enhanced
decision making, insight discovery and process optimization"
Context: The 3-Vs of Big Data
Information Management
•  The story so far:
–  Strict separation between unstructured and structured
data management infrastructures
DBMS
JDBC
SQL
Inverted Index
Keywords
HTTP
The 3rd V: Information Integration
•  Information integration is still one of the biggest
CS problem out there (according to many e.g., Gartner)
•  Information integration typically requires some
sort of mediation
1.  Unstructured Data: keywords, synsets
2.  Structured Data: global schema, transitive closure of
schemas (mostly syntactic)
⇒ nightmarish if 1 and 2 taken separately, horror
marathon if considered together
Entities as Mediation
•  Rising paradigm
–  Store information at the entity granularity
–  Integrate information by inter-linking entities
•  Advantages?
–  Coarser granularity compared to keywords
•  More natural, e.g., brain functions similarly (or is it the
other way around?)
–  Denormalized information compared to RDBMSs
•  Schema-later, heterogeneity, sparsity
•  Pre-computed joins, “Semantic” linking
•  Drawbacks?
Entity-Centric Data Management
Higher-level apps
•  Google Knowledge Graph
9
Example (1): Google’s Knowledge Graph
Example (2): BBC Olympics
Example (3): Web Data Integration
http://data.semanticweb.org/person/philippe-cudre-mauroux/html
Exposing Textual Data
•  The XI Pipeline
–  From text to (contextualized) entities
•  Runs on massive amounts of data (Spark)
Mention
Extraction
NER
Entity
Linking
Entity
Typing
1⃣ 2⃣ 3⃣
Named Entity Recognition (NER)
Text
extraction
(Apache Tika)
List of
extracted
n-grams
n-gram
Indexing
foreach
Candidate
Selection
List of
selected
n-grams
Supervised
Classi!er
Ranked
list of
n-grams
Lemmat
ization
n+1 grams
merging
Feature
extractionFeature
extractionFeatures
POS
Tagging
frequency
reweighting
Roman Prokofyev, Gianluca Demartini, Philippe Cudré-Mauroux:
Effective named entity recognition for idiosyncratic web collections. WWW 2014: 397-408
1⃣
Entity Linking
•  Linking entities to text is an old problem…
–  … and is extremely hard, esp. for machines
•  Dozens of approaches have been suggested
•  What if
–  We want to combine approaches / frameworks?
–  We want to leverage both human computations &
algorithms?
2⃣
ZenCrowd
•  Three-stage blocking approach for integrating textual
data & entities
1.  Cheap inverted index selection of candidates
2.  Expensive similarity measures
3.  Crowdsource low confidence matches
•  Uses dynamic templating to create micro-matching-tasks
and publish them on MTurk
•  Combines both algorithmic and human matchers using
probabilistic networks
Gianluca Demartini, Djellel Eddine Difallah, Philippe Cudré-Mauroux:
ZenCrowd: leveraging probabilistic reasoning and crowdsourcing
techniques for large-scale entity linking. WWW 2012: 469-478
ZenCrowd Architecture
Micro
Matching
Tasks
HTML
Pages
HTML+ RDFa
Pages
LOD Open Data Cloud
Crowdsourcing
Platform
ZenCrowd
Entity
Extractors
LOD Index Get Entity
Input Output
Probabilistic
Network
Decision Engine
Micro-
TaskManager
Workers Decisions
Algorithmic
Matchers
Probabilistic Inference
•  Probabilistic network to integrate a priori & a
posteriori information
–  Agreement of good turkers & algorithms
•  Learning process
–  Constraints
•  Unicity
•  Equality (SameAs)
–  Giant probabilistic graph
•  Instantiated selectively
w1
w2
l1
l2
pw1( ) pw2( )
lf1( ) lf2( )
pl1( ) pl2( )
l
lf3
pl
c11
c22
c12
c21
c13
u2-3( )sa1-2( )
Does it Work?
•  Improves avg. prec. by 0.14 on average!
–  Minimal crowd involvement
–  Embarrassingly parallel problem
Top$US$
Worker$
0$
0.5$
1$
0$ 250$ 500$
Worker&Precision&
Number&of&Tasks&
US$Workers$
IN$Workers$
0.6$
0.62$
0.64$
0.66$
0.68$
0.7$
0.72$
0.74$
0.76$
0.78$
0.8$
1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$
Precision)
Top)K)workers)
Entity Typing
•  Entities can have many types (facets)
•  Which fine-grained types are most relevant given the context?
Thing	
  
American	
  
Billionaires	
  
People	
  
from	
  King	
  
County	
  
People	
  
from	
  
Sea:le	
  
Windows	
  
People	
  
Agent	
  
Person	
  
Living	
  
People	
  
American	
  
People	
  of	
  
Sco@sh	
  
Descent	
  
Harvard	
  
University	
  
People	
  
American	
  
Computer	
  
Programmers	
  
American	
  
Philanthropists	
  
3⃣
TRank
•  Fine-grained Typing
•  Tree of 447’260 types
•  Rooted on <owl:Thing>
•  Depth of 19
•  Ranks relevant types by analyzing the context
•  Textual context
•  Graph context
•  Decision trees
•  Linear regression
Alberto Tonon, Michele Catasta, Gianluca Demartini, Philippe Cudré-Mauroux, Karl Aberer:
TRank: Ranking Entity Types Using the Web of Data. ISWC 2013: 640-656.
Enterprise search
Co-reference resolution
Literature browsing / recommendation
Some Applications
1⃣
2⃣
3⃣
Enterprise Search
•  Holy grail for many large companies
•  How can end-users reach entities?
⇒ Structured search
⇒ Keyword search
•  On their names or attributes
–  Obviously not ideal
•  BM25 on TREC 2011 AOR: MAP=0.15, P@10=0.20
•  Query extension, query completion or pseudo-relevance
feedback yield comparable (or worse) results
Higher-level apps
1⃣
Hybrid Entity Search
The Descendants
TheDescendants
type
title
GeorgeClooney
George Clooney
name
May 6, 1961
dateOfBirth
type
ShaileneW
Shailene Woodley
name
Nov. 15, 1991
dateOfBirth
type
playsIn
playsIn
•  Main idea: combine unstructured and structured
search
–  Inverted index to locate first candidates
–  Graph queries to refine the results
•  Graph traversals (queries on object properties)
•  Graph neighborhoods (queries on
data type properties)
Inverted Index
Keywords
HTTP
DBMS
SPARQL
Architecture
LOD Cloud
index()
User
Query Annotation
and Expansion
Inverted Index
RDF
Store
Ranking
FunctionsRanking
FunctionsRanking
Functions
query()
Entity Search
Keyword Query
intermediate
top-k results
Graph-Enriched
Results
Graph Traversals
(queries on object
properties)
Neighborhoods
(queries on datatype
properties)
Structured
Inverted Index
WordNet
3rd party
search engines
Final Ranking
Function
Pseudo-Relevance Feedback
Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux: Combining inverted indices and
structured search for ad-hoc object retrieval. SIGIR 2012: 125-134
Does it Work?
•  Up to 25% improvement over best IR (stat. sign.)
•  Very modest impact on latency (17%)
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0" 0.2" 0.4" 0.6" 0.8" 1"
Precision)
Recall)
S1_1"
BM25"baseline"
S2_2"
SAMEAS"
Co-Reference Resolution
•  Better co-reference resolution through the
knowledge bases & entity linking
Roman Prokofyev, Alberto Tonon, Michael Luggen, Loic Vouilloz, Djellel Eddine Difallah, and
Philippe Cudre-Mauroux: SANAPHOR: Ontology-Based Coreference Resolution. ISWC 2015.
Barack Obama
called
Angela Merkel last week;
the president asked
the chancellor whether…
2⃣
Literature Browsing
http://sciencewise.info
3⃣
http://exascale.info
ICDAR 2015
August 25, 2015
Entity Storage (Diplodocus[RDF])
•  Materialize the joins!
•  Dense-pack the values
•  Provide new indices
•  Co-locate
•  Co-locate
•  Co-locate
Marcin Wylot, Jigé Pont, Mariusz Wisniewski, Philippe Cudré-Mauroux: dipLODocus[RDF] -
Short and Long-Tail RDF Analytics for Massive Webs of Data. ISWC 2011: 778-793
Support for Provenance
Marcin Wylot, Philippe Cudré-Mauroux, Paul T. Groth: TripleProv: efficient processing of
lineage queries in a native RDF store. WWW 2014: 455-466

More Related Content

What's hot

Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...
Amit Sheth
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overview
Amit Sheth
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollink
SSSW
 
AI, Knowledge Representation and Graph Databases -
 Key Trends in Data Science
AI, Knowledge Representation and Graph Databases -
 Key Trends in Data ScienceAI, Knowledge Representation and Graph Databases -
 Key Trends in Data Science
AI, Knowledge Representation and Graph Databases -
 Key Trends in Data Science
Optum
 
Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?
DATAVERSITY
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
Optum
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
Sören Auer
 
Semantic Integration Patterns
Semantic Integration PatternsSemantic Integration Patterns
Semantic Integration Patterns
Optum
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
Edward Curry
 
Knowledge graphs on the Web
Knowledge graphs on the WebKnowledge graphs on the Web
Knowledge graphs on the Web
Armin Haller
 
Structured Data for the Financial Industry
Structured Data for the Financial Industry Structured Data for the Financial Industry
Structured Data for the Financial Industry
sopekmir
 
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Stefan Urbanek
 
SemTecBiz 2012: Corporate Semantic Web
SemTecBiz 2012: Corporate Semantic WebSemTecBiz 2012: Corporate Semantic Web
SemTecBiz 2012: Corporate Semantic Web
Adrian Paschke
 
Web scraping
Web scrapingWeb scraping
Web scraping
Selecto
 
Creating knowledge out of interlinked data
Creating knowledge out of interlinked dataCreating knowledge out of interlinked data
Creating knowledge out of interlinked data
Sören Auer
 
One Ontology, One Data Set, Multiple Shapes with SHACL
One Ontology, One Data Set, Multiple Shapes with SHACLOne Ontology, One Data Set, Multiple Shapes with SHACL
One Ontology, One Data Set, Multiple Shapes with SHACL
Connected Data World
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based Search
Neo4j
 
The Semantic Web: status and prospects
The Semantic Web: status and prospectsThe Semantic Web: status and prospects
The Semantic Web: status and prospects
Guus Schreiber
 
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
Connected Data World
 
Towards digitizing scholarly communication
Towards digitizing scholarly communicationTowards digitizing scholarly communication
Towards digitizing scholarly communication
Sören Auer
 

What's hot (20)

Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overview
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollink
 
AI, Knowledge Representation and Graph Databases -
 Key Trends in Data Science
AI, Knowledge Representation and Graph Databases -
 Key Trends in Data ScienceAI, Knowledge Representation and Graph Databases -
 Key Trends in Data Science
AI, Knowledge Representation and Graph Databases -
 Key Trends in Data Science
 
Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?Graph Databases - Where Do We Do the Modeling Part?
Graph Databases - Where Do We Do the Modeling Part?
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
Semantic Integration Patterns
Semantic Integration PatternsSemantic Integration Patterns
Semantic Integration Patterns
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
 
Knowledge graphs on the Web
Knowledge graphs on the WebKnowledge graphs on the Web
Knowledge graphs on the Web
 
Structured Data for the Financial Industry
Structured Data for the Financial Industry Structured Data for the Financial Industry
Structured Data for the Financial Industry
 
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
 
SemTecBiz 2012: Corporate Semantic Web
SemTecBiz 2012: Corporate Semantic WebSemTecBiz 2012: Corporate Semantic Web
SemTecBiz 2012: Corporate Semantic Web
 
Web scraping
Web scrapingWeb scraping
Web scraping
 
Creating knowledge out of interlinked data
Creating knowledge out of interlinked dataCreating knowledge out of interlinked data
Creating knowledge out of interlinked data
 
One Ontology, One Data Set, Multiple Shapes with SHACL
One Ontology, One Data Set, Multiple Shapes with SHACLOne Ontology, One Data Set, Multiple Shapes with SHACL
One Ontology, One Data Set, Multiple Shapes with SHACL
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based Search
 
The Semantic Web: status and prospects
The Semantic Web: status and prospectsThe Semantic Web: status and prospects
The Semantic Web: status and prospects
 
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
From Taxonomies and Schemas to Knowledge Graphs: Parts 1 & 2
 
Towards digitizing scholarly communication
Towards digitizing scholarly communicationTowards digitizing scholarly communication
Towards digitizing scholarly communication
 

Similar to Entity-Centric Data Management

Large scale computing
Large scale computing Large scale computing
Large scale computing
Bhupesh Bansal
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
markgrover
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
Lars Marius Garshol
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
Tao Feng
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
Tao Feng
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
EUCLID project
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
itnewsafrica
 
Zen and the Art of Datanauting
Zen and the Art of DatanautingZen and the Art of Datanauting
Zen and the Art of Datanauting
OntologySystems
 
Semantics and Machine Learning
Semantics and Machine LearningSemantics and Machine Learning
Semantics and Machine Learning
Vladimir Alexiev, PhD, PMP
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
Philippe Mizrahi
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
markgrover
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Roi Blanco
 
Human Computation for Big Data
Human Computation for Big DataHuman Computation for Big Data
Human Computation for Big Data
eXascale Infolab
 
IT webinar 2016
IT webinar 2016IT webinar 2016
IT webinar 2016
PR Cell, IIM Rohtak
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
Mark Grover
 
01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...
teodroscampaus
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
BIWUG
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePoint
Joris Poelmans
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
CareerBuilder.com
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
 

Similar to Entity-Centric Data Management (20)

Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Zen and the Art of Datanauting
Zen and the Art of DatanautingZen and the Art of Datanauting
Zen and the Art of Datanauting
 
Semantics and Machine Learning
Semantics and Machine LearningSemantics and Machine Learning
Semantics and Machine Learning
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Human Computation for Big Data
Human Computation for Big DataHuman Computation for Big Data
Human Computation for Big Data
 
IT webinar 2016
IT webinar 2016IT webinar 2016
IT webinar 2016
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
 
01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...01-introduction.ppt the paper that you can unless you want to join me because...
01-introduction.ppt the paper that you can unless you want to join me because...
 
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
 
How to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePointHow to build your own Delve: combining machine learning, big data and SharePoint
How to build your own Delve: combining machine learning, big data and SharePoint
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 

More from eXascale Infolab

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
eXascale Infolab
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
eXascale Infolab
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex Graphs
eXascale Infolab
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory map
eXascale Infolab
 
Cikm 2018
Cikm 2018Cikm 2018
Cikm 2018
eXascale Infolab
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
eXascale Infolab
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
eXascale Infolab
 
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
eXascale Infolab
 
Crowd scheduling www2016
Crowd scheduling www2016Crowd scheduling www2016
Crowd scheduling www2016
eXascale Infolab
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
eXascale Infolab
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data
eXascale Infolab
 
SSSW 2015 Sense Making
SSSW 2015 Sense MakingSSSW 2015 Sense Making
SSSW 2015 Sense Making
eXascale Infolab
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
eXascale Infolab
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web Data
eXascale Infolab
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task Crowdsourcing
eXascale Infolab
 
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
eXascale Infolab
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
eXascale Infolab
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
eXascale Infolab
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
eXascale Infolab
 
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
eXascale Infolab
 

More from eXascale Infolab (20)

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex Graphs
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory map
 
Cikm 2018
Cikm 2018Cikm 2018
Cikm 2018
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
 
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
 
Crowd scheduling www2016
Crowd scheduling www2016Crowd scheduling www2016
Crowd scheduling www2016
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data
 
SSSW 2015 Sense Making
SSSW 2015 Sense MakingSSSW 2015 Sense Making
SSSW 2015 Sense Making
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web Data
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task Crowdsourcing
 
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
 
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
 

Recently uploaded

University of California, Riverside diploma
University of California, Riverside diplomaUniversity of California, Riverside diploma
University of California, Riverside diploma
eufdev
 
Best CSS Animation Libraries for Web Developers
Best CSS Animation Libraries for Web DevelopersBest CSS Animation Libraries for Web Developers
Best CSS Animation Libraries for Web Developers
Shrestha Raaz
 
Maximizing Network Efficiency with Large Language Models (LLM)
Maximizing Network Efficiency with Large Language Models (LLM)Maximizing Network Efficiency with Large Language Models (LLM)
Maximizing Network Efficiency with Large Language Models (LLM)
Bangladesh Network Operators Group
 
Rent remote desktop server mangohost .net
Rent remote desktop server mangohost .netRent remote desktop server mangohost .net
Rent remote desktop server mangohost .net
pdfsubmission50
 
Understanding Threat Intelligence | What is Threat Intelligence
Understanding Threat Intelligence | What is Threat IntelligenceUnderstanding Threat Intelligence | What is Threat Intelligence
Understanding Threat Intelligence | What is Threat Intelligence
Lumiverse Solutions Pvt Ltd
 
Saint Louis University diploma
Saint Louis University diplomaSaint Louis University diploma
Saint Louis University diploma
eufdev
 
Team Cymru Community Services,Overview of all public services
Team Cymru Community Services,Overview of all public servicesTeam Cymru Community Services,Overview of all public services
Team Cymru Community Services,Overview of all public services
Bangladesh Network Operators Group
 
Career Development Advice for Network Engineers across the Pacific, presented...
Career Development Advice for Network Engineers across the Pacific, presented...Career Development Advice for Network Engineers across the Pacific, presented...
Career Development Advice for Network Engineers across the Pacific, presented...
APNIC
 
Best Skills to Learn for Freelancing.pdf
Best Skills to Learn for Freelancing.pdfBest Skills to Learn for Freelancing.pdf
Best Skills to Learn for Freelancing.pdf
Million-$-Knowledge {Million Dollar Knowledge}
 
Girls Call Mahipalpur 000XX00000 Provide Best And Top Girl Service And No1 in...
Girls Call Mahipalpur 000XX00000 Provide Best And Top Girl Service And No1 in...Girls Call Mahipalpur 000XX00000 Provide Best And Top Girl Service And No1 in...
Girls Call Mahipalpur 000XX00000 Provide Best And Top Girl Service And No1 in...
mahigarg2024#G05
 
Lordsexch ID: An Ultimate Online Cricket ID Provider In India
Lordsexch ID: An Ultimate Online Cricket ID Provider In IndiaLordsexch ID: An Ultimate Online Cricket ID Provider In India
Lordsexch ID: An Ultimate Online Cricket ID Provider In India
exchangeid32
 
Effective Tips for Creating the Best Rich Media Ads .pptx
Effective Tips for Creating the Best Rich Media Ads .pptxEffective Tips for Creating the Best Rich Media Ads .pptx
Effective Tips for Creating the Best Rich Media Ads .pptx
AirtoryInc
 
upgrade to zabbix-7 0 como atualiza lts1
upgrade to zabbix-7 0 como atualiza lts1upgrade to zabbix-7 0 como atualiza lts1
upgrade to zabbix-7 0 como atualiza lts1
diogolsew
 
Portugal Dreamin 24 - How to easily use an API with Flows
Portugal Dreamin 24  - How to easily use an API with FlowsPortugal Dreamin 24  - How to easily use an API with Flows
Portugal Dreamin 24 - How to easily use an API with Flows
Thierry TROUIN ☁
 
Software Defined Networking, Concepts and Practical Implementations
Software Defined Networking, Concepts and Practical ImplementationsSoftware Defined Networking, Concepts and Practical Implementations
Software Defined Networking, Concepts and Practical Implementations
Bangladesh Network Operators Group
 
Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...
Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...
Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...
elbertablack
 
Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...
Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...
Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...
samyanvichadda
 
Week 1 - Pendidikan Pancasila - Gr 1.docx
Week 1 - Pendidikan Pancasila - Gr 1.docxWeek 1 - Pendidikan Pancasila - Gr 1.docx
Week 1 - Pendidikan Pancasila - Gr 1.docx
JunaManroe1
 
Do it again anti Republican shirt Do it again anti Republican shirt
Do it again anti Republican shirt Do it again anti Republican shirtDo it again anti Republican shirt Do it again anti Republican shirt
Do it again anti Republican shirt Do it again anti Republican shirt
exgf28
 
Trump Assassination Shirt Trump Assassination Shirt
Trump Assassination Shirt Trump Assassination ShirtTrump Assassination Shirt Trump Assassination Shirt
Trump Assassination Shirt Trump Assassination Shirt
exgf28
 

Recently uploaded (20)

University of California, Riverside diploma
University of California, Riverside diplomaUniversity of California, Riverside diploma
University of California, Riverside diploma
 
Best CSS Animation Libraries for Web Developers
Best CSS Animation Libraries for Web DevelopersBest CSS Animation Libraries for Web Developers
Best CSS Animation Libraries for Web Developers
 
Maximizing Network Efficiency with Large Language Models (LLM)
Maximizing Network Efficiency with Large Language Models (LLM)Maximizing Network Efficiency with Large Language Models (LLM)
Maximizing Network Efficiency with Large Language Models (LLM)
 
Rent remote desktop server mangohost .net
Rent remote desktop server mangohost .netRent remote desktop server mangohost .net
Rent remote desktop server mangohost .net
 
Understanding Threat Intelligence | What is Threat Intelligence
Understanding Threat Intelligence | What is Threat IntelligenceUnderstanding Threat Intelligence | What is Threat Intelligence
Understanding Threat Intelligence | What is Threat Intelligence
 
Saint Louis University diploma
Saint Louis University diplomaSaint Louis University diploma
Saint Louis University diploma
 
Team Cymru Community Services,Overview of all public services
Team Cymru Community Services,Overview of all public servicesTeam Cymru Community Services,Overview of all public services
Team Cymru Community Services,Overview of all public services
 
Career Development Advice for Network Engineers across the Pacific, presented...
Career Development Advice for Network Engineers across the Pacific, presented...Career Development Advice for Network Engineers across the Pacific, presented...
Career Development Advice for Network Engineers across the Pacific, presented...
 
Best Skills to Learn for Freelancing.pdf
Best Skills to Learn for Freelancing.pdfBest Skills to Learn for Freelancing.pdf
Best Skills to Learn for Freelancing.pdf
 
Girls Call Mahipalpur 000XX00000 Provide Best And Top Girl Service And No1 in...
Girls Call Mahipalpur 000XX00000 Provide Best And Top Girl Service And No1 in...Girls Call Mahipalpur 000XX00000 Provide Best And Top Girl Service And No1 in...
Girls Call Mahipalpur 000XX00000 Provide Best And Top Girl Service And No1 in...
 
Lordsexch ID: An Ultimate Online Cricket ID Provider In India
Lordsexch ID: An Ultimate Online Cricket ID Provider In IndiaLordsexch ID: An Ultimate Online Cricket ID Provider In India
Lordsexch ID: An Ultimate Online Cricket ID Provider In India
 
Effective Tips for Creating the Best Rich Media Ads .pptx
Effective Tips for Creating the Best Rich Media Ads .pptxEffective Tips for Creating the Best Rich Media Ads .pptx
Effective Tips for Creating the Best Rich Media Ads .pptx
 
upgrade to zabbix-7 0 como atualiza lts1
upgrade to zabbix-7 0 como atualiza lts1upgrade to zabbix-7 0 como atualiza lts1
upgrade to zabbix-7 0 como atualiza lts1
 
Portugal Dreamin 24 - How to easily use an API with Flows
Portugal Dreamin 24  - How to easily use an API with FlowsPortugal Dreamin 24  - How to easily use an API with Flows
Portugal Dreamin 24 - How to easily use an API with Flows
 
Software Defined Networking, Concepts and Practical Implementations
Software Defined Networking, Concepts and Practical ImplementationsSoftware Defined Networking, Concepts and Practical Implementations
Software Defined Networking, Concepts and Practical Implementations
 
Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...
Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...
Female Service Girls Call Delhi 9873940964 Provide Best And Top Girl Service ...
 
Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...
Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...
Vip Girls Call ServiCe Chennai X00XXX00XX Tanisha Best High Class Chennai Ava...
 
Week 1 - Pendidikan Pancasila - Gr 1.docx
Week 1 - Pendidikan Pancasila - Gr 1.docxWeek 1 - Pendidikan Pancasila - Gr 1.docx
Week 1 - Pendidikan Pancasila - Gr 1.docx
 
Do it again anti Republican shirt Do it again anti Republican shirt
Do it again anti Republican shirt Do it again anti Republican shirtDo it again anti Republican shirt Do it again anti Republican shirt
Do it again anti Republican shirt Do it again anti Republican shirt
 
Trump Assassination Shirt Trump Assassination Shirt
Trump Assassination Shirt Trump Assassination ShirtTrump Assassination Shirt Trump Assassination Shirt
Trump Assassination Shirt Trump Assassination Shirt
 

Entity-Centric Data Management

  • 1. Entity-Centric Data Management Philippe Cudré-Mauroux eXascale Infolab, University of Fribourg Switzerland ICDAR 2015 August 25, 2015
  • 2. eXascale Infolab (XI) •  New lab @ DIUF, U. of Fribourg–Switzerland •  Big Data infrastructures for social / semantic / scientific data (… mostly) http://exascale.info/
  • 4. Volume ■  amount of data Velocity ■  speed of data in and out Variety ■  range of data types and sources [Gartner 2012] "Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization" Context: The 3-Vs of Big Data
  • 5. Information Management •  The story so far: –  Strict separation between unstructured and structured data management infrastructures DBMS JDBC SQL Inverted Index Keywords HTTP
  • 6. The 3rd V: Information Integration •  Information integration is still one of the biggest CS problem out there (according to many e.g., Gartner) •  Information integration typically requires some sort of mediation 1.  Unstructured Data: keywords, synsets 2.  Structured Data: global schema, transitive closure of schemas (mostly syntactic) ⇒ nightmarish if 1 and 2 taken separately, horror marathon if considered together
  • 7. Entities as Mediation •  Rising paradigm –  Store information at the entity granularity –  Integrate information by inter-linking entities •  Advantages? –  Coarser granularity compared to keywords •  More natural, e.g., brain functions similarly (or is it the other way around?) –  Denormalized information compared to RDBMSs •  Schema-later, heterogeneity, sparsity •  Pre-computed joins, “Semantic” linking •  Drawbacks?
  • 9. •  Google Knowledge Graph 9 Example (1): Google’s Knowledge Graph
  • 10. Example (2): BBC Olympics
  • 11. Example (3): Web Data Integration http://data.semanticweb.org/person/philippe-cudre-mauroux/html
  • 12. Exposing Textual Data •  The XI Pipeline –  From text to (contextualized) entities •  Runs on massive amounts of data (Spark) Mention Extraction NER Entity Linking Entity Typing 1⃣ 2⃣ 3⃣
  • 13. Named Entity Recognition (NER) Text extraction (Apache Tika) List of extracted n-grams n-gram Indexing foreach Candidate Selection List of selected n-grams Supervised Classi!er Ranked list of n-grams Lemmat ization n+1 grams merging Feature extractionFeature extractionFeatures POS Tagging frequency reweighting Roman Prokofyev, Gianluca Demartini, Philippe Cudré-Mauroux: Effective named entity recognition for idiosyncratic web collections. WWW 2014: 397-408 1⃣
  • 14. Entity Linking •  Linking entities to text is an old problem… –  … and is extremely hard, esp. for machines •  Dozens of approaches have been suggested •  What if –  We want to combine approaches / frameworks? –  We want to leverage both human computations & algorithms? 2⃣
  • 15. ZenCrowd •  Three-stage blocking approach for integrating textual data & entities 1.  Cheap inverted index selection of candidates 2.  Expensive similarity measures 3.  Crowdsource low confidence matches •  Uses dynamic templating to create micro-matching-tasks and publish them on MTurk •  Combines both algorithmic and human matchers using probabilistic networks Gianluca Demartini, Djellel Eddine Difallah, Philippe Cudré-Mauroux: ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. WWW 2012: 469-478
  • 16. ZenCrowd Architecture Micro Matching Tasks HTML Pages HTML+ RDFa Pages LOD Open Data Cloud Crowdsourcing Platform ZenCrowd Entity Extractors LOD Index Get Entity Input Output Probabilistic Network Decision Engine Micro- TaskManager Workers Decisions Algorithmic Matchers
  • 17. Probabilistic Inference •  Probabilistic network to integrate a priori & a posteriori information –  Agreement of good turkers & algorithms •  Learning process –  Constraints •  Unicity •  Equality (SameAs) –  Giant probabilistic graph •  Instantiated selectively w1 w2 l1 l2 pw1( ) pw2( ) lf1( ) lf2( ) pl1( ) pl2( ) l lf3 pl c11 c22 c12 c21 c13 u2-3( )sa1-2( )
  • 18. Does it Work? •  Improves avg. prec. by 0.14 on average! –  Minimal crowd involvement –  Embarrassingly parallel problem Top$US$ Worker$ 0$ 0.5$ 1$ 0$ 250$ 500$ Worker&Precision& Number&of&Tasks& US$Workers$ IN$Workers$ 0.6$ 0.62$ 0.64$ 0.66$ 0.68$ 0.7$ 0.72$ 0.74$ 0.76$ 0.78$ 0.8$ 1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$ Precision) Top)K)workers)
  • 19. Entity Typing •  Entities can have many types (facets) •  Which fine-grained types are most relevant given the context? Thing   American   Billionaires   People   from  King   County   People   from   Sea:le   Windows   People   Agent   Person   Living   People   American   People  of   Sco@sh   Descent   Harvard   University   People   American   Computer   Programmers   American   Philanthropists   3⃣
  • 20. TRank •  Fine-grained Typing •  Tree of 447’260 types •  Rooted on <owl:Thing> •  Depth of 19 •  Ranks relevant types by analyzing the context •  Textual context •  Graph context •  Decision trees •  Linear regression Alberto Tonon, Michele Catasta, Gianluca Demartini, Philippe Cudré-Mauroux, Karl Aberer: TRank: Ranking Entity Types Using the Web of Data. ISWC 2013: 640-656.
  • 21. Enterprise search Co-reference resolution Literature browsing / recommendation Some Applications 1⃣ 2⃣ 3⃣
  • 22. Enterprise Search •  Holy grail for many large companies •  How can end-users reach entities? ⇒ Structured search ⇒ Keyword search •  On their names or attributes –  Obviously not ideal •  BM25 on TREC 2011 AOR: MAP=0.15, P@10=0.20 •  Query extension, query completion or pseudo-relevance feedback yield comparable (or worse) results Higher-level apps 1⃣
  • 23. Hybrid Entity Search The Descendants TheDescendants type title GeorgeClooney George Clooney name May 6, 1961 dateOfBirth type ShaileneW Shailene Woodley name Nov. 15, 1991 dateOfBirth type playsIn playsIn •  Main idea: combine unstructured and structured search –  Inverted index to locate first candidates –  Graph queries to refine the results •  Graph traversals (queries on object properties) •  Graph neighborhoods (queries on data type properties) Inverted Index Keywords HTTP DBMS SPARQL
  • 24. Architecture LOD Cloud index() User Query Annotation and Expansion Inverted Index RDF Store Ranking FunctionsRanking FunctionsRanking Functions query() Entity Search Keyword Query intermediate top-k results Graph-Enriched Results Graph Traversals (queries on object properties) Neighborhoods (queries on datatype properties) Structured Inverted Index WordNet 3rd party search engines Final Ranking Function Pseudo-Relevance Feedback Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux: Combining inverted indices and structured search for ad-hoc object retrieval. SIGIR 2012: 125-134
  • 25. Does it Work? •  Up to 25% improvement over best IR (stat. sign.) •  Very modest impact on latency (17%) 0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0" 0.2" 0.4" 0.6" 0.8" 1" Precision) Recall) S1_1" BM25"baseline" S2_2" SAMEAS"
  • 26. Co-Reference Resolution •  Better co-reference resolution through the knowledge bases & entity linking Roman Prokofyev, Alberto Tonon, Michael Luggen, Loic Vouilloz, Djellel Eddine Difallah, and Philippe Cudre-Mauroux: SANAPHOR: Ontology-Based Coreference Resolution. ISWC 2015. Barack Obama called Angela Merkel last week; the president asked the chancellor whether… 2⃣
  • 29. Entity Storage (Diplodocus[RDF]) •  Materialize the joins! •  Dense-pack the values •  Provide new indices •  Co-locate •  Co-locate •  Co-locate Marcin Wylot, Jigé Pont, Mariusz Wisniewski, Philippe Cudré-Mauroux: dipLODocus[RDF] - Short and Long-Tail RDF Analytics for Massive Webs of Data. ISWC 2011: 778-793
  • 30. Support for Provenance Marcin Wylot, Philippe Cudré-Mauroux, Paul T. Groth: TripleProv: efficient processing of lineage queries in a native RDF store. WWW 2014: 455-466