SlideShare a Scribd company logo
1 of 17
Jiacheng Huang, Wei Hu*, Haoxuan Li, Yuzhong Qu
Nanjing University, China
* Corresponding author: whu@nju.edu.cn
Automated Comparative Table Generation for
Facilitating Human Intervention in Multi-Entity Resolution
SIGIR’18, July 8–12, Ann Arbor, MI, USA
Outline
 Introduction
 Knowledge graph
 (Crowd) Entity resolution
 Related work
 Our approach
 Experiments and results
 Conclusion
2Introduction ➤ Our approach ➤ Experiments and results ➤ Conclusion
Knowledge graph (KG)
 Knowledge graph (KG) is a knowledge base used by Google
to enhance its search engine’s results
 Other famous knowledge bases
 DBpedia, Freebase, Wikidata, YAGO …
 Linked Open Data (LOD) cloud
 KGs have reached a scale in billions of entities!
 Problem: Many different entities refer to
the same real-world thing
3
Entity Resolution
 Entity resolution (ER): find different entities referring to the same
 a.k.a. entity linkage, entity matching …
 also widely studied in DB and NLP
 resolve heterogeneity and achieve interoperability
 Crowd ER
 use humans, in addition to machines, to obtain
the truths of ER tasks
 Key issues
 How to present a single ER task?
 How to select “right” humans?
 How to pick tasks under a budget?
 ……
4
Little effort has been made on how to
present the critical information (such as
important properties and values) to help
complete a task efficiently and accurately
[Verroios et al., SIGMOD’17]
Related work
 Multi-entity
resolution (MER)
1. Display multiple entities in a form of list
 just like what is typically seen from a Web search engine
2. Use pairwise presentation
 compare two entities at a time and align similar properties between them
 Pros & cons for MER
1. List: remember and compare in mind
2. Pairwise: focus, but difficult to scale
 Both lost transitivity & grouping info
5
entities with similar properties & values
⦿ match
⦿ nonmatch
e1 [dbp:Lil_Eazy-E]
– rdf:type : Person, MusicalArtist
– rdfs:label : Lil Eazy-E
– owl:sameAs : fb:m.01wf_p_
– birthDate : 1984-4-23
– birthPlace : Compton
– gender : male
– genre : Gangsta rap, Hip hop
– givenName : Eric Darnell Wright
(146 property-values in total)
e2 [fb:m.01wf_p_]
– alias : Eric Wright, Eazy-E
– date_of_birth : 1963-9-7
– gender : male
– genre : gangsta rap, hip hop
– name : Eazy-E
– place_of_birth : Compton
– profession : rapper, producer
– type : person, music.artist
(1,253 property-values in total)
e3 [wd:Q36804]
– rdfs:label : Eazy-E
– altLabel : Eric Lynn Wright
– date_of_birth : 1963-9-7
– desc : Gangsta rapper, producer
– genre : gangsta rap
– instance_of : human
– occupation : musician, rapper
– place_of_birth : Compton
(141 property-values in total)
group?
givenName
alias
altLabel
rdf:type
type
instance_of
birthDate
date_of_birth
date_of_birth
1 e1 Eric Darnell Wright Person, MusicalArtist 1984-4-23
2 e2 Eric Wright, Eazy-E person, music.artist 1963-9-7
givenName
alias
altLabel
birthDate
date_of_birth
date_of_birth
e1 Eric Darnell Wright 1984-4-23
e2 Eric Wright, Eazy-E 1963-9-7
e3 Eric Lynn Wright 1963-9-7
e1 [dbp:Lil_Eazy-E]
– rdf:type : Person …
– rdfs:label : Lil Eazy-E
– owl:sameAs : fb:m…
– birthDate : 1984-4-23
– birthPlace : Compton
– gender : male
– genre : Gangsta rap …
…
e2 [fb:m.01wf_p_]
– alias : Eric Wright …
– date_of_birth : 1963-9-7
– genre : gangsta rap …
…
e3 [wd:Q36804]
– rdfs:label : Eazy-E
– altLabel : Eric Lynn …
– date_of_birth : 1963-9-7
…
e1 [dbp:Lil_Eazy-E]
– rdf:type : Person, MusicalArtist
– genre : Gangsta rap, Hip hop
– givenName : Eric Darnell Wright
– rdfs:label : Lil Eazy-E
– birthPlace : Compton
– gender : male
e2 [fb:m.01wf_p_]
– type : person, music.artist
– genre : gangsta rap, hip hop
– alias : Eric Wright, Eazy-E
– place_of_birth : Compton
– gender : male
e1 [dbp:Lil_Eazy-E]
– rdf:type : Person, MusicalArtist
– rdfs:label : Lil Eazy-E
– owl:sameAs : fb:m.01wf_p_
– birthDate : 1984-4-23
– birthPlace : Compton
– gender : male
– genre : Gangsta rap, Hip hop
– givenName : Eric Darnell Wright
(146 property-values in total)
e2 [fb:m.01wf_p_]
– alias : Eric Wright, Eazy-E
– date_of_birth : 1963-9-7
– gender : male
– genre : gangsta rap, hip hop
– name : Eazy-E
– place_of_birth : Compton
– profession : rapper, producer
– type : person, music.artist
(391 property-values in total)
e3 [wd:Q36804]
– rdfs:label : Eazy-E
– altLabel : Eric Lynn Wright
– date_of_birth : 1963-9-7
– desc : Gangsta rapper, producer
– genre : gangsta rap
– instance_of : human
– occupation : musician, rapper
– place_of_birth : Compton
(141 property-values in total)
group?
givenName
alias
altLabel
rdf:type
type
instance_of
birthDate
date_of_birth
date_of_birth
1 � e1 Eric Darnell Wright Person, MusicalArtist 1984-4-23
2 � e2 Eric Wright, Eazy-E person, music.artist 1963-9-7
2 � e3 Eric Lynn Wright human 1963-9-7
givenName
alias
altLabel
birthDate
date_of_birth
date_of_birth
e1 Eric Darnell Wright 1984-4-23
e2 Eric Wright, Eazy-E 1963-9-7
e3 Eric Lynn Wright 1963-9-7
e1 [dbp:Lil_Eazy-E
– rdf:type : Person, M
– genre : Gangsta ra
– givenName : Eric D
– rdfs:label : Lil Eaz
– birthPlace : Comp
– gender : male
Outline
 Introduction
 Our approach
 Approach workflow
1. Holistic property matching
2. Goodness measurement
3. Comparative table generation
 Experiments and results
 Conclusion
6Introduction ➤ Our approach ➤ Experiments and results ➤ Conclusion
Our approach: comparative table
 Comparative table
 arrange entities and properties as
row and column headers, resp.
 assign values in cells
 Workflow
1. Holistic property matching: similarity calculation  property clique derivation
2. Goodness measurement: discriminability, abundance, semantics & diversity
3. Comparative table generation: property clique selection  value selection
7
group?
givenName
alias
altLabel
rdf:type
type
instance_of
birthDate
date_of_birth
date_of_birth
1 e1 Eric Darnell Wright Person, MusicalArtist 1984-4-23
2 e2 Eric Wright, Eazy-E person, music.artist 1963-9-7
3 e3 Eric Lynn Wright human 1963-9-7
givenName
alias
altLabel
birthDate
date_of_birth
date_of_birth
e1 Eric Darnell Wright 1984-4-23
e2 Eric Wright, Eazy-E 1963-9-7
e3 Eric Lynn Wright 1963-9-7
Similarity computation
Clique generation
Holistic Property Matching
{rdfs:label, name, rdfs:label}
{givenName, alias, altLabel}
{rdf:type, type, occupation}
…
Abundance
Discriminability
Comparability measurement
Semantics
Refinement by diversity
0.9 {givenName, alias, altLabel}
0.8 {birthDate, dateofbirth, DOB}
0.6 {rdf:type, type, occupation}
Coverage-constrained
Budget-constrained
Comparative table generation
Input: candidate entities Property cliques
Property clique comparabilitiesOutput: comparative table
Human
Intervention
Similarity calculation
Prop. clique derivation
Holistic property matching
Abundance
Discriminability
Goodness measurement
Diversity
Semantics
Property cliques
Prop. clique selection
Value selection
Comparative table generationGoodness scores Comparative tableMultiple entities
e1 [dbp:Lil_Eazy-E]
– rdf:type : Person …
– rdfs:label : Lil Eazy-E
– owl:sameAs : fb:m…
– birthPlace : Compton
– desc : CEO NWA…
– gender : male
– genre : Gangsta rap …
e2 [fb:m.01wf_p_]
– alias : Eric Wright …
– date_of_birth : 1963-9-7
– genre : gangsta rap …
e3 [wd:Q36804]
– rdfs:label : Eazy-E
– altLabel : Eric Lynn …
– date_of_birth : 1963-9-7
{rdfs:label, name}
{givenName, alias, altLabel}
{rdf:type, type, instance_of}
…
0.2 {givenName, alias, altLabel}
0.4 {rdf:type, type, instance_of}
0.5 {birthDate, date_of_birth}
…
divide into groups
Challenge: heterogeneity, large-scale
vs. limited presentation space
1. Holistic property matching
 Heterogeneous properties
 Label, local name & value similarity, combined with logistic regression
 Property cliques for multiple entities
 restrict each property can match at most one other property
 choose the pairs with highest match probability estimate may lead to conflicts
 Holistic property matching
 maximize the overall match probability
estimate among all matched property pairs
 s.t. 1:1 matching constraint is satisfied
 NP-hard (3-dimensional assignment)
 Greedy algorithm
8
2. Goodness measurement
 Goodness of property cliques
1. Discriminability: a property clique that holds completely different or exactly identical
values for all the entities may not good
2. Abundance: a property clique whose values
are largely missing may be less convincing
3. Semantics gives extra scores to the ones
particularly useful, e.g., owl:sameAs
4. Diversity evaluates the redundancy between
different property cliques (MMR)
 2-phase combination: (discriminability + abundance + semantics) + diversity
 Goodness of values
 Longer length, less redundancy
9
0 1.0
proportion of distinct values
0
0.7
discriminability
proportion of
entities
goodness
proportion of
distinct values
0
0.11.0
1
1.00.1
1
= 0.5
2
= 0.3
3
= 0.2
0 1.0
proportion of distinct values
0
0.7
discriminability
proportion of
entities
goodness
proportion of
distinct values
0
0.11.0
1
1.00.1
1
= 0.5
2
= 0.3
3
= 0.2
3. Comparative table generation
 Property clique selection
 Greedy method
 Given the maximal number of property cliques in a comparative table, simply
select top property cliques with best goodness
 cannot guarantee each entity to be at least described by several properties
 Optimal property clique selection
with entity coverage constraint
 NP-hard (set cover)
 𝐻(𝑁)-approximation
 Value selection
 model it based on the classic 0/1 knapsack
problem with a table cell size constraint
10
Outline
 Introduction
 Our approach
 Experiments and results
 Test on holistic property matching
 Test on property clique ranking
 Test on human intervention
 Conclusion
11Introduction ➤ Our approach ➤ Experiments and results ➤ Conclusion
Test on holistic property matching
 Quality of matched property pairs
 “Official” property matches
 Label others by 3 graduate students
 484 matches, 1397 non-matches
 Quality of derived property cliques
 Compute connected components
 135 reference property cliques
12
 MER tasks
 10 popular domains, 25 DBpedia entities per domain as seeds
 Wikipedia disambiguation page, 2~4 Freebase, Wikidata, YAGO entities
 randomly select 10 entities to constitute an MER task
 250 tasks, 804 distinct real objects
0.868
0.727
0.791
0.824
0.73
0.773
0.893
0.669
0.763
0.877
0.706
0.782
0
0.5
1
Precision Recall F1-score
CTab (LR) LinReg DecTree SVM
0.868
0.727
0.791
0.983
0.233
0.377
0.97
0.066
0.124
Precision Recall F1-score
CTab (LR) Falcon LogMap
0.789
0.869
0.787
0.558
0.64
0.548
0.65
0.741
0.648
0.43
0.55
0.422
0.2
0.4
0.6
0.8
1
NMI Purity V-measure
CTab K-medoids DBSCAN APCluster
Test on property clique ranking
1. Directly rank ref. property cliques
 Assess property clique derivation &
ranking together
 The Hausdorf version of Kendall
tau distance
 treat property clique rankings as
partial rankings of properties (the
properties with the same grade
and in the same clique are tied)
 Ablation study
13
 3 experienced humans score property cliques in each task
 Highly-useful (3), fairly-useful (2), marginally-useful (1) and useless (0)
 Comparative systems
 FACES (list) [Gunaratna et al., AAAI’15]
 C3D+P (pairwise) [Cheng et al., JWS’15]
 CTab, CTab (entropy), CTab (greedy)
Use reference property cliques
KHaus
P@1 P@5 P@10 nDCG@5
FACES 0.176 0.310 0.290 0.239 0.753
C3D+P 0.040 0.347 0.511 0.154 0.647
CTab (entropy) 0.180 0.178 0.184 0.092 0.811
CTab (greedy) 0.632 0.660 0.615 0.684 0.647
CTab 0.756 0.754 0.643 0.798 0.615
KHaus Discr. Abund. Sem. w/o Div. Good
CTab (greedy) 0.678 0.686 0.673 0.655 0.647
CTab 0.675 0.633 0.815 0.618 0.615
Test on human intervention
 60 graduate students (top-5/top-10), 30 orthogonal tasks per human, 100RMB
 Task difficulty is not significantly different in statistics among FACES, C3D+P, CTab
1. Completion time
2. Precision
 Break the entities in each
entity group down to pairs
3. Human scoring and comments
 For CTab, the least cover times was not always satisfiable
14
FACES (L) C3D+P (P) CTab (T) p-value Post-hoc
Top-5
Time (s) 153 208 96 0.01% P < L < T
Prec. 0.63 0.69 0.77 0.07% L, P < T
Top-10
Time (s) 175 180 131 1.13% L, P < T
Prec. 0.79 0.77 0.80 69.8%
Questions [from 1: “totally disagree” to 5: “totally agree”] FACES (L) C3D+P (P) CTab (T) p-value Post-hoc
Q1. The system provided adequate information of entities. 3.11 3.17 3.70 0.76% L, P < T
Q2. The system provided unsuperfluous information of entities. 2.67 3.30 3.23 4.46% L < T, P
Q3. The system helped me easily compare entities of interest. 2.43 3.37 4.00 < 0.01% L < P < T
Q4. I found the system easy to use. 3.00 3.13 3.70 2.28% L, P < T
Outline
 Introduction
 Our approach
 Experiments and results
 Conclusion
15Introduction ➤ Our approach ➤ Experiments and results ➤ Conclusion
Conclusion
 Main contributions
1. Discovery of matched property cliques
2. Scoring functions to measure the goodness of property cliques and values
3. Optimal comparative table generation with the entity coverage constraint
 An 𝐻(𝑁) algorithm to obtain approximate solutions
4. Comparison to state-of-the-art methods and user study
 Accuracy of matched properties, effectiveness of goodness measures and user
satisfaction of comparative tables for MER
 Future work
 Combine comparative tables with other presentation enhancements
 Extend to other areas such as knowledge base summarization
16
Datasets & source code: http://ws.nju.edu.cn/ctab/
Acknowledgements
 National Natural Science Foundation of China (No. 61772264)
 Collaborative Innovation Center of Novel Software Technology and Industrialization
Thank you for your time!
SIGIR’18, July 8–12, Ann Arbor, MI, USA

More Related Content

Similar to Automated Comparative Table Generation for Facilitating Human Intervention in Multi-Entity Resolution

EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...Holistic Benchmarking of Big Linked Data
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with RJeffrey Breen
 
Visible Partisanship Convolutional Neural Networks for the Analysis of Politi...
Visible Partisanship Convolutional Neural Networks for the Analysis of Politi...Visible Partisanship Convolutional Neural Networks for the Analysis of Politi...
Visible Partisanship Convolutional Neural Networks for the Analysis of Politi...Dhruvil Badani
 
Visible Partisanship Convolutional Neural Networks for the Analysis of Politi...
Visible Partisanship Convolutional Neural Networks for the Analysis of Politi...Visible Partisanship Convolutional Neural Networks for the Analysis of Politi...
Visible Partisanship Convolutional Neural Networks for the Analysis of Politi...Dhruvil Badani
 
Social Network Analysis, Semantic Web and Learning Networks
Social Network Analysis, Semantic Web and Learning NetworksSocial Network Analysis, Semantic Web and Learning Networks
Social Network Analysis, Semantic Web and Learning NetworksRory Sie
 
2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus typejins0618
 
2021-04, EACL, T-NER: An All-Round Python Library for Transformer-based Named...
2021-04, EACL, T-NER: An All-Round Python Library for Transformer-based Named...2021-04, EACL, T-NER: An All-Round Python Library for Transformer-based Named...
2021-04, EACL, T-NER: An All-Round Python Library for Transformer-based Named...asahiushio1
 
Question Answering with Lydia
Question Answering with LydiaQuestion Answering with Lydia
Question Answering with LydiaJae Hong Kil
 
Anthem Ayn Rand Essay. Book quot;Anthemquot; by Ayn Rand Review Free Essay S...
Anthem Ayn Rand Essay.  Book quot;Anthemquot; by Ayn Rand Review Free Essay S...Anthem Ayn Rand Essay.  Book quot;Anthemquot; by Ayn Rand Review Free Essay S...
Anthem Ayn Rand Essay. Book quot;Anthemquot; by Ayn Rand Review Free Essay S...Heidi Marshall
 

Similar to Automated Comparative Table Generation for Facilitating Human Intervention in Multi-Entity Resolution (12)

EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with R
 
Visible Partisanship Convolutional Neural Networks for the Analysis of Politi...
Visible Partisanship Convolutional Neural Networks for the Analysis of Politi...Visible Partisanship Convolutional Neural Networks for the Analysis of Politi...
Visible Partisanship Convolutional Neural Networks for the Analysis of Politi...
 
Visible Partisanship Convolutional Neural Networks for the Analysis of Politi...
Visible Partisanship Convolutional Neural Networks for the Analysis of Politi...Visible Partisanship Convolutional Neural Networks for the Analysis of Politi...
Visible Partisanship Convolutional Neural Networks for the Analysis of Politi...
 
Integrating Conflicting Data_PVERConf_May2011
Integrating Conflicting Data_PVERConf_May2011Integrating Conflicting Data_PVERConf_May2011
Integrating Conflicting Data_PVERConf_May2011
 
Social Network Analysis, Semantic Web and Learning Networks
Social Network Analysis, Semantic Web and Learning NetworksSocial Network Analysis, Semantic Web and Learning Networks
Social Network Analysis, Semantic Web and Learning Networks
 
2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus type
 
2021-04, EACL, T-NER: An All-Round Python Library for Transformer-based Named...
2021-04, EACL, T-NER: An All-Round Python Library for Transformer-based Named...2021-04, EACL, T-NER: An All-Round Python Library for Transformer-based Named...
2021-04, EACL, T-NER: An All-Round Python Library for Transformer-based Named...
 
Question Answering with Lydia
Question Answering with LydiaQuestion Answering with Lydia
Question Answering with Lydia
 
Topical_Facets
Topical_FacetsTopical_Facets
Topical_Facets
 
TRank ISWC2013
TRank ISWC2013TRank ISWC2013
TRank ISWC2013
 
Anthem Ayn Rand Essay. Book quot;Anthemquot; by Ayn Rand Review Free Essay S...
Anthem Ayn Rand Essay.  Book quot;Anthemquot; by Ayn Rand Review Free Essay S...Anthem Ayn Rand Essay.  Book quot;Anthemquot; by Ayn Rand Review Free Essay S...
Anthem Ayn Rand Essay. Book quot;Anthemquot; by Ayn Rand Review Free Essay S...
 

Recently uploaded

Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝soniya singh
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSebastiano Panichella
 
LANDMARKS AND MONUMENTS IN NIGERIA.pptx
LANDMARKS  AND MONUMENTS IN NIGERIA.pptxLANDMARKS  AND MONUMENTS IN NIGERIA.pptx
LANDMARKS AND MONUMENTS IN NIGERIA.pptxBasil Achie
 
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...NETWAYS
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...henrik385807
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Pooja Nehwal
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Kayode Fayemi
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxFamilyWorshipCenterD
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024eCommerce Institute
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Krijn Poppe
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxmavinoikein
 
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...NETWAYS
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024eCommerce Institute
 
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...NETWAYS
 
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfhenrik385807
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfhenrik385807
 
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...NETWAYS
 

Recently uploaded (20)

Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
Call Girls in Sarojini Nagar Market Delhi 💯 Call Us 🔝8264348440🔝
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
 
LANDMARKS AND MONUMENTS IN NIGERIA.pptx
LANDMARKS  AND MONUMENTS IN NIGERIA.pptxLANDMARKS  AND MONUMENTS IN NIGERIA.pptx
LANDMARKS AND MONUMENTS IN NIGERIA.pptx
 
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
 
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
 
George Lever - eCommerce Day Chile 2024
George Lever -  eCommerce Day Chile 2024George Lever -  eCommerce Day Chile 2024
George Lever - eCommerce Day Chile 2024
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptx
 
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
Open Source Camp Kubernetes 2024 | Monitoring Kubernetes With Icinga by Eric ...
 
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
Andrés Ramírez Gossler, Facundo Schinnea - eCommerce Day Chile 2024
 
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
 
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
 
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
Night 7k Call Girls Noida Sector 128 Call Me: 8448380779
 
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
 
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
 

Automated Comparative Table Generation for Facilitating Human Intervention in Multi-Entity Resolution

  • 1. Jiacheng Huang, Wei Hu*, Haoxuan Li, Yuzhong Qu Nanjing University, China * Corresponding author: whu@nju.edu.cn Automated Comparative Table Generation for Facilitating Human Intervention in Multi-Entity Resolution SIGIR’18, July 8–12, Ann Arbor, MI, USA
  • 2. Outline  Introduction  Knowledge graph  (Crowd) Entity resolution  Related work  Our approach  Experiments and results  Conclusion 2Introduction ➤ Our approach ➤ Experiments and results ➤ Conclusion
  • 3. Knowledge graph (KG)  Knowledge graph (KG) is a knowledge base used by Google to enhance its search engine’s results  Other famous knowledge bases  DBpedia, Freebase, Wikidata, YAGO …  Linked Open Data (LOD) cloud  KGs have reached a scale in billions of entities!  Problem: Many different entities refer to the same real-world thing 3
  • 4. Entity Resolution  Entity resolution (ER): find different entities referring to the same  a.k.a. entity linkage, entity matching …  also widely studied in DB and NLP  resolve heterogeneity and achieve interoperability  Crowd ER  use humans, in addition to machines, to obtain the truths of ER tasks  Key issues  How to present a single ER task?  How to select “right” humans?  How to pick tasks under a budget?  …… 4 Little effort has been made on how to present the critical information (such as important properties and values) to help complete a task efficiently and accurately [Verroios et al., SIGMOD’17]
  • 5. Related work  Multi-entity resolution (MER) 1. Display multiple entities in a form of list  just like what is typically seen from a Web search engine 2. Use pairwise presentation  compare two entities at a time and align similar properties between them  Pros & cons for MER 1. List: remember and compare in mind 2. Pairwise: focus, but difficult to scale  Both lost transitivity & grouping info 5 entities with similar properties & values ⦿ match ⦿ nonmatch e1 [dbp:Lil_Eazy-E] – rdf:type : Person, MusicalArtist – rdfs:label : Lil Eazy-E – owl:sameAs : fb:m.01wf_p_ – birthDate : 1984-4-23 – birthPlace : Compton – gender : male – genre : Gangsta rap, Hip hop – givenName : Eric Darnell Wright (146 property-values in total) e2 [fb:m.01wf_p_] – alias : Eric Wright, Eazy-E – date_of_birth : 1963-9-7 – gender : male – genre : gangsta rap, hip hop – name : Eazy-E – place_of_birth : Compton – profession : rapper, producer – type : person, music.artist (1,253 property-values in total) e3 [wd:Q36804] – rdfs:label : Eazy-E – altLabel : Eric Lynn Wright – date_of_birth : 1963-9-7 – desc : Gangsta rapper, producer – genre : gangsta rap – instance_of : human – occupation : musician, rapper – place_of_birth : Compton (141 property-values in total) group? givenName alias altLabel rdf:type type instance_of birthDate date_of_birth date_of_birth 1 e1 Eric Darnell Wright Person, MusicalArtist 1984-4-23 2 e2 Eric Wright, Eazy-E person, music.artist 1963-9-7 givenName alias altLabel birthDate date_of_birth date_of_birth e1 Eric Darnell Wright 1984-4-23 e2 Eric Wright, Eazy-E 1963-9-7 e3 Eric Lynn Wright 1963-9-7 e1 [dbp:Lil_Eazy-E] – rdf:type : Person … – rdfs:label : Lil Eazy-E – owl:sameAs : fb:m… – birthDate : 1984-4-23 – birthPlace : Compton – gender : male – genre : Gangsta rap … … e2 [fb:m.01wf_p_] – alias : Eric Wright … – date_of_birth : 1963-9-7 – genre : gangsta rap … … e3 [wd:Q36804] – rdfs:label : Eazy-E – altLabel : Eric Lynn … – date_of_birth : 1963-9-7 … e1 [dbp:Lil_Eazy-E] – rdf:type : Person, MusicalArtist – genre : Gangsta rap, Hip hop – givenName : Eric Darnell Wright – rdfs:label : Lil Eazy-E – birthPlace : Compton – gender : male e2 [fb:m.01wf_p_] – type : person, music.artist – genre : gangsta rap, hip hop – alias : Eric Wright, Eazy-E – place_of_birth : Compton – gender : male e1 [dbp:Lil_Eazy-E] – rdf:type : Person, MusicalArtist – rdfs:label : Lil Eazy-E – owl:sameAs : fb:m.01wf_p_ – birthDate : 1984-4-23 – birthPlace : Compton – gender : male – genre : Gangsta rap, Hip hop – givenName : Eric Darnell Wright (146 property-values in total) e2 [fb:m.01wf_p_] – alias : Eric Wright, Eazy-E – date_of_birth : 1963-9-7 – gender : male – genre : gangsta rap, hip hop – name : Eazy-E – place_of_birth : Compton – profession : rapper, producer – type : person, music.artist (391 property-values in total) e3 [wd:Q36804] – rdfs:label : Eazy-E – altLabel : Eric Lynn Wright – date_of_birth : 1963-9-7 – desc : Gangsta rapper, producer – genre : gangsta rap – instance_of : human – occupation : musician, rapper – place_of_birth : Compton (141 property-values in total) group? givenName alias altLabel rdf:type type instance_of birthDate date_of_birth date_of_birth 1 � e1 Eric Darnell Wright Person, MusicalArtist 1984-4-23 2 � e2 Eric Wright, Eazy-E person, music.artist 1963-9-7 2 � e3 Eric Lynn Wright human 1963-9-7 givenName alias altLabel birthDate date_of_birth date_of_birth e1 Eric Darnell Wright 1984-4-23 e2 Eric Wright, Eazy-E 1963-9-7 e3 Eric Lynn Wright 1963-9-7 e1 [dbp:Lil_Eazy-E – rdf:type : Person, M – genre : Gangsta ra – givenName : Eric D – rdfs:label : Lil Eaz – birthPlace : Comp – gender : male
  • 6. Outline  Introduction  Our approach  Approach workflow 1. Holistic property matching 2. Goodness measurement 3. Comparative table generation  Experiments and results  Conclusion 6Introduction ➤ Our approach ➤ Experiments and results ➤ Conclusion
  • 7. Our approach: comparative table  Comparative table  arrange entities and properties as row and column headers, resp.  assign values in cells  Workflow 1. Holistic property matching: similarity calculation  property clique derivation 2. Goodness measurement: discriminability, abundance, semantics & diversity 3. Comparative table generation: property clique selection  value selection 7 group? givenName alias altLabel rdf:type type instance_of birthDate date_of_birth date_of_birth 1 e1 Eric Darnell Wright Person, MusicalArtist 1984-4-23 2 e2 Eric Wright, Eazy-E person, music.artist 1963-9-7 3 e3 Eric Lynn Wright human 1963-9-7 givenName alias altLabel birthDate date_of_birth date_of_birth e1 Eric Darnell Wright 1984-4-23 e2 Eric Wright, Eazy-E 1963-9-7 e3 Eric Lynn Wright 1963-9-7 Similarity computation Clique generation Holistic Property Matching {rdfs:label, name, rdfs:label} {givenName, alias, altLabel} {rdf:type, type, occupation} … Abundance Discriminability Comparability measurement Semantics Refinement by diversity 0.9 {givenName, alias, altLabel} 0.8 {birthDate, dateofbirth, DOB} 0.6 {rdf:type, type, occupation} Coverage-constrained Budget-constrained Comparative table generation Input: candidate entities Property cliques Property clique comparabilitiesOutput: comparative table Human Intervention Similarity calculation Prop. clique derivation Holistic property matching Abundance Discriminability Goodness measurement Diversity Semantics Property cliques Prop. clique selection Value selection Comparative table generationGoodness scores Comparative tableMultiple entities e1 [dbp:Lil_Eazy-E] – rdf:type : Person … – rdfs:label : Lil Eazy-E – owl:sameAs : fb:m… – birthPlace : Compton – desc : CEO NWA… – gender : male – genre : Gangsta rap … e2 [fb:m.01wf_p_] – alias : Eric Wright … – date_of_birth : 1963-9-7 – genre : gangsta rap … e3 [wd:Q36804] – rdfs:label : Eazy-E – altLabel : Eric Lynn … – date_of_birth : 1963-9-7 {rdfs:label, name} {givenName, alias, altLabel} {rdf:type, type, instance_of} … 0.2 {givenName, alias, altLabel} 0.4 {rdf:type, type, instance_of} 0.5 {birthDate, date_of_birth} … divide into groups Challenge: heterogeneity, large-scale vs. limited presentation space
  • 8. 1. Holistic property matching  Heterogeneous properties  Label, local name & value similarity, combined with logistic regression  Property cliques for multiple entities  restrict each property can match at most one other property  choose the pairs with highest match probability estimate may lead to conflicts  Holistic property matching  maximize the overall match probability estimate among all matched property pairs  s.t. 1:1 matching constraint is satisfied  NP-hard (3-dimensional assignment)  Greedy algorithm 8
  • 9. 2. Goodness measurement  Goodness of property cliques 1. Discriminability: a property clique that holds completely different or exactly identical values for all the entities may not good 2. Abundance: a property clique whose values are largely missing may be less convincing 3. Semantics gives extra scores to the ones particularly useful, e.g., owl:sameAs 4. Diversity evaluates the redundancy between different property cliques (MMR)  2-phase combination: (discriminability + abundance + semantics) + diversity  Goodness of values  Longer length, less redundancy 9 0 1.0 proportion of distinct values 0 0.7 discriminability proportion of entities goodness proportion of distinct values 0 0.11.0 1 1.00.1 1 = 0.5 2 = 0.3 3 = 0.2 0 1.0 proportion of distinct values 0 0.7 discriminability proportion of entities goodness proportion of distinct values 0 0.11.0 1 1.00.1 1 = 0.5 2 = 0.3 3 = 0.2
  • 10. 3. Comparative table generation  Property clique selection  Greedy method  Given the maximal number of property cliques in a comparative table, simply select top property cliques with best goodness  cannot guarantee each entity to be at least described by several properties  Optimal property clique selection with entity coverage constraint  NP-hard (set cover)  𝐻(𝑁)-approximation  Value selection  model it based on the classic 0/1 knapsack problem with a table cell size constraint 10
  • 11. Outline  Introduction  Our approach  Experiments and results  Test on holistic property matching  Test on property clique ranking  Test on human intervention  Conclusion 11Introduction ➤ Our approach ➤ Experiments and results ➤ Conclusion
  • 12. Test on holistic property matching  Quality of matched property pairs  “Official” property matches  Label others by 3 graduate students  484 matches, 1397 non-matches  Quality of derived property cliques  Compute connected components  135 reference property cliques 12  MER tasks  10 popular domains, 25 DBpedia entities per domain as seeds  Wikipedia disambiguation page, 2~4 Freebase, Wikidata, YAGO entities  randomly select 10 entities to constitute an MER task  250 tasks, 804 distinct real objects 0.868 0.727 0.791 0.824 0.73 0.773 0.893 0.669 0.763 0.877 0.706 0.782 0 0.5 1 Precision Recall F1-score CTab (LR) LinReg DecTree SVM 0.868 0.727 0.791 0.983 0.233 0.377 0.97 0.066 0.124 Precision Recall F1-score CTab (LR) Falcon LogMap 0.789 0.869 0.787 0.558 0.64 0.548 0.65 0.741 0.648 0.43 0.55 0.422 0.2 0.4 0.6 0.8 1 NMI Purity V-measure CTab K-medoids DBSCAN APCluster
  • 13. Test on property clique ranking 1. Directly rank ref. property cliques  Assess property clique derivation & ranking together  The Hausdorf version of Kendall tau distance  treat property clique rankings as partial rankings of properties (the properties with the same grade and in the same clique are tied)  Ablation study 13  3 experienced humans score property cliques in each task  Highly-useful (3), fairly-useful (2), marginally-useful (1) and useless (0)  Comparative systems  FACES (list) [Gunaratna et al., AAAI’15]  C3D+P (pairwise) [Cheng et al., JWS’15]  CTab, CTab (entropy), CTab (greedy) Use reference property cliques KHaus P@1 P@5 P@10 nDCG@5 FACES 0.176 0.310 0.290 0.239 0.753 C3D+P 0.040 0.347 0.511 0.154 0.647 CTab (entropy) 0.180 0.178 0.184 0.092 0.811 CTab (greedy) 0.632 0.660 0.615 0.684 0.647 CTab 0.756 0.754 0.643 0.798 0.615 KHaus Discr. Abund. Sem. w/o Div. Good CTab (greedy) 0.678 0.686 0.673 0.655 0.647 CTab 0.675 0.633 0.815 0.618 0.615
  • 14. Test on human intervention  60 graduate students (top-5/top-10), 30 orthogonal tasks per human, 100RMB  Task difficulty is not significantly different in statistics among FACES, C3D+P, CTab 1. Completion time 2. Precision  Break the entities in each entity group down to pairs 3. Human scoring and comments  For CTab, the least cover times was not always satisfiable 14 FACES (L) C3D+P (P) CTab (T) p-value Post-hoc Top-5 Time (s) 153 208 96 0.01% P < L < T Prec. 0.63 0.69 0.77 0.07% L, P < T Top-10 Time (s) 175 180 131 1.13% L, P < T Prec. 0.79 0.77 0.80 69.8% Questions [from 1: “totally disagree” to 5: “totally agree”] FACES (L) C3D+P (P) CTab (T) p-value Post-hoc Q1. The system provided adequate information of entities. 3.11 3.17 3.70 0.76% L, P < T Q2. The system provided unsuperfluous information of entities. 2.67 3.30 3.23 4.46% L < T, P Q3. The system helped me easily compare entities of interest. 2.43 3.37 4.00 < 0.01% L < P < T Q4. I found the system easy to use. 3.00 3.13 3.70 2.28% L, P < T
  • 15. Outline  Introduction  Our approach  Experiments and results  Conclusion 15Introduction ➤ Our approach ➤ Experiments and results ➤ Conclusion
  • 16. Conclusion  Main contributions 1. Discovery of matched property cliques 2. Scoring functions to measure the goodness of property cliques and values 3. Optimal comparative table generation with the entity coverage constraint  An 𝐻(𝑁) algorithm to obtain approximate solutions 4. Comparison to state-of-the-art methods and user study  Accuracy of matched properties, effectiveness of goodness measures and user satisfaction of comparative tables for MER  Future work  Combine comparative tables with other presentation enhancements  Extend to other areas such as knowledge base summarization 16
  • 17. Datasets & source code: http://ws.nju.edu.cn/ctab/ Acknowledgements  National Natural Science Foundation of China (No. 61772264)  Collaborative Innovation Center of Novel Software Technology and Industrialization Thank you for your time! SIGIR’18, July 8–12, Ann Arbor, MI, USA

Editor's Notes

  1. maximal marginal relevance
  2. ANOVA and LSD