SlideShare a Scribd company logo
1 of 55
Download to read offline
Robust Group Linkage
Pei Li1, Xin Luna Dong2, Songtao Guo3, Andrea
Maurino4, Divesh Srivastava5
1University of Zurich, 2Google Inc., 3LinkedIn, 4University
of Milan – Bicocca, 5AT&T Labs - Research
1
Agenda
•  Motivations
•  Challenges
•  Two-Stage Solution
•  Pivot Identification
•  Group Linkage
•  Experiments
•  Related Work
•  Conclusions
2
Motivations
•  Group linkage: linking records that refer to multiple entities in
the same group, not the same entity.
•  social networks: to group users by organizations (e.g., LinkedIn)
•  search engines: to identify business chains (e.g., YellowPages)
3
0 chain
4
0 chain
5
•  Solution 1:
•  require high value consistency
•  Solution 2:
•  match records w. same name
1 chain
6
2 chains
7
Ground Truth
Agenda
•  Motivations
•  Challenges
•  Two-Stage Solution
•  Pivot Identification
•  Group Linkage
•  Experiments
•  Related Work
•  Conclusions
8
Challenges: 1
Group linkage differs from record linkage:
top-5 US business chains
•  learning weights for attributes falls short, since global and
local values occur in the same attribute.
•  global phone of Swisscom: 0800 800 80X
•  local phone in Oerlikon branch: 0443 139 59X
name # store #name #phone #URL #catalog
SUBWAY 21,912 772 21,483 6 23
Bank of America 21,727 48 6,573 186 24
U-Haul 21, 638 2,340 18,384 14 20
USPS 19,225 12,345 5,761 282 22
McDonald’s 17,289 2401 16,607 568 47
9
Challenges: 2
Group linkage differs from record linkage:
top-5 US business chains
•  it is non-trivial to distinguish global / local values from errors.
•  URL shared by 60 branches of Texas FBIns: txfb-ins.com ✔
•  URL shared by 2 branches: farmbureauinsurance-mi.com ✗
name # store #name #phone #URL #catalog
SUBWAY 21,912 772 21,483 6 23
Bank of America 21,727 48 6,573 186 24
U-Haul 21, 638 2,340 18,384 14 20
USPS 19,225 12,345 5,761 282 22
McDonald’s 17,289 2401 16,607 568 47
10
Challenges: 3
Group linkage differs from record linkage:
top-5 US business chains
•  scalability is critical: a group can contain tens of thousands of
members.
name # store #name #phone #URL #catalog
SUBWAY 21,912 772 21,483 6 23
Bank of America 21,727 48 6,573 186 24
U-Haul 21, 638 2,340 18,384 14 20
USPS 19,225 12,345 5,761 282 22
McDonald’s 17,289 2401 16,607 568 47
11
Agenda
•  Motivations
•  Challenges
•  Two-Stage Solution
•  Pivot Identification
•  Group Linkage
•  Experiments
•  Related Work
•  Conclusions
12
Two-stage Solution
Stage I:
•  identify records highly likely to be in the same group, called
pivots
•  collect strong evidence such as name, primary phone in pivots
Stage II:
•  cluster pivots and remaining records into group
•  leverage strong evidence (from Stage I) and be tolerant to
local values
13
Example
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
••• 14
Chain1: {r1-r5}, Chain2: {r6-r9}
Pivot
Stage I: identify subset of records in the same group as pivots.
•  pivots contain highly similar records as strong evidence of a group.
•  pivots are robust in the presence of a few errors.
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
wrong URL
C1
C2
15
Pivot
•  Key idea:
•  Represent a set R of records as a similarity graph G;
•  A pivot is a connected sub-graph robust to a few node removals.
16
Similarity Graph
Undirected graph G: to represent a set R of records
•  a node represents a record r in R
•  two nodes are connected if they are very similar.
r1
r2
r5
r3
r4
r6
r8
r7
Clique
C1
C2
C3
r10
r9
17
Pivot
A pivot is a connected sub-graph that is robust against a few
node removals.
Definition 1 (k-robustness): A graph G is k-robust if after
removing arbitrary k nodes, G is still connected. A clique is
defined to be k-robust for any k.
r1
r2
r5
r3
r4
r6
r8
r7
not 1-robust
18
Pivot
We partition a graph G into a set of maximal k-robust sub-graphs.
Maximal k-robust partitioning of G: to partition G into sub-graphs
such that (1) each sub-graph is k-robust; (2) result of merging any
sub-graphs is not k-robust.
r1
r2
r5
r3
r4
r6
r8
r7
r1
r2
r5
r3
r4
r6
r8
r7
maximal 1-robust partitioning
r1
r2
r5
r3
r4
r6
r8
r7
19
Pivot
Definition (k-pivot): Records that belong to the same sub-
graph in every maximal k-robust partitioning of G form a k-
pivot of R. A pivot contains at least 2 records.
r1
r2
r5
r3
r4
r6
r8
r7
r1
r2
r5
r3
r4
r6
r8
r7
maximal 1-robust partitioning
r1
r2
r5
r3
r4
r6
r8
r7
pivot
20
Pivot Algorithm
•  Finding pivots in G can be reduced to Max-flow problem.
•  O(n2.5), n is the number of nodes in G
•  To improve scalability:
•  represent G by a simplified inverted index
•  Screening: reduce search space from G to sub-graphs in G.
•  considering unions of cliques in G as a whole
•  splitting sets of unions in G into sub-graphs
•  Apply Max-flow algorithm only on sub-graphs of G.
21
Pivot Algorithm - Screening
•  k = 1
r1
r2
r5
r3
r4
r6
r8
r7
r10
r9
1. merge cliques into unions, which are k-robust.
clique
22
Pivot Algorithm - Screening
•  k = 1
r1
r2
r5
r3
r4
r6
r8
r7
r10
r9
1. merge cliques into unions, which are k-robust.
clique
23
Pivot Algorithm - Screening
•  k = 1
r1
r2
r5
r3
r4
r6
r8
r7
r10
r9
1. merge cliques into unions, which are k-robust.
union
24
2. split sets of unions into sub-graphs
by their common nodes.
Pivot Algorithm - Screening
•  k = 1
r1
r2
r5
r3
r4
r6
r8
r7
r10
r9
1. merge cliques into unions, which are k-robust.
union
25
2. split sets of unions into sub-graphs
by their common nodes.
Pivot Algorithm - Screening
•  k = 1
r1
r2
r5r4
r6
r8
r7
r10
r9
1. merge cliques into unions, which are k-robust.
union
26
2. split sets of unions into sub-graphs
by their common nodes.
Pivot Algorithm - Screening
•  k = 1
r1
r2
r5r4
r6
r8
r7
r10
r9
1. merge cliques into unions, which are k-robust.
2. split sets of unions into sub-graphs
by their common nodes.
union
27
Pivot Algorithm - Screening
•  k = 1
r1
r2
r5r4
r6
r8
r7
r10
r9
1. merge cliques into unions, which are k-robust.
2. split sets of unions into sub-graphs
by their common nodes.
union
28
Pivot Algorithm - Screening
•  k = 1
r1
r5r4
r6
r8
r7
r10
r9
1. merge cliques into unions, which are k-robust.
2. split sets of unions into sub-graphs
by their common nodes.
29
Pivot Algorithm - Screening
•  k = 1
r5r4
r6
r8
r7
1. merge cliques into unions, which are k-robust.
2. split sets of unions into sub-graphs
by their common nodes.
pivot
30
Agenda
•  Motivations
•  Challenges
•  Two-Stage Solution
•  Pivot Identification
•  Group Linkage
•  Experiments
•  Related Work
•  Conclusions
31
Group Linkage
•  Stage II: clustering pivots and remaining records into groups
•  weight attribute values based on popularities in a group
•  penalize less on local attribute values of the same group
32
Group Linkage
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
33
Group Linkage
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
reward strong evidence
34
Group Linkage
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
reward strong evidence
35
Group Linkage
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
reward strong evidence
36
Group Linkage
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
reward strong evidence
37
Group Linkage
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
apply weak evidence
38
Group Linkage
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
apply weak evidence
39
Group Linkage
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
low penalty on local values
40
Agenda
•  Motivations
•  Challenges
•  Two-Stage Solution
•  Pivot Identification
•  Group Linkage
•  Experiments
•  Related Work
•  Conclusions
41
Experiments
•  Datasets:
•  18M business listings
•  590 attendees of SIGMOD’98
•  Measurements:
•  effectiveness: Precision / Recall / F-measure
•  efficiency: runtime
# records # groups (size > 1) group size # singletons
BizLow 2446 1 2446 0
BizAvg 2062 30 [2, 308] 503
BizHigh 1149 14 [33, 269] 0
SIGMOD 590 71 [2, 41] 162
42
Overall Results Our solution obtains highest F-
measure (above .95)
43
Contribution of Components
PIVOT improves precision over baselines by 79%, with a lower recall (by 34% lower)
44
Contribution of Components
clustering without PIVOT obtains comparable F-measure as baselines with high precision
45
Contribution of Components
clustering with PIVOT obtains the best results
46
Pivot Quality
Baseline has lower recall, since it has stricter criteria to identify pivots
47
Pivot Quality
SCREEN obtains similar results as PIVOT
48
Parameter k
setting k in [1, 4] performs well in most datasets
49
Scalability
50
5
50
500
5000
50000
0 20 40 60 80 100
Executiontime(sec.)
# of record (%)
NAIVE
INDEX
SINDEX
UNION
PIVOT •  NAÏVE: applying Max-flow in graph G
•  INDEX: using inverted index
•  SINDEX: using simplified inverted index
•  UNION: merging cliques into unions
•  PIVOT: splitting sets of unions into sub-graphs
Agenda
•  Motivations
•  Challenges
•  Two-Stage Solution
•  Pivot Identification
•  Group Linkage
•  Experiments
•  Related Work
•  Conclusions
51
Related Work
Record Similarity:
•  classification based approaches [FS69]
•  distance based approaches [D08]
•  rule base approaches [HS98]
Two-stage Clustering:
•  single node as pivot [LA99, WB09]
•  bi-connected component as pivot [CNT07]
•  agglomerative clustering result as pivot [YIO+10]
Group Linkage:
•  computing similarity of pre-specified groups of records, using
•  record similarity [OKL+07]
•  network evolution analysis [H10]
52
Conclusions
•  Group linkage is important, and differs from record linkage.
•  It is critical to cluster records into groups in two stages.
•  It is important to be robust against errors in the group.
•  Our two-stage algorithm is empirically accurate, efficient and
scalable.
53
References
•  [D08]: D. Dey. Entity matching in heterogeneous databases: A logistic regression approach. Decis.
Support Syst., 44:740–747, 2008.
•  [FS69]: I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the Americal Statistical
Association, 64(328):1183–1210, 1969.
•  [HS98]: M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/
purge problem. Data Mining and Knowledge Discovery, 2:9–37, 1998.
•  [LA99]: B. Larsen and C. Aone. Fast and effective text mining using linear-time document
clustering. In KDD, pages 16–22, 1999.
•  [WB09]: D. T. Wijaya and S. Bressan. Ricochet: A family of unconstrained algorithms for graph
clustering. In DASFAA, 153–167, 2009.
•  [CNT07]: N. Bansal, F. Chiang, N. Koudas, and F. W. Tompa. Seeking stable clusters in the
blogosphere. In VLDB, pages 806–817, 2007.
•  [YIO+10]: M. Yoshida, M. Ikeda, S. Ono, I. Sato, and H. Nakagawa. Person name disambiguation
by bootstrapping. In SIGIR, pages 10–17, 2010.
•  [OKL+07]: B. W. On, N. Koudas, D. Lee, and D. Srivastava. Group linkage. In ICDE, pages 496–
505, 2007.
•  [H10]: S. Huang. Mixed group discovery: Incorporating group linkage with alternatively consistent
social network analysis. International Conference on Semantic Computing, 0:369–376, 2010.
54
Thank You!
55

More Related Content

Similar to group_linkage@www15

TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstSpark Summit
 
Using RDA for Archives and Manuscripts
Using RDA for Archives and ManuscriptsUsing RDA for Archives and Manuscripts
Using RDA for Archives and ManuscriptsAdrienne Pruitt
 
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...confluent
 
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary DatabaseRedis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary DatabaseRedis Labs
 
Data Warehousing
Data WarehousingData Warehousing
Data WarehousingHeena Madan
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0DataStax
 
MLSEV. Association Discovery and Topic Modeling
MLSEV. Association Discovery and Topic ModelingMLSEV. Association Discovery and Topic Modeling
MLSEV. Association Discovery and Topic ModelingBigML, Inc
 
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...Steve Kramer
 
On Relevant Query Answering over Streaming and Distributed Data
On Relevant Query Answering over Streaming and Distributed DataOn Relevant Query Answering over Streaming and Distributed Data
On Relevant Query Answering over Streaming and Distributed DataShima Zahmatkesh
 
RDA serials cataloging
RDA serials catalogingRDA serials cataloging
RDA serials catalogingJennifer Young
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallSpark Summit
 
BigDansing presentation slides for KAUST
BigDansing presentation slides for KAUSTBigDansing presentation slides for KAUST
BigDansing presentation slides for KAUSTZuhair khayyat
 
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectMorningstar Tech Talks
 
Introducción a Neo4j
Introducción a Neo4jIntroducción a Neo4j
Introducción a Neo4jNeo4j
 
20121224 meeting standard cell routing via boolean satisfiability_mori ver
20121224 meeting standard cell routing via boolean satisfiability_mori ver20121224 meeting standard cell routing via boolean satisfiability_mori ver
20121224 meeting standard cell routing via boolean satisfiability_mori verHanson Chi
 

Similar to group_linkage@www15 (20)

TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
 
Using RDA for Archives and Manuscripts
Using RDA for Archives and ManuscriptsUsing RDA for Archives and Manuscripts
Using RDA for Archives and Manuscripts
 
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...
 
Big data
Big dataBig data
Big data
 
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary DatabaseRedis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary Database
 
DDL,DML,1stNF
DDL,DML,1stNFDDL,DML,1stNF
DDL,DML,1stNF
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
 
Datawarehosuing
DatawarehosuingDatawarehosuing
Datawarehosuing
 
sfdfds
sfdfdssfdfds
sfdfds
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
 
MLSEV. Association Discovery and Topic Modeling
MLSEV. Association Discovery and Topic ModelingMLSEV. Association Discovery and Topic Modeling
MLSEV. Association Discovery and Topic Modeling
 
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...
Finding Key Influencers and Viral Topics in Twitter Networks Related to ISIS,...
 
On Relevant Query Answering over Streaming and Distributed Data
On Relevant Query Answering over Streaming and Distributed DataOn Relevant Query Answering over Streaming and Distributed Data
On Relevant Query Answering over Streaming and Distributed Data
 
RDA serials cataloging
RDA serials catalogingRDA serials cataloging
RDA serials cataloging
 
Adding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug GrallAdding Complex Data to Spark Stack by Tug Grall
Adding Complex Data to Spark Stack by Tug Grall
 
BigDansing presentation slides for KAUST
BigDansing presentation slides for KAUSTBigDansing presentation slides for KAUST
BigDansing presentation slides for KAUST
 
Ben Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra ProjectBen Coverston - The Apache Cassandra Project
Ben Coverston - The Apache Cassandra Project
 
Skillwise Big data
Skillwise Big dataSkillwise Big data
Skillwise Big data
 
Introducción a Neo4j
Introducción a Neo4jIntroducción a Neo4j
Introducción a Neo4j
 
20121224 meeting standard cell routing via boolean satisfiability_mori ver
20121224 meeting standard cell routing via boolean satisfiability_mori ver20121224 meeting standard cell routing via boolean satisfiability_mori ver
20121224 meeting standard cell routing via boolean satisfiability_mori ver
 

group_linkage@www15

  • 1. Robust Group Linkage Pei Li1, Xin Luna Dong2, Songtao Guo3, Andrea Maurino4, Divesh Srivastava5 1University of Zurich, 2Google Inc., 3LinkedIn, 4University of Milan – Bicocca, 5AT&T Labs - Research 1
  • 2. Agenda •  Motivations •  Challenges •  Two-Stage Solution •  Pivot Identification •  Group Linkage •  Experiments •  Related Work •  Conclusions 2
  • 3. Motivations •  Group linkage: linking records that refer to multiple entities in the same group, not the same entity. •  social networks: to group users by organizations (e.g., LinkedIn) •  search engines: to identify business chains (e.g., YellowPages) 3
  • 5. 0 chain 5 •  Solution 1: •  require high value consistency
  • 6. •  Solution 2: •  match records w. same name 1 chain 6
  • 8. Agenda •  Motivations •  Challenges •  Two-Stage Solution •  Pivot Identification •  Group Linkage •  Experiments •  Related Work •  Conclusions 8
  • 9. Challenges: 1 Group linkage differs from record linkage: top-5 US business chains •  learning weights for attributes falls short, since global and local values occur in the same attribute. •  global phone of Swisscom: 0800 800 80X •  local phone in Oerlikon branch: 0443 139 59X name # store #name #phone #URL #catalog SUBWAY 21,912 772 21,483 6 23 Bank of America 21,727 48 6,573 186 24 U-Haul 21, 638 2,340 18,384 14 20 USPS 19,225 12,345 5,761 282 22 McDonald’s 17,289 2401 16,607 568 47 9
  • 10. Challenges: 2 Group linkage differs from record linkage: top-5 US business chains •  it is non-trivial to distinguish global / local values from errors. •  URL shared by 60 branches of Texas FBIns: txfb-ins.com ✔ •  URL shared by 2 branches: farmbureauinsurance-mi.com ✗ name # store #name #phone #URL #catalog SUBWAY 21,912 772 21,483 6 23 Bank of America 21,727 48 6,573 186 24 U-Haul 21, 638 2,340 18,384 14 20 USPS 19,225 12,345 5,761 282 22 McDonald’s 17,289 2401 16,607 568 47 10
  • 11. Challenges: 3 Group linkage differs from record linkage: top-5 US business chains •  scalability is critical: a group can contain tens of thousands of members. name # store #name #phone #URL #catalog SUBWAY 21,912 772 21,483 6 23 Bank of America 21,727 48 6,573 186 24 U-Haul 21, 638 2,340 18,384 14 20 USPS 19,225 12,345 5,761 282 22 McDonald’s 17,289 2401 16,607 568 47 11
  • 12. Agenda •  Motivations •  Challenges •  Two-Stage Solution •  Pivot Identification •  Group Linkage •  Experiments •  Related Work •  Conclusions 12
  • 13. Two-stage Solution Stage I: •  identify records highly likely to be in the same group, called pivots •  collect strong evidence such as name, primary phone in pivots Stage II: •  cluster pivots and remaining records into group •  leverage strong evidence (from Stage I) and be tolerant to local values 13
  • 14. Example ID name phone state URL domain r1 Taco Casa AL tacocasa.com r2 Taco Casa 900 AL tacocasa.com r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com r4 Taco Casa 900 AL r5 Taco Casa 900 AL r6 Taco Casa 701 TX tacocasatexas.com r7 Taco Casa 702 TX tacocasatexas.com r8 Taco Casa 703 TX tacocasatexas.com r9 Taco Casa 704 TX r10 Elva’s Taco Casa TX tacodemar.com ••• 14 Chain1: {r1-r5}, Chain2: {r6-r9}
  • 15. Pivot Stage I: identify subset of records in the same group as pivots. •  pivots contain highly similar records as strong evidence of a group. •  pivots are robust in the presence of a few errors. ID name phone state URL domain r1 Taco Casa AL tacocasa.com r2 Taco Casa 900 AL tacocasa.com r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com r4 Taco Casa 900 AL r5 Taco Casa 900 AL r6 Taco Casa 701 TX tacocasatexas.com r7 Taco Casa 702 TX tacocasatexas.com r8 Taco Casa 703 TX tacocasatexas.com r9 Taco Casa 704 TX r10 Elva’s Taco Casa TX tacodemar.com wrong URL C1 C2 15
  • 16. Pivot •  Key idea: •  Represent a set R of records as a similarity graph G; •  A pivot is a connected sub-graph robust to a few node removals. 16
  • 17. Similarity Graph Undirected graph G: to represent a set R of records •  a node represents a record r in R •  two nodes are connected if they are very similar. r1 r2 r5 r3 r4 r6 r8 r7 Clique C1 C2 C3 r10 r9 17
  • 18. Pivot A pivot is a connected sub-graph that is robust against a few node removals. Definition 1 (k-robustness): A graph G is k-robust if after removing arbitrary k nodes, G is still connected. A clique is defined to be k-robust for any k. r1 r2 r5 r3 r4 r6 r8 r7 not 1-robust 18
  • 19. Pivot We partition a graph G into a set of maximal k-robust sub-graphs. Maximal k-robust partitioning of G: to partition G into sub-graphs such that (1) each sub-graph is k-robust; (2) result of merging any sub-graphs is not k-robust. r1 r2 r5 r3 r4 r6 r8 r7 r1 r2 r5 r3 r4 r6 r8 r7 maximal 1-robust partitioning r1 r2 r5 r3 r4 r6 r8 r7 19
  • 20. Pivot Definition (k-pivot): Records that belong to the same sub- graph in every maximal k-robust partitioning of G form a k- pivot of R. A pivot contains at least 2 records. r1 r2 r5 r3 r4 r6 r8 r7 r1 r2 r5 r3 r4 r6 r8 r7 maximal 1-robust partitioning r1 r2 r5 r3 r4 r6 r8 r7 pivot 20
  • 21. Pivot Algorithm •  Finding pivots in G can be reduced to Max-flow problem. •  O(n2.5), n is the number of nodes in G •  To improve scalability: •  represent G by a simplified inverted index •  Screening: reduce search space from G to sub-graphs in G. •  considering unions of cliques in G as a whole •  splitting sets of unions in G into sub-graphs •  Apply Max-flow algorithm only on sub-graphs of G. 21
  • 22. Pivot Algorithm - Screening •  k = 1 r1 r2 r5 r3 r4 r6 r8 r7 r10 r9 1. merge cliques into unions, which are k-robust. clique 22
  • 23. Pivot Algorithm - Screening •  k = 1 r1 r2 r5 r3 r4 r6 r8 r7 r10 r9 1. merge cliques into unions, which are k-robust. clique 23
  • 24. Pivot Algorithm - Screening •  k = 1 r1 r2 r5 r3 r4 r6 r8 r7 r10 r9 1. merge cliques into unions, which are k-robust. union 24 2. split sets of unions into sub-graphs by their common nodes.
  • 25. Pivot Algorithm - Screening •  k = 1 r1 r2 r5 r3 r4 r6 r8 r7 r10 r9 1. merge cliques into unions, which are k-robust. union 25 2. split sets of unions into sub-graphs by their common nodes.
  • 26. Pivot Algorithm - Screening •  k = 1 r1 r2 r5r4 r6 r8 r7 r10 r9 1. merge cliques into unions, which are k-robust. union 26 2. split sets of unions into sub-graphs by their common nodes.
  • 27. Pivot Algorithm - Screening •  k = 1 r1 r2 r5r4 r6 r8 r7 r10 r9 1. merge cliques into unions, which are k-robust. 2. split sets of unions into sub-graphs by their common nodes. union 27
  • 28. Pivot Algorithm - Screening •  k = 1 r1 r2 r5r4 r6 r8 r7 r10 r9 1. merge cliques into unions, which are k-robust. 2. split sets of unions into sub-graphs by their common nodes. union 28
  • 29. Pivot Algorithm - Screening •  k = 1 r1 r5r4 r6 r8 r7 r10 r9 1. merge cliques into unions, which are k-robust. 2. split sets of unions into sub-graphs by their common nodes. 29
  • 30. Pivot Algorithm - Screening •  k = 1 r5r4 r6 r8 r7 1. merge cliques into unions, which are k-robust. 2. split sets of unions into sub-graphs by their common nodes. pivot 30
  • 31. Agenda •  Motivations •  Challenges •  Two-Stage Solution •  Pivot Identification •  Group Linkage •  Experiments •  Related Work •  Conclusions 31
  • 32. Group Linkage •  Stage II: clustering pivots and remaining records into groups •  weight attribute values based on popularities in a group •  penalize less on local attribute values of the same group 32
  • 33. Group Linkage ID name phone state URL domain r1 Taco Casa AL tacocasa.com r2 Taco Casa 900 AL tacocasa.com r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com r4 Taco Casa 900 AL r5 Taco Casa 900 AL r6 Taco Casa 701 TX tacocasatexas.com r7 Taco Casa 702 TX tacocasatexas.com r8 Taco Casa 703 TX tacocasatexas.com r9 Taco Casa 704 TX r10 Elva’s Taco Casa TX tacodemar.com 33
  • 34. Group Linkage ID name phone state URL domain r1 Taco Casa AL tacocasa.com r2 Taco Casa 900 AL tacocasa.com r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com r4 Taco Casa 900 AL r5 Taco Casa 900 AL r6 Taco Casa 701 TX tacocasatexas.com r7 Taco Casa 702 TX tacocasatexas.com r8 Taco Casa 703 TX tacocasatexas.com r9 Taco Casa 704 TX r10 Elva’s Taco Casa TX tacodemar.com reward strong evidence 34
  • 35. Group Linkage ID name phone state URL domain r1 Taco Casa AL tacocasa.com r2 Taco Casa 900 AL tacocasa.com r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com r4 Taco Casa 900 AL r5 Taco Casa 900 AL r6 Taco Casa 701 TX tacocasatexas.com r7 Taco Casa 702 TX tacocasatexas.com r8 Taco Casa 703 TX tacocasatexas.com r9 Taco Casa 704 TX r10 Elva’s Taco Casa TX tacodemar.com reward strong evidence 35
  • 36. Group Linkage ID name phone state URL domain r1 Taco Casa AL tacocasa.com r2 Taco Casa 900 AL tacocasa.com r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com r4 Taco Casa 900 AL r5 Taco Casa 900 AL r6 Taco Casa 701 TX tacocasatexas.com r7 Taco Casa 702 TX tacocasatexas.com r8 Taco Casa 703 TX tacocasatexas.com r9 Taco Casa 704 TX r10 Elva’s Taco Casa TX tacodemar.com reward strong evidence 36
  • 37. Group Linkage ID name phone state URL domain r1 Taco Casa AL tacocasa.com r2 Taco Casa 900 AL tacocasa.com r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com r4 Taco Casa 900 AL r5 Taco Casa 900 AL r6 Taco Casa 701 TX tacocasatexas.com r7 Taco Casa 702 TX tacocasatexas.com r8 Taco Casa 703 TX tacocasatexas.com r9 Taco Casa 704 TX r10 Elva’s Taco Casa TX tacodemar.com reward strong evidence 37
  • 38. Group Linkage ID name phone state URL domain r1 Taco Casa AL tacocasa.com r2 Taco Casa 900 AL tacocasa.com r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com r4 Taco Casa 900 AL r5 Taco Casa 900 AL r6 Taco Casa 701 TX tacocasatexas.com r7 Taco Casa 702 TX tacocasatexas.com r8 Taco Casa 703 TX tacocasatexas.com r9 Taco Casa 704 TX r10 Elva’s Taco Casa TX tacodemar.com apply weak evidence 38
  • 39. Group Linkage ID name phone state URL domain r1 Taco Casa AL tacocasa.com r2 Taco Casa 900 AL tacocasa.com r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com r4 Taco Casa 900 AL r5 Taco Casa 900 AL r6 Taco Casa 701 TX tacocasatexas.com r7 Taco Casa 702 TX tacocasatexas.com r8 Taco Casa 703 TX tacocasatexas.com r9 Taco Casa 704 TX r10 Elva’s Taco Casa TX tacodemar.com apply weak evidence 39
  • 40. Group Linkage ID name phone state URL domain r1 Taco Casa AL tacocasa.com r2 Taco Casa 900 AL tacocasa.com r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com r4 Taco Casa 900 AL r5 Taco Casa 900 AL r6 Taco Casa 701 TX tacocasatexas.com r7 Taco Casa 702 TX tacocasatexas.com r8 Taco Casa 703 TX tacocasatexas.com r9 Taco Casa 704 TX r10 Elva’s Taco Casa TX tacodemar.com low penalty on local values 40
  • 41. Agenda •  Motivations •  Challenges •  Two-Stage Solution •  Pivot Identification •  Group Linkage •  Experiments •  Related Work •  Conclusions 41
  • 42. Experiments •  Datasets: •  18M business listings •  590 attendees of SIGMOD’98 •  Measurements: •  effectiveness: Precision / Recall / F-measure •  efficiency: runtime # records # groups (size > 1) group size # singletons BizLow 2446 1 2446 0 BizAvg 2062 30 [2, 308] 503 BizHigh 1149 14 [33, 269] 0 SIGMOD 590 71 [2, 41] 162 42
  • 43. Overall Results Our solution obtains highest F- measure (above .95) 43
  • 44. Contribution of Components PIVOT improves precision over baselines by 79%, with a lower recall (by 34% lower) 44
  • 45. Contribution of Components clustering without PIVOT obtains comparable F-measure as baselines with high precision 45
  • 46. Contribution of Components clustering with PIVOT obtains the best results 46
  • 47. Pivot Quality Baseline has lower recall, since it has stricter criteria to identify pivots 47
  • 48. Pivot Quality SCREEN obtains similar results as PIVOT 48
  • 49. Parameter k setting k in [1, 4] performs well in most datasets 49
  • 50. Scalability 50 5 50 500 5000 50000 0 20 40 60 80 100 Executiontime(sec.) # of record (%) NAIVE INDEX SINDEX UNION PIVOT •  NAÏVE: applying Max-flow in graph G •  INDEX: using inverted index •  SINDEX: using simplified inverted index •  UNION: merging cliques into unions •  PIVOT: splitting sets of unions into sub-graphs
  • 51. Agenda •  Motivations •  Challenges •  Two-Stage Solution •  Pivot Identification •  Group Linkage •  Experiments •  Related Work •  Conclusions 51
  • 52. Related Work Record Similarity: •  classification based approaches [FS69] •  distance based approaches [D08] •  rule base approaches [HS98] Two-stage Clustering: •  single node as pivot [LA99, WB09] •  bi-connected component as pivot [CNT07] •  agglomerative clustering result as pivot [YIO+10] Group Linkage: •  computing similarity of pre-specified groups of records, using •  record similarity [OKL+07] •  network evolution analysis [H10] 52
  • 53. Conclusions •  Group linkage is important, and differs from record linkage. •  It is critical to cluster records into groups in two stages. •  It is important to be robust against errors in the group. •  Our two-stage algorithm is empirically accurate, efficient and scalable. 53
  • 54. References •  [D08]: D. Dey. Entity matching in heterogeneous databases: A logistic regression approach. Decis. Support Syst., 44:740–747, 2008. •  [FS69]: I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the Americal Statistical Association, 64(328):1183–1210, 1969. •  [HS98]: M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/ purge problem. Data Mining and Knowledge Discovery, 2:9–37, 1998. •  [LA99]: B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. In KDD, pages 16–22, 1999. •  [WB09]: D. T. Wijaya and S. Bressan. Ricochet: A family of unconstrained algorithms for graph clustering. In DASFAA, 153–167, 2009. •  [CNT07]: N. Bansal, F. Chiang, N. Koudas, and F. W. Tompa. Seeking stable clusters in the blogosphere. In VLDB, pages 806–817, 2007. •  [YIO+10]: M. Yoshida, M. Ikeda, S. Ono, I. Sato, and H. Nakagawa. Person name disambiguation by bootstrapping. In SIGIR, pages 10–17, 2010. •  [OKL+07]: B. W. On, N. Koudas, D. Lee, and D. Srivastava. Group linkage. In ICDE, pages 496– 505, 2007. •  [H10]: S. Huang. Mixed group discovery: Incorporating group linkage with alternatively consistent social network analysis. International Conference on Semantic Computing, 0:369–376, 2010. 54