20121224 meeting standard cell routing via boolean satisfiability_mori ver
group_linkage@www15
1. Robust Group Linkage
Pei Li1, Xin Luna Dong2, Songtao Guo3, Andrea
Maurino4, Divesh Srivastava5
1University of Zurich, 2Google Inc., 3LinkedIn, 4University
of Milan – Bicocca, 5AT&T Labs - Research
1
3. Motivations
• Group linkage: linking records that refer to multiple entities in
the same group, not the same entity.
• social networks: to group users by organizations (e.g., LinkedIn)
• search engines: to identify business chains (e.g., YellowPages)
3
9. Challenges: 1
Group linkage differs from record linkage:
top-5 US business chains
• learning weights for attributes falls short, since global and
local values occur in the same attribute.
• global phone of Swisscom: 0800 800 80X
• local phone in Oerlikon branch: 0443 139 59X
name # store #name #phone #URL #catalog
SUBWAY 21,912 772 21,483 6 23
Bank of America 21,727 48 6,573 186 24
U-Haul 21, 638 2,340 18,384 14 20
USPS 19,225 12,345 5,761 282 22
McDonald’s 17,289 2401 16,607 568 47
9
10. Challenges: 2
Group linkage differs from record linkage:
top-5 US business chains
• it is non-trivial to distinguish global / local values from errors.
• URL shared by 60 branches of Texas FBIns: txfb-ins.com ✔
• URL shared by 2 branches: farmbureauinsurance-mi.com ✗
name # store #name #phone #URL #catalog
SUBWAY 21,912 772 21,483 6 23
Bank of America 21,727 48 6,573 186 24
U-Haul 21, 638 2,340 18,384 14 20
USPS 19,225 12,345 5,761 282 22
McDonald’s 17,289 2401 16,607 568 47
10
11. Challenges: 3
Group linkage differs from record linkage:
top-5 US business chains
• scalability is critical: a group can contain tens of thousands of
members.
name # store #name #phone #URL #catalog
SUBWAY 21,912 772 21,483 6 23
Bank of America 21,727 48 6,573 186 24
U-Haul 21, 638 2,340 18,384 14 20
USPS 19,225 12,345 5,761 282 22
McDonald’s 17,289 2401 16,607 568 47
11
13. Two-stage Solution
Stage I:
• identify records highly likely to be in the same group, called
pivots
• collect strong evidence such as name, primary phone in pivots
Stage II:
• cluster pivots and remaining records into group
• leverage strong evidence (from Stage I) and be tolerant to
local values
13
14. Example
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
••• 14
Chain1: {r1-r5}, Chain2: {r6-r9}
15. Pivot
Stage I: identify subset of records in the same group as pivots.
• pivots contain highly similar records as strong evidence of a group.
• pivots are robust in the presence of a few errors.
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
wrong URL
C1
C2
15
16. Pivot
• Key idea:
• Represent a set R of records as a similarity graph G;
• A pivot is a connected sub-graph robust to a few node removals.
16
17. Similarity Graph
Undirected graph G: to represent a set R of records
• a node represents a record r in R
• two nodes are connected if they are very similar.
r1
r2
r5
r3
r4
r6
r8
r7
Clique
C1
C2
C3
r10
r9
17
18. Pivot
A pivot is a connected sub-graph that is robust against a few
node removals.
Definition 1 (k-robustness): A graph G is k-robust if after
removing arbitrary k nodes, G is still connected. A clique is
defined to be k-robust for any k.
r1
r2
r5
r3
r4
r6
r8
r7
not 1-robust
18
19. Pivot
We partition a graph G into a set of maximal k-robust sub-graphs.
Maximal k-robust partitioning of G: to partition G into sub-graphs
such that (1) each sub-graph is k-robust; (2) result of merging any
sub-graphs is not k-robust.
r1
r2
r5
r3
r4
r6
r8
r7
r1
r2
r5
r3
r4
r6
r8
r7
maximal 1-robust partitioning
r1
r2
r5
r3
r4
r6
r8
r7
19
20. Pivot
Definition (k-pivot): Records that belong to the same sub-
graph in every maximal k-robust partitioning of G form a k-
pivot of R. A pivot contains at least 2 records.
r1
r2
r5
r3
r4
r6
r8
r7
r1
r2
r5
r3
r4
r6
r8
r7
maximal 1-robust partitioning
r1
r2
r5
r3
r4
r6
r8
r7
pivot
20
21. Pivot Algorithm
• Finding pivots in G can be reduced to Max-flow problem.
• O(n2.5), n is the number of nodes in G
• To improve scalability:
• represent G by a simplified inverted index
• Screening: reduce search space from G to sub-graphs in G.
• considering unions of cliques in G as a whole
• splitting sets of unions in G into sub-graphs
• Apply Max-flow algorithm only on sub-graphs of G.
21
22. Pivot Algorithm - Screening
• k = 1
r1
r2
r5
r3
r4
r6
r8
r7
r10
r9
1. merge cliques into unions, which are k-robust.
clique
22
23. Pivot Algorithm - Screening
• k = 1
r1
r2
r5
r3
r4
r6
r8
r7
r10
r9
1. merge cliques into unions, which are k-robust.
clique
23
24. Pivot Algorithm - Screening
• k = 1
r1
r2
r5
r3
r4
r6
r8
r7
r10
r9
1. merge cliques into unions, which are k-robust.
union
24
2. split sets of unions into sub-graphs
by their common nodes.
25. Pivot Algorithm - Screening
• k = 1
r1
r2
r5
r3
r4
r6
r8
r7
r10
r9
1. merge cliques into unions, which are k-robust.
union
25
2. split sets of unions into sub-graphs
by their common nodes.
26. Pivot Algorithm - Screening
• k = 1
r1
r2
r5r4
r6
r8
r7
r10
r9
1. merge cliques into unions, which are k-robust.
union
26
2. split sets of unions into sub-graphs
by their common nodes.
27. Pivot Algorithm - Screening
• k = 1
r1
r2
r5r4
r6
r8
r7
r10
r9
1. merge cliques into unions, which are k-robust.
2. split sets of unions into sub-graphs
by their common nodes.
union
27
28. Pivot Algorithm - Screening
• k = 1
r1
r2
r5r4
r6
r8
r7
r10
r9
1. merge cliques into unions, which are k-robust.
2. split sets of unions into sub-graphs
by their common nodes.
union
28
29. Pivot Algorithm - Screening
• k = 1
r1
r5r4
r6
r8
r7
r10
r9
1. merge cliques into unions, which are k-robust.
2. split sets of unions into sub-graphs
by their common nodes.
29
30. Pivot Algorithm - Screening
• k = 1
r5r4
r6
r8
r7
1. merge cliques into unions, which are k-robust.
2. split sets of unions into sub-graphs
by their common nodes.
pivot
30
32. Group Linkage
• Stage II: clustering pivots and remaining records into groups
• weight attribute values based on popularities in a group
• penalize less on local attribute values of the same group
32
33. Group Linkage
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
33
34. Group Linkage
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
reward strong evidence
34
35. Group Linkage
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
reward strong evidence
35
36. Group Linkage
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
reward strong evidence
36
37. Group Linkage
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
reward strong evidence
37
38. Group Linkage
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
apply weak evidence
38
39. Group Linkage
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
apply weak evidence
39
40. Group Linkage
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r7 Taco Casa 702 TX tacocasatexas.com
r8 Taco Casa 703 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
low penalty on local values
40
50. Scalability
50
5
50
500
5000
50000
0 20 40 60 80 100
Executiontime(sec.)
# of record (%)
NAIVE
INDEX
SINDEX
UNION
PIVOT • NAÏVE: applying Max-flow in graph G
• INDEX: using inverted index
• SINDEX: using simplified inverted index
• UNION: merging cliques into unions
• PIVOT: splitting sets of unions into sub-graphs
52. Related Work
Record Similarity:
• classification based approaches [FS69]
• distance based approaches [D08]
• rule base approaches [HS98]
Two-stage Clustering:
• single node as pivot [LA99, WB09]
• bi-connected component as pivot [CNT07]
• agglomerative clustering result as pivot [YIO+10]
Group Linkage:
• computing similarity of pre-specified groups of records, using
• record similarity [OKL+07]
• network evolution analysis [H10]
52
53. Conclusions
• Group linkage is important, and differs from record linkage.
• It is critical to cluster records into groups in two stages.
• It is important to be robust against errors in the group.
• Our two-stage algorithm is empirically accurate, efficient and
scalable.
53
54. References
• [D08]: D. Dey. Entity matching in heterogeneous databases: A logistic regression approach. Decis.
Support Syst., 44:740–747, 2008.
• [FS69]: I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the Americal Statistical
Association, 64(328):1183–1210, 1969.
• [HS98]: M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/
purge problem. Data Mining and Knowledge Discovery, 2:9–37, 1998.
• [LA99]: B. Larsen and C. Aone. Fast and effective text mining using linear-time document
clustering. In KDD, pages 16–22, 1999.
• [WB09]: D. T. Wijaya and S. Bressan. Ricochet: A family of unconstrained algorithms for graph
clustering. In DASFAA, 153–167, 2009.
• [CNT07]: N. Bansal, F. Chiang, N. Koudas, and F. W. Tompa. Seeking stable clusters in the
blogosphere. In VLDB, pages 806–817, 2007.
• [YIO+10]: M. Yoshida, M. Ikeda, S. Ono, I. Sato, and H. Nakagawa. Person name disambiguation
by bootstrapping. In SIGIR, pages 10–17, 2010.
• [OKL+07]: B. W. On, N. Koudas, D. Lee, and D. Srivastava. Group linkage. In ICDE, pages 496–
505, 2007.
• [H10]: S. Huang. Mixed group discovery: Incorporating group linkage with alternatively consistent
social network analysis. International Conference on Semantic Computing, 0:369–376, 2010.
54