group_linkage@www15

Robust Group Linkage
Pei Li1, Xin Luna Dong2, Songtao Guo3, Andrea
Maurino4, Divesh Srivastava5
1University of Zurich, 2Google Inc., 3LinkedIn, 4University
of Milan – Bicocca, 5AT&T Labs - Research
1

Agenda
•  Motivations
•  Challenges
•  Two-Stage Solution
•  Pivot Identification
•  Group Linkage
•  Experiments
•  Related Work
•  Conclusions
2

Motivations
•  Group linkage: linking records that refer to multiple entities in
the same group, not the same entity.
•  social networks: to group users by organizations (e.g., LinkedIn)
•  search engines: to identify business chains (e.g., YellowPages)
3

0 chain
5
•  Solution 1:
•  require high value consistency

•  Solution 2:
•  match records w. same name
1 chain
6

Agenda
•  Motivations
•  Challenges
•  Experiments
•  Related Work
•  Conclusions
8

Challenges: 1
Group linkage differs from record linkage:
top-5 US business chains
•  learning weights for attributes falls short, since global and
local values occur in the same attribute.
•  global phone of Swisscom: 0800 800 80X
•  local phone in Oerlikon branch: 0443 139 59X
name # store #name #phone #URL #catalog
SUBWAY 21,912 772 21,483 6 23
Bank of America 21,727 48 6,573 186 24
U-Haul 21, 638 2,340 18,384 14 20
USPS 19,225 12,345 5,761 282 22
McDonald’s 17,289 2401 16,607 568 47
9

Challenges: 2
•  it is non-trivial to distinguish global / local values from errors.
•  URL shared by 60 branches of Texas FBIns: txfb-ins.com ✔
•  URL shared by 2 branches: farmbureauinsurance-mi.com ✗
SUBWAY 21,912 772 21,483 6 23
Bank of America 21,727 48 6,573 186 24
U-Haul 21, 638 2,340 18,384 14 20
USPS 19,225 12,345 5,761 282 22
McDonald’s 17,289 2401 16,607 568 47
10

Challenges: 3
•  scalability is critical: a group can contain tens of thousands of
members.
SUBWAY 21,912 772 21,483 6 23
Bank of America 21,727 48 6,573 186 24
U-Haul 21, 638 2,340 18,384 14 20
USPS 19,225 12,345 5,761 282 22
McDonald’s 17,289 2401 16,607 568 47
11

Agenda
•  Motivations
•  Challenges
•  Experiments
•  Related Work
•  Conclusions
12

Two-stage Solution
Stage I:
•  identify records highly likely to be in the same group, called
pivots
•  collect strong evidence such as name, primary phone in pivots
Stage II:
•  cluster pivots and remaining records into group
•  leverage strong evidence (from Stage I) and be tolerant to
local values
13

Example
ID name phone state URL domain
r1 Taco Casa AL tacocasa.com
r2 Taco Casa 900 AL tacocasa.com
r3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r6 Taco Casa 701 TX tacocasatexas.com
r9 Taco Casa 704 TX
r10 Elva’s Taco Casa TX tacodemar.com
••• 14
Chain1: {r1-r5}, Chain2: {r6-r9}

Pivot
Stage I: identify subset of records in the same group as pivots.
•  pivots contain highly similar records as strong evidence of a group.
•  pivots are robust in the presence of a few errors.
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r9 Taco Casa 704 TX
wrong URL
C1
C2
15

Pivot
•  Key idea:
•  Represent a set R of records as a similarity graph G;
•  A pivot is a connected sub-graph robust to a few node removals.
16

Similarity Graph
Undirected graph G: to represent a set R of records
•  a node represents a record r in R
•  two nodes are connected if they are very similar.
r1
r2
r5
r3
r4
r6
r8
r7
Clique
C1
C2
C3
r10
r9
17

Pivot
A pivot is a connected sub-graph that is robust against a few
node removals.
Definition 1 (k-robustness): A graph G is k-robust if after
removing arbitrary k nodes, G is still connected. A clique is
defined to be k-robust for any k.
r1
r2
r5
r3
r4
r6
r8
r7
not 1-robust
18

Pivot
We partition a graph G into a set of maximal k-robust sub-graphs.
Maximal k-robust partitioning of G: to partition G into sub-graphs
such that (1) each sub-graph is k-robust; (2) result of merging any
sub-graphs is not k-robust.
r1
r2
r5
r3
r4
r6
r8
r7
r1
r2
r5
r3
r4
r6
r8
r7
maximal 1-robust partitioning
r1
r2
r5
r3
r4
r6
r8
r7
19

Pivot
Definition (k-pivot): Records that belong to the same sub-
graph in every maximal k-robust partitioning of G form a k-
pivot of R. A pivot contains at least 2 records.
r1
r2
r5
r3
r4
r6
r8
r7
r1
r2
r5
r3
r4
r6
r8
r7
maximal 1-robust partitioning
r1
r2
r5
r3
r4
r6
r8
r7
pivot
20

Pivot Algorithm
•  Finding pivots in G can be reduced to Max-flow problem.
•  O(n2.5), n is the number of nodes in G
•  To improve scalability:
•  represent G by a simplified inverted index
•  Screening: reduce search space from G to sub-graphs in G.
•  considering unions of cliques in G as a whole
•  splitting sets of unions in G into sub-graphs
•  Apply Max-flow algorithm only on sub-graphs of G.
21

Pivot Algorithm - Screening
•  k = 1
r1
r2
r5
r3
r4
r6
r8
r7
r10
r9
1. merge cliques into unions, which are k-robust.
clique
22

•  k = 1
r1
r2
r5
r3
r4
r6
r8
r7
r10
r9
clique
23

•  k = 1
r1
r2
r5
r3
r4
r6
r8
r7
r10
r9
union
24
2. split sets of unions into sub-graphs
by their common nodes.

•  k = 1
r1
r2
r5
r3
r4
r6
r8
r7
r10
r9
union
25

•  k = 1
r1
r2
r5r4
r6
r8
r7
r10
r9
union
26

•  k = 1
r1
r2
r5r4
r6
r8
r7
r10
r9
union
27

•  k = 1
r1
r2
r5r4
r6
r8
r7
r10
r9
union
28

•  k = 1
r1
r5r4
r6
r8
r7
r10
r9
29

•  k = 1
r5r4
r6
r8
r7
pivot
30

Agenda
•  Motivations
•  Challenges
•  Experiments
•  Related Work
•  Conclusions
31

Group Linkage
•  Stage II: clustering pivots and remaining records into groups
•  weight attribute values based on popularities in a group
•  penalize less on local attribute values of the same group
32

Group Linkage
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r9 Taco Casa 704 TX
33

Group Linkage
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r9 Taco Casa 704 TX
reward strong evidence
34

Group Linkage
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r9 Taco Casa 704 TX
35

Group Linkage
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r9 Taco Casa 704 TX
36

Group Linkage
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r9 Taco Casa 704 TX
37

Group Linkage
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r9 Taco Casa 704 TX
apply weak evidence
38

Group Linkage
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r9 Taco Casa 704 TX
apply weak evidence
39

Group Linkage
r4 Taco Casa 900 AL
r5 Taco Casa 900 AL
r9 Taco Casa 704 TX
low penalty on local values
40

Agenda
•  Motivations
•  Challenges
•  Experiments
•  Related Work
•  Conclusions
41

Experiments
•  Datasets:
•  18M business listings
•  590 attendees of SIGMOD’98
•  Measurements:
•  effectiveness: Precision / Recall / F-measure
•  efficiency: runtime
# records # groups (size > 1) group size # singletons
BizLow 2446 1 2446 0
BizAvg 2062 30 [2, 308] 503
BizHigh 1149 14 [33, 269] 0
SIGMOD 590 71 [2, 41] 162
42

Overall Results Our solution obtains highest F-
measure (above .95)
43

Contribution of Components
PIVOT improves precision over baselines by 79%, with a lower recall (by 34% lower)
44

clustering without PIVOT obtains comparable F-measure as baselines with high precision
45

clustering with PIVOT obtains the best results
46

Pivot Quality
Baseline has lower recall, since it has stricter criteria to identify pivots
47

Pivot Quality
SCREEN obtains similar results as PIVOT
48

Parameter k
setting k in [1, 4] performs well in most datasets
49

Scalability
50
5
50
500
5000
50000
0 20 40 60 80 100
Executiontime(sec.)
# of record (%)
NAIVE
INDEX
SINDEX
UNION
PIVOT •  NAÏVE: applying Max-flow in graph G
•  INDEX: using inverted index
•  SINDEX: using simplified inverted index
•  UNION: merging cliques into unions
•  PIVOT: splitting sets of unions into sub-graphs

Agenda
•  Motivations
•  Challenges
•  Experiments
•  Related Work
•  Conclusions
51

Related Work
Record Similarity:
•  classification based approaches [FS69]
•  distance based approaches [D08]
•  rule base approaches [HS98]
Two-stage Clustering:
•  single node as pivot [LA99, WB09]
•  bi-connected component as pivot [CNT07]
•  agglomerative clustering result as pivot [YIO+10]
Group Linkage:
•  computing similarity of pre-specified groups of records, using
•  record similarity [OKL+07]
•  network evolution analysis [H10]
52

Conclusions
•  Group linkage is important, and differs from record linkage.
•  It is critical to cluster records into groups in two stages.
•  It is important to be robust against errors in the group.
•  Our two-stage algorithm is empirically accurate, efficient and
scalable.
53

References
•  [D08]: D. Dey. Entity matching in heterogeneous databases: A logistic regression approach. Decis.
Support Syst., 44:740–747, 2008.
•  [FS69]: I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the Americal Statistical
Association, 64(328):1183–1210, 1969.
•  [HS98]: M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/
purge problem. Data Mining and Knowledge Discovery, 2:9–37, 1998.
•  [LA99]: B. Larsen and C. Aone. Fast and effective text mining using linear-time document
clustering. In KDD, pages 16–22, 1999.
•  [WB09]: D. T. Wijaya and S. Bressan. Ricochet: A family of unconstrained algorithms for graph
clustering. In DASFAA, 153–167, 2009.
•  [CNT07]: N. Bansal, F. Chiang, N. Koudas, and F. W. Tompa. Seeking stable clusters in the
blogosphere. In VLDB, pages 806–817, 2007.
•  [YIO+10]: M. Yoshida, M. Ikeda, S. Ono, I. Sato, and H. Nakagawa. Person name disambiguation
by bootstrapping. In SIGIR, pages 10–17, 2010.
•  [OKL+07]: B. W. On, N. Koudas, D. Lee, and D. Srivastava. Group linkage. In ICDE, pages 496–
505, 2007.
•  [H10]: S. Huang. Mixed group discovery: Incorporating group linkage with alternatively consistent
social network analysis. International Conference on Semantic Computing, 0:369–376, 2010.
54

group_linkage@www15

Recommended

Recommended

More Related Content

Similar to group_linkage@www15

Similar to group_linkage@www15 (20)

group_linkage@www15