Hwanjun Song (송환준), 2nd Ph.D Student
Department of Industrial and System Engineering, KAIST
(Graduate School of Knowledge Service Engineering)
Parallel Clustering Algorithm Optimization for
Large-Scale Data Analytics
대용량 데이터 분석을 위한 클러스터링 알고리즘 최적화
- Paper 1 (SIGKDD17)
- Paper 2 (SIGMOD18)
3
• Dividing data into meaningful groups according to the
characteristics found in the data
: S : L : XL
Customers
(Characteristics: height, weight)
Clustering
4
• Marketing (Customer Segmentation)
– height, weight, …
• City-planning
– house type, value, geographical location
• Information Retrieval
– Contents of document
5
• Clustering Algorithms
– K-Means, EM, Hierarchical, Spectral, Ward, Mean Shift, Affinity
Propagation, DBSCAN, …
K-Means Affinity MeanShift Spectral Ward DBSCAN
6
• Use k centroids (virtual points)
• Clustering Error: σ𝑖=1
𝑘 σ 𝒑∈𝑪 𝒊
||𝜽𝒊 − 𝒑|| 𝟐, 𝜃𝑖 𝑖𝑠 𝑚𝑒𝑎𝑛 𝑝𝑜𝑖𝑛𝑡
• Iteratively finds local optimal centroids
, 𝜃𝑖: i-th centroid
K-Means Clustering (K=3), Local Optimum Case
𝜽 𝟏
𝜽 𝟑
(a) Iteration 1
𝜽 𝟑
𝜽 𝟐
𝜽 𝟑
(c) Iteration K(b) Iteration 2
𝜽 𝟏𝜽 𝟐 𝜽 𝟏
𝜽 𝟐
7
• Use k medoids (real points)
• Clustering Error: σ𝑖=1
𝑘 σ 𝒑∈𝔻 𝒅𝒊𝒔𝒕(𝜽𝒊, 𝒑) , 𝑑𝑖𝑠𝑡: 𝑎𝑛𝑦 𝑚𝑒𝑡𝑟𝑖𝑐
• Iteratively finds global optimal medoids
, 𝜃𝑖: i-th medoid
K-Medoids Clustering (K=3)
𝜽 𝟏
𝜽 𝟑
(a) Iteration 1
𝜽 𝟑
𝜽 𝟐
𝜽 𝟑
(c) Iteration K(b) Iteration 2
𝜽 𝟏
𝜽 𝟐
𝜽 𝟏
𝜽 𝟐
8
• Two algorithm parameters: (𝜀, 𝑚𝑖𝑛𝑃𝑡𝑠)
• Captures arbitrary shape of clusters and does not
require the number of clusters in advance
• Finds dense regions and expands them in order to
form clusters
𝜀
(a) Core point (b) (Directly) density-reachable
𝑝
𝑞 𝑜
𝑚𝑖𝑛𝑃𝑡𝑠 = 4
𝒄𝒐𝒓𝒆𝑝
< 𝜺 < 𝜺
9
2017What Happens In An
Internet Minute
10
• It is unlikely that a single machine supports a typical size
of current big data (High computational Complexity)
• Distributed processing (Hadoop, Spark) has been
adopted to increase the usability of clustering algorithms
Data Set
Worker 1 Worker 2 Worker 3 Worker N
…
1) Data Partitioning
3) Merging
Final Result
2) Parallel
Processing
- Paper 1 (SIGKDD17)
- Paper 2 (SIGMOD18)
12
• Still high computational complexity
• Adopts Approximate techniques (e.g., sampling)
– Most studies have increased efficiency at the expense of
accuracy
Ideal
Previous
studies
Efficiency Accuracy
13
• Difficulty to find optimal data partitions
cut 2
Data Set
Worker 3 (P3)
Execution time
waitWorker 1
Worker 2
Worker 2 (P2)
Worker 1 (P1)
waitWorker 3
- Paper 1 (SIGKDD17)
- Paper 2 (SIGMOD18)
Hwanjun Song †, Jae-Gil Lee †* , Wook-Shin Han††
† Graduate School of Knowledge Service Engineering, KAIST
†† Department of Creative IT Engineering, POSTECH
* Corresponding Author
01
Background and Challenge
17
• Use k medoids (real points)
• Clustering Error: σ𝑖=1
𝑘 σ 𝒑∈𝔻 𝒅𝒊𝒔𝒕(𝜽𝒊, 𝒑) , 𝑑𝑖𝑠𝑡: 𝑎𝑛𝑦 𝑚𝑒𝑡𝑟𝑖𝑐
• Iteratively finds global optimal medoids
, 𝜃𝑖: i-th medoid
K-Medoids Clustering (K=3)
𝜽 𝟏
𝜽 𝟑
(a) Iteration 1
𝜽 𝟑
𝜽 𝟐
𝜽 𝟑
(c) Iteration K(b) Iteration 2
𝜽 𝟏
𝜽 𝟐
𝜽 𝟏
𝜽 𝟐
18
• Global Search (Original K-Medoids)
– All points can be candidates for next center
– Reason of high computational complexity
• Local Search
– Only points inside of the cluster can be candidates for next center
– Reason of high clustering error (local optimum)
𝜽 𝟏
𝜽 𝟐
𝜽 𝟏
𝜽 𝟐
𝜽 𝟏𝜽 𝟏
𝜽 𝟐
𝜽 𝟐
19
• Entire Data Set
– Reason of high computational complexity
• Sample Data Set
– Reason of high clustering error (insufficient # of samples)
𝜽 𝟏
𝜽 𝟐
𝜽 𝟏
𝜽 𝟐
• Existing Methods
– Original K-Medoids: {Entire Data + Global Search}
– Method1: {Entire Data + Local Search}
– Method2: {Sampled Data + Global Search}
– Method3: {Sampled Data + Local Search}
20
(a) Seeds (b) Optimal result (c) Local optimal result
Focus on
efficiency
High Accuracy VS High Efficiency
21
• Initial centers (Seeds) are the key to avoid local optimum
– High-quality: each seed is one of points in each optimal cluster
• Perform local search(efficient) using high-quality
seeds
𝜽 𝟏
𝜽 𝟑
𝜽 𝟐
𝜽 𝟑
𝜽 𝟏
𝜽 𝟐
(a) Low-quality seeds (b) High-quality seeds
22
• How to find high-quality seeds?
– Naïve: Global search on entire data (K-Medoids)
– Our: Global search on sufficient number of sampled data
high-quality
seeds
23
• Proposed the size of samples to find high-quality
seeds, not trapped in local optimum
• Based on the seeds, using local search on entire data
(b) Phase II: local search on entire data(a) Phase I: find high-quality seeds
02
Overview of PAMAE
• PAMAE consists of 2 phases
– Parallel Seeding and Parallel Refinement
25
<Phase 1> <Phase 2>
Result of Phase 1 Result of Phase 2
Sample 1
(N points)
Sample 2
(N points)
Sample 3
(N points)
best Seed
Parallel
Refinement
Parallel
Seeding
Data Set
26
• Two parameters for sampling, 𝑛, 𝑚
– 𝑚: # of samples
– 𝑛: # of points in each sample
• Theorem 4.4: Clustering error depending on 𝑛, 𝑚
– 𝑘: # of clusters
– If (𝒏, 𝒎) = (𝟒𝟎𝒌, 𝟓), then the upper bound is 𝟏. 𝟎𝟏𝝓 𝚯 𝒐𝒑𝒕
• Theorem 5.4: Probability for global optimum
– If (𝑛, 𝑚) = (40𝑘, 5), 𝒑 is almost 1
Optimal errorPhase 1 error 𝐸 𝜙 ෡Θ 𝑜𝑝𝑡 < 1 +
2𝑘
𝑛𝑚
𝜙(Θ 𝑜𝑝𝑡)
03
Experimental Results
28
Algorithm Description Implementation
PAM-MR Entire data + Local Search by us
FAMES-MR Entire Data + Local Search by us
CLARA-MR Sampled Data + Global Search by us
CLARA-MR’ Sampled Data + Global Search by us
GREEDI Greedy Search by us
MR-KMEDIAN Sampled Data + Global Search by us
PAMAE proposed algorithm by us
• Algorithms
• Real-world data sets used for experiments
Data Set # Point # Dim Size Type
Covertype 581,102 𝟓𝟓 71𝑴𝑩 float
Census1990 2,458,285 𝟔𝟖 324𝑴𝑩 float
Cosmo50 315,086,245 𝟑 𝟏𝟏. 𝟐𝑮𝑩 float
TeraClickLog 4,373,472,329 𝟏𝟑 𝟑𝟔𝟐𝑮𝑩 float
29
• Cluster setting
– 𝟏𝟐 Microsoft Azure D12v2 instances located in Japan
• Each instance has four cores, 28 GB RAM, and 200GB of disk(SSD)
• Ten instances were used as worker nodes, and two instances were
used as master nodes
• Relative Error
– Optimal cluster error takes a significant proportion in large-scale
data sets
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝐸𝑟𝑟𝑜𝑟 = 𝜙ℛℰℒ = 𝜙 − 𝜙 𝑂𝑃𝑇
30
• Total elapsed time of parallel k-Medoids algorithms
– PAMAE-Spark is the best and outperformed CLARA-MR by up to
5.4 times
(a) Covertype (b) Census1990 (c) TeraClickLog150
31
• Relative error of parallel k-Medoids algorithms
(a) Covertype (b) Census1990 (c) TeraClickLog150
04
Conclusions
33
• K-Medoids algorithm is not appropriate to handle large
amount of data due to the high computational cost
• Most studies have increased efficiency at the expense of
accuracy using local search or sampling
• We propose PAMAE composed of Parallel Seeding and
Parallel Refinement with theoretical proof
• PAMAE significantly outperforms the state-of-the-art
parallel algorithms and produce the most accurate result
Hwanjun Song †, Jae-Gil Lee †*
† Graduate School of Knowledge Service Engineering, KAIST
* Corresponding Author
01
Background and Challenge
36
• One of the most widely used clustering algorithms
• Received the test of time award at KDD 2014
• Captures arbitrary shape of clusters and does not require
the number of clusters in advance
• Finds dense regions and expands them in order to
form clusters
𝜀
(a) Core point (b) (Directly) density-reachable
𝑝
𝑞 𝑜𝑚𝑖𝑛𝑃𝑡𝑠 = 4
𝒄𝒐𝒓𝒆𝑝
37
• Common belief for data partitioning
– Neighboring points must be assigned to the same partition
to calculate the correct density, 𝑁𝜀 𝑝
• Region-based partitioning
P1
P3
P2
(a)Data Set
overlapping
region
2)cut 2
cut 1
(b) Data Partitions
contiguous1)
38
1. Expensive data split
– Too many possible choices to cut the space
– Increase in # of dimensions  Increase in the cost for
partitioning
…
(a) Choice 1 (b) Choice 2 (c) Choice N
cut 2
cut 2
cut 1
cut 2
cut 1
cut 1
39
2. Load imbalance between data partitions
cut 1
cut 2
Data Set Worker 3 (P3)
Execution time
waitWorker 1
Worker 2
Worker 2 (P2)
Worker 1 (P1)
waitWorker 3
40
3. Duplicated points in overlapping regions
– Increase in # of data points
 Increase in total execution time
P3
P2
(a) Data Set (27 points)
(b) Data Partitions (40 points)
duplicated
pointP1
41
• What if we use random partitioning ?
1. Random partitioning is cheap, i.e., O(N)
2. All partitions have almost the same number and distribution
of data points
3. All partitions can be disjoint with each other
𝐏𝟏 𝐏𝟐
(a) Data Set (b) Data Partitions
42
data point in other partitions
data point in the current partition
𝑵 𝜺 𝒑 = 𝟐 ?𝒑
𝜀
Density calculation on a random partition
• How to calculate the density of an 𝜺-neighbor in a
random partition?
43
• Estimating the number of points in other partitions
using a compact summary structure
data point in the current partition
Density calculation on a random partition
1
𝜀
𝒑
1
2
3
1
1 2
2
1
𝑵 𝜺 𝒑 = 𝟔
44
• Proposed a random partitioning method to solve the
three limitations of the region-based method
– Expensive split, load imbalance, and duplicated points
• Designed a highly compact summary of the entire
data set to enable the approximate density calculation
on a random partition
02
Overview of RP-DBSCAN
46
• Phase I: Data Partitioning
– Performs random partitioning
– Builds a compact summary and broadcasts it to all workers
• Phase II: Local Clustering
– Finds all directly density reachable relationships in each partition
• Phase III: Merging
– Merges the results obtained from each partition
– Labels points based on the reachability relationships
47
• Randomly distributes the cells into multiple workers
• Builds a highly compact summary by adopting the
concept of a sub-cell with 𝑑𝑖𝑎𝑔 = 𝜌𝜀
– As 𝜌 gets smaller, the space can be summarized more precisely
1
2
1
1
2
11
1
1
1
(a) Data Set (b) Random Partitioning (c) Summary Building ( 𝜌=0.5)
𝝆𝜺
Worker 1 (P1) Worker 2 (P2)
*Why 𝒅𝒊𝒂𝒈(𝒄𝒆𝒍𝒍) = 𝜺?
𝒑
𝒄𝒐𝒓𝒆
1
𝜺
48
• Compactness of the summary structure
– Stores only the density of the sub-cell rather than the exact
position of each point
– Represents the position of the sub-cell with the ordering of the
sub-cells inside the cell
3
2(𝑥4, 𝑦4)(𝑥3, 𝑦3)
(𝑥2, 𝑦2)
(𝑥1, 𝑦1)(𝑥0, 𝑦0)
position of
sub-cell1
density of
sub-cell1
density of
sub-cell2
(𝑥′2, 𝑦′2)
(𝑥′1, 𝑦′1)
position of
sub-cell2
0 1
2 3
order of
sub-cell1
order of
sub-cell2
3
2 2 bits are enough
𝜌=0.5
𝜌=0.5 𝜌=0.5
49
(a) Random Partition + Summary
1
1
1
𝜺
1
2
2P1 𝑀𝑖𝑛𝑃𝑡𝑠 = 4
P1
P2 P2
(b) Directly Reachability between cells
• Approximately calculates the density of 𝜺-neighbor
• Finds all directly reachable relationships that appear
across two cells in each partition
– Equivalent to find all reachable relationship between all points
50
P1
P2
Outliers
𝐂𝐥𝐮𝐬𝐭𝐞𝐫 𝟏
𝐂𝐥𝐮𝐬𝐭𝐞𝐫 𝟐
(a) Merged Result (b) Labeling Result
• Merges all local results obtained from each partition
• Labels points based on the reachability relationships
03
Experimental Results
52
Region-based partitioning
Algorithm Description Implementation
ESP-DBSCAN even-split w. 𝝆-approximation by us
CBP-DBSCAN cost-based w. 𝝆-approximation by us
RBP-DBSCAN reduced-boundary w. 𝝆-approximation by us
NG-DBSCAN graph-based (vertex-centric) open source
RP-DBSCAN proposed algorithm by us
• Algorithms
53
• Real-world data sets used for experiments
– GeoLife contains user location data, Cosmo50 contains
simulation data, OpenStreetMap contains GPS data, and
TeraClickLog contains click log data
– Especially, GeoLife is heavily skewed because a large proportion
of users stayed in Beijing
Data Set # Point # Dim Size Type
GeoLife 24,876,978 𝟑 𝟖𝟎𝟖𝑴𝑩 float
Cosmo50 315,086,245 𝟑 𝟏𝟏. 𝟐𝑮𝑩 float
OpenStreetMap 2,770,238,904 𝟐 𝟕𝟕. 𝟏𝑮𝑩 float
TeraClickLog 4,373,472,329 𝟏𝟑 𝟑𝟔𝟐𝑮𝑩 float
54
• Algorithm parameters
– 𝜺 = { Τ1
8, Τ1
4 , Τ1
2 , 1} of the value that generates around 10
clusters in each data set
– minPts =100
– 𝜌 = 0.01
• Cluster setting
– 𝟏𝟐 Microsoft Azure D12v2 instances located in South Korea
• Each instance has four cores, 28 GB RAM, and 200GB of disk(SSD)
• Ten instances were used as worker nodes, and two instances were
used as master nodes
55
Time limit : 20,000s
x180
• Total elapsed time of five parallel DBSCAN algorithms
– RP-DBSCAN was always the fastest
• Outperformed the state-of-the-art by up to 180 times in GeoLife data
– Only RP-DBSCAN finished for the largest data set
56
x600
• Load imbalance of five parallel DBSCAN algorithms
– The load imbalance of RP-DBSCAN was always close to 1
– Existing algorithms failed to achieve good load balance
• In GeoLife data, the load imbalance was up to 600
57
• Total number of points processed in the algorithms
– The total number of the points processed by RP-DBSCAN was
always same to that of the data set
– Except for GeoLife data, the total number of points in existing
algorithms increased as 𝜀 increased
58
Rand Index
59
• Our summary was very compact
– The size was only from 0.04% to 8.20% of the data
04
Conclusions
61
• Region-based partitioning in existing algorithms induces
a critical bottleneck in parallel processing
• We proposed a random partitioning method and a highly
compact summary of the entire data
• RP-DBSCAN achieves almost perfect load balance and
does not duplicate points
• RP-DBSCAN dramatically outperforms the state-of-the-art
parallel DBSCAN algorithms by up to 180 times without
much loss of the accuracy
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화

대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화

  • 1.
    Hwanjun Song (송환준),2nd Ph.D Student Department of Industrial and System Engineering, KAIST (Graduate School of Knowledge Service Engineering) Parallel Clustering Algorithm Optimization for Large-Scale Data Analytics 대용량 데이터 분석을 위한 클러스터링 알고리즘 최적화
  • 2.
    - Paper 1(SIGKDD17) - Paper 2 (SIGMOD18)
  • 3.
    3 • Dividing datainto meaningful groups according to the characteristics found in the data : S : L : XL Customers (Characteristics: height, weight) Clustering
  • 4.
    4 • Marketing (CustomerSegmentation) – height, weight, … • City-planning – house type, value, geographical location • Information Retrieval – Contents of document
  • 5.
    5 • Clustering Algorithms –K-Means, EM, Hierarchical, Spectral, Ward, Mean Shift, Affinity Propagation, DBSCAN, … K-Means Affinity MeanShift Spectral Ward DBSCAN
  • 6.
    6 • Use kcentroids (virtual points) • Clustering Error: σ𝑖=1 𝑘 σ 𝒑∈𝑪 𝒊 ||𝜽𝒊 − 𝒑|| 𝟐, 𝜃𝑖 𝑖𝑠 𝑚𝑒𝑎𝑛 𝑝𝑜𝑖𝑛𝑡 • Iteratively finds local optimal centroids , 𝜃𝑖: i-th centroid K-Means Clustering (K=3), Local Optimum Case 𝜽 𝟏 𝜽 𝟑 (a) Iteration 1 𝜽 𝟑 𝜽 𝟐 𝜽 𝟑 (c) Iteration K(b) Iteration 2 𝜽 𝟏𝜽 𝟐 𝜽 𝟏 𝜽 𝟐
  • 7.
    7 • Use kmedoids (real points) • Clustering Error: σ𝑖=1 𝑘 σ 𝒑∈𝔻 𝒅𝒊𝒔𝒕(𝜽𝒊, 𝒑) , 𝑑𝑖𝑠𝑡: 𝑎𝑛𝑦 𝑚𝑒𝑡𝑟𝑖𝑐 • Iteratively finds global optimal medoids , 𝜃𝑖: i-th medoid K-Medoids Clustering (K=3) 𝜽 𝟏 𝜽 𝟑 (a) Iteration 1 𝜽 𝟑 𝜽 𝟐 𝜽 𝟑 (c) Iteration K(b) Iteration 2 𝜽 𝟏 𝜽 𝟐 𝜽 𝟏 𝜽 𝟐
  • 8.
    8 • Two algorithmparameters: (𝜀, 𝑚𝑖𝑛𝑃𝑡𝑠) • Captures arbitrary shape of clusters and does not require the number of clusters in advance • Finds dense regions and expands them in order to form clusters 𝜀 (a) Core point (b) (Directly) density-reachable 𝑝 𝑞 𝑜 𝑚𝑖𝑛𝑃𝑡𝑠 = 4 𝒄𝒐𝒓𝒆𝑝 < 𝜺 < 𝜺
  • 9.
    9 2017What Happens InAn Internet Minute
  • 10.
    10 • It isunlikely that a single machine supports a typical size of current big data (High computational Complexity) • Distributed processing (Hadoop, Spark) has been adopted to increase the usability of clustering algorithms Data Set Worker 1 Worker 2 Worker 3 Worker N … 1) Data Partitioning 3) Merging Final Result 2) Parallel Processing
  • 11.
    - Paper 1(SIGKDD17) - Paper 2 (SIGMOD18)
  • 12.
    12 • Still highcomputational complexity • Adopts Approximate techniques (e.g., sampling) – Most studies have increased efficiency at the expense of accuracy Ideal Previous studies Efficiency Accuracy
  • 13.
    13 • Difficulty tofind optimal data partitions cut 2 Data Set Worker 3 (P3) Execution time waitWorker 1 Worker 2 Worker 2 (P2) Worker 1 (P1) waitWorker 3
  • 14.
    - Paper 1(SIGKDD17) - Paper 2 (SIGMOD18)
  • 15.
    Hwanjun Song †,Jae-Gil Lee †* , Wook-Shin Han†† † Graduate School of Knowledge Service Engineering, KAIST †† Department of Creative IT Engineering, POSTECH * Corresponding Author
  • 16.
  • 17.
    17 • Use kmedoids (real points) • Clustering Error: σ𝑖=1 𝑘 σ 𝒑∈𝔻 𝒅𝒊𝒔𝒕(𝜽𝒊, 𝒑) , 𝑑𝑖𝑠𝑡: 𝑎𝑛𝑦 𝑚𝑒𝑡𝑟𝑖𝑐 • Iteratively finds global optimal medoids , 𝜃𝑖: i-th medoid K-Medoids Clustering (K=3) 𝜽 𝟏 𝜽 𝟑 (a) Iteration 1 𝜽 𝟑 𝜽 𝟐 𝜽 𝟑 (c) Iteration K(b) Iteration 2 𝜽 𝟏 𝜽 𝟐 𝜽 𝟏 𝜽 𝟐
  • 18.
    18 • Global Search(Original K-Medoids) – All points can be candidates for next center – Reason of high computational complexity • Local Search – Only points inside of the cluster can be candidates for next center – Reason of high clustering error (local optimum) 𝜽 𝟏 𝜽 𝟐 𝜽 𝟏 𝜽 𝟐 𝜽 𝟏𝜽 𝟏 𝜽 𝟐 𝜽 𝟐
  • 19.
    19 • Entire DataSet – Reason of high computational complexity • Sample Data Set – Reason of high clustering error (insufficient # of samples) 𝜽 𝟏 𝜽 𝟐 𝜽 𝟏 𝜽 𝟐
  • 20.
    • Existing Methods –Original K-Medoids: {Entire Data + Global Search} – Method1: {Entire Data + Local Search} – Method2: {Sampled Data + Global Search} – Method3: {Sampled Data + Local Search} 20 (a) Seeds (b) Optimal result (c) Local optimal result Focus on efficiency High Accuracy VS High Efficiency
  • 21.
    21 • Initial centers(Seeds) are the key to avoid local optimum – High-quality: each seed is one of points in each optimal cluster • Perform local search(efficient) using high-quality seeds 𝜽 𝟏 𝜽 𝟑 𝜽 𝟐 𝜽 𝟑 𝜽 𝟏 𝜽 𝟐 (a) Low-quality seeds (b) High-quality seeds
  • 22.
    22 • How tofind high-quality seeds? – Naïve: Global search on entire data (K-Medoids) – Our: Global search on sufficient number of sampled data high-quality seeds
  • 23.
    23 • Proposed thesize of samples to find high-quality seeds, not trapped in local optimum • Based on the seeds, using local search on entire data (b) Phase II: local search on entire data(a) Phase I: find high-quality seeds
  • 24.
  • 25.
    • PAMAE consistsof 2 phases – Parallel Seeding and Parallel Refinement 25 <Phase 1> <Phase 2> Result of Phase 1 Result of Phase 2 Sample 1 (N points) Sample 2 (N points) Sample 3 (N points) best Seed Parallel Refinement Parallel Seeding Data Set
  • 26.
    26 • Two parametersfor sampling, 𝑛, 𝑚 – 𝑚: # of samples – 𝑛: # of points in each sample • Theorem 4.4: Clustering error depending on 𝑛, 𝑚 – 𝑘: # of clusters – If (𝒏, 𝒎) = (𝟒𝟎𝒌, 𝟓), then the upper bound is 𝟏. 𝟎𝟏𝝓 𝚯 𝒐𝒑𝒕 • Theorem 5.4: Probability for global optimum – If (𝑛, 𝑚) = (40𝑘, 5), 𝒑 is almost 1 Optimal errorPhase 1 error 𝐸 𝜙 ෡Θ 𝑜𝑝𝑡 < 1 + 2𝑘 𝑛𝑚 𝜙(Θ 𝑜𝑝𝑡)
  • 27.
  • 28.
    28 Algorithm Description Implementation PAM-MREntire data + Local Search by us FAMES-MR Entire Data + Local Search by us CLARA-MR Sampled Data + Global Search by us CLARA-MR’ Sampled Data + Global Search by us GREEDI Greedy Search by us MR-KMEDIAN Sampled Data + Global Search by us PAMAE proposed algorithm by us • Algorithms • Real-world data sets used for experiments Data Set # Point # Dim Size Type Covertype 581,102 𝟓𝟓 71𝑴𝑩 float Census1990 2,458,285 𝟔𝟖 324𝑴𝑩 float Cosmo50 315,086,245 𝟑 𝟏𝟏. 𝟐𝑮𝑩 float TeraClickLog 4,373,472,329 𝟏𝟑 𝟑𝟔𝟐𝑮𝑩 float
  • 29.
    29 • Cluster setting –𝟏𝟐 Microsoft Azure D12v2 instances located in Japan • Each instance has four cores, 28 GB RAM, and 200GB of disk(SSD) • Ten instances were used as worker nodes, and two instances were used as master nodes • Relative Error – Optimal cluster error takes a significant proportion in large-scale data sets 𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝐸𝑟𝑟𝑜𝑟 = 𝜙ℛℰℒ = 𝜙 − 𝜙 𝑂𝑃𝑇
  • 30.
    30 • Total elapsedtime of parallel k-Medoids algorithms – PAMAE-Spark is the best and outperformed CLARA-MR by up to 5.4 times (a) Covertype (b) Census1990 (c) TeraClickLog150
  • 31.
    31 • Relative errorof parallel k-Medoids algorithms (a) Covertype (b) Census1990 (c) TeraClickLog150
  • 32.
  • 33.
    33 • K-Medoids algorithmis not appropriate to handle large amount of data due to the high computational cost • Most studies have increased efficiency at the expense of accuracy using local search or sampling • We propose PAMAE composed of Parallel Seeding and Parallel Refinement with theoretical proof • PAMAE significantly outperforms the state-of-the-art parallel algorithms and produce the most accurate result
  • 34.
    Hwanjun Song †,Jae-Gil Lee †* † Graduate School of Knowledge Service Engineering, KAIST * Corresponding Author
  • 35.
  • 36.
    36 • One ofthe most widely used clustering algorithms • Received the test of time award at KDD 2014 • Captures arbitrary shape of clusters and does not require the number of clusters in advance • Finds dense regions and expands them in order to form clusters 𝜀 (a) Core point (b) (Directly) density-reachable 𝑝 𝑞 𝑜𝑚𝑖𝑛𝑃𝑡𝑠 = 4 𝒄𝒐𝒓𝒆𝑝
  • 37.
    37 • Common belieffor data partitioning – Neighboring points must be assigned to the same partition to calculate the correct density, 𝑁𝜀 𝑝 • Region-based partitioning P1 P3 P2 (a)Data Set overlapping region 2)cut 2 cut 1 (b) Data Partitions contiguous1)
  • 38.
    38 1. Expensive datasplit – Too many possible choices to cut the space – Increase in # of dimensions  Increase in the cost for partitioning … (a) Choice 1 (b) Choice 2 (c) Choice N cut 2 cut 2 cut 1 cut 2 cut 1 cut 1
  • 39.
    39 2. Load imbalancebetween data partitions cut 1 cut 2 Data Set Worker 3 (P3) Execution time waitWorker 1 Worker 2 Worker 2 (P2) Worker 1 (P1) waitWorker 3
  • 40.
    40 3. Duplicated pointsin overlapping regions – Increase in # of data points  Increase in total execution time P3 P2 (a) Data Set (27 points) (b) Data Partitions (40 points) duplicated pointP1
  • 41.
    41 • What ifwe use random partitioning ? 1. Random partitioning is cheap, i.e., O(N) 2. All partitions have almost the same number and distribution of data points 3. All partitions can be disjoint with each other 𝐏𝟏 𝐏𝟐 (a) Data Set (b) Data Partitions
  • 42.
    42 data point inother partitions data point in the current partition 𝑵 𝜺 𝒑 = 𝟐 ?𝒑 𝜀 Density calculation on a random partition • How to calculate the density of an 𝜺-neighbor in a random partition?
  • 43.
    43 • Estimating thenumber of points in other partitions using a compact summary structure data point in the current partition Density calculation on a random partition 1 𝜀 𝒑 1 2 3 1 1 2 2 1 𝑵 𝜺 𝒑 = 𝟔
  • 44.
    44 • Proposed arandom partitioning method to solve the three limitations of the region-based method – Expensive split, load imbalance, and duplicated points • Designed a highly compact summary of the entire data set to enable the approximate density calculation on a random partition
  • 45.
  • 46.
    46 • Phase I:Data Partitioning – Performs random partitioning – Builds a compact summary and broadcasts it to all workers • Phase II: Local Clustering – Finds all directly density reachable relationships in each partition • Phase III: Merging – Merges the results obtained from each partition – Labels points based on the reachability relationships
  • 47.
    47 • Randomly distributesthe cells into multiple workers • Builds a highly compact summary by adopting the concept of a sub-cell with 𝑑𝑖𝑎𝑔 = 𝜌𝜀 – As 𝜌 gets smaller, the space can be summarized more precisely 1 2 1 1 2 11 1 1 1 (a) Data Set (b) Random Partitioning (c) Summary Building ( 𝜌=0.5) 𝝆𝜺 Worker 1 (P1) Worker 2 (P2) *Why 𝒅𝒊𝒂𝒈(𝒄𝒆𝒍𝒍) = 𝜺? 𝒑 𝒄𝒐𝒓𝒆 1 𝜺
  • 48.
    48 • Compactness ofthe summary structure – Stores only the density of the sub-cell rather than the exact position of each point – Represents the position of the sub-cell with the ordering of the sub-cells inside the cell 3 2(𝑥4, 𝑦4)(𝑥3, 𝑦3) (𝑥2, 𝑦2) (𝑥1, 𝑦1)(𝑥0, 𝑦0) position of sub-cell1 density of sub-cell1 density of sub-cell2 (𝑥′2, 𝑦′2) (𝑥′1, 𝑦′1) position of sub-cell2 0 1 2 3 order of sub-cell1 order of sub-cell2 3 2 2 bits are enough 𝜌=0.5 𝜌=0.5 𝜌=0.5
  • 49.
    49 (a) Random Partition+ Summary 1 1 1 𝜺 1 2 2P1 𝑀𝑖𝑛𝑃𝑡𝑠 = 4 P1 P2 P2 (b) Directly Reachability between cells • Approximately calculates the density of 𝜺-neighbor • Finds all directly reachable relationships that appear across two cells in each partition – Equivalent to find all reachable relationship between all points
  • 50.
    50 P1 P2 Outliers 𝐂𝐥𝐮𝐬𝐭𝐞𝐫 𝟏 𝐂𝐥𝐮𝐬𝐭𝐞𝐫 𝟐 (a)Merged Result (b) Labeling Result • Merges all local results obtained from each partition • Labels points based on the reachability relationships
  • 51.
  • 52.
    52 Region-based partitioning Algorithm DescriptionImplementation ESP-DBSCAN even-split w. 𝝆-approximation by us CBP-DBSCAN cost-based w. 𝝆-approximation by us RBP-DBSCAN reduced-boundary w. 𝝆-approximation by us NG-DBSCAN graph-based (vertex-centric) open source RP-DBSCAN proposed algorithm by us • Algorithms
  • 53.
    53 • Real-world datasets used for experiments – GeoLife contains user location data, Cosmo50 contains simulation data, OpenStreetMap contains GPS data, and TeraClickLog contains click log data – Especially, GeoLife is heavily skewed because a large proportion of users stayed in Beijing Data Set # Point # Dim Size Type GeoLife 24,876,978 𝟑 𝟖𝟎𝟖𝑴𝑩 float Cosmo50 315,086,245 𝟑 𝟏𝟏. 𝟐𝑮𝑩 float OpenStreetMap 2,770,238,904 𝟐 𝟕𝟕. 𝟏𝑮𝑩 float TeraClickLog 4,373,472,329 𝟏𝟑 𝟑𝟔𝟐𝑮𝑩 float
  • 54.
    54 • Algorithm parameters –𝜺 = { Τ1 8, Τ1 4 , Τ1 2 , 1} of the value that generates around 10 clusters in each data set – minPts =100 – 𝜌 = 0.01 • Cluster setting – 𝟏𝟐 Microsoft Azure D12v2 instances located in South Korea • Each instance has four cores, 28 GB RAM, and 200GB of disk(SSD) • Ten instances were used as worker nodes, and two instances were used as master nodes
  • 55.
    55 Time limit :20,000s x180 • Total elapsed time of five parallel DBSCAN algorithms – RP-DBSCAN was always the fastest • Outperformed the state-of-the-art by up to 180 times in GeoLife data – Only RP-DBSCAN finished for the largest data set
  • 56.
    56 x600 • Load imbalanceof five parallel DBSCAN algorithms – The load imbalance of RP-DBSCAN was always close to 1 – Existing algorithms failed to achieve good load balance • In GeoLife data, the load imbalance was up to 600
  • 57.
    57 • Total numberof points processed in the algorithms – The total number of the points processed by RP-DBSCAN was always same to that of the data set – Except for GeoLife data, the total number of points in existing algorithms increased as 𝜀 increased
  • 58.
  • 59.
    59 • Our summarywas very compact – The size was only from 0.04% to 8.20% of the data
  • 60.
  • 61.
    61 • Region-based partitioningin existing algorithms induces a critical bottleneck in parallel processing • We proposed a random partitioning method and a highly compact summary of the entire data • RP-DBSCAN achieves almost perfect load balance and does not duplicate points • RP-DBSCAN dramatically outperforms the state-of-the-art parallel DBSCAN algorithms by up to 180 times without much loss of the accuracy