대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화

Hwanjun Song (송환준), 2nd Ph.D Student
Department of Industrial and System Engineering, KAIST
(Graduate School of Knowledge Service Engineering)
Parallel Clustering Algorithm Optimization for
Large-Scale Data Analytics
대용량 데이터 분석을 위한 클러스터링 알고리즘 최적화

- Paper 1 (SIGKDD17)
- Paper 2 (SIGMOD18)

3
• Dividing data into meaningful groups according to the
characteristics found in the data
: S : L : XL
Customers
(Characteristics: height, weight)
Clustering

4
• Marketing (Customer Segmentation)
– height, weight, …
• City-planning
– house type, value, geographical location
• Information Retrieval
– Contents of document

5
• Clustering Algorithms
– K-Means, EM, Hierarchical, Spectral, Ward, Mean Shift, Affinity
Propagation, DBSCAN, …
K-Means Affinity MeanShift Spectral Ward DBSCAN

6
• Use k centroids (virtual points)
• Clustering Error: σ𝑖=1
𝑘 σ 𝒑∈𝑪 𝒊
||𝜽𝒊 − 𝒑|| 𝟐, 𝜃𝑖 𝑖𝑠 𝑚𝑒𝑎𝑛 𝑝𝑜𝑖𝑛𝑡
• Iteratively finds local optimal centroids
, 𝜃𝑖: i-th centroid
K-Means Clustering (K=3), Local Optimum Case
𝜽 𝟏
𝜽 𝟑
(a) Iteration 1
𝜽 𝟑
𝜽 𝟐
𝜽 𝟑
(c) Iteration K(b) Iteration 2
𝜽 𝟏𝜽 𝟐 𝜽 𝟏
𝜽 𝟐

7
• Use k medoids (real points)
𝑘 σ 𝒑∈𝔻 𝒅𝒊𝒔𝒕(𝜽𝒊, 𝒑) , 𝑑𝑖𝑠𝑡: 𝑎𝑛𝑦 𝑚𝑒𝑡𝑟𝑖𝑐
• Iteratively finds global optimal medoids
, 𝜃𝑖: i-th medoid
K-Medoids Clustering (K=3)
𝜽 𝟏
𝜽 𝟑
(a) Iteration 1
𝜽 𝟑
𝜽 𝟐
𝜽 𝟑
𝜽 𝟏
𝜽 𝟐
𝜽 𝟏
𝜽 𝟐

8
• Two algorithm parameters: (𝜀, 𝑚𝑖𝑛𝑃𝑡𝑠)
• Captures arbitrary shape of clusters and does not
require the number of clusters in advance
• Finds dense regions and expands them in order to
form clusters
𝜀
(a) Core point (b) (Directly) density-reachable
𝑝
𝑞 𝑜
𝑚𝑖𝑛𝑃𝑡𝑠 = 4
𝒄𝒐𝒓𝒆𝑝
< 𝜺 < 𝜺

9
2017What Happens In An
Internet Minute

10
• It is unlikely that a single machine supports a typical size
of current big data (High computational Complexity)
• Distributed processing (Hadoop, Spark) has been
adopted to increase the usability of clustering algorithms
Data Set
Worker 1 Worker 2 Worker 3 Worker N
…
1) Data Partitioning
3) Merging
Final Result
2) Parallel
Processing

12
• Still high computational complexity
• Adopts Approximate techniques (e.g., sampling)
– Most studies have increased efficiency at the expense of
accuracy
Ideal
Previous
studies
Efficiency Accuracy

13
• Difficulty to find optimal data partitions
cut 2
Data Set
Worker 3 (P3)
Execution time
waitWorker 1
Worker 2
Worker 2 (P2)
Worker 1 (P1)
waitWorker 3

Hwanjun Song †, Jae-Gil Lee †* , Wook-Shin Han††
† Graduate School of Knowledge Service Engineering, KAIST
†† Department of Creative IT Engineering, POSTECH
* Corresponding Author

17
• Use k medoids (real points)
𝑘 σ 𝒑∈𝔻 𝒅𝒊𝒔𝒕(𝜽𝒊, 𝒑) , 𝑑𝑖𝑠𝑡: 𝑎𝑛𝑦 𝑚𝑒𝑡𝑟𝑖𝑐
• Iteratively finds global optimal medoids
, 𝜃𝑖: i-th medoid
K-Medoids Clustering (K=3)
𝜽 𝟏
𝜽 𝟑
(a) Iteration 1
𝜽 𝟑
𝜽 𝟐
𝜽 𝟑
𝜽 𝟏
𝜽 𝟐
𝜽 𝟏
𝜽 𝟐

18
• Global Search (Original K-Medoids)
– All points can be candidates for next center
– Reason of high computational complexity
• Local Search
– Only points inside of the cluster can be candidates for next center
– Reason of high clustering error (local optimum)
𝜽 𝟏
𝜽 𝟐
𝜽 𝟏
𝜽 𝟐
𝜽 𝟏𝜽 𝟏
𝜽 𝟐
𝜽 𝟐

19
• Entire Data Set
– Reason of high computational complexity
• Sample Data Set
– Reason of high clustering error (insufficient # of samples)
𝜽 𝟏
𝜽 𝟐
𝜽 𝟏
𝜽 𝟐

• Existing Methods
– Original K-Medoids: {Entire Data + Global Search}
– Method1: {Entire Data + Local Search}
– Method2: {Sampled Data + Global Search}
– Method3: {Sampled Data + Local Search}
20
(a) Seeds (b) Optimal result (c) Local optimal result
Focus on
efficiency
High Accuracy VS High Efficiency

21
• Initial centers (Seeds) are the key to avoid local optimum
– High-quality: each seed is one of points in each optimal cluster
• Perform local search(efficient) using high-quality
seeds
𝜽 𝟏
𝜽 𝟑
𝜽 𝟐
𝜽 𝟑
𝜽 𝟏
𝜽 𝟐
(a) Low-quality seeds (b) High-quality seeds

22
• How to find high-quality seeds?
– Naïve: Global search on entire data (K-Medoids)
– Our: Global search on sufficient number of sampled data
high-quality
seeds

23
• Proposed the size of samples to find high-quality
seeds, not trapped in local optimum
• Based on the seeds, using local search on entire data
(b) Phase II: local search on entire data(a) Phase I: find high-quality seeds

• PAMAE consists of 2 phases
– Parallel Seeding and Parallel Refinement
25
<Phase 1> <Phase 2>
Result of Phase 1 Result of Phase 2
Sample 1
(N points)
Sample 2
(N points)
Sample 3
(N points)
best Seed
Parallel
Refinement
Parallel
Seeding
Data Set

26
• Two parameters for sampling, 𝑛, 𝑚
– 𝑚: # of samples
– 𝑛: # of points in each sample
• Theorem 4.4: Clustering error depending on 𝑛, 𝑚
– 𝑘: # of clusters
– If (𝒏, 𝒎) = (𝟒𝟎𝒌, 𝟓), then the upper bound is 𝟏. 𝟎𝟏𝝓 𝚯 𝒐𝒑𝒕
• Theorem 5.4: Probability for global optimum
– If (𝑛, 𝑚) = (40𝑘, 5), 𝒑 is almost 1
Optimal errorPhase 1 error 𝐸 𝜙 ෡Θ 𝑜𝑝𝑡 < 1 +
2𝑘
𝑛𝑚
𝜙(Θ 𝑜𝑝𝑡)

28
Algorithm Description Implementation
PAM-MR Entire data + Local Search by us
FAMES-MR Entire Data + Local Search by us
CLARA-MR Sampled Data + Global Search by us
CLARA-MR’ Sampled Data + Global Search by us
GREEDI Greedy Search by us
MR-KMEDIAN Sampled Data + Global Search by us
PAMAE proposed algorithm by us
• Algorithms
• Real-world data sets used for experiments
Data Set # Point # Dim Size Type
Covertype 581,102 𝟓𝟓 71𝑴𝑩 float
Census1990 2,458,285 𝟔𝟖 324𝑴𝑩 float
Cosmo50 315,086,245 𝟑 𝟏𝟏. 𝟐𝑮𝑩 float
TeraClickLog 4,373,472,329 𝟏𝟑 𝟑𝟔𝟐𝑮𝑩 float

29
• Cluster setting
– 𝟏𝟐 Microsoft Azure D12v2 instances located in Japan
• Each instance has four cores, 28 GB RAM, and 200GB of disk(SSD)
• Ten instances were used as worker nodes, and two instances were
used as master nodes
• Relative Error
– Optimal cluster error takes a significant proportion in large-scale
data sets
𝑅𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝐸𝑟𝑟𝑜𝑟 = 𝜙ℛℰℒ = 𝜙 − 𝜙 𝑂𝑃𝑇

30
• Total elapsed time of parallel k-Medoids algorithms
– PAMAE-Spark is the best and outperformed CLARA-MR by up to
5.4 times
(a) Covertype (b) Census1990 (c) TeraClickLog150

31
• Relative error of parallel k-Medoids algorithms
(a) Covertype (b) Census1990 (c) TeraClickLog150

33
• K-Medoids algorithm is not appropriate to handle large
amount of data due to the high computational cost
• Most studies have increased efficiency at the expense of
accuracy using local search or sampling
• We propose PAMAE composed of Parallel Seeding and
Parallel Refinement with theoretical proof
• PAMAE significantly outperforms the state-of-the-art
parallel algorithms and produce the most accurate result

Hwanjun Song †, Jae-Gil Lee †*
† Graduate School of Knowledge Service Engineering, KAIST
* Corresponding Author

36
• One of the most widely used clustering algorithms
• Received the test of time award at KDD 2014
• Captures arbitrary shape of clusters and does not require
the number of clusters in advance
• Finds dense regions and expands them in order to
form clusters
𝜀
(a) Core point (b) (Directly) density-reachable
𝑝
𝑞 𝑜𝑚𝑖𝑛𝑃𝑡𝑠 = 4
𝒄𝒐𝒓𝒆𝑝

37
• Common belief for data partitioning
– Neighboring points must be assigned to the same partition
to calculate the correct density, 𝑁𝜀 𝑝
• Region-based partitioning
P1
P3
P2
(a)Data Set
overlapping
region
2)cut 2
cut 1
(b) Data Partitions
contiguous1)

38
1. Expensive data split
– Too many possible choices to cut the space
– Increase in # of dimensions  Increase in the cost for
partitioning
…
(a) Choice 1 (b) Choice 2 (c) Choice N
cut 2
cut 2
cut 1
cut 2
cut 1
cut 1

39
2. Load imbalance between data partitions
cut 1
cut 2
Data Set Worker 3 (P3)
Execution time
waitWorker 1
Worker 2
Worker 2 (P2)
Worker 1 (P1)
waitWorker 3

40
3. Duplicated points in overlapping regions
– Increase in # of data points
 Increase in total execution time
P3
P2
(a) Data Set (27 points)
(b) Data Partitions (40 points)
duplicated
pointP1

41
• What if we use random partitioning ?
1. Random partitioning is cheap, i.e., O(N)
2. All partitions have almost the same number and distribution
of data points
3. All partitions can be disjoint with each other
𝐏𝟏 𝐏𝟐
(a) Data Set (b) Data Partitions

42
data point in other partitions
data point in the current partition
𝑵 𝜺 𝒑 = 𝟐 ?𝒑
𝜀
Density calculation on a random partition
• How to calculate the density of an 𝜺-neighbor in a
random partition?

43
• Estimating the number of points in other partitions
using a compact summary structure
data point in the current partition
Density calculation on a random partition
1
𝜀
𝒑
1
2
3
1
1 2
2
1
𝑵 𝜺 𝒑 = 𝟔

44
• Proposed a random partitioning method to solve the
three limitations of the region-based method
– Expensive split, load imbalance, and duplicated points
• Designed a highly compact summary of the entire
data set to enable the approximate density calculation
on a random partition

46
• Phase I: Data Partitioning
– Performs random partitioning
– Builds a compact summary and broadcasts it to all workers
• Phase II: Local Clustering
– Finds all directly density reachable relationships in each partition
• Phase III: Merging
– Merges the results obtained from each partition
– Labels points based on the reachability relationships

47
• Randomly distributes the cells into multiple workers
• Builds a highly compact summary by adopting the
concept of a sub-cell with 𝑑𝑖𝑎𝑔 = 𝜌𝜀
– As 𝜌 gets smaller, the space can be summarized more precisely
1
2
1
1
2
11
1
1
1
(a) Data Set (b) Random Partitioning (c) Summary Building ( 𝜌=0.5)
𝝆𝜺
Worker 1 (P1) Worker 2 (P2)
*Why 𝒅𝒊𝒂𝒈(𝒄𝒆𝒍𝒍) = 𝜺?
𝒑
𝒄𝒐𝒓𝒆
1
𝜺

48
• Compactness of the summary structure
– Stores only the density of the sub-cell rather than the exact
position of each point
– Represents the position of the sub-cell with the ordering of the
sub-cells inside the cell
3
2(𝑥4, 𝑦4)(𝑥3, 𝑦3)
(𝑥2, 𝑦2)
(𝑥1, 𝑦1)(𝑥0, 𝑦0)
position of
sub-cell1
density of
sub-cell1
density of
sub-cell2
(𝑥′2, 𝑦′2)
(𝑥′1, 𝑦′1)
position of
sub-cell2
0 1
2 3
order of
sub-cell1
order of
sub-cell2
3
2 2 bits are enough
𝜌=0.5
𝜌=0.5 𝜌=0.5

49
(a) Random Partition + Summary
1
1
1
𝜺
1
2
2P1 𝑀𝑖𝑛𝑃𝑡𝑠 = 4
P1
P2 P2
(b) Directly Reachability between cells
• Approximately calculates the density of 𝜺-neighbor
• Finds all directly reachable relationships that appear
across two cells in each partition
– Equivalent to find all reachable relationship between all points

50
P1
P2
Outliers
𝐂𝐥𝐮𝐬𝐭𝐞𝐫 𝟏
𝐂𝐥𝐮𝐬𝐭𝐞𝐫 𝟐
(a) Merged Result (b) Labeling Result
• Merges all local results obtained from each partition
• Labels points based on the reachability relationships

52
Region-based partitioning
Algorithm Description Implementation
ESP-DBSCAN even-split w. 𝝆-approximation by us
CBP-DBSCAN cost-based w. 𝝆-approximation by us
RBP-DBSCAN reduced-boundary w. 𝝆-approximation by us
NG-DBSCAN graph-based (vertex-centric) open source
RP-DBSCAN proposed algorithm by us
• Algorithms

53
• Real-world data sets used for experiments
– GeoLife contains user location data, Cosmo50 contains
simulation data, OpenStreetMap contains GPS data, and
TeraClickLog contains click log data
– Especially, GeoLife is heavily skewed because a large proportion
of users stayed in Beijing
Data Set # Point # Dim Size Type
GeoLife 24,876,978 𝟑 𝟖𝟎𝟖𝑴𝑩 float
Cosmo50 315,086,245 𝟑 𝟏𝟏. 𝟐𝑮𝑩 float
OpenStreetMap 2,770,238,904 𝟐 𝟕𝟕. 𝟏𝑮𝑩 float
TeraClickLog 4,373,472,329 𝟏𝟑 𝟑𝟔𝟐𝑮𝑩 float

54
• Algorithm parameters
– 𝜺 = { Τ1
8, Τ1
4 , Τ1
2 , 1} of the value that generates around 10
clusters in each data set
– minPts =100
– 𝜌 = 0.01
• Cluster setting
– 𝟏𝟐 Microsoft Azure D12v2 instances located in South Korea
• Each instance has four cores, 28 GB RAM, and 200GB of disk(SSD)
• Ten instances were used as worker nodes, and two instances were
used as master nodes

55
Time limit : 20,000s
x180
• Total elapsed time of five parallel DBSCAN algorithms
– RP-DBSCAN was always the fastest
• Outperformed the state-of-the-art by up to 180 times in GeoLife data
– Only RP-DBSCAN finished for the largest data set

56
x600
• Load imbalance of five parallel DBSCAN algorithms
– The load imbalance of RP-DBSCAN was always close to 1
– Existing algorithms failed to achieve good load balance
• In GeoLife data, the load imbalance was up to 600

57
• Total number of points processed in the algorithms
– The total number of the points processed by RP-DBSCAN was
always same to that of the data set
– Except for GeoLife data, the total number of points in existing
algorithms increased as 𝜀 increased

59
• Our summary was very compact
– The size was only from 0.04% to 8.20% of the data

61
• Region-based partitioning in existing algorithms induces
a critical bottleneck in parallel processing
• We proposed a random partitioning method and a highly
compact summary of the entire data
• RP-DBSCAN achieves almost perfect load balance and
does not duplicate points
• RP-DBSCAN dramatically outperforms the state-of-the-art
parallel DBSCAN algorithms by up to 180 times without
much loss of the accuracy

대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화

대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화

More Related Content

What's hot

Similar to 대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화

More from NAVER Engineering

Recently uploaded

대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화