We maximise the performance of K-means by applying two types of parallelism:
- MIMD (Multiple Instruction Multiple Data)
- SIMD (Single Instruction Multiple Data)
1. M U LT I - C O R E
K - M E A N S
BÖHM C.; PERDACHER M.; PLANT C.
SPEAKER: MARTIN PERDACHER
2. MULTI-CORE K-MEANS
INTRODUCTION
• K-means is highly relevant use-case for knowledge discovery on
big data
• We maximise the performance of K-means by applying two types
of parallelism:
• MIMD (Multiple Instruction Multiple Data)
• SIMD (Single Instruction Multiple Data)
• Avoid branching operations like if-then:
• Code cluster IDs and distances in joint variables
3. MIMD VS SIMD
IN A SHARED ENVIRONMENT
INTRODUCTION
• Corse-grained parallelism
• OpenMP
• Fine-grained parallelism
• Advanced Vector eXtensions
(AVX2)
• Auto-vectorization exists, but
is far from being efficient.
7. AVX
INTELLIGENT REUSE OF REGISTERS
YMM0
YMM1
YMM2
YMM3
YMM4
YMM5
YMM6
YMM7
YMM8
YMM9
YMM10
YMM11
YMM12
YMM13
YMM14
YMM15
16 distance calculations between
4 data points and 4 centroids
4 dimensions of the data points
4 dimensions of the centroids
reserved for intermediate results
minimum distance for the assignment of the 4 points
8. AVOID BRANCHING
BACKPACKED CLUSTER ID CODING
• How to determine
efficiently?
• AVX has primitives for min but
not for argmin
• Idea is to store current
clusterId j in least significant 8
bits of current distance
sign exponent fraction (52 bit)
cluster-ID
9. AVOID BRANCHING
BACKPACKED CLUSTER ID CODING
• Our technique automatically copies the clusterId
• Even with SIMD - primitives:
sign exponent fraction (52 bit)
cluster-ID
YMM15 := _mm256_min_pd (YMM14, YMM15)
29.5
410.9
29.5
YMM15: 316.3
418.7
316.3
212.8
416.5
212.8
115.0
412.3
412.3
YMM14:
new
YMM15 :
new
10. INFLUENCE ON THE DISTANCE?
BACKPACKED CLUSTER ID CODING
• How much does a backpacked clusterId change the distance?
• Not much:
If the true distance = 1.0 and we have a clusterId of 255
13
1.000000000000057
• Not significantly:
Euclidean distance involves a square root, this means that half
of the bits are numerically insignificant anyway
sign exponent fraction (52 bit)
numerically significant in ||xi-µj|| cluster-ID: 26 bit
11. SETTING
PERFORMANCE EVALUATION
• 2 quad-core CPUs 2.4 GHz
- Intel Xeon E5-2609
- (Sandy Bridge micro-architecture)
- AVX1
• Cache
- 4x32 kB L1 data cache
- 4x256 kB L2 cache
- 10 MB (shared) L3 cache
• Software
C++ (GNU g++)
• 5 iterations
• Synthetic data
- n up to 64 millions
- k up to 20
- d up to 100
• Real data from UCI
- Forest Covertype
(n=580000, d=54)
- Houshold data
(n= 2 Million, d=7)
12. REAL DATA
RUN UNTIL CONVERGENCE
0
2
4
6
8
10
12
Synthetic
12D
CoverType
54D
Household
7D
No Vect. (1-core)
Autovect. (1-core)
MKM (1-core)
No Vect. (8-core)
Autovect. (8-core)
MKM (8-core)
51.2
39.1
55.3
13. SYNTHETIC DATA
DASHED LINE SHOWS IDEAL CURVE
Neue Experimente für SDM final Version
n=32 Million; k=40; d=20
# Threads Autovect. BLAS‐KM no ID coding MKM
1 134.313 43.873 60.915 31.18 134.313 43.873 60.915 31.18
2 68.03 28.856 25.569 18.896 67.1565 21.9365 30.4575 15.59
3 46.871 19.408 18.228 12.501 44.771 14.6243333 20.305 10.3933333
4 36.031 15.39 13.843 9.155 33.57825 10.96825 15.22875 7.795
5 29.411 12.296 13.888 7.64 26.8626 8.7746 12.183 6.236
6 25.081 13.858 10.583 6.554 22.3855 7.31216667 10.1525 5.19666667
7 21.914 11.896 10.923 5.533 19.1875714 6.26757143 8.70214286 4.45428571
8 19.758 10.392 8.519 5.017 16.789125 5.484125 7.614375 3.8975
0
20
40
60
80
100
120
140
160
1 2 3 4 5 6 7 8
Runtimefor5Iterations(s)
Number of Threads
Autovect.
BLAS-KM
no ID coding
MKM
0
10
20
30
40
50
1 2 3 4 5 6 7 8
Runtimefor5Iterations(s)
Number of Threads
0
20
40
60
80
100
1 2 3 4 5 6 7 8
Runtimefor5Iterations(s)
Number of Threads
0
50
100
150
200
250
300
1 2 3 4 5 6 7 8
Runtimefor5Iterations(s)
Number of Threads
15. M U LT I - C O R E
K - M E A N S
BÖHM C.; PERDACHER M.; PLANT C.
SPEAKER: MARTIN PERDACHER
Source code available at:
https://informatik.univie.ac.at/dm/downloads/
PaperId: 031_115