Upcoming SlideShare
Loading in …5
×

# K-Means with BSP

2,984 views

Published on

0 Comments
3 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

No Downloads
Views
Total views
2,984
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
56
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

### K-Means with BSP

1. 1. K-Means Clustering with BSP Thomas Jungblut, Testberichte.de, 2012 Study assignment 4th semester, HWR Berlin
2. 2. Content What is K-Means Clustering? What is BSP? K-Means with BSP 2/33
3. 3. What is K-Means Clustering? 3/33
4. 4. Was ist K-Means Clustering?
5. 5. 7
6. 6. What is K-Means Clustering? Unsupervised Learning Huge number of input vectors k initial centers Two step iterative algorithm  Assignment  Update 9/33
7. 7. How do we parallelize K-Means? 10/33
8. 8. What is BSP? BSP = Bulk Synchronous Parallel Paradigm to design parallel algorithms Two basic operations  Send message  Barrier synchronization 11/33
9. 9. What is BSP? P1 P2 P3 ComputationSuperstep Sync Communication Sync 12/33
10. 10. What is BSP? Computation phase is queuing messages Within two barrier synchronizations messages are exchanged in bulk Messages from previous superstep are available in next superstep 13
11. 11. K-Means with BSPPartition the dataset into equal sized blocks 14/33
12. 12. K-Means with BSPPut centers into RAM on each process Centers Sum assigned vectors to a new temporary center object Iterate sequentially over vectors on disk 15/33
13. 13. K-Means with BSPCenters Centers CentersCenters Centers Centers
14. 14. K-Means with BSP SumsCenters • Center 1 • Sum=25 • 5 times summed • Center 2 • Sum=50 • 10 times summed • Center 3 • Sum=10 • 5 times summed 17/33
15. 15. K-Means with BSP SumCenters Send the sum Sum Centers SumCenters Sum Centers
16. 16. K-Means with BSP SumCenters Send the sum Sum Centers SumCenters Sum Centers
17. 17. K-Means mit BSPCenters Sum Sum • The same calculation on every process Sum • Floating point error Sum can be corrected by Divide by total synchronizing when increments Total it exceeds a given Means Sum threshold New Centers 20/33
18. 18. K-Means with BSP Update Assignment Sync 21/33
19. 19. K-Means with BSP Partition vectors into equal sized blocks  # Blocks = # Tasks Put centers in RAM Assignmentphase  Iterative vectors on disk sequentially  Sum up temporary centers with assigned vectors  Message all tasks with sum and how often something was summed Updatephase  Calculate the total sum over all received messages and average  Replace old centers with new centers and calc convergence 22/33
20. 20. Benchmark 16 Server, 256 Cores, 10G network 80 seconds! Possible starvation: add more servers
21. 21. Benchmark Logarithmic scaling Much better than linear scaling of MapReduce 24
22. 22. Misc Implementation on Github https://github.com/thomasjungblut/thomasjungblut- common/blob/master/src/de/jungblut/clustering/KMe ansBSP.java Will be comitted to Hama‘s ML-package soon https://issues.apache.org/jira/browse/HAMA-547 25