Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Cloudera Data Science Challenge by Mark Nichols, P.E. 734 views
- MachineLearning_MPI_vs_Spark by Xudong Brandon Liang 204 views
- Seeds Affinity Propagation Based on... by IJRES Journal 505 views
- 06 how to write a map reduce versio... by Subhas Kumar Ghosh 860 views
- Optimization for iterative queries ... by makoto onizuka 864 views
- Spark Bi-Clustering - OW2 Big Data ... by ALTIC Altic 763 views

2,760 views

Published on

Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Spark’s MLLib module contains implementations of several unsupervised learning algorithms that scale to large datasets. In this talk, we’ll discuss how to use and implement large-scale machine learning algorithms with the Spark programming model, diving into MLLib’s K-means clustering and Principal Component Analysis (PCA).

Published in:
Technology

No Downloads

Total views

2,760

On SlideShare

0

From Embeds

0

Number of Embeds

1,656

Shares

0

Downloads

52

Comments

0

Likes

4

No embeds

No notes for slide

- 1. Clustering with Spark Sandy Ryza / Data Science / Cloudera
- 2. Me ● Data scientist at Cloudera ● Recently lead Apache Spark development at Cloudera ● Before that, committing on Apache Hadoop ● Before that, studying combinatorial optimization and distributed systems at Brown
- 3. Sometimes you find yourself with lots of stuff
- 4. Large Scale Learning
- 5. Network Packets
- 6. Detect Network Intrusions
- 7. Credit Card Transactions
- 8. Detect Fraud
- 9. Movie Viewings
- 10. Recommend Movies
- 11. Unsupervised Learning ● Learn hidden structure of your data ● Interpret new data as it relates to this structure
- 12. Two Main Problems ● Designing a system for processing huge data in parallel ● Taking advantage of it with algorithms that work well in parallel
- 13. MapReduce Map Map Map Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Key advances by MapReduce: •Data Locality: Automatic split computation and launch of mappers appropriately •Fault tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware •Linear scalability: Combination of locality + programming model that forces developers to write generally scalable solutions to problems * CONFIDENTIAL - RESTRICTED
- 14. MapReduce Map Map Map Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Limitations of MapReduce •Each job reads data from HDFS •No concept of a session •Jobs are rigin map-then-reduce * CONFIDENTIAL - RESTRICTED
- 15. Spark is a general purpose computation framework geared towards massive data - more flexible than MapReduce Extra properties: •Leverages distributed memory •Full Directed Graph expressions for data parallel computations •Improved developer experience Yet retains: Linear scalability, Fault-tolerance and Data-Locality * CONFIDENTIAL - RESTRICTED
- 16. RDDs val lines = sc.textFile (“bigfile.txt”) bigfile.txt lines val numbers = lines.map ((x) => x.toDouble) numbers.sum() numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver
- 17. RDDs val lines = sc.textFile (“bigfile.txt”) bigfile.txt lines val numbers = lines.map ((x) => x.toInt) numbers.cache() numbers Partition Partition Partition Partition Partition Partition HDFS .sum() sum Driver
- 18. numbers.sum() bigfile.txt lines numbers Partition Partition Partition sum Driver
- 19. Spark MLlib Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
- 20. Spark MLlib Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
- 21. Using it val data = sc.textFile("kmeans_data.txt") val parsedData = data.map( _.split(' ').map(_.toDouble)) // Cluster the data into two classes using KMeans val numIterations = 20 val numClusters = 2 val clusters = KMeans.train(parsedData, numClusters, numIterations)
- 22. K-Means ● Choose some initial centers ● Then alternate between two steps: ○ Assign each point to a cluster based on existing centers ○ Recompute cluster centers from the points in each cluster
- 23. K-Means - very parallelizable ● Alternate between two steps: ○ Assign each point to a cluster based on existing centers ■ Process each data point independently ○ Recompute cluster centers from the points in each cluster ■ Average across partitions
- 24. // Find the sum and count of points mapping to each center val totalContribs = data.mapPartitions { points => val k = centers.length val dims = centers(0).vector.length val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]]) val counts = Array.fill(k)(0L) points.foreach { point => val (bestCenter, cost) = KMeans.findClosest(centers, point) costAccum += cost sums(bestCenter) += point.vector counts(bestCenter) += 1 } val contribs = for (j <- 0 until k) yield { (j, (sums(j), counts(j))) } contribs.iterator }.reduceByKey(mergeContribs).collectAsMap()
- 25. // Update the cluster centers and costs var changed = false var j = 0 while (j < k) { val (sum, count) = totalContribs(j) if (count != 0) { sum /= count.toDouble val newCenter = new BreezeVectorWithNorm(sum) if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) { changed = true } centers(j) = newCenter } j += 1 } if (!changed) { logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations") } cost = costAccum.value
- 26. The Problem ● K-Means is very sensitive to initial set of center points chosen. ● Best existing algorithm for choosing centers is highly sequential.
- 27. K-Means++ ● Start with random point from dataset ● Pick another one randomly, with probability proportional to distance from the closest already chosen ● Repeat until initial centers chosen
- 28. K-Means++ ● Initial cluster has expected bound of O(log k) of optimum cost
- 29. K-Means++ ● Requires k passes over the data
- 30. K-Means|| ● Do only a few (~5) passes ● Sample m points on each pass ● Oversample ● Run K-Means++ on sampled points to find initial centers
- 31. Then on the full data...

No public clipboards found for this slide

Be the first to comment