Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Machine Learning Loves Hadoop by Cloudera, Inc. 7805 views
- Machine Learning with Apache Spark by IBM Cloud Data Se... 1786 views
- Introduction to Machine Learning on... by Cloudera, Inc. 1861 views
- 2014-10-20 Large-Scale Machine Lear... by DB Tsai 4523 views
- Large Scale Deep Learning with Tens... by Jen Aman 11604 views
- Workday: Building Large Scale Machi... by DataStax Academy 1213 views

6,597 views

Published on

No Downloads

Total views

6,597

On SlideShare

0

From Embeds

0

Number of Embeds

1,218

Shares

0

Downloads

0

Comments

0

Likes

33

No embeds

No notes for slide

- 1. Large Scale Learning with Apache Spark Sandy Ryza, Data Science, Cloudera
- 2. ● Data scientist at Cloudera ● Recently lead Apache Spark development at Cloudera ● Before that, committing on Apache Hadoop ● Before that, studying combinatorial optimization and distributed systems at Brown Me
- 3. Sometimes you find yourself with lots of stuff
- 4. Large Scale Learning
- 5. Network Packets
- 6. Detect Network Intrusions
- 7. Credit Card Transactions
- 8. Detect Fraud
- 9. Movie Viewings
- 10. Recommend Movies
- 11. Two Main Problems ● Designing a system for processing huge data in parallel ● Taking advantage of it with algorithms that work well in parallel
- 12. System Requirements ● Scalability ● Programming model that abstracts away distributed ugliness ● Data-scientist friendly ○ High-level operators ○ Interactive shell (REPL) ● Efficiency for iterative algorithms
- 13. CONFIDENTIAL - RESTRICTED* MapReduce Map Map Map Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Key advances by MapReduce: •Data Locality: Automatic split computation and launch of mappers appropriately •Fault tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware •Linear scalability: Combination of locality + programming model that forces developers to write generally scalable solutions to problems
- 14. CONFIDENTIAL - RESTRICTED* Spark: Easy and Fast Big Data • Easy to Develop • Rich APIs in Java, Scala, Python • Interactive shell • Fast to Run • General execution graphs • In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
- 15. CONFIDENTIAL - RESTRICTED* What is Spark? Spark is a general purpose computation framework geared towards massive data - more flexible than MapReduce Extra properties: •Leverages distributed memory •Full Directed Graph expressions for data parallel computations •Improved developer experience Yet retains: Linear scalability, Fault-tolerance and Data-Locality
- 16. CONFIDENTIAL - RESTRICTED* Spark introduces concept of RDD to take advantage of memory RDD = Resilient Distributed Datasets •Defined by parallel transformations on data in stable storage
- 17. RDDs bigfile.txt
- 18. RDDs bigfile.txt lines val lines = sc.textFile( “bigfile.txt”)
- 19. RDDs bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers val numbers = lines.map ((x) => x.toDouble)
- 20. RDDs bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers val numbers = lines.map ((x) => x.toDouble) sum numbers.sum()
- 21. RDDs bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toDouble) numbers.sum()
- 22. Shuffle bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val sorted = lines.sort() sorted.sum()
- 23. CONFIDENTIAL - RESTRICTED* Persistence and Fault Tolerance •User decides whether and how to persist • Disk • Memory • Transient (recomputed on each use) Observation: a.Provides fault-tolerance through concept of lineage
- 24. CONFIDENTIAL - RESTRICTED* Lineage •Reconstruct partitions that go down using original steps we used to create them
- 25. RDDs bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toInt) numbers.cache() .sum()
- 26. numbers.sum() bigfile.txt lines numbers Partition Partition Partition sum Driver
- 27. CONFIDENTIAL - RESTRICTED* Easy • Multi-language support • Interactive Shell Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
- 28. CONFIDENTIAL - RESTRICTED* Out of the Box Functionality • Hadoop Integration • Works with Hadoop Data • Runs under YARN • Libraries • MLlib • Spark Streaming • GraphX (alpha) • Roadmap • Language support: • Improved Python support • SparkR • Java 8 • Schema support in Spark’s APIs • Better ML • Sparse Data Support • Model Evaluation Framework • Performance Testing
- 29. CONFIDENTIAL - RESTRICTED* So back to ML • Hadoop Integration • Works with Hadoop Data • Runs under YARN • Libraries •MLlib • Spark Streaming • GraphX (alpha) • Roadmap • Language support: • Improved Python support • SparkR • Java 8 • Schema support in Spark’s APIs • Better ML • Sparse Data Support • Model Evaluation Framework • Performance Testing
- 30. Spark MLlib Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
- 31. Spark MLlib Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
- 32. Why Cluster Big Data? ● Learn the structure of your data ● Interpret new data as it relates to this structure
- 33. Anomaly Detection ● Anomalies as data points far away from any cluster
- 34. Feature Learning
- 35. Feature Learning
- 36. Feature Learning
- 37. Image patch features
- 38. Train a classifier on each cluster
- 39. Using it val data = sc.textFile("kmeans_data.txt")val parsedData = data.map( _.split(' ').map(_.toDouble))// Cluster the data into two classes using KMeansval numIterations = 20val numClusters = 2val clusters = KMeans.train(parsedData, numClusters, numIterations)
- 40. K-Means ● Alternate between two steps: o Assign each point to a cluster based on existing centers o Recompute cluster centers from the points in each cluster
- 41. K-Means - very parallelizable ● Alternate between two steps: o Assign each point to a cluster based on existing centers Process each data point independently o Recompute cluster centers from the points in each cluster Average across partitions
- 42. // Find the sum and count of points mapping to each center val totalContribs = data.mapPartitions { points => val k = centers.length val dims = centers(0).vector.length val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]]) val counts = Array.fill(k)(0L) points.foreach { point => val (bestCenter, cost) = KMeans.findClosest(centers, point) costAccum += cost sums(bestCenter) += point.vector counts(bestCenter) += 1 } val contribs = for (j <- 0 until k) yield { (j, (sums(j), counts(j))) } contribs.iterator }.reduceByKey(mergeContribs).collectAsMap()
- 43. // Update the cluster centers and costs var changed = false var j = 0 while (j < k) { val (sum, count) = totalContribs(j) if (count != 0) { sum /= count.toDouble val newCenter = new BreezeVectorWithNorm(sum) if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) { changed = true } centers(j) = newCenter } j += 1 } if (!changed) { logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations") } cost = costAccum.value
- 44. The Problem ● K-Means is very sensitive to initial set of center points chosen. ● Best existing algorithm for choosing centers is highly sequential.
- 45. K-Means++ ● Start with random point from dataset ● Pick another one randomly, with probability proportional to distance from the closest already chosen ● Repeat until initial centers chosen
- 46. K-Means++ ● Initial cluster has expected bound of O(log k) of optimum cost
- 47. K-Means++ ● Requires k passes over the data
- 48. K-Means|| ● Do only a few (~5) passes ● Sample m points on each pass ● Oversample ● Run K-Means++ on sampled points to find initial centers
- 49. Then on the real data...

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment