Like this presentation? Why not share!

- Large-Scale Machine Learning with A... by DB Tsai 13293 views
- Unsupervised Learning with Apache S... by DB Tsai 3969 views
- Recent Developments in Spark MLlib ... by Hadoop Summit 5114 views
- Machine Learning Loves Hadoop by Cloudera, Inc. 4599 views
- Machine Learning With Spark by Shivaji Dutta 559 views
- Hortonworks Technical Workshop: Mac... by Hortonworks 1394 views

4,008

Published on

No Downloads

Total Views

4,008

On Slideshare

0

From Embeds

0

Number of Embeds

6

Shares

0

Downloads

0

Comments

0

Likes

24

No embeds

No notes for slide

- 1. Large Scale Learning with Apache Spark Sandy Ryza, Data Science, Cloudera
- 2. ● Data scientist at Cloudera ● Recently lead Apache Spark development at Cloudera ● Before that, committing on Apache Hadoop ● Before that, studying combinatorial optimization and distributed systems at Brown Me
- 3. Sometimes you find yourself with lots of stuff
- 4. Large Scale Learning
- 5. Network Packets
- 6. Detect Network Intrusions
- 7. Credit Card Transactions
- 8. Detect Fraud
- 9. Movie Viewings
- 10. Recommend Movies
- 11. Two Main Problems ● Designing a system for processing huge data in parallel ● Taking advantage of it with algorithms that work well in parallel
- 12. System Requirements ● Scalability ● Programming model that abstracts away distributed ugliness ● Data-scientist friendly ○ High-level operators ○ Interactive shell (REPL) ● Efficiency for iterative algorithms
- 13. CONFIDENTIAL - RESTRICTED* MapReduce Map Map Map Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Key advances by MapReduce: •Data Locality: Automatic split computation and launch of mappers appropriately •Fault tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware •Linear scalability: Combination of locality + programming model that forces developers to write generally scalable solutions to problems
- 14. CONFIDENTIAL - RESTRICTED* Spark: Easy and Fast Big Data • Easy to Develop • Rich APIs in Java, Scala, Python • Interactive shell • Fast to Run • General execution graphs • In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
- 15. CONFIDENTIAL - RESTRICTED* What is Spark? Spark is a general purpose computation framework geared towards massive data - more flexible than MapReduce Extra properties: •Leverages distributed memory •Full Directed Graph expressions for data parallel computations •Improved developer experience Yet retains: Linear scalability, Fault-tolerance and Data-Locality
- 16. CONFIDENTIAL - RESTRICTED* Spark introduces concept of RDD to take advantage of memory RDD = Resilient Distributed Datasets •Defined by parallel transformations on data in stable storage
- 17. RDDs bigfile.txt
- 18. RDDs bigfile.txt lines val lines = sc.textFile( “bigfile.txt”)
- 19. RDDs bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers val numbers = lines.map ((x) => x.toDouble)
- 20. RDDs bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers val numbers = lines.map ((x) => x.toDouble) sum numbers.sum()
- 21. RDDs bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toDouble) numbers.sum()
- 22. Shuffle bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val sorted = lines.sort() sorted.sum()
- 23. CONFIDENTIAL - RESTRICTED* Persistence and Fault Tolerance •User decides whether and how to persist • Disk • Memory • Transient (recomputed on each use) Observation: a.Provides fault-tolerance through concept of lineage
- 24. CONFIDENTIAL - RESTRICTED* Lineage •Reconstruct partitions that go down using original steps we used to create them
- 25. RDDs bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toInt) numbers.cache() .sum()
- 26. numbers.sum() bigfile.txt lines numbers Partition Partition Partition sum Driver
- 27. CONFIDENTIAL - RESTRICTED* Easy • Multi-language support • Interactive Shell Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
- 28. CONFIDENTIAL - RESTRICTED* Out of the Box Functionality • Hadoop Integration • Works with Hadoop Data • Runs under YARN • Libraries • MLlib • Spark Streaming • GraphX (alpha) • Roadmap • Language support: • Improved Python support • SparkR • Java 8 • Schema support in Spark’s APIs • Better ML • Sparse Data Support • Model Evaluation Framework • Performance Testing
- 29. CONFIDENTIAL - RESTRICTED* So back to ML • Hadoop Integration • Works with Hadoop Data • Runs under YARN • Libraries •MLlib • Spark Streaming • GraphX (alpha) • Roadmap • Language support: • Improved Python support • SparkR • Java 8 • Schema support in Spark’s APIs • Better ML • Sparse Data Support • Model Evaluation Framework • Performance Testing
- 30. Spark MLlib Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
- 31. Spark MLlib Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
- 32. Why Cluster Big Data? ● Learn the structure of your data ● Interpret new data as it relates to this structure
- 33. Anomaly Detection ● Anomalies as data points far away from any cluster
- 34. Feature Learning
- 35. Feature Learning
- 36. Feature Learning
- 37. Image patch features
- 38. Train a classifier on each cluster
- 39. Using it val data = sc.textFile("kmeans_data.txt")val parsedData = data.map( _.split(' ').map(_.toDouble))// Cluster the data into two classes using KMeansval numIterations = 20val numClusters = 2val clusters = KMeans.train(parsedData, numClusters, numIterations)
- 40. K-Means ● Alternate between two steps: o Assign each point to a cluster based on existing centers o Recompute cluster centers from the points in each cluster
- 41. K-Means - very parallelizable ● Alternate between two steps: o Assign each point to a cluster based on existing centers Process each data point independently o Recompute cluster centers from the points in each cluster Average across partitions
- 42. // Find the sum and count of points mapping to each center val totalContribs = data.mapPartitions { points => val k = centers.length val dims = centers(0).vector.length val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]]) val counts = Array.fill(k)(0L) points.foreach { point => val (bestCenter, cost) = KMeans.findClosest(centers, point) costAccum += cost sums(bestCenter) += point.vector counts(bestCenter) += 1 } val contribs = for (j <- 0 until k) yield { (j, (sums(j), counts(j))) } contribs.iterator }.reduceByKey(mergeContribs).collectAsMap()
- 43. // Update the cluster centers and costs var changed = false var j = 0 while (j < k) { val (sum, count) = totalContribs(j) if (count != 0) { sum /= count.toDouble val newCenter = new BreezeVectorWithNorm(sum) if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) { changed = true } centers(j) = newCenter } j += 1 } if (!changed) { logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations") } cost = costAccum.value
- 44. The Problem ● K-Means is very sensitive to initial set of center points chosen. ● Best existing algorithm for choosing centers is highly sequential.
- 45. K-Means++ ● Start with random point from dataset ● Pick another one randomly, with probability proportional to distance from the closest already chosen ● Repeat until initial centers chosen
- 46. K-Means++ ● Initial cluster has expected bound of O(log k) of optimum cost
- 47. K-Means++ ● Requires k passes over the data
- 48. K-Means|| ● Do only a few (~5) passes ● Sample m points on each pass ● Oversample ● Run K-Means++ on sampled points to find initial centers
- 49. Then on the real data...

Be the first to comment