• Like
  • Save
Large Scale Machine Learning with Apache Spark

Large Scale Machine Learning with Apache Spark

  • 2,696 views
Uploaded on

Spark offers a number of advantages over its predecessor MapReduce that make it ideal for large-scale machine learning. For example, Spark includes MLLib, a library of machine learning algorithms for …

Spark offers a number of advantages over its predecessor MapReduce that make it ideal for large-scale machine learning. For example, Spark includes MLLib, a library of machine learning algorithms for large data. The presentation will cover the state of MLLib and the details of some of the scalable algorithms it includes.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,696
On Slideshare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
0
Comments
0
Likes
15

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Large Scale Learning with Apache Spark Sandy Ryza, Data Science, Cloudera
  • 2. ● Data scientist at Cloudera ● Recently lead Apache Spark development at Cloudera ● Before that, committing on Apache Hadoop ● Before that, studying combinatorial optimization and distributed systems at Brown Me
  • 3. Sometimes you find yourself with lots of stuff
  • 4. Large Scale Learning
  • 5. Network Packets
  • 6. Detect Network Intrusions
  • 7. Credit Card Transactions
  • 8. Detect Fraud
  • 9. Movie Viewings
  • 10. Recommend Movies
  • 11. Two Main Problems ● Designing a system for processing huge data in parallel ● Taking advantage of it with algorithms that work well in parallel
  • 12. System Requirements ● Scalability ● Programming model that abstracts away distributed ugliness ● Data-scientist friendly ○ High-level operators ○ Interactive shell (REPL) ● Efficiency for iterative algorithms
  • 13. CONFIDENTIAL - RESTRICTED* MapReduce Map Map Map Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Key advances by MapReduce: •Data Locality: Automatic split computation and launch of mappers appropriately •Fault tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware •Linear scalability: Combination of locality + programming model that forces developers to write generally scalable solutions to problems
  • 14. CONFIDENTIAL - RESTRICTED* Spark: Easy and Fast Big Data • Easy to Develop • Rich APIs in Java, Scala, Python • Interactive shell • Fast to Run • General execution graphs • In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
  • 15. CONFIDENTIAL - RESTRICTED* What is Spark? Spark is a general purpose computation framework geared towards massive data - more flexible than MapReduce Extra properties: •Leverages distributed memory •Full Directed Graph expressions for data parallel computations •Improved developer experience Yet retains: Linear scalability, Fault-tolerance and Data-Locality
  • 16. CONFIDENTIAL - RESTRICTED* Spark introduces concept of RDD to take advantage of memory RDD = Resilient Distributed Datasets •Defined by parallel transformations on data in stable storage
  • 17. RDDs bigfile.txt
  • 18. RDDs bigfile.txt lines val lines = sc.textFile( “bigfile.txt”)
  • 19. RDDs bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers val numbers = lines.map ((x) => x.toDouble)
  • 20. RDDs bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers val numbers = lines.map ((x) => x.toDouble) sum numbers.sum()
  • 21. RDDs bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toDouble) numbers.sum()
  • 22. Shuffle bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val sorted = lines.sort() sorted.sum()
  • 23. CONFIDENTIAL - RESTRICTED* Persistence and Fault Tolerance •User decides whether and how to persist • Disk • Memory • Transient (recomputed on each use) Observation: a.Provides fault-tolerance through concept of lineage
  • 24. CONFIDENTIAL - RESTRICTED* Lineage •Reconstruct partitions that go down using original steps we used to create them
  • 25. RDDs bigfile.txt lines val lines = sc.textFile (“bigfile.txt”) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toInt) numbers.cache() .sum()
  • 26. numbers.sum() bigfile.txt lines numbers Partition Partition Partition sum Driver
  • 27. CONFIDENTIAL - RESTRICTED* Easy • Multi-language support • Interactive Shell Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  • 28. CONFIDENTIAL - RESTRICTED* Out of the Box Functionality • Hadoop Integration • Works with Hadoop Data • Runs under YARN • Libraries • MLlib • Spark Streaming • GraphX (alpha) • Roadmap • Language support: • Improved Python support • SparkR • Java 8 • Schema support in Spark’s APIs • Better ML • Sparse Data Support • Model Evaluation Framework • Performance Testing
  • 29. CONFIDENTIAL - RESTRICTED* So back to ML • Hadoop Integration • Works with Hadoop Data • Runs under YARN • Libraries •MLlib • Spark Streaming • GraphX (alpha) • Roadmap • Language support: • Improved Python support • SparkR • Java 8 • Schema support in Spark’s APIs • Better ML • Sparse Data Support • Model Evaluation Framework • Performance Testing
  • 30. Spark MLlib Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
  • 31. Spark MLlib Discrete Continuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
  • 32. Why Cluster Big Data? ● Learn the structure of your data ● Interpret new data as it relates to this structure
  • 33. Anomaly Detection ● Anomalies as data points far away from any cluster
  • 34. Feature Learning
  • 35. Feature Learning
  • 36. Feature Learning
  • 37. Image patch features
  • 38. Train a classifier on each cluster
  • 39. Using it val data = sc.textFile("kmeans_data.txt")val parsedData = data.map( _.split(' ').map(_.toDouble))// Cluster the data into two classes using KMeansval numIterations = 20val numClusters = 2val clusters = KMeans.train(parsedData, numClusters, numIterations)
  • 40. K-Means ● Alternate between two steps: o Assign each point to a cluster based on existing centers o Recompute cluster centers from the points in each cluster
  • 41. K-Means - very parallelizable ● Alternate between two steps: o Assign each point to a cluster based on existing centers  Process each data point independently o Recompute cluster centers from the points in each cluster  Average across partitions
  • 42. // Find the sum and count of points mapping to each center val totalContribs = data.mapPartitions { points => val k = centers.length val dims = centers(0).vector.length val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]]) val counts = Array.fill(k)(0L) points.foreach { point => val (bestCenter, cost) = KMeans.findClosest(centers, point) costAccum += cost sums(bestCenter) += point.vector counts(bestCenter) += 1 } val contribs = for (j <- 0 until k) yield { (j, (sums(j), counts(j))) } contribs.iterator }.reduceByKey(mergeContribs).collectAsMap()
  • 43. // Update the cluster centers and costs var changed = false var j = 0 while (j < k) { val (sum, count) = totalContribs(j) if (count != 0) { sum /= count.toDouble val newCenter = new BreezeVectorWithNorm(sum) if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) { changed = true } centers(j) = newCenter } j += 1 } if (!changed) { logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations") } cost = costAccum.value
  • 44. The Problem ● K-Means is very sensitive to initial set of center points chosen. ● Best existing algorithm for choosing centers is highly sequential.
  • 45. K-Means++ ● Start with random point from dataset ● Pick another one randomly, with probability proportional to distance from the closest already chosen ● Repeat until initial centers chosen
  • 46. K-Means++ ● Initial cluster has expected bound of O(log k) of optimum cost
  • 47. K-Means++ ● Requires k passes over the data
  • 48. K-Means|| ● Do only a few (~5) passes ● Sample m points on each pass ● Oversample ● Run K-Means++ on sampled points to find initial centers
  • 49. Then on the real data...