Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MLlib: Spark's Machine Learning Library

8,464 views

Published on

"MLlib: Spark's Machine Learning Library" by Ameet Talwalkar at AMPCamp 5

MLlib: Spark's Machine Learning Library

  1. 1. About Me • Postdoc in AMPLab • Led initial development of MLlib • Technical Advisor for Databricks • Assistant Professor at UCLA • Research interests include scalability and ease-of- use issues in statistical machine learning
  2. 2. MLlib: Spark’s Machine Learning Library Ameet Talwalkar AMPCAMP 5 November 20, 2014
  3. 3. History and Overview Example Applications Ongoing Development
  4. 4. MLbase and MLlib MLOpt MLI MLlib Apache Spark MLbase aims to simplify development and deployment of scalable ML pipelines MLlib: Spark’s core ML library MLI, Pipelines: APIs to simplify ML development • Tables, Matrices, Optimization, ML Pipelines Experimental Testbeds Production MLOpt: Declarative layer to automate hyperparameter tuning Code Pipelines
  5. 5. History of MLlib Initial Release • Developed by MLbase team in AMPLab (11 contributors) • Scala, Java • Shipped with Spark v0.8 (Sep 2013) 15 months later… • 80+ contributors from various organization • Scala, Java, Python • Latest release part of Spark v1.1 (Sep 2014)
  6. 6. What’s in MLlib? • Alternating Least Squares • Lasso • Ridge Regression • Logistic Regression • Decision Trees • Naïve Bayes • Support Vector Machines • K-Means • Gradient descent • L-BFGS • Random data generation • Linear algebra • Feature transformations • Statistics: testing, correlation • Evaluation metrics Collaborative Filtering for Recommendation Prediction Clustering Optimization Many Utilities
  7. 7. Benefits of MLlib • Part of Spark • Integrated data analysis workflow • Free performance gains Apache Spark SparkSQL Spark Streaming MLlib GraphX
  8. 8. Benefits of MLlib • Part of Spark • Integrated data analysis workflow • Free performance gains • Scalable • Python, Scala, Java APIs • Broad coverage of applications & algorithms • Rapid improvements in speed & robustness
  9. 9. History and Overview Example Applications Use Cases Distributed ML Challenges Code Examples Ongoing Development
  10. 10. Clustering with K-Means Given data points Find meaningful clusters
  11. 11. Clustering with K-Means Choose cluster centers Assign points to clusters
  12. 12. Clustering with K-Means Assign Choose cluster centers points to clusters
  13. 13. Clustering with K-Means Choose cluster centers Assign points to clusters
  14. 14. Clustering with K-Means Choose cluster centers Assign points to clusters
  15. 15. Clustering with K-Means Data distributed by instance (point/row) Smart initialization Limited communication (# clusters << # instances)
  16. 16. K-Means: Scala // Load and parse data. val data = sc.textFile("kmeans_data.txt") val parsedData = data.map { x => Vectors.dense(x.split(‘ ').map(_.toDouble)) }.cache() // Cluster data into 5 classes using KMeans. val clusters = KMeans.train( parsedData, k = 5, numIterations = 20) // Evaluate clustering error. val cost = clusters.computeCost(parsedData) println("Sum of squared errors = " + cost)
  17. 17. K-Means: Python # Load and parse data. data = sc.textFile("kmeans_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ‘)])).cache() # Cluster data into 5 classes using KMeans. clusters = KMeans.train(parsedData, k = 5, maxIterations = 20) # Evaluate clustering error. def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) cost = parsedData.map(lambda point: error(point)) .reduce(lambda x, y: x + y) print("Sum of squared error = " + str(cost))
  18. 18. Recommendation ? Goal: Recommend movies to users
  19. 19. Recommendation Collaborative filtering Goal: Recommend movies to users Challenges: • Defining similarity • Dimensionality Millions of Users / Items • Sparsity
  20. 20. Recommendation Solution: Assume ratings are determined by a small number of factors. ≈ x 25M Users, 100K Movies ! 2.5 trillion ratings With 10 factors/user ! 250M parameters
  21. 21. Recommendation with Alternating Least Squares (ALS) Algorithm Alternating update of user/movie factors = x ? ?
  22. 22. Recommendation with Alternating Least Squares (ALS) Algorithm Alternating update of user/movie factors = x Can update factors in parallel Must be careful about communication
  23. 23. Recommendation with Alternating Least Squares (ALS) // Load and parse the data val data = sc.textFile("mllib/data/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) // Build the recommendation model using ALS val model = ALS.train( ratings, rank = 10, numIterations = 20, regularizer = 0.01) // Evaluate the model on rating data val usersProducts = ratings.map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersProducts)
  24. 24. ALS: Today’s ML Exercise • Load 1M/10M ratings from MovieLens • Specify YOUR ratings on examples • Split examples into training/validation • Fit a model (Python or Scala) • Improve model via parameter tuning • Get YOUR recommendations
  25. 25. History and Overview Example Applications Ongoing Development Performance New APIs
  26. 26. Performance 50 ALS on Amazon Reviews on 16 nodes minutes) 40 (30 Runtime 20 10 0 0M 200M 400M 600M 800M MLlib Mahout Number of Ratings On a dataset with 660M users, 2.4M items, and 3.5B ratings MLlib runs in 40 minutes with 50 nodes
  27. 27. Performance Steady performance gains Decision Trees ALS K-Means Logistic Regression Ridge Regression ~3X speedups on average Speedup (Spark 1.0 vs. 1.1)
  28. 28. Algorithms In Spark 1.2 • Random Forests: ensembles of Decision Trees • Boosting Under development • Topic modeling • (many others) Many others!
  29. 29. ML Pipelines Typical ML workflow Training Data Feature Extraction Model Training Model Testing Test Data
  30. 30. ML Pipelines Typical ML workflow is complex. Training Data Feature Extraction Model Training Model Testing Test Data Feature Extraction Other Training Data Other Model Training
  31. 31. ML Pipelines Typical ML workflow is complex. Pipelines in 1.2 (alpha release) • Easy workflow construction • Standardized interface for model tuning • Testing & failing early Inspired by MLbase / Pipelines Project (see Evan’s talk) Collaboration with Databricks MLbase / MLOpt aims to autotune these pipelines
  32. 32. Datasets Further Integration with SparkSQL ML pipelines require Datasets • Handle many data types (features) • Keep metadata about features • Select subsets of features for different parts of pipeline • Join groups of features ML Dataset = SchemaRDD Inspired by MLbase / MLI API
  33. 33. MLlib Programming Guide spark.apache.org/docs/latest/mllib-guide. html Databricks training info databricks.com/spark-training Spark user lists & community spark.apache.org/community.html edX MOOC on Scalable Machine Learning www.edx.org/course/uc-berkeleyx/uc-berkeleyx- cs190-1x-scalable-machine-6066 4-day BIDS minicourse in January Resources

×