Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine Learning with Spark MLlib


Published on

An introduction to Spark MLlib from the Apache Spark with Scala course available at These slides present an overview on machine learning with Apache Spark MLlib.

For more background on machine learning see my other uploaded presentation "Machine Learning with Spark".

Published in: Data & Analytics
  • Be the first to comment

Machine Learning with Spark MLlib

  1. 1. Spark MLlib
  2. 2. Overview • MLlib is Spark’s library of machine learning (ML) functions designed to run in parallel on clusters. MLlib contains a variety of learning algorithms • MLlib invokes various algorithms on RDDs • Some classic ML algorithms are not included with Spark MLlib because they were not designed for parallel
  3. 3. Overview • Divided into two packages: • spark.mllib contains the original API built on top of RDDs. • provides higher-level API built on top of DataFrames • Using is recommended because with DataFrames the API is more versatile and flexible. Plan is to keep supporting spark.mllib along with the development of
  4. 4. Machine Learning Recap • Machine learning algorithms try to predict or make decisions based on training data. • There are multiple types of learning problems, including classification, regression, or clustering. All of which have different objectives.
  5. 5. Spark MLlib Data Types • MLlib contains a few specific data types including Vector, LabeledPoint, Rating, Matrix (local and distributed) and various Model classes.
  6. 6. MLlib Supported Supervised Algorithm Methods • Binary Classification Problems • linear SVMs, logistic regression, decision trees, random forests, gradient-boosted trees, naive bayes • Multiclass Classification Problems • logistic regression, decision trees, random forests, naive Bayes • Regression Problems • linear least squares, Lasso, ridge regression, decision trees, random forests, gradient-boosted trees, isotonic regression
  7. 7. MLlib Supported Unsupervised Models • K-means • Gaussian mixture • Power iteration clustering (PIC) • Latent Dirichlet allocation (LDA) • Bisecting k-means • Streaming k-means
  8. 8. Recommender Systems • Collaborative filtering is commonly used for recommender systems. • spark.mllib currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. • spark.mllib uses the alternating least squares (ALS) algorithm to learn these latent factors.
  9. 9. For more, visit