C H A P T E R 1 1 : M A C H I N E L E A R N I N G W I T H M L L I B
Learning Spark
by Holden Karau et. al.
Overview: Machine Learning with MLlib
 System Requirements
 Machine Learning Basics
 Data Types
 Algorithms
 Feature Extraction
 Statistics
 Classification and Regression
 Clustering
 Collaborative Filtering and Recommendation
 Dimensionality Reduction
 Model Evaluation
 Tips and Performance Considerations
 Pipeline API
 Conclusion
11.1 Overview
 MLlib’s design and philosophy are simple: it lets you
invoke various algorithms on distributed datasets,
representing all data as RDDs.
 It contains only parallel algorithms that run well on
clusters
 In Spark 1.0 and 1.1, MLlib’s interface is relatively
low-level
 In Spark 1.2, MLlib gains an additional pipeline API
for building such pipelines.
11.2 System Requirements
 MLlib requires some linear algebra libraries to be
installed on your machines.
 gfortran runtime library
 to use MLlib in Python, you will need NumPy
 python-numpy or numpy package through your package manager
on Linux
 or by using a third-party scientific Python installation like
Anaconda.
Edx and Coursera Courses
 Introduction to Big Data with Apache Spark
 Spark Fundamentals I
 Functional Programming Principles in Scala
11.3 Machine Learning Basics
 Machine learning algorithms attempt to make
predictions or decisions based on training data.
 All learning algorithms require defining a set of features
for each item
 Most algorithms are defined only for numerical features
 specifically, a vector of numbers representing the value for each
feature
 Once data is represented as feature vectors, most
machine learning algorithms optimize a well-defined
mathematical function based on these vectors
 Finally, most learning algorithms have multiple
parameters that can affect results
11.3 Machine Learning Basics (cont.)
11.4 Data Types
 MLlib contains a few specific data types, located in
the:
 org.apache.spark.mllib package (Java/Scala)
 pyspark.mllib (Python).
 The main ones are:
 Vector
 LabeledPoint
 Rating
 Various Model classes
11.5 Algorithms
 Feature Extraction
 mllib.feature package
 Statistics
 mllib.stat.Statistics class.
 Classification and Regression
 use the LabeledPoint class (resides in the mllib.regression package.)
 Clustering
 K-means, as well as a variant called K-means||
 Collaborative Filtering and Recommendation
 mllib.recommendation.ALS class
 Dimensionality Reduction
 Model Evaluation
11.5.1 Feature Extraction
 TF-IDF (Term Frequency–Inverse Document Frequency)
 computes two statistics for each term in each document:
 the term frequency (TF)
 the inverse document frequency (IDF)
 MLlib has two algorithms that compute TF-IDF: HashingTF and IDF
 Scaling
 Normalization
 Word2Vec
 Collaborative Filtering and Recommendation
 Dimensionality Reduction
 Model Evaluation
11.5.2 Statistics
 MLlib offers several widely used statistic functions
that work directly on RDDs
 Statistics.colStats(rdd)
 Statistics.corr(rdd, method)
 Statistics.corr(rdd1, rdd2, method)
 Statistics.chiSqTest(rdd)
 Apart from these methods, RDDs containing
numeric data offer several basic statistics such as
mean(), stdev(), and sum()
 RDDs support sample() and sampleByKey() to build
simple and stratified samples of data.
11.5.3 Classification and Regression
 Classification and regression are two common forms
of supervised learning
 The difference between them:
 in classification, the variable is discrete
 in regression, the variable predicted is continuous
 MLlib includes a variety of methods:
 Linear regression
 Logistic regression
 Support Vector Machines
 Naive Bayes
 Decision trees and random forests
11.5.4 Clustering
 Clustering is the unsupervised learning task that
involves grouping objects into clusters of high
similarity
 MLlib includes the popular K-means algorithm for
clustering, as well as a variant called K-means||
 Kmeans|| is similar to the K-means++ initialization
procedure often used in singlenode settings.
 To invoke K-means:
 create a mllib.clustering.KMeans object (in Java/Scala)
 or calling KMeans.train (in Python).
11.5.5 Collaborative Filtering and
Recommendation
 Collaborative filtering:
 is a technique for recommender systems
 is attractive
 MLlib includes an implementation of Alternating
Least Squares (ALS)
 It is located in the mllib.recommendation.ALS class.
 To use ALS, you need to give it an RDD of
mllib.recommendation.Rating objects
 there are two variants of ALS: for explicit ratings (the default)
and for implicit ratings
11.5.6 Dimensionality Reduction
 Principal component analysis (PCA)
 the mapping to the lower-dimensional space is done such that the
variance of the data in the lower-dimensional representation is
maximized,
 PCA is currently available only in Java and Scala (as of MLlib 1.2).
 Singular value decomposition (SVD)
 The SVD factorizes an m × n matrix A into three matrices A ≈ UΣVT:
 U is an orthonormal matrix, whose columns are called left singular
vectors.
 Σ is a diagonal matrix with nonnegative diagonals in descending
order, whose diagonals are called singular values.
 V is an orthonormal matrix, whose columns are called right singular
vectors.
11.5.7 Model Evaluation
 In Spark 1.2, MLlib contains an experimental set of
model evaluation functions, though only in Java and
Scala.
 In future versions of Spark, the pipeline API is
expected to include evaluation functions in all
languages.
11.6 Tips and Performance Considerations
 Preparing Features
 Scale your input features.
 Featurize text correctly.
 Label classes correctly.
 Configuring Algorithms
 Caching RDDs to Reuse
 try persist(StorageLevel.DISK_ONLY).
 Recognizing Sparsity
 Level of Parallelism
11.7 Pipeline API
 Starting in Spark 1.2
 This API is similar to the pipeline API in SciKit-Learn.
 a pipeline is a series of algorithms (either feature
transformation or model fitting) that transform a dataset.
 Each stage of the pipeline may have parameters
 The pipeline API uses a uniform representation of
datasets throughout, which is SchemaRDDs from
Spark SQL
 The pipeline API is still experimental at the time of
writing
Edx and Coursera Courses
 Introduction to Big Data with Apache Spark
 Spark Fundamentals I
 Functional Programming Principles in Scala
11.8 Conclusion
 the library ties directly to Spark’s other APIs
 letting you work on RDDs and get back results you
can use in other Spark functions.
 MLlib is one of the most actively developed parts of
Spark, so it is still evolving.

Learning spark ch11 - Machine Learning with MLlib

  • 1.
    C H AP T E R 1 1 : M A C H I N E L E A R N I N G W I T H M L L I B Learning Spark by Holden Karau et. al.
  • 2.
    Overview: Machine Learningwith MLlib  System Requirements  Machine Learning Basics  Data Types  Algorithms  Feature Extraction  Statistics  Classification and Regression  Clustering  Collaborative Filtering and Recommendation  Dimensionality Reduction  Model Evaluation  Tips and Performance Considerations  Pipeline API  Conclusion
  • 3.
    11.1 Overview  MLlib’sdesign and philosophy are simple: it lets you invoke various algorithms on distributed datasets, representing all data as RDDs.  It contains only parallel algorithms that run well on clusters  In Spark 1.0 and 1.1, MLlib’s interface is relatively low-level  In Spark 1.2, MLlib gains an additional pipeline API for building such pipelines.
  • 4.
    11.2 System Requirements MLlib requires some linear algebra libraries to be installed on your machines.  gfortran runtime library  to use MLlib in Python, you will need NumPy  python-numpy or numpy package through your package manager on Linux  or by using a third-party scientific Python installation like Anaconda.
  • 5.
    Edx and CourseraCourses  Introduction to Big Data with Apache Spark  Spark Fundamentals I  Functional Programming Principles in Scala
  • 6.
    11.3 Machine LearningBasics  Machine learning algorithms attempt to make predictions or decisions based on training data.  All learning algorithms require defining a set of features for each item  Most algorithms are defined only for numerical features  specifically, a vector of numbers representing the value for each feature  Once data is represented as feature vectors, most machine learning algorithms optimize a well-defined mathematical function based on these vectors  Finally, most learning algorithms have multiple parameters that can affect results
  • 7.
    11.3 Machine LearningBasics (cont.)
  • 8.
    11.4 Data Types MLlib contains a few specific data types, located in the:  org.apache.spark.mllib package (Java/Scala)  pyspark.mllib (Python).  The main ones are:  Vector  LabeledPoint  Rating  Various Model classes
  • 9.
    11.5 Algorithms  FeatureExtraction  mllib.feature package  Statistics  mllib.stat.Statistics class.  Classification and Regression  use the LabeledPoint class (resides in the mllib.regression package.)  Clustering  K-means, as well as a variant called K-means||  Collaborative Filtering and Recommendation  mllib.recommendation.ALS class  Dimensionality Reduction  Model Evaluation
  • 10.
    11.5.1 Feature Extraction TF-IDF (Term Frequency–Inverse Document Frequency)  computes two statistics for each term in each document:  the term frequency (TF)  the inverse document frequency (IDF)  MLlib has two algorithms that compute TF-IDF: HashingTF and IDF  Scaling  Normalization  Word2Vec  Collaborative Filtering and Recommendation  Dimensionality Reduction  Model Evaluation
  • 11.
    11.5.2 Statistics  MLliboffers several widely used statistic functions that work directly on RDDs  Statistics.colStats(rdd)  Statistics.corr(rdd, method)  Statistics.corr(rdd1, rdd2, method)  Statistics.chiSqTest(rdd)  Apart from these methods, RDDs containing numeric data offer several basic statistics such as mean(), stdev(), and sum()  RDDs support sample() and sampleByKey() to build simple and stratified samples of data.
  • 12.
    11.5.3 Classification andRegression  Classification and regression are two common forms of supervised learning  The difference between them:  in classification, the variable is discrete  in regression, the variable predicted is continuous  MLlib includes a variety of methods:  Linear regression  Logistic regression  Support Vector Machines  Naive Bayes  Decision trees and random forests
  • 13.
    11.5.4 Clustering  Clusteringis the unsupervised learning task that involves grouping objects into clusters of high similarity  MLlib includes the popular K-means algorithm for clustering, as well as a variant called K-means||  Kmeans|| is similar to the K-means++ initialization procedure often used in singlenode settings.  To invoke K-means:  create a mllib.clustering.KMeans object (in Java/Scala)  or calling KMeans.train (in Python).
  • 14.
    11.5.5 Collaborative Filteringand Recommendation  Collaborative filtering:  is a technique for recommender systems  is attractive  MLlib includes an implementation of Alternating Least Squares (ALS)  It is located in the mllib.recommendation.ALS class.  To use ALS, you need to give it an RDD of mllib.recommendation.Rating objects  there are two variants of ALS: for explicit ratings (the default) and for implicit ratings
  • 15.
    11.5.6 Dimensionality Reduction Principal component analysis (PCA)  the mapping to the lower-dimensional space is done such that the variance of the data in the lower-dimensional representation is maximized,  PCA is currently available only in Java and Scala (as of MLlib 1.2).  Singular value decomposition (SVD)  The SVD factorizes an m × n matrix A into three matrices A ≈ UΣVT:  U is an orthonormal matrix, whose columns are called left singular vectors.  Σ is a diagonal matrix with nonnegative diagonals in descending order, whose diagonals are called singular values.  V is an orthonormal matrix, whose columns are called right singular vectors.
  • 16.
    11.5.7 Model Evaluation In Spark 1.2, MLlib contains an experimental set of model evaluation functions, though only in Java and Scala.  In future versions of Spark, the pipeline API is expected to include evaluation functions in all languages.
  • 17.
    11.6 Tips andPerformance Considerations  Preparing Features  Scale your input features.  Featurize text correctly.  Label classes correctly.  Configuring Algorithms  Caching RDDs to Reuse  try persist(StorageLevel.DISK_ONLY).  Recognizing Sparsity  Level of Parallelism
  • 18.
    11.7 Pipeline API Starting in Spark 1.2  This API is similar to the pipeline API in SciKit-Learn.  a pipeline is a series of algorithms (either feature transformation or model fitting) that transform a dataset.  Each stage of the pipeline may have parameters  The pipeline API uses a uniform representation of datasets throughout, which is SchemaRDDs from Spark SQL  The pipeline API is still experimental at the time of writing
  • 19.
    Edx and CourseraCourses  Introduction to Big Data with Apache Spark  Spark Fundamentals I  Functional Programming Principles in Scala
  • 20.
    11.8 Conclusion  thelibrary ties directly to Spark’s other APIs  letting you work on RDDs and get back results you can use in other Spark functions.  MLlib is one of the most actively developed parts of Spark, so it is still evolving.

Editor's Notes

  • #8 Spark is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines, or a computing cluster. First, all libraries and higher- level components in the stack benefit from improvements at the lower layers. Second, the costs associated with running the stack are minimized, because instead of running 5–10 independent software systems, an organization needs to run only one. Finally, one of the largest advantages of tight integration is the ability to build appli‐ cations that seamlessly combine different processing models.