Learning spark ch11 - Machine Learning with MLlib

C H A P T E R 1 1 : M A C H I N E L E A R N I N G W I T H M L L I B
Learning Spark
by Holden Karau et. al.

Overview: Machine Learning with MLlib
 System Requirements
 Machine Learning Basics
 Data Types
 Algorithms
 Feature Extraction
 Statistics
 Classification and Regression
 Clustering
 Collaborative Filtering and Recommendation
 Dimensionality Reduction
 Model Evaluation
 Tips and Performance Considerations
 Pipeline API
 Conclusion

11.1 Overview
 MLlib’s design and philosophy are simple: it lets you
invoke various algorithms on distributed datasets,
representing all data as RDDs.
 It contains only parallel algorithms that run well on
clusters
 In Spark 1.0 and 1.1, MLlib’s interface is relatively
low-level
 In Spark 1.2, MLlib gains an additional pipeline API
for building such pipelines.

11.2 System Requirements
 MLlib requires some linear algebra libraries to be
installed on your machines.
 gfortran runtime library
 to use MLlib in Python, you will need NumPy
 python-numpy or numpy package through your package manager
on Linux
 or by using a third-party scientific Python installation like
Anaconda.

Edx and Coursera Courses
 Introduction to Big Data with Apache Spark
 Spark Fundamentals I
 Functional Programming Principles in Scala

11.3 Machine Learning Basics
 Machine learning algorithms attempt to make
predictions or decisions based on training data.
 All learning algorithms require defining a set of features
for each item
 Most algorithms are defined only for numerical features
 specifically, a vector of numbers representing the value for each
feature
 Once data is represented as feature vectors, most
machine learning algorithms optimize a well-defined
mathematical function based on these vectors
 Finally, most learning algorithms have multiple
parameters that can affect results

11.3 Machine Learning Basics (cont.)

11.4 Data Types
 MLlib contains a few specific data types, located in
the:
 org.apache.spark.mllib package (Java/Scala)
 pyspark.mllib (Python).
 The main ones are:
 Vector
 LabeledPoint
 Rating
 Various Model classes

11.5 Algorithms
 Feature Extraction
 mllib.feature package
 Statistics
 mllib.stat.Statistics class.
 Classification and Regression
 use the LabeledPoint class (resides in the mllib.regression package.)
 Clustering
 K-means, as well as a variant called K-means||
 Collaborative Filtering and Recommendation
 mllib.recommendation.ALS class
 Dimensionality Reduction
 Model Evaluation

11.5.1 Feature Extraction
 TF-IDF (Term Frequency–Inverse Document Frequency)
 computes two statistics for each term in each document:
 the term frequency (TF)
 the inverse document frequency (IDF)
 MLlib has two algorithms that compute TF-IDF: HashingTF and IDF
 Scaling
 Normalization
 Word2Vec
 Collaborative Filtering and Recommendation
 Dimensionality Reduction
 Model Evaluation

11.5.2 Statistics
 MLlib offers several widely used statistic functions
that work directly on RDDs
 Statistics.colStats(rdd)
 Statistics.corr(rdd, method)
 Statistics.corr(rdd1, rdd2, method)
 Statistics.chiSqTest(rdd)
 Apart from these methods, RDDs containing
numeric data offer several basic statistics such as
mean(), stdev(), and sum()
 RDDs support sample() and sampleByKey() to build
simple and stratified samples of data.

11.5.3 Classification and Regression
 Classification and regression are two common forms
of supervised learning
 The difference between them:
 in classification, the variable is discrete
 in regression, the variable predicted is continuous
 MLlib includes a variety of methods:
 Linear regression
 Logistic regression
 Support Vector Machines
 Naive Bayes
 Decision trees and random forests

11.5.4 Clustering
 Clustering is the unsupervised learning task that
involves grouping objects into clusters of high
similarity
 MLlib includes the popular K-means algorithm for
clustering, as well as a variant called K-means||
 Kmeans|| is similar to the K-means++ initialization
procedure often used in singlenode settings.
 To invoke K-means:
 create a mllib.clustering.KMeans object (in Java/Scala)
 or calling KMeans.train (in Python).

11.5.5 Collaborative Filtering and
Recommendation
 Collaborative filtering:
 is a technique for recommender systems
 is attractive
 MLlib includes an implementation of Alternating
Least Squares (ALS)
 It is located in the mllib.recommendation.ALS class.
 To use ALS, you need to give it an RDD of
mllib.recommendation.Rating objects
 there are two variants of ALS: for explicit ratings (the default)
and for implicit ratings

11.5.6 Dimensionality Reduction
 Principal component analysis (PCA)
 the mapping to the lower-dimensional space is done such that the
variance of the data in the lower-dimensional representation is
maximized,
 PCA is currently available only in Java and Scala (as of MLlib 1.2).
 Singular value decomposition (SVD)
 The SVD factorizes an m × n matrix A into three matrices A ≈ UΣVT:
 U is an orthonormal matrix, whose columns are called left singular
vectors.
 Σ is a diagonal matrix with nonnegative diagonals in descending
order, whose diagonals are called singular values.
 V is an orthonormal matrix, whose columns are called right singular
vectors.

11.5.7 Model Evaluation
 In Spark 1.2, MLlib contains an experimental set of
model evaluation functions, though only in Java and
Scala.
 In future versions of Spark, the pipeline API is
expected to include evaluation functions in all
languages.

11.6 Tips and Performance Considerations
 Preparing Features
 Scale your input features.
 Featurize text correctly.
 Label classes correctly.
 Configuring Algorithms
 Caching RDDs to Reuse
 try persist(StorageLevel.DISK_ONLY).
 Recognizing Sparsity
 Level of Parallelism

11.7 Pipeline API
 Starting in Spark 1.2
 This API is similar to the pipeline API in SciKit-Learn.
 a pipeline is a series of algorithms (either feature
transformation or model fitting) that transform a dataset.
 Each stage of the pipeline may have parameters
 The pipeline API uses a uniform representation of
datasets throughout, which is SchemaRDDs from
Spark SQL
 The pipeline API is still experimental at the time of
writing

11.8 Conclusion
 the library ties directly to Spark’s other APIs
 letting you work on RDDs and get back results you
can use in other Spark functions.
 MLlib is one of the most actively developed parts of
Spark, so it is still evolving.

Learning spark ch11 - Machine Learning with MLlib

More Related Content

What's hot

Similar to Learning spark ch11 - Machine Learning with MLlib

More from phanleson

Recently uploaded

Learning spark ch11 - Machine Learning with MLlib

Editor's Notes