Machine learning with Spark

| © Copyright 2015 Hitachi Consulting1
Applied Machine Learning
with Apache Spark
Khalid M. Salama, Ph.D.
Business Insights & Analytics
Hitachi Consulting UK
We Make it Happen. Better.

Outline
 Overview on Data Mining
 Review on Spark Core Concepts
 Introducing Machine Learning with Spark
 Spark MLlib Data Types
 Statistics, Sampling, and Random Data Generation
 Spark ML Pipelines
 Data Pre-processing & Transformation
 Building and Evaluating Supervised ML Models (Classification and Regression)
 Other Unsupervised ML techniques (Clustering, Frequent Pattern Mining)
 Useful Resources

Data Mining Overview

Data Mining
… in a nutshell
Data
Mining
Machine
Learning
Statistics
Artificial
Intelligence
Databases
Other
Technologies
“Data mining, an interdisciplinary subfield of
computer science, is the computational
process of discovering patterns in large data
sets involving methods at the intersection of
artificial intelligence, machine learning,
statistics, and database systems.”
Other Related Technologies:
 Visualization
 Big Data
 High Performance Computing
 Cloud Computing
 Others..

Learning Paradigms
Data as the teacher, machine as the student…
Supervised Learning
Labelled data = data + output (predictable, target, response, class) variable
Learn the relationship between data and output
Unsupervised Learning
Unlabelled data
Learn associations, similarities, groups, etc.
Semi-
supervised
Learning
Partially labelled data
Online/Active
Learning
Real-time learning on
data streams
Reinforcement Learning
game theory, control theory, simulation-based
optimization, operations research, robotics, etc.

Data Mining Task
• Predicting the class of a given case – SupervisedClassification
• Estimating the value of a response value – SupervisedRegression
• Partitioning the cases into similar groups – UnsupervisedClustering
• Finding frequent (co)-occurring items – Unsupervised
Association Rules
Discovery
• Finding similar cases of a given case – BothSimilarity Analysis
• Calculating the probability of variables – BothProbabilistic Inference
• Forecasting future values – SupervisedTime Series Analysis
Important Terms:
• Learning Paradigms:
− Supervised
− Unsupervised
− Semi-supervised
− Others (Reinforcement
learning, Active, etc.)
• Analytics Types:
− Predictive
− Descriptive (Exploratory)
− Prescriptive (Decisive)
Application Fields:
• Text Mining
• Information Retrieval
• (Social) Web Mining
• Speech Recognition
• Image Recognition
• Anomaly Detection
• State Transition Analysis
• Collaborative Filtering
(Recommender systems)

Knowledge Discovery in Databases (KDD)
…or data science, if you like!
Understanding
the Data
Modelling
Evaluation,
Interpretation,
Communication
Deployment
Cross Industry Standard Process
for Data Mining (CRISP-DM)
Data
Start
Understanding
the Business
Preparing
the Data

Data Mining Implementation
Overall Procedure:
 Load dataset.
 Explore data (statistics, cardinality, variable types, correlations, dependencies etc.).
 Apply data transformations and pre-processing (type conversions, feature selection extraction
construction reduction, handling outliers missing values, scaling, etc.).
 Split the data into training set and test set (cross validation).
 Train a model.
 Evaluate and tune the model.
 Interpret the model and communicate results.
 Productionize the model.
Always Build Multiple Models:
 Using different approaches.
 Using different algorithms.
 Using different parameters (parameter sweeping).
 Using different dataset representations.
Empirical Evaluation for Model Selection

Spark Core Concepts

What is Spark?
The Lightening-fast Big Data Processing
General-purpose Big Data Processing
Integrates with HDFS
Graph Processing
Stream Processing
Machine Learning
Libraries
In-memory (fast)
Iterative
Processing
Interactive
Query
SQL
Scala – Python – Java – R – .NET

Spark Components
Spark and the zoo…
Hadoop Distributed File System (HDFS)
Spark
….
Yet Another Resource Negotiator (YARN)Named
Node
DataNode 1 DataNode 2 DataNode 3 DataNode N
Spark Core Engine (RDDs: Resilient Distributed Datasets)
Spark SQL
(structured data)
Spark Streaming
(real-time)
Mlib
(machine learning)
GraphX
(graph processing)
Scala
Java
Python
R
.NET (Mobius)

Spark Core Concepts
Key/Value (Pair) RDDs
Persisting & Removing
RDDs
Per-Partition Operations
Accumulators &
Broadcast Variables
Resilient Distributed Datasets (RDDs)
Transformations Actions

Spark SQL
DataFrames
Distributed collection of data organized into named columns
Conceptually equivalent to a table in a relational database or a data frame in
R/Python, with richer Spark optimizations
RDDs
Structured Data
Files
Hive Tables RDBMS

Introducing Machine Learning
with Spark

Machine Learning with Spark
Spark and Parallel Machine Learning
Running multiple learning
algorithms, or multiple parameter
configurations in parallel
 Normal ML algorithms can be used in various
libraries (weka, sciekit-learn, caret, etc.)
 Simple, as it needs no synchronization
 Less efficient with large datasets; whole
dataset needs to be loaded in each worker
node
Parallelizing the execution of the
learning algorithm on a given
dataset
 The learning algorithm is specifically designed
to run in parallel on a given dataset - spark ML
algorithms.
 Efficient with large datasets; a dataset is
partitioned across multiple worker nodes
 Involves synchronization and data shuffling
 Approximation methods might be used to
reduce synchronization data shuffling, which
may affect model quality

Machine Learning with Spark
Spark Machine Learning Libraries
Spark MLlib
 Provide base data types to implement your own ML algorithms
 Provide utilities to perform related ML functions
 Provide algorithm classes that use RDDs (matrix, vectors, etc.)
Spark ML
 Provide a uniform set of high-level APIs for ML operations (transform, model, evaluate)
 Provide algorithm classes that use DataFrames
 Provide a framework for creating practical ML pipelines
We will introduce the base data types in Spark MLlib, but focus on the
algorithms and transformations in Spark ML

Spark MLlib Data Types

Base data type for implementing ML algorithms
Local Vectors (Dense, Sparse)
Local Matrix (Dense, Sparse)
Distributed Matrix
 RowMatrix
 IndexedRowMatrix
 CoordinateMatrix
LabeledPoint
Rating

Vectors
 A vector represent the input feature values of a dataset example.
 Feature values in a vector have to be numeric
 A collection of vectors represent a dataset, which can be parallelized as an RDD.
 denseVector store the values of all the features, while a sparseVector only stores the
values of a nonzero features (efficient with sparse dataset, e.g. text document vectors)
from pyspark.mllib.linalg import Vectors
v1 = Vectors.dense([1.1,-2,0.3,45,-3.5])
v2 =Vectors.sparse(5,[1,3],[10,20])
for i in range(0,5):
print "feature "+str(i)+": "+str(v2[i])
Create a vector of 5 features, where
only features 1 and 3 have values 10
and 20, respectively. The other
features are zeros

LabeledPoints
 Represent a labelled examples in a supervised dataset
 Used in datasets for Classification and Regression Learning
 Consists of a numerical label (class/target), and a features vector
 Categorical class values need to have numerical representation (i.e., class value index)
example = LabeledPoint(1,[34,56,76])
print "input features: "+str(example.features)+" class index: "+ str(example.label)

Matrices
 A widely-used data structure in linear algebra
 Spark MLlib support dense and spase, local and distributed matrices
from pyspark.mllib.linalg import *
denseMatrix_local = Matrices.dense(3 ,2, [1,2,3,4,5,6,7,8])
from pyspark.mllib.linalg.distributed import *
v1 = Vectors.sparse(3,[1],[10])
v2 = Vectors.sparse(3,[0,2],[5,15])
v3 = Vectors.sparse(3,[0,1,2],[10,20,30])
sparseMatrix_distributed = RowMatrix(sc.parallelize([v1,v2,v3]))
print "rows:" + str(sparseMatrix_distributed.numRows())+" - columns:"+ str(sparseMatrix_distributed.numCols())
print sparseMatrix_distributed.rows.collect()
 Other distributed matrices: IndexedRowMatrix ,CoordinateMatrix, and BlockMatrix
Create a Matrix with 3
rows and 2 columns
Each vector is a row in the matrix

Statistics, Sampling, and Random
Data Generation

Spark MLlib Utilities
Statistics
from pyspark.mllib.stat import *
 Statistics.colStats(dataset)  count, max, min, mean, variance, numNonzero
 Statistics.corr(rdd1, rdd2, method=<“pearons” | “spearman”>)  R correlation value between two
equal-sized collections of numerical values (in tow RDDs)
 Statisitcs.corr(dataset, method)  returns a correlation matrix between each pair of features in an
RDD of feature vectors)
 Statistic.chiSqTest(dataset)  performs dependency test between each input feature and the
target label in a given dataset of LabeledPoints
 Dataset.sampleByKey(withReplacement =<True|False>, fraction)  can be used with dataset of
LabledPoints to perform stratified sampling (i.e., fetch a sample by preserving the class value
distribution of the original dataset). The dataset if is key/value pairs, where the key is the label,
and the value is the LabeledPoint

Random Data Generation
from pyspark.mllib.random import *
data = RandomRDDs.normalRDD(sc, 100)
mean = 1
variance = 4
data_new = data.map(lambda number: mean + (math.sqrt(stdv) * number))
Generate 10 numbers (independent, identically-
distributed) whose values follows the standard normal
distribution with mean =0 and variance = 1
N(0, 1)
Make the generated numbers values follows the normal
distribution with N(1,4)

Kernel Density Estimator
Kernel density estimation is a non-parametric method for estimating
empirical probability without requiring assumptions about the particular
distribution that the observed samples are drawn from.
from pyspark.mllib.random import *
from pyspark.mllib.stat import *
data = RandomRDDs.normalRDD(sc, 100000)
kde = KernelDensity()
kde.setSample(data)
kde.setBandwidth(0.1)
densities = kde.estimate([0.0,1.0,2.0])
densities
Even without assuming that the sample follows a
normal distribution with mean = 0 and stdv =1, the
kernel density was able to estimate the probability
of 0, 1, and 2 from the sample
The technique is more useful with data that does
not follow a known a probability density function

Spark ML Overview

Spark ML Overview
A uniform set of high-level ML APIs
Spark ML standardizes APIs for machine learning algorithms to make it easier to combine
multiple algorithms into a single pipeline, or workflow.
 Transformers – used for data pre-processing.
Input: DataFrame. Output:DataFrame
 Estimators – ML algorithm used to build a predictive model.
Input: DataFrame (with three columns: CaseIndex, Features, Class). Output: Model.
 Parameters – Configurations for Transformers and Estimators
 Pipeline – Chains Transformers and Estimators
ML Pipeline
Dataset
(DataFrame)
Transformer A
(pre-processing)
Estimator
(ML Learning
Algorithm)
Model
Evaluation
Parameters
Transformer Z
(pre-processing
…

Spark ML Transformers

Overview
Text Feature Extraction
 TF-IDF (HashingTF and IDF)
 Word2Vec
 CountVectorizer
 Tokenizer
 StopWordsRemover
 n-gram
Features Vector
Preparation
 VectorAssembler
 VectorIndexer
 StringIndexer
 IndexToString
Feature Selection
 VectorSlicer
 RFormula
 ChiSqSelector
Feature Type Conversion
(Continues  Discrete)
 Binarizer
 Discrete Cosine Transform (DCT)
 OneHotEncoder
 Bucketizer
 QuantileDiscretizer
Feature Scaling
 Normalizer
 StandardScaler
 MaxAbsScaler
 MinMaxScaler
Feature
Construction
 SQLTransformer
 ElementwiseProduct
 PolynomialExpansionDimensionality Reduction
 PCA

Preparing a dataset for Spark ML Pipeline
 A dataset is received with a mix of nominal and numerical attributes
 The target variable can be either categorical or numerical, as well.
 All the dataset attributes (input and target) have to be numeric.
 For nominal attributes, numerical indexes should be used instead of the actual “textual” values
 The training dataset should have three attributes:
 CaseIndex: a unique identifier of a training instance
 Features: a Spark MlLlib denseVector or sparseVector tor represent all the input variables
 Target: a numerical attribute to represent the class or the response variable in the classification or
regression problems, respectively.

Features vector preparation
from pyspark.ml.feature import *
dataframe = sqlContext.createDataFrame(
[
(1,34.45,20,'M','Y'),
(2,23.67,40,'M','Y'),
(3,78.23,20,'M','Y'),
(4,37.48,40,'L','Y'),
(5,48.32,20,'S','N'),
(6,67.45,40,'S','N')
]
,['caseId','score','category','level','label'])
dataframe.show()
dataframe.printSchema()

StringIndexer – encodes a string column of labels to a column of label indices
stringIndexer = StringIndexer(inputCol="label", outputCol="labelIndex")
model = stringIndexer.fit(dataframe)
dataframe_transformed = model.transform(dataframe).select('caseId','score','category','level','labelIndex')
dataframe_transformed.show()
We need to do the same for the “level” input variables
stringIndexer = StringIndexer(inputCol="level", outputCol="levelIndex")
model = stringIndexer.fit(dataframe_transformed)
dataframe_transformed =
model.transform(dataframe_transformed).select('caseId','score','category','levelIndex','labelIndex')
The original textual value can be received
using IndexToString transformation

VectorAssembler - combines a given list of columns into a single vector attribute, to represent the input
feature set for a training data
assembler = VectorAssembler(inputCols=["score", "category", "levelIndex"], outputCol="features")
dataframe_transformed = assembler.transform(dataframe_transformed).select('caseId','features','labelIndex')
Note that, “category” attribute is treated a numerical attribute.
If needs to be treated as a nominal attribute, we can use either StringIndexer,
or use VectorIndexer decide to treat it as nominal or numerical attributes,
based on the maxCategory parameter. That is, if the number of distinct values
less than or equal to maxCateogry, the attribute will be indexed and treated as nominal.
indexer = VectorIndexer(inputCol="features", outputCol="features_indexed", maxCategories=3)
indexerModel = indexer.fit(dataframe_transformed)
dataframe_transformed = indexerModel.transform(dataframe_transformed)
.select('caseId','features_indexed','labelIndex')

Feature type conversion
Binarizer – converts numerical attribute to nominal attribute via thresholding the numerical attribute to binary
(0/1) attribute.
binarizer = Binarizer(threshold=40.0, inputCol="score", outputCol="score_binarized")
dataframe_transformed = binarizer.transform(dataframe_transformed)
dataframe_transformed.select('score_binarized','category','levelIndex','labelIndex').show()

HotOneEncoder – maps a column of label indices to a column of binary vectors, with at most a single one-
value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use
categorical features. The result vector is a SparseVector
stringIndexer = StringIndexer(inputCol="label", outputCol="labelIndex")
model = stringIndexer.fit(dataframe)
dataframe_transformed = model.transform(dataframe).select('caseId','score','category','level','labelIndex')
encoder = OneHotEncoder(dropLast=False, inputCol="levelIndex", outputCol="levelFlags")
dataframe_transformed = encoder.transform(dataframe_transformed)
dataframe_transformed.select('score','category','levelFlags','labelIndex').show()

QuantileDiscretizer – converts numerical attribute to binned categorical attribute, using the quantiles of the
numerical attributes
Bucketizer – converts numerical attribute to nominal attribute via assigning numerical values into a pre-
defined buckets (ranges)
splits = [-float("inf"), 25.0, 35.0, 50.0,float("inf")]
bucketizer = Bucketizer(splits=splits, inputCol="score", outputCol="score_buckets")
dataframe_transformed = bucketizer.transform(dataframe_transformed)
dataframe_transformed.select('score','score_buckets','category','levelIndex','labelIndex').show()

Feature scaling
Scaling is a very important pre-processing step when similarity/distance measures are evolved. This include
clustering and instance-based learning.
dataframe = sqlContext.createDataFrame(
[
(-0.4 ,40 ,4),
(-0.3 ,30 ,3),
(-0.5 ,50 ,5),
(-0.7 ,70 ,7),
(-0.2 ,20 ,2),
(-0.1 ,10 ,1)
],['var1','var2','var3'])
assembler = VectorAssembler(inputCols=["var1", "var2", "var3"], outputCol="features")
dataframe = assembler.transform(dataframe).select('features')

Feature scaling
scaler = StandardScaler(inputCol="features", outputCol="features_scaled1", withStd=True, withMean=False)
scalerModel = scaler.fit(dataframe)
scaledData = scalerModel.transform(dataframe)
scaledData.show(truncate=False)
scaler = MinMaxScaler(inputCol="features", outputCol="features_scaled2", min=0.0, max=1.0)
scalerModel = scaler.fit(dataframe)
scaledData = scalerModel.transform(dataframe)
scaledData.show(truncate=False)

Others
Feature Selection (VectorSlicer, RFormula,ChiSqSelector) – Supervised methods to select the
best features that can predict/estimate the target variables.
Dimensionality Reduction (Principal Component Analysis) - Converts a set of instances of
possibly correlated variables into a set of values of linearly uncorrelated variables called principal
components. Input: many correlated features. Output: a few uncorrelated features
Feature Construction
 SQLTransformer – SQL-based transformation
 Polynomial expansion – Expanding features into a polynomial space, which is formulated by an n-degree
combination of original dimensions. E.g. input x, n=3. Output x, x^2, x^3
 ElementwiseProduct – Multiplies each input vector by a provided “weight” vector, using element-wise
multiplication.

Classification

Spark ML Estimators
Classification algorithms
 Logistic regression
 Decision tree classifier
 Random forest classifier – Tree Ensemble
 Gradient-boosted tree classifier – Tree Ensemble
 Multilayer perceptron classifier – Artificial Neural Networks
 One-vs-Rest classifier – Uses binary classifiers for multiclass classification problems
 Naive Bayes – a probabilistic Bayesian model

Spark ML Estimators
Classification template
from pyspark.ml.classification import *
# split the processed dataset to training and testing sets
(trainingData, testData) = dataset_processed.randomSplit([0.7, 0.3])
# initialize a classification algorithm
classification_algorithm = <classification algorithm>(labelCol="<Indexed Class Attribute>",
featuresCol="<Feature Vector Attribute>")
# create a classificaion model using the classification algorithm and the training sets
classifier = classification_algorithm.fit(trainingData)
# make prediction using the constructed classifier and the test set (or new unseen dataset)
predictions = classifier.transform(testData)
# display actual vs predicted classed
predictions.select("prediction", "<Indexed Class Attribute>", "<Feature Vector Attribute>").show()

Spark ML Estimators
Classification dataset
dataset_raw = sqlContext.sql("SELECT * FROM ds_car")
dataset_raw.show()
dataset_raw.printSchema()
stringIndexer = StringIndexer(inputCol="Class", outputCol="classIndex")
stringIndexer_model = stringIndexer.fit(dataset_raw)
dataset_processed = stringIndexer_model.transform(dataset_raw)
stringIndexer = StringIndexer(inputCol="buying", outputCol="buyingIndex")
stringIndexer_model = stringIndexer.fit(dataset_processed)
dataset_processed = stringIndexer_model.transform(dataset_processed)
stringIndexer = StringIndexer(inputCol="maint", outputCol="maintIndex")
stringIndexer = StringIndexer(inputCol="doors", outputCol="doorsIndex")
stringIndexer = StringIndexer(inputCol="persons", outputCol="personsIndex")
stringIndexer = StringIndexer(inputCol="lug_boot", outputCol="lug_bootIndex")
stringIndexer = StringIndexer(inputCol="safety", outputCol="safetyIndex")
assembler = VectorAssembler(inputCols=['buyingIndex','maintIndex','doorsIndex','personsIndex','lug_bootIndex','safetyIndex'], outputCol="features")
dataset_processed = assembler.transform(dataset_processed).select('features','classIndex')

Spark ML Estimators
Classification example
from pyspark.ml.classification import *
decisionTree_algorithm = DecisionTreeClassifier(labelCol="class_indexed" featuresCol="features")
# create a classification model using the classification algorithm and the training sets
decisionTree_model = decisionTree_algorithm.fit(trainingData)
# make prediction using the constructed classifier and the test set (or new unseen dataset)
predictions = decisionTree_model.transform(testData)
# display actual vs predicted classed
predictions.select("prediction", "class_indexed", "features").show()

Regression

Spark ML Estimators
Regression algorithms
 Linear regression
 Generalized linear regression (Gaussian, Binomial, Poisson, Gamma)
 Decision tree regression
 Random forest regression – Tree Ensemble
 Gradient-boosted tree regression – Tree Ensemble
 Survival regression - Accelerated failure time (AFT) model
 Isotonic regression – Used when the target attributes has a fixed range (Min and Max)

Spark ML Estimators
Regression template
from pyspark.ml.regression import *
regression_algorithm = <regression algorithm>(labelCol="<Numerical Target Attribute>",
featuresCol="<Features Vector Attribute>")
# create a regression model using the regression algorithm and the training sets
Regression_model = regression_algorithm.fit(trainingData)
# make estimations using the constructed regression model and the test set (or new unseen dataset)
predictions = regression_model.transform(testData)
# display actual vs estimated classed
predictions.select("prediction", "<Numerical Target Attribute >", "< Features Vector Attribute>").show()

Spark ML Estimators
Regression dataset
dataset_raw = sqlContext.sql("SELECT * FROM ds_energy")
dataset_raw.show(5)
dataset_raw.printSchema()
from pyspark.ml.feature import *
assembler = VectorAssembler(inputCols=
['RelativeCompactness',
'SurfaceArea',
'WallArea',
'RoofArea',
'OverallHeight',
'Orientation',
'GlazingArea',
'GlazingAreaDistribution'], outputCol="features")
dataset_processed = assembler.transform(dataset_raw).select('features','HeatingLoad')
indexer = VectorIndexer(inputCol="features", outputCol="features_indexed", maxCategories=10)
indexerModel = indexer.fit(dataset_processed)
dataset_processed = indexerModel.transform(dataset_processed).select('features_indexed','HeatingLoad')
dataset_processed.show(truncate=False)

Spark ML Estimators
Regression example
from pyspark.ml.regression import *
# initialize a Generalized Linear Model
GLM_algorithm = GeneralizedLinearRegression(family="gaussian", link="identity", maxIter=10,
regParam=0.3)
# create a regression model using the regression algorithm and the training sets
GLM_model = GLM_algorithm.fit(trainingData)
# make estimations using the constructed regression model and the test set (or new unseen dataset)
predictions = GLM_model.transform(testData)
# display actual vs estimated classed
predictions.select("prediction", "target", "features").show()

Spark ML Estimators
Regression example
print("Coefficients: " + str(GLM_model.coefficients))
print("Intercept: " + str(GLM_model.intercept))
# Summarize the GLM_model over the training set and print out some metrics
summary = GLM_model.summary
print("Coefficient Standard Errors: " + str(summary.coefficientStandardErrors))
print("T Values: " + str(summary.tValues))
print("P Values: " + str(summary.pValues))
print("Dispersion: " + str(summary.dispersion))
print("Null Deviance: " + str(summary.nullDeviance))
print("Residual Degree Of Freedom Null: " + str(summary.residualDegreeOfFreedomNull))
print("Deviance: " + str(summary.deviance))
print("Residual Degree Of Freedom: " + str(summary.residualDegreeOfFreedom))
print("AIC: " + str(summary.aic))
print("Deviance Residuals: ")
summary.residuals().show()

Model Evaluation

Model Evaluation
Classification model evaluation – Binary classification
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(
labelCol="class_Indexed", predictionCol="prediction", metricName="areaUnderROC")
accuracy = evaluator.evaluate(predictions)
metricName Parameter values
 areaUnderROC - Receiver Operating Characteristic
 areaUnderPR – Precision Recall Curve

Model Evaluation
Classification model evaluation – Multiclass classification
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(
labelCol="class_Indexed", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
 "f1" (default)
 "precision"
 "recall"
 "weightedPrecision"
 "weightedRecall"

Model Evaluation
Regression model evaluation
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(
labelCol=“target", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
 "rmse" (default) – Root Means Square Error
 "mse" – Mean Square Error
 "r2"
 "mae" – Mean Absolut Error

Model Selection and Parameter
Tuning

Model Selection and Parameter Tuning
algorithm = <classification or regression algorithm>(<parameters>)
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# TrainValidationSplit will try all combinations of values and determine best model using the evaluator.
paramGrid = ParamGridBuilder()
.addGrid(algorithm.<parameter1>, [0.1, 0.01])
.addGrid(algorithm.<parameter2>, [0.0, 0.5, 1.0])
.build()
tvs = TrainValidationSplit(estimator=algorithm,
estimatorParamMaps=paramGrid,
evaluator=<Classification or Regression Evaluator>(),
trainRatio=0.8)
# Run TrainValidationSplit, and choose the best set of parameters.
model = tvs.fit(train)

Spark ML Pipelines

Spark ML Pipeline
Pipeline example
from pyspark.ml import Pipeline
#data preparation (e.g., VectorAssembler, VectorIndexer, etc.)
transformer1 = …
transformer2 = …
transformer3 = …
#Model algorithm (e.g. DecisionTreeClassifier)
model_algorithm = …
#Pipeline which applys transformation and model building algorithm on dataset
pipeline = Pipeline(stages=[transformer1, transformer2, transformer3, model_algorithm])
model = pipeline.fit(training)

Frequent Pattern Mining

Frequent Pattern Mining
Association rule discovery example
from pyspark.mllib.fpm import *
data = [['a','b','c'],['a','b'],['a','d'],['b','c'],['a','b','d']]
transactions = sc.parallelize(data)
# retrieve the patterns that has at least 40 % occurrence in the dataset (transactions)
model = FPGrowth.train(transactions, minSupport=0.4)
result = model.freqItemsets().collect()
# retrieve the patterns that occurred 4 times or more
model.freqItemsets().filter(lambda itemset: itemset.freq>=4).collect()
# retrieve the pattern that have the item ‘c’
model.freqItemsets().filter(lambda itemset: 'c' in itemset.items).collect()
# retrieve the pattern that have 2 items or more
model.freqItemsets().filter(lambda itemset: len(itemset.items)>=2).collect()

Clustering

Clustering
Data clustering algorithms
 K-means – Spherical, centroid-based, non-parametric, partitioning, non-overlapping
 Latent Dirichlet allocation (LDA) – Probabilistic, usually used with Topic Models
 Power iteration clustering (PIC) – Graph Clustering
 Bisecting k-means – Hierarchical (Agglomerative & Divisive)
 Gaussian Mixture Model (GMM) – Expectation Maximization (EM) algorithm.
Probabilistic, parametric, overlapping

Clustering
Data clustering example
from pyspark.mllib.clustering import *
from math import *
data = [[1,2],[5,3],[3,4],[35,20],[25,10],[30,15]]
dataPoints = sc.parallelize(data)
# create 2 clusters using k-means. The intial centroids can be either random or using K-mean++ technique
clusters = KMeans.train(dataPoints, k=2, maxIterations=10, initializationMode=<‘random’ | ‘k-means||’>)
# assign data points to clusters
predictions = dataPoints.map(lambda point: clusters.predict(point))
assignments = dataPoints.zip(predictions)
#compute sum square error (rmse) of the culsters
sse = assignments.map(lambda (point,cluster): sqrt(sum([error**2 for error in (point -
clusters.centers[cluster])]))).reduce(lambda error1,error2: error1+error2)

My Background
Applying Computational Intelligence in Data Mining
• Honorary Research Fellow, School of Computing , University of Kent.
• Ph.D. Computer Science, University of Kent, Canterbury, UK.
• M.Sc. Computer Science , The American University in Cairo, Egypt.
• 25+ published journal and conference papers, focusing on:
– classification rules induction,
– decision trees construction,
– Bayesian classification modelling,
– data reduction,
– instance-based learning,
– evolving neural networks, and
– data clustering
• Journals: Swarm Intelligence, Swarm & Evolutionary Computation,
, Applied Soft Computing, and Memetic Computing.
• Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio,
ECTA, IEEE WCCI and INNS-BigData.
ResearchGate.org

Thank you!

Machine learning with Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Machine learning with Spark

Similar to Machine learning with Spark (20)

More from Khalid Salama

More from Khalid Salama (11)

Recently uploaded

Recently uploaded (20)

Machine learning with Spark