Machine Learning by Example - Apache Spark

Service Symphony Ltd
Apache Spark
Machine Learning by Example
Meeraj Kunnumpurath
25th of February 2017
1

Introduction
❖ Working as technologist and software architect for couple of decades, at number of leading
ﬁnancial institutions in the UK
❖ Authored a number books on Enterprise Java, Web Services and SOA
❖ Spoken at a number of technology conferences
❖ Founded Service Symphony Ltd in 2009 serving leading ﬁnancial services customers
building mission critical middleware
❖ Engineer with a keen interest in ML, AI and Data Science
❖ Blog: http://www.servicesymphony.com/blog
❖ Email: meeraj@servicesymphony.com
❖ Presentation: https://www.slideshare.net/MeerajKunnumpurath/machine-learning-by-
example-apache-spark
❖ GitHub: https://github.com/kunnum/sandbox/tree/master/notebooks
2

Agenda
❖ Introduction to using ML with Apache Spark
❖ Hands-on example driven approach
❖ Not a deep dive into Apache Spark Architecture
❖ Neither a deep dive into ML algorithms
❖ Examples built using Apache Zeppelin
❖ Some of the examples are from Spark ASF
documentation
3

Apache Spark - Overview
❖ Open source large scale distributed data processing fabric
❖ Offers multiple components addressing different facets of data science for big and
fast data processing, ML, analytics and data ingestion
❖ Ability to process large amount of data in memory spanning multiple process
spaces
❖ Initially started as a research project in UC Berkeley
❖ Originally released under BSD, top level ASF project licensed under ASL 2.0 since
2014
❖ One of the most active open source project, arguably the most active ASF project
❖ Adopted, extended and commercialised by multiple vendors playing in the data
science realm
4

Scala - Spark Natural Transition
❖ Interest in Spark stemmed from deep interest in Scala
and functional programming
❖ Data processing echo system built around Scala, with a
strong synergy in Scala’s design motivations
❖ Extends Scala’s idiomatic functional programming
model to transcend beyond process boundaries
❖ Spark RDDs - Scala collections on steroids
8

ML Components
❖ Data Structures
❖ Vectors and Matrices
❖ Data Frames
❖ Feature Extractors and Transformers
❖ Estimators
❖ Models
❖ Pipelines
❖ Evaluators
❖ Tuning Aids
12

Spark ML - Pipeline Architecture
❖ Dataframe
❖ Estimator
❖ Transformer
❖ Pipeline
❖ Parameter
17

18
Training time ﬂow
Pipeline in estimator mode
Pipeline.ﬁt()
Creates a pipeline model

19
Test time ﬂow
Pipeline in transformer mode
PipelineModel.transform()
Creates dataframe with augmented prediction columns

Regression
❖ Supervised Learning Algorithm for predicting continuous labels
❖ Multiple Algorithms
❖ Linear Regression
❖ Generalised Linear Regression
❖ Decision Tree Regression
❖ Random Forest Regression
❖ Gradient Boosted Tree Regression
❖ Survival Regression
❖ Isotonic Regression
❖ Works with input feature vectors and labelled points
25

Linear Regression - Notebook
27

28

29

30

31

Classification
❖ Supervised learning for predicting discrete labels
❖ Multiple algorithms
❖ Binomial and polynomial logistic regression
❖ Decision tree classifier
❖ Random forest classifier
❖ Gradient boosted tree classifier
❖ Multi-layer neural network classifier
❖ Naive Bayes Classifier
32

Clustering
❖ Unsupervised learning algorithm based on similarity
vectors
❖ Multiple algorithms
❖ K-Means Clustering
❖ LDA - Latent Dirichlet Allocation
❖ Bisecting K-Means
❖ Gaussian Mixture Model
42

Collaborative Filtering
❖ Commonly used for recommender systems
❖ Uses ALS (Alternating Least Squares) to learn latent
factors in user to item association
❖ Default assumption is based on explicit feedback for
matrix factorization
❖ You an explicitly enable implicit preferences
52

Collaborative Filtering - Notebook
54

55

56

57

58

59

Model Tuning
❖ API to tune an individual estimator or the entire
pipeline using a normalised parameter model
❖ API to support k-fold cross validation
❖ API to evaluate performance on linear regression, as
well as binomial and polynomial classiﬁcation
❖ API for performing training validation split
60

Machine Learning by Example - Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Machine Learning by Example - Apache Spark

Similar to Machine Learning by Example - Apache Spark (20)

Recently uploaded

Recently uploaded (20)

Machine Learning by Example - Apache Spark