2. Introduction
❖ Working as technologist and software architect for couple of decades, at number of leading
financial institutions in the UK
❖ Authored a number books on Enterprise Java, Web Services and SOA
❖ Spoken at a number of technology conferences
❖ Founded Service Symphony Ltd in 2009 serving leading financial services customers
building mission critical middleware
❖ Engineer with a keen interest in ML, AI and Data Science
❖ Blog: http://www.servicesymphony.com/blog
❖ Email: meeraj@servicesymphony.com
❖ Presentation: https://www.slideshare.net/MeerajKunnumpurath/machine-learning-by-
example-apache-spark
❖ GitHub: https://github.com/kunnum/sandbox/tree/master/notebooks
2
3. Agenda
❖ Introduction to using ML with Apache Spark
❖ Hands-on example driven approach
❖ Not a deep dive into Apache Spark Architecture
❖ Neither a deep dive into ML algorithms
❖ Examples built using Apache Zeppelin
❖ Some of the examples are from Spark ASF
documentation
3
4. Apache Spark - Overview
❖ Open source large scale distributed data processing fabric
❖ Offers multiple components addressing different facets of data science for big and
fast data processing, ML, analytics and data ingestion
❖ Ability to process large amount of data in memory spanning multiple process
spaces
❖ Initially started as a research project in UC Berkeley
❖ Originally released under BSD, top level ASF project licensed under ASL 2.0 since
2014
❖ One of the most active open source project, arguably the most active ASF project
❖ Adopted, extended and commercialised by multiple vendors playing in the data
science realm
4
8. Scala - Spark Natural Transition
❖ Interest in Spark stemmed from deep interest in Scala
and functional programming
❖ Data processing echo system built around Scala, with a
strong synergy in Scala’s design motivations
❖ Extends Scala’s idiomatic functional programming
model to transcend beyond process boundaries
❖ Spark RDDs - Scala collections on steroids
8
12. ML Components
❖ Data Structures
❖ Vectors and Matrices
❖ Data Frames
❖ Feature Extractors and Transformers
❖ Estimators
❖ Models
❖ Pipelines
❖ Evaluators
❖ Tuning Aids
12
18. Spark ML - Pipeline Architecture
18
Training time flow
Pipeline in estimator mode
Pipeline.fit()
Creates a pipeline model
19. Spark ML - Pipeline Architecture
19
Test time flow
Pipeline in transformer mode
PipelineModel.transform()
Creates dataframe with augmented prediction columns
52. Collaborative Filtering
❖ Commonly used for recommender systems
❖ Uses ALS (Alternating Least Squares) to learn latent
factors in user to item association
❖ Default assumption is based on explicit feedback for
matrix factorization
❖ You an explicitly enable implicit preferences
52
60. Model Tuning
❖ API to tune an individual estimator or the entire
pipeline using a normalised parameter model
❖ API to support k-fold cross validation
❖ API to evaluate performance on linear regression, as
well as binomial and polynomial classification
❖ API for performing training validation split
60