SlideShare a Scribd company logo
1 of 34
Download to read offline
Machine Learning Pipeline
with Spark ML
End to End Machine learning
https://github.com/RamkSwamy/sparkmlpipeline
● Ram Kuppuswamy
● Worked in Microsoft for 13yrs
● Co-Founder, Zinnia Systems
Pvt. Ltd.,
● Big data consultant and
trainer at datamantra.io
Agenda
● Machine Learning Pipeline
● Spark ML API
● Components of ML API
● Building a pipeline
● Persisting a model
● Evaluating a pipeline Model
● Cross validating pipeline Model
Machine learning
● Most of the developers think machine learning is mostly
learning algorithm
● Most of the big data libraries like MLLib, Mahout are
focused on implementing algorithm in distributed
manner
● But when you try to productionize an end to end solution
you will quickly realize that, machine learning is not just
about learning algorithm
● There are many other important steps to build an end to
end machine learning application
Stages of Machine Learning application
● Data Exploration
○ Read Data
○ Missing data
○ Look at correlation
○ Statistics of independent variables
● Data Preparation (Preprocess data)
■ Indexing the labels
■ Handling categorical variables
■ Numeric values in text data (wordtovec)
Stages of ML Application continued...
● Model training
● Model evaluation
● Model Tuning
● Repeat this process many times
Spark MLLib
● Only focused on model learning
● No standard way to do other steps of ML pipeline
● No way to combine all these steps and execute them
● Based on RDD API
● Though some of these steps are added later, they were
not uniform across the algorithms
Spark ML
● Provides higher-level API for construction and tuning of
ML workflows
● Built on top of DataFrames
● We are using Spark 2.0 which will have ML as the
library for Machine Learning going forward and MLLib
will be deprecated
Case Study
● We use the following dataset from the following
Machine Learning Repository:
http://archive.ics.uci.edu/ml/datasets/Census+Income
● They have given the training and test data in the
following 2 files separately:
○ Adult.data
○ Adult.test
● Objective: To predict if the income of an individual is
>50K or <=50K? by constructing a pipeline.
Abstractions of Spark ML
● Transformer
● Estimator
● Evaluator
● Pipeline
● Params
Data
Data Exploration
● Read the data and create a DataFrame
○ Util:loadSalaryCsvTrain
○ Util:loadSalaryCsvTest
● Looking at schema
○ SalaryDataSchema
● Look at Statistics of variables
○ SalaryDataExplore
Data Preparation
● Clean the data
○ cleanDataFrame Util
● Label Indexing
● Categorical handling
○ String Indexing
○ oneHot Encoding
Estimator
● An Estimator abstraction uses an algorithm which is
fitted on a DataFrame returning a model.
● It implements a method fit():
DF Estimator Model
Label Indexing
● We want to create the label which is the dependent
variable
● We have 2 different values a) >50K b) <=50K
● One will take the value of ‘0’ and the other ‘1’
● We use the StringIndexer API to achieve this
● It encodes a string column of labels to a column of label
indices.
● The indices are in [0, numLabels), ordered by label
frequencies, so the most frequent label gets index 0
String Indexer
● We use the StringIndexer API to achieve this
● It encodes a string column of labels to a column of label
indices.
● SalaryLabelIndexing
Categorical handling
● We have many categorical fields such as occupation,
sex, workclass, relationship and marital_status etc.,
● They are all String types and we use the StringIndexer
to generate the indices and then use OneHotEncoder,
which maps a column of label indices to a column of
binary vectors, with at most a single one-value.
● This encoding allows algorithms which expect
continuous features, such as Logistic Regression, to
use categorical features.
StringIndxer
● It is an Estimator which uses StringIndexerModel to fit
the data
Example: StringIndexer
Transformer
● A Transformer is an abstraction which has an algorithm which transforms
one DataFrame to another.
● It implements a method transform()
DF DFTransformer
OneHotEncoder
● It is a Transformer which takes the data and converts
into a vector
● CategoricalForWorkClass
Vector assembler
● A feature transformer that merges multiple columns into
a vector column.
● The LogisticRegression model expects a column named
“features” as vector by default or you can set it.
● SalaryVectorAssembler
Building pipeline
Pipeline
● Chain of Transformers and Estimators
● Pipeline itself is an Estimator
● It is fitted on a DataFrame turning it into a model
● Once you have defined a pipeline, you can have the
training and test datasets go thru the same stages of
processing
○ buildOneHotPipeLine
○ buildPipeLineForFeaturePreparation
○ buildDataPrepPipeLine
Model training
Prediction with test data
Logistic regression
● We want to train the LogisticRegression model with
training data
● We create a pipeline, which will take the training data
and train the model with it and provides that model for
us to use it to predict with the test data
● Example code : LRTraining module
Model training with training data
● The model expects the feature vectors in a column
named “features” and the labels (dependent varible that
we are trying to predict) in a column named “label” by
default.
Model evaluation
Evaluator
● Area under ROC curve is used to measure the accuracy
of our model
● We use various evaluators such as
BinaryClassificationEvaluator() in this process
● Evaluator takes the data and metric related parameter
and evaluates the metric asked for, say area under
ROC curve or PR Curve etc.,
● It has evaluate() method
Evaluator continued
● Takes Data and Parameters and provides metrics
○ SalaryEvaluator
DF
Param
Evaluator Metric
Model selection
Cross-validator
● We want to tune the performance of our models
● It takes the following inputs:
○ Estimator : pipeline we have built
○ Parameter Grid : Regularization and num of iterations for LR
○ Evaluator: Binary classification evaluator
● Find best Parameters
○ SalaryCrossValidator
Recap ML Pipelines
Load Data
StringIndexer
OneHotEncoder
VectorAssembler
Pipeline
Evaluate
LogisticRegression
Transformer Estimator

More Related Content

What's hot

Pythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlowPythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlowFernando Ortega Gallego
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBCarol McDonald
 
Real World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineSrivatsan Srinivasan
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Sparkdatamantra
 
Build an efficient Machine Learning model with LightGBM
Build an efficient Machine Learning model with LightGBMBuild an efficient Machine Learning model with LightGBM
Build an efficient Machine Learning model with LightGBMPoo Kuan Hoong
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge GraphsJeff Z. Pan
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's TutorialWayne Lee
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature EngineeringAlice Zheng
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle Databricks
 
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Edureka!
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowDatabricks
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnSarah Guido
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibTaras Matyashovsky
 
mlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecyclemlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecycleDatabricks
 
Supervised learning
Supervised learningSupervised learning
Supervised learningankit_ppt
 

What's hot (20)

Pythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlowPythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlow
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
 
XGBoost & LightGBM
XGBoost & LightGBMXGBoost & LightGBM
XGBoost & LightGBM
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Real World End to End machine Learning Pipeline
Real World End to End machine Learning PipelineReal World End to End machine Learning Pipeline
Real World End to End machine Learning Pipeline
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
Build an efficient Machine Learning model with LightGBM
Build an efficient Machine Learning model with LightGBMBuild an efficient Machine Learning model with LightGBM
Build an efficient Machine Learning model with LightGBM
 
Introduction of Knowledge Graphs
Introduction of Knowledge GraphsIntroduction of Knowledge Graphs
Introduction of Knowledge Graphs
 
LDA Beginner's Tutorial
LDA Beginner's TutorialLDA Beginner's Tutorial
LDA Beginner's Tutorial
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
 
MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle MLFlow: Platform for Complete Machine Learning Lifecycle
MLFlow: Platform for Complete Machine Learning Lifecycle
 
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
Scikit Learn Tutorial | Machine Learning with Python | Python for Data Scienc...
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflow
 
A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-LearnA Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner's Guide to Machine Learning with Scikit-Learn
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
mlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecyclemlflow: Accelerating the End-to-End ML lifecycle
mlflow: Accelerating the End-to-End ML lifecycle
 
What is MLOps
What is MLOpsWhat is MLOps
What is MLOps
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
Pandas
PandasPandas
Pandas
 

Viewers also liked

Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelinesjeykottalam
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascentjeykottalam
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Spark Summit
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In ProductionSamir Bessalah
 
Apache spark with Machine learning
Apache spark with Machine learningApache spark with Machine learning
Apache spark with Machine learningdatamantra
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalystdatamantra
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibDatabricks
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanHakka Labs
 
Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache SparkDatabricks
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkPetr Zapletal
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMeeraj Kunnumpurath
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
Hierarchical Temporal Memory: Computing Like the Brain - Matt Taylor, Numenta
Hierarchical Temporal Memory: Computing Like the Brain - Matt Taylor, NumentaHierarchical Temporal Memory: Computing Like the Brain - Matt Taylor, Numenta
Hierarchical Temporal Memory: Computing Like the Brain - Matt Taylor, NumentaWithTheBest
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with SparkMd. Mahedi Kaysar
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonSimon Frid
 

Viewers also liked (20)

Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
Apache spark with Machine learning
Apache spark with Machine learningApache spark with Machine learning
Apache spark with Machine learning
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Practical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlibPractical Machine Learning Pipelines with MLlib
Practical Machine Learning Pipelines with MLlib
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong Yan
 
Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache Spark
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Hierarchical Temporal Memory: Computing Like the Brain - Matt Taylor, Numenta
Hierarchical Temporal Memory: Computing Like the Brain - Matt Taylor, NumentaHierarchical Temporal Memory: Computing Like the Brain - Matt Taylor, Numenta
Hierarchical Temporal Memory: Computing Like the Brain - Matt Taylor, Numenta
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
 

Similar to Machine learning pipeline with spark ml

Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Sparkcarl_pulley
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Sparkdatamantra
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Karthik Murugesan
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkFaisal Siddiqi
 
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...PATHALAMRAJESH
 
Student mark Prediction application.pptx
Student mark Prediction application.pptxStudent mark Prediction application.pptx
Student mark Prediction application.pptxSHIVAMKUMAR988270
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeIdo Shilon
 
Model Driven Developing & Model Based Checking: Applying Together
Model Driven Developing & Model Based Checking: Applying TogetherModel Driven Developing & Model Based Checking: Applying Together
Model Driven Developing & Model Based Checking: Applying TogetherIosif Itkin
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopHolden Karau
 
Real world machine learning with Java for Fumankaitori.com
Real world machine learning with Java for Fumankaitori.comReal world machine learning with Java for Fumankaitori.com
Real world machine learning with Java for Fumankaitori.comMathieu Dumoulin
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCgdgsurrey
 
INTERNSHIP presentation on machiine learning
INTERNSHIP presentation  on machiine learningINTERNSHIP presentation  on machiine learning
INTERNSHIP presentation on machiine learningrushikeshgarkal23ve
 
Datastage Online Training|IBM Infosphere Datastage Training|Datastage 8.7 onl...
Datastage Online Training|IBM Infosphere Datastage Training|Datastage 8.7 onl...Datastage Online Training|IBM Infosphere Datastage Training|Datastage 8.7 onl...
Datastage Online Training|IBM Infosphere Datastage Training|Datastage 8.7 onl...onlinetraining24
 

Similar to Machine learning pipeline with spark ml (20)

MLlib with MLFlow.pdf
MLlib with MLFlow.pdfMLlib with MLFlow.pdf
MLlib with MLFlow.pdf
 
Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Spark
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Aws autopilot
Aws autopilotAws autopilot
Aws autopilot
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
 
Student mark Prediction application.pptx
Student mark Prediction application.pptxStudent mark Prediction application.pptx
Student mark Prediction application.pptx
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Model Driven Developing & Model Based Checking: Applying Together
Model Driven Developing & Model Based Checking: Applying TogetherModel Driven Developing & Model Based Checking: Applying Together
Model Driven Developing & Model Based Checking: Applying Together
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Real world machine learning with Java for Fumankaitori.com
Real world machine learning with Java for Fumankaitori.comReal world machine learning with Java for Fumankaitori.com
Real world machine learning with Java for Fumankaitori.com
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDC
 
C2_W1---.pdf
C2_W1---.pdfC2_W1---.pdf
C2_W1---.pdf
 
INTERNSHIP presentation on machiine learning
INTERNSHIP presentation  on machiine learningINTERNSHIP presentation  on machiine learning
INTERNSHIP presentation on machiine learning
 
Online Datastage Training
Online Datastage TrainingOnline Datastage Training
Online Datastage Training
 
Datastage Online Training
Datastage Online TrainingDatastage Online Training
Datastage Online Training
 
Datastage Online Training|IBM Infosphere Datastage Training|Datastage 8.7 onl...
Datastage Online Training|IBM Infosphere Datastage Training|Datastage 8.7 onl...Datastage Online Training|IBM Infosphere Datastage Training|Datastage 8.7 onl...
Datastage Online Training|IBM Infosphere Datastage Training|Datastage 8.7 onl...
 

More from datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 APIdatamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streamingdatamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 

More from datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 

Recently uploaded

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 

Recently uploaded (20)

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 

Machine learning pipeline with spark ml

  • 1. Machine Learning Pipeline with Spark ML End to End Machine learning https://github.com/RamkSwamy/sparkmlpipeline
  • 2. ● Ram Kuppuswamy ● Worked in Microsoft for 13yrs ● Co-Founder, Zinnia Systems Pvt. Ltd., ● Big data consultant and trainer at datamantra.io
  • 3. Agenda ● Machine Learning Pipeline ● Spark ML API ● Components of ML API ● Building a pipeline ● Persisting a model ● Evaluating a pipeline Model ● Cross validating pipeline Model
  • 4. Machine learning ● Most of the developers think machine learning is mostly learning algorithm ● Most of the big data libraries like MLLib, Mahout are focused on implementing algorithm in distributed manner ● But when you try to productionize an end to end solution you will quickly realize that, machine learning is not just about learning algorithm ● There are many other important steps to build an end to end machine learning application
  • 5. Stages of Machine Learning application ● Data Exploration ○ Read Data ○ Missing data ○ Look at correlation ○ Statistics of independent variables ● Data Preparation (Preprocess data) ■ Indexing the labels ■ Handling categorical variables ■ Numeric values in text data (wordtovec)
  • 6. Stages of ML Application continued... ● Model training ● Model evaluation ● Model Tuning ● Repeat this process many times
  • 7. Spark MLLib ● Only focused on model learning ● No standard way to do other steps of ML pipeline ● No way to combine all these steps and execute them ● Based on RDD API ● Though some of these steps are added later, they were not uniform across the algorithms
  • 8. Spark ML ● Provides higher-level API for construction and tuning of ML workflows ● Built on top of DataFrames ● We are using Spark 2.0 which will have ML as the library for Machine Learning going forward and MLLib will be deprecated
  • 9. Case Study ● We use the following dataset from the following Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets/Census+Income ● They have given the training and test data in the following 2 files separately: ○ Adult.data ○ Adult.test ● Objective: To predict if the income of an individual is >50K or <=50K? by constructing a pipeline.
  • 10. Abstractions of Spark ML ● Transformer ● Estimator ● Evaluator ● Pipeline ● Params
  • 11. Data
  • 12. Data Exploration ● Read the data and create a DataFrame ○ Util:loadSalaryCsvTrain ○ Util:loadSalaryCsvTest ● Looking at schema ○ SalaryDataSchema ● Look at Statistics of variables ○ SalaryDataExplore
  • 13. Data Preparation ● Clean the data ○ cleanDataFrame Util ● Label Indexing ● Categorical handling ○ String Indexing ○ oneHot Encoding
  • 14. Estimator ● An Estimator abstraction uses an algorithm which is fitted on a DataFrame returning a model. ● It implements a method fit(): DF Estimator Model
  • 15. Label Indexing ● We want to create the label which is the dependent variable ● We have 2 different values a) >50K b) <=50K ● One will take the value of ‘0’ and the other ‘1’ ● We use the StringIndexer API to achieve this ● It encodes a string column of labels to a column of label indices. ● The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0
  • 16. String Indexer ● We use the StringIndexer API to achieve this ● It encodes a string column of labels to a column of label indices. ● SalaryLabelIndexing
  • 17. Categorical handling ● We have many categorical fields such as occupation, sex, workclass, relationship and marital_status etc., ● They are all String types and we use the StringIndexer to generate the indices and then use OneHotEncoder, which maps a column of label indices to a column of binary vectors, with at most a single one-value. ● This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.
  • 18. StringIndxer ● It is an Estimator which uses StringIndexerModel to fit the data
  • 20. Transformer ● A Transformer is an abstraction which has an algorithm which transforms one DataFrame to another. ● It implements a method transform() DF DFTransformer
  • 21. OneHotEncoder ● It is a Transformer which takes the data and converts into a vector ● CategoricalForWorkClass
  • 22. Vector assembler ● A feature transformer that merges multiple columns into a vector column. ● The LogisticRegression model expects a column named “features” as vector by default or you can set it. ● SalaryVectorAssembler
  • 24. Pipeline ● Chain of Transformers and Estimators ● Pipeline itself is an Estimator ● It is fitted on a DataFrame turning it into a model ● Once you have defined a pipeline, you can have the training and test datasets go thru the same stages of processing ○ buildOneHotPipeLine ○ buildPipeLineForFeaturePreparation ○ buildDataPrepPipeLine
  • 27. Logistic regression ● We want to train the LogisticRegression model with training data ● We create a pipeline, which will take the training data and train the model with it and provides that model for us to use it to predict with the test data ● Example code : LRTraining module
  • 28. Model training with training data ● The model expects the feature vectors in a column named “features” and the labels (dependent varible that we are trying to predict) in a column named “label” by default.
  • 30. Evaluator ● Area under ROC curve is used to measure the accuracy of our model ● We use various evaluators such as BinaryClassificationEvaluator() in this process ● Evaluator takes the data and metric related parameter and evaluates the metric asked for, say area under ROC curve or PR Curve etc., ● It has evaluate() method
  • 31. Evaluator continued ● Takes Data and Parameters and provides metrics ○ SalaryEvaluator DF Param Evaluator Metric
  • 33. Cross-validator ● We want to tune the performance of our models ● It takes the following inputs: ○ Estimator : pipeline we have built ○ Parameter Grid : Regularization and num of iterations for LR ○ Evaluator: Binary classification evaluator ● Find best Parameters ○ SalaryCrossValidator
  • 34. Recap ML Pipelines Load Data StringIndexer OneHotEncoder VectorAssembler Pipeline Evaluate LogisticRegression Transformer Estimator