Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2017 MapR Technologies
Spark Machine Learning
Carol McDonald
@caroljmcdonald
© 2017 MapR Technologies
Agenda
•  Introduction to Machine Learning Techniques
–  Classification
–  Clustering
•  Use Deci...
© 2017 MapR Technologies
What is Machine Learning?
Data Build ModelTrain Algorithm
Finds patterns
New Data Use Model
(pred...
© 2017 MapR Technologies
Examples of ML Algorithms
Supervised
•  Classification
–  Naïve Bayes
–  SVM
–  Random Decision
F...
© 2017 MapR Technologies
Supervised Algorithms use labeled data
Data
features
Build Model
New Data
features
Predict
Use Mo...
© 2017 MapR Technologies
Supervised Machine Learning: Classification & Regression
Classification
Identifies
category for i...
© 2017 MapR Technologies
Classification: Definition
Form of ML that:
•  Identifies which category an item belongs to
•  Us...
© 2017 MapR Technologies
If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck
swims
walks
quacks
Features:
walks...
© 2017 MapR Technologies
Car Insurance Fraud Example
•  What are we trying to predict?
–  This is the Label or Target outc...
© 2017 MapR Technologies
Label:
Amount of Fraud
Y
X
Feature: claimed amount
Data point: fraud amount,
claimed amount
AmntF...
© 2017 MapR Technologies
Credit Card Fraud Example
•  What are we trying to predict?
–  This is the Label:
–  The probabil...
© 2017 MapR Technologies
Label
Probabilty
of Fraud 1
X
Features: trans amount, type of store,
Time Location difference las...
© 2017 MapR Technologies
Supervised Learning: Classification & Regression
•  Classification:
–  identifies which category ...
© 2017 MapR Technologies
Examples of ML Algorithms
Machine Learning
Unsupervised
•  Clustering
–  K-means
•  Dimensionalit...
© 2017 MapR Technologies
Unsupervised Algorithms use Unlabeled data
Customer GroupsBuild ModelTrain Algorithm
Finds patter...
© 2017 MapR Technologies
Unsupervised Machine Learning: Clustering
Clustering
group news articles into different categories
© 2017 MapR Technologies
Clustering: Definition
•  Unsupervised learning task
•  Groups objects into clusters of high simi...
© 2017 MapR Technologies
Clustering: Definition
•  Unsupervised learning task
•  Groups objects into clusters of high simi...
© 2017 MapR Technologies
Clustering: Example
•  Group similar objects
© 2017 MapR Technologies
Clustering: Example
•  Group similar objects
•  Use MLlib K-means algorithm
1.  Initialize coordi...
© 2017 MapR Technologies
Clustering: Example
•  Group similar objects
•  Use MLlib K-means algorithm
1.  Initialize coordi...
© 2017 MapR Technologies
Clustering: Example
•  Group similar objects
•  Use MLlib K-means algorithm
1.  Initialize coordi...
© 2017 MapR Technologies
Clustering: Example
•  Group similar objects
•  Use MLlib K-means algorithm
1.  Initialize coordi...
© 2017 MapR Technologies
Predict Churn
© 2017 MapR Technologies
ML Discovery Model Building
Model
Training/
Building
Training
Set
Test Model
Predictions
Test
Set...
© 2017 MapR Technologies
Telecom Customer Churn Data
•  State: string
•  Account length: integer
•  Area code: integer
•  ...
© 2017 MapR Technologies
Customer Churn Example
•  What are we trying to predict?
–  This is the Label:
–  Did the custome...
© 2017 MapR Technologies
Decision Trees
•  Decision Tree for Classification
prediction
•  Represents tree with nodes
•  IF...
© 2017 MapR Technologies
Example Decision Tree
© 2017 MapR Technologies
Spark ML workflow
© 2017 MapR Technologies
Spark ML workflow with a Pipeline
Pipeline
Estimator
Extract
Features
Load
Data
Train
Model
Estim...
© 2017 MapR Technologies
Zeppelin Notebook with Spark
Data
Engineer
Data
Scientist
© 2017 MapR Technologies
Load the data into a Dataframe: Define the Schema
case class Account(state: String, len: Integer,...
© 2017 MapR Technologies
Data
Frame
Load data
Load the data into a Dataset
val train: Dataset[Account] = spark.read.option...
© 2017 MapR Technologies
Dataset merged with Dataframe
in Spark 2.0, DataFrame APIs merged with Datasets APIs
© 2017 MapR Technologies
Extract the Features
Image reference O’Reilly Learning Spark
+
+
̶+
̶ ̶
Feature Vectors and Label...
© 2017 MapR Technologies
Data
Frame
Add column
Use StringIndexer to map Strings to Numbers
val ipindexer = new StringIndex...
© 2017 MapR Technologies
Data
Frame
Add column
Use StringIndexer to map churn True False to Numbers
Val labelindexer = new...
© 2017 MapR Technologies
Data
Frame
Load data Add column DataFrame +
Features
Use VectorAssembler to put features in vecto...
© 2017 MapR Technologies
Data
Frame
Load data transform
Estimator
val dTree = new DecisionTreeClassifier()
.setLabelCol("l...
© 2017 MapR Technologies
val pipeline = new Pipeline()
.setStages(Array(ipindexer, labelindexer, assembler, dTree))
Put Fe...
© 2017 MapR Technologies
Spark ML workflow with a Pipeline
Pipeline
Transfomers
Load
Data
estimator
Train model
Data
frame...
© 2017 MapR Technologies
K-fold Cross-Validation Process
Data
Model
Training/
Building
Training
Set
Test Model
Predictions...
© 2017 MapR Technologies
K-fold Cross-Validation Process
Data
Model
Training
Training
Set
Test Model
Predictions
Test
Set
...
© 2017 MapR Technologies
ML Cross-Validation Process
Data Model
Training
Set
Test Model
Predictions
Test
Set
Evaluate the ...
© 2017 MapR Technologies
K-fold Cross-Validation Process
Data
Model
Training/
Building
Training
Set
Test Model
Predictions...
© 2017 MapR Technologies
Cross Validation transformation estimation pipeline
Pipeline
Cross Validator
evaluatorParameter G...
© 2017 MapR Technologies
Parameter Tuning with CrossValidator with a Paramgrid
CrossValidator
•  Given:
–  Estimator
–  Pa...
© 2017 MapR Technologies
val cvModel = crossval.fit(ntrain)
Cross Validator fit a model to the data
Pipeline
Cross Validat...
© 2017 MapR Technologies
Evaluate the fitted model
Pipeline
Transfomers
Load
Data
estimator
Train model
Data
frame
Extract...
© 2017 MapR Technologies
fitted
model
Evaluate the Predictions from DecisionTree Estimator
Evaluator
transform
Test
featur...
© 2017 MapR Technologies
Area under the ROC curve
Accuracy is measured by the area under
the ROC curve.
The area measures ...
© 2017 MapR Technologies
To Learn More:
•  Read about and download example code
•  https://mapr.com/blog/churn-prediction-...
© 2017 MapR Technologies
To Learn More:
•  End to End Application for Monitoring Uber Data using Spark ML
•  https://mapr....
© 2017 MapR Technologies
To Learn More:
•  MapR Free ODT http://learn.mapr.com/
© 2017 MapR Technologies
For Q&A :
•  https://community.mapr.com/
•  https://community.mapr.com/community/answers/pages/qa
© 2017 MapR Technologies
Open Source Engines & Tools Commercial Engines & Applications
Enterprise-Grade Platform Services
...
© 2017 MapR Technologies
Q&A
ENGAGE WITH US
Upcoming SlideShare
Loading in …5
×

Live Machine Learning Tutorial: Churn Prediction

1,931 views

Published on

Churn prediction is big business. It minimizes customer defection by predicting which customers are likely to cancel a service. Though originally used within the telecommunications industry, it has become common practice for banks, ISPs, insurance firms, and other verticals. More: http://info.mapr.com/WB_PredictingChurn_Global_DG_17.06.15_RegistrationPage.html

The prediction process is data-driven and often uses advanced machine learning techniques. In this webinar, we'll look at customer data, do some preliminary analysis, and generate churn prediction models – all with Spark machine learning (ML) and a Zeppelin notebook.

Spark’s ML library goal is to make machine learning scalable and easy. Zeppelin with Spark provides a web-based notebook that enables interactive machine learning and visualization.

In this tutorial, we'll do the following:

Review classification and decision trees
Use Spark DataFrames with Spark ML pipelines
Predict customer churn with Apache Spark ML decision trees
Use Zeppelin to run Spark commands and visualize the results

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Live Machine Learning Tutorial: Churn Prediction

  1. 1. © 2017 MapR Technologies Spark Machine Learning Carol McDonald @caroljmcdonald
  2. 2. © 2017 MapR Technologies Agenda •  Introduction to Machine Learning Techniques –  Classification –  Clustering •  Use Decision Tree to Predict Customer Churn
  3. 3. © 2017 MapR Technologies What is Machine Learning? Data Build ModelTrain Algorithm Finds patterns New Data Use Model (prediction function) Predictions Contains patterns Recognizes patterns
  4. 4. © 2017 MapR Technologies Examples of ML Algorithms Supervised •  Classification –  Naïve Bayes –  SVM –  Random Decision Forests •  Regression –  Linear –  Logistic Machine Learning Unsupervised •  Clustering –  K-means •  Dimensionality reduction –  Principal Component Analysis –  SVD
  5. 5. © 2017 MapR Technologies Supervised Algorithms use labeled data Data features Build Model New Data features Predict Use Model
  6. 6. © 2017 MapR Technologies Supervised Machine Learning: Classification & Regression Classification Identifies category for item
  7. 7. © 2017 MapR Technologies Classification: Definition Form of ML that: •  Identifies which category an item belongs to •  Uses supervised learning algorithms –  Data is labeled Sentiment
  8. 8. © 2017 MapR Technologies If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck swims walks quacks Features: walks quacks swims Features:
  9. 9. © 2017 MapR Technologies Car Insurance Fraud Example •  What are we trying to predict? –  This is the Label or Target outcome: –  The amount of Fraud •  What are the “if questions” or properties we can use to predict? –  These are the Features: –  The claim Amount
  10. 10. © 2017 MapR Technologies Label: Amount of Fraud Y X Feature: claimed amount Data point: fraud amount, claimed amount AmntFraud = intercept + coeff * claimedAmnt Car Insurance Fraud Regression Example
  11. 11. © 2017 MapR Technologies Credit Card Fraud Example •  What are we trying to predict? –  This is the Label: –  The probability of Fraud •  What are the “if questions” or properties we can use to predict? –  These are the Features: –  transaction amount, type of merchant, distance from and time since last transaction
  12. 12. © 2017 MapR Technologies Label Probabilty of Fraud 1 X Features: trans amount, type of store, Time Location difference last trans. Fraud 0 Not Fraud .5 Credit Card Fraud Logistic Regression Example
  13. 13. © 2017 MapR Technologies Supervised Learning: Classification & Regression •  Classification: –  identifies which category (eg fraud or not fraud) •  Linear Regression: –  predicts a value (eg amount of fraud) •  Logistic Regression: –  predicts a probability (eg probability of fraud)
  14. 14. © 2017 MapR Technologies Examples of ML Algorithms Machine Learning Unsupervised •  Clustering –  K-means •  Dimensionality reduction –  Principal Component Analysis –  SVD Supervised •  Classification –  Naïve Bayes –  SVM –  Random Decision Forests •  Regression –  Linear –  Logistic
  15. 15. © 2017 MapR Technologies Unsupervised Algorithms use Unlabeled data Customer GroupsBuild ModelTrain Algorithm Finds patterns New Customer Purchase Data Use Model (prediction function) Predict Group Contains patterns Recognizes patterns Customer purchase data
  16. 16. © 2017 MapR Technologies Unsupervised Machine Learning: Clustering Clustering group news articles into different categories
  17. 17. © 2017 MapR Technologies Clustering: Definition •  Unsupervised learning task •  Groups objects into clusters of high similarity
  18. 18. © 2017 MapR Technologies Clustering: Definition •  Unsupervised learning task •  Groups objects into clusters of high similarity –  Search results grouping –  Grouping of customers –  Anomaly detection –  Text categorization
  19. 19. © 2017 MapR Technologies Clustering: Example •  Group similar objects
  20. 20. © 2017 MapR Technologies Clustering: Example •  Group similar objects •  Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) x x x x x
  21. 21. © 2017 MapR Technologies Clustering: Example •  Group similar objects •  Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) 2.  Assign all points to nearest centroid x x x x x
  22. 22. © 2017 MapR Technologies Clustering: Example •  Group similar objects •  Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) 2.  Assign all points to nearest centroid 3.  Update centroids to center of points x x x x x
  23. 23. © 2017 MapR Technologies Clustering: Example •  Group similar objects •  Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) 2.  Assign all points to nearest centroid 3.  Update centroids to center of points 4.  Repeat until conditions met x x x x x
  24. 24. © 2017 MapR Technologies Predict Churn
  25. 25. © 2017 MapR Technologies ML Discovery Model Building Model Training/ Building Training Set Test Model Predictions Test Set Evaluate Results Historical Data Deployed Model Predictions Data Discovery, Model Creation Production Feature Extraction Feature Extraction New Data Customer Data Call Center Records Web Clickstream Server Logs ●  Churn Modelling
  26. 26. © 2017 MapR Technologies Telecom Customer Churn Data •  State: string •  Account length: integer •  Area code: integer •  International plan: string •  Voice mail plan: string •  Number vmail messages: integer •  Total day minutes: double •  Total day calls: integer •  Total day charge: double •  Total eve minutes: double •  Total eve calls: integer •  Total eve charge: double •  Total night minutes: double •  Total night calls: integer •  Total night charge: double •  Total intl minutes: double •  Total intl calls: integer •  Total intl charge: double •  Customer service calls: integer
  27. 27. © 2017 MapR Technologies Customer Churn Example •  What are we trying to predict? –  This is the Label: –  Did the customer churn? True or False •  What are the “if questions” or properties we can use to predict? –  These are the Features: –  Number of Customer service calls, Total day minutes …
  28. 28. © 2017 MapR Technologies Decision Trees •  Decision Tree for Classification prediction •  Represents tree with nodes •  IF THEN ELSE questions using features at each node •  Answers branch to child nodes If the number of customer service calls < 3 If the total day minutes > 200 Churned: T If the total day minutes < 200 Churned: F T Churned: T Churned: F F FF TT
  29. 29. © 2017 MapR Technologies Example Decision Tree
  30. 30. © 2017 MapR Technologies Spark ML workflow
  31. 31. © 2017 MapR Technologies Spark ML workflow with a Pipeline Pipeline Estimator Extract Features Load Data Train Model Estimator Data frame Transformer Cross Validate Pipeline Model TransformerTest Data frame Evaluate fit Train Load Data Evaluator Predict With model Extract Features Evaluator transform
  32. 32. © 2017 MapR Technologies Zeppelin Notebook with Spark Data Engineer Data Scientist
  33. 33. © 2017 MapR Technologies Load the data into a Dataframe: Define the Schema case class Account(state: String, len: Integer, acode: String, intlplan: String, vplan: String, numvmail: Double, tdmins: Double, tdcalls: Double, tdcharge: Double, temins: Double, tecalls: Double, techarge: Double, tnmins: Double, tncalls: Double, tncharge: Double, timins: Double, ticalls: Double, ticharge: Double, numcs: Double, churn: String) Input CSV File sample: KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
  34. 34. © 2017 MapR Technologies Data Frame Load data Load the data into a Dataset val train: Dataset[Account] = spark.read.option("inferSchema", "false") .schema(schema).csv("/user/user01/data/churn-bigml-80.csv").as[Account]
  35. 35. © 2017 MapR Technologies Dataset merged with Dataframe in Spark 2.0, DataFrame APIs merged with Datasets APIs
  36. 36. © 2017 MapR Technologies Extract the Features Image reference O’Reilly Learning Spark + + ̶+ ̶ ̶ Feature Vectors and Label Model Featurization Training Model Evaluation Best Model Label: Churned=T Features: Number customer Service calls Number day minutes Training Data Label: Churned=F Features: Number customer Service calls Number day minutes + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶
  37. 37. © 2017 MapR Technologies Data Frame Add column Use StringIndexer to map Strings to Numbers val ipindexer = new StringIndexer() .setInputCol("intlplan") .setOutputCol("iplanIndex”) Data Frame
  38. 38. © 2017 MapR Technologies Data Frame Add column Use StringIndexer to map churn True False to Numbers Val labelindexer = new StringIndexer() .setInputCol(”churn") .setOutputCol(”label”) Data Frame
  39. 39. © 2017 MapR Technologies Data Frame Load data Add column DataFrame + Features Use VectorAssembler to put features in vector column val featureCols = Array(”temins", "iplanIndex", "tdmins", "tdcalls”…) val assembler = new VectorAssembler() .setInputCols(featureCols) .setOutputCol("features")
  40. 40. © 2017 MapR Technologies Data Frame Load data transform Estimator val dTree = new DecisionTreeClassifier() .setLabelCol("label") .setFeaturesCol("features") Create DecisionTree Estimator, Set Label and Features DataFrame + Features
  41. 41. © 2017 MapR Technologies val pipeline = new Pipeline() .setStages(Array(ipindexer, labelindexer, assembler, dTree)) Put Feature Transformers and Estimator in Pipeline Pipeline ipIndexer feature transform assembler Dtree estimatorlabelindexer feature transform assemble Features Produce model
  42. 42. © 2017 MapR Technologies Spark ML workflow with a Pipeline Pipeline Transfomers Load Data estimator Train model Data frame Extract Features evaluator Pipeline Model Test Data frame evaluator Use fitted model Train Load Data fit transform
  43. 43. © 2017 MapR Technologies K-fold Cross-Validation Process Data Model Training/ Building Training Set Test Model Predictions Test Set data is randomly split into K partition training and test dataset pairs
  44. 44. © 2017 MapR Technologies K-fold Cross-Validation Process Data Model Training Training Set Test Model Predictions Test Set Train algorithm with training dataset
  45. 45. © 2017 MapR Technologies ML Cross-Validation Process Data Model Training Set Test Model Predictions Test Set Evaluate the model with the Test Set
  46. 46. © 2017 MapR Technologies K-fold Cross-Validation Process Data Model Training/ Building Training Set Test Model Predictions Test Set Train/Test loop K times Repeat K times select the Model produced by the best-performing set of parameters
  47. 47. © 2017 MapR Technologies Cross Validation transformation estimation pipeline Pipeline Cross Validator evaluatorParameter Grid fit Set up a CrossValidator with: •  Parameter grid •  Estimator (pipeline) •  Evaluator Perform grid search based model selection
  48. 48. © 2017 MapR Technologies Parameter Tuning with CrossValidator with a Paramgrid CrossValidator •  Given: –  Estimator –  Parameter grid –  Evaluator •  Find best parameters and model val paramGrid = new ParamGridBuilder() .addGrid(dTree.maxDepth, Array(2,3,4,5,6,7)).build() val evaluator= new BinaryClassificationEvaluator() .setLabelCol("label") .setRawPredictionCol("prediction") val crossval = new CrossValidator() .setEstimator(pipeline) .setEvaluator(evaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(3)
  49. 49. © 2017 MapR Technologies val cvModel = crossval.fit(ntrain) Cross Validator fit a model to the data Pipeline Cross Validator evaluatorParameter Grid fit Pipeline Model fit a model to the data with provided parameter grid
  50. 50. © 2017 MapR Technologies Evaluate the fitted model Pipeline Transfomers Load Data estimator Train model Data frame Extract Features evaluator Pipeline Model Test Data frame evaluator transform Train Load Data Predict With model Extract Features fit
  51. 51. © 2017 MapR Technologies fitted model Evaluate the Predictions from DecisionTree Estimator Evaluator transform Test features val predictions = cvModel.transform(test) val accuracy = evaluator.evaluate(predictions) evaluate prediction accuracy
  52. 52. © 2017 MapR Technologies Area under the ROC curve Accuracy is measured by the area under the ROC curve. The area measures correct classifications •  An area of 1 represents a perfect test •  an area of .5 represents a worthless test
  53. 53. © 2017 MapR Technologies To Learn More: •  Read about and download example code •  https://mapr.com/blog/churn-prediction-sparkml/
  54. 54. © 2017 MapR Technologies To Learn More: •  End to End Application for Monitoring Uber Data using Spark ML •  https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine- learning-streaming-and-kafka-api-part-1/
  55. 55. © 2017 MapR Technologies To Learn More: •  MapR Free ODT http://learn.mapr.com/
  56. 56. © 2017 MapR Technologies For Q&A : •  https://community.mapr.com/ •  https://community.mapr.com/community/answers/pages/qa
  57. 57. © 2017 MapR Technologies Open Source Engines & Tools Commercial Engines & Applications Enterprise-Grade Platform Services DataProcessing Web-Scale Storage MapR-XD MapR-DB Search and Others Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability MapR Streams Cloud and Managed Services Search and Others UnifiedManagementandMonitoring Search and Others Event StreamingDatabase Custom Apps MapR Converged Data Platform HDFS API POSIX, NFS Kakfa APIHBase API OJAI API
  58. 58. © 2017 MapR Technologies Q&A ENGAGE WITH US

×