Predictive Churn with
H2o
2
• How Am I?
• IBM Data Science Experience
• DSX Features Tour
• Predictive Churn
• What is PC?
• The Lift as the main performance measure
• Telco Dataset Example
• H2o Sparking Water
• What is H2O ?
• What is Apache Spark ?
• What is H2o Sparkling Water ?
• H2o REST API
• Building a Predictive Churn Model
• Modelling with Spark Scala
• Modelling with H2o Sparkling Water using
R
• Modelling with H2o Flow
• Deploying a H2o Model
• Using Play Scala to build a REST endpoint
• Deploying as Docker Container in Google
Cloud
• Deployment with STEAM
Agenda
How Am I ?
• Ndjido Ardo BAR : Data Scientist @ Davidson Consulting UAE
• Background : Research In Mathematics
• Now: Working in @DU Telecom as a Consultant
• Past:
• Worked @AXA (Paris)
• Worked @BearingPoint Hypercube
• Co-Founder of a StartUp (MLouma)
• Worked @Pasteur Institute: Involved in BioStatistical Research
@ndjido
3
IBM Data Science
Experience
4
IBM Data Science Experience
5
Better organise your Data Science projects
Collaborates with your team members
Learn from the community
https://datascience.ibm.com/
Predictive Churn
6
Predictive Churn
7
What is it ?
Predictive Churn is a set of methods to forecast the churn
rate (customers more likely to stop using a service) of a
given service. It’s used for customer retention mainly by
Marketing practitioners.
There are mainly 2 approaches:
1. Classification based approach
2. Survival-Analysis based approach
Predictive Churn
8
Performance Measures
Actual
Model
TP
TNFN
FP
RECALL = TP / (TP + FN)
PRECISION = TP / (TP + FP)
LIFT = Precision / % Targeted Customer
Lift: It is the ratio between the Precision and the Support. For
instance a lift of N on the top 20% of the targeted population
means that the model can get up to N times more respondents
than if we had randomly picked 20% of the population.
ROC Curve
Predictive Churn
9
Our Dataset Example
Telco Churn Data Description: 21 Variables (Customer, Plan, Behaviour, other)
H2o Sparking Water
10
H2o Sparkling Water
11
What is H2o ?
H2O an open-source, fast, scalable Machine Learning platform
with Deep Learning capabilities. It’s production-ready.
Cloud	Integration
Big	Data	EcosystemOpen	Source Flexible	Interface
Scalability	and	
Performance
GPU	Enablement
Rapid	Model	
Deployment
Smart	and	Fast	
Algorithms
H2O	Flow• 100%	open	source
• Highly	portable	models	
deployed	in	Java	(POJO)	and	
Model	Object	Optimized	
(MOJO)	
• Automated	and	streamlined	
scoring	service	deployment	
with	Rest	API
• Distributed	In-Memory	
Computing	Platform	
• Distributed	Algorithms	
• Fine-Grain	MapReduce
(source: H2o.ai)
H2o Sparkling Water
12
What is H2o ?
HDFS
S3
NFS
Distributed	
In-Memory
Load	Data
Loss-less	
Compression
H2O	Compute	Engine
Production	Scoring	Environment
Exploratory	&	
Descriptive	
Analysis
Feature	
Engineering	&	
Selection
Supervised	&	
Unsupervised	
Modeling
Model

Evaluation	&	
Selection
Predict
Data	&	Model

Storage
Model	Export:	
Plain	Old	Java	Object	
Your	
Imagination
Data	Prep	Export:	
Plain	Old	Java	Object	
Local
SQL
(source: H2o.ai)
High Level Architecture
H2o Sparkling Water
13
What is H2o ?
(source: H2o.ai)
Algorithms Overview
Supervised	Learning
• Generalized	Linear	Models:	
Binomial,	Gaussian,	Gamma,	
Poisson	and	Tweedie	
• Naïve	Bayes	
Statistical	
Analysis
Ensembles
• Distributed	Random	Forest:	
Classification	or	regression	models	
• Gradient	Boosting	Machine:	
Produces	an	ensemble	of	decision	
trees	with	increasing	refined	
approximations
Deep	Neural	
Networks
• Deep	learning:	Create	multi-layer	
feed	forward	neural	networks	
starting	with	an	input	layer	followed	
by	multiple	layers	of	nonlinear	
transformations
Unsupervised	Learning
• K-means:	Partitions	observations	
into	k	clusters/groups	of	the	same	
spatial	size.	Automatically	detect	
optimal	k
Clustering
Dimensionality	
Reduction
• Principal	Component	Analysis:	Linearly	
transforms	correlated	variables	to	
independent	components	
• Generalized	Low	Rank	Models:	extend	the	
idea	of	PCA	to	handle	arbitrary	data	
consisting	of	numerical,	Boolean,	categorical,	
and	missing	data
Anomaly	
Detection
• Autoencoders:	Find	outliers	
using	a	nonlinear	
dimensionality	reduction	using	
deep	learning
H2o Sparkling Water
14
“Apache Spark is a fast and general engine for large-scale data processing. ”
What is Spark?
H2o Sparkling Water
15
+
H2o Sparkling Water is :
a transparent integration of H2o with Apache Spark
with transparent use of H2o data structure & algorithms with Spark
with extension of Spark with more sophisticated Machine Learning algo
What is Sparkling Water ?
Powerful data preparation features
NLP Algorithm
Scikit-Learn like ML pipelines
Advanced ML Algorithms
Powerful Data Compression
Graphical UI (Flow)
Exports Model as POJO
H2o Sparkling Water
16
How does it work with Spark ?
(source: H2o.ai)
H2o REST API
17
Working with data (1/3)
Reading	Data	into	H2O	with	Python/R/Flow
STEP 1
H2O Import function
(source: H2o.ai)
h2o_df = h2o.importFile(“path/to/dataset.csv”)
H2o REST API
18
Reading	Data	from	HDFS	into	H2O	with	Python/R/Flow
H2
O
H2
O
H2
O
data.cs
v
HTTP REST API
request to H
2
O
has HDFS path
H2O ClusterInitiate distributed
ingest
HDFS
Request data
from HDFS
STEP 2
2.2
2.3
2.4
H2O import
function
2.1
function call
(source: H2o.ai)
Working with data (2/3)
H2o REST API
19
Reading	Data	from	HDFS	into	H2O	with	Python/R/Flow
H2
O
H2
O
H2
O
HDFS
STEP 3
Cluster IP
Cluster Port
Pointer to Data
Return pointer to
data in REST API
JSON Response
HDFS provides
data
3.3
3.4
3.1
data.cs
v
Console
H2
O
Fram
e
3.2
Distributed H2
O
Frame in DKV
H2O Cluster
(source: H2o.ai)
Working with data (3/3)
Building A
Predictive Churn Model
20
Hands-on materials available at: https://github.com/ndjido/Predictive-Churn-Modeling-with-H2O/
Building A Predictive Churn Model
21
&
Modelling Pipeline: 3 Approaches
only
H2o Flow
VS
1 2 3
Hands-on
Deploying A
Predictive Churn Model
22
Deploying A Predictive Churn Model
23
Model Building Model POJO
export POJO
Deployment Pipeline
REST API Containerised App
local
Deploying A Predictive Churn Model
24
Deployment Pipeline
POJO Integration in your Play App
H2o GenModel added to
your Play App
DEMO TIME
Deploying A Predictive Churn Model
25
Deployment with H2O Steam
“The Steam AI engine is an end-to-end platform that streamlines the entire process of building
and deploying smart applications. Now data scientists and developers can launch turnkey
compute environments for collaboratively training and deploying predictive models and integrate
those models into real-time smart applications”
Demo
Thank You!
Questions ?
@ndjido
26

Predictive churn h20_dsx

  • 1.
  • 2.
    2 • How AmI? • IBM Data Science Experience • DSX Features Tour • Predictive Churn • What is PC? • The Lift as the main performance measure • Telco Dataset Example • H2o Sparking Water • What is H2O ? • What is Apache Spark ? • What is H2o Sparkling Water ? • H2o REST API • Building a Predictive Churn Model • Modelling with Spark Scala • Modelling with H2o Sparkling Water using R • Modelling with H2o Flow • Deploying a H2o Model • Using Play Scala to build a REST endpoint • Deploying as Docker Container in Google Cloud • Deployment with STEAM Agenda
  • 3.
    How Am I? • Ndjido Ardo BAR : Data Scientist @ Davidson Consulting UAE • Background : Research In Mathematics • Now: Working in @DU Telecom as a Consultant • Past: • Worked @AXA (Paris) • Worked @BearingPoint Hypercube • Co-Founder of a StartUp (MLouma) • Worked @Pasteur Institute: Involved in BioStatistical Research @ndjido 3
  • 4.
  • 5.
    IBM Data ScienceExperience 5 Better organise your Data Science projects Collaborates with your team members Learn from the community https://datascience.ibm.com/
  • 6.
  • 7.
    Predictive Churn 7 What isit ? Predictive Churn is a set of methods to forecast the churn rate (customers more likely to stop using a service) of a given service. It’s used for customer retention mainly by Marketing practitioners. There are mainly 2 approaches: 1. Classification based approach 2. Survival-Analysis based approach
  • 8.
    Predictive Churn 8 Performance Measures Actual Model TP TNFN FP RECALL= TP / (TP + FN) PRECISION = TP / (TP + FP) LIFT = Precision / % Targeted Customer Lift: It is the ratio between the Precision and the Support. For instance a lift of N on the top 20% of the targeted population means that the model can get up to N times more respondents than if we had randomly picked 20% of the population. ROC Curve
  • 9.
    Predictive Churn 9 Our DatasetExample Telco Churn Data Description: 21 Variables (Customer, Plan, Behaviour, other)
  • 10.
  • 11.
    H2o Sparkling Water 11 Whatis H2o ? H2O an open-source, fast, scalable Machine Learning platform with Deep Learning capabilities. It’s production-ready. Cloud Integration Big Data EcosystemOpen Source Flexible Interface Scalability and Performance GPU Enablement Rapid Model Deployment Smart and Fast Algorithms H2O Flow• 100% open source • Highly portable models deployed in Java (POJO) and Model Object Optimized (MOJO) • Automated and streamlined scoring service deployment with Rest API • Distributed In-Memory Computing Platform • Distributed Algorithms • Fine-Grain MapReduce (source: H2o.ai)
  • 12.
    H2o Sparkling Water 12 Whatis H2o ? HDFS S3 NFS Distributed In-Memory Load Data Loss-less Compression H2O Compute Engine Production Scoring Environment Exploratory & Descriptive Analysis Feature Engineering & Selection Supervised & Unsupervised Modeling Model
 Evaluation & Selection Predict Data & Model
 Storage Model Export: Plain Old Java Object Your Imagination Data Prep Export: Plain Old Java Object Local SQL (source: H2o.ai) High Level Architecture
  • 13.
    H2o Sparkling Water 13 Whatis H2o ? (source: H2o.ai) Algorithms Overview Supervised Learning • Generalized Linear Models: Binomial, Gaussian, Gamma, Poisson and Tweedie • Naïve Bayes Statistical Analysis Ensembles • Distributed Random Forest: Classification or regression models • Gradient Boosting Machine: Produces an ensemble of decision trees with increasing refined approximations Deep Neural Networks • Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations Unsupervised Learning • K-means: Partitions observations into k clusters/groups of the same spatial size. Automatically detect optimal k Clustering Dimensionality Reduction • Principal Component Analysis: Linearly transforms correlated variables to independent components • Generalized Low Rank Models: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data Anomaly Detection • Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning
  • 14.
    H2o Sparkling Water 14 “ApacheSpark is a fast and general engine for large-scale data processing. ” What is Spark?
  • 15.
    H2o Sparkling Water 15 + H2oSparkling Water is : a transparent integration of H2o with Apache Spark with transparent use of H2o data structure & algorithms with Spark with extension of Spark with more sophisticated Machine Learning algo What is Sparkling Water ? Powerful data preparation features NLP Algorithm Scikit-Learn like ML pipelines Advanced ML Algorithms Powerful Data Compression Graphical UI (Flow) Exports Model as POJO
  • 16.
    H2o Sparkling Water 16 Howdoes it work with Spark ? (source: H2o.ai)
  • 17.
    H2o REST API 17 Workingwith data (1/3) Reading Data into H2O with Python/R/Flow STEP 1 H2O Import function (source: H2o.ai) h2o_df = h2o.importFile(“path/to/dataset.csv”)
  • 18.
    H2o REST API 18 Reading Data from HDFS into H2O with Python/R/Flow H2 O H2 O H2 O data.cs v HTTPREST API request to H 2 O has HDFS path H2O ClusterInitiate distributed ingest HDFS Request data from HDFS STEP 2 2.2 2.3 2.4 H2O import function 2.1 function call (source: H2o.ai) Working with data (2/3)
  • 19.
    H2o REST API 19 Reading Data from HDFS into H2O with Python/R/Flow H2 O H2 O H2 O HDFS STEP3 Cluster IP Cluster Port Pointer to Data Return pointer to data in REST API JSON Response HDFS provides data 3.3 3.4 3.1 data.cs v Console H2 O Fram e 3.2 Distributed H2 O Frame in DKV H2O Cluster (source: H2o.ai) Working with data (3/3)
  • 20.
    Building A Predictive ChurnModel 20 Hands-on materials available at: https://github.com/ndjido/Predictive-Churn-Modeling-with-H2O/
  • 21.
    Building A PredictiveChurn Model 21 & Modelling Pipeline: 3 Approaches only H2o Flow VS 1 2 3 Hands-on
  • 22.
  • 23.
    Deploying A PredictiveChurn Model 23 Model Building Model POJO export POJO Deployment Pipeline REST API Containerised App local
  • 24.
    Deploying A PredictiveChurn Model 24 Deployment Pipeline POJO Integration in your Play App H2o GenModel added to your Play App DEMO TIME
  • 25.
    Deploying A PredictiveChurn Model 25 Deployment with H2O Steam “The Steam AI engine is an end-to-end platform that streamlines the entire process of building and deploying smart applications. Now data scientists and developers can launch turnkey compute environments for collaboratively training and deploying predictive models and integrate those models into real-time smart applications” Demo
  • 26.