Customer Analytics
Maximizing Revenue Through Deep Customer Knowledge
1
KARL MELO – Strategy and Analytics Consultant (2016)
2
Purpose of the document
Key Challenges
Proposed Solution Appendices
Data Analytics Case Study
Conclusion & Recommendations
Contents
2
3
5
4
6
1
3
THE KEY OBJECTIVE
This document proposes advanced data analytics as the key solution for building intimate knowledge about our
customers’ behaviour, preferences and aspirations; an essential requirement for maximizing revenue in our current
competitive environment.
Note that some information is withheld/modified to protect the company’s anonymity.
Purpose of this document
THE CASE STUDY
A case study uses data from our beauty and personal care subsidiary to practically demonstrate how this could be
achieved. The data are from online distributions in a single market.
We Operate Across Diverse Markets…
4
… In Diverse Industries
TRADE FINANCE
FOOD PROCESSING SPORTS MANAGEMENT
BEAUTY &
PERSONAL CARE
5
6
Competition
We face threats from increasingly efficient
domestic competitors with local market
knowledge and from global players keen to
stake positions in high potential emerging and
frontier markets.
Customer Expectations
Although not yet on a par with advanced
economies, customer expectations in our major
markets, particularly our target demographic –
urban, educated, professionals – have been
rising rapidly.
Key Challenges
Globalization & Social Media
Rapid smart phone penetration is a key
driver of the global harmonisation of
customer preferences and expectations.
7
Proposed Solution
MAXIMIZE REVENUE
DEPLOY ADVANCED DATA ANALYTICS
• Behaviour
• Preferences
• Aspirations
DEVELOP INTIMATE CUSTOMER KNOWLEDGE
• Product development
• Customer care
• Marketing and Communications
LEVERAGE KNOWLEDGE
• Collect/Manage Data
• Explore Data
• Apply Statistical/Machine Learning
• Generate Insights
• Better loyalty
• Increased spend
• Improved acquisition
8
Company
Case Study
XYZ Cosmetics subsidiary
Beauty & Personal CareIndustry
Key Markets Africa, Turkey, Middle East
9
Case Study Objectives
Demonstrate the practical value of advanced data analytics for the company2
Discuss business applications of 3 machine learning methods applied to real customer data1
3 Motivate investments in analytics capabilities across all subsidiaries
10
e
4123
2411
3032 3312 3689
2878
2155 1611
1709
2011 2012 2013 2014 2015
Flat Customer Growth
Old Customers New Customers
0
100,000
200,000
300,000
400,000
500,000
600,000
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Rising Revenues
Flat customer growth mostly explained by increasing competition Growing revenues despite weak customer growth were due to repeated price increases
Further price rises will be hard to implement given intensifying competition
Business Context
7.0%Average Customer Growth
2011 - 2015 12.0%Average Revenue Growth
2011 - 2015
11
Price Volume Revenue
Price increases are no longer
feasible in the current environment
Sales volumes must increase to
drive revenue growth
• New customer acquisition
• Reduce churn
• Increase customer spend
Business Objective
MAXIMIZE REVENUE
DEPLOY ADVANCED DATA ANALYTICS
• Behaviour
• Preferences
• Aspirations
DEVELOP INTIMATE CUSTOMER KNOWLEDGE
• Customer service
• Product development
• Marketing and Communications
LEVERAGE INSIGHTS
12
• Collect/Manage Data
• Explore Data
• Apply Statistical/Machine Learning
• Generate Insights
• Drive new customer acquisition • Reduce churn • Increase customer spend
Advanced data analytics can unlock deep customer insights
3
Group data objects such that members of
each group are more similar to each other
than to those in any other group
Our objective is to identify managerially
relevant subgroups in our customer database
CLUSTER ANALYSIS
80%
SCHEDULING
Assign data objects to defined categories
based on a given set of variables
Our objective is to predict specified outcomes
of categorical variables e.g. customer will
churn/will not churn
CLASSIFICATION
13
3 machine learning methods demonstrate the power of analytics
REGRESSION
Estimate relationships between a set of input
variables and an outcome variable
Our objective is to predict numerical outcomes
for individual customers e.g. customer spend
14
The dataset is made up of 18,417 unique cases of transaction level customer data collected
in a single market between January 2nd 2010 and December 31st 2015
Customer I.D
THE DATA
Frequency
(number of purchases)
DESCRIPTION
VARIABLES Purchase amountPurchase date
MARKETING METRICS
Recency
(time since last purchase - days)
Average amount spent
(per transaction)
The distribution is severely right skewed indicating that a
large majority of customers do not return often.
50% of customers have only
ever made 1 purchase 30%of the database was
active in the last year $58
Historical purchase behaviour
Half of all customers spend between $22 and $50A significant proportion of customers seem to have
lapsed
15
average annual spend
16
CLUSTER ANALYSIS
Clustering algorithms partition customers into subgroups
based on their similarity to one another
17
The tree illustrates the grouping process
The hierarchical clustering algorithm
merges the data into ever larger clusters as
we move up the tree.
The analyst ‘cuts’ the tree at the height that
represents the optimal number of clusters.
The top of the red rectangles indicate the
height at which our tree was cut,
partitioning customers into 5 clusters.
Each customer is represented by
one of thousands of data points at
the bottom of the tree.
5 distinct clusters are identified.
Red rectangles delimit the clusters.
The smallest is hard to distinguish
18
4 of the 5 clusters are stable - They identify ‘real’ patterns in the data, not random associations
Stability Evaluation
Highly Stable
Reasonably Stable
Unstable
Cluster likely made up of
unusual cases that do not fit
anywhere else
19
• Customers in cluster 5 clearly separate themselves.
• The remaining 4 clusters overlap with each other to varying
degrees , suggesting some customers could well be assigned
to either cluster.
• Clusters 1 and 4 are relatively compact, suggesting
customers in those segments are very similar to each other.
• The 2nd, 3rd and 5th clusters are more dispersed, indicating
that customers in those subgroups exhibit a wider range of
behaviours.
Observations
Cluster Scheme
Customers are clustered into subgroups based on similarities in their behaviours
The first 2 principal components capture as much information about the
clustering scheme as is possible in 2 dimensions.
35%Lorem ipsum
dolor sit amet. 60%Lorem ipsum
dolor sit amet.
Customers in cluster 5 spend more per purchase than all others.
Cluster 3 customers are the most frequent purchasers.
Cluster 1 customers have the highest churn rates
Cluster 4 have the lowest churn or newly acquired customers
fall disproportionately into that cluster.
20
Initial insights observable from pairwise visualizations
21
cluster counts recency frequency amount spent
1 5,890 2,607 1.5 34.70
2 3,445 678 4.9 93.87
3 1,295 232 10.9 51.74
4 7,733 649 1.5 44.63
5 54 1,249 2.1 2304.69
Cluster averages describe the typical customer in each subgroup
We can obtain richer cluster profiles
by using statistical analysis techniques
to dive deeper into our clusters
Example:
By measuring the individual contribution of the
variables to the composition of a cluster, we can
better understand the relative importance of
each one in explaining the behaviour of
customers in that cluster.
It is possible to identify the centre of gravity of each
cluster – the most typical customer.
The position of each customer relative to this centre can
be computed automatically, providing a very granular
view of customer behaviour.
22
CLUSTERING ALGORITHM
POSSIBLE APPLICATIONS
• Create relevant customer segments around profiles
• Offer multiple levels of customer care
• Improve cross-selling/up-selling
• Develop products targeted at segments
• Design targeted marketing communications strategies
• Find natural subgroups in data
• Create rich customer profiles based on statistical analyses of subgroups
23
Classification
Classification models can predict which customers will either switch or stay loyal.
24
Classification
Algorithms are trained to learn the structure of a dataset of previously classified customers. The
resulting knowledge is used to define models that estimate the probability of a customer falling
into a given class (switch/stay loyal).
Data
Output
(switch/stay)
AlgorithmRecency
Average amount spent
Frequency
25
BASELINE MODEL
USEFUL MODEL
predicts all customers will remain loyal
• outperform the baseline at minimum
• Meet managerial objectives determined by project sponsors
Classifier Performance Target
26
Classification Evaluation Metrics
ACCURACY
fraction of the time the
model’s predictions are
correct
SENSITIVITY
fraction of times the
model correctly
predicts an item to be
in each class
SPECIFICITY
fraction of times
the model correctly
predicts an item is
not in a given class
ROC CURVES
plot sensitivity as a
function of ‘false alarm’
(1 – specificity)
27
• 5 classification models were implemented on the data.
• Each method adopts a different approach to modelling the prediction problem and may perform
variably depending on the characteristics of the data.
Train models on data
subset (2014 customers)
Select best model based on
evaluation metrics
Test selected model on unseen
data (2015 customers)
Basic modelling process
28
Most sensitivePerformance similar to other
models
Least specific – in this context,
sensitivity matters more.
• Appendix 2 briefly describes the Random Forest algorithm.• See slide 26 for definitions of the evaluation metrics
K-Nearest Neighbour
Random Forests
Support Vector Machines
Algorithms
The Random Forests (rF) model was selected as the best suited for our purposes.
Key question: which model is most often correct when it predicts a customer will switch (sensitivity)?
Random Forests performance
Overall accuracy
58%
Accurate ‘stay loyal’ predictions
68%
29
Random Forest test performance is approximate to ‘true’ performance
42%
Accurate ‘churn’ predictions
Model Evaluation
30
0%
10%
20%
30%
40%
50%
60%
70%
80%
Accuracy Sensitivity Specificity
Training Performance Test Performance
Evaluate the selected model’s predictive performance by testing on unseen data – 2015 customers
As expected, performance deteriorates when the model is tested on new data.
Model Evaluation
31
0%
10%
20%
30%
40%
50%
60%
70%
80%
Accuracy Sensitivity Specificity
Test Performance Baseline Model
 Our best model is slightly less accurate than the simplistic
baseline model. Normally this would mean the model offers
no additional value
 However, the baseline has 0% sensitivity because it predicts
all customers will remain loyal even though we know the
company experienced a churn rate of 40% in 2014/15.
 Our model is more sensitive and specific than the baseline.
Model Evaluation
Evaluate added value by comparing test performance to a baseline model
32
Poor predictive accuracy could be due either to deficiencies
in the model, in the data or both
33
Possible model deficiencies
Mismatches in the inductive biases of models and the data
e.g. linear models would struggle to correctly predict outcomes in non-linearly related data.
Poorly tuned model parameters
The presence of unusual values in the data will prejudice some models
 Model issues were carefully addressed during the modelling process
 Key performance metrics are broadly similar across the different types of models suggesting
underperformance is most likely due to deficiencies in the data.
34
Possible data deficiencies
Inaccurate or inconsistently recorded data can have dramatic effects on predictive
models.
There is no reason to suspect the quality of this data.
The models rely on a very narrow feature space to predict the customer decision to
churn or remain loyal
It is more than likely that our 4 input variables do not sufficiently explain the
variation in our outcome of interest
Implementing the model on a larger dataset with additional informative variables will produce a more
accurate classifier.
Data quality
Feature space
PREDICTIVE
MODELLING
(Classification)
• Build a classification model that estimates probability of switching vs. staying
loyal for each customer
• Predict which customers will switch (attrition)
• Group customers predicted to switch and apply exploratory data analysis to find
common characteristics
APPLICATION
• Target marketing actions
• Seek specific feedback on products and customer care
• Integrate consumer insights into future efforts
35
36
Regression
Regression models can predict our customers’ future spending.
37
Regression
The algorithms are trained to learn the structure of a dataset that includes known spending outcomes.
The resulting knowledge is used to define models that predict the spending level of our customers in future periods.
Data
Output
(predicted spend)
Algorithm
Average amount spent
Frequency
Recency
38
BASELINE MODEL
USEFUL MODEL
Predicts every customer will spend the same amount they did the
previous year
Outperform the baseline at minimum
Meet managerial objectives determined by project sponsors
Regression Performance Target
39
Regression Evaluation Metrics
ROOT MEAN SQUARED ERROR
(RMSE)
difference between predicted
and actual values
R SQUARED
quantifies the extent to which the
model inputs explain the
predicted outcome
40
Train models on data
subset (2014 customers)
Select best model based on
evaluation metrics
Test selected model on unseen
data (2015 customers)
Basic modelling process
• 4 regression models were implemented on the data.
• Each method adopts a different approach to modelling the prediction problem
and may perform variably depending on the characteristics of the data.
41
The Random Forests model clearly performs best on both evaluation metrics
The model produces the smallest error of all the models. Its
predictions are closest to the actual values.
The random forest model produces the largest R Squared value.
This model best explains the variance observed in customer
spending.
• See slide 39 for definitions of the evaluation metrics • Appendix 2 contains a simplified description of the Random Forest algorithm. • See appendix 3 for a brief discussion on the resampling methods
used to compute performance metrics
8
234
48
Baseline Model
Test Performance
RMSE
Model Evaluation
42
43
48
Training Performance
Test Performance
RMSE
Our best model is substantially less error prone (more accurate)
than the baseline model that predicts every customer will spend
in 2015, the average amount spent by all customers in 2014.
As expected performance deteriorates when the model is tested on
new data.
The deterioration is minimal, an indication that the model captures
the fundamental relationship between the variables reasonably well.
Evaluate the selected model’s predictive performance by testing
on unseen data – 2015 customers
Evaluate added value by comparing test performance to a
baseline model
43
Combine customer spending and churn predictions
for additional insights
customer spending prediction
customer score
from the regression model
estimate of customer value
probability estimate of propensity to switch
from the classification model
44
Customer ID Propensity to
Switch
Spending
Predictions $
Customer Score
215460 60.40% 3,012 1193
164930 60.40% 2,936 1163
246530 64.80% 2,483 874
High potential customers
Very high propensity to switch, very high predicted spending
average annual spend = $58
We could predict each customer’s behaviour and
estimate their value to the company
This knowledge can be used to significantly raise
future revenues
Action
Identify very high spenders most likely to switch
Offer high-touch personalized care, targeted marketing communications etc.
Study profiles in depth and use knowledge used to precisely identify and court
prospects with similar features – even at higher cost.
45
average annual spend = $58
We could predict each customer’s behaviour and
estimate their value to the company
This knowledge can be used to significantly raise
future revenues
Highly valuable customers
Very low propensity to switch, high predicted spending
Customer ID Propensity to Switch Spending
Predictions $
Customer Score
262640 3.40% 932 900
107180 3.40% 925 893
234510 1.40% 1,147 1131
Action
Identify high spenders least likely to switch
Reward loyalty and encourage to act as brand cheerleaders
Study profiles in depth and use knowledge used to precisely identify and
court prospects with similar features – even at higher cost.
46
average annual spend = $58
Low value customers
Very low propensity to switch, high predicted spending
Customer ID Propensity to Switch Spending
Predictions $
Customer Score
61450 84.20% 9.99 1.58
63200 87.40% 9.99 1.26
190450 84.20% 11.95 1.89
We could predict each customer’s behaviour and
estimate their value to the company
This knowledge can be used to rationalize marketing
expense and product development
Action
Identify low value customers
Study profiles in depth and define common features
Reduce marketing actions aimed that group
Review and adapt service
Develop and propose more relevant products
PREDICTIVE
MODELLING
(Regression)
• Build a classification model that estimates probability of switching vs. staying
loyal for each customer
• Predict which customers will switch (attrition)
• Group customers predicted to switch and apply exploratory data analysis to find
common characteristics
APPLICATION
• Target marketing actions
• Seek specific feedback on products and customer care
• Integrate consumer insights into future efforts
47
48
Conclusion
Advanced analytics represents
a major strategic opportunity.
Significant investments in analytics capabilities
undertaken within the framework of a comprehensive digital
strategy would place the company in a strong position to
maintain/gain competitive advantage in key markets
49
Data Collection & Management
Identify data that supports strategy
Install systems to creatively source diverse types of data
Manage to ensure data quality
Ensure data is available to all internal users in friendly
formats
Analytics Tools
Acquire easy to use off the shelf B.I tools
Acquire advanced database and analytics tools
Analytics Skills
Train current staff on B.I tools
Recruit & retain skilled advanced analytics
practitioners
Contract out complex projects
Recommendations
Advanced analytics are most powerful when deployed in
combination with business experience and acumen +
domain knowledge
50
Promote a culture that encourages data driven
decision-making at all levels of responsibility
51
THANK YOU
Appendices
52
Appendix 1 - Tools and Source Code
Appendix 2 - Algorithms
Appendix 3 - Brief discussions of some technical concepts
Appendix 4 - Limitations and Challenges
53
Appendix 1
Tools & Source Code
Code and documentation are contained in 3 dynamic documents published at:
Refer to:
• Customer Analytics I – Statistical segmentation
• Customer Analytics II – Predicting customer attrition
• Customer Analytics III – Predicting customer spending
All machine learning methods were implemented in the R statistical programming software environment.
All statistical graphics were created in the R statistical programming software environment.
http://rpubs.com/melokarl
1.2 Source Code
The map on slide 2 was created with Tableau software.
1.1 Tools
54
Appendix 2
Algorithms
2.1 Hierarchical Clustering Algorithm for Cluster Analysis
This appendix offers simplified descriptions of the algorithms implemented in the analysis portion of this project.
There are several clustering algorithms available to analysts including the popular K-Means. For various technical reasons we chose to implement the
hierarchical clustering algorithm.
The algorithm begins by treating each observation n as an individual cluster. The 2 most similar clusters are merged so that we are left with n -1 clusters.
Similarity between observations is determined by a dissimilarity measure. There are several such measures. We used Euclidean distance.
The process is repeated with clusters progressively merged into ever larger subgroups until we have a single cluster – the entire dataset. Groups of
observations cannot be merged using the same types of dissimilarity measures as those used for individual observations at the beginning of the process.
The notion of linkage defines appropriate dissimilarity measures for fusing groups of observations. We used Ward.D2 linkage.
The cluster tree on slide 18 is a graphical illustration of this process.
It is for the analyst to determine the number of clusters that represents the optimal cluster scheme (the optimal number of subgroups to which the
observations may be assigned). Cluster validation techniques are available to lend some rigour to the process. These techniques apply statistical
methods to help determine the number of “real” clusters present in the data.
55
The random forests algorithm is an ensemble method that uses decision trees to predict numerical outcomes (regression) or the probability of
a target belonging to a given category (classification).
Ensemble methods aggregate the output of several weak linear models to produce one strong prediction. The resulting ensemble model is
capable of accurately modelling non-linear relationships. Each model makes predictions based on a set of machine generated rules that can
be summarised in a decision tree.
Random forests models are made up of numerous decision trees (forests). Each time a split is considered in one of the trees, the decision rule
is based on a predictor variable drawn from a random sample of all predictors. Using a small random selection of predictors ensures the trees
are uncorrelated. This improves the model’s ability to perform well on new data.
2.2 Random Forests Algorithm
56
3.1 Principal components
The chart on slide 20 shows a visual representation of the first 2 principal components of our cluster scheme. We used principal components in
this context as a technique to compress our multidimensional data to 2 dimensions to facilitate visualization. This process will cause some
information loss. Nevertheless, the technique captures the maximum amount of information possible in the lower dimensional space.
Considering that our cluster scheme is 3 dimensional, the first 2 principal components capture the essence of the analysis – especially as some
of the lost information is noise.
A principal component is a combination of linearly uncorrelated variables.
Appendix 3
Brief discussions of some
technical concepts
57
3.2 Computation of performance metrics
The performance metrics for the classification and regression models are estimated by running the algorithms on multiple bootstrap versions of
the data. Each run generates a unique estimate. The estimates for the performance of each algorithm on a given metric can be summarized in
distributions as shown in the example boxplots below.
The dark dots inside the rectangles show the median point of the distribution. The width of the rectangles show the dispersion of the data – the
middle 50% of observations (1st to 3rd quartile).
The wider the rectangle, the less certain the estimate of the performance measure.
58
Appendix 4
Limitations and Challenges
DATA LIMITATIONS
Machine learning algorithms recognize patterns in data. Complex patterns are more reliably identified in large datasets.
The extremely narrow dataset ( 2 core features) will probably prejudice the accuracy of some of our predictive models.
Nevertheless, this dataset is a faithful representation of the data currently available at most of our subsidiaries.
Context: French telecommunications company, Orange, provided 2 datasets to analytics practitioners participating in the 2009 Kaggle Data Mining
Cup. The large set contained 15,000 variables and the reduced version 230. The prediction tasks were similar to those presented here.
59
60
Important limitations with the deployment of statistical clustering methods in production environments.
• The algorithm requires frequent updating.
Most clustering algorithms cannot be fully automated and require the supervision of an analyst, a
significant handicap in production settings.
• A particular cluster scheme may capture seasonal effects thus making it inapplicable at different
periods of time.
• Most clustering methods are not very robust to disturbances to the data.
• The method captures patterns in a dataset at a specific moment in time.
This poses some challenges:
- Customers continuously enter and leave the database
- New customers may have different characteristics from old ones
- Existing customers’ behaviour may evolve over time
61
The clustering method demonstrated above can be used to obtain a faithful description of
customer subgroups at a particular point in time and could serve to formulate hypotheses, guide
further investigations and generally inform non-statistical managerial segmentation solutions.
Statistical segmentation, nevertheless, remains a powerful and useful approach to finding
subgroups in customer databases.
 These methods identify natural patterns in multivariate datasets
 Most commonly used non-statistical alternatives rely heavily on judgement and past practice.
 Judgement-based approaches can be severely compromised by various types of bias, personal
agendas and cognitive limitations in processing complex data.
N.B: There is much interest and on-going research in adaptive clustering algorithms capable of responding to changes in the state of the world.

Customer analytics

  • 1.
    Customer Analytics Maximizing RevenueThrough Deep Customer Knowledge 1 KARL MELO – Strategy and Analytics Consultant (2016)
  • 2.
    2 Purpose of thedocument Key Challenges Proposed Solution Appendices Data Analytics Case Study Conclusion & Recommendations Contents 2 3 5 4 6 1
  • 3.
    3 THE KEY OBJECTIVE Thisdocument proposes advanced data analytics as the key solution for building intimate knowledge about our customers’ behaviour, preferences and aspirations; an essential requirement for maximizing revenue in our current competitive environment. Note that some information is withheld/modified to protect the company’s anonymity. Purpose of this document THE CASE STUDY A case study uses data from our beauty and personal care subsidiary to practically demonstrate how this could be achieved. The data are from online distributions in a single market.
  • 4.
    We Operate AcrossDiverse Markets… 4
  • 5.
    … In DiverseIndustries TRADE FINANCE FOOD PROCESSING SPORTS MANAGEMENT BEAUTY & PERSONAL CARE 5
  • 6.
    6 Competition We face threatsfrom increasingly efficient domestic competitors with local market knowledge and from global players keen to stake positions in high potential emerging and frontier markets. Customer Expectations Although not yet on a par with advanced economies, customer expectations in our major markets, particularly our target demographic – urban, educated, professionals – have been rising rapidly. Key Challenges Globalization & Social Media Rapid smart phone penetration is a key driver of the global harmonisation of customer preferences and expectations.
  • 7.
    7 Proposed Solution MAXIMIZE REVENUE DEPLOYADVANCED DATA ANALYTICS • Behaviour • Preferences • Aspirations DEVELOP INTIMATE CUSTOMER KNOWLEDGE • Product development • Customer care • Marketing and Communications LEVERAGE KNOWLEDGE • Collect/Manage Data • Explore Data • Apply Statistical/Machine Learning • Generate Insights • Better loyalty • Increased spend • Improved acquisition
  • 8.
    8 Company Case Study XYZ Cosmeticssubsidiary Beauty & Personal CareIndustry Key Markets Africa, Turkey, Middle East
  • 9.
    9 Case Study Objectives Demonstratethe practical value of advanced data analytics for the company2 Discuss business applications of 3 machine learning methods applied to real customer data1 3 Motivate investments in analytics capabilities across all subsidiaries
  • 10.
    10 e 4123 2411 3032 3312 3689 2878 21551611 1709 2011 2012 2013 2014 2015 Flat Customer Growth Old Customers New Customers 0 100,000 200,000 300,000 400,000 500,000 600,000 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 Rising Revenues Flat customer growth mostly explained by increasing competition Growing revenues despite weak customer growth were due to repeated price increases Further price rises will be hard to implement given intensifying competition Business Context 7.0%Average Customer Growth 2011 - 2015 12.0%Average Revenue Growth 2011 - 2015
  • 11.
    11 Price Volume Revenue Priceincreases are no longer feasible in the current environment Sales volumes must increase to drive revenue growth • New customer acquisition • Reduce churn • Increase customer spend Business Objective
  • 12.
    MAXIMIZE REVENUE DEPLOY ADVANCEDDATA ANALYTICS • Behaviour • Preferences • Aspirations DEVELOP INTIMATE CUSTOMER KNOWLEDGE • Customer service • Product development • Marketing and Communications LEVERAGE INSIGHTS 12 • Collect/Manage Data • Explore Data • Apply Statistical/Machine Learning • Generate Insights • Drive new customer acquisition • Reduce churn • Increase customer spend Advanced data analytics can unlock deep customer insights
  • 13.
    3 Group data objectssuch that members of each group are more similar to each other than to those in any other group Our objective is to identify managerially relevant subgroups in our customer database CLUSTER ANALYSIS 80% SCHEDULING Assign data objects to defined categories based on a given set of variables Our objective is to predict specified outcomes of categorical variables e.g. customer will churn/will not churn CLASSIFICATION 13 3 machine learning methods demonstrate the power of analytics REGRESSION Estimate relationships between a set of input variables and an outcome variable Our objective is to predict numerical outcomes for individual customers e.g. customer spend
  • 14.
    14 The dataset ismade up of 18,417 unique cases of transaction level customer data collected in a single market between January 2nd 2010 and December 31st 2015 Customer I.D THE DATA Frequency (number of purchases) DESCRIPTION VARIABLES Purchase amountPurchase date MARKETING METRICS Recency (time since last purchase - days) Average amount spent (per transaction)
  • 15.
    The distribution isseverely right skewed indicating that a large majority of customers do not return often. 50% of customers have only ever made 1 purchase 30%of the database was active in the last year $58 Historical purchase behaviour Half of all customers spend between $22 and $50A significant proportion of customers seem to have lapsed 15 average annual spend
  • 16.
    16 CLUSTER ANALYSIS Clustering algorithmspartition customers into subgroups based on their similarity to one another
  • 17.
    17 The tree illustratesthe grouping process The hierarchical clustering algorithm merges the data into ever larger clusters as we move up the tree. The analyst ‘cuts’ the tree at the height that represents the optimal number of clusters. The top of the red rectangles indicate the height at which our tree was cut, partitioning customers into 5 clusters. Each customer is represented by one of thousands of data points at the bottom of the tree. 5 distinct clusters are identified. Red rectangles delimit the clusters. The smallest is hard to distinguish
  • 18.
    18 4 of the5 clusters are stable - They identify ‘real’ patterns in the data, not random associations Stability Evaluation Highly Stable Reasonably Stable Unstable Cluster likely made up of unusual cases that do not fit anywhere else
  • 19.
    19 • Customers incluster 5 clearly separate themselves. • The remaining 4 clusters overlap with each other to varying degrees , suggesting some customers could well be assigned to either cluster. • Clusters 1 and 4 are relatively compact, suggesting customers in those segments are very similar to each other. • The 2nd, 3rd and 5th clusters are more dispersed, indicating that customers in those subgroups exhibit a wider range of behaviours. Observations Cluster Scheme Customers are clustered into subgroups based on similarities in their behaviours The first 2 principal components capture as much information about the clustering scheme as is possible in 2 dimensions.
  • 20.
    35%Lorem ipsum dolor sitamet. 60%Lorem ipsum dolor sit amet. Customers in cluster 5 spend more per purchase than all others. Cluster 3 customers are the most frequent purchasers. Cluster 1 customers have the highest churn rates Cluster 4 have the lowest churn or newly acquired customers fall disproportionately into that cluster. 20 Initial insights observable from pairwise visualizations
  • 21.
    21 cluster counts recencyfrequency amount spent 1 5,890 2,607 1.5 34.70 2 3,445 678 4.9 93.87 3 1,295 232 10.9 51.74 4 7,733 649 1.5 44.63 5 54 1,249 2.1 2304.69 Cluster averages describe the typical customer in each subgroup We can obtain richer cluster profiles by using statistical analysis techniques to dive deeper into our clusters Example: By measuring the individual contribution of the variables to the composition of a cluster, we can better understand the relative importance of each one in explaining the behaviour of customers in that cluster. It is possible to identify the centre of gravity of each cluster – the most typical customer. The position of each customer relative to this centre can be computed automatically, providing a very granular view of customer behaviour.
  • 22.
    22 CLUSTERING ALGORITHM POSSIBLE APPLICATIONS •Create relevant customer segments around profiles • Offer multiple levels of customer care • Improve cross-selling/up-selling • Develop products targeted at segments • Design targeted marketing communications strategies • Find natural subgroups in data • Create rich customer profiles based on statistical analyses of subgroups
  • 23.
    23 Classification Classification models canpredict which customers will either switch or stay loyal.
  • 24.
    24 Classification Algorithms are trainedto learn the structure of a dataset of previously classified customers. The resulting knowledge is used to define models that estimate the probability of a customer falling into a given class (switch/stay loyal). Data Output (switch/stay) AlgorithmRecency Average amount spent Frequency
  • 25.
    25 BASELINE MODEL USEFUL MODEL predictsall customers will remain loyal • outperform the baseline at minimum • Meet managerial objectives determined by project sponsors Classifier Performance Target
  • 26.
    26 Classification Evaluation Metrics ACCURACY fractionof the time the model’s predictions are correct SENSITIVITY fraction of times the model correctly predicts an item to be in each class SPECIFICITY fraction of times the model correctly predicts an item is not in a given class ROC CURVES plot sensitivity as a function of ‘false alarm’ (1 – specificity)
  • 27.
    27 • 5 classificationmodels were implemented on the data. • Each method adopts a different approach to modelling the prediction problem and may perform variably depending on the characteristics of the data. Train models on data subset (2014 customers) Select best model based on evaluation metrics Test selected model on unseen data (2015 customers) Basic modelling process
  • 28.
    28 Most sensitivePerformance similarto other models Least specific – in this context, sensitivity matters more. • Appendix 2 briefly describes the Random Forest algorithm.• See slide 26 for definitions of the evaluation metrics K-Nearest Neighbour Random Forests Support Vector Machines Algorithms The Random Forests (rF) model was selected as the best suited for our purposes. Key question: which model is most often correct when it predicts a customer will switch (sensitivity)? Random Forests performance
  • 29.
    Overall accuracy 58% Accurate ‘stayloyal’ predictions 68% 29 Random Forest test performance is approximate to ‘true’ performance 42% Accurate ‘churn’ predictions Model Evaluation
  • 30.
    30 0% 10% 20% 30% 40% 50% 60% 70% 80% Accuracy Sensitivity Specificity TrainingPerformance Test Performance Evaluate the selected model’s predictive performance by testing on unseen data – 2015 customers As expected, performance deteriorates when the model is tested on new data. Model Evaluation
  • 31.
    31 0% 10% 20% 30% 40% 50% 60% 70% 80% Accuracy Sensitivity Specificity TestPerformance Baseline Model  Our best model is slightly less accurate than the simplistic baseline model. Normally this would mean the model offers no additional value  However, the baseline has 0% sensitivity because it predicts all customers will remain loyal even though we know the company experienced a churn rate of 40% in 2014/15.  Our model is more sensitive and specific than the baseline. Model Evaluation Evaluate added value by comparing test performance to a baseline model
  • 32.
    32 Poor predictive accuracycould be due either to deficiencies in the model, in the data or both
  • 33.
    33 Possible model deficiencies Mismatchesin the inductive biases of models and the data e.g. linear models would struggle to correctly predict outcomes in non-linearly related data. Poorly tuned model parameters The presence of unusual values in the data will prejudice some models  Model issues were carefully addressed during the modelling process  Key performance metrics are broadly similar across the different types of models suggesting underperformance is most likely due to deficiencies in the data.
  • 34.
    34 Possible data deficiencies Inaccurateor inconsistently recorded data can have dramatic effects on predictive models. There is no reason to suspect the quality of this data. The models rely on a very narrow feature space to predict the customer decision to churn or remain loyal It is more than likely that our 4 input variables do not sufficiently explain the variation in our outcome of interest Implementing the model on a larger dataset with additional informative variables will produce a more accurate classifier. Data quality Feature space
  • 35.
    PREDICTIVE MODELLING (Classification) • Build aclassification model that estimates probability of switching vs. staying loyal for each customer • Predict which customers will switch (attrition) • Group customers predicted to switch and apply exploratory data analysis to find common characteristics APPLICATION • Target marketing actions • Seek specific feedback on products and customer care • Integrate consumer insights into future efforts 35
  • 36.
    36 Regression Regression models canpredict our customers’ future spending.
  • 37.
    37 Regression The algorithms aretrained to learn the structure of a dataset that includes known spending outcomes. The resulting knowledge is used to define models that predict the spending level of our customers in future periods. Data Output (predicted spend) Algorithm Average amount spent Frequency Recency
  • 38.
    38 BASELINE MODEL USEFUL MODEL Predictsevery customer will spend the same amount they did the previous year Outperform the baseline at minimum Meet managerial objectives determined by project sponsors Regression Performance Target
  • 39.
    39 Regression Evaluation Metrics ROOTMEAN SQUARED ERROR (RMSE) difference between predicted and actual values R SQUARED quantifies the extent to which the model inputs explain the predicted outcome
  • 40.
    40 Train models ondata subset (2014 customers) Select best model based on evaluation metrics Test selected model on unseen data (2015 customers) Basic modelling process • 4 regression models were implemented on the data. • Each method adopts a different approach to modelling the prediction problem and may perform variably depending on the characteristics of the data.
  • 41.
    41 The Random Forestsmodel clearly performs best on both evaluation metrics The model produces the smallest error of all the models. Its predictions are closest to the actual values. The random forest model produces the largest R Squared value. This model best explains the variance observed in customer spending. • See slide 39 for definitions of the evaluation metrics • Appendix 2 contains a simplified description of the Random Forest algorithm. • See appendix 3 for a brief discussion on the resampling methods used to compute performance metrics
  • 42.
    8 234 48 Baseline Model Test Performance RMSE ModelEvaluation 42 43 48 Training Performance Test Performance RMSE Our best model is substantially less error prone (more accurate) than the baseline model that predicts every customer will spend in 2015, the average amount spent by all customers in 2014. As expected performance deteriorates when the model is tested on new data. The deterioration is minimal, an indication that the model captures the fundamental relationship between the variables reasonably well. Evaluate the selected model’s predictive performance by testing on unseen data – 2015 customers Evaluate added value by comparing test performance to a baseline model
  • 43.
    43 Combine customer spendingand churn predictions for additional insights customer spending prediction customer score from the regression model estimate of customer value probability estimate of propensity to switch from the classification model
  • 44.
    44 Customer ID Propensityto Switch Spending Predictions $ Customer Score 215460 60.40% 3,012 1193 164930 60.40% 2,936 1163 246530 64.80% 2,483 874 High potential customers Very high propensity to switch, very high predicted spending average annual spend = $58 We could predict each customer’s behaviour and estimate their value to the company This knowledge can be used to significantly raise future revenues Action Identify very high spenders most likely to switch Offer high-touch personalized care, targeted marketing communications etc. Study profiles in depth and use knowledge used to precisely identify and court prospects with similar features – even at higher cost.
  • 45.
    45 average annual spend= $58 We could predict each customer’s behaviour and estimate their value to the company This knowledge can be used to significantly raise future revenues Highly valuable customers Very low propensity to switch, high predicted spending Customer ID Propensity to Switch Spending Predictions $ Customer Score 262640 3.40% 932 900 107180 3.40% 925 893 234510 1.40% 1,147 1131 Action Identify high spenders least likely to switch Reward loyalty and encourage to act as brand cheerleaders Study profiles in depth and use knowledge used to precisely identify and court prospects with similar features – even at higher cost.
  • 46.
    46 average annual spend= $58 Low value customers Very low propensity to switch, high predicted spending Customer ID Propensity to Switch Spending Predictions $ Customer Score 61450 84.20% 9.99 1.58 63200 87.40% 9.99 1.26 190450 84.20% 11.95 1.89 We could predict each customer’s behaviour and estimate their value to the company This knowledge can be used to rationalize marketing expense and product development Action Identify low value customers Study profiles in depth and define common features Reduce marketing actions aimed that group Review and adapt service Develop and propose more relevant products
  • 47.
    PREDICTIVE MODELLING (Regression) • Build aclassification model that estimates probability of switching vs. staying loyal for each customer • Predict which customers will switch (attrition) • Group customers predicted to switch and apply exploratory data analysis to find common characteristics APPLICATION • Target marketing actions • Seek specific feedback on products and customer care • Integrate consumer insights into future efforts 47
  • 48.
    48 Conclusion Advanced analytics represents amajor strategic opportunity. Significant investments in analytics capabilities undertaken within the framework of a comprehensive digital strategy would place the company in a strong position to maintain/gain competitive advantage in key markets
  • 49.
    49 Data Collection &Management Identify data that supports strategy Install systems to creatively source diverse types of data Manage to ensure data quality Ensure data is available to all internal users in friendly formats Analytics Tools Acquire easy to use off the shelf B.I tools Acquire advanced database and analytics tools Analytics Skills Train current staff on B.I tools Recruit & retain skilled advanced analytics practitioners Contract out complex projects Recommendations Advanced analytics are most powerful when deployed in combination with business experience and acumen + domain knowledge
  • 50.
    50 Promote a culturethat encourages data driven decision-making at all levels of responsibility
  • 51.
  • 52.
    Appendices 52 Appendix 1 -Tools and Source Code Appendix 2 - Algorithms Appendix 3 - Brief discussions of some technical concepts Appendix 4 - Limitations and Challenges
  • 53.
    53 Appendix 1 Tools &Source Code Code and documentation are contained in 3 dynamic documents published at: Refer to: • Customer Analytics I – Statistical segmentation • Customer Analytics II – Predicting customer attrition • Customer Analytics III – Predicting customer spending All machine learning methods were implemented in the R statistical programming software environment. All statistical graphics were created in the R statistical programming software environment. http://rpubs.com/melokarl 1.2 Source Code The map on slide 2 was created with Tableau software. 1.1 Tools
  • 54.
    54 Appendix 2 Algorithms 2.1 HierarchicalClustering Algorithm for Cluster Analysis This appendix offers simplified descriptions of the algorithms implemented in the analysis portion of this project. There are several clustering algorithms available to analysts including the popular K-Means. For various technical reasons we chose to implement the hierarchical clustering algorithm. The algorithm begins by treating each observation n as an individual cluster. The 2 most similar clusters are merged so that we are left with n -1 clusters. Similarity between observations is determined by a dissimilarity measure. There are several such measures. We used Euclidean distance. The process is repeated with clusters progressively merged into ever larger subgroups until we have a single cluster – the entire dataset. Groups of observations cannot be merged using the same types of dissimilarity measures as those used for individual observations at the beginning of the process. The notion of linkage defines appropriate dissimilarity measures for fusing groups of observations. We used Ward.D2 linkage. The cluster tree on slide 18 is a graphical illustration of this process. It is for the analyst to determine the number of clusters that represents the optimal cluster scheme (the optimal number of subgroups to which the observations may be assigned). Cluster validation techniques are available to lend some rigour to the process. These techniques apply statistical methods to help determine the number of “real” clusters present in the data.
  • 55.
    55 The random forestsalgorithm is an ensemble method that uses decision trees to predict numerical outcomes (regression) or the probability of a target belonging to a given category (classification). Ensemble methods aggregate the output of several weak linear models to produce one strong prediction. The resulting ensemble model is capable of accurately modelling non-linear relationships. Each model makes predictions based on a set of machine generated rules that can be summarised in a decision tree. Random forests models are made up of numerous decision trees (forests). Each time a split is considered in one of the trees, the decision rule is based on a predictor variable drawn from a random sample of all predictors. Using a small random selection of predictors ensures the trees are uncorrelated. This improves the model’s ability to perform well on new data. 2.2 Random Forests Algorithm
  • 56.
    56 3.1 Principal components Thechart on slide 20 shows a visual representation of the first 2 principal components of our cluster scheme. We used principal components in this context as a technique to compress our multidimensional data to 2 dimensions to facilitate visualization. This process will cause some information loss. Nevertheless, the technique captures the maximum amount of information possible in the lower dimensional space. Considering that our cluster scheme is 3 dimensional, the first 2 principal components capture the essence of the analysis – especially as some of the lost information is noise. A principal component is a combination of linearly uncorrelated variables. Appendix 3 Brief discussions of some technical concepts
  • 57.
    57 3.2 Computation ofperformance metrics The performance metrics for the classification and regression models are estimated by running the algorithms on multiple bootstrap versions of the data. Each run generates a unique estimate. The estimates for the performance of each algorithm on a given metric can be summarized in distributions as shown in the example boxplots below. The dark dots inside the rectangles show the median point of the distribution. The width of the rectangles show the dispersion of the data – the middle 50% of observations (1st to 3rd quartile). The wider the rectangle, the less certain the estimate of the performance measure.
  • 58.
  • 59.
    DATA LIMITATIONS Machine learningalgorithms recognize patterns in data. Complex patterns are more reliably identified in large datasets. The extremely narrow dataset ( 2 core features) will probably prejudice the accuracy of some of our predictive models. Nevertheless, this dataset is a faithful representation of the data currently available at most of our subsidiaries. Context: French telecommunications company, Orange, provided 2 datasets to analytics practitioners participating in the 2009 Kaggle Data Mining Cup. The large set contained 15,000 variables and the reduced version 230. The prediction tasks were similar to those presented here. 59
  • 60.
    60 Important limitations withthe deployment of statistical clustering methods in production environments. • The algorithm requires frequent updating. Most clustering algorithms cannot be fully automated and require the supervision of an analyst, a significant handicap in production settings. • A particular cluster scheme may capture seasonal effects thus making it inapplicable at different periods of time. • Most clustering methods are not very robust to disturbances to the data. • The method captures patterns in a dataset at a specific moment in time. This poses some challenges: - Customers continuously enter and leave the database - New customers may have different characteristics from old ones - Existing customers’ behaviour may evolve over time
  • 61.
    61 The clustering methoddemonstrated above can be used to obtain a faithful description of customer subgroups at a particular point in time and could serve to formulate hypotheses, guide further investigations and generally inform non-statistical managerial segmentation solutions. Statistical segmentation, nevertheless, remains a powerful and useful approach to finding subgroups in customer databases.  These methods identify natural patterns in multivariate datasets  Most commonly used non-statistical alternatives rely heavily on judgement and past practice.  Judgement-based approaches can be severely compromised by various types of bias, personal agendas and cognitive limitations in processing complex data. N.B: There is much interest and on-going research in adaptive clustering algorithms capable of responding to changes in the state of the world.