1) The document proposes using advanced data analytics to build knowledge of customer behavior, preferences, and aspirations in order to maximize revenue.
2) A case study uses data from an online beauty/personal care subsidiary to demonstrate how clustering, classification, and regression analyses can provide insights.
3) The analyses identify customer subgroups, predict which customers will churn, and forecast spending amounts. This knowledge can then be used to target marketing and improve customer retention and spending.
2. 2
Purpose of the document
Key Challenges
Proposed Solution Appendices
Data Analytics Case Study
Conclusion & Recommendations
Contents
2
3
5
4
6
1
3. 3
THE KEY OBJECTIVE
This document proposes advanced data analytics as the key solution for building intimate knowledge about our
customers’ behaviour, preferences and aspirations; an essential requirement for maximizing revenue in our current
competitive environment.
Note that some information is withheld/modified to protect the company’s anonymity.
Purpose of this document
THE CASE STUDY
A case study uses data from our beauty and personal care subsidiary to practically demonstrate how this could be
achieved. The data are from online distributions in a single market.
5. … In Diverse Industries
TRADE FINANCE
FOOD PROCESSING SPORTS MANAGEMENT
BEAUTY &
PERSONAL CARE
5
6. 6
Competition
We face threats from increasingly efficient
domestic competitors with local market
knowledge and from global players keen to
stake positions in high potential emerging and
frontier markets.
Customer Expectations
Although not yet on a par with advanced
economies, customer expectations in our major
markets, particularly our target demographic –
urban, educated, professionals – have been
rising rapidly.
Key Challenges
Globalization & Social Media
Rapid smart phone penetration is a key
driver of the global harmonisation of
customer preferences and expectations.
7. 7
Proposed Solution
MAXIMIZE REVENUE
DEPLOY ADVANCED DATA ANALYTICS
• Behaviour
• Preferences
• Aspirations
DEVELOP INTIMATE CUSTOMER KNOWLEDGE
• Product development
• Customer care
• Marketing and Communications
LEVERAGE KNOWLEDGE
• Collect/Manage Data
• Explore Data
• Apply Statistical/Machine Learning
• Generate Insights
• Better loyalty
• Increased spend
• Improved acquisition
9. 9
Case Study Objectives
Demonstrate the practical value of advanced data analytics for the company2
Discuss business applications of 3 machine learning methods applied to real customer data1
3 Motivate investments in analytics capabilities across all subsidiaries
10. 10
e
4123
2411
3032 3312 3689
2878
2155 1611
1709
2011 2012 2013 2014 2015
Flat Customer Growth
Old Customers New Customers
0
100,000
200,000
300,000
400,000
500,000
600,000
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Rising Revenues
Flat customer growth mostly explained by increasing competition Growing revenues despite weak customer growth were due to repeated price increases
Further price rises will be hard to implement given intensifying competition
Business Context
7.0%Average Customer Growth
2011 - 2015 12.0%Average Revenue Growth
2011 - 2015
11. 11
Price Volume Revenue
Price increases are no longer
feasible in the current environment
Sales volumes must increase to
drive revenue growth
• New customer acquisition
• Reduce churn
• Increase customer spend
Business Objective
12. MAXIMIZE REVENUE
DEPLOY ADVANCED DATA ANALYTICS
• Behaviour
• Preferences
• Aspirations
DEVELOP INTIMATE CUSTOMER KNOWLEDGE
• Customer service
• Product development
• Marketing and Communications
LEVERAGE INSIGHTS
12
• Collect/Manage Data
• Explore Data
• Apply Statistical/Machine Learning
• Generate Insights
• Drive new customer acquisition • Reduce churn • Increase customer spend
Advanced data analytics can unlock deep customer insights
13. 3
Group data objects such that members of
each group are more similar to each other
than to those in any other group
Our objective is to identify managerially
relevant subgroups in our customer database
CLUSTER ANALYSIS
80%
SCHEDULING
Assign data objects to defined categories
based on a given set of variables
Our objective is to predict specified outcomes
of categorical variables e.g. customer will
churn/will not churn
CLASSIFICATION
13
3 machine learning methods demonstrate the power of analytics
REGRESSION
Estimate relationships between a set of input
variables and an outcome variable
Our objective is to predict numerical outcomes
for individual customers e.g. customer spend
14. 14
The dataset is made up of 18,417 unique cases of transaction level customer data collected
in a single market between January 2nd 2010 and December 31st 2015
Customer I.D
THE DATA
Frequency
(number of purchases)
DESCRIPTION
VARIABLES Purchase amountPurchase date
MARKETING METRICS
Recency
(time since last purchase - days)
Average amount spent
(per transaction)
15. The distribution is severely right skewed indicating that a
large majority of customers do not return often.
50% of customers have only
ever made 1 purchase 30%of the database was
active in the last year $58
Historical purchase behaviour
Half of all customers spend between $22 and $50A significant proportion of customers seem to have
lapsed
15
average annual spend
17. 17
The tree illustrates the grouping process
The hierarchical clustering algorithm
merges the data into ever larger clusters as
we move up the tree.
The analyst ‘cuts’ the tree at the height that
represents the optimal number of clusters.
The top of the red rectangles indicate the
height at which our tree was cut,
partitioning customers into 5 clusters.
Each customer is represented by
one of thousands of data points at
the bottom of the tree.
5 distinct clusters are identified.
Red rectangles delimit the clusters.
The smallest is hard to distinguish
18. 18
4 of the 5 clusters are stable - They identify ‘real’ patterns in the data, not random associations
Stability Evaluation
Highly Stable
Reasonably Stable
Unstable
Cluster likely made up of
unusual cases that do not fit
anywhere else
19. 19
• Customers in cluster 5 clearly separate themselves.
• The remaining 4 clusters overlap with each other to varying
degrees , suggesting some customers could well be assigned
to either cluster.
• Clusters 1 and 4 are relatively compact, suggesting
customers in those segments are very similar to each other.
• The 2nd, 3rd and 5th clusters are more dispersed, indicating
that customers in those subgroups exhibit a wider range of
behaviours.
Observations
Cluster Scheme
Customers are clustered into subgroups based on similarities in their behaviours
The first 2 principal components capture as much information about the
clustering scheme as is possible in 2 dimensions.
20. 35%Lorem ipsum
dolor sit amet. 60%Lorem ipsum
dolor sit amet.
Customers in cluster 5 spend more per purchase than all others.
Cluster 3 customers are the most frequent purchasers.
Cluster 1 customers have the highest churn rates
Cluster 4 have the lowest churn or newly acquired customers
fall disproportionately into that cluster.
20
Initial insights observable from pairwise visualizations
21. 21
cluster counts recency frequency amount spent
1 5,890 2,607 1.5 34.70
2 3,445 678 4.9 93.87
3 1,295 232 10.9 51.74
4 7,733 649 1.5 44.63
5 54 1,249 2.1 2304.69
Cluster averages describe the typical customer in each subgroup
We can obtain richer cluster profiles
by using statistical analysis techniques
to dive deeper into our clusters
Example:
By measuring the individual contribution of the
variables to the composition of a cluster, we can
better understand the relative importance of
each one in explaining the behaviour of
customers in that cluster.
It is possible to identify the centre of gravity of each
cluster – the most typical customer.
The position of each customer relative to this centre can
be computed automatically, providing a very granular
view of customer behaviour.
22. 22
CLUSTERING ALGORITHM
POSSIBLE APPLICATIONS
• Create relevant customer segments around profiles
• Offer multiple levels of customer care
• Improve cross-selling/up-selling
• Develop products targeted at segments
• Design targeted marketing communications strategies
• Find natural subgroups in data
• Create rich customer profiles based on statistical analyses of subgroups
24. 24
Classification
Algorithms are trained to learn the structure of a dataset of previously classified customers. The
resulting knowledge is used to define models that estimate the probability of a customer falling
into a given class (switch/stay loyal).
Data
Output
(switch/stay)
AlgorithmRecency
Average amount spent
Frequency
25. 25
BASELINE MODEL
USEFUL MODEL
predicts all customers will remain loyal
• outperform the baseline at minimum
• Meet managerial objectives determined by project sponsors
Classifier Performance Target
26. 26
Classification Evaluation Metrics
ACCURACY
fraction of the time the
model’s predictions are
correct
SENSITIVITY
fraction of times the
model correctly
predicts an item to be
in each class
SPECIFICITY
fraction of times
the model correctly
predicts an item is
not in a given class
ROC CURVES
plot sensitivity as a
function of ‘false alarm’
(1 – specificity)
27. 27
• 5 classification models were implemented on the data.
• Each method adopts a different approach to modelling the prediction problem and may perform
variably depending on the characteristics of the data.
Train models on data
subset (2014 customers)
Select best model based on
evaluation metrics
Test selected model on unseen
data (2015 customers)
Basic modelling process
28. 28
Most sensitivePerformance similar to other
models
Least specific – in this context,
sensitivity matters more.
• Appendix 2 briefly describes the Random Forest algorithm.• See slide 26 for definitions of the evaluation metrics
K-Nearest Neighbour
Random Forests
Support Vector Machines
Algorithms
The Random Forests (rF) model was selected as the best suited for our purposes.
Key question: which model is most often correct when it predicts a customer will switch (sensitivity)?
Random Forests performance
29. Overall accuracy
58%
Accurate ‘stay loyal’ predictions
68%
29
Random Forest test performance is approximate to ‘true’ performance
42%
Accurate ‘churn’ predictions
Model Evaluation
31. 31
0%
10%
20%
30%
40%
50%
60%
70%
80%
Accuracy Sensitivity Specificity
Test Performance Baseline Model
Our best model is slightly less accurate than the simplistic
baseline model. Normally this would mean the model offers
no additional value
However, the baseline has 0% sensitivity because it predicts
all customers will remain loyal even though we know the
company experienced a churn rate of 40% in 2014/15.
Our model is more sensitive and specific than the baseline.
Model Evaluation
Evaluate added value by comparing test performance to a baseline model
33. 33
Possible model deficiencies
Mismatches in the inductive biases of models and the data
e.g. linear models would struggle to correctly predict outcomes in non-linearly related data.
Poorly tuned model parameters
The presence of unusual values in the data will prejudice some models
Model issues were carefully addressed during the modelling process
Key performance metrics are broadly similar across the different types of models suggesting
underperformance is most likely due to deficiencies in the data.
34. 34
Possible data deficiencies
Inaccurate or inconsistently recorded data can have dramatic effects on predictive
models.
There is no reason to suspect the quality of this data.
The models rely on a very narrow feature space to predict the customer decision to
churn or remain loyal
It is more than likely that our 4 input variables do not sufficiently explain the
variation in our outcome of interest
Implementing the model on a larger dataset with additional informative variables will produce a more
accurate classifier.
Data quality
Feature space
35. PREDICTIVE
MODELLING
(Classification)
• Build a classification model that estimates probability of switching vs. staying
loyal for each customer
• Predict which customers will switch (attrition)
• Group customers predicted to switch and apply exploratory data analysis to find
common characteristics
APPLICATION
• Target marketing actions
• Seek specific feedback on products and customer care
• Integrate consumer insights into future efforts
35
37. 37
Regression
The algorithms are trained to learn the structure of a dataset that includes known spending outcomes.
The resulting knowledge is used to define models that predict the spending level of our customers in future periods.
Data
Output
(predicted spend)
Algorithm
Average amount spent
Frequency
Recency
38. 38
BASELINE MODEL
USEFUL MODEL
Predicts every customer will spend the same amount they did the
previous year
Outperform the baseline at minimum
Meet managerial objectives determined by project sponsors
Regression Performance Target
39. 39
Regression Evaluation Metrics
ROOT MEAN SQUARED ERROR
(RMSE)
difference between predicted
and actual values
R SQUARED
quantifies the extent to which the
model inputs explain the
predicted outcome
40. 40
Train models on data
subset (2014 customers)
Select best model based on
evaluation metrics
Test selected model on unseen
data (2015 customers)
Basic modelling process
• 4 regression models were implemented on the data.
• Each method adopts a different approach to modelling the prediction problem
and may perform variably depending on the characteristics of the data.
41. 41
The Random Forests model clearly performs best on both evaluation metrics
The model produces the smallest error of all the models. Its
predictions are closest to the actual values.
The random forest model produces the largest R Squared value.
This model best explains the variance observed in customer
spending.
• See slide 39 for definitions of the evaluation metrics • Appendix 2 contains a simplified description of the Random Forest algorithm. • See appendix 3 for a brief discussion on the resampling methods
used to compute performance metrics
42. 8
234
48
Baseline Model
Test Performance
RMSE
Model Evaluation
42
43
48
Training Performance
Test Performance
RMSE
Our best model is substantially less error prone (more accurate)
than the baseline model that predicts every customer will spend
in 2015, the average amount spent by all customers in 2014.
As expected performance deteriorates when the model is tested on
new data.
The deterioration is minimal, an indication that the model captures
the fundamental relationship between the variables reasonably well.
Evaluate the selected model’s predictive performance by testing
on unseen data – 2015 customers
Evaluate added value by comparing test performance to a
baseline model
43. 43
Combine customer spending and churn predictions
for additional insights
customer spending prediction
customer score
from the regression model
estimate of customer value
probability estimate of propensity to switch
from the classification model
44. 44
Customer ID Propensity to
Switch
Spending
Predictions $
Customer Score
215460 60.40% 3,012 1193
164930 60.40% 2,936 1163
246530 64.80% 2,483 874
High potential customers
Very high propensity to switch, very high predicted spending
average annual spend = $58
We could predict each customer’s behaviour and
estimate their value to the company
This knowledge can be used to significantly raise
future revenues
Action
Identify very high spenders most likely to switch
Offer high-touch personalized care, targeted marketing communications etc.
Study profiles in depth and use knowledge used to precisely identify and court
prospects with similar features – even at higher cost.
45. 45
average annual spend = $58
We could predict each customer’s behaviour and
estimate their value to the company
This knowledge can be used to significantly raise
future revenues
Highly valuable customers
Very low propensity to switch, high predicted spending
Customer ID Propensity to Switch Spending
Predictions $
Customer Score
262640 3.40% 932 900
107180 3.40% 925 893
234510 1.40% 1,147 1131
Action
Identify high spenders least likely to switch
Reward loyalty and encourage to act as brand cheerleaders
Study profiles in depth and use knowledge used to precisely identify and
court prospects with similar features – even at higher cost.
46. 46
average annual spend = $58
Low value customers
Very low propensity to switch, high predicted spending
Customer ID Propensity to Switch Spending
Predictions $
Customer Score
61450 84.20% 9.99 1.58
63200 87.40% 9.99 1.26
190450 84.20% 11.95 1.89
We could predict each customer’s behaviour and
estimate their value to the company
This knowledge can be used to rationalize marketing
expense and product development
Action
Identify low value customers
Study profiles in depth and define common features
Reduce marketing actions aimed that group
Review and adapt service
Develop and propose more relevant products
47. PREDICTIVE
MODELLING
(Regression)
• Build a classification model that estimates probability of switching vs. staying
loyal for each customer
• Predict which customers will switch (attrition)
• Group customers predicted to switch and apply exploratory data analysis to find
common characteristics
APPLICATION
• Target marketing actions
• Seek specific feedback on products and customer care
• Integrate consumer insights into future efforts
47
48. 48
Conclusion
Advanced analytics represents
a major strategic opportunity.
Significant investments in analytics capabilities
undertaken within the framework of a comprehensive digital
strategy would place the company in a strong position to
maintain/gain competitive advantage in key markets
49. 49
Data Collection & Management
Identify data that supports strategy
Install systems to creatively source diverse types of data
Manage to ensure data quality
Ensure data is available to all internal users in friendly
formats
Analytics Tools
Acquire easy to use off the shelf B.I tools
Acquire advanced database and analytics tools
Analytics Skills
Train current staff on B.I tools
Recruit & retain skilled advanced analytics
practitioners
Contract out complex projects
Recommendations
Advanced analytics are most powerful when deployed in
combination with business experience and acumen +
domain knowledge
50. 50
Promote a culture that encourages data driven
decision-making at all levels of responsibility
52. Appendices
52
Appendix 1 - Tools and Source Code
Appendix 2 - Algorithms
Appendix 3 - Brief discussions of some technical concepts
Appendix 4 - Limitations and Challenges
53. 53
Appendix 1
Tools & Source Code
Code and documentation are contained in 3 dynamic documents published at:
Refer to:
• Customer Analytics I – Statistical segmentation
• Customer Analytics II – Predicting customer attrition
• Customer Analytics III – Predicting customer spending
All machine learning methods were implemented in the R statistical programming software environment.
All statistical graphics were created in the R statistical programming software environment.
http://rpubs.com/melokarl
1.2 Source Code
The map on slide 2 was created with Tableau software.
1.1 Tools
54. 54
Appendix 2
Algorithms
2.1 Hierarchical Clustering Algorithm for Cluster Analysis
This appendix offers simplified descriptions of the algorithms implemented in the analysis portion of this project.
There are several clustering algorithms available to analysts including the popular K-Means. For various technical reasons we chose to implement the
hierarchical clustering algorithm.
The algorithm begins by treating each observation n as an individual cluster. The 2 most similar clusters are merged so that we are left with n -1 clusters.
Similarity between observations is determined by a dissimilarity measure. There are several such measures. We used Euclidean distance.
The process is repeated with clusters progressively merged into ever larger subgroups until we have a single cluster – the entire dataset. Groups of
observations cannot be merged using the same types of dissimilarity measures as those used for individual observations at the beginning of the process.
The notion of linkage defines appropriate dissimilarity measures for fusing groups of observations. We used Ward.D2 linkage.
The cluster tree on slide 18 is a graphical illustration of this process.
It is for the analyst to determine the number of clusters that represents the optimal cluster scheme (the optimal number of subgroups to which the
observations may be assigned). Cluster validation techniques are available to lend some rigour to the process. These techniques apply statistical
methods to help determine the number of “real” clusters present in the data.
55. 55
The random forests algorithm is an ensemble method that uses decision trees to predict numerical outcomes (regression) or the probability of
a target belonging to a given category (classification).
Ensemble methods aggregate the output of several weak linear models to produce one strong prediction. The resulting ensemble model is
capable of accurately modelling non-linear relationships. Each model makes predictions based on a set of machine generated rules that can
be summarised in a decision tree.
Random forests models are made up of numerous decision trees (forests). Each time a split is considered in one of the trees, the decision rule
is based on a predictor variable drawn from a random sample of all predictors. Using a small random selection of predictors ensures the trees
are uncorrelated. This improves the model’s ability to perform well on new data.
2.2 Random Forests Algorithm
56. 56
3.1 Principal components
The chart on slide 20 shows a visual representation of the first 2 principal components of our cluster scheme. We used principal components in
this context as a technique to compress our multidimensional data to 2 dimensions to facilitate visualization. This process will cause some
information loss. Nevertheless, the technique captures the maximum amount of information possible in the lower dimensional space.
Considering that our cluster scheme is 3 dimensional, the first 2 principal components capture the essence of the analysis – especially as some
of the lost information is noise.
A principal component is a combination of linearly uncorrelated variables.
Appendix 3
Brief discussions of some
technical concepts
57. 57
3.2 Computation of performance metrics
The performance metrics for the classification and regression models are estimated by running the algorithms on multiple bootstrap versions of
the data. Each run generates a unique estimate. The estimates for the performance of each algorithm on a given metric can be summarized in
distributions as shown in the example boxplots below.
The dark dots inside the rectangles show the median point of the distribution. The width of the rectangles show the dispersion of the data – the
middle 50% of observations (1st to 3rd quartile).
The wider the rectangle, the less certain the estimate of the performance measure.
59. DATA LIMITATIONS
Machine learning algorithms recognize patterns in data. Complex patterns are more reliably identified in large datasets.
The extremely narrow dataset ( 2 core features) will probably prejudice the accuracy of some of our predictive models.
Nevertheless, this dataset is a faithful representation of the data currently available at most of our subsidiaries.
Context: French telecommunications company, Orange, provided 2 datasets to analytics practitioners participating in the 2009 Kaggle Data Mining
Cup. The large set contained 15,000 variables and the reduced version 230. The prediction tasks were similar to those presented here.
59
60. 60
Important limitations with the deployment of statistical clustering methods in production environments.
• The algorithm requires frequent updating.
Most clustering algorithms cannot be fully automated and require the supervision of an analyst, a
significant handicap in production settings.
• A particular cluster scheme may capture seasonal effects thus making it inapplicable at different
periods of time.
• Most clustering methods are not very robust to disturbances to the data.
• The method captures patterns in a dataset at a specific moment in time.
This poses some challenges:
- Customers continuously enter and leave the database
- New customers may have different characteristics from old ones
- Existing customers’ behaviour may evolve over time
61. 61
The clustering method demonstrated above can be used to obtain a faithful description of
customer subgroups at a particular point in time and could serve to formulate hypotheses, guide
further investigations and generally inform non-statistical managerial segmentation solutions.
Statistical segmentation, nevertheless, remains a powerful and useful approach to finding
subgroups in customer databases.
These methods identify natural patterns in multivariate datasets
Most commonly used non-statistical alternatives rely heavily on judgement and past practice.
Judgement-based approaches can be severely compromised by various types of bias, personal
agendas and cognitive limitations in processing complex data.
N.B: There is much interest and on-going research in adaptive clustering algorithms capable of responding to changes in the state of the world.