Data Mining Techniques
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Data Mining Techniques

on

  • 985 views

 

Statistics

Views

Total Views
985
Views on SlideShare
985
Embed Views
0

Actions

Likes
2
Downloads
56
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Data Mining Techniques Presentation Transcript

  • 1. Survival Data Mining Gordon S. Linoff Founder Data Miners, Inc. gordon@data-miners.com
  • 2. What to Expect from this Talk Background on survival analysis from a data miner’s perspective Introduction to key ideas in survival analysis – hazards – survival – competing risks Lots of examples – Stratification – Quantifying Loyalty Effort – Voluntary and Involuntary Churn – Forecasting – Time to Reactivation and Re-purchase ©2004 Data Miners, Inc. http://www.data-miners.com 2
  • 3. Who Am I? I am not a statistician Adept with databases and advanced algorithms Founded Data Miners with Michael Berry in 1998 We have written three books on data mining Have become very interested in survival analysis for mining customer data – survival data mining ©2004 Data Miners, Inc. http://www.data-miners.com 3
  • 4. What Does Data Mining Really Do? Provides ways to quantitatively measure what business users know or should know qualitatively Connects data to business practices Used to understand customers Occasionally, produces interesting predictive models ©2004 Data Miners, Inc. http://www.data-miners.com 4
  • 5. Data Mining Is About Customers Customer Relationship Lifetime New Established Former Prospect Customer Customer Customer Events • Initial Purchase • 2nd Purchase • Failure to Pay • Sign-Up • Usage • Voluntary Churn Customers Evolve Over Time ©2004 Data Miners, Inc. http://www.data-miners.com 5
  • 6. The State of the Customer Relationship Changes Over Time Starts P1 Starts P1 P2 P1 P1 P2 P1 P3 STOP P3 P1 P1 P2 Starts P1 STOP ©2004 Data Miners, Inc. http://www.data-miners.com 6
  • 7. Traditional Approach to Data Mining Uses Predictive Modeling Jan Feb Mar Apr May Jun Jul Aug Sep Model Set 4 3 2 1 +1 Score Set 4 3 2 1 +1 Data to build the model comes from the past Predictions for some fixed period in the future Present when new data is scored Models built with decision trees, neural networks, logistic regression, regression, and so on ©2004 Data Miners, Inc. http://www.data-miners.com 7
  • 8. Survival Data Mining Adds the Element of When Things Happen Time-to-event analysis Terminology comes from the medical world – which patients survive a treatment, which patients do not Can measure effects of variables (initial covariates or time-dependent covariates) on survival time Natural for understanding customers Can be used to quantify marketing efforts ©2004 Data Miners, Inc. http://www.data-miners.com 8
  • 9. Example Results (Made Up) 99.9% of Life bulbs will last 2000 hours Mean-time-to-failure for hard disk is 500,000 hours “A recent study published in the American Journal of Public Health found that ‘life expectancy’ among smokers who quit at age 35 exceeded that of continuing smokers by 6.9 to 8.5 years for men and 6.1 to 7.7 years for women.” [www.med.upenn.edu/tturc/pdf/benefits.pdf ] – Example of initial covariate Stopping smoking before age 50 increases lifespan by one year for every decade before 50 ©2004 Data Miners, Inc. http://www.data-miners.com 9
  • 10. Original Statistics Life table methods used by actuaries for a long, long time – These the are the methods we will be focused on Applied with vigor to medicine in the mid-20th Century Applied with vigor to manufacturing during 20th Century as well Took off with Sir David Cox’s Proportion Hazards Models in 1972 that provide effects of initial covariates ©2004 Data Miners, Inc. http://www.data-miners.com 10
  • 11. Survival for Marketing Has Some Differences We are happy with discrete time (probability vs rate) – Traditional survival analysis uses continuous time Marketers have hundreds of thousands or millions of examples – Traditional survival analysis might be done on dozens or hundreds of participants We have the benefit of a wealth of data – Traditional survival analysis looks at factors incorporated into the study We have to deal with “window” effects due to business practices and database reality – Traditional survival ignores left truncation ©2004 Data Miners, Inc. http://www.data-miners.com 11
  • 12. To Understand the Calculation, Let’s Focus on the End of the Relationship tenure is the duration START STOP of the relationship Challenge defining beginning of customer relationship Challenge defining end of customer relationship Challenge finding either in customer databases ©2004 Data Miners, Inc. http://www.data-miners.com 12
  • 13. How Long Will A Customer Survive? 100% 90% Survival always starts 80% at 100% and declines High End over time 70% P e rc e n t S u rv iv e d Regular 60% If everyone in the 50% model set stopped, then 40% survival goes to 0; otherwise it is always 30% greater than 0 20% 10% Survival is useful for 0% comparing different groups 0 12 24 36 48 60 72 84 96 108 120 Tenure (months) ©2004 Data Miners, Inc. http://www.data-miners.com 13
  • 14. Key Idea: Data is Censored Known tenure when customers stop Known Customer Start time Unknown tenure for active customers (censored) ©2004 Data Miners, Inc. http://www.data-miners.com 14
  • 15. Use Censored Data to Calculate Hazard Probabilities Hazard, h(t), at time t is the probability that a customer who has survived to time t will not survive to time t+1 # customers who stop at exactly time t h(t) = # customers at risk of stopping at time t Value of hazard depends on units of time – days, weeks, months, years Differs from traditional definition because time is discrete – hazard probability not hazard rate ©2004 Data Miners, Inc. http://www.data-miners.com 15
  • 16. Bathtub Hazard (Risk of Dying by Age) 3.0% 2.5% Bathtub hazard starts high, goes 2.0% down, and then Haz ard 1.5% increases again 1.0% Example from US 0.5% mortality tables 0.0% shows probability of 10-14 yrs 15-19 yrs 30-34 yrs 35-39 yrs 40-44 yrs 45-49 yrs 50-54 yrs 55-59 yrs 60-64 yrs 20-24 yrs 65-69 yrs 70-74 yrs 0-1 yrs 1-4 yrs 5-9 yrs 25-29 yrs dying at a given age A (Y ) ge ears ©2004 Data Miners, Inc. http://www.data-miners.com 16
  • 17. Hazards Are Like An X-Ray Into Customers Peak here is for non- 0.30% Hazard (Daily Risk of Churn) payment and end of 0.25% initial promotion 0.20% Bumpiness is due to within week variation 0.15% 0.10% 0.05% 0.00% 0 60 120 180 240 300 360 420 480 540 600 660 720 Tenure (Days) ©2004 Data Miners, Inc. http://www.data-miners.com 17
  • 18. Hazards Can Show Interesting Features of the Customer Lifecycle 8% Peak here is for non- 7% payment after one year 6% Daily Churn Hazard 5% 4% 3% Peak here is for non- payment after two years 2% 1% 0% 360 390 480 510 540 690 240 270 300 330 420 450 570 600 630 660 60 90 180 210 720 30 120 150 930 960 990 750 780 810 840 870 900 0 Tenure (Days After Start) ©2004 Data Miners, Inc. http://www.data-miners.com 18
  • 19. How Long Will A Customer Survive? Survival analysis answers the question by rephrasing it slightly: – What proportion of customers survive to time t? Survival at time t, S(t), is the probability that a customer will survive exactly to time t Calculation: – S(t) = S(t – 1) * (1 – h(t – 1)) – S(0) = 100% Given a set of hazards by time, survival can be easily calculated in SAS or a spreadsheet ©2004 Data Miners, Inc. http://www.data-miners.com 19
  • 20. Survival is Similar to Retention Retention is jagged. It is calculated on customers who 100% started exactly w weeks ago. 90% 80% Retention/Survival 70% Survival is smooth. 60% 50% It is calculated using customers 40% who started up to w weeks ago. 30% 20% 10% 0% 0 10 20 30 40 50 60 70 80 90 100 110 Tenure (Weeks) ©2004 Data Miners, Inc. http://www.data-miners.com 20
  • 21. Why Survival is Useful Compare hazards for different groups of customers Calculate median time to event – How long does it take for half the customers to leave? Calculate truncated mean tenure – What is the average customer tenure for the first year after starting? For the first two years? – Can be used to quantify effects ©2004 Data Miners, Inc. http://www.data-miners.com 21
  • 22. Median Lifetime is the Tenure Where Survival = 50% 100% 90% 80% High End 70% Percent Survived Regular 60% 50% 40% 30% 20% 10% 0% 0 12 24 36 48 60 72 84 96 108 120 Tenure (months) ©2004 Data Miners, Inc. http://www.data-miners.com 22
  • 23. Average Tenure Is The Area Under The Curves 100% 90% 80% High End 70% Percent Survived Regular 60% 50% 40% average 10-year tenure high end customers = 30% 73 months (6.1 years) average 10-year tenure 20% regular customers = 10% 44 months (3.7 years) 0% 0 12 24 36 48 60 72 84 96 108 120 Tenure ©2004 Data Miners, Inc. http://www.data-miners.com 23
  • 24. Survival to Quantify Marketing Efforts A company has a loyalty marketing effort designed to keep customers This effort costs money What is the value of the effort as measured in increased customer lifetimes? SOLUTION: survival analysis – How much longer to customers survive after accepting the offer? – How to quantify this in dollars and cents? ©2004 Data Miners, Inc. http://www.data-miners.com 24
  • 25. Loyalty Offers is an Example of a Time- Dependent Covariate Non-Responders have Responders have no “loyalty” date a “loyalty” date Open Circle indicates that customer has stopped before (or on) the analysis date time Time = 0 Time = n Time c ©2004 Data Miners, Inc. http://www.data-miners.com 25
  • 26. We Can Use Area to Quantify Results 100% 95% 90% How much Percent Retained is this 85% difference 80% worth? 75% 70% Group 1 Group 2 65% 0 2 4 6 8 10 12 14 16 18 Months After Intervention Chart shows survival after loyalty offer acceptance compared to similar group not given offer ©2004 Data Miners, Inc. http://www.data-miners.com 26
  • 27. We Can Use Area to Quantify Results 100% Increase in 95% 93.4% survival is 90% given by the Percent Retained 85% area between 85.6% the curves. 80% 75% For the first 70% year, area of 65% Group 1 Group 2 triangle is a good enough 0 2 4 6 8 10 12 14 16 18 Months After Intervention estimate Note: there are easy ways to calculate the exact value ©2004 Data Miners, Inc. http://www.data-miners.com 27
  • 28. Loyalty-Responsive Customers Are Doing Better Than Others Survival for first year after loyalty intervention – Group 1: 93.4% – Group 2: 85.6% – Increase for Group 1: 7.8% Average increase in lifetime for first year is 3.9% (assuming the 7.8% would have stayed, on average, 6 months) Assuming revenue of $400/year, loyalty responsive contribute an additional revenue of $15.60 during the first year This actually compares favorably to cost of loyalty program ©2004 Data Miners, Inc. http://www.data-miners.com 28
  • 29. Competing Risks: Customers May Leave for Many Reasons Customers may cancel – Voluntarily – Involuntarily – Migration Survival Analysis Can Show Competing Risks – overall S(t) is the product of Sr(t) for all risks We’ll walk through an example – Overall survival – Competing risks survival – Competing risks hazards – Stratified competing risks survival ©2004 Data Miners, Inc. http://www.data-miners.com 29
  • 30. Overall Survival for a Group of Customers 100% Overall 90% Sharp drop during this 80% period, otherwise 70% smooth 60% 50% 40% 30% 20% 10% 0% 0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600 ©2004 Data Miners, Inc. http://www.data-miners.com 30
  • 31. Competing Risks for Voluntary and Involuntary Churn Steep drop is 100% Involuntary associated with Voluntary 90% involuntary churn Overall 80% 70% 60% 50% 40% 30% 20% 10% Survival for each risk 0% is always greater than 0 30 overall survival 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600 ©2004 Data Miners, Inc. http://www.data-miners.com 31
  • 32. What the Hazards Look Like As expected, steep 9% Involuntary drop has a very high Voluntary 8% hazard 7% 6% 5% 4% 3% 2% 1% 0% 0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600 ©2004 Data Miners, Inc. http://www.data-miners.com 32
  • 33. Most Involuntary Churn is from Non Credit Card Payers 100% Involuntary-CC Voluntary-CC 90% Involuntary-Other 80% Voluntary-Other 70% Steep drop associated 60% only with non-credit card 50% payers 40% 30% 20% 10% 0% 0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600 ©2004 Data Miners, Inc. http://www.data-miners.com 33
  • 34. Time to Reactivation When a customer stops, often they come back – this is winback or reactivation In this case, the “initial condition” is the stop The “final condition” is the restart Not all customers restart ©2004 Data Miners, Inc. http://www.data-miners.com 34
  • 35. Survival Can Be Applied to Other Events: Reactivation 100% 10% 90% 9% ("Risk" of Reactivating) (Remain Deactivated) 80% 8% Haz ard Probability 70% 7% 60% 6% Survival 50% 5% 40% 4% 30% 3% 20% 2% 10% 1% 0% 0% 0 30 60 90 120 150 180 210 240 270 300 330 360 Days After Deactivation ©2004 Data Miners, Inc. http://www.data-miners.com 35
  • 36. Time to Next Order In retailing type businesses, customers make multiple purchases Survival analysis can be applied here, too The question is: how long to the next purchase? Initial state is: date of purchase Final state is: date of next purchase It is better to look at 1 – survival rather than survival ©2004 Data Miners, Inc. http://www.data-miners.com 36
  • 37. Time to Next Purchase, Stratified by Number of Previous Purchases 50% 45% 05-->06 40% 04-->05 35% 30% 03-->04 25% 20% 02-->03 15% ALL 10% 5% 01-->02 0% 0 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 ©2004 Data Miners, Inc. http://www.data-miners.com 37
  • 38. Customer-Centric Forecasting Using Survival Analysis Forecasting customer numbers is important in many businesses Survival analysis makes it possible to do the forecasts at the finest level of granularity – the customer level Survival analysis makes it possible to incorporate customer-specific factors Can be used to estimate restarts as well as stops ©2004 Data Miners, Inc. http://www.data-miners.com 38
  • 39. Using Survival for Customer Centric Forecasting Number Actual Predicted 130 120 110 100 90 80 70 60 50 40 30 20 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 Day of Month ©2004 Data Miners, Inc. http://www.data-miners.com 39
  • 40. The Forecasting Solution is a Bit Complicated Existing New Start Customer Forecast Base Do Existing Base Do New Start Forecast (EBF) Forecast (NSF) Existing Customer New Start Base Forecast Forecast Do Existing Base Do New Start Churn Churn Forecast Forecast (NSCF) (EBCF) Churn Churn Forecast Actuals Compare ©2004 Data Miners, Inc. http://www.data-miners.com 40
  • 41. Survival Data Mining Connects Data to Business Needs Provides ways to quantitatively measure what business users know or should know qualitatively Connects data to business practices Techniques such as survival analysis provide new ways of looking at customers Techniques such as customer-centric forecasting integrate data mining with business processes such as forecasting ©2004 Data Miners, Inc. http://www.data-miners.com 41