Survival Analysis of Web Users

2,686 views

Published on

Survival Analysis of Web Users

  1. 1. “Survival” Analysis of Web Users Dell ZhangDCSIS, Birkbeck, University of London 1
  2. 2. Outline• What Is It• Why Is It Useful• Case Study – The Departure Dynamics of Wikipedia Editors 2
  3. 3. What Is It 3
  4. 4. Time-To-Event Data• Survival Analysis is a branch of statistics which deals with the modelling of time-to-event data – The outcome variable of interest is time until an event occurs. • death, disease, failure • recovery, marriage – It is called reliability theory/analysis in engineering, and duration analysis/modelling in economics or sociology. 4
  5. 5. Y X How to build a probabilistic model of Y ? 5
  6. 6. Y X How to build a probabilistic model of Y ? How to build a probabilistic model of Y given X ? 6
  7. 7. Y X How to build a probabilistic model of Y ? How to build a probabilistic model of Y given X ? 7
  8. 8. Censoring• A key problem in survival analysis – It occurs when we have some information about individual survival time, but we don’t know the survival time exactly. 8
  9. 9. 9
  10. 10. Y X Options: 1) Wait for those patients to die? 2) Discard the censored data? 3) Use the censored data as if they were not censored? 4) …… 10
  11. 11. Goals• Survival Analysis attempts to answer questions such as – What is the fraction of a population which will survive past a certain time? Of those that survive, at what rate will they die? – Can multiple causes of death be taken into account? – How do particular circumstances or characteristics increase or decrease the odds of survival? 11
  12. 12. • Censoring of data• Comparing groups – (1 treatment vs. 2 placebo)• Confounding or Interaction factors – Log WBC 12
  13. 13. Why Is It Usefulfor Online Marketing etc. 13
  14. 14. The Data Are There• Events meaningful to online marketing – Time to Clicking the Ad – Informational: Time to Finding the Wanted Info – Transactional: Time to Buying the Product – Social: Time to Joining/Leaving the Community – …… Time Matters! 14
  15. 15. Evidence-Based Marketing• Let’s work as (real) doctors – Users = Patients – Advertisement (Marketing) = Treatment Survival Analysis brings the time dimension back to the centre stage. 15
  16. 16. 17
  17. 17. 18
  18. 18. Predict whether a new question asked on Stack Overflow will be closed when 19
  19. 19. Case StudyThe Departure Dynamics of Wikipedia Editors 20
  20. 20. About 90,000 regularly active volunteer editors around the world21
  21. 21. 22
  22. 22. Departure Dynamics• Who are likely to “die”?• How soon will they “die”?• Why do they “die”? “live”= stay in the editors’ community = keep editing “die” = leave the editors’ community = stop editing (for 5 months) 23
  23. 23. Who are likely to “die”? (WikiChallenge) 24
  24. 24. 25
  25. 25. 2001-01-01 2010-04-01 2010-09-01 2001-06-01 2010-09-01 2011-02-01 26
  26. 26. 27
  27. 27. Behavioural Dynamics FeaturesExponential Steps months Web Search (SIGIR-2009), Social Tagging (WWW-2009), Language Modelling (ICTIR-2009) 28
  28. 28. 29
  29. 29. 30
  30. 30. 31
  31. 31. Gradient Boosted Trees (GBT) 32 © 2008-2012 ~maniraptora
  32. 32. Gradient Boosted Trees (GBT)• The success of GBT in our task is probably attributable to – its ability to capture the complex nonlinear relationship between the target variable and the features, – its insensitivity to different feature value ranges as well as outliers, and – its resistance to overfitting via regularisation mechanisms such as shrinkage and subsampling (Friedman 1999a; 1999b).• GBT vs RF 33
  33. 33. 34
  34. 34. 35
  35. 35. 36
  36. 36. 37
  37. 37. Final Result• The 2nd best valid algorithm in the WikiChallenge – RMSLE = 0.862582: 41.7% improvement over WMF’s in-house solution – Much simpler model than the top performing system : 21 behavioural dynamics features vs. 206 features – WMF is now implementing this algorithm permanently and looks forward to using it in the production environment. 38
  38. 38. How soon will they “die”? 39
  39. 39. 110,000 random samples birth & death January 2001 The evolution of Wikipedia editors community. 40
  40. 40. 110,000 random samples active editors January 2001 The evolution of Wikipedia editors community. 41
  41. 41. Survival FunctionWhat is the fraction of a population whichwill survive past a certain time? 42
  42. 42. Occasional Editors Customary Editors The histogram of Wikipedia editors lifetime. 43
  43. 43. Kaplan-Meier Estimator 44
  44. 44. 45
  45. 45. The empirical survival function. 46
  46. 46. Normal Distribution 47 Probability Plot
  47. 47. Extreme Value Distribution 48 Probability Plot
  48. 48. Rayleigh Distribution 49 Probability Plot
  49. 49. Exponential Distribution 50 Probability Plot
  50. 50. Lognormal Distribution 51 Probability Plot
  51. 51. Weibull Distribution 52 Probability Plot
  52. 52. The survival function. 53
  53. 53. Weibull distribution 54
  54. 54. Expected Future Lifetime median lifetime: 53 days 55
  55. 55. Hazard FunctionOf those that survive, at what rate will they die? The instantaneous potential per unit time for the event to occur, given that the individual has survived t. 56
  56. 56. Bathtub Curvehttp://en.wikipedia.org/wiki/Bathtub_curve 57
  57. 57. The hazard function. 58
  58. 58. The hazard function. 59
  59. 59. Conclusions• For customary Wikipedia editors, – the survival function can be well described by a Weibull distribution (with the median lifetime of about 53 days); – there are two critical phases (0-2 weeks and 8-20 weeks) when the hazard rate of becoming inactive increases; – more active editors tend to keep active in editing for longer time. 60
  60. 60. Why do they “die”? 61
  61. 61. CovariatesLastEdit 62
  62. 62. 63
  63. 63. 64
  64. 64. Cox Proportional Hazards Model 65
  65. 65. Semi-Parametric• The semi-parametric property of the Cox model => its popularity – The baseline hazard is unspecified – Robust: it will closely approximate the correct parametric model – Using a minimum of assumptions 66
  66. 66. Cox PH vs. Logistic 67
  67. 67. Maximum Likelihood Estimation 68
  68. 68. Cox Proportional Hazards Model β se z p X1: -0.1095 0.0172 -6.3664 0.1935e-9namespace==Main X2: -0.0688 0.0036 -19.2474 0.0000e-9 log(1+cur_size) 69
  69. 69. Hazard Ratio 70
  70. 70. Adjusted Survival Curves 71
  71. 71. 72
  72. 72. Next Step 73
  73. 73. Cartoon: Ron HipschmanData: David Hand 74
  74. 74. Lightning Does Strike Twice!• Roy Sullivan, a former park ranger from Virginia – He was struck by lightning 7 times • 1942 (lost big-toe nail) • 1969 (lost eyebrows) • 1970 (left shoulder seared) • 1972 (hair set on fire) • 1973 (hair set on fire & legs seared) • 1976 (ankle injured) • 1977 (chest & stomach burned) – He committed suicide in September 1983. 75
  75. 75. A Lot More To Do• Multiple Occurrences of “Death” – Recurrent Event Survival Analysis (e.g., based on Counting Process)• Multiple Types of “Death” – Competing Risks Survival Analysis 76
  76. 76. Software Tools• R – The ‘survival’ package• Matlab – The ‘statistics’ toolbox• Python – The ‘statsmodels’ module? 77
  77. 77. References• David G. Kleinbaum and Mitchel Klein. Survival Analysis: A Self-Learning Text. Springer, 3rd edition, 2011. http://goo.gl/wFtta• John Wallace. How Big Data is Changing Retail Marketing Analytics. Webinar, Apr 2005. http://goo.gl/OlMmi• Dell Zhang, Karl Prior, and Mark Levene. How Long Do Wikipedia Editors Keep Active? In Proceedings of the 8th International Symposium on Wikis and Open Collaboration (WikiSym), Linz, Austria, Aug 2012. http://goo.gl/On3qr• Dell Zhang. Wikipedia Edit Number Prediction based on Temporal Dynamics. The Computing Research Repository (CoRR) abs/1110.5051. Oct 2011. http://goo.gl/s2Dex 78
  78. 78. ? 79
  79. 79. 80

×