“Survival” Analysis of
     Web Users
             Dell Zhang
DCSIS, Birkbeck, University of London



                                        1
Outline
• What Is It
• Why Is It Useful
• Case Study
  – The Departure Dynamics of Wikipedia Editors




                                                  2
What Is It




             3
Time-To-Event Data
• Survival Analysis is a branch of statistics which
  deals with the modelling of time-to-event data
  – The outcome variable of interest is time until an
    event occurs.
     • death, disease, failure
     • recovery, marriage
  – It is called reliability theory/analysis in
    engineering, and duration analysis/modelling in
    economics or sociology.

                                                        4
Y   X

        How to build
        a probabilistic model of Y ?




                                       5
Y   X

        How to build
        a probabilistic model of Y ?

        How to build
        a probabilistic model of Y given X ?




                                               6
Y   X

        How to build
        a probabilistic model of Y ?

        How to build
        a probabilistic model of Y given X ?




                                               7
Censoring
• A key problem in survival analysis
  – It occurs when we have some information about
    individual survival time, but we don’t know the
    survival time exactly.




                                                      8
9
Y   X


        Options:

        1) Wait for those patients to die?

        2) Discard the censored data?

        3) Use the censored data as if they were
           not censored?

        4) ……




                                               10
Goals
• Survival Analysis attempts to answer
  questions such as
  – What is the fraction of a population which will
    survive past a certain time? Of those that survive,
    at what rate will they die?
  – Can multiple causes of death be taken into
    account?
  – How do particular circumstances or characteristics
    increase or decrease the odds of survival?

                                                      11
• Censoring of data
• Comparing groups
   – (1 treatment vs. 2 placebo)
• Confounding or Interaction
  factors
   – Log WBC




                                   12
Why Is It Useful

for Online Marketing etc.




                            13
The Data Are There
• Events meaningful to online marketing
  – Time to Clicking the Ad
  – Informational: Time to Finding the Wanted Info
  – Transactional: Time to Buying the Product
  – Social: Time to Joining/Leaving the Community
  – ……

                                     Time Matters!

                                                     14
Evidence-Based Marketing
• Let’s work as (real) doctors
  – Users = Patients
  – Advertisement (Marketing) = Treatment

                      Survival Analysis brings
                        the time dimension
                      back to the centre stage.



                                                  15
17
18
Predict whether a new question asked on Stack Overflow will be closed
        when
                                                                 19
Case Study

The Departure Dynamics of
    Wikipedia Editors



                            20
About 90,000 regularly active volunteer editors around the world21
22
Departure Dynamics
• Who are likely to “die”?
• How soon will they “die”?
• Why do they “die”?

     “live”= stay in the editors’ community
           = keep editing
     “die” = leave the editors’ community
           = stop editing (for 5 months)

                                              23
Who are likely to “die”?

      (WikiChallenge)




                           24
25
2001-01-01                2010-04-01   2010-09-01




             2001-06-01                2010-09-01   2011-02-01



                                                        26
27
Behavioural Dynamics Features




Exponential Steps

                    months

                     Web Search (SIGIR-2009),
                     Social Tagging (WWW-2009),
                     Language Modelling (ICTIR-2009)

                                                  28
29
30
31
Gradient Boosted Trees (GBT)




                                         32
                       © 2008-2012 ~maniraptora
Gradient Boosted Trees (GBT)
• The success of GBT in our task is probably
  attributable to
  – its ability to capture the complex nonlinear
    relationship between the target variable and the
    features,
  – its insensitivity to different feature value ranges as
    well as outliers, and
  – its resistance to overfitting via regularisation
    mechanisms such as shrinkage and subsampling
    (Friedman 1999a; 1999b).
• GBT vs RF
                                                             33
34
35
36
37
Final Result
• The 2nd best valid algorithm in the
  WikiChallenge
  – RMSLE = 0.862582: 41.7% improvement over
    WMF’s in-house solution
  – Much simpler model than the top performing
    system : 21 behavioural dynamics features vs. 206
    features
  – WMF is now implementing this algorithm
    permanently and looks forward to using it in the
    production environment.

                                                    38
How soon will they “die”?




                            39
110,000 random samples         birth & death




     January 2001


              The evolution of Wikipedia editors' community.
                                                               40
110,000 random samples         active editors




     January 2001


              The evolution of Wikipedia editors' community.
                                                               41
Survival Function

What is the fraction of a population which
will survive past a certain time?




                                             42
Occasional Editors                   Customary Editors




    The histogram of Wikipedia editors' lifetime.        43
Kaplan-Meier Estimator




                         44
45
The empirical survival function.   46
Normal Distribution




                      47
 Probability Plot
Extreme Value Distribution




                             48
    Probability Plot
Rayleigh Distribution




                        49
 Probability Plot
Exponential Distribution




                           50
   Probability Plot
Lognormal Distribution




                         51
  Probability Plot
Weibull Distribution




                       52
 Probability Plot
The survival function.   53
Weibull distribution




                       54
Expected Future Lifetime




              median lifetime: 53 days


                                         55
Hazard Function
Of those that survive, at what rate will they die?




   The instantaneous potential per unit time for the event to occur,
   given that the individual has survived t.

                                                                   56
Bathtub Curve




http://en.wikipedia.org/wiki/Bathtub_curve   57
The hazard function.   58
The hazard function.   59
Conclusions
• For customary Wikipedia editors,
  – the survival function can be well described by a
    Weibull distribution (with the median lifetime of
    about 53 days);
  – there are two critical phases (0-2 weeks and 8-20
    weeks) when the hazard rate of becoming inactive
    increases;
  – more active editors tend to keep active in editing
    for longer time.

                                                     60
Why do they “die”?




                     61
Covariates
Last
Edit




                    62
63
64
Cox Proportional Hazards Model




                                 65
Semi-Parametric
• The semi-parametric property of the Cox
  model => its popularity
  – The baseline hazard is unspecified
  – Robust: it will closely approximate the correct
    parametric model
  – Using a minimum of assumptions




                                                      66
Cox PH vs. Logistic




                      67
Maximum Likelihood Estimation




                                68
Cox Proportional Hazards Model


                     β         se        z          p
      X1:
                   -0.1095   0.0172   -6.3664    0.1935e-9
namespace==Main
       X2:
                   -0.0688   0.0036   -19.2474   0.0000e-9
 log(1+cur_size)




                                                        69
Hazard Ratio




               70
Adjusted Survival Curves




                           71
72
Next Step




            73
Cartoon: Ron Hipschman
Data: David Hand 74
Lightning Does Strike Twice!
• Roy Sullivan, a former park ranger from Virginia
  – He was struck by lightning 7 times
     •   1942 (lost big-toe nail)
     •   1969 (lost eyebrows)
     •   1970 (left shoulder seared)
     •   1972 (hair set on fire)
     •   1973 (hair set on fire & legs seared)
     •   1976 (ankle injured)
     •   1977 (chest & stomach burned)
  – He committed suicide in September 1983.

                                                     75
A Lot More To Do
• Multiple Occurrences of “Death”
  – Recurrent Event Survival Analysis (e.g., based on
    Counting Process)
• Multiple Types of “Death”
  – Competing Risks Survival Analysis




                                                        76
Software Tools
• R
  – The ‘survival’ package
• Matlab
  – The ‘statistics’ toolbox
• Python
  – The ‘statsmodels’ module?




                                 77
References
• David G. Kleinbaum and Mitchel Klein. Survival Analysis: A Self-Learning
  Text. Springer, 3rd edition, 2011. http://goo.gl/wFtta
• John Wallace. How Big Data is Changing Retail Marketing Analytics.
  Webinar, Apr 2005. http://goo.gl/OlMmi
• Dell Zhang, Karl Prior, and Mark Levene. How Long Do Wikipedia Editors
  Keep Active? In Proceedings of the 8th International Symposium on Wikis
  and Open Collaboration (WikiSym), Linz, Austria, Aug 2012.
  http://goo.gl/On3qr
• Dell Zhang. Wikipedia Edit Number Prediction based on Temporal
  Dynamics. The Computing Research Repository (CoRR) abs/1110.5051. Oct
  2011. http://goo.gl/s2Dex




                                                                         78
?

    79
80

Survival Analysis of Web Users

  • 1.
    “Survival” Analysis of Web Users Dell Zhang DCSIS, Birkbeck, University of London 1
  • 2.
    Outline • What IsIt • Why Is It Useful • Case Study – The Departure Dynamics of Wikipedia Editors 2
  • 3.
  • 4.
    Time-To-Event Data • SurvivalAnalysis is a branch of statistics which deals with the modelling of time-to-event data – The outcome variable of interest is time until an event occurs. • death, disease, failure • recovery, marriage – It is called reliability theory/analysis in engineering, and duration analysis/modelling in economics or sociology. 4
  • 5.
    Y X How to build a probabilistic model of Y ? 5
  • 6.
    Y X How to build a probabilistic model of Y ? How to build a probabilistic model of Y given X ? 6
  • 7.
    Y X How to build a probabilistic model of Y ? How to build a probabilistic model of Y given X ? 7
  • 8.
    Censoring • A keyproblem in survival analysis – It occurs when we have some information about individual survival time, but we don’t know the survival time exactly. 8
  • 9.
  • 10.
    Y X Options: 1) Wait for those patients to die? 2) Discard the censored data? 3) Use the censored data as if they were not censored? 4) …… 10
  • 11.
    Goals • Survival Analysisattempts to answer questions such as – What is the fraction of a population which will survive past a certain time? Of those that survive, at what rate will they die? – Can multiple causes of death be taken into account? – How do particular circumstances or characteristics increase or decrease the odds of survival? 11
  • 12.
    • Censoring ofdata • Comparing groups – (1 treatment vs. 2 placebo) • Confounding or Interaction factors – Log WBC 12
  • 13.
    Why Is ItUseful for Online Marketing etc. 13
  • 14.
    The Data AreThere • Events meaningful to online marketing – Time to Clicking the Ad – Informational: Time to Finding the Wanted Info – Transactional: Time to Buying the Product – Social: Time to Joining/Leaving the Community – …… Time Matters! 14
  • 15.
    Evidence-Based Marketing • Let’swork as (real) doctors – Users = Patients – Advertisement (Marketing) = Treatment Survival Analysis brings the time dimension back to the centre stage. 15
  • 17.
  • 18.
  • 19.
    Predict whether anew question asked on Stack Overflow will be closed when 19
  • 20.
    Case Study The DepartureDynamics of Wikipedia Editors 20
  • 21.
    About 90,000 regularlyactive volunteer editors around the world21
  • 22.
  • 23.
    Departure Dynamics • Whoare likely to “die”? • How soon will they “die”? • Why do they “die”? “live”= stay in the editors’ community = keep editing “die” = leave the editors’ community = stop editing (for 5 months) 23
  • 24.
    Who are likelyto “die”? (WikiChallenge) 24
  • 25.
  • 26.
    2001-01-01 2010-04-01 2010-09-01 2001-06-01 2010-09-01 2011-02-01 26
  • 27.
  • 28.
    Behavioural Dynamics Features ExponentialSteps months Web Search (SIGIR-2009), Social Tagging (WWW-2009), Language Modelling (ICTIR-2009) 28
  • 29.
  • 30.
  • 31.
  • 32.
    Gradient Boosted Trees(GBT) 32 © 2008-2012 ~maniraptora
  • 33.
    Gradient Boosted Trees(GBT) • The success of GBT in our task is probably attributable to – its ability to capture the complex nonlinear relationship between the target variable and the features, – its insensitivity to different feature value ranges as well as outliers, and – its resistance to overfitting via regularisation mechanisms such as shrinkage and subsampling (Friedman 1999a; 1999b). • GBT vs RF 33
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
    Final Result • The2nd best valid algorithm in the WikiChallenge – RMSLE = 0.862582: 41.7% improvement over WMF’s in-house solution – Much simpler model than the top performing system : 21 behavioural dynamics features vs. 206 features – WMF is now implementing this algorithm permanently and looks forward to using it in the production environment. 38
  • 39.
    How soon willthey “die”? 39
  • 40.
    110,000 random samples birth & death January 2001 The evolution of Wikipedia editors' community. 40
  • 41.
    110,000 random samples active editors January 2001 The evolution of Wikipedia editors' community. 41
  • 42.
    Survival Function What isthe fraction of a population which will survive past a certain time? 42
  • 43.
    Occasional Editors Customary Editors The histogram of Wikipedia editors' lifetime. 43
  • 44.
  • 45.
  • 46.
  • 47.
    Normal Distribution 47 Probability Plot
  • 48.
    Extreme Value Distribution 48 Probability Plot
  • 49.
    Rayleigh Distribution 49 Probability Plot
  • 50.
    Exponential Distribution 50 Probability Plot
  • 51.
    Lognormal Distribution 51 Probability Plot
  • 52.
    Weibull Distribution 52 Probability Plot
  • 53.
  • 54.
  • 55.
    Expected Future Lifetime median lifetime: 53 days 55
  • 56.
    Hazard Function Of thosethat survive, at what rate will they die? The instantaneous potential per unit time for the event to occur, given that the individual has survived t. 56
  • 57.
  • 58.
  • 59.
  • 60.
    Conclusions • For customaryWikipedia editors, – the survival function can be well described by a Weibull distribution (with the median lifetime of about 53 days); – there are two critical phases (0-2 weeks and 8-20 weeks) when the hazard rate of becoming inactive increases; – more active editors tend to keep active in editing for longer time. 60
  • 61.
    Why do they“die”? 61
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
    Semi-Parametric • The semi-parametricproperty of the Cox model => its popularity – The baseline hazard is unspecified – Robust: it will closely approximate the correct parametric model – Using a minimum of assumptions 66
  • 67.
    Cox PH vs.Logistic 67
  • 68.
  • 69.
    Cox Proportional HazardsModel β se z p X1: -0.1095 0.0172 -6.3664 0.1935e-9 namespace==Main X2: -0.0688 0.0036 -19.2474 0.0000e-9 log(1+cur_size) 69
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
    Lightning Does StrikeTwice! • Roy Sullivan, a former park ranger from Virginia – He was struck by lightning 7 times • 1942 (lost big-toe nail) • 1969 (lost eyebrows) • 1970 (left shoulder seared) • 1972 (hair set on fire) • 1973 (hair set on fire & legs seared) • 1976 (ankle injured) • 1977 (chest & stomach burned) – He committed suicide in September 1983. 75
  • 76.
    A Lot MoreTo Do • Multiple Occurrences of “Death” – Recurrent Event Survival Analysis (e.g., based on Counting Process) • Multiple Types of “Death” – Competing Risks Survival Analysis 76
  • 77.
    Software Tools • R – The ‘survival’ package • Matlab – The ‘statistics’ toolbox • Python – The ‘statsmodels’ module? 77
  • 78.
    References • David G.Kleinbaum and Mitchel Klein. Survival Analysis: A Self-Learning Text. Springer, 3rd edition, 2011. http://goo.gl/wFtta • John Wallace. How Big Data is Changing Retail Marketing Analytics. Webinar, Apr 2005. http://goo.gl/OlMmi • Dell Zhang, Karl Prior, and Mark Levene. How Long Do Wikipedia Editors Keep Active? In Proceedings of the 8th International Symposium on Wikis and Open Collaboration (WikiSym), Linz, Austria, Aug 2012. http://goo.gl/On3qr • Dell Zhang. Wikipedia Edit Number Prediction based on Temporal Dynamics. The Computing Research Repository (CoRR) abs/1110.5051. Oct 2011. http://goo.gl/s2Dex 78
  • 79.
    ? 79
  • 80.