Survival Analysis of Web Users

“Survival” Analysis of
Web Users
Dell Zhang
DCSIS, Birkbeck, University of London

1

Outline
• What Is It
• Why Is It Useful
• Case Study
– The Departure Dynamics of Wikipedia Editors

2

Time-To-Event Data
• Survival Analysis is a branch of statistics which
deals with the modelling of time-to-event data
– The outcome variable of interest is time until an
event occurs.
• death, disease, failure
• recovery, marriage
– It is called reliability theory/analysis in
engineering, and duration analysis/modelling in
economics or sociology.

4

Y X

How to build
a probabilistic model of Y ?

5

Y X

How to build

How to build
a probabilistic model of Y given X ?

6

Y X

How to build

How to build
a probabilistic model of Y given X ?

7

Censoring
• A key problem in survival analysis
– It occurs when we have some information about
individual survival time, but we don’t know the
survival time exactly.

8

Y X

Options:

1) Wait for those patients to die?

2) Discard the censored data?

3) Use the censored data as if they were
not censored?

4) ……

10

Goals
• Survival Analysis attempts to answer
questions such as
– What is the fraction of a population which will
survive past a certain time? Of those that survive,
at what rate will they die?
– Can multiple causes of death be taken into
account?
– How do particular circumstances or characteristics
increase or decrease the odds of survival?

11

• Censoring of data
• Comparing groups
– (1 treatment vs. 2 placebo)
• Confounding or Interaction
factors
– Log WBC

12

Why Is It Useful

for Online Marketing etc.

13

The Data Are There
• Events meaningful to online marketing
– Time to Clicking the Ad
– Informational: Time to Finding the Wanted Info
– Transactional: Time to Buying the Product
– Social: Time to Joining/Leaving the Community
– ……

Time Matters!

14

Evidence-Based Marketing
• Let’s work as (real) doctors
– Users = Patients
– Advertisement (Marketing) = Treatment

Survival Analysis brings
the time dimension
back to the centre stage.

15

Predict whether a new question asked on Stack Overflow will be closed
when
19

Case Study

The Departure Dynamics of
Wikipedia Editors

20

About 90,000 regularly active volunteer editors around the world21

Departure Dynamics
• Who are likely to “die”?
• How soon will they “die”?
• Why do they “die”?

“live”= stay in the editors’ community
= keep editing
“die” = leave the editors’ community
= stop editing (for 5 months)

23

Who are likely to “die”?

(WikiChallenge)

24

2001-01-01 2010-04-01 2010-09-01

2001-06-01 2010-09-01 2011-02-01

26

Behavioural Dynamics Features

Exponential Steps

months

Web Search (SIGIR-2009),
Social Tagging (WWW-2009),
Language Modelling (ICTIR-2009)

28

Gradient Boosted Trees (GBT)

32
© 2008-2012 ~maniraptora

Gradient Boosted Trees (GBT)
• The success of GBT in our task is probably
attributable to
– its ability to capture the complex nonlinear
relationship between the target variable and the
features,
– its insensitivity to different feature value ranges as
well as outliers, and
– its resistance to overfitting via regularisation
mechanisms such as shrinkage and subsampling
(Friedman 1999a; 1999b).
• GBT vs RF
33

Final Result
• The 2nd best valid algorithm in the
WikiChallenge
– RMSLE = 0.862582: 41.7% improvement over
WMF’s in-house solution
– Much simpler model than the top performing
system : 21 behavioural dynamics features vs. 206
features
– WMF is now implementing this algorithm
permanently and looks forward to using it in the
production environment.

38

How soon will they “die”?

39

110,000 random samples birth & death

January 2001

The evolution of Wikipedia editors' community.
40

110,000 random samples active editors

January 2001

The evolution of Wikipedia editors' community.
41

Survival Function

What is the fraction of a population which
will survive past a certain time?

42

Occasional Editors Customary Editors

The histogram of Wikipedia editors' lifetime. 43

Kaplan-Meier Estimator

44

The empirical survival function. 46

Normal Distribution

47
Probability Plot

Extreme Value Distribution

48
Probability Plot

Rayleigh Distribution

49
Probability Plot

Exponential Distribution

50
Probability Plot

Lognormal Distribution

51
Probability Plot

Weibull Distribution

52
Probability Plot

Weibull distribution

54

Expected Future Lifetime

median lifetime: 53 days

55

Hazard Function
Of those that survive, at what rate will they die?

The instantaneous potential per unit time for the event to occur,
given that the individual has survived t.

56

Bathtub Curve

http://en.wikipedia.org/wiki/Bathtub_curve 57

Conclusions
• For customary Wikipedia editors,
– the survival function can be well described by a
Weibull distribution (with the median lifetime of
about 53 days);
– there are two critical phases (0-2 weeks and 8-20
weeks) when the hazard rate of becoming inactive
increases;
– more active editors tend to keep active in editing
for longer time.

60

Why do they “die”?

61

Covariates
Last
Edit

62

Cox Proportional Hazards Model

65

Semi-Parametric
• The semi-parametric property of the Cox
model => its popularity
– The baseline hazard is unspecified
– Robust: it will closely approximate the correct
parametric model
– Using a minimum of assumptions

66

Cox PH vs. Logistic

67

Maximum Likelihood Estimation

68

Cox Proportional Hazards Model

β se z p
X1:
-0.1095 0.0172 -6.3664 0.1935e-9
namespace==Main
X2:
-0.0688 0.0036 -19.2474 0.0000e-9
log(1+cur_size)

69

Adjusted Survival Curves

71

Cartoon: Ron Hipschman
Data: David Hand 74

Lightning Does Strike Twice!
• Roy Sullivan, a former park ranger from Virginia
– He was struck by lightning 7 times
• 1942 (lost big-toe nail)
• 1969 (lost eyebrows)
• 1970 (left shoulder seared)
• 1972 (hair set on fire)
• 1973 (hair set on fire & legs seared)
• 1976 (ankle injured)
• 1977 (chest & stomach burned)
– He committed suicide in September 1983.

75

A Lot More To Do
• Multiple Occurrences of “Death”
– Recurrent Event Survival Analysis (e.g., based on
Counting Process)
• Multiple Types of “Death”
– Competing Risks Survival Analysis

76

Software Tools
• R
– The ‘survival’ package
• Matlab
– The ‘statistics’ toolbox
• Python
– The ‘statsmodels’ module?

77

References
• David G. Kleinbaum and Mitchel Klein. Survival Analysis: A Self-Learning
Text. Springer, 3rd edition, 2011. http://goo.gl/wFtta
• John Wallace. How Big Data is Changing Retail Marketing Analytics.
Webinar, Apr 2005. http://goo.gl/OlMmi
• Dell Zhang, Karl Prior, and Mark Levene. How Long Do Wikipedia Editors
Keep Active? In Proceedings of the 8th International Symposium on Wikis
and Open Collaboration (WikiSym), Linz, Austria, Aug 2012.
http://goo.gl/On3qr
• Dell Zhang. Wikipedia Edit Number Prediction based on Temporal
Dynamics. The Computing Research Repository (CoRR) abs/1110.5051. Oct
2011. http://goo.gl/s2Dex

78

Survival Analysis of Web Users

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Viewers also liked

Viewers also liked (20)

Similar to Survival Analysis of Web Users

Similar to Survival Analysis of Web Users (20)

More from Data Science London

More from Data Science London (20)

Survival Analysis of Web Users