Deep time-to-failure: predicting failures, churns and customer lifetime with RNN by Gianmario Spacagna, Chief Scientist at Cubeyou AI

Deep Time-to-Failure
Gianmario Spacagna
IBM #PartyCloud - Data Science Milan meetup
20 September 2018 @ Spirit of Milan

Companies main assets
Machineries Customers

Data availability (machineries)
▫ Historical time-series of machines sensors telemetry
▫ Registered failures
▫ Active machines real-time measurements

Data availability (customers)
▫ Historical sequence of customers behaviours
▫ Stopped subscriptions
▫ Active customers daily activities

Time-to-failure events
▫ Patients expected life
▫ Political leadership duration
▫ Product wearing out
▫ Financial default
▫ Employees quitting

Data scarcity
▫ Sensors failure (not the machine or user device)
▫ Loss of connectivity
▫ Errors in data collection
▫ IT blackouts
▫ Dropouts
▫ Study termination
▫ Active machine/users not “failed” yet

First part
Traditional Survival Analysis

Goals
1. Build a model that can estimate the probability
distribution of remaining time-to-failure
2. Train the model by exploiting censored historical data.

Survival Function
S(t)=Pr(T>t)
T = failure event time
Probability that the time-to-failure is greater than t

Kaplan-Meier estimator
d_i: events happened at time t_i
n_i: known survived individuals
at time t_i

Hazard function
h(t) = 𝖯[T= t | T ≥ t]
Probability failure will happen at time t given that we know
it survived up to time t.
Can be interpreted as a measure of the hazard risk, aka.
the probability that failure would happen right now.

Cumulative Hazard function
Λ(t)=∫[0 to t] h(z) dz
Represents the integral of the hazard-rate.

Nelson-Aalen estimator
Λ̂ (t)=∑ ti≤t d_i n_i
d_i: events happened at time t_i
n_i: known survived individuals
at time t_i

Survival Regression
We have covariates X that we would like to use to map
each individual to his own survival/hazard function.
Features:
● age
● gender
● weight
● is a smoker?
● weekly sport
time?
Model
● Survival function
● Hazard function

Cox’s Proportional Hazard model
the log-hazard of an individual is
a linear function of their static
covariates and a population-level
baseline hazard that changes
over time

Aalen’s Additive model
The hazard rate is a linear
function of multiple baselines
weighted by their corresponding
covariates

Cox’s Time Varying Proportional Hazard
model
If covariates change over time
X(t)

Survival Regression
Limitations
▫ Traditional Survival regression models can only
handle categorical or static numerical attributes
▫ Cox’s Time Varying Proportional Hazard model
cannot perform predictions because it would require
to know the future values of covariates
▫ The models do not take into account the sequence of
measurements but only the current statuses
▫ Non-parametric models are hard to generalize

Second part
Embrace the Weibull and Deep
Learning euphoria!

Weibull Distribution
⍺ = ƛ: scale parameter
β = k: shape parameter

Universal PDF for scientists and engineers
Source: https://ragulpr.github.io/2016/12/22/WTTE-RNN-Hackless-churn-modeling/#embrace-the-weibull-euphoria

Weibull Time-to-Event Recurrent Neural Net
“An algorithm & philosophy about
predicting when things will happen.”
Egil Martinsson
https://github.com/ragulpr/wtte-rnn

WTTE-RNN architecture
The output layer consists in
the Weibull parameters

Loss function
u = E = 1 censored
u = E = 0 uncensored
Source: https://ragulpr.github.io/assets/draft_master_thesis_martinsson_egil_wtte_rnn_2016.pdf

WTTE-RNN in action
Censored
interval

Deep Time-to-Failure
▫ Extension of WTTE for failure events
▫ Only one single event to predict (the failure)
▫ Prediction is done at the end of the observed sequence
instead at each step
Failure event
?

Case Study: NASA jet engines degradation

Data preparation
Selected 17 relevant features
Values normalized between -1 and 1
Maximum lookback period = 100
Each point corresponded to the subsequence between
time 0 and t up to the failure event
Shorter sequences masked with a special value -99

Train/test splits
train_x (20631, 100, 17)
train_y (20631, 2)
test_x (100, 100, 17)
test_y (100, 2)
X contains
axis 0: subsequence identifier
axis 1: time (100 steps)
axis 3: covariate feature
Y contains
axis 0 (T’): latest observation time
axis 1 (E’): 1 if failure event 0 otherwise

Training with Censored Data
T = failure time
T'(t) = min(T, t)
E'(t) = if T <= t 1 (observed) else 0 (censored)
Only the full sequence observe the failure,
the other subsequences always have E’ = 0

RNN parameters
▫ stateful = False
- each subsequence is independent
▫ return_state=False
- only return the output of the recurrent layer
▫ return_sequences = False
- despite WTTE which is True
- we modeled the problem to give one prediction at
the end of each subsequence instead

Initialize alpha and beta
The parameter alpha is proportional to the mean failure
time so we initialize with the mean of observed failures
Beta is a measure of variance so we set max_beta_value
to 100 time steps

Output layer activations
Replaced softplus functions from
original paper with:
alpha neuron (a):
alpha = exp(a) * init_alpha
beta neuron(b):
beta= sigmoid(b) * max_beta_value

Tunable Hyper-parameters
GRU:
▫ activation='tanh'
▫ recurrent_dropout=0.25
▫ dropout=0.3
Optimizer:
▫ algorithm: adam
▫ lr=0.01 (learning_rate)
▫ clipnorm=1

Loss function over epochs
● Validation loss < training loss
● Test data contains only full
sequences
● easier to predict
● higher accuracy
● lower loss

Inspecting Final Recurrent Activations
Covariates subsequences Hidden activations Output
20 GRUs 2 output
nodes
E
Ground truth
T𝛃𝛂time t
x1
x2
x3
Encoding of the
ith
subsequence
in a 20-dim
vector
x1_i
x2_i
x3_i

time t
Inspecting Hidden Recurrent States
x1
x2
x3
20 recurrent states
Output: T, 𝛂, 𝛃
Input covariates
Encoding at
time t_i
t_i
Subsequence
at time t_i
Weibull
parameters
at time t_i
t_i
t_i

Debugging nan weights
Can happen if:
▫ beta value too large (e.g. max_beta > 1000)
▫ if needs to reduce beta:
- downscale the time axis
- use shorter sequences
▫ If TensorFlow backend set epsilon to 1e-10
▫ Add a clip_value of 0.5 or less to the optimizer
▫ Clip log-likelihood
▫ Pre-train the output layer (transfer learning)

Evaluating global distribution

Evaluating last subsequences
Weibull distributions
One curve for each engine
Only last subsequence evaluated
Dots represents expected mode
(maximum likelihood)

𝛃 (∝ variance) Vs. 𝛂 (∝ mean)

Single engine ttf probability distribution over
time
t
One engine
All of the Weibull distributions
of T estimated at each timestep

Single engine T prediction over time
T: time-to-failure
t: current time

Conclusions
We learnt a technique deep-ttf which is the extension of
wtte for the specific case of predicting single failures.
The strengths of this approach are:
▫ Consuming raw time-series or sequences
▫ Training with censored and uncensored data
▫ Probabilistic predictions with confidence intervals
▫ Can be applied to any survival regression problem

References
Tutorial link: https://github.com/gm-spacagna/deep-ttf
NASA data: https://c3.nasa.gov/dashlink/resources/139/
wtte-rnn: https://github.com/ragulpr/wtte-rnn
Analysis of WTTE-RNN variants that improve
performance, R. Cawley et al.
Recurrent Neural Networks for real-time distributed
collaborative prognostics, A. S. Palau et al.

Deep time-to-failure: predicting failures, churns and customer lifetime with RNN by Gianmario Spacagna, Chief Scientist at Cubeyou AI

Recommended

Recommended

More Related Content

Similar to Deep time-to-failure: predicting failures, churns and customer lifetime with RNN by Gianmario Spacagna, Chief Scientist at Cubeyou AI

Similar to Deep time-to-failure: predicting failures, churns and customer lifetime with RNN by Gianmario Spacagna, Chief Scientist at Cubeyou AI (20)

More from Data Science Milan

More from Data Science Milan (20)

Recently uploaded

Recently uploaded (20)

Deep time-to-failure: predicting failures, churns and customer lifetime with RNN by Gianmario Spacagna, Chief Scientist at Cubeyou AI