Deep Time-to-Failure
Gianmario Spacagna
IBM #PartyCloud - Data Science Milan meetup
20 September 2018 @ Spirit of Milan
Companies main assets
Machineries Customers
Bathtub failure curve
Data availability (machineries)
▫ Historical time-series of machines sensors telemetry
▫ Registered failures
▫ Active machines real-time measurements
Predictive Maintenance
Data availability (customers)
▫ Historical sequence of customers behaviours
▫ Stopped subscriptions
▫ Active customers daily activities
Churn prediction
Time-to-failure events
▫ Patients expected life
▫ Political leadership duration
▫ Product wearing out
▫ Financial default
▫ Employees quitting
Data scarcity
▫ Sensors failure (not the machine or user device)
▫ Loss of connectivity
▫ Errors in data collection
▫ IT blackouts
▫ Dropouts
▫ Study termination
▫ Active machine/users not “failed” yet
First part
Traditional Survival Analysis
Goals
1. Build a model that can estimate the probability
distribution of remaining time-to-failure
2. Train the model by exploiting censored historical data.
Right-censorship
Survival Function
S(t)=Pr(T>t)
T = failure event time
Probability that the time-to-failure is greater than t
Kaplan-Meier estimator
d_i: events happened at time t_i
n_i: known survived individuals
at time t_i
Hazard function
h(t) = 𝖯[T= t | T ≥ t]
Probability failure will happen at time t given that we know
it survived up to time t.
Can be interpreted as a measure of the hazard risk, aka.
the probability that failure would happen right now.
Cumulative Hazard function
Λ(t)=∫[0 to t] h(z) dz
Represents the integral of the hazard-rate.
Nelson-Aalen estimator
Λ̂ (t)=∑ ti≤t d_i n_i
d_i: events happened at time t_i
n_i: known survived individuals
at time t_i
Survival Regression
We have covariates X that we would like to use to map
each individual to his own survival/hazard function.
Features:
● age
● gender
● weight
● is a smoker?
● weekly sport
time?
Model
● Survival function
● Hazard function
Cox’s Proportional Hazard model
the log-hazard of an individual is
a linear function of their static
covariates and a population-level
baseline hazard that changes
over time
Aalen’s Additive model
The hazard rate is a linear
function of multiple baselines
weighted by their corresponding
covariates
Cox’s Time Varying Proportional Hazard
model
If covariates change over time
X(t)
Survival Regression
Limitations
▫ Traditional Survival regression models can only
handle categorical or static numerical attributes
▫ Cox’s Time Varying Proportional Hazard model
cannot perform predictions because it would require
to know the future values of covariates
▫ The models do not take into account the sequence of
measurements but only the current statuses
▫ Non-parametric models are hard to generalize
Second part
Embrace the Weibull and Deep
Learning euphoria!
Weibull Distribution
⍺ = ƛ: scale parameter
β = k: shape parameter
Universal PDF for scientists and engineers
Source: https://ragulpr.github.io/2016/12/22/WTTE-RNN-Hackless-churn-modeling/#embrace-the-weibull-euphoria
Weibull Time-to-Event Recurrent Neural Net
“An algorithm & philosophy about
predicting when things will happen.”
Egil Martinsson
https://github.com/ragulpr/wtte-rnn
Recurrent Neural Networks
WTTE-RNN architecture
The output layer consists in
the Weibull parameters
Loss function
u = E = 1 censored
u = E = 0 uncensored
Source: https://ragulpr.github.io/assets/draft_master_thesis_martinsson_egil_wtte_rnn_2016.pdf
WTTE-RNN in action
Censored
interval
Deep Time-to-Failure
▫ Extension of WTTE for failure events
▫ Only one single event to predict (the failure)
▫ Prediction is done at the end of the observed sequence
instead at each step
Failure event
?
Case Study: NASA jet engines degradation
Raw data
Data preparation
Selected 17 relevant features
Values normalized between -1 and 1
Maximum lookback period = 100
Each point corresponded to the subsequence between
time 0 and t up to the failure event
Shorter sequences masked with a special value -99
Train/test splits
train_x (20631, 100, 17)
train_y (20631, 2)
test_x (100, 100, 17)
test_y (100, 2)
X contains
axis 0: subsequence identifier
axis 1: time (100 steps)
axis 3: covariate feature
Y contains
axis 0 (T’): latest observation time
axis 1 (E’): 1 if failure event 0 otherwise
Training with Censored Data
T = failure time
T'(t) = min(T, t)
E'(t) = if T <= t 1 (observed) else 0 (censored)
Only the full sequence observe the failure,
the other subsequences always have E’ = 0
Build the model
LSTM Vs. GRU
RNN parameters
▫ stateful = False
- each subsequence is independent
▫ return_state=False
- only return the output of the recurrent layer
▫ return_sequences = False
- despite WTTE which is True
- we modeled the problem to give one prediction at
the end of each subsequence instead
Initialize alpha and beta
The parameter alpha is proportional to the mean failure
time so we initialize with the mean of observed failures
Beta is a measure of variance so we set max_beta_value
to 100 time steps
Output layer activations
Replaced softplus functions from
original paper with:
alpha neuron (a):
alpha = exp(a) * init_alpha
beta neuron(b):
beta= sigmoid(b) * max_beta_value
Architecture diagram
Trainable parameters
Tunable Hyper-parameters
GRU:
▫ activation='tanh'
▫ recurrent_dropout=0.25
▫ dropout=0.3
Optimizer:
▫ algorithm: adam
▫ lr=0.01 (learning_rate)
▫ clipnorm=1
Training
Loss function over epochs
● Validation loss < training loss
● Test data contains only full
sequences
● easier to predict
● higher accuracy
● lower loss
Training biases and weights
Inspecting Final Recurrent Activations
Covariates subsequences Hidden activations Output
20 GRUs 2 output
nodes
E
Ground truth
T𝛃𝛂time t
x1
x2
x3
Encoding of the
ith
subsequence
in a 20-dim
vector
x1_i
x2_i
x3_i
time t
Inspecting Hidden Recurrent States
x1
x2
x3
20 recurrent states
Output: T, 𝛂, 𝛃
Input covariates
Encoding at
time t_i
t_i
Subsequence
at time t_i
Weibull
parameters
at time t_i
t_i
t_i
Debugging nan weights
Can happen if:
▫ beta value too large (e.g. max_beta > 1000)
▫ if needs to reduce beta:
- downscale the time axis
- use shorter sequences
▫ If TensorFlow backend set epsilon to 1e-10
▫ Add a clip_value of 0.5 or less to the optimizer
▫ Clip log-likelihood
▫ Pre-train the output layer (transfer learning)
Evaluating global distribution
Evaluating last subsequences
Weibull distributions
One curve for each engine
Only last subsequence evaluated
Dots represents expected mode
(maximum likelihood)
𝛃 (∝ variance) Vs. 𝛂 (∝ mean)
Precision Vs. T
Residual Errors
Single engine ttf probability distribution over
time
t
One engine
All of the Weibull distributions
of T estimated at each timestep
Single engine T prediction over time
T: time-to-failure
t: current time
Conclusions
We learnt a technique deep-ttf which is the extension of
wtte for the specific case of predicting single failures.
The strengths of this approach are:
▫ Consuming raw time-series or sequences
▫ Training with censored and uncensored data
▫ Probabilistic predictions with confidence intervals
▫ Can be applied to any survival regression problem
References
Tutorial link: https://github.com/gm-spacagna/deep-ttf
NASA data: https://c3.nasa.gov/dashlink/resources/139/
wtte-rnn: https://github.com/ragulpr/wtte-rnn
Analysis of WTTE-RNN variants that improve
performance, R. Cawley et al.
Recurrent Neural Networks for real-time distributed
collaborative prognostics, A. S. Palau et al.

Deep time-to-failure: predicting failures, churns and customer lifetime with RNN by Gianmario Spacagna, Chief Scientist at Cubeyou AI