Predictive Analytics in Dynamic Data Environments

Machine Learning in Dynamic Data Environments:
Trade-offs in Response to Changes
Jungpil Hahn
jungpil@nus.edu.sg

Data Model Prediction
S : Source Data
T : Target Data

The ML paradigm works well when…
• Trained model is accurate
• Target data is similar to source data
• The world doesn’t change

Change is the only
constant
~ Heraclitos
(circa 500BC)

What can we do when change happens?

New Data New Model Prediction

New Data New Model Prediction
• New data is often scarce (esp. right after change)
• Unsure when change actually happened
However …

What to do?
• Can / should we enhance our model robustness by increasing the
new training data sample size by leveraging historical data?
• Should we retrain the model immediately when change is detected
or later when more new data has become available?
Ø Augment the new data set!

What to do?
• Bias vs.Variance Trade-off
• Can / should we enhance our model robustness by increasing the
new training data sample size by leveraging historical data?
Ø Transfer learning paradigm
• Exploration vs. Exploitation Trade-off
• Should we retrain the model immediately when change is detected
or later when more new data has become available?

Theoretical Analysis
• Difference in data environment (pre-change vs. post-change) as
sample selection
– ! = 1 : diff-distribution
– ! = 0 : same-distribution
• Empirical risk minimization (ERM)
– Minimize
• Weight based on sample selection:

• Expected risk in target data
• Empirical risk using same- and diff-distribution data
• Empirical risk using on same-distribution data
To transfer or not transfer
S-S : Same-distribution source data (q)
S-D : Diff-distribution source data (p)

• Dd : difference of upper bounds in loss between non-transfer
learning and transfer learning
• ,
•
•
Effectiveness ofTransfer Learning
Relative size of diff- vs. same-distribution data examples
Complexity of the model
Extent of data change

Effectiveness ofTransfer Learning
• Depends on …
• The amount of same-distribution source data (q) relative to the diff-
distribution source data (p)
• The number of predictors being used in the prediction model (b)
• The extent of change across the source and the target data sets
(a/b)

Simulate Changing Data Pattern
• Linear model: y=x×β+ε
• β=!
• k= {10, 20, … 50, 60}
– x=(x1, x2, x3, …, xk), %~'! (, * , (σij)=0.5 for i≠j and 1 for i=j.
• ε follows normal distribution.To keep R2=0.6, var(ε) equals
– +,! %×. ×
"# $!
$!
• Selection model: Pr # = 0 &, ( = ) *!& + ,"(
• "!=!
• #"= {0.3, 0.5, …, 1.5}

• ADWIN algorithm:
• monitoring out-of-sample prediction error of a pre-trained model
1,000 data points: r=1 1,000 data points: r=0
Detecting Changes in Data Patterns

Ø In response to changes …
• Using transfer learning
– Transfer – weighting / equal weight
• Using only same-distribution source data
– Retraining (Dropping)
Ø Performance metrics
• Mean squared error (MSE)
– MSE = Bias2 + Variance
Analysis Strategies Compared

Trade-off #1: Retraining vs.Transfer Learning

Retraining vs.Transfer Learning
Bias2 Variance
MSE = Bias2 + Variance

Trade-off #2: Now or Later
Retrain Transfer

Trade-off #2: Now or Later
Retrain – Transfer

Ø Contributions
• Understand the effectiveness of transfer learning from a sample
selection perspective
• Trade-offs in response to changes in data patterns
– Bias-variance trade-off is alleviated by strategic transfer learning
– The tension of the exploration-exploitation trade-off differs among the
two alternative strategies (using transfer learning or not).
Ø Implications for data analytics practice
• Consistent monitoring of the prediction performance and re-
considering the fitness of the prediction model
• Development of model representing the changing environment
• Optimization of waiting time to gain reliable model adjustment
– Value (cost) of prediction error?
– Value of change detection accuracy?
Conclusions

Predictive Analytics in Dynamic Data Environments

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Similar to Predictive Analytics in Dynamic Data Environments

Similar to Predictive Analytics in Dynamic Data Environments (20)

More from Jungpil Hahn

More from Jungpil Hahn (8)

Recently uploaded

Recently uploaded (20)

Predictive Analytics in Dynamic Data Environments