SlideShare a Scribd company logo
1 of 28
Download to read offline
Machine Learning in Dynamic Data Environments:
Trade-offs in Response to Changes
Jungpil Hahn
jungpil@nus.edu.sg
Machine Learning
Data Model Prediction
S : Source Data
T : Target Data
The ML paradigm works well when…
• Trained model is accurate
• Target data is similar to source data
• The world doesn’t change
Change is the only
constant
~ Heraclitos
(circa 500BC)
What can we do when change happens?
Data Model Prediction
New Data New Model Prediction
Data Model Prediction
New Data New Model Prediction
• New data is often scarce (esp. right after change)
• Unsure when change actually happened
However …
What to do?
• Can / should we enhance our model robustness by increasing the
new training data sample size by leveraging historical data?
• Should we retrain the model immediately when change is detected
or later when more new data has become available?
Ø Augment the new data set!
What to do?
• Bias vs.Variance Trade-off
• Can / should we enhance our model robustness by increasing the
new training data sample size by leveraging historical data?
Ø Transfer learning paradigm
• Exploration vs. Exploitation Trade-off
• Should we retrain the model immediately when change is detected
or later when more new data has become available?
Model Setup
Theoretical Analysis
Theoretical Analysis
• Difference in data environment (pre-change vs. post-change) as
sample selection
– ! = 1 : diff-distribution
– ! = 0 : same-distribution
• Empirical risk minimization (ERM)
– Minimize
• Weight based on sample selection:
• Expected risk in target data
• Empirical risk using same- and diff-distribution data
• Empirical risk using on same-distribution data
To transfer or not transfer
S-S : Same-distribution source data (q)
S-D : Diff-distribution source data (p)
• Dd : difference of upper bounds in loss between non-transfer
learning and transfer learning
• ,
•
•
Effectiveness ofTransfer Learning
Relative size of diff- vs. same-distribution data examples
Complexity of the model
Extent of data change
Effectiveness ofTransfer Learning
• Depends on …
• The amount of same-distribution source data (q) relative to the diff-
distribution source data (p)
• The number of predictors being used in the prediction model (b)
• The extent of change across the source and the target data sets
(a/b)
Numerical Analysis
Simulate Changing Data Pattern
• Linear model: y=x×β+ε
• β=!
• k= {10, 20, … 50, 60}
– x=(x1, x2, x3, …, xk), %~'! (, * , (σij)=0.5 for i≠j and 1 for i=j.
• ε follows normal distribution.To keep R2=0.6, var(ε) equals
– +,! %×. ×
"# $!
$!
• Selection model: Pr # = 0 &, ( = ) *!& + ,"(
• "!=!
• #"= {0.3, 0.5, …, 1.5}
• ADWIN algorithm:
• monitoring out-of-sample prediction error of a pre-trained model
1,000 data points: r=1 1,000 data points: r=0
Detecting Changes in Data Patterns
Ø In response to changes …
• Using transfer learning
– Transfer – weighting / equal weight
• Using only same-distribution source data
– Retraining (Dropping)
Ø Performance metrics
• Mean squared error (MSE)
– MSE = Bias2 + Variance
Analysis Strategies Compared
Trade-off #1: Retraining vs.Transfer Learning
Retraining vs.Transfer Learning
Bias2 Variance
MSE = Bias2 + Variance
Trade-off #2: Now or Later
Retrain Transfer
Trade-off #2: Now or Later
Retrain – Transfer
Ø Contributions
• Understand the effectiveness of transfer learning from a sample
selection perspective
• Trade-offs in response to changes in data patterns
– Bias-variance trade-off is alleviated by strategic transfer learning
– The tension of the exploration-exploitation trade-off differs among the
two alternative strategies (using transfer learning or not).
Ø Implications for data analytics practice
• Consistent monitoring of the prediction performance and re-
considering the fitness of the prediction model
• Development of model representing the changing environment
• Optimization of waiting time to gain reliable model adjustment
– Value (cost) of prediction error?
– Value of change detection accuracy?
Conclusions

More Related Content

What's hot (6)

Matlab Data And Statistics
Matlab Data And StatisticsMatlab Data And Statistics
Matlab Data And Statistics
 
Matlab:Regression
Matlab:RegressionMatlab:Regression
Matlab:Regression
 
Preparing Data
Preparing DataPreparing Data
Preparing Data
 
Cluster Forest
Cluster ForestCluster Forest
Cluster Forest
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
H2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDellH2O World - Ensembles with Erin LeDell
H2O World - Ensembles with Erin LeDell
 

Similar to Predictive Analytics in Dynamic Data Environments

Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01
Henock Beyene
 
Module 4 data analysis
Module 4 data analysisModule 4 data analysis
Module 4 data analysis
ILRI-Jmaru
 

Similar to Predictive Analytics in Dynamic Data Environments (20)

Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
 
AL slides.ppt
AL slides.pptAL slides.ppt
AL slides.ppt
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptx
 
crossvalidation.pptx
crossvalidation.pptxcrossvalidation.pptx
crossvalidation.pptx
 
0 introduction
0  introduction0  introduction
0 introduction
 
IME 672 - Classifier Evaluation I.pptx
IME 672 - Classifier Evaluation I.pptxIME 672 - Classifier Evaluation I.pptx
IME 672 - Classifier Evaluation I.pptx
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
SQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMiningSQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMining
 
Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)
 
Statistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptxStatistical Learning and Model Selection module 2.pptx
Statistical Learning and Model Selection module 2.pptx
 
regression.pptx
regression.pptxregression.pptx
regression.pptx
 
Transfer learning-presentation
Transfer learning-presentationTransfer learning-presentation
Transfer learning-presentation
 
Data in science
Data in science Data in science
Data in science
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
Module 4 data analysis
Module 4 data analysisModule 4 data analysis
Module 4 data analysis
 
Statistics for Data Analysis - ODE - BVP .pptx
Statistics for Data Analysis - ODE - BVP .pptxStatistics for Data Analysis - ODE - BVP .pptx
Statistics for Data Analysis - ODE - BVP .pptx
 

More from Jungpil Hahn

More from Jungpil Hahn (8)

Engaging the Crowd in Technology Development
Engaging the Crowd in Technology DevelopmentEngaging the Crowd in Technology Development
Engaging the Crowd in Technology Development
 
Making the Crowd Wiser: (Re)combination through Teaming in Crowdsourcing
Making the Crowd Wiser: (Re)combination through Teaming in CrowdsourcingMaking the Crowd Wiser: (Re)combination through Teaming in Crowdsourcing
Making the Crowd Wiser: (Re)combination through Teaming in Crowdsourcing
 
Impact of End-user PETs on Firms' Analytics Performance
Impact of End-user PETs on Firms' Analytics PerformanceImpact of End-user PETs on Firms' Analytics Performance
Impact of End-user PETs on Firms' Analytics Performance
 
Understanding Blockchain Governance Decentralization: An Agent-based Simulati...
Understanding Blockchain Governance Decentralization: An Agent-based Simulati...Understanding Blockchain Governance Decentralization: An Agent-based Simulati...
Understanding Blockchain Governance Decentralization: An Agent-based Simulati...
 
Presentation at AoM 2014
Presentation at AoM 2014Presentation at AoM 2014
Presentation at AoM 2014
 
CAS Symposium (Oct 12 2013)
CAS Symposium (Oct 12 2013)CAS Symposium (Oct 12 2013)
CAS Symposium (Oct 12 2013)
 
Archetypes of Crowdfunders’ Backing Behaviors and the Outcome of Crowdfunding...
Archetypes of Crowdfunders’ Backing Behaviors and the Outcome of Crowdfunding...Archetypes of Crowdfunders’ Backing Behaviors and the Outcome of Crowdfunding...
Archetypes of Crowdfunders’ Backing Behaviors and the Outcome of Crowdfunding...
 
Knowledge Overlap and Task Interdependence in IS Development
Knowledge Overlap and Task Interdependence in IS DevelopmentKnowledge Overlap and Task Interdependence in IS Development
Knowledge Overlap and Task Interdependence in IS Development
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Predictive Analytics in Dynamic Data Environments

  • 1. Machine Learning in Dynamic Data Environments: Trade-offs in Response to Changes Jungpil Hahn jungpil@nus.edu.sg
  • 3.
  • 4. Data Model Prediction S : Source Data T : Target Data
  • 5. The ML paradigm works well when… • Trained model is accurate • Target data is similar to source data • The world doesn’t change
  • 6. Change is the only constant ~ Heraclitos (circa 500BC)
  • 7. What can we do when change happens?
  • 8. Data Model Prediction New Data New Model Prediction
  • 9. Data Model Prediction New Data New Model Prediction • New data is often scarce (esp. right after change) • Unsure when change actually happened However …
  • 10. What to do? • Can / should we enhance our model robustness by increasing the new training data sample size by leveraging historical data? • Should we retrain the model immediately when change is detected or later when more new data has become available? Ø Augment the new data set!
  • 11. What to do? • Bias vs.Variance Trade-off • Can / should we enhance our model robustness by increasing the new training data sample size by leveraging historical data? Ø Transfer learning paradigm • Exploration vs. Exploitation Trade-off • Should we retrain the model immediately when change is detected or later when more new data has become available?
  • 14. Theoretical Analysis • Difference in data environment (pre-change vs. post-change) as sample selection – ! = 1 : diff-distribution – ! = 0 : same-distribution • Empirical risk minimization (ERM) – Minimize • Weight based on sample selection:
  • 15. • Expected risk in target data • Empirical risk using same- and diff-distribution data • Empirical risk using on same-distribution data To transfer or not transfer S-S : Same-distribution source data (q) S-D : Diff-distribution source data (p)
  • 16. • Dd : difference of upper bounds in loss between non-transfer learning and transfer learning • , • • Effectiveness ofTransfer Learning Relative size of diff- vs. same-distribution data examples Complexity of the model Extent of data change
  • 17. Effectiveness ofTransfer Learning • Depends on … • The amount of same-distribution source data (q) relative to the diff- distribution source data (p) • The number of predictors being used in the prediction model (b) • The extent of change across the source and the target data sets (a/b)
  • 19. Simulate Changing Data Pattern • Linear model: y=x×β+ε • β=! • k= {10, 20, … 50, 60} – x=(x1, x2, x3, …, xk), %~'! (, * , (σij)=0.5 for i≠j and 1 for i=j. • ε follows normal distribution.To keep R2=0.6, var(ε) equals – +,! %×. × "# $! $! • Selection model: Pr # = 0 &, ( = ) *!& + ,"( • "!=! • #"= {0.3, 0.5, …, 1.5}
  • 20. • ADWIN algorithm: • monitoring out-of-sample prediction error of a pre-trained model 1,000 data points: r=1 1,000 data points: r=0 Detecting Changes in Data Patterns
  • 21. Ø In response to changes … • Using transfer learning – Transfer – weighting / equal weight • Using only same-distribution source data – Retraining (Dropping) Ø Performance metrics • Mean squared error (MSE) – MSE = Bias2 + Variance Analysis Strategies Compared
  • 22. Trade-off #1: Retraining vs.Transfer Learning
  • 23. Retraining vs.Transfer Learning Bias2 Variance MSE = Bias2 + Variance
  • 24.
  • 25. Trade-off #2: Now or Later Retrain Transfer
  • 26. Trade-off #2: Now or Later Retrain – Transfer
  • 27.
  • 28. Ø Contributions • Understand the effectiveness of transfer learning from a sample selection perspective • Trade-offs in response to changes in data patterns – Bias-variance trade-off is alleviated by strategic transfer learning – The tension of the exploration-exploitation trade-off differs among the two alternative strategies (using transfer learning or not). Ø Implications for data analytics practice • Consistent monitoring of the prediction performance and re- considering the fitness of the prediction model • Development of model representing the changing environment • Optimization of waiting time to gain reliable model adjustment – Value (cost) of prediction error? – Value of change detection accuracy? Conclusions