SlideShare a Scribd company logo
1 of 58
Time Series Forecasting at Scale
MarcelloTomasini
Sr. Data Scientist
Marcello.Tomasini@veritas.com
Copyright © 2018 Veritas Technologies.2
Agenda
1 Forecasting 101
2 Storage Forecasting Use Case & Setup
3 Automation: Model Selection
4 Automation:Validation a.k.a. Backtesting
5 Automation: OnlineTuning
Copyright © 2018 Veritas Technologies.3
Agenda
1 Forecasting 101
2 Storage Forecasting Use Case & Setup
3 Automation: Model Selection
4 Automation:Validation a.k.a. Backtesting
5 Automation: OnlineTuning
Forecasting is the process of making predictions of
the future based on past and present data
Copyright © 2018 Veritas Technologies4
Definition
Nobody/Nothing can predict the future.
Copyright © 2018 Veritas Technologies5
Disclaimer…
Nobody/Nothing can exactly predict the future.
Copyright © 2018 Veritas Technologies6
Disclaimer…
The predictability of an event or a quantity depends on several factors:
– how well we understand the factors that contribute to it;
– how much data are available;
– whether the forecasts can affect the thing we are trying to forecast.
What is normally assumed is that the way in which the environment is changing will
continue into the future.
Copyright © 2018 Veritas Technologies7
What can be forecasted?
• Explanatory model
• Time series model
Copyright © 2018 Veritas Technologies8
Explanatory Model vsTime Series Model
Additive Decomposition
Multiplicative Decomposition
yt is the data, St is the seasonal component, Tt is the trend-cycle component, and Rt is
the remainder component, all at period t
Copyright © 2018 Veritas Technologies9
Trend and Seasonality
Copyright © 2018 Veritas Technologies10
Additive vs Multiplicative Seasonality
Copyright © 2018 Veritas Technologies11
Trend and Seasonality Decomposition
https://otexts.com/fpp2
Copyright © 2018 Veritas Technologies.12
Agenda
1 Forecasting 101
2 Storage Forecasting Use Case & Setup
3 Automation: Model Selection
4 Automation:Validation a.k.a. Backtesting
5 Automation: OnlineTuning
• 10,000+ NetBackupAppliances actively using Auto Support
• Telemetry data reported daily
– 1 point every 15 minutes (96 points/day) per storage type (MSDP, Log, etc)
– 2+ years of data (0.5M+ points/appliance  ~5B points)
• Storage forecast use cases
– Resource planning
– Workload anomalies
– Possible data unavailability or SLA violations
– Sales opportunity
Copyright © 2018 Veritas Technologies.13
Storage Usage Forecast @Veritas Predictive Insights
https://www.veritas.com/product/backup-and-recovery/netbackup-appliances
Storage Usage Forecast @Veritas Predictive Insights
Time System Log Share Advanced
Disk
MSDP MSDP
Catalog
NBU
Catalog
Config
2016-04-11T17:34:55.37602 14.0 15.00415 0.0 947.537800 6312.08 1.66180 0.0 21.13150
2016-04-11T17:49:51.58820 18.0 15.01245 0.0 947.542024 6312.14 1.66230 0.0 21.36500
Params: {
Appliance ID,
Storage Pools,
...
}
Storage Usage Reliability Score Reliability Score [0-100]
+
Evidence
Storage Forecast
Algorithm
Forecast Data
ConstraintsWeighting
Time System Log Share Advanced
Disk
MSDP MSDP
Catalog
NBU
Catalog
Config
2016-04-12 19.0 15.01345 0.0 947.5300 6312.16 1.66380 0.0 21.456
2016-04-13 20.5 15.01456 0.0 947.5424 6312.18 1.66430 0.0 21.635
Copyright © 2018 Veritas Technologies15
Storage Usage Forecast @Veritas Predictive Insights
https://www.veritas.com/product/appliances/predictive-insights
NOTE: mockup UI for confidentiality
Data Ponds
Copyright © 2018 Veritas Technologies.16
Architecture
User Interface
ML Platform
WKFL
Workflow + Pipelines
ML ETL
Pod1 Pod2
ETL
Pod5
ML
PodN
Scale accordingly to workload needs
Kube
Node1
Kube
Node2
Kube
Node3
Kube
NodeN
Management Services (REST API)
Telemetry Data Source Timeseries DataPlatform Support and Graph Data
Copyright © 2018 Veritas Technologies.17
Setup
https://github.com/influxdata/influxdb-relay
https://github.com/vente-privee/influxdb-relay
K8s cluster:
4 x m4.2xlarge (gen purpose)
8 vCPU
32 GiB RAM
Influx:
2 x r4.xlarge (mem opt)
4 vCPU
30.5 GiB RAM
gp2 SSD 3000 IOPS
Nginx load balancer:
5 workers
4096 connections/worker
• Use cases:
– Store storage / cpu usage / disk iops / network iops data
– Used by multiple algorithms: storage forecasting, workload anomaly detection, etc
– Serve REST API
• Pro:
– Naturally deal with group by time and simple query language (try MongoDB…)
– Performance
– Free/open source HA setup available
– Python DataFrameClient
– RESTAPI
Copyright © 2018 Veritas Technologies.18
Why InfluxDB
• Model selection
– Which model is the best?
• Model Evaluation
– How accurate is the model that is currently running in production?
• Model Update
– How to keep tuning/improving the model?
ALL OFTHIS MUST BE DONE UNSUPERVISED!
Copyright © 2018 Veritas Technologies.19
Challenges of Automation
Copyright © 2018 Veritas Technologies.20
Agenda
1 Forecasting 101
2 Storage Forecasting Use Case & Setup
3 Automation: Model Selection
4 Automation:Validation a.k.a. Backtesting
5 Automation: OnlineTuning
• Model selection
– Which model is the best?
• Model Evaluation
– How accurate is the model that is currently running in production?
• Model Update
– How to keep tuning/improving the model?
Copyright © 2018 Veritas Technologies.21
Challenges of Automation
• How do we deal with missing values?
• How do we deal with outliers?
• How do we deal with trend changepoints?
• How do we deal with trend & seasonality?
• How do we choose algorithm parameters?
Copyright © 2018 Veritas Technologies.22
Model Selection
• How do we deal with missing values?
Data imputation
• How do we deal with outliers?
• How do we deal with trend changepoints?
• How do we deal with trend & seasonality?
• How do we choose algorithm parameters?
Copyright © 2018 Veritas Technologies.23
Model Selection
• How do we deal with missing values?
Data imputation  Problem: ”holes” in the data
Copyright © 2018 Veritas Technologies.24
Model Selection
• How do we deal with missing values?
• How do we deal with outliers?
Just remove them
• How do we deal with trend changepoints?
• How do we deal with trend & seasonality?
• How do we choose algorithm parameters?
Copyright © 2018 Veritas Technologies.25
Model Selection
• How do we deal with missing values?
• How do we deal with outliers?
Just remove them  Problem: automated outliers detection?
• How do we deal with trend changepoints?
• How do we deal with trend & seasonality?
• How do we choose algorithm parameters?
Copyright © 2018 Veritas Technologies.26
Model Selection
• How do we deal with missing values?
• How do we deal with outliers?
• How do we deal with trend & seasonality?
Trend/Seasonality decomposition or seasonal model
• How do we deal with trend change points?
• How do we choose algorithm parameters?
Copyright © 2018 Veritas Technologies.27
Model Selection
• How do we deal with missing values?
• How do we deal with outliers?
• How do we deal with trend & seasonality?
Trend/Seasonality decomposition or seasonal model
 Problem: seasonality period & multiple seasonalities
• How do we deal with trend change points?
• How do we choose algorithm parameters?
Copyright © 2018 Veritas Technologies.28
Model Selection
• How do we deal with missing values?
• How do we deal with outliers?
• How do we deal with trend & seasonality?
• How do we deal with trend change points?
Copyright © 2018 Veritas Technologies.29
Model Selection
• How do we deal with missing values?
• How do we deal with outliers?
• How do we deal with trend & seasonality?
• How do we deal with trend change points?
• How do we choose algorithm parameters?
CrossValidation (Backtesting)
Copyright © 2018 Veritas Technologies.30
Model Selection
• How do we deal with missing values?
• How do we deal with outliers?
• How do we deal with trend & seasonality?
• How do we deal with trend change points?
• How do we choose algorithm parameters?
CrossValidation (Backtesting)  Problem: thousands of models…
Copyright © 2018 Veritas Technologies.31
Model Selection
• Bad News:
– Real world data is messy
– We must tackle each case, no way around it
– Impossible deadlines
DS/ML/AI in production is hard!
Copyright © 2018 Veritas Technologies.32
Model Selection
• Bad News:
– Real world data is messy
– We must tackle each case, no way around it
– Impossible deadlines
DS/ML/AI in production is hard!
• Good News:
– Somebody took care of most of the issues already…
Copyright © 2018 Veritas Technologies.33
Model Selection
• Decomposable time series model
• Incorporate trend changes in the growth model by explicitly defining changepoints
• Fourier series to provide a flexible model of periodic effects (seasonality)
• Holidays as an additional regressor
• Observations do not need to be regularly spaced
• Handle missing values
• Robust to outliers
Copyright © 2018 Veritas Technologies34
Facebook Prophet
https://research.fb.com/prophet-forecasting-at-scale/
Copyright © 2018 Veritas Technologies.35
Agenda
1 Forecasting 101
2 Storage Forecasting Use Case & Setup
3 Automation: Model Selection
4 Automation:Validation a.k.a. Backtesting
5 Automation: OnlineTuning
• Model selection
– Which model is the best?
• Model Evaluation
– How accurate is the model that is currently running in production?
• Model Update
– How to keep tuning/improving the model?
Copyright © 2018 Veritas Technologies.36
Challenges of Automation
Copyright © 2018 Veritas Technologies37
CrossValidation: ExpandingWindowValidation
https://eng.uber.com/omphalos/
https://doi.org/10.1016/S0169-2070(00)00065-0
Copyright © 2018 Veritas Technologies38
CrossValidation: SlidingWindow Backtesting
https://eng.uber.com/omphalos/
Copyright © 2018 Veritas Technologies39
CrossValidation: Experimental Results
https://doi.org/10.1016/j.ijforecast.2006.03.001
https://doi.org/10.1371/journal.pone.0174202
• Problem:
– Thousands of series
– Possibly more than one model per time series
– One-off/batch run means results are always outdated
Copyright © 2018 Veritas Technologies.40
Backtesting vs OnlineTesting
• Problem:
– Thousands of series
– Possibly more than one model per time series
– One-off/batch run means results are always outdated
• Solution:
– Save previous forecasts
– Do multi-horizon forecasts
– Compute error as we generate new forecasts
Copyright © 2018 Veritas Technologies.41
Backtesting vs OnlineTesting
• Forecast error as a time series
– Per appliance
– Per horizon
– Per storage type
Copyright © 2018 Veritas Technologies.42
Backtesting vs OnlineTesting
Forecast Data
• Solution 1:
– 1 measurement histories
– 1 measurement forecasts
• Pro:
– Simple
• Problem:
– Fetching data twice
– Can’t have forecasts in same table as histories
• Series cardinality!
• Tags:
– Type
– Serial
– History_cutoff
• Fields
– Yhat
– Yhat lower
– Yhat upper
– Cap
– Cutoff
Copyright © 2018 Veritas Technologies.43
Forecast Data
• Solution 2:
– 1 measurement per appliance
– Contains both history and forecast
• Pro:
– Scales well
• Cons:
– Retention policy on forecast data
• Tags:
– Type
– Serial
– Is_history
– History_cutoff
• Fields
– Yhat
– Yhat lower
– Yhat upper
– Cap
– Cutoff
Copyright © 2018 Veritas Technologies.44
Accuracy Data
• Solution 1:
– 1 measurement
• Pro:
– Easy cross-analysis
• Cons:
– Sorry you can’t do that aka series cardinality
• Tags:
– Pool_type
– Serial
– Horizon
– Train_freq
– Sample_periods
– Other hyperparams
• Fields
– coverage
– mape
– mdape
– smape
– etc
Copyright © 2018 Veritas Technologies.45
Accuracy Data
• Solution 2:
– 1 measurement per appliance
• Pro:
– Now we scale
• Cons:
– Cross-analysis means fetching from 10K+
measurements
– Hyperparams are different for different
algorithms
– Hyperparams cardinality can be really high
• Tags:
– Pool_type
– Serial
– Horizon
– Train_freq
– Sample_periods
– Other hyperparams
• Fields
– coverage
– mape
– mdape
– smape
– etc
Copyright © 2018 Veritas Technologies.46
Accuracy Data
• Solution 3:
– 1 measurement per appliance
– Externalize hyperparams
• Pro:
– Enables some cool stuff for online tuning
• Cons:
– Cross-analysis means fetching from 10K+
measurements and reference another DB
• Tags:
– Pool_type
– Serial
– Horizon
– Hyperparams_doc_id
• Fields
– coverage
– mape
– mdape
– smape
– Etc
• Hyperparam doc (in ArangoDB):
– Model type
– Set of hyperparams
Copyright © 2018 Veritas Technologies.47
Copyright © 2018 Veritas Technologies.48
Agenda
1 Forecasting 101
2 Storage Forecasting @Veritas & Setup
3 Automation: Model Selection
4 Automation:Validation a.k.a. Backtesting
5 Automation: OnlineTuning
• Model selection
– Which model is the best?
• Model Evaluation
– How accurate is the model that is currently running in production?
• Model Update
– How to keep tuning/improving the model?
Copyright © 2018 Veritas Technologies.49
Challenges of Automation
• Thousands of models
• Each model needs to be tuned for a specific time series
• Need to adapt to changes in underlying process
• Backtesting is computationally too expensive but we have the online validation data!
Copyright © 2018 Veritas Technologies.50
Model Update Problems
• Hyperparameter optimization
where f(x) represents an objective score to minimize— such as RMSE— evaluated on
the validation set.
• Bayesian optimization uses a surrogate function:
to build a probability model of the objective.
• Selection function used to choose next set of hyperparameters.
Copyright © 2018 Veritas Technologies.51
Sequential Model-BasedOptimization (SMBO)
Copyright © 2018 Veritas Technologies.52
Sequential Model-BasedOptimization (SMBO)
http://krasserm.github.io/2018/03/21/bayesian-optimization
Thank you!
Copyright © 2018 Veritas Technologies. All rights reserved. Veritas and the Veritas Logo are trademarks or registered trademarks of Veritas Technologies or its affiliates in the
U.S. and other countries. Other names may be trademarks of their respective owners.
This document is provided for informational purposes only and is not intended as advertising. All warranties relating to the information in this document, either express or
implied, are disclaimed to the maximum extent allowed by law. The information in this document is subject to change without notice.
MarcelloTomasini
Marcello.Tomasini@veritas.com
https://www.linkedin.com/in/marcellotomasini
Q & AQ & A
Copyright © 2018 Veritas Technologies.54
Copyright © 2018 Veritas Technologies.55
Am I Missing Something?
Where is Deep Learning!
Copyright © 2018 Veritas Technologies.56
Am I Missing Something?
• Too many parameters for the data & cross-training leads to “average” models
• Needs careful data preparation (normalization, trend, seasonality, etc)
• Computational cost of training
Copyright © 2018 Veritas Technologies.57
Deep Learning
https://eng.uber.com/omphalos/
Copyright © 2018 Veritas Technologies.58
M4 Forecasting Competition
https://eng.uber.com/m4-forecasting-competition
https://doi.org/10.1016/j.ijforecast.2018.06.001

More Related Content

Similar to Forecasting at Scale with Marcello Tomasini

Similar to Forecasting at Scale with Marcello Tomasini (20)

5 Clear Signs You Need Security Policy Automation
5 Clear Signs You Need Security Policy Automation5 Clear Signs You Need Security Policy Automation
5 Clear Signs You Need Security Policy Automation
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
Real world IoT for enterprises
Real world IoT for enterprisesReal world IoT for enterprises
Real world IoT for enterprises
 
Machine Learning and Industrie 4.0
Machine Learning and Industrie 4.0Machine Learning and Industrie 4.0
Machine Learning and Industrie 4.0
 
AI challanges - Cse day-2018.04.12
AI challanges - Cse day-2018.04.12AI challanges - Cse day-2018.04.12
AI challanges - Cse day-2018.04.12
 
From zero to one - How we evolved our test automation processes and mindset i...
From zero to one - How we evolved our test automation processes and mindset i...From zero to one - How we evolved our test automation processes and mindset i...
From zero to one - How we evolved our test automation processes and mindset i...
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015
 
20181129 keynote augmented intelligence and artificial intelligence
20181129 keynote augmented intelligence and artificial intelligence20181129 keynote augmented intelligence and artificial intelligence
20181129 keynote augmented intelligence and artificial intelligence
 
Machine Learning for Finance Master Class
Machine Learning for Finance Master Class Machine Learning for Finance Master Class
Machine Learning for Finance Master Class
 
The Machine Learning Audit
The Machine Learning AuditThe Machine Learning Audit
The Machine Learning Audit
 
Mtc strategy-briefing-houston-pd m-05212018-3
Mtc strategy-briefing-houston-pd m-05212018-3Mtc strategy-briefing-houston-pd m-05212018-3
Mtc strategy-briefing-houston-pd m-05212018-3
 
Optimize supply chains using machine learning superpowers webinar deck
Optimize supply chains using machine learning superpowers webinar deckOptimize supply chains using machine learning superpowers webinar deck
Optimize supply chains using machine learning superpowers webinar deck
 
Validation
ValidationValidation
Validation
 
Big Data Analytics: From Insights to Production
Big Data Analytics: From Insights to ProductionBig Data Analytics: From Insights to Production
Big Data Analytics: From Insights to Production
 
"From Insights to Production with Big Data Analytics", Eliano Marques, Senior...
"From Insights to Production with Big Data Analytics", Eliano Marques, Senior..."From Insights to Production with Big Data Analytics", Eliano Marques, Senior...
"From Insights to Production with Big Data Analytics", Eliano Marques, Senior...
 
Machine Learning and AI in Risk Management
Machine Learning and AI in Risk ManagementMachine Learning and AI in Risk Management
Machine Learning and AI in Risk Management
 
How to make existing business applications io t ready
How to make existing business applications io t readyHow to make existing business applications io t ready
How to make existing business applications io t ready
 
Meter Asset Tracking and the Power of Analytics in an AMI System
Meter Asset Tracking and the Power of Analytics in an AMI SystemMeter Asset Tracking and the Power of Analytics in an AMI System
Meter Asset Tracking and the Power of Analytics in an AMI System
 
The Tableau Experience Kaunas - TOC Sales and Marketing prezentacija
The Tableau Experience Kaunas - TOC Sales and Marketing prezentacijaThe Tableau Experience Kaunas - TOC Sales and Marketing prezentacija
The Tableau Experience Kaunas - TOC Sales and Marketing prezentacija
 
Companion by Minitab - Seeing the unknown identifying risk and quantifying pr...
Companion by Minitab - Seeing the unknown identifying risk and quantifying pr...Companion by Minitab - Seeing the unknown identifying risk and quantifying pr...
Companion by Minitab - Seeing the unknown identifying risk and quantifying pr...
 

More from InfluxData

How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
InfluxData
 
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
InfluxData
 
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
InfluxData
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
InfluxData
 

More from InfluxData (20)

Announcing InfluxDB Clustered
Announcing InfluxDB ClusteredAnnouncing InfluxDB Clustered
Announcing InfluxDB Clustered
 
Best Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow EcosystemBest Practices for Leveraging the Apache Arrow Ecosystem
Best Practices for Leveraging the Apache Arrow Ecosystem
 
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
How Bevi Uses InfluxDB and Grafana to Improve Predictive Maintenance and Redu...
 
Power Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDBPower Your Predictive Analytics with InfluxDB
Power Your Predictive Analytics with InfluxDB
 
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
How Teréga Replaces Legacy Data Historians with InfluxDB, AWS and IO-Base
 
Build an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING StackBuild an Edge-to-Cloud Solution with the MING Stack
Build an Edge-to-Cloud Solution with the MING Stack
 
Meet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using RustMeet the Founders: An Open Discussion About Rewriting Using Rust
Meet the Founders: An Open Discussion About Rewriting Using Rust
 
Introducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud DedicatedIntroducing InfluxDB Cloud Dedicated
Introducing InfluxDB Cloud Dedicated
 
Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB Gain Better Observability with OpenTelemetry and InfluxDB
Gain Better Observability with OpenTelemetry and InfluxDB
 
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
How a Heat Treating Plant Ensures Tight Process Control and Exceptional Quali...
 
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...How Delft University's Engineering Students Make Their EV Formula-Style Race ...
How Delft University's Engineering Students Make Their EV Formula-Style Race ...
 
Introducing InfluxDB’s New Time Series Database Storage Engine
Introducing InfluxDB’s New Time Series Database Storage EngineIntroducing InfluxDB’s New Time Series Database Storage Engine
Introducing InfluxDB’s New Time Series Database Storage Engine
 
Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena Start Automating InfluxDB Deployments at the Edge with balena
Start Automating InfluxDB Deployments at the Edge with balena
 
Understanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage EngineUnderstanding InfluxDB’s New Storage Engine
Understanding InfluxDB’s New Storage Engine
 
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDBStreamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
Streamline and Scale Out Data Pipelines with Kubernetes, Telegraf, and InfluxDB
 
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
Ward Bowman [PTC] | ThingWorx Long-Term Data Storage with InfluxDB | InfluxDa...
 
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
Scott Anderson [InfluxData] | New & Upcoming Flux Features | InfluxDays 2022
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts | InfluxDays 2022
 
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
Steinkamp, Clifford [InfluxData] | Welcome to InfluxDays 2022 - Day 2 | Influ...
 
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
Steinkamp, Clifford [InfluxData] | Closing Thoughts Day 1 | InfluxDays 2022
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

Forecasting at Scale with Marcello Tomasini

  • 1. Time Series Forecasting at Scale MarcelloTomasini Sr. Data Scientist Marcello.Tomasini@veritas.com
  • 2. Copyright © 2018 Veritas Technologies.2 Agenda 1 Forecasting 101 2 Storage Forecasting Use Case & Setup 3 Automation: Model Selection 4 Automation:Validation a.k.a. Backtesting 5 Automation: OnlineTuning
  • 3. Copyright © 2018 Veritas Technologies.3 Agenda 1 Forecasting 101 2 Storage Forecasting Use Case & Setup 3 Automation: Model Selection 4 Automation:Validation a.k.a. Backtesting 5 Automation: OnlineTuning
  • 4. Forecasting is the process of making predictions of the future based on past and present data Copyright © 2018 Veritas Technologies4 Definition
  • 5. Nobody/Nothing can predict the future. Copyright © 2018 Veritas Technologies5 Disclaimer…
  • 6. Nobody/Nothing can exactly predict the future. Copyright © 2018 Veritas Technologies6 Disclaimer…
  • 7. The predictability of an event or a quantity depends on several factors: – how well we understand the factors that contribute to it; – how much data are available; – whether the forecasts can affect the thing we are trying to forecast. What is normally assumed is that the way in which the environment is changing will continue into the future. Copyright © 2018 Veritas Technologies7 What can be forecasted?
  • 8. • Explanatory model • Time series model Copyright © 2018 Veritas Technologies8 Explanatory Model vsTime Series Model
  • 9. Additive Decomposition Multiplicative Decomposition yt is the data, St is the seasonal component, Tt is the trend-cycle component, and Rt is the remainder component, all at period t Copyright © 2018 Veritas Technologies9 Trend and Seasonality
  • 10. Copyright © 2018 Veritas Technologies10 Additive vs Multiplicative Seasonality
  • 11. Copyright © 2018 Veritas Technologies11 Trend and Seasonality Decomposition https://otexts.com/fpp2
  • 12. Copyright © 2018 Veritas Technologies.12 Agenda 1 Forecasting 101 2 Storage Forecasting Use Case & Setup 3 Automation: Model Selection 4 Automation:Validation a.k.a. Backtesting 5 Automation: OnlineTuning
  • 13. • 10,000+ NetBackupAppliances actively using Auto Support • Telemetry data reported daily – 1 point every 15 minutes (96 points/day) per storage type (MSDP, Log, etc) – 2+ years of data (0.5M+ points/appliance  ~5B points) • Storage forecast use cases – Resource planning – Workload anomalies – Possible data unavailability or SLA violations – Sales opportunity Copyright © 2018 Veritas Technologies.13 Storage Usage Forecast @Veritas Predictive Insights https://www.veritas.com/product/backup-and-recovery/netbackup-appliances
  • 14. Storage Usage Forecast @Veritas Predictive Insights Time System Log Share Advanced Disk MSDP MSDP Catalog NBU Catalog Config 2016-04-11T17:34:55.37602 14.0 15.00415 0.0 947.537800 6312.08 1.66180 0.0 21.13150 2016-04-11T17:49:51.58820 18.0 15.01245 0.0 947.542024 6312.14 1.66230 0.0 21.36500 Params: { Appliance ID, Storage Pools, ... } Storage Usage Reliability Score Reliability Score [0-100] + Evidence Storage Forecast Algorithm Forecast Data ConstraintsWeighting Time System Log Share Advanced Disk MSDP MSDP Catalog NBU Catalog Config 2016-04-12 19.0 15.01345 0.0 947.5300 6312.16 1.66380 0.0 21.456 2016-04-13 20.5 15.01456 0.0 947.5424 6312.18 1.66430 0.0 21.635
  • 15. Copyright © 2018 Veritas Technologies15 Storage Usage Forecast @Veritas Predictive Insights https://www.veritas.com/product/appliances/predictive-insights NOTE: mockup UI for confidentiality
  • 16. Data Ponds Copyright © 2018 Veritas Technologies.16 Architecture User Interface ML Platform WKFL Workflow + Pipelines ML ETL Pod1 Pod2 ETL Pod5 ML PodN Scale accordingly to workload needs Kube Node1 Kube Node2 Kube Node3 Kube NodeN Management Services (REST API) Telemetry Data Source Timeseries DataPlatform Support and Graph Data
  • 17. Copyright © 2018 Veritas Technologies.17 Setup https://github.com/influxdata/influxdb-relay https://github.com/vente-privee/influxdb-relay K8s cluster: 4 x m4.2xlarge (gen purpose) 8 vCPU 32 GiB RAM Influx: 2 x r4.xlarge (mem opt) 4 vCPU 30.5 GiB RAM gp2 SSD 3000 IOPS Nginx load balancer: 5 workers 4096 connections/worker
  • 18. • Use cases: – Store storage / cpu usage / disk iops / network iops data – Used by multiple algorithms: storage forecasting, workload anomaly detection, etc – Serve REST API • Pro: – Naturally deal with group by time and simple query language (try MongoDB…) – Performance – Free/open source HA setup available – Python DataFrameClient – RESTAPI Copyright © 2018 Veritas Technologies.18 Why InfluxDB
  • 19. • Model selection – Which model is the best? • Model Evaluation – How accurate is the model that is currently running in production? • Model Update – How to keep tuning/improving the model? ALL OFTHIS MUST BE DONE UNSUPERVISED! Copyright © 2018 Veritas Technologies.19 Challenges of Automation
  • 20. Copyright © 2018 Veritas Technologies.20 Agenda 1 Forecasting 101 2 Storage Forecasting Use Case & Setup 3 Automation: Model Selection 4 Automation:Validation a.k.a. Backtesting 5 Automation: OnlineTuning
  • 21. • Model selection – Which model is the best? • Model Evaluation – How accurate is the model that is currently running in production? • Model Update – How to keep tuning/improving the model? Copyright © 2018 Veritas Technologies.21 Challenges of Automation
  • 22. • How do we deal with missing values? • How do we deal with outliers? • How do we deal with trend changepoints? • How do we deal with trend & seasonality? • How do we choose algorithm parameters? Copyright © 2018 Veritas Technologies.22 Model Selection
  • 23. • How do we deal with missing values? Data imputation • How do we deal with outliers? • How do we deal with trend changepoints? • How do we deal with trend & seasonality? • How do we choose algorithm parameters? Copyright © 2018 Veritas Technologies.23 Model Selection
  • 24. • How do we deal with missing values? Data imputation  Problem: ”holes” in the data Copyright © 2018 Veritas Technologies.24 Model Selection
  • 25. • How do we deal with missing values? • How do we deal with outliers? Just remove them • How do we deal with trend changepoints? • How do we deal with trend & seasonality? • How do we choose algorithm parameters? Copyright © 2018 Veritas Technologies.25 Model Selection
  • 26. • How do we deal with missing values? • How do we deal with outliers? Just remove them  Problem: automated outliers detection? • How do we deal with trend changepoints? • How do we deal with trend & seasonality? • How do we choose algorithm parameters? Copyright © 2018 Veritas Technologies.26 Model Selection
  • 27. • How do we deal with missing values? • How do we deal with outliers? • How do we deal with trend & seasonality? Trend/Seasonality decomposition or seasonal model • How do we deal with trend change points? • How do we choose algorithm parameters? Copyright © 2018 Veritas Technologies.27 Model Selection
  • 28. • How do we deal with missing values? • How do we deal with outliers? • How do we deal with trend & seasonality? Trend/Seasonality decomposition or seasonal model  Problem: seasonality period & multiple seasonalities • How do we deal with trend change points? • How do we choose algorithm parameters? Copyright © 2018 Veritas Technologies.28 Model Selection
  • 29. • How do we deal with missing values? • How do we deal with outliers? • How do we deal with trend & seasonality? • How do we deal with trend change points? Copyright © 2018 Veritas Technologies.29 Model Selection
  • 30. • How do we deal with missing values? • How do we deal with outliers? • How do we deal with trend & seasonality? • How do we deal with trend change points? • How do we choose algorithm parameters? CrossValidation (Backtesting) Copyright © 2018 Veritas Technologies.30 Model Selection
  • 31. • How do we deal with missing values? • How do we deal with outliers? • How do we deal with trend & seasonality? • How do we deal with trend change points? • How do we choose algorithm parameters? CrossValidation (Backtesting)  Problem: thousands of models… Copyright © 2018 Veritas Technologies.31 Model Selection
  • 32. • Bad News: – Real world data is messy – We must tackle each case, no way around it – Impossible deadlines DS/ML/AI in production is hard! Copyright © 2018 Veritas Technologies.32 Model Selection
  • 33. • Bad News: – Real world data is messy – We must tackle each case, no way around it – Impossible deadlines DS/ML/AI in production is hard! • Good News: – Somebody took care of most of the issues already… Copyright © 2018 Veritas Technologies.33 Model Selection
  • 34. • Decomposable time series model • Incorporate trend changes in the growth model by explicitly defining changepoints • Fourier series to provide a flexible model of periodic effects (seasonality) • Holidays as an additional regressor • Observations do not need to be regularly spaced • Handle missing values • Robust to outliers Copyright © 2018 Veritas Technologies34 Facebook Prophet https://research.fb.com/prophet-forecasting-at-scale/
  • 35. Copyright © 2018 Veritas Technologies.35 Agenda 1 Forecasting 101 2 Storage Forecasting Use Case & Setup 3 Automation: Model Selection 4 Automation:Validation a.k.a. Backtesting 5 Automation: OnlineTuning
  • 36. • Model selection – Which model is the best? • Model Evaluation – How accurate is the model that is currently running in production? • Model Update – How to keep tuning/improving the model? Copyright © 2018 Veritas Technologies.36 Challenges of Automation
  • 37. Copyright © 2018 Veritas Technologies37 CrossValidation: ExpandingWindowValidation https://eng.uber.com/omphalos/ https://doi.org/10.1016/S0169-2070(00)00065-0
  • 38. Copyright © 2018 Veritas Technologies38 CrossValidation: SlidingWindow Backtesting https://eng.uber.com/omphalos/
  • 39. Copyright © 2018 Veritas Technologies39 CrossValidation: Experimental Results https://doi.org/10.1016/j.ijforecast.2006.03.001 https://doi.org/10.1371/journal.pone.0174202
  • 40. • Problem: – Thousands of series – Possibly more than one model per time series – One-off/batch run means results are always outdated Copyright © 2018 Veritas Technologies.40 Backtesting vs OnlineTesting
  • 41. • Problem: – Thousands of series – Possibly more than one model per time series – One-off/batch run means results are always outdated • Solution: – Save previous forecasts – Do multi-horizon forecasts – Compute error as we generate new forecasts Copyright © 2018 Veritas Technologies.41 Backtesting vs OnlineTesting
  • 42. • Forecast error as a time series – Per appliance – Per horizon – Per storage type Copyright © 2018 Veritas Technologies.42 Backtesting vs OnlineTesting
  • 43. Forecast Data • Solution 1: – 1 measurement histories – 1 measurement forecasts • Pro: – Simple • Problem: – Fetching data twice – Can’t have forecasts in same table as histories • Series cardinality! • Tags: – Type – Serial – History_cutoff • Fields – Yhat – Yhat lower – Yhat upper – Cap – Cutoff Copyright © 2018 Veritas Technologies.43
  • 44. Forecast Data • Solution 2: – 1 measurement per appliance – Contains both history and forecast • Pro: – Scales well • Cons: – Retention policy on forecast data • Tags: – Type – Serial – Is_history – History_cutoff • Fields – Yhat – Yhat lower – Yhat upper – Cap – Cutoff Copyright © 2018 Veritas Technologies.44
  • 45. Accuracy Data • Solution 1: – 1 measurement • Pro: – Easy cross-analysis • Cons: – Sorry you can’t do that aka series cardinality • Tags: – Pool_type – Serial – Horizon – Train_freq – Sample_periods – Other hyperparams • Fields – coverage – mape – mdape – smape – etc Copyright © 2018 Veritas Technologies.45
  • 46. Accuracy Data • Solution 2: – 1 measurement per appliance • Pro: – Now we scale • Cons: – Cross-analysis means fetching from 10K+ measurements – Hyperparams are different for different algorithms – Hyperparams cardinality can be really high • Tags: – Pool_type – Serial – Horizon – Train_freq – Sample_periods – Other hyperparams • Fields – coverage – mape – mdape – smape – etc Copyright © 2018 Veritas Technologies.46
  • 47. Accuracy Data • Solution 3: – 1 measurement per appliance – Externalize hyperparams • Pro: – Enables some cool stuff for online tuning • Cons: – Cross-analysis means fetching from 10K+ measurements and reference another DB • Tags: – Pool_type – Serial – Horizon – Hyperparams_doc_id • Fields – coverage – mape – mdape – smape – Etc • Hyperparam doc (in ArangoDB): – Model type – Set of hyperparams Copyright © 2018 Veritas Technologies.47
  • 48. Copyright © 2018 Veritas Technologies.48 Agenda 1 Forecasting 101 2 Storage Forecasting @Veritas & Setup 3 Automation: Model Selection 4 Automation:Validation a.k.a. Backtesting 5 Automation: OnlineTuning
  • 49. • Model selection – Which model is the best? • Model Evaluation – How accurate is the model that is currently running in production? • Model Update – How to keep tuning/improving the model? Copyright © 2018 Veritas Technologies.49 Challenges of Automation
  • 50. • Thousands of models • Each model needs to be tuned for a specific time series • Need to adapt to changes in underlying process • Backtesting is computationally too expensive but we have the online validation data! Copyright © 2018 Veritas Technologies.50 Model Update Problems
  • 51. • Hyperparameter optimization where f(x) represents an objective score to minimize— such as RMSE— evaluated on the validation set. • Bayesian optimization uses a surrogate function: to build a probability model of the objective. • Selection function used to choose next set of hyperparameters. Copyright © 2018 Veritas Technologies.51 Sequential Model-BasedOptimization (SMBO)
  • 52. Copyright © 2018 Veritas Technologies.52 Sequential Model-BasedOptimization (SMBO) http://krasserm.github.io/2018/03/21/bayesian-optimization
  • 53. Thank you! Copyright © 2018 Veritas Technologies. All rights reserved. Veritas and the Veritas Logo are trademarks or registered trademarks of Veritas Technologies or its affiliates in the U.S. and other countries. Other names may be trademarks of their respective owners. This document is provided for informational purposes only and is not intended as advertising. All warranties relating to the information in this document, either express or implied, are disclaimed to the maximum extent allowed by law. The information in this document is subject to change without notice. MarcelloTomasini Marcello.Tomasini@veritas.com https://www.linkedin.com/in/marcellotomasini
  • 54. Q & AQ & A Copyright © 2018 Veritas Technologies.54
  • 55. Copyright © 2018 Veritas Technologies.55 Am I Missing Something?
  • 56. Where is Deep Learning! Copyright © 2018 Veritas Technologies.56 Am I Missing Something?
  • 57. • Too many parameters for the data & cross-training leads to “average” models • Needs careful data preparation (normalization, trend, seasonality, etc) • Computational cost of training Copyright © 2018 Veritas Technologies.57 Deep Learning https://eng.uber.com/omphalos/
  • 58. Copyright © 2018 Veritas Technologies.58 M4 Forecasting Competition https://eng.uber.com/m4-forecasting-competition https://doi.org/10.1016/j.ijforecast.2018.06.001

Editor's Notes

  1. So here is the agenda for today: We will first go through some basic definitions for those that are not familiar with forecasting Then we will discuss what are the challenges of automating forecasting and how some of them can be solved Then we will look on how to validate our model performance Finally we will discuss how to approach the hyperparameter optimization of models deployed at scale
  2. So here is the agenda for today: We will first go through some basic definitions for those that are not familiar with forecasting Then we will discuss what are the challenges of automating forecasting and how some of them can be solved Then we will look on how to validate our model performance Finally we will discuss how to approach the hyperparameter optimization of models deployed at scale
  3. So what is forecasting? Forecasting is trying to predict the future based on the past data
  4. Before we go further, let’s just make one thing clear…
  5. …all it comes down to this statement from a famous statistician. Models are not perfect, they always have an error term. What we are trying to do is to have the model perform better than chance
  6. Sentence in red: this is the so called stationary data assumption. That is mean, variance, and autocorrelation do not change over time.
  7. An explanatory model is useful because it incorporates information about other variables, rather than only historical values of the variable to be forecast. However, there are several reasons a forecaster might select a time series model rather than an explanatory or mixed model. First, the system may not be understood, and even if it was understood it may be extremely difficult to measure the relationships that are assumed to govern its behavior. Second, it is necessary to know or forecast the future values of the various predictors in order to be able to forecast the variable of interest, and this may be too difficult. Third, the main concern may be only to predict what will happen, not to know why it happens. Finally, the time series model may give more accurate forecasts than an explanatory or mixed model.
  8. Left: additive Right: multiplicative
  9. Remainder: mean zero, no significant autocorrelation
  10. So here is the agenda for today: We will first go through some basic definitions for those that are not familiar with forecasting Then we will discuss what are the challenges of automating forecasting and how some of them can be solved Then we will look on how to validate our model performance Finally we will discuss how to approach the hyperparameter optimization of models deployed at scale
  11. NGINX runs on same machine as one of Influx Nodes
  12. So here is the agenda for today: We will first go through some basic definitions for those that are not familiar with forecasting Then we will discuss what are the challenges of automating forecasting and how some of them can be solved Then we will look on how to validate our model performance Finally we will discuss how to approach the hyperparameter optimization of models deployed at scale
  13. DS = Data Science (the stats part)
  14. So here is the agenda for today: We will first go through some basic definitions for those that are not familiar with forecasting Then we will discuss what are the challenges of automating forecasting and how some of them can be solved Then we will look on how to validate our model performance Finally we will discuss how to approach the hyperparameter optimization of models deployed at scale
  15. Paper Link: Tashman, Leonard J. "Out-of-sample tests of forecasting accuracy: an analysis and review." International journal of forecasting 16.4 (2000): 437-450.
  16. Paper Links: Hyndman, Rob J., and Anne B. Koehler. "Another look at measures of forecast accuracy." International journal of forecasting 22.4 (2006): 679-688. Chen, Chao, Jamie Twycross, and Jonathan M. Garibaldi. "A new accuracy measure based on bounded relative error for time series forecasting." PloS one 12.3 (2017): e0174202.
  17. So here is the agenda for today: We will first go through some basic definitions for those that are not familiar with forecasting Then we will discuss what are the challenges of automating forecasting and how some of them can be solved Then we will look on how to validate our model performance Finally we will discuss how to approach the hyperparameter optimization of models deployed at scale
  18. Objective score: Computed from online accuracy data Result of several forecast runs give an algorithm and a specific set of hyperparameters
  19. Uber #1 Paper Link: Makridakis, Spyros, Evangelos Spiliotis, and Vassilios Assimakopoulos. "The M4 Competition: Results, findings, conclusion and way forward." International Journal of Forecasting 34.4 (2018): 802-808.