Forecasting at Scale with Marcello Tomasini

Time Series Forecasting at Scale
MarcelloTomasini
Sr. Data Scientist
Marcello.Tomasini@veritas.com

Copyright © 2018 Veritas Technologies.2
Agenda
1 Forecasting 101
2 Storage Forecasting Use Case & Setup
3 Automation: Model Selection
4 Automation:Validation a.k.a. Backtesting
5 Automation: OnlineTuning

Agenda
1 Forecasting 101

Forecasting is the process of making predictions of
the future based on past and present data
Copyright © 2018 Veritas Technologies4
Definition

Nobody/Nothing can predict the future.
Disclaimer…

Nobody/Nothing can exactly predict the future.
Disclaimer…

The predictability of an event or a quantity depends on several factors:
– how well we understand the factors that contribute to it;
– how much data are available;
– whether the forecasts can affect the thing we are trying to forecast.
What is normally assumed is that the way in which the environment is changing will
continue into the future.
What can be forecasted?

• Explanatory model
• Time series model
Explanatory Model vsTime Series Model

Additive Decomposition
Multiplicative Decomposition
yt is the data, St is the seasonal component, Tt is the trend-cycle component, and Rt is
the remainder component, all at period t
Trend and Seasonality

Additive vs Multiplicative Seasonality

Trend and Seasonality Decomposition
https://otexts.com/fpp2

Agenda
1 Forecasting 101

• 10,000+ NetBackupAppliances actively using Auto Support
• Telemetry data reported daily
– 1 point every 15 minutes (96 points/day) per storage type (MSDP, Log, etc)
– 2+ years of data (0.5M+ points/appliance  ~5B points)
• Storage forecast use cases
– Resource planning
– Workload anomalies
– Possible data unavailability or SLA violations
– Sales opportunity
Storage Usage Forecast @Veritas Predictive Insights
https://www.veritas.com/product/backup-and-recovery/netbackup-appliances

Time System Log Share Advanced
Disk
MSDP MSDP
Catalog
NBU
Catalog
Config
2016-04-11T17:34:55.37602 14.0 15.00415 0.0 947.537800 6312.08 1.66180 0.0 21.13150
2016-04-11T17:49:51.58820 18.0 15.01245 0.0 947.542024 6312.14 1.66230 0.0 21.36500
Params: {
Appliance ID,
Storage Pools,
...
}
Storage Usage Reliability Score Reliability Score [0-100]
+
Evidence
Storage Forecast
Algorithm
Forecast Data
ConstraintsWeighting
Time System Log Share Advanced
Disk
MSDP MSDP
Catalog
NBU
Catalog
Config
2016-04-12 19.0 15.01345 0.0 947.5300 6312.16 1.66380 0.0 21.456
2016-04-13 20.5 15.01456 0.0 947.5424 6312.18 1.66430 0.0 21.635

https://www.veritas.com/product/appliances/predictive-insights
NOTE: mockup UI for confidentiality

Data Ponds
Architecture
User Interface
ML Platform
WKFL
Workflow + Pipelines
ML ETL
Pod1 Pod2
ETL
Pod5
ML
PodN
Scale accordingly to workload needs
Kube
Node1
Kube
Node2
Kube
Node3
Kube
NodeN
Management Services (REST API)
Telemetry Data Source Timeseries DataPlatform Support and Graph Data

Setup
https://github.com/influxdata/influxdb-relay
https://github.com/vente-privee/influxdb-relay
K8s cluster:
4 x m4.2xlarge (gen purpose)
8 vCPU
32 GiB RAM
Influx:
2 x r4.xlarge (mem opt)
4 vCPU
30.5 GiB RAM
gp2 SSD 3000 IOPS
Nginx load balancer:
5 workers
4096 connections/worker

• Use cases:
– Store storage / cpu usage / disk iops / network iops data
– Used by multiple algorithms: storage forecasting, workload anomaly detection, etc
– Serve REST API
• Pro:
– Naturally deal with group by time and simple query language (try MongoDB…)
– Performance
– Free/open source HA setup available
– Python DataFrameClient
– RESTAPI
Why InfluxDB

• Model selection
– Which model is the best?
• Model Evaluation
– How accurate is the model that is currently running in production?
• Model Update
– How to keep tuning/improving the model?
ALL OFTHIS MUST BE DONE UNSUPERVISED!
Challenges of Automation

Agenda
1 Forecasting 101

• Model selection
• Model Update

• How do we deal with missing values?
• How do we deal with outliers?
• How do we deal with trend changepoints?
• How do we deal with trend & seasonality?
• How do we choose algorithm parameters?
Model Selection

Data imputation
Model Selection

Data imputation  Problem: ”holes” in the data
Model Selection

Just remove them
Model Selection

Just remove them  Problem: automated outliers detection?
Model Selection

Trend/Seasonality decomposition or seasonal model
• How do we deal with trend change points?
Model Selection

Trend/Seasonality decomposition or seasonal model
 Problem: seasonality period & multiple seasonalities
Model Selection

Model Selection

CrossValidation (Backtesting)
Model Selection

CrossValidation (Backtesting)  Problem: thousands of models…
Model Selection

• Bad News:
– Real world data is messy
– We must tackle each case, no way around it
– Impossible deadlines
DS/ML/AI in production is hard!
Model Selection

• Bad News:
– Real world data is messy
– We must tackle each case, no way around it
– Impossible deadlines
DS/ML/AI in production is hard!
• Good News:
– Somebody took care of most of the issues already…
Model Selection

• Decomposable time series model
• Incorporate trend changes in the growth model by explicitly defining changepoints
• Fourier series to provide a flexible model of periodic effects (seasonality)
• Holidays as an additional regressor
• Observations do not need to be regularly spaced
• Handle missing values
• Robust to outliers
Facebook Prophet
https://research.fb.com/prophet-forecasting-at-scale/

Agenda
1 Forecasting 101

• Model selection
• Model Update

CrossValidation: ExpandingWindowValidation
https://eng.uber.com/omphalos/
https://doi.org/10.1016/S0169-2070(00)00065-0

CrossValidation: SlidingWindow Backtesting

CrossValidation: Experimental Results
https://doi.org/10.1016/j.ijforecast.2006.03.001
https://doi.org/10.1371/journal.pone.0174202

• Problem:
– Thousands of series
– Possibly more than one model per time series
– One-off/batch run means results are always outdated
Backtesting vs OnlineTesting

• Problem:
– Thousands of series
– Possibly more than one model per time series
– One-off/batch run means results are always outdated
• Solution:
– Save previous forecasts
– Do multi-horizon forecasts
– Compute error as we generate new forecasts

• Forecast error as a time series
– Per appliance
– Per horizon
– Per storage type

Forecast Data
• Solution 1:
– 1 measurement histories
– 1 measurement forecasts
• Pro:
– Simple
• Problem:
– Fetching data twice
– Can’t have forecasts in same table as histories
• Series cardinality!
• Tags:
– Type
– Serial
– History_cutoff
• Fields
– Yhat
– Yhat lower
– Yhat upper
– Cap
– Cutoff

Forecast Data
• Solution 2:
– 1 measurement per appliance
– Contains both history and forecast
• Pro:
– Scales well
• Cons:
– Retention policy on forecast data
• Tags:
– Type
– Serial
– Is_history
– History_cutoff
• Fields
– Yhat
– Yhat lower
– Yhat upper
– Cap
– Cutoff

Accuracy Data
• Solution 1:
– 1 measurement
• Pro:
– Easy cross-analysis
• Cons:
– Sorry you can’t do that aka series cardinality
• Tags:
– Pool_type
– Serial
– Horizon
– Train_freq
– Sample_periods
– Other hyperparams
• Fields
– coverage
– mape
– mdape
– smape
– etc

Accuracy Data
• Solution 2:
• Pro:
– Now we scale
• Cons:
– Cross-analysis means fetching from 10K+
measurements
– Hyperparams are different for different
algorithms
– Hyperparams cardinality can be really high
• Tags:
– Pool_type
– Serial
– Horizon
– Train_freq
– Sample_periods
– Other hyperparams
• Fields
– coverage
– mape
– mdape
– smape
– etc

Accuracy Data
• Solution 3:
– Externalize hyperparams
• Pro:
– Enables some cool stuff for online tuning
• Cons:
– Cross-analysis means fetching from 10K+
measurements and reference another DB
• Tags:
– Pool_type
– Serial
– Horizon
– Hyperparams_doc_id
• Fields
– coverage
– mape
– mdape
– smape
– Etc
• Hyperparam doc (in ArangoDB):
– Model type
– Set of hyperparams

Agenda
1 Forecasting 101
2 Storage Forecasting @Veritas & Setup

• Model selection
• Model Update

• Thousands of models
• Each model needs to be tuned for a specific time series
• Need to adapt to changes in underlying process
• Backtesting is computationally too expensive but we have the online validation data!
Model Update Problems

• Hyperparameter optimization
where f(x) represents an objective score to minimize— such as RMSE— evaluated on
the validation set.
• Bayesian optimization uses a surrogate function:
to build a probability model of the objective.
• Selection function used to choose next set of hyperparameters.
Sequential Model-BasedOptimization (SMBO)

Sequential Model-BasedOptimization (SMBO)
http://krasserm.github.io/2018/03/21/bayesian-optimization

Thank you!
Copyright © 2018 Veritas Technologies. All rights reserved. Veritas and the Veritas Logo are trademarks or registered trademarks of Veritas Technologies or its affiliates in the
U.S. and other countries. Other names may be trademarks of their respective owners.
This document is provided for informational purposes only and is not intended as advertising. All warranties relating to the information in this document, either express or
implied, are disclaimed to the maximum extent allowed by law. The information in this document is subject to change without notice.
MarcelloTomasini
Marcello.Tomasini@veritas.com
https://www.linkedin.com/in/marcellotomasini

Q & AQ & A

Am I Missing Something?

Where is Deep Learning!
Am I Missing Something?

• Too many parameters for the data & cross-training leads to “average” models
• Needs careful data preparation (normalization, trend, seasonality, etc)
• Computational cost of training
Deep Learning

M4 Forecasting Competition
https://eng.uber.com/m4-forecasting-competition
https://doi.org/10.1016/j.ijforecast.2018.06.001

Forecasting at Scale with Marcello Tomasini

Recommended

Recommended

More Related Content

Similar to Forecasting at Scale with Marcello Tomasini

Similar to Forecasting at Scale with Marcello Tomasini (20)

More from InfluxData

More from InfluxData (20)

Recently uploaded

Recently uploaded (20)

Forecasting at Scale with Marcello Tomasini

Editor's Notes