How can we scale forecasting to many people and problems within an organization? Here I argue for a strategy that makes model building equivalent to feature construction, making building a forecaster similar to building a classifier.
Automatic Forecasting at Scale
Sean J. Taylor
12 Aug 2015
Joint Statistical Meetings
Many Forecasting Problems at Facebook
• capacity planning: servers, switches, people, even food
• user / advertiser growth
• goal setting for teams (with respect to forecast)
• detecting anomalies
• “trending” units
Business Time Series Have Similar Attributes
• comprised by multiple “units”
(e.g. countries, users, advertisers, hardware units)
• units are “born” at different times, can exit the sample
• growth curves are common (e.g. saturating a market)
• complex, human-scale seasonality, holidays and events
• structural breaks as exogenous changes happen
(e.g. new products, redesigns, site outages)
• missing data
Results of my search for forecasting advice
▪ carefully clean, scale, and fix missingness in data
▪ try many kinds of models
▪ use model selection procedures based on (penalized)
goodness-of-fit or just ocular goodness-of-fit
▪ lots of tacit knowledge involved — experienced
forecasters have earned a lot of credibility
Why is building a forecaster harder
than building a classifier?
How most people build a classifier:
1. Choose a loss function.
2. Gather as much data as possible and construct
potentially useful features.
3. Train models using different amounts of regularization.
4. Choose the one that predicts the best out-of-sample
using some cross-validation procedure.
With a ﬂexible enough learner, the only time a human
needs to intervene is during feature construction!
Forecasting as (special) supervised learning
▪ state-features constructed from historical data
▪ time-based features for seasonality, events, etc.
▪ off-the-shelf regularized regression (glmnet, VW)
▪ use simulated forecasts to estimate expected loss
When you have a
look like a
arg min ky X k2
+ 1k k1 + 2k k2
Fixed-Horizon Forecasting Regression
Regressors are generated from paste state:
yt+H = f(yt, yt 1, yt 2, . . .)
yt+H = ↵yt +
State features from one-sided kernel-
Can use any weighted statistic to generate features:
mean, variance, quantiles, etc.
Assumption: local smoothness
Assume parameters vary smoothly over forecast horizon
(same as assuming forecast is locally smooth).
yt+H = ↵H · yt + H ·
for each horizon
Adding Seasonality Features
Add components to the model that represent deterministic
functions of time:
▪ cyclic cubic splines for yearly seasonality
▪ day-of-week, day-of-year, hour-of-day dummy variables
▪ smooth curves around known holidays
yt+H = f(yt, yt 1, yt 2, . . .) + g(t + H)
t last mean
1/1 - -
1/2 5 5
1/3 9 7
t+H y Mon Tue
1/1 5 1 0
1/2 9 0 1
1/3 14 0 0
t+H t H y last mean Mon Tues
1/2 1/1 1 5 - - 0 1
1/3 1/1 2 9 - - 0 0
1/3 1/2 1 14 5 5 0 0
Input Data for Training
Making it hierarchical
We want to borrow information about processes across
units. Huge opportunity because:
1. We know more about “new” time series than we think if
we are willing to assume they are generated from a
2. The more examples from a family of time series
processes we have, the better we are able to learn about
its structure. Example: stock market.
3. Precision gains from borrowing information.
One weird trick for hierarchical models
Global parameters Unit-speciﬁc
yi,t+H = ↵yt +
yi + ↵iyt + i
▪ BIG DATA: optimization-based techniques are difficult to
use here because
▪ Online learning using SGD/Adagrad/Adadelta work well
here AND we can update parameters for different loss
functions and regularization parameters at the same time.
▪ Other bonus for online learning: incremental learning on
data sorted by time!
Model Selection via
We have two sets of hyper-parameters:
1. regularization of the model coefficients.
2. amount of differencing we do before
Just like in the classification version of the
problem, we choose the model that
empirically forecasts the best by selecting
K simulated forecast dates.
Predictive Intervals with Quantile Regression
Very important to quantify uncertainty about a forecast.
Often we’d prefer that people not even look at the point
Once you’re in the land of regularized linear regression, we
can get predictive intervals simply by changing loss
function to quantile loss.
Directly optimizing the model for the correct amount of
▪ online feature scaling
▪ feature hashing
▪ stochastic gradient descent (and Adagrad, Adadelta)
▪ fitting several models simultaneously on the same data
Scaling to More People/Problems
1. Start with a single use-case and nail it.
2. Parameterize that solution — adding new problems
should simple be configuration.
3. Work on model/fitting procedure, then run all previous
models for diagnostics.
4. Provide easy tools for model criticism — top predictive
errors, examples with under/over coverage, etc.
▪ Different kinds of “at scale” — people and problems are
more important than size of data
▪ If a model/technique is hard to use, it’s worth thinking
about what it would take for a non-expert to use it.
▪ Making problems look like regularized linear regression is
▪ Forecasting can be made into a very special kind of
▪ Email me with comments/feedback: firstname.lastname@example.org