Retail Demand Forecasting with Machine Learning
Ronald P. (Ron) Menich
mlconf NYC 27 Mar 2015
GO, TEAM!
▪ Syrine Besbes
▪ Wafa Hwess
▪ Rihab Ben Aicha
▪ Abhijit Oka
▪ Mark Tabladillo
▪ Ahmed Yassine Khaili
2
▪ Nikolaos Vasiloglou
▪ Eugene Kamarchik
▪ Kurt Stirewalt
▪ Andy Dean
▪ Firas Aloui
▪ Molham Aref
▪ Rafael Gonzalez-Coloni
Forgive me if I’ve missed someone
PREDICTIX’ CORE RETAIL DECISION SUPPORT OFFERINGS
▪ Planning
▪ Assortment Planning
▪ Merchandise Financial Planning
▪ Item Planning
▪ Forecasting
▪ Machine-learning models
▪ All demand drivers
▪ Internal (promo, price, etc.)
▪ External (weather, competition, events, etc.)
▪ Supply Chain Optimization
▪ Network flow optimization
▪ Optimize for profit
3
GETTING DEMAND FORECASTING RIGHT TRANSLATES TO $$$
▪ Size of the problem
▪ 62 billion weekly forecasts (150K active skus X 8,000 stores X 52 weeks)
▪ Many TB’s of data
▪ 3,000 computing cores elastically provisioned
▪ Forecast accuracy
▪ Measured 25% to 50% reduction in MAPE
▪ The harder the problem the better the improvement
▪ Measured reduction of bias in forecasts
▪ Benefits
▪ $125M from inventory reductions alone
▪ 20% ongoing benefit
4
IN THE BEGINNING, DEMAND FORECASTING SEEMED SIMPLE...
5
Time-series forecasting
…BUT THEN EVER GREATER COMPLEXITY AROSE
6
A Last year’s sales
B Manual partitioning of
data, different TS
models for different
partitions
C Croston’s for sparse,
Winters for dense
D Forecast at aggregate
levels, spread down
J if/then/else assignment of
different TS algorithms
...
N Have user manually
map a new SKU to an
existing one
...
O Have user manually
inject local market
knowledge
L Linear regression for
promotions
Alarm Clock: Demand
forecasts. But are they
really “simple”?
…AND SO NOW WE ASK THE QUESTION
7
A Last year’s sales
B Manual partitioning of
data, different TS
models for different
partitions
C Croston’s for sparse
demand, Winters for
dense
D Forecast at different
hierarchical levels,
spread down
J Automated if/then/else
assignment of different TS
algorithms
...
N Have user manually
map a new SKU to an
existing one
...
O Have user manually
inject local market
knowledge
L Linear regression for
promo
Alarm Clock: Demand
forecasts. But are they
really “simple”?
REALLY?
Machine learning can provide a modern, simpler,
theoretically sound and more extensible alternative for
retail demand forecasting
CAUSAL FACTORS DRIVE RETAIL DEMAND
How much additional
demand was generated for
Post Cereals because
these were on promotion?
How much does the $4 in-store
coupon contribute to the total
uplift?
Does the table highlighting the
$1.50 coupon and the final offer
price drive any additional uplift?
Competition
Weather
SO AN ATTRIBUTE-BASED FORECASTING APPROACH IS APT
Inputs include:
• Product Attributes
(including text descriptions e.g. reviews)
• Hierarchies
• Competitor Data
• Promotions
• Pricing
• Display
• Store Attributes
• Local events
• Weather
• Customer data
• ...
CLOUD ELASTICITY
Machine Learning:
• 2-way interactions
• 3-way
• 4-way
Predictive Analytics
What If on
price/promo/display
changes
Demand Forecasts
▪ Basic products
▪ New products
▪ Short lifecycle
▪ Customer specific
▪ ...
POSSIBLE SUPERVISED LEARNING MODELS
10
Random forests Restricted Boltzman
machines
Deep learning
We chose factorization machines for
several reasons
● Linear regression heritage of market mix
modeling
● SGD/online suitability for handling large
data sets
● Trend can be modeled
ZERO-FILLING --- KNOWING WHY DEMAND DID AND DIDN’T OCCUR AND WHEN
● Unlike for product recommender
systems, retail forecasting must
predict the timing of when demand
will happen (not just the rating
whenever it happens)
● An observation of sales might have
(sku,store,day) primary key
○ Was the product on the shelf
available to be sold?
○ How much was sold, if any?
● In many retail contexts, the vast
majority of observations have zero
sales
○ Recent example: zero sales
observations account for >97.5% of
the training set
○ It is important to know why demand
was zero
11
Extreme Case:
Demand only occurs when there’s a discount
EXAMPLE FORECASTS - TOYS
12
Training set
Test set
EXAMPLE FORECASTS - SEASONAL GROCERY ITEM
13
Training on the left and middle
One month of holdout / test at the very right
EXAMPLE FORECASTS - QUICK SERVICE RESTAURANT
14
For very dense
data - few
zeros - almost
unbiased
forecasts with
WAPE values
below 12.5%
can be
achieved
NEW SKUS CAN READILY BE FORECASTED
15
REPLACEMENT SKUS CAN BE READILY FORECASTED
16
CHALLENGES / ONGOING WORK
● Zero-filling / training set cardinality control using weighted least squares
● Global effects and 2-way interactions are easily trainable, but 3-way and higher-order
interactions require judicious feature engineering
● Parallel learning / consensus of learners
● Visualization / explanation of hidden factors used for interaction modeling
● Automated pruning of non-important attributes
17
THANK YOU.
18

Ronald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYC

  • 1.
    Retail Demand Forecastingwith Machine Learning Ronald P. (Ron) Menich mlconf NYC 27 Mar 2015
  • 2.
    GO, TEAM! ▪ SyrineBesbes ▪ Wafa Hwess ▪ Rihab Ben Aicha ▪ Abhijit Oka ▪ Mark Tabladillo ▪ Ahmed Yassine Khaili 2 ▪ Nikolaos Vasiloglou ▪ Eugene Kamarchik ▪ Kurt Stirewalt ▪ Andy Dean ▪ Firas Aloui ▪ Molham Aref ▪ Rafael Gonzalez-Coloni Forgive me if I’ve missed someone
  • 3.
    PREDICTIX’ CORE RETAILDECISION SUPPORT OFFERINGS ▪ Planning ▪ Assortment Planning ▪ Merchandise Financial Planning ▪ Item Planning ▪ Forecasting ▪ Machine-learning models ▪ All demand drivers ▪ Internal (promo, price, etc.) ▪ External (weather, competition, events, etc.) ▪ Supply Chain Optimization ▪ Network flow optimization ▪ Optimize for profit 3
  • 4.
    GETTING DEMAND FORECASTINGRIGHT TRANSLATES TO $$$ ▪ Size of the problem ▪ 62 billion weekly forecasts (150K active skus X 8,000 stores X 52 weeks) ▪ Many TB’s of data ▪ 3,000 computing cores elastically provisioned ▪ Forecast accuracy ▪ Measured 25% to 50% reduction in MAPE ▪ The harder the problem the better the improvement ▪ Measured reduction of bias in forecasts ▪ Benefits ▪ $125M from inventory reductions alone ▪ 20% ongoing benefit 4
  • 5.
    IN THE BEGINNING,DEMAND FORECASTING SEEMED SIMPLE... 5 Time-series forecasting
  • 6.
    …BUT THEN EVERGREATER COMPLEXITY AROSE 6 A Last year’s sales B Manual partitioning of data, different TS models for different partitions C Croston’s for sparse, Winters for dense D Forecast at aggregate levels, spread down J if/then/else assignment of different TS algorithms ... N Have user manually map a new SKU to an existing one ... O Have user manually inject local market knowledge L Linear regression for promotions Alarm Clock: Demand forecasts. But are they really “simple”?
  • 7.
    …AND SO NOWWE ASK THE QUESTION 7 A Last year’s sales B Manual partitioning of data, different TS models for different partitions C Croston’s for sparse demand, Winters for dense D Forecast at different hierarchical levels, spread down J Automated if/then/else assignment of different TS algorithms ... N Have user manually map a new SKU to an existing one ... O Have user manually inject local market knowledge L Linear regression for promo Alarm Clock: Demand forecasts. But are they really “simple”? REALLY? Machine learning can provide a modern, simpler, theoretically sound and more extensible alternative for retail demand forecasting
  • 8.
    CAUSAL FACTORS DRIVERETAIL DEMAND How much additional demand was generated for Post Cereals because these were on promotion? How much does the $4 in-store coupon contribute to the total uplift? Does the table highlighting the $1.50 coupon and the final offer price drive any additional uplift? Competition Weather
  • 9.
    SO AN ATTRIBUTE-BASEDFORECASTING APPROACH IS APT Inputs include: • Product Attributes (including text descriptions e.g. reviews) • Hierarchies • Competitor Data • Promotions • Pricing • Display • Store Attributes • Local events • Weather • Customer data • ... CLOUD ELASTICITY Machine Learning: • 2-way interactions • 3-way • 4-way Predictive Analytics What If on price/promo/display changes Demand Forecasts ▪ Basic products ▪ New products ▪ Short lifecycle ▪ Customer specific ▪ ...
  • 10.
    POSSIBLE SUPERVISED LEARNINGMODELS 10 Random forests Restricted Boltzman machines Deep learning We chose factorization machines for several reasons ● Linear regression heritage of market mix modeling ● SGD/online suitability for handling large data sets ● Trend can be modeled
  • 11.
    ZERO-FILLING --- KNOWINGWHY DEMAND DID AND DIDN’T OCCUR AND WHEN ● Unlike for product recommender systems, retail forecasting must predict the timing of when demand will happen (not just the rating whenever it happens) ● An observation of sales might have (sku,store,day) primary key ○ Was the product on the shelf available to be sold? ○ How much was sold, if any? ● In many retail contexts, the vast majority of observations have zero sales ○ Recent example: zero sales observations account for >97.5% of the training set ○ It is important to know why demand was zero 11 Extreme Case: Demand only occurs when there’s a discount
  • 12.
    EXAMPLE FORECASTS -TOYS 12 Training set Test set
  • 13.
    EXAMPLE FORECASTS -SEASONAL GROCERY ITEM 13 Training on the left and middle One month of holdout / test at the very right
  • 14.
    EXAMPLE FORECASTS -QUICK SERVICE RESTAURANT 14 For very dense data - few zeros - almost unbiased forecasts with WAPE values below 12.5% can be achieved
  • 15.
    NEW SKUS CANREADILY BE FORECASTED 15
  • 16.
    REPLACEMENT SKUS CANBE READILY FORECASTED 16
  • 17.
    CHALLENGES / ONGOINGWORK ● Zero-filling / training set cardinality control using weighted least squares ● Global effects and 2-way interactions are easily trainable, but 3-way and higher-order interactions require judicious feature engineering ● Parallel learning / consensus of learners ● Visualization / explanation of hidden factors used for interaction modeling ● Automated pruning of non-important attributes 17
  • 18.