SlideShare a Scribd company logo
1 of 35
Data Science Meetup
Passenger forecasting at KLM
From idea to meals on board
The science
Forecasting passengers
by
Alexander Backus
The data science product life cycle
PRODUCTIDEA
EXPERIMENT INDUSTRIALIZEIDEATE
Understanding the data-value chain
data PREDICT value
€
DECIDEinsight action MEASURE
passenger
forecasts
value proposition user
supply
meals
business objectives
optimal
catering
Predicting the number of passengers
that will board a flight
departureplanning
timeline
???
horizons
p p
System requirements
For specific upcoming flights
We want accurate passenger forecasts
At any moment before departure
boarded
passengers
feedback loop
System design
machine-learning algorithm
PREDICT
forecasted
passengers
flight and
booking data
𝑓𝑓 𝒙𝒙 = 𝒚𝒚
System output
Full conditional probability density?
Q10 Q90mean
𝔼𝔼 𝑌𝑌
forecasted passengers
probability
density
low
high
MVP
Current process is based on the number of expected passengers
regression
𝑓𝑓 𝑥𝑥 = 𝑦𝑦
supply chain process
DECIDEinsight action
user
PREDICT
passenger forecasts
Minimizing change management need
data
datetimelocationaircraft
System inputs
Last-minute bookings
No-shows
Aircraft changes
bookings
X X X
varied data
sources
equals forecasted passengers?Booked passengers
hours to departure
a.k.a.
query moment
hours to departure
bookings
0
max
*
24
*Mock figure for illustration purposes
Multi-timescale forecasting
Fit one model with temporal indicators
Defining the target
Facilitate learning: offset with booking number
𝑦𝑦′
= 𝒚𝒚 − 𝒙𝒙𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏
Focus on learning interactions with booking numbers
booked
passengers
boarded
passengers
100 200 300
Performance metrics
business
user
model
€
customer satisfaction / cost reduction
undersupply / oversupply
mean absolute error
𝑀𝑀𝑀𝑀𝑀𝑀 =
1
𝑛𝑛
�
𝑖𝑖=1
𝑛𝑛
𝒚𝒚𝒊𝒊 − �𝒚𝒚𝒊𝒊
forecasted
passengers
200 220 240
boarded
passengers
€
Mean absolute error
Closely related to business goals
Metric slicing€
cabin class
business
economy
intercontinentaleurope
flight group
Differential impact on business process
Performance visualization
hours to departure
day of departurelong-term
MAE
departure
€
*Mock figure for illustration purposes
slices
residuals
0
density
undersupply
oversupply
Validation procedure
€
full historical
data set
2016
2017
2018
grouped random
split
validation set
shuffled
folds
test set
temporal
split
train set
rolling
window
flight_id
hours_to_
departure
weekday booked_pax destination capacity … boarded_pax
1122 72 mon 108 LHR 340 … 166
1123 46 sat 105 CDG 120 … 118
1124 202 tue 176 AMS NaN … 180
1125 4 mon 284 NYC 340 … 296
1126 25 thu 267 NaN 280 … 276
df.head()
Gradient boosting decision trees
Cuts through mixed-type high-dimensional tabular data with few assumptions
Decision trees
Objective: predict boarded passengers (minimize MAE loss)
example training
samples
holiday == True
T F
10 6
leaf
prediction-2
booked > 120
T F split
booked > 120
holiday == True
T F
T F
10 6
-2
Tree ensembles
Averaging multiple instances
destination == NYC
T F
-11
�𝒚𝒚 = 𝑓𝑓 𝒙𝒙 = 6 − 1 = 5
booked > 120
holiday == True
T F
T F
10 6
-2
-4 06 10𝒚𝒚
𝒙𝒙
training samples
tree 1
sequential fitting
-2 2
destination == NYC
T F
1
-2 20 0𝒚𝒚
𝒙𝒙
training samples
tree 2
Gradient boosting
Homing-in on mistakes
from sklearn.pipeline import Pipeline
from lightgbm.sklearn import LGBMRegressor
estimator = Pipeline(steps=[
('preprocessor', some_fancy_preprocessing_pipeline),
('regressor', LGBMRegressor(n_estimators=1000,
objective='regression_l1',
categorical_feature='auto’,
use_missing=True))
])
estimator.fit(X_train, y_train, **fit_params)
Tuning the algorithm
What happens if we keep boosting?
Regularization with learning rate:
�𝑦𝑦𝑡𝑡
= �𝑦𝑦𝑡𝑡−1
+ 𝜼𝜼 𝑓𝑓𝑡𝑡
𝑥𝑥
Early stopping based on validation set:
loss
iteration
validation loss
train loss
stop
Overfitting!
𝑥𝑥
𝑦𝑦
booked > 120
T F
holiday == True
T F
splitting
max_bin
bagging_fraction
feature_fraction pruning
num_leaves
max_depth
min_data_in_leaf
Key hyperparameters
Tuning the algorithm
Sequential model-based optimization: model expected improvement
Finding optimal hyperparameter settings
Balance between exploitation and exploration
Tree of Parzen
estimator
hyperopt
hyperparameter 1
hyperparameter 2
more sampling in high-score regions
from hyperopt import hp, Trials, tpe, fmin
space = {'max_depth': hp.quniform('max_depth', low=3, high=12, q=3),
'feature_fraction': hp.uniform('feature_fraction', low=0.3, high=1.0),
'learning_rate': hp.loguniform('learning_rate', low=-5, low=-3)}
def objective(params):
fit_params = dict(regressor__eval_set=[(X_val, y_val)],
regressor__early_stopping_rounds=5)
estimator.set_params(**params)
estimator.fit(X_train, y_train, **fit_params)
return estimator._final_estimator.best_score_
trials = Trials()
best_params = fmin(fn=objective, space=space, algo=tpe.suggest,
max_evals=10, trials=trials)
Experiment successful!
Time for a real test
Superior performance to the current system
Trimming the feature set
Pave the road to production
*Mock figure for illustration purposes
drop
gain
features
├── README.md
├── paxfor
│ ├── features.py
│ ├── pipeline.py
│ ├── model.py
│ ├── train.py
│ └── settings.py
│
├── requirements.txt
├── setup.py
├── tests
└── notebooks
From notebooks to software package
MOBS
real-time
data feed
Shadow deployment
Predicting on real-time production data
user
actionforecast
supply chain
process
current system
Challenge: Training-serving skew
*Mock figure for illustration purposes
differing values
OLD
variable 𝒙𝒙𝟏𝟏 in historical data
NEW
variable 𝒙𝒙𝟏𝟏 in
real-time production
environment
Solution: residual learning
Step 1. fit estimator:
𝒇𝒇𝟏𝟏 𝒙𝒙𝐀𝐀 = 𝒚𝒚
Step 2: fit residual estimator:
𝒇𝒇𝟐𝟐 𝒙𝒙𝐁𝐁 = 𝒚𝒚 − 𝒇𝒇𝟏𝟏 𝒙𝒙𝐁𝐁
Step 3: predict:
�𝒚𝒚 = 𝒇𝒇𝟏𝟏 𝒙𝒙𝐁𝐁 + 𝒇𝒇𝟐𝟐 𝒙𝒙𝐁𝐁
𝒇𝒇𝟐𝟐
subtract
residual target
extra
features
𝒇𝒇𝟏𝟏
target
𝒚𝒚
𝒙𝒙𝑨𝑨
historical
sources
𝒇𝒇𝟏𝟏 𝒙𝒙𝐁𝐁
𝒙𝒙𝐁𝐁
new
sources
�𝒚𝒚
add
forecast
Shadow deployment successful!
Proven superior performance to the current system
*Mock figure for illustration purposes
Benchmark beaten
hours to departure
day of departurelong-term
departure
meanabsoluteerror
current system
MOBS
The data science product life cycle
PRODUCTIDEA
EXPERIMENT INDUSTRIALIZEIDEATE
Key take-aways
Understanding the data-value chain is key
to define the machine-learning problem
Get business stakeholders committed by
demonstrating value in a live test
Simplicity over complexity:
Think minimal viable to get to production

More Related Content

What's hot

Sql server ___________session_20(ddl triggers)
Sql server  ___________session_20(ddl triggers)Sql server  ___________session_20(ddl triggers)
Sql server ___________session_20(ddl triggers)
Ehtisham Ali
 

What's hot (15)

svm classification
svm classificationsvm classification
svm classification
 
6
66
6
 
JavaCro'14 - JCalc Calculations in Java with open source API – Davor Sauer
JavaCro'14 - JCalc Calculations in Java with open source API – Davor SauerJavaCro'14 - JCalc Calculations in Java with open source API – Davor Sauer
JavaCro'14 - JCalc Calculations in Java with open source API – Davor Sauer
 
Engineering Equation Solver (Thai)
Engineering Equation Solver (Thai)Engineering Equation Solver (Thai)
Engineering Equation Solver (Thai)
 
Sql server ___________session_20(ddl triggers)
Sql server  ___________session_20(ddl triggers)Sql server  ___________session_20(ddl triggers)
Sql server ___________session_20(ddl triggers)
 
Solving a “Transportation Planning” Problem through the Programming Language “C”
Solving a “Transportation Planning” Problem through the Programming Language “C”Solving a “Transportation Planning” Problem through the Programming Language “C”
Solving a “Transportation Planning” Problem through the Programming Language “C”
 
Practical Data Science : Data Cleaning and Summarising
Practical Data Science : Data Cleaning and SummarisingPractical Data Science : Data Cleaning and Summarising
Practical Data Science : Data Cleaning and Summarising
 
ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions
 
Naive application of Machine Learning to Software Development
Naive application of Machine Learning to Software DevelopmentNaive application of Machine Learning to Software Development
Naive application of Machine Learning to Software Development
 
Amortized complexity
Amortized complexityAmortized complexity
Amortized complexity
 
Shapes and calculate (area and contour) / C++ oop concept
Shapes and calculate (area and contour) / C++ oop conceptShapes and calculate (area and contour) / C++ oop concept
Shapes and calculate (area and contour) / C++ oop concept
 
Shapes and calculate (area and contour) / C++ oop concept
Shapes and calculate (area and contour) / C++ oop conceptShapes and calculate (area and contour) / C++ oop concept
Shapes and calculate (area and contour) / C++ oop concept
 
Mosaic plot in R.
Mosaic plot in R.Mosaic plot in R.
Mosaic plot in R.
 
Mpibhseguranca3
Mpibhseguranca3Mpibhseguranca3
Mpibhseguranca3
 
Assignment on Numerical Method C Code
Assignment on Numerical Method C CodeAssignment on Numerical Method C Code
Assignment on Numerical Method C Code
 

Similar to Passenger forecasting at KLM

Passenger forecasting at KLM
Passenger forecasting at KLMPassenger forecasting at KLM
Passenger forecasting at KLM
BigData Republic
 
Dok Talks #115 - What More Can I Learn From My OpenTelemetry Traces?
Dok Talks #115 - What More Can I Learn From My OpenTelemetry Traces?Dok Talks #115 - What More Can I Learn From My OpenTelemetry Traces?
Dok Talks #115 - What More Can I Learn From My OpenTelemetry Traces?
DoKC
 
APassengerKnockOnDelayModelForTimetableOptimisation_beamer
APassengerKnockOnDelayModelForTimetableOptimisation_beamerAPassengerKnockOnDelayModelForTimetableOptimisation_beamer
APassengerKnockOnDelayModelForTimetableOptimisation_beamer
Peter Sels
 

Similar to Passenger forecasting at KLM (20)

Passenger forecasting at KLM
Passenger forecasting at KLMPassenger forecasting at KLM
Passenger forecasting at KLM
 
Evaluating classifierperformance ml-cs6923
Evaluating classifierperformance ml-cs6923Evaluating classifierperformance ml-cs6923
Evaluating classifierperformance ml-cs6923
 
9:40 am InfluxDB 2.0 and Flux – The Road Ahead Paul Dix, Founder and CTO | ...
 9:40 am InfluxDB 2.0 and Flux – The Road Ahead  Paul Dix, Founder and CTO | ... 9:40 am InfluxDB 2.0 and Flux – The Road Ahead  Paul Dix, Founder and CTO | ...
9:40 am InfluxDB 2.0 and Flux – The Road Ahead Paul Dix, Founder and CTO | ...
 
Nyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expandedNyc open-data-2015-andvanced-sklearn-expanded
Nyc open-data-2015-andvanced-sklearn-expanded
 
Balaji Palani [InfluxData] | InfluxDB Tasks Overview | InfluxDays 2022
Balaji Palani [InfluxData] | InfluxDB Tasks Overview | InfluxDays 2022Balaji Palani [InfluxData] | InfluxDB Tasks Overview | InfluxDays 2022
Balaji Palani [InfluxData] | InfluxDB Tasks Overview | InfluxDays 2022
 
ML with python.pdf
ML with python.pdfML with python.pdf
ML with python.pdf
 
Seminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mmeSeminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mme
 
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
Scott Anderson [InfluxData] | InfluxDB Tasks – Beyond Downsampling | InfluxDa...
 
Advanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & moreAdvanced pg_stat_statements: Filtering, Regression Testing & more
Advanced pg_stat_statements: Filtering, Regression Testing & more
 
Dok Talks #115 - What More Can I Learn From My OpenTelemetry Traces?
Dok Talks #115 - What More Can I Learn From My OpenTelemetry Traces?Dok Talks #115 - What More Can I Learn From My OpenTelemetry Traces?
Dok Talks #115 - What More Can I Learn From My OpenTelemetry Traces?
 
Flux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul DixFlux and InfluxDB 2.0 by Paul Dix
Flux and InfluxDB 2.0 by Paul Dix
 
R/Finance 2009 Chicago
R/Finance 2009 ChicagoR/Finance 2009 Chicago
R/Finance 2009 Chicago
 
Flight Landing Risk Assessment Project
Flight Landing Risk Assessment ProjectFlight Landing Risk Assessment Project
Flight Landing Risk Assessment Project
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with R
 
Scikit learn cheat_sheet_python
Scikit learn cheat_sheet_pythonScikit learn cheat_sheet_python
Scikit learn cheat_sheet_python
 
Scikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-PythonScikit-learn Cheatsheet-Python
Scikit-learn Cheatsheet-Python
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
 
APassengerKnockOnDelayModelForTimetableOptimisation_beamer
APassengerKnockOnDelayModelForTimetableOptimisation_beamerAPassengerKnockOnDelayModelForTimetableOptimisation_beamer
APassengerKnockOnDelayModelForTimetableOptimisation_beamer
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
 

Recently uploaded

如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
zifhagzkk
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Stephen266013
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
pwgnohujw
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
ju0dztxtn
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
great91
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
pwgnohujw
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
23050636
 

Recently uploaded (20)

如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster AnalysisData Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 

Passenger forecasting at KLM

  • 1. Data Science Meetup Passenger forecasting at KLM From idea to meals on board The science Forecasting passengers by Alexander Backus
  • 2. The data science product life cycle PRODUCTIDEA EXPERIMENT INDUSTRIALIZEIDEATE
  • 3. Understanding the data-value chain data PREDICT value € DECIDEinsight action MEASURE passenger forecasts value proposition user supply meals business objectives optimal catering
  • 4. Predicting the number of passengers that will board a flight departureplanning timeline ??? horizons p p
  • 5. System requirements For specific upcoming flights We want accurate passenger forecasts At any moment before departure
  • 6. boarded passengers feedback loop System design machine-learning algorithm PREDICT forecasted passengers flight and booking data 𝑓𝑓 𝒙𝒙 = 𝒚𝒚
  • 7. System output Full conditional probability density? Q10 Q90mean 𝔼𝔼 𝑌𝑌 forecasted passengers probability density low high MVP
  • 8. Current process is based on the number of expected passengers regression 𝑓𝑓 𝑥𝑥 = 𝑦𝑦 supply chain process DECIDEinsight action user PREDICT passenger forecasts Minimizing change management need data
  • 9. datetimelocationaircraft System inputs Last-minute bookings No-shows Aircraft changes bookings X X X varied data sources equals forecasted passengers?Booked passengers
  • 10. hours to departure a.k.a. query moment hours to departure bookings 0 max * 24 *Mock figure for illustration purposes Multi-timescale forecasting Fit one model with temporal indicators
  • 11. Defining the target Facilitate learning: offset with booking number 𝑦𝑦′ = 𝒚𝒚 − 𝒙𝒙𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏 Focus on learning interactions with booking numbers booked passengers boarded passengers 100 200 300
  • 12. Performance metrics business user model € customer satisfaction / cost reduction undersupply / oversupply mean absolute error
  • 13. 𝑀𝑀𝑀𝑀𝑀𝑀 = 1 𝑛𝑛 � 𝑖𝑖=1 𝑛𝑛 𝒚𝒚𝒊𝒊 − �𝒚𝒚𝒊𝒊 forecasted passengers 200 220 240 boarded passengers € Mean absolute error Closely related to business goals
  • 15. Performance visualization hours to departure day of departurelong-term MAE departure € *Mock figure for illustration purposes slices residuals 0 density undersupply oversupply
  • 16. Validation procedure € full historical data set 2016 2017 2018 grouped random split validation set shuffled folds test set temporal split train set rolling window
  • 17. flight_id hours_to_ departure weekday booked_pax destination capacity … boarded_pax 1122 72 mon 108 LHR 340 … 166 1123 46 sat 105 CDG 120 … 118 1124 202 tue 176 AMS NaN … 180 1125 4 mon 284 NYC 340 … 296 1126 25 thu 267 NaN 280 … 276 df.head()
  • 18. Gradient boosting decision trees Cuts through mixed-type high-dimensional tabular data with few assumptions
  • 19. Decision trees Objective: predict boarded passengers (minimize MAE loss) example training samples holiday == True T F 10 6 leaf prediction-2 booked > 120 T F split
  • 20. booked > 120 holiday == True T F T F 10 6 -2 Tree ensembles Averaging multiple instances destination == NYC T F -11 �𝒚𝒚 = 𝑓𝑓 𝒙𝒙 = 6 − 1 = 5
  • 21. booked > 120 holiday == True T F T F 10 6 -2 -4 06 10𝒚𝒚 𝒙𝒙 training samples tree 1 sequential fitting -2 2 destination == NYC T F 1 -2 20 0𝒚𝒚 𝒙𝒙 training samples tree 2 Gradient boosting Homing-in on mistakes
  • 22. from sklearn.pipeline import Pipeline from lightgbm.sklearn import LGBMRegressor estimator = Pipeline(steps=[ ('preprocessor', some_fancy_preprocessing_pipeline), ('regressor', LGBMRegressor(n_estimators=1000, objective='regression_l1', categorical_feature='auto’, use_missing=True)) ]) estimator.fit(X_train, y_train, **fit_params)
  • 23. Tuning the algorithm What happens if we keep boosting? Regularization with learning rate: �𝑦𝑦𝑡𝑡 = �𝑦𝑦𝑡𝑡−1 + 𝜼𝜼 𝑓𝑓𝑡𝑡 𝑥𝑥 Early stopping based on validation set: loss iteration validation loss train loss stop Overfitting! 𝑥𝑥 𝑦𝑦
  • 24. booked > 120 T F holiday == True T F splitting max_bin bagging_fraction feature_fraction pruning num_leaves max_depth min_data_in_leaf Key hyperparameters Tuning the algorithm
  • 25. Sequential model-based optimization: model expected improvement Finding optimal hyperparameter settings Balance between exploitation and exploration Tree of Parzen estimator hyperopt hyperparameter 1 hyperparameter 2 more sampling in high-score regions
  • 26. from hyperopt import hp, Trials, tpe, fmin space = {'max_depth': hp.quniform('max_depth', low=3, high=12, q=3), 'feature_fraction': hp.uniform('feature_fraction', low=0.3, high=1.0), 'learning_rate': hp.loguniform('learning_rate', low=-5, low=-3)} def objective(params): fit_params = dict(regressor__eval_set=[(X_val, y_val)], regressor__early_stopping_rounds=5) estimator.set_params(**params) estimator.fit(X_train, y_train, **fit_params) return estimator._final_estimator.best_score_ trials = Trials() best_params = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=10, trials=trials)
  • 27. Experiment successful! Time for a real test Superior performance to the current system
  • 28. Trimming the feature set Pave the road to production *Mock figure for illustration purposes drop gain features
  • 29. ├── README.md ├── paxfor │ ├── features.py │ ├── pipeline.py │ ├── model.py │ ├── train.py │ └── settings.py │ ├── requirements.txt ├── setup.py ├── tests └── notebooks From notebooks to software package
  • 30. MOBS real-time data feed Shadow deployment Predicting on real-time production data user actionforecast supply chain process current system
  • 31. Challenge: Training-serving skew *Mock figure for illustration purposes differing values OLD variable 𝒙𝒙𝟏𝟏 in historical data NEW variable 𝒙𝒙𝟏𝟏 in real-time production environment
  • 32. Solution: residual learning Step 1. fit estimator: 𝒇𝒇𝟏𝟏 𝒙𝒙𝐀𝐀 = 𝒚𝒚 Step 2: fit residual estimator: 𝒇𝒇𝟐𝟐 𝒙𝒙𝐁𝐁 = 𝒚𝒚 − 𝒇𝒇𝟏𝟏 𝒙𝒙𝐁𝐁 Step 3: predict: �𝒚𝒚 = 𝒇𝒇𝟏𝟏 𝒙𝒙𝐁𝐁 + 𝒇𝒇𝟐𝟐 𝒙𝒙𝐁𝐁 𝒇𝒇𝟐𝟐 subtract residual target extra features 𝒇𝒇𝟏𝟏 target 𝒚𝒚 𝒙𝒙𝑨𝑨 historical sources 𝒇𝒇𝟏𝟏 𝒙𝒙𝐁𝐁 𝒙𝒙𝐁𝐁 new sources �𝒚𝒚 add forecast
  • 33. Shadow deployment successful! Proven superior performance to the current system *Mock figure for illustration purposes Benchmark beaten hours to departure day of departurelong-term departure meanabsoluteerror current system MOBS
  • 34. The data science product life cycle PRODUCTIDEA EXPERIMENT INDUSTRIALIZEIDEATE
  • 35. Key take-aways Understanding the data-value chain is key to define the machine-learning problem Get business stakeholders committed by demonstrating value in a live test Simplicity over complexity: Think minimal viable to get to production