1
Data Science for Business Managers
Akın Osman Kazakçı
MINES ParisTech
Balazs Kégl
Ecole Polytechnique, CNRS
2
External
Data
Database
X
PredictionEngine
Visualisation
Automated
actions
Notifications
The value of data is revealed through prediction.
At the heart of the digital
transformation lies the “data”
Levels of transformation
through data
• reporting: what happened in the past? (reflection)
• dashboards and real time monitoring: what is
happening now? (reactivity)
• prediction: what will happen next? (pro-activity)
How can we accelerate a digital
transformation process by leveraging
data?
5
Building value-driven data projects
What knowledge would increase our profits?
The following questions need to be answered
in this order:
What data do we need to collect?
What ML methods are appropriate?
• Are standard innovation methodologies fit for digital
transformation projects?
• (Can we CK this?)
6
Discussion
• Do I have all the relevant data? 

Strategic data watch: is there any new source of data I can
use?
• Do I have best predictive accuracy?

How do I make sure that I’m working with the best
possible predictive models?
7
Two key aspects
8
Data hunt
• Do I have all the relevant data?
9
Exercice: Data hunt
During your transition to predictive analytics, you may need
to update your databases: to include more variables with
potential explicative power
•A travel IT systems company has some
air traffic / passenger data.
•They are interested in predicting
passenger flux between 20 airports in
US.
•Data for 720 days, for each pair of
airports.
•So,“one” variable.
•How can we augment this dataset?
•Which variables can be added?
•Where can we find the data?
Potential	sources	for	relevant	factors
K1 Events
K2 Plane
accidents
K3 Calendar
K4 Delays
causes
K5 Alternative
transportation
K6 Safety
K7 Data on
airports
K8 Similar data
K9 Oil price
K10 Average
domestic air
fares
K11 Town’s
population
K12 Town’s
attractiveness
20+ participants (students), analysed byYohann Sitruk
11
Model quality
• Do I have best predictive accuracy?
1. Train & test paradigm
2. Prediction error and quality metrics
3. ROI in data science projects
12
Plan
• …involves a great deal of trial and error
• little if any theory-based, model-based design
• even research (development of new algorithms) is (mostly) trial
and error
• the data scientist’s best friend is a well-designed experimental
studio for facilitating fast iterations of
•How can we control the quality of the ensuing model?
13
Building a data science model
• Data-driven predictors should work well on future
(unseen) data
• use historical data to select and fit a model, then use the model to
make predictions on new data
• but we only have historical data: how do we “simulate” past and future
on existing data?
14
Train & test paradigm
15
Train & test paradigm
Data
Train
Test
Develop a model
on training set
Test the model on
the test set
Change the test
set
16
Train & test paradigm
Data
Train
Test
Cycling through the
data in this manner is
called cross-validation
This is a powerful
and important
concept for building
robust models
Question
• Assume your management has decided to outsource your
predictive model building activity.
• How would you evaluate various partners?
1. Train & test paradigm
2. Prediction error and (quality) metrics
3. ROI in data science projects
18
Plan
Back to classification
Modèles Standards
Simple linear model,
Many red and blue
items are misclassified
A complex non linear
model, better
separation of data
(again)
What would be a suitable metric that characterises model
performance in the above case?
Prediction error
Modèles Standards
Number of misclassified points
(red or blue) ?M1
M2
According to this criteria M1
seems worst than M2
Assuming both models
avoid over/under-fitting
(is this the case here?)
A list of metrics from SciKit
Learn
(a widely used ML software
library)
Choice of the metric is
important. Ideally, it should
be tied to a business
objective.
Model performance
Model performance
- a simple case -
Two basic notions:
- False positives
- False negatives
Ex
1. the model predicts cancer for a patient who does not have
cancer
2. the model predicts a patient does not have cancer while
she actually has
See that the cost of these errors are not identical. This is true in most cases.
Can you give other examples?
1. Train & test paradigm
2. Prediction error and (quality) metrics
3. ROI in data science projects
23
Plan
Calculating ROI for improving
predictive accuracy
Think about ad targeting and companies such as Assume, for the
sake of example, the following (fictitious) figures.
The company monitors 100 million page loads per hour by internet users. Within
the short duration of loading the company should predict whether the user will
click on an advertisement.
Company pays 0.10$ for showing the advertisement on the dedicated zone of the
page. It makes, 0.17$ if the user clicks on the ad. How does the model
performance affects profitability?
Assume the model causes 5% false positives and 10% false negatives over 100
million predictions.
17 million mauvaise prédictions - par heure!
Le cout des FPs: 100M x 0.05 x 0.10$ = 500, 000 euros
Le cout des FNs: 100M x 0.10 x 0.07$ = 700, 000 euros
The previous example was for (binary) classification
Calculating ROI for improving
predictive accuracy
What happens in case of “regression”?
Example: Predicting remaining lifetime of devices
How to improve predictive
accuracy?
How to reach best predictive accuracy?
Customer Analytics
- Churn
- Pricing
- Lead scoring
- Credit scoring
- Up&cross-sales
Risk & Production
- Fraud / insurance
- Compliance
- Safety analysis
- Cyber-security
- Manufacturing
Operations
- Maintenance
- Fault analysis
- Logistics
- HR
- Procurement
Better Predictions = More Value
Integrating & Increasing data science
capabilities is hard
Finance Sales Marketing
Engineering
Purchasing HR Accounting
Manufacturing
Planning IT DSR&D
- Skill gap: Shortage of data scientists, Not enough skilled people, PhDs are expensive & high
demand, (McKinsey, 2016), unawareness of latest techniques and experimental methods
- Development gap: Lack of adapted infrastructures and systems, limited resources & time, lack of
management practices and appropriate experimental tools
- Deployment gap: It takes months to go from development to deployment, by the time a model is
ready to be deployed in production, the world has changed (distribution shifts; 78% of companies
has no automated procedures, 50% recode from scratch, Dataiku Production Survey Report)
Main obstacles:
Most companies operate with under-performing models
Ex. %10 improvement in sales prediction = %1 decrease in
stock out = 100M€ increase in sales for a retail giant
28
Developing a predictive model is an
experimental process
- Linear Regression
- Logistic Regression
- DecisionTree
- SVM
- Naive Bayes
- KNN
- K-Means
- Random Forest
- Dimensionality Reduction
Algorithms
- Gradient Boost & Adaboost
- …
ML algorithms ML has produced a large variety
of algorithms
each of which has tunable
parameters
The number of such
(hyper)parameters can vary
anywhere from 1 to ~100
Trying every combination
is not possible
B. Kégl / AppStat@LAL Learning to discover
CLASSIFICATION FOR DISCOVERY
20% improvement over
the baseline model used
by physicists (from 3.2 to
3.8) in detecting Higgs
particles
14
A particular instrument for extending the
“search” for best model is crowdsourcing
Hundreds of models
produced and tested by
the participants
30
RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP)
http://www.ramp.studio
RAMP
32
RAMP
33
Amazing improvement
- in just 3 days -
Some numbers
• 100+ participants, working on the same problem
• 411+ models, in just 3 days
• Starting kit scores:
• Combined = 0.131, Err = 0.090, Mare = 0.212
• Final best submission:
• Combined = 0.032 (%75), Err = 0.015 (80%), Mare = 0.065 (~70%)
• Blended model is even better: 0.023 on combined score (better than
Saclay, Hooray!)
• These improvements are amazing
Workshop
• Assume you are all working in various branches of a same group.
• The executive committee decide to run a company wide initiative to elaborate
a roadmap for accelerating digital transition
• Steps:
• Split into 5 teams of 8 persons
• 30-45m. Each group generates as many prediction problems as possible - with direct relevance
to their work (any of the company branches)
• 60m. Build a list of priority, depending:
• Availability or accessibility of data required
• ROI and potential gain (it’s ok to be approximative, but try to come up with informed estimations)
• 30m. Choose 3 applications, and report to the whole group (debriefing)
39
Workshop

Data Science for Business Managers - An intro to ROI for predictive analytics

  • 1.
    1 Data Science forBusiness Managers Akın Osman Kazakçı MINES ParisTech Balazs Kégl Ecole Polytechnique, CNRS
  • 2.
    2 External Data Database X PredictionEngine Visualisation Automated actions Notifications The value ofdata is revealed through prediction. At the heart of the digital transformation lies the “data”
  • 3.
    Levels of transformation throughdata • reporting: what happened in the past? (reflection) • dashboards and real time monitoring: what is happening now? (reactivity) • prediction: what will happen next? (pro-activity)
  • 4.
    How can weaccelerate a digital transformation process by leveraging data?
  • 5.
    5 Building value-driven dataprojects What knowledge would increase our profits? The following questions need to be answered in this order: What data do we need to collect? What ML methods are appropriate?
  • 6.
    • Are standardinnovation methodologies fit for digital transformation projects? • (Can we CK this?) 6 Discussion
  • 7.
    • Do Ihave all the relevant data? 
 Strategic data watch: is there any new source of data I can use? • Do I have best predictive accuracy?
 How do I make sure that I’m working with the best possible predictive models? 7 Two key aspects
  • 8.
    8 Data hunt • DoI have all the relevant data?
  • 9.
    9 Exercice: Data hunt Duringyour transition to predictive analytics, you may need to update your databases: to include more variables with potential explicative power •A travel IT systems company has some air traffic / passenger data. •They are interested in predicting passenger flux between 20 airports in US. •Data for 720 days, for each pair of airports. •So,“one” variable. •How can we augment this dataset? •Which variables can be added? •Where can we find the data?
  • 10.
    Potential sources for relevant factors K1 Events K2 Plane accidents K3Calendar K4 Delays causes K5 Alternative transportation K6 Safety K7 Data on airports K8 Similar data K9 Oil price K10 Average domestic air fares K11 Town’s population K12 Town’s attractiveness 20+ participants (students), analysed byYohann Sitruk
  • 11.
    11 Model quality • DoI have best predictive accuracy?
  • 12.
    1. Train &test paradigm 2. Prediction error and quality metrics 3. ROI in data science projects 12 Plan
  • 13.
    • …involves agreat deal of trial and error • little if any theory-based, model-based design • even research (development of new algorithms) is (mostly) trial and error • the data scientist’s best friend is a well-designed experimental studio for facilitating fast iterations of •How can we control the quality of the ensuing model? 13 Building a data science model
  • 14.
    • Data-driven predictorsshould work well on future (unseen) data • use historical data to select and fit a model, then use the model to make predictions on new data • but we only have historical data: how do we “simulate” past and future on existing data? 14 Train & test paradigm
  • 15.
    15 Train & testparadigm Data Train Test Develop a model on training set Test the model on the test set Change the test set
  • 16.
    16 Train & testparadigm Data Train Test Cycling through the data in this manner is called cross-validation This is a powerful and important concept for building robust models
  • 17.
    Question • Assume yourmanagement has decided to outsource your predictive model building activity. • How would you evaluate various partners?
  • 18.
    1. Train &test paradigm 2. Prediction error and (quality) metrics 3. ROI in data science projects 18 Plan
  • 19.
    Back to classification ModèlesStandards Simple linear model, Many red and blue items are misclassified A complex non linear model, better separation of data (again) What would be a suitable metric that characterises model performance in the above case?
  • 20.
    Prediction error Modèles Standards Numberof misclassified points (red or blue) ?M1 M2 According to this criteria M1 seems worst than M2 Assuming both models avoid over/under-fitting (is this the case here?)
  • 21.
    A list ofmetrics from SciKit Learn (a widely used ML software library) Choice of the metric is important. Ideally, it should be tied to a business objective. Model performance
  • 22.
    Model performance - asimple case - Two basic notions: - False positives - False negatives Ex 1. the model predicts cancer for a patient who does not have cancer 2. the model predicts a patient does not have cancer while she actually has See that the cost of these errors are not identical. This is true in most cases. Can you give other examples?
  • 23.
    1. Train &test paradigm 2. Prediction error and (quality) metrics 3. ROI in data science projects 23 Plan
  • 24.
    Calculating ROI forimproving predictive accuracy Think about ad targeting and companies such as Assume, for the sake of example, the following (fictitious) figures. The company monitors 100 million page loads per hour by internet users. Within the short duration of loading the company should predict whether the user will click on an advertisement. Company pays 0.10$ for showing the advertisement on the dedicated zone of the page. It makes, 0.17$ if the user clicks on the ad. How does the model performance affects profitability? Assume the model causes 5% false positives and 10% false negatives over 100 million predictions. 17 million mauvaise prédictions - par heure! Le cout des FPs: 100M x 0.05 x 0.10$ = 500, 000 euros Le cout des FNs: 100M x 0.10 x 0.07$ = 700, 000 euros
  • 25.
    The previous examplewas for (binary) classification Calculating ROI for improving predictive accuracy What happens in case of “regression”? Example: Predicting remaining lifetime of devices
  • 26.
    How to improvepredictive accuracy?
  • 27.
    How to reachbest predictive accuracy? Customer Analytics - Churn - Pricing - Lead scoring - Credit scoring - Up&cross-sales Risk & Production - Fraud / insurance - Compliance - Safety analysis - Cyber-security - Manufacturing Operations - Maintenance - Fault analysis - Logistics - HR - Procurement Better Predictions = More Value Integrating & Increasing data science capabilities is hard Finance Sales Marketing Engineering Purchasing HR Accounting Manufacturing Planning IT DSR&D - Skill gap: Shortage of data scientists, Not enough skilled people, PhDs are expensive & high demand, (McKinsey, 2016), unawareness of latest techniques and experimental methods - Development gap: Lack of adapted infrastructures and systems, limited resources & time, lack of management practices and appropriate experimental tools - Deployment gap: It takes months to go from development to deployment, by the time a model is ready to be deployed in production, the world has changed (distribution shifts; 78% of companies has no automated procedures, 50% recode from scratch, Dataiku Production Survey Report) Main obstacles: Most companies operate with under-performing models Ex. %10 improvement in sales prediction = %1 decrease in stock out = 100M€ increase in sales for a retail giant
  • 28.
    28 Developing a predictivemodel is an experimental process - Linear Regression - Logistic Regression - DecisionTree - SVM - Naive Bayes - KNN - K-Means - Random Forest - Dimensionality Reduction Algorithms - Gradient Boost & Adaboost - … ML algorithms ML has produced a large variety of algorithms each of which has tunable parameters The number of such (hyper)parameters can vary anywhere from 1 to ~100 Trying every combination is not possible
  • 29.
    B. Kégl /AppStat@LAL Learning to discover CLASSIFICATION FOR DISCOVERY 20% improvement over the baseline model used by physicists (from 3.2 to 3.8) in detecting Higgs particles 14 A particular instrument for extending the “search” for best model is crowdsourcing Hundreds of models produced and tested by the participants
  • 30.
    30 RAPID ANALYTICS ANDMODEL PROTOTYPING (RAMP) http://www.ramp.studio
  • 32.
  • 33.
  • 34.
  • 35.
    Some numbers • 100+participants, working on the same problem • 411+ models, in just 3 days • Starting kit scores: • Combined = 0.131, Err = 0.090, Mare = 0.212 • Final best submission: • Combined = 0.032 (%75), Err = 0.015 (80%), Mare = 0.065 (~70%) • Blended model is even better: 0.023 on combined score (better than Saclay, Hooray!) • These improvements are amazing
  • 38.
  • 39.
    • Assume youare all working in various branches of a same group. • The executive committee decide to run a company wide initiative to elaborate a roadmap for accelerating digital transition • Steps: • Split into 5 teams of 8 persons • 30-45m. Each group generates as many prediction problems as possible - with direct relevance to their work (any of the company branches) • 60m. Build a list of priority, depending: • Availability or accessibility of data required • ROI and potential gain (it’s ok to be approximative, but try to come up with informed estimations) • 30m. Choose 3 applications, and report to the whole group (debriefing) 39 Workshop