How to
Ruin your
Business
with
Data
Science
Dr. Ingo Mierswa
Founder & President
@ingomierswa
Amazon
http://www.allthatiknow.com/wp-content/uploads/2014/01/tesco-profitability-and-big-data.png
The Rise and Fall of Tesco
“…the sudden downfall of the British retailer Tesco, ranked by some as the second largest
retailer in the world after Wal-Mart. The U.K. chain was considered a pioneer in the use of
customer loyalty programs intended to gather data on consumer preferences.”
“…scandal prompted Tesco to write off an astounding $416.7 million in profits. The scandal
also led to the resignation of Tesco Chairman Sir Richard Broadbent late last month.”
“While it’s clear that shady accounting and cutthroat retail competition played the primary role in
Tesco’s downfall, observers note that the retailer’s foray into big data also backfired.”
“…consumer sentiment turned against Tesco as customers chafed at data collection efforts
like the loyalty program that were increasingly perceived as gimmicky. The loyalty program
appears to have been undone by consumer perception that Tesco got far more in the bargain
than did shoppers.”
Excerpts from: https://www.datanami.com/2014/11/07/tescos-collapse-cautionary-tale-big-data/
Analyzing
Aliens
Original work by Dan Henebery and Josiah Davis:
http://www.questionable-economics.com/what-do-we-know-about-aliens/
Aliens are Fans of the X-Files
Total reported UFO sightings per year since 1963
Source: NUFORC & http://www.questionable-economics.com/what-do-we-know-about-aliens/
Aliens Work Hard, Party Hard
Proportion of all reported UFO sightings by hour and day
Source: NUFORC & http://www.questionable-economics.com/what-do-we-know-about-aliens/
Aliens Love America. And Fireworks.
Average reported UFO sightings per week since 2010
Source: NUFORC & http://www.questionable-economics.com/what-do-we-know-about-aliens/
Confusion between correlation and causation
Wrong model validation
Focusing on models vs. data & data prep
Confusion between
Correlation and Causation
In early 2004, Governor Rod Blagojevich announced a
plan to mail one book a month to every child in Illinois
from the time they were born until they entered
kindergarten. The plan would cost $26 million a year.
The more books are available at home,
the higher are children's test marks.
Furtheranalysis showed, that students from homes which have several books
performed better in their academics – even if they have never read the books!
Source: http://freakonomics.com/2008/12/10/the-blagojevich-upside/
Demo
Modeling the Titanic accident.
Confusion between Correlation and Causation
What to do to avoid problems
• Check for correlations before modeling and understand what they mean
• Consider removing factors with too high correlations to target or other
factors
• Take out all information which is not available at the point of prediction
Wrong Model Validation
Demo
Impact of wrong model validation.
I got it… but how can this ruin me?
Let’s imagine a company is losing $200 Million per year due to customer churn.
A machine learning model has been created and – with an improper validation – has shown to
reduce this churn rate by 20%, i.e. by $40 Million.
Those measurements incur costs of $20 Million, but given the reduction in churn volume of
$40 Million, it’s still a very good investment.
But as it turns out, the model was not properly validated and after spending the $20 Million,
the reduction in churn was only 5%, i.e. $10 Million revenue savings.
As a consequence, the expected $20 Million gain has turned into a $10 Million loss. Ouch.
Wrong Model Validation
What to do to avoid problems
• Ignore training errors completely
• Always use cross-validation
• All data transformations which work across rows need to be inside of the
cross-validation
• Take out all information which is not available at the point of prediction
Focusing on Models vs.
Data & Data Preparation
Following Technology Hypes
The approach that you see from many less-experienced data scientists.
Receive the data.
Don't investigate the data at all.
Think of the most complicated-sounding model possible and just mindlessly
plug your data into said model.
Present the results of the model in a way that does not help others develop
insight about the problem, and is not particularly actionable.
Demo
Feature engineering vs. modeling.
Focusing on Models vs. Data & Data Preparation
What to do to avoid problems
• Start with a plan and question in mind
• Define success before you start
• Understand the business problem and collaborate with stakeholders
• Use common sense and invent new features (feature engineering)
• Simple models are often better and lead to more robust models
Our Mission
Real data science, fast and simple.
We do not compromise on the quality of results or completeness.
We focus on making data scientists and teams more productive.
We empower more people to do and use real data science.
RapidMiner Highlights
Gartner, Magic Quadrant for Data Science Platforms, 14 February 2016. Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only
those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims
all warranties, expressed or implied, with respect to this research, includingany warranties of merchantability or fitness for a particular purpose.
Key Take-aways
Check for correlations before modeling
Always cross-validate models and data prep
Use common sense and feature engineering before
trying fancy models
Dr. Ingo Mierswa
Founder & President
@ingomierswa
@rapidminer
How to
Ruin your
Business
with
Data
Science

How to Ruin your Business with Data Science & Machine Learning by Ingo Mierswa

  • 1.
    How to Ruin your Business with Data Science Dr.Ingo Mierswa Founder & President @ingomierswa
  • 2.
  • 3.
    http://www.allthatiknow.com/wp-content/uploads/2014/01/tesco-profitability-and-big-data.png The Rise andFall of Tesco “…the sudden downfall of the British retailer Tesco, ranked by some as the second largest retailer in the world after Wal-Mart. The U.K. chain was considered a pioneer in the use of customer loyalty programs intended to gather data on consumer preferences.” “…scandal prompted Tesco to write off an astounding $416.7 million in profits. The scandal also led to the resignation of Tesco Chairman Sir Richard Broadbent late last month.” “While it’s clear that shady accounting and cutthroat retail competition played the primary role in Tesco’s downfall, observers note that the retailer’s foray into big data also backfired.” “…consumer sentiment turned against Tesco as customers chafed at data collection efforts like the loyalty program that were increasingly perceived as gimmicky. The loyalty program appears to have been undone by consumer perception that Tesco got far more in the bargain than did shoppers.” Excerpts from: https://www.datanami.com/2014/11/07/tescos-collapse-cautionary-tale-big-data/
  • 5.
    Analyzing Aliens Original work byDan Henebery and Josiah Davis: http://www.questionable-economics.com/what-do-we-know-about-aliens/
  • 6.
    Aliens are Fansof the X-Files Total reported UFO sightings per year since 1963 Source: NUFORC & http://www.questionable-economics.com/what-do-we-know-about-aliens/
  • 7.
    Aliens Work Hard,Party Hard Proportion of all reported UFO sightings by hour and day Source: NUFORC & http://www.questionable-economics.com/what-do-we-know-about-aliens/
  • 8.
    Aliens Love America.And Fireworks. Average reported UFO sightings per week since 2010 Source: NUFORC & http://www.questionable-economics.com/what-do-we-know-about-aliens/
  • 9.
    Confusion between correlationand causation Wrong model validation Focusing on models vs. data & data prep
  • 10.
  • 11.
    In early 2004,Governor Rod Blagojevich announced a plan to mail one book a month to every child in Illinois from the time they were born until they entered kindergarten. The plan would cost $26 million a year. The more books are available at home, the higher are children's test marks. Furtheranalysis showed, that students from homes which have several books performed better in their academics – even if they have never read the books! Source: http://freakonomics.com/2008/12/10/the-blagojevich-upside/
  • 12.
  • 13.
    Confusion between Correlationand Causation What to do to avoid problems • Check for correlations before modeling and understand what they mean • Consider removing factors with too high correlations to target or other factors • Take out all information which is not available at the point of prediction
  • 14.
  • 15.
    Demo Impact of wrongmodel validation.
  • 16.
    I got it…but how can this ruin me? Let’s imagine a company is losing $200 Million per year due to customer churn. A machine learning model has been created and – with an improper validation – has shown to reduce this churn rate by 20%, i.e. by $40 Million. Those measurements incur costs of $20 Million, but given the reduction in churn volume of $40 Million, it’s still a very good investment. But as it turns out, the model was not properly validated and after spending the $20 Million, the reduction in churn was only 5%, i.e. $10 Million revenue savings. As a consequence, the expected $20 Million gain has turned into a $10 Million loss. Ouch.
  • 17.
    Wrong Model Validation Whatto do to avoid problems • Ignore training errors completely • Always use cross-validation • All data transformations which work across rows need to be inside of the cross-validation • Take out all information which is not available at the point of prediction
  • 18.
    Focusing on Modelsvs. Data & Data Preparation
  • 19.
    Following Technology Hypes Theapproach that you see from many less-experienced data scientists. Receive the data. Don't investigate the data at all. Think of the most complicated-sounding model possible and just mindlessly plug your data into said model. Present the results of the model in a way that does not help others develop insight about the problem, and is not particularly actionable.
  • 20.
  • 21.
    Focusing on Modelsvs. Data & Data Preparation What to do to avoid problems • Start with a plan and question in mind • Define success before you start • Understand the business problem and collaborate with stakeholders • Use common sense and invent new features (feature engineering) • Simple models are often better and lead to more robust models
  • 22.
    Our Mission Real datascience, fast and simple. We do not compromise on the quality of results or completeness. We focus on making data scientists and teams more productive. We empower more people to do and use real data science.
  • 23.
    RapidMiner Highlights Gartner, MagicQuadrant for Data Science Platforms, 14 February 2016. Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, includingany warranties of merchantability or fitness for a particular purpose.
  • 24.
    Key Take-aways Check forcorrelations before modeling Always cross-validate models and data prep Use common sense and feature engineering before trying fancy models
  • 25.
    Dr. Ingo Mierswa Founder& President @ingomierswa @rapidminer How to Ruin your Business with Data Science