SlideShare a Scribd company logo
Honey, I Shrunk the Target Variable!
Florian Wilhelm
Common pitfalls when transforming the target variable and
how to exploit transformations
Berlin, April 12th 2022
Dein
Foto
hier
Mathematical Modelling
dA Data Science to Production & MLOps
Personalisation & RecSys
Uncertainty Quantification & Causality
Python Data Stack
Creator of PyScaffold
@FlorianWilhelm
FlorianWilhelm
FlorianWilhem.info
2
Dr. Florian Wilhelm
Head of Data Science @ inovex
inovex is an IT project house
with focus on digital transformation
› Product Discovery · Product Ownership
› Web · UI/UX · Replatforming · Microservices
› Mobile · Apps · Smart Devices · Robotics
› Big Data & Business Intelligence Platforms
› Data Science · Data Products · Search · Deep Learning
› Data Center Automation · DevOps · Cloud · Hosting
› Agile Training · Technology Training · Coaching
Karlsruhe · Pforzheim · Stuttgart · München · Köln · Hamburg
www.inovex.de/en
Using technology to inspire our clients.
And ourselves.
Recap about
Metrics
4
Choosing the Right Metric
› (R)MSE is most often
used in practice
› Scikit-Learn’s
regressors use mostly
MSE as default
5
In which Use-Cases does (R)MSE make sense?
Quadratic Absolute
Little Recap about Metrics
6
Difference
Relation
Our Use-Case
7
8
How much should
I sell my car for?
Model fitted on
many sold cars
and their features
could provide a
fair market value
Our Use-Case Setting
9
1. take used-cars database from Kaggle with 370k cars having
features: vehicle type, model, registration date, gearbox,
powerPS, model, mileage, fuel type, brand and price
2. built a model to estimate the price based on these features
and treat this as a fair market value
3. decide what’s a good/fair/bad price based on this fair
market value
source-code: https://github.com/FlorianWilhelm/used-cars-log-trans/
Question 1:
10
What’s worse? Selling 10 equal cars
with an actual price of 50,000 € and
1. getting the actual price for 9
but only 40,000 € for the last car or
2. getting 49,000 € for every car?
● For (R)MSE option 1 is much worse
● For MAE both options are equally good/bad
Question 2:
11
Which one is worse?
Getting 1,000 € less if your
car’s actual value is
1. 100,000 € or
2. 10,000 €?
● For RMSE & MAE this makes no difference
● For RMSPE & MAPE option 2 is much worse
Learning 1:
The right metric depends on the
use-case and will affect your results!
12
What does minimizing (R)MSE
actually Mean?
13
Minimizing MSE
14
is continuous
random variable
Derive and set to 0:
is actually the Mean!
Analog proof
for MAE and
Median
For the Math Skeptics…
15
Learning 2:
The mean (expected value) minimizes (R)MSE
and the median minimizes MAE.
16
Shrinking the Prices with Log
17
18
18
Distribution of Prices
19
19
Distribution of Prices and LogNormal Fit
Not perfectly lognormal,
which will be important later
Minimizing (R)MSE with log(price)
20
What we gonna do:
1. Take log(price) as target variable
2. Minimize (R)MSE to find ŷ
3. Transform ŷ back with exp(ŷ)
Minimizing (R)MSE with log(price) is …
21
… the Median?!?
Mathematically, in case of a
lognormal residual distribution:
› taking the log, minimizing for
RMSE and transforming back
with exp, will lead to the median.
› if we wanted the mean, we need
to correct the transformed result
by adding .
22
On our data (not perfectly lognormal)
https://www.pinterest.de/pin/494973815284951824/
Uploaded by Jittanisa Sukaphatana
a bit higher than the “actual” mean of 6807
And there is much more…
Correction terms when applying log to the a target variable
with lognormal residuals and minimizing (R)MSE:
23
(R)MSE MAE MAPE RMSPE
Proofs under https://www.inovex.de/de/blog/honey-i-shrunk-the-target-variable/
Learning 3:
Transforming your target might change the
metric you are actually minimizing!
24
Transforming the Target Variable
for Fun & Profit
25
What To Do If Your Metric Is Not Supported?
26
Imagine you want to optimise for RMSPE, and your data has
a lognormal residual distribution but the ML-library your
are using only supports (R)MSE?
One More Time. Instead of doing…
27
model fit with (R)MSE
1. Fitting a model using (R)MSE as loss/metric
2. Evaluating our predictions with another
metric, e.g. MAD, MAPE, RMSPE
… We Do for Our Use-Case…
28
transform
model fit with (R)MSE
correction
&
transform
1. Log transformation
2. Fitting a model using (R)MSE as loss/metric
3. Correction & back-transformation
4. Evaluating our predictions with another
metric, e.g. MAD, MAPE, RMSPE
Let’s Apply This In Our Use-Case
29
Improvements over raw target when using a log transformation & correction
and evaluating the final prediction under a given metric, e.g. MAPE, …
In case of the Kaggle competition the
transformation was key for winning
negative numbers mean improvement
30
Want to know more?
blog.inovex.de
31
https://www.inovex.de/de/blog/honey-i-shrunk-the-target-variable/
Thank you!
Florian Wilhelm
Head of Data Science
inovex GmbH
Schanzenstraße 6-20
Kupferhütte 1.13
51063 Köln
florian.wilhelm@inovex.de
Linear Models
&
Normal Distribution
33
Recap: Linear Model
34
raw features
(non-linear) functions, feature engineering
weights to fit
true latent (unknown) outcome
noise
observations/samples
Normal Distribution
Cathedral Distribution
35
Linear model with a single, binary feature variable x and random noise.
Appendix Learning:
The residuals of a linear model should be
normally distributed, not the target variable.
36

More Related Content

What's hot

Get file
Get fileGet file
Get file
atelier t*h
 
IDEO - Design thinking workshop 2016
IDEO - Design thinking workshop 2016IDEO - Design thinking workshop 2016
IDEO - Design thinking workshop 2016
Center for Entrepreneurship (C4E), University of Cyprus
 
Ikea Strategies
Ikea Strategies Ikea Strategies
Ikea Strategies
Mita Angela M. Dimalanta
 
IKEA Porter's Five Forces and Value Chain Analysis
IKEA Porter's Five Forces and Value Chain AnalysisIKEA Porter's Five Forces and Value Chain Analysis
IKEA Porter's Five Forces and Value Chain Analysis
National University of Malaysia
 
The design thinking transformation in business
The design thinking transformation in businessThe design thinking transformation in business
The design thinking transformation in business
Cathy Wang
 
IKEA Marketing Excellence
IKEA Marketing ExcellenceIKEA Marketing Excellence
IKEA Marketing Excellence
Varshit Kumar
 
Design Thinking explained
Design Thinking explainedDesign Thinking explained
Design Thinking explained
Twan van den Broek
 
Kickstart your Product Backlog with Innovation Games
Kickstart your Product Backlog with Innovation GamesKickstart your Product Backlog with Innovation Games
Kickstart your Product Backlog with Innovation Games
Frederic Vandaele
 
Innovation vs. Creativity
Innovation vs. CreativityInnovation vs. Creativity
Innovation vs. Creativity
Saneel Radia
 
Problems IKEA faced with foreign market entry | MKT382 RST
Problems IKEA faced with foreign market entry | MKT382 RSTProblems IKEA faced with foreign market entry | MKT382 RST
Problems IKEA faced with foreign market entry | MKT382 RST
Rifatul Sazal
 
philips versus matsushita
 philips versus matsushita  philips versus matsushita
philips versus matsushita
Shamli Sharma
 
8 steps to innovation: An introduction
8 steps to innovation: An introduction8 steps to innovation: An introduction
8 steps to innovation: An introduction
vpdabholkar
 
Solonia.Teodros_Introduction to Design Thinking.pdf
Solonia.Teodros_Introduction to Design Thinking.pdfSolonia.Teodros_Introduction to Design Thinking.pdf
Solonia.Teodros_Introduction to Design Thinking.pdf
YellowExperiments
 
Design Thinking 101 by Natalie Nixon of Figure 8 Thinking
Design Thinking 101 by Natalie Nixon of Figure 8 ThinkingDesign Thinking 101 by Natalie Nixon of Figure 8 Thinking
Design Thinking 101 by Natalie Nixon of Figure 8 Thinking
Natalie W. Nixon, PhD
 
Sosiaalisen median katsaus 07/2022
Sosiaalisen median katsaus 07/2022Sosiaalisen median katsaus 07/2022
Sosiaalisen median katsaus 07/2022
Harto Pönkä
 
IKEA Case Study: SWOT Analysis
IKEA Case Study: SWOT AnalysisIKEA Case Study: SWOT Analysis
IKEA Case Study: SWOT Analysis
InstantAssignmentHelpAustralia
 
Jeff Bezos' 2016 Letter to Amazon Shareholders
Jeff Bezos' 2016 Letter to Amazon ShareholdersJeff Bezos' 2016 Letter to Amazon Shareholders
Jeff Bezos' 2016 Letter to Amazon Shareholders
Razin Mustafiz
 
Google Analytics, analytiikka ja evästeet -tietosuojanäkökulma
Google Analytics, analytiikka ja evästeet -tietosuojanäkökulmaGoogle Analytics, analytiikka ja evästeet -tietosuojanäkökulma
Google Analytics, analytiikka ja evästeet -tietosuojanäkökulma
Harto Pönkä
 
The definition of R&D following the Frascati manual
The definition of R&D following the Frascati manualThe definition of R&D following the Frascati manual
The definition of R&D following the Frascati manual
LEYTON
 
Stanford and the Silicon Valley Ecosystem - Tom Byers - 2013 HBCU Innovation ...
Stanford and the Silicon Valley Ecosystem - Tom Byers - 2013 HBCU Innovation ...Stanford and the Silicon Valley Ecosystem - Tom Byers - 2013 HBCU Innovation ...
Stanford and the Silicon Valley Ecosystem - Tom Byers - 2013 HBCU Innovation ...
EpicenterUSA
 

What's hot (20)

Get file
Get fileGet file
Get file
 
IDEO - Design thinking workshop 2016
IDEO - Design thinking workshop 2016IDEO - Design thinking workshop 2016
IDEO - Design thinking workshop 2016
 
Ikea Strategies
Ikea Strategies Ikea Strategies
Ikea Strategies
 
IKEA Porter's Five Forces and Value Chain Analysis
IKEA Porter's Five Forces and Value Chain AnalysisIKEA Porter's Five Forces and Value Chain Analysis
IKEA Porter's Five Forces and Value Chain Analysis
 
The design thinking transformation in business
The design thinking transformation in businessThe design thinking transformation in business
The design thinking transformation in business
 
IKEA Marketing Excellence
IKEA Marketing ExcellenceIKEA Marketing Excellence
IKEA Marketing Excellence
 
Design Thinking explained
Design Thinking explainedDesign Thinking explained
Design Thinking explained
 
Kickstart your Product Backlog with Innovation Games
Kickstart your Product Backlog with Innovation GamesKickstart your Product Backlog with Innovation Games
Kickstart your Product Backlog with Innovation Games
 
Innovation vs. Creativity
Innovation vs. CreativityInnovation vs. Creativity
Innovation vs. Creativity
 
Problems IKEA faced with foreign market entry | MKT382 RST
Problems IKEA faced with foreign market entry | MKT382 RSTProblems IKEA faced with foreign market entry | MKT382 RST
Problems IKEA faced with foreign market entry | MKT382 RST
 
philips versus matsushita
 philips versus matsushita  philips versus matsushita
philips versus matsushita
 
8 steps to innovation: An introduction
8 steps to innovation: An introduction8 steps to innovation: An introduction
8 steps to innovation: An introduction
 
Solonia.Teodros_Introduction to Design Thinking.pdf
Solonia.Teodros_Introduction to Design Thinking.pdfSolonia.Teodros_Introduction to Design Thinking.pdf
Solonia.Teodros_Introduction to Design Thinking.pdf
 
Design Thinking 101 by Natalie Nixon of Figure 8 Thinking
Design Thinking 101 by Natalie Nixon of Figure 8 ThinkingDesign Thinking 101 by Natalie Nixon of Figure 8 Thinking
Design Thinking 101 by Natalie Nixon of Figure 8 Thinking
 
Sosiaalisen median katsaus 07/2022
Sosiaalisen median katsaus 07/2022Sosiaalisen median katsaus 07/2022
Sosiaalisen median katsaus 07/2022
 
IKEA Case Study: SWOT Analysis
IKEA Case Study: SWOT AnalysisIKEA Case Study: SWOT Analysis
IKEA Case Study: SWOT Analysis
 
Jeff Bezos' 2016 Letter to Amazon Shareholders
Jeff Bezos' 2016 Letter to Amazon ShareholdersJeff Bezos' 2016 Letter to Amazon Shareholders
Jeff Bezos' 2016 Letter to Amazon Shareholders
 
Google Analytics, analytiikka ja evästeet -tietosuojanäkökulma
Google Analytics, analytiikka ja evästeet -tietosuojanäkökulmaGoogle Analytics, analytiikka ja evästeet -tietosuojanäkökulma
Google Analytics, analytiikka ja evästeet -tietosuojanäkökulma
 
The definition of R&D following the Frascati manual
The definition of R&D following the Frascati manualThe definition of R&D following the Frascati manual
The definition of R&D following the Frascati manual
 
Stanford and the Silicon Valley Ecosystem - Tom Byers - 2013 HBCU Innovation ...
Stanford and the Silicon Valley Ecosystem - Tom Byers - 2013 HBCU Innovation ...Stanford and the Silicon Valley Ecosystem - Tom Byers - 2013 HBCU Innovation ...
Stanford and the Silicon Valley Ecosystem - Tom Byers - 2013 HBCU Innovation ...
 

Similar to Honey I Shrunk the Target Variable! Common pitfalls when transforming the target variable and how to exploit transformations.

USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSUSING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
HCL Technologies
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to Z
Charles Vestur
 
“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Pr...
“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Pr...“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Pr...
“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Pr...
Edge AI and Vision Alliance
 
VSSML18. Practical Workshops
VSSML18. Practical WorkshopsVSSML18. Practical Workshops
VSSML18. Practical Workshops
BigML, Inc
 
Importance of Computer In Petroleum Engineering
Importance of Computer In Petroleum EngineeringImportance of Computer In Petroleum Engineering
Importance of Computer In Petroleum Engineering
EngineerSaeedOfficial
 
MongoDB World 2018: Building Intelligent Apps with MongoDB & Google Cloud
MongoDB World 2018: Building Intelligent Apps with MongoDB & Google CloudMongoDB World 2018: Building Intelligent Apps with MongoDB & Google Cloud
MongoDB World 2018: Building Intelligent Apps with MongoDB & Google Cloud
MongoDB
 
Unlocking the Power of Integer Programming
Unlocking the Power of Integer ProgrammingUnlocking the Power of Integer Programming
Unlocking the Power of Integer Programming
Florian Wilhelm
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Sri Ambati
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx
gdgsurrey
 
[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...
[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...
[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...
Geon-Hong Kim
 
P01executive Summary Yy2009mm03dd16
P01executive Summary Yy2009mm03dd16P01executive Summary Yy2009mm03dd16
P01executive Summary Yy2009mm03dd16
guest558440c
 
Lpp through graphical analysis
Lpp through graphical analysis Lpp through graphical analysis
Lpp through graphical analysis
YuktaBansal1
 
Energy Management Solution - iARMS-EMS/PMS
Energy Management Solution - iARMS-EMS/PMSEnergy Management Solution - iARMS-EMS/PMS
Energy Management Solution - iARMS-EMS/PMS
Envision Enterprise Solutions America Inc.
 
Can Machine Learning Models be Trusted? Explaining Decisions of ML Models
Can Machine Learning Models be Trusted? Explaining Decisions of ML ModelsCan Machine Learning Models be Trusted? Explaining Decisions of ML Models
Can Machine Learning Models be Trusted? Explaining Decisions of ML Models
Darek Smyk
 
ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case
ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case
ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case
eProsima
 
Building Custom
Machine Learning Algorithms
with Apache SystemML
Building Custom
Machine Learning Algorithms
with Apache SystemMLBuilding Custom
Machine Learning Algorithms
with Apache SystemML
Building Custom
Machine Learning Algorithms
with Apache SystemML
sparktc
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
Succeeding with Functional-first Programming in Enterprise
Succeeding with Functional-first Programming in EnterpriseSucceeding with Functional-first Programming in Enterprise
Succeeding with Functional-first Programming in Enterprisedsyme
 
IBM i & Data Science in the AI era.
IBM i & Data Science in the AI era.  IBM i & Data Science in the AI era.
IBM i & Data Science in the AI era.
Benoit Marolleau
 
"Custom ML Models for Each User", Siamion Karasik
"Custom ML Models for Each User", Siamion Karasik"Custom ML Models for Each User", Siamion Karasik
"Custom ML Models for Each User", Siamion Karasik
Fwdays
 

Similar to Honey I Shrunk the Target Variable! Common pitfalls when transforming the target variable and how to exploit transformations. (20)

USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSUSING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
 
Building a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to ZBuilding a performing Machine Learning model from A to Z
Building a performing Machine Learning model from A to Z
 
“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Pr...
“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Pr...“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Pr...
“An Industry Standard Performance Benchmark Suite for Machine Learning,” a Pr...
 
VSSML18. Practical Workshops
VSSML18. Practical WorkshopsVSSML18. Practical Workshops
VSSML18. Practical Workshops
 
Importance of Computer In Petroleum Engineering
Importance of Computer In Petroleum EngineeringImportance of Computer In Petroleum Engineering
Importance of Computer In Petroleum Engineering
 
MongoDB World 2018: Building Intelligent Apps with MongoDB & Google Cloud
MongoDB World 2018: Building Intelligent Apps with MongoDB & Google CloudMongoDB World 2018: Building Intelligent Apps with MongoDB & Google Cloud
MongoDB World 2018: Building Intelligent Apps with MongoDB & Google Cloud
 
Unlocking the Power of Integer Programming
Unlocking the Power of Integer ProgrammingUnlocking the Power of Integer Programming
Unlocking the Power of Integer Programming
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx2024-02-24_Session 1 - PMLE_UPDATED.pptx
2024-02-24_Session 1 - PMLE_UPDATED.pptx
 
[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...
[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...
[OFW 14] Prediction of Flow Characteristics by Applying Machine Learning of S...
 
P01executive Summary Yy2009mm03dd16
P01executive Summary Yy2009mm03dd16P01executive Summary Yy2009mm03dd16
P01executive Summary Yy2009mm03dd16
 
Lpp through graphical analysis
Lpp through graphical analysis Lpp through graphical analysis
Lpp through graphical analysis
 
Energy Management Solution - iARMS-EMS/PMS
Energy Management Solution - iARMS-EMS/PMSEnergy Management Solution - iARMS-EMS/PMS
Energy Management Solution - iARMS-EMS/PMS
 
Can Machine Learning Models be Trusted? Explaining Decisions of ML Models
Can Machine Learning Models be Trusted? Explaining Decisions of ML ModelsCan Machine Learning Models be Trusted? Explaining Decisions of ML Models
Can Machine Learning Models be Trusted? Explaining Decisions of ML Models
 
ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case
ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case
ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case
 
Building Custom
Machine Learning Algorithms
with Apache SystemML
Building Custom
Machine Learning Algorithms
with Apache SystemMLBuilding Custom
Machine Learning Algorithms
with Apache SystemML
Building Custom
Machine Learning Algorithms
with Apache SystemML
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Succeeding with Functional-first Programming in Enterprise
Succeeding with Functional-first Programming in EnterpriseSucceeding with Functional-first Programming in Enterprise
Succeeding with Functional-first Programming in Enterprise
 
IBM i & Data Science in the AI era.
IBM i & Data Science in the AI era.  IBM i & Data Science in the AI era.
IBM i & Data Science in the AI era.
 
"Custom ML Models for Each User", Siamion Karasik
"Custom ML Models for Each User", Siamion Karasik"Custom ML Models for Each User", Siamion Karasik
"Custom ML Models for Each User", Siamion Karasik
 

More from Florian Wilhelm

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
Florian Wilhelm
 
An Interpretable Model for Collaborative Filtering Using an Extended Latent D...
An Interpretable Model for Collaborative Filtering Using an Extended Latent D...An Interpretable Model for Collaborative Filtering Using an Extended Latent D...
An Interpretable Model for Collaborative Filtering Using an Extended Latent D...
Florian Wilhelm
 
Matrix Factorization for Collaborative Filtering Is Just Solving an Adjoint L...
Matrix Factorization for Collaborative Filtering Is Just Solving an Adjoint L...Matrix Factorization for Collaborative Filtering Is Just Solving an Adjoint L...
Matrix Factorization for Collaborative Filtering Is Just Solving an Adjoint L...
Florian Wilhelm
 
Uncertainty Quantification in AI
Uncertainty Quantification in AIUncertainty Quantification in AI
Uncertainty Quantification in AI
Florian Wilhelm
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
Florian Wilhelm
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
Florian Wilhelm
 
How mobile.de brings Data Science to Production for a Personalized Web Experi...
How mobile.de brings Data Science to Production for a Personalized Web Experi...How mobile.de brings Data Science to Production for a Personalized Web Experi...
How mobile.de brings Data Science to Production for a Personalized Web Experi...
Florian Wilhelm
 
Deep Learning-based Recommendations for Germany's Biggest Vehicle Marketplace
Deep Learning-based Recommendations for Germany's Biggest Vehicle MarketplaceDeep Learning-based Recommendations for Germany's Biggest Vehicle Marketplace
Deep Learning-based Recommendations for Germany's Biggest Vehicle Marketplace
Florian Wilhelm
 
Deep Learning-based Recommendations for Germany's Biggest Online Vehicle Mark...
Deep Learning-based Recommendations for Germany's Biggest Online Vehicle Mark...Deep Learning-based Recommendations for Germany's Biggest Online Vehicle Mark...
Deep Learning-based Recommendations for Germany's Biggest Online Vehicle Mark...
Florian Wilhelm
 
Declarative Thinking and Programming
Declarative Thinking and ProgrammingDeclarative Thinking and Programming
Declarative Thinking and Programming
Florian Wilhelm
 
Which car fits my life? - PyData Berlin 2017
Which car fits my life? - PyData Berlin 2017Which car fits my life? - PyData Berlin 2017
Which car fits my life? - PyData Berlin 2017
Florian Wilhelm
 
PyData Meetup Berlin 2017-04-19
PyData Meetup Berlin 2017-04-19PyData Meetup Berlin 2017-04-19
PyData Meetup Berlin 2017-04-19
Florian Wilhelm
 
Explaining the idea behind automatic relevance determination and bayesian int...
Explaining the idea behind automatic relevance determination and bayesian int...Explaining the idea behind automatic relevance determination and bayesian int...
Explaining the idea behind automatic relevance determination and bayesian int...
Florian Wilhelm
 

More from Florian Wilhelm (13)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
An Interpretable Model for Collaborative Filtering Using an Extended Latent D...
An Interpretable Model for Collaborative Filtering Using an Extended Latent D...An Interpretable Model for Collaborative Filtering Using an Extended Latent D...
An Interpretable Model for Collaborative Filtering Using an Extended Latent D...
 
Matrix Factorization for Collaborative Filtering Is Just Solving an Adjoint L...
Matrix Factorization for Collaborative Filtering Is Just Solving an Adjoint L...Matrix Factorization for Collaborative Filtering Is Just Solving an Adjoint L...
Matrix Factorization for Collaborative Filtering Is Just Solving an Adjoint L...
 
Uncertainty Quantification in AI
Uncertainty Quantification in AIUncertainty Quantification in AI
Uncertainty Quantification in AI
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
 
Bridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to ProductionBridging the Gap: from Data Science to Production
Bridging the Gap: from Data Science to Production
 
How mobile.de brings Data Science to Production for a Personalized Web Experi...
How mobile.de brings Data Science to Production for a Personalized Web Experi...How mobile.de brings Data Science to Production for a Personalized Web Experi...
How mobile.de brings Data Science to Production for a Personalized Web Experi...
 
Deep Learning-based Recommendations for Germany's Biggest Vehicle Marketplace
Deep Learning-based Recommendations for Germany's Biggest Vehicle MarketplaceDeep Learning-based Recommendations for Germany's Biggest Vehicle Marketplace
Deep Learning-based Recommendations for Germany's Biggest Vehicle Marketplace
 
Deep Learning-based Recommendations for Germany's Biggest Online Vehicle Mark...
Deep Learning-based Recommendations for Germany's Biggest Online Vehicle Mark...Deep Learning-based Recommendations for Germany's Biggest Online Vehicle Mark...
Deep Learning-based Recommendations for Germany's Biggest Online Vehicle Mark...
 
Declarative Thinking and Programming
Declarative Thinking and ProgrammingDeclarative Thinking and Programming
Declarative Thinking and Programming
 
Which car fits my life? - PyData Berlin 2017
Which car fits my life? - PyData Berlin 2017Which car fits my life? - PyData Berlin 2017
Which car fits my life? - PyData Berlin 2017
 
PyData Meetup Berlin 2017-04-19
PyData Meetup Berlin 2017-04-19PyData Meetup Berlin 2017-04-19
PyData Meetup Berlin 2017-04-19
 
Explaining the idea behind automatic relevance determination and bayesian int...
Explaining the idea behind automatic relevance determination and bayesian int...Explaining the idea behind automatic relevance determination and bayesian int...
Explaining the idea behind automatic relevance determination and bayesian int...
 

Recently uploaded

FAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable PredictionsFAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable Predictions
Michel Dumontier
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
kumarmathi863
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
Health Advances
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
Sérgio Sacani
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
muralinath2
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
anitaento25
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
Predicting property prices with machine learning algorithms.pdf
Predicting property prices with machine learning algorithms.pdfPredicting property prices with machine learning algorithms.pdf
Predicting property prices with machine learning algorithms.pdf
binhminhvu04
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
Viksit bharat till 2047 India@2047.pptx
Viksit bharat till 2047  India@2047.pptxViksit bharat till 2047  India@2047.pptx
Viksit bharat till 2047 India@2047.pptx
rakeshsharma20142015
 

Recently uploaded (20)

FAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable PredictionsFAIR & AI Ready KGs for Explainable Predictions
FAIR & AI Ready KGs for Explainable Predictions
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
Predicting property prices with machine learning algorithms.pdf
Predicting property prices with machine learning algorithms.pdfPredicting property prices with machine learning algorithms.pdf
Predicting property prices with machine learning algorithms.pdf
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
Viksit bharat till 2047 India@2047.pptx
Viksit bharat till 2047  India@2047.pptxViksit bharat till 2047  India@2047.pptx
Viksit bharat till 2047 India@2047.pptx
 

Honey I Shrunk the Target Variable! Common pitfalls when transforming the target variable and how to exploit transformations.

  • 1. Honey, I Shrunk the Target Variable! Florian Wilhelm Common pitfalls when transforming the target variable and how to exploit transformations Berlin, April 12th 2022
  • 2. Dein Foto hier Mathematical Modelling dA Data Science to Production & MLOps Personalisation & RecSys Uncertainty Quantification & Causality Python Data Stack Creator of PyScaffold @FlorianWilhelm FlorianWilhelm FlorianWilhem.info 2 Dr. Florian Wilhelm Head of Data Science @ inovex
  • 3. inovex is an IT project house with focus on digital transformation › Product Discovery · Product Ownership › Web · UI/UX · Replatforming · Microservices › Mobile · Apps · Smart Devices · Robotics › Big Data & Business Intelligence Platforms › Data Science · Data Products · Search · Deep Learning › Data Center Automation · DevOps · Cloud · Hosting › Agile Training · Technology Training · Coaching Karlsruhe · Pforzheim · Stuttgart · München · Köln · Hamburg www.inovex.de/en Using technology to inspire our clients. And ourselves.
  • 5. Choosing the Right Metric › (R)MSE is most often used in practice › Scikit-Learn’s regressors use mostly MSE as default 5 In which Use-Cases does (R)MSE make sense?
  • 6. Quadratic Absolute Little Recap about Metrics 6 Difference Relation
  • 8. 8 How much should I sell my car for? Model fitted on many sold cars and their features could provide a fair market value
  • 9. Our Use-Case Setting 9 1. take used-cars database from Kaggle with 370k cars having features: vehicle type, model, registration date, gearbox, powerPS, model, mileage, fuel type, brand and price 2. built a model to estimate the price based on these features and treat this as a fair market value 3. decide what’s a good/fair/bad price based on this fair market value source-code: https://github.com/FlorianWilhelm/used-cars-log-trans/
  • 10. Question 1: 10 What’s worse? Selling 10 equal cars with an actual price of 50,000 € and 1. getting the actual price for 9 but only 40,000 € for the last car or 2. getting 49,000 € for every car? ● For (R)MSE option 1 is much worse ● For MAE both options are equally good/bad
  • 11. Question 2: 11 Which one is worse? Getting 1,000 € less if your car’s actual value is 1. 100,000 € or 2. 10,000 €? ● For RMSE & MAE this makes no difference ● For RMSPE & MAPE option 2 is much worse
  • 12. Learning 1: The right metric depends on the use-case and will affect your results! 12
  • 13. What does minimizing (R)MSE actually Mean? 13
  • 14. Minimizing MSE 14 is continuous random variable Derive and set to 0: is actually the Mean! Analog proof for MAE and Median
  • 15. For the Math Skeptics… 15
  • 16. Learning 2: The mean (expected value) minimizes (R)MSE and the median minimizes MAE. 16
  • 17. Shrinking the Prices with Log 17
  • 19. 19 19 Distribution of Prices and LogNormal Fit Not perfectly lognormal, which will be important later
  • 20. Minimizing (R)MSE with log(price) 20 What we gonna do: 1. Take log(price) as target variable 2. Minimize (R)MSE to find ŷ 3. Transform ŷ back with exp(ŷ)
  • 21. Minimizing (R)MSE with log(price) is … 21
  • 22. … the Median?!? Mathematically, in case of a lognormal residual distribution: › taking the log, minimizing for RMSE and transforming back with exp, will lead to the median. › if we wanted the mean, we need to correct the transformed result by adding . 22 On our data (not perfectly lognormal) https://www.pinterest.de/pin/494973815284951824/ Uploaded by Jittanisa Sukaphatana a bit higher than the “actual” mean of 6807
  • 23. And there is much more… Correction terms when applying log to the a target variable with lognormal residuals and minimizing (R)MSE: 23 (R)MSE MAE MAPE RMSPE Proofs under https://www.inovex.de/de/blog/honey-i-shrunk-the-target-variable/
  • 24. Learning 3: Transforming your target might change the metric you are actually minimizing! 24
  • 25. Transforming the Target Variable for Fun & Profit 25
  • 26. What To Do If Your Metric Is Not Supported? 26 Imagine you want to optimise for RMSPE, and your data has a lognormal residual distribution but the ML-library your are using only supports (R)MSE?
  • 27. One More Time. Instead of doing… 27 model fit with (R)MSE 1. Fitting a model using (R)MSE as loss/metric 2. Evaluating our predictions with another metric, e.g. MAD, MAPE, RMSPE
  • 28. … We Do for Our Use-Case… 28 transform model fit with (R)MSE correction & transform 1. Log transformation 2. Fitting a model using (R)MSE as loss/metric 3. Correction & back-transformation 4. Evaluating our predictions with another metric, e.g. MAD, MAPE, RMSPE
  • 29. Let’s Apply This In Our Use-Case 29 Improvements over raw target when using a log transformation & correction and evaluating the final prediction under a given metric, e.g. MAPE, … In case of the Kaggle competition the transformation was key for winning negative numbers mean improvement
  • 30. 30
  • 31. Want to know more? blog.inovex.de 31 https://www.inovex.de/de/blog/honey-i-shrunk-the-target-variable/
  • 32. Thank you! Florian Wilhelm Head of Data Science inovex GmbH Schanzenstraße 6-20 Kupferhütte 1.13 51063 Köln florian.wilhelm@inovex.de
  • 34. Recap: Linear Model 34 raw features (non-linear) functions, feature engineering weights to fit true latent (unknown) outcome noise observations/samples Normal Distribution
  • 35. Cathedral Distribution 35 Linear model with a single, binary feature variable x and random noise.
  • 36. Appendix Learning: The residuals of a linear model should be normally distributed, not the target variable. 36