Uploaded byAlex Cachia

PPTX, PDF76 views

Data Preparation and the Importance of How Machines Learn by Rebecca Vickery

This document discusses the importance of data preparation for machine learning models. It outlines the typical machine learning workflow of getting data, preparing it through tasks like feature engineering, building baseline models, selecting the best model, and tuning hyperparameters. Specific challenges in data preparation like label encoding, one-hot encoding, and handling categorical variables with many values are described. Solutions to these challenges through techniques like weight of evidence and scikit-learn pipelines are also presented. The document emphasizes that much of a machine learning project's time should be spent on data preparation to develop high-quality inputs that allow models to learn effectively from the data.

Data Preparation
and the
Importance of how
Machines Learn
Rebecca Vickery, Data Scientist,
Holiday Extras

Machine learning

Source: Google images

Simple ML workflow
Get data >> baseline model >> model selection
>> model tuning >> predict

Simple ML workflow
Get data >>
Features/Inputs What we want to predict

Simple ML workflow
Baseline model >>
Accuracy score
Perfect = 1.0
0.44

Simple ML workflow
Model selection >>
Best model = Random Forest

$Simple ML workflow Hyperparameter optimisation >> Best score = 1.0 Best Params = {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 500}$

Source: Google images

What happens when we have this data set?

What happens when we have this data set?

Source: thedailybeast.com

Actual ML workflow
Get data >> data preparation
>> feature engineering >> baseline model
>> model selection >> model tuning
>> predict

Label encoding

Problem
Source: flaticon.com
4 is bigger than 1
so there must be a
relationship
between these
rows
Source: flaticon.com
1 = neutered male
2 = spayed female
3 = intact male

Solution: One hot encoding

Ordinal data

Problem: Won’t work for all variables
366 different unique values
= 366 new features

Solution: Feature engineering?
Single
Colour
Multi
Colour

Problem: We will lose a lot of information
Source: thetelegraph.com

Solution: Weight of evidence
For each colour (e.g. Tan):
WOE = log ( ( pi /p) / ( ni / n) )
pi = number of times Tan appears in positive class (1)
p = total number of positive classes (1)
ni = number of times Tan appears in negative class (0)
n = total number of negative classes (0)

Solution: Weight of evidence
Output is a positive or negative number between -1 and
+1

Solution(s)
WOE is one of many solutions for this

Problem(s)
Source: Photo by Louis Reed on Unsplash

Scikit-learn pipelines

Solution
pip install category_encoders

Pipeline example

Less time
But still some work to do

“There are only two Machine
Learning approaches that win
competitions: Handcrafted &
Neural Networks.”
Anthony Goldbloom, CEO & Founder, Kaggle

Thanks for
listening
Find me at...

Recommended

PPTX

Introduction to ML (Machine Learning)

bySwatiTripathi44

PPTX

Humans to the Rescue: Troubleshooting AI Systems with Human-in-the-loop

byCrowdsourcing Week

PDF

Fundementals of Machine Learning and Deep Learning

PPTX

MACHINE LEARNING for BCA students .pptx

bythrishathanushree230

PDF

newmicrosoftpowerpointpresentation-210512111200.pdf

byabhimanyurajjha002

PDF

Machine Learning Basic in Computer Science.pdf

PDF

Pybcn machine learning for dummies with python

byJavier Arias Losada

PPTX

ML basics.pptx

byPriyadharshiniG41

PPTX

machinelearningoverview-250809184828-927201d2.pptx

PPTX

Machine Learning_overview_presentation.pptx

bykondavamsidharreddy2

PPTX

the impact of Artificila intelligence in medicine

PPTX

Introduction to Machine Learning

PDF

Introduction to ML.pdf Supervised Learning, Unsupervised

PDF

Machine Learning Final Summary Cheat Sheet

PDF

VSSML17 Review. Summary Day 1 Sessions

PPTX

MachineLearning Seminar PPT.pptx

PPTX

machine learning workflow with data input.pptx

PPTX

artificial intelligence.pptx

byrithika858339

PPTX

Evaluating machine learning claims

PDF

AI/ML Fundamentals to advanced Slides by GDG Amrita Mysuru.pdf

PDF

Getting started with Machine Learning

byGaurav Bhalotia

PDF

Making Machine Learning Work in Practice - StampedeCon 2014

PPTX

Machine Learning Essentials Demystified part1 | Big Data Demystified

PDF

Foundations of Machine Learning - StampedeCon AI Summit 2017

PPTX

MachineLearningSparkML.pptx

byharikaramisetty3

PPTX

Python Machine Learning January 2018 - Ho Chi Minh City

byAndrew Schwabe

PDF

Choosing a Machine Learning technique to solve your need

PPTX

MachineLearningSparkML AI and expert Systems

byshreenathji26

PPTX

No Onions, No Tiers - An Introduction to Vertical Slice Architecture by Bill ...

PPTX

Supporting IT by David Meares

More Related Content

PPTX

Introduction to ML (Machine Learning)

bySwatiTripathi44

PPTX

Humans to the Rescue: Troubleshooting AI Systems with Human-in-the-loop

byCrowdsourcing Week

PDF

Fundementals of Machine Learning and Deep Learning

PPTX

MACHINE LEARNING for BCA students .pptx

bythrishathanushree230

PDF

newmicrosoftpowerpointpresentation-210512111200.pdf

byabhimanyurajjha002

PDF

Machine Learning Basic in Computer Science.pdf

PDF

Pybcn machine learning for dummies with python

byJavier Arias Losada

PPTX

ML basics.pptx

byPriyadharshiniG41

Introduction to ML (Machine Learning)

bySwatiTripathi44

Humans to the Rescue: Troubleshooting AI Systems with Human-in-the-loop

byCrowdsourcing Week

Fundementals of Machine Learning and Deep Learning

MACHINE LEARNING for BCA students .pptx

bythrishathanushree230

newmicrosoftpowerpointpresentation-210512111200.pdf

byabhimanyurajjha002

Machine Learning Basic in Computer Science.pdf

Pybcn machine learning for dummies with python

byJavier Arias Losada

ML basics.pptx

byPriyadharshiniG41

Similar to Data Preparation and the Importance of How Machines Learn by Rebecca Vickery

PPTX

machinelearningoverview-250809184828-927201d2.pptx

PPTX

Machine Learning_overview_presentation.pptx

bykondavamsidharreddy2

PPTX

the impact of Artificila intelligence in medicine

PPTX

Introduction to Machine Learning

PDF

Introduction to ML.pdf Supervised Learning, Unsupervised

PDF

Machine Learning Final Summary Cheat Sheet

PDF

VSSML17 Review. Summary Day 1 Sessions

PPTX

MachineLearning Seminar PPT.pptx

PPTX

machine learning workflow with data input.pptx

PPTX

artificial intelligence.pptx

byrithika858339

PPTX

Evaluating machine learning claims

PDF

AI/ML Fundamentals to advanced Slides by GDG Amrita Mysuru.pdf

PDF

Getting started with Machine Learning

byGaurav Bhalotia

PDF

Making Machine Learning Work in Practice - StampedeCon 2014

PPTX

Machine Learning Essentials Demystified part1 | Big Data Demystified

PDF

Foundations of Machine Learning - StampedeCon AI Summit 2017

PPTX

MachineLearningSparkML.pptx

byharikaramisetty3

PPTX

Python Machine Learning January 2018 - Ho Chi Minh City

byAndrew Schwabe

PDF

Choosing a Machine Learning technique to solve your need

PPTX

MachineLearningSparkML AI and expert Systems

byshreenathji26

machinelearningoverview-250809184828-927201d2.pptx

Machine Learning_overview_presentation.pptx

bykondavamsidharreddy2

the impact of Artificila intelligence in medicine

Introduction to Machine Learning

Introduction to ML.pdf Supervised Learning, Unsupervised

Machine Learning Final Summary Cheat Sheet

VSSML17 Review. Summary Day 1 Sessions

MachineLearning Seminar PPT.pptx

machine learning workflow with data input.pptx

artificial intelligence.pptx

byrithika858339

Evaluating machine learning claims

AI/ML Fundamentals to advanced Slides by GDG Amrita Mysuru.pdf

Getting started with Machine Learning

byGaurav Bhalotia

Making Machine Learning Work in Practice - StampedeCon 2014

Machine Learning Essentials Demystified part1 | Big Data Demystified

Foundations of Machine Learning - StampedeCon AI Summit 2017

MachineLearningSparkML.pptx

byharikaramisetty3

Python Machine Learning January 2018 - Ho Chi Minh City

byAndrew Schwabe

Choosing a Machine Learning technique to solve your need

MachineLearningSparkML AI and expert Systems

byshreenathji26

More from Alex Cachia

PPTX

No Onions, No Tiers - An Introduction to Vertical Slice Architecture by Bill ...

PPTX

Supporting IT by David Meares

PPTX

OWASP Top 10 2021 - let's take a closer look by Glenn Wilson

PDF

If you think open source is not for you, think again by Jane Chakravorty

PDF

Chaos Engineering – why we should all practice breaking things on purpose by ...

PPTX

A brief overview of the history and practice of user experience by Ian Westbrook

PPTX

Return the carriage, feed the line by Aaron Taylor

PPTX

Treating your career path and training like leveling up in games by Raymond C...

PPTX

Digital forensics and giving evidence by Jonathan Haddock

PPTX

Software Security by Glenn Wilson

PPTX

Why Rust? by Edd Barrett (codeHarbour December 2019)

PPTX

Issue with tracking? Fail that build! by Steve Coppin-Smith (codeHarbour Nove...

PPTX

Hack your voicemail with Javascript by Chris Willmott (codeHarbour October 2019)

PPTX

Developing for Africa by Jonathan Haddock (codeHarbour October 2019)

PDF

Revving up with Reinforcement Learning by Ricardo Sueiras

PPTX

Blockchain For Your Business by Kenneth Cox (codeHarbour July 2019)

PPTX

Seeking Simplicity by Phil Nash (codeHarbour June 2019)

PPTX

Sharing Data is Caring Data by Mark Terry (codeHarbour June 2019)

PPTX

Managing technical debt by Chris Willmott (codeHarbour April 2019)

PPTX

Telephone Systems and Voice over IP by Bob Eager (codeHarbour April 2019)

No Onions, No Tiers - An Introduction to Vertical Slice Architecture by Bill ...

Supporting IT by David Meares

OWASP Top 10 2021 - let's take a closer look by Glenn Wilson

If you think open source is not for you, think again by Jane Chakravorty

Chaos Engineering – why we should all practice breaking things on purpose by ...

A brief overview of the history and practice of user experience by Ian Westbrook

Return the carriage, feed the line by Aaron Taylor

Treating your career path and training like leveling up in games by Raymond C...

Digital forensics and giving evidence by Jonathan Haddock

Software Security by Glenn Wilson

Why Rust? by Edd Barrett (codeHarbour December 2019)

Issue with tracking? Fail that build! by Steve Coppin-Smith (codeHarbour Nove...

Hack your voicemail with Javascript by Chris Willmott (codeHarbour October 2019)

Developing for Africa by Jonathan Haddock (codeHarbour October 2019)

Revving up with Reinforcement Learning by Ricardo Sueiras

Blockchain For Your Business by Kenneth Cox (codeHarbour July 2019)

Seeking Simplicity by Phil Nash (codeHarbour June 2019)

Sharing Data is Caring Data by Mark Terry (codeHarbour June 2019)

Managing technical debt by Chris Willmott (codeHarbour April 2019)

Telephone Systems and Voice over IP by Bob Eager (codeHarbour April 2019)

Recently uploaded

PDF

Accelerating Responsible AI Adoption in Public Sector and Private Organizations.

byYasir Naveed Riaz

PDF

Knowing and Doing: Knowledge graphs, AI, and work

bymarainglezakis1

PPTX

Unit-4-ARTIFICIAL NEURAL NETWORKS.pptx ANN ppt Artificial neural network

PDF

Unlocking the Power of Salesforce Architecture: Frameworks for Effective Solu...

byvarsha30tiwari

PPTX

Basics of Identity Access Management In mordern Infrastructure

byPrinceXavier18

PPTX

cybercrime in Information security .pptx

byismaeelsahib032

PPTX

Chapter 3 Introduction to number system.pptx

byGetachewAbera9

PDF

ElyriaSoftware — Powering the Future with Blockchain Innovation

byElyria Software

PPT

software-security-intro in information security.ppt

byismaeelsahib032

PDF

Energy Storage Landscape Clean Energy Ministerial

bySurajitBanerjee38

PPTX

Ritesh_kumar_Aatmanirbhar Bharat: Make in India, Make for the World.pptx

byriteshrkgs2008

PDF

Digit Expo 2025 - EICC Edinburgh 27th November

PDF

Hybrid Cloud vs Multi-Cloud Strategy 2025

byTeleglobal International

PDF

Six Shifts For 2026 (And The Next Six Years)

PDF

Unser Jahresrückblick – MarvelClient in 2025

PDF

Recursive Self Improvement vs Continuous Learning

PDF

DIGITAL FORENSICS - Notes for Everything.pdf

bypankajkumavatbeit

DOCX

Introduction to the World of Computers (Hardware & Software)

byALIABBAS ABASOV

PDF

Usage Control for Process Discovery through a Trusted Execution Environment

byValerioGoretti

PDF

The major tech developments for 2026 by Pluralsight, a research and training ...

byChris Skinner

Accelerating Responsible AI Adoption in Public Sector and Private Organizations.

byYasir Naveed Riaz

Knowing and Doing: Knowledge graphs, AI, and work

bymarainglezakis1

Unit-4-ARTIFICIAL NEURAL NETWORKS.pptx ANN ppt Artificial neural network

Unlocking the Power of Salesforce Architecture: Frameworks for Effective Solu...

byvarsha30tiwari

Basics of Identity Access Management In mordern Infrastructure

byPrinceXavier18

cybercrime in Information security .pptx

byismaeelsahib032

Chapter 3 Introduction to number system.pptx

byGetachewAbera9

ElyriaSoftware — Powering the Future with Blockchain Innovation

byElyria Software

software-security-intro in information security.ppt

byismaeelsahib032

Energy Storage Landscape Clean Energy Ministerial

bySurajitBanerjee38

Ritesh_kumar_Aatmanirbhar Bharat: Make in India, Make for the World.pptx

byriteshrkgs2008

Digit Expo 2025 - EICC Edinburgh 27th November

Hybrid Cloud vs Multi-Cloud Strategy 2025

byTeleglobal International

Six Shifts For 2026 (And The Next Six Years)

Unser Jahresrückblick – MarvelClient in 2025

Recursive Self Improvement vs Continuous Learning

DIGITAL FORENSICS - Notes for Everything.pdf

bypankajkumavatbeit

Introduction to the World of Computers (Hardware & Software)

byALIABBAS ABASOV

Usage Control for Process Discovery through a Trusted Execution Environment

byValerioGoretti

The major tech developments for 2026 by Pluralsight, a research and training ...

byChris Skinner

Data Preparation and the Importance of How Machines Learn by Rebecca Vickery

2.
Data Preparation and the Importanceof how Machines Learn Rebecca Vickery, Data Scientist, Holiday Extras
3.
Machine learning
4.
Source: Google images
5.
Simple ML workflow Getdata >> baseline model >> model selection >> model tuning >> predict
6.
Simple ML workflow Getdata >> Features/Inputs What we want to predict
7.
Simple ML workflow Baselinemodel >> Accuracy score Perfect = 1.0 0.44
8.
Simple ML workflow Modelselection >> Best model = Random Forest
9.
Simple ML workflow Hyperparameteroptimisation >> Best score = 1.0 Best Params = {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 500}
10.
Source: Google images
11.
What happens whenwe have this data set?
12.
What happens whenwe have this data set?
14.
Source: thedailybeast.com
15.
Actual ML workflow Getdata >> data preparation >> feature engineering >> baseline model >> model selection >> model tuning >> predict
16.
Label encoding
17.
Problem Source: flaticon.com 4 isbigger than 1 so there must be a relationship between these rows Source: flaticon.com 1 = neutered male 2 = spayed female 3 = intact male
18.
Solution: One hotencoding
19.
Ordinal data
20.
Problem: Won’t workfor all variables 366 different unique values = 366 new features
21.
Solution: Feature engineering? Single Colour Multi Colour
22.
Problem: We willlose a lot of information Source: thetelegraph.com
23.
Solution: Weight ofevidence For each colour (e.g. Tan): WOE = log ( ( pi /p) / ( ni / n) ) pi = number of times Tan appears in positive class (1) p = total number of positive classes (1) ni = number of times Tan appears in negative class (0) n = total number of negative classes (0)
24.
Solution: Weight ofevidence Output is a positive or negative number between -1 and +1
25.
Solution(s) WOE is oneof many solutions for this
26.
Problem(s) Source: Photo byLouis Reed on Unsplash
27.
Scikit-learn pipelines
28.
Solution pip install category_encoders
29.
Pipeline example
30.
Less time But stillsome work to do
31.
“There are onlytwo Machine Learning approaches that win competitions: Handcrafted & Neural Networks.” Anthony Goldbloom, CEO & Founder, Kaggle
32.
Thanks for listening Find meat...

Editor's Notes

#4 Before I get into talking more about ml will recap what it is. For the purposes of this talk I am only going to cover supervised ml. In ML provide algorithm with labelled examples. Algo learns mapping. Uses this mapping/pattern/rules to predict unlabelled data.
#5 This mapping of inputs to outputs is based on the algorithm performing a large number of mathematical computations very quickly. The maths behind most machine learning models is mostly based on linear algebra and calculus. I am not going to talk in detail about the maths, I think there is 1 v simple equation in the talk. But the fact that in ml machines learn using maths is important to the talk.
#6 So to build up this pictue of how machines learn I’ll walk through what a simple ml workflow looks like. I’ll be showing a few code examples. This is all python based, I mainly work in python. And will be using the open source ml library scikit-learn. Is anyone familiar?
#8 Create a dumb model. In this case I have created a model that predicts everything as the most frequently occurring class. This is so that you know that the improvements you make or cleverer models you create really are clever and aren’t doing anything stupid.
#11 This is a basic workflow on toy data but in the real world things are never that simple. For example I have yet to ever ever a 100% accuracy score on real data. The other thing to note is that because this is a toy data set all the features were numerical. In real life this is also very rarely always the case. You will most often be dealing with different data types.
#12 So what happens when we have a data set like this. This data is taken from the website kaggle - an ml comp website. It contains attributes of various pets that have been placed in an animal shelter. The goal of this competition is to predict wether the animal will have a successful outcome or not. All the data in this is categorical or string apart from the target variable. Introduce kaggle.
#13 So this is what happens when we try to run ml on this. We get an error. Because the algorithms only understand numbers. So we need to translate all these features into numbers. A language that the machine can understand and learn from.
#14 And this is not a simple process because as much as ml is often hyped as being this wonder tool. If humans don’t think very carefully about the data that they are feeding the machine then machines can be really dumb. Machines can learn patterns in data but they cannot think beyond the data and the format that data comes in. So humans need to do the thinking. This is why when talking about data science this joke is often made. But it is actually very true.
#15 https://www.thedailybeast.com/why-doctors-arent-afraid-of-better-more-efficient-ai-diagnosing-cancer Machines can learn but they can’t think… yet Human needs to do the thinking - humans supply the context
#16 In reality this is what a real ml workflow looks like. The data preparation part, especially when we are dealing with something like the animal shelter data set is one of the most time consuming parts.
#17 So how do we do this conversion? One solution is called label encoding. Talk through what it is.
#18 The problem is that all the machine sees is numbers and the relationships between them. It does not have any context beyond that. It doesn’t know that this is a mapping to something meaningful from a human perspective. So we need a better way to represent these numbers.
#19 One solution is known as one hot encoding. Explain.
#20 But one hot encoding can’t just be used for all categorical features. For example with the age upon outcome there is a relationship between the rows. 1 year is smaller then 2 years. It is important that the machine is able to capture this context too. So with this feature it makes sense to map it to its equivalent numerical representation in this case into days.
#21 One hot encoding also doesn’t work for features where there is high cardinaity. Or in other words there are a large number of unique values in the feature. For example color has 366 different values. If we did one hot encoding we would create 366 new columns which would make the dataset extremely wide and sparse. This is a problem because it can add a lot of noise to the data and lead to overfitting.
#22 One solution to this is to engineer new intuitive features. So again this goes back to the importance of humans doing the thinking and why when people list out skills data scientists need they list domain knowledge. You can also use data analysis to try to understand some relationships in the data to derive these features and you will also need to do some trial and error. So one example of a feature we could engineer could be single colour pets vs multi colour. My cats - stuff. Maybe there is a relationship there.
#23 The problem with this approach is that with the best will in the world, the most intuitive or domain knowledge, lots of data analysis. By just doing feature engineering you are losing one of the advantages of ml in that it can if used correctly pick up on hidden patterns that a human could not see. With feature engineering we may miss things like this. Has anyone heard about this. That black cats are finding it much harder to find homes because they don’t photograph well.
#24 There are solutions to this. We can use maths to calculate features that attempt to capture the patterns in these features. One example of this is weight of evidence. Number between -1 and +1, if number is positive then the appearance of tan is a positive influencer of the positive outcome. The machine can learn this relationship if it is present across enough samples.
#26 There are many different methods to compute these.
#27 In machine learning we need to experiment with different techniques and combinations of techniques to find the most optimal solution for the data set. If we have to manually code all these solutions then this can be very time consuming.
#28 Fortunately there are two solutions for this. Scikit-learn has a feature called pipelines. Pipelines allow you to chain together steps. These steps can be a number of things but the important one in this example is the preprocessing steps. We can chain together all the different methods for preprocessing. Apply them to the desired columns and then when we perform model fitting the preprocessing is applied at the same tine.
#29 To code something like weight of evidence is a lot of code, a lot of logic to work through making sure calculations are correct and so on. If we had to repeat this for to try all these different encoders it would take a very long time. Fortunately there is a python library called category encoders that does this work for us.
#30 Let’s look at and an example. Talk through.
#31 So this makes this process quite a lot easier but there is still a lot of work to do on feature engineering.
#32 The ceo of kaggle once said this having observed winning and losing solutions in Kaggle competitions. By handcrafted he is talking about feature engineering. One competition had a dataset containing a number of features about cars at auction, for example mileage, age, make, model, colour and the task was to predict which ones would be a good buy and which would be a lemon. The winning entry grouped the cars into unsual colours (so not commonly occurring) and usual colours. And this turned out to be one of the most predictive features and won them the competition.