Data Preparation
and the
Importance of how
Machines Learn
Rebecca Vickery, Data Scientist,
Holiday Extras
Machine learning
Source: Google images
Simple ML workflow
Get data >> baseline model >> model selection
>> model tuning >> predict
Simple ML workflow
Get data >>
Features/Inputs What we want to predict
Simple ML workflow
Baseline model >>
Accuracy score
Perfect = 1.0
0.44
Simple ML workflow
Model selection >>
Best model = Random Forest
Simple ML workflow
Hyperparameter optimisation >>
Best score = 1.0
Best Params =
{'max_depth': 5,
'min_samples_leaf':
1,
'min_samples_split':
10, 'n_estimators':
500}
Source: Google images
What happens when we have this data set?
What happens when we have this data set?
Source: thedailybeast.com
Actual ML workflow
Get data >> data preparation
>> feature engineering >> baseline model
>> model selection >> model tuning
>> predict
Label encoding
Problem
Source: flaticon.com
4 is bigger than 1
so there must be a
relationship
between these
rows
Source: flaticon.com
1 = neutered male
2 = spayed female
3 = intact male
Solution: One hot encoding
Ordinal data
Problem: Won’t work for all variables
366 different unique values
= 366 new features
Solution: Feature engineering?
Single
Colour
Multi
Colour
Problem: We will lose a lot of information
Source: thetelegraph.com
Solution: Weight of evidence
For each colour (e.g. Tan):
WOE = log ( ( pi /p) / ( ni / n) )
pi = number of times Tan appears in positive class (1)
p = total number of positive classes (1)
ni = number of times Tan appears in negative class (0)
n = total number of negative classes (0)
Solution: Weight of evidence
Output is a positive or negative number between -1 and
+1
Solution(s)
WOE is one of many solutions for this
Problem(s)
Source: Photo by Louis Reed on Unsplash
Scikit-learn pipelines
Solution
pip install category_encoders
Pipeline example
Less time
But still some work to do
“There are only two Machine
Learning approaches that win
competitions: Handcrafted &
Neural Networks.”
Anthony Goldbloom, CEO & Founder, Kaggle
Thanks for
listening
Find me at...

Data Preparation and the Importance of How Machines Learn by Rebecca Vickery

Editor's Notes

  • #4 Before I get into talking more about ml will recap what it is. For the purposes of this talk I am only going to cover supervised ml. In ML provide algorithm with labelled examples. Algo learns mapping. Uses this mapping/pattern/rules to predict unlabelled data.
  • #5 This mapping of inputs to outputs is based on the algorithm performing a large number of mathematical computations very quickly. The maths behind most machine learning models is mostly based on linear algebra and calculus. I am not going to talk in detail about the maths, I think there is 1 v simple equation in the talk. But the fact that in ml machines learn using maths is important to the talk.
  • #6 So to build up this pictue of how machines learn I’ll walk through what a simple ml workflow looks like. I’ll be showing a few code examples. This is all python based, I mainly work in python. And will be using the open source ml library scikit-learn. Is anyone familiar?
  • #8 Create a dumb model. In this case I have created a model that predicts everything as the most frequently occurring class. This is so that you know that the improvements you make or cleverer models you create really are clever and aren’t doing anything stupid.
  • #11 This is a basic workflow on toy data but in the real world things are never that simple. For example I have yet to ever ever a 100% accuracy score on real data. The other thing to note is that because this is a toy data set all the features were numerical. In real life this is also very rarely always the case. You will most often be dealing with different data types.
  • #12 So what happens when we have a data set like this. This data is taken from the website kaggle - an ml comp website. It contains attributes of various pets that have been placed in an animal shelter. The goal of this competition is to predict wether the animal will have a successful outcome or not. All the data in this is categorical or string apart from the target variable. Introduce kaggle.
  • #13 So this is what happens when we try to run ml on this. We get an error. Because the algorithms only understand numbers. So we need to translate all these features into numbers. A language that the machine can understand and learn from.
  • #14 And this is not a simple process because as much as ml is often hyped as being this wonder tool. If humans don’t think very carefully about the data that they are feeding the machine then machines can be really dumb. Machines can learn patterns in data but they cannot think beyond the data and the format that data comes in. So humans need to do the thinking. This is why when talking about data science this joke is often made. But it is actually very true.
  • #15 https://www.thedailybeast.com/why-doctors-arent-afraid-of-better-more-efficient-ai-diagnosing-cancer Machines can learn but they can’t think… yet Human needs to do the thinking - humans supply the context
  • #16 In reality this is what a real ml workflow looks like. The data preparation part, especially when we are dealing with something like the animal shelter data set is one of the most time consuming parts.
  • #17 So how do we do this conversion? One solution is called label encoding. Talk through what it is.
  • #18 The problem is that all the machine sees is numbers and the relationships between them. It does not have any context beyond that. It doesn’t know that this is a mapping to something meaningful from a human perspective. So we need a better way to represent these numbers.
  • #19 One solution is known as one hot encoding. Explain.
  • #20 But one hot encoding can’t just be used for all categorical features. For example with the age upon outcome there is a relationship between the rows. 1 year is smaller then 2 years. It is important that the machine is able to capture this context too. So with this feature it makes sense to map it to its equivalent numerical representation in this case into days.
  • #21 One hot encoding also doesn’t work for features where there is high cardinaity. Or in other words there are a large number of unique values in the feature. For example color has 366 different values. If we did one hot encoding we would create 366 new columns which would make the dataset extremely wide and sparse. This is a problem because it can add a lot of noise to the data and lead to overfitting.
  • #22 One solution to this is to engineer new intuitive features. So again this goes back to the importance of humans doing the thinking and why when people list out skills data scientists need they list domain knowledge. You can also use data analysis to try to understand some relationships in the data to derive these features and you will also need to do some trial and error. So one example of a feature we could engineer could be single colour pets vs multi colour. My cats - stuff. Maybe there is a relationship there.
  • #23 The problem with this approach is that with the best will in the world, the most intuitive or domain knowledge, lots of data analysis. By just doing feature engineering you are losing one of the advantages of ml in that it can if used correctly pick up on hidden patterns that a human could not see. With feature engineering we may miss things like this. Has anyone heard about this. That black cats are finding it much harder to find homes because they don’t photograph well.
  • #24 There are solutions to this. We can use maths to calculate features that attempt to capture the patterns in these features. One example of this is weight of evidence. Number between -1 and +1, if number is positive then the appearance of tan is a positive influencer of the positive outcome. The machine can learn this relationship if it is present across enough samples.
  • #26 There are many different methods to compute these.
  • #27 In machine learning we need to experiment with different techniques and combinations of techniques to find the most optimal solution for the data set. If we have to manually code all these solutions then this can be very time consuming.
  • #28 Fortunately there are two solutions for this. Scikit-learn has a feature called pipelines. Pipelines allow you to chain together steps. These steps can be a number of things but the important one in this example is the preprocessing steps. We can chain together all the different methods for preprocessing. Apply them to the desired columns and then when we perform model fitting the preprocessing is applied at the same tine.
  • #29 To code something like weight of evidence is a lot of code, a lot of logic to work through making sure calculations are correct and so on. If we had to repeat this for to try all these different encoders it would take a very long time. Fortunately there is a python library called category encoders that does this work for us.
  • #30 Let’s look at and an example. Talk through.
  • #31 So this makes this process quite a lot easier but there is still a lot of work to do on feature engineering.
  • #32 The ceo of kaggle once said this having observed winning and losing solutions in Kaggle competitions. By handcrafted he is talking about feature engineering. One competition had a dataset containing a number of features about cars at auction, for example mileage, age, make, model, colour and the task was to predict which ones would be a good buy and which would be a lemon. The winning entry grouped the cars into unsual colours (so not commonly occurring) and usual colours. And this turned out to be one of the most predictive features and won them the competition.