SlideShare a Scribd company logo
End-to-End Machine Learning Project
Eng Teong Cheah | Microsoft MVP for AI
Working with Real Data
Working with Real Data?
When you are learning about Machine Learning it is best to actually
experiment with real-world data, not just artificial datasets. Fortunately, there
are thousands of open datasets to choose from, ranging across all sorts of
domains. Here are a few places you can look to get data:
- Popular open data repositories:
- UC Irvine Machine Learning Repository
- Kaggle datasets
- Amazon’s AWS datasets
- Meta portals (they list open data repositories):
- http://dataportals.org/
- http://opendatamonitor.eu/
- http://quandl.com/
Working with Real Data?
- Other pages listing many popular open data repositories:
- Wikipedia’s list of Machine Learning datasets
- Quora.com question
- Datasets subreddit
Look at the Big Picture
The first task you are asked to perform is to build a model of housing prices in
California using the California census data. This data has metrics such as the
population, median income, median housing price, and so on for each block
group California. Block groups are the smallest geographical unit for which the
US Census Bureau publishes sample data ( a block group typically has a
population of 600 to 3,000 people). We will just call them “districts” for shot.
Your model should learn from this data and be able to predict the median
housing price in any district, given all the other metrics.
Frame the Problem
Question:
What exactly is the business objective; building a model is probably not the end
goal. How does the company expect to use and benefit form this model?
This is important because it will determine how you frame the problem, what
algorithms you will select, what performance measure you will use to evaluate
your model, and how much effort you should spend tweaking it.
Frame the Problem
Answer:
Your model’s output (a prediction of a district’s, median housing price) will be
fed to another Machine Learning system, along with many other signals. This
downstream system will determine whether it is worth investing in a given area
or not. Getting this right is critical, as it directly affects revenue.
Frame the Problem
Question:
What the current solution looks like (if any). It will often give you a reference
performance, as well as insights on how to solve the problem.
Frame the Problem
Answer:
The district housing prices are currently estimated manually by experts: a team
gathers up-to-data information about a district (excluding median housing
prices), and they use complex rules to come up with an estimate. This is costly
and time-consuming, and their estimates are not great; their typical error rate
is about 15%.
Frame the Problem
With all this information you are now ready to start designing your system.
First, you need to frame the problem:
Is it supervised, unsupervised, or Reinforcement Learning?
Is it a classification task, a regression task, or something else?
Should you use batch learning or online learning techniques?
It is clearly a typical supervised learning task since you are
given labeled training examples (each instance comes with the
expected output, i.e., the district’s median housing price).
Moreover, it is also a typical regression task, since you are
asked to predict the value. More specifically, this is a
multivariate regression problem since the system will use
multiple features to make a prediction (it will use the district’s
population, the median income, etc.)
You predicted life satisfaction based on just one feature, the
GDP per capita, so it was a univariate regression problem.
Finally, there is no continuous flow of data coming in the
system, there is no particular need to adjust to changing data
rapidly, and the data is small enough to fit in memory, so plain
batch learning should do just fine.
Select a Performance Measure
A typical performance measure for regression problems is the Root Mean
Square Error (RMSE). It measures the standard deviation of the errors the
system makes in its prediction.
Check the Assumptions
It is good practice to list and verify the assumptions that were made so far (by
you or others); this can catch serious issues early on.
For example, the district prices that your system, and we assume that these
prices are going to be used as such. But what if the downstream system
actually converts the prices into categories (e.g., “cheap”, “medium”, or
“expensive”) and then uses those categories instead of the prices themselves?
Check the Assumptions
In this case, getting the price perfectly right is not important at all; your system
just needs to get the category right. If that’s so, then the problem should have
been framed as a classification, not a regression task. You don’t want to find
this out after working on a regression system for a months.
Fortunately, after talking with the team in charge of the downstream system,
you are confident that they do indeed need the actual prices, not categories.
Get the Data
It’s time to get your hands dirty.
Don’t hesitate to pick up your laptop and walk through the following code
examples in a Jupyter notebook.
Create the Workspace
1. You need to have Python installed.
2. Create a workspace directory for your Machine Learning code and datasets.
You need a number of Python modules: Jupyter, NumPy, Pandas, Matplotlib
and Scikit-Learn.
Download the Data
In typical environments your data would be available in a relational database
(or some other common datastore) and spread across multiple
tables/documents/files.
To access it, you would first need to get your credentials and access
authorizations, and familiarize yourself with the data schema.
In this project, however, things are much simpler: you will download a single
compressed file, housing.tgz, which contains a comma-separated value (CSV)
file called housing.csv with all the data.
Download the Data
Here is the function to fetch data:
Take a Quick Look at the Data Structure
Let’s take a look at the top five rows using the DataFrame’s head() method
Take a Quick Look at the Data Structure
The info() method is useful to get a quick description of the data, in particular
the total number of rows, and each attribute’s type and number of non-full
values.
Take a Quick Look at the Data Structure
Call the hist() method on the whole dataset, and it will plot a histogram for
each numerical attribute. For example, you can see that slightly over 800
districts have a median_house_value equal to about $500,000.
Create a Test Set
It may sound strange to voluntarily set aside part of the data at this stage.
After all, you have only taken a quick glance at the data, and surely you should
learn a whole lot more about it before you decide what algorithms to use,
right?
This is true, but your brain is an amazing pattern detection system, which
means that it is highly prone to overfitting: if you look at the test set, you may
stumble upon some seemingly interesting pattern in the test data that leads
you to select a particular kind of Machine Learning model.
When you estimate the generalization error using the test set, you estimate will
be too optimistic and you will launch a system that will not perform as well as
expected. This is called data snooping bias.
Create a Test Set
Creating a test set is theoretically quite simple: just pick some instances
randomly, typically 20% of the dataset, and set them aside:
Create a Test Set
You can then use this function like this:
Discover and Visualize the Data to Gain Insights
First, make sure you have put the test set aside and you are only exploring the
training set. Also, if the training set is very large, you may want to sample an
exploration set, to make manipulations easy and fast. In our case, the set is
quite small so you can just work directly on the full set.
Visualizing Geographical Data
Since there is geographical information (latitude and longitude), it is a good
idea to create a scatterplot of all districts to visualize the data.
Visualizing Geographical Data
This looks like California all right, but other than that it is hard to see any
particular pattern. Setting the alpha option to 0.1 makes it much easier to
visualize the places where there is a high density of data points
Looking for Correlations
The correlation coefficient ranges from -1 to 1.
When it is close to 1, it means that there is a strong positive correlations.
For example, the median house value tends to go up when then median
income goes up.
When coefficient is close to -1, it means that there is a strong negative
correlation; you can see a small negative correlation between the latitude and
the median house value(i.e., prices have a slight tendency to go down when
you go north.
Finally, coefficients close to zero mean that there is no linear correlation.
Looking for Correlations
Experimenting with Attribute Combinations
One last thing you may want to do before actually preparing the data for
Machine Learning algorithms is to try out various attribute combinations.
For example, the total number of rooms in a district is not very useful if you
don’t know how many households there are.
What you really want is the number of rooms per household.
Similarly, the total number of bedrooms by itself is not very useful: you
probably want to compare it to the number of rooms. And the population per
household also seems like an interesting attribute combination to look at.
Prepare the Data for Machine Learning Algorithms
Instead of just doing this manually, you should write functions to do that, for
several good reasons:
1. This will allow you to reproduce these transformations easily on any dataset
(e.g., the next time you get the fresh datasets).
2. You will gradually build a library of transformation functions that you can
reuse in future projects.
3. You can use these functions in your live system to transform the new data
before feeding it to your algorithms.
4. This will make it possible for you to easily try various transformations and
see which combination of transformation works best.
Data Cleaning
Most of Machine Learning algorithms cannot work with missing features, so
let’s create a few functions to take care of them. You noticed earlier that the
total_bedrooms attribute has some missing values, so let’s fix this. You have
three options:
1. Get rid of the corresponding districts.
2. Get rid of the whole attribute.
3. Set the values to some value (zero, the mean, the median, etc.)
Handling Text and Categorical Attributes
Earlier we left out the categorical attribute ocean_proximity because it is a text
attribute so we cannot compute its median. Most Machine Learning algorithms
prefer to work with numbers anyway.
Custom Transformers
Although Scikit-Learn provides many useful transformers, you will need to write
your own for task such as custom cleanup operations or combining specific
attributes. You will want your transformer to work seamlessly with Scikit-Learn
functionalities(such as pipelines), and since Scikit-Learn relies on duck
typing(not inheritance), all you need is to create a class and implement 3
methods: fit() (returning self), transform(), and fit_transform().
Feature Scaling
One of the most important transformation you need to apply to your data is
feature scaling.
With few exceptions, Machine Learning algorithms don’t perform well when
the input numerical attributes have very different scales.
This is the case for the housing data: the total number of rooms ranges from
about 6 to 39,320, while the median incomes only range from 0 to 15. Note
that scaling the target values is generally not required.
There are two common ways to get all attributes to have the same scale:
min-max scaling and standardization.
Transformation Pipelines
There are many data transformation steps that need be executed in the right
order. Fortunately, Scikit-Learn provides the Pipeline class to help with such
sequences of transformations. Here is a small pipelines for the numerical
attributes:
Select and Train a Model
At last! You framed the problem, you got the data and explored it, you sampled
a training set and a test set, and you wrote transformation pipelines to clean up
and prepare your data for Machine Learning algorithms automatically. You are
now ready to select and train a Machine Learning model.
Fine-Tune Your Model
Let’s assume that you now have a shortlist of promising models. You now need
to fine-tune them.
Grid Search
One way to do that would be to fiddle with the hyperparameters manually,
until you find a great combination of hyperparameter values. This would be
very tedious work, and you may not have time to explore many combinations.
Instead you should get Scikit-Learn’s GridSearchCV to search for you. All you
need to do is tell it which hyperparameters you want it to experiment with, and
what values to try out, and it will evaluate all the possible combinations of
hyperparameters values, using cross-validation.
Randomized Search
This approach has 2 main benefits:
- If you let the randomized search run for, say, 1,000 iterations, this approach
will explore, 1,000 different values for each hyperparameter(instead of just a
few values per hyperparameter with the grid search approach).
- You have more control over the computing budget you want to allocate to
hyper-parameter search, simply by setting the number of iterations.
Randomized Search
This approach has 2 main benefits:
- If you let the randomized search run for, say, 1,000 iterations, this approach
will explore, 1,000 different values for each hyperparameter(instead of just a
few values per hyperparameter with the grid search approach).
- You have more control over the computing budget you want to allocate to
hyper-parameter search, simply by setting the number of iterations.
Ensemble Methods
Another way to fine-tune your system is to try to combine the models that
perform best. The group (or “ensemble”) will often perform better than the
best individual model (just like Random Forests perform better than the
individual Decision Tree they rely on), especially if the individual models very
different types of errors.
Evaluate Your System on the Test Set
After tweaking your models for a while, you eventually have a system that
performs sufficiently well. Now is the time to evaluate the final model on the
test set. There is nothing special about this process; just get the predictors and
the labels from your test set, run your full_pipelines to transform the data (call
transform(), not fit_transform()!), and evaluate the final model on the test set.
Launch, Monitor and Maintain Your System
Finally, you will generally want to train your models on a regular basis using
fresh data. You should automate this process as much as possible.
If you don’t, you are very likely to refresh your model only every six months (at
best), and your system’s performance may fluctuate severely over time.
If your system is an online learning system, you should make sure you save
snapshots of its state at regular intervals so you can easily roll back to a
previously working state.
Thanks!
Does anyone have any questions?
Twitter: @walkercet
Blog: https://ceteongvanness.wordpress.com
Resources
Hands-On Machine Learning with Scikit-Learn and TensorFlow

More Related Content

What's hot

Andrew NG machine learning
Andrew NG machine learningAndrew NG machine learning
Andrew NG machine learning
ShareDocView.com
 
Confusion Matrix Explained
Confusion Matrix ExplainedConfusion Matrix Explained
Confusion Matrix Explained
Stockholm University
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
Samra Shahzadi
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
Joel Graff
 
Text Classification
Text ClassificationText Classification
Text Classification
RAX Automation Suite
 
Machine Learning by Rj
Machine Learning by RjMachine Learning by Rj
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Rahul Jain
 
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaSupervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Edureka!
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
Mohammad Junaid Khan
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?
Venu Anuganti
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
Kamal Acharya
 
Ensemble methods
Ensemble methodsEnsemble methods
Ensemble methods
Christopher Marker
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Simplilearn
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
Mohammad Junaid Khan
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
DataminingTools Inc
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
Rahul Kumar
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
Sara Hooker
 
Predictive analytics
Predictive analytics Predictive analytics
Predictive analytics
SAS Singapore Institute Pte Ltd
 
Machine learning Algorithms
Machine learning AlgorithmsMachine learning Algorithms
Machine learning Algorithms
Walaa Hamdy Assy
 

What's hot (20)

Andrew NG machine learning
Andrew NG machine learningAndrew NG machine learning
Andrew NG machine learning
 
Confusion Matrix Explained
Confusion Matrix ExplainedConfusion Matrix Explained
Confusion Matrix Explained
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
Text Classification
Text ClassificationText Classification
Text Classification
 
Machine Learning by Rj
Machine Learning by RjMachine Learning by Rj
Machine Learning by Rj
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | EdurekaSupervised vs Unsupervised vs Reinforcement Learning | Edureka
Supervised vs Unsupervised vs Reinforcement Learning | Edureka
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Ensemble methods
Ensemble methodsEnsemble methods
Ensemble methods
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Predictive analytics
Predictive analytics Predictive analytics
Predictive analytics
 
Machine learning Algorithms
Machine learning AlgorithmsMachine learning Algorithms
Machine learning Algorithms
 

Similar to End-to-End Machine Learning Project

notes as .ppt
notes as .pptnotes as .ppt
notes as .pptbutest
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learning
Johnson Ubah
 
7 Cases Where You Can't Afford to Skip Analytics Testing
7 Cases Where You Can't Afford to Skip Analytics Testing7 Cases Where You Can't Afford to Skip Analytics Testing
7 Cases Where You Can't Afford to Skip Analytics Testing
ObservePoint
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
Roger Barga
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
NitinSharma134320
 
Data .pptx
Data .pptxData .pptx
Data .pptx
ssuserbda195
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
Aun Akbar
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
SVasuKrishna1
 
Analyzing Performance Test Data
Analyzing Performance Test DataAnalyzing Performance Test Data
Analyzing Performance Test Data
Optimus Information Inc.
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
TrainerAnalogicx
 
Unit 1-ML (1) (1).pptx
Unit 1-ML (1) (1).pptxUnit 1-ML (1) (1).pptx
Unit 1-ML (1) (1).pptx
Chitrachitrap
 
A tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesA tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbies
Vimal Gupta
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
Roger Barga
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
Roger Barga
 
Machine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual WorkshopMachine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual Workshop
CCG
 
Lecture-1-Introduction to Deep learning.pptx
Lecture-1-Introduction to Deep learning.pptxLecture-1-Introduction to Deep learning.pptx
Lecture-1-Introduction to Deep learning.pptx
JayChauhan100
 
Machine learning
Machine learningMachine learning
Machine learning
Dr Geetha Mohan
 
Machine learning and_buzzwords
Machine learning and_buzzwordsMachine learning and_buzzwords
Machine learning and_buzzwords
Rajarshi Dutta
 
Unit-V Machine Learning.ppt
Unit-V Machine Learning.pptUnit-V Machine Learning.ppt
Unit-V Machine Learning.ppt
Sharpmark256
 

Similar to End-to-End Machine Learning Project (20)

notes as .ppt
notes as .pptnotes as .ppt
notes as .ppt
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learning
 
7 Cases Where You Can't Afford to Skip Analytics Testing
7 Cases Where You Can't Afford to Skip Analytics Testing7 Cases Where You Can't Afford to Skip Analytics Testing
7 Cases Where You Can't Afford to Skip Analytics Testing
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
ML.pdf
ML.pdfML.pdf
ML.pdf
 
Machine Learning.pptx
Machine Learning.pptxMachine Learning.pptx
Machine Learning.pptx
 
Data .pptx
Data .pptxData .pptx
Data .pptx
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
 
Analyzing Performance Test Data
Analyzing Performance Test DataAnalyzing Performance Test Data
Analyzing Performance Test Data
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
 
Unit 1-ML (1) (1).pptx
Unit 1-ML (1) (1).pptxUnit 1-ML (1) (1).pptx
Unit 1-ML (1) (1).pptx
 
A tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesA tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbies
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Barga Data Science lecture 10
Barga Data Science lecture 10Barga Data Science lecture 10
Barga Data Science lecture 10
 
Machine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual WorkshopMachine Learning with Azure and Databricks Virtual Workshop
Machine Learning with Azure and Databricks Virtual Workshop
 
Lecture-1-Introduction to Deep learning.pptx
Lecture-1-Introduction to Deep learning.pptxLecture-1-Introduction to Deep learning.pptx
Lecture-1-Introduction to Deep learning.pptx
 
Machine learning
Machine learningMachine learning
Machine learning
 
Machine learning and_buzzwords
Machine learning and_buzzwordsMachine learning and_buzzwords
Machine learning and_buzzwords
 
Unit-V Machine Learning.ppt
Unit-V Machine Learning.pptUnit-V Machine Learning.ppt
Unit-V Machine Learning.ppt
 

More from Eng Teong Cheah

Efficiently Removing Duplicates from a Sorted Array
Efficiently Removing Duplicates from a Sorted ArrayEfficiently Removing Duplicates from a Sorted Array
Efficiently Removing Duplicates from a Sorted Array
Eng Teong Cheah
 
Monitoring Models
Monitoring ModelsMonitoring Models
Monitoring Models
Eng Teong Cheah
 
Responsible Machine Learning
Responsible Machine LearningResponsible Machine Learning
Responsible Machine Learning
Eng Teong Cheah
 
Training Optimal Models
Training Optimal ModelsTraining Optimal Models
Training Optimal Models
Eng Teong Cheah
 
Deploying Models
Deploying ModelsDeploying Models
Deploying Models
Eng Teong Cheah
 
Machine Learning Workflows
Machine Learning WorkflowsMachine Learning Workflows
Machine Learning Workflows
Eng Teong Cheah
 
Working with Compute
Working with ComputeWorking with Compute
Working with Compute
Eng Teong Cheah
 
Working with Data
Working with DataWorking with Data
Working with Data
Eng Teong Cheah
 
Experiments & TrainingModels
Experiments & TrainingModelsExperiments & TrainingModels
Experiments & TrainingModels
Eng Teong Cheah
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
Eng Teong Cheah
 
Getting Started with Azure Machine Learning
Getting Started with Azure Machine LearningGetting Started with Azure Machine Learning
Getting Started with Azure Machine Learning
Eng Teong Cheah
 
Hacking Containers - Container Storage
Hacking Containers - Container StorageHacking Containers - Container Storage
Hacking Containers - Container Storage
Eng Teong Cheah
 
Hacking Containers - Looking at Cgroups
Hacking Containers - Looking at CgroupsHacking Containers - Looking at Cgroups
Hacking Containers - Looking at Cgroups
Eng Teong Cheah
 
Hacking Containers - Linux Containers
Hacking Containers - Linux ContainersHacking Containers - Linux Containers
Hacking Containers - Linux Containers
Eng Teong Cheah
 
Data Security - Storage Security
Data Security - Storage SecurityData Security - Storage Security
Data Security - Storage Security
Eng Teong Cheah
 
Application Security- App security
Application Security- App securityApplication Security- App security
Application Security- App security
Eng Teong Cheah
 
Application Security - Key Vault
Application Security - Key VaultApplication Security - Key Vault
Application Security - Key Vault
Eng Teong Cheah
 
Compute Security - Container Security
Compute Security - Container SecurityCompute Security - Container Security
Compute Security - Container Security
Eng Teong Cheah
 
Compute Security - Host Security
Compute Security - Host SecurityCompute Security - Host Security
Compute Security - Host Security
Eng Teong Cheah
 
Virtual Networking Security - Network Security
Virtual Networking Security - Network SecurityVirtual Networking Security - Network Security
Virtual Networking Security - Network Security
Eng Teong Cheah
 

More from Eng Teong Cheah (20)

Efficiently Removing Duplicates from a Sorted Array
Efficiently Removing Duplicates from a Sorted ArrayEfficiently Removing Duplicates from a Sorted Array
Efficiently Removing Duplicates from a Sorted Array
 
Monitoring Models
Monitoring ModelsMonitoring Models
Monitoring Models
 
Responsible Machine Learning
Responsible Machine LearningResponsible Machine Learning
Responsible Machine Learning
 
Training Optimal Models
Training Optimal ModelsTraining Optimal Models
Training Optimal Models
 
Deploying Models
Deploying ModelsDeploying Models
Deploying Models
 
Machine Learning Workflows
Machine Learning WorkflowsMachine Learning Workflows
Machine Learning Workflows
 
Working with Compute
Working with ComputeWorking with Compute
Working with Compute
 
Working with Data
Working with DataWorking with Data
Working with Data
 
Experiments & TrainingModels
Experiments & TrainingModelsExperiments & TrainingModels
Experiments & TrainingModels
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Getting Started with Azure Machine Learning
Getting Started with Azure Machine LearningGetting Started with Azure Machine Learning
Getting Started with Azure Machine Learning
 
Hacking Containers - Container Storage
Hacking Containers - Container StorageHacking Containers - Container Storage
Hacking Containers - Container Storage
 
Hacking Containers - Looking at Cgroups
Hacking Containers - Looking at CgroupsHacking Containers - Looking at Cgroups
Hacking Containers - Looking at Cgroups
 
Hacking Containers - Linux Containers
Hacking Containers - Linux ContainersHacking Containers - Linux Containers
Hacking Containers - Linux Containers
 
Data Security - Storage Security
Data Security - Storage SecurityData Security - Storage Security
Data Security - Storage Security
 
Application Security- App security
Application Security- App securityApplication Security- App security
Application Security- App security
 
Application Security - Key Vault
Application Security - Key VaultApplication Security - Key Vault
Application Security - Key Vault
 
Compute Security - Container Security
Compute Security - Container SecurityCompute Security - Container Security
Compute Security - Container Security
 
Compute Security - Host Security
Compute Security - Host SecurityCompute Security - Host Security
Compute Security - Host Security
 
Virtual Networking Security - Network Security
Virtual Networking Security - Network SecurityVirtual Networking Security - Network Security
Virtual Networking Security - Network Security
 

Recently uploaded

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 

Recently uploaded (20)

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 

End-to-End Machine Learning Project

  • 1. End-to-End Machine Learning Project Eng Teong Cheah | Microsoft MVP for AI
  • 3. Working with Real Data? When you are learning about Machine Learning it is best to actually experiment with real-world data, not just artificial datasets. Fortunately, there are thousands of open datasets to choose from, ranging across all sorts of domains. Here are a few places you can look to get data: - Popular open data repositories: - UC Irvine Machine Learning Repository - Kaggle datasets - Amazon’s AWS datasets - Meta portals (they list open data repositories): - http://dataportals.org/ - http://opendatamonitor.eu/ - http://quandl.com/
  • 4. Working with Real Data? - Other pages listing many popular open data repositories: - Wikipedia’s list of Machine Learning datasets - Quora.com question - Datasets subreddit
  • 5. Look at the Big Picture The first task you are asked to perform is to build a model of housing prices in California using the California census data. This data has metrics such as the population, median income, median housing price, and so on for each block group California. Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data ( a block group typically has a population of 600 to 3,000 people). We will just call them “districts” for shot. Your model should learn from this data and be able to predict the median housing price in any district, given all the other metrics.
  • 6. Frame the Problem Question: What exactly is the business objective; building a model is probably not the end goal. How does the company expect to use and benefit form this model? This is important because it will determine how you frame the problem, what algorithms you will select, what performance measure you will use to evaluate your model, and how much effort you should spend tweaking it.
  • 7. Frame the Problem Answer: Your model’s output (a prediction of a district’s, median housing price) will be fed to another Machine Learning system, along with many other signals. This downstream system will determine whether it is worth investing in a given area or not. Getting this right is critical, as it directly affects revenue.
  • 8. Frame the Problem Question: What the current solution looks like (if any). It will often give you a reference performance, as well as insights on how to solve the problem.
  • 9. Frame the Problem Answer: The district housing prices are currently estimated manually by experts: a team gathers up-to-data information about a district (excluding median housing prices), and they use complex rules to come up with an estimate. This is costly and time-consuming, and their estimates are not great; their typical error rate is about 15%.
  • 10. Frame the Problem With all this information you are now ready to start designing your system. First, you need to frame the problem: Is it supervised, unsupervised, or Reinforcement Learning? Is it a classification task, a regression task, or something else? Should you use batch learning or online learning techniques?
  • 11. It is clearly a typical supervised learning task since you are given labeled training examples (each instance comes with the expected output, i.e., the district’s median housing price). Moreover, it is also a typical regression task, since you are asked to predict the value. More specifically, this is a multivariate regression problem since the system will use multiple features to make a prediction (it will use the district’s population, the median income, etc.) You predicted life satisfaction based on just one feature, the GDP per capita, so it was a univariate regression problem. Finally, there is no continuous flow of data coming in the system, there is no particular need to adjust to changing data rapidly, and the data is small enough to fit in memory, so plain batch learning should do just fine.
  • 12. Select a Performance Measure A typical performance measure for regression problems is the Root Mean Square Error (RMSE). It measures the standard deviation of the errors the system makes in its prediction.
  • 13. Check the Assumptions It is good practice to list and verify the assumptions that were made so far (by you or others); this can catch serious issues early on. For example, the district prices that your system, and we assume that these prices are going to be used as such. But what if the downstream system actually converts the prices into categories (e.g., “cheap”, “medium”, or “expensive”) and then uses those categories instead of the prices themselves?
  • 14. Check the Assumptions In this case, getting the price perfectly right is not important at all; your system just needs to get the category right. If that’s so, then the problem should have been framed as a classification, not a regression task. You don’t want to find this out after working on a regression system for a months. Fortunately, after talking with the team in charge of the downstream system, you are confident that they do indeed need the actual prices, not categories.
  • 15. Get the Data It’s time to get your hands dirty. Don’t hesitate to pick up your laptop and walk through the following code examples in a Jupyter notebook.
  • 16. Create the Workspace 1. You need to have Python installed. 2. Create a workspace directory for your Machine Learning code and datasets. You need a number of Python modules: Jupyter, NumPy, Pandas, Matplotlib and Scikit-Learn.
  • 17. Download the Data In typical environments your data would be available in a relational database (or some other common datastore) and spread across multiple tables/documents/files. To access it, you would first need to get your credentials and access authorizations, and familiarize yourself with the data schema. In this project, however, things are much simpler: you will download a single compressed file, housing.tgz, which contains a comma-separated value (CSV) file called housing.csv with all the data.
  • 18. Download the Data Here is the function to fetch data:
  • 19. Take a Quick Look at the Data Structure Let’s take a look at the top five rows using the DataFrame’s head() method
  • 20. Take a Quick Look at the Data Structure The info() method is useful to get a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-full values.
  • 21. Take a Quick Look at the Data Structure Call the hist() method on the whole dataset, and it will plot a histogram for each numerical attribute. For example, you can see that slightly over 800 districts have a median_house_value equal to about $500,000.
  • 22. Create a Test Set It may sound strange to voluntarily set aside part of the data at this stage. After all, you have only taken a quick glance at the data, and surely you should learn a whole lot more about it before you decide what algorithms to use, right? This is true, but your brain is an amazing pattern detection system, which means that it is highly prone to overfitting: if you look at the test set, you may stumble upon some seemingly interesting pattern in the test data that leads you to select a particular kind of Machine Learning model. When you estimate the generalization error using the test set, you estimate will be too optimistic and you will launch a system that will not perform as well as expected. This is called data snooping bias.
  • 23. Create a Test Set Creating a test set is theoretically quite simple: just pick some instances randomly, typically 20% of the dataset, and set them aside:
  • 24. Create a Test Set You can then use this function like this:
  • 25. Discover and Visualize the Data to Gain Insights First, make sure you have put the test set aside and you are only exploring the training set. Also, if the training set is very large, you may want to sample an exploration set, to make manipulations easy and fast. In our case, the set is quite small so you can just work directly on the full set.
  • 26. Visualizing Geographical Data Since there is geographical information (latitude and longitude), it is a good idea to create a scatterplot of all districts to visualize the data.
  • 27. Visualizing Geographical Data This looks like California all right, but other than that it is hard to see any particular pattern. Setting the alpha option to 0.1 makes it much easier to visualize the places where there is a high density of data points
  • 28. Looking for Correlations The correlation coefficient ranges from -1 to 1. When it is close to 1, it means that there is a strong positive correlations. For example, the median house value tends to go up when then median income goes up. When coefficient is close to -1, it means that there is a strong negative correlation; you can see a small negative correlation between the latitude and the median house value(i.e., prices have a slight tendency to go down when you go north. Finally, coefficients close to zero mean that there is no linear correlation.
  • 30. Experimenting with Attribute Combinations One last thing you may want to do before actually preparing the data for Machine Learning algorithms is to try out various attribute combinations. For example, the total number of rooms in a district is not very useful if you don’t know how many households there are. What you really want is the number of rooms per household. Similarly, the total number of bedrooms by itself is not very useful: you probably want to compare it to the number of rooms. And the population per household also seems like an interesting attribute combination to look at.
  • 31. Prepare the Data for Machine Learning Algorithms Instead of just doing this manually, you should write functions to do that, for several good reasons: 1. This will allow you to reproduce these transformations easily on any dataset (e.g., the next time you get the fresh datasets). 2. You will gradually build a library of transformation functions that you can reuse in future projects. 3. You can use these functions in your live system to transform the new data before feeding it to your algorithms. 4. This will make it possible for you to easily try various transformations and see which combination of transformation works best.
  • 32. Data Cleaning Most of Machine Learning algorithms cannot work with missing features, so let’s create a few functions to take care of them. You noticed earlier that the total_bedrooms attribute has some missing values, so let’s fix this. You have three options: 1. Get rid of the corresponding districts. 2. Get rid of the whole attribute. 3. Set the values to some value (zero, the mean, the median, etc.)
  • 33. Handling Text and Categorical Attributes Earlier we left out the categorical attribute ocean_proximity because it is a text attribute so we cannot compute its median. Most Machine Learning algorithms prefer to work with numbers anyway.
  • 34. Custom Transformers Although Scikit-Learn provides many useful transformers, you will need to write your own for task such as custom cleanup operations or combining specific attributes. You will want your transformer to work seamlessly with Scikit-Learn functionalities(such as pipelines), and since Scikit-Learn relies on duck typing(not inheritance), all you need is to create a class and implement 3 methods: fit() (returning self), transform(), and fit_transform().
  • 35. Feature Scaling One of the most important transformation you need to apply to your data is feature scaling. With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales. This is the case for the housing data: the total number of rooms ranges from about 6 to 39,320, while the median incomes only range from 0 to 15. Note that scaling the target values is generally not required. There are two common ways to get all attributes to have the same scale: min-max scaling and standardization.
  • 36. Transformation Pipelines There are many data transformation steps that need be executed in the right order. Fortunately, Scikit-Learn provides the Pipeline class to help with such sequences of transformations. Here is a small pipelines for the numerical attributes:
  • 37. Select and Train a Model At last! You framed the problem, you got the data and explored it, you sampled a training set and a test set, and you wrote transformation pipelines to clean up and prepare your data for Machine Learning algorithms automatically. You are now ready to select and train a Machine Learning model.
  • 38. Fine-Tune Your Model Let’s assume that you now have a shortlist of promising models. You now need to fine-tune them.
  • 39. Grid Search One way to do that would be to fiddle with the hyperparameters manually, until you find a great combination of hyperparameter values. This would be very tedious work, and you may not have time to explore many combinations. Instead you should get Scikit-Learn’s GridSearchCV to search for you. All you need to do is tell it which hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameters values, using cross-validation.
  • 40. Randomized Search This approach has 2 main benefits: - If you let the randomized search run for, say, 1,000 iterations, this approach will explore, 1,000 different values for each hyperparameter(instead of just a few values per hyperparameter with the grid search approach). - You have more control over the computing budget you want to allocate to hyper-parameter search, simply by setting the number of iterations.
  • 41. Randomized Search This approach has 2 main benefits: - If you let the randomized search run for, say, 1,000 iterations, this approach will explore, 1,000 different values for each hyperparameter(instead of just a few values per hyperparameter with the grid search approach). - You have more control over the computing budget you want to allocate to hyper-parameter search, simply by setting the number of iterations.
  • 42. Ensemble Methods Another way to fine-tune your system is to try to combine the models that perform best. The group (or “ensemble”) will often perform better than the best individual model (just like Random Forests perform better than the individual Decision Tree they rely on), especially if the individual models very different types of errors.
  • 43. Evaluate Your System on the Test Set After tweaking your models for a while, you eventually have a system that performs sufficiently well. Now is the time to evaluate the final model on the test set. There is nothing special about this process; just get the predictors and the labels from your test set, run your full_pipelines to transform the data (call transform(), not fit_transform()!), and evaluate the final model on the test set.
  • 44. Launch, Monitor and Maintain Your System Finally, you will generally want to train your models on a regular basis using fresh data. You should automate this process as much as possible. If you don’t, you are very likely to refresh your model only every six months (at best), and your system’s performance may fluctuate severely over time. If your system is an online learning system, you should make sure you save snapshots of its state at regular intervals so you can easily roll back to a previously working state.
  • 45. Thanks! Does anyone have any questions? Twitter: @walkercet Blog: https://ceteongvanness.wordpress.com
  • 46. Resources Hands-On Machine Learning with Scikit-Learn and TensorFlow