Introduction to Machine learning
Explore ML
Welcome to
Explore ML!
Day 1
Evolution of
Machines
What is a machine?
What is learning?
Learning is any process by
which a system improves
performance from
experience.
Herbert Alexander Simon
1. How does the game work?
2. How is it recognising your drawings?
3. How could we program this?
Quickdraw Game - Discussion
What is Machine
Learning?
Machine Learning is concerned with computer
programs that automatically improve their
performance through experience.
AI
ML
DL
Supervised Learning:
Regression
Regression analysis is a
statistical method that
helps us to analyze and
understand the relationship
between two or more
variables of interest.
Classification
Program learns from
the given dataset or
observations and then
classifies new
observation into a
number of classes
or groups.
Unsupervised Learning
Association
It is a machine learning
and data mining
technique that finds
important relations
between variables or
features in a data set.
Clustering
A way of grouping the
data points into
different clusters,
consisting of similar
data points.
Anomaly Detection
It is the process of
identifying unexpected
items or events in data
sets, which differ from
the norm.
Reinforcement Learning:
Learn from mistakes
Reinforcement learning is a machine learning training method based on
rewarding desired behaviors and/or punishing undesired ones
Reinforcement at work
How do I start solving a
problem with ML?
First, familiarise yourself with what data is
available.
Feature Handling
Preparing the proper input dataset, compatible with Machine learning
algorithm requirements.
Goal of Feature Handling
According to survey, data scientists spend 60% of their
time on data preparation
In Feature Handling, you will learn...
Handling categorical data
● Nominal variables
● Ordinal variables
● One hot encoding
● Label/ordinal/integer encoding
Missing invalid values
● Mean method
● Median method
● Mode method
A variable whose values are one or more categories.
Categorical Variables
Before we move further,
Variable comprises a finite set of discrete values with no relationship between
those values.
These are variables which are not related to each other in any order
Nominal Variables
Ordinal variables
Variable comprises a finite set of discrete values with a ranked
ordering between values.
These are variables where we can find a certain order or relation or
rank between those variables.
One Hot Encoding
Forcing an ordinal relationship via
an ordinal encoding and allowing
the model to assume a natural
ordering between categories may
result in poor performance or
unexpected results
In ordinal encoding, each
unique category value is
assigned an integer value.
Ordinal Encoding
Unfortunately, data in real life usually has
issues
Consider a dataset that gives you information
about multiple people aboard the Titanic like
their ages, sexes, sibling counts, embarkment
points and whether or not they survived the
disaster.
Based on this, you have to predict if an
arbitrary passenger on Titanic would survive
the sinking.
Looking at a real-life dataset
What will happen if we directly jump into
solving the problem?
Real life datasets almost always have
missing values
For example, not all passengers’ age will be recorded.
There are multiple reasons why this could happen.
Reasons
● Simply put, it’s difficult to collect data.
● Sometimes data is lost.
● Data can also be corrupted.
● People may not be comfortable with sharing data.
Handling missing values
Mean, Median, Mode
Statistical approach to handle the missing values
Mean
Handling missing values
Mean
In this method, any missing values in a column are replaced with the mean
of that column.
Assume that we have a dataset of a some patients and in that the age
attribute has some missing values, we have to overcome this or else it will
be a good recipe for a disaster.
Cons of using this method
● This method is heavily dependent and extremely sensitive for the outliers
present in a data set.
● Value influenced by outlier is a major threat to any machine learning model
and it may make model catastrophic.
Statistical approach to handle the missing values
Handling missing values
Median
Another technique is median imputation in which the missing values are
replaced with the median value of the entire feature column.
● Doesn’t factor the correlations between features. It only works on the
column level.
● Will give poor results on encoded categorical features (do NOT use it
on categorical features).
Cons of using this method
Statistical approach to handle the missing values
Mode
Handling missing values
Another technique is mode imputation in which the missing values are
replaced with the mode value or most frequent value of the entire
feature column.
● It also doesn’t factor the correlations between features.
● It can introduce bias in the data.
Cons of using this method
Quick Recap!
Suppose you have a Basket
Its filled with some fresh fruits
Arrange different fruits in different places
-- TASK --
How did we learn?
Things you can expect
tomorrow
Introduction to advanced ML topics used to
solve real-life problems
Intuition behind each concept, not just the
high-level understanding
Applying these concepts on a custom dataset
and experimenting with the results in a
hands-on session
Lots of fun, learning and exclusive Google
goodies!

Explore ML day 1

  • 1.
    Introduction to Machinelearning Explore ML
  • 2.
  • 3.
  • 4.
    What is amachine?
  • 5.
  • 6.
    Learning is anyprocess by which a system improves performance from experience. Herbert Alexander Simon
  • 8.
    1. How doesthe game work? 2. How is it recognising your drawings? 3. How could we program this? Quickdraw Game - Discussion
  • 11.
    What is Machine Learning? MachineLearning is concerned with computer programs that automatically improve their performance through experience.
  • 12.
  • 16.
  • 17.
    Regression Regression analysis isa statistical method that helps us to analyze and understand the relationship between two or more variables of interest.
  • 18.
    Classification Program learns from thegiven dataset or observations and then classifies new observation into a number of classes or groups.
  • 19.
  • 20.
    Association It is amachine learning and data mining technique that finds important relations between variables or features in a data set.
  • 21.
    Clustering A way ofgrouping the data points into different clusters, consisting of similar data points.
  • 22.
    Anomaly Detection It isthe process of identifying unexpected items or events in data sets, which differ from the norm.
  • 23.
  • 24.
    Learn from mistakes Reinforcementlearning is a machine learning training method based on rewarding desired behaviors and/or punishing undesired ones
  • 25.
  • 28.
    How do Istart solving a problem with ML?
  • 29.
    First, familiarise yourselfwith what data is available.
  • 30.
  • 31.
    Preparing the properinput dataset, compatible with Machine learning algorithm requirements. Goal of Feature Handling
  • 32.
    According to survey,data scientists spend 60% of their time on data preparation
  • 33.
    In Feature Handling,you will learn... Handling categorical data ● Nominal variables ● Ordinal variables ● One hot encoding ● Label/ordinal/integer encoding Missing invalid values ● Mean method ● Median method ● Mode method
  • 34.
    A variable whosevalues are one or more categories. Categorical Variables Before we move further,
  • 35.
    Variable comprises afinite set of discrete values with no relationship between those values. These are variables which are not related to each other in any order Nominal Variables
  • 36.
    Ordinal variables Variable comprisesa finite set of discrete values with a ranked ordering between values. These are variables where we can find a certain order or relation or rank between those variables.
  • 38.
    One Hot Encoding Forcingan ordinal relationship via an ordinal encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results
  • 39.
    In ordinal encoding,each unique category value is assigned an integer value. Ordinal Encoding
  • 40.
    Unfortunately, data inreal life usually has issues
  • 41.
    Consider a datasetthat gives you information about multiple people aboard the Titanic like their ages, sexes, sibling counts, embarkment points and whether or not they survived the disaster. Based on this, you have to predict if an arbitrary passenger on Titanic would survive the sinking. Looking at a real-life dataset
  • 42.
    What will happenif we directly jump into solving the problem?
  • 43.
    Real life datasetsalmost always have missing values For example, not all passengers’ age will be recorded. There are multiple reasons why this could happen.
  • 44.
    Reasons ● Simply put,it’s difficult to collect data. ● Sometimes data is lost. ● Data can also be corrupted. ● People may not be comfortable with sharing data.
  • 45.
  • 46.
    Statistical approach tohandle the missing values Mean Handling missing values
  • 47.
    Mean In this method,any missing values in a column are replaced with the mean of that column. Assume that we have a dataset of a some patients and in that the age attribute has some missing values, we have to overcome this or else it will be a good recipe for a disaster.
  • 49.
    Cons of usingthis method ● This method is heavily dependent and extremely sensitive for the outliers present in a data set. ● Value influenced by outlier is a major threat to any machine learning model and it may make model catastrophic.
  • 50.
    Statistical approach tohandle the missing values Handling missing values Median
  • 51.
    Another technique ismedian imputation in which the missing values are replaced with the median value of the entire feature column.
  • 52.
    ● Doesn’t factorthe correlations between features. It only works on the column level. ● Will give poor results on encoded categorical features (do NOT use it on categorical features). Cons of using this method
  • 53.
    Statistical approach tohandle the missing values Mode Handling missing values
  • 54.
    Another technique ismode imputation in which the missing values are replaced with the mode value or most frequent value of the entire feature column.
  • 55.
    ● It alsodoesn’t factor the correlations between features. ● It can introduce bias in the data. Cons of using this method
  • 56.
  • 57.
    Suppose you havea Basket Its filled with some fresh fruits Arrange different fruits in different places -- TASK --
  • 58.
    How did welearn?
  • 59.
    Things you canexpect tomorrow
  • 60.
    Introduction to advancedML topics used to solve real-life problems
  • 61.
    Intuition behind eachconcept, not just the high-level understanding
  • 62.
    Applying these conceptson a custom dataset and experimenting with the results in a hands-on session
  • 63.
    Lots of fun,learning and exclusive Google goodies!