Explore ML day 1

Introduction to Machine learning
Explore ML

Learning is any process by
which a system improves
performance from
experience.
Herbert Alexander Simon

1. How does the game work?
2. How is it recognising your drawings?
3. How could we program this?
Quickdraw Game - Discussion

What is Machine
Learning?
Machine Learning is concerned with computer
programs that automatically improve their
performance through experience.

Regression
Regression analysis is a
statistical method that
helps us to analyze and
understand the relationship
between two or more
variables of interest.

Classification
Program learns from
the given dataset or
observations and then
classifies new
observation into a
number of classes
or groups.

Association
It is a machine learning
and data mining
technique that finds
important relations
between variables or
features in a data set.

Clustering
A way of grouping the
data points into
different clusters,
consisting of similar
data points.

Anomaly Detection
It is the process of
identifying unexpected
items or events in data
sets, which differ from
the norm.

Learn from mistakes
Reinforcement learning is a machine learning training method based on
rewarding desired behaviors and/or punishing undesired ones

How do I start solving a
problem with ML?

First, familiarise yourself with what data is
available.

Preparing the proper input dataset, compatible with Machine learning
algorithm requirements.
Goal of Feature Handling

According to survey, data scientists spend 60% of their
time on data preparation

In Feature Handling, you will learn...
Handling categorical data
● Nominal variables
● Ordinal variables
● One hot encoding
● Label/ordinal/integer encoding
Missing invalid values
● Mean method
● Median method
● Mode method

A variable whose values are one or more categories.
Categorical Variables
Before we move further,

Variable comprises a finite set of discrete values with no relationship between
those values.
These are variables which are not related to each other in any order
Nominal Variables

Ordinal variables
Variable comprises a finite set of discrete values with a ranked
ordering between values.
These are variables where we can find a certain order or relation or
rank between those variables.

One Hot Encoding
Forcing an ordinal relationship via
an ordinal encoding and allowing
the model to assume a natural
ordering between categories may
result in poor performance or
unexpected results

In ordinal encoding, each
unique category value is
assigned an integer value.
Ordinal Encoding

Unfortunately, data in real life usually has
issues

Consider a dataset that gives you information
about multiple people aboard the Titanic like
their ages, sexes, sibling counts, embarkment
points and whether or not they survived the
disaster.
Based on this, you have to predict if an
arbitrary passenger on Titanic would survive
the sinking.
Looking at a real-life dataset

What will happen if we directly jump into
solving the problem?

Real life datasets almost always have
missing values
For example, not all passengers’ age will be recorded.
There are multiple reasons why this could happen.

Reasons
● Simply put, it’s difficult to collect data.
● Sometimes data is lost.
● Data can also be corrupted.
● People may not be comfortable with sharing data.

Handling missing values
Mean, Median, Mode

Statistical approach to handle the missing values
Mean

Mean
In this method, any missing values in a column are replaced with the mean
of that column.
Assume that we have a dataset of a some patients and in that the age
attribute has some missing values, we have to overcome this or else it will
be a good recipe for a disaster.

Cons of using this method
● This method is heavily dependent and extremely sensitive for the outliers
present in a data set.
● Value influenced by outlier is a major threat to any machine learning model
and it may make model catastrophic.

Median

Another technique is median imputation in which the missing values are
replaced with the median value of the entire feature column.

● Doesn’t factor the correlations between features. It only works on the
column level.
● Will give poor results on encoded categorical features (do NOT use it
on categorical features).

Mode

Another technique is mode imputation in which the missing values are
replaced with the mode value or most frequent value of the entire
feature column.

● It also doesn’t factor the correlations between features.
● It can introduce bias in the data.

Suppose you have a Basket
Its filled with some fresh fruits
Arrange different fruits in different places
-- TASK --

Things you can expect
tomorrow

Introduction to advanced ML topics used to
solve real-life problems

Intuition behind each concept, not just the
high-level understanding

Applying these concepts on a custom dataset
and experimenting with the results in a
hands-on session

Lots of fun, learning and exclusive Google
goodies!

Explore ML day 1

More Related Content

What's hot

Similar to Explore ML day 1

Recently uploaded

Explore ML day 1