2. What is MACHINE LEARNING ?....
Not a well defined definition. But
Arthur Samuel (1959):
Machine learning: "Field of study that gives computers the ability to learn
without being explicitly programmed"
Samuel wrote a checkers playing program
Had the program play 10000 games against itself
Work out which board positions were good and bad depending on
wins/losses
Example
3. Tom Michel (1999):
Well posed learning problem: "A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P, if
its performance at tasks in T, as measured by P, improves with experience E."
An other definition……
The checkers example,
E = 10000s games
T is playing checkers
P if you win or loss
7. Supervised Learning (Train me)
It is a data mining task of inferring a function from labeled training data.
The training data consist of a set of training examples.
In supervised learning, each example is a pair consisting of an input object (typically a
vector) and the desired output value (also called the supervisory signal)
Unsupervised Learning (I am self sufficient in learning)
That learns from test data that has not been labeled, classified or categorized.
Instead of responding to feedback, unsupervised learning identifies commonalities in
the data and reacts based on the presence or absence of such commonalities in each new
piece of data
8. Reinforcement Learning (My life My rules! (Hit & Trail))
It is the ability of an agent to interact with the environment and find out what is the best
outcome. It follows the concept of hit and trial method.
The agent is rewarded with a point for a correct or a wrong answer, and on the basis of
the positive reward points gained the model trains itself.
Reinforcement learning differs from the supervised learning in a way that in supervised
learning, the training data has the answer key with it so the model is trained with the
correct answer itself whereas in reinforcement learning, there is no answer but the
reinforcement agent decides what to do to perform the given task
In the absence of training dataset, it is bound to learn from its experience.
11. Real Life example
Task is to arrange them as
groups
NO SIZE Colour Shape Fruit Name
1 Big Red
Rounded shape with a
depression at the top
Apple
2 Small Red
Heart-shaped to nearly
globular
Cherry
3 Big Green Long curving cylinder Banana
4 Small Green
Round to oval, Bunch
shape cylindrical
Grape
12. For Supervised Learning
Already learn from previous work about the physical characters of fruits
So arranging the same type of fruits at one place.
Your previous work is called as training data in data mining
You already learn the things from your train data, this is because of
response variable
Response variable means just a decision variable
13. For Unsupervised Learning
This time we don’t know any thing about the fruits, honestly saying this
is the first time you have seen them. You have no clue about those.
So, how will we arrange them?
What will we do first???
We will take a fruit and you will arrange them by considering physical
character of that particular fruit.
14. Suppose We have considered color
•RED COLOR GROUP: apples & cherry fruits.
•GREEN COLOR GROUP: bananas & grapes.
Consider The Size along with previous consideration:
•RED COLOR AND BIG SIZE: apple.
•RED COLOR AND SMALL SIZE: cherry fruits.
•GREEN COLOR AND BIG SIZE: bananas.
•GREEN COLOR AND SMALL SIZE: grapes.
This type of learning is known as unsupervised learning.
Clustering comes under unsupervised learning.
17. Selecting the Right Algorithm
selecting a machine learning algorithm is a process of trial and error.
It’s also a trade-off between specific characteristics of the algorithms,
such as:
Speed of training
Memory usage
Predictive accuracy on new data
Transparency or interpretability (how easily you can
understand the reasons an algorithm makes its predictions)
18. SUPERVISEDLEARNING
Classification Regression
Classification techniques predict
discrete responses
—for example,
whether an email is genuine or spam, or
whether a tumor is small,
medium, or large. Classification models
are trained to classify data into
categories. Applications include
medical imaging, speech
recognition, and credit scoring.
Regression techniques predict
continuous responses
—for example,
changes in temperature or fluctuations
in electricity demand.
Applications include forecasting stock
prices, handwriting recognition, and
acoustic signal processing
If the data can be separated into
specific groups or classes, use
classification algorithms.
If the nature of your response is a
real number –such as temperature,
or the time until failure for a piece
of equipment—use regression
techniques.
19. Let’s take a closer look at the most commonly used
classification and regression algorithms.
20. Binaryvs. Multiclass Classification
When we are working on a classification problem, begin by determining whether
the problem is binary or multiclass.
BinaryClassification Multiclass Classification
A single training or test item (instance)
can only be divided into two classes
—for example, Determine whether an
email is genuine or spam
It can be divided into more than two
—for example, Train a model to classify
an image as a dog, cat, or other animal
It requires a more complex model
22. Otherexamplesfor Classification
Binary Classification
Put a tennis ball into the Color or no-Color bin (color)
(Medical Test) Determine if a patient has certain disease or not
(Quality Control Test) Decide if a product should be sold or discarded
(IR Test) Determine if a document should be in the search results or not
Multi-Class Classification
Put a tennis ball into the Green, Orange, or White ball bin (color)
Decide if an email is advertisement, newsletter, phishing, hack, or
personal.
Classify a document into Yahoo! Categories
(Optical Recognition) Classify a scanned character into digit (0..9)
23. Support Vector Machine
“Support Vector Machine” (SVM) is a supervised
machine learning algorithm which can be used
mostly in classification problems.
In this algorithm, data plots as a points in n-
dimensional space (where n is number of
features) with the value of each feature being
the value of a particular coordinate.
Then, classification perform by finding the
hyper-plane that differentiate the two classes
very well
hyper-plane
24. Margin
Margin
HowSupport vector machine Works
Classifies data by finding the linear decision boundary
(hyperplane) that separates all data points of one class
from those of the other class.
The best hyperplane for an SVM is the one with the
largest margin between the two classes, when the data is
linearly separable.
If the data is not linearly separable, a loss function is used
to penalize points on the wrong side of the hyperplane.
SVMs sometimes use a kernel transform to transform nonlinearly separable data into
higher dimensions where a linear decision boundary can be found.
25. Identifythe right hyper-plane
“Select the hyper-plane which segregates the two classes
better”.
In this scenario, hyper-plane “B” has excellently performed
this job.
Which is the right hyper plane?
Which is the right hyper plane?
Above, you can see that the margin for hyper-plane C is
high as compared to both A and B.
Hence, we name the right hyper-plane as C.
27. Identifythe right hyper-plane
SVM selects the hyper-plane which classifies the classes
accurately prior to maximizing margin.
Here, hyper-plane B has a classification error and A has
classified all correctly.
Therefore, the right hyper-plane is A.
Which is the right hyper plane?
Which is the right hyper plane?
It solves this problem by introducing additional feature.
Here, we will add a new feature z=x^2+y^2. (Kernel
Transformation)
Now, let’s plot the data points on axis x and z:
28. Support vector machine Best used ….
For data that has exactly two classes (you can also use it for multiclass classification with
a technique called error correcting output codes)
For high-dimensional, nonlinearly separable data
29. Pros and Cons associatedwithSVM
Pros:
It works really well with clear margin of separation
It is effective in high dimensionalspaces.
It is effective in cases where number of dimensionsis greater than the number of samples.
It uses a subset of training points in the decisionfunction (called support vectors), so it is also
memory efficient.
Cons:
It doesn’t perform well, when we have large data set because the required training time is higher
It also doesn’tperform very well, when the data set has more noise i.e. target classes are
overlapping
SVM doesn’t directly provide probabilityestimates, these are calculated using an expensive five-fold
cross-validation.
30. Discriminant Analysis
Discriminant analysis (DA) is a technique for analyzing data when the criterion or
dependent variable is categorical and the predictor or independent variables are
interval in nature.
It is a technique to discriminate between two or more mutually exclusive and
exhaustive groups on the basis of some explanatory variables
Types Discriminant Analysis (DA)
1. Linear D A - when the criterion /
dependent variable has two categories
Example: adopters & non-adopters
2. Multiple D A- when three or more
categories are involved
Example: SHG1, SHG2,SHG3
31. Group sizes of the dependent should not be grossly different i.e. 80:20. It should be
at least five times the number of independent variables
How DA Works
Assumptions
1. Sample Size (n)
Each of the independent variable is normally distributed.
2. Normal Distribution
All variables have linear and homoscedastic relationships.
3. Homogeneity of variances / covariances
32. Outliers should not be present in the data.
DA is highly sensitive to the inclusion of
outliers.
4. Outliers
There should NOT BE MULTICOLLINEARITY
among the independent variables.
5. Non - multicolinearity
33. The groups must be mutually exclusive, with every subject or case belonging to
only one group.
6. Mutually exclusive
Each of the allocations for the dependent categories in the initial classification are
correctly classified.
7. Classification
34. Discriminant Analysis Model
The discriminant analysis model involves linear combinations of the following
form
𝐷 = 𝑏0+𝑏1𝑋1 + 𝑏2𝑋2 + 𝑏3𝑋3 + ……….+ 𝑏𝑘𝑋𝑘
where
D = discriminant score
b 's = discriminant coefficient or weight
X 's = predictor or independent variable
The coefficients, or weights (b), are estimated so that the groups differ as much
as possible on the values of the discriminant function.
35. Applications of Discriminant Analysis Model
Discriminant analysis has been success fully used for many applications. As long
as we can transform the problem into a classification problem.
DA can be used for original applications also
1. Identification
TO identify type of customers that is likely to buy certain product in a store.
Using simple questionnaires survey, we can get the features of customers
DA will help to select which features can describe the group membership of
buy or not buy the product
36. 3. Prediction
Question “will it rain to day” can be thought as prediction.
Prediction problem can be thought as assigning “today” to one of the two
possible groups of rain and dry
2. Decision Making
Doctor diagnosing illness may be seen as which disease the patient has.
This problem can be transform into classification problem by assigning the
patient to a number of possible groups of disease based on the Observation on
the symptoms
37. 5. Learning
Scientists want to teach robot to learn to talk can be seen as classification
problem.
It assigns frequency , pitch, tune, and many other measurements of sound into
many groups of words
4. Pattern recognition
To distinguish pedestrian from dogs and cars on capture image sequence of traffic
date is a classification problem
38. Naïve Bayes Model
It is a classification technique based on Bayes theorem with an assumption of
independence among predictors.
It is easy to build and particularly useful for very large datasets.
It learns and predicts very fast and it does not require lots of storage.
I has an Assumption : All features must be independent of each other
It still returns very good accuracy in practice even when the independent
assumption does not hold
39. 1. Real-time Prediction
Applications of Naïve Bayes Model
2. Multi - Class Prediction
3. Text Classification/ Spam Filtering/Sentiment Analysis
4. Recommendation System
40. Probability Basics
• Prior, conditional and joint probability for random variables
– Prior probability: P(x)
– Conditional probability: P(𝑥1|𝑥2), P(𝑥2|𝑥1)
– Relationship: P 𝑥1, 𝑥2 = 𝑃 𝑥2 𝑥1 𝑃 𝑥1 = 𝑃 𝑥1 𝑥2 𝑃 𝑥2
– Independence: )
(
)
(
)
),
(
)
|
(
),
(
)
|
( 2
1
2
1
2
1
2
1
2 x
P
x
P
,x
P(x
x
P
x
x
P
x
P
x
x
P 1
)
(
)
(
)
(
)
(
x
x
x
P
c
P
c
|
P
|
c
P
Discriminative Generative
Bayesian Rule
41.
42. Event contains 2 boxes. Box 1 Contains 2 white balls and 3 red balls, Box 2
contains 4 white balls and 5 red balls. One ball is drawn at random from one of the
box and is found to be red. Find the probability that It was drawn from second
box.
Example to Understand Baye’s Theorem
Let Assume, Red ball = R, white ball = W, Box1 = A, Box2 = B
Probability for selected one as box1 P(A) =
1
2
Probability for selected box as box 2 P(B) =
1
2
Probability of getting red ball from box1 = P(R|A) =
3
5
Solution
43. Probability of getting red ball from box2 = P(R|B) =
5
9
probability Red ball was drawn from second box = P(B|R) =
𝑃(𝑅|𝐵).𝑃(𝐵)
𝑃(𝑅|𝐵).𝑃 𝐵 +𝑃(𝑅|𝐴).𝑃(𝐴)
This is called baye’s theorem
P(B|R) =
𝑃(𝑅|𝐵).𝑃(𝐵)
𝑃(𝑅|𝐵).𝑃 𝐵 +𝑃(𝑅|𝐴).𝑃(𝐴)
=
5
9
∗
1
2
5
9
∗
1
2
+
3
5
∗
1
2
= 0.487 = 48.7%
44. With below tabulation of the 100 people, what is the conditional probability that a
certain member of the school is a ‘Teacher’ given that he is a ‘Man’?
Example to Understand Baye’s Theorem
Female Male Total
Teacher 8 12 20
Student 32 48 80
Total 40 60 100
45. The Naïve Bayes Model
The Bayes Rule provides the formula for the probability of Y given X. But, in real-
world problems, you typically have multiple X variables
When the features are independent, we can extend the Bayes Rule to what is
called Naive Bayes
It is called ‘Naive’ because of the naive assumption that the X’s are independent
of each other.
46.
47.
48. Naive Bayes Example
Say you have 1000 fruits which could be either ‘banana’, ‘orange’ or ‘other’. These
are the 3 possible classes of the Y variable.
49. For the sake of computing the probabilities, let’s aggregate the
training data to form a counts table like this.
50. Step1: Compute the ‘Prior’ probabilities for each of the class of
fruits.
P(Y=Banana) = 500 / 1000 = 0.50
P(Y=Orange) = 300 / 1000 = 0.30
P(Y=Other) = 200 / 1000 = 0.20.
Step 2: Compute the probability of evidence that goes in the
denominator..
P(x1=Long) = 500 / 1000 = 0.50
P(x2=Sweet) = 650 / 1000 = 0.65
P(x3=Yellow) = 800 / 1000 = 0.80
51. Step 3: Compute the probability of likelihood of evidences that goes
in the numerator..
P(x1=Long | Y=Banana) = 400 / 500 = 0.80
P(x2=Sweet | Y=Banana) = 350 / 500 = 0.70
P(x3=Yellow | Y=Banana) = 450 / 500 = 0.90
So, the overall probability of Likelihood of evidence for Banana =
0.8 * 0.7 * 0.9 = 0.504
52. Step 4: Substitute all the 3 equations into the Naive Bayes formula,
to get the probability that it is a banana.
53. Nearest Neighbor Algorithm
Simple Analogy , Tell me about your friends (Who your neighbors are) , then I
will tell who you are
55. What is KNN (K-Nearest Neighbor)
A powerful classification algorithm used in pattern recognition.
K nearest neighbors stores all available casesand classifies new
cases based on a similarity measure(e.g distance function)
One of the top data mining algorithms used today.
A non-parametric lazy learning algorithm (An Instancebased Learning
method).
56. When do we use KNN
KNN can be used for both classification and regression predictive problems.
However, it is more widely used in classification problems in the industry.
To evaluate any technique we generally look at 3 important aspects
It is commonly used for its easy of interpretation and low calculation time.
63. Training error rate and Validation error rate
.
Segregate the training and validation from the initial dataset. then
Plot the validation error curve to get the optimal value of K. This value of K
should be used for all predictions
66. George to John Distance = Sqrt[(35 − 37)2+(35 − 50)2+(3 − 2)2] = 15.16
Rachel to John Distance = Sqrt[(22 − 37)2+(50 − 50)2+(2 − 2)2] = 15
Steve to John Distance = Sqrt[(63 − 37)2+(200 − 50)2+(1 − 2)2] = 152.23
Tom to John Distance = Sqrt[(59 − 37)2+(170 − 50)2+(1 − 2)2] = 122
Tom to John Distance = Sqrt[(25 − 37)2+(40 − 50)2+(4 − 2)2] = 15.74
Distance Measure from john to others using Euclidean
Distance
69. Types of Regression
1. Simple Linear Regression
2. Polynomial Regression
3. Support Vector Regression
4. Decision Tree Regression
5. Random Forest Regression
70. Form of Linear Regression
𝑌 = 𝑏0+𝑏1𝑋1 + 𝑏2𝑋2 + 𝑏3𝑋3 + ……….+ 𝑏𝑘𝑋𝑘
Y is the response
b values are called the model coefficients. These values are “learned”
during the model fitting/training step.
𝑏0 is the intercept
𝑏1 is the coefficient for X1 (the first feature)
𝑏𝑘 is the coefficient for Xn (the nth feature)
71. Steps for Training Linear regression
1. Model Coefficients/Parameters
When training a linear regression model it’s way to say we are trying to find out a
coefficients for the linear function that best describe the input variables.
2. Cost Function (Loss Function)
When building a linear model it’s said that we are trying to minimize the error an
algorithm does making predictions, and we got that by choosing a function to help
us measure the error also called cost function.
3.Estimate The Coefficients
For that task there’s a mathematical algorithm called Gradient Descent,
72. Model evaluation metrics for regression
It is necessary to evaluate metrics designed for comparing continuous values
Root Mean Squared Error, is on of the best evaluation methods
1
𝑛
𝑖=1
𝑛
(𝑦𝑖 − 𝑦𝑚𝑒𝑎𝑛)2
76. Learn More About machine learning through Online Courses
1. Coursera – Machine Learning- Andrew N.G. – Stanford University
2. Machine Learning for Intelligent Systems – Kilian Weinberger