MACHINE LEARNING
What is MACHINE LEARNING ?....
 Not a well defined definition. But
 Arthur Samuel (1959):
Machine learning: "Field of study that gives computers the ability to learn
without being explicitly programmed"
 Samuel wrote a checkers playing program
 Had the program play 10000 games against itself
 Work out which board positions were good and bad depending on
wins/losses
Example
 Tom Michel (1999):
Well posed learning problem: "A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P, if
its performance at tasks in T, as measured by P, improves with experience E."
An other definition……
The checkers example,
 E = 10000s games
 T is playing checkers
 P if you win or loss
How machine learning WOrks
Is Machine Learning Magic?
No,
It is more like gardening.
Seeds = Algorithms
Nutrients = Data
Gardener = You
Plants = Programs
Types of MACHINE LEARNING
Supervised Learning (Train me)
 It is a data mining task of inferring a function from labeled training data.
 The training data consist of a set of training examples.
 In supervised learning, each example is a pair consisting of an input object (typically a
vector) and the desired output value (also called the supervisory signal)
Unsupervised Learning (I am self sufficient in learning)
 That learns from test data that has not been labeled, classified or categorized.
 Instead of responding to feedback, unsupervised learning identifies commonalities in
the data and reacts based on the presence or absence of such commonalities in each new
piece of data
Reinforcement Learning (My life My rules! (Hit & Trail))
 It is the ability of an agent to interact with the environment and find out what is the best
outcome. It follows the concept of hit and trial method.
 The agent is rewarded with a point for a correct or a wrong answer, and on the basis of
the positive reward points gained the model trains itself.
 Reinforcement learning differs from the supervised learning in a way that in supervised
learning, the training data has the answer key with it so the model is trained with the
correct answer itself whereas in reinforcement learning, there is no answer but the
reinforcement agent decides what to do to perform the given task
 In the absence of training dataset, it is bound to learn from its experience.
Supervised
Unsupervised
Learning
Reinforcemen
t
• Labeled data
• Direct feedback
• Predict outcome/future
• Decision Process
• Reward System
• Learn series of actions
• No labels
• No feedback
• Find hidden Structure
Supervised Learning Vs Unsupervised learning
Real Life example
Task is to arrange them as
groups
NO SIZE Colour Shape Fruit Name
1 Big Red
Rounded shape with a
depression at the top
Apple
2 Small Red
Heart-shaped to nearly
globular
Cherry
3 Big Green Long curving cylinder Banana
4 Small Green
Round to oval, Bunch
shape cylindrical
Grape
For Supervised Learning
 Already learn from previous work about the physical characters of fruits
 So arranging the same type of fruits at one place.
 Your previous work is called as training data in data mining
 You already learn the things from your train data, this is because of
response variable
 Response variable means just a decision variable
For Unsupervised Learning
 This time we don’t know any thing about the fruits, honestly saying this
is the first time you have seen them. You have no clue about those.
 So, how will we arrange them?
 What will we do first???
 We will take a fruit and you will arrange them by considering physical
character of that particular fruit.
 Suppose We have considered color
•RED COLOR GROUP: apples & cherry fruits.
•GREEN COLOR GROUP: bananas & grapes.
 Consider The Size along with previous consideration:
•RED COLOR AND BIG SIZE: apple.
•RED COLOR AND SMALL SIZE: cherry fruits.
•GREEN COLOR AND BIG SIZE: bananas.
•GREEN COLOR AND SMALL SIZE: grapes.
 This type of learning is known as unsupervised learning.
 Clustering comes under unsupervised learning.
Machine Learning Techniques
MACHINE LEARNING
SUPERVISED LEARNING UNSUPERVISED LEARNING
CLASSIFICATION
Nearest Neighbor
SVR, GPR’
Decision Trees
Neural Network
REGRESSION
Linear Regression GLM
Ensemble Methods
CLASSIFICATION
Support Vector Machines
Discriminant Analysis
Naïve Bayes
Hierarchical
Neural Networks
Hidden Markov Model
CLUSTERING
K-Means, K-Medoids, Fuzzy
C-Means
Gaussian Mixture
Selecting the Right Algorithm
 selecting a machine learning algorithm is a process of trial and error.
 It’s also a trade-off between specific characteristics of the algorithms,
such as:
 Speed of training
 Memory usage
 Predictive accuracy on new data
 Transparency or interpretability (how easily you can
understand the reasons an algorithm makes its predictions)
SUPERVISEDLEARNING
Classification Regression
Classification techniques predict
discrete responses
—for example,
whether an email is genuine or spam, or
whether a tumor is small,
medium, or large. Classification models
are trained to classify data into
categories. Applications include
medical imaging, speech
recognition, and credit scoring.
Regression techniques predict
continuous responses
—for example,
changes in temperature or fluctuations
in electricity demand.
Applications include forecasting stock
prices, handwriting recognition, and
acoustic signal processing
If the data can be separated into
specific groups or classes, use
classification algorithms.
If the nature of your response is a
real number –such as temperature,
or the time until failure for a piece
of equipment—use regression
techniques.
 Let’s take a closer look at the most commonly used
classification and regression algorithms.
Binaryvs. Multiclass Classification
When we are working on a classification problem, begin by determining whether
the problem is binary or multiclass.
BinaryClassification Multiclass Classification
A single training or test item (instance)
can only be divided into two classes
—for example, Determine whether an
email is genuine or spam
It can be divided into more than two
—for example, Train a model to classify
an image as a dog, cat, or other animal
It requires a more complex model
Binaryvs. Multiclass Classification
𝑋2
𝑋1
×
×
×
×
×
×
×
∆
∆
∆ ∆
∆
𝑋2
𝑋1
×
×
×
×
×
×
Binary Multiclass
Otherexamplesfor Classification
 Binary Classification
 Put a tennis ball into the Color or no-Color bin (color)
 (Medical Test) Determine if a patient has certain disease or not
 (Quality Control Test) Decide if a product should be sold or discarded
 (IR Test) Determine if a document should be in the search results or not
 Multi-Class Classification
 Put a tennis ball into the Green, Orange, or White ball bin (color)
 Decide if an email is advertisement, newsletter, phishing, hack, or
personal.
 Classify a document into Yahoo! Categories
 (Optical Recognition) Classify a scanned character into digit (0..9)
Support Vector Machine
 “Support Vector Machine” (SVM) is a supervised
machine learning algorithm which can be used
mostly in classification problems.
 In this algorithm, data plots as a points in n-
dimensional space (where n is number of
features) with the value of each feature being
the value of a particular coordinate.
 Then, classification perform by finding the
hyper-plane that differentiate the two classes
very well
hyper-plane
Margin
Margin
HowSupport vector machine Works
Classifies data by finding the linear decision boundary
(hyperplane) that separates all data points of one class
from those of the other class.
The best hyperplane for an SVM is the one with the
largest margin between the two classes, when the data is
linearly separable.
If the data is not linearly separable, a loss function is used
to penalize points on the wrong side of the hyperplane.
 SVMs sometimes use a kernel transform to transform nonlinearly separable data into
higher dimensions where a linear decision boundary can be found.
Identifythe right hyper-plane
 “Select the hyper-plane which segregates the two classes
better”.
 In this scenario, hyper-plane “B” has excellently performed
this job.
Which is the right hyper plane?
Which is the right hyper plane?
 Above, you can see that the margin for hyper-plane C is
high as compared to both A and B.
 Hence, we name the right hyper-plane as C.
Margin
Identifythe right hyper-plane
 SVM selects the hyper-plane which classifies the classes
accurately prior to maximizing margin.
 Here, hyper-plane B has a classification error and A has
classified all correctly.
 Therefore, the right hyper-plane is A.
Which is the right hyper plane?
Which is the right hyper plane?
 It solves this problem by introducing additional feature.
Here, we will add a new feature z=x^2+y^2. (Kernel
Transformation)
 Now, let’s plot the data points on axis x and z:
Support vector machine Best used ….
For data that has exactly two classes (you can also use it for multiclass classification with
a technique called error correcting output codes)
For high-dimensional, nonlinearly separable data
Pros and Cons associatedwithSVM
Pros:
 It works really well with clear margin of separation
 It is effective in high dimensionalspaces.
 It is effective in cases where number of dimensionsis greater than the number of samples.
 It uses a subset of training points in the decisionfunction (called support vectors), so it is also
memory efficient.
Cons:
 It doesn’t perform well, when we have large data set because the required training time is higher
 It also doesn’tperform very well, when the data set has more noise i.e. target classes are
overlapping
 SVM doesn’t directly provide probabilityestimates, these are calculated using an expensive five-fold
cross-validation.
Discriminant Analysis
Discriminant analysis (DA) is a technique for analyzing data when the criterion or
dependent variable is categorical and the predictor or independent variables are
interval in nature.
 It is a technique to discriminate between two or more mutually exclusive and
exhaustive groups on the basis of some explanatory variables
Types Discriminant Analysis (DA)
1. Linear D A - when the criterion /
dependent variable has two categories
Example: adopters & non-adopters
2. Multiple D A- when three or more
categories are involved
Example: SHG1, SHG2,SHG3
 Group sizes of the dependent should not be grossly different i.e. 80:20. It should be
at least five times the number of independent variables
How DA Works
Assumptions
1. Sample Size (n)
 Each of the independent variable is normally distributed.
2. Normal Distribution
 All variables have linear and homoscedastic relationships.
3. Homogeneity of variances / covariances
 Outliers should not be present in the data.
DA is highly sensitive to the inclusion of
outliers.
4. Outliers
 There should NOT BE MULTICOLLINEARITY
among the independent variables.
5. Non - multicolinearity
 The groups must be mutually exclusive, with every subject or case belonging to
only one group.
6. Mutually exclusive
 Each of the allocations for the dependent categories in the initial classification are
correctly classified.
7. Classification
Discriminant Analysis Model
 The discriminant analysis model involves linear combinations of the following
form
𝐷 = 𝑏0+𝑏1𝑋1 + 𝑏2𝑋2 + 𝑏3𝑋3 + ……….+ 𝑏𝑘𝑋𝑘
 where
 D = discriminant score
 b 's = discriminant coefficient or weight
 X 's = predictor or independent variable
 The coefficients, or weights (b), are estimated so that the groups differ as much
as possible on the values of the discriminant function.
Applications of Discriminant Analysis Model
 Discriminant analysis has been success fully used for many applications. As long
as we can transform the problem into a classification problem.
 DA can be used for original applications also
1. Identification
 TO identify type of customers that is likely to buy certain product in a store.
 Using simple questionnaires survey, we can get the features of customers
 DA will help to select which features can describe the group membership of
buy or not buy the product
3. Prediction
 Question “will it rain to day” can be thought as prediction.
 Prediction problem can be thought as assigning “today” to one of the two
possible groups of rain and dry
2. Decision Making
 Doctor diagnosing illness may be seen as which disease the patient has.
 This problem can be transform into classification problem by assigning the
patient to a number of possible groups of disease based on the Observation on
the symptoms
5. Learning
 Scientists want to teach robot to learn to talk can be seen as classification
problem.
 It assigns frequency , pitch, tune, and many other measurements of sound into
many groups of words
4. Pattern recognition
 To distinguish pedestrian from dogs and cars on capture image sequence of traffic
date is a classification problem
Naïve Bayes Model
 It is a classification technique based on Bayes theorem with an assumption of
independence among predictors.
 It is easy to build and particularly useful for very large datasets.
 It learns and predicts very fast and it does not require lots of storage.
 I has an Assumption : All features must be independent of each other
 It still returns very good accuracy in practice even when the independent
assumption does not hold
1. Real-time Prediction
Applications of Naïve Bayes Model
2. Multi - Class Prediction
3. Text Classification/ Spam Filtering/Sentiment Analysis
4. Recommendation System
Probability Basics
• Prior, conditional and joint probability for random variables
– Prior probability: P(x)
– Conditional probability: P(𝑥1|𝑥2), P(𝑥2|𝑥1)
– Relationship: P 𝑥1, 𝑥2 = 𝑃 𝑥2 𝑥1 𝑃 𝑥1 = 𝑃 𝑥1 𝑥2 𝑃 𝑥2
– Independence: )
(
)
(
)
),
(
)
|
(
),
(
)
|
( 2
1
2
1
2
1
2
1
2 x
P
x
P
,x
P(x
x
P
x
x
P
x
P
x
x
P 1 


)
(
)
(
)
(
)
(
x
x
x
P
c
P
c
|
P
|
c
P 
Discriminative Generative
Bayesian Rule
Event contains 2 boxes. Box 1 Contains 2 white balls and 3 red balls, Box 2
contains 4 white balls and 5 red balls. One ball is drawn at random from one of the
box and is found to be red. Find the probability that It was drawn from second
box.
Example to Understand Baye’s Theorem
Let Assume, Red ball = R, white ball = W, Box1 = A, Box2 = B
Probability for selected one as box1 P(A) =
1
2
Probability for selected box as box 2 P(B) =
1
2
Probability of getting red ball from box1 = P(R|A) =
3
5
Solution
Probability of getting red ball from box2 = P(R|B) =
5
9
probability Red ball was drawn from second box = P(B|R) =
𝑃(𝑅|𝐵).𝑃(𝐵)
𝑃(𝑅|𝐵).𝑃 𝐵 +𝑃(𝑅|𝐴).𝑃(𝐴)
This is called baye’s theorem
P(B|R) =
𝑃(𝑅|𝐵).𝑃(𝐵)
𝑃(𝑅|𝐵).𝑃 𝐵 +𝑃(𝑅|𝐴).𝑃(𝐴)
=
5
9
∗
1
2
5
9
∗
1
2
+
3
5
∗
1
2
= 0.487 = 48.7%
With below tabulation of the 100 people, what is the conditional probability that a
certain member of the school is a ‘Teacher’ given that he is a ‘Man’?
Example to Understand Baye’s Theorem
Female Male Total
Teacher 8 12 20
Student 32 48 80
Total 40 60 100
The Naïve Bayes Model
 The Bayes Rule provides the formula for the probability of Y given X. But, in real-
world problems, you typically have multiple X variables
 When the features are independent, we can extend the Bayes Rule to what is
called Naive Bayes
 It is called ‘Naive’ because of the naive assumption that the X’s are independent
of each other.
Naive Bayes Example
 Say you have 1000 fruits which could be either ‘banana’, ‘orange’ or ‘other’. These
are the 3 possible classes of the Y variable.
For the sake of computing the probabilities, let’s aggregate the
training data to form a counts table like this.
Step1: Compute the ‘Prior’ probabilities for each of the class of
fruits.
P(Y=Banana) = 500 / 1000 = 0.50
P(Y=Orange) = 300 / 1000 = 0.30
P(Y=Other) = 200 / 1000 = 0.20.
Step 2: Compute the probability of evidence that goes in the
denominator..
P(x1=Long) = 500 / 1000 = 0.50
P(x2=Sweet) = 650 / 1000 = 0.65
P(x3=Yellow) = 800 / 1000 = 0.80
Step 3: Compute the probability of likelihood of evidences that goes
in the numerator..
P(x1=Long | Y=Banana) = 400 / 500 = 0.80
P(x2=Sweet | Y=Banana) = 350 / 500 = 0.70
P(x3=Yellow | Y=Banana) = 450 / 500 = 0.90
So, the overall probability of Likelihood of evidence for Banana =
0.8 * 0.7 * 0.9 = 0.504
Step 4: Substitute all the 3 equations into the Naive Bayes formula,
to get the probability that it is a banana.
Nearest Neighbor Algorithm
 Simple Analogy , Tell me about your friends (Who your neighbors are) , then I
will tell who you are
Other Names for Nearest neighbor Algorithm
 K-Nearest Neighbors
 Memory-Based Reasoning
 Example-Based Reasoning
 Instance-Based Learning
 Lazy Learning
What is KNN (K-Nearest Neighbor)
 A powerful classification algorithm used in pattern recognition.
 K nearest neighbors stores all available casesand classifies new
cases based on a similarity measure(e.g distance function)
 One of the top data mining algorithms used today.
 A non-parametric lazy learning algorithm (An Instancebased Learning
method).
When do we use KNN
 KNN can be used for both classification and regression predictive problems.
However, it is more widely used in classification problems in the industry.
 To evaluate any technique we generally look at 3 important aspects
 It is commonly used for its easy of interpretation and low calculation time.
How does KNN work?
How does KNN work?.....
How does KNN work?.....
KNN has the following basic steps:
1. Calculate distance
2. Find closest neighbors
3. Vote for labels
Effect of K in KNN work
Effect of K in KNN work.....
How to choose factor K
Training error rate and Validation error rate
.
 Segregate the training and validation from the initial dataset. then
 Plot the validation error curve to get the optimal value of K. This value of K
should be used for all predictions
Distance Measure for Continuous Variables
 Minkowski =
Example
George to John Distance = Sqrt[(35 − 37)2+(35 − 50)2+(3 − 2)2] = 15.16
Rachel to John Distance = Sqrt[(22 − 37)2+(50 − 50)2+(2 − 2)2] = 15
Steve to John Distance = Sqrt[(63 − 37)2+(200 − 50)2+(1 − 2)2] = 152.23
Tom to John Distance = Sqrt[(59 − 37)2+(170 − 50)2+(1 − 2)2] = 122
Tom to John Distance = Sqrt[(25 − 37)2+(40 − 50)2+(4 − 2)2] = 15.74
Distance Measure from john to others using Euclidean
Distance
Linear Regression
In regression problem the goal of the algorithm is to predict real-valued output.
Types of Regression
1. Simple Linear Regression
2. Polynomial Regression
3. Support Vector Regression
4. Decision Tree Regression
5. Random Forest Regression
Form of Linear Regression
𝑌 = 𝑏0+𝑏1𝑋1 + 𝑏2𝑋2 + 𝑏3𝑋3 + ……….+ 𝑏𝑘𝑋𝑘
 Y is the response
 b values are called the model coefficients. These values are “learned”
during the model fitting/training step.
 𝑏0 is the intercept
 𝑏1 is the coefficient for X1 (the first feature)
 𝑏𝑘 is the coefficient for Xn (the nth feature)
Steps for Training Linear regression
1. Model Coefficients/Parameters
When training a linear regression model it’s way to say we are trying to find out a
coefficients for the linear function that best describe the input variables.
2. Cost Function (Loss Function)
When building a linear model it’s said that we are trying to minimize the error an
algorithm does making predictions, and we got that by choosing a function to help
us measure the error also called cost function.
3.Estimate The Coefficients
For that task there’s a mathematical algorithm called Gradient Descent,
Model evaluation metrics for regression
 It is necessary to evaluate metrics designed for comparing continuous values
 Root Mean Squared Error, is on of the best evaluation methods
1
𝑛
𝑖=1
𝑛
(𝑦𝑖 − 𝑦𝑚𝑒𝑎𝑛)2
Decision tree Algorithm
Example
Learn More About machine learning through Online Courses
1. Coursera – Machine Learning- Andrew N.G. – Stanford University
2. Machine Learning for Intelligent Systems – Kilian Weinberger
Machine Learning_PPT.pptx

Machine Learning_PPT.pptx

  • 1.
  • 2.
    What is MACHINELEARNING ?....  Not a well defined definition. But  Arthur Samuel (1959): Machine learning: "Field of study that gives computers the ability to learn without being explicitly programmed"  Samuel wrote a checkers playing program  Had the program play 10000 games against itself  Work out which board positions were good and bad depending on wins/losses Example
  • 3.
     Tom Michel(1999): Well posed learning problem: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E." An other definition…… The checkers example,  E = 10000s games  T is playing checkers  P if you win or loss
  • 4.
  • 5.
    Is Machine LearningMagic? No, It is more like gardening. Seeds = Algorithms Nutrients = Data Gardener = You Plants = Programs
  • 6.
  • 7.
    Supervised Learning (Trainme)  It is a data mining task of inferring a function from labeled training data.  The training data consist of a set of training examples.  In supervised learning, each example is a pair consisting of an input object (typically a vector) and the desired output value (also called the supervisory signal) Unsupervised Learning (I am self sufficient in learning)  That learns from test data that has not been labeled, classified or categorized.  Instead of responding to feedback, unsupervised learning identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data
  • 8.
    Reinforcement Learning (Mylife My rules! (Hit & Trail))  It is the ability of an agent to interact with the environment and find out what is the best outcome. It follows the concept of hit and trial method.  The agent is rewarded with a point for a correct or a wrong answer, and on the basis of the positive reward points gained the model trains itself.  Reinforcement learning differs from the supervised learning in a way that in supervised learning, the training data has the answer key with it so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to perform the given task  In the absence of training dataset, it is bound to learn from its experience.
  • 9.
    Supervised Unsupervised Learning Reinforcemen t • Labeled data •Direct feedback • Predict outcome/future • Decision Process • Reward System • Learn series of actions • No labels • No feedback • Find hidden Structure
  • 10.
    Supervised Learning VsUnsupervised learning
  • 11.
    Real Life example Taskis to arrange them as groups NO SIZE Colour Shape Fruit Name 1 Big Red Rounded shape with a depression at the top Apple 2 Small Red Heart-shaped to nearly globular Cherry 3 Big Green Long curving cylinder Banana 4 Small Green Round to oval, Bunch shape cylindrical Grape
  • 12.
    For Supervised Learning Already learn from previous work about the physical characters of fruits  So arranging the same type of fruits at one place.  Your previous work is called as training data in data mining  You already learn the things from your train data, this is because of response variable  Response variable means just a decision variable
  • 13.
    For Unsupervised Learning This time we don’t know any thing about the fruits, honestly saying this is the first time you have seen them. You have no clue about those.  So, how will we arrange them?  What will we do first???  We will take a fruit and you will arrange them by considering physical character of that particular fruit.
  • 14.
     Suppose Wehave considered color •RED COLOR GROUP: apples & cherry fruits. •GREEN COLOR GROUP: bananas & grapes.  Consider The Size along with previous consideration: •RED COLOR AND BIG SIZE: apple. •RED COLOR AND SMALL SIZE: cherry fruits. •GREEN COLOR AND BIG SIZE: bananas. •GREEN COLOR AND SMALL SIZE: grapes.  This type of learning is known as unsupervised learning.  Clustering comes under unsupervised learning.
  • 16.
    Machine Learning Techniques MACHINELEARNING SUPERVISED LEARNING UNSUPERVISED LEARNING CLASSIFICATION Nearest Neighbor SVR, GPR’ Decision Trees Neural Network REGRESSION Linear Regression GLM Ensemble Methods CLASSIFICATION Support Vector Machines Discriminant Analysis Naïve Bayes Hierarchical Neural Networks Hidden Markov Model CLUSTERING K-Means, K-Medoids, Fuzzy C-Means Gaussian Mixture
  • 17.
    Selecting the RightAlgorithm  selecting a machine learning algorithm is a process of trial and error.  It’s also a trade-off between specific characteristics of the algorithms, such as:  Speed of training  Memory usage  Predictive accuracy on new data  Transparency or interpretability (how easily you can understand the reasons an algorithm makes its predictions)
  • 18.
    SUPERVISEDLEARNING Classification Regression Classification techniquespredict discrete responses —for example, whether an email is genuine or spam, or whether a tumor is small, medium, or large. Classification models are trained to classify data into categories. Applications include medical imaging, speech recognition, and credit scoring. Regression techniques predict continuous responses —for example, changes in temperature or fluctuations in electricity demand. Applications include forecasting stock prices, handwriting recognition, and acoustic signal processing If the data can be separated into specific groups or classes, use classification algorithms. If the nature of your response is a real number –such as temperature, or the time until failure for a piece of equipment—use regression techniques.
  • 19.
     Let’s takea closer look at the most commonly used classification and regression algorithms.
  • 20.
    Binaryvs. Multiclass Classification Whenwe are working on a classification problem, begin by determining whether the problem is binary or multiclass. BinaryClassification Multiclass Classification A single training or test item (instance) can only be divided into two classes —for example, Determine whether an email is genuine or spam It can be divided into more than two —for example, Train a model to classify an image as a dog, cat, or other animal It requires a more complex model
  • 21.
    Binaryvs. Multiclass Classification 𝑋2 𝑋1 × × × × × × × ∆ ∆ ∆∆ ∆ 𝑋2 𝑋1 × × × × × × Binary Multiclass
  • 22.
    Otherexamplesfor Classification  BinaryClassification  Put a tennis ball into the Color or no-Color bin (color)  (Medical Test) Determine if a patient has certain disease or not  (Quality Control Test) Decide if a product should be sold or discarded  (IR Test) Determine if a document should be in the search results or not  Multi-Class Classification  Put a tennis ball into the Green, Orange, or White ball bin (color)  Decide if an email is advertisement, newsletter, phishing, hack, or personal.  Classify a document into Yahoo! Categories  (Optical Recognition) Classify a scanned character into digit (0..9)
  • 23.
    Support Vector Machine “Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used mostly in classification problems.  In this algorithm, data plots as a points in n- dimensional space (where n is number of features) with the value of each feature being the value of a particular coordinate.  Then, classification perform by finding the hyper-plane that differentiate the two classes very well hyper-plane
  • 24.
    Margin Margin HowSupport vector machineWorks Classifies data by finding the linear decision boundary (hyperplane) that separates all data points of one class from those of the other class. The best hyperplane for an SVM is the one with the largest margin between the two classes, when the data is linearly separable. If the data is not linearly separable, a loss function is used to penalize points on the wrong side of the hyperplane.  SVMs sometimes use a kernel transform to transform nonlinearly separable data into higher dimensions where a linear decision boundary can be found.
  • 25.
    Identifythe right hyper-plane “Select the hyper-plane which segregates the two classes better”.  In this scenario, hyper-plane “B” has excellently performed this job. Which is the right hyper plane? Which is the right hyper plane?  Above, you can see that the margin for hyper-plane C is high as compared to both A and B.  Hence, we name the right hyper-plane as C.
  • 26.
  • 27.
    Identifythe right hyper-plane SVM selects the hyper-plane which classifies the classes accurately prior to maximizing margin.  Here, hyper-plane B has a classification error and A has classified all correctly.  Therefore, the right hyper-plane is A. Which is the right hyper plane? Which is the right hyper plane?  It solves this problem by introducing additional feature. Here, we will add a new feature z=x^2+y^2. (Kernel Transformation)  Now, let’s plot the data points on axis x and z:
  • 28.
    Support vector machineBest used …. For data that has exactly two classes (you can also use it for multiclass classification with a technique called error correcting output codes) For high-dimensional, nonlinearly separable data
  • 29.
    Pros and ConsassociatedwithSVM Pros:  It works really well with clear margin of separation  It is effective in high dimensionalspaces.  It is effective in cases where number of dimensionsis greater than the number of samples.  It uses a subset of training points in the decisionfunction (called support vectors), so it is also memory efficient. Cons:  It doesn’t perform well, when we have large data set because the required training time is higher  It also doesn’tperform very well, when the data set has more noise i.e. target classes are overlapping  SVM doesn’t directly provide probabilityestimates, these are calculated using an expensive five-fold cross-validation.
  • 30.
    Discriminant Analysis Discriminant analysis(DA) is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor or independent variables are interval in nature.  It is a technique to discriminate between two or more mutually exclusive and exhaustive groups on the basis of some explanatory variables Types Discriminant Analysis (DA) 1. Linear D A - when the criterion / dependent variable has two categories Example: adopters & non-adopters 2. Multiple D A- when three or more categories are involved Example: SHG1, SHG2,SHG3
  • 31.
     Group sizesof the dependent should not be grossly different i.e. 80:20. It should be at least five times the number of independent variables How DA Works Assumptions 1. Sample Size (n)  Each of the independent variable is normally distributed. 2. Normal Distribution  All variables have linear and homoscedastic relationships. 3. Homogeneity of variances / covariances
  • 32.
     Outliers shouldnot be present in the data. DA is highly sensitive to the inclusion of outliers. 4. Outliers  There should NOT BE MULTICOLLINEARITY among the independent variables. 5. Non - multicolinearity
  • 33.
     The groupsmust be mutually exclusive, with every subject or case belonging to only one group. 6. Mutually exclusive  Each of the allocations for the dependent categories in the initial classification are correctly classified. 7. Classification
  • 34.
    Discriminant Analysis Model The discriminant analysis model involves linear combinations of the following form 𝐷 = 𝑏0+𝑏1𝑋1 + 𝑏2𝑋2 + 𝑏3𝑋3 + ……….+ 𝑏𝑘𝑋𝑘  where  D = discriminant score  b 's = discriminant coefficient or weight  X 's = predictor or independent variable  The coefficients, or weights (b), are estimated so that the groups differ as much as possible on the values of the discriminant function.
  • 35.
    Applications of DiscriminantAnalysis Model  Discriminant analysis has been success fully used for many applications. As long as we can transform the problem into a classification problem.  DA can be used for original applications also 1. Identification  TO identify type of customers that is likely to buy certain product in a store.  Using simple questionnaires survey, we can get the features of customers  DA will help to select which features can describe the group membership of buy or not buy the product
  • 36.
    3. Prediction  Question“will it rain to day” can be thought as prediction.  Prediction problem can be thought as assigning “today” to one of the two possible groups of rain and dry 2. Decision Making  Doctor diagnosing illness may be seen as which disease the patient has.  This problem can be transform into classification problem by assigning the patient to a number of possible groups of disease based on the Observation on the symptoms
  • 37.
    5. Learning  Scientistswant to teach robot to learn to talk can be seen as classification problem.  It assigns frequency , pitch, tune, and many other measurements of sound into many groups of words 4. Pattern recognition  To distinguish pedestrian from dogs and cars on capture image sequence of traffic date is a classification problem
  • 38.
    Naïve Bayes Model It is a classification technique based on Bayes theorem with an assumption of independence among predictors.  It is easy to build and particularly useful for very large datasets.  It learns and predicts very fast and it does not require lots of storage.  I has an Assumption : All features must be independent of each other  It still returns very good accuracy in practice even when the independent assumption does not hold
  • 39.
    1. Real-time Prediction Applicationsof Naïve Bayes Model 2. Multi - Class Prediction 3. Text Classification/ Spam Filtering/Sentiment Analysis 4. Recommendation System
  • 40.
    Probability Basics • Prior,conditional and joint probability for random variables – Prior probability: P(x) – Conditional probability: P(𝑥1|𝑥2), P(𝑥2|𝑥1) – Relationship: P 𝑥1, 𝑥2 = 𝑃 𝑥2 𝑥1 𝑃 𝑥1 = 𝑃 𝑥1 𝑥2 𝑃 𝑥2 – Independence: ) ( ) ( ) ), ( ) | ( ), ( ) | ( 2 1 2 1 2 1 2 1 2 x P x P ,x P(x x P x x P x P x x P 1    ) ( ) ( ) ( ) ( x x x P c P c | P | c P  Discriminative Generative Bayesian Rule
  • 42.
    Event contains 2boxes. Box 1 Contains 2 white balls and 3 red balls, Box 2 contains 4 white balls and 5 red balls. One ball is drawn at random from one of the box and is found to be red. Find the probability that It was drawn from second box. Example to Understand Baye’s Theorem Let Assume, Red ball = R, white ball = W, Box1 = A, Box2 = B Probability for selected one as box1 P(A) = 1 2 Probability for selected box as box 2 P(B) = 1 2 Probability of getting red ball from box1 = P(R|A) = 3 5 Solution
  • 43.
    Probability of gettingred ball from box2 = P(R|B) = 5 9 probability Red ball was drawn from second box = P(B|R) = 𝑃(𝑅|𝐵).𝑃(𝐵) 𝑃(𝑅|𝐵).𝑃 𝐵 +𝑃(𝑅|𝐴).𝑃(𝐴) This is called baye’s theorem P(B|R) = 𝑃(𝑅|𝐵).𝑃(𝐵) 𝑃(𝑅|𝐵).𝑃 𝐵 +𝑃(𝑅|𝐴).𝑃(𝐴) = 5 9 ∗ 1 2 5 9 ∗ 1 2 + 3 5 ∗ 1 2 = 0.487 = 48.7%
  • 44.
    With below tabulationof the 100 people, what is the conditional probability that a certain member of the school is a ‘Teacher’ given that he is a ‘Man’? Example to Understand Baye’s Theorem Female Male Total Teacher 8 12 20 Student 32 48 80 Total 40 60 100
  • 45.
    The Naïve BayesModel  The Bayes Rule provides the formula for the probability of Y given X. But, in real- world problems, you typically have multiple X variables  When the features are independent, we can extend the Bayes Rule to what is called Naive Bayes  It is called ‘Naive’ because of the naive assumption that the X’s are independent of each other.
  • 48.
    Naive Bayes Example Say you have 1000 fruits which could be either ‘banana’, ‘orange’ or ‘other’. These are the 3 possible classes of the Y variable.
  • 49.
    For the sakeof computing the probabilities, let’s aggregate the training data to form a counts table like this.
  • 50.
    Step1: Compute the‘Prior’ probabilities for each of the class of fruits. P(Y=Banana) = 500 / 1000 = 0.50 P(Y=Orange) = 300 / 1000 = 0.30 P(Y=Other) = 200 / 1000 = 0.20. Step 2: Compute the probability of evidence that goes in the denominator.. P(x1=Long) = 500 / 1000 = 0.50 P(x2=Sweet) = 650 / 1000 = 0.65 P(x3=Yellow) = 800 / 1000 = 0.80
  • 51.
    Step 3: Computethe probability of likelihood of evidences that goes in the numerator.. P(x1=Long | Y=Banana) = 400 / 500 = 0.80 P(x2=Sweet | Y=Banana) = 350 / 500 = 0.70 P(x3=Yellow | Y=Banana) = 450 / 500 = 0.90 So, the overall probability of Likelihood of evidence for Banana = 0.8 * 0.7 * 0.9 = 0.504
  • 52.
    Step 4: Substituteall the 3 equations into the Naive Bayes formula, to get the probability that it is a banana.
  • 53.
    Nearest Neighbor Algorithm Simple Analogy , Tell me about your friends (Who your neighbors are) , then I will tell who you are
  • 54.
    Other Names forNearest neighbor Algorithm  K-Nearest Neighbors  Memory-Based Reasoning  Example-Based Reasoning  Instance-Based Learning  Lazy Learning
  • 55.
    What is KNN(K-Nearest Neighbor)  A powerful classification algorithm used in pattern recognition.  K nearest neighbors stores all available casesand classifies new cases based on a similarity measure(e.g distance function)  One of the top data mining algorithms used today.  A non-parametric lazy learning algorithm (An Instancebased Learning method).
  • 56.
    When do weuse KNN  KNN can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry.  To evaluate any technique we generally look at 3 important aspects  It is commonly used for its easy of interpretation and low calculation time.
  • 57.
  • 58.
    How does KNNwork?.....
  • 59.
    How does KNNwork?..... KNN has the following basic steps: 1. Calculate distance 2. Find closest neighbors 3. Vote for labels
  • 60.
    Effect of Kin KNN work
  • 61.
    Effect of Kin KNN work.....
  • 62.
    How to choosefactor K
  • 63.
    Training error rateand Validation error rate .  Segregate the training and validation from the initial dataset. then  Plot the validation error curve to get the optimal value of K. This value of K should be used for all predictions
  • 64.
    Distance Measure forContinuous Variables  Minkowski =
  • 65.
  • 66.
    George to JohnDistance = Sqrt[(35 − 37)2+(35 − 50)2+(3 − 2)2] = 15.16 Rachel to John Distance = Sqrt[(22 − 37)2+(50 − 50)2+(2 − 2)2] = 15 Steve to John Distance = Sqrt[(63 − 37)2+(200 − 50)2+(1 − 2)2] = 152.23 Tom to John Distance = Sqrt[(59 − 37)2+(170 − 50)2+(1 − 2)2] = 122 Tom to John Distance = Sqrt[(25 − 37)2+(40 − 50)2+(4 − 2)2] = 15.74 Distance Measure from john to others using Euclidean Distance
  • 68.
    Linear Regression In regressionproblem the goal of the algorithm is to predict real-valued output.
  • 69.
    Types of Regression 1.Simple Linear Regression 2. Polynomial Regression 3. Support Vector Regression 4. Decision Tree Regression 5. Random Forest Regression
  • 70.
    Form of LinearRegression 𝑌 = 𝑏0+𝑏1𝑋1 + 𝑏2𝑋2 + 𝑏3𝑋3 + ……….+ 𝑏𝑘𝑋𝑘  Y is the response  b values are called the model coefficients. These values are “learned” during the model fitting/training step.  𝑏0 is the intercept  𝑏1 is the coefficient for X1 (the first feature)  𝑏𝑘 is the coefficient for Xn (the nth feature)
  • 71.
    Steps for TrainingLinear regression 1. Model Coefficients/Parameters When training a linear regression model it’s way to say we are trying to find out a coefficients for the linear function that best describe the input variables. 2. Cost Function (Loss Function) When building a linear model it’s said that we are trying to minimize the error an algorithm does making predictions, and we got that by choosing a function to help us measure the error also called cost function. 3.Estimate The Coefficients For that task there’s a mathematical algorithm called Gradient Descent,
  • 72.
    Model evaluation metricsfor regression  It is necessary to evaluate metrics designed for comparing continuous values  Root Mean Squared Error, is on of the best evaluation methods 1 𝑛 𝑖=1 𝑛 (𝑦𝑖 − 𝑦𝑚𝑒𝑎𝑛)2
  • 73.
  • 74.
  • 76.
    Learn More Aboutmachine learning through Online Courses 1. Coursera – Machine Learning- Andrew N.G. – Stanford University 2. Machine Learning for Intelligent Systems – Kilian Weinberger