SlideShare a Scribd company logo
Generative Classifiers: Classifying with Bayesian decision theory, Bayes’ rule, Naïve Bayes
classifier.
Discriminative Classifiers: Logistic Regression, Decision Trees: Training and Visualizing a
Decision Tree, Making Predictions, Estimating Class Probabilities, The CART Training
Algorithm, Attribute selection measures- Gini impurity; Entropy, Regularization
Hyperparameters, Regression Trees, Linear Support vector machines.
UNIT-2
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
Yellow colour border
Conditional probability
• Terms
• P(h) : prior probability of h
• P(D) : prior probability that training data D will be observed
• P(D|h) : prior knowledge
• P(h|D) : posterior probability of h , given D
• Bayes Rule
 Bayes theorem is the cornerstone of Bayesian learning methods because it provides a way to
calculate the posterior probability P(h/D), from the prior probability P(h), together with
P(D) and P(D/h).
Bayes’ rule
7
• A Classification model like ANN, DT. The model is based on the
computation of probabilities. Hence, a probabilistic model.
• Let X be the feature matrix of samples.
• Let y be the corresponding vector of labels/classes.
• A single sample is represented by 𝑥, 𝑦𝑗 where 𝑥 = 𝑥1, 𝑥2, … 𝑥𝑑 is
the feature vector and 𝑦𝑗 is the class.
• 𝑦𝑗 is a particular class. Not the class of sample 𝑗
maximum a posteriori (MAP) hypothesis
 We can determine the MAP hypotheses by using Bayes theorem to calculate the posterior
probability of each candidate hypothesis.
MAP hypothesis provided
argmax is most commonly used in machine learning
for finding the class with the largest predicted
probability.
Notice in the final step right we
dropped the term P(D) because it is a
constant independent of h.
Naïve Bayes classifier
The naive Bayes classifier applies to learning tasks where each instance x is described by a conjunction of
attribute values and where the target function f ( x ) can take on any value from some finite set V.
A set of training examples of the target function is provided, and a new instance is presented, described by the
tuple of attribute values (al, a2.. .an ). The learner is asked to predict the target
value, or classification, for this new instance.
The Bayesian approach to classifying the new instance is to assign the most probable target
value, VMAP, given the attribute values (al, a2.. .an ) that describe the instance.
Where VNB denotes the target value output by
the naive Bayes classifier.
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
New instance (Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong)
use the naive Bayes classifier and the training data  to classify the below new instance:
 Our task is to predict the target value (yes or no) of the target concept PlayTennis for
this new instance.
Thus, the naive Bayes classifier assigns the target value PlayTennis = no to this new instance,
based on the probability estimates learned from the training data.
New instance  (Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong)
What is the Sigmoid Function?
It is a mathematical function having a characteristic that can take any real value and map it to
between 0 to 1 shaped like the letter “S”. The sigmoid function also called a logistic function.
Y = 1 / 1+e -z.
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
Logistic Regression
Logistic Regression: Logistic regression is a type of regression used for classification
problems, where the output variable is categorical in nature. Logistic regression uses a logistic
function to predict the probability of the input belonging to a particular category.
OR
Logistic regression (logit regression) is a type of regression analysis used for predicting
the outcome of a categorical dependent variable.
In logistic regression, dependent variable (Y) is binary (0,1) and independent variables
(X) are
continuous in nature.
Define Logistic regression ?
Logistic Regression sub Topics
1)Estimating Probabilities
2)Training and Cost Function
3)Decision Boundaries
4)Softmax Regression
the Logistic Regression utilizes a more sophisticated cost function, which is known as the
“Sigmoid function” or “logistic function” instead of a linear function.
1) Estimating Probabilities
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
3. Decision Boundaries
Loading iris dataset −
Loading iris Dataset
Printing iris Dataset
Print the names
of the four
features
Print the
Names of
Target
(iris-flowers)
Print integers
representing the
species of each
observation.
0 = setosa , 1 = versicolor, 2 = virginica
150 Rows and 4 columns.
Size of Feature matrix
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
The petal width of Iris-Virginica flowers (represented by triangles) ranges from 1.4
cm to 2.5 cm, while the other iris flowers (represented by squares) generally have a
smaller petal width, ranging from 0.1 cm to 1.8 cm.
Just like the other linear models, Logistic Regression models can be regularized using ℓ1 or ℓ2
penalties. Scitkit-Learn actually adds an ℓ2 penalty by default.
Figure 4-24 shows the same dataset but this time displaying two features: petal width and length. Once trained,
the Logistic Regression classifier can estimate the probability that a new flower is an Iris-Virginica based on
these two features. The dashed line represents the points where the model estimates a 50% probability: this is the
model’s decision boundary. Note that it is a linear boundary.
4. Softmax Regression
The Logistic Regression model can be generalized to support multiple classes
directly, without having to train and combine multiple binary classifiers. This is called
Softmax Regression, or Multinomial Logistic Regression.
The idea of Softmax Regression  when given an instance x, the Softmax
Regression model first computes a score Sk(x) for each class k, then estimates the
probability of each class by applying the softmax function (also called the normalized
exponential) to the scores.
the softmax function (Equation 4-20):
Types of Machine Learning
Decision Trees
For example, the instance
(Outlook = Sunny, Temperature = Hot, Humidity = High, Wind=Strong)
would be sorted down the leftmost branch of this decision tree and would therefore be classified as a
negative instance (i.e., the tree predicts that PlayTennis = no).
 Decision trees classify instances by sorting them down the tree from the root to some
leaf node, which provides the classification of the instance.
 Each node in the tree specifies a test of some attribute of the instance, and each branch
descending from that node corresponds to one of the possible values for this attribute.
 An instance is classified by starting at the root node of the tree, testing the attribute
specified by this node, then moving down the tree branch corresponding to the value of the
attribute in the given example.
 This process is then repeated for the subtree rooted at the new node
DECISION TREE REPRESENTATION
THE BASIC DECISION TREE LEARNING ALGORITHM
ID3 Algorithm
Top-down, greedy search through space of possible
decision trees
What is top-down?
How to start tree?
What attribute should represent the root?
Information gain is precisely the measure used by ID3 to
select the best attribute at each step in growing the tree.
Question?
How do you determine which
attribute best classifies data?
Answer: Entropy!
•Information gain:
• Statistical quantity measuring how well an
attribute classifies the data.
• Calculate the information gain for each attribute.
• Choose attribute with greatest information gain.
•Example: PlayTennis
• Four attributes used for classification:
• Outlook = {Sunny,Overcast,Rain}
• Temperature = {Hot, Mild, Cool}
• Humidity = {High, Normal}
• Wind = {Weak, Strong}
• One predicted (target) attribute (binary)
• PlayTennis = {Yes,No}
• Given 14 Training examples
• 9 positive
• 5 negative
Information gain
 The higher the information gain the more effective
the attribute in classifying training data.
 Expected reduction in entropy knowing A
Gain(S, A) = Entropy(S) −  * Entropy(Sv)
v  Values(A)
Values(A) possible values for A
Sv subset of S for which A has value v
|Sv|
|S|
The information gain G(S,A) where A is an attribute
G(S,A)  E(S) - v in Values(A) (|Sv| / |S|) * E(Sv)
Entropy
Largest
entropy
Boolean
functions
with the
same number
of ones and
zeros have
largest
entropy
• Positive examples and Negative examples from set S:
H(S) = - p+ log2(p+) - p- log2(p- )
Entropy
•The information gain for Outlook is:
•G(S,Outlook) = E(S) – [5/14 * E(Outlook=sunny)
+ 4/14 * E(Outlook = overcast) + 5/14 *
E(Outlook=rain)]
•G(S,Outlook) = E([9+,5-]) – [5/14*E(2+,3-) +
4/14*E([4+,0-]) + 5/14*E([3+,2-])]
•G(S,Outlook) = 0.94 – [5/14*0.971 + 4/14*0.0 +
5/14*0.971]
•G(S,Outlook) = 0.246
The information gain G(S,A) where A is an attribute
G(S,A)  E(S) - v in Values(A) (|Sv| / |S|) * E(Sv)
• G(S,Temperature) = 0.94 – [4/14*E(Temperature=hot) +
6/14*E(Temperature=mild) +
4/14*E(Temperature=cool)]
• G(S,Temperature) = 0.94 – [4/14*E([2+,2-]) +
6/14*E([4+,2-]) + 4/14*E([3+,1-])]
• G(S,Temperature) = 0.94 – [4/14 + 6/14*0.918 +
4/14*0.811]
• G(S,Temperature) = 0.029
• G(S,Humidity) = 0.94 – [7/14*E(Humidity=high) +
7/14*E(Humidity=normal)]
• G(S,Humidity = 0.94 – [7/14*E([3+,4-]) + 7/14*E([6+,1-])]
• G(S,Humidity = 0.94 – [7/14*0.985 + 7/14*0.592]
• G(S,Humidity) = 0.1515
 Let
 Values(Wind) = {Weak, Strong}
 S = [9+, 5−]
 SWeak = [6+, 2−]
 SStrong = [3+, 3−]
 Information gain due to knowing Wind:
Gain(S, Wind) = Entropy(S) − 8/14 Entropy(SWeak) − 6/14
Entropy(SStrong)
= 0.94 − 8/14  0.811 − 6/14  1.00
= 0.048
which attribute to test at the root?
 Which attribute should be tested at the root?
 Gain(S, Outlook) = 0.246
 Gain(S, Humidity) = 0.151
 Gain(S, Wind) = 0.048
 Gain(S, Temperature) = 0.029
 Outlook provides the best prediction for the target
 Lets grow the tree:
 add to the tree a successor for each possible value of
Outlook
 partition the training samples according to the value of
Outlook
Consider the following set of training examples:
a) What is the entropy of this collection of training examples with respect to the target function
classification?
b) What is the information gain of a2 relative to these training examples.
Problems on Decision Trees
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
X = iris.data[:,2:] # petal length and width
y = iris.target
tree_clf = DecisionTreeClassifier(max_depth=2,random_state=42)
tree_clf.fit(X, y)
from sklearn.tree import export_graphviz
export_graphviz(
tree_clf,
out_file="iris_tree.dot",
feature_names=iris.feature_names[2:],
class_names=iris.target_names,
rounded=True,
filled=True )
from graphviz import Source
Source.from_file("iris_tree.dot")
Training and Visualizing a Decision Tree
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
Leaf
node False
True
Leaf
node
Leaf
node
ID3: algorithm
ID3(X, T, Attrs) X: training examples:
T: target attribute (e.g. PlayTennis),
Attrs: other attributes, initially all attributes
1) Create Root node
If all X's are +, return Root with class +
If all X's are –, return Root with class –
If Attrs is empty, return Root with class most common value of T in X
else
2) A  the attribute from Attributes that best* classifies Examples .
The decision attribute for Root  A .
For each possible value vi of A:
- add a new branch below Root, for test A = vi
- Xi  subset of X with A = vi
- If Xi is empty then add a new leaf with class the most common value of T in X
else add the subtree generated by ID3(Xi, T, Attrs  {A})
End
3) return Root
* The best* attribute is the one with highest information
gain Equation.
Making Predictions
Let’s see how the tree represented in Figure 6-1 makes predictions.
 Suppose you find an iris flower and you want to classify it. You start at the root node (depth
0, at the top): this node asks whether the flower’s petal length is smaller than 2.45 cm. If it is,
then you move down to the root’s left child node (depth 1, left). In this case, it is a leaf node
(i.e., it does not have any children nodes), so it does not ask any questions: you can simply look
at the predicted class for that node and the Decision Tree predicts that your flower is an Iris-
Setosa (class=setosa).
 Now suppose you find another flower, but this time the petal length is greater than 2.45 cm.
You must move down to the root’s right child node (depth 1, right), which is not a leaf node,
so it asks another question: is the petal width smaller than 1.75 cm? If it is, then your flower
is most likely an Iris-Versicolor (depth 2, left). If not, it is likely an Iris-Virginica (depth 2,
right). It’s really that simple.
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
Estimating Class Probabilities
A Decision Tree can also estimate the probability that an instance belongs to a particular class k: first it
traverses the tree to find the leaf node for this instance, and then it returns the ratio of training instances of
class k in this node.
For example, suppose you have found a flower whose petals are 5 cm long and 1.5 cm wide.
The corresponding leaf node is the depth-2 left node, so the Decision Tree should output the following
probabilities: 0% for Iris-Setosa (0/54), 90.7% for Iris-Versicolor (49/54), and 9.3% for Iris-Virginica (5/54).
And of course if you ask it to predict the class, it should output Iris-Versicolor (class 1) since it has the
highest probability.
Let’s check this:
tree_clf.predict_proba([[5, 1.5]])
array([[0. , 0.90740741, 0.09259259]])
>>> tree_clf.predict([[5, 1.5]])
array([1])
Attribute selection measures- Gini impurity
 By default, the Gini impurity measure is used, but you can select the entropy
impurity
measure instead by setting the criterion hyperparameter to "entropy".
In Machine Learning, it is frequently used as an impurity measure: a set’s
So should you use Gini impurity or entropy? The
truth is, most of the time it does not make a big
difference: they lead to similar trees. Gini impurity is
slightly faster to compute, so it is a good default.
However, when they differ, Gini impurity tends to
isolate the most frequent class in its own branch of
the tree, while entropy tends to produce slightly
The CART Training Algorithm
The only difference between ID3 and CART for building decision tree is the criteria for
selecting optimal attribute.
i.e in ID3 the attribute selection criterial is on the basis of information gain, whereas for
CART is “Gini Index”
What is CART algorithm in machine learning?
The CART algorithm is a type of classification algorithm that is required to build a decision
tree on the basis of Gini's impurity index.
ID3 → (Iterative Dichotomiser 3, uses entropy and information for classification problems and
standard deviation reduction for regression problems)
CART → (Classification And Regression Tree. used for both classification and regression)
and its value goes from 0 (perfectly pure) to 1 (perfectly
impure).
Training Examples
Play Tennis Data set
Gini index(attribute)= weighted average * Giniindex(each attribute value)
Giniindex(each attribute value)=
-
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
Decision Trees are also capable of performing regression tasks. Let’s build a
regression tree using Scikit-Learn’s DecisionTreeRegressor class, training it on a noisy
quadratic dataset with max_depth=2
import numpy as np
from sklearn.tree import DecisionTreeRegressor
np.random.seed(42)
X_quad = np.random.rand(200,1) – 0.5
y_quad = X_quad ** 2 + np.random.randn(200,1)
tree_reg =
DecisionTreeRegressor(max_depth=2,random_stat
e=42)
tree_reg.fit(X_quad, y_quad)
Regression Trees
This tree looks very similar to the classification tree you built earlier. The main difference is that instead
of predicting a class in each node, it predicts a value.
For example, suppose you want to make a prediction for a new instance with x1 = 0.2.
The root node asks whether x1 <= 0.197. since it is not, the algorithm goes to the right child node, which
asks whether x1 <= 0.772. since it is, the algorithm goes to the left child node.
This is a leaf node, and it predicts value = 0.111.
This prediction is the average target value of the 110 training instances associated with this leaf node,
and it results in a mean squared error equal to 0.015 over these 110 instances.
This model’s predictions are represented on the left of Figure 6-5. If you set max_depth=3, you get
the predictions represented on the right.
The CART algorithm works mostly the same way as earlier, except that instead of
trying
to split the training set in a way that minimizes impurity, it now tries to split the
training set in a way that minimizes the MSE. Equation 6-4 shows the cost function
that the algorithm tries to minimize.
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
 Outliers are data points that don’t fit the pattern of the rest of the data set. The best way to detect the outliers in the
given data set is to plot the boxplot of the data set and the point located outside the box in the boxplot are all the
outliers in the data set.
Linear Support vector machines
 A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning
model, capable of performing linear or nonlinear classification, regression, and
even outlier detection.
SVMs are particularly well suited for classification of complex but small- or medium-size
datasets.
 A Support Vector Machine (SVM) performs classification by finding the hyperplane
that maximizes the margin between the two classes.
A hyperplane is a decision boundary that differentiates the two classes in
SVM.
SVM on Iris Data set  The two classes can clearly be separated easily with a straight line (they are
linearly separable).
 The left plot shows the decision boundaries of three possible linear classifiers. The model whose decision
boundary is represented by the dashed line is so bad that it does not even separate the classes properly.
The other two models work perfectly on this training set, but their decision boundaries come so close to
the instances that these models will probably not perform as well on new instances.
 In contrast, the solid line in the plot on the right represents the decision boundary of an SVM classifier;
this line not only separates the two classes but also stays as far away from the closest training instances as
possible.
 Notice that adding more training instances “off the street” will not affect the decision boundary
at all: it is fully determined (or “supported”) by the instances located on the edge of the street.
These instances are called the support vectors (they are circled in Figure 5-1).
1) Linear SVM Classification
2) Soft Margin Classification
To avoid these issues it is preferable to use a more flexible model. The objective is to find a good
balance between keeping the street as large as possible and limiting the margin violations (i.e.,
instances that end up in the middle of the street or even on the wrong side). This is called soft margin
 If we strictly impose that all instances be off the street and on the correct side, this is called hard margin
classification.
 There are two main issues with hard margin classification. First, it only works if the data is linearly
separable, and second it is quite sensitive to outliers. Figure 5-3 shows the iris dataset with just one
additional outlier: on the left, it is impossible to find a hard margin, and on the right the decision
boundary ends up very different from the one we saw in Figure 5-1 without the outlier, and it will
probably not generalize as well.
 In Scikit-Learn’s SVM classes, you can control this balance using the C hyperparameter: a smaller
C value leads to a wider street but more margin violations. Figure 5-4 shows the decision
boundaries and margins of two soft margin SVM classifiers on a nonlinearly separable dataset. On
the left, using a low C value the margin is quite large, but many instances end up on the street.
On the right, using a high C value the classifier makes fewer margin violations but ends up with a
smaller margin.
 However, it seems likely that the first classifier will generalize better: in fact even on this training set
it makes fewer prediction errors, since most of the margin violations are actually on the correct side of
the decision boundary.
from sklearn.datasets import load_iris
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
iris = load_iris(as_frame=True)
X = iris.data[[“petal length (cm)”,”petal width (cm)”]].values
y = (iris.target == 2) # Iris-Virginica
svm_clf = make_pipeline(StandardScaler(), LinearSVC(C=1,random_state=42))
svm_clf.fit(X, y)
Then, as usual, you can use the model to make predictions:
>>>X_new = [[5.5,1.7],[5.0,1.5]]
>>> svm_clf.predict(x_new)
array([True,False])
The first plant is classified as an Iris virginica, while the second is not.
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers
ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers

More Related Content

Similar to ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers

Decision tree
Decision treeDecision tree
Decision tree
Soujanya V
 
DWDM-AG-day-1-2023-SEC A plus Half B--.pdf
DWDM-AG-day-1-2023-SEC A plus Half B--.pdfDWDM-AG-day-1-2023-SEC A plus Half B--.pdf
DWDM-AG-day-1-2023-SEC A plus Half B--.pdf
ChristinaGayenMondal
 
Supervised algorithms
Supervised algorithmsSupervised algorithms
Supervised algorithms
Yassine Akhiat
 
Decision_Tree in machine learning with examples.ppt
Decision_Tree in machine learning with examples.pptDecision_Tree in machine learning with examples.ppt
Decision_Tree in machine learning with examples.ppt
amrita chaturvedi
 
Data mining classifiers.
Data mining classifiers.Data mining classifiers.
Data mining classifiers.
ShwetaPatil174
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation source
butest
 
Machine learning
Machine learningMachine learning
Machine learning
Sukhwinder Singh
 
Artificial intelligence.pptx
Artificial intelligence.pptxArtificial intelligence.pptx
Artificial intelligence.pptx
keerthikaA8
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
keerthikaA8
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
keerthikaA8
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
Yogendra Singh
 
[ppt]
[ppt][ppt]
[ppt]
butest
 
[ppt]
[ppt][ppt]
[ppt]
butest
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
midi
 
lecture 5 about lecture 5 about lecture lecture
lecture 5 about lecture 5 about lecture lecturelecture 5 about lecture 5 about lecture lecture
lecture 5 about lecture 5 about lecture lecture
anxiousanoja
 
ppt
pptppt
ppt
butest
 
3_learning.ppt
3_learning.ppt3_learning.ppt
3_learning.ppt
butest
 
Lect4
Lect4Lect4
Lect4
sumit621
 
Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3
butest
 
Regression on gaussian symbols
Regression on gaussian symbolsRegression on gaussian symbols
Regression on gaussian symbols
Axel de Romblay
 

Similar to ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers (20)

Decision tree
Decision treeDecision tree
Decision tree
 
DWDM-AG-day-1-2023-SEC A plus Half B--.pdf
DWDM-AG-day-1-2023-SEC A plus Half B--.pdfDWDM-AG-day-1-2023-SEC A plus Half B--.pdf
DWDM-AG-day-1-2023-SEC A plus Half B--.pdf
 
Supervised algorithms
Supervised algorithmsSupervised algorithms
Supervised algorithms
 
Decision_Tree in machine learning with examples.ppt
Decision_Tree in machine learning with examples.pptDecision_Tree in machine learning with examples.ppt
Decision_Tree in machine learning with examples.ppt
 
Data mining classifiers.
Data mining classifiers.Data mining classifiers.
Data mining classifiers.
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation source
 
Machine learning
Machine learningMachine learning
Machine learning
 
Artificial intelligence.pptx
Artificial intelligence.pptxArtificial intelligence.pptx
Artificial intelligence.pptx
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
[ppt]
[ppt][ppt]
[ppt]
 
[ppt]
[ppt][ppt]
[ppt]
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
lecture 5 about lecture 5 about lecture lecture
lecture 5 about lecture 5 about lecture lecturelecture 5 about lecture 5 about lecture lecture
lecture 5 about lecture 5 about lecture lecture
 
ppt
pptppt
ppt
 
3_learning.ppt
3_learning.ppt3_learning.ppt
3_learning.ppt
 
Lect4
Lect4Lect4
Lect4
 
Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3
 
Regression on gaussian symbols
Regression on gaussian symbolsRegression on gaussian symbols
Regression on gaussian symbols
 

Recently uploaded

🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
bhupeshkumar0889
 
CHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptxCHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptx
girewiy968
 
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
sharonblush
 
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
44annissa
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
tanupasswan6
 
🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
kuldeepsharmaks8120
 
the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...
huseindihon
 
Australian Catholic University degree offer diploma Transcript
Australian Catholic University  degree offer diploma TranscriptAustralian Catholic University  degree offer diploma Transcript
Australian Catholic University degree offer diploma Transcript
taqyea
 
NPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension schemeNPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension scheme
ASISHSABAT3
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
gargtinna79
 
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
birajmohan012
 
Nipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma TranscriptNipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma Transcript
zyqedad
 
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured DataFine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
kevig
 
Universidad Camilo José Cela degree offer diploma Transcript
Universidad Camilo José Cela  degree offer diploma TranscriptUniversidad Camilo José Cela  degree offer diploma Transcript
Universidad Camilo José Cela degree offer diploma Transcript
taqyea
 
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
vrvipin164
 
ISBP 821 - UCP 600 - ed).pdf banking standards
ISBP 821 - UCP 600 - ed).pdf banking standardsISBP 821 - UCP 600 - ed).pdf banking standards
ISBP 821 - UCP 600 - ed).pdf banking standards
DevanshuAnada1
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
chetankumar9855
 
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
janvikumar4133
 
High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...
High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...
High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...
saadkhan1485265
 
ch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ssch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ss
MinThetLwin1
 

Recently uploaded (20)

🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...🚂🚘 Premium Girls Call Bangalore  🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
🚂🚘 Premium Girls Call Bangalore 🛵🚡000XX00000 💃 Choose Best And Top Girl Serv...
 
CHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptxCHAPTER-1-Introduction-to-Marketing.pptx
CHAPTER-1-Introduction-to-Marketing.pptx
 
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
 
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
VIP Girls Call Mumbai 9910780858 Provide Best And Top Girl Service And No1 in...
 
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
Celebrity Girls Call Delhi 🎈🔥9711199171 🔥💋🎈 Provide Best And Top Girl Service...
 
🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...🚂🚘 Premium Girls Call Nashik  🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
🚂🚘 Premium Girls Call Nashik 🛵🚡000XX00000 💃 Choose Best And Top Girl Service...
 
the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...the potential of the development of the Ford–Fulkerson algorithm to solve the...
the potential of the development of the Ford–Fulkerson algorithm to solve the...
 
Australian Catholic University degree offer diploma Transcript
Australian Catholic University  degree offer diploma TranscriptAustralian Catholic University  degree offer diploma Transcript
Australian Catholic University degree offer diploma Transcript
 
NPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension schemeNPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension scheme
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
 
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
Beautiful Girls Call Pune 000XX00000 Provide Best And Top Girl Service And No...
 
Nipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma TranscriptNipissing University degree offer Nipissing diploma Transcript
Nipissing University degree offer Nipissing diploma Transcript
 
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured DataFine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
 
Universidad Camilo José Cela degree offer diploma Transcript
Universidad Camilo José Cela  degree offer diploma TranscriptUniversidad Camilo José Cela  degree offer diploma Transcript
Universidad Camilo José Cela degree offer diploma Transcript
 
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
Coimbatore Girls call Service 000XX00000 Provide Best And Top Girl Service An...
 
ISBP 821 - UCP 600 - ed).pdf banking standards
ISBP 821 - UCP 600 - ed).pdf banking standardsISBP 821 - UCP 600 - ed).pdf banking standards
ISBP 821 - UCP 600 - ed).pdf banking standards
 
Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...Amul goes international: Desi dairy giant to launch fresh ...
Amul goes international: Desi dairy giant to launch fresh ...
 
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
 
High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...
High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...
High Girls Call Nagpur 000XX00000 Provide Best And Top Girl Service And No1 i...
 
ch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ssch8_multiplexing cs553 st07 slide share ss
ch8_multiplexing cs553 st07 slide share ss
 

ML-PPT-UNIT-2 Generative Classifiers Discriminative Classifiers

  • 1. Generative Classifiers: Classifying with Bayesian decision theory, Bayes’ rule, Naïve Bayes classifier. Discriminative Classifiers: Logistic Regression, Decision Trees: Training and Visualizing a Decision Tree, Making Predictions, Estimating Class Probabilities, The CART Training Algorithm, Attribute selection measures- Gini impurity; Entropy, Regularization Hyperparameters, Regression Trees, Linear Support vector machines. UNIT-2
  • 6. • Terms • P(h) : prior probability of h • P(D) : prior probability that training data D will be observed • P(D|h) : prior knowledge • P(h|D) : posterior probability of h , given D • Bayes Rule  Bayes theorem is the cornerstone of Bayesian learning methods because it provides a way to calculate the posterior probability P(h/D), from the prior probability P(h), together with P(D) and P(D/h). Bayes’ rule
  • 7. 7 • A Classification model like ANN, DT. The model is based on the computation of probabilities. Hence, a probabilistic model. • Let X be the feature matrix of samples. • Let y be the corresponding vector of labels/classes. • A single sample is represented by 𝑥, 𝑦𝑗 where 𝑥 = 𝑥1, 𝑥2, … 𝑥𝑑 is the feature vector and 𝑦𝑗 is the class. • 𝑦𝑗 is a particular class. Not the class of sample 𝑗
  • 8. maximum a posteriori (MAP) hypothesis  We can determine the MAP hypotheses by using Bayes theorem to calculate the posterior probability of each candidate hypothesis. MAP hypothesis provided argmax is most commonly used in machine learning for finding the class with the largest predicted probability. Notice in the final step right we dropped the term P(D) because it is a constant independent of h.
  • 9. Naïve Bayes classifier The naive Bayes classifier applies to learning tasks where each instance x is described by a conjunction of attribute values and where the target function f ( x ) can take on any value from some finite set V. A set of training examples of the target function is provided, and a new instance is presented, described by the tuple of attribute values (al, a2.. .an ). The learner is asked to predict the target value, or classification, for this new instance. The Bayesian approach to classifying the new instance is to assign the most probable target value, VMAP, given the attribute values (al, a2.. .an ) that describe the instance. Where VNB denotes the target value output by the naive Bayes classifier.
  • 11. Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No
  • 13. New instance (Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong) use the naive Bayes classifier and the training data  to classify the below new instance:  Our task is to predict the target value (yes or no) of the target concept PlayTennis for this new instance.
  • 14. Thus, the naive Bayes classifier assigns the target value PlayTennis = no to this new instance, based on the probability estimates learned from the training data. New instance  (Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong)
  • 15. What is the Sigmoid Function? It is a mathematical function having a characteristic that can take any real value and map it to between 0 to 1 shaped like the letter “S”. The sigmoid function also called a logistic function. Y = 1 / 1+e -z.
  • 17. Logistic Regression Logistic Regression: Logistic regression is a type of regression used for classification problems, where the output variable is categorical in nature. Logistic regression uses a logistic function to predict the probability of the input belonging to a particular category. OR Logistic regression (logit regression) is a type of regression analysis used for predicting the outcome of a categorical dependent variable. In logistic regression, dependent variable (Y) is binary (0,1) and independent variables (X) are continuous in nature. Define Logistic regression ?
  • 18. Logistic Regression sub Topics 1)Estimating Probabilities 2)Training and Cost Function 3)Decision Boundaries 4)Softmax Regression
  • 19. the Logistic Regression utilizes a more sophisticated cost function, which is known as the “Sigmoid function” or “logistic function” instead of a linear function. 1) Estimating Probabilities
  • 23. Loading iris dataset − Loading iris Dataset Printing iris Dataset Print the names of the four features Print the Names of Target (iris-flowers)
  • 24. Print integers representing the species of each observation. 0 = setosa , 1 = versicolor, 2 = virginica 150 Rows and 4 columns. Size of Feature matrix
  • 26. The petal width of Iris-Virginica flowers (represented by triangles) ranges from 1.4 cm to 2.5 cm, while the other iris flowers (represented by squares) generally have a smaller petal width, ranging from 0.1 cm to 1.8 cm.
  • 27. Just like the other linear models, Logistic Regression models can be regularized using ℓ1 or ℓ2 penalties. Scitkit-Learn actually adds an ℓ2 penalty by default. Figure 4-24 shows the same dataset but this time displaying two features: petal width and length. Once trained, the Logistic Regression classifier can estimate the probability that a new flower is an Iris-Virginica based on these two features. The dashed line represents the points where the model estimates a 50% probability: this is the model’s decision boundary. Note that it is a linear boundary.
  • 28. 4. Softmax Regression The Logistic Regression model can be generalized to support multiple classes directly, without having to train and combine multiple binary classifiers. This is called Softmax Regression, or Multinomial Logistic Regression. The idea of Softmax Regression  when given an instance x, the Softmax Regression model first computes a score Sk(x) for each class k, then estimates the probability of each class by applying the softmax function (also called the normalized exponential) to the scores.
  • 29. the softmax function (Equation 4-20):
  • 30. Types of Machine Learning
  • 31. Decision Trees For example, the instance (Outlook = Sunny, Temperature = Hot, Humidity = High, Wind=Strong) would be sorted down the leftmost branch of this decision tree and would therefore be classified as a negative instance (i.e., the tree predicts that PlayTennis = no).
  • 32.  Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance.  Each node in the tree specifies a test of some attribute of the instance, and each branch descending from that node corresponds to one of the possible values for this attribute.  An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute in the given example.  This process is then repeated for the subtree rooted at the new node DECISION TREE REPRESENTATION
  • 33. THE BASIC DECISION TREE LEARNING ALGORITHM ID3 Algorithm Top-down, greedy search through space of possible decision trees What is top-down? How to start tree? What attribute should represent the root? Information gain is precisely the measure used by ID3 to select the best attribute at each step in growing the tree.
  • 34. Question? How do you determine which attribute best classifies data? Answer: Entropy! •Information gain: • Statistical quantity measuring how well an attribute classifies the data. • Calculate the information gain for each attribute. • Choose attribute with greatest information gain.
  • 35. •Example: PlayTennis • Four attributes used for classification: • Outlook = {Sunny,Overcast,Rain} • Temperature = {Hot, Mild, Cool} • Humidity = {High, Normal} • Wind = {Weak, Strong} • One predicted (target) attribute (binary) • PlayTennis = {Yes,No} • Given 14 Training examples • 9 positive • 5 negative
  • 36. Information gain  The higher the information gain the more effective the attribute in classifying training data.  Expected reduction in entropy knowing A Gain(S, A) = Entropy(S) −  * Entropy(Sv) v  Values(A) Values(A) possible values for A Sv subset of S for which A has value v |Sv| |S| The information gain G(S,A) where A is an attribute G(S,A)  E(S) - v in Values(A) (|Sv| / |S|) * E(Sv)
  • 38. • Positive examples and Negative examples from set S: H(S) = - p+ log2(p+) - p- log2(p- ) Entropy
  • 39. •The information gain for Outlook is: •G(S,Outlook) = E(S) – [5/14 * E(Outlook=sunny) + 4/14 * E(Outlook = overcast) + 5/14 * E(Outlook=rain)] •G(S,Outlook) = E([9+,5-]) – [5/14*E(2+,3-) + 4/14*E([4+,0-]) + 5/14*E([3+,2-])] •G(S,Outlook) = 0.94 – [5/14*0.971 + 4/14*0.0 + 5/14*0.971] •G(S,Outlook) = 0.246 The information gain G(S,A) where A is an attribute G(S,A)  E(S) - v in Values(A) (|Sv| / |S|) * E(Sv)
  • 40. • G(S,Temperature) = 0.94 – [4/14*E(Temperature=hot) + 6/14*E(Temperature=mild) + 4/14*E(Temperature=cool)] • G(S,Temperature) = 0.94 – [4/14*E([2+,2-]) + 6/14*E([4+,2-]) + 4/14*E([3+,1-])] • G(S,Temperature) = 0.94 – [4/14 + 6/14*0.918 + 4/14*0.811] • G(S,Temperature) = 0.029 • G(S,Humidity) = 0.94 – [7/14*E(Humidity=high) + 7/14*E(Humidity=normal)] • G(S,Humidity = 0.94 – [7/14*E([3+,4-]) + 7/14*E([6+,1-])] • G(S,Humidity = 0.94 – [7/14*0.985 + 7/14*0.592] • G(S,Humidity) = 0.1515
  • 41.  Let  Values(Wind) = {Weak, Strong}  S = [9+, 5−]  SWeak = [6+, 2−]  SStrong = [3+, 3−]  Information gain due to knowing Wind: Gain(S, Wind) = Entropy(S) − 8/14 Entropy(SWeak) − 6/14 Entropy(SStrong) = 0.94 − 8/14  0.811 − 6/14  1.00 = 0.048
  • 42. which attribute to test at the root?  Which attribute should be tested at the root?  Gain(S, Outlook) = 0.246  Gain(S, Humidity) = 0.151  Gain(S, Wind) = 0.048  Gain(S, Temperature) = 0.029  Outlook provides the best prediction for the target  Lets grow the tree:  add to the tree a successor for each possible value of Outlook  partition the training samples according to the value of Outlook
  • 43. Consider the following set of training examples: a) What is the entropy of this collection of training examples with respect to the target function classification? b) What is the information gain of a2 relative to these training examples. Problems on Decision Trees
  • 44. from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier iris = load_iris() X = iris.data[:,2:] # petal length and width y = iris.target tree_clf = DecisionTreeClassifier(max_depth=2,random_state=42) tree_clf.fit(X, y) from sklearn.tree import export_graphviz export_graphviz( tree_clf, out_file="iris_tree.dot", feature_names=iris.feature_names[2:], class_names=iris.target_names, rounded=True, filled=True ) from graphviz import Source Source.from_file("iris_tree.dot") Training and Visualizing a Decision Tree
  • 47. ID3: algorithm ID3(X, T, Attrs) X: training examples: T: target attribute (e.g. PlayTennis), Attrs: other attributes, initially all attributes 1) Create Root node If all X's are +, return Root with class + If all X's are –, return Root with class – If Attrs is empty, return Root with class most common value of T in X else 2) A  the attribute from Attributes that best* classifies Examples . The decision attribute for Root  A . For each possible value vi of A: - add a new branch below Root, for test A = vi - Xi  subset of X with A = vi - If Xi is empty then add a new leaf with class the most common value of T in X else add the subtree generated by ID3(Xi, T, Attrs  {A}) End 3) return Root * The best* attribute is the one with highest information gain Equation.
  • 48. Making Predictions Let’s see how the tree represented in Figure 6-1 makes predictions.  Suppose you find an iris flower and you want to classify it. You start at the root node (depth 0, at the top): this node asks whether the flower’s petal length is smaller than 2.45 cm. If it is, then you move down to the root’s left child node (depth 1, left). In this case, it is a leaf node (i.e., it does not have any children nodes), so it does not ask any questions: you can simply look at the predicted class for that node and the Decision Tree predicts that your flower is an Iris- Setosa (class=setosa).  Now suppose you find another flower, but this time the petal length is greater than 2.45 cm. You must move down to the root’s right child node (depth 1, right), which is not a leaf node, so it asks another question: is the petal width smaller than 1.75 cm? If it is, then your flower is most likely an Iris-Versicolor (depth 2, left). If not, it is likely an Iris-Virginica (depth 2, right). It’s really that simple.
  • 51. Estimating Class Probabilities A Decision Tree can also estimate the probability that an instance belongs to a particular class k: first it traverses the tree to find the leaf node for this instance, and then it returns the ratio of training instances of class k in this node. For example, suppose you have found a flower whose petals are 5 cm long and 1.5 cm wide. The corresponding leaf node is the depth-2 left node, so the Decision Tree should output the following probabilities: 0% for Iris-Setosa (0/54), 90.7% for Iris-Versicolor (49/54), and 9.3% for Iris-Virginica (5/54). And of course if you ask it to predict the class, it should output Iris-Versicolor (class 1) since it has the highest probability. Let’s check this: tree_clf.predict_proba([[5, 1.5]]) array([[0. , 0.90740741, 0.09259259]]) >>> tree_clf.predict([[5, 1.5]]) array([1])
  • 52. Attribute selection measures- Gini impurity  By default, the Gini impurity measure is used, but you can select the entropy impurity measure instead by setting the criterion hyperparameter to "entropy". In Machine Learning, it is frequently used as an impurity measure: a set’s So should you use Gini impurity or entropy? The truth is, most of the time it does not make a big difference: they lead to similar trees. Gini impurity is slightly faster to compute, so it is a good default. However, when they differ, Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy tends to produce slightly
  • 53. The CART Training Algorithm
  • 54. The only difference between ID3 and CART for building decision tree is the criteria for selecting optimal attribute. i.e in ID3 the attribute selection criterial is on the basis of information gain, whereas for CART is “Gini Index” What is CART algorithm in machine learning? The CART algorithm is a type of classification algorithm that is required to build a decision tree on the basis of Gini's impurity index. ID3 → (Iterative Dichotomiser 3, uses entropy and information for classification problems and standard deviation reduction for regression problems) CART → (Classification And Regression Tree. used for both classification and regression) and its value goes from 0 (perfectly pure) to 1 (perfectly impure).
  • 57. Gini index(attribute)= weighted average * Giniindex(each attribute value) Giniindex(each attribute value)= -
  • 65. Decision Trees are also capable of performing regression tasks. Let’s build a regression tree using Scikit-Learn’s DecisionTreeRegressor class, training it on a noisy quadratic dataset with max_depth=2 import numpy as np from sklearn.tree import DecisionTreeRegressor np.random.seed(42) X_quad = np.random.rand(200,1) – 0.5 y_quad = X_quad ** 2 + np.random.randn(200,1) tree_reg = DecisionTreeRegressor(max_depth=2,random_stat e=42) tree_reg.fit(X_quad, y_quad) Regression Trees
  • 66. This tree looks very similar to the classification tree you built earlier. The main difference is that instead of predicting a class in each node, it predicts a value. For example, suppose you want to make a prediction for a new instance with x1 = 0.2. The root node asks whether x1 <= 0.197. since it is not, the algorithm goes to the right child node, which asks whether x1 <= 0.772. since it is, the algorithm goes to the left child node. This is a leaf node, and it predicts value = 0.111. This prediction is the average target value of the 110 training instances associated with this leaf node, and it results in a mean squared error equal to 0.015 over these 110 instances.
  • 67. This model’s predictions are represented on the left of Figure 6-5. If you set max_depth=3, you get the predictions represented on the right.
  • 68. The CART algorithm works mostly the same way as earlier, except that instead of trying to split the training set in a way that minimizes impurity, it now tries to split the training set in a way that minimizes the MSE. Equation 6-4 shows the cost function that the algorithm tries to minimize.
  • 71.  Outliers are data points that don’t fit the pattern of the rest of the data set. The best way to detect the outliers in the given data set is to plot the boxplot of the data set and the point located outside the box in the boxplot are all the outliers in the data set.
  • 72. Linear Support vector machines  A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression, and even outlier detection. SVMs are particularly well suited for classification of complex but small- or medium-size datasets.  A Support Vector Machine (SVM) performs classification by finding the hyperplane that maximizes the margin between the two classes. A hyperplane is a decision boundary that differentiates the two classes in SVM.
  • 73. SVM on Iris Data set  The two classes can clearly be separated easily with a straight line (they are linearly separable).  The left plot shows the decision boundaries of three possible linear classifiers. The model whose decision boundary is represented by the dashed line is so bad that it does not even separate the classes properly. The other two models work perfectly on this training set, but their decision boundaries come so close to the instances that these models will probably not perform as well on new instances.  In contrast, the solid line in the plot on the right represents the decision boundary of an SVM classifier; this line not only separates the two classes but also stays as far away from the closest training instances as possible.  Notice that adding more training instances “off the street” will not affect the decision boundary at all: it is fully determined (or “supported”) by the instances located on the edge of the street. These instances are called the support vectors (they are circled in Figure 5-1). 1) Linear SVM Classification
  • 74. 2) Soft Margin Classification To avoid these issues it is preferable to use a more flexible model. The objective is to find a good balance between keeping the street as large as possible and limiting the margin violations (i.e., instances that end up in the middle of the street or even on the wrong side). This is called soft margin  If we strictly impose that all instances be off the street and on the correct side, this is called hard margin classification.  There are two main issues with hard margin classification. First, it only works if the data is linearly separable, and second it is quite sensitive to outliers. Figure 5-3 shows the iris dataset with just one additional outlier: on the left, it is impossible to find a hard margin, and on the right the decision boundary ends up very different from the one we saw in Figure 5-1 without the outlier, and it will probably not generalize as well.
  • 75.  In Scikit-Learn’s SVM classes, you can control this balance using the C hyperparameter: a smaller C value leads to a wider street but more margin violations. Figure 5-4 shows the decision boundaries and margins of two soft margin SVM classifiers on a nonlinearly separable dataset. On the left, using a low C value the margin is quite large, but many instances end up on the street. On the right, using a high C value the classifier makes fewer margin violations but ends up with a smaller margin.  However, it seems likely that the first classifier will generalize better: in fact even on this training set it makes fewer prediction errors, since most of the margin violations are actually on the correct side of the decision boundary.
  • 76. from sklearn.datasets import load_iris from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.svm import LinearSVC iris = load_iris(as_frame=True) X = iris.data[[“petal length (cm)”,”petal width (cm)”]].values y = (iris.target == 2) # Iris-Virginica svm_clf = make_pipeline(StandardScaler(), LinearSVC(C=1,random_state=42)) svm_clf.fit(X, y) Then, as usual, you can use the model to make predictions: >>>X_new = [[5.5,1.7],[5.0,1.5]] >>> svm_clf.predict(x_new) array([True,False]) The first plant is classified as an Iris virginica, while the second is not.