- 1. Generative Classifiers: Classifying with Bayesian decision theory, Bayes’ rule, Naïve Bayes classifier. Discriminative Classifiers: Logistic Regression, Decision Trees: Training and Visualizing a Decision Tree, Making Predictions, Estimating Class Probabilities, The CART Training Algorithm, Attribute selection measures- Gini impurity; Entropy, Regularization Hyperparameters, Regression Trees, Linear Support vector machines. UNIT-2
- 6. • Terms • P(h) : prior probability of h • P(D) : prior probability that training data D will be observed • P(D|h) : prior knowledge • P(h|D) : posterior probability of h , given D • Bayes Rule Bayes theorem is the cornerstone of Bayesian learning methods because it provides a way to calculate the posterior probability P(h/D), from the prior probability P(h), together with P(D) and P(D/h). Bayes’ rule
- 7. 7 • A Classification model like ANN, DT. The model is based on the computation of probabilities. Hence, a probabilistic model. • Let X be the feature matrix of samples. • Let y be the corresponding vector of labels/classes. • A single sample is represented by 𝑥, 𝑦𝑗 where 𝑥 = 𝑥1, 𝑥2, … 𝑥𝑑 is the feature vector and 𝑦𝑗 is the class. • 𝑦𝑗 is a particular class. Not the class of sample 𝑗
- 8. maximum a posteriori (MAP) hypothesis We can determine the MAP hypotheses by using Bayes theorem to calculate the posterior probability of each candidate hypothesis. MAP hypothesis provided argmax is most commonly used in machine learning for finding the class with the largest predicted probability. Notice in the final step right we dropped the term P(D) because it is a constant independent of h.
- 9. Naïve Bayes classifier The naive Bayes classifier applies to learning tasks where each instance x is described by a conjunction of attribute values and where the target function f ( x ) can take on any value from some finite set V. A set of training examples of the target function is provided, and a new instance is presented, described by the tuple of attribute values (al, a2.. .an ). The learner is asked to predict the target value, or classification, for this new instance. The Bayesian approach to classifying the new instance is to assign the most probable target value, VMAP, given the attribute values (al, a2.. .an ) that describe the instance. Where VNB denotes the target value output by the naive Bayes classifier.
- 11. Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No
- 13. New instance (Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong) use the naive Bayes classifier and the training data to classify the below new instance: Our task is to predict the target value (yes or no) of the target concept PlayTennis for this new instance.
- 14. Thus, the naive Bayes classifier assigns the target value PlayTennis = no to this new instance, based on the probability estimates learned from the training data. New instance (Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong)
- 15. What is the Sigmoid Function? It is a mathematical function having a characteristic that can take any real value and map it to between 0 to 1 shaped like the letter “S”. The sigmoid function also called a logistic function. Y = 1 / 1+e -z.
- 17. Logistic Regression Logistic Regression: Logistic regression is a type of regression used for classification problems, where the output variable is categorical in nature. Logistic regression uses a logistic function to predict the probability of the input belonging to a particular category. OR Logistic regression (logit regression) is a type of regression analysis used for predicting the outcome of a categorical dependent variable. In logistic regression, dependent variable (Y) is binary (0,1) and independent variables (X) are continuous in nature. Define Logistic regression ?
- 18. Logistic Regression sub Topics 1)Estimating Probabilities 2)Training and Cost Function 3)Decision Boundaries 4)Softmax Regression
- 19. the Logistic Regression utilizes a more sophisticated cost function, which is known as the “Sigmoid function” or “logistic function” instead of a linear function. 1) Estimating Probabilities
- 23. Loading iris dataset − Loading iris Dataset Printing iris Dataset Print the names of the four features Print the Names of Target (iris-flowers)
- 24. Print integers representing the species of each observation. 0 = setosa , 1 = versicolor, 2 = virginica 150 Rows and 4 columns. Size of Feature matrix
- 26. The petal width of Iris-Virginica flowers (represented by triangles) ranges from 1.4 cm to 2.5 cm, while the other iris flowers (represented by squares) generally have a smaller petal width, ranging from 0.1 cm to 1.8 cm.
- 27. Just like the other linear models, Logistic Regression models can be regularized using ℓ1 or ℓ2 penalties. Scitkit-Learn actually adds an ℓ2 penalty by default. Figure 4-24 shows the same dataset but this time displaying two features: petal width and length. Once trained, the Logistic Regression classifier can estimate the probability that a new flower is an Iris-Virginica based on these two features. The dashed line represents the points where the model estimates a 50% probability: this is the model’s decision boundary. Note that it is a linear boundary.
- 28. 4. Softmax Regression The Logistic Regression model can be generalized to support multiple classes directly, without having to train and combine multiple binary classifiers. This is called Softmax Regression, or Multinomial Logistic Regression. The idea of Softmax Regression when given an instance x, the Softmax Regression model first computes a score Sk(x) for each class k, then estimates the probability of each class by applying the softmax function (also called the normalized exponential) to the scores.
- 29. the softmax function (Equation 4-20):
- 30. Types of Machine Learning
- 31. Decision Trees For example, the instance (Outlook = Sunny, Temperature = Hot, Humidity = High, Wind=Strong) would be sorted down the leftmost branch of this decision tree and would therefore be classified as a negative instance (i.e., the tree predicts that PlayTennis = no).
- 32. Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. Each node in the tree specifies a test of some attribute of the instance, and each branch descending from that node corresponds to one of the possible values for this attribute. An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute in the given example. This process is then repeated for the subtree rooted at the new node DECISION TREE REPRESENTATION
- 33. THE BASIC DECISION TREE LEARNING ALGORITHM ID3 Algorithm Top-down, greedy search through space of possible decision trees What is top-down? How to start tree? What attribute should represent the root? Information gain is precisely the measure used by ID3 to select the best attribute at each step in growing the tree.
- 34. Question? How do you determine which attribute best classifies data? Answer: Entropy! •Information gain: • Statistical quantity measuring how well an attribute classifies the data. • Calculate the information gain for each attribute. • Choose attribute with greatest information gain.
- 35. •Example: PlayTennis • Four attributes used for classification: • Outlook = {Sunny,Overcast,Rain} • Temperature = {Hot, Mild, Cool} • Humidity = {High, Normal} • Wind = {Weak, Strong} • One predicted (target) attribute (binary) • PlayTennis = {Yes,No} • Given 14 Training examples • 9 positive • 5 negative
- 36. Information gain The higher the information gain the more effective the attribute in classifying training data. Expected reduction in entropy knowing A Gain(S, A) = Entropy(S) − * Entropy(Sv) v Values(A) Values(A) possible values for A Sv subset of S for which A has value v |Sv| |S| The information gain G(S,A) where A is an attribute G(S,A) E(S) - v in Values(A) (|Sv| / |S|) * E(Sv)
- 37. Entropy Largest entropy Boolean functions with the same number of ones and zeros have largest entropy
- 38. • Positive examples and Negative examples from set S: H(S) = - p+ log2(p+) - p- log2(p- ) Entropy
- 39. •The information gain for Outlook is: •G(S,Outlook) = E(S) – [5/14 * E(Outlook=sunny) + 4/14 * E(Outlook = overcast) + 5/14 * E(Outlook=rain)] •G(S,Outlook) = E([9+,5-]) – [5/14*E(2+,3-) + 4/14*E([4+,0-]) + 5/14*E([3+,2-])] •G(S,Outlook) = 0.94 – [5/14*0.971 + 4/14*0.0 + 5/14*0.971] •G(S,Outlook) = 0.246 The information gain G(S,A) where A is an attribute G(S,A) E(S) - v in Values(A) (|Sv| / |S|) * E(Sv)
- 40. • G(S,Temperature) = 0.94 – [4/14*E(Temperature=hot) + 6/14*E(Temperature=mild) + 4/14*E(Temperature=cool)] • G(S,Temperature) = 0.94 – [4/14*E([2+,2-]) + 6/14*E([4+,2-]) + 4/14*E([3+,1-])] • G(S,Temperature) = 0.94 – [4/14 + 6/14*0.918 + 4/14*0.811] • G(S,Temperature) = 0.029 • G(S,Humidity) = 0.94 – [7/14*E(Humidity=high) + 7/14*E(Humidity=normal)] • G(S,Humidity = 0.94 – [7/14*E([3+,4-]) + 7/14*E([6+,1-])] • G(S,Humidity = 0.94 – [7/14*0.985 + 7/14*0.592] • G(S,Humidity) = 0.1515
- 41. Let Values(Wind) = {Weak, Strong} S = [9+, 5−] SWeak = [6+, 2−] SStrong = [3+, 3−] Information gain due to knowing Wind: Gain(S, Wind) = Entropy(S) − 8/14 Entropy(SWeak) − 6/14 Entropy(SStrong) = 0.94 − 8/14 0.811 − 6/14 1.00 = 0.048
- 42. which attribute to test at the root? Which attribute should be tested at the root? Gain(S, Outlook) = 0.246 Gain(S, Humidity) = 0.151 Gain(S, Wind) = 0.048 Gain(S, Temperature) = 0.029 Outlook provides the best prediction for the target Lets grow the tree: add to the tree a successor for each possible value of Outlook partition the training samples according to the value of Outlook
- 43. Consider the following set of training examples: a) What is the entropy of this collection of training examples with respect to the target function classification? b) What is the information gain of a2 relative to these training examples. Problems on Decision Trees
- 44. from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier iris = load_iris() X = iris.data[:,2:] # petal length and width y = iris.target tree_clf = DecisionTreeClassifier(max_depth=2,random_state=42) tree_clf.fit(X, y) from sklearn.tree import export_graphviz export_graphviz( tree_clf, out_file="iris_tree.dot", feature_names=iris.feature_names[2:], class_names=iris.target_names, rounded=True, filled=True ) from graphviz import Source Source.from_file("iris_tree.dot") Training and Visualizing a Decision Tree
- 47. ID3: algorithm ID3(X, T, Attrs) X: training examples: T: target attribute (e.g. PlayTennis), Attrs: other attributes, initially all attributes 1) Create Root node If all X's are +, return Root with class + If all X's are –, return Root with class – If Attrs is empty, return Root with class most common value of T in X else 2) A the attribute from Attributes that best* classifies Examples . The decision attribute for Root A . For each possible value vi of A: - add a new branch below Root, for test A = vi - Xi subset of X with A = vi - If Xi is empty then add a new leaf with class the most common value of T in X else add the subtree generated by ID3(Xi, T, Attrs {A}) End 3) return Root * The best* attribute is the one with highest information gain Equation.
- 48. Making Predictions Let’s see how the tree represented in Figure 6-1 makes predictions. Suppose you find an iris flower and you want to classify it. You start at the root node (depth 0, at the top): this node asks whether the flower’s petal length is smaller than 2.45 cm. If it is, then you move down to the root’s left child node (depth 1, left). In this case, it is a leaf node (i.e., it does not have any children nodes), so it does not ask any questions: you can simply look at the predicted class for that node and the Decision Tree predicts that your flower is an Iris- Setosa (class=setosa). Now suppose you find another flower, but this time the petal length is greater than 2.45 cm. You must move down to the root’s right child node (depth 1, right), which is not a leaf node, so it asks another question: is the petal width smaller than 1.75 cm? If it is, then your flower is most likely an Iris-Versicolor (depth 2, left). If not, it is likely an Iris-Virginica (depth 2, right). It’s really that simple.
- 51. Estimating Class Probabilities A Decision Tree can also estimate the probability that an instance belongs to a particular class k: first it traverses the tree to find the leaf node for this instance, and then it returns the ratio of training instances of class k in this node. For example, suppose you have found a flower whose petals are 5 cm long and 1.5 cm wide. The corresponding leaf node is the depth-2 left node, so the Decision Tree should output the following probabilities: 0% for Iris-Setosa (0/54), 90.7% for Iris-Versicolor (49/54), and 9.3% for Iris-Virginica (5/54). And of course if you ask it to predict the class, it should output Iris-Versicolor (class 1) since it has the highest probability. Let’s check this: tree_clf.predict_proba([[5, 1.5]]) array([[0. , 0.90740741, 0.09259259]]) >>> tree_clf.predict([[5, 1.5]]) array([1])
- 52. Attribute selection measures- Gini impurity By default, the Gini impurity measure is used, but you can select the entropy impurity measure instead by setting the criterion hyperparameter to "entropy". In Machine Learning, it is frequently used as an impurity measure: a set’s So should you use Gini impurity or entropy? The truth is, most of the time it does not make a big difference: they lead to similar trees. Gini impurity is slightly faster to compute, so it is a good default. However, when they differ, Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy tends to produce slightly
- 53. The CART Training Algorithm
- 54. The only difference between ID3 and CART for building decision tree is the criteria for selecting optimal attribute. i.e in ID3 the attribute selection criterial is on the basis of information gain, whereas for CART is “Gini Index” What is CART algorithm in machine learning? The CART algorithm is a type of classification algorithm that is required to build a decision tree on the basis of Gini's impurity index. ID3 → (Iterative Dichotomiser 3, uses entropy and information for classification problems and standard deviation reduction for regression problems) CART → (Classification And Regression Tree. used for both classification and regression) and its value goes from 0 (perfectly pure) to 1 (perfectly impure).
- 56. Play Tennis Data set
- 57. Gini index(attribute)= weighted average * Giniindex(each attribute value) Giniindex(each attribute value)= -
- 65. Decision Trees are also capable of performing regression tasks. Let’s build a regression tree using Scikit-Learn’s DecisionTreeRegressor class, training it on a noisy quadratic dataset with max_depth=2 import numpy as np from sklearn.tree import DecisionTreeRegressor np.random.seed(42) X_quad = np.random.rand(200,1) – 0.5 y_quad = X_quad ** 2 + np.random.randn(200,1) tree_reg = DecisionTreeRegressor(max_depth=2,random_stat e=42) tree_reg.fit(X_quad, y_quad) Regression Trees
- 66. This tree looks very similar to the classification tree you built earlier. The main difference is that instead of predicting a class in each node, it predicts a value. For example, suppose you want to make a prediction for a new instance with x1 = 0.2. The root node asks whether x1 <= 0.197. since it is not, the algorithm goes to the right child node, which asks whether x1 <= 0.772. since it is, the algorithm goes to the left child node. This is a leaf node, and it predicts value = 0.111. This prediction is the average target value of the 110 training instances associated with this leaf node, and it results in a mean squared error equal to 0.015 over these 110 instances.
- 67. This model’s predictions are represented on the left of Figure 6-5. If you set max_depth=3, you get the predictions represented on the right.
- 68. The CART algorithm works mostly the same way as earlier, except that instead of trying to split the training set in a way that minimizes impurity, it now tries to split the training set in a way that minimizes the MSE. Equation 6-4 shows the cost function that the algorithm tries to minimize.
- 71. Outliers are data points that don’t fit the pattern of the rest of the data set. The best way to detect the outliers in the given data set is to plot the boxplot of the data set and the point located outside the box in the boxplot are all the outliers in the data set.
- 72. Linear Support vector machines A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression, and even outlier detection. SVMs are particularly well suited for classification of complex but small- or medium-size datasets. A Support Vector Machine (SVM) performs classification by finding the hyperplane that maximizes the margin between the two classes. A hyperplane is a decision boundary that differentiates the two classes in SVM.
- 73. SVM on Iris Data set The two classes can clearly be separated easily with a straight line (they are linearly separable). The left plot shows the decision boundaries of three possible linear classifiers. The model whose decision boundary is represented by the dashed line is so bad that it does not even separate the classes properly. The other two models work perfectly on this training set, but their decision boundaries come so close to the instances that these models will probably not perform as well on new instances. In contrast, the solid line in the plot on the right represents the decision boundary of an SVM classifier; this line not only separates the two classes but also stays as far away from the closest training instances as possible. Notice that adding more training instances “off the street” will not affect the decision boundary at all: it is fully determined (or “supported”) by the instances located on the edge of the street. These instances are called the support vectors (they are circled in Figure 5-1). 1) Linear SVM Classification
- 74. 2) Soft Margin Classification To avoid these issues it is preferable to use a more flexible model. The objective is to find a good balance between keeping the street as large as possible and limiting the margin violations (i.e., instances that end up in the middle of the street or even on the wrong side). This is called soft margin If we strictly impose that all instances be off the street and on the correct side, this is called hard margin classification. There are two main issues with hard margin classification. First, it only works if the data is linearly separable, and second it is quite sensitive to outliers. Figure 5-3 shows the iris dataset with just one additional outlier: on the left, it is impossible to find a hard margin, and on the right the decision boundary ends up very different from the one we saw in Figure 5-1 without the outlier, and it will probably not generalize as well.
- 75. In Scikit-Learn’s SVM classes, you can control this balance using the C hyperparameter: a smaller C value leads to a wider street but more margin violations. Figure 5-4 shows the decision boundaries and margins of two soft margin SVM classifiers on a nonlinearly separable dataset. On the left, using a low C value the margin is quite large, but many instances end up on the street. On the right, using a high C value the classifier makes fewer margin violations but ends up with a smaller margin. However, it seems likely that the first classifier will generalize better: in fact even on this training set it makes fewer prediction errors, since most of the margin violations are actually on the correct side of the decision boundary.
- 76. from sklearn.datasets import load_iris from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.svm import LinearSVC iris = load_iris(as_frame=True) X = iris.data[[“petal length (cm)”,”petal width (cm)”]].values y = (iris.target == 2) # Iris-Virginica svm_clf = make_pipeline(StandardScaler(), LinearSVC(C=1,random_state=42)) svm_clf.fit(X, y) Then, as usual, you can use the model to make predictions: >>>X_new = [[5.5,1.7],[5.0,1.5]] >>> svm_clf.predict(x_new) array([True,False]) The first plant is classified as an Iris virginica, while the second is not.