Supervised Learning Decision Trees Explained

Introduction to
Supervised Learning
In supervised learning, you train your model on a labelled dataset that means we have both raw
input data as well as its results. We split our data into a training dataset and test dataset where the
training dataset is used to train our network whereas the test dataset acts as new data for
predicting results or to see the accuracy of our model.
Hence, in supervised learning, our model learns from seen results the same as a teacher teaches
his students because the teacher already knows the results. Accuracy is what we achieve in
supervised learning as model perfection is usually high.

The model performs fast because the training time taken is less as we already have desired results
in our dataset. This model predicts accurate results on unseen data or new data without even
knowing a prior target. In some of the supervised learning models, we revert back the output
result to learn more in order to achieve the highest possible accuracy.
The algorithm learns the input patterns that generate the expected output and now once the
algorithm is trained it can be used to predict the correct output of a never seen input.

In this image above you can see that we are feeding raw inputs as an image of apple to the
algorithm as a part of the algorithm we have a supervisor who keeps on correcting the machine or
who keeps on training the machines or keeps on telling him that yes it is an apple or no it is not an
apple, things like that.
So this process keeps on repeating until we get a final trained model, once the model is ready it
can easily predict the correct output of a never seen input.

Applications of Supervised Learning
•Sentiment Analysis: It is a natural language processing technique in which we analyze and categorize some
meaning out of the given text data. For example, if we are analyzing tweets of people and want to predict whether
a tweet is a query, complaint, suggestion, opinion or news, we will simply use sentiment analysis.
•Recommendations: Every e-Commerce site or media, all of them use the recommendation system to recommend
their products and new releases to their customers or users on the basis of their activities. Netflix, Amazon,
Youtube, Flipkart are earning huge profits with the help of their recommendation system.
•Spam Filtration: Detecting spam emails is indeed a very helpful tool, this filtration techniques can easily detect
any sort of virus, malware or even harmful URLs. In recent studies, it was found that about 56.87 per cent of all
emails revolving around the internet were spam in March 2017 which was a major drop from April 2014's 71.1
percent spam share.

Some algorithms for supervised learning
1.Linear Regression
2.Random Forest
3.Support Vector Machines (SVM)

Decision Tree Learning
Decision Trees are a type of Supervised Machine Learning (that is you explain what the input is and what
the corresponding output is in the training data) where the data is continuously split according to a certain
parameter. The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the
decisions or the final outcomes. And the decision nodes are where the data is split.
•A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
•Below diagram explains the general structure of a decision tree:

Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine
learning model. Below are the two reasons for using the Decision tree:
•Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
•The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.

How does the Decision Tree
algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from
the root node of the tree. This algorithm compares the values of root attribute with the
record (real dataset) attribute and, based on the comparison, follows the branch and
jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the
tree. The complete process can be better understood using the below algorithm:

•Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
•Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
•Step-3: Divide the S into subsets that contains possible values for the best attributes.
•Step-4: Generate the decision tree node, which contains the best attribute.
•Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to
decide whether he should accept the offer or Not.

Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to
select the best attribute for the root node and for sub-nodes. So, to solve such problems
there is a technique which is called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for the nodes of the tree. There are
two popular techniques for ASM, which are:
•Information Gain
•Gini Index

1. Information Gain:
•Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
•It calculates how much information a feature provides us about a class.
•According to the value of information gain, we split the node and build the decision tree.
•A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:
Information Gain= Entropy(S) - [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
◦ S= Total number of samples
◦ P(yes)= probability of yes
◦ P(no)= probability of no

2. Gini Index:
•Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
•An attribute with the low Gini index should be preferred as compared to the high Gini
index.
•It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
•Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj
2

Advantages of the Decision Tree
•It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
•It can be very useful for solving decision-related problems.
•It helps to think about all the possible outcomes for a problem.
•There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree
•The decision tree contains lots of layers, which makes it complex.
•It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
•For more class labels, the computational complexity of the decision tree may increase.

Issues in Decision Tree LEARNING
How to avoid overfitting the Decision tree model
Overfitting is one of the major problem for every model in machine learning. If model is
overfitted it will poorly generalized to new samples. To avoid decision tree from
overfitting we remove the branches that make use of features having low
importance. This method is called as Pruning or post-pruning. This way we will
reduce the complexity of tree, and hence improves predictive accuracy by the reduction
of overfitting.

Pruning should reduce the size of a learning tree without reducing predictive accuracy as
measured by a cross-validation set. There are 2 major Pruning techniques.
•Minimum Error: The tree is pruned back to the point where the cross-validated error is a
minimum.
•Smallest Tree: The tree is pruned back slightly further than the minimum error.
Technically the pruning creates a decision tree with cross-validation error within 1
standard error of the minimum error.

Early Stop or Pre-pruning
An alternative method to prevent overfitting is to try and stop the tree-building process
early, before it produces leaves with very small samples. This heuristic is known as early
stopping but is also sometimes known as pre-pruning decision trees.
At each stage of splitting the tree, we check the cross-validation error. If the error does
not decrease significantly enough then we stop. Early stopping may underfit by stopping
too early. The current split may be of little benefit, but having made it, subsequent splits
more significantly reduce the error.

Early stopping and pruning can be used together, separately, or not at all. Post pruning
decision trees is more mathematically rigorous, finding a tree at least as good as early
stopping. Early stopping is a quick fix heuristic. If used together with pruning, early
stopping may save time. After all, why build a tree only to prune it back again?

Decision Trees: ID3 Algorithm
ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm
iteratively (repeatedly) dichotomizes(divides) features into two or more groups at each
step.
Invented by Ross Quinlan, ID3 uses a top-down greedy approach to build a decision
tree.
In simple words, the top-down approach means that we start building the tree from the
top and the greedy approach means that at each iteration we select the best feature at
the present moment to create a node.

What are the characteristics of ID3
algorithm?
1.ID3 uses a greedy approach that's why it does not guarantee an optimal solution; it can
get stuck in local optimums.
2.ID3 can overfit to the training data (to avoid overfitting, smaller decision trees should be
preferred over larger ones).
3.This algorithm usually produces small trees, but it does not always produce the
smallest possible tree.
4.ID3 is harder to use on continuous data (if the values of any given attribute is
continuous, then there are many more places to split the data on this attribute, and
searching for the best value to split by can be time consuming).

What are the steps in ID3
algorithm?
The steps in ID3 algorithm are as follows:
1.Calculate entropy for dataset.
2.For each attribute/feature.
2.1. Calculate entropy for all its categorical values.
2.2. Calculate information gain for the feature.
3.Find the feature with maximum information gain.
4.Repeat it until we get the desired tree.

Here,dataset is of binary classes(yes and no), where 9 out of 14 are "yes" and 5 out of
14 are "no".
Complete entropy of dataset is:

Here, the attribute with maximum information gain
is Outlook. So, the decision tree built so far

Here, when Outlook == overcast, it is of pure class(Yes).
Now, we have to repeat same procedure for the data with rows consist of Outlook value
as Sunny and then for Outlook value as Rain.
Now, finding the best attribute for splitting the data with Outlook=Sunny values{ Dataset
rows = [1, 2, 8, 9, 11]}.

Here, the attribute with maximum information gain is Humidity. So,
the decision tree built so far -

Here, when Outlook = Sunny and Humidity = High, it is a pure class of category "no". And
When Outlook = Sunny and Humidity = Normal, it is again a pure class of category "yes".
Therefore, we don't need to do further calculations.
Now, finding the best attribute for splitting the data with Outlook=Sunny values{ Dataset
rows = [4, 5, 6, 10, 14]}

Here, the attribute with maximum information gain is Wind. So,
the decision tree built so far -

Here, when Outlook = Rain and Wind = Strong, it is a pure class of category "no". And
When Outlook = Rain and Wind = Weak, it is again a pure class of category "yes".
And this is our final desired tree for the given dataset.

Instance-based learning
• Instance-based learning is a family of learning algorithms
that, instead of performing explicit generalization,
compares new problem instances with instances seen in
training, which have been stored in memory.
•It is also known as memory-based learning or lazy-learning

For example,
If we were to create a spam filter with an instance-based
learning algorithm, instead of just flagging emails
that are already marked as spam emails, our
spam filter would be programmed to also flag
emails that are very similar to them.
This requires a measure of resemblance between two
emails. A similarity measure between two emails could be
the same sender or the repetitive use of the same
keywords or something else.

Advantages:
1.Instead of estimating for the entire instance set,
local approximations can be made to the target
function.
2.This algorithm can adapt to new data easily,
one which is collected as we go .

Disadvantages:
1.Classification costs are high
2.Large amount of memory required to store the
data, and each query involves starting the
identification of a local model from scratch.

K-Nearest Neighbor(KNN) Algorithm
•K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
•K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
•K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
•K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.

•K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
•It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
•KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.

Example:
•Suppose, we have an image of a creature that looks similar to cat and dog, but we want
to know either it is a cat or dog. So for this identification, we can use the KNN algorithm,
as it works on a similarity measure. Our KNN model will find the similar features of the
new data set to the cats and dogs images and based on the most similar features it will
put it in either cat or dog category.

Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new
data point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:

How to select the value of K in the K-NN
Algorithm?
•There is no particular way to determine the best value
for "K", so we need to try some values to find the best
out of them. The most preferred value for K is 5.
•A very low value for K such as K=1 or K=2, can be
noisy and lead to the effects of outliers in the model.
•Large values for K are good, but it may find some
difficulties.

How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
•Step-1: Select the number K of the neighbors
•Step-2: Calculate the Euclidean distance of K number of neighbors
•Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
•Step-4: Among these k neighbors, count the number of the data points in each category.
•Step-5: Assign the new data points to that category for which the number of the neighbor
is maximum.
•Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category.
Consider the below image:

•Firstly, we will choose the number of neighbors, so we will choose the k=5.
•Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:

By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
•As we can see the 3 nearest neighbors are from category
A, hence this new data point must belong to category A.

Advantages of KNN Algorithm:
•It is simple to implement.
•It is robust to the noisy training data
•It can be more effective if the training
data is large.

Disadvantages of KNN Algorithm:
•Always needs to determine the value of K
which may be complex some time.
•The computation cost is high because of
calculating the distance between the data
points for all the training samples.

Support Vector Machine Algorithm
The goal of the SVM algorithm is to create the best line or
decision boundary that can segregate n-dimensional space into
classes so that we can easily put the new data point in the correct
category in the future.
This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in
creating the hyperplane. These extreme cases are called as
support vectors, and hence algorithm is termed as Support
Vector Machine.

Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:

Example:
•Suppose we see a strange cat that also has some features of dogs,
•so if we want a model that can accurately identify whether it is a cat or dog, so such a
model can be created by using the SVM algorithm.
•We will first train our model with lots of images of cats and dogs so that it can learn
about different features of cats and dogs,
• and then we test it with this strange creature.
•So as support vector creates a decision boundary between these two data (cat and dog)
and choose extreme cases (support vectors), it will see the extreme case of cat and dog.

On the basis of the support vectors, it will classify it as a cat. Consider the below
diagram:
SVM algorithm can be used for Face detection, image
classification, text categorization, etc.

Hyperplane and Support Vectors in the SVM
algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the
classes in n-dimensional space, but we need to find out the best decision
boundary that helps to classify the data points. This best boundary is known as
the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the
dataset, which means if there are 2 features (as shown in image), then
hyperplane will be a straight line. And if there are 3 features, then hyperplane
will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.

Types of SVM
SVM can be of two types:
•Linear SVM: Linear SVM is used for linearly separable data, which
means if a dataset can be classified into two classes by using a single
straight line, then such data is termed as linearly separable data, and
classifier is used called as Linear SVM classifier.
•Non-linear SVM: Non-Linear SVM is used for non-linearly separated
data, which means if a dataset cannot be classified by using a straight
line, then such data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier.

How does SVM works?
Linear SVM:The working of the SVM algorithm can be understood by using an example.
Suppose we have a dataset that has two tags (green and blue), and the dataset has two
features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates
in either green or blue. Consider the below image

So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is to
maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.

Non-Linear SVM:If data is linearly arranged, then we can separate it by using a straight
line, but for non-linear data, we cannot draw a single straight line. Consider the below
image:

So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the
below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-
linear data.

Artificial Neural Networks
Introduction:
The term "Artificial neural network" refers to a biologically inspired sub-field of artificial
intelligence modeled after the brain. Similar to a human brain has neurons
interconnected to each other, artificial neural networks also have neurons that are linked
to each other in various layers of the networks. These neurons are known as nodes.

What is Artificial Neural Network?
The above figure illustrates the
typical
diagram of Biological Neural
Network.
The typical Artificial Neural Network looks
something like the above figure.

Relationship between Biological neural network and artificial
neural network:
Biological Neural Network Artificial Neural Network
Dendrites Inputs
Cell nucleus Nodes
Synapse Weights
Axon Output

The architecture of an artificial neural
network:
Artificial Neural Network mimic the behavior of human brain to solve the complex data
driven problem.
Artificial Neural Network primarily consists of three layers:

Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the
programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.
Output Layer:
The input goes through a series of transformations using the hidden layer, which finally
results in output that is conveyed using this layer.

https://www.youtube.com/watch?v=aircAruvnKk
https://www.youtube.com/watch?v=vpOLiDyhNUA

The artificial neural network takes input and computes the weighted sum of the inputs
and includes a bias. This computation is represented in the form of a transfer function.
It determines weighted total is passed as an input to an activation function to produce the
output. Activation functions choose whether a node should fire or not. Only those who are fired
make it to the output layer. There are distinctive activation functions available that can be
applied upon the sort of task we are performing.

What is perceptron?
A perceptron consists of four parts: input values, weights and a bias, a weighted sum,
and activation function.
Assume we have a single neuron and three inputs x1, x2, x3 multiplied by the
weights w1, w2, w3 respectively as shown below,

The idea is simple, given the numerical value of the inputs and the weights, there is a
function, inside the neuron, that will produce an output. The question now is, what is this
function?
One function may look like
This function is called the weighted sum because it is the sum of the weights and inputs.
This looks like a good function, but what if we wanted the outputs to fall into a certain
range say
0 to 1.

We can do this by using something known as an activation function. An activation
function is a function that converts the input given (the input, in this case, would be the
weighted sum) into a certain output based on a set of rules.

There are different kinds of activation functions that exist, for example:
1.Hyperbolic Tangent: used to output a number from -1 to 1.
2.Logistic Function: used to output a number from 0 to 1.
3.The last thing we are missing is the bias. The bias is a threshold the perceptron must
reach before the output is produced. So the final neuron equation looks like:

Typically the bias is represented near
the inputs),
Notice that the activation function takes in the
weighted sum plus the bias as inputs to create a single
output. Using the Logistical Function this output will be
between 0 and 1.

Why are perceptron's used?
Perceptrons are the building blocks of neural networks. It is typically used for supervised
learning of binary classifiers. This is best explained through an example.
Let’s take a simple perceptron. In this perceptron we have an input x and y, which is
multiplied with the weights wx and wy respectively, it also contains a bias.

Let’s also create a graph with two different categories of data represented with red and
blue dots.

Notice that the x-axis is labeled after the input x and the y-axis
is labeled after the input y.
Suppose our goal was to separates this data so that there
is a distinction between the blue dots and the red dots.
How can we use the perceptron to do this?
A perceptron can create a decision boundary for a binary
classification, where a decision boundary is regions of space
on a graph that separates different data points.

Let’s play with the function to better understand this. We can say,
wx = -0.5
wy = 0.5
and b = 0
Then the function for the perceptron will look like,
0.5x + 0.5y = 0
and the graph will look like,

Let’s suppose that the activation function, in this case, is a
simple step function that outputs either 0 or 1. The perceptron
function will then label the blue dots as 1 and the red dots as 0.
In other words,
if 0.5x + 0.5y => 0, then 1
if 0.5x + 0.5y < 0, then 0.
Therefore, the function 0.5x + 0.5y = 0 creates a decision
boundary that separates the red and blue points.

Overall, we see that a perceptron can do basic
classification using a decision boundary.

Supervised Learning Decision Trees Explained

Supervised Learning Decision Trees Explained

Recommended

Recommended

More Related Content

Similar to Supervised Learning Decision Trees Explained

Similar to Supervised Learning Decision Trees Explained (20)

Recently uploaded

Recently uploaded (20)

Supervised Learning Decision Trees Explained