Similar to Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, K nearest Neighbour, K means Clustering, Random Forest By akanksha Bali
Similar to Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, K nearest Neighbour, K means Clustering, Random Forest By akanksha Bali (20)
Decision Tree, Naive Bayes, Association Rule Mining, Support Vector Machine, K nearest Neighbour, K means Clustering, Random Forest By akanksha Bali
1. Decision Tree, Naive Bayes,
Association rule Mining, Support
Vector Machine, KNN, Kmeans
Clustering, Random Forest
Presented to
Prof. Vibhakar Mansotra
Dean of Mathematical Science, University of Jammu
Presented by
Akanksha Bali
Research Scholar,Batch-2019, University of Jammu
2. Contents
Decision Tree
Naive Bayes Classifier
Support Vector Machine
Association Rule Mining
Apriori Algorithm
K Nearest Neighbour
K means Clustering
Random forest
2
3. Decision Trees
A decision tree is a flowchart-like tree structure where the data is continuously
split according to a certain parameter
Each internal node(decision node) denotes a test on an
attribute.
Each branch represents an outcome of the test.
Here are two main types of decision trees:
Classification trees (yes/no types)
What we’ve seen above is an example of classification tree, where the
outcome was a variable like ‘fit’ or ‘unfit’. Here the decision variable
is categorical.
Regression trees (continuous data types)
Here the decision or the outcome variable is continuous, e.g. a number
like 12
3
4. 4
Entropy
Entropy
Entropy, also called as shannon entropy is denoted by H(S) for a finite set S,
is the measure of the amount of uncertainty or randomness in data.
H(S) = ∑ p(x)log2p(x)
Information gain
Information gain is also called as kullback-leibler divergence denoted by
IG(S,A) for a set S is the effective change in entropy after deciding on a
particular attribute A. It measures the relative change in entropy with respect
to the independent variables.
IG(S,A) = H(S)-H(S,A)
IG(S,A) = H(S) - ∑P(x)*H(x)
Where IG(S, A) is the information gain by applying feature A. H(S) is the
entropy of the entire set, while the second term calculates the entropy after
applying the feature A, where p(x) is the probability of event x.
5. 5
Top-Down Induction of Decision
Trees ID3
D3 Algorithm will perform following tasks recursively
1.Create root node for the tree
2.If all examples are positive, return leaf node ‘positive’
3.Else if all examples are negative, return leaf node ‘negative’
4.Calculate the entropy of current state H(S)
5.For each attribute, calculate the entropy with respect to the attribute
‘x’ denoted by H(S, x)
6. Calculate
7. Select the attribute which has maximum value of IG(S, x)
8. Remove the attribute that offers highest IG from the set of attributes
9. Repeat until we run out of all attributes, or the decision tree has all
leaf nodes.
7. 7
Selecting the Next Attribute
Humidity
High Normal
[3+, 4-] [6+, 1-]
S=[9+,5-]
E=0.940
Gain(S,Humidity)
=0.940-(7/14)*0.985
– (7/14)*0.592
=0.151
E=0.985 E=0.592
Wind
Weak Strong
[6+, 2-] [3+, 3-]
S=[9+,5-]
E=0.940
E=0.811 E=1.0
Gain(S,Wind)
=0.940-(8/14)*0.811
– (6/14)*1.0
=0.048
Humidity provides greater info. gain than Wind, w.r.t target classification.
8. 8
Selecting the Next Attribute
Outlook
Sunny Rain
[2+, 3-] [3+, 2-]
S=[9+,5-]
E=0.940
Gain(S,Outlook)
=0.940-(5/14)*0.971
-(4/14)*0.0 – (5/14)*0.0971
=0.247
E=0.971 E=0.971
Over
cast
[4+, 0]
E=0.0
9. 9
Selecting the Next Attribute
The information gain values for the 4 attributes
are:
• Gain(S,Outlook) =0.247
• Gain(S,Humidity) =0.151
• Gain(S,Wind) =0.048
• Gain(S,Temperature) =0.029
where S denotes the collection of training
examples
12. 12
Converting a Tree to Rules
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
R1: If (Outlook=Sunny) ∧ (Humidity=High) Then PlayTennis=No
R2: If (Outlook=Sunny) ∧ (Humidity=Normal) Then PlayTennis=Yes
R3: If (Outlook=Overcast) Then PlayTennis=Yes
R4: If (Outlook=Rain) ∧ (Wind=Strong) Then PlayTennis=No
R5: If (Outlook=Rain) ∧ (Wind=Weak) Then PlayTennis=Yes
14. 14
Avoid Overfitting
stop growing when split not statistically
significant
grow full tree, then post-prune
15. NAÏVE BAYES ALGORITHM
The Bayesian Classification represents a supervised
learning method as well as a statistical method for
classification.
It can solve diagnostic and predictive problems.
It is based on the name of Thomas Bayes(1700-61).
It works on the principle of comditional probability as
given by the bayes theorem.
15
16. Derivation
Derivation D : Set of tuples
Each Tuple is an ‘n’ dimensional attribute vector X :
(x1,x2,x3,…. xn)
Let there be ‘m’ Classes : C1,C2,C3…Cm
Maximum Posteriori Hypothesis
P(Ci/X) = P(X/Ci) P(Ci) / P(X) (bayes theorem)
16
17. Problem Statement
Consider the given data set, apply naive bayes
algorithm and predict that if the fruit has the
folowing properties then which type of fruit it is
Fruit = { yellow, sweet, long}
Fruit Yellow Sweet Long Total
Orange 350 450 0 650
Banana 400 300 350 400
others 50 100 50 150
Total 800 850 400 1200
18. Problem
Step 1: Compute the prior probabilities for each of the class of fruits:
P(C=orange) = 650/1200 = 0.54
P(C=banana) = 400/1200 = 0.33
P(C=others) = 150/1200 = 0.125
Step 2: Compute the probability of evidence
P(X1=long) = 400/1200=0.33
P(X2=sweet) = 850/1200 = 0.708
P(X3=yellow) = 800/1200 = 0.66
Step 3: Compute the probability of likelihood of evidences
P(C=orange|X1=long) = 0/400 = 0
P(C=orange|X2=sweet) = 450/850 = 0.52
P(C=orange|X3=yellow) = 350/800 = 0.43
P(C=Banana|X1=long) = 350/400 = 0.875
P(C=Banana|X2=sweet) = 300/850 = 0.35
P(C=Banana|X3=yellow) = 400/800 = 0.5
P(C=others|X1=long) = 50/400 = 0.125
P(C=others|X2=sweet) = 100/850 = 0.117
P(C=others|X3=yellow) = 50/800 = 0.0625
18
19. Problem
Step 5: Calculate posterior probability
P(Yellow|Orange)=P(orange|Yellow)*P(yellow)
= (0.43*0.66)/0.5 = 0.5676
P(Sweet|Orange) = 0.69
P(Long|Orange) = 0
Step 6: P(fruit| Orange) = 0.56*0.69*0 = 0
In the Similar way P(fruit|banana)= 1*0.75 * 0.87 = 0.65
P(fruit|others) = 0.33*0.66*0.33 = 0.072
Step 7: Prediction :- type of fruit is Banana
19
P(Orange)
20. Association rule mining
Association rule learning is a
rule-based machine learning method for discovering
interesting relations between variable
Using association rule learning, the supermarket can
determine which products are frequently bought
together and use this information for marketing
purposes. This is sometimes referred to as market
basket analysis.
20
22. Association rule mining
Important concepts of Association Rule Mining:
The support supp(X) of an itemset is defined as the proportion of transactions in
the data set which contain the itemset. In the example database, the itemset
{milk,bread,butter}has a support of 1/5=0.2 since it occurs in 20% of all
transactions (1 out of 5 transactions).
The confidence of a rule is defined conf(X=>Y)= supp(XUY)/supp(X)
For example, the rule {butter, bread}=>{milk}
has a confidence of supp(butter,bread,milk}/support(butter,bread} = 0.2/0.2=1
in the database, which means that for 100% of the transactions containing butter
and bread the rule is correct (100% of the times a customer buys butter and bread,
milk is bought as well).
22
23. APIORI ALGORITHM
The name of the algorithm is based on the fact that the
algorithm uses prior knowledge of frequent itemset properties.
Apriori employs an iterative approach known as a level-wise
search, where k-itemsets are used to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the
database to accumulate the count for each item, and collecting
those items that satisfy minimum support.
The resulting set is denoted L1.
Next, L1 is used to find L2, the set of frequent 2-itemsets, which
is used to find L3, and so on, until no more frequent k-itemsets
can be found.
The finding of each Lk requires one full scan of the database.
24. Problem Statement
For the following given transaction dataset. Generate rules
using apriori algorithm. Consider the values as support =
50% and confidence = 50%
24
Transaction ID Items
Purchased
I1 A,B,C
I2 A,C
I3 A,D
I4 B,E,F
25. Problem Statement
Step 1: Create table of Frequent itemset and calculate
support
25
items Frequency Support count
{A} 3 ¾=75%
{B} 2 2/4=50%
{C} 2 2/4=50%
{D} 1 ¼=25%
{E} 1 ¼=25%
{F} 1 ¼=25%
26. Problem Statement
Step 2: Choose rows with support value is equal or
greater than 50%
26
items Frequency Support count
{A} 3 ¾=75%
{B} 2 2/4=50%
{C} 2 2/4=50%
27. Problem Statement
Step 3: Create table of 2 item Frequent set and calculate
their frequency and support
27
items Frequency Support count
{A,B} 1 ¼ =25%
{A,C} 2 2/4 =50%
{B,C} 1 ¼ =25%
28. Problem Statement
Step 4: Choose rows with support value is equal or
greater than 50%
Formulate Final rules and calculate confidence
28
items Frequency Support count
{A,C} 2 2/4 =50%
Association
rules
supp confiden
ce
Conf%
A->C 2 2/3=.66 66%
C->A 2 2/2=1 100%
29. SUPPORT VECTOR
MACHINE
Support Vector Machine” (SVM) is a supervised machine
learning algorithm which can be used for both classification or
regression challenges.
Mostly used in classification problems.
we perform classification by finding the hyper-plane
that differentiate the two classes very well
29
32. Support vector machine
Pros:
It works really well with clear margin of separation
It is effective in high dimensional spaces.
It is effective in cases where number of dimensions is greater
than the number of samples.
It uses a subset of training points in the decision function
(called support vectors), so it is also memory efficient.
Cons:
It doesn’t perform well, when we have large data set because
the required training time is higher
It also doesn’t perform very well, when the data set has more
noise i.e. target classes are overlapping
32
34. K Nearest Neighbour
• K-Nearest Neighbors is one of the most basic yet
essential classification algorithms in Machine Learning.
It belongs to the supervised learning domain and finds
intense application in pattern recognition, data mining
and intrusion detection.
• It was first described in the early 1950s.
• Gained popularity, when increased computing power
became available.
• Used widely in area of pattern recognition and
statistical estimation.
34
35. Closeness
The Euclidean distance between two points
or tuples, say,
X1 = (x11,x12,...,x1n) and X2 =(x21,x22,...,x2n),is
35
37. Example
• We have data from the questionnaires survey and objective
testing with two attributes (acid durability and strength) to classify
whether a special paper tissue is good or not. Here are four training
samples :
X1 = Acid Durability
(seconds)
X2 = Strength
(kg/square meter)
Y = Classification
7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
Now the factory produces a new paper tissue that passes the
laboratory test with X1 = 3 and X2 = 7. Guess the classification
of this new tissue.
38. Step 1 : Initialize and Define k.
Lets say, k = 3
(Always choose k as an odd number if the number of attributes is even to avoid
a tie in the class prediction)
Step 2 : Compute the distance between input sample and
trainingsample
- Co-ordinate of the input sample is (3,7).
- Instead of calculating the Euclidean distance, we
calculate the Squared Euclidean distance.
X1 = Acid Durability
(seconds)
X2 = Strength
(kg/square meter)
Squared Euclidean distance
7 7 (7-3)2 + (7-7)2 = 16
7 4 (7-3)2 + (4-7)2 = 25
3 4 (3-3)2 + (4-7)2 = 09
1 4 (1-3)2 + (4-7)2 = 13
39. Step 3 : Sort the distance and determine the nearest
neighbours based of the Kth minimum distance :
X1 = Acid
Durability
(seconds)
X2 = Strength
(kg/square
meter)
Squared
Euclidean
distance
Rank
minimum
distance
Is it included
in 3-Nearest
Neighbour?
7 7 16 3 Yes
7 4 25 4 No
3 4 09 1 Yes
1 4 13 2 Yes
Example
40. Step 4 : Take 3-Nearest Neighbours:
Gather the category Y of the nearest neighbours.
X1 = Acid
Durability
(seconds)
X2 =
Strength
(kg/square
meter)
Squared
Euclidean
distance
Rank
minimum
distance
Is it
included in
3-Nearest
Neighbour?
Y =
Categor
y of the
nearest
neighbo
ur
7 7 16 3 Yes Bad
7 4 25 4 No -
3 4 09 1 Yes Good
1 4 13 2 Yes Good
Example
41. Step 5 : Apply simple majority
Use simple majority of the category of the nearest
neighbours as the prediction value of the query
instance.
We have 2 “good” and 1 “bad”. Thus we conclude
that the new paper tissue that passes the laboratory
test with X1 = 3 and X2 = 7 is included in the
“good” category.
Example
42. K – Means Clustering
Contents
Introduction
Algorithm
Example
Application
42
43. KNN Clustering Algorithm
Clustering: the process of grouping a set of objects into
classes of similar objects
Documents within a cluster should be similar.
Documents from different clusters should be dissimilar.
The commonest form of unsupervised learning
Unsupervised learning = learning from raw data, as
opposed to supervised data where a classification of
examples is given.
in principle, optimal partition achieved via minimising the sum
of squared distance to its “representative object” in each cluster
43
2
1
2
)(),( knn
N
n
k mxd −= ∑=
mxe.g., Euclidean distance =
45. A Simple example showing the implementation
of k-means algorithm
(using K=2)
.
46. Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).
47. Step 2:
Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
Their new centroids are:
48. Step 3:
Now using these centroids
we compute the Euclidean
distance of each object, as
shown in table.
Therefore, the new
clusters are:
{1,2} and {3,4,5,6,7}
Next centroids are:
m1=(1.25,1.5) and m2 =
(3.9,5.1)
49. λ Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}
λ Therefore, there is no
change in the cluster.
λ Thus, the algorithm comes
to a halt here and final
result consist of 2 clusters
{1,2} and {3,4,5,6,7}.
50. Example
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
50
consider the following data set consisting of the scores
of two variables on each of seven individuals:
51. Example
.
51
This data set is to be grouped into two clusters. As a first step in
finding a sensible initial partition, let the A & B values of the two
individuals furthest apart (using the Euclidean distance measure),
define the initial cluster means, giving:
Individual
Mean Vector
(centroid)
Group 1 1 (1.0, 1.0)
Group 2 4 (5.0, 7.0)
52. Example
The remaining individuals are now examined in sequence and
allocated to the cluster to which they are closest, in terms of
Euclidean distance to the cluster mean. The mean vector is
recalculated each time a new member is added.
52
Cluster 1 Cluster 2
Step Individual
Mean
Vector
(centroid)
Individual
Mean
Vector
(centroid)
1 1 (1.0, 1.0) 4 (5.0, 7.0)
2 1, 2 (1.2, 1.5) 4 (5.0, 7.0)
3 1, 2, 3 (1.8, 2.3) 4 (5.0, 7.0)
4 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0)
5 1, 2, 3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7)
6 1, 2, 3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4)
53. Example
Now the initial partition has changed, and the two
clusters at this stage having the following
characteristics:
53
Individual
Mean Vector
(centroid)
Cluster 1 1, 2, 3 (1.8, 2.3)
Cluster 2 4, 5, 6, 7 (4.1, 5.4)
54. Example
Individual
Distance to mean
(centroid) of
Cluster 1
Distance to mean
(centroid) of
Cluster 2
1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1
54
But we cannot yet be sure that each individual has been assigned
to the right cluster. So, we compare each individual’s distance to
its own cluster mean and to
that of the opposite cluster.
55. Example
The iterative relocation would now continue from this new
partition until no more relocations occur. However, in this
example each individual is now nearer its own cluster mean than
that of the other cluster and the iteration stops, choosing the
latest partitioning as the final cluster solution.
55
Individual
Mean Vector
(centroid)
Cluster 1 1, 2 (1.3, 1.5)
Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1)
56. Applications
Clustering helps marketers improve their customer base
and work on the target areas. It helps group people
(according to different criteria’s such as willingness,
purchasing power etc.) based on their similarity in many
ways related to the product under consideration.
Clustering helps in identification of groups of houses on
the basis of their value, type and geographical locations.
Clustering is used to study earth-quake. Based on the
areas hit by an earthquake in a region, clustering can
help analyse the next probable location where
earthquake can occur.
56
57. Random Forest
Contents
Random Forest Introduction
Pseudocode
Prediction Pseudocode
Example
Random Forest vs Decision Tree
Advantages
Disadvantages
Application
57
58. Random Forest
Random forest algorithm is a supervised classification and
regression algorithm.
Randomly creates a forest with several trees.
58
59. Random Forest pseudocode
Randomly select “k” features from total “m” features.
Where k << m
Among the “k” features, calculate the node “d” using
the best split point.
Split the node into daughter nodes using the best split.
Repeat 1 to 3 steps until “l” number of nodes has been
reached.
Build forest by repeating steps 1 to 4 for “n” number
times to create “n” number of trees.
59
60. Prediction pseudocode
To perform prediction using the trained random forest
algorithm uses the below pseudocode.
Takes the test features and use the rules of each
randomly created decision tree to predict the oucome
and stores the predicted outcome (target)
Calculate the votes for each predicted target.
Consider the high voted predicted target as the final
prediction from the random forest algorithm.
60
61. Example
61
Day Outlook Humidity Wind Play
D1 Sunny High Weak Yes
D2 Sunny High Strong No
D3 Overcast High Weak Yes
D4 Rain High Weak Yes
D5 Rain Normal Weak Yes
D6 Rain Normal Strong No
D7 Overcast Normal Strong Yes
D8 Sunny High Weak No
D9 Sunny Normal Weak Yes
D10 Rain Normal Weak Yes
D11 Sunny Normal Strong Yes
D12 Overcast High Strong Yes
D13 Overcast Normal Weak Yes
D14 Rain High Strong No
62. Example
Whether the game will happen if the weather condition
is
Outlook = Rain, Humidity = High, Wind = weak
Play=?
Step 1: divide the data into smaller subsets
Step 2: every subsets need not be distinct, some
subsets may be overlapped
62
64. Advantages
Random forests is considered as a highly accurate and
robust method.
It does not suffer from the overfitting problem.
The algorithm can be used in both classification and
regression problems.
Random forests can also handle missing values.
You can get the relative feature importance, which helps
in selecting the most contributing features for the
classifier.
64
65. Disadvantages
It can take longer than expected time to compute a large
number of trees.
The model is difficult to interpret compared to a
decision tree.
65
66. Random forest vs Decision
Trees
Random forests is a set of multiple decision trees.
Deep decision trees may suffer from overfitting, but
random forests prevents overfitting by creating trees on
random subsets.
Decision trees are computationally faster.
Random forests is difficult to interpret, while a decision
tree is easily interpretable and can be converted to
rules.
66