Machine Learning Techniques

Topik 7
Rule-Based Machine Learning, Clustering,
dan Association Rules
Dr. Sunu Wibirama
Modul Kuliah Kecerdasan Buatan
Kode mata kuliah: UGMx 001001132012
June 28, 2022

June 28, 2022
1 Capaian Pembelajaran Mata Kuliah
Topik ini akan memenuhi CPMK 5, yakni mampu mendefinisikan beberapa teknik ma-
chine learning klasik (linear regression, rule-based machine learning, probabilistic machine
learning, clustering) dan konsep dasar deep learning serta implementasinya dalam penge-
nalan citra (convolutional neural network).
Adapun indikator tercapainya CPMK tersebut adalah memahami konsep dasar deci-
sion tree, dapat melakukan penghitungan entropi dan information gain, memahami konsep
clustering dan association rules.
2 Cakupan Materi
Cakupan materi dalam topik ini sebagai berikut:
a) Introduction to Decision Tree: materi ini membahas tentang konsep generalisasi dalam
machine learning. Hal mendasar dalam machine learning adalah menghasilkan rules
atau aturan dari sekolompok data. Rules ini akan digunakan untuk memprediksi kelas
atau kategori dari data yang belum pernah dilihat sebelumnya oleh sistem. Meskipun
demikian, pada beberapa kasus ada lebih dari satu opsi untuk melakukan klasifikasi.
Tree yang dihasilkan dari rules tersebut memiliki berbagai kemungkinan. Di sinilah
peran dari teori informasi dalam menentukan peran dan posisi atribut dalam pemben-
tukan tree.
b) Entropy and Information Gain: materi ini membahas konsep teori informasi untuk
menentukan tingkat ketidakteraturan dalam sebuah data. Semakin besar entropi dari
sebuah atribut, maka peran atribut tersebut dalam membagi kelas atau kategori se-
makin kecil. Semakin kecil nilai entropi, semakin besar peran tersebut dalam membagi
kelas atau kategori dari data yang kita miliki. Atribut yang memiliki penurunan en-
tropi terendah—atau memiliki information gain tertinggi—adalah atribut yang akan
menempati root node dalam decision tree.
c) Clustering: materi ini membahas salah satu teknik unsupervised learning yang sangat
terkenal, yakni konsep clustering. Clustering biasa dimanfaatkan untuk mencari pola
atau kesamaan karakteristik pada data. Dengan clustering, data dapat dibagi menjadi
beberapa kategori. Clustering menjadi langkah awal untuk melakukan labeling pada
data.
d) Association Rules: teknik ini adalah bagian dari unsupervised learning yang sering
digunakan untuk memberikan rekomendasi bagi pelanggan e-commerce berdasarkan
frekuensi kemunculan sebuah produk atau seberapa sering sebuah dua buah produk
dibeli bersamaan. Pada materi ini akan dibahas tiga buah ukuran (metrics) yang
sering digunakan dalam association rules, yakni support, confidence, dan lift.
1

27/06/2022
sunu@ugm.ac.id
Copyright © 2022 Sunu Wibirama | Do not distribute without permission @sunu_wibirama 1
Sunu Wibirama
sunu@ugm.ac.id
Department of Electrical and Information Engineering
Faculty of Engineering
Universitas Gadjah Mada
INDONESIA
Introduction to Decision Tree (Part 01)
Kecerdasan Buatan | Artificial Intelligence
Version: January 2022
sunu@ugm.ac.id
Review: Key concepts of AI
• Machines learn from experience…
• Through examples, analogy or discovery
• Adapting…
• Changes in response to interaction
• Generalizing…
• To use experience to form a response to
novel situations (i.e., unseen data)
• Machine learning is the branch of Artificial
Intelligence concerned with building
systems that generalize from examples
2

27/06/2022
sunu@ugm.ac.id
Can a machine Learn?
• From a limited set of examples, you
should be able to generalize.
• Human can easily do generalization if
the data are simple enough
• If the data are complicated, human
ability is limited and prone to error.
• We get a machine to do this task.
3
sunu@ugm.ac.id
What rules that you can extract to predict outcome?
4
J.D. Kelleher, B.M. Namee, A. D’Arcy, Fundamentals of Machine Learning for Predictive Data Analytics, MIT Press, 2015
Credit scoring dataset

27/06/2022
sunu@ugm.ac.id
5
sunu@ugm.ac.id
Can you extract the rules manually?
6
More complicated credit scoring dataset

27/06/2022
sunu@ugm.ac.id
7
Not as easy as the first case, right?
That’s when machine learning helps us!
sunu@ugm.ac.id
8
End of File

27/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
sunu@ugm.ac.id
2
Credit scoring dataset

27/06/2022
sunu@ugm.ac.id
Goal of machine learning?
3
Using a set of training data to find the best rule (model) from that
generalize well (do a good labeling task / predict the outcome)
on new data (unseen data)
training
data
new
data
sunu@ugm.ac.id
Four types of classification techniques

27/06/2022
sunu@ugm.ac.id
Rule-based Machine Learning: Decision Trees
• A map of the reasoning process, good at solving classification
problems (Negnevitsky, 2005)
• A decision tree represents a number of different attributes and
values
• Nodes represent attributes*
• Branches represent values of the attributes
• Path through a tree represents a decision
• Tree can be associated with rules
5
Note: attributes = features
sunu@ugm.ac.id
Why Decision Trees?
• Decision Trees (DT) are one of the most popular data mining
tools (with linear and logistic regression)
• They are:
• Easy to understand
• Easy to implement
• Computationally cheap
• Almost all data mining packages include DT
• They have advantages for model comprehensibility, which is
important for:
• model evaluation
• communication to non-technical stakeholders
6

27/06/2022
sunu@ugm.ac.id
Example: Ice-cream
7
Outlook Temperature Holiday Season Result
Overcast Mild Yes Don’t Sell
Sunny Mild Yes Sell
Sunny Hot No Sell
Overcast Hot No Don’t Sell
Sunny Cold No Don’t Sell
Overcast Cold Yes Don’t Sell
*overcast = cloudy
sunu@ugm.ac.id
Example: Ice-cream
• When should an ice-cream seller
attempt to sell ice-cream?
• Could you write a set of rules?
• How would you acquire the
knowledge?
• You might learn by experience:
• For example, experience of:
• ‘Outlook’: Overcast or Sunny
• ‘Temperature’: Hot, Mild or Cold
• ‘Holiday Season’: Yes or No
8

27/06/2022
sunu@ugm.ac.id
Generalisation
• What should the seller do when:
• ‘Outlook’: Sunny
• ‘Temperature’: Hot
• ‘Holiday Season’: Yes
• What about:
• ‘Outlook’: Overcast
• ‘Temperature’: Hot
• ‘Holiday Season’: Yes
9
Sell
Sell
Let’s visualize the rules
sunu@ugm.ac.id
Example 1: Ice-cream
10
Outlook
Temperature
Sunny
Hot
Sell
Don’t Sell
Sell
Yes No
Mild
Holiday Season
Cold
Don’t Sell
Holiday Season
Overcast
No
Don’t Sell
Yes
Temperature
Hot Cold
Mild
Don’t Sell
Sell Don’t Sell
Root node
Branch
(value of
attributes)
Leaf
Node
(attributes)

27/06/2022
sunu@ugm.ac.id
Construction
• Concept learning:
• Inducing concepts from examples
• Different algorithms used to construct a
tree based upon the examples
• Most popular Iterative Dichotomizer 3
(called ID3, proposed by Quinlan, 1986)
• But:
• Different trees can be constructed from
the same set of examples
• Real-life is noisy and often contradictory
sunu@ugm.ac.id
12
End of File

27/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
sunu@ugm.ac.id
Example: Ice-cream
2
Outlook
Temperature
Sunny
Hot
Sell
Don’t Sell
Sell
Yes No
Mild
Holiday Season
Cold
Don’t Sell
Holiday Season
Overcast
No
Don’t Sell
Yes
Temperature
Hot Cold
Mild
Don’t Sell
Sell Don’t Sell
Root node
Branch
(value of
attributes)
Leaf
Node
(attributes)

27/06/2022
sunu@ugm.ac.id
Construction
• Concept learning:
• Inducing concepts from examples
• Different algorithms used to construct a
tree based upon the examples
• Most popular Iterative Dichotomizer 3
(called ID3, proposed by Quinlan, 1986)
• But:
• Different trees can be constructed from
the same set of examples
• Real-life is noisy and often contradictory
sunu@ugm.ac.id
Ambiguous Trees
4
Item X Y Class
1 False False +
2 True False +
3 False True -
4 True True -
Consider the following data:
pengawet
pewarna
Pewarna = food coloring
Pengawet = preservatives

27/06/2022
sunu@ugm.ac.id
Ambiguous Trees
5
Y
{3,4}
Negative
{1,2}
Positive
True False
1st option: Y as root node
sunu@ugm.ac.id
Ambiguous Trees
6
X
{2,4}
Y
{1,3}
Y
{2}
Positive
{4}
Negative
True False
True False
{1}
Positive
{3}
Negative
True False
Different trees can be constructed from the same set of examples.
Which tree is the best? Based upon choice of attributes at each node in the tree
2nd option: X as root node

27/06/2022
sunu@ugm.ac.id
7
End of File

27/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
Information Theory in Decision Tree (Part 01)
sunu@ugm.ac.id
Ambiguous Trees
2
Item X Y Class
1 False False +
2 True False +
3 False True -
4 True True -
Consider the following data:
pengawet
pewarna
Pewarna = food coloring
Pengawet = preservatives

27/06/2022
sunu@ugm.ac.id
Ambiguous Trees
3
Y
{3,4}
Negative
{1,2}
Positive
True False
1st option: Y as root node
sunu@ugm.ac.id
Ambiguous Trees
4
X
{2,4}
Y
{1,3}
Y
{2}
Positive
{4}
Negative
True False
True False
{1}
Positive
{3}
Negative
True False
Different trees can be constructed from the same set of examples.
Which tree is the best? Based upon choice of attributes at each node in the tree
2nd option: X as root node

27/06/2022
sunu@ugm.ac.id
Information Theory
• We can use information theory to help us understand:
• Which attribute is the best to choose for a particular node of the tree
• This is the node that is the best at separating the required
predictions, and hence which leads to the best (or at least a good)
tree
• ‘Information Theory address both the limitations and the possibilities of
communication’ (MacKay, 2003:16):
• Measuring information content
• Probability (measure of chance) and entropy (measure of disorder)
5
MacKay, D.J.C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge, UK: Cambridge University Press.
sunu@ugm.ac.id
Choosing attributes
• Entropy:
• Measure of disorder/unexpected/
uncertainty/ surprise/random
• High entropy means the data has high
variance and thus contains a lot of
information and/or noise.
• For classification categories:
• Attribute that has value
• Probability of being in category is
• Entropy is:
6

27/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
sunu@ugm.ac.id
Choosing attributes
• Entropy:
• Measure of disorder/unexpected/
uncertainty/ surprise/random
• High entropy means the data has high
variance and thus contains a lot of
information and/or noise.
• For classification categories:
• Attribute that has value
• Probability of being in category is
• Entropy is:
2

27/06/2022
sunu@ugm.ac.id
3
The attributes define whether
the example is a:
1. City or town: Yes or No
2. Has a university nearby: Yes or No
3. Type of nearby housing estate:
None, Small, Medium, Large
4. Quality of public transport:
Good, Average, Poor
5. The number of school nearby:
Small, Medium, Large
Class : + (Yes, locate new bar)
- ( No, don’t locate new bar)
sunu@ugm.ac.id
Entropy example
• Choice of attributes:
• City/Town, University, Housing Estate,
Industrial Estate, Transport and
Schools
• Let’s compute entropy of City/Town
• City/Town: is either Y or N
• For Y: 7 positive examples, 3 negative
• For N: 4 positive examples, 6 negative
4

27/06/2022
sunu@ugm.ac.id
5
For Y (Yes):
7 positives, 3 negatives
sunu@ugm.ac.id
6
For N (No):

27/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
sunu@ugm.ac.id
2
the example is a:
Good, Average, Poor
Small, Medium, Large
Class : + (Yes, locate new bar)
- ( No, don’t locate new bar)

27/06/2022
sunu@ugm.ac.id
Entropy example
• City/Town, University, Housing Estate,
Schools
• Let’s compute entropy of City/Town
3
sunu@ugm.ac.id
4
For Y (Yes):

27/06/2022
sunu@ugm.ac.id
5
For N (No):
sunu@ugm.ac.id
Entropy example
• City/Town as root node:
• For c=2 (positive and negative)
classification categories
• Attribute a=City/Town that has value v=Y
• Probability of v=Y being in category positive
= 7/10
• Probability of v=Y being in category
negative
= 3/10
6
For Y (Yes):

27/06/2022
sunu@ugm.ac.id
Entropy example
• For c=2 (positive and negative) classification
categories
• Attribute a=City/Town that has value v=Y
• Entropy E is:
E (City/Town = Y)
= (-7/10 x log2 7/10) + (- 3/10 x log2 3/10)
= -[0.7 x -0.51 + 0.3 x -1.74 ]
= 0.881
7
For Y (Yes):
sunu@ugm.ac.id
8
End of File

27/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
sunu@ugm.ac.id
Entropy example
• City/Town, University, Housing
Estate,
Schools
2

27/06/2022
sunu@ugm.ac.id
3
For Y (Yes):
sunu@ugm.ac.id
4
For N (No):

27/06/2022
sunu@ugm.ac.id
Entropy Example
• Attribute a=City/Town that has value v=N
• Probability of v=N being in category
positive
=4/10
• Probability of v=N being in category
negative
= 6/10
5
For N (No):
sunu@ugm.ac.id
Entropy Example
• Attribute a=City/Town that has value v=N
• Entropy E is:
E (City/Town = N)
= (-4/10 x log2 4/10) + (- 6/10 x log2 6/10)
= 0.971
6
For N (No):

27/06/2022
sunu@ugm.ac.id
Entropy Example
• If the purity of the instances increases, the entropy decreases
• High entropy means high disorder / uncertainty
• In the example below, 7(+) and 3(-) has lower entropy (0.881)
because the purity of the instances is higher (7 City/Town = Y tends
to show (+) class)
7
E(City/Town = Y) = 0.881 —> 7 (+) and 3 (-)
E(City/Town = N) = 0.971 —> 4 (+) and 6 (-)
sunu@ugm.ac.id
Entropy
8
Ten instances consist of two classes : + and -
Source: F. Provost and T. Fawcett, Data Science for Business, O’Reilly Media, 2013
High entropy
means high
disorder /
uncertainty

27/06/2022
sunu@ugm.ac.id
Choosing attributes
• Information gain:
• Expected reduction in entropy (high is good)
• Entropy of whole example set is
• Examples with = and is value are
• Entropy = = ( )
• Gain is:
• = total samples = 20
• = number of samples with value (Y/N values)
9
sunu@ugm.ac.id
Root of tree
• For root of tree, there are 20 examples:
• Probability of being positive class with
11 examples
=11/20
• Probability of being negative with
9 examples
= 9/20
10

27/06/2022
sunu@ugm.ac.id
Information gain example
• For root of tree there are 20
examples:
• Entropy of all training examples E(T) is:
|T | = 20
E(T) = (-11/20 x log2 11/20) +
(- 9/20 x log2 9/20)
= 0.993
11
sunu@ugm.ac.id
Entropy example
12
E(City/Town = Y) = 0.881 —> 7 (+) and 3 (-)
E(City/Town = N) = 0.971 —> 4 (+) and 6 (-)
Total sample for E(City/Town = Y) = 10
Total sample for E(City/Town = N) = 10

27/06/2022
sunu@ugm.ac.id
• 10 examples for a=City/Town and value v=Y
• |Tj=Y | = 10 E(Tj=Y) = 0.881
• 10 examples for a=City/Town and value v=N
• |Tj=N | = 10 E(Tj=N) = 0.971
13
sunu@ugm.ac.id
14
End of File

27/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
sunu@ugm.ac.id
Choosing attributes
• Information gain:
• Expected reduction in entropy (high is good)
• Entropy of whole example set is
• Examples with = and is value are
• Entropy = = ( )
• Gain is:
• = total samples = 20
• = number of samples with value (Y/N values)
2

27/06/2022
sunu@ugm.ac.id
• For root of tree there are 20
examples:
• Entropy of all training examples E(T) is:
|T | = 20
E(T) = (-11/20 x log2 11/20) +
(- 9/20 x log2 9/20)
= 0.993
3
sunu@ugm.ac.id
• 10 examples for a=City/Town and value v=Y
• |Tj=Y | = 10 E(Tj=Y) = 0.881
• 10 examples for a=City/Town and value v=N
• |Tj=N | = 10 E(Tj=N) = 0.971
4

27/06/2022
sunu@ugm.ac.id
Now, we compute information gain of
another attribute: Transport
sunu@ugm.ac.id
6
the example is a:
2. Has a university nearby: Yes or
No
Good, Average, Poor
Small, Medium, or Large
E(T) = 0.993

27/06/2022
sunu@ugm.ac.id
Compute the entropy of transport
• Transport:
categories
• Attribute a=Transport that has value v=G
• Probability of v=G being in category positive
=
5/5
= 1
• Probability of v=G being in category negative
• =
0/0 = 0
7
E(Transport = G) = (-5/5 x log2 5/5) + (0) = 0
Quality of public transport:
Good, Average, Poor
sunu@ugm.ac.id
• Transport:
categories
• Attribute a=Transport that has value v=A
• Probability of v=A being in category positive
= 3/7 = 0.429
• Probability of v=A being in category negative
= 4/7 = 0.571
8
E(Transport = A) = (-3/7 x log2 3/7) + (-4/7 x log2 4/7)
= -[3/7 x -1.22 + 4/7 x -0.808]
= 0.524 + 0.461 = 0.985
Good, Average, Poor

27/06/2022
sunu@ugm.ac.id
• Transport:
categories
• Attribute a=Transport that has value v=P
• Probability of v=P being in category positive
= 3/8 = 0.375
• Probability of v=P being in category negative
= 5/8 = 0.625
9
E(Transport = P) = (-3/8 x log2 3/8) + (-5/8 x log2 5/8)
= - [3/8 x -1.415 + 5/8 x -0.678]
= 0.530 + 0.424 = 0.954
Good, Average, Poor
sunu@ugm.ac.id
Information gain of transport
10
Gain(T, Transport) = 0.993 – ((5/20 x 0) + (7/20 x 0.985) + (8/20 x 0.954))
= 0.993 - (0.345 + 0.382)
= 0.266
A = average
P = poor
G = good

27/06/2022
sunu@ugm.ac.id
Information gain: City/Town vs. Transport
11
sunu@ugm.ac.id
Choosing attributes
• Chose root node as the attribute that
gives the highest Information Gain
• In this case attribute Transport,
IG=0.226
• Branches from root node then become
the values associated with the attribute
• Recursive calculation of IG of
attributes/nodes
• Filter examples by attribute value
12

27/06/2022
sunu@ugm.ac.id
Recursive example for selecting
the next attribute
• Example: with Transport as the root
node:
• Select examples where Transport is
Average: (1, 3, 6, 8, 11, 15, 17)
• Use only these examples to construct
this branch of the tree.
• Select other attributes by computing the
Information Gain (IG), get the highest
one.
• Repeat for each value of Transport
(Poor, Good)
13
sunu@ugm.ac.id
Final tree
14
Transport
{7,12,16,19,20}
Positive
A P G
{8}
Negative
{6}
Negative
{1,3,6,8,11,15,17}
Housing Estate
L M S N
{11,17}
Industrial Estate
{17}
Negative
{11}
Positive
Y N
{1,3,15}
University
{15}
Negative
{1,3}
Positive
Y N
Callan 2003:243
{5,9,14}
Positive
{2,4,10,13,18}
Negative
{2,4,5,9,10,13,14,18}
Industrial Estate
Y N
the example is a:
Good, Average, Poor
Small, Medium, or Large

27/06/2022
sunu@ugm.ac.id
15
End of File

27/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
Clustering (Part 01)
sunu@ugm.ac.id
Unsupervised learning
• Learning without teacher. No label / class in the data
• Normally used:
• when we want to explore the unlabeled data for the
first time (to create a training set for later prediction
or classification)
• when we want to do voters profiling based on their
characteristics and online activities (useful for
political campaign)
• when we want to provide association between
customer products (recommender system in e-
commerce).

27/06/2022
sunu@ugm.ac.id
Clustering analysis
The daily expenditures on food (X1) and clothing (X2) of five persons
are shown in a table below
sunu@ugm.ac.id
Clustering analysis
• The numbers are fictitious and not at all realistic, but the example will
help us explain the essential features of cluster analysis as simply as
possible. The data in the table are plotted in the figure below.

27/06/2022
sunu@ugm.ac.id
Clustering analysis
• Inspection of the figure suggests that the five
observations form two clusters.
• The first consists of persons a and d, and the second of
b, c and e.
• It can be noted that the observations in each cluster are
similar to one another with respect to expenditures on
food (X1) and clothing (X2), and that the two clusters
are quite distinct from each other.
• This inspection was possible because only two
variables were involved in grouping the observations.
The question is: Can a procedure be devised for similarly
grouping observations when there are more than two
variables or attributes?
sunu@ugm.ac.id
Measures of distances for variables
• Clustering methods require a more precise
definition of "similarity" ("closeness",
"proximity") of observations and clusters.
• When the grouping is based on variables,
it is natural to employ the familiar concept of
distance.
• Consider the right figure as a map showing
two points, i and j, with coordinates (X1i,X2i)
and (X1j ,X2j), respectively.

27/06/2022
sunu@ugm.ac.id
Euclidean distance
• The Euclidean distance between the two
points is the hypotenuse of the triangle ABC:
• An observation i is declared to be closer
(more similar) to j than to observation k if
D(i,j) < D(i,k).
• An alternative measure is the squared
Euclidean distance. In the figure, the squared
distance between the two points i and j is
sunu@ugm.ac.id
Clustering methods
• Nearest neighbor (or single linkage) method
• Furthest neighbor (or complete linkage) method
• K-means method

27/06/2022
sunu@ugm.ac.id
Nearest neighbor method
• One of the the simplest method to treat the distance between the two
nearest observations, one from each cluster, as the distance between
the two clusters.
• This is known as the nearest neighbor (or single linkage) method.
sunu@ugm.ac.id
Let us suppose that Euclidean distance is the appropriate measure of proximity.
We begin with each of the five observations forming its own cluster.
The distance between each pair of observations is shown in the figure below.

27/06/2022
sunu@ugm.ac.id
• For example, the distance between a and b is
• Observations b and e are nearest (most similar)
and, as shown in figure (b), are grouped in the
same cluster.
• Assuming the nearest neighbor method is used, the
distance between the cluster (be) and another
observation is the smaller distances between that
observation, on the one hand, and b and e, on the
other. For example,
sunu@ugm.ac.id
Nearest neighbor Method
• Observation a and d are nearest with distance
1.414
• We arbitrarily select (a,d) as the new cluster
• Then, the distance between (be) and (ad) is
• while that between c and (ad) is

27/06/2022
sunu@ugm.ac.id
Nearest neighbor Method
• We finally merge (be) with c to form the cluster (bce) shown in below
sunu@ugm.ac.id
• The grouping of these two clusters, it will be noted, occurs at a distance of
6.325, a much greater distance than that at which the earlier groupings took
place.

27/06/2022
sunu@ugm.ac.id
• The groupings and the distance at
which these took place are also shown
in the tree diagram (dendrogram)
sunu@ugm.ac.id
• One usually searches the dendrogram for
large jumps in the grouping distance as
guidance in arriving at the number of
groups.
• In this illustration, it is clear that the elements
in each of the clusters (ad) and (bce) are close
(they were merged at a small distance)
• However, the clusters are distant (the
distance at which they merge is large).
• Thus, we conclude that there are two clusters
instead of one big cluster.
Largest
jumps

27/06/2022
sunu@ugm.ac.id
Furthest neighbor method
• Under the furthest neighbor (or complete linkage) method, the
distance between two clusters is the distant members.
sunu@ugm.ac.id
• The distances between all pairs of observations shown
in the figure are the same as with the nearest neighbor
method.
• Therefore, the furthest neighbor method also calls for
grouping b and e at Step 1.
• However, the distances between (be), on the one hand,
and the clusters (a), (c), and (d), on the other, are
different:

27/06/2022
sunu@ugm.ac.id
• The four clusters remaining at Step 2 and the distances between these
clusters are shown below
• The nearest clusters are (a) and (d), which are now grouped into the
cluster (ad). The remaining steps are similarly executed.
sunu@ugm.ac.id
20
End of File

27/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
Clustering (Part 02)
sunu@ugm.ac.id
K-means clustering
• k-means clustering is a technique used to uncover
categories.
• In the retail sector, it can be used to categorize both products
and customers.
• k represents the number of categories identified, with each
category’s average (mean) characteristics being appreciably
different from that of other categories.

27/06/2022
sunu@ugm.ac.id
Determining cluster membership
• Specify the number of clusters arbitrarily.
• We can then determine cluster membership.
This involves a simple iterative process.
• We will illustrate this process with a 2-cluster
example:
Step 1: Start by making a guess on where the
central points of each cluster are. Let’s call
these pseudo-centers, since we do not yet
know if they are actually at the center of their
clusters.
sunu@ugm.ac.id
Step 2: Assign each data point to the
nearest pseudo-center (measured by
euclidean distance).
By doing so, we have just formed
clusters, with each cluster comprising
all data points associated with its
pseudo-center.

27/06/2022
sunu@ugm.ac.id
Step 3: Update the location of each
cluster’s pseudo-center, such that it is
now indeed in the center of all its
members (cluster’s centroid).
NOTE: The cluster centroid is the point with
coordinates equal to the average values of
the variables for the observations in that
cluster.
sunu@ugm.ac.id
Step 4: Repeat the steps of re-
assigning cluster members
(Step 2) and re-locating cluster
centers (Step 3), until there are
no more changes to cluster
membership.
(see the animation)

27/06/2022
sunu@ugm.ac.id
K-means method
Suppose two clusters are to be formed for the observations listed in
a table below, showing the daily expenditures on food (X1)
and clothing (X2) of five persons
sunu@ugm.ac.id
K-means method
• Step 1: we begin by arbitrarily assigning a, b and d to Cluster 1, and
c and e to Cluster 2. The cluster centroids are calculated as shown
in the table.

27/06/2022
sunu@ugm.ac.id
K-means method
• The cluster centroid is the point with
coordinates equal to the average values of
the variables for the observations in that
cluster.
• Thus, the centroid of Cluster 1 is the point
(X1 = 3.67, X2 = 3.67), and that of Cluster
2 the point (8.75, 2). The two centroids are
marked by C1 and C2.
• The cluster's centroid, therefore, can be
considered the center of the observations
in the cluster.
sunu@ugm.ac.id
K-means method
• We now calculate the distance between a
and the two centroids:
• Observe that a is closer to the centroid of
Cluster 1, to which it is currently assigned.
Therefore, a is not reassigned.
• Next, we calculate the distance between b
and the two cluster centroids:

27/06/2022
sunu@ugm.ac.id
K-means method
• Step 2: since b is closer to Cluster 2's centroid than to that of
Cluster 1, it is reassigned to Cluster 2. The new cluster centroids
are calculated as shown in figure (a).
sunu@ugm.ac.id
K-means method
• The new centroids are plotted. The distances of the observations from the new cluster
centroids are as follows (an asterisk indicates the nearest centroid):
• Every observation belongs to the cluster to the centroid of which it is nearest, and the k-
means method stops. The elements of the two clusters are shown in the table.

27/06/2022
sunu@ugm.ac.id
13
End of File

27/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
Association Rules (Part 01)
sunu@ugm.ac.id
Supermarket’s problem
• When we go grocery shopping, we often
have a standard list of things to buy.
• Each shopper has a distinctive list,
depending on one’s needs and
preferences.
• A housewife might buy healthy ingredients
for a family dinner, while a bachelor might
buy fruits and chips.
• Understanding these buying patterns can
help to increase sales in several ways.

27/06/2022
sunu@ugm.ac.id
Supermarket’s problem
If there is a pair of items, X and Y, that are
frequently bought together:
• Both X and Y can be placed on the same
shelf, so that buyers of one item would be
prompted to buy the other.
• Promotional discounts could be applied
to just one out of the two items.
• Advertisements on X could be targeted at
buyers who purchase Y.
• X and Y could be combined into a new
product, such as having Y in flavors of X.
While we may know that certain items are
frequently bought together, the question is, how
do we uncover these associations?
sunu@ugm.ac.id
Association rules (1/3)
Table 1. Example Transactions
• Association rules analysis is a technique to uncover how items are
associated to each other. There are three common ways to
measure association.
• Measure 1: Support. This says how popular an itemset is, as
measured by the proportion of transactions in which an itemset
appears.
• In Table 1, the support of {apple} is 4 out of 8, or 50%. Itemsets
can also contain multiple items. For instance, the support of {apple,
beer, rice} is 2 out of 8, or 25%.
• If you discover that sales of items beyond a certain proportion tend
to have a significant impact on your profits, you might consider
using that proportion as your support threshold.
• You may then identify itemsets with support values above this
threshold as significant itemsets.

27/06/2022
sunu@ugm.ac.id
• Measure 2: Confidence. This says how likely item Y is purchased
when item X is purchased, expressed as {X -> Y}.
• This is measured by the proportion of transactions with item X, in
which item Y also appears.
In Table 1, the confidence of {apple -> beer} is 3 out of 4, or 75%.
• One drawback of the confidence measure is that it might misrepresent
the importance of an association.
• This is because it only accounts for how popular apples are, but not
beers.
• If beers are also very popular in general, there will be a higher
chance that a transaction containing apples will also contain
beers, thus inflating the confidence measure.
• To account for the base popularity of both constituent items, we use a
third measure called lift.
Support {apple, beer} : 3 / 8
Support {apple} : 4 / 8
Confidence {apple, beer} : 3/8 * 8/4 = 3/4
sunu@ugm.ac.id
Measure 3: Lift. This says how likely item Y is
purchased when item X is purchased, while
controlling for how popular item Y is.
In Table 1, the lift of {apple -> beer} is 1,
which implies no association between items.
A lift value greater than 1:
item Y is likely to be bought if item X is bought
A lift value less than 1:
item Y is unlikely to be bought if item X is bought.
Support {apple, beer} : 3 / 8
Support {apple} = 4 / 8
Support {beer} = 6/8
Support {apple} * support {beer} = 24/64
Lift : 3/8 * 64/24 = 8/8 = 1

27/06/2022
sunu@ugm.ac.id
Illustration of association rules
The network graph shows associations between selected items in
a supermarket.
Larger circles imply higher support, while red circles imply
higher lift. Several purchase patterns can be observed:
• The most popular transaction was of pip and tropical fruits (#1)
• Another popular transaction was of onions and other
vegetables (#2)
• If someone buys meat spreads, he is likely to have bought
yogurt as well (#3)
• Relatively many people buy sausage along with sliced cheese
(#4)
• If someone buys tea, he is likely to have bought fruit as well,
possibly inspiring the production of fruit-flavored tea (#5)
1
2
3
4
5
sunu@ugm.ac.id
8
End of File

27/06/2022
sunu@ugm.ac.id
Sunu Wibirama
sunu@ugm.ac.id
INDONESIA
Association Rules (Part 02)
sunu@ugm.ac.id
How to use support, confidence, and lift
• The {beer -> soda} rule has the highest
confidence at 20% (see Table 3)
• However, both beer and soda appear
frequently across all transactions (see
Table 2), so their association could simply
be a fluke.
• This is confirmed by the lift value of
{beer -> soda}, which is 1, implying no
association between beer and soda.
Table 2. Support of individual items
Table 3. Association measures for beer-related rules

27/06/2022
sunu@ugm.ac.id
How to use support, confidence, and lift
• On the other hand, the {beer -> male cosmetics}
rule has a low confidence, due to few purchases
of male cosmetics in general (see Table 3)
• However, whenever someone does buy male
cosmetics, he is very likely to buy beer as well,
as inferred from a high lift value of 2.6 (see
Table 3)
• The converse is true for {beer -> berries}.
• With a lift value below 1, we may conclude that
if someone buys berries, he would likely be
averse to beer.
Table 2. Support of individual items
Table 3. Association measures for beer-related rules
sunu@ugm.ac.id
Apriori algorithm
• The apriori principle can reduce the number of itemsets we need to
examine.
• Put simply, the apriori principle states that if an itemset is infrequent,
then all its subsets must also be infrequent.
• This means that if {beer} was found to be infrequent, we can expect
{beer, pizza} to be equally or even more infrequent.
• So in consolidating the list of popular itemsets, we need not consider
{beer, pizza}, nor any other itemset configuration that contains beer

27/06/2022
sunu@ugm.ac.id
Apriori algorithm
Using the apriori principle, the number of itemsets that
have to be examined can be pruned, and the list of
popular itemsets can be obtained in these steps:
Step 0. Start with itemsets containing just a single item,
such as {apple} and {pear}.
Step 1. Determine the support for itemsets. Keep the
itemsets that meet your minimum support threshold, and
remove itemsets that do not.
Step 2. Using the itemsets you have kept from
Step 1, generate all the possible itemset configurations.
Step 3. Repeat Steps 1 & 2 until there are no more new
itemsets.
sunu@ugm.ac.id
Apriori Algorithm
• As seen in the animation, {apple} was determine to
have low support, hence it was removed and all
other itemset configurations that contain apple need
not be considered.
• This reduced the number of itemsets to consider by
more than half.
• Note that the support threshold that you pick in Step
1 could be based on formal analysis or past
experience.
• If you discover that sales of items beyond a certain
proportion tend to have a significant impact on your
profits, you might consider using that proportion as
your support threshold.

27/06/2022
sunu@ugm.ac.id
Finding item rules with high confidence or lift
• We have seen how the apriori algorithm can be
used to identify itemsets with high support.
• The same principle can also be used to identify
item associations with high confidence or lift.
• Finding rules with high confidence or lift is less
computationally taxing once high-support
itemsets have been identified, because
confidence and lift values are calculated using
support values.
sunu@ugm.ac.id
Finding item rules with high confidence or lift
• Take for example the task of finding high-confidence
rules.
• If the rule {beer, chips -> apple} has low confidence, all
other rules with the same constituent items and with
apple on the right hand side would have low confidence
too.
• Specifically, the rules
{beer -> apple, chips}
{chips -> apple, beer}
would have low confidence as well.
• As before, lower level candidate item rules can be
pruned using the apriori algorithm, so that fewer
candidate rules need to be examined.

27/06/2022
sunu@ugm.ac.id
Limitations
• Computationally Expensive.
• Even though the apriori algorithm reduces the number of candidate itemsets to consider,
this number could still be huge when store inventories are large or when the support
threshold is low.
• However, an alternative solution would be to reduce the number of comparisons by using
advanced data structures, to sort candidate itemsets more efficiently.
• Spurious (fake) Associations.
• Analysis of large inventories would involve more itemset configurations, and the support
threshold might have to be lowered to detect certain associations.
• However, lowering the support threshold might also increase the number of spurious
associations detected.
sunu@ugm.ac.id
10
End of File

Machine Learning Techniques

Recommended

Recommended

More Related Content

Similar to Machine Learning Techniques

Similar to Machine Learning Techniques (20)

More from Sunu Wibirama

More from Sunu Wibirama (7)

Recently uploaded

Recently uploaded (20)

Machine Learning Techniques