This document discusses association rule mining and clustering. Association rule mining aims to identify relationships between different items in transactional data using measures like support, confidence and lift. The Apriori algorithm is described as a popular method for mining association rules. Clustering techniques discussed include k-means, k-medoids, and hierarchical agglomerative clustering. K-means groups objects by assigning them to centroids while k-medoids uses actual objects as cluster centers. Hierarchical clustering creates nested clusters by successively merging or dividing clusters based on distance measures.
2. Association Rule mining
Association Rule mining is “what goes
with what”
Association rule mining is a technique to
identify underlying relations between
different items.
Given a set of transactions, find rules
that will predict the occurrence of an
item based on the occurrences of other
items in the transactions.
The process of identifying an
associations between products is
3. More profit can be generated if the
relationship between the items purchased in
different transactions can be identified.
For instance, if item A and B are bought together
more frequently then several steps can be taken
to increase the profit. For example,
A and B can be placed together so that when a
customer buys one of the product he doesn't have
to go far away to buy the other product.
People who buy one of the products can be
targeted through an advertisement campaign to buy
the other.
Collective discounts can be offered on these
products if the customer buys both of them.
Both A and B can be packaged together.
5. Association Rules
Association rule has to
be interpreted in the
form of “if-then”
statements
Association rules are
probabilistic in nature
Some possible
association rules are
{Bread} -> {Eggs}
{Bread,Cereal} ->
{Eggs}
Collection of one or
more items is called
7. The possible associations can be many.
We may be interested in finding the strong
associations.
But how to find strong associations ?
Answer: Support ,Confidence & Lift.
Support and Confidence are the
measures to confirm the rule as a strong
association rule.
These two measures express the degree
of uncertainty about the rule.
The antecedent and consequent must be
disjoint sets
8. Theory of Apriori Algorithm
There are three major components of
Apriori algorithm:
Support (prevalance/popularity)
Confidence (predictability) – likely
purchase of consequent
Lift (interest)- association expect by
chance
9. Three key terms to determine
rules
Lift =1 means there is no association between products A
and B.
Lift > 1 means products A and B are more likely to be
bought together.
Support(X)= freq(X)/N
Support(Y)=freq(Y)/N
Some Algorithms take
the support and some
take the support count
11. Steps in Apriori algorithm
Step 1: Generate all frequent item sets
Candidate Generation is the possible
combination of the itemset performed by the
join operation
A frequent itemset is
an itemset appearing in (frequent) at
least minimum support transactions from
the transaction database
Eg: {A},{B},{C} –Freq. Itemset at k=1
{A,B} {A,C} {B,C} – candidate Generation at
k=2
Step 2: Generate strong association rules
12. Step1: Finding the frequent
itemset
Let k=1
Generate frequent item sets of length 1
Repeat until no new frequent item sets are
identified
Create a candidate list of k itemsets by
performing join operation on pairs of (k-1)
itemsets in the list.
Prune candidate item sets containing subsets of
length k that are infrequent
Count the support of each candidate by
scanning the DB
Eliminate candidates that are infrequent,
leaving the list with only those that are
13.
14. Step2: Generate Strong Rules
Formulate all the possible combination of the
frequent itemset
Calculate the confidence
Choose the rule which has more confidence
Also calculate the lift and check whether rule has
lift >1
17. To speed up the process,
Set a minimum value for support and confidence.
This means that we are only interested in finding
rules for the items that have certain default
existence (e.g. support) and have a minimum value
for co-occurrence with other items (e.g. confidence).
Extract all the subsets having higher value of
support than minimum threshold.
Select all the rules from the subsets with confidence
value higher than minimum threshold.
Order the rules by descending order of Lift.
18. Advantage
Subset of a frequent itemset is also a frequent
itemset.
This reduce the number of candidates being
considered by only exploring the itemsets whose
support count is greater than the minimum
support count.
All infrequent itemsets can be pruned if it has an
infrequent subset.
19. Some other ARM Algorithms
FP Growth
AIS
SETM
ECLAT
20. Practice Problem
Find the Support,
Confidence and Lift
for the rule
{Apples,Milk}-
>{Cheese}
Apply Apriori
algorithm to find the
frequent item set
22. Cluster Analysis
Clustering is a Process of grouping objects
which are similar
Clustering is an unsupervised learning
technique
Objects of a cluster are similar and objects of
different cluster are dissimilar
The objects can be grouped based on
attributes/features or by relationships with
other objects (distance or similarity)
Clustering does not require assumptions
about category labels that tag objects with
prior identifiers
Clustering is subjective (or problem
dependent) and can summarize data to a
23. Applications
Customer relationship management
Information retrieval
Data compression
Image processing
Marketing
Medicine
Pattern recognition
24. Similarity Measurement
Grouping is done based on the closeness or
similarity
One way of doing this, is measure the
distance
Distance Measurement Methods
Euclidean Distance
Manhattan Distance
Chebychev Distance
Percentage Disagreement
28. Percent Disagreement
Suited for features which is categorical in nature
Distance(Oi,Oj)= 100* [ Number of (Oik <>Ojk )] div n
N represents the number of features
Distance (O1,O2) = 100*(1 div 4) =25%
Distance (O1,O3) = 100*(2 div 4)= 50%
Distance (O2,O3) = 100*(3 div 4 )= 75%
Object Gender Age
bracket
Income
level
BP
O1 M 20-30 Low Normal
O2 M 30-40 Low Normal
O3 F 20-30 Medium Normal
29. Types of Clustering
Partitional – we construct various partitions and
then evaluate them by some criteria
K-means
K- medoids
Hierarchical – we create hierarchical
decomposition of the set of objects using some
criterion
Bottom up – agglomerative
Initially, each point is a cluster
Repeatedly combine the two nearest clusters into one
Top-down – divisive
Start with one cluster and recursively split it
30. K-means
Step 1: Choose k objects arbitrarily from D as in
initial cluster centers
Step 2 : Repeat
Step 3: Calculate the distance between each other
data point
Step 4: Assign each object to the cluster of
nearest center
Step 5: Update the cluster means
Step 6: Until no change
31.
32. Example
As a simple illustration of a k-means algorithm,
consider the following data set consisting of the
scores of two variables on each of Five
individuals. This data set is to be grouped into two
clusters.
Subject X1 X2
A 1 1
B 1 0
C 0 2
D 2 4
E 3 5
33. Choose the cluster centroids
Individual
Mean Vector
(centroid)
Group 1 A (1, 1)
Group 2 D (2, 4)
34. Calculate the distance between each objects
using euclidean/manhattan/chebychev distance
measure
Subject A B C D E
A 0
B 1 0
C 1 2 0
D 3 4 2 0
E 4 5 3 1 0
35. Calculate the distance (using euclidean/
manhattan/ chebychev) of each individual to the
chosen centroid. Assign 1 to the min distance of
the cluster. Eg: for obj A min(0,3) = 0 , so put 1 in
cluster A. Rearrange the cluster and reassign the
centroid.
Object
Cluster 1
(1,1) A
Cluster 2
(2,4) D
A 0 3
B 1 4
C 1 2
D 3 0
E 4 1
Object
Cluster 1
(1,1)
Cluster 2
(2,4)
A 1 0
B 1 0
C 1 0
D 0 1
E 0 1
New centroid,
Cluster 1 (1+1+0/3 , 1+0+2/3) = (2/3,3/3) = (0.6,1)
Cluster 2 ( 2+3/2 , 4+5/2) = (5/2,9/2)=(2.5,4.5)
36. Repeat until no change in the centroids
Object
Cluster 1
(0.6,1)
Cluster 2
(2.5,4.5)
A 0.4 3.5
B 1 4.5
C 1 2.5
D 3 0.5
E 4 0.5
Object
Cluster 1
(0.6,1)
Cluster 2
(2.5,4.5)
A 1 0
B 1 0
C 1 0
D 0 1
E 0 1
New centroid,
Cluster 1 (1+1+0/3 , 1+0+2/3) = (2/3,3/3) = (0.6,1)
Cluster 2 ( 2+3/2 , 4+5/2) = (5/2,9/2)=(2.5,4.5)
37. Stopping Criteria for KMeans
The datapoints assigned to specific cluster
remain the same
Centroids remain the same
The distance of datapoints from their centroid is
minimum
Fixed number of iterations have reached
(insufficient iterations → poor results, choose max
iteration wisely)
38. Model Metrics
Within-cluster sum of squares
The sum of the squared deviations from each
observation and the cluster centroid.
small sum of squares is more compact
Between-Cluster sum of squares
measures the squared average
distance between all centroids.
Larger implies the clusters are well separated
40. k-Medoids
K-means is sensitive to outliers
k-medoids – instead of taking the mean value of the
object in the cluster as a reference point, medoids can
be used which is more centrally located object in a
cluster
The k-medoids clustering algorithm:
Select k points as the initial representative of the objects
Repeat
Assigning each point closest to the medoid
Randomly select a non-representative object Oi
Compute the total cost S of swapping the medoid m
with Oi
If S<0, then swap m with Oi to form the new set of
medoids
47. Agglomerative clustering
It is Hierarchical clustering method –
specifically it uses bottom-up approach
Idea: Ensure nearby points ends up in the
same cluster
Start with a collection of n singleton
clusters
Each cluster contains one data point
Repeatedly only one cluster is left:
Find a pair of clusters that is closest: min
D(ci,cj)
Merge the clusters ci, cj into a new cluster
cij
49. Merge col 3 and 5 . For example, d(1,3)= 3 and
d(1,5)=11. So, D(1,"35")=11. This gives us the
new distance matrix. The items with the smallest
distance get clustered next. This will be 2 and 4.
35 24 1
35 0
24 10 0
1 11 9 0
52. Modern Clustering methods
Hierarchical clustering
BIRCH – Balanced Iterative reducing and clustering
using hierarchies
CURE – clustering using Representation
ROCK – Robust clustering for categorical data
Partitive clustering
CLARA – Clustering Large Application
CLARANS – clustering large application on
randomized search
K-mode
53. Other Clustering methods
Density based clustering
DENCLUE – Density based clustering
DBSCAN- Density based spatial clustering of
application with noise
Optics – ordering points to identify the clustering
structure
Grid based methods
STING – statistical information Grid based method
Wave cluster
Model based methods
COBWEB
CLASSIT