Clustering and Association Rules 
Case 4 
NOVEMBER 24, 2014 
GROUP 7 
Sushmita Dey 
Nikolaos Minas 
AllanKuo 
Prof Shaonan Tian
Clustering 
• Clustering is a popular 
method. 
• It groups a set of points 
together in a . Objects different 
from each other are grouped in 
. The distance is used 
as matric to separate objects to 
.
Clustering 
• Objects within same cluster are closer 
to each other compared to objects in 
different cluster. 
• We used from the iris data 
set to apply
K-Means Clustering 
• We use k-means() function from the 
“fpc” package. 
• We started with number of cluster 
equal to and the result was 
of pure cluster, 
of slightly less pure 
cluster and the mixture of 
and
K-Means Clustering 
• Figure 1 • Figure 2 
3 3 
1 
2 
1 
1 
1 
1 2 2 
2 
2 
1 3 
3 
2 
1 2 
1 
2 
2 
3 
2 
1 
3 
2 
3 3 
1 
2 
1 
2 
3 
2 
2 
2 
3 
2 
1 
1 3 
1 
3 
3 
3 
2 
1 
2 
3 
3 
3 
1 
1 
2 
2 2 
1 
1 
2 
2 
3 
2 
3 
2 
2 
1 
2 
3 
1 
1 
2 
1 
2 
1 
1 
3 
3 
3 
1 
1 
2 
2 
2 
2 
1 
3 
2 
1 
2 
2 
2 
2 
2 2 
2 
1 
1 
3 
2 
2 
2 
2 
1 
3 
3 
1 
2 
2 
2 
2 
2 
1 
2 
3 
1 2 
1 
3 
1 2 
1 
1 
3 
3 
1 2 
3 
1 
3 
2 
2 
3 
1 
1 
1 
0 5 10 
-15 -14 -13 -12 -11 -10 -9 
dc 1 
dc 2 
4 
1 
1 
4 
4 
2 
4 
4 
2 
4 
3 4 
4 
2 
1 
1 
3 1 
1 
4 
2 
2 
4 
4 
1 
4 
3 
1 
1 1 
3 
4 
2 
4 
4 
1 
4 
4 
4 
1 
4 
2 
2 1 
3 
1 
1 
1 
4 
3 
4 
1 
1 
1 
2 
4 
4 4 
3 
3 
4 
4 
1 
4 
1 
4 
4 
3 
4 
1 
2 
2 
4 
3 
4 
2 
2 
1 
1 
1 
3 
3 
2 
4 
4 
4 
4 
3 
1 
4 
4 
4 
4 
4 
4 
3 
2 
1 
4 
4 
4 
4 
3 
1 
1 
3 
4 
4 
4 
4 
2 
4 
1 
3 4 
3 
1 
2 4 
3 
4 
1 
1 
2 4 
3 1 
3 
3 
3 
2 
0 5 10 
-18 -16 -14 -12 
dc 1 
dc 2
Hierarchical Clustering with 
hclust() 
• We used hclust() function from the 
“fpc” package 
• We used War’s variance 
method to create clusters 
• We started with and 
went upto
Hierarchical Clustering 
• Fig 5: • Fig6 
1 
2 
2 
3 
3 
2 1 1 
2 
3 
3 1 
11 
3 
3 
2 
1 2 2 
1 
1 
3 
2 
2 
3 
1 
3 
3 
3 
2 3 
3 
1 
3 
2 
3 
1 
2 
3 
2 
3 
2 
1 
2 
3 
2 
1 
3 
1 
2 
2 
1 
2 
3 
2 1 
2 
2 
3 
2 
3 
2 
3 
3 
2 
1 
3 
3 
3 
1 
3 
3 
2 
2 
2 
1 
2 
1 
3 
2 
3 
2 
1 
3 
1 
3 
3 
3 
3 
2 
1 
3 
1 
1 
2 
1 
3 
2 
2 
3 
3 
3 
3 
2 3 1 
2 
3 
1 
2 
1 
3 
3 
3 
3 
2 
2 
3 
3 
1 
3 
2 
1 
2 
3 
2 
2 
1 
1 
3 
3 
1 
0 5 10 
-15 -14 -13 -12 -11 -10 -9 
dc 1 
dc 2 
1 
2 
2 
2 
3 
1 11 
2 
2 
2 
1 
1 
3 
2 
2 
3 
1 
3 
4 3 
4 
2 4 
4 
3 
3 
4 
1 
3 
2 
3 
1 
2 
3 
2 
3 
2 
1 
2 
3 
2 
1 
4 
2 1 
2 
1 
2 
3 
2 1 
2 
4 
2 
4 
2 
4 
3 
2 
1 
3 
3 
4 
1 
4 
4 
2 
2 
2 
1 
22 
1 
3 
2 
4 
2 
1 
2 
3 
1 
3 
1 
3 
3 
3 
3 
3 
2 
1 
3 
1 
1 
1 
2 
1 
2 
1 
3 
2 
4 
3 
3 
2 3 1 
2 
4 
1 
2 
1 
3 
3 
4 
2 
2 
3 
3 
1 
3 
2 
1 
2 
3 
2 
2 
1 
1 
4 
4 
1 
5 10 15 20 
-16 -15 -14 -13 -12 -11 -10 
dc 1 
dc 2 
Figure 5: Centroid Plot with 3 
Clusters 
Figure 5: Centroid Plot with 4 
Clusters
Association Rules 
• Association rule is a popular 
unsupervised 
• Association rule is used in 
in the retails stores to 
find which items are 
.
Association Rules 
• Association rules are mostly suited to 
find between items in 
large set of transactional data 
• A typical rule may be represented as: 
• {peanut butter, jelly}-> { } 
• If peanut butter and jelly are 
purchased then
Apriori Algorithm 
• Apriori Algorithm is used to learn 
in a large 
transactional dataset. 
• Apriori algorithm employs a simple a 
priori belief as a heuristic that all 
of a set 
must also be . 
• We used the arules package from R to 
analyze the Groceries dataset.
Groceries Data Sets
Data Exploration 
• We install and load the package using the 
commandsinstall.packages(“arules” 
)and library(arules). 
• We use R functions to explore the grocery 
dataset. 
• We use dim() function to find the 
dimensions of the Groceries dataset 
• We use inspect() function from 
”arules” package to find the 1st 10 
transactions in the data sets.
Data Exploration 
• We use output from the summary() 
function on the dataset to find most 
frequently purchased item( 
), items per average 
transaction( ) and items in the 
largest transaction # of items(32) 
• We use the itemFrequencyPlot() 
• Function to create plot from the dataset for visual 
exploration 
• We plotted item frequency plot for all the items 
and items with support
Items frequency plot(All items)
Items frequency plot(Items with 
10% support)
Associations Rules 
•We use Apriori algorithm from the 
arules package to generate set of 
association rules. 
•We generated rules using 
support = and confidence = 
by trying out different values 
of support and confidence.
Associations Rules 
• We use summary() function on rule set 
to find the rule length distribution, 
with rules containing one item. 
• We found that generated rule sets 
have quality metric of lift as 
• We use inspect() and 
sort()function to generate 
sorted by .

Clustering and Association Rule

  • 1.
    Clustering and AssociationRules Case 4 NOVEMBER 24, 2014 GROUP 7 Sushmita Dey Nikolaos Minas AllanKuo Prof Shaonan Tian
  • 2.
    Clustering • Clusteringis a popular method. • It groups a set of points together in a . Objects different from each other are grouped in . The distance is used as matric to separate objects to .
  • 3.
    Clustering • Objectswithin same cluster are closer to each other compared to objects in different cluster. • We used from the iris data set to apply
  • 4.
    K-Means Clustering •We use k-means() function from the “fpc” package. • We started with number of cluster equal to and the result was of pure cluster, of slightly less pure cluster and the mixture of and
  • 5.
    K-Means Clustering •Figure 1 • Figure 2 3 3 1 2 1 1 1 1 2 2 2 2 1 3 3 2 1 2 1 2 2 3 2 1 3 2 3 3 1 2 1 2 3 2 2 2 3 2 1 1 3 1 3 3 3 2 1 2 3 3 3 1 1 2 2 2 1 1 2 2 3 2 3 2 2 1 2 3 1 1 2 1 2 1 1 3 3 3 1 1 2 2 2 2 1 3 2 1 2 2 2 2 2 2 2 1 1 3 2 2 2 2 1 3 3 1 2 2 2 2 2 1 2 3 1 2 1 3 1 2 1 1 3 3 1 2 3 1 3 2 2 3 1 1 1 0 5 10 -15 -14 -13 -12 -11 -10 -9 dc 1 dc 2 4 1 1 4 4 2 4 4 2 4 3 4 4 2 1 1 3 1 1 4 2 2 4 4 1 4 3 1 1 1 3 4 2 4 4 1 4 4 4 1 4 2 2 1 3 1 1 1 4 3 4 1 1 1 2 4 4 4 3 3 4 4 1 4 1 4 4 3 4 1 2 2 4 3 4 2 2 1 1 1 3 3 2 4 4 4 4 3 1 4 4 4 4 4 4 3 2 1 4 4 4 4 3 1 1 3 4 4 4 4 2 4 1 3 4 3 1 2 4 3 4 1 1 2 4 3 1 3 3 3 2 0 5 10 -18 -16 -14 -12 dc 1 dc 2
  • 6.
    Hierarchical Clustering with hclust() • We used hclust() function from the “fpc” package • We used War’s variance method to create clusters • We started with and went upto
  • 7.
    Hierarchical Clustering •Fig 5: • Fig6 1 2 2 3 3 2 1 1 2 3 3 1 11 3 3 2 1 2 2 1 1 3 2 2 3 1 3 3 3 2 3 3 1 3 2 3 1 2 3 2 3 2 1 2 3 2 1 3 1 2 2 1 2 3 2 1 2 2 3 2 3 2 3 3 2 1 3 3 3 1 3 3 2 2 2 1 2 1 3 2 3 2 1 3 1 3 3 3 3 2 1 3 1 1 2 1 3 2 2 3 3 3 3 2 3 1 2 3 1 2 1 3 3 3 3 2 2 3 3 1 3 2 1 2 3 2 2 1 1 3 3 1 0 5 10 -15 -14 -13 -12 -11 -10 -9 dc 1 dc 2 1 2 2 2 3 1 11 2 2 2 1 1 3 2 2 3 1 3 4 3 4 2 4 4 3 3 4 1 3 2 3 1 2 3 2 3 2 1 2 3 2 1 4 2 1 2 1 2 3 2 1 2 4 2 4 2 4 3 2 1 3 3 4 1 4 4 2 2 2 1 22 1 3 2 4 2 1 2 3 1 3 1 3 3 3 3 3 2 1 3 1 1 1 2 1 2 1 3 2 4 3 3 2 3 1 2 4 1 2 1 3 3 4 2 2 3 3 1 3 2 1 2 3 2 2 1 1 4 4 1 5 10 15 20 -16 -15 -14 -13 -12 -11 -10 dc 1 dc 2 Figure 5: Centroid Plot with 3 Clusters Figure 5: Centroid Plot with 4 Clusters
  • 8.
    Association Rules •Association rule is a popular unsupervised • Association rule is used in in the retails stores to find which items are .
  • 9.
    Association Rules •Association rules are mostly suited to find between items in large set of transactional data • A typical rule may be represented as: • {peanut butter, jelly}-> { } • If peanut butter and jelly are purchased then
  • 10.
    Apriori Algorithm •Apriori Algorithm is used to learn in a large transactional dataset. • Apriori algorithm employs a simple a priori belief as a heuristic that all of a set must also be . • We used the arules package from R to analyze the Groceries dataset.
  • 11.
  • 12.
    Data Exploration •We install and load the package using the commandsinstall.packages(“arules” )and library(arules). • We use R functions to explore the grocery dataset. • We use dim() function to find the dimensions of the Groceries dataset • We use inspect() function from ”arules” package to find the 1st 10 transactions in the data sets.
  • 13.
    Data Exploration •We use output from the summary() function on the dataset to find most frequently purchased item( ), items per average transaction( ) and items in the largest transaction # of items(32) • We use the itemFrequencyPlot() • Function to create plot from the dataset for visual exploration • We plotted item frequency plot for all the items and items with support
  • 14.
  • 15.
    Items frequency plot(Itemswith 10% support)
  • 16.
    Associations Rules •Weuse Apriori algorithm from the arules package to generate set of association rules. •We generated rules using support = and confidence = by trying out different values of support and confidence.
  • 17.
    Associations Rules •We use summary() function on rule set to find the rule length distribution, with rules containing one item. • We found that generated rule sets have quality metric of lift as • We use inspect() and sort()function to generate sorted by .