SlideShare a Scribd company logo
1 of 53
Unit 4
Association Rule Mining &
Clustering
Association Rule mining
 Association Rule mining is “what goes
with what”
 Association rule mining is a technique to
identify underlying relations between
different items.
 Given a set of transactions, find rules
that will predict the occurrence of an
item based on the occurrences of other
items in the transactions.
 The process of identifying an
associations between products is
 More profit can be generated if the
relationship between the items purchased in
different transactions can be identified.
 For instance, if item A and B are bought together
more frequently then several steps can be taken
to increase the profit. For example,
 A and B can be placed together so that when a
customer buys one of the product he doesn't have
to go far away to buy the other product.
 People who buy one of the products can be
targeted through an advertisement campaign to buy
the other.
 Collective discounts can be offered on these
products if the customer buys both of them.
 Both A and B can be packaged together.
 Applications
Market basket analysis
Cross-marketing
Catalog design etc..
Association Rules
 Association rule has to
be interpreted in the
form of “if-then”
statements
 Association rules are
probabilistic in nature
 Some possible
association rules are
 {Bread} -> {Eggs}
 {Bread,Cereal} ->
{Eggs}
 Collection of one or
more items is called
 {Bread, Cereal} -> {Eggs}
X => Y
If – Then
Antecedent - Consequent
 The possible associations can be many.
We may be interested in finding the strong
associations.
 But how to find strong associations ?
 Answer: Support ,Confidence & Lift.
 Support and Confidence are the
measures to confirm the rule as a strong
association rule.
 These two measures express the degree
of uncertainty about the rule.
 The antecedent and consequent must be
disjoint sets
Theory of Apriori Algorithm
 There are three major components of
Apriori algorithm:
 Support (prevalance/popularity)
Confidence (predictability) – likely
purchase of consequent
Lift (interest)- association expect by
chance
Three key terms to determine
rules
Lift =1 means there is no association between products A
and B.
Lift > 1 means products A and B are more likely to be
bought together.
Support(X)= freq(X)/N
Support(Y)=freq(Y)/N
Some Algorithms take
the support and some
take the support count
Example
Steps in Apriori algorithm
 Step 1: Generate all frequent item sets
 Candidate Generation is the possible
combination of the itemset performed by the
join operation
 A frequent itemset is
an itemset appearing in (frequent) at
least minimum support transactions from
the transaction database
 Eg: {A},{B},{C} –Freq. Itemset at k=1
 {A,B} {A,C} {B,C} – candidate Generation at
k=2
 Step 2: Generate strong association rules
Step1: Finding the frequent
itemset
 Let k=1
 Generate frequent item sets of length 1
 Repeat until no new frequent item sets are
identified
 Create a candidate list of k itemsets by
performing join operation on pairs of (k-1)
itemsets in the list.
 Prune candidate item sets containing subsets of
length k that are infrequent
 Count the support of each candidate by
scanning the DB
 Eliminate candidates that are infrequent,
leaving the list with only those that are
Step2: Generate Strong Rules
 Formulate all the possible combination of the
frequent itemset
 Calculate the confidence
 Choose the rule which has more confidence
 Also calculate the lift and check whether rule has
lift >1
Example
 To speed up the process,
 Set a minimum value for support and confidence.
This means that we are only interested in finding
rules for the items that have certain default
existence (e.g. support) and have a minimum value
for co-occurrence with other items (e.g. confidence).
 Extract all the subsets having higher value of
support than minimum threshold.
 Select all the rules from the subsets with confidence
value higher than minimum threshold.
 Order the rules by descending order of Lift.
Advantage
 Subset of a frequent itemset is also a frequent
itemset.
 This reduce the number of candidates being
considered by only exploring the itemsets whose
support count is greater than the minimum
support count.
 All infrequent itemsets can be pruned if it has an
infrequent subset.
Some other ARM Algorithms
 FP Growth
 AIS
 SETM
 ECLAT
Practice Problem
 Find the Support,
Confidence and Lift
for the rule
{Apples,Milk}-
>{Cheese}
 Apply Apriori
algorithm to find the
frequent item set
Cluster Analysis
Cluster Analysis
 Clustering is a Process of grouping objects
which are similar
 Clustering is an unsupervised learning
technique
 Objects of a cluster are similar and objects of
different cluster are dissimilar
 The objects can be grouped based on
attributes/features or by relationships with
other objects (distance or similarity)
 Clustering does not require assumptions
about category labels that tag objects with
prior identifiers
 Clustering is subjective (or problem
dependent) and can summarize data to a
 Applications
 Customer relationship management
 Information retrieval
 Data compression
 Image processing
 Marketing
 Medicine
 Pattern recognition
Similarity Measurement
 Grouping is done based on the closeness or
similarity
 One way of doing this, is measure the
distance
 Distance Measurement Methods
 Euclidean Distance
 Manhattan Distance
 Chebychev Distance
 Percentage Disagreement
Euclidean Distance
 Calculate the distance on the raw
data
Distance(Oi,Oj)= sqrt (∑(Oik – Ojk)2
Distance (O1,O2) = sqrt((5-8)2+(6-9)2+(4-3)2+(9-2)2
= 8.25
Distance (O1,O3) = sqrt((5-3)2+(6-4)2+(4-5)2+(9-3)2
= 6.7
2 2 2 2
Object X1 X2 X3 X4
O1 5 6 4 9
O2 8 9 3 2
O3 3 4 5 3
Manhattan Distance
 Simply the average difference across dimensions
 Distance(Oi,Oj)= 1/n (∑(|Oik – Ojk|)
 n represents the number of features
Distance (O1,O2) = 1/4(|5-8|+|6-9|+|4-3|+|9-2| = 14/4 = 3.5
Distance (O1,O3) = 1/4(|5-3|+|6-4|+|4-5|+|9-3| = 2.75
Distance (O2,O3) = 1/4(|8-3|+|9-4|+|3-5|+|2-3| = 3.25
Object X1 X2 X3 X4
O1 5 6 4 9
O2 8 9 3 2
O3 3 4 5 3
Chebychev Distance
 Max difference across dimensions
 Distance(Oi,Oj)= Max(|Oik – Ojk|)
Distance (O1,O2) = Max(|5-8|,|6-9|,|4-3|,|9-2| =7
Distance (O1,O3) = Max(|5-3|,|6-4|,|4-5|,|9-3| = 6
Distance (O2,O3) = Max(|8-3|,|9-4|,|3-5|,|2-3| = 5
Object X1 X2 X3 X4
O1 5 6 4 9
O2 8 9 3 2
O3 3 4 5 3
Percent Disagreement
 Suited for features which is categorical in nature
 Distance(Oi,Oj)= 100* [ Number of (Oik <>Ojk )] div n
 N represents the number of features
Distance (O1,O2) = 100*(1 div 4) =25%
Distance (O1,O3) = 100*(2 div 4)= 50%
Distance (O2,O3) = 100*(3 div 4 )= 75%
Object Gender Age
bracket
Income
level
BP
O1 M 20-30 Low Normal
O2 M 30-40 Low Normal
O3 F 20-30 Medium Normal
Types of Clustering
 Partitional – we construct various partitions and
then evaluate them by some criteria
 K-means
 K- medoids
 Hierarchical – we create hierarchical
decomposition of the set of objects using some
criterion
 Bottom up – agglomerative
 Initially, each point is a cluster
 Repeatedly combine the two nearest clusters into one
 Top-down – divisive
 Start with one cluster and recursively split it
K-means
Step 1: Choose k objects arbitrarily from D as in
initial cluster centers
Step 2 : Repeat
Step 3: Calculate the distance between each other
data point
Step 4: Assign each object to the cluster of
nearest center
Step 5: Update the cluster means
Step 6: Until no change
Example
 As a simple illustration of a k-means algorithm,
consider the following data set consisting of the
scores of two variables on each of Five
individuals. This data set is to be grouped into two
clusters.
Subject X1 X2
A 1 1
B 1 0
C 0 2
D 2 4
E 3 5
 Choose the cluster centroids
Individual
Mean Vector
(centroid)
Group 1 A (1, 1)
Group 2 D (2, 4)
 Calculate the distance between each objects
using euclidean/manhattan/chebychev distance
measure
Subject A B C D E
A 0
B 1 0
C 1 2 0
D 3 4 2 0
E 4 5 3 1 0
 Calculate the distance (using euclidean/
manhattan/ chebychev) of each individual to the
chosen centroid. Assign 1 to the min distance of
the cluster. Eg: for obj A min(0,3) = 0 , so put 1 in
cluster A. Rearrange the cluster and reassign the
centroid.
Object
Cluster 1
(1,1) A
Cluster 2
(2,4) D
A 0 3
B 1 4
C 1 2
D 3 0
E 4 1
Object
Cluster 1
(1,1)
Cluster 2
(2,4)
A 1 0
B 1 0
C 1 0
D 0 1
E 0 1
New centroid,
Cluster 1 (1+1+0/3 , 1+0+2/3) = (2/3,3/3) = (0.6,1)
Cluster 2 ( 2+3/2 , 4+5/2) = (5/2,9/2)=(2.5,4.5)
 Repeat until no change in the centroids
Object
Cluster 1
(0.6,1)
Cluster 2
(2.5,4.5)
A 0.4 3.5
B 1 4.5
C 1 2.5
D 3 0.5
E 4 0.5
Object
Cluster 1
(0.6,1)
Cluster 2
(2.5,4.5)
A 1 0
B 1 0
C 1 0
D 0 1
E 0 1
New centroid,
Cluster 1 (1+1+0/3 , 1+0+2/3) = (2/3,3/3) = (0.6,1)
Cluster 2 ( 2+3/2 , 4+5/2) = (5/2,9/2)=(2.5,4.5)
Stopping Criteria for KMeans
 The datapoints assigned to specific cluster
remain the same
 Centroids remain the same
 The distance of datapoints from their centroid is
minimum
 Fixed number of iterations have reached
(insufficient iterations → poor results, choose max
iteration wisely)
Model Metrics
 Within-cluster sum of squares
 The sum of the squared deviations from each
observation and the cluster centroid.
 small sum of squares is more compact
 Between-Cluster sum of squares
 measures the squared average
distance between all centroids.
 Larger implies the clusters are well separated
Practice Problem
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
k-Medoids
 K-means is sensitive to outliers
 k-medoids – instead of taking the mean value of the
object in the cluster as a reference point, medoids can
be used which is more centrally located object in a
cluster
 The k-medoids clustering algorithm:
 Select k points as the initial representative of the objects
 Repeat
 Assigning each point closest to the medoid
 Randomly select a non-representative object Oi
 Compute the total cost S of swapping the medoid m
with Oi
 If S<0, then swap m with Oi to form the new set of
medoids
K-medoids example
Agglomerative clustering
 It is Hierarchical clustering method –
specifically it uses bottom-up approach
 Idea: Ensure nearby points ends up in the
same cluster
 Start with a collection of n singleton
clusters
Each cluster contains one data point
 Repeatedly only one cluster is left:
Find a pair of clusters that is closest: min
D(ci,cj)
Merge the clusters ci, cj into a new cluster
cij
Example
 For a given dataset, Form the distance matrix
 Merge col 3 and 5 . For example, d(1,3)= 3 and
d(1,5)=11. So, D(1,"35")=11. This gives us the
new distance matrix. The items with the smallest
distance get clustered next. This will be 2 and 4.
35 24 1
35 0
24 10 0
1 11 9 0
Form the dendrogram.
Practice
Modern Clustering methods
 Hierarchical clustering
 BIRCH – Balanced Iterative reducing and clustering
using hierarchies
 CURE – clustering using Representation
 ROCK – Robust clustering for categorical data
 Partitive clustering
 CLARA – Clustering Large Application
 CLARANS – clustering large application on
randomized search
 K-mode
Other Clustering methods
 Density based clustering
 DENCLUE – Density based clustering
 DBSCAN- Density based spatial clustering of
application with noise
 Optics – ordering points to identify the clustering
structure
 Grid based methods
 STING – statistical information Grid based method
 Wave cluster
 Model based methods
 COBWEB
 CLASSIT

More Related Content

Similar to Unit 4.pptx

ML basic &amp; clustering
ML basic &amp; clusteringML basic &amp; clustering
ML basic &amp; clusteringmonalisa Das
 
Data Mining Concepts 15061
Data Mining Concepts 15061Data Mining Concepts 15061
Data Mining Concepts 15061badirh
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Conceptsdataminers.ir
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxnikshaikh786
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.pptSowmyaJyothi3
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.pptSowmyaJyothi3
 
Pattern Discovery Using Apriori and Ch-Search Algorithm
 Pattern Discovery Using Apriori and Ch-Search Algorithm Pattern Discovery Using Apriori and Ch-Search Algorithm
Pattern Discovery Using Apriori and Ch-Search Algorithmijceronline
 
A New Extraction Optimization Approach to Frequent 2 Item sets
A New Extraction Optimization Approach to Frequent 2 Item setsA New Extraction Optimization Approach to Frequent 2 Item sets
A New Extraction Optimization Approach to Frequent 2 Item setsijcsa
 
A NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETS
A NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETSA NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETS
A NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETSijcsa
 
A NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETS
A NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETSA NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETS
A NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETSijcsa
 
Association Rule Mining
Association Rule MiningAssociation Rule Mining
Association Rule MiningPALLAB DAS
 
An Empirical Investigation Of The Arbitrage Pricing Theory
An Empirical Investigation Of The Arbitrage Pricing TheoryAn Empirical Investigation Of The Arbitrage Pricing Theory
An Empirical Investigation Of The Arbitrage Pricing TheoryAkhil Goyal
 
Machine learning and decision trees
Machine learning and decision treesMachine learning and decision trees
Machine learning and decision treesPadma Metta
 
MODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxMODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxnikshaikh786
 
Machine Learning Clustering
Machine Learning ClusteringMachine Learning Clustering
Machine Learning ClusteringRupak Roy
 

Similar to Unit 4.pptx (20)

B0950814
B0950814B0950814
B0950814
 
Rmining
RminingRmining
Rmining
 
ML basic &amp; clustering
ML basic &amp; clusteringML basic &amp; clustering
ML basic &amp; clustering
 
Data Mining Concepts 15061
Data Mining Concepts 15061Data Mining Concepts 15061
Data Mining Concepts 15061
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.ppt
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.ppt
 
Decision theory
Decision theoryDecision theory
Decision theory
 
Pattern Discovery Using Apriori and Ch-Search Algorithm
 Pattern Discovery Using Apriori and Ch-Search Algorithm Pattern Discovery Using Apriori and Ch-Search Algorithm
Pattern Discovery Using Apriori and Ch-Search Algorithm
 
Hiding slides
Hiding slidesHiding slides
Hiding slides
 
A New Extraction Optimization Approach to Frequent 2 Item sets
A New Extraction Optimization Approach to Frequent 2 Item setsA New Extraction Optimization Approach to Frequent 2 Item sets
A New Extraction Optimization Approach to Frequent 2 Item sets
 
A NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETS
A NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETSA NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETS
A NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETS
 
A NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETS
A NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETSA NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETS
A NEW EXTRACTION OPTIMIZATION APPROACH TO FREQUENT 2 ITEMSETS
 
Association Rule Mining
Association Rule MiningAssociation Rule Mining
Association Rule Mining
 
An Empirical Investigation Of The Arbitrage Pricing Theory
An Empirical Investigation Of The Arbitrage Pricing TheoryAn Empirical Investigation Of The Arbitrage Pricing Theory
An Empirical Investigation Of The Arbitrage Pricing Theory
 
Machine learning and decision trees
Machine learning and decision treesMachine learning and decision trees
Machine learning and decision trees
 
MODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxMODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptx
 
Machine Learning Clustering
Machine Learning ClusteringMachine Learning Clustering
Machine Learning Clustering
 

Recently uploaded

Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...ppkakm
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6
 
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptxrouholahahmadi9876
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...drmkjayanthikannan
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Computer Graphics Introduction To Curves
Computer Graphics Introduction To CurvesComputer Graphics Introduction To Curves
Computer Graphics Introduction To CurvesChandrakantDivate1
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfsumitt6_25730773
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Ramkumar k
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...jabtakhaidam7
 
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...vershagrag
 

Recently uploaded (20)

Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
457503602-5-Gas-Well-Testing-and-Analysis-pptx.pptx
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Computer Graphics Introduction To Curves
Computer Graphics Introduction To CurvesComputer Graphics Introduction To Curves
Computer Graphics Introduction To Curves
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Signal Processing and Linear System Analysis
Signal Processing and Linear System AnalysisSignal Processing and Linear System Analysis
Signal Processing and Linear System Analysis
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdf
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
Jaipur ❤CALL GIRL 0000000000❤CALL GIRLS IN Jaipur ESCORT SERVICE❤CALL GIRL IN...
 
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
💚Trustworthy Call Girls Pune Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top...
 

Unit 4.pptx

  • 1. Unit 4 Association Rule Mining & Clustering
  • 2. Association Rule mining  Association Rule mining is “what goes with what”  Association rule mining is a technique to identify underlying relations between different items.  Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transactions.  The process of identifying an associations between products is
  • 3.  More profit can be generated if the relationship between the items purchased in different transactions can be identified.  For instance, if item A and B are bought together more frequently then several steps can be taken to increase the profit. For example,  A and B can be placed together so that when a customer buys one of the product he doesn't have to go far away to buy the other product.  People who buy one of the products can be targeted through an advertisement campaign to buy the other.  Collective discounts can be offered on these products if the customer buys both of them.  Both A and B can be packaged together.
  • 4.  Applications Market basket analysis Cross-marketing Catalog design etc..
  • 5. Association Rules  Association rule has to be interpreted in the form of “if-then” statements  Association rules are probabilistic in nature  Some possible association rules are  {Bread} -> {Eggs}  {Bread,Cereal} -> {Eggs}  Collection of one or more items is called
  • 6.  {Bread, Cereal} -> {Eggs} X => Y If – Then Antecedent - Consequent
  • 7.  The possible associations can be many. We may be interested in finding the strong associations.  But how to find strong associations ?  Answer: Support ,Confidence & Lift.  Support and Confidence are the measures to confirm the rule as a strong association rule.  These two measures express the degree of uncertainty about the rule.  The antecedent and consequent must be disjoint sets
  • 8. Theory of Apriori Algorithm  There are three major components of Apriori algorithm:  Support (prevalance/popularity) Confidence (predictability) – likely purchase of consequent Lift (interest)- association expect by chance
  • 9. Three key terms to determine rules Lift =1 means there is no association between products A and B. Lift > 1 means products A and B are more likely to be bought together. Support(X)= freq(X)/N Support(Y)=freq(Y)/N Some Algorithms take the support and some take the support count
  • 11. Steps in Apriori algorithm  Step 1: Generate all frequent item sets  Candidate Generation is the possible combination of the itemset performed by the join operation  A frequent itemset is an itemset appearing in (frequent) at least minimum support transactions from the transaction database  Eg: {A},{B},{C} –Freq. Itemset at k=1  {A,B} {A,C} {B,C} – candidate Generation at k=2  Step 2: Generate strong association rules
  • 12. Step1: Finding the frequent itemset  Let k=1  Generate frequent item sets of length 1  Repeat until no new frequent item sets are identified  Create a candidate list of k itemsets by performing join operation on pairs of (k-1) itemsets in the list.  Prune candidate item sets containing subsets of length k that are infrequent  Count the support of each candidate by scanning the DB  Eliminate candidates that are infrequent, leaving the list with only those that are
  • 13.
  • 14. Step2: Generate Strong Rules  Formulate all the possible combination of the frequent itemset  Calculate the confidence  Choose the rule which has more confidence  Also calculate the lift and check whether rule has lift >1
  • 16.
  • 17.  To speed up the process,  Set a minimum value for support and confidence. This means that we are only interested in finding rules for the items that have certain default existence (e.g. support) and have a minimum value for co-occurrence with other items (e.g. confidence).  Extract all the subsets having higher value of support than minimum threshold.  Select all the rules from the subsets with confidence value higher than minimum threshold.  Order the rules by descending order of Lift.
  • 18. Advantage  Subset of a frequent itemset is also a frequent itemset.  This reduce the number of candidates being considered by only exploring the itemsets whose support count is greater than the minimum support count.  All infrequent itemsets can be pruned if it has an infrequent subset.
  • 19. Some other ARM Algorithms  FP Growth  AIS  SETM  ECLAT
  • 20. Practice Problem  Find the Support, Confidence and Lift for the rule {Apples,Milk}- >{Cheese}  Apply Apriori algorithm to find the frequent item set
  • 22. Cluster Analysis  Clustering is a Process of grouping objects which are similar  Clustering is an unsupervised learning technique  Objects of a cluster are similar and objects of different cluster are dissimilar  The objects can be grouped based on attributes/features or by relationships with other objects (distance or similarity)  Clustering does not require assumptions about category labels that tag objects with prior identifiers  Clustering is subjective (or problem dependent) and can summarize data to a
  • 23.  Applications  Customer relationship management  Information retrieval  Data compression  Image processing  Marketing  Medicine  Pattern recognition
  • 24. Similarity Measurement  Grouping is done based on the closeness or similarity  One way of doing this, is measure the distance  Distance Measurement Methods  Euclidean Distance  Manhattan Distance  Chebychev Distance  Percentage Disagreement
  • 25. Euclidean Distance  Calculate the distance on the raw data Distance(Oi,Oj)= sqrt (∑(Oik – Ojk)2 Distance (O1,O2) = sqrt((5-8)2+(6-9)2+(4-3)2+(9-2)2 = 8.25 Distance (O1,O3) = sqrt((5-3)2+(6-4)2+(4-5)2+(9-3)2 = 6.7 2 2 2 2 Object X1 X2 X3 X4 O1 5 6 4 9 O2 8 9 3 2 O3 3 4 5 3
  • 26. Manhattan Distance  Simply the average difference across dimensions  Distance(Oi,Oj)= 1/n (∑(|Oik – Ojk|)  n represents the number of features Distance (O1,O2) = 1/4(|5-8|+|6-9|+|4-3|+|9-2| = 14/4 = 3.5 Distance (O1,O3) = 1/4(|5-3|+|6-4|+|4-5|+|9-3| = 2.75 Distance (O2,O3) = 1/4(|8-3|+|9-4|+|3-5|+|2-3| = 3.25 Object X1 X2 X3 X4 O1 5 6 4 9 O2 8 9 3 2 O3 3 4 5 3
  • 27. Chebychev Distance  Max difference across dimensions  Distance(Oi,Oj)= Max(|Oik – Ojk|) Distance (O1,O2) = Max(|5-8|,|6-9|,|4-3|,|9-2| =7 Distance (O1,O3) = Max(|5-3|,|6-4|,|4-5|,|9-3| = 6 Distance (O2,O3) = Max(|8-3|,|9-4|,|3-5|,|2-3| = 5 Object X1 X2 X3 X4 O1 5 6 4 9 O2 8 9 3 2 O3 3 4 5 3
  • 28. Percent Disagreement  Suited for features which is categorical in nature  Distance(Oi,Oj)= 100* [ Number of (Oik <>Ojk )] div n  N represents the number of features Distance (O1,O2) = 100*(1 div 4) =25% Distance (O1,O3) = 100*(2 div 4)= 50% Distance (O2,O3) = 100*(3 div 4 )= 75% Object Gender Age bracket Income level BP O1 M 20-30 Low Normal O2 M 30-40 Low Normal O3 F 20-30 Medium Normal
  • 29. Types of Clustering  Partitional – we construct various partitions and then evaluate them by some criteria  K-means  K- medoids  Hierarchical – we create hierarchical decomposition of the set of objects using some criterion  Bottom up – agglomerative  Initially, each point is a cluster  Repeatedly combine the two nearest clusters into one  Top-down – divisive  Start with one cluster and recursively split it
  • 30. K-means Step 1: Choose k objects arbitrarily from D as in initial cluster centers Step 2 : Repeat Step 3: Calculate the distance between each other data point Step 4: Assign each object to the cluster of nearest center Step 5: Update the cluster means Step 6: Until no change
  • 31.
  • 32. Example  As a simple illustration of a k-means algorithm, consider the following data set consisting of the scores of two variables on each of Five individuals. This data set is to be grouped into two clusters. Subject X1 X2 A 1 1 B 1 0 C 0 2 D 2 4 E 3 5
  • 33.  Choose the cluster centroids Individual Mean Vector (centroid) Group 1 A (1, 1) Group 2 D (2, 4)
  • 34.  Calculate the distance between each objects using euclidean/manhattan/chebychev distance measure Subject A B C D E A 0 B 1 0 C 1 2 0 D 3 4 2 0 E 4 5 3 1 0
  • 35.  Calculate the distance (using euclidean/ manhattan/ chebychev) of each individual to the chosen centroid. Assign 1 to the min distance of the cluster. Eg: for obj A min(0,3) = 0 , so put 1 in cluster A. Rearrange the cluster and reassign the centroid. Object Cluster 1 (1,1) A Cluster 2 (2,4) D A 0 3 B 1 4 C 1 2 D 3 0 E 4 1 Object Cluster 1 (1,1) Cluster 2 (2,4) A 1 0 B 1 0 C 1 0 D 0 1 E 0 1 New centroid, Cluster 1 (1+1+0/3 , 1+0+2/3) = (2/3,3/3) = (0.6,1) Cluster 2 ( 2+3/2 , 4+5/2) = (5/2,9/2)=(2.5,4.5)
  • 36.  Repeat until no change in the centroids Object Cluster 1 (0.6,1) Cluster 2 (2.5,4.5) A 0.4 3.5 B 1 4.5 C 1 2.5 D 3 0.5 E 4 0.5 Object Cluster 1 (0.6,1) Cluster 2 (2.5,4.5) A 1 0 B 1 0 C 1 0 D 0 1 E 0 1 New centroid, Cluster 1 (1+1+0/3 , 1+0+2/3) = (2/3,3/3) = (0.6,1) Cluster 2 ( 2+3/2 , 4+5/2) = (5/2,9/2)=(2.5,4.5)
  • 37. Stopping Criteria for KMeans  The datapoints assigned to specific cluster remain the same  Centroids remain the same  The distance of datapoints from their centroid is minimum  Fixed number of iterations have reached (insufficient iterations → poor results, choose max iteration wisely)
  • 38. Model Metrics  Within-cluster sum of squares  The sum of the squared deviations from each observation and the cluster centroid.  small sum of squares is more compact  Between-Cluster sum of squares  measures the squared average distance between all centroids.  Larger implies the clusters are well separated
  • 39. Practice Problem Subject A B 1 1.0 1.0 2 1.5 2.0 3 3.0 4.0 4 5.0 7.0 5 3.5 5.0 6 4.5 5.0 7 3.5 4.5
  • 40. k-Medoids  K-means is sensitive to outliers  k-medoids – instead of taking the mean value of the object in the cluster as a reference point, medoids can be used which is more centrally located object in a cluster  The k-medoids clustering algorithm:  Select k points as the initial representative of the objects  Repeat  Assigning each point closest to the medoid  Randomly select a non-representative object Oi  Compute the total cost S of swapping the medoid m with Oi  If S<0, then swap m with Oi to form the new set of medoids
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47. Agglomerative clustering  It is Hierarchical clustering method – specifically it uses bottom-up approach  Idea: Ensure nearby points ends up in the same cluster  Start with a collection of n singleton clusters Each cluster contains one data point  Repeatedly only one cluster is left: Find a pair of clusters that is closest: min D(ci,cj) Merge the clusters ci, cj into a new cluster cij
  • 48. Example  For a given dataset, Form the distance matrix
  • 49.  Merge col 3 and 5 . For example, d(1,3)= 3 and d(1,5)=11. So, D(1,"35")=11. This gives us the new distance matrix. The items with the smallest distance get clustered next. This will be 2 and 4. 35 24 1 35 0 24 10 0 1 11 9 0
  • 52. Modern Clustering methods  Hierarchical clustering  BIRCH – Balanced Iterative reducing and clustering using hierarchies  CURE – clustering using Representation  ROCK – Robust clustering for categorical data  Partitive clustering  CLARA – Clustering Large Application  CLARANS – clustering large application on randomized search  K-mode
  • 53. Other Clustering methods  Density based clustering  DENCLUE – Density based clustering  DBSCAN- Density based spatial clustering of application with noise  Optics – ordering points to identify the clustering structure  Grid based methods  STING – statistical information Grid based method  Wave cluster  Model based methods  COBWEB  CLASSIT