SlideShare a Scribd company logo
1 of 54
Download to read offline
Module - 4
By
Dr.Ramkumar.T
ramkumar.thirunavukarasu@vit.ac.in
Mining Frequent Patterns, Associations and Correlations
2
What Is Frequent Pattern
Analysis?
• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that
occurs frequently in a data set
• First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of
frequent itemsets and association rule mining
• Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
• Applications
– Basket data analysis, cross-marketing, catalog design, sale campaign analysis,
Web log (click stream) analysis, and DNA sequence analysis.
Association Rule Mining
Association Rule Mining
• Proposed by Agrawal R, Imielinski T, and Swami AN –
"Mining Association Rules between Sets of Items in Large
Databases.‖ – SIGMOD, June 1993
• An important data mining model studied extensively by the
database and data mining community
• Initially used for Market Basket Analysis to find how items
purchased by customers are related
• Assumes all data are categorical
• Not a good algorithm for numeric data
Basic Terms
• An item - an item/article in a basket
• Itemset – Collection of one or more items
I = {i1, i2, …, im}
• k-itemset - An itemset that contains k-items
{Bread} - 1 Itemset
{Bread,Milk} – 2 Itemset
{Bread,Milk, Diaper} – 3 Itemset
Basic Terms
• Transaction (T) - items purchased in a basket, Contains set of
items
• A transactional dataset - A set of transactions, Contains
{T1, T2,…, Tn}
Association Rule Measures
• An Association Rule is an implication of the form X  Y,
where X, Y belong to ‗ I ‘
• Example: {Milk, Diaper} → {Beer}
• Objective measures - A data-driven approach for evaluating the
quality of association patterns. It is domain-independent and
requires minimal input from the users
• Subjective measure – Knowledge acquired through domain
expertise
• Widely used objective measures in ARM : Support &
Confidence
Support (s)
• Support measures the frequency of the item/itemset
• The support ‗s‘ of an itemset ‗X‘ is the percentage of transactions in the
transaction database D that contain ‗X‘
• The support ‗s‘ of the rule ‗X  Y‘ in the transaction database ‗D‘ is the
support of the it ems set
Confidence (c)
Association Rule Mining
Given:
- a set ―I‖ of all the items;
- A database ―D‖ of transactions;
- minimum support – ‗s‘
- minimum confidence – ‗c‘
Find :
- All association rules X  Y that satisfy the
minimum support ‗s‘ and minimum confidence
‗c‘
How?
- Step 1: Frequent Item set Generation - Find all sets of
items that have minimum support (frequent item
sets)
- Step 2: Rule Generation - Generate association rules
that satisfy the minimum confidence thresholds from
the frequent item sets found in the previous step.
Apriori Algorithm
• Apriori algorithm is mining frequent item sets for single dimensional
boolean associations rules
• The name of the algorithm is based on the fact that the algorithm uses
prior knowledge of frequent items
• Applies an iterative approach known as level wise search, where k-
items are used to explore k+1 item
• Apriori property is used to reduce the search space
• Apriori Property – ―All nonempty subset of frequent items must be
also frequent‖ also called as ―Anti-Monotonic‖ property
• Anti-monotone in the sense that if a set cannot pass a test, all its
supper sets will fail the same test as well
Data representation
• Market basket data can be represented in a binary
format.
• Each row corresponds to a transaction and each
column corresponds to an item.
• An item can be treated as a binary variable whose
value is one if the item is present in a transaction
and zero otherwise.
Data representation for MBA
14
Apriori: A Candidate Generation & Test Approach
• Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
• Method:
– Initially, scan DB once to get frequent 1-itemset
– Generate length (k+1) candidate itemsets from length k
frequent itemsets
– Test the candidates against DB
– Terminate when no frequent or candidate set can be
generated
15
The Apriori Algorithm—An Example
Database TDB
1st scan
C1
L1
L2
C2 C2
2nd scan
C3 L3
3rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2
Generation of Strong Association Rules
17
The Apriori Algorithm (Pseudo-
Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
18
Implementation of Apriori
• How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
• Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4 = {abcd}
Maximal Frequent Itemset
• The number of frequent itemsets generated
by the Apriori algorithm can often be very
large.
• Hence it is beneficial to identify a small
representative set from which every
frequent item set can be derived.
• One such approach is using maximal
frequent item sets.
Maximal Frequent Itemset
• A maximal frequent itemset is a frequent
itemset for which none of its immediate
supersets are frequent.
• Eg. {A,C} is the Maximal Frequent Itemset,
where the superset {A,C,E} is in-frequent in
the example problem
• Maximal frequent itemsets provide a
compact representation of all the
frequent itemsets for a particular dataset.
Closed Frequent Itemset
• An itemset is closed in a dataset if there
exists no superset that has the same support
count.
• Eg. {B,E} is the closed frequent itemset
with support count of 3, where its superset
{B,C,E} has the support count of 2.
Illustration
Exercise
Improving the efficiency of Apriori
Enhancement
The idea behind Partitioning
Algorithm
Enhancement
FP Tree Mining
 Apriori: uses a generate-and-test approach – generates
candidate itemsets and tests if they are frequent
– Generation of candidate itemsets is expensive(in both space
and time)
– Support counting is expensive
• Subset checking (computationally expensive)
• Multiple Database scans (I/O)
 FP-Growth: allows frequent itemset discovery without
candidate itemset generation. Two step approach:
– Step 1: Build a compact data structure called the FP-tree
• Built using 2 passes over the data-set.
– Step 2: Extracts frequent itemsets directly from the FP-tree
29
Definition of FP-tree
 A frequent pattern tree is defined below.
 It consists of one root labeled as ―null‖, a set of item
prefix subtrees as the children of the root, and a
frequent-item header table.
 Each node in the item prefix subtree consists of three
fields: item-name, count, and node-link.
 Each entry in the frequent-item header table consists
of two fields, (1) item-name and (2) head of node-
link.
30
Frequent Pattern Tree — Example
Step 1: FP-Tree Construction
FP-Tree is constructed using 2 passes over the
data-set:
Pass 1:
– Scan data and find support for each item.
– Discard infrequent items.
– Sort frequent items in decreasing order based on
their support.
Use this order when building the FP-Tree, so
common prefixes can be shared.
Step 1: FP-Tree Construction
Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and maps it to a path
2. Fixed order is used, so paths can overlap when transactions
share items (when they have the same prfix ).
– In this case, counters are incremented
3. Pointers are maintained between nodes containing the same
item, creating singly linked lists (dotted lines)
– The more paths that overlap, the higher the compression. FP-tree
may fit in memory.
4. Frequent itemsets extracted from the FP-Tree.
Step 1: FP-Tree Construction
(Example)
FP-Tree size
 The FP-Tree usually has a smaller size than the uncompressed
data - typically many transactions share items (and hence
prefixes).
– Best case scenario: all transactions contain the same set of items.
• 1 path in the FP-tree
– Worst case scenario: every transaction has a unique set of items
(no items in common)
• Size of the FP-tree is at least as large as the original data.
• Storage requirements for the FP-tree are higher - need to store the pointers
between the nodes and the counters.
 The size of the FP-tree depends on how the items are ordered
 Ordering by decreasing support is typically used but it does not
always lead to the smallest tree (it's a heuristic).
Step 2: Frequent Itemset Generation
FP-Growth extracts frequent itemsets from
the FP-tree.
Bottom-up algorithm - from the leaves
towards the root
Divide and conquer - Decompose both the
mining task and DB according to the
frequent patterns obtained so far
36
Compare Apriori-like method to FP-
tree
• Apriori-like method may generate an exponential
number of candidates in the worst case.
• FP-tree does not generate an exponential number
of nodes.
• The items ordered in the support-descending
order indicate that FP-tree structure is usually
highly compact.
FP Tree Pros and Cons
Advantages of FP-Growth
– only 2 passes over data-set
– ―compresses‖ data-set
– no candidate generation
– much faster than Apriori
Disadvantages of FP-Growth
– FP-Tree may not fit in memory!!
– FP-Tree is expensive to build
38
Construct FP-tree from a Transaction Database
 Let the minimum support be 20%
1. Scan DB once, find frequent 1-itemset
(single item pattern)
2. Sort frequent items in frequency
descending order, f-list
3. Scan DB again, construct FP-tree
Frequent 1-itemset Support Count
I1 6
I2 7
I3 6
I4 2
I5 2
39
TID Items bought (ordered) frequent items
T100 {I1, I2, I5} {I2, I1, I5}
T200 {I2, I4} {I2, I4}
T300 {I2, I3} {I2, I3}
T400 {I1, I2, I4} {I2, I1, I4}
T500 {I1, I3} {I1, I3}
T600 {I2, I3} {I2, I3}
T700 {I1,I3} {I1,I3}
T800 {I1, I2, I3, I5} {I2, I1, I3, I5}
T900 {I1, I2, I5} {I2, I1, I5}
Construct FP-tree from a Transaction Database
40
Construct FP-tree from a Transaction Database
41
Why Is FP-Growth the Winner?
• Divide-and-conquer:
– leads to focused search of smaller databases
• Other factors
– no candidate generation, no candidate test
– compressed database: FP-tree structure
– no repeated scan of entire database
– basic operations—counting local frequent items and
building sub FP-tree, no pattern search and matching
Evaluation of Association Rules
• Association rule algorithms tend to produce too many rules – many
of them are ―uninteresting‖
• A rule is considered subjectively uninteresting unless it
• reveals unexpected information about the data, or
• provides useful knowledge that can lead to profitable actions.
• How can we know some rules are not good (even with high
confidence and support)?
• Leads to define various measures Used to define various measures •
Traditionally: support, confidence
New: lift, etc.
Strong rules are not necessarily Interesting
• Support and Confidence measures are insufficient in filtering
out un-interestingness association rules
• To tackle the weakness, a correlation measure can be used to
augment the support-confidence framework
• Measure of dependent/correlated events:
• Hence a typical association rule can be represented in the form
of
A  B { support,confidence,Correlation}
Contingency Table (or) Confusion Matrix
Read the scenario
• Consider a population of 100 people in which there are 50
researchers and 50 non-researchers. 80 out of the 100 people are
coffee drinkers and 20 are coffee abstainers. Suppose 35
researchers drink coffee and the remaining 15 do not. It follows
that 45 non-researchers drink coffee and remaining 5 do not.
• How to Form contingency Table ?
October 4, 2021 Data Mining: Concepts and Techniques 46
Contingency Table for the Problem
Calculate the Support and Confidence of (Researcher  coffee drinker)
Calculate the support and confidence of (Researcher  coffee abstainer)
Coffee-
Drinker
Coffee-
abstainer
Sum
(row)
Researcher 35 15 50
Non-
Researcher
45 5 50
Sum(col.) 80 20 100
47
Interestingness Measure: Correlations (Lift)
if Corr A,B = 1 THEN ‗A‘ and ‗B‘ are independent of each other
if Corr A,B > 1 THEN ‗A‘ and ‗B‘ are positively correlated
if Corr A,B < 1 THEN ‗A‘ and ‗B‘ are Negatively correlated
Corr Researcher,Coffe drinker = 0.35/ (0.50) (0.80)
= 0.35/0.40
= 0.875
I
)
(
)
(
)
(
,
B
P
A
P
B
A
P
corr B
A


October 4, 2021 Data Mining: Concepts and Techniques 48
Answer It
• Suppose we are interested in analyzing transaction of All
electronics with respect to purchase of computer games and
videos. Of the 10,000 transaction analyzed, the data show that
6000 of the customer transaction included computer games,
while 7500 included videos and 4000 included both computer
games and videos.
• Check whether the rule computer game  Video is really
interesting for a minimum support of 0.30 and minimum
confidence of 0.66.
Multiple-level Association Rules
• Items often form hierarchy
• Flexible support settings: Items at the lower level are expected
to have lower support.
• Transaction database can be encoded based on dimensions and
levels
uniform support
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Level 1
min_sup = 5%
Level 2
min_sup = 3%
reduced support
ML/MD Associations with Flexible Support Constraints
• Why flexible support constraints?
– Real life occurrence frequencies vary greatly
• Diamond, watch, pens in a shopping basket
– Uniform support may not be an interesting model
• A flexible model
– The lower-level, the more dimension combination, and the long pattern
length, usually the smaller support
– General rules should be easy to specify and understand
– Special items and special group of items may be specified individually
and have higher priority
Multi-dimensional Association
• Single-dimensional rules:
buys(X, ―milk‖)  buys(X, ―bread‖)
• Multi-dimensional rules:  2 dimensions or predicates
– Inter-dimension assoc. rules (no repeated predicates)
age(X,‖19-25‖)  occupation(X,―student‖)  buys(X,―coke‖)
– hybrid-dimension assoc. rules (repeated predicates)
age(X,‖19-25‖)  buys(X, ―popcorn‖)  buys(X, ―coke‖)
• Categorical Attributes
– finite number of possible values, no ordering among values%
• Quantitative Attributes
– numeric, implicit ordering among values
52
Techniques for Mining MD Associations
• Search for frequent k-predicate set:
– Example: {age, occupation, buys} is a 3-predicate set
– Techniques can be categorized by how age are treated
1. Using static discretization of quantitative attributes
– Quantitative attributes are statically discretized by using
predefined concept hierarchies
2. Quantitative association rules
– Quantitative attributes are dynamically discretized into
―bins‖ based on the distribution of the data
3. Distance-based association rules
– This is a dynamic discretization process that considers the
distance between data points
Data Mining: Concepts and Techniques 53
Quantitative Association Rules
age(X,‖30-34‖) 
income(X,‖24K - 48K‖)
 buys(X,‖high
resolution TV‖)
 Numeric attributes are dynamically discretized
 Such that the confidence or compactness of the rules mined
is maximized
 2-D quantitative association rules: Aquan1  Aquan2  Acat
Usage of Binning Methods
• Binning methods do not capture the semantics of interval data
• Distance-based partitioning, more meaningful discretization
considering:
– density/number of points in an interval
– ―closeness‖ of points in an interval
Price($)
Equi-width
(width $10)
Equi-depth
(depth 2)
Distance-
based
7 [0,10] [7,20] [7,7]
20 [11,20] [22,50] [20,22]
22 [21,30] [51,53] [50,53]
50 [31,40]
51 [41,50]
53 [51,60]

More Related Content

What's hot

Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onwordSulman Ahmed
 
Data mining techniques unit iv
Data mining techniques unit ivData mining techniques unit iv
Data mining techniques unit ivmalathieswaran29
 
Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...ranjit banshpal
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Aiswaryadevi Jaganmohan
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data MiningValerii Klymchuk
 
Its all about data mining
Its all about data miningIts all about data mining
Its all about data miningJason Rodrigues
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kambererror007
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessingSalah Amean
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methodsrajshreemuthiah
 
A classification of methods for frequent pattern mining
A classification of methods for frequent pattern miningA classification of methods for frequent pattern mining
A classification of methods for frequent pattern miningIOSR Journals
 
Data Mining: Concepts and techniques: Chapter 13 trend
Data Mining: Concepts and techniques: Chapter 13 trendData Mining: Concepts and techniques: Chapter 13 trend
Data Mining: Concepts and techniques: Chapter 13 trendSalah Amean
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reductionKrish_ver2
 

What's hot (20)

Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
Data mining techniques unit iv
Data mining techniques unit ivData mining techniques unit iv
Data mining techniques unit iv
 
Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...
 
Data Mining
Data MiningData Mining
Data Mining
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641
 
01 Introduction to Data Mining
01 Introduction to Data Mining01 Introduction to Data Mining
01 Introduction to Data Mining
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Its all about data mining
Its all about data miningIts all about data mining
Its all about data mining
 
Ghhh
GhhhGhhh
Ghhh
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Data discretization
Data discretizationData discretization
Data discretization
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
 
Dsa unit 1
Dsa unit 1Dsa unit 1
Dsa unit 1
 
Unit i
Unit iUnit i
Unit i
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
A classification of methods for frequent pattern mining
A classification of methods for frequent pattern miningA classification of methods for frequent pattern mining
A classification of methods for frequent pattern mining
 
Data Mining: Concepts and techniques: Chapter 13 trend
Data Mining: Concepts and techniques: Chapter 13 trendData Mining: Concepts and techniques: Chapter 13 trend
Data Mining: Concepts and techniques: Chapter 13 trend
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 

Similar to 6 module 4

Chapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptxChapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptxssuser957b41
 
Apriori Algorithm.pptx
Apriori Algorithm.pptxApriori Algorithm.pptx
Apriori Algorithm.pptxRashi Agarwal
 
Associations.ppt
Associations.pptAssociations.ppt
Associations.pptQuyn590023
 
Data mining techniques unit III
Data mining techniques unit IIIData mining techniques unit III
Data mining techniques unit IIImalathieswaran29
 
Associations1
Associations1Associations1
Associations1mancnilu
 
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Subrata Kumer Paul
 
Fp growth algorithm
Fp growth algorithmFp growth algorithm
Fp growth algorithmPradip Kumar
 
Mining Frequent Itemsets.ppt
Mining Frequent Itemsets.pptMining Frequent Itemsets.ppt
Mining Frequent Itemsets.pptNBACriteria2SICET
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithmGangadhar S
 
Class Comparisions Association Rule
Class Comparisions Association RuleClass Comparisions Association Rule
Class Comparisions Association RuleTarang Desai
 
20IT501_DWDM_PPT_Unit_III.ppt
20IT501_DWDM_PPT_Unit_III.ppt20IT501_DWDM_PPT_Unit_III.ppt
20IT501_DWDM_PPT_Unit_III.pptPalaniKumarR2
 
20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.ppt20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.pptSamPrem3
 
Apriori and Eclat algorithm in Association Rule Mining
Apriori and Eclat algorithm in Association Rule MiningApriori and Eclat algorithm in Association Rule Mining
Apriori and Eclat algorithm in Association Rule MiningWan Aezwani Wab
 

Similar to 6 module 4 (20)

Chapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptxChapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptx
 
21 FP Tree
21 FP Tree21 FP Tree
21 FP Tree
 
Apriori Algorithm.pptx
Apriori Algorithm.pptxApriori Algorithm.pptx
Apriori Algorithm.pptx
 
Associations.ppt
Associations.pptAssociations.ppt
Associations.ppt
 
Data mining techniques unit III
Data mining techniques unit IIIData mining techniques unit III
Data mining techniques unit III
 
Associations1
Associations1Associations1
Associations1
 
Association rules apriori algorithm
Association rules   apriori algorithmAssociation rules   apriori algorithm
Association rules apriori algorithm
 
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
 
Fp growth algorithm
Fp growth algorithmFp growth algorithm
Fp growth algorithm
 
Fp growth algorithm
Fp growth algorithmFp growth algorithm
Fp growth algorithm
 
Mining Frequent Itemsets.ppt
Mining Frequent Itemsets.pptMining Frequent Itemsets.ppt
Mining Frequent Itemsets.ppt
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
06FPBasic.ppt
06FPBasic.ppt06FPBasic.ppt
06FPBasic.ppt
 
06FPBasic.ppt
06FPBasic.ppt06FPBasic.ppt
06FPBasic.ppt
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
06 fp basic
06 fp basic06 fp basic
06 fp basic
 
Class Comparisions Association Rule
Class Comparisions Association RuleClass Comparisions Association Rule
Class Comparisions Association Rule
 
20IT501_DWDM_PPT_Unit_III.ppt
20IT501_DWDM_PPT_Unit_III.ppt20IT501_DWDM_PPT_Unit_III.ppt
20IT501_DWDM_PPT_Unit_III.ppt
 
20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.ppt20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.ppt
 
Apriori and Eclat algorithm in Association Rule Mining
Apriori and Eclat algorithm in Association Rule MiningApriori and Eclat algorithm in Association Rule Mining
Apriori and Eclat algorithm in Association Rule Mining
 

Recently uploaded

SCRIP Lua HTTP PROGRACMACION PLC WECON CA
SCRIP Lua HTTP PROGRACMACION PLC  WECON CASCRIP Lua HTTP PROGRACMACION PLC  WECON CA
SCRIP Lua HTTP PROGRACMACION PLC WECON CANestorGamez6
 
Housewife Call Girls NRI Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
Housewife Call Girls NRI Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...Housewife Call Girls NRI Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
Housewife Call Girls NRI Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...narwatsonia7
 
NATA 2024 SYLLABUS, full syllabus explained in detail
NATA 2024 SYLLABUS, full syllabus explained in detailNATA 2024 SYLLABUS, full syllabus explained in detail
NATA 2024 SYLLABUS, full syllabus explained in detailDesigntroIntroducing
 
Passbook project document_april_21__.pdf
Passbook project document_april_21__.pdfPassbook project document_april_21__.pdf
Passbook project document_april_21__.pdfvaibhavkanaujia
 
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130Suhani Kapoor
 
PORTFOLIO DE ARQUITECTURA CRISTOBAL HERAUD 2024
PORTFOLIO DE ARQUITECTURA CRISTOBAL HERAUD 2024PORTFOLIO DE ARQUITECTURA CRISTOBAL HERAUD 2024
PORTFOLIO DE ARQUITECTURA CRISTOBAL HERAUD 2024CristobalHeraud
 
定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一
定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一
定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一lvtagr7
 
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...Amil baba
 
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130Suhani Kapoor
 
Design Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William VickeryDesign Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William VickeryWilliamVickery6
 
VIP Call Girls Service Mehdipatnam Hyderabad Call +91-8250192130
VIP Call Girls Service Mehdipatnam Hyderabad Call +91-8250192130VIP Call Girls Service Mehdipatnam Hyderabad Call +91-8250192130
VIP Call Girls Service Mehdipatnam Hyderabad Call +91-8250192130Suhani Kapoor
 
Dubai Call Girls Pro Domain O525547819 Call Girls Dubai Doux
Dubai Call Girls Pro Domain O525547819 Call Girls Dubai DouxDubai Call Girls Pro Domain O525547819 Call Girls Dubai Doux
Dubai Call Girls Pro Domain O525547819 Call Girls Dubai Douxkojalkojal131
 
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)jennyeacort
 
306MTAMount UCLA University Bachelor's Diploma in Social Media
306MTAMount UCLA University Bachelor's Diploma in Social Media306MTAMount UCLA University Bachelor's Diploma in Social Media
306MTAMount UCLA University Bachelor's Diploma in Social MediaD SSS
 
办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一
办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一
办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一F La
 
How to Be Famous in your Field just visit our Site
How to Be Famous in your Field just visit our SiteHow to Be Famous in your Field just visit our Site
How to Be Famous in your Field just visit our Sitegalleryaagency
 
Architecture case study India Habitat Centre, Delhi.pdf
Architecture case study India Habitat Centre, Delhi.pdfArchitecture case study India Habitat Centre, Delhi.pdf
Architecture case study India Habitat Centre, Delhi.pdfSumit Lathwal
 

Recently uploaded (20)

SCRIP Lua HTTP PROGRACMACION PLC WECON CA
SCRIP Lua HTTP PROGRACMACION PLC  WECON CASCRIP Lua HTTP PROGRACMACION PLC  WECON CA
SCRIP Lua HTTP PROGRACMACION PLC WECON CA
 
Housewife Call Girls NRI Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
Housewife Call Girls NRI Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...Housewife Call Girls NRI Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
Housewife Call Girls NRI Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
 
NATA 2024 SYLLABUS, full syllabus explained in detail
NATA 2024 SYLLABUS, full syllabus explained in detailNATA 2024 SYLLABUS, full syllabus explained in detail
NATA 2024 SYLLABUS, full syllabus explained in detail
 
Passbook project document_april_21__.pdf
Passbook project document_april_21__.pdfPassbook project document_april_21__.pdf
Passbook project document_april_21__.pdf
 
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
VIP Call Girls Service Kukatpally Hyderabad Call +91-8250192130
 
PORTFOLIO DE ARQUITECTURA CRISTOBAL HERAUD 2024
PORTFOLIO DE ARQUITECTURA CRISTOBAL HERAUD 2024PORTFOLIO DE ARQUITECTURA CRISTOBAL HERAUD 2024
PORTFOLIO DE ARQUITECTURA CRISTOBAL HERAUD 2024
 
定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一
定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一
定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一
 
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
 
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
VIP Call Girls Service Bhagyanagar Hyderabad Call +91-8250192130
 
Design Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William VickeryDesign Portfolio - 2024 - William Vickery
Design Portfolio - 2024 - William Vickery
 
Cheap Rate Call girls Kalkaji 9205541914 shot 1500 night
Cheap Rate Call girls Kalkaji 9205541914 shot 1500 nightCheap Rate Call girls Kalkaji 9205541914 shot 1500 night
Cheap Rate Call girls Kalkaji 9205541914 shot 1500 night
 
VIP Call Girls Service Mehdipatnam Hyderabad Call +91-8250192130
VIP Call Girls Service Mehdipatnam Hyderabad Call +91-8250192130VIP Call Girls Service Mehdipatnam Hyderabad Call +91-8250192130
VIP Call Girls Service Mehdipatnam Hyderabad Call +91-8250192130
 
Call Girls Service Mukherjee Nagar @9999965857 Delhi 🫦 No Advance VVIP 🍎 SER...
Call Girls Service Mukherjee Nagar @9999965857 Delhi 🫦 No Advance  VVIP 🍎 SER...Call Girls Service Mukherjee Nagar @9999965857 Delhi 🫦 No Advance  VVIP 🍎 SER...
Call Girls Service Mukherjee Nagar @9999965857 Delhi 🫦 No Advance VVIP 🍎 SER...
 
young call girls in Pandav nagar 🔝 9953056974 🔝 Delhi escort Service
young call girls in Pandav nagar 🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Pandav nagar 🔝 9953056974 🔝 Delhi escort Service
young call girls in Pandav nagar 🔝 9953056974 🔝 Delhi escort Service
 
Dubai Call Girls Pro Domain O525547819 Call Girls Dubai Doux
Dubai Call Girls Pro Domain O525547819 Call Girls Dubai DouxDubai Call Girls Pro Domain O525547819 Call Girls Dubai Doux
Dubai Call Girls Pro Domain O525547819 Call Girls Dubai Doux
 
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
 
306MTAMount UCLA University Bachelor's Diploma in Social Media
306MTAMount UCLA University Bachelor's Diploma in Social Media306MTAMount UCLA University Bachelor's Diploma in Social Media
306MTAMount UCLA University Bachelor's Diploma in Social Media
 
办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一
办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一
办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一
 
How to Be Famous in your Field just visit our Site
How to Be Famous in your Field just visit our SiteHow to Be Famous in your Field just visit our Site
How to Be Famous in your Field just visit our Site
 
Architecture case study India Habitat Centre, Delhi.pdf
Architecture case study India Habitat Centre, Delhi.pdfArchitecture case study India Habitat Centre, Delhi.pdf
Architecture case study India Habitat Centre, Delhi.pdf
 

6 module 4

  • 1. Module - 4 By Dr.Ramkumar.T ramkumar.thirunavukarasu@vit.ac.in Mining Frequent Patterns, Associations and Correlations
  • 2. 2 What Is Frequent Pattern Analysis? • Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set • First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining • Motivation: Finding inherent regularities in data – What products were often purchased together?— Beer and diapers?! – What are the subsequent purchases after buying a PC? – What kinds of DNA are sensitive to this new drug? – Can we automatically classify web documents? • Applications – Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.
  • 4. Association Rule Mining • Proposed by Agrawal R, Imielinski T, and Swami AN – "Mining Association Rules between Sets of Items in Large Databases.‖ – SIGMOD, June 1993 • An important data mining model studied extensively by the database and data mining community • Initially used for Market Basket Analysis to find how items purchased by customers are related • Assumes all data are categorical • Not a good algorithm for numeric data
  • 5. Basic Terms • An item - an item/article in a basket • Itemset – Collection of one or more items I = {i1, i2, …, im} • k-itemset - An itemset that contains k-items {Bread} - 1 Itemset {Bread,Milk} – 2 Itemset {Bread,Milk, Diaper} – 3 Itemset
  • 6. Basic Terms • Transaction (T) - items purchased in a basket, Contains set of items • A transactional dataset - A set of transactions, Contains {T1, T2,…, Tn}
  • 7. Association Rule Measures • An Association Rule is an implication of the form X  Y, where X, Y belong to ‗ I ‘ • Example: {Milk, Diaper} → {Beer} • Objective measures - A data-driven approach for evaluating the quality of association patterns. It is domain-independent and requires minimal input from the users • Subjective measure – Knowledge acquired through domain expertise • Widely used objective measures in ARM : Support & Confidence
  • 8. Support (s) • Support measures the frequency of the item/itemset • The support ‗s‘ of an itemset ‗X‘ is the percentage of transactions in the transaction database D that contain ‗X‘ • The support ‗s‘ of the rule ‗X  Y‘ in the transaction database ‗D‘ is the support of the it ems set
  • 10. Association Rule Mining Given: - a set ―I‖ of all the items; - A database ―D‖ of transactions; - minimum support – ‗s‘ - minimum confidence – ‗c‘ Find : - All association rules X  Y that satisfy the minimum support ‗s‘ and minimum confidence ‗c‘ How? - Step 1: Frequent Item set Generation - Find all sets of items that have minimum support (frequent item sets) - Step 2: Rule Generation - Generate association rules that satisfy the minimum confidence thresholds from the frequent item sets found in the previous step.
  • 11. Apriori Algorithm • Apriori algorithm is mining frequent item sets for single dimensional boolean associations rules • The name of the algorithm is based on the fact that the algorithm uses prior knowledge of frequent items • Applies an iterative approach known as level wise search, where k- items are used to explore k+1 item • Apriori property is used to reduce the search space • Apriori Property – ―All nonempty subset of frequent items must be also frequent‖ also called as ―Anti-Monotonic‖ property • Anti-monotone in the sense that if a set cannot pass a test, all its supper sets will fail the same test as well
  • 12. Data representation • Market basket data can be represented in a binary format. • Each row corresponds to a transaction and each column corresponds to an item. • An item can be treated as a binary variable whose value is one if the item is present in a transaction and zero otherwise.
  • 14. 14 Apriori: A Candidate Generation & Test Approach • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94) • Method: – Initially, scan DB once to get frequent 1-itemset – Generate length (k+1) candidate itemsets from length k frequent itemsets – Test the candidates against DB – Terminate when no frequent or candidate set can be generated
  • 15. 15 The Apriori Algorithm—An Example Database TDB 1st scan C1 L1 L2 C2 C2 2nd scan C3 L3 3rd scan Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemset sup {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Itemset {B, C, E} Itemset sup {B, C, E} 2 Supmin = 2
  • 16. Generation of Strong Association Rules
  • 17. 17 The Apriori Algorithm (Pseudo- Code) Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;
  • 18. 18 Implementation of Apriori • How to generate candidates? – Step 1: self-joining Lk – Step 2: pruning • Example of Candidate-generation – L3={abc, abd, acd, ace, bcd} – Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace – Pruning: • acde is removed because ade is not in L3 – C4 = {abcd}
  • 19. Maximal Frequent Itemset • The number of frequent itemsets generated by the Apriori algorithm can often be very large. • Hence it is beneficial to identify a small representative set from which every frequent item set can be derived. • One such approach is using maximal frequent item sets.
  • 20. Maximal Frequent Itemset • A maximal frequent itemset is a frequent itemset for which none of its immediate supersets are frequent. • Eg. {A,C} is the Maximal Frequent Itemset, where the superset {A,C,E} is in-frequent in the example problem • Maximal frequent itemsets provide a compact representation of all the frequent itemsets for a particular dataset.
  • 21. Closed Frequent Itemset • An itemset is closed in a dataset if there exists no superset that has the same support count. • Eg. {B,E} is the closed frequent itemset with support count of 3, where its superset {B,C,E} has the support count of 2.
  • 26. The idea behind Partitioning Algorithm
  • 28. FP Tree Mining  Apriori: uses a generate-and-test approach – generates candidate itemsets and tests if they are frequent – Generation of candidate itemsets is expensive(in both space and time) – Support counting is expensive • Subset checking (computationally expensive) • Multiple Database scans (I/O)  FP-Growth: allows frequent itemset discovery without candidate itemset generation. Two step approach: – Step 1: Build a compact data structure called the FP-tree • Built using 2 passes over the data-set. – Step 2: Extracts frequent itemsets directly from the FP-tree
  • 29. 29 Definition of FP-tree  A frequent pattern tree is defined below.  It consists of one root labeled as ―null‖, a set of item prefix subtrees as the children of the root, and a frequent-item header table.  Each node in the item prefix subtree consists of three fields: item-name, count, and node-link.  Each entry in the frequent-item header table consists of two fields, (1) item-name and (2) head of node- link.
  • 31. Step 1: FP-Tree Construction FP-Tree is constructed using 2 passes over the data-set: Pass 1: – Scan data and find support for each item. – Discard infrequent items. – Sort frequent items in decreasing order based on their support. Use this order when building the FP-Tree, so common prefixes can be shared.
  • 32. Step 1: FP-Tree Construction Pass 2: Nodes correspond to items and have a counter 1. FP-Growth reads 1 transaction at a time and maps it to a path 2. Fixed order is used, so paths can overlap when transactions share items (when they have the same prfix ). – In this case, counters are incremented 3. Pointers are maintained between nodes containing the same item, creating singly linked lists (dotted lines) – The more paths that overlap, the higher the compression. FP-tree may fit in memory. 4. Frequent itemsets extracted from the FP-Tree.
  • 33. Step 1: FP-Tree Construction (Example)
  • 34. FP-Tree size  The FP-Tree usually has a smaller size than the uncompressed data - typically many transactions share items (and hence prefixes). – Best case scenario: all transactions contain the same set of items. • 1 path in the FP-tree – Worst case scenario: every transaction has a unique set of items (no items in common) • Size of the FP-tree is at least as large as the original data. • Storage requirements for the FP-tree are higher - need to store the pointers between the nodes and the counters.  The size of the FP-tree depends on how the items are ordered  Ordering by decreasing support is typically used but it does not always lead to the smallest tree (it's a heuristic).
  • 35. Step 2: Frequent Itemset Generation FP-Growth extracts frequent itemsets from the FP-tree. Bottom-up algorithm - from the leaves towards the root Divide and conquer - Decompose both the mining task and DB according to the frequent patterns obtained so far
  • 36. 36 Compare Apriori-like method to FP- tree • Apriori-like method may generate an exponential number of candidates in the worst case. • FP-tree does not generate an exponential number of nodes. • The items ordered in the support-descending order indicate that FP-tree structure is usually highly compact.
  • 37. FP Tree Pros and Cons Advantages of FP-Growth – only 2 passes over data-set – ―compresses‖ data-set – no candidate generation – much faster than Apriori Disadvantages of FP-Growth – FP-Tree may not fit in memory!! – FP-Tree is expensive to build
  • 38. 38 Construct FP-tree from a Transaction Database  Let the minimum support be 20% 1. Scan DB once, find frequent 1-itemset (single item pattern) 2. Sort frequent items in frequency descending order, f-list 3. Scan DB again, construct FP-tree Frequent 1-itemset Support Count I1 6 I2 7 I3 6 I4 2 I5 2
  • 39. 39 TID Items bought (ordered) frequent items T100 {I1, I2, I5} {I2, I1, I5} T200 {I2, I4} {I2, I4} T300 {I2, I3} {I2, I3} T400 {I1, I2, I4} {I2, I1, I4} T500 {I1, I3} {I1, I3} T600 {I2, I3} {I2, I3} T700 {I1,I3} {I1,I3} T800 {I1, I2, I3, I5} {I2, I1, I3, I5} T900 {I1, I2, I5} {I2, I1, I5} Construct FP-tree from a Transaction Database
  • 40. 40 Construct FP-tree from a Transaction Database
  • 41. 41 Why Is FP-Growth the Winner? • Divide-and-conquer: – leads to focused search of smaller databases • Other factors – no candidate generation, no candidate test – compressed database: FP-tree structure – no repeated scan of entire database – basic operations—counting local frequent items and building sub FP-tree, no pattern search and matching
  • 42. Evaluation of Association Rules • Association rule algorithms tend to produce too many rules – many of them are ―uninteresting‖ • A rule is considered subjectively uninteresting unless it • reveals unexpected information about the data, or • provides useful knowledge that can lead to profitable actions. • How can we know some rules are not good (even with high confidence and support)? • Leads to define various measures Used to define various measures • Traditionally: support, confidence New: lift, etc.
  • 43. Strong rules are not necessarily Interesting • Support and Confidence measures are insufficient in filtering out un-interestingness association rules • To tackle the weakness, a correlation measure can be used to augment the support-confidence framework • Measure of dependent/correlated events: • Hence a typical association rule can be represented in the form of A  B { support,confidence,Correlation}
  • 44. Contingency Table (or) Confusion Matrix
  • 45. Read the scenario • Consider a population of 100 people in which there are 50 researchers and 50 non-researchers. 80 out of the 100 people are coffee drinkers and 20 are coffee abstainers. Suppose 35 researchers drink coffee and the remaining 15 do not. It follows that 45 non-researchers drink coffee and remaining 5 do not. • How to Form contingency Table ?
  • 46. October 4, 2021 Data Mining: Concepts and Techniques 46 Contingency Table for the Problem Calculate the Support and Confidence of (Researcher  coffee drinker) Calculate the support and confidence of (Researcher  coffee abstainer) Coffee- Drinker Coffee- abstainer Sum (row) Researcher 35 15 50 Non- Researcher 45 5 50 Sum(col.) 80 20 100
  • 47. 47 Interestingness Measure: Correlations (Lift) if Corr A,B = 1 THEN ‗A‘ and ‗B‘ are independent of each other if Corr A,B > 1 THEN ‗A‘ and ‗B‘ are positively correlated if Corr A,B < 1 THEN ‗A‘ and ‗B‘ are Negatively correlated Corr Researcher,Coffe drinker = 0.35/ (0.50) (0.80) = 0.35/0.40 = 0.875 I ) ( ) ( ) ( , B P A P B A P corr B A  
  • 48. October 4, 2021 Data Mining: Concepts and Techniques 48 Answer It • Suppose we are interested in analyzing transaction of All electronics with respect to purchase of computer games and videos. Of the 10,000 transaction analyzed, the data show that 6000 of the customer transaction included computer games, while 7500 included videos and 4000 included both computer games and videos. • Check whether the rule computer game  Video is really interesting for a minimum support of 0.30 and minimum confidence of 0.66.
  • 49. Multiple-level Association Rules • Items often form hierarchy • Flexible support settings: Items at the lower level are expected to have lower support. • Transaction database can be encoded based on dimensions and levels uniform support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 5% Level 1 min_sup = 5% Level 2 min_sup = 3% reduced support
  • 50. ML/MD Associations with Flexible Support Constraints • Why flexible support constraints? – Real life occurrence frequencies vary greatly • Diamond, watch, pens in a shopping basket – Uniform support may not be an interesting model • A flexible model – The lower-level, the more dimension combination, and the long pattern length, usually the smaller support – General rules should be easy to specify and understand – Special items and special group of items may be specified individually and have higher priority
  • 51. Multi-dimensional Association • Single-dimensional rules: buys(X, ―milk‖)  buys(X, ―bread‖) • Multi-dimensional rules:  2 dimensions or predicates – Inter-dimension assoc. rules (no repeated predicates) age(X,‖19-25‖)  occupation(X,―student‖)  buys(X,―coke‖) – hybrid-dimension assoc. rules (repeated predicates) age(X,‖19-25‖)  buys(X, ―popcorn‖)  buys(X, ―coke‖) • Categorical Attributes – finite number of possible values, no ordering among values% • Quantitative Attributes – numeric, implicit ordering among values
  • 52. 52 Techniques for Mining MD Associations • Search for frequent k-predicate set: – Example: {age, occupation, buys} is a 3-predicate set – Techniques can be categorized by how age are treated 1. Using static discretization of quantitative attributes – Quantitative attributes are statically discretized by using predefined concept hierarchies 2. Quantitative association rules – Quantitative attributes are dynamically discretized into ―bins‖ based on the distribution of the data 3. Distance-based association rules – This is a dynamic discretization process that considers the distance between data points
  • 53. Data Mining: Concepts and Techniques 53 Quantitative Association Rules age(X,‖30-34‖)  income(X,‖24K - 48K‖)  buys(X,‖high resolution TV‖)  Numeric attributes are dynamically discretized  Such that the confidence or compactness of the rules mined is maximized  2-D quantitative association rules: Aquan1  Aquan2  Acat
  • 54. Usage of Binning Methods • Binning methods do not capture the semantics of interval data • Distance-based partitioning, more meaningful discretization considering: – density/number of points in an interval – ―closeness‖ of points in an interval Price($) Equi-width (width $10) Equi-depth (depth 2) Distance- based 7 [0,10] [7,20] [7,7] 20 [11,20] [22,50] [20,22] 22 [21,30] [51,53] [50,53] 50 [31,40] 51 [41,50] 53 [51,60]