SlideShare a Scribd company logo
1 of 36
Mining frequent patterns: Basic concepts-Market Basket Analysis-
Frequent itemsets, Closed itemsets-Association rules-Introductio-
Apriori algorithm-theoritical approach-Apply Apriori algorithm on
dataset-1-Apply Apriori algorithm on dataset-2-Generating
Association rules from frequent item sets -Improving efficiency of
Apriori-Pattern growth approach-Mining frequest itemsets using
Vertical data format-Strong rules vs. weak rules-Association analysis
to Correlation analysis-Comparison of pattern evaluation measures
 Frequent patterns are patterns (e.g., item sets, subsequences, or
substructures) that appear frequently in a data set.
 a set of items, such as milk and bread, that appear frequently together in a
transaction data set is a frequent item set.
 A subsequence, such as buying first a PC, then a digital camera, and then a
memory card, if it occurs frequently in a shopping history database, is a
(frequent) sequential pattern.
 A substructure can refer to different structural forms, such as subgraphs,
subtrees, which may be combined with item sets or subsequences.
 If a substructure occurs frequently, it is called a (frequent) structured
pattern.
 It helps in data classification, clustering, and other data mining task.
 Frequent pattern mining searches for recurring relationships in a given
data set.
Market Basket Analysis: A Motivating Example
 Frequent item set mining leads to the discovery of associations and
correlations among items in large transactional or relational data sets.
 It helps in many business decision-making processes such as catalog
design, cross-marketing, and customer shopping behavior analysis.
 analyzes customer buying habits by finding associations
between the different items that customers place in their
“shopping baskets”.
 discovery of these associations can help retailers develop
marketing strategies.
 learn more about the buying habits of customers market basket analysis
may be performed on the retail data of customer transactions at store.
 use the results to plan marketing or advertising strategies, or in the
design of a new catalog.
 help to design different store layouts.
one strategy
 items that are frequently purchased together can be placed in closeness to
further encourage the combined sale of such items.
Ex:
Computer hardware and software
Alternative strategy
 placing hardware and software at opposite ends of the store may tempt
customers who purchase such items to pick up other items along the
way.
ex:
Expensive computer and antivirus.
 Market basket analysis can also help retailers plan which items to put on
sale at reduced prices.
Ex:
Computer and printer
 each item has a Boolean variable, representing the presence or
absence of that item.
 Each basket can then be represented by a Boolean vector of values
assigned to these variables.
 The Boolean vectors can be analyzed for buying patterns that reflect
items that are frequently purchased together.
 These patterns can be represented in the form of association rules.
 computer ⇒ antivirus software [support = 2%,confidence = 60%].
 They respectively reflect the usefulness and certainty of discovered rules.
 A support of 2% for Rule means that 2% of all the transactions under
analysis show that computer and antivirus software are purchased together.
 A confidence of 60% means that 60% of the customers who purchased a
computer also bought the software.
 association rules are considered interesting if they satisfy both a minimum
support threshold and a minimum confidence threshold.
 These thresholds can be a set by users or domain experts.
 Additional analysis can be performed to discover interesting statistical
correlations between associated items.
 A set of items is referred to as an item set.
 An item set that contains k items is a k-item set.
 The set {computer, antivirus software} is a 2-itemset.
 The occurrence frequency of an item set is the number of transactions
that contain the item set.
 This is also known, simply, as the frequency, support
count, or count of the item set.
support(A⇒B) =P(A ∪B)
confidence(A⇒B) =P(B|A)
 Support : represents the popularity of that product of all the product transactions.
 Support is an indication of how frequently the items appear in the data.
 5% Support means total 5% of transactions in database.
 Support(A -> B) = Support count(A ∪ B)
 Confidence: indicates the number of times the if-then statements are found true.
 Measures how often items in Y appear in transactions that contain X
 A confidence of 60% means that 60% of the customers who purchased a milk and bread also
bought butter.
 Confidence(A -> B) = Support count(A ∪ B) / Support count(A)
 If a rule satisfies both minimum support and minimum confidence,
it is a strong rule.
Support_count(X) : Number of transactions in which X appears.
If X is A union B then it is the number of transactions in which A and B both are present.
 Association rule mining finds interesting associations and relationships
among large sets of data items.
 This rule shows how frequently a item set occurs in a transaction.
 Ex: Market Based Analysis.
 An implication expression of the form X -> Y, where X and Y are
any 2 item sets.
{milk,bread}-- > egg.
 association rule mining can be viewed as a two-step process:
 1. Find all frequent item sets: each of these item sets will occur at least as
frequently as a predetermined minimum support count, min_sup.
 2. Generate strong association rules from the frequent item sets: these
rules must satisfy minimum support and minimum confidence.

 Maximal Item set: An item set is maximal frequent if none of its supersets
are frequent.
 Steps:
 Examine the frequent item sets that appear at the border between the
infrequent and frequent item sets.
 Identify all of its immediate supersets.
 If none of the immediate supersets are frequent, the item set is maximal
frequent.
 Sub count=2.
 Closed Item set: An item set is
closed in a data set if there exists no
superset that has the same support
count as this original item set.
 Sub count=2
a(3)={ab(2),ac(1),ad(2)}// closed
closed
maximal
frequent
 K- Item set: Item set which contains K items is a K-item set.
 So it can be said that an item set is frequent if the corresponding support
count is greater than minimum support count.
 A major challenge in mining frequent item sets from a large data set is the
fact that such mining often generates a huge number of item sets satisfying
the minimum support (min sup) threshold, especially when min sup is set
low.
 This is because if an item set is frequent, each of its subsets is frequent as
well.
 For example, a frequent item set of length 100, such as {a1, a2,...,
a100}, contains (1 100 ) = 100
16
 Apriori algorithm was the first algorithm that was proposed for frequent
item set mining.
 In 1994 it was improved by R Agarwal and R Srikant and came to be
known as Apriori.
 This algorithm uses two steps “join” and “prune” to reduce the search
space.
 It is an iterative approach to discover the most frequent item sets.
 The probability that item I is not frequent is if:
 P(I) < minimum support threshold, then I is not frequent.
 P (I+A) < minimum support threshold, then I+A is not frequent, where A
also belongs to item set.
 If an item set set has value less than minimum support then all of its
supersets will also fall below min support, and thus can be ignored.
17
 Join Step: This step generates (K+1) item set from K-item sets by joining
each item with itself.
 Prune Step: This step scans the count of each item in the database. If the
candidate item does not meet minimum support, then it is regarded as
infrequent and thus it is removed. This step is performed to reduce the size
of the candidate item sets
 This data mining technique follows the join and the prune steps iteratively
until the most frequent itemset is achieved.
 A minimum support threshold is given in the problem or it is assumed by
the user.
 #1) In the first iteration of the algorithm, each item is taken as a 1-itemsets
candidate. The algorithm will count the occurrences of each item.
 #2) Let there be some minimum support, min_sup ( eg 2). The set of 1 –
itemsets whose occurrence is satisfying the min sup are determined. Only
those candidates which count more than or equal to min_sup, are taken
ahead for the next iteration and the others are pruned.
18
 #3) Next, 2-itemset frequent items with min_sup are discovered. For this in the
join step, the 2-itemset is generated by forming a group of 2 by combining items
with itself.
 #4) The 2-itemset candidates are pruned using min-sup threshold value. Now the
table will have 2 –itemsets with min-sup only.
 #5) The next iteration will form 3 –itemsets using join and prune step. This
iteration will follow antimonotone property where the subsets of 3-itemsets, that is
the 2 –itemset subsets of each group fall in min_sup.
 If all 2-itemset subsets are frequent then the superset will be frequent otherwise it
is pruned.
 #6) Next step will follow making 4-itemset by joining 3-itemset with itself and
pruning if its subset does not meet the min_sup criteria. The algorithm is stopped
when the most frequent itemset is achieved.
19
 Method:
◦ Initially, scan DB once to get frequent 1-itemset
◦ Generate length (k+1) candidate item sets from length k frequent
item sets
◦ Terminate when no frequent or candidate set can be generated
 To improve the efficiency of the level-wise generation of frequent item
sets, the apriori property is used to reduce the search space
◦ All non-empty subsets of a frequent itemset must also be frequent
◦ If {beer, diaper, nuts} is frequent, so is {beer, diaper}
 large itemset: itemset with support > s
 candidate itemset: itemset that may have support > s
20
Database TDB
1st scan
C1
L1
L2
C2 C2
2nd scan
C3 L3
3rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2
21
 How to generate candidates?
◦ Step 1: self-joining Lk
◦ Step 2: pruning
 Example of Candidate-generation
◦ L3={abc, abd, acd, ace, bcd}
◦ Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
◦ Pruning:
 acde is removed because ade is not in L3
◦ C4 = {abcd}
Find frequent item sets using an iterative
level-wise approach based on candidate
generation.
Input: D, a database of transactions; min
sup, the minimum support count
threshold.
Output: L, frequent item sets in D.
Method:
(1) L1 = find frequent 1-itemsets(D);
(2) for (k = 2;Lk−1 ≠ φ;k++) {
(3) Ck = apriori gen(Lk−1);
(4) for each transaction t ∈ D { // scan D for
counts
(5) Ct = subset(Ck , t); // get the subsets of t
that are candidates
(6) for each candidate c ∈ Ct
(7) c.count++;
(8) }
(9) Lk = {c ∈ Ck |c.count ≥ min sup}
(10) }
(11) return L = ∪kLk ;
procedure apriori gen(Lk−1:frequent (k − 1)-
itemsets)
(1) for each itemset l1 ∈ Lk−1
(2) for each itemset l2 ∈ Lk−1
(3) if (l1[1] = l2[1]) ∧ (l1[2] = l2[2]) ∧... ∧
(l1[k − 2] = l2[k − 2]) ∧ (l1[k − 1] < l2[k − 1])
then {
(4) c = l1 ✶ l2; // join step: generate
candidates
(5) if has infrequent subset(c, Lk−1) then
(6) delete c; // prune step: remove unfruitful
candidate
(7) else add c to Ck ;
(8) }
(9) return Ck ;
23
 Once the frequent item sets from transactions in a database D have been
found, can generate strong association rules from them.
 where strong association rules satisfy both minimum support and minimum
confidence.
 This can be done using for confidence.
 confidence(A ⇒ B) = P(B|A) = support count(A ∪B)/ support count(A) .
 The conditional probability is expressed in terms of itemset support count,
where support count(A ∪B) is the number of transactions containing the
itemsets A ∪B, and support count(A) is the number of transactions containing
the itemset A.
24
 Based on this equation, association rules can be generated as follows:
 For each frequent itemset l, generate all nonempty subsets of l.
 For every nonempty subset s of l, output the rule “s ⇒ (l − s)”
if (support count(l) / support count(s)) ≥ min conf, where min conf is the
minimum confidence threshold.
 Because the rules are generated from frequent itemsets, each one
automatically satisfies the minimum support.
 Frequent itemsets can be stored ahead of time in hash tables along with their
counts so that they can be accessed quickly.
 Frequent itemsets can be stored ahead of time in hash tables along with their
counts so that they can be accessed quickly.
25
 Example: Given the following table and the frequent item set X = {I1, I2, I5},
generate the association rules.
 The nonempty subsets of X are:
{I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5}
 The resulting association rules are:
 If the minimum confidence threshold is, say, 70%, then only the second, third,
and last rules are output, because these are the only ones generated that are
strong
(i)Hash-based technique (hashing
itemsets into corresponding buckets):
A hash-based technique can be used to
reduce the size of the candidate k-itemsets,
Ck , for k > 1.
For example, when scanning each
transaction in the database to generate the
frequent 1-itemsets, L1, we can generate
all the 2-itemsets for each transaction, hash
(i.e., map) them into the different buckets
of a hash table structure, and increase the
corresponding bucket counts .
A 2-itemset with a corresponding bucket
count in the hash table that is below the
support threshold cannot be frequent and
thus should be removed from the candidate
set.
Such a hash-based technique may
substantially reduce the number of
candidate k-itemsets examined (especially
when k = 2).
(ii)Transaction reduction (reducing the number of transactions scanned in future
iterations):
A transaction that does not contain any frequent k-itemsets cannot contain any
frequent (k + 1)-itemsets.
Therefore, such a transaction can be marked or removed from further
consideration because subsequent database scans for j-itemsets, where j > k, will
not need to consider such a transaction.
(iii) Partitioning (partitioning the data to find candidate item sets):
A partitioning technique can be used that requires just two database scans to mine
the frequent item sets.
It consists of two phases.
In phase I, the algorithm divides the transactions of D into n non overlapping
partitions.
If the minimum relative support threshold for transactions in D is min sup, then
the minimum support count for a partition is min sup × the number of transactions
in that partition.
For each partition, all the local frequent itemsets (i.e., the itemsets frequent within
the partition) are found.
A local frequent itemset may or may not be frequent with respect to the entire
database, D.
Partitioning (partitioning the data to find candidate item sets):
However, any itemset that is potentially frequent with respect to D must occur as
a frequent itemset in at least one of the partitions.
Therefore, all local frequent itemsets are candidate itemsets with respect to D.
The collection of frequent itemsets from all partitions forms the global candidate
itemsets with respect to D.
In phase II, a second scan of D is conducted in which the actual support of each
candidate is assessed to determine the global frequent itemsets.
Partition size and the number of partitions are set so that each partition can fit into
main memory and therefore be read only once in each phase.
(iv) Sampling (mining on a subset of the given data):
 The basic idea of the sampling approach is to pick a random sample S of
the given data D, and then search for frequent itemsets in S instead of D.
 we trade off some degree of accuracy against efficiency.
 The S sample size is such that the search for frequent item sets in S can be
done in main memory, and so only one scan of the transactions in S is
required overall.
 Because we are searching for frequent itemsets in S rather than in D, it is
possible that we will miss some of the global frequent itemsets.
 To reduce this possibility, use a lower support threshold than minimum
support to find the frequent itemsets local to S (denoted L S ).
 The rest of the database is then used to compute the actual frequencies of
each itemset in L S .
Sampling (mining on a subset of the given data):
 A mechanism is used to determine whether all the global frequent itemsets
are included in L S .
 If L S actually contains all the frequent itemsets in D, then only one scan of
D is required.
 Otherwise, a second pass can be done to find the frequent itemsets that
were missed in the first pass.
 The sampling approach is especially beneficial when efficiency is of
utmost importance such as in computationally intensive applications that
must be run frequently
(v) Dynamic itemset counting (adding candidate itemsets at different points
during a scan):
 A dynamic itemset counting technique was proposed in which the database
is partitioned into blocks marked by start points.
 In this variation, new candidate itemsets can be added at any start point,
unlike in Apriori, which determines new candidate itemsets only
immediately before each complete database scan.
 The technique uses the count-so-far as the lower bound of the actual count.
 If the count-so-far passes the minimum support, the itemset is added into
the frequent itemset collection and can be used to generate longer
candidates. This leads to fewer database scans than with Apriori for finding
all the frequent itemsets.
33
Minimum support = 2
34
 play basketball  eat cereal [40%, 66.7%]
 play basketball  not eat cereal [20%, 33.3%]
 Measure of dependent/correlated events: lift
 If lift =1, occurrence of item set A is independent of occurrence
of item set B, otherwise, they are dependent and correlated
89
.
0
5000
/
3750
*
5000
/
3000
5000
/
2000
)
,
( 

C
B
lift Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000
)
(
)
(
)
(
B
P
A
P
B
A
P
lift


33
.
1
5000
/
1250
*
5000
/
3000
5000
/
1000
)
,
( 

C
B
lift
35
 play basketball  eat cereal [40%, 66.7%]
 play basketball  not eat cereal [20%, 33.3%]
 Measure of dependent/correlated events: lift
 If lift =1, occurrence of itemset A is independent of occurrence
of itemset B, otherwise, they are dependent and correlated
89
.
0
5000
/
3750
*
5000
/
3000
5000
/
2000
)
,
( 

C
B
lift Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000
)
(
)
(
)
(
B
P
A
P
B
A
P
lift


33
.
1
5000
/
1250
*
5000
/
3000
5000
/
1000
)
,
( 

C
B
lift
 Researchers have studied many pattern evaluation measures even before
the start of in-depth research on scalable methods for mining frequent
patterns. Recently, several other pattern evaluation measures have attracted
interest.
 All_conf(A,B) = sup(𝐴∪𝐵)/ max{sup(A),sup(B)}= min{P(A|B),P(B|A)},
 Given two itemsets, A and B, the max confidence measure of A and B is
defined as max conf(A, B) = max{P(A|B),P(B|A)}.
 The max conf measure is the maximum confidence of the two association
rules, “A ⇒ B” and “B ⇒ A.”
 Given two itemsets, A and B, the Kulczynski measure of A and B
(abbreviated as Kulc) is defined as Kulc(A, B) = 1/ 2 (P(A|B) + P(B|A)).

More Related Content

What's hot

2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic conceptsKrish_ver2
 
Chapter 4 Classification
Chapter 4 ClassificationChapter 4 Classification
Chapter 4 ClassificationKhalid Elshafie
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methodsrajshreemuthiah
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & predictionhktripathy
 
Exam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsExam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsHarsh Parekh
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 
Classification vs clustering
Classification vs clusteringClassification vs clustering
Classification vs clusteringKhadija Parween
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.docbutest
 
Attribute oriented analysis
Attribute oriented analysisAttribute oriented analysis
Attribute oriented analysisHirra Sultan
 
Classification and Clustering
Classification and ClusteringClassification and Clustering
Classification and ClusteringEng Teong Cheah
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsIJERA Editor
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression TreesHemant Chetwani
 

What's hot (20)

Classification
ClassificationClassification
Classification
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts
 
XL Miner: Classification
XL Miner: ClassificationXL Miner: Classification
XL Miner: Classification
 
Xlminer demo
Xlminer demoXlminer demo
Xlminer demo
 
Chapter 4 Classification
Chapter 4 ClassificationChapter 4 Classification
Chapter 4 Classification
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
XL-MINER:Partition
XL-MINER:PartitionXL-MINER:Partition
XL-MINER:Partition
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & prediction
 
XL-MINER: Data Exploration
XL-MINER: Data ExplorationXL-MINER: Data Exploration
XL-MINER: Data Exploration
 
Exam Short Preparation on Data Analytics
Exam Short Preparation on Data AnalyticsExam Short Preparation on Data Analytics
Exam Short Preparation on Data Analytics
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 
7 decision tree
7 decision tree7 decision tree
7 decision tree
 
Classification vs clustering
Classification vs clusteringClassification vs clustering
Classification vs clustering
 
DATA MINING.doc
DATA MINING.docDATA MINING.doc
DATA MINING.doc
 
Attribute oriented analysis
Attribute oriented analysisAttribute oriented analysis
Attribute oriented analysis
 
Introduction To XL-Miner
Introduction To XL-MinerIntroduction To XL-Miner
Introduction To XL-Miner
 
4 module 3 --
4 module 3 --4 module 3 --
4 module 3 --
 
Classification and Clustering
Classification and ClusteringClassification and Clustering
Classification and Clustering
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining Algorithms
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
 

Similar to Dma unit 2

Apriori Algorithm.pptx
Apriori Algorithm.pptxApriori Algorithm.pptx
Apriori Algorithm.pptxRashi Agarwal
 
Lec6_Association.ppt
Lec6_Association.pptLec6_Association.ppt
Lec6_Association.pptprema370155
 
IRJET-Comparative Analysis of Apriori and Apriori with Hashing Algorithm
IRJET-Comparative Analysis of  Apriori and Apriori with Hashing AlgorithmIRJET-Comparative Analysis of  Apriori and Apriori with Hashing Algorithm
IRJET-Comparative Analysis of Apriori and Apriori with Hashing AlgorithmIRJET Journal
 
Pattern Discovery Using Apriori and Ch-Search Algorithm
 Pattern Discovery Using Apriori and Ch-Search Algorithm Pattern Discovery Using Apriori and Ch-Search Algorithm
Pattern Discovery Using Apriori and Ch-Search Algorithmijceronline
 
20IT501_DWDM_PPT_Unit_III.ppt
20IT501_DWDM_PPT_Unit_III.ppt20IT501_DWDM_PPT_Unit_III.ppt
20IT501_DWDM_PPT_Unit_III.pptPalaniKumarR2
 
20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.ppt20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.pptSamPrem3
 
Data mining techniques unit III
Data mining techniques unit IIIData mining techniques unit III
Data mining techniques unit IIImalathieswaran29
 
A NEW ASSOCIATION RULE MINING BASED ON FREQUENT ITEM SET
A NEW ASSOCIATION RULE MINING BASED  ON FREQUENT ITEM SETA NEW ASSOCIATION RULE MINING BASED  ON FREQUENT ITEM SET
A NEW ASSOCIATION RULE MINING BASED ON FREQUENT ITEM SETcscpconf
 
Analyzing Adverse Drug Events Using Data Mining Approach
Analyzing Adverse Drug Events Using Data Mining ApproachAnalyzing Adverse Drug Events Using Data Mining Approach
Analyzing Adverse Drug Events Using Data Mining ApproachRupal7
 
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...IOSR Journals
 
Top Down Approach to find Maximal Frequent Item Sets using Subset Creation
Top Down Approach to find Maximal Frequent Item Sets using Subset CreationTop Down Approach to find Maximal Frequent Item Sets using Subset Creation
Top Down Approach to find Maximal Frequent Item Sets using Subset Creationcscpconf
 
An improvised tree algorithm for association rule mining using transaction re...
An improvised tree algorithm for association rule mining using transaction re...An improvised tree algorithm for association rule mining using transaction re...
An improvised tree algorithm for association rule mining using transaction re...Editor IJCATR
 
big data seminar.pptx
big data seminar.pptxbig data seminar.pptx
big data seminar.pptxAmenahAbbood
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.pptSowmyaJyothi3
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.pptSowmyaJyothi3
 
Result analysis of mining fast frequent itemset using compacted data
Result analysis of mining fast frequent itemset using compacted dataResult analysis of mining fast frequent itemset using compacted data
Result analysis of mining fast frequent itemset using compacted dataijistjournal
 
Result Analysis of Mining Fast Frequent Itemset Using Compacted Data
Result Analysis of Mining Fast Frequent Itemset Using Compacted DataResult Analysis of Mining Fast Frequent Itemset Using Compacted Data
Result Analysis of Mining Fast Frequent Itemset Using Compacted Dataijistjournal
 

Similar to Dma unit 2 (20)

Apriori Algorithm.pptx
Apriori Algorithm.pptxApriori Algorithm.pptx
Apriori Algorithm.pptx
 
Lec6_Association.ppt
Lec6_Association.pptLec6_Association.ppt
Lec6_Association.ppt
 
Association rules apriori algorithm
Association rules   apriori algorithmAssociation rules   apriori algorithm
Association rules apriori algorithm
 
IRJET-Comparative Analysis of Apriori and Apriori with Hashing Algorithm
IRJET-Comparative Analysis of  Apriori and Apriori with Hashing AlgorithmIRJET-Comparative Analysis of  Apriori and Apriori with Hashing Algorithm
IRJET-Comparative Analysis of Apriori and Apriori with Hashing Algorithm
 
Pattern Discovery Using Apriori and Ch-Search Algorithm
 Pattern Discovery Using Apriori and Ch-Search Algorithm Pattern Discovery Using Apriori and Ch-Search Algorithm
Pattern Discovery Using Apriori and Ch-Search Algorithm
 
Dm unit ii r16
Dm unit ii   r16Dm unit ii   r16
Dm unit ii r16
 
20IT501_DWDM_PPT_Unit_III.ppt
20IT501_DWDM_PPT_Unit_III.ppt20IT501_DWDM_PPT_Unit_III.ppt
20IT501_DWDM_PPT_Unit_III.ppt
 
20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.ppt20IT501_DWDM_U3.ppt
20IT501_DWDM_U3.ppt
 
Data mining techniques unit III
Data mining techniques unit IIIData mining techniques unit III
Data mining techniques unit III
 
A NEW ASSOCIATION RULE MINING BASED ON FREQUENT ITEM SET
A NEW ASSOCIATION RULE MINING BASED  ON FREQUENT ITEM SETA NEW ASSOCIATION RULE MINING BASED  ON FREQUENT ITEM SET
A NEW ASSOCIATION RULE MINING BASED ON FREQUENT ITEM SET
 
APRIORI ALGORITHM -PPT.pptx
APRIORI ALGORITHM -PPT.pptxAPRIORI ALGORITHM -PPT.pptx
APRIORI ALGORITHM -PPT.pptx
 
Analyzing Adverse Drug Events Using Data Mining Approach
Analyzing Adverse Drug Events Using Data Mining ApproachAnalyzing Adverse Drug Events Using Data Mining Approach
Analyzing Adverse Drug Events Using Data Mining Approach
 
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
A Survey on Identification of Closed Frequent Item Sets Using Intersecting Al...
 
Top Down Approach to find Maximal Frequent Item Sets using Subset Creation
Top Down Approach to find Maximal Frequent Item Sets using Subset CreationTop Down Approach to find Maximal Frequent Item Sets using Subset Creation
Top Down Approach to find Maximal Frequent Item Sets using Subset Creation
 
An improvised tree algorithm for association rule mining using transaction re...
An improvised tree algorithm for association rule mining using transaction re...An improvised tree algorithm for association rule mining using transaction re...
An improvised tree algorithm for association rule mining using transaction re...
 
big data seminar.pptx
big data seminar.pptxbig data seminar.pptx
big data seminar.pptx
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.ppt
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.ppt
 
Result analysis of mining fast frequent itemset using compacted data
Result analysis of mining fast frequent itemset using compacted dataResult analysis of mining fast frequent itemset using compacted data
Result analysis of mining fast frequent itemset using compacted data
 
Result Analysis of Mining Fast Frequent Itemset Using Compacted Data
Result Analysis of Mining Fast Frequent Itemset Using Compacted DataResult Analysis of Mining Fast Frequent Itemset Using Compacted Data
Result Analysis of Mining Fast Frequent Itemset Using Compacted Data
 

Recently uploaded

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacingjaychoudhary37
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZTE
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 

Recently uploaded (20)

VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
microprocessor 8085 and its interfacing
microprocessor 8085  and its interfacingmicroprocessor 8085  and its interfacing
microprocessor 8085 and its interfacing
 
power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
VICTOR MAESTRE RAMIREZ - Planetary Defender on NASA's Double Asteroid Redirec...
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
ZXCTN 5804 / ZTE PTN / ZTE POTN / ZTE 5804 PTN / ZTE POTN 5804 ( 100/200 GE Z...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 

Dma unit 2

  • 1. Mining frequent patterns: Basic concepts-Market Basket Analysis- Frequent itemsets, Closed itemsets-Association rules-Introductio- Apriori algorithm-theoritical approach-Apply Apriori algorithm on dataset-1-Apply Apriori algorithm on dataset-2-Generating Association rules from frequent item sets -Improving efficiency of Apriori-Pattern growth approach-Mining frequest itemsets using Vertical data format-Strong rules vs. weak rules-Association analysis to Correlation analysis-Comparison of pattern evaluation measures
  • 2.  Frequent patterns are patterns (e.g., item sets, subsequences, or substructures) that appear frequently in a data set.  a set of items, such as milk and bread, that appear frequently together in a transaction data set is a frequent item set.  A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it occurs frequently in a shopping history database, is a (frequent) sequential pattern.  A substructure can refer to different structural forms, such as subgraphs, subtrees, which may be combined with item sets or subsequences.  If a substructure occurs frequently, it is called a (frequent) structured pattern.
  • 3.  It helps in data classification, clustering, and other data mining task.  Frequent pattern mining searches for recurring relationships in a given data set. Market Basket Analysis: A Motivating Example  Frequent item set mining leads to the discovery of associations and correlations among items in large transactional or relational data sets.  It helps in many business decision-making processes such as catalog design, cross-marketing, and customer shopping behavior analysis.
  • 4.  analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets”.  discovery of these associations can help retailers develop marketing strategies.
  • 5.  learn more about the buying habits of customers market basket analysis may be performed on the retail data of customer transactions at store.  use the results to plan marketing or advertising strategies, or in the design of a new catalog.  help to design different store layouts. one strategy  items that are frequently purchased together can be placed in closeness to further encourage the combined sale of such items. Ex: Computer hardware and software
  • 6. Alternative strategy  placing hardware and software at opposite ends of the store may tempt customers who purchase such items to pick up other items along the way. ex: Expensive computer and antivirus.  Market basket analysis can also help retailers plan which items to put on sale at reduced prices. Ex: Computer and printer
  • 7.  each item has a Boolean variable, representing the presence or absence of that item.  Each basket can then be represented by a Boolean vector of values assigned to these variables.  The Boolean vectors can be analyzed for buying patterns that reflect items that are frequently purchased together.  These patterns can be represented in the form of association rules.  computer ⇒ antivirus software [support = 2%,confidence = 60%].
  • 8.  They respectively reflect the usefulness and certainty of discovered rules.  A support of 2% for Rule means that 2% of all the transactions under analysis show that computer and antivirus software are purchased together.  A confidence of 60% means that 60% of the customers who purchased a computer also bought the software.  association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold.  These thresholds can be a set by users or domain experts.  Additional analysis can be performed to discover interesting statistical correlations between associated items.
  • 9.  A set of items is referred to as an item set.  An item set that contains k items is a k-item set.  The set {computer, antivirus software} is a 2-itemset.  The occurrence frequency of an item set is the number of transactions that contain the item set.  This is also known, simply, as the frequency, support count, or count of the item set. support(A⇒B) =P(A ∪B) confidence(A⇒B) =P(B|A)
  • 10.  Support : represents the popularity of that product of all the product transactions.  Support is an indication of how frequently the items appear in the data.  5% Support means total 5% of transactions in database.  Support(A -> B) = Support count(A ∪ B)  Confidence: indicates the number of times the if-then statements are found true.  Measures how often items in Y appear in transactions that contain X  A confidence of 60% means that 60% of the customers who purchased a milk and bread also bought butter.  Confidence(A -> B) = Support count(A ∪ B) / Support count(A)  If a rule satisfies both minimum support and minimum confidence, it is a strong rule. Support_count(X) : Number of transactions in which X appears. If X is A union B then it is the number of transactions in which A and B both are present.
  • 11.  Association rule mining finds interesting associations and relationships among large sets of data items.  This rule shows how frequently a item set occurs in a transaction.  Ex: Market Based Analysis.  An implication expression of the form X -> Y, where X and Y are any 2 item sets. {milk,bread}-- > egg.  association rule mining can be viewed as a two-step process:  1. Find all frequent item sets: each of these item sets will occur at least as frequently as a predetermined minimum support count, min_sup.  2. Generate strong association rules from the frequent item sets: these rules must satisfy minimum support and minimum confidence. 
  • 12.  Maximal Item set: An item set is maximal frequent if none of its supersets are frequent.  Steps:  Examine the frequent item sets that appear at the border between the infrequent and frequent item sets.  Identify all of its immediate supersets.  If none of the immediate supersets are frequent, the item set is maximal frequent.  Sub count=2.
  • 13.  Closed Item set: An item set is closed in a data set if there exists no superset that has the same support count as this original item set.  Sub count=2 a(3)={ab(2),ac(1),ad(2)}// closed closed maximal frequent
  • 14.  K- Item set: Item set which contains K items is a K-item set.  So it can be said that an item set is frequent if the corresponding support count is greater than minimum support count.
  • 15.  A major challenge in mining frequent item sets from a large data set is the fact that such mining often generates a huge number of item sets satisfying the minimum support (min sup) threshold, especially when min sup is set low.  This is because if an item set is frequent, each of its subsets is frequent as well.  For example, a frequent item set of length 100, such as {a1, a2,..., a100}, contains (1 100 ) = 100
  • 16. 16  Apriori algorithm was the first algorithm that was proposed for frequent item set mining.  In 1994 it was improved by R Agarwal and R Srikant and came to be known as Apriori.  This algorithm uses two steps “join” and “prune” to reduce the search space.  It is an iterative approach to discover the most frequent item sets.  The probability that item I is not frequent is if:  P(I) < minimum support threshold, then I is not frequent.  P (I+A) < minimum support threshold, then I+A is not frequent, where A also belongs to item set.  If an item set set has value less than minimum support then all of its supersets will also fall below min support, and thus can be ignored.
  • 17. 17  Join Step: This step generates (K+1) item set from K-item sets by joining each item with itself.  Prune Step: This step scans the count of each item in the database. If the candidate item does not meet minimum support, then it is regarded as infrequent and thus it is removed. This step is performed to reduce the size of the candidate item sets  This data mining technique follows the join and the prune steps iteratively until the most frequent itemset is achieved.  A minimum support threshold is given in the problem or it is assumed by the user.  #1) In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The algorithm will count the occurrences of each item.  #2) Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets whose occurrence is satisfying the min sup are determined. Only those candidates which count more than or equal to min_sup, are taken ahead for the next iteration and the others are pruned.
  • 18. 18  #3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the 2-itemset is generated by forming a group of 2 by combining items with itself.  #4) The 2-itemset candidates are pruned using min-sup threshold value. Now the table will have 2 –itemsets with min-sup only.  #5) The next iteration will form 3 –itemsets using join and prune step. This iteration will follow antimonotone property where the subsets of 3-itemsets, that is the 2 –itemset subsets of each group fall in min_sup.  If all 2-itemset subsets are frequent then the superset will be frequent otherwise it is pruned.  #6) Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its subset does not meet the min_sup criteria. The algorithm is stopped when the most frequent itemset is achieved.
  • 19. 19  Method: ◦ Initially, scan DB once to get frequent 1-itemset ◦ Generate length (k+1) candidate item sets from length k frequent item sets ◦ Terminate when no frequent or candidate set can be generated  To improve the efficiency of the level-wise generation of frequent item sets, the apriori property is used to reduce the search space ◦ All non-empty subsets of a frequent itemset must also be frequent ◦ If {beer, diaper, nuts} is frequent, so is {beer, diaper}  large itemset: itemset with support > s  candidate itemset: itemset that may have support > s
  • 20. 20 Database TDB 1st scan C1 L1 L2 C2 C2 2nd scan C3 L3 3rd scan Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemset sup {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Itemset {B, C, E} Itemset sup {B, C, E} 2 Supmin = 2
  • 21. 21  How to generate candidates? ◦ Step 1: self-joining Lk ◦ Step 2: pruning  Example of Candidate-generation ◦ L3={abc, abd, acd, ace, bcd} ◦ Self-joining: L3*L3  abcd from abc and abd  acde from acd and ace ◦ Pruning:  acde is removed because ade is not in L3 ◦ C4 = {abcd}
  • 22. Find frequent item sets using an iterative level-wise approach based on candidate generation. Input: D, a database of transactions; min sup, the minimum support count threshold. Output: L, frequent item sets in D. Method: (1) L1 = find frequent 1-itemsets(D); (2) for (k = 2;Lk−1 ≠ φ;k++) { (3) Ck = apriori gen(Lk−1); (4) for each transaction t ∈ D { // scan D for counts (5) Ct = subset(Ck , t); // get the subsets of t that are candidates (6) for each candidate c ∈ Ct (7) c.count++; (8) } (9) Lk = {c ∈ Ck |c.count ≥ min sup} (10) } (11) return L = ∪kLk ; procedure apriori gen(Lk−1:frequent (k − 1)- itemsets) (1) for each itemset l1 ∈ Lk−1 (2) for each itemset l2 ∈ Lk−1 (3) if (l1[1] = l2[1]) ∧ (l1[2] = l2[2]) ∧... ∧ (l1[k − 2] = l2[k − 2]) ∧ (l1[k − 1] < l2[k − 1]) then { (4) c = l1 ✶ l2; // join step: generate candidates (5) if has infrequent subset(c, Lk−1) then (6) delete c; // prune step: remove unfruitful candidate (7) else add c to Ck ; (8) } (9) return Ck ;
  • 23. 23  Once the frequent item sets from transactions in a database D have been found, can generate strong association rules from them.  where strong association rules satisfy both minimum support and minimum confidence.  This can be done using for confidence.  confidence(A ⇒ B) = P(B|A) = support count(A ∪B)/ support count(A) .  The conditional probability is expressed in terms of itemset support count, where support count(A ∪B) is the number of transactions containing the itemsets A ∪B, and support count(A) is the number of transactions containing the itemset A.
  • 24. 24  Based on this equation, association rules can be generated as follows:  For each frequent itemset l, generate all nonempty subsets of l.  For every nonempty subset s of l, output the rule “s ⇒ (l − s)” if (support count(l) / support count(s)) ≥ min conf, where min conf is the minimum confidence threshold.  Because the rules are generated from frequent itemsets, each one automatically satisfies the minimum support.  Frequent itemsets can be stored ahead of time in hash tables along with their counts so that they can be accessed quickly.  Frequent itemsets can be stored ahead of time in hash tables along with their counts so that they can be accessed quickly.
  • 25. 25  Example: Given the following table and the frequent item set X = {I1, I2, I5}, generate the association rules.  The nonempty subsets of X are: {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5}  The resulting association rules are:  If the minimum confidence threshold is, say, 70%, then only the second, third, and last rules are output, because these are the only ones generated that are strong
  • 26. (i)Hash-based technique (hashing itemsets into corresponding buckets): A hash-based technique can be used to reduce the size of the candidate k-itemsets, Ck , for k > 1. For example, when scanning each transaction in the database to generate the frequent 1-itemsets, L1, we can generate all the 2-itemsets for each transaction, hash (i.e., map) them into the different buckets of a hash table structure, and increase the corresponding bucket counts . A 2-itemset with a corresponding bucket count in the hash table that is below the support threshold cannot be frequent and thus should be removed from the candidate set. Such a hash-based technique may substantially reduce the number of candidate k-itemsets examined (especially when k = 2).
  • 27. (ii)Transaction reduction (reducing the number of transactions scanned in future iterations): A transaction that does not contain any frequent k-itemsets cannot contain any frequent (k + 1)-itemsets. Therefore, such a transaction can be marked or removed from further consideration because subsequent database scans for j-itemsets, where j > k, will not need to consider such a transaction.
  • 28. (iii) Partitioning (partitioning the data to find candidate item sets): A partitioning technique can be used that requires just two database scans to mine the frequent item sets. It consists of two phases. In phase I, the algorithm divides the transactions of D into n non overlapping partitions. If the minimum relative support threshold for transactions in D is min sup, then the minimum support count for a partition is min sup × the number of transactions in that partition. For each partition, all the local frequent itemsets (i.e., the itemsets frequent within the partition) are found. A local frequent itemset may or may not be frequent with respect to the entire database, D.
  • 29. Partitioning (partitioning the data to find candidate item sets): However, any itemset that is potentially frequent with respect to D must occur as a frequent itemset in at least one of the partitions. Therefore, all local frequent itemsets are candidate itemsets with respect to D. The collection of frequent itemsets from all partitions forms the global candidate itemsets with respect to D. In phase II, a second scan of D is conducted in which the actual support of each candidate is assessed to determine the global frequent itemsets. Partition size and the number of partitions are set so that each partition can fit into main memory and therefore be read only once in each phase.
  • 30. (iv) Sampling (mining on a subset of the given data):  The basic idea of the sampling approach is to pick a random sample S of the given data D, and then search for frequent itemsets in S instead of D.  we trade off some degree of accuracy against efficiency.  The S sample size is such that the search for frequent item sets in S can be done in main memory, and so only one scan of the transactions in S is required overall.  Because we are searching for frequent itemsets in S rather than in D, it is possible that we will miss some of the global frequent itemsets.  To reduce this possibility, use a lower support threshold than minimum support to find the frequent itemsets local to S (denoted L S ).  The rest of the database is then used to compute the actual frequencies of each itemset in L S .
  • 31. Sampling (mining on a subset of the given data):  A mechanism is used to determine whether all the global frequent itemsets are included in L S .  If L S actually contains all the frequent itemsets in D, then only one scan of D is required.  Otherwise, a second pass can be done to find the frequent itemsets that were missed in the first pass.  The sampling approach is especially beneficial when efficiency is of utmost importance such as in computationally intensive applications that must be run frequently
  • 32. (v) Dynamic itemset counting (adding candidate itemsets at different points during a scan):  A dynamic itemset counting technique was proposed in which the database is partitioned into blocks marked by start points.  In this variation, new candidate itemsets can be added at any start point, unlike in Apriori, which determines new candidate itemsets only immediately before each complete database scan.  The technique uses the count-so-far as the lower bound of the actual count.  If the count-so-far passes the minimum support, the itemset is added into the frequent itemset collection and can be used to generate longer candidates. This leads to fewer database scans than with Apriori for finding all the frequent itemsets.
  • 34. 34  play basketball  eat cereal [40%, 66.7%]  play basketball  not eat cereal [20%, 33.3%]  Measure of dependent/correlated events: lift  If lift =1, occurrence of item set A is independent of occurrence of item set B, otherwise, they are dependent and correlated 89 . 0 5000 / 3750 * 5000 / 3000 5000 / 2000 ) , (   C B lift Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000 ) ( ) ( ) ( B P A P B A P lift   33 . 1 5000 / 1250 * 5000 / 3000 5000 / 1000 ) , (   C B lift
  • 35. 35  play basketball  eat cereal [40%, 66.7%]  play basketball  not eat cereal [20%, 33.3%]  Measure of dependent/correlated events: lift  If lift =1, occurrence of itemset A is independent of occurrence of itemset B, otherwise, they are dependent and correlated 89 . 0 5000 / 3750 * 5000 / 3000 5000 / 2000 ) , (   C B lift Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000 ) ( ) ( ) ( B P A P B A P lift   33 . 1 5000 / 1250 * 5000 / 3000 5000 / 1000 ) , (   C B lift
  • 36.  Researchers have studied many pattern evaluation measures even before the start of in-depth research on scalable methods for mining frequent patterns. Recently, several other pattern evaluation measures have attracted interest.  All_conf(A,B) = sup(𝐴∪𝐵)/ max{sup(A),sup(B)}= min{P(A|B),P(B|A)},  Given two itemsets, A and B, the max confidence measure of A and B is defined as max conf(A, B) = max{P(A|B),P(B|A)}.  The max conf measure is the maximum confidence of the two association rules, “A ⇒ B” and “B ⇒ A.”  Given two itemsets, A and B, the Kulczynski measure of A and B (abbreviated as Kulc) is defined as Kulc(A, B) = 1/ 2 (P(A|B) + P(B|A)).