SlideShare a Scribd company logo
1 of 30
CS501: DATABASE AND
DATA MINING
Mining Frequent Patterns1
WHAT IS FREQUENT PATTERN
ANALYSIS?
 Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
2
WHY IS FREQ. PATTERN MINING
IMPORTANT?
 Discloses an intrinsic and important property of data sets
 Forms the foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
 Classification: associative classification
 Cluster analysis: frequent pattern-based clustering
 Data warehousing: iceberg cube and cube-gradient
 Semantic data compression: fascicles
 Broad applications 3
BASIC CONCEPTS: FREQUENT
PATTERNS AND ASSOCIATION RULES
 Itemset X = {x1, …, xk}
 Find all the rules X  Y with
minimum support and confidence
support, s, probability that a
transaction contains X ∪ Y
confidence, c, conditional
probability that a
transaction having X also
contains Y
4
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
A  D (60%, 100%)
D  A (60%, 75%)
Customer
buys diaper
Customer
buys both
Customer
buys beer
Transaction-id Items bought
10 A, B, D
20 A, C, D
30 A, D, E
40 B, E, F
50 B, C, D, E, F
CLOSED PATTERNS AND MAX-
PATTERNS
 A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1
100
) + (2
100
) + … + (1
1
0
0
0
0
) =
2100
– 1 = 1.27*1030
sub-patterns!
 Solution: Mine closed patterns and max-patterns instead
 An itemset X is closed if X is frequent and there exists
no super-pattern Y X, with the same support as X
 An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y X
 Closed pattern is a lossless compression of freq.
patterns
Reducing the # of patterns and rules 5
SCALABLE METHODS FOR MINING
FREQUENT PATTERNS
 The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent
If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
 Scalable mining methods: Three major approaches
Apriori
Freq. pattern growth
Vertical data format approach
6
APRIORI: A CANDIDATE GENERATION-AND-TEST
APPROACH
 Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
 Method:
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length
k frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be
generated
7
THE APRIORI ALGORITHM—AN
EXAMPLE
8
Database TDB
1st
scan
C1
L1
L2
C2 C2
2nd
scan
C3 L33rd
scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2
IMPORTANT DETAILS OF APRIORI
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 How to count supports of candidates?
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4={abcd}
9
HOW TO COUNT SUPPORTS OF
CANDIDATES?
 Why counting supports of candidates a problem?
The total number of candidates can be very huge
 One transaction may contain many candidates
 Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets
and counts
Interior node contains a hash table
Subset function: finds all the candidates contained
in a transaction
10
BOTTLENECK OF FREQUENT-
PATTERN MINING
 Multiple database scans are costly
 Mining long patterns needs many passes of scanning
and generates lots of candidates
 To find frequent itemset i1i2…i100
 # of scans: 100
 # of Candidates: (1
100
) + (2
100
) + … + (1
1
0
0
0
0
) = 2100
-1 = 1.27*1030
!
 Bottleneck: candidate-generation-and-test
 Can we avoid candidate generation?
11
FREQUENT ITEMSET GENERATION
 Apriori: uses a generate-and-test approach –
generates candidate itemsets and tests if they are
frequent
 Generation of candidate itemsets is expensive(in both
space and time)
 Support counting is expensive
 Subset checking (computationally expensive)
 Multiple Database scans (I/O)
 FP-Growth: allows frequent itemset discovery
without candidate itemset generation. Two step
approach:
 Step 1: Build a compact data structure called the FP-tree
 Built using 2 passes over the data-set.
 Step 2: Extracts frequent itemsets directly from the FP-
tree 12
STEP 1: FP-TREE CONSTRUCTION
 FP-Tree is constructed using 2 passes over the
data-set:
Pass 1:
 Scan data and find support for each item.
 Discard infrequent items.
 Sort frequent items in decreasing order based on
their support.
Use this order when building the FP-Tree, so
common prefixes can be shared.
13
STEP 1: FP-TREE CONSTRUCTION
Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and maps it
to a path
2. Fixed order is used, so paths can overlap when
transactions share items (when they have the same
prfix ).
 In this case, counters are incremented
1. Pointers are maintained between nodes containing
the same item, creating singly linked lists (dotted
lines)
 The more paths that overlap, the higher the compression.
FP-tree may fit in memory.
1. Frequent itemsets extracted from the FP-Tree. 14
STEP 1: FP-TREE CONSTRUCTION (EXAMPLE)
15
FP-TREE SIZE
 The FP-Tree usually has a smaller size than the
uncompressed data - typically many transactions
share items (and hence prefixes).
 Best case scenario: all transactions contain the same set of
items.
 1 path in the FP-tree
 Worst case scenario: every transaction has a unique set of
items (no items in common)
 Size of the FP-tree is at least as large as the original data.
 Storage requirements for the FP-tree are higher - need to store
the pointers between the nodes and the counters.
 The size of the FP-tree depends on how the items are
ordered
 Ordering by decreasing support is typically used but
it does not always lead to the smallest tree (it's a
heuristic). 16
STEP 2: FREQUENT ITEMSET
GENERATION
 FP-Growth extracts frequent itemsets from the
FP-tree.
 Bottom-up algorithm - from the leaves towards
the root
 Divide and conquer: first look for frequent
itemsets ending in e, then de, etc. . . then d, then
cd, etc. . .
 First, extract prefix path sub-trees ending in an
item(set). (hint: use the linked lists)
17
PREFIX PATH SUB-TREES
(EXAMPLE)
18
STEP 2: FREQUENT ITEMSET
GENERATION
 Each prefix path sub-tree is processed
recursively to extract the frequent
itemsets. Solutions are then merged.
 E.g. the prefix path sub-tree for e will be used
to extract frequent itemsets ending in e, then
in de, ce, be and ae, then in cde, bde, cde, etc.
 Divide and conquer approach
19
CONDITIONAL FP-TREE
 The FP-Tree that would be built if we only consider
transactions containing a particular itemset (and then
removing that itemset from all transactions).
 Example: FP-Tree conditional on e.
20
EXAMPLE
Let minSup = 2 and extract all frequent itemsets
containing e.
 1. Obtain the prefix path sub-tree for e:
21
EXAMPLE
 2. Check if e is a frequent item by adding the
counts along the linked list (dotted line). If so,
extract it.
 Yes, count =3 so {e} is extracted as a frequent
itemset.
 3. As e is frequent, find frequent itemsets ending
in e. i.e. de, ce, be and ae.
22
EXAMPLE
 4. Use the the conditional FP-tree for e to find
frequent itemsets ending in de, ce and ae
 Note that be is not considered as b is not in the
conditional FP-tree for e.
 For each of them (e.g. de), find the prefix paths
from the conditional tree for e, extract frequent
itemsets, generate conditional FP-tree, etc...
(recursive)
23
EXAMPLE
 Example: e -> de -> ade ({d,e}, {a,d,e} are found to be
frequent)
•Example: e -> ce ({c,e} is found to be frequent)
24
RESULT
Frequent itemsets found (ordered by sufix and order in
which they are found):
25
DISCUSION
 Advantages of FP-Growth
 only 2 passes over data-set
 “compresses” data-set
 no candidate generation
 much faster than Apriori
 Disadvantages of FP-Growth
 FP-Tree may not fit in memory!!
 FP-Tree is expensive to build
26
ECLAT: ANOTHER METHOD FOR FREQUENT
ITEMSET GENERATION
 ECLAT: for each item, store a list of transaction ids
(tids); vertical data layout
TID-list
27
ECLAT: ANOTHER METHOD FOR FREQUENT
ITEMSET GENERATION
 Determine support of any k-itemset by intersecting tid-
lists of two of its (k-1) subsets.
 Advantage: very fast support counting
 Disadvantage: intermediate tid-lists may become too
large for memory
A
1
4
5
6
7
8
9
B
1
2
5
7
8
10
∧ →
AB
1
5
7
8
28
INTERESTINGNESS MEASURE: CORRELATIONS
(LIFT)
 play basketball ⇒ eat cereal [40%, 66.7%] is misleading
 The overall % of students eating cereal is 75% > 66.7%.
 play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
29
Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000
LIFT
 Measure of dependent/correlated events: lift
 Lift = 1 , A & B are independent
 Lift > 1, A & B are positively correlated
 Lift<1, A & B are negatively correlated
30
)()(
)(
BPAP
BAP
lift
∪
=
89.0
5000/3750*5000/3000
5000/2000
),( ==CBlift
33.1
5000/1250*5000/3000
5000/1000
),( ==¬CBlift

More Related Content

What's hot (20)

Trees, Binary Search Tree, AVL Tree in Data Structures
Trees, Binary Search Tree, AVL Tree in Data Structures Trees, Binary Search Tree, AVL Tree in Data Structures
Trees, Binary Search Tree, AVL Tree in Data Structures
 
Binary Trees
Binary TreesBinary Trees
Binary Trees
 
BINARY SEARCH TREE
BINARY SEARCH TREEBINARY SEARCH TREE
BINARY SEARCH TREE
 
THREADED BINARY TREE AND BINARY SEARCH TREE
THREADED BINARY TREE AND BINARY SEARCH TREETHREADED BINARY TREE AND BINARY SEARCH TREE
THREADED BINARY TREE AND BINARY SEARCH TREE
 
Trees (data structure)
Trees (data structure)Trees (data structure)
Trees (data structure)
 
1.1 binary tree
1.1 binary tree1.1 binary tree
1.1 binary tree
 
Trees
TreesTrees
Trees
 
Tree
TreeTree
Tree
 
Tree traversal techniques
Tree traversal techniquesTree traversal techniques
Tree traversal techniques
 
Binary Search Tree and AVL
Binary Search Tree and AVLBinary Search Tree and AVL
Binary Search Tree and AVL
 
Tree Data Structure by Daniyal Khan
Tree Data Structure by Daniyal KhanTree Data Structure by Daniyal Khan
Tree Data Structure by Daniyal Khan
 
Binary search trees
Binary search treesBinary search trees
Binary search trees
 
Binary Search Tree
Binary Search TreeBinary Search Tree
Binary Search Tree
 
Binary tree and Binary search tree
Binary tree and Binary search treeBinary tree and Binary search tree
Binary tree and Binary search tree
 
trees in data structure
trees in data structure trees in data structure
trees in data structure
 
Trees data structure
Trees data structureTrees data structure
Trees data structure
 
Tree in data structure
Tree in data structureTree in data structure
Tree in data structure
 
1.5 binary search tree
1.5 binary search tree1.5 binary search tree
1.5 binary search tree
 
Ch13 Binary Search Tree
Ch13 Binary Search TreeCh13 Binary Search Tree
Ch13 Binary Search Tree
 
Binary Search Tree
Binary Search TreeBinary Search Tree
Binary Search Tree
 

Similar to CS501: DATABASE AND DATA MINING - Frequent Pattern Mining

Fp growth algorithm
Fp growth algorithmFp growth algorithm
Fp growth algorithmPradip Kumar
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
 
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...ijsrd.com
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.pptSowmyaJyothi3
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.pptSowmyaJyothi3
 
Pattern Discovery Using Apriori and Ch-Search Algorithm
 Pattern Discovery Using Apriori and Ch-Search Algorithm Pattern Discovery Using Apriori and Ch-Search Algorithm
Pattern Discovery Using Apriori and Ch-Search Algorithmijceronline
 
Mining frequent patterns association
Mining frequent patterns associationMining frequent patterns association
Mining frequent patterns associationDeepaR42
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodShani729
 
Mining Frequent Itemsets.ppt
Mining Frequent Itemsets.pptMining Frequent Itemsets.ppt
Mining Frequent Itemsets.pptNBACriteria2SICET
 
ARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .pptARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .pptChellamuthuHaripriya
 
Interval intersection
Interval intersectionInterval intersection
Interval intersectionAabida Noman
 
Associations.ppt
Associations.pptAssociations.ppt
Associations.pptQuyn590023
 

Similar to CS501: DATABASE AND DATA MINING - Frequent Pattern Mining (20)

Fp growth algorithm
Fp growth algorithmFp growth algorithm
Fp growth algorithm
 
FP-growth.pptx
FP-growth.pptxFP-growth.pptx
FP-growth.pptx
 
My6asso
My6assoMy6asso
My6asso
 
Unit 3.pptx
Unit 3.pptxUnit 3.pptx
Unit 3.pptx
 
Data Mining
Data MiningData Mining
Data Mining
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
Simulation and Performance Analysis of Long Term Evolution (LTE) Cellular Net...
 
6 module 4
6 module 46 module 4
6 module 4
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.ppt
 
Association Rule.ppt
Association Rule.pptAssociation Rule.ppt
Association Rule.ppt
 
Pattern Discovery Using Apriori and Ch-Search Algorithm
 Pattern Discovery Using Apriori and Ch-Search Algorithm Pattern Discovery Using Apriori and Ch-Search Algorithm
Pattern Discovery Using Apriori and Ch-Search Algorithm
 
pattern mninng.ppt
pattern mninng.pptpattern mninng.ppt
pattern mninng.ppt
 
Mining frequent patterns association
Mining frequent patterns associationMining frequent patterns association
Mining frequent patterns association
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth method
 
Mining Frequent Itemsets.ppt
Mining Frequent Itemsets.pptMining Frequent Itemsets.ppt
Mining Frequent Itemsets.ppt
 
ARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .pptARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .ppt
 
21 FP Tree
21 FP Tree21 FP Tree
21 FP Tree
 
Interval intersection
Interval intersectionInterval intersection
Interval intersection
 
Associations.ppt
Associations.pptAssociations.ppt
Associations.ppt
 

More from Kamal Singh Lodhi (16)

Introduction to Data Structure
Introduction to Data Structure Introduction to Data Structure
Introduction to Data Structure
 
Stack Algorithm
Stack AlgorithmStack Algorithm
Stack Algorithm
 
Data Structure (MC501)
Data Structure (MC501)Data Structure (MC501)
Data Structure (MC501)
 
Cs501 trc drc
Cs501 trc drcCs501 trc drc
Cs501 trc drc
 
Cs501 transaction
Cs501 transactionCs501 transaction
Cs501 transaction
 
Cs501 rel algebra
Cs501 rel algebraCs501 rel algebra
Cs501 rel algebra
 
Cs501 intro
Cs501 introCs501 intro
Cs501 intro
 
Cs501 fd nf
Cs501 fd nfCs501 fd nf
Cs501 fd nf
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 
Cs501 data preprocessingdw
Cs501 data preprocessingdwCs501 data preprocessingdw
Cs501 data preprocessingdw
 
Cs501 concurrency
Cs501 concurrencyCs501 concurrency
Cs501 concurrency
 
Cs501 cluster analysis
Cs501 cluster analysisCs501 cluster analysis
Cs501 cluster analysis
 
Cs501 classification prediction
Cs501 classification predictionCs501 classification prediction
Cs501 classification prediction
 
Attribute Classification
Attribute ClassificationAttribute Classification
Attribute Classification
 
Real Time ImageVideo Processing with Applications in Face Recognition
Real Time ImageVideo Processing with  Applications in Face Recognition   Real Time ImageVideo Processing with  Applications in Face Recognition
Real Time ImageVideo Processing with Applications in Face Recognition
 
Flow diagram
Flow diagramFlow diagram
Flow diagram
 

Recently uploaded

Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 

Recently uploaded (20)

Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 

CS501: DATABASE AND DATA MINING - Frequent Pattern Mining

  • 1. CS501: DATABASE AND DATA MINING Mining Frequent Patterns1
  • 2. WHAT IS FREQUENT PATTERN ANALYSIS?  Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set  First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining  Motivation: Finding inherent regularities in data  What products were often purchased together?— Beer and diapers?!  What are the subsequent purchases after buying a PC?  What kinds of DNA are sensitive to this new drug?  Can we automatically classify web documents?  Applications  Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. 2
  • 3. WHY IS FREQ. PATTERN MINING IMPORTANT?  Discloses an intrinsic and important property of data sets  Forms the foundation for many essential data mining tasks  Association, correlation, and causality analysis  Sequential, structural (e.g., sub-graph) patterns  Pattern analysis in spatiotemporal, multimedia, time- series, and stream data  Classification: associative classification  Cluster analysis: frequent pattern-based clustering  Data warehousing: iceberg cube and cube-gradient  Semantic data compression: fascicles  Broad applications 3
  • 4. BASIC CONCEPTS: FREQUENT PATTERNS AND ASSOCIATION RULES  Itemset X = {x1, …, xk}  Find all the rules X  Y with minimum support and confidence support, s, probability that a transaction contains X ∪ Y confidence, c, conditional probability that a transaction having X also contains Y 4 Let supmin = 50%, confmin = 50% Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3} Association rules: A  D (60%, 100%) D  A (60%, 75%) Customer buys diaper Customer buys both Customer buys beer Transaction-id Items bought 10 A, B, D 20 A, C, D 30 A, D, E 40 B, E, F 50 B, C, D, E, F
  • 5. CLOSED PATTERNS AND MAX- PATTERNS  A long pattern contains a combinatorial number of sub- patterns, e.g., {a1, …, a100} contains (1 100 ) + (2 100 ) + … + (1 1 0 0 0 0 ) = 2100 – 1 = 1.27*1030 sub-patterns!  Solution: Mine closed patterns and max-patterns instead  An itemset X is closed if X is frequent and there exists no super-pattern Y X, with the same support as X  An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y X  Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules 5
  • 6. SCALABLE METHODS FOR MINING FREQUENT PATTERNS  The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper}  Scalable mining methods: Three major approaches Apriori Freq. pattern growth Vertical data format approach 6
  • 7. APRIORI: A CANDIDATE GENERATION-AND-TEST APPROACH  Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!  Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated 7
  • 8. THE APRIORI ALGORITHM—AN EXAMPLE 8 Database TDB 1st scan C1 L1 L2 C2 C2 2nd scan C3 L33rd scan Tid Items 10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemset sup {A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2 Itemset sup {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Itemset {B, C, E} Itemset sup {B, C, E} 2 Supmin = 2
  • 9. IMPORTANT DETAILS OF APRIORI  How to generate candidates?  Step 1: self-joining Lk  Step 2: pruning  How to count supports of candidates?  Example of Candidate-generation  L3={abc, abd, acd, ace, bcd}  Self-joining: L3*L3  abcd from abc and abd  acde from acd and ace  Pruning:  acde is removed because ade is not in L3  C4={abcd} 9
  • 10. HOW TO COUNT SUPPORTS OF CANDIDATES?  Why counting supports of candidates a problem? The total number of candidates can be very huge  One transaction may contain many candidates  Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction 10
  • 11. BOTTLENECK OF FREQUENT- PATTERN MINING  Multiple database scans are costly  Mining long patterns needs many passes of scanning and generates lots of candidates  To find frequent itemset i1i2…i100  # of scans: 100  # of Candidates: (1 100 ) + (2 100 ) + … + (1 1 0 0 0 0 ) = 2100 -1 = 1.27*1030 !  Bottleneck: candidate-generation-and-test  Can we avoid candidate generation? 11
  • 12. FREQUENT ITEMSET GENERATION  Apriori: uses a generate-and-test approach – generates candidate itemsets and tests if they are frequent  Generation of candidate itemsets is expensive(in both space and time)  Support counting is expensive  Subset checking (computationally expensive)  Multiple Database scans (I/O)  FP-Growth: allows frequent itemset discovery without candidate itemset generation. Two step approach:  Step 1: Build a compact data structure called the FP-tree  Built using 2 passes over the data-set.  Step 2: Extracts frequent itemsets directly from the FP- tree 12
  • 13. STEP 1: FP-TREE CONSTRUCTION  FP-Tree is constructed using 2 passes over the data-set: Pass 1:  Scan data and find support for each item.  Discard infrequent items.  Sort frequent items in decreasing order based on their support. Use this order when building the FP-Tree, so common prefixes can be shared. 13
  • 14. STEP 1: FP-TREE CONSTRUCTION Pass 2: Nodes correspond to items and have a counter 1. FP-Growth reads 1 transaction at a time and maps it to a path 2. Fixed order is used, so paths can overlap when transactions share items (when they have the same prfix ).  In this case, counters are incremented 1. Pointers are maintained between nodes containing the same item, creating singly linked lists (dotted lines)  The more paths that overlap, the higher the compression. FP-tree may fit in memory. 1. Frequent itemsets extracted from the FP-Tree. 14
  • 15. STEP 1: FP-TREE CONSTRUCTION (EXAMPLE) 15
  • 16. FP-TREE SIZE  The FP-Tree usually has a smaller size than the uncompressed data - typically many transactions share items (and hence prefixes).  Best case scenario: all transactions contain the same set of items.  1 path in the FP-tree  Worst case scenario: every transaction has a unique set of items (no items in common)  Size of the FP-tree is at least as large as the original data.  Storage requirements for the FP-tree are higher - need to store the pointers between the nodes and the counters.  The size of the FP-tree depends on how the items are ordered  Ordering by decreasing support is typically used but it does not always lead to the smallest tree (it's a heuristic). 16
  • 17. STEP 2: FREQUENT ITEMSET GENERATION  FP-Growth extracts frequent itemsets from the FP-tree.  Bottom-up algorithm - from the leaves towards the root  Divide and conquer: first look for frequent itemsets ending in e, then de, etc. . . then d, then cd, etc. . .  First, extract prefix path sub-trees ending in an item(set). (hint: use the linked lists) 17
  • 19. STEP 2: FREQUENT ITEMSET GENERATION  Each prefix path sub-tree is processed recursively to extract the frequent itemsets. Solutions are then merged.  E.g. the prefix path sub-tree for e will be used to extract frequent itemsets ending in e, then in de, ce, be and ae, then in cde, bde, cde, etc.  Divide and conquer approach 19
  • 20. CONDITIONAL FP-TREE  The FP-Tree that would be built if we only consider transactions containing a particular itemset (and then removing that itemset from all transactions).  Example: FP-Tree conditional on e. 20
  • 21. EXAMPLE Let minSup = 2 and extract all frequent itemsets containing e.  1. Obtain the prefix path sub-tree for e: 21
  • 22. EXAMPLE  2. Check if e is a frequent item by adding the counts along the linked list (dotted line). If so, extract it.  Yes, count =3 so {e} is extracted as a frequent itemset.  3. As e is frequent, find frequent itemsets ending in e. i.e. de, ce, be and ae. 22
  • 23. EXAMPLE  4. Use the the conditional FP-tree for e to find frequent itemsets ending in de, ce and ae  Note that be is not considered as b is not in the conditional FP-tree for e.  For each of them (e.g. de), find the prefix paths from the conditional tree for e, extract frequent itemsets, generate conditional FP-tree, etc... (recursive) 23
  • 24. EXAMPLE  Example: e -> de -> ade ({d,e}, {a,d,e} are found to be frequent) •Example: e -> ce ({c,e} is found to be frequent) 24
  • 25. RESULT Frequent itemsets found (ordered by sufix and order in which they are found): 25
  • 26. DISCUSION  Advantages of FP-Growth  only 2 passes over data-set  “compresses” data-set  no candidate generation  much faster than Apriori  Disadvantages of FP-Growth  FP-Tree may not fit in memory!!  FP-Tree is expensive to build 26
  • 27. ECLAT: ANOTHER METHOD FOR FREQUENT ITEMSET GENERATION  ECLAT: for each item, store a list of transaction ids (tids); vertical data layout TID-list 27
  • 28. ECLAT: ANOTHER METHOD FOR FREQUENT ITEMSET GENERATION  Determine support of any k-itemset by intersecting tid- lists of two of its (k-1) subsets.  Advantage: very fast support counting  Disadvantage: intermediate tid-lists may become too large for memory A 1 4 5 6 7 8 9 B 1 2 5 7 8 10 ∧ → AB 1 5 7 8 28
  • 29. INTERESTINGNESS MEASURE: CORRELATIONS (LIFT)  play basketball ⇒ eat cereal [40%, 66.7%] is misleading  The overall % of students eating cereal is 75% > 66.7%.  play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence 29 Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 2000 5000
  • 30. LIFT  Measure of dependent/correlated events: lift  Lift = 1 , A & B are independent  Lift > 1, A & B are positively correlated  Lift<1, A & B are negatively correlated 30 )()( )( BPAP BAP lift ∪ = 89.0 5000/3750*5000/3000 5000/2000 ),( ==CBlift 33.1 5000/1250*5000/3000 5000/1000 ),( ==¬CBlift