SlideShare a Scribd company logo
1 of 29
Download to read offline
Frequent item set and
Association rules
Viet-Trung Tran
1	
  
Credits
•  Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University
2	
  
Supermarket shelf management
•  Goal: Identify items that are bought together
by sufficiently many customers
•  Approach: Process sales data to find
dependencies among items
•  A classic rule: 
– if someone buy diaper and milk, he/she will buy
beer
3	
  
The market-basket model
•  A large set of items
•  A large set of baskets
•  Each basket is a small
subset of items
•  Want to discover
association rules
–  People who buy {a,b,c} tend
to buy {x,y,z}
4	
  
Generalization
•  Many-to-many mapping between two kinds
of things
– But asks about connections among "items", not
baskets
– Items and baskets are abstract
•  products/shopping
•  words/documents
•  drugs/patients
5	
  
Application
•  Products = items , sets of products = baskets
•  Amazon people who buy X also buy Y
•  Real market baskets: chain stores keep TB of data
about what customers buy together
–  Tell how typical customers navigate stores
–  Run sale on diaper and milk but raise the price of beer 
6	
  
Application [2]
•  Documents = items; sentences = baskets
–  Items that appear together too often could represent
plagairism
•  Patients = items; drugs & side-effect = baskets 
–  detect combinations of drugs that result in side-effect
7	
  
Application [3]
•  Finding community in graphs (e.g., Twitter)
8	
  
Frequent itemsets
•  Simplest question: find set of
items that appear together
"frequently" in baskets
•  Support for itemset I
–  Number of baskets containing all
items in I
•  Given a support threadshold s
–  Set of items that appear in at least s
baskets are called frequent
itemsets
9	
  
Example
10	
  
Association rules
•  If-then rules about the contents of baskets
•  {i1,i2,...,ik} -> j means: "if a basket contains
all of i then it is likely to contain j"
•  Confidence of this association rule is the
probability of j given I = {i1, i2,...,ik}
11	
  
Observation
•  Not all high confidence rules are interesting
– The rule X -> milk is high confidence but it is just
milk is purchased very often 
•  Interest of an association rule I -> j
– Different between its confidence and the fraction
of baskets that contain J
– Interest on those with high positive or negative 
12	
  
Example
13	
  
Finding association rules
•  Goal: finding all association rules with

support >= s and confidence >= c
•  Hard part: finding the frequent itemsets
– If {i1,i2,...,ik} -> j has high support and
confidence, then bot {i1,i2,...ik} and {i1,i2,...,ik,j}
will be frequent
14	
  
Itemsets: computation models
•  Hardest problems often be
finding frequent pairs
– Probability of being frequent drops
exponentially with size, number of
sets grow more slowly with size
•  First concentrate on pairs, and
then extend to large datasets
15	
  
Naive algorithm
•  Read file once, counting in main memory
•  For each basket of n items, generate n(n-1)/2
pairs by two nested loop
•  Failed if (#items)^2 exceeds memory
– 100K (Walmark) , 10B web pages, 
16	
  
A-priori algorithm [1]
•  A two-pass approach limits the need
for memory
•  Key idea: monotonicity
–  if a set of items I appears at least s
times, so does every subset J of I
•  Contrapositive for pairs
–  If items i does not appear in s baskets,
then no pair including i can appear in s
baskets
17	
  
A-priori algorithm [2]
•  Pass 1: Read baskets and count in main
memory the occurrences of each individual
item
•  Items that appear >= s time are the frequent
items
•  Pass 2: Read baskets again and count those
pairs where both elements are frequent
18	
  
Main-Memory: Picture of A-Priori
19	
  
Item counts
Pass 1 Pass 2
Frequent items
Mainmemory
Counts of
pairs of
frequent items
(candidate
pairs)
Frequent tripe, etc.
20	
  
PCY (Park-Chen-Yu) Algorithm
•  Observation: 

In pass 1 of A-Priori, most memory is idle
–  We store only individual item counts
–  Can we use the idle memory to reduce 

memory required in pass 2?
•  Pass 1 of PCY: In addition to item counts, maintain a hash
table with as many 

buckets as fit in memory 
–  Keep a count for each bucket into which 

pairs of items are hashed
•  For each bucket just keep the count, not the actual 

pairs that hash to the bucket!
21	
  
PCY Algorithm [2]
–  Pass 1:
•  Count exact frequency of each item:
•  Take pairs of items {i,j}, hash them into
B buckets and count of the number of
pairs that hashed to each bucket:
–  Pass 2:
•  For a pair {i,j} to be a candidate for 

a frequent pair, its singletons {i}, {j} 

have to be frequent and the pair 

has to hash to a frequent bucket!
22	
  
Items 1…N
Basket 1: {1,2,3}
Pairs: {1,2} {1,3} {2,3}
Basket 2: {1,2,4}
Pairs: {1,2} {1,4} {2,4}
Buckets 1…B
3 1 2
Frequent Itemsets in < 2
Passes
Frequent Itemsets in < 2 Passes
•  A-Priori, PCY, etc., take k passes to find frequent
itemsets of size k
•  Can we use fewer passes?
•  Use 2 or fewer passes for all sizes, 

but may miss some frequent itemsets
–  Random sampling
–  SON (Savasere, Omiecinski, and Navathe)
–  Toivonen
24	
  
Random Sampling [1]
•  Take a random sample of the market baskets
•  Run a-priori or one of its improvements

in main memory
–  So we don’t pay for disk I/O each 

time we increase the size of itemsets
–  Reduce support threshold 

proportionally 

to match the sample size
25	
  
Copy of
sample
baskets
Space
for
counts
Mainmemory
26	
  
SON Algorithm [1]
•  Repeatedly read small subsets of the baskets into
main memory and run an in-memory algorithm to
find all frequent itemsets
–  Note: we are not sampling, but processing the entire file
in memory-sized chunks
•  An itemset becomes a candidate if it is found to be
frequent in any one or more subsets of the baskets.
27	
  
SON Algorithm [2]
•  On a second pass, count all the candidate
itemsets and determine which are frequent in the
entire set
•  Key “monotonicity” idea: an itemset cannot be
frequent in the entire set of baskets unless it is
frequent in at least one subset.
SON – Distributed Version
•  SON lends itself to distributed data mining 
•  Baskets distributed among many nodes 
–  Compute frequent itemsets at each node
–  Distribute candidates to all nodes
–  Accumulate the counts of all candidates
28	
  
SON: Map/Reduce
•  Phase 1: Find candidate itemsets
–  Map?
–  Reduce?
•  Phase 2: Find true frequent itemsets
–  Map?
–  Reduce?
29	
  

More Related Content

What's hot

Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth method
Shani729
 
Apriori and Eclat algorithm in Association Rule Mining
Apriori and Eclat algorithm in Association Rule MiningApriori and Eclat algorithm in Association Rule Mining
Apriori and Eclat algorithm in Association Rule Mining
Wan Aezwani Wab
 

What's hot (20)

Lect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithmLect6 Association rule & Apriori algorithm
Lect6 Association rule & Apriori algorithm
 
Apriori algorithm
Apriori algorithm Apriori algorithm
Apriori algorithm
 
Sequential Pattern Mining and GSP
Sequential Pattern Mining and GSPSequential Pattern Mining and GSP
Sequential Pattern Mining and GSP
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth method
 
Eclat algorithm in association rule mining
Eclat algorithm in association rule miningEclat algorithm in association rule mining
Eclat algorithm in association rule mining
 
Apriori Algorithm.pptx
Apriori Algorithm.pptxApriori Algorithm.pptx
Apriori Algorithm.pptx
 
Apriori
AprioriApriori
Apriori
 
UNIT-4.pptx
UNIT-4.pptxUNIT-4.pptx
UNIT-4.pptx
 
Community Detection
Community Detection Community Detection
Community Detection
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
K-MEDOIDS CLUSTERING  USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...K-MEDOIDS CLUSTERING  USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
K-MEDOIDS CLUSTERING USING PARTITIONING AROUND MEDOIDS FOR PERFORMING FACE R...
 
Data mining fp growth
Data mining fp growthData mining fp growth
Data mining fp growth
 
Rules of data mining
Rules of data miningRules of data mining
Rules of data mining
 
Apriori and Eclat algorithm in Association Rule Mining
Apriori and Eclat algorithm in Association Rule MiningApriori and Eclat algorithm in Association Rule Mining
Apriori and Eclat algorithm in Association Rule Mining
 
Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Spatial data mining
Spatial data miningSpatial data mining
Spatial data mining
 
Data storage and indexing
Data storage and indexingData storage and indexing
Data storage and indexing
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptx
 

Viewers also liked

лекция № 1 ока опорно-двиг
лекция № 1 ока опорно-двиглекция № 1 ока опорно-двиг
лекция № 1 ока опорно-двиг
lali100226
 
Babson College Advising
Babson College Advising Babson College Advising
Babson College Advising
jladino7
 
Apple
Apple Apple
Customer service culture
Customer service culture Customer service culture
Customer service culture
Mahmoud Dargaly
 
Ninong's video
Ninong's videoNinong's video
Ninong's video
mtlacap818
 
Universidad nacional de chimborazo.docx periferitos de entarada
Universidad  nacional de chimborazo.docx periferitos de entaradaUniversidad  nacional de chimborazo.docx periferitos de entarada
Universidad nacional de chimborazo.docx periferitos de entarada
jhon pintag
 
CHEN90023 Lachlan Russell 389374
CHEN90023 Lachlan Russell 389374CHEN90023 Lachlan Russell 389374
CHEN90023 Lachlan Russell 389374
Lachlan Russell
 
Schneider Electric offer in French
Schneider Electric offer in FrenchSchneider Electric offer in French
Schneider Electric offer in French
Aoqi Liu
 
100%open innovation methods for charities toolkit session
100%open innovation methods for charities   toolkit session100%open innovation methods for charities   toolkit session
100%open innovation methods for charities toolkit session
Nesta
 

Viewers also liked (20)

Φάρμακα κοινωνικών δραστηριοτήτων ή φάρμακα σχετι
Φάρμακα κοινωνικών δραστηριοτήτωνή φάρμακα σχετιΦάρμακα κοινωνικών δραστηριοτήτωνή φάρμακα σχετι
Φάρμακα κοινωνικών δραστηριοτήτων ή φάρμακα σχετι
 
Guía de cinemática2016
Guía de  cinemática2016Guía de  cinemática2016
Guía de cinemática2016
 
Equity and Transparency in the New Province of Humanity
Equity and Transparency in the New Province of HumanityEquity and Transparency in the New Province of Humanity
Equity and Transparency in the New Province of Humanity
 
лекция № 1 ока опорно-двиг
лекция № 1 ока опорно-двиглекция № 1 ока опорно-двиг
лекция № 1 ока опорно-двиг
 
Global CCS Institute - Day 1 - Panel 2 - CCS in Developing Countries
Global CCS Institute - Day 1 - Panel 2 - CCS in Developing CountriesGlobal CCS Institute - Day 1 - Panel 2 - CCS in Developing Countries
Global CCS Institute - Day 1 - Panel 2 - CCS in Developing Countries
 
Babson College Advising
Babson College Advising Babson College Advising
Babson College Advising
 
A Journey to the Summit of Kilimanjaro
A Journey to the Summit of KilimanjaroA Journey to the Summit of Kilimanjaro
A Journey to the Summit of Kilimanjaro
 
Citations needed for the sum of all human knowledge: Wikidata as the missing ...
Citations needed for the sum of all human knowledge: Wikidata as the missing ...Citations needed for the sum of all human knowledge: Wikidata as the missing ...
Citations needed for the sum of all human knowledge: Wikidata as the missing ...
 
Apple
Apple Apple
Apple
 
The DigiMarketing Imperative
The DigiMarketing ImperativeThe DigiMarketing Imperative
The DigiMarketing Imperative
 
Breaking the taboo: Legacy and in memory fundraising
Breaking the taboo: Legacy and in memory fundraisingBreaking the taboo: Legacy and in memory fundraising
Breaking the taboo: Legacy and in memory fundraising
 
Customer service culture
Customer service culture Customer service culture
Customer service culture
 
Ninong's video
Ninong's videoNinong's video
Ninong's video
 
2014 presentation 2014_fr
2014 presentation 2014_fr2014 presentation 2014_fr
2014 presentation 2014_fr
 
A journey in the world of animals. english
A journey in the world of animals. englishA journey in the world of animals. english
A journey in the world of animals. english
 
Universidad nacional de chimborazo.docx periferitos de entarada
Universidad  nacional de chimborazo.docx periferitos de entaradaUniversidad  nacional de chimborazo.docx periferitos de entarada
Universidad nacional de chimborazo.docx periferitos de entarada
 
CHEN90023 Lachlan Russell 389374
CHEN90023 Lachlan Russell 389374CHEN90023 Lachlan Russell 389374
CHEN90023 Lachlan Russell 389374
 
Schneider Electric offer in French
Schneider Electric offer in FrenchSchneider Electric offer in French
Schneider Electric offer in French
 
100%open innovation methods for charities toolkit session
100%open innovation methods for charities   toolkit session100%open innovation methods for charities   toolkit session
100%open innovation methods for charities toolkit session
 
Article overview: Deep Neural Networks Reveal a Gradient in the Complexity of...
Article overview: Deep Neural Networks Reveal a Gradient in the Complexity of...Article overview: Deep Neural Networks Reveal a Gradient in the Complexity of...
Article overview: Deep Neural Networks Reveal a Gradient in the Complexity of...
 

Similar to 2 association rules

3.1 mining frequent patterns with association rules-mca4
3.1 mining frequent patterns with association rules-mca43.1 mining frequent patterns with association rules-mca4
3.1 mining frequent patterns with association rules-mca4
Azad public school
 
Associations1
Associations1Associations1
Associations1
mancnilu
 

Similar to 2 association rules (20)

Data Mining Lecture_3.pptx
Data Mining Lecture_3.pptxData Mining Lecture_3.pptx
Data Mining Lecture_3.pptx
 
6 module 4
6 module 46 module 4
6 module 4
 
MODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxMODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptx
 
Mining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association RulesMining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association Rules
 
Lecture3 assoc rules
Lecture3 assoc rulesLecture3 assoc rules
Lecture3 assoc rules
 
Chapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptxChapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptx
 
Data mining techniques unit III
Data mining techniques unit IIIData mining techniques unit III
Data mining techniques unit III
 
Association Rules
Association RulesAssociation Rules
Association Rules
 
Association Rules
Association RulesAssociation Rules
Association Rules
 
Dm unit ii r16
Dm unit ii   r16Dm unit ii   r16
Dm unit ii r16
 
DM -Unit 2-PPT.ppt
DM -Unit 2-PPT.pptDM -Unit 2-PPT.ppt
DM -Unit 2-PPT.ppt
 
Rules of data mining
Rules of data miningRules of data mining
Rules of data mining
 
Dynamic itemset counting
Dynamic itemset countingDynamic itemset counting
Dynamic itemset counting
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
DMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association RulesDMTM 2015 - 05 Association Rules
DMTM 2015 - 05 Association Rules
 
DMTM Lecture 16 Association rules
DMTM Lecture 16 Association rulesDMTM Lecture 16 Association rules
DMTM Lecture 16 Association rules
 
3.1 mining frequent patterns with association rules-mca4
3.1 mining frequent patterns with association rules-mca43.1 mining frequent patterns with association rules-mca4
3.1 mining frequent patterns with association rules-mca4
 
Associations1
Associations1Associations1
Associations1
 
Associations.ppt
Associations.pptAssociations.ppt
Associations.ppt
 
Greedy method1
Greedy method1Greedy method1
Greedy method1
 

More from Viet-Trung TRAN

Dynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value StoreDynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value Store
Viet-Trung TRAN
 
Pregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnPregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớn
Viet-Trung TRAN
 
Mapreduce simplified-data-processing
Mapreduce simplified-data-processingMapreduce simplified-data-processing
Mapreduce simplified-data-processing
Viet-Trung TRAN
 

More from Viet-Trung TRAN (20)

Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
 
Dynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value StoreDynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value Store
 
Pregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnPregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớn
 
Mapreduce simplified-data-processing
Mapreduce simplified-data-processingMapreduce simplified-data-processing
Mapreduce simplified-data-processing
 
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookTìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
 
giasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studygiasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case study
 
Giasan.vn @rstars
Giasan.vn @rstarsGiasan.vn @rstars
Giasan.vn @rstars
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
Large-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkLarge-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on Spark
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
 
success factors for project proposals
success factors for project proposalssuccess factors for project proposals
success factors for project proposals
 
GPSinsights poster
GPSinsights posterGPSinsights poster
GPSinsights poster
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents
 
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learning
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
 

Recently uploaded

Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
HyderabadDolls
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
HyderabadDolls
 

Recently uploaded (20)

Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service AvailableVastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 

2 association rules

  • 1. Frequent item set and Association rules Viet-Trung Tran 1  
  • 2. Credits •  Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University 2  
  • 3. Supermarket shelf management •  Goal: Identify items that are bought together by sufficiently many customers •  Approach: Process sales data to find dependencies among items •  A classic rule: – if someone buy diaper and milk, he/she will buy beer 3  
  • 4. The market-basket model •  A large set of items •  A large set of baskets •  Each basket is a small subset of items •  Want to discover association rules –  People who buy {a,b,c} tend to buy {x,y,z} 4  
  • 5. Generalization •  Many-to-many mapping between two kinds of things – But asks about connections among "items", not baskets – Items and baskets are abstract •  products/shopping •  words/documents •  drugs/patients 5  
  • 6. Application •  Products = items , sets of products = baskets •  Amazon people who buy X also buy Y •  Real market baskets: chain stores keep TB of data about what customers buy together –  Tell how typical customers navigate stores –  Run sale on diaper and milk but raise the price of beer 6  
  • 7. Application [2] •  Documents = items; sentences = baskets –  Items that appear together too often could represent plagairism •  Patients = items; drugs & side-effect = baskets –  detect combinations of drugs that result in side-effect 7  
  • 8. Application [3] •  Finding community in graphs (e.g., Twitter) 8  
  • 9. Frequent itemsets •  Simplest question: find set of items that appear together "frequently" in baskets •  Support for itemset I –  Number of baskets containing all items in I •  Given a support threadshold s –  Set of items that appear in at least s baskets are called frequent itemsets 9  
  • 11. Association rules •  If-then rules about the contents of baskets •  {i1,i2,...,ik} -> j means: "if a basket contains all of i then it is likely to contain j" •  Confidence of this association rule is the probability of j given I = {i1, i2,...,ik} 11  
  • 12. Observation •  Not all high confidence rules are interesting – The rule X -> milk is high confidence but it is just milk is purchased very often •  Interest of an association rule I -> j – Different between its confidence and the fraction of baskets that contain J – Interest on those with high positive or negative 12  
  • 14. Finding association rules •  Goal: finding all association rules with support >= s and confidence >= c •  Hard part: finding the frequent itemsets – If {i1,i2,...,ik} -> j has high support and confidence, then bot {i1,i2,...ik} and {i1,i2,...,ik,j} will be frequent 14  
  • 15. Itemsets: computation models •  Hardest problems often be finding frequent pairs – Probability of being frequent drops exponentially with size, number of sets grow more slowly with size •  First concentrate on pairs, and then extend to large datasets 15  
  • 16. Naive algorithm •  Read file once, counting in main memory •  For each basket of n items, generate n(n-1)/2 pairs by two nested loop •  Failed if (#items)^2 exceeds memory – 100K (Walmark) , 10B web pages, 16  
  • 17. A-priori algorithm [1] •  A two-pass approach limits the need for memory •  Key idea: monotonicity –  if a set of items I appears at least s times, so does every subset J of I •  Contrapositive for pairs –  If items i does not appear in s baskets, then no pair including i can appear in s baskets 17  
  • 18. A-priori algorithm [2] •  Pass 1: Read baskets and count in main memory the occurrences of each individual item •  Items that appear >= s time are the frequent items •  Pass 2: Read baskets again and count those pairs where both elements are frequent 18  
  • 19. Main-Memory: Picture of A-Priori 19   Item counts Pass 1 Pass 2 Frequent items Mainmemory Counts of pairs of frequent items (candidate pairs)
  • 21. PCY (Park-Chen-Yu) Algorithm •  Observation: 
 In pass 1 of A-Priori, most memory is idle –  We store only individual item counts –  Can we use the idle memory to reduce 
 memory required in pass 2? •  Pass 1 of PCY: In addition to item counts, maintain a hash table with as many 
 buckets as fit in memory –  Keep a count for each bucket into which 
 pairs of items are hashed •  For each bucket just keep the count, not the actual 
 pairs that hash to the bucket! 21  
  • 22. PCY Algorithm [2] –  Pass 1: •  Count exact frequency of each item: •  Take pairs of items {i,j}, hash them into B buckets and count of the number of pairs that hashed to each bucket: –  Pass 2: •  For a pair {i,j} to be a candidate for 
 a frequent pair, its singletons {i}, {j} 
 have to be frequent and the pair 
 has to hash to a frequent bucket! 22   Items 1…N Basket 1: {1,2,3} Pairs: {1,2} {1,3} {2,3} Basket 2: {1,2,4} Pairs: {1,2} {1,4} {2,4} Buckets 1…B 3 1 2
  • 23. Frequent Itemsets in < 2 Passes
  • 24. Frequent Itemsets in < 2 Passes •  A-Priori, PCY, etc., take k passes to find frequent itemsets of size k •  Can we use fewer passes? •  Use 2 or fewer passes for all sizes, 
 but may miss some frequent itemsets –  Random sampling –  SON (Savasere, Omiecinski, and Navathe) –  Toivonen 24  
  • 25. Random Sampling [1] •  Take a random sample of the market baskets •  Run a-priori or one of its improvements
 in main memory –  So we don’t pay for disk I/O each 
 time we increase the size of itemsets –  Reduce support threshold 
 proportionally 
 to match the sample size 25   Copy of sample baskets Space for counts Mainmemory
  • 26. 26   SON Algorithm [1] •  Repeatedly read small subsets of the baskets into main memory and run an in-memory algorithm to find all frequent itemsets –  Note: we are not sampling, but processing the entire file in memory-sized chunks •  An itemset becomes a candidate if it is found to be frequent in any one or more subsets of the baskets.
  • 27. 27   SON Algorithm [2] •  On a second pass, count all the candidate itemsets and determine which are frequent in the entire set •  Key “monotonicity” idea: an itemset cannot be frequent in the entire set of baskets unless it is frequent in at least one subset.
  • 28. SON – Distributed Version •  SON lends itself to distributed data mining •  Baskets distributed among many nodes –  Compute frequent itemsets at each node –  Distribute candidates to all nodes –  Accumulate the counts of all candidates 28  
  • 29. SON: Map/Reduce •  Phase 1: Find candidate itemsets –  Map? –  Reduce? •  Phase 2: Find true frequent itemsets –  Map? –  Reduce? 29