Slides detailing the ideas behind frequent itemset mining and the a-priori algorithm. See code at GitHub: https://github.com/jaduimstra/ProductRecommendation
2. Frequent Itemsets
Try to identify the items that are frequently bought together
Example:people who buy a,b,c tend to buy d,e
Amazon:
– Keeps log of what you've bought
– Uses logs of all users to find items that are frequently
bought together
3. Typical Problem
●
A large set of items
●
A large set of baskets
●
Each basket has a small subset of
items
●
Define 'frequent' itemsets as those that
appear in at least s baskets where s is
the 'support threshold'
6. Computing Association Rules
1.Read data from disk. Data is typically stored
basket-by-basket
2.Generate pairs, triples, quadruples, etc of items
as each basket is read
3.Count number of occurences of each itemset
4.Calculate confidence based on support for
itemsets
BUT...
7. ...If the data is large
1. Disk I/O will slow processing—fastest way is to
sequentially read entire data set, rather than
randomly accessing different bucket
2. Itemset counting limited by storing counts in
memory—disk I/O will further slow computation
1. For n=1 items, memory is O(n)
2. For n=2 items, memory is O(n2)
3. Quickly run out of memory for large n
8. A-priori Algorithm
Uses multiple passes through the data and counts only selected
itemsets
Main idea
– If a set of items I appears at least s times, so does every
subset J of I
– Contrapositive for pairs:
• If item i does not appear in s baskets, then no pair
including i can appear in s baskets