1. Market Basket Analysis
using Apriori algorithm
on “Groceries” dataset
Submitted By:
MadhuKiran P C20-085
Sai Vinod P C20-131
Sesha Sai Harsha C20-142
2. Contents
Overview:................................................................................................................................................3
Apriori algorithm:....................................................................................................................................3
The data: .................................................................................................................................................4
Transformed data to dummy flag variables:...........................................................................................4
Program flow: .........................................................................................................................................5
Top 12 most frequent items: ..................................................................................................................5
Results: Top 12 rules by “support”: ........................................................................................................5
Results: Top 12 rules by “confidence”:...................................................................................................6
Results: Top 12 rules by “lift”: ................................................................................................................6
Web:........................................................................................................................................................7
Discussion: ..............................................................................................................................................7
References: .............................................................................................................................................7
3. Overview:
Identifies frequently purchased groceries from given transactional data
Implemented SPSS Modeler A-priori modelling node to calculate support, confidence and lift for
association rules
Listed top 12 frequent bought items, top 10 combinations by support, confidence and lift values.
Apriori algorithm:
Apriori algorithm employs a simple a priori belief as guideline for reducing the association rule
search space: all subsets of a frequent item-set must also be frequent
The support of an item-set or rule measures how frequently it occurs in the data
A rule's confidence is a measurement of its predictive power or accuracy. It is defined as the
support of the item-set containing both X and Y divided by the support of the item-set
containing only X
Lift is a measure of how much more likely one item is to be purchased relative to its typical
purchase rate, given that you know another item has been purchased
4. The data:
citrus fruit semi-finished
bread
margarine ready soups
tropical fruit yogurt coffee
whole milk
pip fruit yogurt cream cheese meat spreads
other vegetables whole milk condensed milk long life bakery
product
whole milk butter yogurt rice abrasive cleaner
rolls/buns
other vegetables UHT milk rolls/buns bottled beer liquor (appetizer)
potted plants
whole milk cereals
tropical fruit other vegetables white bread bottled water chocolate
citrus fruit tropical fruit whole milk butter curd
beef
frankfurter rolls/buns soda
The dataset has been created by researchers Department of Information Systems and
Operations, Wirtschaftsuniversitat Wien, Austria
The “Groceries” data set contains 1 month (30 days) of real-world point-of-sale transaction data
from a typical local grocery outlet. The data set contains 9835 transactions and the items are
aggregated to 169 categories
Item categories have been used instead of brands, for simplicity. So “milk” can refer to any
brand of milk.
Transformed data to dummy flag variables:
citrus
fruit
tropical
fruit
whole
milk
pip fruit other
vegetables
rolls/buns potted
plants
beef
1 1 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0
3 0 0 1 0 0 0 0 0
4 0 0 0 1 0 0 0 0
5 0 0 1 0 1 0 0 0
6 0 0 1 0 0 0 0 0
7 0 0 0 0 0 1 0 0
8 0 0 0 0 1 1 0 0
9 0 0 0 0 0 0 1 0
10 0 0 1 0 0 0 0 0
11 0 1 0 0 1 0 0 0
12 1 1 1 0 0 0 0 0
13 0 0 0 0 0 0 0 1
5. Program flow:
Converted dataset to dummy flag variables
Load the dataset into SPSS environment
Using data audit node, the matrix has 169 columns (corresponding to 169 item categories) and
9835 rows (corresponding to 9835 transactions)
Apply A-priori modelling node with 5% support and 30% confidence and lift parameters to
generate association rules
Top 12 most frequent items:
Results: Top 12 rules by “support”:
Consequent Antecedent Support % Confidence % Lift
other
vegetables
whole milk 25.310 30.300 1.568
whole milk other vegetables 19.318 39.698 1.568
whole milk rolls/buns 18.443 31.542 1.246
other
vegetables
yogurt 14.011 32.570 1.686
whole milk yogurt 14.011 39.646 1.566
whole milk bottled water 11.270 30.789 1.216
other
vegetables
root vegetables 10.832 44.280 2.292
whole milk root vegetables 10.832 45.087 1.781
2513
1903 1809 1715
1372
1087 1072 1032 969 924 875 814
0
500
1000
1500
2000
2500
3000
Top 12 most frequent items
6. other
vegetables
tropical fruit 10.395 33.801 1.750
whole milk tropical fruit 10.395 39.130 1.546
Results: Top 12 rules by “confidence”:
Consequent Antecedent Support % Confidence % Lift
whole milk butter 5.701 49.616 1.960
whole milk curd 5.642 48.320 1.909
whole milk domestic eggs 6.459 47.856 1.891
whole milk root vegetables 10.832 45.087 1.781
other
vegetables
root vegetables 10.832 44.280 2.292
whole milk whipped/sour cream 7.333 43.936 1.736
other
vegetables
yogurt and whole milk 5.555 43.045 2.228
whole milk beef 5.351 40.872 1.615
whole milk margarine 6.211 40.845 1.614
other
vegetables
whipped/sour cream 7.333 40.755 2.110
Results: Top 12 rules by “lift”:
Consequent Antecedent Support % Confidence % Lift
root vegetables beef 5.351 33.243 3.069
root vegetables other vegetables and whole milk 7.669 31.749 2.931
yogurt curd 5.642 34.884 2.490
other
vegetables
root vegetables 10.832 44.280 2.292
other
vegetables
yogurt and whole milk 5.555 43.045 2.228
yogurt other vegetables and whole milk 7.669 31.179 2.225
other
vegetables
whipped/sour cream 7.333 40.755 2.110
other
vegetables
pork 5.846 38.155 1.975
other
vegetables
beef 5.351 38.147 1.975
whole milk butter 5.701 49.616 1.960
7. Web:
➔ We can observe that those who buys pastry, citrus fruit & sausage are a group of customers
stand out
➔ It does mean that (here, for example), a customer is more likely to buy any of these three
products if he/she buys one pf those three
Discussion:
We can see that the top rules when sorted by “support” and “confidence” are dominated by
“whole milk” and “other vegetables”, which are the two most frequently bought items overall
However, when “lift” is considered we get rules not involving “whole milk” and “other
vegetables”. A lift value of greater than 1 implies that LHS and RHS sets are found more often
than purely by chance
Although such market basket analysis may yield many rules, not all of them would be useful.
Some would be trivial, some inexplicable and only a very few of them would be useful. Further
analysis and extra domain knowledge and common-sense are often required to subjectively
judge the real-world usefulness of the rules
References:
Dataset download link (via “arules” package) http://cran.r-
project.org/web/packages/arules/index.html
"Fast algorithms for mining association rule", in Proceedings of the 20th International
Conference on Very Large Databases, pp. 487-499, by R. Agrawal, and R.Srikant, (1994).
“Implications of probabilistic data modelling for mining association rules” , in Studies in
Classification, Data Analysis, and Knowledge Organization: from Data and Information Analysis
to Knowledge Engineering, pp. 598–605, by M. Hahsler, K. Hornik, and T. Reutterer, (2006).
“Machine Learning with R”, Brett Lantz, Packt Publishing