Association Rule Mining

Association Rule Mining
Project submitted
by
Pallab Das
MLIS (Digital Library)
Jadavpur University

What is ARM ?
Association rule mining is a procedure which aims to
observe frequency occurring patterns, correlations, or
associations from datasets found in various kinds of
databases such as.
• RDBMS
• Transactional databases
• and other forms of repositories.

{Bread} ==>{Milk}
{Soda} ==> {Chips}
{Bread}==> {Jam}
Tid Items
1 Bread, Peanut, Milk, Fruit, Jam
2 Bread, Jam, Soda, Chips, Milk, Fruit
3 Steak, Jam, Soda, Chips, Bread
4 Jam, Soda, Peanuts, Milk, Bread
5 Jam, Soda, Chips, Milk, Bread
6 Fruit, Soda, Chips, Milk
7 Fruits, Soda, Peanuts, Milk

• Association rule mining was defined in the
1990s
• Rakesh Agrawal, Tomasz Imieliński and Arun
Swami developed an algorithm-based way to
find relationships between items using point-
of-sale (POS) systems.

Rakesh Agrawal
Tomaz Polanski
Arun Swami

• Most machine learning algorithms work with
numeric datasets and hence tend to be
mathematical.
• However, association rule mining is suitable
for non-numeric, categorical data and
requires just a little bit more than simple
counting.

An association rule has 2
parts:
• an antecedent (if) and
• a consequent (then)

story
• A famous story about association rule mining
is the "beer and diaper" story
• Supermarket shoppers discovered that
customers (presumably young men) who buy
diapers tend also to buy beer.
• This anecdote became popular as an example
of how unexpected association rules might be
found from everyday data.

If - then
• An antecedent is something that’s found in
data,
• and a consequent is an item that is found in
combination with the antecedent.
Have a look at this rule for instance:
• “If a customer buys bread, he’s 70% likely of
buying milk.”
In the above association rule, bread is the
antecedent and milk is the consequent.

Syntax
• The syntax of association rule can be written as :
• buys (X; “computer”) -> buy (X ; “Software”)
[ Support 1% ; Confidence 50%]
X = Customer ;
Confidence 50% = If a customer buys a computer there is 50% chance of
buying software.
Support 1% = 1% of all the transaction under the database are having
computer and software together.

Example with Compound Condition
• Age(X, “25::::39”)^income(X, “30K::::35K”))->buys(X, “iphone”)
(Support = 2% , Confidence =60%)
• X = Customer who’s age is between 25 – 39
• Income = Customer who’s income is 30K – 35K.
• Confidence 60% = If customer age is 25 – 39 & their 30K – 35K then
there is 60% chance of buying an iPhone.
• Support 2% = Occurrence of the above three conditions together.

Some of the well known rules
• Some of the well known rules are
Market Basket Analysis
Apriori Algorithm
FP Growth Algorithm

Support and Confidence
Considering the following example of Library circulation --
Modern History World War II {Support:9%,Confidence: 65%}
Support : In a certain library, In the percentage of circulation / transactions (
T) that contain books both on Modern History(A) and World War II
together(B). (9% of every user had taken these 2 books together)
Support : (A B )=P (A U B)
Confidence : Probability of having World War II (B) with Modern
History(A) which is already taken by the user (65%)
Confidence: (AB)=P(B | A)

Tid= Transaction Id
Items – A,B,C,D
TOTAL SUPPORT- Total transaction (5)
support – Occurrence / Total support
confidence = Given X  Y
Occurrence of Y
--------------------------- `* 100
Occurrence of X
P(B|A) = Probability of purchase item B with item A
TId Items
1 ABC
2 ABD
3 BC
4 AC
5 BCD

TId Items Support = Occurrence / Total Support
1 ABC
2 ABD Total Support =5
3 BC Support { AB } = 2/5 = 40%
4 AC Support {BC} = 3/5 = 60%
5 BCD Support {ABC} = 1/5 = 20%
Calculating Support
• The occurrence of item A&B is 2
time
• Hence it is divided by total support
=5
• And we get Support = 40%
• By calculating same way we get
support of item
• BC = 60%
• ABC = 20%
Best support gaining items are - B and C

Calculate Confidence
TId Item Confidence
Occurrence of Y/ Occurrence
of X
1 ABC
2 ABD Confidence {A ==> B) = 2/3 = 66%
3 BC Confidence {B==>C} =3/4 = 75%
4 AC Confidence {AB==>C =1/2 = 50%
5 BCD
A has been purchased = 3 times
B has been purchased along with
A is 2.
Confidence {A ==> B) = 2/3 = 66%
It means customer who bought
item A bought B along is 66%
Inshort the probability of
buying item B with item A is
{ B|A } = 66%.

Benefits of Market Basket Analysis:
 Store Layout –
Based on the insights from market basket analysis we can
organize our store to increase revenues. I
 Marketing Messages:
Hence it will help our customers with fruitful suggestions
instead of annoying them with marketing blasts.
 Maintain Inventory
Online publishers and bloggers to display content which
consumer is most likely to read next. This will reduce bounce
rate, improve engagement and result in better performance in
search results.
 Recommendation Engines:
Recommendation engines are already used by some popular
companies like Netflix, Amazon, Facebook , etc

• Apriori algorithm uses frequent item sets to
generate association rules.
• It is based on the concept that a subset of a
frequent item set must also be a frequent item set.

Frequent Itemset
• Frequent Itemset is an item set whose support
value is greater than a threshold value.
. Frequent Itemset =Support > Threshold Value

WE have 5 items
TD = Transaction (5)
Suppose we have minimum support of -2
TID Items
T1 1,3,4
T2 2,3,5
T3 1,2,3,5
T4 2,5
T5 1,3,5

We have itemset and support for the 5
items in c1 table
ItemSet Support
{1} 3
{2} 3
{3} 4
{4} 1
{5} 4
Here item no 4 has to be eleminated
as its support value less than
minimmum support
1<2
C1
1ST Iteration

After eliminating the item no 4
we get the final table F1.
ItemSet Support
{1} 3
{2} 3
{3} 4
{5} 4
F1

2nd Iteration
Item Support
{1,2} 1
{1,3} 3
{1,5} 2
{2,3} 2
{2,5} 3
{3,5} 3
Item set {1,2} will be eliminated
Because the support value Is less
than the minimum threshold (2)
Item set Support
{1,3} 3
{1,5} 2
{2,3} 2
{2,5} 3
{3,5} 3
F2
C2

Item set Support
T1 1,2,3
T2 1,2,5
T3 1,3,5
T4 2,3,5
C3

Apriori Algorithm – Pruning
3rd Iteration
C3?
Item set In F2 ?
{1,2,3},{1,2},{1,3},{2,3} NO
{1,2,5},{1,2},{1,5},{2,5} NO
{1,3,5},{1,3},{1,5},{3,5} YES
{2,3,5},{2,3},{2,5},{3,5} YES
Here we can see that sub set {1,2} are not there in F2 , so they are
removed.

4th Iteration
Item set Support
{1,3,5} 2
{2,3,5,} 2
F3
Item set Support
1,2,3,4 1
C3
1,2,3,4 = this item set is
eleminated because it has
support less than the
minimum threshold

Apriori Algorithm Subset creation
Assume our minimum
confidence value is
60%
For I = {1,3,5} subsets are {1,3}, {1,5}, {3,5}, {1},{5}
For I = {2,3,5} sub set are {2,3}, {2,5},{3,5},{2}, {3},{5}

Applying rule to item set F3
• Rule 1 : {1,3}  ({1,3,5}-{1,3}) means
1&3 5
Confidence = Support (1,3,5)/
support (1,3) = 2/3 = 66.66% > 60%
• Rule 2 : {1,5}  ({1,3,5}-{1,5}) means
1&5 5
support (1,5) = 2/2 = 100% > 60%
Priory
Algorithm
Established

• Rule 3 : {3,5}  ({1,3,5}-{3,5}) means 3&5
5
Confidence = Support (1,3,5)/ support (3,5)
= 2/3 = 66.66% > 60%
Rule 4 : {1}  ({1,3,5}-{1}) means 1 3&5
support (1) = 2/3 = 66.66% > 60%

• Rule 5 : {3}  ({1,3,5}-{3}) means 3. 1&5
ₓ Confidence = Support (1,3,5)/ support (3) = 2/4 =
50% < 60%
• Rule 6 : {5}  ({1,3,5}-{5}) means 5 1&3
ₓ Confidence = Support (1,3,5)/ support (5) = 2/4 =
40% < 60%

• Rule 1 – 4 established
• Rule 5 - 6 are eliminated

Advantages
• 1. Easy to implement
• 2. Use large itemset property

Fp Growth Algorithm
• Fp Growth Algorithm (Frequent pattern
growth).
• FP growth algorithm is an improvement of
apriori algorithm. FP growth algorithm used
for finding frequent itemset in a transaction
database without candidate generation.
• FP growth represents frequent items in
frequent pattern trees or FP-tree.

Advantages of FP growth algorithm:-
• Faster than apriori algorithm
• No candidate generation
• Only two passes over dataset

Disadvantages of FP growth algorithm
• FP tree may not fit in memory
• FP tree is expensive to build

Some Other Types of Association Rules
• Sequential pattern mining discovers subsequences that
are common to more than minsupsequences in a
sequence database, where minsup is set by the user. A
sequence is an ordered list of transactions.
• K-optimal pattern discovery. provides an alternative to
the standard approach to association rule learning
that requires that each pattern appear frequently in
the data.
• Contrast set learning is a form of associative
learning. Contrast set learners use rules that differ
meaningfully in their distribution across subsets.

ARM - applications in library and
information science
Helps in Suggesting of books to the users.
Understanding the user needs .
Helps in understanding the trend of the user
reading habit.
Helps in acquisition of book according to the
user needs.
Helps in understanding various statistical data
of the library.

Association Rule Mining

More Related Content

What's hot

Similar to Association Rule Mining

Recently uploaded

Association Rule Mining