Association Rule Mining
Project submitted
by
Pallab Das
MLIS (Digital Library)
Jadavpur University
What is ARM ?
Association rule mining is a procedure which aims to
observe frequency occurring patterns, correlations, or
associations from datasets found in various kinds of
databases such as.
• RDBMS
• Transactional databases
• and other forms of repositories.
{Bread} ==>{Milk}
{Soda} ==> {Chips}
{Bread}==> {Jam}
Tid Items
1 Bread, Peanut, Milk, Fruit, Jam
2 Bread, Jam, Soda, Chips, Milk, Fruit
3 Steak, Jam, Soda, Chips, Bread
4 Jam, Soda, Peanuts, Milk, Bread
5 Jam, Soda, Chips, Milk, Bread
6 Fruit, Soda, Chips, Milk
7 Fruits, Soda, Peanuts, Milk
• Association rule mining was defined in the
1990s
• Rakesh Agrawal, Tomasz Imieliński and Arun
Swami developed an algorithm-based way to
find relationships between items using point-
of-sale (POS) systems.
Rakesh Agrawal
Tomaz Polanski
Arun Swami
• Most machine learning algorithms work with
numeric datasets and hence tend to be
mathematical.
• However, association rule mining is suitable
for non-numeric, categorical data and
requires just a little bit more than simple
counting.
An association rule has 2
parts:
• an antecedent (if) and
• a consequent (then)
story
• A famous story about association rule mining
is the "beer and diaper" story
• Supermarket shoppers discovered that
customers (presumably young men) who buy
diapers tend also to buy beer.
• This anecdote became popular as an example
of how unexpected association rules might be
found from everyday data.
If - then
• An antecedent is something that’s found in
data,
• and a consequent is an item that is found in
combination with the antecedent.
Have a look at this rule for instance:
• “If a customer buys bread, he’s 70% likely of
buying milk.”
In the above association rule, bread is the
antecedent and milk is the consequent.
Syntax
• The syntax of association rule can be written as :
• buys (X; “computer”) -> buy (X ; “Software”)
[ Support 1% ; Confidence 50%]
X = Customer ;
Confidence 50% = If a customer buys a computer there is 50% chance of
buying software.
Support 1% = 1% of all the transaction under the database are having
computer and software together.
Example with Compound Condition
• Age(X, “25::::39”)^income(X, “30K::::35K”))->buys(X, “iphone”)
(Support = 2% , Confidence =60%)
• X = Customer who’s age is between 25 – 39
• Income = Customer who’s income is 30K – 35K.
• Confidence 60% = If customer age is 25 – 39 & their 30K – 35K then
there is 60% chance of buying an iPhone.
• Support 2% = Occurrence of the above three conditions together.
Some of the well known rules
• Some of the well known rules are
Market Basket Analysis
Apriori Algorithm
FP Growth Algorithm
Market Basket Analysis
Support and Confidence
Considering the following example of Library circulation --
Modern History World War II {Support:9%,Confidence: 65%}
Support : In a certain library, In the percentage of circulation / transactions (
T) that contain books both on Modern History(A) and World War II
together(B). (9% of every user had taken these 2 books together)
Support : (A B )=P (A U B)
Confidence : Probability of having World War II (B) with Modern
History(A) which is already taken by the user (65%)
Confidence: (AB)=P(B | A)
Tid= Transaction Id
Items – A,B,C,D
TOTAL SUPPORT- Total transaction (5)
support – Occurrence / Total support
confidence = Given X  Y
Occurrence of Y
--------------------------- `* 100
Occurrence of X
P(B|A) = Probability of purchase item B with item A
TId Items
1 ABC
2 ABD
3 BC
4 AC
5 BCD
TId Items Support = Occurrence / Total Support
1 ABC
2 ABD Total Support =5
3 BC Support { AB } = 2/5 = 40%
4 AC Support {BC} = 3/5 = 60%
5 BCD Support {ABC} = 1/5 = 20%
Calculating Support
• The occurrence of item A&B is 2
time
• Hence it is divided by total support
=5
• And we get Support = 40%
• By calculating same way we get
support of item
• BC = 60%
• ABC = 20%
Best support gaining items are - B and C
Calculate Confidence
TId Item Confidence
Occurrence of Y/ Occurrence
of X
1 ABC
2 ABD Confidence {A ==> B) = 2/3 = 66%
3 BC Confidence {B==>C} =3/4 = 75%
4 AC Confidence {AB==>C =1/2 = 50%
5 BCD
A has been purchased = 3 times
B has been purchased along with
A is 2.
Confidence {A ==> B) = 2/3 = 66%
It means customer who bought
item A bought B along is 66%
Inshort the probability of
buying item B with item A is
{ B|A } = 66%.
Benefits of Market Basket Analysis:
 Store Layout –
Based on the insights from market basket analysis we can
organize our store to increase revenues. I
 Marketing Messages:
Hence it will help our customers with fruitful suggestions
instead of annoying them with marketing blasts.
 Maintain Inventory
Online publishers and bloggers to display content which
consumer is most likely to read next. This will reduce bounce
rate, improve engagement and result in better performance in
search results.
 Recommendation Engines:
Recommendation engines are already used by some popular
companies like Netflix, Amazon, Facebook , etc
• Apriori algorithm uses frequent item sets to
generate association rules.
• It is based on the concept that a subset of a
frequent item set must also be a frequent item set.
Frequent Itemset
• Frequent Itemset is an item set whose support
value is greater than a threshold value.
. Frequent Itemset =Support > Threshold Value
WE have 5 items
TD = Transaction (5)
Suppose we have minimum support of -2
TID Items
T1 1,3,4
T2 2,3,5
T3 1,2,3,5
T4 2,5
T5 1,3,5
We have itemset and support for the 5
items in c1 table
ItemSet Support
{1} 3
{2} 3
{3} 4
{4} 1
{5} 4
Here item no 4 has to be eleminated
as its support value less than
minimmum support
1<2
C1
1ST Iteration
After eliminating the item no 4
we get the final table F1.
ItemSet Support
{1} 3
{2} 3
{3} 4
{5} 4
F1
2nd Iteration
Item Support
{1,2} 1
{1,3} 3
{1,5} 2
{2,3} 2
{2,5} 3
{3,5} 3
Item set {1,2} will be eliminated
Because the support value Is less
than the minimum threshold (2)
Item set Support
{1,3} 3
{1,5} 2
{2,3} 2
{2,5} 3
{3,5} 3
F2
C2
Item set Support
T1 1,2,3
T2 1,2,5
T3 1,3,5
T4 2,3,5
C3
Apriori Algorithm – Pruning
3rd Iteration
C3?
Item set In F2 ?
{1,2,3},{1,2},{1,3},{2,3} NO
{1,2,5},{1,2},{1,5},{2,5} NO
{1,3,5},{1,3},{1,5},{3,5} YES
{2,3,5},{2,3},{2,5},{3,5} YES
Here we can see that sub set {1,2} are not there in F2 , so they are
removed.
4th Iteration
Item set Support
{1,3,5} 2
{2,3,5,} 2
F3
Item set Support
1,2,3,4 1
C3
1,2,3,4 = this item set is
eleminated because it has
support less than the
minimum threshold
Apriori Algorithm Subset creation
Assume our minimum
confidence value is
60%
For I = {1,3,5} subsets are {1,3}, {1,5}, {3,5}, {1},{5}
For I = {2,3,5} sub set are {2,3}, {2,5},{3,5},{2}, {3},{5}
Applying rule to item set F3
• Rule 1 : {1,3}  ({1,3,5}-{1,3}) means
1&3 5
Confidence = Support (1,3,5)/
support (1,3) = 2/3 = 66.66% > 60%
• Rule 2 : {1,5}  ({1,3,5}-{1,5}) means
1&5 5
Confidence = Support (1,3,5)/
support (1,5) = 2/2 = 100% > 60%
Priory
Algorithm
Established
• Rule 3 : {3,5}  ({1,3,5}-{3,5}) means 3&5
5
Confidence = Support (1,3,5)/ support (3,5)
= 2/3 = 66.66% > 60%
Rule 4 : {1}  ({1,3,5}-{1}) means 1 3&5
Confidence = Support (1,3,5)/
support (1) = 2/3 = 66.66% > 60%
• Rule 5 : {3}  ({1,3,5}-{3}) means 3. 1&5
ₓ Confidence = Support (1,3,5)/ support (3) = 2/4 =
50% < 60%
• Rule 6 : {5}  ({1,3,5}-{5}) means 5 1&3
ₓ Confidence = Support (1,3,5)/ support (5) = 2/4 =
40% < 60%
• Rule 1 – 4 established
• Rule 5 - 6 are eliminated
Advantages
• 1. Easy to implement
• 2. Use large itemset property
Fp Growth Algorithm
• Fp Growth Algorithm (Frequent pattern
growth).
• FP growth algorithm is an improvement of
apriori algorithm. FP growth algorithm used
for finding frequent itemset in a transaction
database without candidate generation.
• FP growth represents frequent items in
frequent pattern trees or FP-tree.
Advantages of FP growth algorithm:-
• Faster than apriori algorithm
• No candidate generation
• Only two passes over dataset
Disadvantages of FP growth algorithm
• FP tree may not fit in memory
• FP tree is expensive to build
Some Other Types of Association Rules
• Sequential pattern mining discovers subsequences that
are common to more than minsupsequences in a
sequence database, where minsup is set by the user. A
sequence is an ordered list of transactions.
• K-optimal pattern discovery. provides an alternative to
the standard approach to association rule learning
that requires that each pattern appear frequently in
the data.
• Contrast set learning is a form of associative
learning. Contrast set learners use rules that differ
meaningfully in their distribution across subsets.
ARM - applications in library and
information science
Helps in Suggesting of books to the users.
Understanding the user needs .
Helps in understanding the trend of the user
reading habit.
Helps in acquisition of book according to the
user needs.
Helps in understanding various statistical data
of the library.
Association Rule Mining

Association Rule Mining

  • 1.
    Association Rule Mining Projectsubmitted by Pallab Das MLIS (Digital Library) Jadavpur University
  • 3.
    What is ARM? Association rule mining is a procedure which aims to observe frequency occurring patterns, correlations, or associations from datasets found in various kinds of databases such as. • RDBMS • Transactional databases • and other forms of repositories.
  • 4.
    {Bread} ==>{Milk} {Soda} ==>{Chips} {Bread}==> {Jam} Tid Items 1 Bread, Peanut, Milk, Fruit, Jam 2 Bread, Jam, Soda, Chips, Milk, Fruit 3 Steak, Jam, Soda, Chips, Bread 4 Jam, Soda, Peanuts, Milk, Bread 5 Jam, Soda, Chips, Milk, Bread 6 Fruit, Soda, Chips, Milk 7 Fruits, Soda, Peanuts, Milk
  • 5.
    • Association rulemining was defined in the 1990s • Rakesh Agrawal, Tomasz Imieliński and Arun Swami developed an algorithm-based way to find relationships between items using point- of-sale (POS) systems.
  • 6.
  • 7.
    • Most machinelearning algorithms work with numeric datasets and hence tend to be mathematical. • However, association rule mining is suitable for non-numeric, categorical data and requires just a little bit more than simple counting.
  • 8.
    An association rulehas 2 parts: • an antecedent (if) and • a consequent (then)
  • 9.
    story • A famousstory about association rule mining is the "beer and diaper" story • Supermarket shoppers discovered that customers (presumably young men) who buy diapers tend also to buy beer. • This anecdote became popular as an example of how unexpected association rules might be found from everyday data.
  • 10.
    If - then •An antecedent is something that’s found in data, • and a consequent is an item that is found in combination with the antecedent. Have a look at this rule for instance: • “If a customer buys bread, he’s 70% likely of buying milk.” In the above association rule, bread is the antecedent and milk is the consequent.
  • 11.
    Syntax • The syntaxof association rule can be written as : • buys (X; “computer”) -> buy (X ; “Software”) [ Support 1% ; Confidence 50%] X = Customer ; Confidence 50% = If a customer buys a computer there is 50% chance of buying software. Support 1% = 1% of all the transaction under the database are having computer and software together.
  • 12.
    Example with CompoundCondition • Age(X, “25::::39”)^income(X, “30K::::35K”))->buys(X, “iphone”) (Support = 2% , Confidence =60%) • X = Customer who’s age is between 25 – 39 • Income = Customer who’s income is 30K – 35K. • Confidence 60% = If customer age is 25 – 39 & their 30K – 35K then there is 60% chance of buying an iPhone. • Support 2% = Occurrence of the above three conditions together.
  • 13.
    Some of thewell known rules • Some of the well known rules are Market Basket Analysis Apriori Algorithm FP Growth Algorithm
  • 14.
  • 15.
    Support and Confidence Consideringthe following example of Library circulation -- Modern History World War II {Support:9%,Confidence: 65%} Support : In a certain library, In the percentage of circulation / transactions ( T) that contain books both on Modern History(A) and World War II together(B). (9% of every user had taken these 2 books together) Support : (A B )=P (A U B) Confidence : Probability of having World War II (B) with Modern History(A) which is already taken by the user (65%) Confidence: (AB)=P(B | A)
  • 16.
    Tid= Transaction Id Items– A,B,C,D TOTAL SUPPORT- Total transaction (5) support – Occurrence / Total support confidence = Given X  Y Occurrence of Y --------------------------- `* 100 Occurrence of X P(B|A) = Probability of purchase item B with item A TId Items 1 ABC 2 ABD 3 BC 4 AC 5 BCD
  • 17.
    TId Items Support= Occurrence / Total Support 1 ABC 2 ABD Total Support =5 3 BC Support { AB } = 2/5 = 40% 4 AC Support {BC} = 3/5 = 60% 5 BCD Support {ABC} = 1/5 = 20% Calculating Support • The occurrence of item A&B is 2 time • Hence it is divided by total support =5 • And we get Support = 40% • By calculating same way we get support of item • BC = 60% • ABC = 20% Best support gaining items are - B and C
  • 18.
    Calculate Confidence TId ItemConfidence Occurrence of Y/ Occurrence of X 1 ABC 2 ABD Confidence {A ==> B) = 2/3 = 66% 3 BC Confidence {B==>C} =3/4 = 75% 4 AC Confidence {AB==>C =1/2 = 50% 5 BCD A has been purchased = 3 times B has been purchased along with A is 2. Confidence {A ==> B) = 2/3 = 66% It means customer who bought item A bought B along is 66% Inshort the probability of buying item B with item A is { B|A } = 66%.
  • 19.
    Benefits of MarketBasket Analysis:  Store Layout – Based on the insights from market basket analysis we can organize our store to increase revenues. I  Marketing Messages: Hence it will help our customers with fruitful suggestions instead of annoying them with marketing blasts.  Maintain Inventory Online publishers and bloggers to display content which consumer is most likely to read next. This will reduce bounce rate, improve engagement and result in better performance in search results.  Recommendation Engines: Recommendation engines are already used by some popular companies like Netflix, Amazon, Facebook , etc
  • 21.
    • Apriori algorithmuses frequent item sets to generate association rules. • It is based on the concept that a subset of a frequent item set must also be a frequent item set.
  • 22.
    Frequent Itemset • FrequentItemset is an item set whose support value is greater than a threshold value. . Frequent Itemset =Support > Threshold Value
  • 23.
    WE have 5items TD = Transaction (5) Suppose we have minimum support of -2 TID Items T1 1,3,4 T2 2,3,5 T3 1,2,3,5 T4 2,5 T5 1,3,5
  • 24.
    We have itemsetand support for the 5 items in c1 table ItemSet Support {1} 3 {2} 3 {3} 4 {4} 1 {5} 4 Here item no 4 has to be eleminated as its support value less than minimmum support 1<2 C1 1ST Iteration
  • 25.
    After eliminating theitem no 4 we get the final table F1. ItemSet Support {1} 3 {2} 3 {3} 4 {5} 4 F1
  • 26.
    2nd Iteration Item Support {1,2}1 {1,3} 3 {1,5} 2 {2,3} 2 {2,5} 3 {3,5} 3 Item set {1,2} will be eliminated Because the support value Is less than the minimum threshold (2) Item set Support {1,3} 3 {1,5} 2 {2,3} 2 {2,5} 3 {3,5} 3 F2 C2
  • 27.
    Item set Support T11,2,3 T2 1,2,5 T3 1,3,5 T4 2,3,5 C3
  • 28.
    Apriori Algorithm –Pruning 3rd Iteration C3? Item set In F2 ? {1,2,3},{1,2},{1,3},{2,3} NO {1,2,5},{1,2},{1,5},{2,5} NO {1,3,5},{1,3},{1,5},{3,5} YES {2,3,5},{2,3},{2,5},{3,5} YES Here we can see that sub set {1,2} are not there in F2 , so they are removed.
  • 29.
    4th Iteration Item setSupport {1,3,5} 2 {2,3,5,} 2 F3 Item set Support 1,2,3,4 1 C3 1,2,3,4 = this item set is eleminated because it has support less than the minimum threshold
  • 30.
    Apriori Algorithm Subsetcreation Assume our minimum confidence value is 60% For I = {1,3,5} subsets are {1,3}, {1,5}, {3,5}, {1},{5} For I = {2,3,5} sub set are {2,3}, {2,5},{3,5},{2}, {3},{5}
  • 31.
    Applying rule toitem set F3 • Rule 1 : {1,3}  ({1,3,5}-{1,3}) means 1&3 5 Confidence = Support (1,3,5)/ support (1,3) = 2/3 = 66.66% > 60% • Rule 2 : {1,5}  ({1,3,5}-{1,5}) means 1&5 5 Confidence = Support (1,3,5)/ support (1,5) = 2/2 = 100% > 60% Priory Algorithm Established
  • 32.
    • Rule 3: {3,5}  ({1,3,5}-{3,5}) means 3&5 5 Confidence = Support (1,3,5)/ support (3,5) = 2/3 = 66.66% > 60% Rule 4 : {1}  ({1,3,5}-{1}) means 1 3&5 Confidence = Support (1,3,5)/ support (1) = 2/3 = 66.66% > 60%
  • 33.
    • Rule 5: {3}  ({1,3,5}-{3}) means 3. 1&5 ₓ Confidence = Support (1,3,5)/ support (3) = 2/4 = 50% < 60% • Rule 6 : {5}  ({1,3,5}-{5}) means 5 1&3 ₓ Confidence = Support (1,3,5)/ support (5) = 2/4 = 40% < 60%
  • 34.
    • Rule 1– 4 established • Rule 5 - 6 are eliminated
  • 35.
    Advantages • 1. Easyto implement • 2. Use large itemset property
  • 37.
    Fp Growth Algorithm •Fp Growth Algorithm (Frequent pattern growth). • FP growth algorithm is an improvement of apriori algorithm. FP growth algorithm used for finding frequent itemset in a transaction database without candidate generation. • FP growth represents frequent items in frequent pattern trees or FP-tree.
  • 38.
    Advantages of FPgrowth algorithm:- • Faster than apriori algorithm • No candidate generation • Only two passes over dataset
  • 39.
    Disadvantages of FPgrowth algorithm • FP tree may not fit in memory • FP tree is expensive to build
  • 40.
    Some Other Typesof Association Rules • Sequential pattern mining discovers subsequences that are common to more than minsupsequences in a sequence database, where minsup is set by the user. A sequence is an ordered list of transactions. • K-optimal pattern discovery. provides an alternative to the standard approach to association rule learning that requires that each pattern appear frequently in the data. • Contrast set learning is a form of associative learning. Contrast set learners use rules that differ meaningfully in their distribution across subsets.
  • 41.
    ARM - applicationsin library and information science Helps in Suggesting of books to the users. Understanding the user needs . Helps in understanding the trend of the user reading habit. Helps in acquisition of book according to the user needs. Helps in understanding various statistical data of the library.