BIM Data Mining Unit4 by Tekendra Nath Yogi

Unit 4: Association Analysis LH 7
Presented By : Tekendra Nath Yogi
Tekendranath@gmail.com
College Of Applied Business And Technology

Contd…
• Outline:
– 4.1. Basics and Algorithms
– 4.2. Frequent Item-set Pattern & Apriori Principle
– 4.3. FP-Growth, FP-Tree
– 4.4. Handling Categorical Attributes
26/30/2019 By: Tekendra Nath Yogi

6/30/2019 By:Tekendra Nath Yogi 3
Association Analysis
• Association rules analysis is a technique to uncover(mine) how
items are associated to each other.
• Such uncovered association between items are called association
rules
• When to mine association rules?
– Scenario:
• You are a sales manager
• Customer bought a pc and a digital camera recently
• What should you recommend to her next?
• Association rules are helpful In making your recommendation.

Contd…
• Frequent patterns(item sets):
– Frequent patterns are patterns that appear frequently in a data
set.
– E.g.,
• In transaction data set milk and bread is a frequent pattern,
• In a shopping history database first buy pc, then a digital camera, and
then a memory card is another example of frequent pattern.
– Finding frequent patterns plays an essential role in mining
association rules.

Frequent pattern mining
• Frequent pattern mining searches for recurring relationships in a given
data set.
• Frequent pattern mining helps for the discovery of interesting
associations between item sets
• Such associations can be applicable in many business decision making
processes such as:
– Catalog design
– Basket data analysis
– cross-marketing,
– sale campaign analysis,
– Web log (click stream) analysis, etc

Market Basket analysis
A typical example of frequent pattern(item set) mining for association rules.
• Market basket analysis analyzes customer buying habits by finding
associations between the different items that customers place in their shopping
baskets.
Applications: To make marketing strategies
Example of Association Rule:
milk bread

Definitions
• Itemset:
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items
• Support count ():
– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread,Diaper}) = 2
• Support:
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset:
– An itemset whose support is greater than or
equal to a minsup threshold
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
6/30/2019 7By:Tekendra Nath Yogi

Definitions
• Association Rule :
• An implication expression of the form X  Y, where X and Y are
itemsets
• e.g.,
• It is not very difficult to develop algorithms that will find this
associations in a large database.
• The problem is that such an algorithm will also uncover many other
associations that are of very little value.
• It is necessary to introduce some measures to distinguish interesting
associations from non-interesting ones.
Beer}Diaper,Milk{ 

Contd….
• Rule Evaluation Metrics:
– Support (s): This says how popular an itemset is (Prevalence), as
measured by the proportion of transactions in which an itemset appears.
– In Table below, the support of {apple} is 4 out of 8, or 50%. Itemsets can
also contain multiple items. For instance, the support of {apple, beer, rice}
is 2 out of 8, or 25%.

Contd….
• Rule Evaluation Metrics:
– Confidence: This says how likely item Y is purchased when item X is
purchased(Predictability), expressed as {X -> Y}. This is measured by the
proportion of transactions with item X, in which item Y also appears. In
Table below, the confidence of {apple -> beer} is 3 out of 4, or 75%.

Example1
• Given data set D is:
• What is the support and confidence of the rule:
• Support: percentage of tuples that contain {Milk, Diaper, Beer}
– i.e.,
• Confidence:
– i.e.,
TID Items
1 Bread, Milk
Beer}Diaper,Milk{ 
%404.0
5
2
|T|
)BeerDiaper,,Milk(


s
%6767.0
3
2
)Diaper,Milk(
)BeerDiaper,Milk,(



c
Diper}{Milk,containthattuplesofnumber
Beer}Diper,{Milk,containthattuplesofnumber

TID date items_bought
100 10/10/99 {F,A,D,B}
200 15/10/99 {D,A,C,E,B}
300 19/10/99 {C,A,B,E}
400 20/10/99 {B,A,D}
Example2
• Given data set D is:
• What is the support and confidence of the rule: {B,D}  {A}
• Support:
• percentage of tuples that contain {A,B,D} =3/4*100=75%
• Confidence:
%100100*3/3
D}{B,containthattuplesofnumber
D}B,{A,containthattuplesofnumber


Association Rule Mining Task
• Given a set of transactions T, the goal of association rule mining is to find all
rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
• If a rule A=>B[Support, Confidence] satisfies min_sup and min_confidence
then it is a strong rule.
• So, we can say that the goal of association rule mining is to find all strong
rules.

Contd….
• Brute-force approach for association rule mining:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!

Contd….
• How? Given data set D is:
– Example of Rules and their support and confidence is:
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
– Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper,
Beer}. Rules originating from the same itemset have identical support but
can have different confidence Thus, we may decouple the support and
confidence requirements
TID Items
1 Bread, Milk

Contd….
• Association Rules mining is a two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset, where
each rule is a binary partitioning of a frequent itemset
• Frequent itemset generation is still computationally expensive!

Contd…
Given d items, there
are 2d possible
candidate itemsets
Frequent Itemset Generation: Frequent itemset generation is still computationally
expensive!

Contd…
• Reducing Number of Candidates! By using Apriori principle:
– Apriori principle states that:
• If an itemset is frequent, then all of its subsets must also be frequent
– E.g., if {beer, diaper, nuts} is frequent, so is {beer, diaper}
• Apriori principle holds due to the following property of the support
measure:
• i.e., Support of an itemset never exceeds the support of its subsets
• E.g.,
)()()(:, YsXsYXYX 
TID Items
1 Bread, Milk
s(Bread) > s(Bread, Beer)
s(Milk) > s(Bread, Milk)
s(Diaper, Beer) > s(Diaper, Beer, Coke)

Contd…
• How is the apriori property is used in the algorithm?
– If there is any itemset which is infrequent, its superset should not be
generated/tested!
Found to be
Infrequent
Pruned supersets
If item-set {a,b} is infrequent then we do not
need to take into account all its super-sets.

The Apriori Algorithm
1. Initially, scan DB once to get candidate item set C1 and find frequent
1-items from C1and put them to Lk (k=1)
2. Use Lk to generate a collection of candidate itemsets Ck+1 with size
(k+1)
3. Scan the database to find which itemsets in Ck+1 are frequent and put
them into Lk+1
4. If Lk+1 is not empty(i.e., terminate when no frequent or candidate set
can be generated)
k=k+1
GOTO 2

Generating association rules from frequent itemsets
• generate strong association rules from frequent itemsets (where strong
association rules satisfy both minimum support and minimum confidence) as
follows:
• The rules generated from frequent itemsets, each one automatically satisfies
the minimum support.

Contd…
• Example1: Use APRIORI algorithm to generate strong
association rules from the following transaction database. Use
min_sup=2 and min_confidence=75%
Database TDB
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E

Contd…
Database TDB
1st scan
C1
L1
L2
C2 C2
2nd scan
C3 L33rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
Supmin = 2
Step1: Frequent itemset generation:

Contd…
• Step2:Association rule generation and strong association rule filtering:
– Possible association rules for frequent item set {B,C,E} and their
corresponding confidence is:
• B->{C,E} confidence1= 2/3=66.67%
• C->{B,E} confidence 2=2/3=66.67%
• E->{B,C} confidence3=2/3=66.67%
• {C,E}->B confidence4=2/2=100%
• {B,E}->C confidence5=2/3=66.67%
• {B,C}->E confidence6=2/2=100%
– Here, minimum confidence is 75% so strong rules are:
• {C,E}->B
• {B,C}->E

Contd…
• Example2: Use APRIORI algorithm to generate strong
association rules from the following transaction database. Use
min_sup=2 and min_confidence=75%
Database TDB
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5

Contd….
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database D itemset sup.
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2
{2} 3
{3} 3
{5} 3
Scan D
C1
L1
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
itemset sup
{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3
{3 5} 2
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
L2
C2 C2
Scan D
C3 L3itemset
{2 3 5}
Scan D itemset sup
{2 3 5} 2
min_sup=2=50%
Step1: Frequent itemset generation:

Contd…
• Step2:Association rule generation and strong association rule
filtering:
– Possible association rules for frequent item set {2,3,5} and their
corresponding confidence is:
• 2->{3,5} c1=2/3=66.67%
• 3->{2,5} c2=2/3=66.67%
• 5->{2,3} c3=2/3=66.67%
• {3,5}->2 c4=2/2=100%
• {2,5}->3 c5=2/3=100%
• {2,3}->5 c6=2/2=100%
– Here, minimum confidence is 75% so strong rules are:
• {3,5}->2
• {2,3}->5

Contd…
• Example3:
– Consider the following transactions for association rules analysis:
– Use minimum support(min_sup) = 2 (2/9 = 22%) and
– Minimum confidence = 70%

Contd…
• Step1: Frequent Itemset Generation:

Contd…
• Step2: Generating association rules:
• The data contain frequent itemset X ={I1,I2,I5} .What are the association rules
that can be generated from X?
• The nonempty subsets of X are . The
resulting association rules are as shown below, each listed with its confidence:
• Here, minimum confidence threshold is 70%, so only the second, third, and
last rules are output, because these are the only ones generated that are strong.

31
Contd...
• Problems with A-priori Algorithms:
– It is costly to handle a huge number of candidate sets.
• For example if there are 104 large 1-itemsets, the Apriori algorithm
will need to generate more than 107 candidate 2-itemsets. Moreover
for 100-itemsets, it must generate more than 2100  1030 candidates in
total.
– The candidate generation is the inherent cost of the Apriori Algorithms, no
matter what implementation technique is applied.
– To mine a large data sets for long patterns – this algorithm is NOT a good
idea.
– When Database is scanned to check Ck for creating Lk, a large number of
transactions will be scanned even they do not contain any k-itemset.
6/30/2019 By:Tekendra Nath Yogi

Frequent pattern (FP) growth approach for Mining
Frequent Item-sets
• A frequent pattern growth approach mines Frequent Patterns
Without Candidate Generation.
• In FP-Growth there are mainly two step involved:
– Build a compact data structure called FP-Tree and
– Than, extract frequent itemset directly from the FP-tree.

Contd….
• FP-tree Construction from a Transactional DB:
– FP-Tree is constructed using two passes over the data set:
– Pass1:
1. Scan DB to find frequent 1-itemsets as:
a. Scan DB and find support for each item.
b. Discard infrequent items
2. sort items in descending order of their frequency(support count).
3. Sort the items in each transaction in descending order of their frequency.
Use this order when building the FP-tree, so common prefixes can be shared.
– Pass2: Scan DB again, construct FP-tree
1. FP-growth reads one transaction at a time and maps it to a path.
2. Fixed order is used, so path can overlap when transaction share items.
3. Pointers are maintained between nodes containing the same item(doted line)

June 30, 2019 By:Tekendra Nath Yogi 34
Contd…
Fig: Flow chart for FP-tree construction process

Contd..
• Mining Frequent Patterns Using FP-tree:
– Start from each frequent length-1 pattern (called suffix pattern)
– construct its conditional pattern base (set of prefix paths in the FP-tree co-
occurring with the suffix pattern)
– then construct its (conditional) FP-tree.
– The pattern growth is achieved by the concatenation of the suffix pattern
with the frequent patterns generated from a conditional FP-tree.

Contd…
• Example1: Find all frequent itemsets or frequent patterns in the following
database using FP-growth algorithm. Take minimum support=2
TID List of item
IDs
1 I1, I2, I5
2 I2, I4
3 I2, I3
4 I1, I2, I4
5 I1, I3
6 I2, I3
7 I1, I3
8 I1, I2, I3, I5
9 I1, I2, I3
• Now we will build a FPtree of thatdatabase
• Item sets are considered in order of their descending value of supportcount.

Contd…
• Constructing 1-itemsets and counting support count for each item set :
• Discarding all infrequent itemsets:
since min_sup=2.
Itemset Support count
I1 6
I2 7
I3 6
I4 2
I5 2
I1 6
I2 7
I3 6
I4 2
I5 2

Contd…
• Sorting frequent 1-itemsets in descending order of their support count:
I2 7
I1 6
I3 6
I4 2
I5 2

Contd…
• Now, ordering each itemsets in D based on frequent 1-itemsets above:
TID List of items Ordered items
1 I1,I2,I5 I2,I1,I5
2 I2,I4 I2,I4
3 I2,I3 I2,I3
4 I1,I2,I4 I2,I1,I4
5 I1,I3 I1,I3
6 I2,I3 I2,I3
7 I1,I3 I1,I3
8 I1,I2,I3,I5 I2,I1,I3,I5
9 I1,I2,I3 I2,I1,I3

Contd…
• Now drawing FP-tree by using ordered itemsets one by one:
null
I2:1
null
I2:1
I1:1
I5:1
ForTransaction 1:
I2,I1,I5

Contd…
null
I2:2
I1:1
I5:1
I4:1
ForTransaction 2:
I2,I4

Contd…
null
I2:3
I1:1
I5:1
I3:1 I4:1
ForTransaction 3:
I2,I3

Contd…
null
I2:4
I1:2
I5:1
I3:1 I4:1
I4:1
ForTransaction 4:
I2,I1,I4

Contd…
null
I2:4
I1:2
I3:1 I4:1
I4:1
ForTransaction 5:
I1,I3
I5:1
I1:1
I3:1

Contd…
null
I2:5
I1:2
I3:2 I4:1
I4:1
ForTransaction 6:
I2,I3
I5:1
I1:1
I3:1

Contd…
null
I2:5
I1:2
I3:2 I4:1
I4:1
ForTransaction 7:
I1,I3
I5:1
I1:2
I3:2

Contd…
null
I2:6
I1:3
I3:1
I3:2
I5:1
I4:1
I4:1
ForTransaction 8:
I2,I1,I3,I5
I5:1
I1:2
I3:2

Contd…
null
I2:7
I1:4
I3:2
I3:2
I5:1
I4:1
I4:1
ForTransaction 9:
I2,I1,I3
I1:2
I3:2
I5:1

Contd…
I2 7
I1 6
I3 6
I4 2
I5 2
null
I2:7
I1:4
I3:2
I3:2
I5:1
I4:1
I4:1
To facilitate tree traversal, an item header table
is built so that each item points to its
occurrences in the tree via a chain of node-
links.
I1:2
I3:2
I5:1
FPTree Construction Over!!
Now we need to find conditional pattern base and
Conditional FPTree for each item

Contd…
null
I2:7
I1:4
I3:2
I3:2
I5:1
I4:1
I4:1
I1:2
I3:2
I5:1
Conditional Pattern Base
I5 { {I2,I1:1}, {I2,I1,I3:1} }
Conditional FPTree for I5:{I2:2,I1:2}

Contd…
null
I2:7
I1:4
I3:2
I3:2
I5:1
I4:1
I4:1
I1:2
I3:2
I5:1
I4 {{I2,I1:1}, {I2:1} }
Conditional FPTree for I4:{I2:2}

Contd…
null
I2:7
I1:4
I3:2
I3:2
I5:1
I4:1
I4:1
I1:2
I3:2
I5:1
I3 {{I2,I1:2}, {I2:2}, {I1:2}}
Conditional FPTree for I3:{I2:4,I1:2},{I1:2}

Contd…
null
I2:7
I1:4
I3:2
I3:2
I5:1
I4:1
I4:1
I1:2
I3:2
I5:1
I1 {{I2:4}}
Conditional FPTree for I1:{I2:4}

Contd…
• Frequent patterns generated:
Frequent Pattern Generated
I5 {I2,I5:2},{I1,I5:2},{I2,I1,I5: 2}
I4 {I2,I4:2}
I3 {I2,I3:4}, {I1,I3:4}, {I2,I1,I3: 2}
I1 {I2,I1:4}

Contd…
• Summary of generation of conditional pattern base, conditional FP-Tree, and
frequent patterns generated:

Example 2
• Example2: Find all frequent itemsets or frequent patterns in
the following database using FP-growth algorithm. Take
minimum support=3 :

Contd….
• FP-Tree construction:
• Finding frequent 1-itemset and sorting the this set in descending order of support
count(frequency):
• Then, Making sorted frequent transactions in the transaction dataset D:
Item frequency
f 4
c 4
a 3
b 3
m 3
p 3
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

Contd…
root
TID freq. Items bought
100 {f, c, a, m, p}
200 {f, c, a, b, m}
300 {f, b}
400 {c, p, b}
500 {f, c, a, m, p}
Item frequency
f 4
c 4
a 3
b 3
m 3
p 3
min_support = 3
f:1
c:1
a:1
m:1
p:1

Contd…
root
Item frequency
f 4
c 4
a 3
b 3
m 3
p 3
min_support = 3
f:2
c:2
a:2
m:1
p:1
b:1
m:1
100 {f, c, a, m, p}
200 {f, c, a, b, m}
300 {f, b}
400 {c, p, b}
500 {f, c, a, m, p}

Contd…
root
Item frequency
f 4
c 4
a 3
b 3
m 3
p 3
min_support = 3
f:3
c:2
a:2
m:1
p:1
b:1
m:1
b:1
100 {f, c, a, m, p}
200 {f, c, a, b, m}
300 {f, b}
400 {c, p, b}
500 {f, c, a, m, p}

Contd…
root
Item frequency
f 4
c 4
a 3
b 3
m 3
p 3
min_support = 3
f:3
c:2
a:2
m:1
p:1
b:1
m:1
b:1
100 {f, c, a, m, p}
200 {f, c, a, b, m}
300 {f, b}
400 {c, p, b}
500 {f, c, a, m, p}
c:1
b:1
p:1

Contd…
root
Item frequency
f 4
c 4
a 3
b 3
m 3
p 3
min_support = 3
f:4
c:3
a:3
m:2
p:2
b:1
m:1
b:1
100 {f, c, a, m, p}
200 {f, c, a, b, m}
300 {f, b}
400 {c, p, b}
500 {f, c, a, m, p}
c:1
b:1
p:1
Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3

Contd…
• Mining Frequent Patterns Using the FP-tree :
Start with last item in order (i.e., p).
• Follow node pointers and traverse only the paths containing p.
• Accumulate all of transformed prefix paths of that item to form a conditional
pattern base
Conditional pattern base for p
fcam:2, cb:1
f:4
c:3
a:3
m:2
p:2
c:1
b:1
p:1
p
Construct a new FP-tree based on this
pattern, by merging all paths and
keeping nodes that appear sup times.
This leads to only one branch c:3
Thus we derive only one frequent
pattern cont. p. Pattern cp

Contd…
• Move to next least frequent item in order, i.e., m
• Follow node pointers and traverse only the paths containing m.
• Accumulate all of transformed prefix paths of that item to form a conditional
pattern base
f:4
c:3
a:3
m:2
m
m:1
b:1
m-conditional pattern base: fca:2, fcab:1
{}
f:3
c:3
a:3
m-conditional FP-tree (contains only path fca:3)
All frequent patterns that include m
m,
fm, cm, am,
fcm, fam, cam,
fcam 

Contd…
EmptyEmptyf
{(f:3)}|c{(f:3)}c
{(f:3, c:3)}|a{(fc:3)}a
Empty{(fca:1), (f:1), (c:1)}b
{(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m
{(c:3)}|p{(fcam:2), (cb:1)}p
Conditional FP-treeConditional pattern-baseItem

Why Is Frequent Pattern Growth Fast?
• Performance studies show
– FP-growth is an faster than Apriori
• Reasoning
– No candidate generation, no candidate test
– Uses compact data structure
– Eliminates repeated database scan
– Basic operation is counting and FP-tree building

Handling Categorical Attributes
• So far, we have used only transaction data for mining association rules.
• The data can be in transaction form or table form
Transaction form: t1: a, b
t2: a, c, d, e
t3: a, d, f
Table form:
• Table data need to be converted to transaction form for association rule
mining
Attr1 Attr2 Attr3
a b d
b c e

Contd…
• To convert a table data set to a transaction data set simply change
each value to an attribute–value pair.
• For example:

Contd…
• Each attribute–value pair is considered an item.
• Using only values is not sufficient in the transaction form because different
attributes may have the same values.
• For example, without including attribute names, value a’s for Attribute1 and
Attribute2 are not distinguishable.
• After the conversion, Figure (B) can be used in mining.

Homework
• What is the aim of association rule mining? Why is this aim important in some
application?
• Define the concepts of support and confidence for an association rule.
• Show how the apriori algorithm works on an example dataset.
• What is the basis of the apriori algorithm? Describe the algorithm briefly. Which step
of the algorithm can become a bottleneck?
• A database has five transactions. Let min sup = 60% and min conf = 80%. Find all
frequent itemsets using Apriori algorithm. List all the strong association rules.

Contd…
• Show using an example how FP-tree algorithm solves the association rule
mining (ARM) problem.
• Perform ARM using FP-growth on the following data set with minimum
support = 50% and confidence = 75%
Transaction ID Items
1 Bread, cheese, Eggs, Juice
2 Bread, Cheese, Juice
3 Bread, Milk, Yogurt
4 Bread, Juice, Milk
5 Cheese, Juice, Milk

Thank You !
72By: Tekendra Nath Yogi6/30/2019

BIM Data Mining Unit4 by Tekendra Nath Yogi

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BIM Data Mining Unit4 by Tekendra Nath Yogi

Similar to BIM Data Mining Unit4 by Tekendra Nath Yogi (20)

More from Tekendra Nath Yogi

More from Tekendra Nath Yogi (20)

Recently uploaded

Recently uploaded (20)

BIM Data Mining Unit4 by Tekendra Nath Yogi