Lecture14 - Advanced topics in association rules

Introduction to Machine
Learning
Lecture 14
Advanced Topics in Association Rules Mining

Albert Orriols i Puig
aorriols@salle.url.edu
i l @ ll ld

Artificial Intelligence – Machine Learning
Enginyeria i Arquitectura La Salle
gy q
Universitat Ramon Llull

Recap of Lecture 13
Ideas come from the market basket analysis (
y (MBA)
)
Let’s go shopping!

Milk, eggs, sugar,
bread
Milk, eggs, cereal, Eggs, sugar
bread
bd

Customer1

Customer2 Customer3

What do my customer buy? Which product are bought together?
Aim: Find associations and correlations between t e d e e t
d assoc at o s a d co e at o s bet ee the different
items that customers place in their shopping basket
Slide 2
Artificial Intelligence Machine Learning

Recap of Lecture 13
Itemset sup
Itemset sup
Database TDB
Dtb {A} 2 L1 {A} 2
C1
Tid Items {B} 3
{B} 3
10 A, C
A C, D {C} 3
{C} 3
1st scan
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
Itemset sup
C2 C2
Itemset
te set
{A,
{A B} 1
L2 2nd scan
Itemset sup {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B,
{B C} 2
{A, E}
{B, C} 2
{B, E} 3
{B, C}
{B, E} 3
{C, E} 2
{C, E} 2 {B,
{B E}
{C, E}

Itemset
te set L3
C3 3rd scan Itemset
It t sup
{B, C, E}
{B, C, E} 2
Slide 3

Recap of Lecture 13
Challenges
g
Apriori scans the data base multiple times
Most ft
M t often, there is a high number of candidates
th i hi h b f did t
Support counting for candidates can be time expensive

Several methods try to improve this points by
Reduce the number of scans of the data base
Shrink the number of candidates
Counting the support of candidates more efficiently

Slide 4

Today’s Agenda
Starting a journey through some advanced
topics in ARM
Mining frequent patterns without candidate
generation
Multiple Level AR
Sequential Pattern Mining
Quantitative association rules
Mining class association rules
Beyond support & confidence
B d t fid
Applications

Slide 5

Revisiting Candidate Generation
Remember A priori?
p
Use the previous frequent itemsets (k-1) to generate the k-
itemsets
te sets
Count itemsets support by scanning the data base

Bottleneck in the process: Candidate generation
Suppose 100 items
First level of the tree 100 nodes
⎛100 ⎞
Second level of the tree ⎜
⎜2⎟ ⎟
⎝ ⎠
⎛100 ⎞
⎜
⎜k⎟
In general, number of k-itemsets:
⎟
⎝ ⎠
Slide 6

Can We Avoid Generation?
Build an auxiliar structure to get statistics about the
g
itemsets in order to avoid candidate generation
Use an FP-tree
FP tree
Avoid multiple scans of the data

Divide-and-conquer methodology
Avoid candidate generation
Outline of the process:
Generate an FP-Tree
Mine the FP-tree

Slide 7

Building the FP-Tree

TID Items Sorted FIS

1 {F,A,C,D,G,I,M,P} {F,C,A,M,P}

2 {A,B,C,F,L,M,O} {F,C,A,B,M}

3 {B,F,H,J,O} {F,B}

4 {B,C,K,S,P} {C,B,P}

5 {A,F,C,E,L,P,M,N} {F,C,A,M,P}

Scan the DB for the first time and identify frequent itemsets. They
are: <(f:4),(c:4), (a:3),(b:3),(m:3),(p:3)>
We sort the items according to their frequency in the last column

Slide 8

After reading TID1:
root
F:1

3 {B,F,H,J,O} {F,B} C:1
4 {B,C,K,S,P} {C,B,P}

A:1

M:1

P:1

Scan again the DB to build the tree
g

Slide 9

After reading TID2:
root
F:2

3 {B,F,H,J,O} {F,B} C:2
4 {B,C,K,S,P} {C,B,P}

A:2

B:1
M:1
B:1
P:1

Slide 10

After reading TID3:
root
F:3
B:1
3 {B,F,H,J,O} {F,B} C:2
4 {B,C,K,S,P} {C,B,P}

A:2

B:1
M:1
M:1
P:1

Slide 11

After reading TID4:
root
F:3 C:1
B:1
3 {B,F,H,J,O} {F,B} C:2 B:1
4 {B,C,K,S,P} {C,B,P}

A:2 P:1

B:1
M:1
M:1
P:1

Slide 12

After reading TID5:
root
F:4 C:1
B:1
3 {B,F,H,J,O} {F,B} C:3 B:1
4 {B,C,K,S,P} {C,B,P}

A:3 P:1

B:1
M:2
M:1
P:2

Slide 13


root
F:4 C:1
3 {B,F,H,J,O} {F,B} Item
B:1
4 {B,C,K,S,P} {C,B,P} F
C:3
C3 B:1
B1
5 {A,F,C,E,L,P,M,N} {F,C,A,M,P} C

A
A:3 P:1
B
B:1
M M:2
P
M:1
P:2

Build and index to access quickly to the nodes and traverse the tree
q y

Slide 14

Mining the FP-Tree
Properties to mine the FP-tree
p
Node-link prop.: All possible itemsets in which the frequent item
a is included can be found by following a’s node-links
s c uded ca ou d oo g a s ode s
root
F:4 C:1
Item P has support of 3
B:1
F Two paths in the FP-
C:3 B:1
tree for node P
C
{F,C,A,M}
1.
A
A:3 P:1
{C,B,P}
{C B P}
2.
2
B
B:1
M M:2
P
M:1
P:2

Slide 15

Mining the FP-Tree
Prefix path p p To calculate the frequent p
p prop.: q patterns for a node
a in path P, only the prefix subpath of node of node a in P
needs to be accumulated, and the frequency count of every
node in the prefix path should carry the same count as node a

root
Node i i
N d P is involved in:
l di
F:4 C:1
Item (F:4,C:3,A:3,M:2,P:2)
B:1
F Take the prefix of the
C:3 B:1 path until M
C
(F:4,C:3,A:3)
A
A:3 P:1 Adjust counts to 2
B
B:1 (F:2,C:2,A:2)
M M:2
So, F, C, and A co-ocur
P
M:1 with M
P:2

Slide 16

Mining the FP-Tree
Fragment g
g growth: Let α be an itemset in DB, B be α’s
,
conditional pattern base, and β be an itemset in B. Then, the
support α U β is equivalent to the support of β in B.

root
t
F:2
For M, we had
,

(F:2,C:2,A:2)
C:2
(F:1,C:1,A:1,B:1)

Therefore,
A:2
{(F,C,A,M):2},{(F,C,M}:2},
B:1 …

Slide 17

Is FP-growth Faster than Apriori?

As the support threshold goes down, the number of itemsets
increases dramatically. FP-growth does not need to generate
candidates and test them
them.

Slide 18

Is FP-growth Faster than Apriori?

Both FP-growth and A priori scale linearly with the number of
transactions. But FP-growth is more efficient

Slide 19

Next Class

Advanced topics in association rule mining

Slide 20

Lecture14 - Advanced topics in association rules

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (11)

More from Albert Orriols-Puig

More from Albert Orriols-Puig (19)

Recently uploaded

Recently uploaded (20)

Lecture14 - Advanced topics in association rules