Interval Intersection Techniquein used in Data Mining (for frequent itemsets)
Data mining turns a large collection of data into knowledge
it also contains Partioning Algorithm.
it is better than apriori algorithm
2. • Data mining turns a large collection of data into
knowledge
• Process of extracting patterns from data
• Knowledge discovery from data or KDD
3.
4.
5. Aggrawal & Srikant 1993
After analyzing the supermarket transactional
dataset.
Use Breadth first search
Level wise search method
1. First Phases: Candidate itemset generation
2. Second phase: Support Counting
6. Scan DB once to get frequent 1 itemset.
Generate length (k+1) candidate itemsets from length k
frequent itemsets.
Scan DB and remove the infrequent candidates
Terminate when no frequent or candidate set can be
generated.
7. • Apriori pruning principle:
“Any subset of a frequent pattern must be frequent”
• If {beer, chips, nuts} is frequent, so is {beer, chips}, i.e.,
every transaction having {beer, chips, nuts} also contains
{beer, chips}.
8. START
Read each item in transaction
Support of every item is calculated
Support>=min_su
pp
Insert items to frequent itemset
Find confidence, for each non empty
subset
Confidence>=min
_conf
Insert to strong rules
Stop
Remove item
Remove sub-set
No
No
Yes
Yes
9. TID ITEMS
1 A,C,D,F
2 A,B,E
3 B,F
4 A,C,E,F
5 A,D,E,F
6 A,B,E,F
7 A,C,F
8 A,C,E,F
8 Transactions ->represented by
INTEGER
6 Items -> represented by
ALPHABETS
Find frequent itemsets
using APRIORI
10. ITEM Supp
A 7
B 2
C 4
D 2
E 5
F 7
ITEM Supp
A 7
C 4
E 5
F 7
C1
L1
ITEM Supp
AC 4
AE 5
AF 6
CE 2
CF 4
EF 4
ITEM Supp
AC 4
AE 5
AF 6
CF 4
EF 4
ITEM Supp
ACE 2
ACF 4
AEF 4
ITEM Supp
ACF 4
AEF 4
ITEM Supp
A 7
C 4
E 5
F 7
AC 4
AE 5
AF 6
CF 4
EF 4
ACF 4
AEF 4
C2 L2
C3L3
FREQUENT
ITEM SETS
USING
APRIORI
12. • Disadvantages:
Requires Multiple scans of transaction database i.e. to
compute those with supp >= minsupp ,it need to be
scanned at every level.
During pass1 most memory is idle
Assumes transaction database is memory resident.
High disk I/O overhead
Huge number of candidates i.e. If transaction DB
has10^4 frequent itemsets , they will generate 10^7
candidate 2 itemset
Tedious workload of support counting for candidate
13. Trans
action
in DB
Divide
DB into
n
partition
s
Find the
frequent
itemset
local to
each
partition
(1 scan)
Combine
all local
frequent
itemset
to form
candidat
e
itemset
Find
global
frequen
t
itemset
s
among
Candid
ates
(1
scan)
Fre
que
nt
item
sets
in
DB
Phase 1
Phase 2
14. Overcomes memory problem for large database.
Objective is to reduce the Disk I/O overhead.
15. Interval Intersection
An interval [3, 6] defines a range between two real
numbers such as [a, b].
Let x be any real number in this interval then,
a x b
where a = starting number and
b = ending number of the interval.
Intersection is an operation on two intervals which is
mathematically expressed as:
For intervals X = [Xa, Xb] and Y = [Ya, Yb]
It is denoted by X ∩ Y and
show as:
{Z | Z X and Z Y} = {max (Xa, Yb), min (Xb, Ya)}.
16. Minimum memory is used
Least time is consumed for calculating the support count.
Using this technique only two scans are required, so it
reduced the number of scans
Make the process faster.
17.
18. • Negative border is used to store those item sets, which
are having less support count than the minimum support
count.
19. Input: Dataset ‘D’ and Gmin_sup.
Output: Frequent item set list. //Stored in FIL
SCL=NULL //Initialize Support count List;
N_Border = Null //Initialize negative border
1. P=Partition (Dataset D) //partition dataset D into N parts.
2. for each partition 1 To N P
3. Repeat until no further item sets are found i.e. FILk = ϕ //FIL(frequent item set list)
4. for i = 1 to i = k //k length item sets;
for i = 1
scan the dataset and store the support counts in SCL i.e.
SCL=SCL Sup_count(Itemi)
while (SCL != empty)
{
If( min_sup > SC(Itemi))
N_Borde r =N _Border {Itemi }
SCL= SCL-{Itemi }
}
for k>=2
Interval Intersection(Interval set A, interval set B, list FIL)
Return FIL.
5. Merge(FIL1, FIL2,……,FILyN)
Return (FIL).
20. SCL=null
N_Border=null
Partition dataset D into N parts
Repeat until no further item sets are
found
scan the dataset and store the
support counts in SCL
SCL=SCL U Sup_count(Itemi)
min_sup >
SC(Itemi)
yes
N_Borde r =N _Border {Itemi }
SCL= SCL-{Itemi }
For k>=2
Interval Intersection(Interval set A,
interval set B, list FIL)
Merge(FIL1, FIL2,……,FILyN)
Return (FIL)
while (SCL != empty)
For each partition
for i = 1
no
21. Input: Results of all the partitions. // FIL1, FIL2……FINN;
Output: List of frequent item sets. // Final results in FIL;
FIL=NULL //Initialize frequent item set list;
if (Itemi FIL)
{
SC(Itemi)= SC(Itemi).FIL + SC(Itemi).FILl // Support count is added;
}
Else
{
FIL= FIL {Itemi} //Item is inserted in FIL
while (FIL != empty)
{
if (SC(Itemi) > GMin_Sup)
Continue;
else if ( itemi N_Border)
SC (itemi)=SC(itemi).FILl + SC(Itemi).N_Border
If (SC(itemi) > GMin_Sup)
continue
else
FIL= FIL – {Itemi} ; //Item is removed from the
FIL;
}
}
22. FIL = NULL
item FIL
Support count is
added
Item inserted
While ( FIL != empty)
else
Item is removed
SC(item)
>
Gmin_sup
p
if
Item N_Bordercon
tinu
e
yes
no
SC=SC.FIL
+
SC.N_Bord
er
SC(item)
>
Gmin_sup
p
con
tinu
e
yes
no
yes
23. Tid ITEM
1 A,C,D,F
2 A,B,E
3 B,F
4 A,C,E,F
Support values
A=3
B=2
C=2
D=1
E=2
F=3
1-itemset in
interval set
representation
A=[1,2], [4,4]
B= [2, 3]
C= [1, 1] [4, 4]
E= [2, 2], [4,
4]
F=[1, 1] [3,4]
AB= [2, 2]
BE = [2, 2]
AC= [1, 1] [4,
4]
BF = [3, 3]
AE=[2,2] [4, 4]
CE = [4, 4]
AF= [1, 1] [4,
4]
CF = [1, 1][4,4]
BC=[]
EF = [4, 4]
2-itemset in
interval set
representatio
n
Support c =∑ End-
start+1
ITEM Supp
A 7
C 4
E 5
F 7
AC 4
AE 5
AF 6
CF 4
EF 4
ACF 4
AEF 4
Negative border
D,AB,BE,BF,CE,
EF,ACE,AEF
24. Parameters Apriori Algorithm PFIMII Algorithm
Complexity More complex due to many
number of scans
Less complex due to only
two scans.
Number of Scans Three scans of the dataset Two scans of the dataset
Execution Time More time Less time consuming
Results Same results as that of the
PFIMMI Algorithm
Same as that of Apriori
Algorithm
25.
26.
27.
28. • PFIMII Proposed algorithm creates many partitions of the
dataset and performs the task of finding frequent item
sets in parallel on each partition.
• Many of the previous algorithms make multiple scans of
the dataset to determine the support count and frequent
item sets. This makes the process time consuming and
inefficient. But, PFIMII takes only two scans of the
dataset, thus makes the task less complex and efficient.
• Need less access to DISK resident database.
• Algorithm performs the task of frequent item sets in
parallel on various partitions of the dataset which makes
it faster.
29. • Yungho-Leu, Vania Utami, “A new frequent item set mining
algorithm based on interval intersection” in proceedings of
Conference on machine learning and cybernatics, guangzhou
12-15 April, 2015.
• Aggaraval R; Imielinski.t; Swami.A. “Mining Association Rules
between Sets of Items in Large Databases”. ACM SIGMOD
Conference. Washington DC, USA, 2013.
• Amit Siwach; Neelam Duhan; Parul tomar. “PFIMII: Parallel
Frequent Itemset Mining using Interval Intersection”,Data
Mining.
• Jiawei Han And Micheline kamber, “Frequent item set mining
methods”, Data Mining concepts and techniques.
• Moore,R. E, R. Baker Kearfott and M. J. Cloud, “Introduction
to interval analysis”, Siam,2009.