Interval intersection

• Data mining turns a large collection of data into
knowledge
• Process of extracting patterns from data
• Knowledge discovery from data or KDD

 Aggrawal & Srikant 1993
 After analyzing the supermarket transactional
dataset.
 Use Breadth first search
 Level wise search method
1. First Phases: Candidate itemset generation
2. Second phase: Support Counting

 Scan DB once to get frequent 1 itemset.
 Generate length (k+1) candidate itemsets from length k
frequent itemsets.
 Scan DB and remove the infrequent candidates
 Terminate when no frequent or candidate set can be
generated.

• Apriori pruning principle:
“Any subset of a frequent pattern must be frequent”
• If {beer, chips, nuts} is frequent, so is {beer, chips}, i.e.,
every transaction having {beer, chips, nuts} also contains
{beer, chips}.

START
Read each item in transaction
Support of every item is calculated
Support>=min_su
pp
Insert items to frequent itemset
Find confidence, for each non empty
subset
Confidence>=min
_conf
Insert to strong rules
Stop
Remove item
Remove sub-set
No
No
Yes
Yes

TID ITEMS
1 A,C,D,F
2 A,B,E
3 B,F
4 A,C,E,F
5 A,D,E,F
6 A,B,E,F
7 A,C,F
8 A,C,E,F
8 Transactions ->represented by
INTEGER
6 Items -> represented by
ALPHABETS
Find frequent itemsets
using APRIORI

ITEM Supp
A 7
B 2
C 4
D 2
E 5
F 7
ITEM Supp
A 7
C 4
E 5
F 7
C1
L1
ITEM Supp
AC 4
AE 5
AF 6
CE 2
CF 4
EF 4
ITEM Supp
AC 4
AE 5
AF 6
CF 4
EF 4
ITEM Supp
ACE 2
ACF 4
AEF 4
ITEM Supp
ACF 4
AEF 4
ITEM Supp
A 7
C 4
E 5
F 7
AC 4
AE 5
AF 6
CF 4
EF 4
ACF 4
AEF 4
C2 L2
C3L3
FREQUENT
ITEM SETS
USING
APRIORI

• Advantages:
Uses large itemset property.
Easily parallelized
Easy to implement

• Disadvantages:
Requires Multiple scans of transaction database i.e. to
compute those with supp >= minsupp ,it need to be
scanned at every level.
During pass1 most memory is idle
Assumes transaction database is memory resident.
High disk I/O overhead
Huge number of candidates i.e. If transaction DB
has10^4 frequent itemsets , they will generate 10^7
candidate 2 itemset
Tedious workload of support counting for candidate

Trans
action
in DB
Divide
DB into
n
partition
s
Find the
frequent
itemset
local to
each
partition
(1 scan)
Combine
all local
frequent
itemset
to form
candidat
e
itemset
Find
global
frequen
t
itemset
s
among
Candid
ates
(1
scan)
Fre
que
nt
item
sets
in
DB
Phase 1
Phase 2

 Overcomes memory problem for large database.
 Objective is to reduce the Disk I/O overhead.

Interval Intersection
 An interval [3, 6] defines a range between two real
numbers such as [a, b].
Let x be any real number in this interval then,
a  x  b
where a = starting number and
b = ending number of the interval.
 Intersection is an operation on two intervals which is
mathematically expressed as:
For intervals X = [Xa, Xb] and Y = [Ya, Yb]
 It is denoted by X ∩ Y and
show as:
{Z | Z  X and Z  Y} = {max (Xa, Yb), min (Xb, Ya)}.

Minimum memory is used
Least time is consumed for calculating the support count.
Using this technique only two scans are required, so it
reduced the number of scans
Make the process faster.

• Negative border is used to store those item sets, which
are having less support count than the minimum support
count.

Input: Dataset ‘D’ and Gmin_sup.
Output: Frequent item set list. //Stored in FIL
SCL=NULL //Initialize Support count List;
N_Border = Null //Initialize negative border
1. P=Partition (Dataset D) //partition dataset D into N parts.
2. for each partition 1 To N  P
3. Repeat until no further item sets are found i.e. FILk = ϕ //FIL(frequent item set list)
4. for i = 1 to i = k //k length item sets;
for i = 1
scan the dataset and store the support counts in SCL i.e.
SCL=SCL Sup_count(Itemi)
while (SCL != empty)
{
If( min_sup > SC(Itemi))
N_Borde r =N _Border {Itemi }
SCL= SCL-{Itemi }
}
for k>=2
Interval Intersection(Interval set A, interval set B, list FIL)
Return FIL.
5. Merge(FIL1, FIL2,……,FILyN)
Return (FIL).

SCL=null
N_Border=null
Partition dataset D into N parts
Repeat until no further item sets are
found
scan the dataset and store the
support counts in SCL
SCL=SCL U Sup_count(Itemi)
min_sup >
SC(Itemi)
yes
N_Borde r =N _Border {Itemi }
SCL= SCL-{Itemi }
For k>=2
Interval Intersection(Interval set A,
interval set B, list FIL)
Merge(FIL1, FIL2,……,FILyN)
Return (FIL)
while (SCL != empty)
For each partition
for i = 1
no

Input: Results of all the partitions. // FIL1, FIL2……FINN;
Output: List of frequent item sets. // Final results in FIL;
FIL=NULL //Initialize frequent item set list;
if (Itemi  FIL)
{
SC(Itemi)= SC(Itemi).FIL + SC(Itemi).FILl // Support count is added;
}
Else
{
FIL= FIL {Itemi} //Item is inserted in FIL
while (FIL != empty)
{
if (SC(Itemi) > GMin_Sup)
Continue;
else if ( itemi  N_Border)
SC (itemi)=SC(itemi).FILl + SC(Itemi).N_Border
If (SC(itemi) > GMin_Sup)
continue
else
FIL= FIL – {Itemi} ; //Item is removed from the
FIL;
}
}

FIL = NULL
item  FIL
Support count is
added
Item inserted
While ( FIL != empty)
else
Item is removed
SC(item)
>
Gmin_sup
p
if
Item  N_Bordercon
tinu
e
yes
no
SC=SC.FIL
+
SC.N_Bord
er
SC(item)
>
Gmin_sup
p
con
tinu
e
yes
no
yes

Tid ITEM
1 A,C,D,F
2 A,B,E
3 B,F
4 A,C,E,F
Support values
A=3
B=2
C=2
D=1
E=2
F=3
1-itemset in
interval set
representation
A=[1,2], [4,4]
B= [2, 3]
C= [1, 1] [4, 4]
E= [2, 2], [4,
4]
F=[1, 1] [3,4]
AB= [2, 2]
BE = [2, 2]
AC= [1, 1] [4,
4]
BF = [3, 3]
AE=[2,2] [4, 4]
CE = [4, 4]
AF= [1, 1] [4,
4]
CF = [1, 1][4,4]
BC=[]
EF = [4, 4]
2-itemset in
interval set
representatio
n
Support c =∑ End-
start+1
ITEM Supp
A 7
C 4
E 5
F 7
AC 4
AE 5
AF 6
CF 4
EF 4
ACF 4
AEF 4
Negative border
D,AB,BE,BF,CE,
EF,ACE,AEF

Parameters Apriori Algorithm PFIMII Algorithm
Complexity More complex due to many
number of scans
Less complex due to only
two scans.
Number of Scans Three scans of the dataset Two scans of the dataset
Execution Time More time Less time consuming
Results Same results as that of the
PFIMMI Algorithm
Same as that of Apriori
Algorithm

• PFIMII Proposed algorithm creates many partitions of the
dataset and performs the task of finding frequent item
sets in parallel on each partition.
• Many of the previous algorithms make multiple scans of
the dataset to determine the support count and frequent
item sets. This makes the process time consuming and
inefficient. But, PFIMII takes only two scans of the
dataset, thus makes the task less complex and efficient.
• Need less access to DISK resident database.
• Algorithm performs the task of frequent item sets in
parallel on various partitions of the dataset which makes
it faster.

• Yungho-Leu, Vania Utami, “A new frequent item set mining
algorithm based on interval intersection” in proceedings of
Conference on machine learning and cybernatics, guangzhou
12-15 April, 2015.
• Aggaraval R; Imielinski.t; Swami.A. “Mining Association Rules
between Sets of Items in Large Databases”. ACM SIGMOD
Conference. Washington DC, USA, 2013.
• Amit Siwach; Neelam Duhan; Parul tomar. “PFIMII: Parallel
Frequent Itemset Mining using Interval Intersection”,Data
Mining.
• Jiawei Han And Micheline kamber, “Frequent item set mining
methods”, Data Mining concepts and techniques.
• Moore,R. E, R. Baker Kearfott and M. J. Cloud, “Introduction
to interval analysis”, Siam,2009.

Interval intersection

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Interval intersection

Similar to Interval intersection (20)

Recently uploaded

Recently uploaded (20)

Interval intersection