Horizontal format data mining with extended bitmaps

Question?
• Is it possible to leverage benefits of
vertical data formats in combination
with efficiencies of bitmap operations
to mine association rules in a
distributed environment.

Association Rule Mining??
• Finding Interesting Relationships
between the variables.
• Finding the subset that is common to a
chosen minimum number of the
itemsets from the set of itemsets.
• Pattern Recognition.
• Explained By Market Basket Analysis.

Sample (Toy ) Data
Set
TID Item ID’s

T100 I1, I2, I5

T200 I2, I4

T300 I1, I2

T400 I2, I5

Apriori
• Fundamental Algorithm for Association
Rule Mining.
• Mines frequent patterns from a horizontal
data format which represents the items
categorized into particular transactions.
• i-th stage identifies all frequent i-element
sets.
• Two steps:
• > Candidate generation.
• > Candidate counting.

Vertical Form
• Transactions categorized into particular items.
• Vertical format data mining only has to parse
the dataset once to get the itemsets.
• For the itemset generation from the 2nd
itemset it only needs to refer the previous
itemset.
• Eliminates parsing through the dataset each
time to count the frequency of itemsets, for
each round.
• More efficient than its horizontal form.

BitMaps
• Compactly store individual bits.
• Exploit bit-level parallelism effectively.
• 0’s and 1’s.
• 1 indicates existence.

Combined?
• Algorithm takes a horizontal data set.
• With a one pass of the data set
construct a bit map based data
structure.
• This structure is in vertical format.
• The structure facilitates efficient mining
of association rules.

Sample (Toy ) Data
Set
TID Item ID’s

T100 I1, I2, I5

Horizontal
Format T200 I2, I4

T300 I1, I2

T400 I2, I5

I1
TID Item ID’s
T100 I1, I2, I5
T200 I2, I4
T300 I1, I2
T400 I2, I5 I2

Ordered Item
Array

I4

I5

I1 I2
TID Item ID’s 1
T100 I1, I2, I5 1
T200 I2, I4
T300 I1, I2
T400 I2, I5 I2
1

I4

I5

I1 I2 I5
TID Item ID’s 1
T100 I1, I2, I5 1 1
T200 I2, I4
T300 I1, I2
T400 I2, I5 I2 I5
1
1

I4

I5
1

I1 I2 I5
TID Item ID’s 1
T100 I1, I2, I5 1 1
T200 I2, I4
T300 I1, I2
T400 I2, I5 I2 I5
1
1

I4

Master Array

I5
1

I1 I2 I5
TID Item ID’s 1
T100 I1, I2, I5 1 1
T200 I2, I4
Associated
T300 I1, I2 Items
T400 I2, I5 I2 I5
1
1

I4

Master Array

I5
1

I1 I2 I5
TID Item ID’s 1
T100 I1, I2, I5 1 1
T200 I2, I4
Associated
T300 I1, I2 Items
T400 I2, I5 I2 I5
1
1

I4 Bitmap

Master Array

I5
1

I1 I2 I5
TID Item ID’s 1
T100 I1, I2, I5 1 1
T200 I2, I4
T300 I1, I2
T400 I2, I5 I2 I5
2
1

I4

I5
1

I1 I2 I5
TID Item ID’s 1
T100 I1, I2, I5 1 1
T200 I2, I4
T300 I1, I2
T400 I2, I5 I2 I5 I4
2
1 0

0 1

I4
1

I5
1

I1 I2 I5
TID Item ID’s 2
T100 I1, I2, I5 1 1
T200 I2, I4
T300 I1, I2
T400 I2, I5 I2 I5 I4
2
1 0

0 1

I4
1

I5
1

I2 I5
I1
TID Item ID’s 2
T100 I1, I2, I5 1 1

T200 I2, I4
1 0
T300 I1, I2
T400 I2, I5 I2 I5 I4
3

1 0

0 1
I4
1

I5
2

I2 I5
I1
TID Item ID’s 2
T100 I1, I2, I5 1 1

T200 I2, I4
1 0
T300 I1, I2
T400 I2, I5 I2 I5 I4
4

1 0 Final
Data
0 1 Structure
I4
1 1 0

I5
2

Counting I2 I5
I1
Frequent Item 2
Sets 1 1

No. of Items Frequent Item 1 0
Sets
1 I1, I2, I5 I2 I5 I4
4
2 I1-I2, I2-I5
1 0
3 -
0 1
Minimum Support = 2 I4
1 1 0

I5
2

Counting I2 I5
I1
Frequent Item 2
Sets 1 1 1

No. of Items Frequent Item 0
1 0
Sets
1 I1, I2, I5 I2 I5 I4
4
2 I1-I2, I2-I5
1 0 0
3 -
0 1 0
Minimum Support = 2 I4 0
1 1 0

I5
2

Insights
• The algorithm performs better than
Apriori in most scenarios.
• Data structure generation dominates
the total time in most cases.
• As an aside…
• Can this be made to a distributed
mining algorithm?

Turns out this can be done rather easily.
Algorithm lends to map reduce like
distributed processing..
Each master array index is self
contained.. I1 I2 I5
2
1 1
1 0

So can be mined in parallel.
Data structure generation  Map phase
Result accumulation -> Reduce phase

What Does Future Hold?
• Make this distributed.
• Java not the best of options. Use C so
we can control memory allocations the
way we want.
• Experiment with bitmap compression
techniques.

Horizontal format data mining with extended bitmaps

More Related Content

Recently uploaded

Featured

Horizontal format data mining with extended bitmaps