Your SlideShare is downloading. ×
0
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Horizontal format data mining with extended bitmaps

274

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
274
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Question?• Is it possible to leverage benefits of vertical data formats in combination with efficiencies of bitmap operations to mine association rules in a distributed environment.
  • 2. Association Rule Mining??• Finding Interesting Relationships between the variables.• Finding the subset that is common to a chosen minimum number of the itemsets from the set of itemsets.• Pattern Recognition.• Explained By Market Basket Analysis.
  • 3. Sample (Toy ) Data SetTID Item ID’sT100 I1, I2, I5T200 I2, I4T300 I1, I2T400 I2, I5
  • 4. Apriori• Fundamental Algorithm for Association Rule Mining.• Mines frequent patterns from a horizontal data format which represents the items categorized into particular transactions.• i-th stage identifies all frequent i-element sets.• Two steps:• > Candidate generation.• > Candidate counting.
  • 5. Vertical Form• Transactions categorized into particular items.• Vertical format data mining only has to parse the dataset once to get the itemsets.• For the itemset generation from the 2nd itemset it only needs to refer the previous itemset.• Eliminates parsing through the dataset each time to count the frequency of itemsets, for each round.• More efficient than its horizontal form.
  • 6. BitMaps• Compactly store individual bits.• Exploit bit-level parallelism effectively.• 0’s and 1’s.• 1 indicates existence.
  • 7. Combined?• Algorithm takes a horizontal data set.• With a one pass of the data set construct a bit map based data structure.• This structure is in vertical format.• The structure facilitates efficient mining of association rules.
  • 8. Sample (Toy ) Data SetTID Item ID’sT100 I1, I2, I5T200 I2, I4T300 I1, I2T400 I2, I5
  • 9. Sample (Toy ) Data Set TID Item ID’s T100 I1, I2, I5HorizontalFormat T200 I2, I4 T300 I1, I2 T400 I2, I5
  • 10. I1TID Item ID’sT100 I1, I2, I5T200 I2, I4T300 I1, I2T400 I2, I5 I2 Ordered Item Array I4 I5
  • 11. I1 I2TID Item ID’s 1T100 I1, I2, I5 1T200 I2, I4T300 I1, I2T400 I2, I5 I2 1 I4 I5
  • 12. I1 I2 I5TID Item ID’s 1T100 I1, I2, I5 1 1T200 I2, I4T300 I1, I2T400 I2, I5 I2 I5 1 1 I4 I5 1
  • 13. I1 I2 I5TID Item ID’s 1T100 I1, I2, I5 1 1T200 I2, I4T300 I1, I2T400 I2, I5 I2 I5 1 1 I4 Master Array I5 1
  • 14. I1 I2 I5TID Item ID’s 1T100 I1, I2, I5 1 1T200 I2, I4 AssociatedT300 I1, I2 ItemsT400 I2, I5 I2 I5 1 1 I4 Master Array I5 1
  • 15. I1 I2 I5TID Item ID’s 1T100 I1, I2, I5 1 1T200 I2, I4 AssociatedT300 I1, I2 ItemsT400 I2, I5 I2 I5 1 1 I4 Bitmap Master Array I5 1
  • 16. I1 I2 I5TID Item ID’s 1T100 I1, I2, I5 1 1T200 I2, I4T300 I1, I2T400 I2, I5 I2 I5 2 1 I4 I5 1
  • 17. I1 I2 I5TID Item ID’s 1T100 I1, I2, I5 1 1T200 I2, I4T300 I1, I2T400 I2, I5 I2 I5 I4 2 1 0 0 1 I4 1 I5 1
  • 18. I1 I2 I5TID Item ID’s 2T100 I1, I2, I5 1 1T200 I2, I4T300 I1, I2T400 I2, I5 I2 I5 I4 2 1 0 0 1 I4 1 I5 1
  • 19. I2 I5 I1TID Item ID’s 2T100 I1, I2, I5 1 1T200 I2, I4 1 0T300 I1, I2T400 I2, I5 I2 I5 I4 3 1 0 0 1 I4 1 I5 2
  • 20. I2 I5 I1TID Item ID’s 2T100 I1, I2, I5 1 1T200 I2, I4 1 0T300 I1, I2T400 I2, I5 I2 I5 I4 4 1 0 Final Data 0 1 Structure I4 1 1 0 I5 2
  • 21. Counting I2 I5 I1 Frequent Item 2 Sets 1 1No. of Items Frequent Item 1 0 Sets 1 I1, I2, I5 I2 I5 I4 4 2 I1-I2, I2-I5 1 0 3 - 0 1 Minimum Support = 2 I4 1 1 0 I5 2
  • 22. Counting I2 I5 I1 Frequent Item 2 Sets 1 1 1No. of Items Frequent Item 0 1 0 Sets 1 I1, I2, I5 I2 I5 I4 4 2 I1-I2, I2-I5 1 0 0 3 - 0 1 0 Minimum Support = 2 I4 0 1 1 0 I5 2
  • 23. Results
  • 24. Insights• The algorithm performs better than Apriori in most scenarios.• Data structure generation dominates the total time in most cases.• As an aside…• Can this be made to a distributed mining algorithm?
  • 25. Turns out this can be done rather easily.Algorithm lends to map reduce like distributed processing..Each master array index is self contained.. I1 I2 I5 2 1 1 1 0So can be mined in parallel.Data structure generation  Map phaseResult accumulation -> Reduce phase
  • 26. What Does Future Hold?• Make this distributed.• Java not the best of options. Use C so we can control memory allocations the way we want.• Experiment with bitmap compression techniques.
  • 27. Summary

×