SlideShare a Scribd company logo
1 of 29
Download to read offline
Question?
• Is it possible to leverage benefits of
  vertical data formats in combination
  with efficiencies of bitmap operations
  to mine association rules in a
  distributed environment.
Association Rule Mining??
• Finding Interesting Relationships
  between the variables.
• Finding the subset that is common to a
  chosen minimum number of the
  itemsets from the set of itemsets.
• Pattern Recognition.
• Explained By Market Basket Analysis.
Sample (Toy ) Data
         Set
TID        Item ID’s

T100       I1, I2, I5


T200       I2, I4


T300       I1, I2


T400       I2, I5
Apriori
• Fundamental Algorithm for Association
  Rule Mining.
• Mines frequent patterns from a horizontal
  data format which represents the items
  categorized into particular transactions.
• i-th stage identifies all frequent i-element
  sets.
• Two steps:
• > Candidate generation.
• > Candidate counting.
Vertical Form
• Transactions categorized into particular items.
• Vertical format data mining only has to parse
  the dataset once to get the itemsets.
• For the itemset generation from the 2nd
  itemset it only needs to refer the previous
  itemset.
• Eliminates parsing through the dataset each
  time to count the frequency of itemsets, for
  each round.
• More efficient than its horizontal form.
BitMaps
•   Compactly store individual bits.
•   Exploit bit-level parallelism effectively.
•   0’s and 1’s.
•   1 indicates existence.
Combined?
• Algorithm takes a horizontal data set.
• With a one pass of the data set
  construct a bit map based data
  structure.
• This structure is in vertical format.
• The structure facilitates efficient mining
  of association rules.
Sample (Toy ) Data
         Set
TID        Item ID’s

T100       I1, I2, I5


T200       I2, I4


T300       I1, I2


T400       I2, I5
Sample (Toy ) Data
                     Set
                 TID         Item ID’s

                 T100        I1, I2, I5

Horizontal
Format           T200        I2, I4


                 T300        I1, I2


                 T400        I2, I5
I1
TID    Item ID’s
T100   I1, I2, I5
T200   I2, I4
T300   I1, I2
T400   I2, I5       I2


                         Ordered Item
                            Array


                    I4




                    I5
I1       I2
TID    Item ID’s         1
T100   I1, I2, I5            1
T200   I2, I4
T300   I1, I2
T400   I2, I5       I2
                         1




                    I4




                    I5
I1       I2   I5
TID    Item ID’s         1
T100   I1, I2, I5            1    1
T200   I2, I4
T300   I1, I2
T400   I2, I5       I2       I5
                         1
                             1



                    I4




                    I5
                         1
I1       I2     I5
TID    Item ID’s         1
T100   I1, I2, I5            1      1
T200   I2, I4
T300   I1, I2
T400   I2, I5       I2       I5
                         1
                             1



                    I4




                              Master Array

                    I5
                         1
I1       I2     I5
TID    Item ID’s         1
T100   I1, I2, I5            1      1
T200   I2, I4
                                  Associated
T300   I1, I2                     Items
T400   I2, I5       I2       I5
                         1
                             1



                    I4




                              Master Array

                    I5
                         1
I1       I2     I5
TID    Item ID’s         1
T100   I1, I2, I5            1      1
T200   I2, I4
                                  Associated
T300   I1, I2                     Items
T400   I2, I5       I2       I5
                         1
                             1



                    I4            Bitmap




                              Master Array

                    I5
                         1
I1       I2   I5
TID    Item ID’s         1
T100   I1, I2, I5            1    1
T200   I2, I4
T300   I1, I2
T400   I2, I5       I2       I5
                         2
                             1



                    I4




                    I5
                         1
I1       I2   I5
TID    Item ID’s         1
T100   I1, I2, I5            1    1
T200   I2, I4
T300   I1, I2
T400   I2, I5       I2       I5   I4
                         2
                             1     0

                             0     1

                    I4
                         1




                    I5
                         1
I1       I2   I5
TID    Item ID’s         2
T100   I1, I2, I5            1    1
T200   I2, I4
T300   I1, I2
T400   I2, I5       I2       I5   I4
                         2
                             1     0

                             0     1

                    I4
                         1




                    I5
                         1
I2   I5
                    I1
TID    Item ID’s         2
T100   I1, I2, I5            1    1

T200   I2, I4
                             1    0
T300   I1, I2
T400   I2, I5       I2       I5   I4
                         3

                             1     0

                             0     1
                    I4
                         1




                    I5
                         2
I2   I5
                    I1
TID    Item ID’s         2
T100   I1, I2, I5            1    1

T200   I2, I4
                             1    0
T300   I1, I2
T400   I2, I5       I2       I5   I4
                         4

                             1     0     Final
                                         Data
                             0     1   Structure
                    I4
                         1   1     0




                    I5
                         2
Counting                             I2   I5
                               I1
 Frequent Item                      2
     Sets                               1    1

No. of Items   Frequent Item            1    0
                   Sets
     1           I1, I2, I5    I2       I5   I4
                                    4
     2          I1-I2, I2-I5
                                        1     0
     3               -
                                        0     1
  Minimum Support = 2          I4
                                    1   1     0




                               I5
                                    2
Counting                             I2   I5
                               I1
 Frequent Item                      2
     Sets                               1    1    1

No. of Items   Frequent Item                      0
                                        1    0
                   Sets
     1           I1, I2, I5    I2       I5   I4
                                    4
     2          I1-I2, I2-I5
                                        1     0   0
     3               -
                                        0     1   0
  Minimum Support = 2          I4                 0
                                    1   1     0




                               I5
                                    2
Results
Insights
• The algorithm performs better than
  Apriori in most scenarios.
• Data structure generation dominates
  the total time in most cases.
• As an aside…
• Can this be made to a distributed
  mining algorithm?
Turns out this can be done rather easily.
Algorithm lends to map reduce like
  distributed processing..
Each master array index is self
  contained..          I1      I2 I5
                        2
                               1   1
                               1   0


So can be mined in parallel.
Data structure generation  Map phase
Result accumulation -> Reduce phase
What Does Future Hold?
• Make this distributed.
• Java not the best of options. Use C so
  we can control memory allocations the
  way we want.
• Experiment with bitmap compression
  techniques.
Summary

More Related Content

Viewers also liked

M Ulu Lu 3ssssssssssssssssss
M Ulu Lu 3ssssssssssssssssssM Ulu Lu 3ssssssssssssssssss
M Ulu Lu 3ssssssssssssssssssmululu sean john
 
Presentatie Seats2meet
Presentatie Seats2meetPresentatie Seats2meet
Presentatie Seats2meetLoeswijntjes
 
Designing The Rest Of Your Life. for renewal not retirement..Linkedin
Designing The Rest Of Your Life. for renewal not retirement..LinkedinDesigning The Rest Of Your Life. for renewal not retirement..Linkedin
Designing The Rest Of Your Life. for renewal not retirement..Linkedindoriss60
 
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmapsHorizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmapsDenis Weerasiri
 
If Its Not Food, Dont Eat It! - Part 1
If Its Not Food, Dont Eat It! - Part 1If Its Not Food, Dont Eat It! - Part 1
If Its Not Food, Dont Eat It! - Part 1Kelly Hayford
 
Overcoming Our Unhealthy Food Culture
Overcoming Our Unhealthy Food CultureOvercoming Our Unhealthy Food Culture
Overcoming Our Unhealthy Food CultureKelly Hayford
 
women empowerment through micro finance
women empowerment through micro financewomen empowerment through micro finance
women empowerment through micro financesonamjayaswal
 
Introduction and Advanced Concepts of BPEL
Introduction and Advanced Concepts of BPELIntroduction and Advanced Concepts of BPEL
Introduction and Advanced Concepts of BPELDenis Weerasiri
 
Oracle soa suite 11g introduction slide share
Oracle soa suite 11g introduction slide shareOracle soa suite 11g introduction slide share
Oracle soa suite 11g introduction slide shareSrinivasarao Mataboyina
 
Your first step by step tutorial for oracle SOA
Your first step by step tutorial for oracle SOAYour first step by step tutorial for oracle SOA
Your first step by step tutorial for oracle SOAhalimelnagar
 

Viewers also liked (14)

M Ulu Lu 3ssssssssssssssssss
M Ulu Lu 3ssssssssssssssssssM Ulu Lu 3ssssssssssssssssss
M Ulu Lu 3ssssssssssssssssss
 
Presentatie Seats2meet
Presentatie Seats2meetPresentatie Seats2meet
Presentatie Seats2meet
 
The Sugar Beast
The Sugar BeastThe Sugar Beast
The Sugar Beast
 
Designing The Rest Of Your Life. for renewal not retirement..Linkedin
Designing The Rest Of Your Life. for renewal not retirement..LinkedinDesigning The Rest Of Your Life. for renewal not retirement..Linkedin
Designing The Rest Of Your Life. for renewal not retirement..Linkedin
 
Horizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmapsHorizontal format data mining with extended bitmaps
Horizontal format data mining with extended bitmaps
 
If Its Not Food, Dont Eat It! - Part 1
If Its Not Food, Dont Eat It! - Part 1If Its Not Food, Dont Eat It! - Part 1
If Its Not Food, Dont Eat It! - Part 1
 
M Ulu L U 3
M Ulu L U 3M Ulu L U 3
M Ulu L U 3
 
Overcoming Our Unhealthy Food Culture
Overcoming Our Unhealthy Food CultureOvercoming Our Unhealthy Food Culture
Overcoming Our Unhealthy Food Culture
 
Project Report.
Project Report.Project Report.
Project Report.
 
women empowerment through micro finance
women empowerment through micro financewomen empowerment through micro finance
women empowerment through micro finance
 
Introduction and Advanced Concepts of BPEL
Introduction and Advanced Concepts of BPELIntroduction and Advanced Concepts of BPEL
Introduction and Advanced Concepts of BPEL
 
Soa & Bpel
Soa & BpelSoa & Bpel
Soa & Bpel
 
Oracle soa suite 11g introduction slide share
Oracle soa suite 11g introduction slide shareOracle soa suite 11g introduction slide share
Oracle soa suite 11g introduction slide share
 
Your first step by step tutorial for oracle SOA
Your first step by step tutorial for oracle SOAYour first step by step tutorial for oracle SOA
Your first step by step tutorial for oracle SOA
 

Recently uploaded

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 

Recently uploaded (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 

Horizontal format data mining with extended bitmaps

  • 1.
  • 2. Question? • Is it possible to leverage benefits of vertical data formats in combination with efficiencies of bitmap operations to mine association rules in a distributed environment.
  • 3. Association Rule Mining?? • Finding Interesting Relationships between the variables. • Finding the subset that is common to a chosen minimum number of the itemsets from the set of itemsets. • Pattern Recognition. • Explained By Market Basket Analysis.
  • 4. Sample (Toy ) Data Set TID Item ID’s T100 I1, I2, I5 T200 I2, I4 T300 I1, I2 T400 I2, I5
  • 5. Apriori • Fundamental Algorithm for Association Rule Mining. • Mines frequent patterns from a horizontal data format which represents the items categorized into particular transactions. • i-th stage identifies all frequent i-element sets. • Two steps: • > Candidate generation. • > Candidate counting.
  • 6. Vertical Form • Transactions categorized into particular items. • Vertical format data mining only has to parse the dataset once to get the itemsets. • For the itemset generation from the 2nd itemset it only needs to refer the previous itemset. • Eliminates parsing through the dataset each time to count the frequency of itemsets, for each round. • More efficient than its horizontal form.
  • 7. BitMaps • Compactly store individual bits. • Exploit bit-level parallelism effectively. • 0’s and 1’s. • 1 indicates existence.
  • 8. Combined? • Algorithm takes a horizontal data set. • With a one pass of the data set construct a bit map based data structure. • This structure is in vertical format. • The structure facilitates efficient mining of association rules.
  • 9. Sample (Toy ) Data Set TID Item ID’s T100 I1, I2, I5 T200 I2, I4 T300 I1, I2 T400 I2, I5
  • 10. Sample (Toy ) Data Set TID Item ID’s T100 I1, I2, I5 Horizontal Format T200 I2, I4 T300 I1, I2 T400 I2, I5
  • 11. I1 TID Item ID’s T100 I1, I2, I5 T200 I2, I4 T300 I1, I2 T400 I2, I5 I2 Ordered Item Array I4 I5
  • 12. I1 I2 TID Item ID’s 1 T100 I1, I2, I5 1 T200 I2, I4 T300 I1, I2 T400 I2, I5 I2 1 I4 I5
  • 13. I1 I2 I5 TID Item ID’s 1 T100 I1, I2, I5 1 1 T200 I2, I4 T300 I1, I2 T400 I2, I5 I2 I5 1 1 I4 I5 1
  • 14. I1 I2 I5 TID Item ID’s 1 T100 I1, I2, I5 1 1 T200 I2, I4 T300 I1, I2 T400 I2, I5 I2 I5 1 1 I4 Master Array I5 1
  • 15. I1 I2 I5 TID Item ID’s 1 T100 I1, I2, I5 1 1 T200 I2, I4 Associated T300 I1, I2 Items T400 I2, I5 I2 I5 1 1 I4 Master Array I5 1
  • 16. I1 I2 I5 TID Item ID’s 1 T100 I1, I2, I5 1 1 T200 I2, I4 Associated T300 I1, I2 Items T400 I2, I5 I2 I5 1 1 I4 Bitmap Master Array I5 1
  • 17. I1 I2 I5 TID Item ID’s 1 T100 I1, I2, I5 1 1 T200 I2, I4 T300 I1, I2 T400 I2, I5 I2 I5 2 1 I4 I5 1
  • 18. I1 I2 I5 TID Item ID’s 1 T100 I1, I2, I5 1 1 T200 I2, I4 T300 I1, I2 T400 I2, I5 I2 I5 I4 2 1 0 0 1 I4 1 I5 1
  • 19. I1 I2 I5 TID Item ID’s 2 T100 I1, I2, I5 1 1 T200 I2, I4 T300 I1, I2 T400 I2, I5 I2 I5 I4 2 1 0 0 1 I4 1 I5 1
  • 20. I2 I5 I1 TID Item ID’s 2 T100 I1, I2, I5 1 1 T200 I2, I4 1 0 T300 I1, I2 T400 I2, I5 I2 I5 I4 3 1 0 0 1 I4 1 I5 2
  • 21. I2 I5 I1 TID Item ID’s 2 T100 I1, I2, I5 1 1 T200 I2, I4 1 0 T300 I1, I2 T400 I2, I5 I2 I5 I4 4 1 0 Final Data 0 1 Structure I4 1 1 0 I5 2
  • 22. Counting I2 I5 I1 Frequent Item 2 Sets 1 1 No. of Items Frequent Item 1 0 Sets 1 I1, I2, I5 I2 I5 I4 4 2 I1-I2, I2-I5 1 0 3 - 0 1 Minimum Support = 2 I4 1 1 0 I5 2
  • 23. Counting I2 I5 I1 Frequent Item 2 Sets 1 1 1 No. of Items Frequent Item 0 1 0 Sets 1 I1, I2, I5 I2 I5 I4 4 2 I1-I2, I2-I5 1 0 0 3 - 0 1 0 Minimum Support = 2 I4 0 1 1 0 I5 2
  • 25.
  • 26. Insights • The algorithm performs better than Apriori in most scenarios. • Data structure generation dominates the total time in most cases. • As an aside… • Can this be made to a distributed mining algorithm?
  • 27. Turns out this can be done rather easily. Algorithm lends to map reduce like distributed processing.. Each master array index is self contained.. I1 I2 I5 2 1 1 1 0 So can be mined in parallel. Data structure generation  Map phase Result accumulation -> Reduce phase
  • 28. What Does Future Hold? • Make this distributed. • Java not the best of options. Use C so we can control memory allocations the way we want. • Experiment with bitmap compression techniques.