SlideShare a Scribd company logo
1 of 51
Download to read offline
The Merits of Bitset Compression
Techniques for Mining Association
Rules from Big Data
Hamid Fadishei*, Sahar Doustian, Parisa Saadati
University of Bojnord, Iran
*fadishei@ub.ac.ir
TOPHPC 2017 24-26 APRIL, TEHRAN, IRAN 1
Outline
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 2
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 3
Big Data
Analytics
Outline
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 4
Big Data
Analytics
Set Math
Relies on
Outline
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 5
Big Data
Analytics
Set Math
Relies on
Compressed
Bitsets
Accelerated
by
Outline
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 6
Big Data
Analytics
Set Math
Relies on
Compressed
Bitsets
Accelerated
by
Some
Algorithm
Another
Algorithm
Yet Another
Algorithm
...
Implemented by
Outline
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 7
Big Data
Analytics
Set Math
Relies on
Compressed
Bitsets
Accelerated
by
Some
Algorithm
Another
Algorithm
Yet Another
Algorithm
...
Implemented by
Comparative
Study
Outline
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 8
Big Data
Analytics
Set Math
Relies on
Compressed
Bitsets
Accelerated
by
Some
Algorithm
Another
Algorithm
Yet Another
Algorithm
...
Implemented by
Comparative
Study
Outline
Focus of
this paper!
Frequent Pattern Mining
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 9
Seeking Frequent Patterns in Big Data
Problem of finding the itemsets whose occurrence count is more than a predefined “support” [2]
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 10
Seeking Frequent Patterns in Big Data
Problem of finding the itemsets whose occurrence count is more than a predefined “support”
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 11
Transaction Basket Items
1 Bread, Beer, Diaper
2 Beer, Milk, Diaper
3 Beer, Diaper, Milk, Nuts
4 Bread, Milk, Diaper
5 Beer, Diaper
Seeking Frequent Patterns in Big Data
Problem of finding the itemsets whose occurrence count is more than a predefined “support”
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 12
Transaction Basket Items
1 Bread, Beer, Diaper
2 Beer, Milk, Diaper
3 Beer, Diaper, Milk, Nuts
4 Bread, Milk, Diaper
5 Beer, Diaper
Algorithms for Frequent Pattern Mining
Many algorithms, some of the most well known ones are:
◦ Apriori [3] Scans data multiple times to generate the itemsets of length 1, then 2, then 3, and so on
◦ FPGrowth [4] Constructs a tree and recursively extracts patterns from it
◦ ECLAT [5] Utilizes vertical representation of dataset
The present study uses ECLAT
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 13
Vertical Presentation of Data
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 14
Item Transactions
Bread 1, 4
Beer 1, 2, 3, 5
Milk 2, 3, 4
Diaper 1, 2, 3, 4, 5
Nuts 3
Transaction Items
1 Bread, Beer, Diaper
2 Beer, Milk, Diaper
3 Beer, Diaper, Milk, Nuts
4 Bread, Milk, Diaper
5 Beer, Diaper
Bitset Encoding of Vertical Form
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 15
Item Transactions
Bread 1, 4
Beer 1, 2, 3, 5
Milk 2, 3, 4
Diaper 1, 2, 3, 4, 5
Nuts 3
1 2 3 4 5
Bread 1 0 0 1 0
Beer 1 1 1 0 1
Milk 0 1 1 1 0
Diaper 1 1 1 1 1
Nuts 0 0 1 0 0
Transaction Items
1 Bread, Beer, Diaper
2 Beer, Milk, Diaper
3 Beer, Diaper, Milk, Nuts
4 Bread, Milk, Diaper
5 Beer, Diaper
ECLAT Algorithm
Repeatedly appends patterns and calculates support by counting ones
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 16
1 2 3 4 5
Bread 1 0 0 1 0
Beer 1 1 1 0 1
Milk 0 1 1 1 0
Diaper 1 1 1 1 1
Nuts 0 0 1 0 0
ECLAT Algorithm
Repeatedly appends patterns and calculates support by counting ones
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 17
1 2 3 4 5
Bread 1 0 0 1 0
Beer 1 1 1 0 1
Milk 0 1 1 1 0
Diaper 1 1 1 1 1
Nuts 0 0 1 0 0
ECLAT Algorithm
Repeatedly appends patterns and calculates support by counting ones
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 18
1 2 3 4 5
Bread 1 0 0 1 0
Beer 1 1 1 0 1
Milk 0 1 1 1 0
Diaper 1 1 1 1 1
Nuts 0 0 1 0 0
1 2 3 4 5
Beer, Milk 1 0 0 1 0
ECLAT Algorithm
Repeatedly appends patterns and calculates support by counting ones
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 19
1 2 3 4 5
Bread 1 0 0 1 0
Beer 1 1 1 0 1
Milk 0 1 1 1 0
Diaper 1 1 1 1 1
Nuts 0 0 1 0 0
1 2 3 4 5
Beer, Milk 1 0 0 1 0
Beer, Diaper 1 1 1 0 1
ECLAT Algorithm
Repeatedly appends patterns and calculates support by counting ones
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 20
1 2 3 4 5
Bread 1 0 0 1 0
Beer 1 1 1 0 1
Milk 0 1 1 1 0
Diaper 1 1 1 1 1
Nuts 0 0 1 0 0
1 2 3 4 5
Beer, Milk 1 0 0 1 0
Beer, Diaper 1 1 1 0 1
ECLAT Algorithm
Most time consuming parts are set operations
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 21
ECLAT Algorithm
Most time consuming parts are set operations
◦ Calculating intersections
◦ Calculating cardinalities
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 22
ECLAT Algorithm
Most time consuming parts are set operations
◦ Calculating intersections by ANDing bitsets
◦ Calculating cardinalities by counting ONEs
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 23
Bitset Compression
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 24
Bitset Compression Methods
Bitset compression techniques
◦ Many of them
◦ The present study focuses on 4 of them
◦ EWAH [7]
◦ CONCISE [8]
◦ Roaring [9]
◦ BitMagic [10]
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 25
Bitset Compression Methods
Bitset compression techniques
◦ Many of them
◦ The present study focuses on 4 of them
◦ EWAH [7]
◦ CONCISE [8]
◦ Roaring [9]
◦ BitMagic [10]
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 26
Simple, Based on RLE
Bitset Compression Methods
Bitset compression techniques
◦ Many of them
◦ The present study focuses on 4 of them
◦ EWAH [7]
◦ CONCISE [8]
◦ Roaring [9]
◦ BitMagic [10]
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 27
Simple, Based on RLE
More sophisticated
Bitset Compression: EWAH
EWAH uses RLE compression
It defines two type of words
◦ Marker – for sparse parts
◦ Dirty – for uncompressible parts
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 28
Bitset Compression: CONCISE
Similar to EWAH but defines an addition:
◦ Ability to define a single-bit exception inside the marker word
◦ Tends to reduce memory usage
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 29
Bitset Compression: Roaring
Roaring is more sophisticated
Uses a notion of hybrid containers
◦ Sparse parts are stored in array containers
◦ Dense parts in bitmap containers
Containers are organized as a two-level tree
◦ Small root that can usually fit into CPU cache
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 30
Array container
0
62
124
186
248
310
.
.
.
.
61938
Bits:0X0000&Cardinality:1000
Bitmap container
1
0
1
0
1
0
.
.
.
.
0
Bits:0X0002&Cardinality:2
15
Array of containers
Array container
0
1
2
3
4
5
.
.
.
.
99
Bits:0X0001&Cardinality:100
Bitset Compression: BitMagic
Another container-based bitset compression technique
Simliar to Roaring, but more simplistic
◦ Does not use array containers
◦ Does not exploits binary search for calculating intersections
◦ Does not use heuristics to decide which container type should be used for results of operations
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 31
Experiments
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 32
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 33
ECLAT
(eclat.c, etc)
Bitset interface
(Compile-time binding)
(wrapper.h)
EWAH
(wrapper_ewah.cpp)
BitMagic
(wrapper_bm.cpp)
Roaring
(wrapper_roaring.c)
CONCISE
(wrapper_concise.cpp)
Energy and
Performance
Monitoring
(stats.c)
Real-world
Dataset
Params
IBM Quest
Generator
Synthetic
Dataset
Output
Results
Experimental Framework
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 34
ECLAT
(eclat.c, etc)
Bitset interface
(Compile-time binding)
(wrapper.h)
EWAH
(wrapper_ewah.cpp)
BitMagic
(wrapper_bm.cpp)
Roaring
(wrapper_roaring.c)
CONCISE
(wrapper_concise.cpp)
Energy and
Performance
Monitoring
(stats.c)
Real-world
Dataset
Params
IBM Quest
Generator
Synthetic
Dataset
Output
Results
Experimental Framework
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 35
ECLAT
(eclat.c, etc)
Bitset interface
(Compile-time binding)
(wrapper.h)
EWAH
(wrapper_ewah.cpp)
BitMagic
(wrapper_bm.cpp)
Roaring
(wrapper_roaring.c)
CONCISE
(wrapper_concise.cpp)
Energy and
Performance
Monitoring
(stats.c)
Real-world
Dataset
Params
IBM Quest
Generator
Synthetic
Dataset
Output
Results
Experimental Framework
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 36
ECLAT
(eclat.c, etc)
Bitset interface
(Compile-time binding)
(wrapper.h)
EWAH
(wrapper_ewah.cpp)
BitMagic
(wrapper_bm.cpp)
Roaring
(wrapper_roaring.c)
CONCISE
(wrapper_concise.cpp)
Energy and
Performance
Monitoring
(stats.c)
Real-world
Dataset
Params
IBM Quest
Generator
Synthetic
Dataset
Output
Results
Experimental Framework
Instacart (~3M transactions)
Kosarak (~1M transactions)
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 37
ECLAT
(eclat.c, etc)
Bitset interface
(Compile-time binding)
(wrapper.h)
EWAH
(wrapper_ewah.cpp)
BitMagic
(wrapper_bm.cpp)
Roaring
(wrapper_roaring.c)
CONCISE
(wrapper_concise.cpp)
Energy and
Performance
Monitoring
(stats.c)
Real-world
Dataset
Params
IBM Quest
Generator
Synthetic
Dataset
Output
Results
Experimental Framework
Server under test:
Core i5, 32GB RAM
Studying the effects of…
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 38
Dataset
Size
Minimum
Support
Transaction
Length
Item
Count
Frequent
Pattern
Count
…on
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 39
Run
Time
Memory
Usage
Energy
Consumption
Effect of dataset size
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 40
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 41
Effect of minsup
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 42
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 43
Effect of transaction length
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 44
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 45
Effect of item count
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 46
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 47
Count of frequent itemsets
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 48
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 49
Conclusions and Lessons Learnt
Different bitset compression techniques can exhibit dramatically different behavior.
There is no always-best advice for selecting the proper technique.
More sophisticated does not always mean better
Devising an advisory layer can be a promising future work.
◦ A framework that predicts the possibly best lower level technique at runtime
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 50
Thanks, many thanks!
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 51
Espidan village, Bojnord, Iran

More Related Content

Recently uploaded

Independent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging StationIndependent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging Stationsiddharthteach18
 
Artificial intelligence presentation2-171219131633.pdf
Artificial intelligence presentation2-171219131633.pdfArtificial intelligence presentation2-171219131633.pdf
Artificial intelligence presentation2-171219131633.pdfKira Dess
 
engineering chemistry power point presentation
engineering chemistry  power point presentationengineering chemistry  power point presentation
engineering chemistry power point presentationsj9399037128
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisDr.Costas Sachpazis
 
Working Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdfWorking Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdfSkNahidulIslamShrabo
 
Basics of Relay for Engineering Students
Basics of Relay for Engineering StudentsBasics of Relay for Engineering Students
Basics of Relay for Engineering Studentskannan348865
 
Path loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelPath loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelDrAjayKumarYadav4
 
01-vogelsanger-stanag-4178-ed-2-the-new-nato-standard-for-nitrocellulose-test...
01-vogelsanger-stanag-4178-ed-2-the-new-nato-standard-for-nitrocellulose-test...01-vogelsanger-stanag-4178-ed-2-the-new-nato-standard-for-nitrocellulose-test...
01-vogelsanger-stanag-4178-ed-2-the-new-nato-standard-for-nitrocellulose-test...AshwaniAnuragi1
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdfAlexander Litvinenko
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfEr.Sonali Nasikkar
 
Adsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) pptAdsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) pptjigup7320
 
Artificial Intelligence in due diligence
Artificial Intelligence in due diligenceArtificial Intelligence in due diligence
Artificial Intelligence in due diligencemahaffeycheryld
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Ramkumar k
 
Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...IJECEIAES
 
History of Indian Railways - the story of Growth & Modernization
History of Indian Railways - the story of Growth & ModernizationHistory of Indian Railways - the story of Growth & Modernization
History of Indian Railways - the story of Growth & ModernizationEmaan Sharma
 
handbook on reinforce concrete and detailing
handbook on reinforce concrete and detailinghandbook on reinforce concrete and detailing
handbook on reinforce concrete and detailingAshishSingh1301
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsMathias Magdowski
 
☎️Looking for Abortion Pills? Contact +27791653574.. 💊💊Available in Gaborone ...
☎️Looking for Abortion Pills? Contact +27791653574.. 💊💊Available in Gaborone ...☎️Looking for Abortion Pills? Contact +27791653574.. 💊💊Available in Gaborone ...
☎️Looking for Abortion Pills? Contact +27791653574.. 💊💊Available in Gaborone ...mikehavy0
 
Databricks Generative AI FoundationCertified.pdf
Databricks Generative AI FoundationCertified.pdfDatabricks Generative AI FoundationCertified.pdf
Databricks Generative AI FoundationCertified.pdfVinayVadlagattu
 
Raashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashidFaiyazSheikh
 

Recently uploaded (20)

Independent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging StationIndependent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging Station
 
Artificial intelligence presentation2-171219131633.pdf
Artificial intelligence presentation2-171219131633.pdfArtificial intelligence presentation2-171219131633.pdf
Artificial intelligence presentation2-171219131633.pdf
 
engineering chemistry power point presentation
engineering chemistry  power point presentationengineering chemistry  power point presentation
engineering chemistry power point presentation
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
Working Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdfWorking Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdf
 
Basics of Relay for Engineering Students
Basics of Relay for Engineering StudentsBasics of Relay for Engineering Students
Basics of Relay for Engineering Students
 
Path loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelPath loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata Model
 
01-vogelsanger-stanag-4178-ed-2-the-new-nato-standard-for-nitrocellulose-test...
01-vogelsanger-stanag-4178-ed-2-the-new-nato-standard-for-nitrocellulose-test...01-vogelsanger-stanag-4178-ed-2-the-new-nato-standard-for-nitrocellulose-test...
01-vogelsanger-stanag-4178-ed-2-the-new-nato-standard-for-nitrocellulose-test...
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdfInstruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
Instruct Nirmaana 24-Smart and Lean Construction Through Technology.pdf
 
Adsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) pptAdsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) ppt
 
Artificial Intelligence in due diligence
Artificial Intelligence in due diligenceArtificial Intelligence in due diligence
Artificial Intelligence in due diligence
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)
 
Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...
 
History of Indian Railways - the story of Growth & Modernization
History of Indian Railways - the story of Growth & ModernizationHistory of Indian Railways - the story of Growth & Modernization
History of Indian Railways - the story of Growth & Modernization
 
handbook on reinforce concrete and detailing
handbook on reinforce concrete and detailinghandbook on reinforce concrete and detailing
handbook on reinforce concrete and detailing
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility Applications
 
☎️Looking for Abortion Pills? Contact +27791653574.. 💊💊Available in Gaborone ...
☎️Looking for Abortion Pills? Contact +27791653574.. 💊💊Available in Gaborone ...☎️Looking for Abortion Pills? Contact +27791653574.. 💊💊Available in Gaborone ...
☎️Looking for Abortion Pills? Contact +27791653574.. 💊💊Available in Gaborone ...
 
Databricks Generative AI FoundationCertified.pdf
Databricks Generative AI FoundationCertified.pdfDatabricks Generative AI FoundationCertified.pdf
Databricks Generative AI FoundationCertified.pdf
 
Raashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashid final report on Embedded Systems
Raashid final report on Embedded Systems
 

Featured

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 

Featured (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

The Merits of Bitset Compression Techniques for Mining Association Rules from Big Data

  • 1. The Merits of Bitset Compression Techniques for Mining Association Rules from Big Data Hamid Fadishei*, Sahar Doustian, Parisa Saadati University of Bojnord, Iran *fadishei@ub.ac.ir TOPHPC 2017 24-26 APRIL, TEHRAN, IRAN 1
  • 2. Outline TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 2
  • 3. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 3 Big Data Analytics Outline
  • 4. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 4 Big Data Analytics Set Math Relies on Outline
  • 5. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 5 Big Data Analytics Set Math Relies on Compressed Bitsets Accelerated by Outline
  • 6. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 6 Big Data Analytics Set Math Relies on Compressed Bitsets Accelerated by Some Algorithm Another Algorithm Yet Another Algorithm ... Implemented by Outline
  • 7. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 7 Big Data Analytics Set Math Relies on Compressed Bitsets Accelerated by Some Algorithm Another Algorithm Yet Another Algorithm ... Implemented by Comparative Study Outline
  • 8. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 8 Big Data Analytics Set Math Relies on Compressed Bitsets Accelerated by Some Algorithm Another Algorithm Yet Another Algorithm ... Implemented by Comparative Study Outline Focus of this paper!
  • 9. Frequent Pattern Mining TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 9
  • 10. Seeking Frequent Patterns in Big Data Problem of finding the itemsets whose occurrence count is more than a predefined “support” [2] TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 10
  • 11. Seeking Frequent Patterns in Big Data Problem of finding the itemsets whose occurrence count is more than a predefined “support” TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 11 Transaction Basket Items 1 Bread, Beer, Diaper 2 Beer, Milk, Diaper 3 Beer, Diaper, Milk, Nuts 4 Bread, Milk, Diaper 5 Beer, Diaper
  • 12. Seeking Frequent Patterns in Big Data Problem of finding the itemsets whose occurrence count is more than a predefined “support” TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 12 Transaction Basket Items 1 Bread, Beer, Diaper 2 Beer, Milk, Diaper 3 Beer, Diaper, Milk, Nuts 4 Bread, Milk, Diaper 5 Beer, Diaper
  • 13. Algorithms for Frequent Pattern Mining Many algorithms, some of the most well known ones are: ◦ Apriori [3] Scans data multiple times to generate the itemsets of length 1, then 2, then 3, and so on ◦ FPGrowth [4] Constructs a tree and recursively extracts patterns from it ◦ ECLAT [5] Utilizes vertical representation of dataset The present study uses ECLAT TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 13
  • 14. Vertical Presentation of Data TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 14 Item Transactions Bread 1, 4 Beer 1, 2, 3, 5 Milk 2, 3, 4 Diaper 1, 2, 3, 4, 5 Nuts 3 Transaction Items 1 Bread, Beer, Diaper 2 Beer, Milk, Diaper 3 Beer, Diaper, Milk, Nuts 4 Bread, Milk, Diaper 5 Beer, Diaper
  • 15. Bitset Encoding of Vertical Form TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 15 Item Transactions Bread 1, 4 Beer 1, 2, 3, 5 Milk 2, 3, 4 Diaper 1, 2, 3, 4, 5 Nuts 3 1 2 3 4 5 Bread 1 0 0 1 0 Beer 1 1 1 0 1 Milk 0 1 1 1 0 Diaper 1 1 1 1 1 Nuts 0 0 1 0 0 Transaction Items 1 Bread, Beer, Diaper 2 Beer, Milk, Diaper 3 Beer, Diaper, Milk, Nuts 4 Bread, Milk, Diaper 5 Beer, Diaper
  • 16. ECLAT Algorithm Repeatedly appends patterns and calculates support by counting ones TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 16 1 2 3 4 5 Bread 1 0 0 1 0 Beer 1 1 1 0 1 Milk 0 1 1 1 0 Diaper 1 1 1 1 1 Nuts 0 0 1 0 0
  • 17. ECLAT Algorithm Repeatedly appends patterns and calculates support by counting ones TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 17 1 2 3 4 5 Bread 1 0 0 1 0 Beer 1 1 1 0 1 Milk 0 1 1 1 0 Diaper 1 1 1 1 1 Nuts 0 0 1 0 0
  • 18. ECLAT Algorithm Repeatedly appends patterns and calculates support by counting ones TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 18 1 2 3 4 5 Bread 1 0 0 1 0 Beer 1 1 1 0 1 Milk 0 1 1 1 0 Diaper 1 1 1 1 1 Nuts 0 0 1 0 0 1 2 3 4 5 Beer, Milk 1 0 0 1 0
  • 19. ECLAT Algorithm Repeatedly appends patterns and calculates support by counting ones TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 19 1 2 3 4 5 Bread 1 0 0 1 0 Beer 1 1 1 0 1 Milk 0 1 1 1 0 Diaper 1 1 1 1 1 Nuts 0 0 1 0 0 1 2 3 4 5 Beer, Milk 1 0 0 1 0 Beer, Diaper 1 1 1 0 1
  • 20. ECLAT Algorithm Repeatedly appends patterns and calculates support by counting ones TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 20 1 2 3 4 5 Bread 1 0 0 1 0 Beer 1 1 1 0 1 Milk 0 1 1 1 0 Diaper 1 1 1 1 1 Nuts 0 0 1 0 0 1 2 3 4 5 Beer, Milk 1 0 0 1 0 Beer, Diaper 1 1 1 0 1
  • 21. ECLAT Algorithm Most time consuming parts are set operations TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 21
  • 22. ECLAT Algorithm Most time consuming parts are set operations ◦ Calculating intersections ◦ Calculating cardinalities TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 22
  • 23. ECLAT Algorithm Most time consuming parts are set operations ◦ Calculating intersections by ANDing bitsets ◦ Calculating cardinalities by counting ONEs TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 23
  • 24. Bitset Compression TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 24
  • 25. Bitset Compression Methods Bitset compression techniques ◦ Many of them ◦ The present study focuses on 4 of them ◦ EWAH [7] ◦ CONCISE [8] ◦ Roaring [9] ◦ BitMagic [10] TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 25
  • 26. Bitset Compression Methods Bitset compression techniques ◦ Many of them ◦ The present study focuses on 4 of them ◦ EWAH [7] ◦ CONCISE [8] ◦ Roaring [9] ◦ BitMagic [10] TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 26 Simple, Based on RLE
  • 27. Bitset Compression Methods Bitset compression techniques ◦ Many of them ◦ The present study focuses on 4 of them ◦ EWAH [7] ◦ CONCISE [8] ◦ Roaring [9] ◦ BitMagic [10] TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 27 Simple, Based on RLE More sophisticated
  • 28. Bitset Compression: EWAH EWAH uses RLE compression It defines two type of words ◦ Marker – for sparse parts ◦ Dirty – for uncompressible parts TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 28
  • 29. Bitset Compression: CONCISE Similar to EWAH but defines an addition: ◦ Ability to define a single-bit exception inside the marker word ◦ Tends to reduce memory usage TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 29
  • 30. Bitset Compression: Roaring Roaring is more sophisticated Uses a notion of hybrid containers ◦ Sparse parts are stored in array containers ◦ Dense parts in bitmap containers Containers are organized as a two-level tree ◦ Small root that can usually fit into CPU cache TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 30 Array container 0 62 124 186 248 310 . . . . 61938 Bits:0X0000&Cardinality:1000 Bitmap container 1 0 1 0 1 0 . . . . 0 Bits:0X0002&Cardinality:2 15 Array of containers Array container 0 1 2 3 4 5 . . . . 99 Bits:0X0001&Cardinality:100
  • 31. Bitset Compression: BitMagic Another container-based bitset compression technique Simliar to Roaring, but more simplistic ◦ Does not use array containers ◦ Does not exploits binary search for calculating intersections ◦ Does not use heuristics to decide which container type should be used for results of operations TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 31
  • 32. Experiments TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 32
  • 33. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 33 ECLAT (eclat.c, etc) Bitset interface (Compile-time binding) (wrapper.h) EWAH (wrapper_ewah.cpp) BitMagic (wrapper_bm.cpp) Roaring (wrapper_roaring.c) CONCISE (wrapper_concise.cpp) Energy and Performance Monitoring (stats.c) Real-world Dataset Params IBM Quest Generator Synthetic Dataset Output Results Experimental Framework
  • 34. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 34 ECLAT (eclat.c, etc) Bitset interface (Compile-time binding) (wrapper.h) EWAH (wrapper_ewah.cpp) BitMagic (wrapper_bm.cpp) Roaring (wrapper_roaring.c) CONCISE (wrapper_concise.cpp) Energy and Performance Monitoring (stats.c) Real-world Dataset Params IBM Quest Generator Synthetic Dataset Output Results Experimental Framework
  • 35. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 35 ECLAT (eclat.c, etc) Bitset interface (Compile-time binding) (wrapper.h) EWAH (wrapper_ewah.cpp) BitMagic (wrapper_bm.cpp) Roaring (wrapper_roaring.c) CONCISE (wrapper_concise.cpp) Energy and Performance Monitoring (stats.c) Real-world Dataset Params IBM Quest Generator Synthetic Dataset Output Results Experimental Framework
  • 36. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 36 ECLAT (eclat.c, etc) Bitset interface (Compile-time binding) (wrapper.h) EWAH (wrapper_ewah.cpp) BitMagic (wrapper_bm.cpp) Roaring (wrapper_roaring.c) CONCISE (wrapper_concise.cpp) Energy and Performance Monitoring (stats.c) Real-world Dataset Params IBM Quest Generator Synthetic Dataset Output Results Experimental Framework Instacart (~3M transactions) Kosarak (~1M transactions)
  • 37. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 37 ECLAT (eclat.c, etc) Bitset interface (Compile-time binding) (wrapper.h) EWAH (wrapper_ewah.cpp) BitMagic (wrapper_bm.cpp) Roaring (wrapper_roaring.c) CONCISE (wrapper_concise.cpp) Energy and Performance Monitoring (stats.c) Real-world Dataset Params IBM Quest Generator Synthetic Dataset Output Results Experimental Framework Server under test: Core i5, 32GB RAM
  • 38. Studying the effects of… TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 38 Dataset Size Minimum Support Transaction Length Item Count Frequent Pattern Count
  • 39. …on TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 39 Run Time Memory Usage Energy Consumption
  • 40. Effect of dataset size TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 40
  • 41. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 41
  • 42. Effect of minsup TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 42
  • 43. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 43
  • 44. Effect of transaction length TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 44
  • 45. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 45
  • 46. Effect of item count TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 46
  • 47. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 47
  • 48. Count of frequent itemsets TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 48
  • 49. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 49
  • 50. Conclusions and Lessons Learnt Different bitset compression techniques can exhibit dramatically different behavior. There is no always-best advice for selecting the proper technique. More sophisticated does not always mean better Devising an advisory layer can be a promising future work. ◦ A framework that predicts the possibly best lower level technique at runtime TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 50
  • 51. Thanks, many thanks! TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 51 Espidan village, Bojnord, Iran