The document discusses techniques for mining association rules from big data using compressed bitsets. It compares the performance of the ECLAT frequent pattern mining algorithm using different bitset compression methods like EWAH, CONCISE, Roaring and BitMagic. The experiments vary dataset size, minimum support, transaction length, item count and frequent pattern count to analyze the effect on runtime, memory usage and energy consumption. The results show that more sophisticated compression does not always perform better and that the best technique depends on the specific problem characteristics.
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
The Merits of Bitset Compression Techniques for Mining Association Rules from Big Data
1. The Merits of Bitset Compression
Techniques for Mining Association
Rules from Big Data
Hamid Fadishei*, Sahar Doustian, Parisa Saadati
University of Bojnord, Iran
*fadishei@ub.ac.ir
TOPHPC 2017 24-26 APRIL, TEHRAN, IRAN 1
4. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 4
Big Data
Analytics
Set Math
Relies on
Outline
5. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 5
Big Data
Analytics
Set Math
Relies on
Compressed
Bitsets
Accelerated
by
Outline
6. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 6
Big Data
Analytics
Set Math
Relies on
Compressed
Bitsets
Accelerated
by
Some
Algorithm
Another
Algorithm
Yet Another
Algorithm
...
Implemented by
Outline
7. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 7
Big Data
Analytics
Set Math
Relies on
Compressed
Bitsets
Accelerated
by
Some
Algorithm
Another
Algorithm
Yet Another
Algorithm
...
Implemented by
Comparative
Study
Outline
8. TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 8
Big Data
Analytics
Set Math
Relies on
Compressed
Bitsets
Accelerated
by
Some
Algorithm
Another
Algorithm
Yet Another
Algorithm
...
Implemented by
Comparative
Study
Outline
Focus of
this paper!
10. Seeking Frequent Patterns in Big Data
Problem of finding the itemsets whose occurrence count is more than a predefined “support” [2]
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 10
11. Seeking Frequent Patterns in Big Data
Problem of finding the itemsets whose occurrence count is more than a predefined “support”
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 11
Transaction Basket Items
1 Bread, Beer, Diaper
2 Beer, Milk, Diaper
3 Beer, Diaper, Milk, Nuts
4 Bread, Milk, Diaper
5 Beer, Diaper
12. Seeking Frequent Patterns in Big Data
Problem of finding the itemsets whose occurrence count is more than a predefined “support”
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 12
Transaction Basket Items
1 Bread, Beer, Diaper
2 Beer, Milk, Diaper
3 Beer, Diaper, Milk, Nuts
4 Bread, Milk, Diaper
5 Beer, Diaper
13. Algorithms for Frequent Pattern Mining
Many algorithms, some of the most well known ones are:
◦ Apriori [3] Scans data multiple times to generate the itemsets of length 1, then 2, then 3, and so on
◦ FPGrowth [4] Constructs a tree and recursively extracts patterns from it
◦ ECLAT [5] Utilizes vertical representation of dataset
The present study uses ECLAT
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 13
21. ECLAT Algorithm
Most time consuming parts are set operations
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 21
22. ECLAT Algorithm
Most time consuming parts are set operations
◦ Calculating intersections
◦ Calculating cardinalities
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 22
23. ECLAT Algorithm
Most time consuming parts are set operations
◦ Calculating intersections by ANDing bitsets
◦ Calculating cardinalities by counting ONEs
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 23
25. Bitset Compression Methods
Bitset compression techniques
◦ Many of them
◦ The present study focuses on 4 of them
◦ EWAH [7]
◦ CONCISE [8]
◦ Roaring [9]
◦ BitMagic [10]
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 25
26. Bitset Compression Methods
Bitset compression techniques
◦ Many of them
◦ The present study focuses on 4 of them
◦ EWAH [7]
◦ CONCISE [8]
◦ Roaring [9]
◦ BitMagic [10]
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 26
Simple, Based on RLE
27. Bitset Compression Methods
Bitset compression techniques
◦ Many of them
◦ The present study focuses on 4 of them
◦ EWAH [7]
◦ CONCISE [8]
◦ Roaring [9]
◦ BitMagic [10]
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 27
Simple, Based on RLE
More sophisticated
28. Bitset Compression: EWAH
EWAH uses RLE compression
It defines two type of words
◦ Marker – for sparse parts
◦ Dirty – for uncompressible parts
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 28
29. Bitset Compression: CONCISE
Similar to EWAH but defines an addition:
◦ Ability to define a single-bit exception inside the marker word
◦ Tends to reduce memory usage
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 29
30. Bitset Compression: Roaring
Roaring is more sophisticated
Uses a notion of hybrid containers
◦ Sparse parts are stored in array containers
◦ Dense parts in bitmap containers
Containers are organized as a two-level tree
◦ Small root that can usually fit into CPU cache
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 30
Array container
0
62
124
186
248
310
.
.
.
.
61938
Bits:0X0000&Cardinality:1000
Bitmap container
1
0
1
0
1
0
.
.
.
.
0
Bits:0X0002&Cardinality:2
15
Array of containers
Array container
0
1
2
3
4
5
.
.
.
.
99
Bits:0X0001&Cardinality:100
31. Bitset Compression: BitMagic
Another container-based bitset compression technique
Simliar to Roaring, but more simplistic
◦ Does not use array containers
◦ Does not exploits binary search for calculating intersections
◦ Does not use heuristics to decide which container type should be used for results of operations
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 31
50. Conclusions and Lessons Learnt
Different bitset compression techniques can exhibit dramatically different behavior.
There is no always-best advice for selecting the proper technique.
More sophisticated does not always mean better
Devising an advisory layer can be a promising future work.
◦ A framework that predicts the possibly best lower level technique at runtime
TOPHPC 2019 23-25 APRIL, TEHRAN, IRAN 50