1. An Efficient Map-Reduce Algorithm for Computing
Formal Concepts from Binary Data
Lalit Kumar
University of Cincinnati
2. Finding Concepts in Binary Datasets
a b c d e f g
0 0 1 1 0 1 1 1
1 1 0 1 1 0 0 1
2 1 1 0 0 0 0 0
3 1 0 1 1 0 1 0
4 1 0 1 0 1 0 1
Objects
Attributes
Concept:
Is the set of objects sharing the same value for a certain set of properties.
Formal Concepts: ~Closed Item Sets
a c g
1 1 1 1
4 1 1 1
Intent
Extent
c g
0 1 1
1 1 1
4 1 1
Intent
Extent
c f
0 1 1
3 1 1
Intent
Extent
… …
3. Too Many Concepts Even in a Sparse Dataset
•List of all concepts in the previous example.
C1 = <{0, 1, 2, 3, 4}, {}>
C2 = <{1, 2, 3, 4}, {a}>
C3 = <{2}, {a, b}>
C4 = <{}, {a, b, c, d, e, f, g}>
C5 = <{1, 3, 4}, {a, c}>
C6 = <{1, 3}, {a, c, d}>
C7 = <{3}, {a, c, d, f}>
C8 = <{1}, {a, c, d, g}>
C9 = <{4}, {a, c, e, g}>
C10 = <{1, 4}, {a, c, g}>
C11 = <{0, 2}, {b}>
C12 = <{0}, {b, c, e, f, g}>
C13 = <{0, 1, 3, 4}, {c}>
C14 = <{0, 4}, {c, e, g}>
C15 = <{0, 3}, {c, f}>
C16 = <{0, 1, 4}, {c, g}>
4. Lattice of Concepts
All concepts can be arranged in a lattice using the subset ordering
<{1, 3, 4}, {a, c}> <{0, 1, 4}, {c, g}>
<{1, 4}, {a, c, g}>
Intersection (∩) on Extents Union (U) on Intents
Concept can be computed from its parents by performing Set Union and Set Intersection operation.
Parent Concept Parent Concept
New Concept
5. Sufficient Set
<C1>
<C2> <C11>
<C13>
<C15>
<C3>
<C5>
<C12>
<C16>
<C10>
<C14>
<C6>
<C9>
<C4>
<C8> <C7>
Complete Lattice
Concepts in Sufficient Set (green boxes)
Consequence:
We don’t need to make explicit the entire lattice. A smaller set may be sufficient to generate
the rest of the lattice
Sufficient Set:
Subset of lattice required to generate entire lattice but not necessarily a minimal set.
6. Existing Algorithms
•Find all the concepts of the lattice.
•Use DFS based algorithms.
•Computationally intractable.
•Even Map-Reduce algorithms takes long time and multiple iteration
in DFS search.
7. Previous Work <C1,0>
<C11,2>
<C13,3> <C2,1>C6 C15 C16 C14
1 3 5 2 6 4
0
<C12,3>C4 C12
C12 C12
C4
<C15,6>
C12
<C14,5>
C12
C6 <C16,7>
3 5 2 4 6 5 3 4 6
3 6 5
<C3,2>C9 C10 C7 C6 <C5,3>
C4 C4 C4 C4 <C3,2> <C9,5> <C10,7> <C6,4>
C7
C4
… …
… …
… …
•Each level of the tree is processed using Map-Reduce to enumerate
formal concept
•Effectively, it’s implementation of DFS and BFS using Map-Reduce
All the nodes
are concepts
8. Previous Work…
•Distributed Algorithm for computing formal concepts using Map-Reduce Framework.
(Author: Pert Krajca and Vilem Vychodil, 2009)
Starting concept
(concept having either largest intent or extent)
map: derive new concepts
reduce: remove redundant concepts
map: derive new concepts
reduce: remove redundant concepts
.
.
.
Iteration#1
Iteration#2
Store results
(first level of the tree)
Store results
(second level of the tree)
* Figure is taken from the mentioned paper
This step needs to be performed
for each node of the DFS tree
9. Previous Work…
•Distributed Formal Concept Analysis Algorithms Based on an Iterative Map-Reduce
Framework. (Author: Biao Xu, Ruair´ı de Fr´ein, Eric Robson, and M´ıche´al O Foghl´u, 2012)
While (!isLastClosure(closure))
Map
computeClosure()
* Figure is taken from the mentioned paper
runMapReduce()
Data Split # 1 Data Split # n
Map
computeClosure()
S S
D D
…………..
<atr1 , localClosure1>
<atr2 , localClosure2>
<atri , localClosurei>
<atr1 , localClosure1>
<atr2 , localClosure2>
<atrj , localClosurej>
Reducer#1
merging()
check()
Reducer# n
merging()
check()
…………..
Closure
D
10. Previous Work (Sequential Algorithms)…
•A Biclustering algorithm for extracting bit-patterns from binary datasets.
(Author: Rodriguez-Baena DS, Perez-Pulido AJ and Aguilar-Ruiz JS, 2011)
•A Fast Algorithm for Computing All Intersections of Objects in a Finite Semi-Lattice
(Sergei O. Kuznetsov, 1993)
•In-Close, a Fast Algorithm for Computing Formal Concepts.
(Author: Simon Andrews, 2009)
•A Local Approach to Concept Generation.
(Anne Berry, Jean-Paul Bordat, and Alain Sigayret, 2006)
11. Our Approach
•Single iteration of Map-Reduce.
•Find only Sufficient Set of concepts.
•Use single processor system to enumerate those parts of lattice that
may be of interest.
This helps conquer
the complexity
12. Our Algorithm
Binary Dataset
(input)
Map-Reduce
(single iteration)
Sufficient Set
Generation
(single processor M/C)
Lattice Enumeration
(single processor M/C)
Complete Lattice
(output)
Map-Reduce cluster and single processor machine configuration is detailed later in the presentation
13. Map-Reduce
•Phase#1
•Needs only one iteration of Map-Reduce.
•List all the attributes of the dataset with their index.
•Condense entire dataset to very small and manageable size without
information loss.
•Reducer output can be easily processed to enumerate required Sufficient
Set containing the entire lattice of Formal Concepts.
14. Map-Reduce...
Phase#1 (example)
a b c d e f g
1 1 1 0 1 0 1 0
2 1 0 1 0 1 0 1
3 0 1 1 1 0 1 1
4 0 1 0 1 1 0 0
5 1 0 0 1 1 1 0
6 0 1 1 0 0 1 1
Attributes
Formal concepts in the Sufficient Set were generated from the input data and then
Sufficient set is used to enumerate all the formal concepts in the entire lattice.
Objects
15. Map-Reduce...
Mapper Output
Key (Intent) Value (Extent)
a 2
a 5
a, b 1
b 4
b, c 6
b, c, d 3
c 2
d 1
d, e 4
d, e, f 5
e 2
f 1
f, g 3
f, g 6
g 2
Processing flow of Mapper
Total number of output in the table = 15
a b c d e f g
1 1 1 0 1 0 1 0
2 1 0 1 0 1 0 1
3 0 1 1 1 0 1 1
4 0 1 0 1 1 0 0
5 1 0 0 1 1 1 0
6 0 1 1 0 0 1 1
Input Data
Key/Value pair is flipped. Value is treated as key and key as value.
Intent
Extent
16. Map-Reduce Phase...
Reducer Output
Key (Intent) Value (Extent)
<a> 2,5
<a, b> 1
<b> 4
<b, c> 6
<b, c, d> 3
<c> 2
<d> 1
<d, e> 4
<d, e, f> 5
<e> 2
<f> 1
<f, g> 3,6
<g> 2
Processing flow of Reducer
Total number of output in the table = 13
Mapper Output
Key (Intent) Value (Extent)
a 2
a 5
a, b 1
b 4
b, c 6
b, c, d 3
c 2
d 1
d, e 4
d, e, f 5
e 2
f 1
f, g 3
f, g 6
g 2
17. Sufficient Set Generation
•Phase#2
•Use Reducer output to generate Sufficient Set.
•Processing is performed on single processor machine as the reducer
output is much smaller.
•Enumerate formal concepts (Sufficient Set), which can be used to generate
entire lattice.
18. Sufficient Set Generation...
Input for Sufficient Set Generation
Intent Extent
<a> 2,5
<a, b> 1
<b> 4
<b, c> 6
<b, c, d> 3
<c> 2
<d> 1
<d, e> 4
<d, e, f> 5
<e> 2
<f> 1
<f, g> 3,6
<g> 2
Processing flow for Sufficient Set generation
Output after processing
Intent Extent
<b, c, d, f, g> 3
<a, d, e, f> 5
<a, b, d, f> 1
<b, c> 3,6
<b, c, f, g> 6
<d, e> 4,5
<b, d, e> 4
<f, g> 3,6
<a> 1,2,5
<b> 1,3,4,6
<b, d, e> 4
<c> 2,3,6
<a, c, e, g> 2
<d> 1,3,4,5
<a, b, d, f> 1
<e> 2,4,5
<a, c, e, g> 2
<f> 1,3,5,6
<a, b, d, f> 1
<g> 2,3,6
<a, c, e, g> 2
Taking “∩” on Intent and “U” on Extent
Taking “U” on Intent and “∩” on Extent
Taking “∩” on Intent and “U” on Extent
Taking “U” on Intent and “∩” on Extent
19. Sufficient Set Generation...
Processing flow for Sufficient Set generation…
Output from previous step
Intent Extent
<b, c, d, f, g> 3
<a, d, e, f> 5
<a, b, d, f> 1 (R1)
<b, c> 3,6 (M1) (R2)
<b, c, f, g> 6 (M1) (R2)
<d, e> 4,5
<b, d, e> 4 (R3)
<f, g> 3,6 (M1) (R2)
<a> 1,2,5
<b> 1,3,4,6
<b, d, e> 4 (R3)
<c> 2,3,6 (M2)
<a, c, e, g> 2 (R4)
<d> 1,3,4,5
<a, b, d, f> 1 (R1)
<e> 2,4,5
<a, c, e, g> 2 (R4)
<f> 1,3,5,6
<a, b, d, f> 1 (R1)
<g> 2,3,6 (M2)
<a, c, e, g> 2 (R4)
Intent Extent
<b, c, d, f, g> 3
<a, d, e, f> 5
<a, b, d, f> 1
<b, c, f, g> 3,6
<d, e> 4,5
<b, d, e> 4
<a> 1,2,5
<b> 1,3,4,6
<c, g> 2,3,6
<a, c, e, g> 2
<d> 1,3,4,5
<e> 2,4,5
<f> 1,3,5,6
Output after removal of redundant entries
Sufficient Set
Taking “U” on Intent and “∩” on Extent
Taking “∩” on LHS and “U” on RHS
20. Lattice Enumeration
•Phase#3
•Use Sufficient Set to enumerate all the Formal Concepts in the lattice.
•Selective generation of formal concepts using Sufficient Set.
•Since Sufficient set is relatively very small in size to the given dataset, it
can be stored and processed to generate lattice as and when required.
23. Test Setup
Distributed Computing Configuration
Total Machines 4
Processor Intel Xeon (64 bit)
Ethernet Card 100Mbps
Main Memory (RAM) 4GB (each machine)
Operating System CentOS 6.5 (64 bit)
Hadoop Version 2.6
Java Version 1.7 (Oracle)
Java IDE NetBeans (Version 8.2)
Hard Disk Drive 500GB (each machine)
Stand-Alone Machine Configuration
Processor Intel Xeon (64 bit)
Main Memory (RAM) 8GB
Operating System CentOS 7 (64 bit)
Python Version 2.7
Hard Disk Drive 500GB
24. Results
Dataset Mushroom Anon-Web Census Income
NextClouser
(Sequential)
618 sec 14671 sec 18230 sec
CloseByOne
Sequential
2543 sec 656 sec 7465 sec
MRGanter
(Map-Reduce)
20269 sec
(5 nodes)
20110 sec
(3 nodes)
9654 sec
(11 nodes)
MRCbo
(Map-Reduce)
241 sec
(11 nodes)
693 sec
(11 nodes)
803 sec
(11 nodes)
MRGanter+
(Map-Reduce)
198 sec
(9 nodes)
496 sec
(9 nodes)
358 sec
(9 nodes)
Our Algorithm (Map-Reduce) 42
(10 nodes)
26
(10 nodes)
69
(nodes)
Our Algorithm (Enumeration of
Sufficient Set) (Single M/c)
6 sec 24 sec 97 sec
Our Algorithm (Enumeration of
Entire Lattice) (Single M/c)
291 sec 165 sec 653 sec
Number of Concepts in
Sufficient Set
117 365 147
All the dataset is taken from UCI repository.
25. Scalability Test
Execution Time for Plant Dataset (22632 rows x 70 column)
Dataset Size Execution time
(sec) for Map-
Reduce
Sufficient Set
Generation time
(sec)
Number of
Concepts in
Sufficient Set
22632 x 70 22 64 661
226320 x 70 (Dataset*10) 64 780 661
452640 x 70 (Dataset*20) 206 1773 661
678960 x 70 (Dataset*30) 441 2898 661
905280 x 70 (Dataset*40) 874 3982 661
1131600 x 70 (Dataset*50) 1480 5334 661
Plant dataset is taken from UCI repository.
26. Result with 2 different data densities
#Rows #Column Time (sec)
by BiBit
Total Time (sec) by
Our Algorithm
20 50 0.01 17
50 50 0.01 18
100 50 0.79 19
500 50 10.3 20
1000 50 16.8 21
5000 50 66.3 22
10000 50 128.5 29
Input data density = 10%
#Rows #Column Time (sec)
by BiBit
Total Time (sec) by
Our Algorithm
20 50 4.2 17
50 50 5.0 17
100 50 7.3 18
500 50 40.5 18
1000 50 67.7 19
5000 50 316.8 29
10000 50 633.8 53
Comparison of Our Algorithm with BiBit Algorithm
Input data density = 60%
27. Our Contribution
•Output generated from our algorithm is very much smaller (without
information loss).
•Generated Sufficient Set contains all the information to generate
complete lattice.
•Sufficient set can also be used to generate concepts of desired length
without revisiting the original dataset.