An article from the Telecommunications Software & Systems Group, Waterford Institute of Technology, Ireland describing algorithms for distributed Formal Concept Analysis
ABSTRACT
While many existing formal concept analysis algorithms are efficient, they are typically unsuitable for distributed implementation. Taking the MapReduce (MR) framework as our inspiration we introduce a distributed approach for performing formal concept mining. Our method has its novelty in that we use a light-weight MapReduce runtime called Twister which is better suited to iterative algorithms than recent distributed approaches. First, we describe the theoretical foundations underpinning our distributed formal concept analysis approach. Second, we provide a representative exemplar of how a classic centralized algorithm can be implemented in a distributed fashion using our methodology: we modify Ganter's classic algorithm by introducing a family of MR* algorithms, namely MRGanter and MRGanter+ where the prefix denotes the algorithm's lineage. To evaluate the factors that impact distributed algorithm performance, we compare our MR* algorithms with the state-of-the-art. Experiments conducted on real datasets demonstrate that MRGanter+ is efficient, scalable and an appealing algorithm for distributed problems.
Accepted for publication at the International Conference for Formal Concept Analysis 2012.
Project participants: Biao Xu, Ruairí de Fréin, Eric Robson, Mícheál Ó Foghlú
Ruairí de Fréin: rdefrein (at) gmail (dot) com
bibtex:
@incollection{
year={2012},
isbn={978-3-642-29891-2},
booktitle={Formal Concept Analysis},
volume={7278},
series={Lecture Notes in Computer Science},
editor={Domenach, Florent and Ignatov, DmitryI. and Poelmans, Jonas},
doi={10.1007/978-3-642-29892-9_26},
title={Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduce Framework},
url={http://dx.doi.org/10.1007/978-3-642-29892-9_26},
publisher={Springer Berlin Heidelberg},
keywords={Formal Concept Analysis; Distributed Mining; MapReduce},
author={Xu, Biao and Fréin, Ruairí and Robson, Eric and Ó Foghlú, Mícheál},
pages={292-308}
}
DOWNLOAD
The article Arxiv: http://arxiv.org/abs/1210.2401
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduce Framework
1. Motivation
Our Solution
Evaluation
Future Work
Distributed Formal Concept Analysis
Algorithms Based on an Iterative MapReduce
Framework
Biao Xu Ruairí de Fréin Eric Robson Mícheál Ó Foghlú
Telecommunications Software & Systems Group
Waterford Institute of Technology
ICFCA 2012 Leuven, Blegium
Biao Xu, etc. Distributed FCA Algorithms MR
2. Motivation
Our Solution
Evaluation
Future Work
Outline
1 Motivation
The Basic Problems of Current FCA Algorithms
Related Work
2 Our Solution
Adopt Iterative MapReduce Framework
FCA Algorithms Adaptation
3 Evaluation
4 Future Work
Biao Xu, etc. Distributed FCA Algorithms MR
3. Motivation
Our Solution
Evaluation
Future Work
The Basic Problems of Current FCA Algorithms
Related Work
Outline
1 Motivation
The Basic Problems of Current FCA Algorithms
Related Work
2 Our Solution
Adopt Iterative MapReduce Framework
FCA Algorithms Adaptation
3 Evaluation
4 Future Work
Biao Xu, etc. Distributed FCA Algorithms MR
4. Motivation
Our Solution
Evaluation
Future Work
The Basic Problems of Current FCA Algorithms
Related Work
Apply FCA algorithms in real world applications
Time-consuming to large and high-demension data.
Table: Execution time of traditional FCA algorithms (in seconds).
Dataset mushroom anon-web census-income
size 8124×125 32711×294 103950×133
NextClosure 618 14671 18230
CloseByOne 2543 656 7465
Hard to deal with distributed database.
Data volume
Communication
Privacy
Security
Biao Xu, etc. Distributed FCA Algorithms MR
5. Motivation
Our Solution
Evaluation
Future Work
The Basic Problems of Current FCA Algorithms
Related Work
Outline
1 Motivation
The Basic Problems of Current FCA Algorithms
Related Work
2 Our Solution
Adopt Iterative MapReduce Framework
FCA Algorithms Adaptation
3 Evaluation
4 Future Work
Biao Xu, etc. Distributed FCA Algorithms MR
6. Motivation
Our Solution
Evaluation
Future Work
The Basic Problems of Current FCA Algorithms
Related Work
Few work on distributed FCA algorithms
A distributed version of CloseByOne based on Hadoop
MapReduce.
Petr Krajca, etc. Distributed Algorithm for Computing
Formal Concepts Using Map-Reduce Framework. IDA,
2009.
Differences in our work.
using an iterative MapReduce, Twister.
mining formal concepts in the least iterations.
Biao Xu, etc. Distributed FCA Algorithms MR
7. Motivation
Our Solution
Evaluation
Future Work
Adopt Iterative MapReduce Framework
FCA Algorithms Adaptation
Outline
1 Motivation
The Basic Problems of Current FCA Algorithms
Related Work
2 Our Solution
Adopt Iterative MapReduce Framework
FCA Algorithms Adaptation
3 Evaluation
4 Future Work
Biao Xu, etc. Distributed FCA Algorithms MR
8. Motivation
Our Solution
Evaluation
Future Work
Adopt Iterative MapReduce Framework
FCA Algorithms Adaptation
Features of MapReduce Framework
Divide and conquer strategy: map + reduce function.
Table: Partitioned datasets S1 and S2
S1 or (OS1
, P, IS1
)
a b c d e f g
1 × × × ×
2 × × × ×
3 × × × × ×
S2 or (OS2
, P, IS2
)
a b c d e f g
4 × × ×
5 × × × ×
6 × × × ×
Move algorithms to nodes other than datasets.
Utilize a cluster not only single machine.
Fault tolerance.
Biao Xu, etc. Distributed FCA Algorithms MR
9. Motivation
Our Solution
Evaluation
Future Work
Adopt Iterative MapReduce Framework
FCA Algorithms Adaptation
MapReduce Data Flow
Split 0 map
reduce Part 0
reduce Part 1
Split 1 map
Split 2 map
Input
Output
node 0
sort
copy
merge
Biao Xu, etc. Distributed FCA Algorithms MR
10. Motivation
Our Solution
Evaluation
Future Work
Adopt Iterative MapReduce Framework
FCA Algorithms Adaptation
Twister: an Iterative MapReduce Runtime
A lightweight MapReduce runtime developed by Indiana
University.
Efficient support for Iterative MapReduce computations.
Table: Comparison between Twister and Hadoop
Twister Hadoop
Long running map/reduce task Single step map/reduce
Iterative supporting Jobs chaining
Static & dynamic data Static data only
Biao Xu, etc. Distributed FCA Algorithms MR
11. Motivation
Our Solution
Evaluation
Future Work
Adopt Iterative MapReduce Framework
FCA Algorithms Adaptation
Twister Architecture
Twister Daemon
Worker Pool
Master Node
Main Program
Twister Driver
Twister Daemon
Worker Pool
map
reduce
map map
reduce reduce
Cacheable Tasks
•••
•••
•••
Local Disk Local Disk
Data distribution,
collection, and
partition file creation
Worker Node
B
B
B
Worker Node
Pub/sub
Broker Network
Biao Xu, etc. Distributed FCA Algorithms MR
12. Motivation
Our Solution
Evaluation
Future Work
Adopt Iterative MapReduce Framework
FCA Algorithms Adaptation
Outline
1 Motivation
The Basic Problems of Current FCA Algorithms
Related Work
2 Our Solution
Adopt Iterative MapReduce Framework
FCA Algorithms Adaptation
3 Evaluation
4 Future Work
Biao Xu, etc. Distributed FCA Algorithms MR
13. Motivation
Our Solution
Evaluation
Future Work
Adopt Iterative MapReduce Framework
FCA Algorithms Adaptation
Decompose the FCA Algorithm
Map phase produces local concepts, FY
Sn
.
Reduce phase generates global concepts by merging local
concepts from mappers.
Theorem: Given the closures
FY
S1
, · · · , FY
Sn
from n disjoint partitions,
FY
S = FY
S1
∩ · · · ∩ FY
Sn
.
Named our algorithms with MR : MRCbo, MRGanter,
MRGanter+.
Biao Xu, etc. Distributed FCA Algorithms MR
14. Motivation
Our Solution
Evaluation
Future Work
Adopt Iterative MapReduce Framework
FCA Algorithms Adaptation
MRGanter Work Flow
Data Split 1
Map
computeClosure()
while(!isLastClosure(Closure))
runMapReduce()
•••
Reduce 1
merging()
check()
Data Split n
Map
computeClosure()
Reduce n
merging()
check()
Closure
•••
DD
D
S S
D
atr1, localClosure1
↓
atrj, localClosurej
atr1, localClosure1
↓
atri, localClosurei
Figure: Static data labeled by S and dynamic data labeled by D.
Biao Xu, etc. Distributed FCA Algorithms MR
15. Motivation
Our Solution
Evaluation
Future Work
Adopt Iterative MapReduce Framework
FCA Algorithms Adaptation
Running example of MRGanter and MRGanter+.
d p_i F1 from S1 F2 from S2 F
∅
g {c,g} {b,c,f,g} {c,g}
f {b,d,f} {f} {f}
e {a,c,e,g} {d,e} {e}
d {b,d,f} {d,e} {d}
c {c,g} {b,c,f,g} {c,g}
b {b,d,f} {b} {b}
a {a} {a,d,e,f} {a}
{f}
g {b,c,d,f,g} {b,c,f,g} {b,c,f,g}
e {a,c,e,g} {d,e} {e}
d {b,d,f} {d,e} {d}
c {c,g} {b,c,f,g} {c,g}
b {b,d,f} {b} {b}
a {a} {a,d,e,f} {a}
{e}
g {a,c,e,g} {a,. . . ,g} {a,c,e,g}
f {a,. . . ,g} {a,d,e,f} {a,d,e,f}
d {b,d,f} {d,e} {d}
c {c,g} {b,c,f,g} {c,g}
b {b,d,f} {b} {b}
a {a} {a,d,e,f} {a}
{d}
g {b,c,d,f,g} {a,. . . ,g} {b,c,d,f,g}
f {b,d,f} {a,d,e,f} {d,f}
e {a,. . . ,g} {d,e} {d,e}
c {c,g} {b,c,f,g} {c,g}
b {b,d,f} {b} {b}
a {a} {a,d,e,f} {a}
d p_i F1 from S1 F2 from S2 F
∅
g {c,g} {b,c,f,g} {c,g}
f {b,d,f} {f} {f}
e {a,c,e,g} {d,e} {e}
d {b,d,f} {d,e {d}
c {c,g} {b,c,f,g} {c,g}
b {b,d,f} {b} {b}
a {a} {a,d,e,f} {a}
{cg}
f {b,c,d,f,g} {b,c,f,g} {b,c,f,g}
e {a,c,e,g} {a,. . . ,g} {a,c,e,g}
d {b,c,d,f,g} {a,. . . ,g} {b,c,d,f,g}
b {b,d,f} {b} {b}
a {a} {a,d,e,f} {a}
{f}
g {b,c,d,f,g} {b,c,f,g} {b,c,f,g}
e {a,c,e,g} {d,e} {e}
d {b,d,f} {d,e} {d}
c {c,g} {b,c,f,g} {c,g}
b {b,d,f} {b} {b}
a {a} {a,d,e,f} {a}
{e}
g {a,c,e,g} {a,. . . ,g} {a,c,e,g}
f {a,. . . ,g} {a,d,e,f} {a,d,e,f}
d {b,d,f} {d,e} {d}
c {c,g} {b,c,f,g} {c,g}
b {b,d,f} {b} {b}
a {a} {a,d,e,f} {a}
Biao Xu, etc. Distributed FCA Algorithms MR
16. Motivation
Our Solution
Evaluation
Future Work
Efficiency of MR
Table: Execution time: Distributed algorithms are the fastest (in
seconds) on certain number of machines (in round brackets).
Dataset mushroom anon-web census-income
concepts 219010 129009 96531
Density 17.36% 1.03% 6.7%
NextClosure 618 14671 18230
CloseByOne 2543 656 7465
MRCbo 241 (11) 693 (11) 803 (11)
MRGanter 20269 (5) 20110 (3) 9654 (11)
MRGanter+ 198 (9) 496 (9) 358 (11)
Biao Xu, etc. Distributed FCA Algorithms MR
17. Motivation
Our Solution
Evaluation
Future Work
Scalability of MR (1)
0 2 4 6 8 10 12
10
2
10
3
10
4
10
5
Nodes (Count)
CPUTime(Second)
MRGanter+
MRCbo
MRGanter
Figure: Mushroom dataset: comparison of MRGanter+, MRCbo and
MRGanter. MRGanter+ outperforms MRCbo and MRGanter when
dense data is processed.
Biao Xu, etc. Distributed FCA Algorithms MR
18. Motivation
Our Solution
Evaluation
Future Work
Scalability of MR (2)
0 2 4 6 8 10 12
10
2
10
3
10
4
10
5
Nodes (Count)
CPUTime(Second)
MRGanter+
MRCbo
MRGanter
Figure: Anon-web dataset: comparison of MRGanter+, MRCbo and
MRGanter. MRGanter+ is faster when more than 3 nodes are used.
Biao Xu, etc. Distributed FCA Algorithms MR
19. Motivation
Our Solution
Evaluation
Future Work
Scalability of MR (3)
0 2 4 6 8 10 12
10
2
10
3
10
4
10
5
Nodes (Count)
CPUTime(Second)
MRGanter+
MRCbo
MRGanter
Figure: Census dataset: comparison of MRGanter+, MRCbo and
MRGanter. MRGanter+ is fastest when a large dataset is processed.
Biao Xu, etc. Distributed FCA Algorithms MR
20. Motivation
Our Solution
Evaluation
Future Work
Future Work
Explore the effect of data distribution between cluster
nodes.
Examine MR performance with larger dataset sizes.
Extend our approach by reducing the size of intermediate
data.
Biao Xu, etc. Distributed FCA Algorithms MR