SlideShare a Scribd company logo
1 of 28
An Efficient Map-Reduce Algorithm for Computing
Formal Concepts from Binary Data
Lalit Kumar
University of Cincinnati
Finding Concepts in Binary Datasets
a b c d e f g
0 0 1 1 0 1 1 1
1 1 0 1 1 0 0 1
2 1 1 0 0 0 0 0
3 1 0 1 1 0 1 0
4 1 0 1 0 1 0 1
Objects
Attributes
Concept:
Is the set of objects sharing the same value for a certain set of properties.
Formal Concepts: ~Closed Item Sets
a c g
1 1 1 1
4 1 1 1
Intent
Extent
c g
0 1 1
1 1 1
4 1 1
Intent
Extent
c f
0 1 1
3 1 1
Intent
Extent
… …
Too Many Concepts Even in a Sparse Dataset
•List of all concepts in the previous example.
C1 = <{0, 1, 2, 3, 4}, {}>
C2 = <{1, 2, 3, 4}, {a}>
C3 = <{2}, {a, b}>
C4 = <{}, {a, b, c, d, e, f, g}>
C5 = <{1, 3, 4}, {a, c}>
C6 = <{1, 3}, {a, c, d}>
C7 = <{3}, {a, c, d, f}>
C8 = <{1}, {a, c, d, g}>
C9 = <{4}, {a, c, e, g}>
C10 = <{1, 4}, {a, c, g}>
C11 = <{0, 2}, {b}>
C12 = <{0}, {b, c, e, f, g}>
C13 = <{0, 1, 3, 4}, {c}>
C14 = <{0, 4}, {c, e, g}>
C15 = <{0, 3}, {c, f}>
C16 = <{0, 1, 4}, {c, g}>
Lattice of Concepts
All concepts can be arranged in a lattice using the subset ordering
<{1, 3, 4}, {a, c}> <{0, 1, 4}, {c, g}>
<{1, 4}, {a, c, g}>
Intersection (∩) on Extents Union (U) on Intents
Concept can be computed from its parents by performing Set Union and Set Intersection operation.
Parent Concept Parent Concept
New Concept
Sufficient Set
<C1>
<C2> <C11>
<C13>
<C15>
<C3>
<C5>
<C12>
<C16>
<C10>
<C14>
<C6>
<C9>
<C4>
<C8> <C7>
Complete Lattice
Concepts in Sufficient Set (green boxes)
Consequence:
We don’t need to make explicit the entire lattice. A smaller set may be sufficient to generate
the rest of the lattice
Sufficient Set:
Subset of lattice required to generate entire lattice but not necessarily a minimal set.
Existing Algorithms
•Find all the concepts of the lattice.
•Use DFS based algorithms.
•Computationally intractable.
•Even Map-Reduce algorithms takes long time and multiple iteration
in DFS search.
Previous Work <C1,0>
<C11,2>
<C13,3> <C2,1>C6 C15 C16 C14
1 3 5 2 6 4
0
<C12,3>C4 C12
C12 C12
C4
<C15,6>
C12
<C14,5>
C12
C6 <C16,7>
3 5 2 4 6 5 3 4 6
3 6 5
<C3,2>C9 C10 C7 C6 <C5,3>
C4 C4 C4 C4 <C3,2> <C9,5> <C10,7> <C6,4>
C7
C4
… …
… …
… …
•Each level of the tree is processed using Map-Reduce to enumerate
formal concept
•Effectively, it’s implementation of DFS and BFS using Map-Reduce
All the nodes
are concepts
Previous Work…
•Distributed Algorithm for computing formal concepts using Map-Reduce Framework.
(Author: Pert Krajca and Vilem Vychodil, 2009)
Starting concept
(concept having either largest intent or extent)
map: derive new concepts
reduce: remove redundant concepts
map: derive new concepts
reduce: remove redundant concepts
.
.
.
Iteration#1
Iteration#2
Store results
(first level of the tree)
Store results
(second level of the tree)
* Figure is taken from the mentioned paper
This step needs to be performed
for each node of the DFS tree
Previous Work…
•Distributed Formal Concept Analysis Algorithms Based on an Iterative Map-Reduce
Framework. (Author: Biao Xu, Ruair´ı de Fr´ein, Eric Robson, and M´ıche´al O Foghl´u, 2012)
While (!isLastClosure(closure))
Map
computeClosure()
* Figure is taken from the mentioned paper
runMapReduce()
Data Split # 1 Data Split # n
Map
computeClosure()
S S
D D
…………..
<atr1 , localClosure1>
<atr2 , localClosure2>
<atri , localClosurei>
<atr1 , localClosure1>
<atr2 , localClosure2>
<atrj , localClosurej>
Reducer#1
merging()
check()
Reducer# n
merging()
check()
…………..
Closure
D
Previous Work (Sequential Algorithms)…
•A Biclustering algorithm for extracting bit-patterns from binary datasets.
(Author: Rodriguez-Baena DS, Perez-Pulido AJ and Aguilar-Ruiz JS, 2011)
•A Fast Algorithm for Computing All Intersections of Objects in a Finite Semi-Lattice
(Sergei O. Kuznetsov, 1993)
•In-Close, a Fast Algorithm for Computing Formal Concepts.
(Author: Simon Andrews, 2009)
•A Local Approach to Concept Generation.
(Anne Berry, Jean-Paul Bordat, and Alain Sigayret, 2006)
Our Approach
•Single iteration of Map-Reduce.
•Find only Sufficient Set of concepts.
•Use single processor system to enumerate those parts of lattice that
may be of interest.
This helps conquer
the complexity
Our Algorithm
Binary Dataset
(input)
Map-Reduce
(single iteration)
Sufficient Set
Generation
(single processor M/C)
Lattice Enumeration
(single processor M/C)
Complete Lattice
(output)
Map-Reduce cluster and single processor machine configuration is detailed later in the presentation
Map-Reduce
•Phase#1
•Needs only one iteration of Map-Reduce.
•List all the attributes of the dataset with their index.
•Condense entire dataset to very small and manageable size without
information loss.
•Reducer output can be easily processed to enumerate required Sufficient
Set containing the entire lattice of Formal Concepts.
Map-Reduce...
Phase#1 (example)
a b c d e f g
1 1 1 0 1 0 1 0
2 1 0 1 0 1 0 1
3 0 1 1 1 0 1 1
4 0 1 0 1 1 0 0
5 1 0 0 1 1 1 0
6 0 1 1 0 0 1 1
Attributes
Formal concepts in the Sufficient Set were generated from the input data and then
Sufficient set is used to enumerate all the formal concepts in the entire lattice.
Objects
Map-Reduce...
Mapper Output
Key (Intent) Value (Extent)
a 2
a 5
a, b 1
b 4
b, c 6
b, c, d 3
c 2
d 1
d, e 4
d, e, f 5
e 2
f 1
f, g 3
f, g 6
g 2
Processing flow of Mapper
Total number of output in the table = 15
a b c d e f g
1 1 1 0 1 0 1 0
2 1 0 1 0 1 0 1
3 0 1 1 1 0 1 1
4 0 1 0 1 1 0 0
5 1 0 0 1 1 1 0
6 0 1 1 0 0 1 1
Input Data
Key/Value pair is flipped. Value is treated as key and key as value.
Intent
Extent
Map-Reduce Phase...
Reducer Output
Key (Intent) Value (Extent)
<a> 2,5
<a, b> 1
<b> 4
<b, c> 6
<b, c, d> 3
<c> 2
<d> 1
<d, e> 4
<d, e, f> 5
<e> 2
<f> 1
<f, g> 3,6
<g> 2
Processing flow of Reducer
Total number of output in the table = 13
Mapper Output
Key (Intent) Value (Extent)
a 2
a 5
a, b 1
b 4
b, c 6
b, c, d 3
c 2
d 1
d, e 4
d, e, f 5
e 2
f 1
f, g 3
f, g 6
g 2
Sufficient Set Generation
•Phase#2
•Use Reducer output to generate Sufficient Set.
•Processing is performed on single processor machine as the reducer
output is much smaller.
•Enumerate formal concepts (Sufficient Set), which can be used to generate
entire lattice.
Sufficient Set Generation...
Input for Sufficient Set Generation
Intent Extent
<a> 2,5
<a, b> 1
<b> 4
<b, c> 6
<b, c, d> 3
<c> 2
<d> 1
<d, e> 4
<d, e, f> 5
<e> 2
<f> 1
<f, g> 3,6
<g> 2
Processing flow for Sufficient Set generation
Output after processing
Intent Extent
<b, c, d, f, g> 3
<a, d, e, f> 5
<a, b, d, f> 1
<b, c> 3,6
<b, c, f, g> 6
<d, e> 4,5
<b, d, e> 4
<f, g> 3,6
<a> 1,2,5
<b> 1,3,4,6
<b, d, e> 4
<c> 2,3,6
<a, c, e, g> 2
<d> 1,3,4,5
<a, b, d, f> 1
<e> 2,4,5
<a, c, e, g> 2
<f> 1,3,5,6
<a, b, d, f> 1
<g> 2,3,6
<a, c, e, g> 2
Taking “∩” on Intent and “U” on Extent
Taking “U” on Intent and “∩” on Extent
Taking “∩” on Intent and “U” on Extent
Taking “U” on Intent and “∩” on Extent
Sufficient Set Generation...
Processing flow for Sufficient Set generation…
Output from previous step
Intent Extent
<b, c, d, f, g> 3
<a, d, e, f> 5
<a, b, d, f> 1 (R1)
<b, c> 3,6 (M1) (R2)
<b, c, f, g> 6 (M1) (R2)
<d, e> 4,5
<b, d, e> 4 (R3)
<f, g> 3,6 (M1) (R2)
<a> 1,2,5
<b> 1,3,4,6
<b, d, e> 4 (R3)
<c> 2,3,6 (M2)
<a, c, e, g> 2 (R4)
<d> 1,3,4,5
<a, b, d, f> 1 (R1)
<e> 2,4,5
<a, c, e, g> 2 (R4)
<f> 1,3,5,6
<a, b, d, f> 1 (R1)
<g> 2,3,6 (M2)
<a, c, e, g> 2 (R4)
Intent Extent
<b, c, d, f, g> 3
<a, d, e, f> 5
<a, b, d, f> 1
<b, c, f, g> 3,6
<d, e> 4,5
<b, d, e> 4
<a> 1,2,5
<b> 1,3,4,6
<c, g> 2,3,6
<a, c, e, g> 2
<d> 1,3,4,5
<e> 2,4,5
<f> 1,3,5,6
Output after removal of redundant entries
Sufficient Set
Taking “U” on Intent and “∩” on Extent
Taking “∩” on LHS and “U” on RHS
Lattice Enumeration
•Phase#3
•Use Sufficient Set to enumerate all the Formal Concepts in the lattice.
•Selective generation of formal concepts using Sufficient Set.
•Since Sufficient set is relatively very small in size to the given dataset, it
can be stored and processed to generate lattice as and when required.
Lattice Enumeration...
Processing flow for Complete Lattice enumeration
Output after processing
Intent Extent
<b, c, d, f, g> 3
<a, d, e, f> 5
<a, b, d, f> 1
<b, c, f, g> 3,6
<d, e> 4,5
<b, d, e> 4
<a> 1,2,5
<b> 1,3,4,6
<c, g> 2,3,6
<a, c, e, g> 2
<d> 1,3,4,5
<e> 2,4,5
<f> 1,3,5,6
<d, f> 3,5
<b, d, f> 1,3
<b, d> 3,4
<a, d, f> 1,5
<a, e> 2,5
<b, f> 1,3,6
<b, d> 1,4
<a, d> 1,5
<a, e> 2,5
<a, f> 1,5
<b, d> 1,3,4
<b, f> 1,3,6
<d, f> 1,3,5
Intent Extent
<b, c, d, f, g> 3
<a, d, e, f> 5
<a, b, d, f> 1
<b, c, f, g> 3,6
<d, e> 4,5
<b, d, e> 4
<a> 1,2,5
<b> 1,3,4,6
<c, g> 2,3,6
<a, c, e, g> 2
<d> 1,3,4,5
<e> 2,4,5
<f> 1,3,5,6
Sufficient Set
Taking “∩” on Intent and “U” on Extent
Taking “U” on Intent and “∩” on Extent
<d, f> <3,5>
<b, d, f> <1,3>
<b, c, f, g> <3,6>(SS)
<d> <3,4,5>(SS)
<b, d> <3,4>
<b> <1,3,4,6>(SS)
<c, g> <2,3,6>(SS)
<c, g> <2,3>(SS)
<d> <1,3,4,5>(SS)
<f> <1,3,5,6>(SS)
<b, c, g> <3,6>(SS)
<b, d> <1,3,4>
<b, e> <4>(SS)
<b, f> <1,3,6>
<d, e> <4,5>(SS)
<d, f> <1,3,5>
Lattice Enumeration...
Processing flow for Complete Lattice enumeration…
Output from previous step
Intent Extent
<b, c, d, f, g> 3
<a, d, e, f> 5
<a, b, d, f> 1
<b, c, f, g> 3,6
<d, e> 4,5
<b, d, e> 4
<a> 1,2,5
<b> 1,3,4,6
<c, g> 2,3,6
<a, c, e, g> 2
<d> 1,3,4,5
<e> 2,4,5
<f> 1,3,5,6
<d, f> 1,3,5
<b, f> 1,3,6
<b, d> 1,3,4
<b, d, f> 1,3
<a, e> 2,5
<a, d, f> 1,5
<a, b, c, d, e, f, g> Empty
Empty <> 1,2,3,4,5,6
Complete Lattice
Intent Extent
<b, c, d, f, g> 3
<a, d, e, f> 5
<a, b, d, f> 1
<b, c, f, g> 3,6
<d, e> 4,5
<b, d, e> 4
<a> 1,2,5
<b> 1,3,4,6
<c, g> 2,3,6
<a, c, e, g> 2
<d> 1,3,4,5
<e> 2,4,5
<f> 1,3,5,6
<d, f> 3,5
<b, d, f> 1,3
<b, d> 3,4
<a, d, f> 1,5
<a, e> 2,5
<b, f> 1,3,6
<b, d> 1,4
<a, d> 1,5
<a, e> 2,5
<a, f> 1,5
<b, d> 1,3,4
<b, f> 1,3,6
<d, f> 1,3,5
Process
Process
Process
Process
Process
Process
Test Setup
Distributed Computing Configuration
Total Machines 4
Processor Intel Xeon (64 bit)
Ethernet Card 100Mbps
Main Memory (RAM) 4GB (each machine)
Operating System CentOS 6.5 (64 bit)
Hadoop Version 2.6
Java Version 1.7 (Oracle)
Java IDE NetBeans (Version 8.2)
Hard Disk Drive 500GB (each machine)
Stand-Alone Machine Configuration
Processor Intel Xeon (64 bit)
Main Memory (RAM) 8GB
Operating System CentOS 7 (64 bit)
Python Version 2.7
Hard Disk Drive 500GB
Results
Dataset Mushroom Anon-Web Census Income
NextClouser
(Sequential)
618 sec 14671 sec 18230 sec
CloseByOne
Sequential
2543 sec 656 sec 7465 sec
MRGanter
(Map-Reduce)
20269 sec
(5 nodes)
20110 sec
(3 nodes)
9654 sec
(11 nodes)
MRCbo
(Map-Reduce)
241 sec
(11 nodes)
693 sec
(11 nodes)
803 sec
(11 nodes)
MRGanter+
(Map-Reduce)
198 sec
(9 nodes)
496 sec
(9 nodes)
358 sec
(9 nodes)
Our Algorithm (Map-Reduce) 42
(10 nodes)
26
(10 nodes)
69
(nodes)
Our Algorithm (Enumeration of
Sufficient Set) (Single M/c)
6 sec 24 sec 97 sec
Our Algorithm (Enumeration of
Entire Lattice) (Single M/c)
291 sec 165 sec 653 sec
Number of Concepts in
Sufficient Set
117 365 147
All the dataset is taken from UCI repository.
Scalability Test
Execution Time for Plant Dataset (22632 rows x 70 column)
Dataset Size Execution time
(sec) for Map-
Reduce
Sufficient Set
Generation time
(sec)
Number of
Concepts in
Sufficient Set
22632 x 70 22 64 661
226320 x 70 (Dataset*10) 64 780 661
452640 x 70 (Dataset*20) 206 1773 661
678960 x 70 (Dataset*30) 441 2898 661
905280 x 70 (Dataset*40) 874 3982 661
1131600 x 70 (Dataset*50) 1480 5334 661
Plant dataset is taken from UCI repository.
Result with 2 different data densities
#Rows #Column Time (sec)
by BiBit
Total Time (sec) by
Our Algorithm
20 50 0.01 17
50 50 0.01 18
100 50 0.79 19
500 50 10.3 20
1000 50 16.8 21
5000 50 66.3 22
10000 50 128.5 29
Input data density = 10%
#Rows #Column Time (sec)
by BiBit
Total Time (sec) by
Our Algorithm
20 50 4.2 17
50 50 5.0 17
100 50 7.3 18
500 50 40.5 18
1000 50 67.7 19
5000 50 316.8 29
10000 50 633.8 53
Comparison of Our Algorithm with BiBit Algorithm
Input data density = 60%
Our Contribution
•Output generated from our algorithm is very much smaller (without
information loss).
•Generated Sufficient Set contains all the information to generate
complete lattice.
•Sufficient set can also be used to generate concepts of desired length
without revisiting the original dataset.
Questions ?

More Related Content

What's hot

Identification of unknown parameters and prediction of missing values. Compar...
Identification of unknown parameters and prediction of missing values. Compar...Identification of unknown parameters and prediction of missing values. Compar...
Identification of unknown parameters and prediction of missing values. Compar...Alexander Litvinenko
 
Comparative study of algorithms of nonlinear optimization
Comparative study of algorithms of nonlinear optimizationComparative study of algorithms of nonlinear optimization
Comparative study of algorithms of nonlinear optimizationPranamesh Chakraborty
 
Appendix of heterogeneous cellular network user distribution model
Appendix of heterogeneous cellular network user distribution modelAppendix of heterogeneous cellular network user distribution model
Appendix of heterogeneous cellular network user distribution modelCora Li
 
FPGA Implementation of A New Chien Search Block for Reed-Solomon Codes RS (25...
FPGA Implementation of A New Chien Search Block for Reed-Solomon Codes RS (25...FPGA Implementation of A New Chien Search Block for Reed-Solomon Codes RS (25...
FPGA Implementation of A New Chien Search Block for Reed-Solomon Codes RS (25...IJERA Editor
 
Appendix of downlink coverage probability in heterogeneous cellular networks ...
Appendix of downlink coverage probability in heterogeneous cellular networks ...Appendix of downlink coverage probability in heterogeneous cellular networks ...
Appendix of downlink coverage probability in heterogeneous cellular networks ...Cora Li
 
Fast Algorithm for Computing the Discrete Hartley Transform of Type-II
Fast Algorithm for Computing the Discrete Hartley Transform of Type-IIFast Algorithm for Computing the Discrete Hartley Transform of Type-II
Fast Algorithm for Computing the Discrete Hartley Transform of Type-IIijeei-iaes
 
Application of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimizationApplication of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimizationPranamesh Chakraborty
 
GATE Computer Science Solved Paper 2004
GATE Computer Science Solved Paper 2004GATE Computer Science Solved Paper 2004
GATE Computer Science Solved Paper 2004Rohit Garg
 
A common unique random fixed point theorem in hilbert space using integral ty...
A common unique random fixed point theorem in hilbert space using integral ty...A common unique random fixed point theorem in hilbert space using integral ty...
A common unique random fixed point theorem in hilbert space using integral ty...Alexander Decker
 
Non-local Neural Network
Non-local Neural NetworkNon-local Neural Network
Non-local Neural NetworkHiroshi Fukui
 
Encryption Quality Analysis and Security Evaluation of CAST-128 Algorithm and...
Encryption Quality Analysis and Security Evaluation of CAST-128 Algorithm and...Encryption Quality Analysis and Security Evaluation of CAST-128 Algorithm and...
Encryption Quality Analysis and Security Evaluation of CAST-128 Algorithm and...IJNSA Journal
 
Communication systems solution manual 5th edition
Communication systems solution manual 5th editionCommunication systems solution manual 5th edition
Communication systems solution manual 5th editionTayeen Ahmed
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structuresDavid Gleich
 
YaPingPresentation
YaPingPresentationYaPingPresentation
YaPingPresentationYa-Ping Wang
 
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R codeMove your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R codeJeffrey Breen
 
論文紹介 Fast imagetagging
論文紹介 Fast imagetagging論文紹介 Fast imagetagging
論文紹介 Fast imagetaggingTakashi Abe
 

What's hot (19)

Identification of unknown parameters and prediction of missing values. Compar...
Identification of unknown parameters and prediction of missing values. Compar...Identification of unknown parameters and prediction of missing values. Compar...
Identification of unknown parameters and prediction of missing values. Compar...
 
Comparative study of algorithms of nonlinear optimization
Comparative study of algorithms of nonlinear optimizationComparative study of algorithms of nonlinear optimization
Comparative study of algorithms of nonlinear optimization
 
Appendix of heterogeneous cellular network user distribution model
Appendix of heterogeneous cellular network user distribution modelAppendix of heterogeneous cellular network user distribution model
Appendix of heterogeneous cellular network user distribution model
 
FPGA Implementation of A New Chien Search Block for Reed-Solomon Codes RS (25...
FPGA Implementation of A New Chien Search Block for Reed-Solomon Codes RS (25...FPGA Implementation of A New Chien Search Block for Reed-Solomon Codes RS (25...
FPGA Implementation of A New Chien Search Block for Reed-Solomon Codes RS (25...
 
Appendix of downlink coverage probability in heterogeneous cellular networks ...
Appendix of downlink coverage probability in heterogeneous cellular networks ...Appendix of downlink coverage probability in heterogeneous cellular networks ...
Appendix of downlink coverage probability in heterogeneous cellular networks ...
 
Fast Algorithm for Computing the Discrete Hartley Transform of Type-II
Fast Algorithm for Computing the Discrete Hartley Transform of Type-IIFast Algorithm for Computing the Discrete Hartley Transform of Type-II
Fast Algorithm for Computing the Discrete Hartley Transform of Type-II
 
Gate-Cs 2006
Gate-Cs 2006Gate-Cs 2006
Gate-Cs 2006
 
Gate-Cs 2009
Gate-Cs 2009Gate-Cs 2009
Gate-Cs 2009
 
Application of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimizationApplication of recursive perturbation approach for multimodal optimization
Application of recursive perturbation approach for multimodal optimization
 
GATE Computer Science Solved Paper 2004
GATE Computer Science Solved Paper 2004GATE Computer Science Solved Paper 2004
GATE Computer Science Solved Paper 2004
 
A common unique random fixed point theorem in hilbert space using integral ty...
A common unique random fixed point theorem in hilbert space using integral ty...A common unique random fixed point theorem in hilbert space using integral ty...
A common unique random fixed point theorem in hilbert space using integral ty...
 
Non-local Neural Network
Non-local Neural NetworkNon-local Neural Network
Non-local Neural Network
 
Encryption Quality Analysis and Security Evaluation of CAST-128 Algorithm and...
Encryption Quality Analysis and Security Evaluation of CAST-128 Algorithm and...Encryption Quality Analysis and Security Evaluation of CAST-128 Algorithm and...
Encryption Quality Analysis and Security Evaluation of CAST-128 Algorithm and...
 
Communication systems solution manual 5th edition
Communication systems solution manual 5th editionCommunication systems solution manual 5th edition
Communication systems solution manual 5th edition
 
Iterative methods with special structures
Iterative methods with special structuresIterative methods with special structures
Iterative methods with special structures
 
F31 book-arith-pres-pt3
F31 book-arith-pres-pt3F31 book-arith-pres-pt3
F31 book-arith-pres-pt3
 
YaPingPresentation
YaPingPresentationYaPingPresentation
YaPingPresentation
 
Move your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R codeMove your data (Hans Rosling style) with googleVis + 1 line of R code
Move your data (Hans Rosling style) with googleVis + 1 line of R code
 
論文紹介 Fast imagetagging
論文紹介 Fast imagetagging論文紹介 Fast imagetagging
論文紹介 Fast imagetagging
 

Similar to LalitBDA2015V3

Graph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphGraph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphAndrew Yongjoon Kong
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideWooSung Choi
 
K means clustering
K means clusteringK means clustering
K means clusteringAhmedasbasb
 
Breaking a Stick to form a Pentagon with Positive Integers using Programming ...
Breaking a Stick to form a Pentagon with Positive Integers using Programming ...Breaking a Stick to form a Pentagon with Positive Integers using Programming ...
Breaking a Stick to form a Pentagon with Positive Integers using Programming ...IRJET Journal
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介Masayuki Matsushita
 
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...Ruairi de Frein
 
Descriptive analytics in r programming language
Descriptive analytics in r programming languageDescriptive analytics in r programming language
Descriptive analytics in r programming languageAshwini Mathur
 
Digital electronics k map comparators and their function
Digital electronics k map comparators and their functionDigital electronics k map comparators and their function
Digital electronics k map comparators and their functionkumarankit06875
 
DeepXplore: Automated Whitebox Testing of Deep Learning
DeepXplore: Automated Whitebox Testing of Deep LearningDeepXplore: Automated Whitebox Testing of Deep Learning
DeepXplore: Automated Whitebox Testing of Deep LearningMasahiro Sakai
 
Pert 05 aplikasi clustering
Pert 05 aplikasi clusteringPert 05 aplikasi clustering
Pert 05 aplikasi clusteringaiiniR
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with RYanchang Zhao
 
[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R台灣資料科學年會
 
Massively distributed environments and closed itemset mining
Massively distributed environments and closed itemset miningMassively distributed environments and closed itemset mining
Massively distributed environments and closed itemset miningMehdi Zitouni
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfcookie1969
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUsSri Ambati
 
Datastage real time scenario
Datastage real time scenarioDatastage real time scenario
Datastage real time scenarioNaresh Bala
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classificationYanchang Zhao
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
 
Introduction to julia
Introduction to juliaIntroduction to julia
Introduction to julia岳華 杜
 

Similar to LalitBDA2015V3 (20)

Graph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraphGraph analysis platform comparison, pregel/goldenorb/giraph
Graph analysis platform comparison, pregel/goldenorb/giraph
 
An optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slideAn optimal and progressive algorithm for skyline queries slide
An optimal and progressive algorithm for skyline queries slide
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Breaking a Stick to form a Pentagon with Positive Integers using Programming ...
Breaking a Stick to form a Pentagon with Positive Integers using Programming ...Breaking a Stick to form a Pentagon with Positive Integers using Programming ...
Breaking a Stick to form a Pentagon with Positive Integers using Programming ...
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
 
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduc...
 
Descriptive analytics in r programming language
Descriptive analytics in r programming languageDescriptive analytics in r programming language
Descriptive analytics in r programming language
 
Digital electronics k map comparators and their function
Digital electronics k map comparators and their functionDigital electronics k map comparators and their function
Digital electronics k map comparators and their function
 
DeepXplore: Automated Whitebox Testing of Deep Learning
DeepXplore: Automated Whitebox Testing of Deep LearningDeepXplore: Automated Whitebox Testing of Deep Learning
DeepXplore: Automated Whitebox Testing of Deep Learning
 
Pert 05 aplikasi clustering
Pert 05 aplikasi clusteringPert 05 aplikasi clustering
Pert 05 aplikasi clustering
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
 
[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R[系列活動] Data exploration with modern R
[系列活動] Data exploration with modern R
 
Massively distributed environments and closed itemset mining
Massively distributed environments and closed itemset miningMassively distributed environments and closed itemset mining
Massively distributed environments and closed itemset mining
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
 
Datastage real time scenario
Datastage real time scenarioDatastage real time scenario
Datastage real time scenario
 
RDataMining slides-regression-classification
RDataMining slides-regression-classificationRDataMining slides-regression-classification
RDataMining slides-regression-classification
 
R programming language
R programming languageR programming language
R programming language
 
Scalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduceScalable and Adaptive Graph Querying with MapReduce
Scalable and Adaptive Graph Querying with MapReduce
 
Introduction to julia
Introduction to juliaIntroduction to julia
Introduction to julia
 

LalitBDA2015V3

  • 1. An Efficient Map-Reduce Algorithm for Computing Formal Concepts from Binary Data Lalit Kumar University of Cincinnati
  • 2. Finding Concepts in Binary Datasets a b c d e f g 0 0 1 1 0 1 1 1 1 1 0 1 1 0 0 1 2 1 1 0 0 0 0 0 3 1 0 1 1 0 1 0 4 1 0 1 0 1 0 1 Objects Attributes Concept: Is the set of objects sharing the same value for a certain set of properties. Formal Concepts: ~Closed Item Sets a c g 1 1 1 1 4 1 1 1 Intent Extent c g 0 1 1 1 1 1 4 1 1 Intent Extent c f 0 1 1 3 1 1 Intent Extent … …
  • 3. Too Many Concepts Even in a Sparse Dataset •List of all concepts in the previous example. C1 = <{0, 1, 2, 3, 4}, {}> C2 = <{1, 2, 3, 4}, {a}> C3 = <{2}, {a, b}> C4 = <{}, {a, b, c, d, e, f, g}> C5 = <{1, 3, 4}, {a, c}> C6 = <{1, 3}, {a, c, d}> C7 = <{3}, {a, c, d, f}> C8 = <{1}, {a, c, d, g}> C9 = <{4}, {a, c, e, g}> C10 = <{1, 4}, {a, c, g}> C11 = <{0, 2}, {b}> C12 = <{0}, {b, c, e, f, g}> C13 = <{0, 1, 3, 4}, {c}> C14 = <{0, 4}, {c, e, g}> C15 = <{0, 3}, {c, f}> C16 = <{0, 1, 4}, {c, g}>
  • 4. Lattice of Concepts All concepts can be arranged in a lattice using the subset ordering <{1, 3, 4}, {a, c}> <{0, 1, 4}, {c, g}> <{1, 4}, {a, c, g}> Intersection (∩) on Extents Union (U) on Intents Concept can be computed from its parents by performing Set Union and Set Intersection operation. Parent Concept Parent Concept New Concept
  • 5. Sufficient Set <C1> <C2> <C11> <C13> <C15> <C3> <C5> <C12> <C16> <C10> <C14> <C6> <C9> <C4> <C8> <C7> Complete Lattice Concepts in Sufficient Set (green boxes) Consequence: We don’t need to make explicit the entire lattice. A smaller set may be sufficient to generate the rest of the lattice Sufficient Set: Subset of lattice required to generate entire lattice but not necessarily a minimal set.
  • 6. Existing Algorithms •Find all the concepts of the lattice. •Use DFS based algorithms. •Computationally intractable. •Even Map-Reduce algorithms takes long time and multiple iteration in DFS search.
  • 7. Previous Work <C1,0> <C11,2> <C13,3> <C2,1>C6 C15 C16 C14 1 3 5 2 6 4 0 <C12,3>C4 C12 C12 C12 C4 <C15,6> C12 <C14,5> C12 C6 <C16,7> 3 5 2 4 6 5 3 4 6 3 6 5 <C3,2>C9 C10 C7 C6 <C5,3> C4 C4 C4 C4 <C3,2> <C9,5> <C10,7> <C6,4> C7 C4 … … … … … … •Each level of the tree is processed using Map-Reduce to enumerate formal concept •Effectively, it’s implementation of DFS and BFS using Map-Reduce All the nodes are concepts
  • 8. Previous Work… •Distributed Algorithm for computing formal concepts using Map-Reduce Framework. (Author: Pert Krajca and Vilem Vychodil, 2009) Starting concept (concept having either largest intent or extent) map: derive new concepts reduce: remove redundant concepts map: derive new concepts reduce: remove redundant concepts . . . Iteration#1 Iteration#2 Store results (first level of the tree) Store results (second level of the tree) * Figure is taken from the mentioned paper This step needs to be performed for each node of the DFS tree
  • 9. Previous Work… •Distributed Formal Concept Analysis Algorithms Based on an Iterative Map-Reduce Framework. (Author: Biao Xu, Ruair´ı de Fr´ein, Eric Robson, and M´ıche´al O Foghl´u, 2012) While (!isLastClosure(closure)) Map computeClosure() * Figure is taken from the mentioned paper runMapReduce() Data Split # 1 Data Split # n Map computeClosure() S S D D ………….. <atr1 , localClosure1> <atr2 , localClosure2> <atri , localClosurei> <atr1 , localClosure1> <atr2 , localClosure2> <atrj , localClosurej> Reducer#1 merging() check() Reducer# n merging() check() ………….. Closure D
  • 10. Previous Work (Sequential Algorithms)… •A Biclustering algorithm for extracting bit-patterns from binary datasets. (Author: Rodriguez-Baena DS, Perez-Pulido AJ and Aguilar-Ruiz JS, 2011) •A Fast Algorithm for Computing All Intersections of Objects in a Finite Semi-Lattice (Sergei O. Kuznetsov, 1993) •In-Close, a Fast Algorithm for Computing Formal Concepts. (Author: Simon Andrews, 2009) •A Local Approach to Concept Generation. (Anne Berry, Jean-Paul Bordat, and Alain Sigayret, 2006)
  • 11. Our Approach •Single iteration of Map-Reduce. •Find only Sufficient Set of concepts. •Use single processor system to enumerate those parts of lattice that may be of interest. This helps conquer the complexity
  • 12. Our Algorithm Binary Dataset (input) Map-Reduce (single iteration) Sufficient Set Generation (single processor M/C) Lattice Enumeration (single processor M/C) Complete Lattice (output) Map-Reduce cluster and single processor machine configuration is detailed later in the presentation
  • 13. Map-Reduce •Phase#1 •Needs only one iteration of Map-Reduce. •List all the attributes of the dataset with their index. •Condense entire dataset to very small and manageable size without information loss. •Reducer output can be easily processed to enumerate required Sufficient Set containing the entire lattice of Formal Concepts.
  • 14. Map-Reduce... Phase#1 (example) a b c d e f g 1 1 1 0 1 0 1 0 2 1 0 1 0 1 0 1 3 0 1 1 1 0 1 1 4 0 1 0 1 1 0 0 5 1 0 0 1 1 1 0 6 0 1 1 0 0 1 1 Attributes Formal concepts in the Sufficient Set were generated from the input data and then Sufficient set is used to enumerate all the formal concepts in the entire lattice. Objects
  • 15. Map-Reduce... Mapper Output Key (Intent) Value (Extent) a 2 a 5 a, b 1 b 4 b, c 6 b, c, d 3 c 2 d 1 d, e 4 d, e, f 5 e 2 f 1 f, g 3 f, g 6 g 2 Processing flow of Mapper Total number of output in the table = 15 a b c d e f g 1 1 1 0 1 0 1 0 2 1 0 1 0 1 0 1 3 0 1 1 1 0 1 1 4 0 1 0 1 1 0 0 5 1 0 0 1 1 1 0 6 0 1 1 0 0 1 1 Input Data Key/Value pair is flipped. Value is treated as key and key as value. Intent Extent
  • 16. Map-Reduce Phase... Reducer Output Key (Intent) Value (Extent) <a> 2,5 <a, b> 1 <b> 4 <b, c> 6 <b, c, d> 3 <c> 2 <d> 1 <d, e> 4 <d, e, f> 5 <e> 2 <f> 1 <f, g> 3,6 <g> 2 Processing flow of Reducer Total number of output in the table = 13 Mapper Output Key (Intent) Value (Extent) a 2 a 5 a, b 1 b 4 b, c 6 b, c, d 3 c 2 d 1 d, e 4 d, e, f 5 e 2 f 1 f, g 3 f, g 6 g 2
  • 17. Sufficient Set Generation •Phase#2 •Use Reducer output to generate Sufficient Set. •Processing is performed on single processor machine as the reducer output is much smaller. •Enumerate formal concepts (Sufficient Set), which can be used to generate entire lattice.
  • 18. Sufficient Set Generation... Input for Sufficient Set Generation Intent Extent <a> 2,5 <a, b> 1 <b> 4 <b, c> 6 <b, c, d> 3 <c> 2 <d> 1 <d, e> 4 <d, e, f> 5 <e> 2 <f> 1 <f, g> 3,6 <g> 2 Processing flow for Sufficient Set generation Output after processing Intent Extent <b, c, d, f, g> 3 <a, d, e, f> 5 <a, b, d, f> 1 <b, c> 3,6 <b, c, f, g> 6 <d, e> 4,5 <b, d, e> 4 <f, g> 3,6 <a> 1,2,5 <b> 1,3,4,6 <b, d, e> 4 <c> 2,3,6 <a, c, e, g> 2 <d> 1,3,4,5 <a, b, d, f> 1 <e> 2,4,5 <a, c, e, g> 2 <f> 1,3,5,6 <a, b, d, f> 1 <g> 2,3,6 <a, c, e, g> 2 Taking “∩” on Intent and “U” on Extent Taking “U” on Intent and “∩” on Extent Taking “∩” on Intent and “U” on Extent Taking “U” on Intent and “∩” on Extent
  • 19. Sufficient Set Generation... Processing flow for Sufficient Set generation… Output from previous step Intent Extent <b, c, d, f, g> 3 <a, d, e, f> 5 <a, b, d, f> 1 (R1) <b, c> 3,6 (M1) (R2) <b, c, f, g> 6 (M1) (R2) <d, e> 4,5 <b, d, e> 4 (R3) <f, g> 3,6 (M1) (R2) <a> 1,2,5 <b> 1,3,4,6 <b, d, e> 4 (R3) <c> 2,3,6 (M2) <a, c, e, g> 2 (R4) <d> 1,3,4,5 <a, b, d, f> 1 (R1) <e> 2,4,5 <a, c, e, g> 2 (R4) <f> 1,3,5,6 <a, b, d, f> 1 (R1) <g> 2,3,6 (M2) <a, c, e, g> 2 (R4) Intent Extent <b, c, d, f, g> 3 <a, d, e, f> 5 <a, b, d, f> 1 <b, c, f, g> 3,6 <d, e> 4,5 <b, d, e> 4 <a> 1,2,5 <b> 1,3,4,6 <c, g> 2,3,6 <a, c, e, g> 2 <d> 1,3,4,5 <e> 2,4,5 <f> 1,3,5,6 Output after removal of redundant entries Sufficient Set Taking “U” on Intent and “∩” on Extent Taking “∩” on LHS and “U” on RHS
  • 20. Lattice Enumeration •Phase#3 •Use Sufficient Set to enumerate all the Formal Concepts in the lattice. •Selective generation of formal concepts using Sufficient Set. •Since Sufficient set is relatively very small in size to the given dataset, it can be stored and processed to generate lattice as and when required.
  • 21. Lattice Enumeration... Processing flow for Complete Lattice enumeration Output after processing Intent Extent <b, c, d, f, g> 3 <a, d, e, f> 5 <a, b, d, f> 1 <b, c, f, g> 3,6 <d, e> 4,5 <b, d, e> 4 <a> 1,2,5 <b> 1,3,4,6 <c, g> 2,3,6 <a, c, e, g> 2 <d> 1,3,4,5 <e> 2,4,5 <f> 1,3,5,6 <d, f> 3,5 <b, d, f> 1,3 <b, d> 3,4 <a, d, f> 1,5 <a, e> 2,5 <b, f> 1,3,6 <b, d> 1,4 <a, d> 1,5 <a, e> 2,5 <a, f> 1,5 <b, d> 1,3,4 <b, f> 1,3,6 <d, f> 1,3,5 Intent Extent <b, c, d, f, g> 3 <a, d, e, f> 5 <a, b, d, f> 1 <b, c, f, g> 3,6 <d, e> 4,5 <b, d, e> 4 <a> 1,2,5 <b> 1,3,4,6 <c, g> 2,3,6 <a, c, e, g> 2 <d> 1,3,4,5 <e> 2,4,5 <f> 1,3,5,6 Sufficient Set Taking “∩” on Intent and “U” on Extent Taking “U” on Intent and “∩” on Extent <d, f> <3,5> <b, d, f> <1,3> <b, c, f, g> <3,6>(SS) <d> <3,4,5>(SS) <b, d> <3,4> <b> <1,3,4,6>(SS) <c, g> <2,3,6>(SS) <c, g> <2,3>(SS) <d> <1,3,4,5>(SS) <f> <1,3,5,6>(SS) <b, c, g> <3,6>(SS) <b, d> <1,3,4> <b, e> <4>(SS) <b, f> <1,3,6> <d, e> <4,5>(SS) <d, f> <1,3,5>
  • 22. Lattice Enumeration... Processing flow for Complete Lattice enumeration… Output from previous step Intent Extent <b, c, d, f, g> 3 <a, d, e, f> 5 <a, b, d, f> 1 <b, c, f, g> 3,6 <d, e> 4,5 <b, d, e> 4 <a> 1,2,5 <b> 1,3,4,6 <c, g> 2,3,6 <a, c, e, g> 2 <d> 1,3,4,5 <e> 2,4,5 <f> 1,3,5,6 <d, f> 1,3,5 <b, f> 1,3,6 <b, d> 1,3,4 <b, d, f> 1,3 <a, e> 2,5 <a, d, f> 1,5 <a, b, c, d, e, f, g> Empty Empty <> 1,2,3,4,5,6 Complete Lattice Intent Extent <b, c, d, f, g> 3 <a, d, e, f> 5 <a, b, d, f> 1 <b, c, f, g> 3,6 <d, e> 4,5 <b, d, e> 4 <a> 1,2,5 <b> 1,3,4,6 <c, g> 2,3,6 <a, c, e, g> 2 <d> 1,3,4,5 <e> 2,4,5 <f> 1,3,5,6 <d, f> 3,5 <b, d, f> 1,3 <b, d> 3,4 <a, d, f> 1,5 <a, e> 2,5 <b, f> 1,3,6 <b, d> 1,4 <a, d> 1,5 <a, e> 2,5 <a, f> 1,5 <b, d> 1,3,4 <b, f> 1,3,6 <d, f> 1,3,5 Process Process Process Process Process Process
  • 23. Test Setup Distributed Computing Configuration Total Machines 4 Processor Intel Xeon (64 bit) Ethernet Card 100Mbps Main Memory (RAM) 4GB (each machine) Operating System CentOS 6.5 (64 bit) Hadoop Version 2.6 Java Version 1.7 (Oracle) Java IDE NetBeans (Version 8.2) Hard Disk Drive 500GB (each machine) Stand-Alone Machine Configuration Processor Intel Xeon (64 bit) Main Memory (RAM) 8GB Operating System CentOS 7 (64 bit) Python Version 2.7 Hard Disk Drive 500GB
  • 24. Results Dataset Mushroom Anon-Web Census Income NextClouser (Sequential) 618 sec 14671 sec 18230 sec CloseByOne Sequential 2543 sec 656 sec 7465 sec MRGanter (Map-Reduce) 20269 sec (5 nodes) 20110 sec (3 nodes) 9654 sec (11 nodes) MRCbo (Map-Reduce) 241 sec (11 nodes) 693 sec (11 nodes) 803 sec (11 nodes) MRGanter+ (Map-Reduce) 198 sec (9 nodes) 496 sec (9 nodes) 358 sec (9 nodes) Our Algorithm (Map-Reduce) 42 (10 nodes) 26 (10 nodes) 69 (nodes) Our Algorithm (Enumeration of Sufficient Set) (Single M/c) 6 sec 24 sec 97 sec Our Algorithm (Enumeration of Entire Lattice) (Single M/c) 291 sec 165 sec 653 sec Number of Concepts in Sufficient Set 117 365 147 All the dataset is taken from UCI repository.
  • 25. Scalability Test Execution Time for Plant Dataset (22632 rows x 70 column) Dataset Size Execution time (sec) for Map- Reduce Sufficient Set Generation time (sec) Number of Concepts in Sufficient Set 22632 x 70 22 64 661 226320 x 70 (Dataset*10) 64 780 661 452640 x 70 (Dataset*20) 206 1773 661 678960 x 70 (Dataset*30) 441 2898 661 905280 x 70 (Dataset*40) 874 3982 661 1131600 x 70 (Dataset*50) 1480 5334 661 Plant dataset is taken from UCI repository.
  • 26. Result with 2 different data densities #Rows #Column Time (sec) by BiBit Total Time (sec) by Our Algorithm 20 50 0.01 17 50 50 0.01 18 100 50 0.79 19 500 50 10.3 20 1000 50 16.8 21 5000 50 66.3 22 10000 50 128.5 29 Input data density = 10% #Rows #Column Time (sec) by BiBit Total Time (sec) by Our Algorithm 20 50 4.2 17 50 50 5.0 17 100 50 7.3 18 500 50 40.5 18 1000 50 67.7 19 5000 50 316.8 29 10000 50 633.8 53 Comparison of Our Algorithm with BiBit Algorithm Input data density = 60%
  • 27. Our Contribution •Output generated from our algorithm is very much smaller (without information loss). •Generated Sufficient Set contains all the information to generate complete lattice. •Sufficient set can also be used to generate concepts of desired length without revisiting the original dataset.

Editor's Notes

  1. 9
  2. 24
  3. 25