Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Background
Motivation and Objective
Design and implementation
Performance test
Conclusion and future work
Implementing Generate-Test-and-Aggregate
Algorithms on Hadoop
Yu Liu1, Sebastian Fischer2, Kento Emoto3, and Zhenjiang Hu4
1The Graduate University for Advanced Studies
2,4National Institute of Informatics
3University of Tokyo
September 28, 2011
Yu Liu1
, Sebastian Fischer2
, Kento Emoto3
, and Zhenjiang Hu4
Implementing Generate-Test-and-Aggregate Algorithms on Hadoo

Background
Performance test
MapReduce
GTA algorithm
Parallelization of GTA algorithm
MapReduce
Computation in three phases: map, shuﬄe and reduce
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
MapReduce
GTA algorithm
Programming with MapReduce
Programmers need to implement the following classes (Hadoop)
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
MapReduce
GTA algorithm
Programming with MapReduce
The main difficulties of MapReduce Programming :
Nontrivial problems are usually difficult to be computed in a
divide-and-conquer fashion
Efficiency of parallel algorithms is difficult to be obtained
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
MapReduce
GTA algorithm
Generate Test and Aggregate Algorithm
The Generate-Test-and-Aggregate (GTA for short) algorithm
consists of
generate can generate all possible solution candidates.
test ﬁlters the intermediate data.
aggregate computes a summary of valid intermediate data.
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
MapReduce
GTA algorithm
Generate Test and Aggregate Algorithm
The Generate-Test-and-Aggregate (GTA for short) algorithm
consists of
generate can generate all possible solution candidates.
test ﬁlters the intermediate data.
aggregate computes a summary of valid intermediate data.
GTA is a very useful and common strategy for a large class of
problems
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
MapReduce
GTA algorithm
An Example: Knapsack Problem
Fill a knapsack with items, each of certain value and weight, such that
the total value of packed items is maximal while adhering to a weight
restriction of the knapsack.
picture from Wikipedia
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
MapReduce
GTA algorithm
A knapsack program (GTA algorithm):
knapsack = maxvalue ◦ ﬁlter ◦ sublists
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
MapReduce
GTA algorithm
E.g, there are 3 items: (1kg, $1), (1kg, $2), (2kg, $2)
sublists [(1kg, $1), (1kg, $2), (2kg, $2)]
= [ ], [(1kg, $1)], [(1kg, $1), (1kg, $2)], [(1kg, $1), (1kg, $2), (2kg, $2)],
[(1kg, $1), (2kg, $2)], [(1kg, $2)], [(1kg, $2), (2kg, $2)], [(2kg, $2)]
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
MapReduce
GTA algorithm
Spouse the capacity of knapsack is 2 kg
ﬁlter [ ], [(1kg, $1)], [(1kg, $1), (1kg, $2)], [(1kg, $1), (1kg, $2), (2kg, $2)],
[(1kg, $1), (2kg, $2)], [(1kg, $2)], [(1kg, $2), (2kg, $2)], [(2kg, $2)]
= [ ], [(1kg, $1)], [(1kg, $1), (1kg, $2)], [(2kg, $2)], [(1kg, $2)]
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
MapReduce
GTA algorithm
maxvalue [ ], [(1kg, $1)], [(1kg, $1), (1kg, $2)], [(2kg, $2)], [(1kg, $2)]
= $3
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
MapReduce
GTA algorithm
This program is simple but ineﬃcient because it generates
exponential intermediate data (2n).
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
MapReduce
GTA algorithm
Theorems of Gernerating Eﬃcient Parallel GTA Programs
Eﬃcient parallel programs can be derived from users’
naive but correct programs in terms of a generate, a test, and an
aggregate functions [Emoto et. al., 2011]
aggregate ◦ test ◦ generate ⇒ list homomorphism
List homomorphisms is a class of recursive functions which match very well
with the divide-and-conquer paradigm [Bird, 87; Cole, 95].
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
MapReduce
GTA algorithm
The Emoto’s theorem is under the following assumptions:
aggregate is a semiring homomorphism.
test is a list homomorphism.
generate is a polymorphism over semiring structures.
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
The Emoto’s fusion theorem shows us a possible way to
systematically implement eﬃcient parallel programs with GTA
algorithm
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
The Emoto’s fusion theorem shows us a possible way to
systematically implement eﬃcient parallel programs with GTA
algorithm
We need to evaluate this approach by
implementing a practical library, which should
have easy-to-use programming interface help users design
GTA algorithms
be able to generate eﬃcient parallel programs on MapReduce
(Hadoop)
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
System Overview
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Implementation on Hadoop
We implement the following classes:

Background
Performance test
MapReducer is an Interface of list homomorphism
h[ ] = id⊕
h[a] = f a
h(x ++ y) = h x ⊕ h y
1 public interface MapReducer<Elem , Val , Res> {
2 public Val identity () ;
3 public Val element ( Elem elem ) ;
4 public Val combine ( Val left , Val right ) ;
5 public Res postprocess ( Val val ) ;
6 }
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
Aggregator deﬁnes a semiring homomorphism
(A, ⊕, ⊗) → (S, ⊕ , ⊗ )
1 public interface Aggregator<A ,S> {
2 public S zero () ;
3 public S one () ;
4 public S singleton ( A a ) ;
5 public S plus ( S left , S right ) ;
6 public S times ( S left , S right ) ;
7 }
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
Test is almost list homomorphism, it inherits MapReducer
1 public interface Test<Elem , Key> extends MapReducer<Elem , ←
Key , Boolean> {}
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
Test inherits MapReducer
Generator implements a MapReducer
polymorphic over semiring: Constructor
ﬁlter embedding: embed function return a new generator
1 public abstract class Generator<Elem , Single , Val , Res>
2 implements MapReducer<Elem , Val , Res> {
3 //The c o n t r a c t o r takes an i n s t a n c e of Aggregator
4 public Generator ( Aggregator< Single , Val> aggregator ) { . . . }
5
6 // take an i n s t a n c e of Test and r e t u r n a new i n s t a n c e of Generator
7 public <Key> Generator<Elem , Single , WritableMap<Key , Val>,Res>
8 embed ( final Test<Single , Key> test ) {
9 final Generator<Elem , Single , Val , Res> base = this ;
10 return new Generator<Elem , Single , WritableMap<Key , Val>,Res>
11 ( new Aggregator<Single , WritableMap<Key , Val>>(){ . . . }
12 }
13 public Val process ( List<Elem> list ) { . . . }
14 . . .
15 }
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
1 Users need to make their own Generator, Test, and Aggregator
by extending/implementing the library provided ones1
2 An instance of Generator will be created at run-time on each
working-node, which is also an eﬃcient list homomorphism
3 The instance list homomorphism can be executed by Hadoop
in parallel
1
Our library provides commonly used Generators and Aggregators.
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
Java Codes
Let’s have a look at the actual implementation of GTA Knapsack...
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
Performance Evaluation
Environment: hardware
We conﬁgured clusters with 2, 4, 8, 16, and 32 nodes (virtual
machines). Each computing/data node has one CPU (VM, Xeon
E5530@2.4GHz, 1 core), 3 GB memory.
Test data
102 × 220 (≈ 108) knapsack items (3.2GB)
Each item’s weight is between 0 to 10 and the capacity of the
knapsack is 100.
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
Evaluation on Hadoop
The Knapsack program scales well when increasing nodes of cluster
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
Conclusion
The implementation of GTA library on Hadoop can
hide the technical details of MapReduce(Hadoop)
automatically do parallelization and optimization
generate MapReduce programs which have good scalability
make coding, testing and code-reusing much simpler
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
Future Work
Optimization of current framework to gain better performance
Extension of current framework
Other approaches of systematic parallel programming
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
Thanks
Questions?
The project is hosted on
http://screwdriver.googlecode.com
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
Appendix: The Computation on Semiring
Definition (Semiring)
Given a set S and two binary operations ⊕ and ⊗, the triple (S, ⊕, ⊗) is called a
semiring if and only if
(S, ⊕) is a commutative monoid with identity element id⊕
(S, ⊗) is a monoid with identity element id⊗
⊗ is associative and distributes over ⊕
id⊕ is a zero of ⊗: id⊕ ⊗ a = a ⊗ id⊕ = id⊕
(Int, +, ×) is a semiring, (PositiveInt, +, max) is another semiring
Definition (Semiring homomorphism)
Given two semirings (S, ⊕, ⊗) and (S , ⊕ , ⊗ ), a function hom : S → S is a semiring
homomorphism from (S, ⊕, ⊗) to (S , ⊕ , ⊗ ), iff it is a monoid homomorphism from
(S, ⊕) to (S , ⊕ ) and also a monoid homomorphism from (S, ⊗) to (S , ⊗ ).
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
Theorem (Filter-Embedding Fusion)
Given a set A, a finite monoid (M, ), a monoid homomorphism hom from ([A], ++ )
to (M, ), a semiring (S, ⊕, ⊗), a semiring homomorphism aggregate from
( [A] , ×++ ) to (S, ⊕, ⊗), a function ok : M → Bool and a polymorphic semiring
generator generate, the following equation holds:
aggregate ◦ filter(ok ◦ hom)
◦ generate ,x++ (λx → [x] )
= postprocessM ok
◦ generate⊕M ,⊗M
(λx → aggregateM [x] )
The result of fusion is an efficient algorithm in form of a list
homomorphism.
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
List Homomorphism
List Homomorphism [Bird, 87; Cole,95] is a class of recursive
functions.
Deﬁnition of List Homomorphism
If there is an associative operator , such that for any list x and
list y
h (x ++ y) = h(x) h(y).
Where ++ is the list concatenation and h [a] = f a, h(x) id = h(x), id is an identity element of .
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
List Homomorphism
functions.
list y
h (x ++ y) = h(x) h(y).
Instance of a list homomorphism
sum [a] = a
sum (x ++ y) = sum x + sum y.
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
List Homomorphism
functions.
list y
h (x ++ y) = h(x) h(y).
A list homomorphism can be automatically parallelized by
MapReduce [Yu et. al., EuroPar11].
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Background
Performance test
Evaluation on Hadoop
We test 3.2GB data on {2 , 4, 8, 16, 32} nodes clusters and 32
GB data on {32, 64} nodes clusters
2 nodes 4 nodes 8 nodes 16 nodes 32 nodes 64 nodes
time(sec.) 1602 882 482 317 961 511
speedup – × 1.82 × 1.83 × 1.52 – × 1.88
Yu Liu1
, Kento Emoto3
, and Zhenjiang Hu4

Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Implementing Generate-Test-and-Aggregate Algorithms on Hadoop

Similar to Implementing Generate-Test-and-Aggregate Algorithms on Hadoop (20)

More from Yu Liu

More from Yu Liu (20)

Recently uploaded

Recently uploaded (20)

Implementing Generate-Test-and-Aggregate Algorithms on Hadoop