A Generate-Test-Aggregate Parallel Programming Library on Spark
1. A Generate-Test-Aggregate
Parallel Programming Library
Yu Liu1, Kento Emoto2, Zhenjiang Hu3
1The Graduate University for Advanced Studies
2The University of Tokyo
3National Institute of Informatics
PPoPP PMAM 2013
Systematic Parallel Programming for MapReduce
2. Outline
Introduction to GTA
The GTA library
Implementation strategy
Programming interface
Automatic parallelization and optimization
Applications and evaluations
Conclusions
3. Outline
Introduction to GTA
The GTA library
Implementation strategy
Programming interface
Automatic parallelization and optimization
Applications and evaluations
Conclusions
4. The GTA Programming Methodology
Simple programming pattern
1. Generate all possible solution candidates;
2. Test and filter candidates;
3. Aggregate the valid candidates.
Expressive and code efficient
Covers a large class of problems
Automatic optimization and parallelization
~ Kento Emoto, et.al., [ESOP’12]
5. An Example: The Knapsack Problem
Writing a parallel (MapReduce) program for the
knapsack problem is not easy.
Picture from Wikipedia
6. input: [ (1 $, 2 Kg), (2 $, 6 Kg), (3 $, 10 Kg) ]
weight limitation =15
generate:
[ [ ], [ (1$, 2 Kg) ], [ (2$, 6 Kg) ], [ (3 $, 10 Kg) ], [(1$, 2 Kg) , (2$,
6 Kg) ], [1$, 2 Kg) , (3 $, 10 Kg) ], [(2$, 6 Kg) , (3 $, 10 Kg) ],
[(1$, 2 Kg) , (2$, 6 Kg) , (3 $, 10 Kg) ] ]
test: [true, true, true, true, true, false, false]
filter: [ [ ], [ (1$, 2 Kg) ], [ (2$, 6 Kg) ], [ (3 $, 10 Kg) ],
[(1$, 2 Kg) , (2$, 6 Kg) ], [1$, 2 Kg) , (3 $, 10 Kg) ] ]
aggregate: 0$, 1$, 2 $, 3$, 3$, 4$
7. Naively implementing Knapsack is inefficient (O(2n)).
Input (length) Time (ms)
8 30
12 86
16 97
20 2829
24
java.lang.OutOfMemoryError: Java heap
space
performance of the naïve Knapsack program
The GTA fusion theorem is introduced for resolve
efficiency problem
9. Definitions of G,T,A
Class Name Algebraic Structure
Generator polymorphic semiring
generator
Predicate almost list
homomorphism
Aggregator semiring homomorphism
Ref: K.Emoto [ESOP’12]
10. Main Contributions
The implementation of a GTA library
A simple and statically typed GTA-DSL is
implemented
Algebraic structures and
computations/transformations of them are
implemented
Evaluation of GTA methodology
15. The users write GTA expressions like:
generate(g:GEN) filter(t:Predicate)* aggregate(a:Aggregator)
G‧T‧A Programming DSL
GEN, Aggregator, Predicate are Scala traits defined in the GTA library
18. Implementation of GTA
Fusion/Optimization
The main difficulties:
How to define a polymorphic generator
How to define a predicate for test
How to define intermediate data structures
and other algebraic structures
20. More Examples
More examples in the paper and source package:
Extended Knapsack problems
The maximum-segments-sum problem
Finding the most possible sequence (viterbi algorithm)
More information on: https://bitbucket.org/inii/gtalib
21. G‧T‧A Building Blocks
Our library provides commonly used G·T·A building
blocks and users can also implement their own G,T,As.
22. Performance Evaluations
Evaluations on EdubaseCluster (Cloud)
– Up to 32 VM nodes, each has 3GB RAM, 1 single
core CPU
– Executed on Spark – an in-memory MR cluster
26. Conclusions
We show GTA can be efficiently implemented
GTA-DSL can simplify parallel programming
Simple programming model
Good code efficiency
GTA-DSL is architecture independent
27. Future Works
Enrich the library by more building blocks in
terms of G, T, A
GTA-DSL can be extended to processing more
complex data structures such as tree/graph