2. YongChul Kwon
Magdalena Balazinska, Bill Howe
University of Washington
Joint work* with
*“Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined
Functions,” appears in the proceedings of the 1st ACM Cloud Computing Symposium, 2010
3. Motivation
• Science is becoming a data analysis problem
• MapReduce with Hadoop is an attractive solution
– Easy API, declarative layer, seamless scalability, …
• Computational skew can make it hard to get high performance
• e.g., 14 hours vs. 70 minutes
• Challenges include:
– Partitioning data to avoid computational skew
– Implementing a hierarchical parallel merge
• SkewReduce:
– To automatically output a data partitioning and merge plan
3
4. Example Science Application:
Extracting Celestial Objects
• Input pixels
– { (x,y,r,g,b,ir,uv,…) }
• Coordinates
• Light intensities
• …
• Output features
– List of celestial objects
• Star
• Galaxy
• Planet
• Asteroid
• …
M34 from Sloan Digital Sky Survey 4
5. Scientific feature extraction applications
• Astronomy: e.g., identify celestial objects
– 2D arrays of pixel intensities, each element is point in the sky and the time the
image taken
• Climate and ocean: e.g., understand phenomenon in systems
– 3D regions of atmosphere and oceans using arrays or meshes, simulating
behavior over time by solving a set of governing equations
• Cosmology: e.g., study the structure and changes in universe
– 4D models of clouds of particles influenced by gravity to analyze the origin and
evolution of the universe
• Flow cytometry: e.g., counting and examining microscopic particles
– Scattered light used to recognize microorganisms in water, enormous volume
of events clustered in a 6D space corresponding to different wavelengths of
light
These application domains all reason about the multi-dimensional
space in which the data is embedded as well as the data itself
6. Parallel Feature Extraction
• Partition multi-dimensional input data
• Extract features from each partition
• Merge (or reconcile) features
• Finalize output
Features
INPUT
DATA
Map
Hierarchical Reduce
6
7. Partition
• Bounding box algorithms are used to make
semantically correct partitions of data
Determine (axis, point) to split partition
What you want is for the partitions to have the same runtime!
8. Extract
• Apply application specific feature extraction algorithm to the
data within a bounding box
• Algorithm complexity may depend on the relationships among
data points in the box
O(N log N) ~ O(N2)
0 neighbors per particle ~ N neighbors per particle
8
Relationships among data in the space can lead to computational
skew!
9. Hierarchical parallel merge
• Partitions must be
merged based on
relationships between
bounding boxes
• Data near edges of
boxes taken into
account
• Some data can be set
aside during
extract/merge and re-
introduced in finalize
to reduce data copying
Features
Hierarchical merge requires a map reduce driver program to
schedule merges in correct order
Set aside data not needed future merge
Set aside data not
needed in future merge
13. Impact of micro partitions?
• It works!
• Framework/merge overhead
can be large!
• To find sweet spot, need to
try different granularities! 0
2
4
6
8
10
12
14
16
256 1024 4096 8192
Completiontime(Hours)
# of partitions
Can we find a good partitioning plan
that incurs less overhead?
13
14. Solution 2?
Manual partition
• Repeat
– Solve using map reduce
– Only divide those
partitions that take too
long
• Until balanced
Can we find a good partitioning plan
without such trial and error?
a
b
c
d
a
1
a
2
b
1
b
2
c
d
Iteration 1
Iteration 2
21. Impact of Cost Function
0
2
4
6
8
10
12
14
16
Data Size Histogram 1D Histogram 3D
Completiontime(Hours)
Cost Function
Higher fidelity
= Better performance
Astro
21
22. Highlights of Evaluation
• Sample size
– Representativeness of sample is important
– 1% sample size worked well
• Runtime of SkewReduce optimization
– Less than 15% of real runtime of SkewReduce plan
• Data volume in Merge phase using set aside
– Total volume during Merge = 1% of input data
• Details in the SkewReduce paper
22
23. Conclusion
• Scientific analysis should be easy to write, scalable, and
have a predictable performance
• Skew is a general problem, solutions are needed
• SkewReduce
– API for feature extracting features
– Scalable execution
– Good performance in spite of skew
• Cost-based partition optimization using a data sample
• Next step is to handle skew in arbitrary map-reduce
systems
• Looking for your examples of computational skew
• Current implementation can be made available to the
fearless
23
27. Want Some Code?
public class MyExtract
extends PExtractOP {
public class MyExtractMapper
extends SkewReduceMapper {
// 0.20.x New API
public void run(Context context) {
// your extractor code
}
}
protected Job createJob(Configuration conf) {
// configure Job object in 0.20.x New API
}
}
public class MyApp
extends SkewReduceDriver {
…
public LExtractorOP createExtractOp() {
// return logical OP – physical OP template
}
public LMergeOP createMergeOp() {
// return logical OP – physical OP template
}
…
public Partition getRootPartition() {
// we provide several default implementations
}
}
Example SkewReduce Driver Example Extractor
28. Summary of Contributions
• Given a feature extraction application
– Possibly with computation skew
• SkewReduce
– Automatically partitions input data
– Improves runtime by reducing computation skew
• Key technique: user-defined cost functions
28