HP - Jerome Rolia - Hadoop World 2010

©2009 HP Confidential
Jerry Rolia
Principal Scientist, Automated Infrastructure Lab, Hewlett Packard Labs
October 12, 2010
Techniques to use Hadoop with
scientific data

YongChul Kwon
Magdalena Balazinska, Bill Howe
University of Washington
Joint work* with
*“Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined
Functions,” appears in the proceedings of the 1st ACM Cloud Computing Symposium, 2010

Motivation
• Science is becoming a data analysis problem
• MapReduce with Hadoop is an attractive solution
– Easy API, declarative layer, seamless scalability, …
• Computational skew can make it hard to get high performance
• e.g., 14 hours vs. 70 minutes
• Challenges include:
– Partitioning data to avoid computational skew
– Implementing a hierarchical parallel merge
• SkewReduce:
– To automatically output a data partitioning and merge plan
3

Example Science Application:
Extracting Celestial Objects
• Input pixels
– { (x,y,r,g,b,ir,uv,…) }
• Coordinates
• Light intensities
• …
• Output features
– List of celestial objects
• Star
• Galaxy
• Planet
• Asteroid
• …
M34 from Sloan Digital Sky Survey 4

Scientific feature extraction applications
• Astronomy: e.g., identify celestial objects
– 2D arrays of pixel intensities, each element is point in the sky and the time the
image taken
• Climate and ocean: e.g., understand phenomenon in systems
– 3D regions of atmosphere and oceans using arrays or meshes, simulating
behavior over time by solving a set of governing equations
• Cosmology: e.g., study the structure and changes in universe
– 4D models of clouds of particles influenced by gravity to analyze the origin and
evolution of the universe
• Flow cytometry: e.g., counting and examining microscopic particles
– Scattered light used to recognize microorganisms in water, enormous volume
of events clustered in a 6D space corresponding to different wavelengths of
light
These application domains all reason about the multi-dimensional
space in which the data is embedded as well as the data itself

Parallel Feature Extraction
• Partition multi-dimensional input data
• Extract features from each partition
• Merge (or reconcile) features
• Finalize output
Features
INPUT
DATA
Map
Hierarchical Reduce
6

Partition
• Bounding box algorithms are used to make
semantically correct partitions of data
Determine (axis, point) to split partition
What you want is for the partitions to have the same runtime!

Extract
• Apply application specific feature extraction algorithm to the
data within a bounding box
• Algorithm complexity may depend on the relationships among
data points in the box
O(N log N) ~ O(N2)
0 neighbors per particle ~ N neighbors per particle
8
Relationships among data in the space can lead to computational
skew!

Hierarchical parallel merge
• Partitions must be
merged based on
relationships between
bounding boxes
• Data near edges of
boxes taken into
account
• Some data can be set
aside during
extract/merge and re-
introduced in finalize
to reduce data copying
Features
Hierarchical merge requires a map reduce driver program to
schedule merges in correct order
Set aside data not needed future merge
Set aside data not
needed in future merge

Finalize
• Features are integrated with set aside data for
final output

Skew
Local Clustering
(MAP)
Merge
(REDUCE)
Problem: Computational Skew
• The top red line runs for 1.5 hours
5 minutes
Time
TaskID
35 minutes
11

Solution 1?
Micro partition
• Assign tiny amount of work
to each task to reduce skew
12

Impact of micro partitions?
• It works!
• Framework/merge overhead
can be large!
• To find sweet spot, need to
try different granularities! 0
2
4
6
8
10
12
14
16
256 1024 4096 8192
Completiontime(Hours)
# of partitions
Can we find a good partitioning plan
that incurs less overhead?
13

Solution 2?
Manual partition
• Repeat
– Solve using map reduce
– Only divide those
partitions that take too
long
• Until balanced
Can we find a good partitioning plan
without such trial and error?
a
b
c
d
a
1
a
2
b
1
b
2
c
d
Iteration 1
Iteration 2

SkewReduce approach
Sample
SkewReduce
Optimizer
1
2
13
14
15
5
6
9
3
4
12
7
8
10
11
Cluster
configuration
Cost
functions
• Goal: minimize expected total runtime
• Output: SkewReduce runtime plan
– Bounding boxes for data partitions
– Schedule for longest jobs first and hierarchical parallel merge
Runtime Plan
15

Cost functions
• Two cost functions:
– Feature cost: (Bounding box, sample, sample rate)
→ cost
– Merge cost:(Bounding boxes, sample, sample rate)
→ cost
• E.g..,
– Estimate density of points in sample
– Characterize using histograms
– Use micro-benchmarks to relate to execution time

Search Partition Plan
• Greedy top-down search
– Split if total expected runtime improves
• Evaluate costs for subpartitions and merge
• Estimate new runtime
100
Original
1
2
3
50
50
10
Possible Split
2
1 3
1 32
Schedule 2
= 110
Schedule 1
= 60
17
Time

Partition Plan
• Partition based on cluster and predicted feature
extraction/merge costs
• Stop partitioning when overhead exceeds benefit
18
…

Evaluation
• Distributed Friends of Friends
– Astro: Gravitational simulation snapshot
• 900 M particles, 18 GB
– Seaflow: flow cytometry survey
• 59 M observations, 1.9 GB
• 8 node cluster
– Dual quad core CPU, 16 GB RAM
– Hadoop 0.20.1 + custom patch in MapReduce API
19

Does SkewReduce work?
• SkewReduce plan yields faster running time
0
1
2
3
4
5
6
7
8
9
10
RelativeRuntime
Astro Seaflow
128 MB 16 MB 4 MB 2 MB Manual SkewReduce
14.1 8.8 4.1 5.7 2.0 1.6
87.2 63.1 77.7 98.7 - 14.1
Hours
Minutes
20
(1.9 GB, 3D)(18 GB, 3D)MapReduce
1 hour preparation,
improve on otherwise
best plan

Impact of Cost Function
0
2
4
6
8
10
12
14
16
Data Size Histogram 1D Histogram 3D
Completiontime(Hours)
Cost Function
Higher fidelity
= Better performance
Astro
21

Highlights of Evaluation
• Sample size
– Representativeness of sample is important
– 1% sample size worked well
• Runtime of SkewReduce optimization
– Less than 15% of real runtime of SkewReduce plan
• Data volume in Merge phase using set aside
– Total volume during Merge = 1% of input data
• Details in the SkewReduce paper
22

Conclusion
• Scientific analysis should be easy to write, scalable, and
have a predictable performance
• Skew is a general problem, solutions are needed
• SkewReduce
– API for feature extracting features
– Scalable execution
– Good performance in spite of skew
• Cost-based partition optimization using a data sample
• Next step is to handle skew in arbitrary map-reduce
systems
• Looking for your examples of computational skew
• Current implementation can be made available to the
fearless
23

SkewReduce API
PROCESS
FEATURE
SETASIDE
MERGE
FEATURE
SETASIDE
FINALIZE OUTPUT
25
FEATURE
SETASIDE
INPUT
Bounding Box
FEATURE
FEATURE
Bounding Box

SkewReduce: Prototype Architecture
Bounding
Boxes
SkewReduce
Runtime
Extract
Job
Completion
Monitor
Schedule
Merge
Pig
Script
Job
Scheduler
Map-only job
per Extract/Merge
Asynchronous
notification
Schedule
new task
Optimizer output
User algorithms

Want Some Code? 
public class MyExtract
extends PExtractOP {
public class MyExtractMapper
extends SkewReduceMapper {
// 0.20.x New API
public void run(Context context) {
// your extractor code
}
}
protected Job createJob(Configuration conf) {
// configure Job object in 0.20.x New API
}
}
public class MyApp
extends SkewReduceDriver {
…
public LExtractorOP createExtractOp() {
// return logical OP – physical OP template
}
public LMergeOP createMergeOp() {
// return logical OP – physical OP template
}
…
public Partition getRootPartition() {
// we provide several default implementations
}
}
Example SkewReduce Driver Example Extractor

Summary of Contributions
• Given a feature extraction application
– Possibly with computation skew
• SkewReduce
– Automatically partitions input data
– Improves runtime by reducing computation skew
• Key technique: user-defined cost functions
28

HP - Jerome Rolia - Hadoop World 2010

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HP - Jerome Rolia - Hadoop World 2010

Similar to HP - Jerome Rolia - Hadoop World 2010 (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

HP - Jerome Rolia - Hadoop World 2010