©2009 HP Confidential
Jerry Rolia
Principal Scientist, Automated Infrastructure Lab, Hewlett Packard Labs
October 12, 2010
Techniques to use Hadoop with
scientific data
YongChul Kwon
Magdalena Balazinska, Bill Howe
University of Washington
Joint work* with
*“Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined
Functions,” appears in the proceedings of the 1st ACM Cloud Computing Symposium, 2010
Motivation
• Science is becoming a data analysis problem
• MapReduce with Hadoop is an attractive solution
– Easy API, declarative layer, seamless scalability, …
• Computational skew can make it hard to get high performance
• e.g., 14 hours vs. 70 minutes
• Challenges include:
– Partitioning data to avoid computational skew
– Implementing a hierarchical parallel merge
• SkewReduce:
– To automatically output a data partitioning and merge plan
3
Example Science Application:
Extracting Celestial Objects
• Input pixels
– { (x,y,r,g,b,ir,uv,…) }
• Coordinates
• Light intensities
• …
• Output features
– List of celestial objects
• Star
• Galaxy
• Planet
• Asteroid
• …
M34 from Sloan Digital Sky Survey 4
Scientific feature extraction applications
• Astronomy: e.g., identify celestial objects
– 2D arrays of pixel intensities, each element is point in the sky and the time the
image taken
• Climate and ocean: e.g., understand phenomenon in systems
– 3D regions of atmosphere and oceans using arrays or meshes, simulating
behavior over time by solving a set of governing equations
• Cosmology: e.g., study the structure and changes in universe
– 4D models of clouds of particles influenced by gravity to analyze the origin and
evolution of the universe
• Flow cytometry: e.g., counting and examining microscopic particles
– Scattered light used to recognize microorganisms in water, enormous volume
of events clustered in a 6D space corresponding to different wavelengths of
light
These application domains all reason about the multi-dimensional
space in which the data is embedded as well as the data itself
Parallel Feature Extraction
• Partition multi-dimensional input data
• Extract features from each partition
• Merge (or reconcile) features
• Finalize output
Features
INPUT
DATA
Map
Hierarchical Reduce
6
Partition
• Bounding box algorithms are used to make
semantically correct partitions of data
Determine (axis, point) to split partition
What you want is for the partitions to have the same runtime!
Extract
• Apply application specific feature extraction algorithm to the
data within a bounding box
• Algorithm complexity may depend on the relationships among
data points in the box
O(N log N) ~ O(N2)
0 neighbors per particle ~ N neighbors per particle
8
Relationships among data in the space can lead to computational
skew!
Hierarchical parallel merge
• Partitions must be
merged based on
relationships between
bounding boxes
• Data near edges of
boxes taken into
account
• Some data can be set
aside during
extract/merge and re-
introduced in finalize
to reduce data copying
Features
Hierarchical merge requires a map reduce driver program to
schedule merges in correct order
Set aside data not needed future merge
Set aside data not
needed in future merge
Finalize
• Features are integrated with set aside data for
final output
Skew
Local Clustering
(MAP)
Merge
(REDUCE)
Problem: Computational Skew
• The top red line runs for 1.5 hours
5 minutes
Time
TaskID
35 minutes
11
Solution 1?
Micro partition
• Assign tiny amount of work
to each task to reduce skew
12
Impact of micro partitions?
• It works!
• Framework/merge overhead
can be large!
• To find sweet spot, need to
try different granularities! 0
2
4
6
8
10
12
14
16
256 1024 4096 8192
Completiontime(Hours)
# of partitions
Can we find a good partitioning plan
that incurs less overhead?
13
Solution 2?
Manual partition
• Repeat
– Solve using map reduce
– Only divide those
partitions that take too
long
• Until balanced
Can we find a good partitioning plan
without such trial and error?
a
b
c
d
a
1
a
2
b
1
b
2
c
d
Iteration 1
Iteration 2
SkewReduce approach
Sample
SkewReduce
Optimizer
1
2
13
14
15
5
6
9
3
4
12
7
8
10
11
Cluster
configuration
Cost
functions
• Goal: minimize expected total runtime
• Output: SkewReduce runtime plan
– Bounding boxes for data partitions
– Schedule for longest jobs first and hierarchical parallel merge
Runtime Plan
15
Cost functions
• Two cost functions:
– Feature cost: (Bounding box, sample, sample rate)
→ cost
– Merge cost:(Bounding boxes, sample, sample rate)
→ cost
• E.g..,
– Estimate density of points in sample
– Characterize using histograms
– Use micro-benchmarks to relate to execution time
Search Partition Plan
• Greedy top-down search
– Split if total expected runtime improves
• Evaluate costs for subpartitions and merge
• Estimate new runtime
100
Original
1
2
3
50
50
10
Possible Split
2
1 3
1 32
Schedule 2
= 110
Schedule 1
= 60
17
Time
Partition Plan
• Partition based on cluster and predicted feature
extraction/merge costs
• Stop partitioning when overhead exceeds benefit
18
…
Evaluation
• Distributed Friends of Friends
– Astro: Gravitational simulation snapshot
• 900 M particles, 18 GB
– Seaflow: flow cytometry survey
• 59 M observations, 1.9 GB
• 8 node cluster
– Dual quad core CPU, 16 GB RAM
– Hadoop 0.20.1 + custom patch in MapReduce API
19
Does SkewReduce work?
• SkewReduce plan yields faster running time
0
1
2
3
4
5
6
7
8
9
10
RelativeRuntime
Astro Seaflow
128 MB 16 MB 4 MB 2 MB Manual SkewReduce
14.1 8.8 4.1 5.7 2.0 1.6
87.2 63.1 77.7 98.7 - 14.1
Hours
Minutes
20
(1.9 GB, 3D)(18 GB, 3D)MapReduce
1 hour preparation,
improve on otherwise
best plan
Impact of Cost Function
0
2
4
6
8
10
12
14
16
Data Size Histogram 1D Histogram 3D
Completiontime(Hours)
Cost Function
Higher fidelity
= Better performance
Astro
21
Highlights of Evaluation
• Sample size
– Representativeness of sample is important
– 1% sample size worked well
• Runtime of SkewReduce optimization
– Less than 15% of real runtime of SkewReduce plan
• Data volume in Merge phase using set aside
– Total volume during Merge = 1% of input data
• Details in the SkewReduce paper
22
Conclusion
• Scientific analysis should be easy to write, scalable, and
have a predictable performance
• Skew is a general problem, solutions are needed
• SkewReduce
– API for feature extracting features
– Scalable execution
– Good performance in spite of skew
• Cost-based partition optimization using a data sample
• Next step is to handle skew in arbitrary map-reduce
systems
• Looking for your examples of computational skew
• Current implementation can be made available to the
fearless
23
BACKUP
24
SkewReduce API
PROCESS
FEATURE
SETASIDE
MERGE
FEATURE
SETASIDE
FINALIZE OUTPUT
25
FEATURE
SETASIDE
INPUT
Bounding Box
FEATURE
FEATURE
Bounding Box
SkewReduce: Prototype Architecture
Bounding
Boxes
SkewReduce
Runtime
Extract
Job
Completion
Monitor
Schedule
Merge
Pig
Script
Job
Scheduler
Map-only job
per Extract/Merge
Asynchronous
notification
Schedule
new task
Optimizer output
User algorithms
Want Some Code? 
public class MyExtract
extends PExtractOP {
public class MyExtractMapper
extends SkewReduceMapper {
// 0.20.x New API
public void run(Context context) {
// your extractor code
}
}
protected Job createJob(Configuration conf) {
// configure Job object in 0.20.x New API
}
}
public class MyApp
extends SkewReduceDriver {
…
public LExtractorOP createExtractOp() {
// return logical OP – physical OP template
}
public LMergeOP createMergeOp() {
// return logical OP – physical OP template
}
…
public Partition getRootPartition() {
// we provide several default implementations
}
}
Example SkewReduce Driver Example Extractor
Summary of Contributions
• Given a feature extraction application
– Possibly with computation skew
• SkewReduce
– Automatically partitions input data
– Improves runtime by reducing computation skew
• Key technique: user-defined cost functions
28

HP - Jerome Rolia - Hadoop World 2010

  • 1.
    ©2009 HP Confidential JerryRolia Principal Scientist, Automated Infrastructure Lab, Hewlett Packard Labs October 12, 2010 Techniques to use Hadoop with scientific data
  • 2.
    YongChul Kwon Magdalena Balazinska,Bill Howe University of Washington Joint work* with *“Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions,” appears in the proceedings of the 1st ACM Cloud Computing Symposium, 2010
  • 3.
    Motivation • Science isbecoming a data analysis problem • MapReduce with Hadoop is an attractive solution – Easy API, declarative layer, seamless scalability, … • Computational skew can make it hard to get high performance • e.g., 14 hours vs. 70 minutes • Challenges include: – Partitioning data to avoid computational skew – Implementing a hierarchical parallel merge • SkewReduce: – To automatically output a data partitioning and merge plan 3
  • 4.
    Example Science Application: ExtractingCelestial Objects • Input pixels – { (x,y,r,g,b,ir,uv,…) } • Coordinates • Light intensities • … • Output features – List of celestial objects • Star • Galaxy • Planet • Asteroid • … M34 from Sloan Digital Sky Survey 4
  • 5.
    Scientific feature extractionapplications • Astronomy: e.g., identify celestial objects – 2D arrays of pixel intensities, each element is point in the sky and the time the image taken • Climate and ocean: e.g., understand phenomenon in systems – 3D regions of atmosphere and oceans using arrays or meshes, simulating behavior over time by solving a set of governing equations • Cosmology: e.g., study the structure and changes in universe – 4D models of clouds of particles influenced by gravity to analyze the origin and evolution of the universe • Flow cytometry: e.g., counting and examining microscopic particles – Scattered light used to recognize microorganisms in water, enormous volume of events clustered in a 6D space corresponding to different wavelengths of light These application domains all reason about the multi-dimensional space in which the data is embedded as well as the data itself
  • 6.
    Parallel Feature Extraction •Partition multi-dimensional input data • Extract features from each partition • Merge (or reconcile) features • Finalize output Features INPUT DATA Map Hierarchical Reduce 6
  • 7.
    Partition • Bounding boxalgorithms are used to make semantically correct partitions of data Determine (axis, point) to split partition What you want is for the partitions to have the same runtime!
  • 8.
    Extract • Apply applicationspecific feature extraction algorithm to the data within a bounding box • Algorithm complexity may depend on the relationships among data points in the box O(N log N) ~ O(N2) 0 neighbors per particle ~ N neighbors per particle 8 Relationships among data in the space can lead to computational skew!
  • 9.
    Hierarchical parallel merge •Partitions must be merged based on relationships between bounding boxes • Data near edges of boxes taken into account • Some data can be set aside during extract/merge and re- introduced in finalize to reduce data copying Features Hierarchical merge requires a map reduce driver program to schedule merges in correct order Set aside data not needed future merge Set aside data not needed in future merge
  • 10.
    Finalize • Features areintegrated with set aside data for final output
  • 11.
    Skew Local Clustering (MAP) Merge (REDUCE) Problem: ComputationalSkew • The top red line runs for 1.5 hours 5 minutes Time TaskID 35 minutes 11
  • 12.
    Solution 1? Micro partition •Assign tiny amount of work to each task to reduce skew 12
  • 13.
    Impact of micropartitions? • It works! • Framework/merge overhead can be large! • To find sweet spot, need to try different granularities! 0 2 4 6 8 10 12 14 16 256 1024 4096 8192 Completiontime(Hours) # of partitions Can we find a good partitioning plan that incurs less overhead? 13
  • 14.
    Solution 2? Manual partition •Repeat – Solve using map reduce – Only divide those partitions that take too long • Until balanced Can we find a good partitioning plan without such trial and error? a b c d a 1 a 2 b 1 b 2 c d Iteration 1 Iteration 2
  • 15.
    SkewReduce approach Sample SkewReduce Optimizer 1 2 13 14 15 5 6 9 3 4 12 7 8 10 11 Cluster configuration Cost functions • Goal:minimize expected total runtime • Output: SkewReduce runtime plan – Bounding boxes for data partitions – Schedule for longest jobs first and hierarchical parallel merge Runtime Plan 15
  • 16.
    Cost functions • Twocost functions: – Feature cost: (Bounding box, sample, sample rate) → cost – Merge cost:(Bounding boxes, sample, sample rate) → cost • E.g.., – Estimate density of points in sample – Characterize using histograms – Use micro-benchmarks to relate to execution time
  • 17.
    Search Partition Plan •Greedy top-down search – Split if total expected runtime improves • Evaluate costs for subpartitions and merge • Estimate new runtime 100 Original 1 2 3 50 50 10 Possible Split 2 1 3 1 32 Schedule 2 = 110 Schedule 1 = 60 17 Time
  • 18.
    Partition Plan • Partitionbased on cluster and predicted feature extraction/merge costs • Stop partitioning when overhead exceeds benefit 18 …
  • 19.
    Evaluation • Distributed Friendsof Friends – Astro: Gravitational simulation snapshot • 900 M particles, 18 GB – Seaflow: flow cytometry survey • 59 M observations, 1.9 GB • 8 node cluster – Dual quad core CPU, 16 GB RAM – Hadoop 0.20.1 + custom patch in MapReduce API 19
  • 20.
    Does SkewReduce work? •SkewReduce plan yields faster running time 0 1 2 3 4 5 6 7 8 9 10 RelativeRuntime Astro Seaflow 128 MB 16 MB 4 MB 2 MB Manual SkewReduce 14.1 8.8 4.1 5.7 2.0 1.6 87.2 63.1 77.7 98.7 - 14.1 Hours Minutes 20 (1.9 GB, 3D)(18 GB, 3D)MapReduce 1 hour preparation, improve on otherwise best plan
  • 21.
    Impact of CostFunction 0 2 4 6 8 10 12 14 16 Data Size Histogram 1D Histogram 3D Completiontime(Hours) Cost Function Higher fidelity = Better performance Astro 21
  • 22.
    Highlights of Evaluation •Sample size – Representativeness of sample is important – 1% sample size worked well • Runtime of SkewReduce optimization – Less than 15% of real runtime of SkewReduce plan • Data volume in Merge phase using set aside – Total volume during Merge = 1% of input data • Details in the SkewReduce paper 22
  • 23.
    Conclusion • Scientific analysisshould be easy to write, scalable, and have a predictable performance • Skew is a general problem, solutions are needed • SkewReduce – API for feature extracting features – Scalable execution – Good performance in spite of skew • Cost-based partition optimization using a data sample • Next step is to handle skew in arbitrary map-reduce systems • Looking for your examples of computational skew • Current implementation can be made available to the fearless 23
  • 24.
  • 25.
  • 26.
    SkewReduce: Prototype Architecture Bounding Boxes SkewReduce Runtime Extract Job Completion Monitor Schedule Merge Pig Script Job Scheduler Map-onlyjob per Extract/Merge Asynchronous notification Schedule new task Optimizer output User algorithms
  • 27.
    Want Some Code? public class MyExtract extends PExtractOP { public class MyExtractMapper extends SkewReduceMapper { // 0.20.x New API public void run(Context context) { // your extractor code } } protected Job createJob(Configuration conf) { // configure Job object in 0.20.x New API } } public class MyApp extends SkewReduceDriver { … public LExtractorOP createExtractOp() { // return logical OP – physical OP template } public LMergeOP createMergeOp() { // return logical OP – physical OP template } … public Partition getRootPartition() { // we provide several default implementations } } Example SkewReduce Driver Example Extractor
  • 28.
    Summary of Contributions •Given a feature extraction application – Possibly with computation skew • SkewReduce – Automatically partitions input data – Improves runtime by reducing computation skew • Key technique: user-defined cost functions 28