SlideShare a Scribd company logo
1 of 28
©2009 HP Confidential
Jerry Rolia
Principal Scientist, Automated Infrastructure Lab, Hewlett Packard Labs
October 12, 2010
Techniques to use Hadoop with
scientific data
YongChul Kwon
Magdalena Balazinska, Bill Howe
University of Washington
Joint work* with
*“Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined
Functions,” appears in the proceedings of the 1st ACM Cloud Computing Symposium, 2010
Motivation
• Science is becoming a data analysis problem
• MapReduce with Hadoop is an attractive solution
– Easy API, declarative layer, seamless scalability, …
• Computational skew can make it hard to get high performance
• e.g., 14 hours vs. 70 minutes
• Challenges include:
– Partitioning data to avoid computational skew
– Implementing a hierarchical parallel merge
• SkewReduce:
– To automatically output a data partitioning and merge plan
3
Example Science Application:
Extracting Celestial Objects
• Input pixels
– { (x,y,r,g,b,ir,uv,…) }
• Coordinates
• Light intensities
• …
• Output features
– List of celestial objects
• Star
• Galaxy
• Planet
• Asteroid
• …
M34 from Sloan Digital Sky Survey 4
Scientific feature extraction applications
• Astronomy: e.g., identify celestial objects
– 2D arrays of pixel intensities, each element is point in the sky and the time the
image taken
• Climate and ocean: e.g., understand phenomenon in systems
– 3D regions of atmosphere and oceans using arrays or meshes, simulating
behavior over time by solving a set of governing equations
• Cosmology: e.g., study the structure and changes in universe
– 4D models of clouds of particles influenced by gravity to analyze the origin and
evolution of the universe
• Flow cytometry: e.g., counting and examining microscopic particles
– Scattered light used to recognize microorganisms in water, enormous volume
of events clustered in a 6D space corresponding to different wavelengths of
light
These application domains all reason about the multi-dimensional
space in which the data is embedded as well as the data itself
Parallel Feature Extraction
• Partition multi-dimensional input data
• Extract features from each partition
• Merge (or reconcile) features
• Finalize output
Features
INPUT
DATA
Map
Hierarchical Reduce
6
Partition
• Bounding box algorithms are used to make
semantically correct partitions of data
Determine (axis, point) to split partition
What you want is for the partitions to have the same runtime!
Extract
• Apply application specific feature extraction algorithm to the
data within a bounding box
• Algorithm complexity may depend on the relationships among
data points in the box
O(N log N) ~ O(N2)
0 neighbors per particle ~ N neighbors per particle
8
Relationships among data in the space can lead to computational
skew!
Hierarchical parallel merge
• Partitions must be
merged based on
relationships between
bounding boxes
• Data near edges of
boxes taken into
account
• Some data can be set
aside during
extract/merge and re-
introduced in finalize
to reduce data copying
Features
Hierarchical merge requires a map reduce driver program to
schedule merges in correct order
Set aside data not needed future merge
Set aside data not
needed in future merge
Finalize
• Features are integrated with set aside data for
final output
Skew
Local Clustering
(MAP)
Merge
(REDUCE)
Problem: Computational Skew
• The top red line runs for 1.5 hours
5 minutes
Time
TaskID
35 minutes
11
Solution 1?
Micro partition
• Assign tiny amount of work
to each task to reduce skew
12
Impact of micro partitions?
• It works!
• Framework/merge overhead
can be large!
• To find sweet spot, need to
try different granularities! 0
2
4
6
8
10
12
14
16
256 1024 4096 8192
Completiontime(Hours)
# of partitions
Can we find a good partitioning plan
that incurs less overhead?
13
Solution 2?
Manual partition
• Repeat
– Solve using map reduce
– Only divide those
partitions that take too
long
• Until balanced
Can we find a good partitioning plan
without such trial and error?
a
b
c
d
a
1
a
2
b
1
b
2
c
d
Iteration 1
Iteration 2
SkewReduce approach
Sample
SkewReduce
Optimizer
1
2
13
14
15
5
6
9
3
4
12
7
8
10
11
Cluster
configuration
Cost
functions
• Goal: minimize expected total runtime
• Output: SkewReduce runtime plan
– Bounding boxes for data partitions
– Schedule for longest jobs first and hierarchical parallel merge
Runtime Plan
15
Cost functions
• Two cost functions:
– Feature cost: (Bounding box, sample, sample rate)
→ cost
– Merge cost:(Bounding boxes, sample, sample rate)
→ cost
• E.g..,
– Estimate density of points in sample
– Characterize using histograms
– Use micro-benchmarks to relate to execution time
Search Partition Plan
• Greedy top-down search
– Split if total expected runtime improves
• Evaluate costs for subpartitions and merge
• Estimate new runtime
100
Original
1
2
3
50
50
10
Possible Split
2
1 3
1 32
Schedule 2
= 110
Schedule 1
= 60
17
Time
Partition Plan
• Partition based on cluster and predicted feature
extraction/merge costs
• Stop partitioning when overhead exceeds benefit
18
…
Evaluation
• Distributed Friends of Friends
– Astro: Gravitational simulation snapshot
• 900 M particles, 18 GB
– Seaflow: flow cytometry survey
• 59 M observations, 1.9 GB
• 8 node cluster
– Dual quad core CPU, 16 GB RAM
– Hadoop 0.20.1 + custom patch in MapReduce API
19
Does SkewReduce work?
• SkewReduce plan yields faster running time
0
1
2
3
4
5
6
7
8
9
10
RelativeRuntime
Astro Seaflow
128 MB 16 MB 4 MB 2 MB Manual SkewReduce
14.1 8.8 4.1 5.7 2.0 1.6
87.2 63.1 77.7 98.7 - 14.1
Hours
Minutes
20
(1.9 GB, 3D)(18 GB, 3D)MapReduce
1 hour preparation,
improve on otherwise
best plan
Impact of Cost Function
0
2
4
6
8
10
12
14
16
Data Size Histogram 1D Histogram 3D
Completiontime(Hours)
Cost Function
Higher fidelity
= Better performance
Astro
21
Highlights of Evaluation
• Sample size
– Representativeness of sample is important
– 1% sample size worked well
• Runtime of SkewReduce optimization
– Less than 15% of real runtime of SkewReduce plan
• Data volume in Merge phase using set aside
– Total volume during Merge = 1% of input data
• Details in the SkewReduce paper
22
Conclusion
• Scientific analysis should be easy to write, scalable, and
have a predictable performance
• Skew is a general problem, solutions are needed
• SkewReduce
– API for feature extracting features
– Scalable execution
– Good performance in spite of skew
• Cost-based partition optimization using a data sample
• Next step is to handle skew in arbitrary map-reduce
systems
• Looking for your examples of computational skew
• Current implementation can be made available to the
fearless
23
BACKUP
24
SkewReduce API
PROCESS
FEATURE
SETASIDE
MERGE
FEATURE
SETASIDE
FINALIZE OUTPUT
25
FEATURE
SETASIDE
INPUT
Bounding Box
FEATURE
FEATURE
Bounding Box
SkewReduce: Prototype Architecture
Bounding
Boxes
SkewReduce
Runtime
Extract
Job
Completion
Monitor
Schedule
Merge
Pig
Script
Job
Scheduler
Map-only job
per Extract/Merge
Asynchronous
notification
Schedule
new task
Optimizer output
User algorithms
Want Some Code? 
public class MyExtract
extends PExtractOP {
public class MyExtractMapper
extends SkewReduceMapper {
// 0.20.x New API
public void run(Context context) {
// your extractor code
}
}
protected Job createJob(Configuration conf) {
// configure Job object in 0.20.x New API
}
}
public class MyApp
extends SkewReduceDriver {
…
public LExtractorOP createExtractOp() {
// return logical OP – physical OP template
}
public LMergeOP createMergeOp() {
// return logical OP – physical OP template
}
…
public Partition getRootPartition() {
// we provide several default implementations
}
}
Example SkewReduce Driver Example Extractor
Summary of Contributions
• Given a feature extraction application
– Possibly with computation skew
• SkewReduce
– Automatically partitions input data
– Improves runtime by reducing computation skew
• Key technique: user-defined cost functions
28

More Related Content

What's hot

Big Linked Data Querying - ExtremeEarth Open Workshop
Big Linked Data Querying - ExtremeEarth Open WorkshopBig Linked Data Querying - ExtremeEarth Open Workshop
Big Linked Data Querying - ExtremeEarth Open WorkshopExtremeEarth
 
m2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and ReuseVasia Kalavri
 
Determining the k in k-means with MapReduce
Determining the k in k-means with MapReduceDetermining the k in k-means with MapReduce
Determining the k in k-means with MapReduceThibault Debatty
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...DB Tsai
 
A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...newmooxx
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitionerSubhas Kumar Ghosh
 
CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud
CloudClustering: Toward an Iterative Data Processing Pattern on the CloudCloudClustering: Toward an Iterative Data Processing Pattern on the Cloud
CloudClustering: Toward an Iterative Data Processing Pattern on the CloudAnkur Dave
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...Xiao Qin
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkDB Tsai
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorSubhas Kumar Ghosh
 
GEOPROCESSING IN QGIS
GEOPROCESSING IN QGISGEOPROCESSING IN QGIS
GEOPROCESSING IN QGISSwetha A
 
MapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesVasia Kalavri
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReducePietro Michiardi
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Databricks
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Spark Summit
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...Databricks
 

What's hot (20)

Big Linked Data Querying - ExtremeEarth Open Workshop
Big Linked Data Querying - ExtremeEarth Open WorkshopBig Linked Data Querying - ExtremeEarth Open Workshop
Big Linked Data Querying - ExtremeEarth Open Workshop
 
m2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reuse
 
Determining the k in k-means with MapReduce
Determining the k in k-means with MapReduceDetermining the k in k-means with MapReduce
Determining the k in k-means with MapReduce
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
 
A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...A time energy performance analysis of map reduce on heterogeneous systems wit...
A time energy performance analysis of map reduce on heterogeneous systems wit...
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud
CloudClustering: Toward an Iterative Data Processing Pattern on the CloudCloudClustering: Toward an Iterative Data Processing Pattern on the Cloud
CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud
 
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
 
GEOPROCESSING IN QGIS
GEOPROCESSING IN QGISGEOPROCESSING IN QGIS
GEOPROCESSING IN QGIS
 
MapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open Issues
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
 
MapReduce
MapReduceMapReduce
MapReduce
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
 

Similar to HP - Jerome Rolia - Hadoop World 2010

Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit
 
Performance Models for Apache Accumulo
Performance Models for Apache AccumuloPerformance Models for Apache Accumulo
Performance Models for Apache AccumuloSqrrl
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...Feng Li
 
Reliable, Remote Computation at All Scales
Reliable, Remote Computation at All ScalesReliable, Remote Computation at All Scales
Reliable, Remote Computation at All ScalesGlobus
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Etu Solution
 
Automated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskAutomated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskASI Data Science
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez DataWorks Summit
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCOlga Lavrentieva
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersKumari Surabhi
 
Data quality evaluation & orbit identification from scatterometer
Data quality evaluation & orbit identification from scatterometerData quality evaluation & orbit identification from scatterometer
Data quality evaluation & orbit identification from scatterometerMudit Dholakia
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1Stefanie Zhao
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfTSANKARARAO
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintijccsa
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Clusterairbots
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 

Similar to HP - Jerome Rolia - Hadoop World 2010 (20)

Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
 
Project Matsu
Project MatsuProject Matsu
Project Matsu
 
Performance Models for Apache Accumulo
Performance Models for Apache AccumuloPerformance Models for Apache Accumulo
Performance Models for Apache Accumulo
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...
 
Reliable, Remote Computation at All Scales
Reliable, Remote Computation at All ScalesReliable, Remote Computation at All Scales
Reliable, Remote Computation at All Scales
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
Automated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with DaskAutomated Data Exploration: Building efficient analysis pipelines with Dask
Automated Data Exploration: Building efficient analysis pipelines with Dask
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
try
trytry
try
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez
 
Resisting skew accumulation
Resisting skew accumulationResisting skew accumulation
Resisting skew accumulation
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPC
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
Data quality evaluation & orbit identification from scatterometer
Data quality evaluation & orbit identification from scatterometerData quality evaluation & orbit identification from scatterometer
Data quality evaluation & orbit identification from scatterometer
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
CUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce ClusterCUDA performance study on Hadoop MapReduce Cluster
CUDA performance study on Hadoop MapReduce Cluster
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

HP - Jerome Rolia - Hadoop World 2010

  • 1. ©2009 HP Confidential Jerry Rolia Principal Scientist, Automated Infrastructure Lab, Hewlett Packard Labs October 12, 2010 Techniques to use Hadoop with scientific data
  • 2. YongChul Kwon Magdalena Balazinska, Bill Howe University of Washington Joint work* with *“Skew-Resistant Parallel Processing of Feature-Extracting Scientific User-Defined Functions,” appears in the proceedings of the 1st ACM Cloud Computing Symposium, 2010
  • 3. Motivation • Science is becoming a data analysis problem • MapReduce with Hadoop is an attractive solution – Easy API, declarative layer, seamless scalability, … • Computational skew can make it hard to get high performance • e.g., 14 hours vs. 70 minutes • Challenges include: – Partitioning data to avoid computational skew – Implementing a hierarchical parallel merge • SkewReduce: – To automatically output a data partitioning and merge plan 3
  • 4. Example Science Application: Extracting Celestial Objects • Input pixels – { (x,y,r,g,b,ir,uv,…) } • Coordinates • Light intensities • … • Output features – List of celestial objects • Star • Galaxy • Planet • Asteroid • … M34 from Sloan Digital Sky Survey 4
  • 5. Scientific feature extraction applications • Astronomy: e.g., identify celestial objects – 2D arrays of pixel intensities, each element is point in the sky and the time the image taken • Climate and ocean: e.g., understand phenomenon in systems – 3D regions of atmosphere and oceans using arrays or meshes, simulating behavior over time by solving a set of governing equations • Cosmology: e.g., study the structure and changes in universe – 4D models of clouds of particles influenced by gravity to analyze the origin and evolution of the universe • Flow cytometry: e.g., counting and examining microscopic particles – Scattered light used to recognize microorganisms in water, enormous volume of events clustered in a 6D space corresponding to different wavelengths of light These application domains all reason about the multi-dimensional space in which the data is embedded as well as the data itself
  • 6. Parallel Feature Extraction • Partition multi-dimensional input data • Extract features from each partition • Merge (or reconcile) features • Finalize output Features INPUT DATA Map Hierarchical Reduce 6
  • 7. Partition • Bounding box algorithms are used to make semantically correct partitions of data Determine (axis, point) to split partition What you want is for the partitions to have the same runtime!
  • 8. Extract • Apply application specific feature extraction algorithm to the data within a bounding box • Algorithm complexity may depend on the relationships among data points in the box O(N log N) ~ O(N2) 0 neighbors per particle ~ N neighbors per particle 8 Relationships among data in the space can lead to computational skew!
  • 9. Hierarchical parallel merge • Partitions must be merged based on relationships between bounding boxes • Data near edges of boxes taken into account • Some data can be set aside during extract/merge and re- introduced in finalize to reduce data copying Features Hierarchical merge requires a map reduce driver program to schedule merges in correct order Set aside data not needed future merge Set aside data not needed in future merge
  • 10. Finalize • Features are integrated with set aside data for final output
  • 11. Skew Local Clustering (MAP) Merge (REDUCE) Problem: Computational Skew • The top red line runs for 1.5 hours 5 minutes Time TaskID 35 minutes 11
  • 12. Solution 1? Micro partition • Assign tiny amount of work to each task to reduce skew 12
  • 13. Impact of micro partitions? • It works! • Framework/merge overhead can be large! • To find sweet spot, need to try different granularities! 0 2 4 6 8 10 12 14 16 256 1024 4096 8192 Completiontime(Hours) # of partitions Can we find a good partitioning plan that incurs less overhead? 13
  • 14. Solution 2? Manual partition • Repeat – Solve using map reduce – Only divide those partitions that take too long • Until balanced Can we find a good partitioning plan without such trial and error? a b c d a 1 a 2 b 1 b 2 c d Iteration 1 Iteration 2
  • 15. SkewReduce approach Sample SkewReduce Optimizer 1 2 13 14 15 5 6 9 3 4 12 7 8 10 11 Cluster configuration Cost functions • Goal: minimize expected total runtime • Output: SkewReduce runtime plan – Bounding boxes for data partitions – Schedule for longest jobs first and hierarchical parallel merge Runtime Plan 15
  • 16. Cost functions • Two cost functions: – Feature cost: (Bounding box, sample, sample rate) → cost – Merge cost:(Bounding boxes, sample, sample rate) → cost • E.g.., – Estimate density of points in sample – Characterize using histograms – Use micro-benchmarks to relate to execution time
  • 17. Search Partition Plan • Greedy top-down search – Split if total expected runtime improves • Evaluate costs for subpartitions and merge • Estimate new runtime 100 Original 1 2 3 50 50 10 Possible Split 2 1 3 1 32 Schedule 2 = 110 Schedule 1 = 60 17 Time
  • 18. Partition Plan • Partition based on cluster and predicted feature extraction/merge costs • Stop partitioning when overhead exceeds benefit 18 …
  • 19. Evaluation • Distributed Friends of Friends – Astro: Gravitational simulation snapshot • 900 M particles, 18 GB – Seaflow: flow cytometry survey • 59 M observations, 1.9 GB • 8 node cluster – Dual quad core CPU, 16 GB RAM – Hadoop 0.20.1 + custom patch in MapReduce API 19
  • 20. Does SkewReduce work? • SkewReduce plan yields faster running time 0 1 2 3 4 5 6 7 8 9 10 RelativeRuntime Astro Seaflow 128 MB 16 MB 4 MB 2 MB Manual SkewReduce 14.1 8.8 4.1 5.7 2.0 1.6 87.2 63.1 77.7 98.7 - 14.1 Hours Minutes 20 (1.9 GB, 3D)(18 GB, 3D)MapReduce 1 hour preparation, improve on otherwise best plan
  • 21. Impact of Cost Function 0 2 4 6 8 10 12 14 16 Data Size Histogram 1D Histogram 3D Completiontime(Hours) Cost Function Higher fidelity = Better performance Astro 21
  • 22. Highlights of Evaluation • Sample size – Representativeness of sample is important – 1% sample size worked well • Runtime of SkewReduce optimization – Less than 15% of real runtime of SkewReduce plan • Data volume in Merge phase using set aside – Total volume during Merge = 1% of input data • Details in the SkewReduce paper 22
  • 23. Conclusion • Scientific analysis should be easy to write, scalable, and have a predictable performance • Skew is a general problem, solutions are needed • SkewReduce – API for feature extracting features – Scalable execution – Good performance in spite of skew • Cost-based partition optimization using a data sample • Next step is to handle skew in arbitrary map-reduce systems • Looking for your examples of computational skew • Current implementation can be made available to the fearless 23
  • 26. SkewReduce: Prototype Architecture Bounding Boxes SkewReduce Runtime Extract Job Completion Monitor Schedule Merge Pig Script Job Scheduler Map-only job per Extract/Merge Asynchronous notification Schedule new task Optimizer output User algorithms
  • 27. Want Some Code?  public class MyExtract extends PExtractOP { public class MyExtractMapper extends SkewReduceMapper { // 0.20.x New API public void run(Context context) { // your extractor code } } protected Job createJob(Configuration conf) { // configure Job object in 0.20.x New API } } public class MyApp extends SkewReduceDriver { … public LExtractorOP createExtractOp() { // return logical OP – physical OP template } public LMergeOP createMergeOp() { // return logical OP – physical OP template } … public Partition getRootPartition() { // we provide several default implementations } } Example SkewReduce Driver Example Extractor
  • 28. Summary of Contributions • Given a feature extraction application – Possibly with computation skew • SkewReduce – Automatically partitions input data – Improves runtime by reducing computation skew • Key technique: user-defined cost functions 28