SlideShare a Scribd company logo

WBDB 2014 Benchmarking Virtualized Hadoop Clusters

T
t_ivanov

This work investigates the performance of Big Data applications in virtualized Hadoop environments. An evaluation and comparison of the performance of applications running on a virtualized Hadoop cluster with separated data and computation layers against standard Hadoop installation is presented. http://clds.sdsc.edu/wbdb2014.de/program

1 of 22
Download to read offline
Benchmarking Virtualized
Hadoop Clusters
Todor Ivanov, Roberto V. Zicari
Big Data Lab, Goethe University Frankfurt
Alejandro Buchmann
Database and Distributed Systems, TU Darmstadt
15th Workshop on Big Data Benchmarking 2014
Outline
• Virtualizing Hadoop
• Measuring Performance
– Iterative Experimental Approach
– Platform Setup
– Experiments
– Summary of Results
• Lessons Learned
• Next Steps
5th Workshop on Big Data Benchmarking 2014 2
Virtualizing Hadoop
• Motivation
– Hadoop-as-a-service (e.g. Amazon Elastic Map Reduce)
– Automated deployment and cost-effective management
– Dynamically scalable cluster size (e.g. # of nodes, resource allocation)
• Challenges
– I/O overhead
– Network overhead (message communication and data transfer)
• Related Work: virtualized vs. physical Hadoop
 Virtualized Hadoop has an estimated overhead ranging between 2-10%
(reported in [1], [2], [3])
5th Workshop on Big Data Benchmarking 2014 3
[1] Buell, J.: A Benchmarking Case Study of Virtualized Hadoop Performance on VMware vSphere 5.
Tech. White Pap. VMware Inc. (2011).
[2] Buell, J.: Virtualized Hadoop Performance with VMware vSphere ®5.1. Tech. White Pap. VMware Inc. (2013).
[3] Microsoft: Performance of Hadoop on Windows in Hyper-V Environments. Tech. White Pap. Microsoft. (2013).
Objectives of Our Research
Investigate and compare the performance between
standard and separated data-compute cluster configurations.
• How does the application performance change on a data-compute
cluster?
• What type of applications are more suitable for data-compute clusters?
5th Workshop on Big Data Benchmarking 2014 4
Standard
Cluster Data-Compute
Cluster
Methodology:
Iterative Experimental Approach
I. Choose a Big Data
Benchmark
II. Configure
Hadoop Cluster
III. Perform
Experiments
IV. Evaluate
Results
5th Workshop on Big Data Benchmarking 2014 5
Step I: Intel HiBench
• Benchmark suite for Hadoop (developed by Intel in 2010) (Huang et al. [4])
• 4 categories, 10 workloads & 3 types
• Metrics: Time (Sec) & Throughput (Bytes/Sec)
Category No Workload Tools Type
Micro Benchmarks
1 Sort MapReduce IO Bound
2 WordCount MapReduce CPU Bound
3 TeraSort MapReduce Mixed
4 TestDFSIOEnhanced MapReduce IO Bound
Web Search
5 Nutch Indexing Nutch, Lucene Mixed
6 Page Rank Pegasus Mixed
Machine Learning
7 Bayesian Classification Mahout Mixed
8 K-means Clustering Mahout Mixed
Analytical Query
9 Join Hive Mixed
10 Aggregation Hive Mixed
5th Workshop on Big Data Benchmarking 2014 6
[4] Huang, S. et al.: The HiBench benchmark suite: Characterization of the MapReduce-based data analysis.
Data Engineering Workshops (ICDEW), 2010

Recommended

MapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine LearningMapReduce: Distributed Computing for Machine Learning
MapReduce: Distributed Computing for Machine Learningbutest
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking TutorialTilmann Rabl
 
I-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageI-Sieve: An inline High Performance Deduplication System Used in cloud storage
I-Sieve: An inline High Performance Deduplication System Used in cloud storageredpel dot com
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2Giovanna Roda
 

More Related Content

What's hot

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureKyong-Ha Lee
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...Kyong-Ha Lee
 
Getting the most out of multi-GPU on Inference stage using Hadoop-spark cluster
Getting the most out of multi-GPU on Inference stage using Hadoop-spark clusterGetting the most out of multi-GPU on Inference stage using Hadoop-spark cluster
Getting the most out of multi-GPU on Inference stage using Hadoop-spark clusterDaesu Chung
 
A science-gateway for workflow executions: online and non-clairvoyant self-h...
A science-gateway for workflow executions: online and non-clairvoyant self-h...A science-gateway for workflow executions: online and non-clairvoyant self-h...
A science-gateway for workflow executions: online and non-clairvoyant self-h...Rafael Ferreira da Silva
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...AM Publications
 
NameNode and DataNode Coupling for a Power-proportional Hadoop Distributed F...
NameNode and DataNode Couplingfor a Power-proportional Hadoop Distributed F...NameNode and DataNode Couplingfor a Power-proportional Hadoop Distributed F...
NameNode and DataNode Coupling for a Power-proportional Hadoop Distributed F...Hanh Le Hieu
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performanceijcsa
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataeSAT Publishing House
 

What's hot (20)

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing Architecture
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Unit 1
Unit 1Unit 1
Unit 1
 
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
HadoopXML: A Suite for Parallel Processing of Massive XML Data with Multiple ...
 
Hadoop
HadoopHadoop
Hadoop
 
Getting the most out of multi-GPU on Inference stage using Hadoop-spark cluster
Getting the most out of multi-GPU on Inference stage using Hadoop-spark clusterGetting the most out of multi-GPU on Inference stage using Hadoop-spark cluster
Getting the most out of multi-GPU on Inference stage using Hadoop-spark cluster
 
A science-gateway for workflow executions: online and non-clairvoyant self-h...
A science-gateway for workflow executions: online and non-clairvoyant self-h...A science-gateway for workflow executions: online and non-clairvoyant self-h...
A science-gateway for workflow executions: online and non-clairvoyant self-h...
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
 
NameNode and DataNode Coupling for a Power-proportional Hadoop Distributed F...
NameNode and DataNode Couplingfor a Power-proportional Hadoop Distributed F...NameNode and DataNode Couplingfor a Power-proportional Hadoop Distributed F...
NameNode and DataNode Coupling for a Power-proportional Hadoop Distributed F...
 
Hadoop
HadoopHadoop
Hadoop
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
 

Viewers also liked

Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Rajit Saha
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoopChiou-Nan Chen
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld
 
Soyez Big Data ready avec Isilon
Soyez Big Data ready avec IsilonSoyez Big Data ready avec Isilon
Soyez Big Data ready avec IsilonRSD
 
7. emc isilon hdfs enterprise storage for hadoop
7. emc isilon hdfs   enterprise storage for hadoop7. emc isilon hdfs   enterprise storage for hadoop
7. emc isilon hdfs enterprise storage for hadoopTaldor Group
 
EMC Hadoop Starter Kit
EMC Hadoop Starter KitEMC Hadoop Starter Kit
EMC Hadoop Starter KitEMC
 
Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop InnoTech
 
Big data on virtualized infrastucture
Big data on virtualized infrastuctureBig data on virtualized infrastucture
Big data on virtualized infrastuctureDataWorks Summit
 
Gartner IT Symposium 2014 - VMware Cloud Services
Gartner IT Symposium 2014 - VMware Cloud ServicesGartner IT Symposium 2014 - VMware Cloud Services
Gartner IT Symposium 2014 - VMware Cloud ServicesPhilip Say
 
VMworld - vSphere Distributed Switch 6.0 Technical Deep Dive
VMworld - vSphere Distributed Switch 6.0 Technical Deep DiveVMworld - vSphere Distributed Switch 6.0 Technical Deep Dive
VMworld - vSphere Distributed Switch 6.0 Technical Deep DiveChris Wahl
 
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...Nati Shalom
 
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...EMC
 
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...EMC
 

Viewers also liked (15)

Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
Virtualized Big Data Platform at VMware Corp IT @ VMWorld 2015
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoop
 
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
VMworld 2013: Big Data Extensions: Advanced Features and Customer Case Study
 
Soyez Big Data ready avec Isilon
Soyez Big Data ready avec IsilonSoyez Big Data ready avec Isilon
Soyez Big Data ready avec Isilon
 
7. emc isilon hdfs enterprise storage for hadoop
7. emc isilon hdfs   enterprise storage for hadoop7. emc isilon hdfs   enterprise storage for hadoop
7. emc isilon hdfs enterprise storage for hadoop
 
EMC Hadoop Starter Kit
EMC Hadoop Starter KitEMC Hadoop Starter Kit
EMC Hadoop Starter Kit
 
Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop Emerging Big Data & Analytics Trends with Hadoop
Emerging Big Data & Analytics Trends with Hadoop
 
EMC config Hadoop
EMC config HadoopEMC config Hadoop
EMC config Hadoop
 
Big data on virtualized infrastucture
Big data on virtualized infrastuctureBig data on virtualized infrastucture
Big data on virtualized infrastucture
 
Gartner IT Symposium 2014 - VMware Cloud Services
Gartner IT Symposium 2014 - VMware Cloud ServicesGartner IT Symposium 2014 - VMware Cloud Services
Gartner IT Symposium 2014 - VMware Cloud Services
 
VMworld - vSphere Distributed Switch 6.0 Technical Deep Dive
VMworld - vSphere Distributed Switch 6.0 Technical Deep DiveVMworld - vSphere Distributed Switch 6.0 Technical Deep Dive
VMworld - vSphere Distributed Switch 6.0 Technical Deep Dive
 
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
Real World Application Orchestration Made Easy on VMware vCloud Air, vSphere ...
 
Cloud Management with vRealize Operations
Cloud Management with vRealize OperationsCloud Management with vRealize Operations
Cloud Management with vRealize Operations
 
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
 
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
Hadoop Analytics + Enterprise Class Storage: One-Stop Solution From EMC for H...
 

Similar to WBDB 2014 Benchmarking Virtualized Hadoop Clusters

BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBencht_ivanov
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systemshdhappy001
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systemshdhappy001
 
Comparison of In-memory Data Platforms
Comparison of In-memory Data PlatformsComparison of In-memory Data Platforms
Comparison of In-memory Data PlatformsAmir Mahdi Akbari
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeROHIT KHARABE
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Dayprogrammermag
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopYahoo Developer Network
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on HadoopDataWorks Summit
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopTony Ng
 
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)Robert Grossman
 
Towards a Macrobenchmark Framework for Performance Analysis of Java Applications
Towards a Macrobenchmark Framework for Performance Analysis of Java ApplicationsTowards a Macrobenchmark Framework for Performance Analysis of Java Applications
Towards a Macrobenchmark Framework for Performance Analysis of Java ApplicationsGábor Szárnyas
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
An introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud ApplicationsAn introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud ApplicationsRavi Yogesh
 

Similar to WBDB 2014 Benchmarking Virtualized Hadoop Clusters (20)

BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 
Comparison of In-memory Data Platforms
Comparison of In-memory Data PlatformsComparison of In-memory Data Platforms
Comparison of In-memory Data Platforms
 
Big Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit KharabeBig Data Testing Approach - Rohit Kharabe
Big Data Testing Approach - Rohit Kharabe
 
F1803013034
F1803013034F1803013034
F1803013034
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
An Introduction to Cloud Computing by Robert Grossman 08-06-09 (v19)
 
Towards a Macrobenchmark Framework for Performance Analysis of Java Applications
Towards a Macrobenchmark Framework for Performance Analysis of Java ApplicationsTowards a Macrobenchmark Framework for Performance Analysis of Java Applications
Towards a Macrobenchmark Framework for Performance Analysis of Java Applications
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
An introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud ApplicationsAn introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud Applications
 

More from t_ivanov

CoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core OperationsCoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core Operationst_ivanov
 
Building the DataBench Workflow and Architecture
Building the DataBench Workflow and ArchitectureBuilding the DataBench Workflow and Architecture
Building the DataBench Workflow and Architecturet_ivanov
 
Adding Velocity to BigBench
Adding Velocity to BigBenchAdding Velocity to BigBench
Adding Velocity to BigBencht_ivanov
 
Exploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured StreamingExploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured Streamingt_ivanov
 
ABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack BenchmarkABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack Benchmarkt_ivanov
 
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platformst_ivanov
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBencht_ivanov
 

More from t_ivanov (7)

CoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core OperationsCoreBigBench: Benchmarking Big Data Core Operations
CoreBigBench: Benchmarking Big Data Core Operations
 
Building the DataBench Workflow and Architecture
Building the DataBench Workflow and ArchitectureBuilding the DataBench Workflow and Architecture
Building the DataBench Workflow and Architecture
 
Adding Velocity to BigBench
Adding Velocity to BigBenchAdding Velocity to BigBench
Adding Velocity to BigBench
 
Exploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured StreamingExploratory Analysis of Spark Structured Streaming
Exploratory Analysis of Spark Structured Streaming
 
ABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack BenchmarkABench: Big Data Architecture Stack Benchmark
ABench: Big Data Architecture Stack Benchmark
 
Lessons Learned on Benchmarking Big Data Platforms
Lessons Learned on Benchmarking  Big Data PlatformsLessons Learned on Benchmarking  Big Data Platforms
Lessons Learned on Benchmarking Big Data Platforms
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
 

Recently uploaded

Open Source vs Closed Source LLMs. Pros and Cons
Open Source vs Closed Source LLMs. Pros and ConsOpen Source vs Closed Source LLMs. Pros and Cons
Open Source vs Closed Source LLMs. Pros and ConsSprings
 
Introduction to Research Automation with Globus
Introduction to Research Automation with GlobusIntroduction to Research Automation with Globus
Introduction to Research Automation with GlobusGlobus
 
Building Research Applications with Globus PaaS
Building Research Applications with Globus PaaSBuilding Research Applications with Globus PaaS
Building Research Applications with Globus PaaSGlobus
 
انتزاع و هزینه - انتزاع و تاثیرات آن در توسعه و نگهداری نرم‌افزار
انتزاع و هزینه - انتزاع و تاثیرات آن در توسعه و نگهداری نرم‌افزارانتزاع و هزینه - انتزاع و تاثیرات آن در توسعه و نگهداری نرم‌افزار
انتزاع و هزینه - انتزاع و تاثیرات آن در توسعه و نگهداری نرم‌افزارsohilww
 
Advanced Globus System Administration Topics
Advanced Globus System Administration TopicsAdvanced Globus System Administration Topics
Advanced Globus System Administration TopicsGlobus
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flinkconfluent
 
Agile & Scrum, Certified Scrum Master! Crash Course
Agile & Scrum,  Certified Scrum Master! Crash CourseAgile & Scrum,  Certified Scrum Master! Crash Course
Agile & Scrum, Certified Scrum Master! Crash CourseRohan Chandane
 
Best Practices for Data Sharing Using Globus
Best Practices for Data Sharing Using GlobusBest Practices for Data Sharing Using Globus
Best Practices for Data Sharing Using GlobusGlobus
 
Implementing Docker Containers with Windows Server 2019
Implementing Docker Containers with Windows Server 2019Implementing Docker Containers with Windows Server 2019
Implementing Docker Containers with Windows Server 2019VICTOR MAESTRE RAMIREZ
 
LLMOps with Azure Machine Learning prompt flow
LLMOps with Azure Machine Learning prompt flowLLMOps with Azure Machine Learning prompt flow
LLMOps with Azure Machine Learning prompt flowNaoki (Neo) SATO
 
Machine Learning Basics for Dummies (no math!)
Machine Learning Basics for Dummies (no math!)Machine Learning Basics for Dummies (no math!)
Machine Learning Basics for Dummies (no math!)Dmitry Zinoviev
 
CSS Notes in PDF, Easy to understand. For beginner to advanced. ...
CSS Notes in PDF, Easy to understand. For beginner to advanced.              ...CSS Notes in PDF, Easy to understand. For beginner to advanced.              ...
CSS Notes in PDF, Easy to understand. For beginner to advanced. ...syedfaisal759877
 
Reliable, Remote Computation at All Scales
Reliable, Remote Computation at All ScalesReliable, Remote Computation at All Scales
Reliable, Remote Computation at All ScalesGlobus
 
Cybersecurity Measures For Remote Workers.pdf
Cybersecurity Measures For Remote Workers.pdfCybersecurity Measures For Remote Workers.pdf
Cybersecurity Measures For Remote Workers.pdfCIOWomenMagazine
 
An Introduction to Globus for Researchers
An Introduction to Globus for ResearchersAn Introduction to Globus for Researchers
An Introduction to Globus for ResearchersGlobus
 
Globus for System Administrators
Globus for System AdministratorsGlobus for System Administrators
Globus for System AdministratorsGlobus
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio, Inc.
 
Managing multicast/igmp stream on Docker
Managing multicast/igmp stream on DockerManaging multicast/igmp stream on Docker
Managing multicast/igmp stream on DockerThierry Gayet
 
Passbolt Introduction and Usage for secret managment
Passbolt Introduction and Usage for secret managmentPassbolt Introduction and Usage for secret managment
Passbolt Introduction and Usage for secret managmentThierry Gayet
 

Recently uploaded (20)

Open Source vs Closed Source LLMs. Pros and Cons
Open Source vs Closed Source LLMs. Pros and ConsOpen Source vs Closed Source LLMs. Pros and Cons
Open Source vs Closed Source LLMs. Pros and Cons
 
Introduction to Research Automation with Globus
Introduction to Research Automation with GlobusIntroduction to Research Automation with Globus
Introduction to Research Automation with Globus
 
Building Research Applications with Globus PaaS
Building Research Applications with Globus PaaSBuilding Research Applications with Globus PaaS
Building Research Applications with Globus PaaS
 
انتزاع و هزینه - انتزاع و تاثیرات آن در توسعه و نگهداری نرم‌افزار
انتزاع و هزینه - انتزاع و تاثیرات آن در توسعه و نگهداری نرم‌افزارانتزاع و هزینه - انتزاع و تاثیرات آن در توسعه و نگهداری نرم‌افزار
انتزاع و هزینه - انتزاع و تاثیرات آن در توسعه و نگهداری نرم‌افزار
 
Advanced Globus System Administration Topics
Advanced Globus System Administration TopicsAdvanced Globus System Administration Topics
Advanced Globus System Administration Topics
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Agile & Scrum, Certified Scrum Master! Crash Course
Agile & Scrum,  Certified Scrum Master! Crash CourseAgile & Scrum,  Certified Scrum Master! Crash Course
Agile & Scrum, Certified Scrum Master! Crash Course
 
Best Practices for Data Sharing Using Globus
Best Practices for Data Sharing Using GlobusBest Practices for Data Sharing Using Globus
Best Practices for Data Sharing Using Globus
 
Implementing Docker Containers with Windows Server 2019
Implementing Docker Containers with Windows Server 2019Implementing Docker Containers with Windows Server 2019
Implementing Docker Containers with Windows Server 2019
 
LLMOps with Azure Machine Learning prompt flow
LLMOps with Azure Machine Learning prompt flowLLMOps with Azure Machine Learning prompt flow
LLMOps with Azure Machine Learning prompt flow
 
Machine Learning Basics for Dummies (no math!)
Machine Learning Basics for Dummies (no math!)Machine Learning Basics for Dummies (no math!)
Machine Learning Basics for Dummies (no math!)
 
CSS Notes in PDF, Easy to understand. For beginner to advanced. ...
CSS Notes in PDF, Easy to understand. For beginner to advanced.              ...CSS Notes in PDF, Easy to understand. For beginner to advanced.              ...
CSS Notes in PDF, Easy to understand. For beginner to advanced. ...
 
Reliable, Remote Computation at All Scales
Reliable, Remote Computation at All ScalesReliable, Remote Computation at All Scales
Reliable, Remote Computation at All Scales
 
2024 Trends Transforming Enterprise Resource Planning
2024 Trends Transforming Enterprise Resource Planning2024 Trends Transforming Enterprise Resource Planning
2024 Trends Transforming Enterprise Resource Planning
 
Cybersecurity Measures For Remote Workers.pdf
Cybersecurity Measures For Remote Workers.pdfCybersecurity Measures For Remote Workers.pdf
Cybersecurity Measures For Remote Workers.pdf
 
An Introduction to Globus for Researchers
An Introduction to Globus for ResearchersAn Introduction to Globus for Researchers
An Introduction to Globus for Researchers
 
Globus for System Administrators
Globus for System AdministratorsGlobus for System Administrators
Globus for System Administrators
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Managing multicast/igmp stream on Docker
Managing multicast/igmp stream on DockerManaging multicast/igmp stream on Docker
Managing multicast/igmp stream on Docker
 
Passbolt Introduction and Usage for secret managment
Passbolt Introduction and Usage for secret managmentPassbolt Introduction and Usage for secret managment
Passbolt Introduction and Usage for secret managment
 

WBDB 2014 Benchmarking Virtualized Hadoop Clusters

  • 1. Benchmarking Virtualized Hadoop Clusters Todor Ivanov, Roberto V. Zicari Big Data Lab, Goethe University Frankfurt Alejandro Buchmann Database and Distributed Systems, TU Darmstadt 15th Workshop on Big Data Benchmarking 2014
  • 2. Outline • Virtualizing Hadoop • Measuring Performance – Iterative Experimental Approach – Platform Setup – Experiments – Summary of Results • Lessons Learned • Next Steps 5th Workshop on Big Data Benchmarking 2014 2
  • 3. Virtualizing Hadoop • Motivation – Hadoop-as-a-service (e.g. Amazon Elastic Map Reduce) – Automated deployment and cost-effective management – Dynamically scalable cluster size (e.g. # of nodes, resource allocation) • Challenges – I/O overhead – Network overhead (message communication and data transfer) • Related Work: virtualized vs. physical Hadoop  Virtualized Hadoop has an estimated overhead ranging between 2-10% (reported in [1], [2], [3]) 5th Workshop on Big Data Benchmarking 2014 3 [1] Buell, J.: A Benchmarking Case Study of Virtualized Hadoop Performance on VMware vSphere 5. Tech. White Pap. VMware Inc. (2011). [2] Buell, J.: Virtualized Hadoop Performance with VMware vSphere ®5.1. Tech. White Pap. VMware Inc. (2013). [3] Microsoft: Performance of Hadoop on Windows in Hyper-V Environments. Tech. White Pap. Microsoft. (2013).
  • 4. Objectives of Our Research Investigate and compare the performance between standard and separated data-compute cluster configurations. • How does the application performance change on a data-compute cluster? • What type of applications are more suitable for data-compute clusters? 5th Workshop on Big Data Benchmarking 2014 4 Standard Cluster Data-Compute Cluster
  • 5. Methodology: Iterative Experimental Approach I. Choose a Big Data Benchmark II. Configure Hadoop Cluster III. Perform Experiments IV. Evaluate Results 5th Workshop on Big Data Benchmarking 2014 5
  • 6. Step I: Intel HiBench • Benchmark suite for Hadoop (developed by Intel in 2010) (Huang et al. [4]) • 4 categories, 10 workloads & 3 types • Metrics: Time (Sec) & Throughput (Bytes/Sec) Category No Workload Tools Type Micro Benchmarks 1 Sort MapReduce IO Bound 2 WordCount MapReduce CPU Bound 3 TeraSort MapReduce Mixed 4 TestDFSIOEnhanced MapReduce IO Bound Web Search 5 Nutch Indexing Nutch, Lucene Mixed 6 Page Rank Pegasus Mixed Machine Learning 7 Bayesian Classification Mahout Mixed 8 K-means Clustering Mahout Mixed Analytical Query 9 Join Hive Mixed 10 Aggregation Hive Mixed 5th Workshop on Big Data Benchmarking 2014 6 [4] Huang, S. et al.: The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. Data Engineering Workshops (ICDEW), 2010
  • 7. Step II: Platform Setup • Platform layer (Hadoop Cluster) – vSphere Big Data Extension integrating Serengeti Server (version 1.0) – VM template hosting CentOS – Apache Hadoop (version 1.2.1) with default parameters: • 200MB Java Heap size • 64MB block size • 3 replication factor • Management layer (Virtualization) – VMWare vSphere 5.1 – ESXi and vCenter Servers • Hardware layer - Dell PowerEdge T420 server – 2 x Intel Xeon E5-2420 (1.9 GHz), 6 core CPUs – 32GB RAM – 4 x 1 TB, WD SATA disks Hardware Management (Virtualization) Application (HiBench Benchmark) Platform (Hadoop Cluster) CPUs Memory Storage 5th Workshop on Big Data Benchmarking 2014 7
  • 8. (Known) Limitations • Single physical server (no physical network) • VMWare ESXi server hypervisor • Testing with default configurations (Serengeti & Hadoop) • Time constraints: – Input data sizes: 10/20/50GB – 3 test repetitions 5th Workshop on Big Data Benchmarking 2014 8
  • 9. Step II: Comparison Factors The number of utilized VMs in the compared clusters should be equal. • Each additional VM increases the hypervisor overhead (reported in [2], [5], [6]) • Utilizing more VMs may improve the overall system performance [2] The utilized hardware resources in a cluster should be equal. 5th Workshop on Big Data Benchmarking 2014 9 [2] Buell, J.: Virtualized Hadoop Performance with VMware vSphere ®5.1. Tech. White Pap. VMware Inc. (2013). [5] Li, J. et al.: Performance Overhead Among Three Hypervisors: An Experimental Study using Hadoop Benchmarks. Big Data (BigData Congress), 2013 [6] Ye, K. et al.: vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration. Cluster Computing Workshops (CLUSTER WORKSHOPS), 2012
  • 10. Step II: Comparison Standard1/Data- Compute1 Standard Cluster Data-Compute Cluster 1) of the utilized hardware resources 2) of the utilized VMs ∆ – difference in performance 5th Workshop on Big Data Benchmarking 2014 10
  • 11. Step II: Comparison Standard2/Data- Compute3 Standard Cluster Data-Compute Cluster 1) of the utilized hardware resources 2) of the utilized VMs ∆ – difference in performance 5th Workshop on Big Data Benchmarking 2014 11
  • 12. Step II: Comparison Data- Compute1/2/3 Data-Compute Cluster Data-Compute Cluster 1) of the utilized hardware resources ∆ – difference in performance 5th Workshop on Big Data Benchmarking 2014 12
  • 13. Step II: All Cluster Configurations 5th Workshop on Big Data Benchmarking 2014 13
  • 14. Step III & IV: CPU Bound - WordCount • Configuration: 4 map/1 reduce tasks, 10/20/50 GB input data sizes • Times normalized with respect to baseline Standard1 • 38-47% better performance for Data-Compute cluster • Data-Compute1 (2CW & 1DW) ≈ Data-Compute2 (2CW & 2DW) Equal Number of VMs 3 VMs 6 VMs DataSize (GB) Diff. (%) Standard1/ Data-Comp1 Diff. (%) Standard2/ Data-Comp3 10 -40 -38 20 -41 -42 50 -43 -47 5th Workshop on Big Data Benchmarking 2014 14 1.00 1.00 1.00 1.75 1.74 1.74 0.71 0.71 0.700.71 0.71 0.70 1.26 1.22 1.19 0 0.5 1 1.5 2 10 20 50Data Size (GB) Standard1 Standard2 Data-Comp1 Data-Comp2 Data-Comp3 RatiotoStandard1
  • 15. Step III & IV: Read I/O Bound – TestDFSIOEnh (1) • Configuration: 100MB file size, 10/20/50 GB input data sizes • Read times normalized with respect to baseline Standard1 • Standard1 (Standard Cluster) performs best Equal Number of VMs 3 VMs 6 VMs Data Size (GB) Diff. (%) Standard1/ Data-Comp1 Diff. (%) Standard2/ Data-Comp3 10 68 -18 20 71 -30 50 73 -46 RatiotoStandard1 5th Workshop on Big Data Benchmarking 2014 15 1.00 1.00 1.00 1.83 1.93 1.87 3.08 3.39 3.66 1.51 1.71 1.78 1.55 1.48 1.28 0.0 1.0 2.0 3.0 4.0 10 20 50Data Size (GB) Standard1 Standard2 Data-Comp1 Data-Comp2 Data-Comp3
  • 16. Step III & IV: Read I/O Bound – TestDFSIOEnh (2) • Configuration: 100MB file size, 10/20/50 GB input data sizes • Read times normalized with respect to baseline Standard1 • Data-Comp1 (2CW & 1DW) > DC2 (2CW & 2DW) > DC3 (3CW & 3DW)  More data nodes improve read performance in a Data-Compute cluster. Different Number of VMs 3 VMs 4 VMs 4 VMs 6 VMs Data Size (GB) Diff. (%) Data- Comp1/2 Diff. (%) Data- Comp2/3 10 -104 3 20 -99 -15 50 -106 -39 5th Workshop on Big Data Benchmarking 2014 16 1.00 1.00 1.00 1.83 1.93 1.87 3.08 3.39 3.66 1.51 1.71 1.78 1.55 1.48 1.28 0.0 1.0 2.0 3.0 4.0 10 20 50Data Size (GB) Standard1 Standard2 Data-Comp1 Data-Comp2 Data-Comp3 RatiotoStandard1
  • 17. Step III & IV: Write I/O Bound – TestDFSIOEnh (1) • Configuration: 100MB file size, 10/20/50 GB input data sizes • Write times normalized with respect to baseline Standard1 • Data-Compute cluster (Data-Comp1, Data-Comp3) performs better Equal Number of VMs 3 VMs 6 VMs Data Size (GB) Diff. (%) Standard1/ Data-Comp1 Diff. (%) Standard2/ Data-Comp3 10 -10 4 20 -21 -14 50 -24 -1 5th Workshop on Big Data Benchmarking 2014 17 1.00 1.00 1.00 0.84 1.08 1.00 0.91 0.83 0.81 0.73 0.86 0.95 0.87 0.95 0.99 0.0 0.5 1.0 1.5 10 20 50 Data Size (GB) Standard1 Standard2 Data-Comp1 Data-Comp2 Data-Comp3 RatiotoStandard1
  • 18. Step III & IV: Write I/O Bound – TestDFSIOEnh (2) • Configuration: 100MB file size, 10/20/50 GB input data sizes • Write times normalized with respect to baseline Standard1 • Data-Comp1 (2CW & 1DW) < Data-Comp3(3CW & 3DW)  Having 2 extra Data Worker nodes increases the write overhead up to 19% in a Data-Compute cluster. • Data-Comp3 (6VMs) outperforms Standard1 (3VMs) Different Number of VMs 3 VMs 6 VMs 3 VMs 6 VMs Data Size (GB) Diff. (%) Data- Comp1/3 Diff. (%) Standard1/ Data-Comp3 10 -4 -15 20 13 -6 50 19 -1 5th Workshop on Big Data Benchmarking 2014 18 1.00 1.00 1.00 0.84 1.08 1.00 0.91 0.83 0.81 0.73 0.86 0.95 0.87 0.95 0.99 0.0 0.5 1.0 1.5 10 20 50 Data Size (GB) Standard1 Standard2 Data-Comp1 Data-Comp2 Data-Comp3 RatiotoStandard1
  • 19. Summary of Results • Compute-intensive (i.e. CPU bound) workloads are suitable for Data- Compute clusters. (up to 47% faster) • Read-intensive (i.e. read I/O bound) workloads are suitable for Standard clusters. – For Data-Compute clusters adding more data nodes improves the read performance. (up to 39% better e.g. Data-Compute2/Data-Compute3) • Write-intensive (i.e. write I/O bound) workloads are suitable for Data- Compute clusters. (up to 15% faster e.g. Standard1/Data-Compute3 ) – Lower number of data nodes result in better write performance. 5th Workshop on Big Data Benchmarking 2014 19
  • 20. Lessons Learned • Factors influencing cluster performance*: – Overall number of virtual nodes (VMs) in a cluster – Choosing cluster type (Standard or Data-Compute Hadoop cluster) – Number of nodes for each type (compute and data nodes) in a Data- Compute cluster * note: Limitations known! (slide 9) 5th Workshop on Big Data Benchmarking 2014 20
  • 21. Next Steps • Repeat the experiments on virtualized multi-node cluster • Evaluate virtualized performance with other workloads • Experiments with larger data sets • Repeat the experiments using other hypervisors (e.g. OpenStack) 5th Workshop on Big Data Benchmarking 2014 21
  • 22. Thank you!  Questions & Feedback are very welcome! Contact info: Todor Ivanov todor@dbis.cs.uni-frankfurt.de http://www.bigdata.uni-frankfurt.de/ 5th Workshop on Big Data Benchmarking 2014 22