SlideShare a Scribd company logo
1 of 6
Download to read offline
HadoopDistributions:
EvaluatingCloudera,
Hortonworks,andMapR
inMicro-benchmarksand
Real-worldApplications
VladimirStarostenkov,SeniorR&DDeveloper,
KirillGrigorchuk,HeadofR&DDepartment
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be
reproduced or transmitted in any form or by any means without written permission from the author.
2
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
Table of Contents
1. Introduction ................................................................................................................................................4
2. Tools, Libraries, and Methods....................................................................................................................5
2.1 Micro benchmarks................................................................................................................................................................. 6
2.1.1 WordCount................................................................................................................................................................ 6
2.1.2 Sort............................................................................................................................................................................... 7
2.1.3 TeraSort...................................................................................................................................................................... 7
2.1.4 Distributed File System I/O.................................................................................................................................. 7
2.2 Real-world applications....................................................................................................................................................... 7
2.2.1 PageRank................................................................................................................................................................... 8
2.2.2 Bayes ........................................................................................................................................................................... 8
3. What Makes This Research Unique?..........................................................................................................9
3.1 Testing environment............................................................................................................................................................ 9
4. Results........................................................................................................................................................11
4.1 Overall cluster performance.............................................................................................................................................11
4.2 Hortonworks Data Platform (HDP).................................................................................................................................12
4.3 Cloudera’s Distribution Including Apache Hadoop (CDH) ....................................................................................14
4.4 MapR ........................................................................................................................................................................................15
5. Conclusion.................................................................................................................................................18
Appendix A: Main Features and Their Comparison Across Distributions................................................19
Appendix B: Overview of the Distributions................................................................................................21
1. MapR...........................................................................................................................................................................................21
2. Cloudera....................................................................................................................................................................................22
3. Hortonworks............................................................................................................................................................................23
Appendix C: Performance Results for Each Benchmarking Test...............................................................24
1. Real-world applications........................................................................................................................................................24
1.1 Bayes.............................................................................................................................................................................24
1.2 PageRank.....................................................................................................................................................................25
3
+1 650 395-7002 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
2. Micro benchmarks .................................................................................................................................................................26
2.1 Distributed File System I/O (DFSIO) ...................................................................................................................26
2.2 Hive aggregation......................................................................................................................................................27
2.3 Sort................................................................................................................................................................................28
2.4 TeraSort........................................................................................................................................................................29
2.5 WordCount.................................................................................................................................................................30
Appendix D: Performance Results for Each Test Sectioned by Distribution............................................32
1. MapR...........................................................................................................................................................................................32
2. Hortonworks............................................................................................................................................................................42
3. Cloudera....................................................................................................................................................................................52
Appendix E: Disk Benchmarking .................................................................................................................62
1. DFSIO (read) benchmark......................................................................................................................................................62
2. DFSIO (write) benchmark ....................................................................................................................................................63
Appendix F: Parameters used to optimize Hadoop Jobs...........................................................................64
4


5



6
+1 (650) 265-2266 engineering@altoros.com
www.altoros.com | twitter.com/altoros
© 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in
any form or by any means without written permission from the author.
4. Results
4.1 Overall cluster performance
Throughput in bytes per second was measured for a cluster that consisted of 4, 8, 12, and 16
DataNodes (Figures 2-7). The throughput of 8-, 12-, and 16-node clusters was compared
against the throughput of a 4-node cluster in each benchmark test. The speed of data
processing of 8-, 12-, and 16-node clusters was divided by the throughput of a 4-node
cluster. These values demonstrate cluster scalability in each of the tests. The higher the value
is, the better.
Although data consistency may be guaranteed by a hosting/cloud provider, to employ the
advantages of data locality, Hadoop requires using its internal data replication.
The overall performance results of the MapR distribution in all benchmark tests
Cluster performance scales linearly under the WordCount workload. It behaves the same in
running PageRank until the cluster reaches an I/O bottleneck. The results of other
benchmarks strongly correlated with DFSIO.
To download the full version of this 65-page research (with 83 performance
diagrams for each distribution measured under 7 workloads), please visit:
www.altoros.com/hadoop_benchmark
0
1
2
3
4
5
BAYES DFSIOE HIVEAGGR PAGERANK SORT TERASORT WORDCOUNT
MapR
Overall cluster performance
4 8 12 16

More Related Content

Viewers also liked

How Insurance Companies Use MongoDB
How Insurance Companies Use MongoDB How Insurance Companies Use MongoDB
How Insurance Companies Use MongoDB MongoDB
 
How Retail Banks Use MongoDB
How Retail Banks Use MongoDBHow Retail Banks Use MongoDB
How Retail Banks Use MongoDBMongoDB
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 
Big data advance topics - part 2.pptx
Big data   advance topics - part 2.pptxBig data   advance topics - part 2.pptx
Big data advance topics - part 2.pptxMoldovan Radu Adrian
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRDouglas Bernardini
 
빅데이터 처리에 있어서 이미지 비디오 데이터의 분석
빅데이터 처리에 있어서 이미지 비디오 데이터의 분석빅데이터 처리에 있어서 이미지 비디오 데이터의 분석
빅데이터 처리에 있어서 이미지 비디오 데이터의 분석JeongHeon Lee
 
마인즈랩소개자료 20150616
마인즈랩소개자료 20150616마인즈랩소개자료 20150616
마인즈랩소개자료 20150616Taejoon Yoo
 
i-VOC (Voice of the Customer Big Data Analytics Solution) 고객의소리 분석
i-VOC (Voice of the Customer Big Data Analytics Solution) 고객의소리 분석i-VOC (Voice of the Customer Big Data Analytics Solution) 고객의소리 분석
i-VOC (Voice of the Customer Big Data Analytics Solution) 고객의소리 분석Taejoon Yoo
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXBMC Software
 
비정형 데이터를 기반으로 한 빅데이터 필요기술 및 적용사례
비정형 데이터를 기반으로 한 빅데이터 필요기술 및 적용사례비정형 데이터를 기반으로 한 빅데이터 필요기술 및 적용사례
비정형 데이터를 기반으로 한 빅데이터 필요기술 및 적용사례JeongHeon Lee
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseHortonworks
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 

Viewers also liked (17)

How Insurance Companies Use MongoDB
How Insurance Companies Use MongoDB How Insurance Companies Use MongoDB
How Insurance Companies Use MongoDB
 
How Retail Banks Use MongoDB
How Retail Banks Use MongoDBHow Retail Banks Use MongoDB
How Retail Banks Use MongoDB
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Big data advance topics - part 2.pptx
Big data   advance topics - part 2.pptxBig data   advance topics - part 2.pptx
Big data advance topics - part 2.pptx
 
Pivotal hawq internals
Pivotal hawq internalsPivotal hawq internals
Pivotal hawq internals
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
빅데이터 처리에 있어서 이미지 비디오 데이터의 분석
빅데이터 처리에 있어서 이미지 비디오 데이터의 분석빅데이터 처리에 있어서 이미지 비디오 데이터의 분석
빅데이터 처리에 있어서 이미지 비디오 데이터의 분석
 
마인즈랩소개자료 20150616
마인즈랩소개자료 20150616마인즈랩소개자료 20150616
마인즈랩소개자료 20150616
 
i-VOC (Voice of the Customer Big Data Analytics Solution) 고객의소리 분석
i-VOC (Voice of the Customer Big Data Analytics Solution) 고객의소리 분석i-VOC (Voice of the Customer Big Data Analytics Solution) 고객의소리 분석
i-VOC (Voice of the Customer Big Data Analytics Solution) 고객의소리 분석
 
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAXHow Big Data and Hadoop Integrated into BMC ControlM at CARFAX
How Big Data and Hadoop Integrated into BMC ControlM at CARFAX
 
비정형 데이터를 기반으로 한 빅데이터 필요기술 및 적용사례
비정형 데이터를 기반으로 한 빅데이터 필요기술 및 적용사례비정형 데이터를 기반으로 한 빅데이터 필요기술 및 적용사례
비정형 데이터를 기반으로 한 빅데이터 필요기술 및 적용사례
 
Modern Data Architecture
Modern Data ArchitectureModern Data Architecture
Modern Data Architecture
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 

More from Altoros

Maturing with Kubernetes
Maturing with KubernetesMaturing with Kubernetes
Maturing with KubernetesAltoros
 
Kubernetes Platform Readiness and Maturity Assessment
Kubernetes Platform Readiness and Maturity AssessmentKubernetes Platform Readiness and Maturity Assessment
Kubernetes Platform Readiness and Maturity AssessmentAltoros
 
Journey Through Four Stages of Kubernetes Deployment Maturity
Journey Through Four Stages of Kubernetes Deployment MaturityJourney Through Four Stages of Kubernetes Deployment Maturity
Journey Through Four Stages of Kubernetes Deployment MaturityAltoros
 
SGX: Improving Privacy, Security, and Trust Across Blockchain Networks
SGX: Improving Privacy, Security, and Trust Across Blockchain NetworksSGX: Improving Privacy, Security, and Trust Across Blockchain Networks
SGX: Improving Privacy, Security, and Trust Across Blockchain NetworksAltoros
 
Using the Cloud Foundry and Kubernetes Stack as a Part of a Blockchain CI/CD ...
Using the Cloud Foundry and Kubernetes Stack as a Part of a Blockchain CI/CD ...Using the Cloud Foundry and Kubernetes Stack as a Part of a Blockchain CI/CD ...
Using the Cloud Foundry and Kubernetes Stack as a Part of a Blockchain CI/CD ...Altoros
 
A Zero-Knowledge Proof: Improving Privacy on a Blockchain
A Zero-Knowledge Proof:  Improving Privacy on a BlockchainA Zero-Knowledge Proof:  Improving Privacy on a Blockchain
A Zero-Knowledge Proof: Improving Privacy on a BlockchainAltoros
 
Crap. Your Big Data Kitchen Is Broken.
Crap. Your Big Data Kitchen Is Broken.Crap. Your Big Data Kitchen Is Broken.
Crap. Your Big Data Kitchen Is Broken.Altoros
 
Containers and Kubernetes
Containers and KubernetesContainers and Kubernetes
Containers and KubernetesAltoros
 
Distributed Ledger Technology for Over-the-Counter Trading
Distributed Ledger Technology for Over-the-Counter TradingDistributed Ledger Technology for Over-the-Counter Trading
Distributed Ledger Technology for Over-the-Counter TradingAltoros
 
5-Step Deployment of Hyperledger Fabric on Multiple Nodes
5-Step Deployment of Hyperledger Fabric on Multiple Nodes5-Step Deployment of Hyperledger Fabric on Multiple Nodes
5-Step Deployment of Hyperledger Fabric on Multiple NodesAltoros
 
Deploying Kubernetes on GCP with Kubespray
Deploying Kubernetes on GCP with KubesprayDeploying Kubernetes on GCP with Kubespray
Deploying Kubernetes on GCP with KubesprayAltoros
 
UAA for Kubernetes
UAA for KubernetesUAA for Kubernetes
UAA for KubernetesAltoros
 
Troubleshooting .NET Applications on Cloud Foundry
Troubleshooting .NET Applications on Cloud FoundryTroubleshooting .NET Applications on Cloud Foundry
Troubleshooting .NET Applications on Cloud FoundryAltoros
 
Continuous Integration and Deployment with Jenkins for PCF
Continuous Integration and Deployment with Jenkins for PCFContinuous Integration and Deployment with Jenkins for PCF
Continuous Integration and Deployment with Jenkins for PCFAltoros
 
How to Never Leave Your Deployment Unattended
How to Never Leave Your Deployment UnattendedHow to Never Leave Your Deployment Unattended
How to Never Leave Your Deployment UnattendedAltoros
 
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
Cloud Foundry Monitoring How-To: Collecting Metrics and LogsCloud Foundry Monitoring How-To: Collecting Metrics and Logs
Cloud Foundry Monitoring How-To: Collecting Metrics and LogsAltoros
 
Smart Baggage Tracking: End-to-End Sensor-Based Solution
Smart Baggage Tracking: End-to-End Sensor-Based SolutionSmart Baggage Tracking: End-to-End Sensor-Based Solution
Smart Baggage Tracking: End-to-End Sensor-Based SolutionAltoros
 
Navigating the Ecosystem of Pivotal Cloud Foundry Tiles
Navigating the Ecosystem of Pivotal Cloud Foundry TilesNavigating the Ecosystem of Pivotal Cloud Foundry Tiles
Navigating the Ecosystem of Pivotal Cloud Foundry TilesAltoros
 
AI as a Catalyst for IoT
AI as a Catalyst for IoTAI as a Catalyst for IoT
AI as a Catalyst for IoTAltoros
 
Over-Engineering: Causes, Symptoms, and Treatment
Over-Engineering: Causes, Symptoms, and TreatmentOver-Engineering: Causes, Symptoms, and Treatment
Over-Engineering: Causes, Symptoms, and TreatmentAltoros
 

More from Altoros (20)

Maturing with Kubernetes
Maturing with KubernetesMaturing with Kubernetes
Maturing with Kubernetes
 
Kubernetes Platform Readiness and Maturity Assessment
Kubernetes Platform Readiness and Maturity AssessmentKubernetes Platform Readiness and Maturity Assessment
Kubernetes Platform Readiness and Maturity Assessment
 
Journey Through Four Stages of Kubernetes Deployment Maturity
Journey Through Four Stages of Kubernetes Deployment MaturityJourney Through Four Stages of Kubernetes Deployment Maturity
Journey Through Four Stages of Kubernetes Deployment Maturity
 
SGX: Improving Privacy, Security, and Trust Across Blockchain Networks
SGX: Improving Privacy, Security, and Trust Across Blockchain NetworksSGX: Improving Privacy, Security, and Trust Across Blockchain Networks
SGX: Improving Privacy, Security, and Trust Across Blockchain Networks
 
Using the Cloud Foundry and Kubernetes Stack as a Part of a Blockchain CI/CD ...
Using the Cloud Foundry and Kubernetes Stack as a Part of a Blockchain CI/CD ...Using the Cloud Foundry and Kubernetes Stack as a Part of a Blockchain CI/CD ...
Using the Cloud Foundry and Kubernetes Stack as a Part of a Blockchain CI/CD ...
 
A Zero-Knowledge Proof: Improving Privacy on a Blockchain
A Zero-Knowledge Proof:  Improving Privacy on a BlockchainA Zero-Knowledge Proof:  Improving Privacy on a Blockchain
A Zero-Knowledge Proof: Improving Privacy on a Blockchain
 
Crap. Your Big Data Kitchen Is Broken.
Crap. Your Big Data Kitchen Is Broken.Crap. Your Big Data Kitchen Is Broken.
Crap. Your Big Data Kitchen Is Broken.
 
Containers and Kubernetes
Containers and KubernetesContainers and Kubernetes
Containers and Kubernetes
 
Distributed Ledger Technology for Over-the-Counter Trading
Distributed Ledger Technology for Over-the-Counter TradingDistributed Ledger Technology for Over-the-Counter Trading
Distributed Ledger Technology for Over-the-Counter Trading
 
5-Step Deployment of Hyperledger Fabric on Multiple Nodes
5-Step Deployment of Hyperledger Fabric on Multiple Nodes5-Step Deployment of Hyperledger Fabric on Multiple Nodes
5-Step Deployment of Hyperledger Fabric on Multiple Nodes
 
Deploying Kubernetes on GCP with Kubespray
Deploying Kubernetes on GCP with KubesprayDeploying Kubernetes on GCP with Kubespray
Deploying Kubernetes on GCP with Kubespray
 
UAA for Kubernetes
UAA for KubernetesUAA for Kubernetes
UAA for Kubernetes
 
Troubleshooting .NET Applications on Cloud Foundry
Troubleshooting .NET Applications on Cloud FoundryTroubleshooting .NET Applications on Cloud Foundry
Troubleshooting .NET Applications on Cloud Foundry
 
Continuous Integration and Deployment with Jenkins for PCF
Continuous Integration and Deployment with Jenkins for PCFContinuous Integration and Deployment with Jenkins for PCF
Continuous Integration and Deployment with Jenkins for PCF
 
How to Never Leave Your Deployment Unattended
How to Never Leave Your Deployment UnattendedHow to Never Leave Your Deployment Unattended
How to Never Leave Your Deployment Unattended
 
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
Cloud Foundry Monitoring How-To: Collecting Metrics and LogsCloud Foundry Monitoring How-To: Collecting Metrics and Logs
Cloud Foundry Monitoring How-To: Collecting Metrics and Logs
 
Smart Baggage Tracking: End-to-End Sensor-Based Solution
Smart Baggage Tracking: End-to-End Sensor-Based SolutionSmart Baggage Tracking: End-to-End Sensor-Based Solution
Smart Baggage Tracking: End-to-End Sensor-Based Solution
 
Navigating the Ecosystem of Pivotal Cloud Foundry Tiles
Navigating the Ecosystem of Pivotal Cloud Foundry TilesNavigating the Ecosystem of Pivotal Cloud Foundry Tiles
Navigating the Ecosystem of Pivotal Cloud Foundry Tiles
 
AI as a Catalyst for IoT
AI as a Catalyst for IoTAI as a Catalyst for IoT
AI as a Catalyst for IoT
 
Over-Engineering: Causes, Symptoms, and Treatment
Over-Engineering: Causes, Symptoms, and TreatmentOver-Engineering: Causes, Symptoms, and Treatment
Over-Engineering: Causes, Symptoms, and Treatment
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 

Recently uploaded (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Hadoop Distributions: Evaluating Cloudera, Hortonworks, and MapR in Micro-benchmarks and Real-world Applications

  • 1. HadoopDistributions: EvaluatingCloudera, Hortonworks,andMapR inMicro-benchmarksand Real-worldApplications VladimirStarostenkov,SeniorR&DDeveloper, KirillGrigorchuk,HeadofR&DDepartment © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author.
  • 2. 2 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. Table of Contents 1. Introduction ................................................................................................................................................4 2. Tools, Libraries, and Methods....................................................................................................................5 2.1 Micro benchmarks................................................................................................................................................................. 6 2.1.1 WordCount................................................................................................................................................................ 6 2.1.2 Sort............................................................................................................................................................................... 7 2.1.3 TeraSort...................................................................................................................................................................... 7 2.1.4 Distributed File System I/O.................................................................................................................................. 7 2.2 Real-world applications....................................................................................................................................................... 7 2.2.1 PageRank................................................................................................................................................................... 8 2.2.2 Bayes ........................................................................................................................................................................... 8 3. What Makes This Research Unique?..........................................................................................................9 3.1 Testing environment............................................................................................................................................................ 9 4. Results........................................................................................................................................................11 4.1 Overall cluster performance.............................................................................................................................................11 4.2 Hortonworks Data Platform (HDP).................................................................................................................................12 4.3 Cloudera’s Distribution Including Apache Hadoop (CDH) ....................................................................................14 4.4 MapR ........................................................................................................................................................................................15 5. Conclusion.................................................................................................................................................18 Appendix A: Main Features and Their Comparison Across Distributions................................................19 Appendix B: Overview of the Distributions................................................................................................21 1. MapR...........................................................................................................................................................................................21 2. Cloudera....................................................................................................................................................................................22 3. Hortonworks............................................................................................................................................................................23 Appendix C: Performance Results for Each Benchmarking Test...............................................................24 1. Real-world applications........................................................................................................................................................24 1.1 Bayes.............................................................................................................................................................................24 1.2 PageRank.....................................................................................................................................................................25
  • 3. 3 +1 650 395-7002 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. 2. Micro benchmarks .................................................................................................................................................................26 2.1 Distributed File System I/O (DFSIO) ...................................................................................................................26 2.2 Hive aggregation......................................................................................................................................................27 2.3 Sort................................................................................................................................................................................28 2.4 TeraSort........................................................................................................................................................................29 2.5 WordCount.................................................................................................................................................................30 Appendix D: Performance Results for Each Test Sectioned by Distribution............................................32 1. MapR...........................................................................................................................................................................................32 2. Hortonworks............................................................................................................................................................................42 3. Cloudera....................................................................................................................................................................................52 Appendix E: Disk Benchmarking .................................................................................................................62 1. DFSIO (read) benchmark......................................................................................................................................................62 2. DFSIO (write) benchmark ....................................................................................................................................................63 Appendix F: Parameters used to optimize Hadoop Jobs...........................................................................64
  • 6. 6 +1 (650) 265-2266 engineering@altoros.com www.altoros.com | twitter.com/altoros © 2013 Altoros Systems, Inc. Any unauthorized republishing, rewriting or use of this material is prohibited. No part of this resource may be reproduced or transmitted in any form or by any means without written permission from the author. 4. Results 4.1 Overall cluster performance Throughput in bytes per second was measured for a cluster that consisted of 4, 8, 12, and 16 DataNodes (Figures 2-7). The throughput of 8-, 12-, and 16-node clusters was compared against the throughput of a 4-node cluster in each benchmark test. The speed of data processing of 8-, 12-, and 16-node clusters was divided by the throughput of a 4-node cluster. These values demonstrate cluster scalability in each of the tests. The higher the value is, the better. Although data consistency may be guaranteed by a hosting/cloud provider, to employ the advantages of data locality, Hadoop requires using its internal data replication. The overall performance results of the MapR distribution in all benchmark tests Cluster performance scales linearly under the WordCount workload. It behaves the same in running PageRank until the cluster reaches an I/O bottleneck. The results of other benchmarks strongly correlated with DFSIO. To download the full version of this 65-page research (with 83 performance diagrams for each distribution measured under 7 workloads), please visit: www.altoros.com/hadoop_benchmark 0 1 2 3 4 5 BAYES DFSIOE HIVEAGGR PAGERANK SORT TERASORT WORDCOUNT MapR Overall cluster performance 4 8 12 16