SlideShare a Scribd company logo
A Discussion of Hadoop Use Cases & 
Runtime Environments 
Tom Phelan, Chief Architect of BlueData 
Los Angeles Hadoop Users Group 
Sept 25, 2014
… or when should I virtualize my Hadoop cluster?
First - Some Definitions
Physical Hadoop Cluster 
 AKA “bare metal” installation 
 The Hadoop distribution is installed as an application on top of the operating 
system. 
 A set of physical servers run the various Hadoop services, forming the Hadoop 
cluster. 
• File System (HDFS, NameNode etc) 
• Processing Framework (JobTracker etc) 
 Original design goal was reduced Cost and not necessarily improved 
performance.
Physical Hadoop Cluster 
NameNode JobTracker 
Server 
Disk Disk Disk 
DataNode TaskTracker 
Server 
Disk Disk Disk 
DataNode TaskTracker 
Server 
Disk Disk Disk 
Controller 
Worker 
Worker
Virtual Hadoop Cluster 
 The Hadoop distribution is installed as an application running within the 
context of a collection of virtual machines . 
 A virtual machine is software that presents an abstraction that is 
identical to the underlying hardware. In general, the software running 
within the VM cannot tell the difference from a physical server. 
 If the collection of virtual machines is spread across more than one 
physical server, it is typically referred to as a cloud,. The cloud can be 
either public or private. 
 The type of virtualization technology used can be one of : 
• Type I Hypervisor , VMW ESX 
• Type II Hypervisor , KVM 
• Linux Containers, LXC
Virtual Hadoop Cluster – Public Cloud 
 IaaS – infrastructure as a service 
 The type of virtualization is unknown 
 Typically the physical hosts are not located within the enterprise data center 
 Data security can be an issue 
 Can be expensive 
Examples: AWS, Azure
Virtual Hadoop Cluster – Private Cloud 
 IAAS – infrastructure as a service 
 The type of virtualization is known but not specified 
 Typically the physical hosts are located within the enterprise data center 
 Data security enforced by the enterprise 
 Can be expensive 
Examples: VMware vSphere, OpenStack. CloudStack
Virtual Hadoop Cluster – Private Cloud - 
Hypervisor 
 IAAS – infrastructure as a service 
 The type of virtualization is Type I or II hypervisor. Generically referred to 
as “hypervisor” or “virtual machine” 
 Typically the physical hosts are located within the enterprise data center 
 Data security enforced by the enterprise 
 Strong fault isolation - a fault in the VM cannot cause the physical cluster 
to crash 
 Strong resource partitioning 
 Moderate amount of “overhead” to implement the virtualization. 
 Can be expensive 
Examples: VMW Vsphere, OpenStack
Virtual Hadoop Cluster – Private Cloud - 
Containers 
 IAAS – infrastructure as a service 
 The type of virtualization is Linux Containers, LXC 
 Typically the physical hosts are located within the enterprise data center 
 Data security enforced by the enterprise 
 Currently weak fault isolation - a fault in the VM can the physical cluster 
to crash 
 Moderate resource partitioning 
 Low amount of “overhead” to implement the virtualization. 
 Can be expensive 
Examples:Docker, Mesos, CoreOS, LXC
Virtual Hadoop Cluster - Hypervisor 
Controller VM 
NameNode JobTracker 
Hadoop 
Server 
vDisk vDisk vDisk 
Worker VM 
Cloud 
Server 
Disk Disk Disk 
Cloud 
Server 
Worker VM 
Disk Disk Disk Cloud 
Server 
Disk Disk Disk 
Cloud 
Server 
Disk Disk Disk 
DataNode TaskTracker 
Hadoop 
Server 
vDisk vDisk vDisk 
DataNode TaskTracker 
Hadoop 
Server 
vDisk vDisk vDisk
Virtual Hadoop Cluster - Containers 
JobTracker 
Controller 
Container 
Cloud 
Server 
Disk Disk Disk 
Cloud 
Server 
Disk Disk Disk Cloud 
Server 
Disk Disk Disk 
Cloud 
Server 
Disk Disk Disk 
TaskTracker 
Worker 
Container 
TaskTracker 
Worker 
Container 
NameNode 
DataNode 
DataNode 
DataNode
Virtual Hadoop Cluster- Hypervisors 
Cloud 
Server 
Disk Disk Disk 
Cloud 
Server 
Disk Disk Disk Cloud 
Server 
Disk Disk Disk 
Cloud 
Server 
Disk Disk Disk
Virtual Hadoop Cluster – Private Cloud – Data 
Para Virtualization 
 Paravirtualization means that the abstraction the virtualization software provides is 
similar, but not identical, to the underlying hardware. 
 The differences are designed to reduce the virtualization “overhead” by taking 
advantage of some knowledge about the tasks running in the virtual machine. 
Examples: BlueData
Virtual Hadoop Cluster – Paravirtualization 
Controller VM 
NameNode JobTracker 
Hadoop 
Server 
vDisk vDisk vDisk 
Worker VM 
Cloud 
Server 
Disk Disk Disk 
Cloud 
Server 
Worker VM 
Disk Disk Disk Cloud 
Server 
Disk Disk Disk 
Cloud 
Server 
Disk Disk Disk 
DataNode TaskTracker 
Hadoop 
Server 
vDisk vDisk vDisk 
DataNode TaskTracker 
Hadoop 
Server 
vDisk vDisk vDisk 
Data 
Connection 
Data 
Connection 
Data 
Connection 
NFS HDFS 
GlusterFS
In which situations should an enterprise run 
their Hadoop jobs in a virtual or physical 
environment?
Evaluation based on: 
Faster … 
– Deployment 
– Runtime 
Easier … 
– Deployment 
– Management 
Cheaper … 
– Hardware costs 
– Management costs
Questions not to ask 
•How fast does the job need to run? 
•How much does the cluster cost? 
•How easy is it to use? 
Any application can be run with the needed speed in either a 
virtual or physical environment if enough money is spent. 
Any tool is easy to use once you are familiar with it. 
Other attributes indicate if the best solution is with physical or 
virtual clusters.
Answers 
There are multiple clusters and each is lightly used. 
– Virtual cluster 
There is one cluster, it runs a single Hadoop query job. It 
runs 7 x 24 and demands instant response. 
– Physical cluster
Answers 
Test & Dev environment where Hadoop clusters need to be 
built quickly and have short lifespan. Each developer gets 
their own cluster. No security concerns. 
– Virtual cluster - LXC 
An environment with multiple Hadoop applications constantly 
running and requiring access to a common data set. No 
expected change in applications or load. 
– Physical cluster
Answers 
IAAS environment with multiple external customers each with 
different QoS agreements, Hadoop distros, and data 
security needs. 
– Virtual cluster - Hypervisor
Those scenarios are too easy!
What the obvious answers tell us: 
 Situations that require many distinct Hadoop clusters, or clusters that require 
frequent provisioning, or clusters that have a relatively short lifespan are well 
suited for virtualized Hadoop. 
 Flexibility and speed of cluster creation are critical. 
 Situations that require few distinct Hadoop clusters, have long lifespans, and 
static configurations are well suited for Bare Metal Hadoop. 
 No reason to pay virtualization “tax” in exchange for flexibility.
Questions to ask 
 How many clusters will be needed? 
• Over what time span? 
 What is the life span of the clusters? 
 Will the clusters have idle time? 
 What are the fault isolation needs? 
 What is the source of the big data?
Other questions to ask* 
 Are multiple levels of priority job priority required? 
 Are multiple levels of data security required? 
 Is resource usage tracking/billing required? 
* The implementation of these may be different between different distributions of Hadoop 
and so the level of effort to implement in virtual and physical environments may be different.
Use Case I 
Large manufacturing company 
– Internal Customers 
Started out with one Hadoop cluster 
– Success! 
Soon everyone wanted one. 
– Many lightly used 
– Different configurations. 
– “Cluster Sprawl” 
Virtual Clusters 
Either hypervisor or LXC could be used
Use Case II 
Development group within a large tech company 
– Internal Customers 
Built out a physical Hadoop cluster. 
– No data security requirement 
– No expectation of growth in foreseeable future. 
Single Physical Cluster
Use Case III 
Large service company. 
– Internal and External Customers 
No use of Hadoop. 
Fault containment and data security required 
Small IT department tasked with all Hadoop support for the company. 
– No clear way to predict growth. 
Virtual Clusters - hypervisor
Use Case IV 
Large tech company 
– Internal Customers 
Constant stream of low priority jobs 
Bursty stream of very high priority, low latency, jobs 
No future demand growth. 
Very Hadoop savvy IT organization. 
Two Physical Clusters 
OR virtual clusters
Use Case V 
Startup online service company 
– Selling information gathered using unique Hadoop analytics 
External Customers 
Multiple data sources 
Customer data security requirements 
Rapid growth of customer base 
No in-house Hadoop expertise 
Virtual Clusters 
Could benefit from paravirtualization
Q & A
Contact 
Tom Phelan 
tap@bluedata.com
Virtual Hadoop Cluster 
NameNode Controller VM 
JobTracker 
Hadoop 
Server 
vDisk vDisk vDisk 
Worker VM 
Cloud 
Server 
Disk Disk Disk 
Cloud 
Server 
Worker VM 
Disk Disk Disk Cloud 
Server 
Disk Disk Disk 
Cloud 
Server 
Disk Disk Disk 
DataNode TaskTracker 
Hadoop 
Server 
vDisk vDisk vDisk 
DataNode TaskTracker 
Hadoop 
Server 
vDisk vDisk vDisk

More Related Content

What's hot

[db tech showcase Tokyo 2015] C16:Oracle Disaster Recovery at New Zealand sto...
[db tech showcase Tokyo 2015] C16:Oracle Disaster Recovery at New Zealand sto...[db tech showcase Tokyo 2015] C16:Oracle Disaster Recovery at New Zealand sto...
[db tech showcase Tokyo 2015] C16:Oracle Disaster Recovery at New Zealand sto...
Insight Technology, Inc.
 
Adaptec Hybrid RAID Whitepaper
Adaptec Hybrid RAID WhitepaperAdaptec Hybrid RAID Whitepaper
Adaptec Hybrid RAID Whitepaper
Adaptec by PMC
 

What's hot (20)

Webinar: Hardware-Based Hyperconvergence vs. Hyperconvergence Software
Webinar: Hardware-Based Hyperconvergence vs. Hyperconvergence SoftwareWebinar: Hardware-Based Hyperconvergence vs. Hyperconvergence Software
Webinar: Hardware-Based Hyperconvergence vs. Hyperconvergence Software
 
Liberate Your Files with a Private Cloud Storage Solution powered by Open Source
Liberate Your Files with a Private Cloud Storage Solution powered by Open SourceLiberate Your Files with a Private Cloud Storage Solution powered by Open Source
Liberate Your Files with a Private Cloud Storage Solution powered by Open Source
 
Data Analytics Using Container Persistence Through SMACK - Manny Rodriguez-Pe...
Data Analytics Using Container Persistence Through SMACK - Manny Rodriguez-Pe...Data Analytics Using Container Persistence Through SMACK - Manny Rodriguez-Pe...
Data Analytics Using Container Persistence Through SMACK - Manny Rodriguez-Pe...
 
6 Ways to Get More From Your Azure
6 Ways to Get More From Your Azure6 Ways to Get More From Your Azure
6 Ways to Get More From Your Azure
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Accelerate Your OpenStack Deployment Presented by SolidFire and Red Hat
Accelerate Your OpenStack Deployment Presented by SolidFire and Red HatAccelerate Your OpenStack Deployment Presented by SolidFire and Red Hat
Accelerate Your OpenStack Deployment Presented by SolidFire and Red Hat
 
DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...
DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...
DataStax | Best Practices for Securing DataStax Enterprise (Matt Kennedy) | C...
 
[db tech showcase Tokyo 2015] C16:Oracle Disaster Recovery at New Zealand sto...
[db tech showcase Tokyo 2015] C16:Oracle Disaster Recovery at New Zealand sto...[db tech showcase Tokyo 2015] C16:Oracle Disaster Recovery at New Zealand sto...
[db tech showcase Tokyo 2015] C16:Oracle Disaster Recovery at New Zealand sto...
 
OpenStack Training | OpenStack Tutorial For Beginners | OpenStack Certificati...
OpenStack Training | OpenStack Tutorial For Beginners | OpenStack Certificati...OpenStack Training | OpenStack Tutorial For Beginners | OpenStack Certificati...
OpenStack Training | OpenStack Tutorial For Beginners | OpenStack Certificati...
 
Solving Business Challenges with OpenStack
Solving Business Challenges with OpenStackSolving Business Challenges with OpenStack
Solving Business Challenges with OpenStack
 
CONVRGD Slide Deck_Customer (1)
CONVRGD Slide Deck_Customer (1)CONVRGD Slide Deck_Customer (1)
CONVRGD Slide Deck_Customer (1)
 
Ceph and openstack at the boston meetup
Ceph and openstack at the boston meetupCeph and openstack at the boston meetup
Ceph and openstack at the boston meetup
 
Pvs slide
Pvs slidePvs slide
Pvs slide
 
PVS and MCS Webinar - Technical Deep Dive
PVS and MCS Webinar - Technical Deep DivePVS and MCS Webinar - Technical Deep Dive
PVS and MCS Webinar - Technical Deep Dive
 
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
 
Building Real-Time Web Applications with Vortex-Web
Building Real-Time Web Applications with Vortex-WebBuilding Real-Time Web Applications with Vortex-Web
Building Real-Time Web Applications with Vortex-Web
 
Adaptec Hybrid RAID Whitepaper
Adaptec Hybrid RAID WhitepaperAdaptec Hybrid RAID Whitepaper
Adaptec Hybrid RAID Whitepaper
 
Soft layer canonical_brief_final
Soft layer canonical_brief_finalSoft layer canonical_brief_final
Soft layer canonical_brief_final
 
Joyent Cloud App Architectures
Joyent Cloud App ArchitecturesJoyent Cloud App Architectures
Joyent Cloud App Architectures
 
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CSBetter, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
Better, Faster, Cheaper Infrastructure: Apache CloudStack and Riak CS
 

Viewers also liked

Dell/EMC Technical Validation of BlueData EPIC with Isilon
Dell/EMC Technical Validation of BlueData EPIC with IsilonDell/EMC Technical Validation of BlueData EPIC with Isilon
Dell/EMC Technical Validation of BlueData EPIC with Isilon
Greg Kirchoff
 
BlueData Isilon Validation Brief
BlueData Isilon Validation BriefBlueData Isilon Validation Brief
BlueData Isilon Validation Brief
Boni Bruno
 
Big Data & the Cloud
Big Data & the CloudBig Data & the Cloud
Big Data & the Cloud
DATAVERSITY
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
BlueData, Inc.
 

Viewers also liked (20)

Dell/EMC Technical Validation of BlueData EPIC with Isilon
Dell/EMC Technical Validation of BlueData EPIC with IsilonDell/EMC Technical Validation of BlueData EPIC with Isilon
Dell/EMC Technical Validation of BlueData EPIC with Isilon
 
BlueData Isilon Validation Brief
BlueData Isilon Validation BriefBlueData Isilon Validation Brief
BlueData Isilon Validation Brief
 
BlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for HadoopBlueData Hunk Integration: Splunk Analytics for Hadoop
BlueData Hunk Integration: Splunk Analytics for Hadoop
 
Big Data & the Cloud
Big Data & the CloudBig Data & the Cloud
Big Data & the Cloud
 
PaaS Emerging Technologies - October 2015
PaaS Emerging Technologies - October 2015PaaS Emerging Technologies - October 2015
PaaS Emerging Technologies - October 2015
 
BlueData EPIC 2.0 Overview
BlueData EPIC 2.0 OverviewBlueData EPIC 2.0 Overview
BlueData EPIC 2.0 Overview
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
Big Data and the Cloud a Best Friend Story
Big Data and the Cloud a Best Friend StoryBig Data and the Cloud a Best Friend Story
Big Data and the Cloud a Best Friend Story
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data Architectures
 
Overview of big data in cloud computing
Overview of big data in cloud computingOverview of big data in cloud computing
Overview of big data in cloud computing
 
February 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with DockerFebruary 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with Docker
 
How Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help businessHow Big Data ,Cloud Computing ,Data Science can help business
How Big Data ,Cloud Computing ,Data Science can help business
 
Big Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBig Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 Telco
 
Big Data & The Cloud
Big Data & The CloudBig Data & The Cloud
Big Data & The Cloud
 
Tracxn Research — Big Data Infrastructure Landscape, September 2016
Tracxn Research — Big Data Infrastructure Landscape, September 2016Tracxn Research — Big Data Infrastructure Landscape, September 2016
Tracxn Research — Big Data Infrastructure Landscape, September 2016
 
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhDSpark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-Service
 
The Impact of IoT on Cloud Computing, Big Data & Analytics
The Impact of IoT on Cloud Computing, Big Data & AnalyticsThe Impact of IoT on Cloud Computing, Big Data & Analytics
The Impact of IoT on Cloud Computing, Big Data & Analytics
 
Mesosphere quick overview
Mesosphere quick overviewMesosphere quick overview
Mesosphere quick overview
 
Lessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker ContainersLessons Learned Running Hadoop and Spark in Docker Containers
Lessons Learned Running Hadoop and Spark in Docker Containers
 

Similar to Why Virtualization is important by Tom Phelan of BlueData

Build your private cloud with paa s using linuxz cover story enterprise tech ...
Build your private cloud with paa s using linuxz cover story enterprise tech ...Build your private cloud with paa s using linuxz cover story enterprise tech ...
Build your private cloud with paa s using linuxz cover story enterprise tech ...
Elena Nanos
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
Richard McDougall
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
Wei Ting Chen
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual Machines
DataWorks Summit
 

Similar to Why Virtualization is important by Tom Phelan of BlueData (20)

Vmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps IronfanVmware Serengeti - Based on Infochimps Ironfan
Vmware Serengeti - Based on Infochimps Ironfan
 
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
VMworld 2013: Beyond Mission Critical: Virtualizing Big-Data, Hadoop, HPC, Cl...
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoop
 
Cloud1 Computing 01
Cloud1 Computing 01Cloud1 Computing 01
Cloud1 Computing 01
 
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
VMworld 2013: Big Data Platform Building Blocks: Serengeti, Resource Manageme...
 
Virtualizing Apache Spark and Machine Learning with Justin Murray
Virtualizing Apache Spark and Machine Learning with Justin MurrayVirtualizing Apache Spark and Machine Learning with Justin Murray
Virtualizing Apache Spark and Machine Learning with Justin Murray
 
Securing your Big Data Environments in the Cloud
Securing your Big Data Environments in the CloudSecuring your Big Data Environments in the Cloud
Securing your Big Data Environments in the Cloud
 
Build your private cloud with paa s using linuxz cover story enterprise tech ...
Build your private cloud with paa s using linuxz cover story enterprise tech ...Build your private cloud with paa s using linuxz cover story enterprise tech ...
Build your private cloud with paa s using linuxz cover story enterprise tech ...
 
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
Trusted Analytics as a Service (BDT209) | AWS re:Invent 2013
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
The Last Frontier- Virtualization, Hybrid Management and the Cloud
The Last Frontier-  Virtualization, Hybrid Management and the CloudThe Last Frontier-  Virtualization, Hybrid Management and the Cloud
The Last Frontier- Virtualization, Hybrid Management and the Cloud
 
Big Data Security on Microsoft Azure - HDInsight and HortonWorks
Big Data Security on Microsoft Azure - HDInsight and HortonWorksBig Data Security on Microsoft Azure - HDInsight and HortonWorks
Big Data Security on Microsoft Azure - HDInsight and HortonWorks
 
Hadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big DataHadoop and WANdisco: The Future of Big Data
Hadoop and WANdisco: The Future of Big Data
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
 
Virtualized Hadoop
Virtualized HadoopVirtualized Hadoop
Virtualized Hadoop
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual Machines
 

More from Data Con LA

Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

Article writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptxArticle writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptx
abhinandnam9997
 
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
aagad
 

Recently uploaded (12)

The Best AI Powered Software - Intellivid AI Studio
The Best AI Powered Software - Intellivid AI StudioThe Best AI Powered Software - Intellivid AI Studio
The Best AI Powered Software - Intellivid AI Studio
 
ER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAEER(Entity Relationship) Diagram for online shopping - TAE
ER(Entity Relationship) Diagram for online shopping - TAE
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 
Article writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptxArticle writing on excessive use of internet.pptx
Article writing on excessive use of internet.pptx
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
 
The AI Powered Organization-Intro to AI-LAN.pdf
The AI Powered Organization-Intro to AI-LAN.pdfThe AI Powered Organization-Intro to AI-LAN.pdf
The AI Powered Organization-Intro to AI-LAN.pdf
 
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
一比一原版UTS毕业证悉尼科技大学毕业证成绩单如何办理
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
Stay Ahead with 2024's Top Web Design Trends
Stay Ahead with 2024's Top Web Design TrendsStay Ahead with 2024's Top Web Design Trends
Stay Ahead with 2024's Top Web Design Trends
 
The Use of AI in Indonesia Election 2024: A Case Study
The Use of AI in Indonesia Election 2024: A Case StudyThe Use of AI in Indonesia Election 2024: A Case Study
The Use of AI in Indonesia Election 2024: A Case Study
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 

Why Virtualization is important by Tom Phelan of BlueData

  • 1. A Discussion of Hadoop Use Cases & Runtime Environments Tom Phelan, Chief Architect of BlueData Los Angeles Hadoop Users Group Sept 25, 2014
  • 2. … or when should I virtualize my Hadoop cluster?
  • 3. First - Some Definitions
  • 4. Physical Hadoop Cluster  AKA “bare metal” installation  The Hadoop distribution is installed as an application on top of the operating system.  A set of physical servers run the various Hadoop services, forming the Hadoop cluster. • File System (HDFS, NameNode etc) • Processing Framework (JobTracker etc)  Original design goal was reduced Cost and not necessarily improved performance.
  • 5. Physical Hadoop Cluster NameNode JobTracker Server Disk Disk Disk DataNode TaskTracker Server Disk Disk Disk DataNode TaskTracker Server Disk Disk Disk Controller Worker Worker
  • 6. Virtual Hadoop Cluster  The Hadoop distribution is installed as an application running within the context of a collection of virtual machines .  A virtual machine is software that presents an abstraction that is identical to the underlying hardware. In general, the software running within the VM cannot tell the difference from a physical server.  If the collection of virtual machines is spread across more than one physical server, it is typically referred to as a cloud,. The cloud can be either public or private.  The type of virtualization technology used can be one of : • Type I Hypervisor , VMW ESX • Type II Hypervisor , KVM • Linux Containers, LXC
  • 7. Virtual Hadoop Cluster – Public Cloud  IaaS – infrastructure as a service  The type of virtualization is unknown  Typically the physical hosts are not located within the enterprise data center  Data security can be an issue  Can be expensive Examples: AWS, Azure
  • 8. Virtual Hadoop Cluster – Private Cloud  IAAS – infrastructure as a service  The type of virtualization is known but not specified  Typically the physical hosts are located within the enterprise data center  Data security enforced by the enterprise  Can be expensive Examples: VMware vSphere, OpenStack. CloudStack
  • 9. Virtual Hadoop Cluster – Private Cloud - Hypervisor  IAAS – infrastructure as a service  The type of virtualization is Type I or II hypervisor. Generically referred to as “hypervisor” or “virtual machine”  Typically the physical hosts are located within the enterprise data center  Data security enforced by the enterprise  Strong fault isolation - a fault in the VM cannot cause the physical cluster to crash  Strong resource partitioning  Moderate amount of “overhead” to implement the virtualization.  Can be expensive Examples: VMW Vsphere, OpenStack
  • 10. Virtual Hadoop Cluster – Private Cloud - Containers  IAAS – infrastructure as a service  The type of virtualization is Linux Containers, LXC  Typically the physical hosts are located within the enterprise data center  Data security enforced by the enterprise  Currently weak fault isolation - a fault in the VM can the physical cluster to crash  Moderate resource partitioning  Low amount of “overhead” to implement the virtualization.  Can be expensive Examples:Docker, Mesos, CoreOS, LXC
  • 11. Virtual Hadoop Cluster - Hypervisor Controller VM NameNode JobTracker Hadoop Server vDisk vDisk vDisk Worker VM Cloud Server Disk Disk Disk Cloud Server Worker VM Disk Disk Disk Cloud Server Disk Disk Disk Cloud Server Disk Disk Disk DataNode TaskTracker Hadoop Server vDisk vDisk vDisk DataNode TaskTracker Hadoop Server vDisk vDisk vDisk
  • 12. Virtual Hadoop Cluster - Containers JobTracker Controller Container Cloud Server Disk Disk Disk Cloud Server Disk Disk Disk Cloud Server Disk Disk Disk Cloud Server Disk Disk Disk TaskTracker Worker Container TaskTracker Worker Container NameNode DataNode DataNode DataNode
  • 13. Virtual Hadoop Cluster- Hypervisors Cloud Server Disk Disk Disk Cloud Server Disk Disk Disk Cloud Server Disk Disk Disk Cloud Server Disk Disk Disk
  • 14. Virtual Hadoop Cluster – Private Cloud – Data Para Virtualization  Paravirtualization means that the abstraction the virtualization software provides is similar, but not identical, to the underlying hardware.  The differences are designed to reduce the virtualization “overhead” by taking advantage of some knowledge about the tasks running in the virtual machine. Examples: BlueData
  • 15. Virtual Hadoop Cluster – Paravirtualization Controller VM NameNode JobTracker Hadoop Server vDisk vDisk vDisk Worker VM Cloud Server Disk Disk Disk Cloud Server Worker VM Disk Disk Disk Cloud Server Disk Disk Disk Cloud Server Disk Disk Disk DataNode TaskTracker Hadoop Server vDisk vDisk vDisk DataNode TaskTracker Hadoop Server vDisk vDisk vDisk Data Connection Data Connection Data Connection NFS HDFS GlusterFS
  • 16. In which situations should an enterprise run their Hadoop jobs in a virtual or physical environment?
  • 17. Evaluation based on: Faster … – Deployment – Runtime Easier … – Deployment – Management Cheaper … – Hardware costs – Management costs
  • 18. Questions not to ask •How fast does the job need to run? •How much does the cluster cost? •How easy is it to use? Any application can be run with the needed speed in either a virtual or physical environment if enough money is spent. Any tool is easy to use once you are familiar with it. Other attributes indicate if the best solution is with physical or virtual clusters.
  • 19. Answers There are multiple clusters and each is lightly used. – Virtual cluster There is one cluster, it runs a single Hadoop query job. It runs 7 x 24 and demands instant response. – Physical cluster
  • 20. Answers Test & Dev environment where Hadoop clusters need to be built quickly and have short lifespan. Each developer gets their own cluster. No security concerns. – Virtual cluster - LXC An environment with multiple Hadoop applications constantly running and requiring access to a common data set. No expected change in applications or load. – Physical cluster
  • 21. Answers IAAS environment with multiple external customers each with different QoS agreements, Hadoop distros, and data security needs. – Virtual cluster - Hypervisor
  • 22. Those scenarios are too easy!
  • 23. What the obvious answers tell us:  Situations that require many distinct Hadoop clusters, or clusters that require frequent provisioning, or clusters that have a relatively short lifespan are well suited for virtualized Hadoop.  Flexibility and speed of cluster creation are critical.  Situations that require few distinct Hadoop clusters, have long lifespans, and static configurations are well suited for Bare Metal Hadoop.  No reason to pay virtualization “tax” in exchange for flexibility.
  • 24. Questions to ask  How many clusters will be needed? • Over what time span?  What is the life span of the clusters?  Will the clusters have idle time?  What are the fault isolation needs?  What is the source of the big data?
  • 25. Other questions to ask*  Are multiple levels of priority job priority required?  Are multiple levels of data security required?  Is resource usage tracking/billing required? * The implementation of these may be different between different distributions of Hadoop and so the level of effort to implement in virtual and physical environments may be different.
  • 26. Use Case I Large manufacturing company – Internal Customers Started out with one Hadoop cluster – Success! Soon everyone wanted one. – Many lightly used – Different configurations. – “Cluster Sprawl” Virtual Clusters Either hypervisor or LXC could be used
  • 27. Use Case II Development group within a large tech company – Internal Customers Built out a physical Hadoop cluster. – No data security requirement – No expectation of growth in foreseeable future. Single Physical Cluster
  • 28. Use Case III Large service company. – Internal and External Customers No use of Hadoop. Fault containment and data security required Small IT department tasked with all Hadoop support for the company. – No clear way to predict growth. Virtual Clusters - hypervisor
  • 29. Use Case IV Large tech company – Internal Customers Constant stream of low priority jobs Bursty stream of very high priority, low latency, jobs No future demand growth. Very Hadoop savvy IT organization. Two Physical Clusters OR virtual clusters
  • 30. Use Case V Startup online service company – Selling information gathered using unique Hadoop analytics External Customers Multiple data sources Customer data security requirements Rapid growth of customer base No in-house Hadoop expertise Virtual Clusters Could benefit from paravirtualization
  • 31. Q & A
  • 32. Contact Tom Phelan tap@bluedata.com
  • 33. Virtual Hadoop Cluster NameNode Controller VM JobTracker Hadoop Server vDisk vDisk vDisk Worker VM Cloud Server Disk Disk Disk Cloud Server Worker VM Disk Disk Disk Cloud Server Disk Disk Disk Cloud Server Disk Disk Disk DataNode TaskTracker Hadoop Server vDisk vDisk vDisk DataNode TaskTracker Hadoop Server vDisk vDisk vDisk

Editor's Notes

  1. Before we go further, we need to discuss the definition of virtual and physical cluster
  2. Before we go further, we need to discuss the definition of virtual and physical cluster
  3. Before we go further, we need to discuss the definition of virtual and physical cluster
  4. Before we go further, we need to discuss the definition of virtual and physical cluster
  5. Before we go further, we need to discuss the definition of virtual and physical cluster
  6. Before we go further, we need to discuss the definition of virtual and physical cluster
  7. Note: there is not mention of the words “better”, “faster”, “cheaper”, etc.
  8. VMW, Mirantis, CloudERA, Horton Works, etc each have studies and whitepapers saying that Physical is faster or virtual is faster Viritual is cheaper or physical is cheaper Cloudear Manager, Ambari – physical deployment is as easy of virtual deployment Nikta Ianov (GridGain) Hadoop was not designed for speed, It was designed for cost reduction
  9. Qualcom
  10. IBM internal (menlo)
  11. Orange
  12. linkedIN
  13. Power usage startup
  14. Before we go further, we need to discuss the definition of virtual and physical cluster