Hadoop in the Clouds, Virtualization and Virtual Machines


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • First let us take a look at the term virtualization.
    Virtualization is a general CS term that predates the term virtual machine by decades.
  • A thing to remember that Hadoop moves the computation to the data.
    Shortcut for local access directly from the OS,
  • Now lets turn to HDFS
  • VM on one extreme and a Hadoop App service such as EMR on the other end

    All except Amazon run HDP
  • Interleaved HDFS cluster
    We have one customer interleave a HDFS cluster across compute racks
  • VMs have been important for reducing the TCO for traditional enterprise processing
  • Yarn’s underlying resource model is very flexible and has an extensible abstraction for different resource types
    First memory,
    Now Cpu and different kinds of memory
    Next IO and Network bandwidths
  • Hadoop in the Clouds, Virtualization and Virtual Machines

    1. 1. © Hortonworks Inc. 2013 Hadoop in Clouds, Virtualization and Virtual Machines Page 1 Sanjay Radia Founder & Architect Hortonworks Inc Suresh Srinivas Founder, Director of Engineering Hortonworks Inc.
    2. 2. © Hortonworks Inc. 2013 Hello Sanjay Radia • Founder, Hortonworks • Part of the Hadoop team at Yahoo! since 2007 –Chief Architect of Hadoop Core at Yahoo! –Long time Apache Hadoop PMC and Committer –Designed and developed several key Hadoop features • Prior –Data center automation, virtualization, Java, HA, OSs, File Systems (Startup, Sun Microsystems, …) –Ph.D., University of Waterloo Page 2
    3. 3. © Hortonworks Inc. 2013 Overview • Virtualization – A General CS concept –What resources does Hadoop virtualize? • Public Clouds –A spectrum of offerings • Private Clouds –VMs and Hadoop –Test, dev, Poc clusters –Production clusters –Guideline for VMs Page 3
    4. 4. © Hortonworks Inc. 2013 Virtualization – a General CS Concept Defn: Slice/dice a resource across multiple users so that – Each user has an impression of a dedicated resource – CPU, Virtual memory, … – A user has the impression of a much larger resource – Virtual memory, tiered storage Examples • Virtualize the CPU, virtual memory – Abstractions: process (threads) and its virtual memory • Virtualize Storage – Abstractions: file system, Luns, etc • Virtualize the entire machine (CPU, memory, devices) – Abstraction: VMs • Virtualize Hadoop cluster resources across users and tenants – Next slide … Page 4
    5. 5. © Hortonworks Inc. 2013 So what about Virtualization in a Hadoop cluster? Hadoop lets the OS virtualize the hardware resources. Hadoop instead virtualizes the network and each node’s OS resources across the cluster 5 Hadoop is Big-Data “Operating System”
    6. 6. © Hortonworks Inc. 2013 Hadoop’s Architecture Page 6 NameNode RM File System Namespace Block Management Compute Resource Management Task placement Tasks DISC 1 DISC 2 DISC N DataNode TaskManager Block delete Replicate Heartbeat BlockReport Heartbeat Client Read/Write Data File create, delete, modify BlocksBlocksBlocks Slave Node Task Management Slave Node Tens to Thousands Slave Nodes
    7. 7. © Hortonworks Inc. 2013 What about Virtualization in Hadoop (1) • Hadoop virtualize compute, storage and network resources across the cluster tenants – Abstractions: Yarn application container, HDFS files, … – Yarn virtualizes each node’s OS resources across tenants – Yarn scheduler focuses on CPU and memory today but at lowest level the resource request is fairly general – IO, network bandwidth can be added – Yarn supports a diverse set of applications beyond, MR, SQL, ETL, … – Allows sharing the cluster infrastructure – Slider: run services on Hadoop cluster – Yarn container is a process (and soon Linux containers) to provide isolation – can be other things including VMs – Capacity scheduler guarantees capacity across multiple cluster tenants – Can go beyond guaranteed capacity if available, but subject to preemption – Limits, Queues and sub-queues for flexible management – Uses OS limits, and Linux containers to limit resource usage Hadoop was designed for multi-tenancy Page 7
    8. 8. © Hortonworks Inc. 2013 What about Virtualization in Hadoop (2) • Hadoop virtualize compute, storage and network resources across the cluster tenants – Abstraction: Yarn application container, HDFS files, … – Yarn virtualizes compute/memory across nodes for tenants – (covered in previous slide) –HDFS – A shared file system across tenants with quota support – Soon: – memory for intermediate data that is virtualized like virtual memory – Tiering across storage types Hadoop was designed for multi-tenancy Page 8
    9. 9. © Hortonworks Inc. 2013 Take Away for rest of the talk • Hadoop for the public cloud –Most based on VMs due to strong isolation requirement –Main challenge is your data as clusters are spun up and down – We describe best practices and what is coming • Hadoop for the private cloud –For test and dev clusters – Both Hadoop on raw machines and VMs are a good idea –For production clusters – Hadoop natively provides excellent elasticity, resource management and multi- tenancy – If you must use VMs – Understand your motivations well and configure accordingly –For VMs, we describe some best practices Page 9
    10. 10. © Hortonworks Inc. 2013 Public Clouds Page 10
    11. 11. © Hortonworks Inc. 2012© Hortonworks Inc. 2013. Confidential and Proprietary. Public Clouds – A Spectrum Page 11 Virtual Machines Hadoop App as a Service Run/Manage Your Hadoop Cluster On VMs Just submit your job/app Hosted Deployed Managed Hadoop Clusters Managed “virtual” cluster that uses a shared Hadoop. Just submit jobs/apps MS Azure AWS EC2 Rackspace Cloud AWS EMR Cluster deployed on VMs or private hosts with easy Cluster Management & Job Management tools MS HDInsight (VMs) Rackspace (VMs or Host in cage) ATT Altiscale You Spin up/down cluster Data on external store (S3, Azure storage) across Cluster lifecycles Automatic Spin up/down Data on external store (S3 Azure storage) across Cluster lifecycles, Shared HDFS but Isolated Namespaces Shared Yarn but isolation via Containers or VMs Hosted: Cluster/Data persists
    12. 12. © Hortonworks Inc. 2013 Cluster Lifecycle and Data Persistence • For a few: the Cluster and HDFS data persists –Rackspace (Hosted and managed private cluster) –Altiscale • For most: clusters spun up and down –Main challenge: need to persist data across cluster lifecycle – Local data disappears when cluster is spun down –External K-V store – Access remote storage (slow) – Copy, then process, process, process, spin down (e.g. Netflix) –How can we do better? – Interleaved storage cluster across rack (an example later) – Shadow HDFS that caches K-V Store – great for reading – Use the K-V store as 4th lazy HDFS replica – works for writing Page 12
    13. 13. © Hortonworks Inc. 2013 Use of External Storage when cluster is Spinup/Spindown • Shadow mount external storage (S3) – great for reading – Cache one or two replicas at mount time – when missing, fetch from source – Hadoop apps access cached S3 data on local HDFS • Use S3 to store blocks and check-pointed image – works for writes – Write lazy of 4th replica – When cluster is spun down, all block and image must be flushed to S3 Page 13 DataNode Shadow Mount (Cache) S3S3 Checkpoint Image +Blocks Lazy write Of 4th replica Local HDFS
    14. 14. © Hortonworks Inc. 2013 An Interesting “Hadoop as service” configuration • HDFS Cluster for data external to the compute cluster – but stripped to allow rack level access • Persists when the dynamic customer clusters are spun-down • Much better performance than external S3… Page 14 Core Switch ` Switch …… ` Switch …… ` Switch …… Storage HDFS cluster - Stores data that persists when compute clusters are spun down Transient Hadoop cluster - Clusters (VMs) are spun up and down - Have transient local storage
    15. 15. © Hortonworks Inc. 2013 VMs and Hadoop in Private Clouds Page 15
    16. 16. © Hortonworks Inc. 2013 Virtual Machines and Hadoop • Traditional Motivation for VMs – Easier to install and manage – Elasticity – expand and shrink service capacity – Relies on expensive shared storage so that data can be accessed from any VM – Infrastructure consolidation and improved utilization • Some of the VM benefits require shared storage – But external shared storage (SAN/NAS) does not make sense for Hadoop • Potential Motivation for Hadoop on VMs – Public clouds mostly use VMs for isolation reasons – Motivation for VMs for private clouds – Test, dev clusters-on-fly to allow independent version and cluster management – Share infrastructure with other non-Hadoop VMs – Production clusters … – Elasticity ? – Multi-tenancy isolation? Page 16
    17. 17. © Hortonworks Inc. 2013 Virtual Machines and Hadoop • Elasticity in Hadoop via VMs –Elastic DataNodes? - Not a good idea –lots of data to be copied –Elastic Compute Nodes – works but some challenges – Need to allocate local disk partition for shuffle data – VM per-tenant will fragment your local storage - This can be fixed by moving tmp storage to HDFS –will be enabled in the future – Local read optimization (short circuit read) is not available to VM • Resource management & Isolation excellent in Hadoop –Yarn has flexible resource model (CPU, Memory, others coming) – It manages across the nodes –Yarn’s Capacity scheduler gives capacity guarantees –Isolation supported via Linux Containers –Docker helps with image management Page 17
    18. 18. © Hortonworks Inc. 2013 Guideline for VMs for Hadoop • VMs for Test, Dev, Poc, Staging Clusters works well –Can share Host with non-Hadoop VMs –For such use cases having independent clusters is fine – Do not have lots of data or storage fragmentation is less concern – Independent clusters allow each to have a different Hadoop version –Some customers: a native shared Hadoop cluster for testing apps • Production Hadoop – VMs generally less attractive But if you do, consider motivation, and configure accordingly (next slide) –Hadoop has built-in – elasticity for CPU and Storage – resource management – capacity guarantees for multi-tenancy – isolation for multi-tenancy Page 18
    19. 19. © Hortonworks Inc. 2013 CN… Expandd compute Configurations for Hadoop VMs • Model 1 is generally better – ComputeNode/DataNode shortcut optimization – Less storage fragmentation (intermediate shuffle data) – Share cluster: Hadoop has excellent Tenant/Resource isolation – Use Model 2 if you need need additional tenant isolation but shared data Page 20 Base cluster: Data+Compute on each VM CN DN Model 1 Base HDFS cluster: a DataNode on a VM DN Per-tenant compute cluster CN … Model 2 • Share the cluster unless you need per-tenant Hadoop-version or more isolation
    20. 20. © Hortonworks Inc. 2013 Summary • Hadoop Virtualizes the Cluster’s HW Resources across Tenants • Hadoop for the Public Cloud – Most based on VMs due to strong complete isolation requirement – Altiscale, Rackspace offer additional models – Main challenge is data persistence as clusters are spun up/down – We describe best practices and what is coming • Hadoop for the Private Cloud – For test, dev, Poc clusters – Both Hadoop on raw machines and VMs are a good idea – VMs allow “cluster-on-the-fly” plus ability to control version for each user – For production clusters – Hadoop natively provides excellent elasticity, resource management and multi-tenancy – If you must use VMs – Understand your motivations well and configure accordingly – For VMs, follow the best practices we provided Page 21
    21. 21. © Hortonworks Inc. 2013 Thank You Page 22 Q & A