Hello OpenStack, Meet Hadoop


Published on

Hadoop is often viewed as needing racks of dedicated boxes -despite the fact that in sheer number terms, the majority of Hadoop clusters ever created have been brought up on public cloud infrastructures -particularly Amazon`s. Yet the rest of datacenter computing is moving towards virtualization -be it in-cloud startups or in-enterprise IT departments. Some organizations are standing up private clouds: a rack or two of servers with an API for VM creation. Hadoop can live there -it just needs to integrate better. At the same time, OpenStack is emerging as the de-facto standard open source cloud platform for private use, and is available publicly from a number of cloud infrastructure service providers. This talk looks at what we`ve done -and are doing- to integrate Hadoop with OpenStack. This is taking it beyond Hadoop`s current support for Amazon`s infrastructure, making a combined Hadoop + OpenStack cluster something to consider in-house -and in-cloud. This work is being done in collaboration with members of the OpenStack community, showing how cloud and big data projects can not only co-exist, we can co-develop our platforms.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Mirantis
  • Note:Not Recommended: One cluster having multiple data nodes on the same hypervisor nodeAllowed: Multiple clusters having a data node on the same hypervisor nodeAllowed: One data node and multiple compute nodes from per hypervisor
  • Object Store (codenamed "Swift") provides object storage. It allows you to store or retrieve files (but not mount directories like a fileserver). Several companies provide commercial storage services based on Swift. These include KT, Rackspace (from which Swift originated) and Internap. Swift is also used internally at many large companies to store their data.Image (codenamed "Glance") provides a catalog and repository for virtual disk images. These disk images are mostly commonly used in OpenStack Compute. While this service is technically optional, any cloud of size will require it.Compute (codenamed "Nova") provides virtual servers upon demand. Rackspace and HP provide commercial compute services built on Nova and it is used internally at companies like Mercado Libre and NASA (where it originated).Dashboard (codenamed "Horizon") provides a modular web-based user interface for all the OpenStack services. With this web GUI, you can perform most operations on your cloud like launching an instance, assigning IP addresses and setting access controls.Identity (codenamed "Keystone") provides authentication and authorization for all the OpenStack services. It also provides a service catalog of services within a particular OpenStack cloud.Network (codenamed "Quantum") provides "network connectivity as a service" between interface devices managed by other OpenStack services (most likely Nova). The service works by allowing users to create their own networks and then attach interfaces to them. Quantum has a pluggable architecture to support many popular networking vendors and technologies.Block Storage (codenamed "Cinder") provides persistent block storage to guest VMs. This project was born from code originally in Nova (the nova-volume service described below). In the Folsom release, both the nova-volume service and the separate volume service are available.File STORAGE(NAS)– No Support. Currently, OpenStack Compute does not have any native support for this type of file storage inside of an instance. However, there is a Gluster storage connector for OpenStack that enables the use of the GlusterFS file system as a back-end for the Image service.
  • 1. What is RDO?* Distribution of OpenStack - The OpenStack project produces code. Packaging, integration, installation and support is left to distributors and partners - In its current form, OpenStack is a toolbox for creating an IaaS cloud, RDO allows you to get started quickly* For RHEL, CentOS, Scientific Linux and other RHEL clones, and for Fedora - There is a demand for being able to try out OpenStack on the industry's most successful enterprise Linux platform - We welcome users and experiences from the Red Hat Enterprise Linux ecosystem, which includes CentOS and Scientific Linux - We also want to make it easy for users of Fedora to try the version of OpenStack they are interested in without necessarily upgrading their entire operating system* Community-driven - The RDO community site is a wiki, and a forum. We welcome the participation of community members sharing knowledge, helping each other - Support offered with RDO is of a standard which can be expected from a community supported project - we encourage anyone who is looking for enterprise level support to upgrade to Red Hat OpenStack
  • Every conversation with customers around Hadoop deployment model end with one word ‘flexibility’ . Customers want to be able to deploy Hadoop On prem – physical or over a virtual infrastructure and in the cloud. In the cloud OpenStack is emerging/ rather has emerged as the hands down dominant open source cloud management platform
  • Hello OpenStack, Meet Hadoop

    1. 1. © Hortonworks Inc. 2013 Hadoop meet OpenStack Himanshu Bari, Hortonworks Ilya Elterman, Mirantis John Spiedel, Hortonworks June 26th, 2013
    2. 2. © Hortonworks Inc. 2013 Disclaimer • This document may contain product features and technology directions that are under development or may be under development in the future. • Technical feasibility, market demand, user feedback, and the Apache Software Foundation community development process can all affect timing and final delivery. • This document’s description of these features and technology directions does not represent a contractual commitment from Hortonworks to deliver these features in any generally available product. • Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
    3. 3. © Hortonworks Inc. 2013 Agenda Why Hadoop on OpenStack Savanna controller deep dive Hortonworks OpenStack plugin DEMO
    4. 4. © Hortonworks Inc. 2013 Why Hadoop & OpenStack? Hadoop provides a greenfield use case • Net new workload • Needs scale out infrastructure • Shared platform OpenStack provides the perfect cloud platform • Operational agility • Supports scale out architecture • Deployment choice across public, private, and hybrid clouds 1. Open source communities provide the fastest path to innovation 2. Open source is changing the game as economics and accessibility serve to accelerate cloud & big data market trends 3. Both are attracting major ecosystem players: IBM, RHT, HP, RAX, etc… Marries two of the largest open source movements
    5. 5. © Hortonworks Inc. 2013 OpenStack Infrastructure Savanna Elastic Hadoop Controller Project Savanna to accelerate integration Swift storage Hadoop Cluster N N N N N N 2 Ambari Hadoop management - - + + N N N N 1 3 1. Cluster templates: deploy pre configured Hadoop clusters in seconds from Horizon or Ambari 1. HDFS-Swift connectors: move data between HDFS and Swift object storage 1. Simplified Elasticity Project Savanna Automate deployment of Apache Hadoop on OpenStack
    6. 6. © Hortonworks Inc. 2013 Provisioning Phase-1 features - Frequent dev/test/staging cluster provision requests - Migrations from staging to prod and vice versa - Reduce operator error in cluster provisioning - Migrate away from Amazon EMR for Ad hoc analytics requests for experimentation - Cluster and node level templates for self-provisioning - Template operations like save/import - Move data between HDFS & Swift object store Job flow based cluster provisioning Phase-2 features Benefits/Use cases
    7. 7. © Hortonworks Inc. 2013 Elasticity Phase-1 features - Commission a new node or decommission a node for maintenance - For dev/test/staging clusters: automatically vary cluster data & compute capacity based on tenant, workload, time of day, resource utilization etc. - Automatically vary compute capacity only for production clusters - Hadoop cluster node add/remove from OpenStack - Cluster operations like destroy cluster fired from OpenStack Rule based cluster node elasticity Phase-2 features Benefits/Use cases
    8. 8. © Hortonworks Inc. 2013 Multi-tenancy Phase-1 features - Common infrastructure for Hadoop and non Hadoop workloads - Simplify maintenance through version isolation - Resource isolation to support varying SLAs based on tenant and workload - Simplify chargeback/showback - Hadoop virtualization extensions support - Ability to pin VMs to group of physical hosts - Keystone integration with Ambari - One Ambari instance per tenant - Keystone enhancements to support Job flow to tenant mapping Phase-2 features Benefits/Use cases
    9. 9. © Hortonworks Inc. 2013 Agenda Why Hadoop on OpenStack Savanna controller deep dive Hortonworks OpenStack plugin DEMO
    10. 10. © Hortonworks Inc. 2013 OpenStack - cloud management platform Glance Image Service Keystone Identity Service Horizon NeutronNova Cinder Block Store Swift Object Store (Apache License) Ceilometer Metering Heat Orchestration Integrated Mutli-hypervisor & guest OS support Savanna Hadoop
    11. 11. © Hortonworks Inc. 2013 Project Savanna logical architecture OpenStack Infrastructure Network Storage Security Compute Savanna Controller Hortonworks OpenStack plugin API Hadoop Provisioning Configuration Templates Horizon + Savanna UI A P I Configuration Elasticity Orchestration On-demand jobs execution Hadoop Cluster Ambari + API Plugin manager
    12. 12. © Hortonworks Inc. 2013 Savanna Architecture Savanna Python Client RESTAPI Cluster Configuration Manager Horizon Keystone Auth DAL Nova Glance Swift Savanna Pages Hadoop VM Provisioning Plugin Hadoop VM Hadoop VM Hadoop VM VM Manager Image Registry
    13. 13. © Hortonworks Inc. 2013 Savanna key features • Node group and cluster templates • Cluster scaling (add/remove nodes) • Hadoop cluster topology configuration parameters –Data node anti-affinity –HDFS location –Swift integration • Plugin mechanism for integration with different Hadoop distributions • Plugin implementations –Hortonworks Data Platform OpenStack plugin ( uses Apache Ambari) –Vanilla Apache Hadoop ( No Apache Ambari) – reference implementation with pre build image
    14. 14. © Hortonworks Inc. 2013 HDFS reliability on VMs Compute DN DN D N DN DN D N Data Block Compute
    15. 15. © Hortonworks Inc. 2013 Data node anti-affinity DN Compute TT | DN Compute DN Compute DN Cluster A Cluster B
    16. 16. © Hortonworks Inc. 2013 Hadoop-8545: Swift for Hadoop Swift Hadoop Job #1 Local HDFS Hadoop Job #2 ... Hadoop Job #N
    17. 17. © Hortonworks Inc. 2013 HDFS placement options • Ephemeral drive /var/lib/nova/instances/instance-xxx/disk -> /mnt/ephemeral • Block storage volume Cinder Volume -> /mnt/volume • Bare drive support /dev/sdb -> /mnt/sdb
    18. 18. © Hortonworks Inc. 2013 Savanna key features • API to execute Map/Reduce jobs without exposing details of underlying infrastructure (similar to AWS EMR) • User-friendly UI for ad-hoc analytics queries based on Hive or Pig • Network configuration support, integration with Neutron (OpenStack Networking, earlier Quantum)
    19. 19. © Hortonworks Inc. 2013 Agenda Why Hadoop on OpenStack Savanna Controller deep dive Hortonworks OpenStack plugin DEMO
    20. 20. © Hortonworks Inc. 2013 Hortonworks Data Platform OpenStack plugin • Provision HDP cluster using Ambari • Supports generic or pre-packaged VM images • Supports standard Savanna configuration templates • Supports Ambari templates (aka blueprints) configuration –https://issues.apache.org/jira/browse/AMBARI-1783
    21. 21. © Hortonworks Inc. 2013 HDP OpenStack plugin and Ambari • Ambari services installed on cluster hosts –Ambari Server and Ambari Agent • HDP plugin uses Ambari REST API –Define cluster topology –Configure Hadoop services –Install Hadoop services on all VM’s –Start Hadoop Services • Monitor and Manage cluster with Ambari –Ambari UI –Ambari REST API
    22. 22. © Hortonworks Inc. 2013 Red Hat RDO RDO is a freely-available, community supported distribution of OpenStack, packaged and integrated for Red Hat Enterprise Linux and its clones, and for Fedora http://openstack.redhat.com
    23. 23. © Hortonworks Inc. 2013 Apache Ambari templates (aka blueprints) Preconfigured information across all clusters using this template HDP Stack Information - Services & Components & Packages - Description - Package Dependencies Hadoop Topology Component / Host Group Mapping Hadoop Configuration All Hadoop Configuration for the Cluster (hundreds of parameters and their values) Per cluster pluggable data - User names - Passwords - Host names - Host VM flavors ( CPU/Mem) - Node count per host group ………. ………. ………. ……….
    24. 24. © Hortonworks Inc. 2013 Demo • Provision a Hadoop cluster on OpenStack –Savanna with OpenStack UI extensions –HDP Plugin –Ambari Templates –HDP stack including metrics and alerts • Monitor cluster using Ambari UI
    25. 25. © Hortonworks Inc. 2013 Specify topology and configure • Node Group templates –Specify host/component mappings –Specify VM flavor –Specify node scoped configurations • Cluster templates –Specify node groups –Specify VM image –Specify cluster scoped configurations • Upload templates (aka Ambari blueprints) –Specifies topology and configuration –Used to create Cluster Template
    26. 26. © Hortonworks Inc. 2013 Savanna Controller: Provision VM’s Master VM Slave VM 1 Node Groups Master: 1 Slave: 2 Slave VM 2 Savanna OpenStack • Savanna provisions OpenStack VM’s based on configured Node Groups
    27. 27. © Hortonworks Inc. 2013 HDP OpenStack Plugin: Install Ambari Slave VM 1 Slave VM 2 Savanna OpenStackHDP Plugin Ambari Server Ambari Agent Ambari Agent • HDP plugin remotely installs Ambari services • Ambari is installed from public/private repo • Ambari Agents register with Ambari Server Ambari DB Master VM Ambari Agent
    28. 28. © Hortonworks Inc. 2013 HDP OpenStack Plugin: Define topology and configure Savanna HDP Plugin • HDP plugin specifies cluster topology and configuration via Ambari REST API • Ambari stores topology and configuration in it’s DB • Ambari is installed from public/private repo • Ambari Agents register with Ambari Server Slave VM 1 Slave VM 2 OpenStack Ambari Agent Ambari Agent Ambari Server Ambari DB Master VM Ambari Agent REST API
    29. 29. © Hortonworks Inc. 2013 HDP OpenStack Plugin: Install Hadoop Services Savanna HDP Plugin • HDP plugin sets state of all services to INSTALLED via Ambari REST API • Ambari installs services on each host from public/private HDP repos • Ambari pushes configurations to each host • Service installation is asynchronous • HDP Plugin polls install status via Ambari REST API Slave VM 1 Slave VM 2 OpenStack Ambari Server Ambari Agent Ambari Agent Ambari DB REST API DN TT DN TT N N JT Master VM Ambari Agent
    30. 30. © Hortonworks Inc. 2013 HDP OpenStack Plugin: Start Services Savanna HDP Plugin • HDP plugin sets state of all services to STARTED via Ambari REST API • Ambari starts all services on all hosts • Service start is asynchronous • HDP Plugin polls start status via Ambari REST API Slave VM 1 Slave VM 2 OpenStack Ambari Server Ambari Agent Ambari Agent Ambari DB REST API Master VM Ambari Agent N N JT DN TT DN TT
    31. 31. © Hortonworks Inc. 2013 Apache Ambari: Monitor Cluster • Use Ambari UI to monitor the cluster –Use hostname where Ambari Server is running –Default port is 8080 • Use Ambari REST API to monitor the cluster –Use hostname where Ambari Server is running –Default port is 8080 –‘clusters’ is root resource –Example URL – http://ambarihost:8080/api/v1/clusters –REST API Documentation – https://github.com/apache/ambari/blob/trunk/ambari-server/docs/api/v1/index.md
    32. 32. © Hortonworks Inc. 2013 • OpenStack provides operational agility and deployment choice • Hadoop is a net new workload and a perfect app for OpenStack • Integration marries two of the Largest Open Source Movements – Community-driven innovation outpaces any single vendor – Both are attracting major ecosystem players: IBM, RHT, HP, RAX, etc… Summary Project Savanna Automate deployment of Apache Hadoop on OpenStack
    33. 33. © Hortonworks Inc. 2013 Learn More & Get Involved! Download Hortonworks Data Platform www.hortonworks.com/download Follow… @hortonworks Email questions to: Project Savanna: https://launchpad.net/Savanna https://wiki.openstack.org/wiki/Savanna/HowToParticipate hbari@hortonworks.com ielterman@mirantis.com