Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Docker based Hadoop provisioning - anywhere
April 16th, 2015
Janos Matyas
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Overview
• Introduction
• Goals and motivations
• Technology stack
• How it works
• Results/achievements/future plans
• Demo and Q&A
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Goals and motivations
• Full Hadoop stack provisioning – everywhere
• Automate and unify the process
• Zero-configuration approach
• Same process through a cluster lifecycle (Dev, QA, UAT, Prod)
• Provide tooling - UI, REST API and CLI/shell
• Secure and multi-tenant
• SLA policy based autoscaling
Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Technology stack
• Docker
• Swarm
• Consul
• Apache Ambari
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Docker
• Container based virtualization
• Lightweight and portable
• Build once, run anywhere
• Ease of packaging applications
• Automated and scripted
• Isolated
Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Docker – How it works
• Containers are isolated, but share OS and
bins/libraries
• No need to emulate hardware
Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Swarm
• Native clustering for Docker
• Distributed container orchestration
• Same API as Docker
Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Swarm – How it works
• Swarm managers/agents
• Discovery services
• Advanced scheduling
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Consul
• Service discovery/registry
• Health checking
• Key/Value store
• DNS
• Multi datacenter aware
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Consul – How it works
• Consul servers/agents
• Consistency through a quorum (RAFT)
• Scalability due to gossip based protocol (SWIM)
• Decentralized and fault tolerant
• Highly available
• Consistency over availability (CP)
• Multiple interfaces - HTTP and DNS
• Support for watches
Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Ambari
• Easy Hadoop cluster provisioning
• Management and monitoring
• Key feature - Blueprints
• REST API, CLI shell
• Extensible
• Stacks
• Services
• Views
Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache Ambari – How it works
• Ambari server/agents
• Define a blueprint (blueprint.json)
• Define a host mapping (hostmapping.json)
• Post the cluster create
Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Cloudbreak
Cloudbreak is a cloud-agnostic Hadoop as a
Service API. Abstracts the provisioning and ease
management and monitoring of on-demand
clusters.
Cloudbreak is a powerful left surf that
breaks over a coral reef, a mile off
southwest the island of Tavarua, Fiji.
Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Cloudbreak
• Benefits
• Zero configuration
• Elastic
• Secure
• Infrastructure agnostic
• Heterogenous clusters
• Auto-scaling
• Main REST resources
• /template – specify an instance group infrastructure
• /stack – creates an infrastructure based on a template
• /blueprint – describes a Hadoop cluster
• /cluster – creates a Hadoop cluster
Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Cloudbreak – How it works
• Start VMs - with a running Docker daemon
• Cloudbreak Bootstrap
• Start Consul Cluster
• Start Swarm Cluster (Consul for discovery)
• Start Ambari servers/agents - Swarm API
• Ambari services registered in Consul (Registrator)
• Post Blueprint
Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Cloudbreak - Features
• Support Kerberized clusters
• Cloudbreak “recipes”
• Automate host configuration
• Pre/post Ambari lifecycle hooks
• Services reconfiguration
• Automate/execute custom actions
• Side – effects
• Ambari CLI/shell and Groovy based client
• Cloud Foundry’s UAA Dockerized
• Munchausen – bootstrap Swarm with Consul
• Dockerized full Hadoop stack (Apache Hadoop 42K+, Ambari 8K+, Spark 6K+ downloads)
Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Cloudbreak - Hadoop as a Service API
• Public tech preview
• Microsoft Azure
• Amazon AWS
• Google Cloud Platform
• OpenStack
• Private tech preview – R&D
• Bare metal
• Rackspace Managed Cloud
• HP Helion Public Cloud
*integration SPI is available
Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Periscope
Periscope is a heuristic Hadoop scheduler
associated with a QoS profile. Built on
YARN schedulers, cloud and VM resource
management API's it allows to associate
SLA's to applications and customers.
Periscope is a powerful, fast, thick and top-
to-bottom right-hander, eastward from
Sumbawa's famous west-coast.
Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Periscope
• Benefits
• Zero configuration
• Metric and time based alarms
• SLA policy based autoscaling
• Secure
• Hostgroup specific
• Main REST resources
• /clusters – specify a cluster to be monitored
• /alerts– time and metric based
• /policies – specify an SLA policy for a cluster based on an alarm
• /applications – specify an SLA policy for an application (under development)
Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Periscope – How it works
• Configures/monitors alarms in Ambari
• Setup alarm, cooldown periods
• Manages cluster sizes
• Allow to associate SLA scaling policies to alarms
• Orchestrates Cloudbreak to up/downscale the cluster
Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Demo and Q&A

Docker based Hadoop provisioning - anywhere

  • 1.
    Page1 © HortonworksInc. 2011 – 2015. All Rights Reserved Docker based Hadoop provisioning - anywhere April 16th, 2015 Janos Matyas
  • 2.
    Page2 © HortonworksInc. 2011 – 2015. All Rights Reserved Overview • Introduction • Goals and motivations • Technology stack • How it works • Results/achievements/future plans • Demo and Q&A
  • 3.
    Page3 © HortonworksInc. 2011 – 2015. All Rights Reserved Goals and motivations • Full Hadoop stack provisioning – everywhere • Automate and unify the process • Zero-configuration approach • Same process through a cluster lifecycle (Dev, QA, UAT, Prod) • Provide tooling - UI, REST API and CLI/shell • Secure and multi-tenant • SLA policy based autoscaling
  • 4.
    Page4 © HortonworksInc. 2011 – 2015. All Rights Reserved Technology stack • Docker • Swarm • Consul • Apache Ambari
  • 5.
    Page5 © HortonworksInc. 2011 – 2015. All Rights Reserved Docker • Container based virtualization • Lightweight and portable • Build once, run anywhere • Ease of packaging applications • Automated and scripted • Isolated
  • 6.
    Page6 © HortonworksInc. 2011 – 2015. All Rights Reserved Docker – How it works • Containers are isolated, but share OS and bins/libraries • No need to emulate hardware
  • 7.
    Page7 © HortonworksInc. 2011 – 2015. All Rights Reserved Swarm • Native clustering for Docker • Distributed container orchestration • Same API as Docker
  • 8.
    Page8 © HortonworksInc. 2011 – 2015. All Rights Reserved Swarm – How it works • Swarm managers/agents • Discovery services • Advanced scheduling
  • 9.
    Page9 © HortonworksInc. 2011 – 2015. All Rights Reserved Consul • Service discovery/registry • Health checking • Key/Value store • DNS • Multi datacenter aware
  • 10.
    Page10 © HortonworksInc. 2011 – 2015. All Rights Reserved Consul – How it works • Consul servers/agents • Consistency through a quorum (RAFT) • Scalability due to gossip based protocol (SWIM) • Decentralized and fault tolerant • Highly available • Consistency over availability (CP) • Multiple interfaces - HTTP and DNS • Support for watches
  • 11.
    Page11 © HortonworksInc. 2011 – 2015. All Rights Reserved Apache Ambari • Easy Hadoop cluster provisioning • Management and monitoring • Key feature - Blueprints • REST API, CLI shell • Extensible • Stacks • Services • Views
  • 12.
    Page12 © HortonworksInc. 2011 – 2015. All Rights Reserved Apache Ambari – How it works • Ambari server/agents • Define a blueprint (blueprint.json) • Define a host mapping (hostmapping.json) • Post the cluster create
  • 13.
    Page13 © HortonworksInc. 2011 – 2015. All Rights Reserved Cloudbreak Cloudbreak is a cloud-agnostic Hadoop as a Service API. Abstracts the provisioning and ease management and monitoring of on-demand clusters. Cloudbreak is a powerful left surf that breaks over a coral reef, a mile off southwest the island of Tavarua, Fiji.
  • 14.
    Page14 © HortonworksInc. 2011 – 2015. All Rights Reserved Cloudbreak • Benefits • Zero configuration • Elastic • Secure • Infrastructure agnostic • Heterogenous clusters • Auto-scaling • Main REST resources • /template – specify an instance group infrastructure • /stack – creates an infrastructure based on a template • /blueprint – describes a Hadoop cluster • /cluster – creates a Hadoop cluster
  • 15.
    Page15 © HortonworksInc. 2011 – 2015. All Rights Reserved Cloudbreak – How it works • Start VMs - with a running Docker daemon • Cloudbreak Bootstrap • Start Consul Cluster • Start Swarm Cluster (Consul for discovery) • Start Ambari servers/agents - Swarm API • Ambari services registered in Consul (Registrator) • Post Blueprint
  • 16.
    Page16 © HortonworksInc. 2011 – 2015. All Rights Reserved Cloudbreak - Features • Support Kerberized clusters • Cloudbreak “recipes” • Automate host configuration • Pre/post Ambari lifecycle hooks • Services reconfiguration • Automate/execute custom actions • Side – effects • Ambari CLI/shell and Groovy based client • Cloud Foundry’s UAA Dockerized • Munchausen – bootstrap Swarm with Consul • Dockerized full Hadoop stack (Apache Hadoop 42K+, Ambari 8K+, Spark 6K+ downloads)
  • 17.
    Page17 © HortonworksInc. 2011 – 2015. All Rights Reserved Cloudbreak - Hadoop as a Service API • Public tech preview • Microsoft Azure • Amazon AWS • Google Cloud Platform • OpenStack • Private tech preview – R&D • Bare metal • Rackspace Managed Cloud • HP Helion Public Cloud *integration SPI is available
  • 18.
    Page18 © HortonworksInc. 2011 – 2015. All Rights Reserved Periscope Periscope is a heuristic Hadoop scheduler associated with a QoS profile. Built on YARN schedulers, cloud and VM resource management API's it allows to associate SLA's to applications and customers. Periscope is a powerful, fast, thick and top- to-bottom right-hander, eastward from Sumbawa's famous west-coast.
  • 19.
    Page19 © HortonworksInc. 2011 – 2015. All Rights Reserved Periscope • Benefits • Zero configuration • Metric and time based alarms • SLA policy based autoscaling • Secure • Hostgroup specific • Main REST resources • /clusters – specify a cluster to be monitored • /alerts– time and metric based • /policies – specify an SLA policy for a cluster based on an alarm • /applications – specify an SLA policy for an application (under development)
  • 20.
    Page20 © HortonworksInc. 2011 – 2015. All Rights Reserved Periscope – How it works • Configures/monitors alarms in Ambari • Setup alarm, cooldown periods • Manages cluster sizes • Allow to associate SLA scaling policies to alarms • Orchestrates Cloudbreak to up/downscale the cluster
  • 21.
    Page21 © HortonworksInc. 2011 – 2015. All Rights Reserved Demo and Q&A

Editor's Notes

  • #2 Two days ago I was working for SequenceIQ, as the CTO.
  • #3  ----- Meeting Notes (10/04/15 20:35) ----- SequenceIQ been acquired. Started February, quickly gain trackion around June.
  • #4  ----- Meeting Notes (10/04/15 20:38) ----- We were doing this over and over again. Scripted, Ansible, tried everything and all existing tools.
  • #5  ----- Meeting Notes (10/04/15 20:38) ----- Architecturally most important components
  • #7  ----- Meeting Notes (10/04/15 20:56) ----- Under the hood is built on: 1. cgroup and namespacing capabilities of the Linux kernel 2. Docker image specification - filesystem composed of layers, presented as one cohesive filesystem Recommended 3.8, works from 2.6.2 3. Libcontainer specification - namespacing, filesystem, resources (cgroups)
  • #8  ----- Meeting Notes (10/04/15 20:56) ----- Docker simplifies things - on one host. We span up containers remotely on many hosts- how? Swarm pulls together many Docker engines - presents as one virtual Docker Engine.
  • #9  ----- Meeting Notes (10/04/15 20:56) ----- Steps: Can span us Docker containers remotely on hosts considering: 1. Resource management - aware of the cluster resources (e.g. can schedule it with bin packing - anywhere where 1GB memory is available) or randomly 2. Constraints using labels (label one node and stsrt the container based on labels) 3. Affinity - containers can be co-scheduled (link, vollumes-from, net=container on the same host)
  • #10  ----- Meeting Notes (10/04/15 21:05) ----- We have a dynamic scaling cluster where nodes are coming/leaving but also failing. Register services in consul, like Ambari services Zookeeper, doozerd, etcd – same as Consul, requires a quorom, offer strong consistency, but not datacenter aware Zookeeper: no service discovery, offers primitive K/V, no DNS, does not go through DC Zookeeper provides ephemeral nodes – but stil clients need to habe keep-alive connections
  • #11 Agent – long running daemon, serves DNS and HTTP interface, every node Client – an agent that forwards all RPC to server. Takes part in LAN gossip Server - participates in RAFT quorum, responds to RPC, WAN gossip Datacenter – low latency, high bandwith private network Gossip – TCP and UDP UNICAST. Usually Broadcast/Multicast does not work in cloud Strong consistency: Service catalog stores all the nodes, service instances, health check data, ACLs, and Key/Value information. It is strongly consistent, and replicated using the consensus protocol. Gossip – eventual consistency, updates to catalog comes through gossip, thus state can lag behind until is reconciled.
  • #12 Most likely you’ve seen an Ambari session Its extensible : Stacks – set of services, multiple versions (e.g. HDP 2.1, HDP 2.2, Bigtop) Services – e.g HDFS, Kafka, Zeppelin Views – capability to add visualization, management and monitoring capabilities of a new “application”
  • #13 Pre-install the server and agents.
  • #14 Combining all these – welcome Cloudbreak. Zero configuration way to provision HDP cluters – anywhere by the push of a button, CLI or API. One consistent infrastructure agnostic API.
  • #15  ----- Meeting Notes (10/04/15 21:47) ----- Expand on points No configuration, need to have a running infrastructure. Any size - 200 nodes in 8 min. OAuth2, gateway (Knox will come), TLS Since YARN - Different services - different instance types: e.g. Spark - high memory, Kafka - high disk thorughput but memory as well to buffer active read/writes Scale based on load
  • #16 View from 10000 meter high Only thing we need is a Docker daemon. All cloud providers are going towards Docker
  • #17 Kerberos – we take the pain (Dockerized a Kerberos server) Recipes – built on Consul events, read results from the K/V store Anybody can push his own plugin: we use plugn – instal lyour plugin, and use it from Cloudbreak We did different projects, fixed quite a few interesting problems.
  • #20 Zero config, does not require pre-installation Can set alarms – based on alarms SLA policies. ----- Meeting Notes (10/04/15 22:04) ----- New features in hadoop 2.6 Our contribution, plus lots of others (move applications between queues), admission control - reserve capacity over time Most likely Vinod explained all these.
  • #21 Mention Baywatch ELK ElasticSearch, Logstash, Kibana – aggregate logs and metrics.
  • #22 Will be a Webex