Docker Based Hadoop Provisioning

Janos Matyas / CTO / SequenceIQ Inc.

GOAL / MOTIVATION
TECHNOLOGY STACK
PROBLEM RESOLUTION / HOW IT WORKS
RESULTS / ACHIEVEMENTS
OVERVIEW

GOAL / MOTIVATION
 Ease Hadoop provisioning – everywhere
 Automate and unify the process
 Arbitrary cluster size
 Same process through a cluster lifecycle (Dev, QA, UAT, Prod)
 (Auto) scaling Hadoop
 QoS

OUR APPROACH
 Use Docker
 Build cloud-specific ‘Dockerized’ images
 Provision the cluster
 Use Ambari

DOCKER
 Lightweight, portable
 Build once, run anywhere
 VM – without the overhead of a VM
 Isolated containers
 Automated and scripted

DOCKER – CONTAINERS vs. VMs
 Containers are isolated, but share OS and,
where appropriate, bins/libraries

APACHE AMBARI – ARCHITECTURE
 Easy Hadoop cluster provisioning
 Management and monitoring
 Key features – blueprints
 REST API

APACHE AMBARI – CREATE CLUSTER
 Define a blueprint (POST /api/v1/blueprints)
 Create cluster (POST /api/v1/clusters/mycluster)

HADOOP PROVISIONG ISSUES
 Each cloud provider has a proprietary API
 Create images for each provider
 Network configuration
 Service discovery
 Resize, failover, member join support

OUR APPROACH – DETAILS
 Build your Docker image
 Install or pre-install Hadoop services with Ambari
 Install Serf and dnsmasq
 Build your cloud image
 Use Ansible to create an image

BUILD DOCKER IMAGES
 Create the Dockerfile
 Have Docker.io to build the image
 Optionally pre-install services
 Use Ambari
 Push image to Docker.io
 Licensing questions

BUILD CLOUD IMAGES
 Use a Docker ready base image
 Use Ansible to provision the image template
 Pull the Docker images
 Apply custom infrastructure
 Use cloud provider specific playbooks
 AWS EC2
 Azure

ANSIBLE
 Configuration as data
 Simplest way to automate IT
 Secure and agentless
 Goal oriented
 One playbook – multiple modules
 We use it to “burn” cloud images/templates

PROVISIONING – ISSUES
 FQDN
 /etc/hosts is read-only in Docker
 Everybody needs to know everybody
 DNS
 Single point of failure
 Dynamic cluster – nodes joining, leaving, failing
 Routing
 Cloud – ability to inter-host container routing
 Collision free private IP range for Docker bridge
 We need predefined host names/IP addresses
 /etc/hosts is read-only in Docker
 Use Ansible to provision the image template
 Pull the Docker images
 Start a DNS server
 Use it as a reference docker run -dns <IP_OF_DNS>
 Nodes need to know each other

PROVISIONING – SOLUTION
 FQDN
 Use –h and –dns Docker params
 DNS
 dnsmasq is running on each Docker container
 Serf member-xxx events trigger dnsmasq reconfiguration
 Routing
 Docker bridge configuration – follows a convention

SERF
 Gossip based membership
 Service discovery
 Decentralized
 Lightweight, fault tolerant
 Highly available
 DevOps friendly
 Keep an eye on Consul, Open vSwitch, pipework

SERF – DECENTRALIZED SERVICE DISCOVERY
 Gossip instead of heartbeat
 LAN, WAN profiles
 Provides membership information
 Event handlers: member_join, member_leave, member_failed, member-
update, member-reap, user
 Query

SERF – MEMBERSHIP, EVENT HANDLERS

DNSMASQ
 Network infrastructure for small networks
 Lightweight DNS, DHCP server
 Comes with most Linux distributions

AWS EC2 – HADOOP CLUSTER
 Use EC2 REST API to provision instances (from Dockerized image)
 Start Docker containers
 One Ambari server
 N-1 Ambari agents connecting to server
 Connect ambari-shell to
 Define blueprint

AWS EC2 – NETWORK SECURITY
 Create a VPC
 Configure subnets
 Routing tables
 Security gateway
 Set ACL
 Configure VPN

AWS EC2 - CLOUDFORMATION
 Manually set up VPC is too complicated
 Use CloudFormation
 Manage the stack together
 Template-based
 Environments under version control
 Customizable at runtime
 No extra charge
"VpcId" : {
"Type" : "String",
"Description" : "VpcId of your existing Virtual Private Cloud (VPC)"
},
"SubnetId" : {
"Type" : "String",
"Description" : "SubnetId of an existing subnet (for the primary
network) in your Virtual Private Cloud (VPC)"
},
"SecondaryIPAddressCount" : {
"Type" : "Number",
"Default" : "1",
"MinValue" : "1",
"MaxValue" : "5",
"Description" : "Number of secondary IP addresses to assign to the
network interface (1-5)",
"ConstraintDescription": "must be a number from 1 to 5."
},
"SSHLocation" : {
"Description" : "The IP address range that can be used to SSH to the
EC2 instances",
"Type": "String",
"MinLength": "9",
"MaxLength": "18",
"Default": "0.0.0.0/0",
"AllowedPattern": "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})/
(d{1,2})",
"ConstraintDescription": "must be a valid IP CIDR range of the form
x.x.x.x/x."
}
},

CLOUDBREAK
Cloudbreak is a powerful left surf that
breaks over a coral reef, a mile off
southwest the island of Tavarua, Fiji.
Cloudbreak is a cloud-agnostic
Hadoop as a Service API. Abstracts
the provisioning and ease
management and monitoring of on-
demand clusters.
Provisioning Hadoop has never been easier

CLOUDBREAK
 Benefits
 Elastic
 Scalable
 Blueprints
 Flexible
 Main REST resources
 /template – specify a cluster infrastructure
 /stack – creates a cloud infrastructure built from a template
 /blueprint – describes a Hadoop cluster
 /cluster – creates a Hadoop cluster

RESULTS AND ACHIEVEMENTS
 Hadoop as a Service API
 Available for EC2 and Azure cloud
 OpenStack, bare metal is coming soon
 Open source under Apache 2 licence
 Same goals as Apache Ambari Launchpad project
 What's next?

HADOOP SERVICES - AS A SERVICE
 Leverage YARN
 Slider (Hoya) providers
 HBase, Accumulo
 SequenceIQ providers - Flume, Tomcat
 YARN -1964
 QoS for YARN – heuristic scheduler
 Platform as a Service API

BANZAI PIPELINE
Banzai Pipeline is a surf reef break located
in Hawaii, off Ehukai Beach Park in
Pupukea on O'ahu's North Shore.
Banzai Pipeline is a RESTful
application development
platform for building on-
demand data and job pipelines
running on Hadoop YARN.
Banzai Pipeline is a big data API for the REST

THANK YOU
 Get the code: https://github.com/sequenceiq
 Read about: http://blog.sequenceiq.com
 Facebook: http://facebook.com/sequenceiq
 Twitter: http://twitter.com/sequenceiq
 LinkedIn: http://linkedin.com/sequenceiq
 Contact: janos.matyas@sequenceiq.com
FEEL FREE TO CONTRIBUTE

Docker Based Hadoop Provisioning

More Related Content

What's hot

Viewers also liked

Similar to Docker Based Hadoop Provisioning

More from DataWorks Summit

Recently uploaded

Docker Based Hadoop Provisioning

Editor's Notes