Janos Matyas / CTO / SequenceIQ Inc.
GOAL / MOTIVATION
TECHNOLOGY STACK
PROBLEM RESOLUTION / HOW IT WORKS
RESULTS / ACHIEVEMENTS
OVERVIEW
GOAL / MOTIVATION
 Ease Hadoop provisioning – everywhere
 Automate and unify the process
 Arbitrary cluster size
 Same process through a cluster lifecycle (Dev, QA, UAT, Prod)
 (Auto) scaling Hadoop
 QoS
OUR APPROACH
 Use Docker
 Build cloud-specific ‘Dockerized’ images
 Provision the cluster
 Use Ambari
DOCKER
 Lightweight, portable
 Build once, run anywhere
 VM – without the overhead of a VM
 Isolated containers
 Automated and scripted
DOCKER – CONTAINERS vs. VMs
 Containers are isolated, but share OS and,
where appropriate, bins/libraries
APACHE AMBARI – ARCHITECTURE
 Easy Hadoop cluster provisioning
 Management and monitoring
 Key features – blueprints
 REST API
APACHE AMBARI – CREATE CLUSTER
 Define a blueprint (POST /api/v1/blueprints)
 Create cluster (POST /api/v1/clusters/mycluster)
HADOOP PROVISIONG ISSUES
 Each cloud provider has a proprietary API
 Create images for each provider
 Network configuration
 Service discovery
 Resize, failover, member join support
OUR APPROACH – DETAILS
 Build your Docker image
 Install or pre-install Hadoop services with Ambari
 Install Serf and dnsmasq
 Build your cloud image
 Use Ansible to create an image
 Provision the cluster
BUILD DOCKER IMAGES
 Create the Dockerfile
 Have Docker.io to build the image
 Optionally pre-install services
 Use Ambari
 Push image to Docker.io
 Licensing questions
BUILD CLOUD IMAGES
 Use a Docker ready base image
 Use Ansible to provision the image template
 Pull the Docker images
 Apply custom infrastructure
 Use cloud provider specific playbooks
 AWS EC2
 Azure
ANSIBLE
 Configuration as data
 Simplest way to automate IT
 Secure and agentless
 Goal oriented
 One playbook – multiple modules
 We use it to “burn” cloud images/templates
PROVISIONING – ISSUES
 FQDN
 /etc/hosts is read-only in Docker
 Everybody needs to know everybody
 DNS
 Single point of failure
 Dynamic cluster – nodes joining, leaving, failing
 Routing
 Cloud – ability to inter-host container routing
 Collision free private IP range for Docker bridge
 We need predefined host names/IP addresses
 /etc/hosts is read-only in Docker
 Use Ansible to provision the image template
 Pull the Docker images
 Start a DNS server
 Use it as a reference docker run -dns <IP_OF_DNS>
 Nodes need to know each other
PROVISIONING – SOLUTION
 FQDN
 Use –h and –dns Docker params
 DNS
 dnsmasq is running on each Docker container
 Serf member-xxx events trigger dnsmasq reconfiguration
 Routing
 Docker bridge configuration – follows a convention
SERF
 Gossip based membership
 Service discovery
 Decentralized
 Lightweight, fault tolerant
 Highly available
 DevOps friendly
 Keep an eye on Consul, Open vSwitch, pipework
SERF – DECENTRALIZED SERVICE DISCOVERY
 Gossip instead of heartbeat
 LAN, WAN profiles
 Provides membership information
 Event handlers: member_join, member_leave, member_failed, member-
update, member-reap, user
 Query
SERF – GOSSIPING
SERF – MEMBERSHIP, EVENT HANDLERS
DNSMASQ
 Network infrastructure for small networks
 Lightweight DNS, DHCP server
 Comes with most Linux distributions
AWS EC2 – HADOOP CLUSTER
 Use EC2 REST API to provision instances (from Dockerized image)
 Start Docker containers
 One Ambari server
 N-1 Ambari agents connecting to server
 Connect ambari-shell to
 Define blueprint
 Provision the cluster
AWS EC2 – NETWORK SECURITY
 Create a VPC
 Configure subnets
 Routing tables
 Security gateway
 Set ACL
 Configure VPN
AWS EC2 - CLOUDFORMATION
 Manually set up VPC is too complicated
 Use CloudFormation
 Manage the stack together
 Template-based
 Environments under version control
 Customizable at runtime
 No extra charge
"VpcId" : {
"Type" : "String",
"Description" : "VpcId of your existing Virtual Private Cloud (VPC)"
},
"SubnetId" : {
"Type" : "String",
"Description" : "SubnetId of an existing subnet (for the primary
network) in your Virtual Private Cloud (VPC)"
},
"SecondaryIPAddressCount" : {
"Type" : "Number",
"Default" : "1",
"MinValue" : "1",
"MaxValue" : "5",
"Description" : "Number of secondary IP addresses to assign to the
network interface (1-5)",
"ConstraintDescription": "must be a number from 1 to 5."
},
"SSHLocation" : {
"Description" : "The IP address range that can be used to SSH to the
EC2 instances",
"Type": "String",
"MinLength": "9",
"MaxLength": "18",
"Default": "0.0.0.0/0",
"AllowedPattern": "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})/
(d{1,2})",
"ConstraintDescription": "must be a valid IP CIDR range of the form
x.x.x.x/x."
}
},
CLOUDBREAK
Cloudbreak is a powerful left surf that
breaks over a coral reef, a mile off
southwest the island of Tavarua, Fiji.
Cloudbreak is a cloud-agnostic
Hadoop as a Service API. Abstracts
the provisioning and ease
management and monitoring of on-
demand clusters.
Provisioning Hadoop has never been easier
CLOUDBREAK
 Benefits
 Elastic
 Scalable
 Blueprints
 Flexible
 Main REST resources
 /template – specify a cluster infrastructure
 /stack – creates a cloud infrastructure built from a template
 /blueprint – describes a Hadoop cluster
 /cluster – creates a Hadoop cluster
RESULTS AND ACHIEVEMENTS
 Hadoop as a Service API
 Available for EC2 and Azure cloud
 OpenStack, bare metal is coming soon
 Open source under Apache 2 licence
 Same goals as Apache Ambari Launchpad project
 What's next?
HADOOP SERVICES - AS A SERVICE
 Leverage YARN
 Slider (Hoya) providers
 HBase, Accumulo
 SequenceIQ providers - Flume, Tomcat
 YARN -1964
 QoS for YARN – heuristic scheduler
 Platform as a Service API
BANZAI PIPELINE
Banzai Pipeline is a surf reef break located
in Hawaii, off Ehukai Beach Park in
Pupukea on O'ahu's North Shore.
Banzai Pipeline is a RESTful
application development
platform for building on-
demand data and job pipelines
running on Hadoop YARN.
Banzai Pipeline is a big data API for the REST
THANK YOU
 Get the code: https://github.com/sequenceiq
 Read about: http://blog.sequenceiq.com
 Facebook: http://facebook.com/sequenceiq
 Twitter: http://twitter.com/sequenceiq
 LinkedIn: http://linkedin.com/sequenceiq
 Contact: janos.matyas@sequenceiq.com
FEEL FREE TO CONTRIBUTE

Docker Based Hadoop Provisioning

  • 1.
    Janos Matyas /CTO / SequenceIQ Inc.
  • 2.
    GOAL / MOTIVATION TECHNOLOGYSTACK PROBLEM RESOLUTION / HOW IT WORKS RESULTS / ACHIEVEMENTS OVERVIEW
  • 3.
    GOAL / MOTIVATION Ease Hadoop provisioning – everywhere  Automate and unify the process  Arbitrary cluster size  Same process through a cluster lifecycle (Dev, QA, UAT, Prod)  (Auto) scaling Hadoop  QoS
  • 4.
    OUR APPROACH  UseDocker  Build cloud-specific ‘Dockerized’ images  Provision the cluster  Use Ambari
  • 5.
    DOCKER  Lightweight, portable Build once, run anywhere  VM – without the overhead of a VM  Isolated containers  Automated and scripted
  • 6.
    DOCKER – CONTAINERSvs. VMs  Containers are isolated, but share OS and, where appropriate, bins/libraries
  • 7.
    APACHE AMBARI –ARCHITECTURE  Easy Hadoop cluster provisioning  Management and monitoring  Key features – blueprints  REST API
  • 8.
    APACHE AMBARI –CREATE CLUSTER  Define a blueprint (POST /api/v1/blueprints)  Create cluster (POST /api/v1/clusters/mycluster)
  • 9.
    HADOOP PROVISIONG ISSUES Each cloud provider has a proprietary API  Create images for each provider  Network configuration  Service discovery  Resize, failover, member join support
  • 10.
    OUR APPROACH –DETAILS  Build your Docker image  Install or pre-install Hadoop services with Ambari  Install Serf and dnsmasq  Build your cloud image  Use Ansible to create an image  Provision the cluster
  • 11.
    BUILD DOCKER IMAGES Create the Dockerfile  Have Docker.io to build the image  Optionally pre-install services  Use Ambari  Push image to Docker.io  Licensing questions
  • 12.
    BUILD CLOUD IMAGES Use a Docker ready base image  Use Ansible to provision the image template  Pull the Docker images  Apply custom infrastructure  Use cloud provider specific playbooks  AWS EC2  Azure
  • 13.
    ANSIBLE  Configuration asdata  Simplest way to automate IT  Secure and agentless  Goal oriented  One playbook – multiple modules  We use it to “burn” cloud images/templates
  • 14.
    PROVISIONING – ISSUES FQDN  /etc/hosts is read-only in Docker  Everybody needs to know everybody  DNS  Single point of failure  Dynamic cluster – nodes joining, leaving, failing  Routing  Cloud – ability to inter-host container routing  Collision free private IP range for Docker bridge  We need predefined host names/IP addresses  /etc/hosts is read-only in Docker  Use Ansible to provision the image template  Pull the Docker images  Start a DNS server  Use it as a reference docker run -dns <IP_OF_DNS>  Nodes need to know each other
  • 15.
    PROVISIONING – SOLUTION FQDN  Use –h and –dns Docker params  DNS  dnsmasq is running on each Docker container  Serf member-xxx events trigger dnsmasq reconfiguration  Routing  Docker bridge configuration – follows a convention
  • 16.
    SERF  Gossip basedmembership  Service discovery  Decentralized  Lightweight, fault tolerant  Highly available  DevOps friendly  Keep an eye on Consul, Open vSwitch, pipework
  • 17.
    SERF – DECENTRALIZEDSERVICE DISCOVERY  Gossip instead of heartbeat  LAN, WAN profiles  Provides membership information  Event handlers: member_join, member_leave, member_failed, member- update, member-reap, user  Query
  • 18.
  • 19.
    SERF – MEMBERSHIP,EVENT HANDLERS
  • 20.
    DNSMASQ  Network infrastructurefor small networks  Lightweight DNS, DHCP server  Comes with most Linux distributions
  • 21.
    AWS EC2 –HADOOP CLUSTER  Use EC2 REST API to provision instances (from Dockerized image)  Start Docker containers  One Ambari server  N-1 Ambari agents connecting to server  Connect ambari-shell to  Define blueprint  Provision the cluster
  • 22.
    AWS EC2 –NETWORK SECURITY  Create a VPC  Configure subnets  Routing tables  Security gateway  Set ACL  Configure VPN
  • 23.
    AWS EC2 -CLOUDFORMATION  Manually set up VPC is too complicated  Use CloudFormation  Manage the stack together  Template-based  Environments under version control  Customizable at runtime  No extra charge "VpcId" : { "Type" : "String", "Description" : "VpcId of your existing Virtual Private Cloud (VPC)" }, "SubnetId" : { "Type" : "String", "Description" : "SubnetId of an existing subnet (for the primary network) in your Virtual Private Cloud (VPC)" }, "SecondaryIPAddressCount" : { "Type" : "Number", "Default" : "1", "MinValue" : "1", "MaxValue" : "5", "Description" : "Number of secondary IP addresses to assign to the network interface (1-5)", "ConstraintDescription": "must be a number from 1 to 5." }, "SSHLocation" : { "Description" : "The IP address range that can be used to SSH to the EC2 instances", "Type": "String", "MinLength": "9", "MaxLength": "18", "Default": "0.0.0.0/0", "AllowedPattern": "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})/ (d{1,2})", "ConstraintDescription": "must be a valid IP CIDR range of the form x.x.x.x/x." } },
  • 24.
    CLOUDBREAK Cloudbreak is apowerful left surf that breaks over a coral reef, a mile off southwest the island of Tavarua, Fiji. Cloudbreak is a cloud-agnostic Hadoop as a Service API. Abstracts the provisioning and ease management and monitoring of on- demand clusters. Provisioning Hadoop has never been easier
  • 25.
    CLOUDBREAK  Benefits  Elastic Scalable  Blueprints  Flexible  Main REST resources  /template – specify a cluster infrastructure  /stack – creates a cloud infrastructure built from a template  /blueprint – describes a Hadoop cluster  /cluster – creates a Hadoop cluster
  • 26.
    RESULTS AND ACHIEVEMENTS Hadoop as a Service API  Available for EC2 and Azure cloud  OpenStack, bare metal is coming soon  Open source under Apache 2 licence  Same goals as Apache Ambari Launchpad project  What's next?
  • 27.
    HADOOP SERVICES -AS A SERVICE  Leverage YARN  Slider (Hoya) providers  HBase, Accumulo  SequenceIQ providers - Flume, Tomcat  YARN -1964  QoS for YARN – heuristic scheduler  Platform as a Service API
  • 28.
    BANZAI PIPELINE Banzai Pipelineis a surf reef break located in Hawaii, off Ehukai Beach Park in Pupukea on O'ahu's North Shore. Banzai Pipeline is a RESTful application development platform for building on- demand data and job pipelines running on Hadoop YARN. Banzai Pipeline is a big data API for the REST
  • 29.
    THANK YOU  Getthe code: https://github.com/sequenceiq  Read about: http://blog.sequenceiq.com  Facebook: http://facebook.com/sequenceiq  Twitter: http://twitter.com/sequenceiq  LinkedIn: http://linkedin.com/sequenceiq  Contact: janos.matyas@sequenceiq.com FEEL FREE TO CONTRIBUTE

Editor's Notes

  • #14 YAML
  • #15 Dev – env : use default Docker bridge (easy)
  • #16 -h for hostname, --dns to specify the DNS service to use Convention: AMI launch index
  • #18 Fire and forget Waits for anwer – limited response collection