Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Docker based Hadoop provisioning - Hadoop Summit 2014

19,394 views

Published on

Docker based Hadoop provisioning in the cloud and on-premise/physical hardware

Published in: Software, Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • http://dbmanagement.info/Tutorials/Hadoop.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Docker based Hadoop provisioning - Hadoop Summit 2014

  1. 1. Janos Matyas / CTO / SequenceIQ Inc.
  2. 2. GOAL / MOTIVATION TECHNOLOGY STACK PROBLEM RESOLUTION / HOW IT WORKS RESULTS / ACHIEVEMENTS OVERVIEW
  3. 3. GOAL / MOTIVATION  Ease Hadoop provisioning – everywhere  Automate and unify the process  Arbitrary cluster size  Same process through a cluster lifecycle (Dev, QA, UAT, Prod)  (Auto) scaling Hadoop  QoS
  4. 4. OUR APPROACH  Use Docker  Build cloud-specific ‘Dockerized’ images  Provision the cluster  Use Ambari
  5. 5. DOCKER  Lightweight, portable  Build once, run anywhere  VM – without the overhead of a VM  Isolated containers  Automated and scripted
  6. 6. DOCKER – CONTAINERS vs. VMs  Containers are isolated, but share OS and, where appropriate, bins/libraries
  7. 7. APACHE AMBARI – ARCHITECTURE  Easy Hadoop cluster provisioning  Management and monitoring  Key features – blueprints  REST API
  8. 8. APACHE AMBARI – CREATE CLUSTER  Define a blueprint (POST /api/v1/blueprints)  Create cluster (POST /api/v1/clusters/mycluster)
  9. 9. HADOOP PROVISIONG ISSUES  Each cloud provider has a proprietary API  Create images for each provider  Network configuration  Service discovery  Resize, failover, member join support
  10. 10. OUR APPROACH – DETAILS  Build your Docker image  Install or pre-install Hadoop services with Ambari  Install Serf and dnsmasq  Build your cloud image  Use Ansible to create an image  Provision the cluster
  11. 11. BUILD DOCKER IMAGES  Create the Dockerfile  Have Docker.io to build the image  Optionally pre-install services  Use Ambari  Push image to Docker.io  Licensing questions
  12. 12. BUILD CLOUD IMAGES  Use a Docker ready base image  Use Ansible to provision the image template  Pull the Docker images  Apply custom infrastructure  Use cloud provider specific playbooks  AWS EC2  Azure
  13. 13. ANSIBLE  Configuration as data  Simplest way to automate IT  Secure and agentless  Goal oriented  One playbook – multiple modules  We use it to “burn” cloud images/templates
  14. 14. PROVISIONING – ISSUES  FQDN  /etc/hosts is read-only in Docker  Everybody needs to know everybody  DNS  Single point of failure  Dynamic cluster – nodes joining, leaving, failing  Routing  Cloud – ability to inter-host container routing  Collision free private IP range for Docker bridge
  15. 15. PROVISIONING – SOLUTION  FQDN  Use –h and –dns Docker params  DNS  dnsmasq is running on each Docker container  Serf member-xxx events trigger dnsmasq reconfiguration  Routing  Docker bridge configuration – follows a convention
  16. 16. SERF  Gossip based membership  Service discovery  Decentralized  Lightweight, fault tolerant  Highly available  DevOps friendly  Keep an eye on Consul, Open vSwitch, pipework
  17. 17. SERF – DECENTRALIZED SERVICE DISCOVERY  Gossip instead of heartbeat  LAN, WAN profiles  Provides membership information  Event handlers: member_join, member_leave, member_failed, member- update, member-reap, user  Query
  18. 18. SERF – GOSSIPING
  19. 19. SERF – MEMBERSHIP, EVENT HANDLERS
  20. 20. DNSMASQ  Network infrastructure for small networks  Lightweight DNS, DHCP server  Comes with most Linux distributions
  21. 21. AWS EC2 – HADOOP CLUSTER  Use EC2 REST API to provision instances (from Dockerized image)  Start Docker containers  One Ambari server  N-1 Ambari agents connecting to server  Connect ambari-shell to  Define blueprint  Provision the cluster
  22. 22. AWS EC2 – NETWORK SECURITY  Create a VPC  Configure subnets  Routing tables  Security gateway  Set ACL  Configure VPN
  23. 23. AWS EC2 - CLOUDFORMATION  Manually set up VPC is too complicated  Use CloudFormation  Manage the stack together  Template-based  Environments under version control  Customizable at runtime  No extra charge "VpcId" : { "Type" : "String", "Description" : "VpcId of your existing Virtual Private Cloud (VPC)" }, "SubnetId" : { "Type" : "String", "Description" : "SubnetId of an existing subnet (for the primary network) in your Virtual Private Cloud (VPC)" }, "SecondaryIPAddressCount" : { "Type" : "Number", "Default" : "1", "MinValue" : "1", "MaxValue" : "5", "Description" : "Number of secondary IP addresses to assign to the network interface (1-5)", "ConstraintDescription": "must be a number from 1 to 5." }, "SSHLocation" : { "Description" : "The IP address range that can be used to SSH to the EC2 instances", "Type": "String", "MinLength": "9", "MaxLength": "18", "Default": "0.0.0.0/0", "AllowedPattern": "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})/ (d{1,2})", "ConstraintDescription": "must be a valid IP CIDR range of the form x.x.x.x/x." } },
  24. 24. CLOUDBREAK Cloudbreak is a powerful left surf that breaks over a coral reef, a mile off southwest the island of Tavarua, Fiji. Cloudbreak is a cloud-agnostic Hadoop as a Service API. Abstracts the provisioning and ease management and monitoring of on- demand clusters. Provisioning Hadoop has never been easier
  25. 25. CLOUDBREAK  Benefits  Elastic  Scalable  Blueprints  Flexible  Main REST resources  /template – specify a cluster infrastructure  /stack – creates a cloud infrastructure built from a template  /blueprint – describes a Hadoop cluster  /cluster – creates a Hadoop cluster
  26. 26. RESULTS AND ACHIEVEMENTS  Hadoop as a Service API  Available for EC2 and Azure cloud  OpenStack, bare metal is coming soon  Open source under Apache 2 licence  Same goals as Apache Ambari Launchpad project  What's next?
  27. 27. HADOOP SERVICES - AS A SERVICE  Leverage YARN  Slider (Hoya) providers  HBase, Accumulo  SequenceIQ providers - Flume, Tomcat  YARN -1964  QoS for YARN – heuristic scheduler  Platform as a Service API
  28. 28. BANZAI PIPELINE Banzai Pipeline is a surf reef break located in Hawaii, off Ehukai Beach Park in Pupukea on O'ahu's North Shore. Banzai Pipeline is a RESTful application development platform for building on- demand data and job pipelines running on Hadoop YARN. Banzai Pipeline is a big data API for the REST
  29. 29. THANK YOU  Get the code: https://github.com/sequenceiq  Read about: http://blog.sequenceiq.com  Facebook: http://facebook.com/sequenceiq  Twitter: http://twitter.com/sequenceiq  LinkedIn: http://linkedin.com/sequenceiq  Contact: janos.matyas@sequenceiq.com FEEL FREE TO CONTRIBUTE

×