Docker based Hadoop provisioning - Hadoop Summit 2014


Published on

Docker based Hadoop provisioning in the cloud and on-premise/physical hardware

Published in: Software, Technology
1 Comment
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Thanks for coming – today will talk about Docker based Hadoop provisioning.
    Quick introduction of who we are - Young startup, from Budapest, Hungary. Janos Matyas – CTO, open source contributor, Hadoop YARN evangelist.
  • Why we have started this at all – there are so many options.
    We repeated the same steps over and over – and scripted. Still, we felt that there is something missing.
    See bullet points
  • Been through many different approaches. Bare metal, cloud VM, so on – ended up using Docker.
    Tested many provisioning frameworks – Ambari is the one.
  • Quick question - How many of you have used Docker before.
    Docker is a container based virtualization framework. Unlike traditional virtualization Docker is fast, lightweight and easy to use. Docker allows you to create containers holding all the dependencies for an application. Each container is kept isolated from any other, and nothing gets shared.
  • I can run 5-6 containers – less overhead than 1 virtualbox. No SOCKS proxy, etc.
  • The ‘provisioning’ framework. No need to enter details, there were pretty good sessions about Ambari.
    Blueprints 1.5.1 tech preview, 1.6 fully supported. Blueprint = stack definition + component layout.
    REST API – we have created, open sourced Ambari client + shell (come and join the Ambari Meetup today at 3:30)
  • Now, the issues.
    Do it again and again – for each cloud provider.
    Create the image – but how do you know what’s the requirement, building an image each and every time?
    Network – this is a big issue. EC2 has API, Azure his own. Open Stack has a network as a service component – Neutrom. SDN – Software define network!!!
    Everything is dynamic – how do you do service discovery?
    Extra features – fully dynamic Hadoop cluster.
  • Will expand on these shortly.
    Sounds too easy – lets get into details.
  • A Docker image is described by a Dockerfile – like a Vagrant file for virtualbox for example.
    You want trusted build – use
    Faster provisioning – a 100+ node Hadoop cluster in less than 5 minutes? Come and join the Ambari meetup.
    Licensing –Ganglia or Nagios (BSD and GPL). Hortonworks Hadoop – Apache 2
    Bigtop is coming…
  • Amazon Linux – Redhat based – recently is Docker ready. OpenStack stack Nova hypervisor supports Docker.
    Apply the network and other infrastructure relates stuff.
    Remember the licensing – use our Ansible script to build your cloud image. Or modify.
  • IT automation war - Ansible vc Chef, Puppet.
    Ansible configurations are simple data descriptions of your infrastructure (both human-readable and machine-parsable).
    Needs only SSH.
  • Dev – env : use default Docker bridge (easy)
    All talks to each other
    DNS – heavy management overweight
  • -h for hostname, --dns to specify the DNS service to use
    Convention: AMI launch index
  • Serf is a decentralized solution for cluster membership, failure detection and orchestration.
    Serf, Zookeeper, etcd, doozerd. All three have server nodes that require a quorum of nodes to operate – strong consistency.
    Serf - eventual consistency
    Most important thing is that gossip based – will expand shortly.
    Decentralized – all nodes are equal.
  • Fire and forget
    Waits for anwer – limited response collection.
    Custom event handlers
    Tags – e.g. Ambari server, hostgroups, etc
  • Load increases – how to cluster knows that there is a new member.
  • Running on each Docker container – updated by SERF events.
  • Amazon supports Docker natively.
    Start N number of nodes. Pass our userdata script .at startup.
    Start the containers – they will know about each other using Serf.
    Shell or REST API or Ambari UI.
  • You need security – strongly recommended use your VPC instead of default VPC.
    Use different availability zones for maximum uptime.
  • Who did VPS knows – can be scripted. It is harder to decommision / change / delete than add components.
    Use CloudFormation.
  • This is a very easy but still error prone process – though it helps a let.
    We build an API on top, and automated the whole process.
    We are not a Service Provider – this is an API.
  • Elastic – arbitrary number of nodes.
    Scalable – follow your workload change.
    Blueprints – supports different cluster blueprints
    Flexible – Use your favorite cloud, bring your own Hadoop – one common API
  • One API – any size, anywhere.
    Why we needed Cloudbreak – this is not the end of the story.
  • We wanted to have a Platform as a Service API.
    We are YARN evangelists – wanted to run everything on YARN.
    Community driven.
    Heuristic scheduler.
  • A fully dynamic big data pipeline.
    Build your pipeline, run dynamically / on demand. All pre-coded, zero coding, only configuration.
    Data pipeline – run services on demand, short or long term. Start when needed, stoped when is idle. Apply ETL on demand.
    Job pipeline – all major ML are supported (Mahout, Mllib), and 44 other MR jobs (correlations, joins, summarizations, filtering, sort, sharding, shuffle)
    Streaming pipeline – Spark based
    Custom SDK – abstracts the complexity behind MR and Spark.
  • Subscribe to the Beta test.
    We did contributions on several Apache and other open source projects.
    Babilon at SequenceIQ; Java and Scala is the default. Groovy is used very often. Than Go – Docker + Serf – we had to learn Go to fix things. Ansible for IT.
    Strongly suggest to use Docker – we use it everywhere. CI/CD, cloud.
    For a demo come and join the Ambari meetup.
    Thanks for coming. Q&A. Join me after or follow us through one of the social medias listed.
  • Docker based Hadoop provisioning - Hadoop Summit 2014

    1. 1. Janos Matyas / CTO / SequenceIQ Inc.
    3. 3. GOAL / MOTIVATION  Ease Hadoop provisioning – everywhere  Automate and unify the process  Arbitrary cluster size  Same process through a cluster lifecycle (Dev, QA, UAT, Prod)  (Auto) scaling Hadoop  QoS
    4. 4. OUR APPROACH  Use Docker  Build cloud-specific ‘Dockerized’ images  Provision the cluster  Use Ambari
    5. 5. DOCKER  Lightweight, portable  Build once, run anywhere  VM – without the overhead of a VM  Isolated containers  Automated and scripted
    6. 6. DOCKER – CONTAINERS vs. VMs  Containers are isolated, but share OS and, where appropriate, bins/libraries
    7. 7. APACHE AMBARI – ARCHITECTURE  Easy Hadoop cluster provisioning  Management and monitoring  Key features – blueprints  REST API
    8. 8. APACHE AMBARI – CREATE CLUSTER  Define a blueprint (POST /api/v1/blueprints)  Create cluster (POST /api/v1/clusters/mycluster)
    9. 9. HADOOP PROVISIONG ISSUES  Each cloud provider has a proprietary API  Create images for each provider  Network configuration  Service discovery  Resize, failover, member join support
    10. 10. OUR APPROACH – DETAILS  Build your Docker image  Install or pre-install Hadoop services with Ambari  Install Serf and dnsmasq  Build your cloud image  Use Ansible to create an image  Provision the cluster
    11. 11. BUILD DOCKER IMAGES  Create the Dockerfile  Have to build the image  Optionally pre-install services  Use Ambari  Push image to  Licensing questions
    12. 12. BUILD CLOUD IMAGES  Use a Docker ready base image  Use Ansible to provision the image template  Pull the Docker images  Apply custom infrastructure  Use cloud provider specific playbooks  AWS EC2  Azure
    13. 13. ANSIBLE  Configuration as data  Simplest way to automate IT  Secure and agentless  Goal oriented  One playbook – multiple modules  We use it to “burn” cloud images/templates
    14. 14. PROVISIONING – ISSUES  FQDN  /etc/hosts is read-only in Docker  Everybody needs to know everybody  DNS  Single point of failure  Dynamic cluster – nodes joining, leaving, failing  Routing  Cloud – ability to inter-host container routing  Collision free private IP range for Docker bridge
    15. 15. PROVISIONING – SOLUTION  FQDN  Use –h and –dns Docker params  DNS  dnsmasq is running on each Docker container  Serf member-xxx events trigger dnsmasq reconfiguration  Routing  Docker bridge configuration – follows a convention
    16. 16. SERF  Gossip based membership  Service discovery  Decentralized  Lightweight, fault tolerant  Highly available  DevOps friendly  Keep an eye on Consul, Open vSwitch, pipework
    17. 17. SERF – DECENTRALIZED SERVICE DISCOVERY  Gossip instead of heartbeat  LAN, WAN profiles  Provides membership information  Event handlers: member_join, member_leave, member_failed, member- update, member-reap, user  Query
    18. 18. SERF – GOSSIPING
    20. 20. DNSMASQ  Network infrastructure for small networks  Lightweight DNS, DHCP server  Comes with most Linux distributions
    21. 21. AWS EC2 – HADOOP CLUSTER  Use EC2 REST API to provision instances (from Dockerized image)  Start Docker containers  One Ambari server  N-1 Ambari agents connecting to server  Connect ambari-shell to  Define blueprint  Provision the cluster
    22. 22. AWS EC2 – NETWORK SECURITY  Create a VPC  Configure subnets  Routing tables  Security gateway  Set ACL  Configure VPN
    23. 23. AWS EC2 - CLOUDFORMATION  Manually set up VPC is too complicated  Use CloudFormation  Manage the stack together  Template-based  Environments under version control  Customizable at runtime  No extra charge "VpcId" : { "Type" : "String", "Description" : "VpcId of your existing Virtual Private Cloud (VPC)" }, "SubnetId" : { "Type" : "String", "Description" : "SubnetId of an existing subnet (for the primary network) in your Virtual Private Cloud (VPC)" }, "SecondaryIPAddressCount" : { "Type" : "Number", "Default" : "1", "MinValue" : "1", "MaxValue" : "5", "Description" : "Number of secondary IP addresses to assign to the network interface (1-5)", "ConstraintDescription": "must be a number from 1 to 5." }, "SSHLocation" : { "Description" : "The IP address range that can be used to SSH to the EC2 instances", "Type": "String", "MinLength": "9", "MaxLength": "18", "Default": "", "AllowedPattern": "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})/ (d{1,2})", "ConstraintDescription": "must be a valid IP CIDR range of the form x.x.x.x/x." } },
    24. 24. CLOUDBREAK Cloudbreak is a powerful left surf that breaks over a coral reef, a mile off southwest the island of Tavarua, Fiji. Cloudbreak is a cloud-agnostic Hadoop as a Service API. Abstracts the provisioning and ease management and monitoring of on- demand clusters. Provisioning Hadoop has never been easier
    25. 25. CLOUDBREAK  Benefits  Elastic  Scalable  Blueprints  Flexible  Main REST resources  /template – specify a cluster infrastructure  /stack – creates a cloud infrastructure built from a template  /blueprint – describes a Hadoop cluster  /cluster – creates a Hadoop cluster
    26. 26. RESULTS AND ACHIEVEMENTS  Hadoop as a Service API  Available for EC2 and Azure cloud  OpenStack, bare metal is coming soon  Open source under Apache 2 licence  Same goals as Apache Ambari Launchpad project  What's next?
    27. 27. HADOOP SERVICES - AS A SERVICE  Leverage YARN  Slider (Hoya) providers  HBase, Accumulo  SequenceIQ providers - Flume, Tomcat  YARN -1964  QoS for YARN – heuristic scheduler  Platform as a Service API
    28. 28. BANZAI PIPELINE Banzai Pipeline is a surf reef break located in Hawaii, off Ehukai Beach Park in Pupukea on O'ahu's North Shore. Banzai Pipeline is a RESTful application development platform for building on- demand data and job pipelines running on Hadoop YARN. Banzai Pipeline is a big data API for the REST
    29. 29. THANK YOU  Get the code:  Read about:  Facebook:  Twitter:  LinkedIn:  Contact: FEEL FREE TO CONTRIBUTE