Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Meetup Jan 2019 - Hadoop On Azure

203 views

Published on

Íñigo Goiri of Microsoft presents regarding the state of running Hadoop on Azure, Microsoft's cloud computing platform. He discusses some of the advanced features of Azure for cheaply running offline workloads, and what modifications have been made to Hadoop to take advantage of this functionality.

This is taken from the Apache Hadoop Contributors Meetup on January 30, hosted by LinkedIn in Mountain View.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Hadoop Meetup Jan 2019 - Hadoop On Azure

  1. 1. Hadoop on Azure Íñigo Goiri
  2. 2. Commercial options • Azure offers Hadoop and Spark: • HDInsight • Azure Databricks • Our target: • “Raw” VMs • Pure Hadoop OSS • Fast creation and scaling
  3. 3. Building OSS Hadoop on Azure • Azure DevOps for building • Periodic sync to trunk • Build on VM with OSS Docker image • Output ‘tgz’ to Azure Blob Storage
  4. 4. Deploying a cluster • Azure Resource Manager (ARM) template • JSON file describing resources • Main resources: • Virtual Machine Scale Sets (VMSS) • Virtual Network • Network Security Group • Load Balancer • Public IP • Internal DNS
  5. 5. VM creation and startup • Cloud-init script • YAML syntax similar to Docker • Kubernetes (AKS) does not add much • Download code and install • Hadoop, Docker, ZooKeeper, scripts,… • Setup environment variables • Discover other services (e.g., ZooKeeper) • Start services
  6. 6. VM roles (VMSS) • 3 x NameNodes • ZooKeeper • Journal Nodes • Routers (RBF) • 2 x Resource Managers • N x Workers • DataNode • Node Manager
  7. 7. Network • Virtual Network for all VMs • Load Balancer • isActive servlet (HADOOP-15707) • Public IPs • External DNS • Firewall • Internal DNS • Locate components (e.g., nn0, zk2, and rm1)
  8. 8. Worker nodes • Node Manager (YARN) • Docker for long running services • DataNode (HDFS) • Use VM local disks • Leverage PROVIDED storage • Mount external storage (S3, ADLS, HDFS,…) • Local HDFS as caching
  9. 9. Creation performance and scalability • Create new cluster • 3-5 minutes • Add 100 workers • <3 minutes • Add 1000 workers • 900 <3 minutes • Long tail (<15 minutes)
  10. 10. Low priority VMs • ~80% price discount • Can be evicted at any time • Larger VMs more likely to be evicted • 30 seconds notification • Possible to decommission (NM and DN) • Ideal for worker nodes • Mix of low-priority and reserved VMs Low Priority Reserved Low Priority Low Priority Low Priority Reserved Reserved Reserved Reserved Managers
  11. 11. Proposed changes to OSS Hadoop • Hadoop Registry to find managers • Improve PROVIDED storage (HDFS) • Improve Dynamic Resource for NMs (YARN)
  12. 12. Hadoop Registry to find Managers • Currently: • Script to set DNS names (e.g., nn2.hadooptest.com, rm0.hadooptest.com) • Configuration file with hard-coded values • Possible to use DNS resolution (HDFS-14118) • YARN Registry to find YARN services • Moved to Hadoop Registry • New approach: • Managers (e.g., NN or RM) register when starting • Workers (e.g., DN or NM) use registry to find managers • Dynamic subclusters (RBF)
  13. 13. Improve PROVIDED storage • Currently: • Generate FS image at start time • Propagate alias map to DNs • New approach: • Dynamic mount points • HA support • Lazy loading replicas metadata on DNs
  14. 14. Improve Dynamic Resource Config for NMs • VMs can change size (CPU) • Harvesting [OSDI’16] • Leverage Resource Options (YARN-291, YARN-996) • Container preemption • Container priorities (OPPORTUNISTIC) • Extend current interfaces • Integrate with Resource Monitor
  15. 15. Future work • Improve Security • Currently network rules • Integration with Azure Active Directory • Delegation tokens propagation • Changes to OSS • Hadoop Registry • PROVIDED storage • NM Dynamic Resource • Open source scripts?

×