Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Atmosphere 2016 - Diptanu Choudhury - Taming the public clouds with nomad

132 views

Published on

Distributed Cluster Schedulers are becoming increasingly popular. They present a good abstraction for running workloads at a “warehouse-scale” on the public and private clouds by decoupling workload from compute, network and storage resources.

In this talk, we will talk about the operational challenges of running a Cluster Scheduler to serve highly available services across multiple geographies and in a heterogeneous runtime environment. We will go into details of the needs from a cluster scheduler with respect to managing multiple runtime/virtualization platforms, provide observability, running maintenance on hardware and software, etc.

Published in: Technology
  • Be the first to comment

Atmosphere 2016 - Diptanu Choudhury - Taming the public clouds with nomad

  1. 1. HASHICORP Taming the modern public clouds with Nomad Diptanu Gon Choudhury @diptanu
  2. 2. HASHICORP Evolution of compute infrastructure 1995 2000 2015
  3. 3. HASHICORP Evolution of compute infrastructure
  4. 4. HASHICORP Evolution of compute infrastructure Global Public Cloud AWS - US-West-2 AWS - US-East-1 GCP - US-Central-1 Private Clouds Private Clouds
  5. 5. HASHICORP Challenges of the modern cloud 10s of 1000s of compute nodes to manage Compute clusters are spread across the globe Static and offline partitioning of clusters are no longer efficient
  6. 6. HASHICORP Challenges of the modern cloud Heterogenous API for accessing compute infrastructure Heterogenous primitives for managing network, secrets, etc
  7. 7. HASHICORP Evolution of application architecture SOA and Micro Services are replacing monoliths Distributed Systems are the new normal
  8. 8. HASHICORP Challenges in running modern services Orchestrated deployment and rollback strategies More modes of failures
  9. 9. Operator Datacenter Skywalker Vader Leia Solo
  10. 10. Operator Datacenter PYTHON PYTHON GOLANG GOLANG GOLANG Skywalker Vader Leia Solo
  11. 11. Operator Datacenter RUBY PYTHON PYTHON PYTHON GOLANG GOLANG GOLANG GOLANG NODE Skywalker Vader Leia Solo
  12. 12. Operator Datacenter RUBY PYTHON PYTHON PYTHON GOLANG GOLANG GOLANG GOLANG NODE Skywalker Vader Leia Solo RUBY VADER LEIA SOLO 192.168.1.4 192.168.1.5 192.168.1.7 192.168.1.253 88:45:13:B6:87:C4 94:CE:4F:C8:54:C3 CA:9A:3D:7F:8B:CB 72:30:9C:0D:1E:74 Randomly kills applications
  13. 13. Operator Datacenter RUBY PYTHON PYTHON PYTHON GOLANG GOLANG GOLANG GOLANG NODE Skywalker Leia Solo RUBY VADER LEIA SOLO 192.168.1.4 192.168.1.5 192.168.1.7 192.168.1.253 88:45:13:B6:87:C4 94:CE:4F:C8:54:C3 CA:9A:3D:7F:8B:CB 72:30:9C:0D:1E:74 Randomly kills applications FFVader
  14. 14. Operator Datacenter RUBY PYTHON PYTHON PYTHON GOLANG GOLANG GOLANG GOLANG NODE Skywalker Leia Solo RUBY VADER LEIA SOLO 192.168.1.4 192.168.1.5 192.168.1.7 192.168.1.253 88:45:13:B6:87:C4 94:CE:4F:C8:54:C3 CA:9A:3D:7F:8B:CB 72:30:9C:0D:1E:74 Randomly kills applications FFVader PYTHON PYTHON PYTHON
  15. 15. Operator Datacenter RUBY GOLANG GOLANG GOLANG GOLANG NODE Skywalker Leia Solo RUBY VADER LEIA SOLO 192.168.1.4 192.168.1.5 192.168.1.7 192.168.1.253 88:45:13:B6:87:C4 94:CE:4F:C8:54:C3 CA:9A:3D:7F:8B:CB 72:30:9C:0D:1E:74 Randomly kills applications Vader PYTHON PYTHON PYTHON
  16. 16. Operator Datacenter RUBY GOLANG GOLANG GOLANG GOLANG NODE Skywalker Leia Solo RUBY VADER LEIA SOLO 192.168.1.4 192.168.1.9 192.168.1.7 192.168.1.253 88:45:13:B6:87:C4 94:CE:4F:C8:54:C3 CA:9A:3D:7F:8B:CB 72:30:9C:0D:1E:74 Rebuilt on 04/20/2016 Vader PYTHON PYTHON PYTHON
  17. 17. Operator Datacenter RUBY GOLANG GOLANG GOLANG GOLANG NODE Skywalker Leia Solo RUBY VADER LEIA SOLO 192.168.1.4 192.168.1.9 192.168.1.7 192.168.1.253 88:45:13:B6:87:C4 94:CE:4F:C8:54:C3 CA:9A:3D:7F:8B:CB 72:30:9C:0D:1E:74 Rebuilt on 04/20/2016 Vader PYTHON PYTHON PYTHON
  18. 18. This does not scale
  19. 19. HASHICORP Cluster Schedulers to the rescue Decouple Work from Resources Better Quality of Service Higher Resource Utilization
  20. 20. Nomad HASHICORP Multi-Datacenter Multi-Region Flexible Workloads Job Priorities Bin Packing Large Scale Operationally Simple
  21. 21. HASHICORP Nomad as Cluster Scheduler Bin Packing Job Queueing Over-Subscription Higher Resource Utilization Decouple Work from Resources Better Quality of Service
  22. 22. HASHICORP Nomad as the Cluster Scheduler Abstraction API Contracts Standardization Higher Resource Utilization Decouple Work from Resources Better Quality of Service
  23. 23. HASHICORP Nomad as the Cluster Scheduler Priorities Resource Isolation Pre-emption Higher Resource Utilization Decouple Work from Resources Better Quality of Service
  24. 24. HASHICORP Job Specification Declares what to run
  25. 25. HASHICORP example.nomad # Define our simple redis job job "redis" { # Run only in us-east-1 datacenters = ["us-east-1"] # Define the single redis task using Docker task "redis" { driver = "docker" config { image = "redis:latest" } resources { cpu = 500 # Mhz memory = 256 # MB network { mbits = 10 dynamic_ports = ["redis"] } } } }
  26. 26. HASHICORP Job Specification Nomad determines where and manages how to run
  27. 27. HASHICORP Job Specification Abstract work from resources
  28. 28. HASHICORP Supports multiple Clouds, DCs and Regions Resources across DCs are presented as single pool Developers can target multiple datacenter in the same job file Unified interface for developers across clouds
  29. 29. HASHICORP Unified interface across hybrid clouds AWS GCP Azure On-Prem DC Nomad Job Spec
  30. 30. HASHICORP Single Region Architecture SERVER SERVER SERVER CLIENT CLIENT CLIENT DC1 DC2 DC3 FOLLOWER LEADER FOLLOWER REPLICATION FORWARDING REPLICATION FORWARDING RPC RPC RPC
  31. 31. HASHICORP Multi Region Architecture SERVER SERVER SERVER FOLLOWER LEADER FOLLOWER REPLICATION FORWARDING REPLICATION REGION B GOSSIP REPLICATION REPLICATION FORWARDING REGION FORWARDING REGION A SERVER FOLLOWER SERVER SERVER LEADER FOLLOWER
  32. 32. Nomad HASHICORP Region is Isolation Domain 1-N Datacenters Per Region Flexibility to do 1:1 (Consul) Scheduling Boundary
  33. 33. Data Model ALLOCATION JOB EVALUATION NODE
  34. 34. Evaluation ~= State Change
  35. 35. Evaluations Create / Update / Delete Job Node Up / Node Down Allocation Failed
  36. 36. Evaluations SCHEDULER func(Evaluation) => []AllocationUpdates
  37. 37. Evaluations SCHEDULER func(Evaluation) => []AllocationUpdates Service, Batch, System
  38. 38. HASHICORP Scheduler Architecture Concurrent and optimistic scheduling Event Driven invocation of schedulers No head of line blocking for different type of workloads
  39. 39. HASHICORP Client Architecture Broad OS Support Host Fingerprinting Pluggable Drivers
  40. 40. HASHICORP Drivers Execute Tasks Provide Resource Isolation
  41. 41. HASHICORP Containerized Virtualized Standalone Docker Qemu / KVM Java Jar Static Binaries Rocket
  42. 42. HASHICORP Containerized Virtualized Standalone Docker Rocket Windows Server Containers Qemu / KVM Hyper-V Xen Java Jar Static Binaries C#
  43. 43. HASHICORP Maintainance Primitives First class support for doing maintenance on nodes Drain allocations running on a node nomad node-drain -enable 149cc920 Are you sure you want to enable drain mode for node "149cc920"? [y/N]
  44. 44. HASHICORP Service Discovery Aware Allows developers to define services exposed by a job Keep services and checks synced
  45. 45. HASHICORP example.nomad job "redis" { task "redis" { ……… service { name = “binstore” tags = [“env:staging”, “stack:beta”] port = “http” check { name = “binstore-http” type = “http” path = “/status” interval = “30s” timeout = “2s” } } ………… } }
  46. 46. HASHICORP System Job Scheduler Runs a job on every node on the cluster Great for running monitoring, logging, auditing software
  47. 47. HASHICORP Log Management Takes care of rotating logs of services Log forwarding coming soon
  48. 48. HASHICORP Thanks! https://github.com/hashicorp/nomad https://www.nomadproject.io/

×