Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019

43 views

Published on

Managing large stateful applications is tough.

Matt Schallert outlines how, using Kubernetes, Uber automated managing a challenging stateful workload—M3DB, its sharded, replicated, multizone time series database—and examines the operational challenges the company faced while scaling M3DB from a handful of clusters to over 40 clusters across multiple data centers and cloud providers, all while trying to create an environment-agnostic solution for open source users.

Matt then demonstrates methods of managing stateful workloads in a declarative manner to ease operational burden. You’ll see how M3DB’s declarative approach to cluster management can be extended to other workloads using its common set of open source libraries. This approach made orchestrating M3DB easier.

Along the way, Matt shares lessons learned that you can apply to a variety of stateful workloads across bare metal and cloud environments, regardless of whether it’s running under an orchestration system or managing instances directly. You’ll walk away with advice for managing stateful systems at scale and lessons to bear in mind when considering using an orchestration system for state management.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Large-Scale Automated Storage on Kubernetes - Matt Schallert OSCON 2019

  1. 1. Large-Scale Automated Storage on Kubernetes Matt Schallert @mattschallert SRE @ Uber
  2. 2. About Me ● SRE @ Uber since 2016 ● M3 - Uber’s OSS Metrics Platform @mattschallert
  3. 3. Agenda ● Background: Metrics @ Uber ● Scaling Challenges ● Automation ● M3DB on Kubernetes @mattschallert
  4. 4. 2015 @mattschallert
  5. 5. Mid 2016 @mattschallert
  6. 6. ● Cassandra stack scaled to needs but was expensive ○ Human cost: operational overhead ○ Infra cost: lots of high-IO servers ● Scaling challenges inevitable, needed to reduce complexity Mid 2016 @mattschallert
  7. 7. M3DB ● Open source distributed time series database ● Highly compressed (11x), fast, compaction-free storage ● Sharded, replicated (zone & rack-aware) ● Production late 2016 @mattschallert
  8. 8. M3DB ● Reduced operational overhead ● Runs on commodity hardware with SSDs ● Strongly consistent cluster membership backed by etcd @mattschallert
  9. 9. m3cluster ● Reusable cluster management libraries created for M3DB ○ Topology representation, shard distribution algos ○ Service discovery, runtime config, leader election ○ Etcd (strongly consistent KV store) as source of truth for all components ● Similar to Apache Helix ○ Helix is Java-heavy, we’re a Go-based stack
  10. 10. m3cluster ● Topologies represented as desired goal states, M3DB converges @mattschallert
  11. 11. m3cluster: Placement @mattschallert
  12. 12. Late 2016 @mattschallert
  13. 13. Post-M3DB Deployment ● M3DB solved most pressing issue (Cassandra) ○ Improved stability, reduced costs ● System as a whole: still a high operational overhead @mattschallert
  14. 14. Fragmented representation of components @mattschallert
  15. 15. System Representation (2016) @mattschallert
  16. 16. System Representation (2016)
  17. 17. System Representation (2016)
  18. 18. System Representation (2016)
  19. 19. System Representation (2016)
  20. 20. System Representation ● Systems represented as… ○ Lists fetched from Uber config service ○ Static lists deployed via Puppet ○ Static lists coupled with service deploys ○ m3cluster placements @mattschallert
  21. 21. Fragmentation ● Had smaller components rather than one unifying view of the system as a whole ● Difficult to reason about ○ “Where do I need to make this change?” ○ Painful onboarding for new engineers ● Impediment to automation ○ Replace host: update config, build N services, deploy
  22. 22. System Representation (2016)
  23. 23. System Representation: 2018 @mattschallert
  24. 24. m3cluster benefits ● Common libraries for all workloads: stateful (M3DB), semi-stateful (aggregation tier), stateless (ingestion) ○ Implement “distribute shards across racks according to disk size” once, reuse everywhere ● Single source of truth: everything stored in etcd @mattschallert
  25. 25. Operational Improvements ● Easier to reason about the entire system ○ Unifying view: Placements ● Operations much easier, possible to automate ○ Just modifying goal state in etcd ● One config per deployment (no instance-specific) ○ Cattle vs. pets: instances aren’t special, treat them as a whole
  26. 26. M3DB Goal State ● Shard assignments stored in etcd ○ M3DB nodes observe desired state, react, update shard state ● Imperative actions such as “add a node” are really changes in declarative state ○ Easy for both people or tooling to interact with @mattschallert
  27. 27. M3DB: Adding a Node @mattschallert
  28. 28. M3DB: Adding a Node
  29. 29. M3DB: Adding a Node @mattschallert
  30. 30. M3DB: Adding a Node @mattschallert
  31. 31. M3DB Goal State ● Goal-state primitives built with future automation in mind ● m3cluster interacts with single source of truth ● Vs. imperative changes: ○ No instance-level operations ○ No state or steps to keep track of: can always reconcile state of world (restarts, etc. safe) @mattschallert
  32. 32. System Representation: 2018 @mattschallert
  33. 33. M3DB Operations ● Retooled entire stack as goal state-driven, dynamic ○ Operations simplified, but still triggered by a person ● M3DB requires more care to manage than stateless ○ Takes time to stream data @mattschallert
  34. 34. 2016 Clusters 2 @mattschallert
  35. 35. 2018 Clusters 40+ @mattschallert
  36. 36. Automating M3DB ● Wanted goal state at a higher level ○ Clusters as cattle ● M3DB can only do so much; needed to expand scope ● Chose to build on top of Kubernetes ○ Value for M3DB OSS community ○ Internal Uber use cases @mattschallert
  37. 37. Kubernetes @ 35,000 ft ● Container orchestration system ● Support for stateful workloads ● Self-healing ● Extensible ○ CRD: define your own API resources ○ Operator: custom controller @mattschallert
  38. 38. ● Pod: base unit of work ○ Group of like containers deployed together ● Pods can come and go ○ Self-healing → pods move in response to failures ● No pods are special Kubernetes Driving Principles @mattschallert
  39. 39. Kubernetes Controllers for { desired := getDesiredState() current := getCurrentState() makeChanges(desired, current) } Writing Controllers Guide
  40. 40. M3DB & Kubernetes M3DB: ● Goal state driven ● Single source of truth ● Nodes work to converge Kubernetes: ● Goal state driven ● Single source of truth ● Components work to converge @mattschallert
  41. 41. M3DB Operator ● Manages M3DB clusters running on Kubernetes ○ Creation, deletion, scaling, maintenance ● Performs actions a person would have to do in full cluster lifecycle ○ Updating M3DB clusters ○ Adding more instances @mattschallert
  42. 42. M3DB Operator ● Users define desired M3DB clusters ○ “9 node cluster, spread across 3 zones, with remote SSDs attached” ● Operator updates desired Kubernetes state, waits for convergence, updates M3DB desired state @mattschallert
  43. 43. M3DB Operator: Create Example @mattschallert
  44. 44. @mattschallert
  45. 45. @mattschallert
  46. 46. @mattschallert
  47. 47. M3DB Operator: Day 2 Operations ● Scaling a cluster ○ M3DB adds 1-by-1, operation handles many ● Replacing instances, updating config, etc. ● All operations follow same pattern ○ Look at Kubernetes state, look at M3DB state, affect change @mattschallert
  48. 48. M3DB Operator: Success Factors ● Bridge between goal state-driven APIS (anti-footgun) ○ Can’t accidentally remove more pods than desired ○ Can’t accidentally remove two M3DB instances ○ Operator can be restarted, pick back up ● Share similar concepts, expanded scope ○ Failure domains, resource quantities, etc.
  49. 49. ● Finding common abstractions made mental model easier ○ Implementation can be separate, interface the same ● + Goal-state driven approach simplified operations ○ External factors (failures, deploys) don’t disrupt steady state Lessons Learned: M3 @mattschallert
  50. 50. ● Embracing Kubernetes principles in M3DB made Kubernetes onboarding easier ○ No pods are special, may move around ○ Self-healing after failures ● Instances as cattle, not pets ○ No instance-level operations or config Lessons Learned: Kubernetes @mattschallert
  51. 51. ● Any orchestration system implies complexity ○ Your own context: is it worth it? ○ How does your own system respond to failures, rescheduling, etc.? ● Maybe not a good fit if have single special instances ○ M3DB can tolerate failures Considerations @mattschallert
  52. 52. ● Requirements & guarantees of your system will influence your requirements for automation ○ M3DB strongly consistent: wanted strongly consistent cluster membership Considerations @mattschallert
  53. 53. ● Building common abstractions for all our workloads eased mental model ● Following goal-state driven approach eased automation ● Closely aligned with Kubernetes principles Recap @mattschallert
  54. 54. ● github.com/m3db/m3 ○ M3DB, m3cluster, m3aggregator ○ github.com/m3db/m3db-operator ● m3db.io & docs.m3db.io ● Please leave feedback on O’Reilly site! Thank you! @mattschallert
  55. 55. Pre-2015
  56. 56. Early 2015
  57. 57. M3DB Operator: Today ● [ THIS SLIDE IS SKIPPED ] ● https://github.com/m3db/m3db-operator ● Managing 1 M3DB clusters ~as easy as 10
  58. 58. Late 2015
  59. 59. Mid 2016
  60. 60. Post-M3DB Deployment
  61. 61. System Interfaces
  62. 62. System Interfaces ● Ingestion path used multiple text-based line protocols ○ Difficult to encode metric metadata ○ Redundant serialization costs ● Storage layers used various RPC formats
  63. 63. System Representation (2016)
  64. 64. M3DB Operator [SLIDE SKIPPED] ROUGH DRAFT DIAGRAM
  65. 65. ● [ Transition to Kubernetes, possibly using edge vs. level triggered ] ○ Level triggered: reacting to current state. Actions such as “replace this node” can’t be missed. ○ Edge triggered: reacting to events. Events can be missed. ● [ Maybe something about idempotency requirement ? ] Edge vs. Level Triggered Logic
  66. 66. m3cluster: Placement message Instance { string id string isolation_group string zone uint32 weight string endpoint repeated Shard shards ... } message Placement { map<string, Instance> instances uint32 replica_factor uint32 num_shards bool is_sharded }
  67. 67. m3cluster: Shard enum ShardState { INITIALIZING = 0; AVAILABLE = 1; LEAVING = 2; } message Shard { uint32 id = 1; ShardState state = 2; string source_id = 3; }
  68. 68. M3cluster: all enum ShardState { INITIALIZING = 0; AVAILABLE = 1; LEAVING = 2; } message Shard { uint32 id = 1; ShardState state = 2; string source_id = 3; } message Instance { string id = 1; string isolation_group = 2; string zone = 3; uint32 weight = 4; string endpoint = 5; repeated Shard shards = 6; ... } message Placement { map<string, Instance> instances = 1; uint32 replica_factor = 2; uint32 num_shards = 3; bool is_sharded = 4; }
  69. 69. ● Consistency of operations ○ Many actors in the system ○ Failures can happen, may re-reconcile states ○ Kubernetes & m3cluster leverage etcd Prerequisites for Orchestration
  70. 70. ● [ Lessons learned orchestrating M3DB generally ] ● [ Learnings specific to Kubernetes ] Lessons Learned

×