Successfully reported this slideshow.
Your SlideShare is downloading. ×

Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosystem - Pulsar Summit SF 2022

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 66 Ad

Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosystem - Pulsar Summit SF 2022

Download to read offline

In this talk, learn how Toast leverages our Envoy control-plane to manage blue-green deploys of Pulsar consumers, and how this has helped drive adoption across the engineering organization. Dive into the history of Pulsar at Toast, starting from its introduction in 2019 to provide event-driven architecture across a rapidly scaling restaurant software platform. We will detail some of the hurdles that we encountered gaining buy-in across a diverse set of teams, and dive deep into how we enforce best practices and integrate with our service control plane.

In this talk, learn how Toast leverages our Envoy control-plane to manage blue-green deploys of Pulsar consumers, and how this has helped drive adoption across the engineering organization. Dive into the history of Pulsar at Toast, starting from its introduction in 2019 to provide event-driven architecture across a rapidly scaling restaurant software platform. We will detail some of the hurdles that we encountered gaining buy-in across a diverse set of teams, and dive deep into how we enforce best practices and integrate with our service control plane.

Advertisement
Advertisement

More Related Content

More from StreamNative (20)

Advertisement

Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosystem - Pulsar Summit SF 2022

  1. 1. Pulsar Summit San Francisco Hotel Nikko August 18 2022 Use Case Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosystem Kai Levy & Zach Walsh Toast, Inc.
  2. 2. Kai and Zach both work on Toast’s Scale team, building shared infrastructure and solving problems of messaging, routing and persistence at scale. Kai Levy Senior Software Engineer Toast Zach Walsh Senior Software Engineer Toast
  3. 3. Agenda Toast’s microservice ecosystem + Pulsar Blue/green deployments at Toast Driving Pulsar adoption Our Envoy Proxy control plane “The Pulsar Toggle”
  4. 4. We empower the restaurant community to delight their guests, do what they love, and thrive
  5. 5. Toast’s technology platform
  6. 6. Toast’s microservice ecosystem How it started How it’s going
  7. 7. How it’s going (with Pulsar)
  8. 8. 2018 Asynchronous messaging with RabbitMQ ● Order syncing between devices ● Change Data Capture (CDC) A History of Pulsar at Toast
  9. 9. 2018 Asynchronous messaging with RabbitMQ ● Order syncing between devices ● Change Data Capture (CDC) A History of Pulsar at Toast 2019 Pulsar pilot ● Initial exploration & testing ● Cluster productionalization ● First features, such as migrating change data capture
  10. 10. Persistence & Stability Seamless Pulsar failover ● RabbitMQ: potential stability issues + in-memory data-storage = lost messages ○ Manual maintenance was a big burden ● Pulsar’s data replication & automatic topic balancing eliminated these concerns
  11. 11. Horizontal Scalability broker 0 … ● Supports adding more topics without manual provisioning ● Throughput has grown more than 5x without any change in architecture broker 1 broker 2 broker 3 broker n
  12. 12. 2018 Asynchronous messaging with RabbitMQ ● Order syncing between devices ● Change Data Capture (CDC) A History of Pulsar at Toast 2019 Pulsar pilot ● Initial exploration & testing ● Cluster productionalization ● First features, such as migrating change data capture 2020 Full-fledged adoption ● Teams across Toast rapidly built features on top of Pulsar to help restaurants survive the pandemic ● Decorated streams built on Pulsar, which enabled more scalable consumers
  13. 13. CDC notify-topic Domain service (Source of Truth) service2 service1 service3 Full-fledged adoption … serviceN
  14. 14. CDC data decorator service notify-topic decorated-stream Domain service (Source of Truth) service1 … serviceN Full-fledged adoption
  15. 15. Order status notifications Delivery & curbside arrival notifications for consumers - helping restaurants pivot to digital Full-fledged adoption Tip pool tracking Tip pooling information is kept up-to-date with orders information Loyalty points accrual Consumer-facing loyalty programs help Toast restaurants thrive Restaurant availability Third party platforms are notified when a restaurant goes offline
  16. 16. 2018 Asynchronous messaging with RabbitMQ ● Order syncing between devices ● Change data capture (CDC) A History of Pulsar at Toast 2019 Pulsar pilot ● Initial exploration & testing ● Cluster productionalization ● First features, such as migrating change data capture 2020 Full-fledged adoption ● Teams across Toast rapidly built features on top of Pulsar to help restaurants survive the pandemic ● Decorated streams built on Pulsar, which enabled more scalable consumers 2022 Next-gen order processing ● Critical replatforming projects in development will help Toast reach the next level of scale ● Event-driven architecture being widely used for new features
  17. 17. Agenda Toast’s microservice ecosystem + Pulsar Blue/green deployments at Toast Driving Pulsar adoption Our Envoy Proxy control plane “The Pulsar Toggle”
  18. 18. Pulsar adoption has grown steadily user adoption (linear)
  19. 19. Toast client libraries Providing Toast-specific functionality for free 1. Out-of-box authentication 2. Dead-letter topic guidance (+ topic registries) 3. Metric instrumentation 4. Message parsing 5. Pulsar client configuration +
  20. 20. Authentication & authorization ● Automatic service authentication provided by the client libraries ○ Easy to use with any of our supported application frameworks ● Contributed a patch into the public Java client library
  21. 21. Dead-Letter Topics ● Standards for undeliverable messages ○ Per-subscription DLQs, or automatic acknowledgement after redelivery ○ Integrated with service configuration
  22. 22. Topic registries with terraform ● Started with in-house provider ○ Now migrating to StreamNative provider ● Lets us manage namespace authorization ● Provide defaults for retention & persistence ● Central place for discovering events Developers write infrastructure as code
  23. 23. Metrics ● Automatically report over 2 dozen metrics ○ Consistent across services ● Critical for operations & monitoring ● Added our own custom metrics ● Adding APM integrations ackLatency ackTimeouts auto-acknowledgements
  24. 24. Message Parsing We parse Protobuf messages into friendly Kotlin data classes ● Our open-source, Kotlin-first protocol buffer compiler ● One-line usage for engineers building on our client
  25. 25. Configuration recommendations Providing guidance around client settings ● Producer batching ● Acknowledgement timeout ● Receiver queue size ● Redelivery delay ● Unique consumer & producer names Starting Pulsar consumer status recorder with config: { "topicNames" : [ "persistent://…" ], "topicsPattern" : null, "subscriptionName" : "...", "subscriptionType" : "Shared", "subscriptionMode" : "Durable", "receiverQueueSize" : 1000, "acknowledgementsGroupTimeMicros" : 100000, "negativeAckRedeliveryDelayMicros" : 500000, "maxTotalReceiverQueueSizeAcrossPartitions" : 50000, "consumerName" : null, "ackTimeoutMillis" : 30000, "tickDurationMillis" : 1000, "priorityLevel" : 0, "maxPendingChuckedMessage" : 10, "autoAckOldestChunkedMessageOnQueueFull" : false, "expireTimeOfIncompleteChunkedMessageMillis" : 60000, "cryptoFailureAction" : "FAIL", "properties" : { }, "readCompacted" : false, "subscriptionInitialPosition" : "Latest", "patternAutoDiscoveryPeriod" : 60, "regexSubscriptionMode" : "PersistentOnly",
  26. 26. But something is still missing…
  27. 27. Agenda Toast’s microservice ecosystem + Pulsar Blue/green deployments at Toast (the problem) Driving Pulsar adoption Our Envoy Proxy control plane “The Pulsar Toggle”
  28. 28. Deployment and elevation practices service v1 service v2 HTTP ingress control plane
  29. 29. Deployment and elevation practices service v2 service v1 HTTP ingress control plane
  30. 30. Deployment and elevation practices service v1 service v2 HTTP ingress control plane service v2
  31. 31. Deployment and elevation practices service v1 service v2 HTTP ingress control plane service v2
  32. 32. Deployment and elevation practices service v1 service v2 HTTP ingress control plane service v2
  33. 33. Deployment and elevation practices service v1 service v2 HTTP ingress control plane service v2
  34. 34. Deployment and elevation practices service v1 service v2 HTTP ingress control plane service v2
  35. 35. service v1 Deployment and elevation practices service v2 HTTP ingress control plane service v2
  36. 36. shared pulsar subscription Deploying changes to Pulsar consumers is risky service v1 service v2 service v1 service v2
  37. 37. Mismatch in tooling Our platform for request-driven service deploys was well ahead of our Pulsar platform, causing developer frustration
  38. 38. User frustration
  39. 39. Principle of least surprise “In interface design, always do the least surprising thing.” - Basics of the Unix Philosophy
  40. 40. Elevations & deploys should be safe, easy, uneventful!
  41. 41. Agenda Toast’s microservice ecosystem + Pulsar Blue/green deployments at Toast (the solution) Driving Pulsar adoption Our Envoy Proxy control plane “The Pulsar Toggle”
  42. 42. Pulsar operational tooling Elevations & deploys weren’t easy on Pulsar REST services Pulsar consumers Can I validate my deploy before prod traffic? ✅ ❌ Can I validate with a small amount of prod traffic? ✅ ❌ Can I easily roll back? ✅ ❌ Can I easily roll forward? ✅ ❌ Contrast: REST services & Pulsar (in 2019)
  43. 43. Pulsar Consumer Elevation Requirements 1. Elevate traffic to new consumers as they are set to “active” in the control plane. 2. Avoid building a single point of failure. 3. Make this reusable for other background processes at Toast. 4. No performance hit or extra infrastructure.
  44. 44. Some options we considered Message Router Pattern incoming topic Deploy N Deploy N + 1 Router Control Plane blue topic green topic
  45. 45. Some options we considered Message Router Pattern - Problems incoming topic Deploy N Deploy N + 1 Router Control Plane blue topic green topic ● But, the router is a single point of failure ● More infrastructure to monitor ● Two hops per message
  46. 46. Some options we considered Feature Flags ● Apps use a feature flag to know whether to connect ● But, not integrated with our control plane ● Requires more setup for each consumer incoming topic Deploy N Deploy N + 1 FF Off FF On
  47. 47. Some options we considered Pausing Inactive Consumers ● The Feature Flag approach is close ○ No extra infrastructure ○ No extra hops ● But, we’d need to integrate it into our control plane ● Is this possible with Pulsar? incoming topic Deploy N Deploy N + 1 inactive active
  48. 48. Let’s see what the Pulsar source code has to say about pausing consumers. What does Pulsar provide? In Consumer.java:
  49. 49. Will pause() and resume() work? Pulsar consumers Pulsar consumers with pause() Can I validate my deploy before prod traffic? ❌ ✅ Can I validate with a small amount of prod traffic? ❌ ❌ Can I easily roll back? ❌ ✅ Can I easily roll forward? ❌ ✅ What do operations look like if inactive consumers call pause()?
  50. 50. How do we get each consumer to call pause() or resume() at the right time? How Would You Solve This? ● Pausing pulsar consumers is easy. Knowing when to pause is hard. ● Central control plane component owns this data ● Let’s just poll that service ● What would that look like? control plane service Z
  51. 51. What’s Wrong With This? ● Used to be the pattern for service discovery at Toast ● Subject to thundering herd ● Now, we leverage Envoy control plane service Z
  52. 52. Agenda Toast’s microservice ecosystem + Pulsar Blue/green deployments at Toast Driving Pulsar adoption Our Envoy Proxy control plane “The Pulsar Toggle”
  53. 53. How We Leverage Envoy Envoy at Toast
  54. 54. Envoy is a reverse proxy Deployed as a sidecar, forwards requests to their destination Envoy acts as a proxy, forwarding requests upstream. my-service menus GET /menus/v2/menuItems GET /v2/menuItems envoy
  55. 55. Envoy is eventually consistent Routing changes are pushed asynchronously Envoy sidecars across the fleet are pushed updates within ~1-2min of the change. Control Plane …
  56. 56. Envoy knows service status It gets a push each time any deploy goes active or inactive We can leverage this to pause() or resume() consumers.
  57. 57. Envoy direct responses Using an interesting Envoy feature to avoid single points of failure It can intercept requests and reply with a direct response! This gets the status info into the process where the Consumer is running. *magic config* GET /sidecar/v1/elevation/active { "active": true } my-service envoy
  58. 58. Agenda Toast’s microservice ecosystem + Pulsar Blue/green deployments at Toast Driving Pulsar adoption Our Envoy Proxy control plane “The Pulsar Toggle”
  59. 59. The Pulsar Toggle
  60. 60. “Pulsar Toggle” implementation Leveraging our Envoy Control Plane to toggle Pulsar consumers A thread polls the locally-running Envoy instance and toggles the Pulsar consumer as needed
  61. 61. Some “gotchas” Eventually consistent Consumers don’t pause immediately - updates propagate with some latency Start paused Wasn’t a way to subscribe in a paused state - we made a patch to the Java client More advanced elevation patterns Currently we can’t support percent elevations of pulsar traffic onto new deploys Receiver queue size Critically important to tune this parameter of consumers
  62. 62. Results ~30 Toggle users in Prod across pulsar consumers & background workers 0 Outages No added load on any critical systems 2 Contributions To open source - the Java client & the Camel integration
  63. 63. Increased adoption 2x New topics Developers are adding topics at twice the rate since the Pulsar toggle was released user adoption (linear)
  64. 64. Users Love it! 65% Increase reported ease of use when deploying pulsar consumer changes 46% Decrease reported risk associated with deploying pulsar consumer changes Positive feedback from satisfaction surveys with our users
  65. 65. Key Takeaways Integration Strong integration with existing systems is critical for org-wide adoption. Ease of Use As we make our Pulsar platform easier to use, we see more and more adoption. Stability Pulsar’s stability through big growth has been a killer feature for us.
  66. 66. Kai Levy & Zach Walsh Thank you! klevy@toasttab.com zachary.walsh@toasttab.com Pulsar Summit San Francisco Hotel Nikko August 18 2022 We’re Hiring! careers.toasttab.com

×