Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reactive by example - at Reversim Summit 2015


Published on

Explaining the reactive manifesto by a real world case study.
This is a cool story about the evolution of our monitoring infrastructure. From the naive approach to a super resilient system.

How do we manage to handle 4M metrics / minute, and over 1K concurrent connections?
What strategies did we try to apply and where did it fail?
What are the techniques and technologies we use in order to achieve this?
How do we handle errors, and failures at this scale?
What can we still improve?

Published in: Software

Reactive by example - at Reversim Summit 2015

  1. 1. Reactive By Example Eran Harel - @eran_ha
  2. 2. source: The Reactive Manifesto
  3. 3. Responsive The system responds in a timely manner if at all possible. source:
  4. 4. Resilient The system stays responsive in the face of failure. source:
  5. 5. Elastic The system stays responsive under varying workload. source:
  6. 6. Message Driven Reactive Systems rely on asynchronous message-passing to establish a boundary between components that ensures loose coupling, isolation, location transparency, and provides the means to delegate errors as messages. source:
  7. 7. Case Study Scaling our metric delivery system
  8. 8. Graphite ● Graphite is a highly scalable real-time graphing system. ● Graphite performs two pretty simple tasks: storing numbers that change over time and graphing them. ● Sources: ○ ○
  9. 9. Graphite
  10. 10. Graphite plain-text Protocol <> <value> <unix epoch>n For example: servers.foo1.load.shortterm 4.5 1286269260n
  11. 11. Brief History - take I App -> Graphite This kept us going for a while… The I/O interrupts were too much for Graphite.
  12. 12. Brief History - take II App -> LogStash -> RabbitMQ -> LogStash -> Graphite The LogStash on localhost couldn’t handle the load, crashed and hung on regular basis. The horror...
  13. 13. Brief History - take III App -> Gruffalo -> RabbitMQ -> LogStash -> Graphite The queue consuming LogStash was way too slow. Queues build up hung RabbitMQ, and stopped the producers on Gruffalo. Total failure.
  14. 14. Brief History - take IV App -> Gruffalo -> Graphite (single carbon relay) A single relay couldn’t take all the load, and losing it means graphite is 100% unavailable.
  15. 15. Brief History - take V App -> Gruffalo -> Graphite (multi carbon relays) Great success, but not for long. As we grew our metric count we had to take additional measures to make it stable.
  16. 16. Introducing Gruffalo ● Gruffalo acts as a proxy to Graphite; it ○ Uses non-blocking IO (Netty) ○ Protects Graphite from the herd of clients, minimizing context switches and interrupts ○ Replicates metrics between Data Centers ○ Batches metrics ○ Increases the Graphite availability
  17. 17. Metrics Delivery HL Design Carbon RelayCarbon RelayCarbon Relay Carbon RelayCarbon RelayCarbon Relay DC1 DC2
  18. 18. Graphite (Gruffalo) Clients ● GraphiteReporter ● Collectd ● StatsD ● JmxTrans ● Bucky ● netcat ● Slingshot
  19. 19. Metrics Clients Behavior ● Most clients open up a fresh connection, once per minute, and publish ~1000K - 5000K metrics ● Each metric is flushed immediately
  20. 20. Scale (Metrics / Min) More than 4M metrics per minute sent to graphite
  21. 21. Scale (Concurrent Connections)
  22. 22. Scale (bps)
  23. 23. Hardware ● We handle the load using 2 Gruffalo instances in each Data Center (4 cores each) ● A single instance can handle the load, but we need redundancy
  24. 24. The Gruffalo Pipeline (Inbound) IdleState Handler Line Framer String Decoder Batch Handler Publish Handler Graphite Client Helps detect dropped / leaked connections Handling ends here unless the batch is full (4KB)
  25. 25. The Graphite Client Pipeline (Outbound) IdleState Handler String Decoder String Encoder Graphite Handler Handles reconnects, back-pressure, and dropped connections Helps detect dropped connections
  26. 26. Graphite Client Load Balancing Carbon Relay 1 Carbon Relay 2 Carbon Relay n Metric batches ...
  27. 27. Graphite Client Retries ● A connection to a carbon relay may be down. But we have more than one relay. ● We make a noble attempt to find a target to publish metrics to, even if some relay connections are down.
  28. 28. Graphite Client Reconnects Processes crash, the network is *not* reliable, and timeouts do occur...
  29. 29. Graphite Client Metric Replication ● For DR purposes we replicate each metric to 2 Data Centers. ● ...Yes it can be done elsewhere… ● Sending millions of metrics across the WAN, to a remote data center is what brings most of the challenges
  30. 30. Handling Graceless Disconnections ● We came across an issue where an unreachable data center was not detected by the TCP stack. ● This renders the outbound channel unwritable ● Solution: Trigger reconnection when no writes are performed on a connection for 10 sec.
  31. 31. Queues Everywhere ● SO_Backlog - queue of incoming connections ● EventLoop queues (inbound and outbound) ● NIC driver queues - and on each device on the way 0_o
  32. 32. Why are queues bad? ● If queues grow unbounded, at some point, the process will exhaust all available RAM and crash, or become unresponsive. ● At this point you need to apply either ○ Back-Pressure ○ Drop requests: SLA-- ○ Crash: is this an option?
  33. 33. Why are queues bad? ● Queues can increase latency by a magnitude of the size of the queue (in the worst case).
  34. 34. ● When one component is struggling to keep- up, the system as a whole needs to respond in a sensible way. ● Back-pressure is an important feedback mechanism that allows systems to gracefully respond to load rather than collapse under it. Back-Pressure
  35. 35. Back-Pressure (take I) ● Netty sends an event when the channel writability changes ● We use this to stop / resume reads from all inbound connections, and stop / resume accepting new connections ● This isn’t enough under high loads
  36. 36. Back-Pressure (take II) ● We implemented throttling based on outstanding messages count ● Setup metrics and observe before applying this
  37. 37. Idle / Leaked Inbound Connections Detection ● Broken connections can’t be detected by the receiving side. ● Half-Open connections can be caused by crashes (process, host, routers), unplugging network cables, etc ● Solution: We close all idle inbound connections
  38. 38. The Load Balancing Problem ● TCP Keep-alive? ● HAProxy? ● DNS? ● Something else?
  39. 39. Consul Client Side Load Balancing ● We register Gruffalo instances in Consul ● Clients use Consul DNS and resolve a random host on each metrics batch ● This makes scaling, maintenance, and deployments easy with zero client code changes :)
  40. 40. Auto Scaling? [What can be done to achieve auto-scaling?]
  41. 41. Questions? “Systems built as Reactive Systems are more flexible, loosely-coupled and scalable. This makes them easier to develop and amenable to change. They are significantly more tolerant of failure and when failure does occur they meet it with elegance rather than disaster. Reactive Systems are highly responsive, giving users effective interactive feedback.” source:
  42. 42. Wouldn’t you want to do this daily? We’re recruiting ;)
  43. 43. Links ● ● ● ● ●