SlideShare a Scribd company logo
1 of 43
Download to read offline
Reactive By Example
Eran Harel - @eran_ha
source: http://www.reactivemanifesto.org/
The Reactive Manifesto
Responsive
The system responds in a timely manner if at
all possible.
source: http://www.reactivemanifesto.org/
Resilient
The system stays responsive in the face of
failure.
source: http://www.reactivemanifesto.org/
Elastic
The system stays responsive under varying
workload.
source: http://www.reactivemanifesto.org/
Message Driven
Reactive Systems rely on asynchronous message-passing
to establish a boundary between components that ensures
loose coupling, isolation, location transparency, and
provides the means to delegate errors as messages.
source: http://www.reactivemanifesto.org/
Case Study
Scaling our metric delivery system
Graphite
● Graphite is a highly scalable real-time
graphing system.
● Graphite performs two pretty simple tasks:
storing numbers that change over time and
graphing them.
● Sources:
○ http://graphite.wikidot.com/faq
○ http://aosabook.org/en/graphite.html
Graphite
http://aosabook.org/en/graphite.html
Graphite plain-text Protocol
<dotted.metric.name> <value> <unix epoch>n
For example:
servers.foo1.load.shortterm 4.5 1286269260n
Brief History - take I
App -> Graphite
This kept us going for a while…
The I/O interrupts were too much for Graphite.
Brief History - take II
App -> LogStash -> RabbitMQ -> LogStash ->
Graphite
The LogStash on localhost couldn’t handle the load, crashed and
hung on regular basis.
The horror...
Brief History - take III
App -> Gruffalo -> RabbitMQ -> LogStash -> Graphite
The queue consuming LogStash was way too slow.
Queues build up hung RabbitMQ, and stopped the producers on
Gruffalo.
Total failure.
Brief History - take IV
App -> Gruffalo -> Graphite (single carbon relay)
A single relay couldn’t take all the load, and losing it means
graphite is 100% unavailable.
Brief History - take V
App -> Gruffalo -> Graphite (multi carbon relays)
Great success, but not for long.
As we grew our metric count we had to take additional measures
to make it stable.
Introducing Gruffalo
● Gruffalo acts as a proxy to Graphite; it
○ Uses non-blocking IO (Netty)
○ Protects Graphite from the herd of clients,
minimizing context switches and interrupts
○ Replicates metrics between Data Centers
○ Batches metrics
○ Increases the Graphite availability
https://github.com/outbrain/gruffalo
Metrics Delivery HL Design
Carbon
RelayCarbon
RelayCarbon
Relay
Carbon
RelayCarbon
RelayCarbon
Relay
DC1
DC2
Graphite (Gruffalo) Clients
● GraphiteReporter
● Collectd
● StatsD
● JmxTrans
● Bucky
● netcat
● Slingshot
Metrics Clients Behavior
● Most clients open up a fresh connection,
once per minute, and publish ~1000K -
5000K metrics
● Each metric is flushed immediately
Scale (Metrics / Min)
More than 4M metrics per minute sent to graphite
Scale (Concurrent Connections)
Scale (bps)
Hardware
● We handle the load using 2 Gruffalo
instances in each Data Center (4 cores
each)
● A single instance can handle the load, but
we need redundancy
The Gruffalo Pipeline
(Inbound)
IdleState
Handler
Line
Framer
String
Decoder
Batch
Handler
Publish
Handler
Graphite Client
Helps detect
dropped /
leaked
connections
Handling ends
here unless the
batch is full
(4KB)
The Graphite Client Pipeline
(Outbound)
IdleState
Handler
String
Decoder
String
Encoder
Graphite
Handler
Handles
reconnects,
back-pressure,
and dropped
connections
Helps detect
dropped
connections
Graphite Client Load Balancing
Carbon Relay 1
Carbon Relay 2
Carbon Relay n
Metric batches
...
Graphite Client Retries
● A connection to a carbon relay may be
down. But we have more than one relay.
● We make a noble attempt to find a target to
publish metrics to, even if some relay
connections are down.
Graphite Client Reconnects
Processes crash, the network is *not* reliable,
and timeouts do occur...
Graphite Client Metric Replication
● For DR purposes we replicate each metric to
2 Data Centers.
● ...Yes it can be done elsewhere…
● Sending millions of metrics across the WAN,
to a remote data center is what brings most
of the challenges
Handling Graceless Disconnections
● We came across an issue where an
unreachable data center was not detected
by the TCP stack.
● This renders the outbound channel
unwritable
● Solution: Trigger reconnection when no
writes are performed on a connection for 10
sec.
Queues Everywhere
● SO_Backlog - queue of incoming
connections
● EventLoop queues (inbound and outbound)
● NIC driver queues - and on each device on
the way 0_o
Why are queues bad?
● If queues grow unbounded, at some point,
the process will exhaust all available RAM
and crash, or become unresponsive.
● At this point you need to apply either
○ Back-Pressure
○ Drop requests: SLA--
○ Crash: is this an option?
Why are queues bad?
● Queues can increase latency by a
magnitude of the size of the queue (in the
worst case).
● When one component is struggling to keep-
up, the system as a whole needs to respond
in a sensible way.
● Back-pressure is an important feedback
mechanism that allows systems to gracefully
respond to load rather than collapse under it.
Back-Pressure
Back-Pressure (take I)
● Netty sends an event when the channel
writability changes
● We use this to stop / resume reads from all
inbound connections, and stop / resume
accepting new connections
● This isn’t enough under high loads
Back-Pressure (take II)
● We implemented throttling based on
outstanding messages count
● Setup metrics and observe before applying
this
Idle / Leaked Inbound Connections
Detection
● Broken connections can’t be detected by the
receiving side.
● Half-Open connections can be caused by
crashes (process, host, routers), unplugging
network cables, etc
● Solution: We close all idle inbound
connections
The Load Balancing Problem
● TCP Keep-alive?
● HAProxy?
● DNS?
● Something else?
Consul Client Side Load Balancing
● We register Gruffalo instances in Consul
● Clients use Consul DNS and resolve a
random host on each metrics batch
● This makes scaling, maintenance, and
deployments easy with zero client code
changes :)
Auto Scaling?
[What can be done to achieve auto-scaling?]
Questions?
“Systems built as Reactive Systems are more flexible,
loosely-coupled and scalable. This makes them easier to
develop and amenable to change. They are significantly
more tolerant of failure and when failure does occur they
meet it with elegance rather than disaster. Reactive
Systems are highly responsive, giving users effective
interactive feedback.”
source: http://www.reactivemanifesto.org/
Wouldn’t you want to do this daily?
We’re recruiting ;)
Links
● http://www.reactivemanifesto.org/
● https://www.youtube.com/watch?v=IGW5VcnJLuU
● https://github.com/outbrain/gruffalo
● http://aosabook.org/en/graphite.html
● http://ferd.ca/queues-don-t-fix-overload.html

More Related Content

What's hot

Real world functional reactive programming
Real world functional reactive programmingReal world functional reactive programming
Real world functional reactive programmingEric Polerecky
 
Asynchronous, Event-driven Network Application Development with Netty
Asynchronous, Event-driven Network Application Development with NettyAsynchronous, Event-driven Network Application Development with Netty
Asynchronous, Event-driven Network Application Development with NettyErsin Er
 
Functional Reactive Programming / Compositional Event Systems
Functional Reactive Programming / Compositional Event SystemsFunctional Reactive Programming / Compositional Event Systems
Functional Reactive Programming / Compositional Event SystemsLeonardo Borges
 
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...Tokuhiro Matsuno
 
Reactive programming with Rxjava
Reactive programming with RxjavaReactive programming with Rxjava
Reactive programming with RxjavaChristophe Marchal
 
Flink Connector Development Tips & Tricks
Flink Connector Development Tips & TricksFlink Connector Development Tips & Tricks
Flink Connector Development Tips & TricksEron Wright
 
Reactive Programming on Android - RxAndroid - RxJava
Reactive Programming on Android - RxAndroid - RxJavaReactive Programming on Android - RxAndroid - RxJava
Reactive Programming on Android - RxAndroid - RxJavaAli Muzaffar
 
Building Scalable Stateless Applications with RxJava
Building Scalable Stateless Applications with RxJavaBuilding Scalable Stateless Applications with RxJava
Building Scalable Stateless Applications with RxJavaRick Warren
 
JMC/JFR: Kotlin spezial
JMC/JFR: Kotlin spezialJMC/JFR: Kotlin spezial
JMC/JFR: Kotlin spezialMiro Wengner
 
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Virtual Flink Forward 2020: Build your next-generation stream platform based ...Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Virtual Flink Forward 2020: Build your next-generation stream platform based ...Flink Forward
 
The Road To Reactive with RxJava JEEConf 2016
The Road To Reactive with RxJava JEEConf 2016The Road To Reactive with RxJava JEEConf 2016
The Road To Reactive with RxJava JEEConf 2016Frank Lyaruu
 
Load testing in Zonky with Gatling
Load testing in Zonky with GatlingLoad testing in Zonky with Gatling
Load testing in Zonky with GatlingPetr Vlček
 
Structured concurrency with Kotlin Coroutines
Structured concurrency with Kotlin CoroutinesStructured concurrency with Kotlin Coroutines
Structured concurrency with Kotlin CoroutinesVadims Savjolovs
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusMarco Pas
 

What's hot (20)

Reactive Java (GeeCON 2014)
Reactive Java (GeeCON 2014)Reactive Java (GeeCON 2014)
Reactive Java (GeeCON 2014)
 
Real world functional reactive programming
Real world functional reactive programmingReal world functional reactive programming
Real world functional reactive programming
 
Asynchronous, Event-driven Network Application Development with Netty
Asynchronous, Event-driven Network Application Development with NettyAsynchronous, Event-driven Network Application Development with Netty
Asynchronous, Event-driven Network Application Development with Netty
 
Functional Reactive Programming / Compositional Event Systems
Functional Reactive Programming / Compositional Event SystemsFunctional Reactive Programming / Compositional Event Systems
Functional Reactive Programming / Compositional Event Systems
 
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
promgen - prometheus managemnet tool / simpleclient_java hacks @ Prometheus c...
 
Reactive programming with Rxjava
Reactive programming with RxjavaReactive programming with Rxjava
Reactive programming with Rxjava
 
Flink Connector Development Tips & Tricks
Flink Connector Development Tips & TricksFlink Connector Development Tips & Tricks
Flink Connector Development Tips & Tricks
 
Play Framework
Play FrameworkPlay Framework
Play Framework
 
Reactive Programming on Android - RxAndroid - RxJava
Reactive Programming on Android - RxAndroid - RxJavaReactive Programming on Android - RxAndroid - RxJava
Reactive Programming on Android - RxAndroid - RxJava
 
Building Scalable Stateless Applications with RxJava
Building Scalable Stateless Applications with RxJavaBuilding Scalable Stateless Applications with RxJava
Building Scalable Stateless Applications with RxJava
 
JMC/JFR: Kotlin spezial
JMC/JFR: Kotlin spezialJMC/JFR: Kotlin spezial
JMC/JFR: Kotlin spezial
 
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Virtual Flink Forward 2020: Build your next-generation stream platform based ...Virtual Flink Forward 2020: Build your next-generation stream platform based ...
Virtual Flink Forward 2020: Build your next-generation stream platform based ...
 
The Road To Reactive with RxJava JEEConf 2016
The Road To Reactive with RxJava JEEConf 2016The Road To Reactive with RxJava JEEConf 2016
The Road To Reactive with RxJava JEEConf 2016
 
Ratpack JVM_MX Meetup February 2016
Ratpack JVM_MX Meetup February 2016Ratpack JVM_MX Meetup February 2016
Ratpack JVM_MX Meetup February 2016
 
Load testing in Zonky with Gatling
Load testing in Zonky with GatlingLoad testing in Zonky with Gatling
Load testing in Zonky with Gatling
 
Reactive Java (33rd Degree)
Reactive Java (33rd Degree)Reactive Java (33rd Degree)
Reactive Java (33rd Degree)
 
RxJava in practice
RxJava in practice RxJava in practice
RxJava in practice
 
Structured concurrency with Kotlin Coroutines
Structured concurrency with Kotlin CoroutinesStructured concurrency with Kotlin Coroutines
Structured concurrency with Kotlin Coroutines
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using Prometheus
 
RxJava on Android
RxJava on AndroidRxJava on Android
RxJava on Android
 

Viewers also liked

Creating Compelling Characteristics
Creating Compelling CharacteristicsCreating Compelling Characteristics
Creating Compelling CharacteristicsMary Zedeck
 
Proactive vs Reactive-cricket
Proactive vs Reactive-cricketProactive vs Reactive-cricket
Proactive vs Reactive-cricketTery Casey
 
Proactive vs reactive
Proactive vs reactiveProactive vs reactive
Proactive vs reactiveTery Casey
 
Reactive Programming in Java 8 with Rx-Java
Reactive Programming in Java 8 with Rx-JavaReactive Programming in Java 8 with Rx-Java
Reactive Programming in Java 8 with Rx-JavaKasun Indrasiri
 
Habit 1 Be Proactive
Habit 1 Be ProactiveHabit 1 Be Proactive
Habit 1 Be ProactivePraveen Kumar
 

Viewers also liked (8)

Creating Compelling Characteristics
Creating Compelling CharacteristicsCreating Compelling Characteristics
Creating Compelling Characteristics
 
Proactive vs Reactive-cricket
Proactive vs Reactive-cricketProactive vs Reactive-cricket
Proactive vs Reactive-cricket
 
Proactive vs Reactive
Proactive vs ReactiveProactive vs Reactive
Proactive vs Reactive
 
Proactive vs reactive
Proactive vs reactiveProactive vs reactive
Proactive vs reactive
 
Proactive vs reactive
Proactive vs reactiveProactive vs reactive
Proactive vs reactive
 
Reactive Programming in Java 8 with Rx-Java
Reactive Programming in Java 8 with Rx-JavaReactive Programming in Java 8 with Rx-Java
Reactive Programming in Java 8 with Rx-Java
 
Be Proactive 1
Be Proactive 1Be Proactive 1
Be Proactive 1
 
Habit 1 Be Proactive
Habit 1 Be ProactiveHabit 1 Be Proactive
Habit 1 Be Proactive
 

Similar to Reactive Systems Case Study: Scaling a Metrics Delivery System

Reactive by example (DevOpsDaysTLV 2019)
Reactive by example (DevOpsDaysTLV 2019)Reactive by example (DevOpsDaysTLV 2019)
Reactive by example (DevOpsDaysTLV 2019)Eran Harel
 
An adaptive and eventually self healing framework for geo-distributed real-ti...
An adaptive and eventually self healing framework for geo-distributed real-ti...An adaptive and eventually self healing framework for geo-distributed real-ti...
An adaptive and eventually self healing framework for geo-distributed real-ti...Angad Singh
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLSeveralnines
 
Functional reactive programming
Functional reactive programmingFunctional reactive programming
Functional reactive programmingAraf Karsh Hamid
 
M|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouM|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouMariaDB plc
 
Reactive Applications with Apache Pulsar and Spring Boot
Reactive Applications with Apache Pulsar and Spring BootReactive Applications with Apache Pulsar and Spring Boot
Reactive Applications with Apache Pulsar and Spring BootVMware Tanzu
 
Pregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingPregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingRiyad Parvez
 
Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016Bhupesh Chawda
 
Real-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache ApexReal-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache ApexApache Apex
 
Microservices Part 4: Functional Reactive Programming
Microservices Part 4: Functional Reactive ProgrammingMicroservices Part 4: Functional Reactive Programming
Microservices Part 4: Functional Reactive ProgrammingAraf Karsh Hamid
 
Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w...
Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w...Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w...
Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w...TigerGraph
 
Observability tips for HAProxy
Observability tips for HAProxyObservability tips for HAProxy
Observability tips for HAProxyWilly Tarreau
 
Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetrypphaal
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at ScaleSean Zhong
 
Reactive mistakes - ScalaDays Chicago 2017
Reactive mistakes -  ScalaDays Chicago 2017Reactive mistakes -  ScalaDays Chicago 2017
Reactive mistakes - ScalaDays Chicago 2017Petr Zapletal
 

Similar to Reactive Systems Case Study: Scaling a Metrics Delivery System (20)

Reactive by example (DevOpsDaysTLV 2019)
Reactive by example (DevOpsDaysTLV 2019)Reactive by example (DevOpsDaysTLV 2019)
Reactive by example (DevOpsDaysTLV 2019)
 
OpenShift Multicluster
OpenShift MulticlusterOpenShift Multicluster
OpenShift Multicluster
 
An adaptive and eventually self healing framework for geo-distributed real-ti...
An adaptive and eventually self healing framework for geo-distributed real-ti...An adaptive and eventually self healing framework for geo-distributed real-ti...
An adaptive and eventually self healing framework for geo-distributed real-ti...
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
 
Functional reactive programming
Functional reactive programmingFunctional reactive programming
Functional reactive programming
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
M|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouM|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for You
 
Reactive Applications with Apache Pulsar and Spring Boot
Reactive Applications with Apache Pulsar and Spring BootReactive Applications with Apache Pulsar and Spring Boot
Reactive Applications with Apache Pulsar and Spring Boot
 
Pregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingPregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph Processing
 
Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016Introduction to Apache Apex - CoDS 2016
Introduction to Apache Apex - CoDS 2016
 
Real-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache ApexReal-time Stream Processing using Apache Apex
Real-time Stream Processing using Apache Apex
 
Microservices Part 4: Functional Reactive Programming
Microservices Part 4: Functional Reactive ProgrammingMicroservices Part 4: Functional Reactive Programming
Microservices Part 4: Functional Reactive Programming
 
FrackingPaper
FrackingPaperFrackingPaper
FrackingPaper
 
Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w...
Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w...Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w...
Graph Gurus Episode 7: Connecting the Dots in Real-Time: Deep Link Analysis w...
 
OS_Ch06.pdf
OS_Ch06.pdfOS_Ch06.pdf
OS_Ch06.pdf
 
Observability tips for HAProxy
Observability tips for HAProxyObservability tips for HAProxy
Observability tips for HAProxy
 
SVCC-2014
SVCC-2014SVCC-2014
SVCC-2014
 
Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetry
 
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleStrata Singapore: GearpumpReal time DAG-Processing with Akka at Scale
Strata Singapore: Gearpump Real time DAG-Processing with Akka at Scale
 
Reactive mistakes - ScalaDays Chicago 2017
Reactive mistakes -  ScalaDays Chicago 2017Reactive mistakes -  ScalaDays Chicago 2017
Reactive mistakes - ScalaDays Chicago 2017
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 

Recently uploaded (20)

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 

Reactive Systems Case Study: Scaling a Metrics Delivery System

  • 1. Reactive By Example Eran Harel - @eran_ha
  • 3. Responsive The system responds in a timely manner if at all possible. source: http://www.reactivemanifesto.org/
  • 4. Resilient The system stays responsive in the face of failure. source: http://www.reactivemanifesto.org/
  • 5. Elastic The system stays responsive under varying workload. source: http://www.reactivemanifesto.org/
  • 6. Message Driven Reactive Systems rely on asynchronous message-passing to establish a boundary between components that ensures loose coupling, isolation, location transparency, and provides the means to delegate errors as messages. source: http://www.reactivemanifesto.org/
  • 7. Case Study Scaling our metric delivery system
  • 8. Graphite ● Graphite is a highly scalable real-time graphing system. ● Graphite performs two pretty simple tasks: storing numbers that change over time and graphing them. ● Sources: ○ http://graphite.wikidot.com/faq ○ http://aosabook.org/en/graphite.html
  • 10. Graphite plain-text Protocol <dotted.metric.name> <value> <unix epoch>n For example: servers.foo1.load.shortterm 4.5 1286269260n
  • 11. Brief History - take I App -> Graphite This kept us going for a while… The I/O interrupts were too much for Graphite.
  • 12. Brief History - take II App -> LogStash -> RabbitMQ -> LogStash -> Graphite The LogStash on localhost couldn’t handle the load, crashed and hung on regular basis. The horror...
  • 13. Brief History - take III App -> Gruffalo -> RabbitMQ -> LogStash -> Graphite The queue consuming LogStash was way too slow. Queues build up hung RabbitMQ, and stopped the producers on Gruffalo. Total failure.
  • 14. Brief History - take IV App -> Gruffalo -> Graphite (single carbon relay) A single relay couldn’t take all the load, and losing it means graphite is 100% unavailable.
  • 15. Brief History - take V App -> Gruffalo -> Graphite (multi carbon relays) Great success, but not for long. As we grew our metric count we had to take additional measures to make it stable.
  • 16. Introducing Gruffalo ● Gruffalo acts as a proxy to Graphite; it ○ Uses non-blocking IO (Netty) ○ Protects Graphite from the herd of clients, minimizing context switches and interrupts ○ Replicates metrics between Data Centers ○ Batches metrics ○ Increases the Graphite availability https://github.com/outbrain/gruffalo
  • 17. Metrics Delivery HL Design Carbon RelayCarbon RelayCarbon Relay Carbon RelayCarbon RelayCarbon Relay DC1 DC2
  • 18. Graphite (Gruffalo) Clients ● GraphiteReporter ● Collectd ● StatsD ● JmxTrans ● Bucky ● netcat ● Slingshot
  • 19. Metrics Clients Behavior ● Most clients open up a fresh connection, once per minute, and publish ~1000K - 5000K metrics ● Each metric is flushed immediately
  • 20. Scale (Metrics / Min) More than 4M metrics per minute sent to graphite
  • 23. Hardware ● We handle the load using 2 Gruffalo instances in each Data Center (4 cores each) ● A single instance can handle the load, but we need redundancy
  • 24. The Gruffalo Pipeline (Inbound) IdleState Handler Line Framer String Decoder Batch Handler Publish Handler Graphite Client Helps detect dropped / leaked connections Handling ends here unless the batch is full (4KB)
  • 25. The Graphite Client Pipeline (Outbound) IdleState Handler String Decoder String Encoder Graphite Handler Handles reconnects, back-pressure, and dropped connections Helps detect dropped connections
  • 26. Graphite Client Load Balancing Carbon Relay 1 Carbon Relay 2 Carbon Relay n Metric batches ...
  • 27. Graphite Client Retries ● A connection to a carbon relay may be down. But we have more than one relay. ● We make a noble attempt to find a target to publish metrics to, even if some relay connections are down.
  • 28. Graphite Client Reconnects Processes crash, the network is *not* reliable, and timeouts do occur...
  • 29. Graphite Client Metric Replication ● For DR purposes we replicate each metric to 2 Data Centers. ● ...Yes it can be done elsewhere… ● Sending millions of metrics across the WAN, to a remote data center is what brings most of the challenges
  • 30. Handling Graceless Disconnections ● We came across an issue where an unreachable data center was not detected by the TCP stack. ● This renders the outbound channel unwritable ● Solution: Trigger reconnection when no writes are performed on a connection for 10 sec.
  • 31. Queues Everywhere ● SO_Backlog - queue of incoming connections ● EventLoop queues (inbound and outbound) ● NIC driver queues - and on each device on the way 0_o
  • 32. Why are queues bad? ● If queues grow unbounded, at some point, the process will exhaust all available RAM and crash, or become unresponsive. ● At this point you need to apply either ○ Back-Pressure ○ Drop requests: SLA-- ○ Crash: is this an option?
  • 33. Why are queues bad? ● Queues can increase latency by a magnitude of the size of the queue (in the worst case).
  • 34. ● When one component is struggling to keep- up, the system as a whole needs to respond in a sensible way. ● Back-pressure is an important feedback mechanism that allows systems to gracefully respond to load rather than collapse under it. Back-Pressure
  • 35. Back-Pressure (take I) ● Netty sends an event when the channel writability changes ● We use this to stop / resume reads from all inbound connections, and stop / resume accepting new connections ● This isn’t enough under high loads
  • 36. Back-Pressure (take II) ● We implemented throttling based on outstanding messages count ● Setup metrics and observe before applying this
  • 37. Idle / Leaked Inbound Connections Detection ● Broken connections can’t be detected by the receiving side. ● Half-Open connections can be caused by crashes (process, host, routers), unplugging network cables, etc ● Solution: We close all idle inbound connections
  • 38. The Load Balancing Problem ● TCP Keep-alive? ● HAProxy? ● DNS? ● Something else?
  • 39. Consul Client Side Load Balancing ● We register Gruffalo instances in Consul ● Clients use Consul DNS and resolve a random host on each metrics batch ● This makes scaling, maintenance, and deployments easy with zero client code changes :)
  • 40. Auto Scaling? [What can be done to achieve auto-scaling?]
  • 41. Questions? “Systems built as Reactive Systems are more flexible, loosely-coupled and scalable. This makes them easier to develop and amenable to change. They are significantly more tolerant of failure and when failure does occur they meet it with elegance rather than disaster. Reactive Systems are highly responsive, giving users effective interactive feedback.” source: http://www.reactivemanifesto.org/
  • 42. Wouldn’t you want to do this daily? We’re recruiting ;)
  • 43. Links ● http://www.reactivemanifesto.org/ ● https://www.youtube.com/watch?v=IGW5VcnJLuU ● https://github.com/outbrain/gruffalo ● http://aosabook.org/en/graphite.html ● http://ferd.ca/queues-don-t-fix-overload.html