SlideShare a Scribd company logo
Resilient design 101
Avishai Ish-Shalom
github.com/avishai-ish-shalom@nukembergavishai.is@wix.com
Wix in numbers
~ 500 Engineers
~ 1500 employees
~ 100M users
~ 500 micro services
Lithuania
Ukraine
Vilnius
Kyiv
Dnipro
Wix Engineering Locations
Israel
Tel-Aviv
Be’er Sheva
Queues
01
Queues are everywhere!
▪ Futures/Executors
▪ Sockets
▪ Locks (DB Connection pools)
▪ Callbacks in node.js/Netty
Anything async?!
Queues
▪ Incoming load (arrival rate)
▪ Service from the queue (service rate)
▪ Service discipline (FIFO/LIFO/Priority)
▪ Latency = Wait time + Service time
▪ Service time independent of queue
It varies
▪ Arrival rate fluctuates
▪ Service times fluctuates
▪ Delays accumulate
▪ Idle time wasted
Queues are almost always full or near-empty!
Capacity &
Latency
▪ Latency (and queue size) rises to infinity
as utilization approaches 1
▪ For QoS ρ << 0.75
▪ Decent latency -> over capacity
ρ = arrival rate / service rate (utilization)
Implications
Infinite queues:
▪ Memory pressure / OOM
▪ High latency
▪ Stale work
Always limit queue size!
Work item TTL*
Latency &
Service time
λ = wait time
σ = service time
ρ = utilization
Utilization fluctuates!
▪ 10% fluctuation at = 0.5 will hardly affects latency (~ 1.1x)
▪ 10% fluctuation at = 0.9 will kill you (~ 10x latency)
▪ Be careful when overloading resources
▪ During peak load we must be extra careful
▪ Highly varied load must be capped
Practical advice
▪ Use chokepoints (throttling/load shedding)
▪ Plan for low utilization of slow resources
Example
Resource Latency Planned Utilization
RPC thread pool 1ms 0.75
DB connection pool 10ms 0.5
Backpressure
▪ Internal queues fill up and cause latency
▪ Front layer will continue sending traffic
▪ We need to inform the client that we’re out of capacity
▪ E.g.: Blocking client, HTTP 503, finite queues for
threadpools
Backpressure
▪ Blocking code has backpressure by default
▪ Executors, remote calls and async code need explicit
backpressure
▪ E.g. producer/consumer through Kafka
Load shedding
▪ A tradeoff between latency and error rate
▪ Cap the queue size / throttle arrival rate
▪ Reject excess work or send to fallback service
Example: Facebook uses LIFO queue and rejects stale work
http://queue.acm.org/detail.cfm?id=2839461
Thread Pools
02
Jetty architecture
Thread pool (QTP)
Socket
Acceptor
thread
Too many threads
▪ O/S also has a queue
▪ Threads take memory, FDs, etc
▪ What about shared resources?
Bad QoS, GC storms, ungraceful
degradation
Not enough threads
wrong
▪ Work will queue up
▪ Not enough RUNNING threads
High latency, low resource utilization
Capacity/Latency tradeoffs
When optimizing for Latency:
For low latency, resources must be available when needed
Keep the queue empty
▪ Block or apply backpressure
▪ Keep the queue small
▪ Overprovision
Capacity/Latency tradeoffs
When optimizing for Capacity
For max capacity, resources must always have work waiting
Keep the queue full
▪ We use a large queue to buffer work
▪ Queueing increases latency
▪ Queue size >> concurrency
How may threads?
▪ Assuming CPU is the limiting resource
▪ Compute by maximal load (opt. latency)
▪ With a Grid: How many cores???
Java Concurrency in Practice (http://jcip.net/)
How may threads?
How to compute?
▪ Transaction time = W + C
▪ C ~ Total CPU time / throughput
▪ U ~ 0.5 – 0.7 (account for O/S, JVM, GC - and 0.75 utilization target)
▪ Memory and other resource limits
What about async servers?
Async servers architecture
Socket
Event loop
epoll
Callbacks
O/S
Syscalls
Async systems
▪ Event loop callback/handler queue
▪ The callback queue is unlimited (!!!)
▪ Event loop can block (ouch)
▪ No inherent concurrency limit
▪ No backpressure (*)
Async systems - overload
▪ No preemption -> no QoS
▪ No backpressure -> overload
▪ Hard to tune
▪ Hard to limit concurrency/queue size
▪ Hard to debug
So what’s the point?
▪ High concurrency
▪ More control
▪ I/O heavy servers
Still evolving…. let’s revisit in a few years?
Little’s Law
03
Little’s law
▪ Holds for all distributions
▪ For “stable” systems
▪ Holds for systems and their subsystems
▪ “Throughput” is either Arrival rate or Service rate depending on the context.
Be careful!
L = λ⋅W
L = Avg clients in the system
λ = Avg Throughput
W = Avg Latency
Using Little’s law
▪ How many requests queued inside the system?
▪ Verifying load tests / benchmarks
▪ Calculating latency when no direct measurement is possible
Go watch Gil Tene’s "How NOT to Measure Latency"
Read Benchmarking Blunders and Things That Go Bump in the Night
Timeouts
04
How not to timeout
People use arbitrary timeout values
▪ DB timeout > Overall transaction timeout
▪ Cache timeout > DB latency
▪ Huge unrealistic timeouts
▪ Refusing to return errors
P.S: connection timeout, read timeout & transaction timeout are not the same thing
Deciding on timeouts
Use the distribution luke!
▪ Resources/Errors tradeoff
▪ Cumulative distribution chart
▪ Watch out for multiple modes
▪ Context, context, context
Timeouts should be derived from
real world constraints!
UX numbers every developer needs to know
▪ Smooth motion perception threshold: ~ 20ms
▪ Immediate reaction threshold: ~ 100ms
▪ Delay perception threshold: ~ 300ms
▪ Focus threshold: ~ 1sec
▪ Frustration threshold: ~ 10sec
Google's RAIL model
UX powers of 10
Hardware latency numbers every developer
needs to know
▪ SSD Disk seek: 0.15ms
▪ Magnetic disk seek: ~ 10ms
▪ Round trip within same datacenter: ~ 0.5ms
▪ Packet roundtrip US->EU->US: ~ 150ms
▪ Send 1M over typical user WAN: ~ 1sec
Latency numbers every developer needs to know (updated)
Timeout Budgets
▪ Decide on global timeouts
▪ Pass context object
▪ Each stage decrements budget
▪ Local timeouts according to budget
▪ If budget too low, terminate
preemptively
Think microservices
Example
Global: 500ms
Stage Used Budget Timeout
Authorization 6ms 494ms 100ms
Data fetch (DB) 123ms 371ms 200ms
Processing 47ms 324ms 371ms
Rendering 89ms 235ms 324ms
Audit 2ms - -
Filter 10ms 223ms 233ms
The debt buyer
▪ Transactions may return eventually after timeout
▪ Does the client really have to wait?
▪ Timeout and return error/default response to client (50ms)
▪ Keep waiting asynchronously (1 sec)
Can’t be used when client is expecting data back
Questions?
github.com/avishai-ish-shalom@nukembergavishai.is@wix.com
Thank You
github.com/avishai-ish-shalom@nukembergavishai.is@wix.com

More Related Content

What's hot

Rails on JRuby
Rails on JRubyRails on JRuby
Rails on JRuby
Rob C
 
Spark Streaming with Kafka - Meetup Bangalore
Spark Streaming with Kafka - Meetup BangaloreSpark Streaming with Kafka - Meetup Bangalore
Spark Streaming with Kafka - Meetup Bangalore
Dibyendu Bhattacharya
 
Server side caching Vs other alternatives
Server side caching Vs other alternativesServer side caching Vs other alternatives
Server side caching Vs other alternatives
Bappaditya Sinha
 
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
Michael Mior
 
Gevent at TellApart
Gevent at TellApartGevent at TellApart
Gevent at TellApart
Kevin Ballard
 
NServiceBus - building a distributed system based on a messaging infrastructure
NServiceBus - building a distributed system based on a messaging infrastructureNServiceBus - building a distributed system based on a messaging infrastructure
NServiceBus - building a distributed system based on a messaging infrastructure
Mauro Servienti
 
Performance Tuning - Memory leaks, Thread deadlocks, JDK tools
Performance Tuning -  Memory leaks, Thread deadlocks, JDK toolsPerformance Tuning -  Memory leaks, Thread deadlocks, JDK tools
Performance Tuning - Memory leaks, Thread deadlocks, JDK tools
Haribabu Nandyal Padmanaban
 
Scaling the Rails
Scaling the RailsScaling the Rails
Scaling the Rails
elliando dias
 
Scalabe MySQL Infrastructure
Scalabe MySQL InfrastructureScalabe MySQL Infrastructure
Scalabe MySQL Infrastructure
Balazs Pocze
 
Cassandra: An Alien Technology That's not so Alien
Cassandra: An Alien Technology That's not so AlienCassandra: An Alien Technology That's not so Alien
Cassandra: An Alien Technology That's not so Alien
Brian Hess
 
Reactive Microservices with JRuby and Docker
Reactive Microservices with JRuby and DockerReactive Microservices with JRuby and Docker
Reactive Microservices with JRuby and Docker
John Scattergood
 
Cassandra and drivers
Cassandra and driversCassandra and drivers
Cassandra and drivers
Ben Bromhead
 
Thin client server capacity planning for sm es
Thin client server capacity planning for sm esThin client server capacity planning for sm es
Thin client server capacity planning for sm es
Limesh Parekh
 
Silverstripe at scale - design & architecture for silverstripe applications
Silverstripe at scale - design & architecture for silverstripe applicationsSilverstripe at scale - design & architecture for silverstripe applications
Silverstripe at scale - design & architecture for silverstripe applications
BrettTasker
 
Fastest Servlets in the West
Fastest Servlets in the WestFastest Servlets in the West
Fastest Servlets in the West
Stuart (Pid) Williams
 
ESX performance problems 10 steps
ESX performance problems 10 stepsESX performance problems 10 steps
ESX performance problems 10 steps
Concentrated Technology
 
WebLogic Stability; Detect and Analyse Stuck Threads
WebLogic Stability; Detect and Analyse Stuck ThreadsWebLogic Stability; Detect and Analyse Stuck Threads
WebLogic Stability; Detect and Analyse Stuck Threads
Maarten Smeets
 
QCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVM
QCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVMQCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVM
QCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVM
Azul Systems, Inc.
 
The Nightmare of Locking, Blocking and Isolation Levels!
The Nightmare of Locking, Blocking and Isolation Levels!The Nightmare of Locking, Blocking and Isolation Levels!
The Nightmare of Locking, Blocking and Isolation Levels!
Boris Hristov
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 

What's hot (20)

Rails on JRuby
Rails on JRubyRails on JRuby
Rails on JRuby
 
Spark Streaming with Kafka - Meetup Bangalore
Spark Streaming with Kafka - Meetup BangaloreSpark Streaming with Kafka - Meetup Bangalore
Spark Streaming with Kafka - Meetup Bangalore
 
Server side caching Vs other alternatives
Server side caching Vs other alternativesServer side caching Vs other alternatives
Server side caching Vs other alternatives
 
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Clo...
 
Gevent at TellApart
Gevent at TellApartGevent at TellApart
Gevent at TellApart
 
NServiceBus - building a distributed system based on a messaging infrastructure
NServiceBus - building a distributed system based on a messaging infrastructureNServiceBus - building a distributed system based on a messaging infrastructure
NServiceBus - building a distributed system based on a messaging infrastructure
 
Performance Tuning - Memory leaks, Thread deadlocks, JDK tools
Performance Tuning -  Memory leaks, Thread deadlocks, JDK toolsPerformance Tuning -  Memory leaks, Thread deadlocks, JDK tools
Performance Tuning - Memory leaks, Thread deadlocks, JDK tools
 
Scaling the Rails
Scaling the RailsScaling the Rails
Scaling the Rails
 
Scalabe MySQL Infrastructure
Scalabe MySQL InfrastructureScalabe MySQL Infrastructure
Scalabe MySQL Infrastructure
 
Cassandra: An Alien Technology That's not so Alien
Cassandra: An Alien Technology That's not so AlienCassandra: An Alien Technology That's not so Alien
Cassandra: An Alien Technology That's not so Alien
 
Reactive Microservices with JRuby and Docker
Reactive Microservices with JRuby and DockerReactive Microservices with JRuby and Docker
Reactive Microservices with JRuby and Docker
 
Cassandra and drivers
Cassandra and driversCassandra and drivers
Cassandra and drivers
 
Thin client server capacity planning for sm es
Thin client server capacity planning for sm esThin client server capacity planning for sm es
Thin client server capacity planning for sm es
 
Silverstripe at scale - design & architecture for silverstripe applications
Silverstripe at scale - design & architecture for silverstripe applicationsSilverstripe at scale - design & architecture for silverstripe applications
Silverstripe at scale - design & architecture for silverstripe applications
 
Fastest Servlets in the West
Fastest Servlets in the WestFastest Servlets in the West
Fastest Servlets in the West
 
ESX performance problems 10 steps
ESX performance problems 10 stepsESX performance problems 10 steps
ESX performance problems 10 steps
 
WebLogic Stability; Detect and Analyse Stuck Threads
WebLogic Stability; Detect and Analyse Stuck ThreadsWebLogic Stability; Detect and Analyse Stuck Threads
WebLogic Stability; Detect and Analyse Stuck Threads
 
QCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVM
QCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVMQCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVM
QCon London: Low latency Java in the real world - LMAX Exchange and the Zing JVM
 
The Nightmare of Locking, Blocking and Isolation Levels!
The Nightmare of Locking, Blocking and Isolation Levels!The Nightmare of Locking, Blocking and Isolation Levels!
The Nightmare of Locking, Blocking and Isolation Levels!
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 

Similar to Resilient Design 101 (JeeConf 2017)

Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong KimCeph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph Community
 
QoSintro.PPT
QoSintro.PPTQoSintro.PPT
QoSintro.PPT
payal445263
 
Shoot the Bird: Linear Broadcast Distribution on AWS by Usman Shakeel of Amaz...
Shoot the Bird: Linear Broadcast Distribution on AWS by Usman Shakeel of Amaz...Shoot the Bird: Linear Broadcast Distribution on AWS by Usman Shakeel of Amaz...
Shoot the Bird: Linear Broadcast Distribution on AWS by Usman Shakeel of Amaz...
ETCenter
 
Resilient Design Using Queue Theory
Resilient Design Using Queue TheoryResilient Design Using Queue Theory
Resilient Design Using Queue Theory
ScyllaDB
 
ECS19 - Ingo Gegenwarth - Running Exchange in large environment
ECS19 - Ingo Gegenwarth -  Running Exchangein large environmentECS19 - Ingo Gegenwarth -  Running Exchangein large environment
ECS19 - Ingo Gegenwarth - Running Exchange in large environment
European Collaboration Summit
 
Amazon builder Library notes
Amazon builder Library notesAmazon builder Library notes
Amazon builder Library notes
Diego Pacheco
 
Network latency - measurement and improvement
Network latency - measurement and improvementNetwork latency - measurement and improvement
Network latency - measurement and improvement
Matt Willsher
 
MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011
Mike Willbanks
 
Network performance overview
Network  performance overviewNetwork  performance overview
Network performance overview
My cp
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
ScyllaDB
 
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
ScyllaDB
 
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in ScyllaScylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
ScyllaDB
 
Otimizando servidores web
Otimizando servidores webOtimizando servidores web
Otimizando servidores web
Amazon Web Services LATAM
 
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
Amazon Web Services
 
Accelerating and Securing your Applications in AWS. In-depth look at Solving ...
Accelerating and Securing your Applications in AWS. In-depth look at Solving ...Accelerating and Securing your Applications in AWS. In-depth look at Solving ...
Accelerating and Securing your Applications in AWS. In-depth look at Solving ...
Amazon Web Services
 
Tuning the Kernel for Varnish Cache
Tuning the Kernel for Varnish CacheTuning the Kernel for Varnish Cache
Tuning the Kernel for Varnish Cache
Per Buer
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
Scott Mansfield
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014
Nick Dimiduk
 
How to optimize CloudLinux OS limits
How to optimize CloudLinux OS limitsHow to optimize CloudLinux OS limits
How to optimize CloudLinux OS limits
CloudLinux
 
Much Faster Networking
Much Faster NetworkingMuch Faster Networking
Much Faster Networking
C4Media
 

Similar to Resilient Design 101 (JeeConf 2017) (20)

Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong KimCeph QoS: How to support QoS in distributed storage system - Taewoong Kim
Ceph QoS: How to support QoS in distributed storage system - Taewoong Kim
 
QoSintro.PPT
QoSintro.PPTQoSintro.PPT
QoSintro.PPT
 
Shoot the Bird: Linear Broadcast Distribution on AWS by Usman Shakeel of Amaz...
Shoot the Bird: Linear Broadcast Distribution on AWS by Usman Shakeel of Amaz...Shoot the Bird: Linear Broadcast Distribution on AWS by Usman Shakeel of Amaz...
Shoot the Bird: Linear Broadcast Distribution on AWS by Usman Shakeel of Amaz...
 
Resilient Design Using Queue Theory
Resilient Design Using Queue TheoryResilient Design Using Queue Theory
Resilient Design Using Queue Theory
 
ECS19 - Ingo Gegenwarth - Running Exchange in large environment
ECS19 - Ingo Gegenwarth -  Running Exchangein large environmentECS19 - Ingo Gegenwarth -  Running Exchangein large environment
ECS19 - Ingo Gegenwarth - Running Exchange in large environment
 
Amazon builder Library notes
Amazon builder Library notesAmazon builder Library notes
Amazon builder Library notes
 
Network latency - measurement and improvement
Network latency - measurement and improvementNetwork latency - measurement and improvement
Network latency - measurement and improvement
 
MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011MNPHP Scalable Architecture 101 - Feb 3 2011
MNPHP Scalable Architecture 101 - Feb 3 2011
 
Network performance overview
Network  performance overviewNetwork  performance overview
Network performance overview
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
 
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
 
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in ScyllaScylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
Scylla Summit 2018: Worry-free ingestion - flow-control of writes in Scylla
 
Otimizando servidores web
Otimizando servidores webOtimizando servidores web
Otimizando servidores web
 
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
 
Accelerating and Securing your Applications in AWS. In-depth look at Solving ...
Accelerating and Securing your Applications in AWS. In-depth look at Solving ...Accelerating and Securing your Applications in AWS. In-depth look at Solving ...
Accelerating and Securing your Applications in AWS. In-depth look at Solving ...
 
Tuning the Kernel for Varnish Cache
Tuning the Kernel for Varnish CacheTuning the Kernel for Varnish Cache
Tuning the Kernel for Varnish Cache
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014
 
How to optimize CloudLinux OS limits
How to optimize CloudLinux OS limitsHow to optimize CloudLinux OS limits
How to optimize CloudLinux OS limits
 
Much Faster Networking
Much Faster NetworkingMuch Faster Networking
Much Faster Networking
 

Recently uploaded

Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
rodomar2
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
Alina Yurenko
 
Project Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdfProject Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdf
Karya Keeper
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
Quickdice ERP
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
Peter Muessig
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
Green Software Development
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid
 
Kubernetes at Scale: Going Multi-Cluster with Istio
Kubernetes at Scale:  Going Multi-Cluster  with IstioKubernetes at Scale:  Going Multi-Cluster  with Istio
Kubernetes at Scale: Going Multi-Cluster with Istio
Severalnines
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
kgyxske
 
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
The Third Creative Media
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
Reetu63
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Peter Caitens
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
Maitrey Patel
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Paul Brebner
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
kalichargn70th171
 

Recently uploaded (20)

Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CDKuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
KuberTENes Birthday Bash Guadalajara - Introducción a Argo CD
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
 
Project Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdfProject Management: The Role of Project Dashboards.pdf
Project Management: The Role of Project Dashboards.pdf
 
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesE-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
 
GreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-JurisicGreenCode-A-VSCode-Plugin--Dario-Jurisic
GreenCode-A-VSCode-Plugin--Dario-Jurisic
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
 
Kubernetes at Scale: Going Multi-Cluster with Istio
Kubernetes at Scale:  Going Multi-Cluster  with IstioKubernetes at Scale:  Going Multi-Cluster  with Istio
Kubernetes at Scale: Going Multi-Cluster with Istio
 
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
一比一原版(sdsu毕业证书)圣地亚哥州立大学毕业证如何办理
 
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
ppt on the brain chip neuralink.pptx
ppt  on   the brain  chip neuralink.pptxppt  on   the brain  chip neuralink.pptx
ppt on the brain chip neuralink.pptx
 
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom KittEnhanced Screen Flows UI/UX using SLDS with Tom Kitt
Enhanced Screen Flows UI/UX using SLDS with Tom Kitt
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
 
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
 
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...
 

Resilient Design 101 (JeeConf 2017)

  • 1. Resilient design 101 Avishai Ish-Shalom github.com/avishai-ish-shalom@nukembergavishai.is@wix.com
  • 2. Wix in numbers ~ 500 Engineers ~ 1500 employees ~ 100M users ~ 500 micro services Lithuania Ukraine Vilnius Kyiv Dnipro Wix Engineering Locations Israel Tel-Aviv Be’er Sheva
  • 4. Queues are everywhere! ▪ Futures/Executors ▪ Sockets ▪ Locks (DB Connection pools) ▪ Callbacks in node.js/Netty Anything async?!
  • 5. Queues ▪ Incoming load (arrival rate) ▪ Service from the queue (service rate) ▪ Service discipline (FIFO/LIFO/Priority) ▪ Latency = Wait time + Service time ▪ Service time independent of queue
  • 6. It varies ▪ Arrival rate fluctuates ▪ Service times fluctuates ▪ Delays accumulate ▪ Idle time wasted Queues are almost always full or near-empty!
  • 7. Capacity & Latency ▪ Latency (and queue size) rises to infinity as utilization approaches 1 ▪ For QoS ρ << 0.75 ▪ Decent latency -> over capacity ρ = arrival rate / service rate (utilization)
  • 8. Implications Infinite queues: ▪ Memory pressure / OOM ▪ High latency ▪ Stale work Always limit queue size! Work item TTL*
  • 9. Latency & Service time λ = wait time σ = service time ρ = utilization
  • 10. Utilization fluctuates! ▪ 10% fluctuation at = 0.5 will hardly affects latency (~ 1.1x) ▪ 10% fluctuation at = 0.9 will kill you (~ 10x latency) ▪ Be careful when overloading resources ▪ During peak load we must be extra careful ▪ Highly varied load must be capped
  • 11. Practical advice ▪ Use chokepoints (throttling/load shedding) ▪ Plan for low utilization of slow resources Example Resource Latency Planned Utilization RPC thread pool 1ms 0.75 DB connection pool 10ms 0.5
  • 12. Backpressure ▪ Internal queues fill up and cause latency ▪ Front layer will continue sending traffic ▪ We need to inform the client that we’re out of capacity ▪ E.g.: Blocking client, HTTP 503, finite queues for threadpools
  • 13. Backpressure ▪ Blocking code has backpressure by default ▪ Executors, remote calls and async code need explicit backpressure ▪ E.g. producer/consumer through Kafka
  • 14. Load shedding ▪ A tradeoff between latency and error rate ▪ Cap the queue size / throttle arrival rate ▪ Reject excess work or send to fallback service Example: Facebook uses LIFO queue and rejects stale work http://queue.acm.org/detail.cfm?id=2839461
  • 16. Jetty architecture Thread pool (QTP) Socket Acceptor thread
  • 17. Too many threads ▪ O/S also has a queue ▪ Threads take memory, FDs, etc ▪ What about shared resources? Bad QoS, GC storms, ungraceful degradation Not enough threads wrong ▪ Work will queue up ▪ Not enough RUNNING threads High latency, low resource utilization
  • 18. Capacity/Latency tradeoffs When optimizing for Latency: For low latency, resources must be available when needed Keep the queue empty ▪ Block or apply backpressure ▪ Keep the queue small ▪ Overprovision
  • 19. Capacity/Latency tradeoffs When optimizing for Capacity For max capacity, resources must always have work waiting Keep the queue full ▪ We use a large queue to buffer work ▪ Queueing increases latency ▪ Queue size >> concurrency
  • 20. How may threads? ▪ Assuming CPU is the limiting resource ▪ Compute by maximal load (opt. latency) ▪ With a Grid: How many cores??? Java Concurrency in Practice (http://jcip.net/)
  • 21. How may threads? How to compute? ▪ Transaction time = W + C ▪ C ~ Total CPU time / throughput ▪ U ~ 0.5 – 0.7 (account for O/S, JVM, GC - and 0.75 utilization target) ▪ Memory and other resource limits
  • 22. What about async servers?
  • 23. Async servers architecture Socket Event loop epoll Callbacks O/S Syscalls
  • 24. Async systems ▪ Event loop callback/handler queue ▪ The callback queue is unlimited (!!!) ▪ Event loop can block (ouch) ▪ No inherent concurrency limit ▪ No backpressure (*)
  • 25. Async systems - overload ▪ No preemption -> no QoS ▪ No backpressure -> overload ▪ Hard to tune ▪ Hard to limit concurrency/queue size ▪ Hard to debug
  • 26. So what’s the point? ▪ High concurrency ▪ More control ▪ I/O heavy servers Still evolving…. let’s revisit in a few years?
  • 28. Little’s law ▪ Holds for all distributions ▪ For “stable” systems ▪ Holds for systems and their subsystems ▪ “Throughput” is either Arrival rate or Service rate depending on the context. Be careful! L = λ⋅W L = Avg clients in the system λ = Avg Throughput W = Avg Latency
  • 29. Using Little’s law ▪ How many requests queued inside the system? ▪ Verifying load tests / benchmarks ▪ Calculating latency when no direct measurement is possible Go watch Gil Tene’s "How NOT to Measure Latency" Read Benchmarking Blunders and Things That Go Bump in the Night
  • 31. How not to timeout People use arbitrary timeout values ▪ DB timeout > Overall transaction timeout ▪ Cache timeout > DB latency ▪ Huge unrealistic timeouts ▪ Refusing to return errors P.S: connection timeout, read timeout & transaction timeout are not the same thing
  • 32. Deciding on timeouts Use the distribution luke! ▪ Resources/Errors tradeoff ▪ Cumulative distribution chart ▪ Watch out for multiple modes ▪ Context, context, context
  • 33. Timeouts should be derived from real world constraints!
  • 34. UX numbers every developer needs to know ▪ Smooth motion perception threshold: ~ 20ms ▪ Immediate reaction threshold: ~ 100ms ▪ Delay perception threshold: ~ 300ms ▪ Focus threshold: ~ 1sec ▪ Frustration threshold: ~ 10sec Google's RAIL model UX powers of 10
  • 35. Hardware latency numbers every developer needs to know ▪ SSD Disk seek: 0.15ms ▪ Magnetic disk seek: ~ 10ms ▪ Round trip within same datacenter: ~ 0.5ms ▪ Packet roundtrip US->EU->US: ~ 150ms ▪ Send 1M over typical user WAN: ~ 1sec Latency numbers every developer needs to know (updated)
  • 36. Timeout Budgets ▪ Decide on global timeouts ▪ Pass context object ▪ Each stage decrements budget ▪ Local timeouts according to budget ▪ If budget too low, terminate preemptively Think microservices Example Global: 500ms Stage Used Budget Timeout Authorization 6ms 494ms 100ms Data fetch (DB) 123ms 371ms 200ms Processing 47ms 324ms 371ms Rendering 89ms 235ms 324ms Audit 2ms - - Filter 10ms 223ms 233ms
  • 37. The debt buyer ▪ Transactions may return eventually after timeout ▪ Does the client really have to wait? ▪ Timeout and return error/default response to client (50ms) ▪ Keep waiting asynchronously (1 sec) Can’t be used when client is expecting data back