Load balancing theory and practice

Load balancing
theory and practice

Welcome
Me:
• Dave Rosenthal
• Co-founder of FoundationDB
• Spent last three years building a distributed
transactional NoSQL database
• It’s my birthday

Any time you have multiple computers working on a
job, you have a load balancing problem!

Warning
There is an ugly downside to learning about load
balancing: TSA checkpoints, grocery store lines,
and traffic lights may become even more
frustrating.

What is load balancing?
Wikipedia: “…methodology to distribute
workload across multiple computers … to
achieve optimal resource utilization, maximize
throughput, minimize response time, and avoid
overload”

All part of the latency curve

The latency curve
Overload
10000

1000
Saturation
Latency

100

10

1 Nominal Interesting

Jobs/second

Goal for real-time systems

10000

1000
Latency

100

Low latency at
10 given load

1

Jobs/second

Goal for batch systems

10000

1000
High Jobs/sec at a
Latency

100 reasonable latency

10

1

Jobs/second

The latency curve
1000

100
Latency (ms)

Better load balancing strategies
can dramatically improve both
latency and throughput
10

1
0 0.2 0.4 0.6 0.8 1
Load

Load balancing tensions
• We want to reduce queue lengths in the
system to yield better latency
• We want to lengthen queue lengths to keep a
“buffer” of work to keep busy during irregular
traffic and yield better throughput
• For distributed systems, equalizing queue
lengths sounds good

Can we just limit queue sizes?
40
35
30
% of dropped jobs

25
20
15
10
5
0
0 5 10 15 20
Queued job limit

Simple strategies
Global job queue: for slow tasks
Round robin: for highly uniform situations
Random: probably won’t screw you
Sticky: for cacheable situations
Fastest of N tries: tradeoff throughput for
latency. I recommend N = 2 or 3.

Use a global queue if possible
10
Latency under 80% load

1
Random assignment
Global Job Queue

0.1
1 2 3 4 5 6 7 8 9 10
Cluster Size

Options for information transfer
• None (rare)
• Latency (most common)
• Failure detection
• Explicit
– Load average
– Queue length
– Response times

FoundationDB’s approach
1. Request to random of three servers
2. Server either answers query or replies “busy” if its
queue is longer than the queue limit estimate
3. Queries that were busy are sent to second random
server with “must do” flag set.

Queue limit = 25 * 2^(20*P)
• A global queue limit is implicitly shared by estimating
the fraction of incoming requests (P) that are flagged
“must do”
• Converges to a P(redirect)/queue-size equilibrium

FDB latency curve before/after
100

10
Latency

1

0.1
0 200000 400000 600000 800000 1000000 1200000
Operations per second

Tackling load balancing
• Queuing theory: One useful insight
• Simulation: Do this
• Instrumentation: Do this
• Control theory: Know how to avoid this
• Operations research: Read about this for fun
– Blackett: Shield planes where they are not shot!

The one insight: Little’s law
Q = R*W

• (Q)ueue size = (R)ate * (W)ait-time
• Q is the average number of jobs in the system
• R is the average arrival rate (jobs/second)
• W is the average wait time (seconds)
• For any (!) steady-state systems
– Or sub-systems, or joint systems, or…

Little’s law example 1
Q = R*W

• We get 1,000,000 request per second (R=1E6)
• We take 100 ms to service each request
• (Q = 1E6*0.100)
• Little’s Law: Average queue depth is 100,000!

Little’s law example 2
W = Q/R

• We have 100 users in the system making
continuous requests (Q=100)
• We get 10,000 requests per second
• (W = 100 / 10,000)
• Little’s Law: Average wait time is 10 ms

Little’s law ramifications
Q = R*W

• In distributed system:
– R scales up
– W remains the same, or gets a bit worse
• To maintain performance, you’re going to
need a whole lot of jobs in flight

The rest of queuing theory
Erlang
• A language
• A man (Agner Krarup Erlang)
• And a unit! (Q from little’s law AKA offered load is
measured in dimensionless Erlang units)
• Erlang-B formula (for limited-length queues)
• Erlang-C formula (P(waiting))

Abandon hope

Math for queuing theory
10000
Complexity of Math

1000

100

10

1
Little’s law

Real-world applicability
?

Simulation
The best way to explore distributed system
behavior

Quiz
Model: Jobs of random durations. 80% load.
Goal: Minimize average job latency.

What to work a bit more on?
• First task received
• Last task received
• Shortest task
• Longest task
• Random task
• Task with least work remaining
• Task with most work remaining

Simulation results at 80% load
First task received

Last task received

Shortest task

Longest task

Random task

Task with least work remaining

Task with most work remaining

0 10 20 30 40 50
Latency

Simulation results at 95% load
First task received

Last task received

Shortest task

Longest task

Random task

Task with least work remaining

Task with most work remaining

1 10 100 1000 10000 100000
Latency

FoundationDB’s approach
• Strategy validated using simulation used for a
single server’s fiber scheduling
• High priority: Work on the next task to finish
• But be careful to enqueue incoming work
from the network with highest priority—we
want to know about all our jobs to make good
decisions
• Low priority: Catch up with housekeeping (e.g.
non-log writing)

Load spikes
Low load system High load system

Bursts of job requests can destroy latency. The
effect is quadratic: A burst produces a queue of
size B that lasts time proportional to B. On highly-
loaded systems, the effect is multiplied by 1/(1-
load), leading to huge latency impacts.

Burst-avoiding tip
1. Search for any delay/interval in your system
2. If system correctness depends on the
delay/interval being exact, first fix that
3. Now change that delay/interval to randomly
wait 0.8-1.2 times the nominal time on each
execution

YMMV, but this tends to diffuse system events more
evenly in time and help utilization and latency.

Overload
Overload
10000

1000
Latency

100

10

1

Jobs/second

Overload
What happens when work comes in too fast?
• Somewhere in your system a queue is going to
get huge. Where?
• Lowered efficiency due to:
– Sloshing
– Poor caching
• Unconditional acceptance of new work means
no information transfer to previous system!

Overload (cont’d): Sloshing
Loading 10 million rows into popular NoSQL K/V
store shows sloshing

12.5 minutes

Overload (cont’d): No sloshing
Loading 10 million rows into FDB shows smooth
behavior:

System queuing

Queue Queue Queue

Work
A
B
C Node 1 Node 2 Node 3
D
E

System queuing

Queue Queue Queue
A

Work

B
C Node 1 Node 2 Node 3
D
E

Internal queue buildup

Queue Queue Queue
A
B
C

Work D

Node 1 Node 2 Node 3

E

Even queues, external buildup

Queue Queue Queue
A
B
C

Work

Node 1 Node 2 Node 3
D
E
…

Our approach
“Ratekeeper”
• Active management of internal queue sizes
prevents sloshing
• Avoids every subcomponent needing it’s own
well-tuned load balancing strategy
• Explicitly send queue information at 10hz back to
a centrally-elected control algorithm
• When queues get large, slow system input
• Pushes latency into an external queue at the
front of the system using “tickets”

Ratekeeper in action
1400000

1200000
Operations per second

1000000

800000

600000

400000

200000

0
0 100 200 300 400 500 600
Seconds

What can go wrong
Well, we are controlling the queue depths of the
system, so, basically, everything in control
theory…

Namely, oscillation:

Recognizing oscillation
• Something moving up and down :)
– Look for low utilization of parallel resources
– Zoom in!
• Think about sources of feedback—is there
some way that having a machine getting more
job done feeds either less or more work for
that machine in the future? (probably yes)

What oscillation looks like
70

60

50
Utilization %

40
Node A
30
Node B
20

10

0
1 2 3 4 5

What oscillation looks like
120

100

80
Utilization %

60
Node A
Node B
40

20

0
2 2.05 2.1 2.15 2.2 2.25 2.3
-20

Avoiding oscillation
• This is control theory—avoid if possible!
• The major thing to know: control gets harder
at frequencies get higher. (e.g. Bose
headphones)
• Two strategies:
– Control on a longer time scale
– Introduce a low-pass-filer in the control loop (e.g.
exponential moving average)

Instrumentation
If you can’t measure, you can’t make it better

Things that might be nice to measure:
• Latencies
• Queue lengths
• Causes of latency?

Measuring latencies
Our approach:
• We want information about the distribution, not
just the average
• We use a “Distribution” class
– addSample(X)
– Stores 500+ samples
– Throws away half of them when it hits 1000
samples, and halves probability of accepting new
samples
– Also tracks exact min, max, mean, and stddev

Measuring queue lengths
Our approach:
• Track the % of time that a queue is at zero length
• Measure queue length snapshots at intervals
• Watch out for oscillations
– Slow ones you can see
– Fast ones look like noise (which, unfortunately, is also
what noise looks like)
– “Zoom in” to exclude the possibility of micro-
oscillations

Measuring latency from blocking
• Easy to calculate:
– L = (b0^2 + b1^2 … bN^2) / elapsed
– Total all squared seconds of blocking time over some
interval, divide by the duration of the interval.
• Measures impact of unavailability on mean
latency from random traffic
• Example: Is server’s slow latency explained by
this lock?
• Doesn’t count catch-up time.

Summary
Thanks for listening, and remember:
• Everything has a latency curve
• Little’s law
• Randomize regular intervals
• Validate designs with simulation
• Instrument

May your queues be small, but not empty
david.rosenthal@foundationdb.com

Prioritization/QOS
• Can help in systems under partial load
• Vital in systems that handle batch and real-
time loads simultaneously
• Be careful that high priority work doesn’t
generate other high priority work plus other
jobs in the queue. This can lead to poor
utilization analogous to the internal queue
buildup case.

Congestion pricing
• My favorite topic
• Priority isn’t just a function of the benefit of
your job
• To be a good citizen, you should subtract the
costs to others
• For example, jumping into the front of a long
queue has costs proportional to the queue
size

Other FIFO alternatives?
• LIFO
– Avoids the reason to line up early
– In situations where there is adequate capacity to
serve everyone, can yield better waiting times for
everyone involved

Load balancing theory and practice

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Load balancing theory and practice

Similar to Load balancing theory and practice (20)

Recently uploaded

Recently uploaded (20)

Load balancing theory and practice