Boxcar
A self-balancing distributed services protocol
R.B. Boyer
Software Engineer
Resume
I help
people
get jobs.
I solve
interesting
problems.
Boxcar was the solution
to a problem:
Building
How we build products
Simple
Fast
Comprehensive
Relevant
How we build products
Simple
Fast
Comprehensive
Relevant
How we build systems
Simple
Fast
Resilient
Scalable
Simple
“I want my application to be
more complicated”
- No one ever
Complexity creates
confusion
Complexity creates
confusion
Confusion breeds bugs
Fast
“I want my application to be
slower”
- No one ever
conducted a speed test
+500 milliseconds of latency
per search
20% fewer
searches
Speed is a feature
Resilient
“I want my users to
experience outages”
- No one ever
Programs crash
Programs crash
Machines die
Minimize vulnerability to
any failure
Scalable
“My system will only need to
support 10 users”
- No one ever
Scale with MORE machines
Scale with MORE machines
Not BIGGER machines
TL;DR:
Indeed

Jobs
Sites

Job
Seekers
Aggregation

Jobs
Sites

Job Search

Job
Seekers
Job Search
Aggregation
Challenge!

Job Search
Aggregation
Challenge:
keep this

Simple
Fast
Resilient
Scalable
Options:
Share data access?
Example:
Shared Database
Shared Database

Main
Database
Shared Database

Main
Application

Main
Database
Shared Database

Main
Application

Analysis
Tool

Main
Database
Shared Database
Main
Application

Analysis
Tool

Billing
Application

Main
Database
Shared Database
Main
Application

Main
Database

Analysis
Tool

Billing
Application
Intern
Project
Shared Database
Main
Application

Main
Database

Analysis
Tool

Billing
Application
Intern
Project

Other
Intern
Project
Shared Database
Main
Application

Main
Database

Analysis
Tool

Billing
Application
Intern
Project

Other
Intern
Project

...
Shared Database
Main
Application

Main
Database

Analysis
Tool

This is an anti-pattern
Billing
Application
Intern
Project...
On a long enough timeline...
Maintenance Nightmare
Share data access
Share data access
Insulate data from
consumers
Shared Database
Main
Application

Main
Database

Analysis
Tool

Billing
Application
Intern
Project

Other
Intern
Project

...
Insulated Database
Main
Application

Main
Database

Main
Service

Analysis
Tool

Billing
Application
Intern
Project

Other...
Service?
Service

Client
Client
Client
Client
Client
Client
Client
Client

NETWORK

Service
Client
Client
Client
Client

NETWORK

Service

Icky
Technical
Stuff
Service

NETWORK

Databases
Client
Client
Client
Client

Logging
Caches
Business Logic
...

Client API

Icky Technical
Det...
Client API

Service.getJobs([12345, 62])
Icky Technical Details
SELECT * FROM jobs AS j
LEFT JOIN companyinfo AS ci
ON j.id=ci.job_id
LEFT JOIN locations AS loc
ON...
Service
Oriented
Architecture
Service
Oriented
Architecture
Boxcar
Boxcar is a...
self-balancing
distributed
services
protocol
Origin Story
There was a life before
Boxcar
There were services before
Boxcar
Pick one:
Doc Service
Document Serving Service
aka “Doc Service”
http://go.indeed.com/docservice
Doc Service controls
access to JOBS
Building Blocks
Webapp

Wants jobs

Doc
Service

Controls access to jobs

Docstore

Stores jobs
Build it
Webapp
Webapp
Doc
Service
Docstore
Webapp
Doc
Service
Docstore
Mission Accomplished

Webapp
Doc
Service
Docstore
But is it good?
How we build systems
Simple
Fast
Resilient
Scalable
Goodness Metric
Simple deploys
Efficient networking (Fast)
Resilient
Horizontally scalable
Webapp
Doc
Service
Docstore
Webapp
Doc
Service
Docstore
✘ Resilient
Webapp
Doc
Service
Docstore
Add Resilience
Webapp
Doc
Service
Docstore
Webapp

Webapp

Doc
Service

Doc
Service

Docstore

Docstore
Front-end Load Balancer

Webapp

Webapp

Doc
Service

Doc
Service

Docstore

Docstore
Siloed Stacks
Front-end Load Balancer

Webapp

Webapp

Doc
Service

Doc
Service

Docstore

Docstore
Siloed Stacks

✓ Simple deploys
✓ Efficient networking (Fast)
✓ Resilient
?

Horizontally scalable
Scaling Silos
Front-end Load Balancer

Webapp

Webapp

Doc
Service

Doc
Service

Docstore

Docstore
Scaling Silos
Front-end Load Balancer

Webapp

Webapp

Webapp

Doc
Service

Doc
Service

Docstore

Docstore
Scaling Silos
Front-end Load Balancer

Webapp

Webapp

Webapp

Doc
Service

Doc
Service

Doc
Service

Docstore

Docstore

...
Need bigger and

bigger machines
Vertical Scaling
Siloed Stacks

✓ Simple deploys
✓ Efficient networking (Fast)
✓ Resilient
✘ Horizontally scalable
Siloed Stacks

✓ Simple deploys
✓ Efficient networking (Fast)
✓ Resilient
✘ Horizontally scalable

Services
Version 1
Improve scalability
Front-end Load Balancer

Webapp

Webapp

Webapp

Doc
Service

Doc
Service

Doc
Service

Docstore

Docstore

Docstore
Front-end Load Balancer
Webapp

Webapp

Webapp

Doc Service Load Balancer
Doc
Service

Doc
Service

Docstore

Docstore
Per-Service Balancer
Front-end Load Balancer
Webapp

Webapp

Webapp

Doc Service Load Balancer
Doc
Service

Doc
Service

D...
Per-Service Balancer

~

Simple deploys

?

Efficient networking (Fast)

?

Resilient

✓ Horizontally scalable
Proxying isn’t free

✘2x Bandwidth
Webapp

Doc Service Load Balancer

Doc
Service
Per-Service Balancer

~

Simple deploys

✘ Efficient networking (Fast)
?

Resilient

✓ Horizontally scalable
Resilience
Front-end Load Balancer
Webapp

Webapp

Webapp

SINGLE POINT OF FAILURE
Doc
Service

Doc
Service

Docstore

Doc...
Need two balancers
Need two balancers
...and a way to balance
between them?
Load Balancer Balancing
Master / Slave
Share IP address
Heartbeat between nodes
Complex
Resilience
Front-end Load Balancer
Webapp

Webapp

Webapp

Doc Service Load Balancer
Doc Service Load Balancer
Doc
Service...
Best explained by our
Operations folks:
“Redundant Array of
Inexpensive Datacenters”
http://go.indeed.com/raid
Per-Service Balancer

~

Simple deploys

✘ Efficient networking (Fast)
✓ Resilient
✓ Horizontally scalable
Per-Service Balancer

~

Simple deploys

✘ Efficient networking (Fast)
✓ Resilient
✓ Horizontally scalable

Services
Versi...
Reduce network waste
Front-end Load Balancer
Webapp

Webapp

Webapp

Doc Service Load Balancer
Doc
Service

Doc
Service

Docstore

Docstore
Front-end Load Balancer
Webapp

Webapp

Webapp

Doc
Service

Doc
Service

Docstore

Docstore
Naive Round Robin
Front-end Load Balancer
Webapp

Webapp

Webapp

Doc
Service

Doc
Service

Docstore

Docstore
Naive Round Robin

✓ Simple deploys
?

Efficient networking (Fast)

?

Resilient

✓ Horizontally scalable
Direct Connections

Webapp

✓1x Bandwidth
Doc
Service
Naive Round Robin

✓ Simple deploys
✓ Efficient networking (Fast)
?

Resilient

✓ Horizontally scalable
Server A

Server B
Server A

Server B
REQUEST

Server A

Server B
✘

REQUEST

Server A

Server B
✘

REQUEST

Server A

Server B
REQUEST

Server A

Server B
Server A

Server B

REQUEST
Naive Round Robin

✓ Simple deploys
✓ Efficient networking (Fast)
✓ Resilient
✓ Horizontally scalable
Naive Round Robin

✓ Simple deploys
✓ Efficient networking (Fast)
✓ Resilient
✓ Horizontally scalable
Naive Round Robin

✓ Simple deploys
✓ Efficient networking (Fast)
✓ Resilient
✓ Horizontally scalable
?

Balanced
Slow

Fast
Slow

Fast
Slow

Fast
Slow

Fast
Slow

Fast
Slow

Fast
Slow

Fast
Slow

Fast
Slow

Fast
Slow

Fast
Slow

Fast
Can’t keep
up

Slow

Fast
Naive Round Robin

✓ Simple deploys
✓ Efficient networking (Fast)
✓ Resilient
✓ Horizontally scalable
✘ Balanced
Naive Round Robin

✓ Simple deploys
✓ Efficient networking (Fast)
✓ Resilient
✓ Horizontally scalable
✘ Balanced

NOPE
Ensure balance
Front-end Load Balancer
Webapp

Webapp

Webapp

Doc Service Load Balancer
Doc
Service

Doc
Service

Docstore

Docstore
Front-end Load Balancer
Webapp

Webapp

Webapp

Distribute!
Doc
Service

Doc
Service

Docstore

Docstore
Front-end Load Balancer
Webapp

Webapp

B

Webapp

B

B

Doc
Service

Doc
Service

Docstore

Docstore
Front-end Load Balancer
Web
App

Web
App

B

Web
App

B

Doc
Service

Doc
Service

Docstore

Docstore

B
Boxcar
Front-end Load Balancer
Web
App

Web
App

B

Web
App

B

Doc
Service

Doc
Service

Docstore

Docstore

B
Naive Round
Robin

Per-Service
Balancer
The Boxcar balancing
algorithm is simple
Gist
Servers assign numeric value to
connections
Clients use the connection with the
lowest numeric value to service each
...
Server A
Server A

Server A
Server A
Slot 0
Slot 1
Slot 2
Slot 3
Slot 4
...
Slot Numbers
Just numbers

Server A
Slot 0

No limit

Slot 1
Slot 2

NOT a priority

Slot 3
Slot 4
...

ONLY used for bala...
LOW slot numbers
are the
BEST slot numbers
Server A
Slot 0

USED

Slot 1

He

llo

!

USED

Slot 2

Client 2

Slot 3
Slot 4

USED
...
Server A
Slot 0

USED

Slot 1

USED

Slot 2

USED

Slot 3
Slot 4

USED

Slot 2
Client 2

...
Client 2
Client 2
Client 2
Server

Client 2

Server

Slot 0
Slot 2

A

Slot 12
Slot 29
Slot 30
Slot 57

B
long-lived
connections

Server

Client 2

Server

Slot 0
Slot 2

A

Slot 12
Slot 29
Slot 30
Slot 57

B
Clients are greedy
MINE!

50

2
Clients are greedy
Want best connections
MINE!

50

2

Continually look for better
connections
Close worst connections
Background thread maintains
the
connection pool
Server

Client 2

Server

Slot 0
Slot 2

A

Slot 12
Slot 29
Slot 30
Slot 57

B
Slot
17

Server

Client 2

Server

Slot 0
Slot 2

A

Slot 12
Slot 29
Slot 30
Slot 57

B
Slot
17

Server

Client 2

Server

Slot 0
Slot 2

A

B

Slot 12
Slot 29
Slot 30

✘

Slot 57

✘
Slot
17

Server

Client 2

Server

Slot 0
Slot 2

A

Slot 12
Slot 29
Slot 30

B
Slot
17

Server

Client 2

Server

Slot 0
Slot 2

A

Slot 12
Slot 29
Slot 30

B
Server

Client 2

Server

Slot 0
Slot 2

A

Slot 12
Slot 17
Slot 29
Slot 30

B
Server

Client 2

Server

Slot 0
Slot 2

A

Slot 12
Slot 17
Slot 29
Slot 30

Continues forever

B
Incoming Requests
Client 2
Slot 0

ACTIVE

Slot 2

ACTIVE

Slot 12

[idle]

Slot 29

ACTIVE

Slot 30

[idle]

Slot 57

[id...
Incoming Requests
Client 2
Slot 0

ACTIVE

Slot 2

ACTIVE

Slot 12

ACTIVE

Slot 29

ACTIVE

Slot 30

[idle]

Slot 57

[id...
Connections NOT
established on-demand
Requests to Busy Pool
Client 2
Slot 0

ACTIVE

Slot 2

ACTIVE

Slot 12

ACTIVE

Slot 29

ACTIVE

Slot 30

ACTIVE

Slot 57
...
Requests to Busy Pool
Client 2
Slot 0

ACTIVE

Slot 2

ACTIVE

Slot 12

ACTIVE

Slot 29

ACTIVE

Slot 30

ACTIVE

Slot 57
...
Sizing the pool properly is
imperative!
Gist Redux
Servers assign numeric value to
connections
Clients use the connection with the
lowest numeric value to service...
Balanced load is
emergent behavior
Load Balancing
Simulations
Server A

Server B

Client X
Server A
slot 0

Server B

Client X
0
Server A
slot 0

Server B

Client X
0
Server A
slot 0

Client X
0
0

Server B
slot 0
Server A
slot 0

Server B
slot 0

Client X
0
0
Server A
slot 0
slot 1

1
1

Server B
slot 0
slot 1

Client X
0
0
Server A
slot 0

Client X
0
0

Server B
slot 0

Steady-state balance
Server A
slot 0

Client X
0
0

New Clients Join
Server B
slot 0
Server A
slot 0

Client X
0
0

Client Y

Server B
slot 0
Server A
slot 0
slot 1

Client X
0
0

1
1
Server B
slot 0
slot 1

Client Y
Server A
slot 0
slot 1

Client X
0
0

Client Y
1
1
Server B
slot 0
slot 1
Server A
slot 0
slot 1
slot 2

Client X
0
0

2

Server B
slot 0
slot 1
slot 2

2

Client Y
1
1
Server A
slot 0
slot 1

Client X
0
0

Client Y
1
1
Server B
slot 0
slot 1
Server A
slot 0
slot 1

Client X
0
0

Client Y
1
1
Server B
slot 0
slot 1

Client Z
Server A
slot 0
slot 1
slot 2

Client X
0
0

Client Y
1
1
Server B
slot 0
slot 1
slot 2

2
2

Client Z
Server A
slot 0
slot 1
slot 2

Client X
0
0

Client Y
1
1
Server B
slot 0
slot 1
slot 2

Client Z
2
2
Server A
slot 0
slot 1
slot 2
slot 3

Client X
0
0

Client Y
1
1
Server B
slot 0
slot 1
slot 2
slot 3

3
3

Client Z
2
2
Server A
slot 0
slot 1
slot 2

Client X
0
0

Client Y
1
1
Server B
slot 0
slot 1
slot 2

Client Z
2
2
Steady-state balance
Server A
slot 0
slot 1
slot 2

Client X
0
0

Server Failure
Server B
slot 0
slot 1
slot 2

Client Y
1
1

Client Z
2
2
Server A
slot 0
slot 1
slot 2

Client X
0

Client Y
1

Server B
slot 0

Client Z
2
Server A
slot 0
slot 1
slot 2

Client X
0

Client Y
1

Server B
Client Z
2
Server A
slot 0
slot 1
slot 2
slot 3

Client X
0

Client Y
1

Server B
3

Client Z
2
Server A
slot 0
slot 1
slot 2
slot 3

Client X
0

Client Y
1

Server B
Client Z
2
3
Server A
slot 0
slot 1
slot 2
slot 3
slot 4

Client X
0
4

Client Y
1

Server B
Client Z
2
3
Server A
slot 0
slot 1
slot 2
slot 3
slot 4

Client X
0
4

Client Y
1

Server B
Client Z
2
3
Server A
slot 0
slot 1
slot 2
slot 3
slot 4
slot 5

Client X
0
4

5

Client Y
1

Server B
Client Z
2
3
Server A
slot 0
slot 1
slot 2
slot 3
slot 4
slot 5

Client X
0
4

Client Y
1
5

Server B
Client Z
2
3
Server A
slot 0
slot 1
slot 2
slot 3
slot 4
slot 5

Client X
0
4

Client Y
1
5

Server B
Client Z
2
3
Steady-state balance
Server A
slot 0
slot 1
slot 2
slot 3
slot 4
slot 5

Client X
0
4

Client Y
1
5

Server B
Client Z
2
3
Steady-state balance
Server A
slot 0
slot 1
slot 2
slot 3
slot 4
slot 5

Client X
0
4

Server Restored

Client Y
1
5

Server B
Client Z
2
3
Server A
slot 0
slot 1
slot 2
slot 3
slot 4
slot 5

Client X
0
4

Client Y
1
5

Server B
Client Z
2
3
Server A
slot 0
slot 1
slot 2
slot 3
slot 4
slot 5

Client X
0
4

Client Y
1
5

Server B
slot 0
0

Client Z
2
3
Server A
slot 0
slot 1
slot 2
slot 3
slot 4
slot 5

Server B
slot 0

Client X
0
4

Client Y
1
5

Client Z
2
0<3
Server A
slot 0
slot 1
slot 2
slot 3
slot 4
slot 5

Server B
slot 0

Client X
0
4

Client Y
1
5

Client Z
2
0
Server A
slot 0
slot 1
slot 2
slot 3
slot 4
slot 5

Server B
slot 0

Client X
0
4

Client Y
1
5

Client Z
2
0
Server A
slot 0
slot 1
slot 2
slot 3
slot 4
slot 5

Server B
slot 0

Client X
0
4

Client Y
1
5

Client Z
0
2
Server A
slot 0
slot 1
slot 2
slot 3
slot 4
slot 5

Client X
0
4

Client Y
1
5
1

Server B
slot 0
slot 1

Client Z
0
2
Server A
slot 0
slot 1
slot 2
slot 3
slot 4
slot 5

Server B
slot 0
slot 1

Client X
0
4

Client Y
1
1<5

Client Z
0
2
Server A
slot 0
slot 1
slot 2
slot 3
slot 4

Client X
0
4

Client Y
1
1
Server B
slot 0
slot 1

Client Z
0
2
Server A
slot 0
slot 1
slot 2
slot 3
slot 4

Client X
0
4
2
Client Y
1
1

Server B
slot 0
slot 1
slot 2

Client Z
0
2
Server A
slot 0
slot 1
slot 2
slot 3
slot 4

Client X
0
2<4

Client Y
1
1
Server B
slot 0
slot 1
slot 2

Client Z
0
2
Server A
slot 0
slot 1
slot 2

Client X
0
2

Client Y
1
1
Server B
slot 0
slot 1
slot 2

Client Z
0
2
Server A
slot 0
slot 1
slot 2

Client X
0
2

Client Y
1
1
Server B
slot 0
slot 1
slot 2

Client Z
0
2
Steady-state balance
Server A
slot 0
slot 1
slot 2

Client X
0
2

Client Shutdown
Server B
slot 0
slot 1
slot 2

Client Y
1
1

Client Z
0
2
Server A
slot 0
slot 1
slot 2

Client X
0
2

Client Y
1
1
Server B
slot 0
slot 1
slot 2

Client Z
0
2
Server A
slot 0
slot 1
slot 2

Server B
slot 0
slot 1
slot 2

Client X
0
2

Client Z
0
2
Server A
slot 0
slot 1
slot 2

Server B
slot 0
slot 1
slot 2

Client X
0
2

1

Client Z
0
2
Server A
slot 0
slot 1
slot 2

Server B
slot 0
slot 1
slot 2

Client X
0
2

Client Z
0
1<2
Server A
slot 0

Server B
slot 0
slot 1
slot 2

Client X
0
2

Client Z
0
1
Server A
slot 0
slot 1

Server B
slot 0
slot 1
slot 2

1

Client X
0
2

Client Z
0
1
Server A
slot 0
slot 1

Client X
0
1<2

Server B
slot 0
slot 1
slot 2

Client Z
0
1
Server A
slot 0
slot 1

Server B
slot 0
slot 1

Client X
0
1

Client Z
0
1
Server A
slot 0
slot 1

Client X
0
1

Server B
slot 0
slot 1

Client Z
0
1
Steady-state balance
Server A
slot 0
slot 1

Client X
0
1

Client Rejoins
Server B
slot 0
slot 1

Client Z
0
1
Server A
slot 0
slot 1

Client X
0
1

Client Y

Server B
slot 0
slot 1

Client Z
0
1
Server A
slot 0
slot 1
slot 2

Client X
0
1

2

Client Y

2
Server B
slot 0
slot 1
slot 2

Client Z
0
1
Server A
slot 0
slot 1
slot 2

Client X
0
1

Client Y
2
2
Server B
slot 0
slot 1
slot 2

Client Z
0
1
Server A
slot 0
slot 1
slot 2

Client X
0
1

Client Y
2
2
Server B
slot 0
slot 1
slot 2

Client Z
0
1
Steady-state balance
Why does this Balance?
Connections are like
running water
seeking lower ground
Slots

Connections

Servers
Slots

Connections

Servers
Slots

Connections

Servers
Slots

Connections

Servers
Slots

Connections

Servers
Slots

Connections

Servers
Slots

Connections

Servers
Slots

Connections

Servers
Slots

Connections

Servers
Slots

Connections

Servers
Slots

Connections

Servers
Slots

Connections

Servers
Roughly Equal
Distribution
Slots

Connections

Servers
Edge cases
Server A
slot 0
slot 1

Server B
slot 0
slot 1

Client X
0
1

Client Z
0
1
Server A
slot 0
slot 1

Client X
0
1

Balanced
but not ideal
Server B
slot 0
slot 1

Client Z
0
1
Server A
slot 0
slot 1

Client X
0
1

Server B
slot 0

Client Z
Server A
slot 0
slot 1

Client X
0
1

Server B
slot 0

Client Z

EMPTY
POOL!
Server A
slot 0
slot 1

Client X
0
1

✘ Resilient
Server B
slot 0

EMPTY
POOL!

Client Z
Fix by adding entropy
Fix by adding entropy
aka “Table Shaking”
Table Shaking
Servers regularly hang up on connections
Table Shaking
Servers regularly hang up on connections
Clients expect failed connections
Table Shaking
Servers regularly hang up on connections
Clients expect failed connections
Failures are retried on new conne...
Table Shaking
Servers regularly hang up on connections
Clients expect failed connections
Failures are retried on new conne...
Server A
slot 0
slot 1

Client X
0
1

Table Shaking
turns this
Server B
slot 0
slot 1

Client Z
0
1
Server A
slot 0
slot 1

Client X
0
1

Into this
Server B
slot 0
slot 1

Client Z
0
1
Server A
slot 0
slot 1

Server B
slot 0

Client X
0

Client Z
1
Server A
slot 0
slot 1

Client X
0

YAY!

YAY!
Server B
slot 0

Client Z
1
Balancing Tricks:
Handicapping
Handicapping is
Server Self-quarantine
Handicapping
Exploit slot number assignment
Handicapping
Exploit slot number assignment
Unhealthy servers inflate slot numbers
Handicapping
Exploit slot number assignment
Unhealthy servers inflate slot numbers
Clients naturally avoid these servers
Slots

Connections

Servers
Unhealthy
Slots

Connections

Servers
Slots

Connections

Servers
Slots

Connections

Servers
Slots

Connections

Servers
Unhealthy
Slots

Connections

Servers
Slots

Connections

Servers
Slots

Connections

Servers
Slots

Connections

Servers

graceful degradation
Is Boxcar good?
Boxcar

✓ Simple deploys
✓ Efficient networking (Fast)
?

Resilient

✓ Horizontally scalable
?

Balanced
Clients are pessimistic
Clients are pessimistic
Failure is expected
Boxcar

✓ Simple deploys
✓ Efficient networking (Fast)
✓ Resilient
✓ Horizontally scalable
?

Balanced
Balance Connections
Not Requests
Balancing Review:
Naive Round Robin
Slow

Fast
Slow

Fast
Slow

Fast
Slow

Fast
Slow

Fast
Slow

Fast
Slow

Fast
Slow

Fast
Slow

Fast
Slow

Fast
Slow

Fast
Can’t keep
up

Slow

Fast
The problem was that
requests (connections)
piled up
Boxcar has a fixed number of
connections
Boxcar has a fixed number of
connections
there’s nothing to pile up
Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9
0 1

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

Slot 7

Slot 9

Slow
Server

Fast
Server
Client

7 9

2 requests

4 requests

Slot 7

Slot 9

Slow
Server

Fast
Server
Slow servers handle fewer
requests
No overloaded servers
All requests are serviced
Load balancing is
probabilistic
Boxcar

✓ Simple deploys
✓ Efficient networking (Fast)
✓ Resilient
✓ Horizontally scalable
✓ Balanced
Boxcar

✓ Simple deploys
✓ Efficient networking (Fast)
✓ Resilient
✓ Horizontally scalable
✓ Balanced
Good enough for Indeed
Services well over a

BILLION requests
every day
Fundamental technology
Powering over 20
different services
In production since 2009
Service
Oriented
Architecture
Q&A
[@IndeedEng] Boxcar: A self-balancing distributed services protocol
[@IndeedEng] Boxcar: A self-balancing distributed services protocol
[@IndeedEng] Boxcar: A self-balancing distributed services protocol
[@IndeedEng] Boxcar: A self-balancing distributed services protocol
[@IndeedEng] Boxcar: A self-balancing distributed services protocol
Upcoming SlideShare
Loading in...5
×

[@IndeedEng] Boxcar: A self-balancing distributed services protocol

5,589

Published on

Video available at: http://www.youtube.com/watch?v=E1ok08TVxDw

Indeed's flagship job search product has evolved over the years to meet new challenges. It began as a single, monolithic web application. This grew larger and increasingly complex as we built new features. To remedy this growing problem, we implemented a service-oriented architecture to improve system availability, scalability, and maintainability. We examined common practices for service-oriented architectures, and we discovered ways to improve on the state of the art. We developed these ideas into a new framework called Boxcar. In this talk, we will discuss the scaling problems we solved, the innovative ideas behind boxcar, and how we built the scalable architecture that we now use throughout our systems.

R.B. Boyer is a Software Engineer who has been with Indeed since late 2007. Over the years he has worked on a variety of projects, including distributed storage, authentication, and service architectures.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,589
On Slideshare
0
From Embeds
0
Number of Embeds
23
Actions
Shares
0
Downloads
14
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

[@IndeedEng] Boxcar: A self-balancing distributed services protocol

  1. 1. Boxcar A self-balancing distributed services protocol
  2. 2. R.B. Boyer Software Engineer Resume
  3. 3. I help people get jobs.
  4. 4. I solve interesting problems.
  5. 5. Boxcar was the solution to a problem:
  6. 6. Building
  7. 7. How we build products Simple Fast Comprehensive Relevant
  8. 8. How we build products Simple Fast Comprehensive Relevant
  9. 9. How we build systems Simple Fast Resilient Scalable
  10. 10. Simple “I want my application to be more complicated” - No one ever
  11. 11. Complexity creates confusion
  12. 12. Complexity creates confusion Confusion breeds bugs
  13. 13. Fast “I want my application to be slower” - No one ever
  14. 14. conducted a speed test
  15. 15. +500 milliseconds of latency per search
  16. 16. 20% fewer searches
  17. 17. Speed is a feature
  18. 18. Resilient “I want my users to experience outages” - No one ever
  19. 19. Programs crash
  20. 20. Programs crash Machines die
  21. 21. Minimize vulnerability to any failure
  22. 22. Scalable “My system will only need to support 10 users” - No one ever
  23. 23. Scale with MORE machines
  24. 24. Scale with MORE machines Not BIGGER machines
  25. 25. TL;DR:
  26. 26. Indeed Jobs Sites Job Seekers
  27. 27. Aggregation Jobs Sites Job Search Job Seekers
  28. 28. Job Search Aggregation
  29. 29. Challenge! Job Search Aggregation
  30. 30. Challenge: keep this Simple Fast Resilient Scalable
  31. 31. Options:
  32. 32. Share data access?
  33. 33. Example: Shared Database
  34. 34. Shared Database Main Database
  35. 35. Shared Database Main Application Main Database
  36. 36. Shared Database Main Application Analysis Tool Main Database
  37. 37. Shared Database Main Application Analysis Tool Billing Application Main Database
  38. 38. Shared Database Main Application Main Database Analysis Tool Billing Application Intern Project
  39. 39. Shared Database Main Application Main Database Analysis Tool Billing Application Intern Project Other Intern Project
  40. 40. Shared Database Main Application Main Database Analysis Tool Billing Application Intern Project Other Intern Project Email Tool
  41. 41. Shared Database Main Application Main Database Analysis Tool This is an anti-pattern Billing Application Intern Project Other Intern Project Email Tool
  42. 42. On a long enough timeline...
  43. 43. Maintenance Nightmare
  44. 44. Share data access
  45. 45. Share data access Insulate data from consumers
  46. 46. Shared Database Main Application Main Database Analysis Tool Billing Application Intern Project Other Intern Project Email Tool
  47. 47. Insulated Database Main Application Main Database Main Service Analysis Tool Billing Application Intern Project Other Intern Project Email Tool
  48. 48. Service?
  49. 49. Service Client Client Client Client
  50. 50. Client Client Client Client NETWORK Service
  51. 51. Client Client Client Client NETWORK Service Icky Technical Stuff
  52. 52. Service NETWORK Databases Client Client Client Client Logging Caches Business Logic ... Client API Icky Technical Details
  53. 53. Client API Service.getJobs([12345, 62])
  54. 54. Icky Technical Details SELECT * FROM jobs AS j LEFT JOIN companyinfo AS ci ON j.id=ci.job_id LEFT JOIN locations AS loc ON loc.id=j.location_id WHERE j.id IN (12345, 62)
  55. 55. Service Oriented Architecture
  56. 56. Service Oriented Architecture
  57. 57. Boxcar
  58. 58. Boxcar is a... self-balancing distributed services protocol
  59. 59. Origin Story
  60. 60. There was a life before Boxcar
  61. 61. There were services before Boxcar
  62. 62. Pick one:
  63. 63. Doc Service
  64. 64. Document Serving Service aka “Doc Service” http://go.indeed.com/docservice
  65. 65. Doc Service controls access to JOBS
  66. 66. Building Blocks Webapp Wants jobs Doc Service Controls access to jobs Docstore Stores jobs
  67. 67. Build it
  68. 68. Webapp
  69. 69. Webapp Doc Service Docstore
  70. 70. Webapp Doc Service Docstore
  71. 71. Mission Accomplished Webapp Doc Service Docstore
  72. 72. But is it good?
  73. 73. How we build systems Simple Fast Resilient Scalable
  74. 74. Goodness Metric Simple deploys Efficient networking (Fast) Resilient Horizontally scalable
  75. 75. Webapp Doc Service Docstore
  76. 76. Webapp Doc Service Docstore
  77. 77. ✘ Resilient Webapp Doc Service Docstore
  78. 78. Add Resilience
  79. 79. Webapp Doc Service Docstore
  80. 80. Webapp Webapp Doc Service Doc Service Docstore Docstore
  81. 81. Front-end Load Balancer Webapp Webapp Doc Service Doc Service Docstore Docstore
  82. 82. Siloed Stacks Front-end Load Balancer Webapp Webapp Doc Service Doc Service Docstore Docstore
  83. 83. Siloed Stacks ✓ Simple deploys ✓ Efficient networking (Fast) ✓ Resilient ? Horizontally scalable
  84. 84. Scaling Silos Front-end Load Balancer Webapp Webapp Doc Service Doc Service Docstore Docstore
  85. 85. Scaling Silos Front-end Load Balancer Webapp Webapp Webapp Doc Service Doc Service Docstore Docstore
  86. 86. Scaling Silos Front-end Load Balancer Webapp Webapp Webapp Doc Service Doc Service Doc Service Docstore Docstore Docstore
  87. 87. Need bigger and bigger machines
  88. 88. Vertical Scaling
  89. 89. Siloed Stacks ✓ Simple deploys ✓ Efficient networking (Fast) ✓ Resilient ✘ Horizontally scalable
  90. 90. Siloed Stacks ✓ Simple deploys ✓ Efficient networking (Fast) ✓ Resilient ✘ Horizontally scalable Services Version 1
  91. 91. Improve scalability
  92. 92. Front-end Load Balancer Webapp Webapp Webapp Doc Service Doc Service Doc Service Docstore Docstore Docstore
  93. 93. Front-end Load Balancer Webapp Webapp Webapp Doc Service Load Balancer Doc Service Doc Service Docstore Docstore
  94. 94. Per-Service Balancer Front-end Load Balancer Webapp Webapp Webapp Doc Service Load Balancer Doc Service Doc Service Docstore Docstore
  95. 95. Per-Service Balancer ~ Simple deploys ? Efficient networking (Fast) ? Resilient ✓ Horizontally scalable
  96. 96. Proxying isn’t free ✘2x Bandwidth Webapp Doc Service Load Balancer Doc Service
  97. 97. Per-Service Balancer ~ Simple deploys ✘ Efficient networking (Fast) ? Resilient ✓ Horizontally scalable
  98. 98. Resilience Front-end Load Balancer Webapp Webapp Webapp SINGLE POINT OF FAILURE Doc Service Doc Service Docstore Docstore
  99. 99. Need two balancers
  100. 100. Need two balancers ...and a way to balance between them?
  101. 101. Load Balancer Balancing Master / Slave Share IP address Heartbeat between nodes Complex
  102. 102. Resilience Front-end Load Balancer Webapp Webapp Webapp Doc Service Load Balancer Doc Service Load Balancer Doc Service Doc Service Docstore Docstore
  103. 103. Best explained by our Operations folks: “Redundant Array of Inexpensive Datacenters” http://go.indeed.com/raid
  104. 104. Per-Service Balancer ~ Simple deploys ✘ Efficient networking (Fast) ✓ Resilient ✓ Horizontally scalable
  105. 105. Per-Service Balancer ~ Simple deploys ✘ Efficient networking (Fast) ✓ Resilient ✓ Horizontally scalable Services Version 2
  106. 106. Reduce network waste
  107. 107. Front-end Load Balancer Webapp Webapp Webapp Doc Service Load Balancer Doc Service Doc Service Docstore Docstore
  108. 108. Front-end Load Balancer Webapp Webapp Webapp Doc Service Doc Service Docstore Docstore
  109. 109. Naive Round Robin Front-end Load Balancer Webapp Webapp Webapp Doc Service Doc Service Docstore Docstore
  110. 110. Naive Round Robin ✓ Simple deploys ? Efficient networking (Fast) ? Resilient ✓ Horizontally scalable
  111. 111. Direct Connections Webapp ✓1x Bandwidth Doc Service
  112. 112. Naive Round Robin ✓ Simple deploys ✓ Efficient networking (Fast) ? Resilient ✓ Horizontally scalable
  113. 113. Server A Server B
  114. 114. Server A Server B
  115. 115. REQUEST Server A Server B
  116. 116. ✘ REQUEST Server A Server B
  117. 117. ✘ REQUEST Server A Server B
  118. 118. REQUEST Server A Server B
  119. 119. Server A Server B REQUEST
  120. 120. Naive Round Robin ✓ Simple deploys ✓ Efficient networking (Fast) ✓ Resilient ✓ Horizontally scalable
  121. 121. Naive Round Robin ✓ Simple deploys ✓ Efficient networking (Fast) ✓ Resilient ✓ Horizontally scalable
  122. 122. Naive Round Robin ✓ Simple deploys ✓ Efficient networking (Fast) ✓ Resilient ✓ Horizontally scalable ? Balanced
  123. 123. Slow Fast
  124. 124. Slow Fast
  125. 125. Slow Fast
  126. 126. Slow Fast
  127. 127. Slow Fast
  128. 128. Slow Fast
  129. 129. Slow Fast
  130. 130. Slow Fast
  131. 131. Slow Fast
  132. 132. Slow Fast
  133. 133. Slow Fast
  134. 134. Can’t keep up Slow Fast
  135. 135. Naive Round Robin ✓ Simple deploys ✓ Efficient networking (Fast) ✓ Resilient ✓ Horizontally scalable ✘ Balanced
  136. 136. Naive Round Robin ✓ Simple deploys ✓ Efficient networking (Fast) ✓ Resilient ✓ Horizontally scalable ✘ Balanced NOPE
  137. 137. Ensure balance
  138. 138. Front-end Load Balancer Webapp Webapp Webapp Doc Service Load Balancer Doc Service Doc Service Docstore Docstore
  139. 139. Front-end Load Balancer Webapp Webapp Webapp Distribute! Doc Service Doc Service Docstore Docstore
  140. 140. Front-end Load Balancer Webapp Webapp B Webapp B B Doc Service Doc Service Docstore Docstore
  141. 141. Front-end Load Balancer Web App Web App B Web App B Doc Service Doc Service Docstore Docstore B
  142. 142. Boxcar Front-end Load Balancer Web App Web App B Web App B Doc Service Doc Service Docstore Docstore B
  143. 143. Naive Round Robin Per-Service Balancer
  144. 144. The Boxcar balancing algorithm is simple
  145. 145. Gist Servers assign numeric value to connections Clients use the connection with the lowest numeric value to service each request
  146. 146. Server A
  147. 147. Server A Server A
  148. 148. Server A Slot 0 Slot 1 Slot 2 Slot 3 Slot 4 ...
  149. 149. Slot Numbers Just numbers Server A Slot 0 No limit Slot 1 Slot 2 NOT a priority Slot 3 Slot 4 ... ONLY used for balancing
  150. 150. LOW slot numbers are the BEST slot numbers
  151. 151. Server A Slot 0 USED Slot 1 He llo ! USED Slot 2 Client 2 Slot 3 Slot 4 USED ...
  152. 152. Server A Slot 0 USED Slot 1 USED Slot 2 USED Slot 3 Slot 4 USED Slot 2 Client 2 ...
  153. 153. Client 2
  154. 154. Client 2
  155. 155. Client 2
  156. 156. Server Client 2 Server Slot 0 Slot 2 A Slot 12 Slot 29 Slot 30 Slot 57 B
  157. 157. long-lived connections Server Client 2 Server Slot 0 Slot 2 A Slot 12 Slot 29 Slot 30 Slot 57 B
  158. 158. Clients are greedy MINE! 50 2
  159. 159. Clients are greedy Want best connections MINE! 50 2 Continually look for better connections Close worst connections
  160. 160. Background thread maintains the connection pool
  161. 161. Server Client 2 Server Slot 0 Slot 2 A Slot 12 Slot 29 Slot 30 Slot 57 B
  162. 162. Slot 17 Server Client 2 Server Slot 0 Slot 2 A Slot 12 Slot 29 Slot 30 Slot 57 B
  163. 163. Slot 17 Server Client 2 Server Slot 0 Slot 2 A B Slot 12 Slot 29 Slot 30 ✘ Slot 57 ✘
  164. 164. Slot 17 Server Client 2 Server Slot 0 Slot 2 A Slot 12 Slot 29 Slot 30 B
  165. 165. Slot 17 Server Client 2 Server Slot 0 Slot 2 A Slot 12 Slot 29 Slot 30 B
  166. 166. Server Client 2 Server Slot 0 Slot 2 A Slot 12 Slot 17 Slot 29 Slot 30 B
  167. 167. Server Client 2 Server Slot 0 Slot 2 A Slot 12 Slot 17 Slot 29 Slot 30 Continues forever B
  168. 168. Incoming Requests Client 2 Slot 0 ACTIVE Slot 2 ACTIVE Slot 12 [idle] Slot 29 ACTIVE Slot 30 [idle] Slot 57 [idle] GetJobs()
  169. 169. Incoming Requests Client 2 Slot 0 ACTIVE Slot 2 ACTIVE Slot 12 ACTIVE Slot 29 ACTIVE Slot 30 [idle] Slot 57 [idle] GetJobs()
  170. 170. Connections NOT established on-demand
  171. 171. Requests to Busy Pool Client 2 Slot 0 ACTIVE Slot 2 ACTIVE Slot 12 ACTIVE Slot 29 ACTIVE Slot 30 ACTIVE Slot 57 ACTIVE GetJobs()
  172. 172. Requests to Busy Pool Client 2 Slot 0 ACTIVE Slot 2 ACTIVE Slot 12 ACTIVE Slot 29 ACTIVE Slot 30 ACTIVE Slot 57 ACTIVE ✘ GetJobs() ERROR!
  173. 173. Sizing the pool properly is imperative!
  174. 174. Gist Redux Servers assign numeric value to connections Clients use the connection with the lowest numeric value to service each request
  175. 175. Balanced load is emergent behavior
  176. 176. Load Balancing Simulations
  177. 177. Server A Server B Client X
  178. 178. Server A slot 0 Server B Client X 0
  179. 179. Server A slot 0 Server B Client X 0
  180. 180. Server A slot 0 Client X 0 0 Server B slot 0
  181. 181. Server A slot 0 Server B slot 0 Client X 0 0
  182. 182. Server A slot 0 slot 1 1 1 Server B slot 0 slot 1 Client X 0 0
  183. 183. Server A slot 0 Client X 0 0 Server B slot 0 Steady-state balance
  184. 184. Server A slot 0 Client X 0 0 New Clients Join Server B slot 0
  185. 185. Server A slot 0 Client X 0 0 Client Y Server B slot 0
  186. 186. Server A slot 0 slot 1 Client X 0 0 1 1 Server B slot 0 slot 1 Client Y
  187. 187. Server A slot 0 slot 1 Client X 0 0 Client Y 1 1 Server B slot 0 slot 1
  188. 188. Server A slot 0 slot 1 slot 2 Client X 0 0 2 Server B slot 0 slot 1 slot 2 2 Client Y 1 1
  189. 189. Server A slot 0 slot 1 Client X 0 0 Client Y 1 1 Server B slot 0 slot 1
  190. 190. Server A slot 0 slot 1 Client X 0 0 Client Y 1 1 Server B slot 0 slot 1 Client Z
  191. 191. Server A slot 0 slot 1 slot 2 Client X 0 0 Client Y 1 1 Server B slot 0 slot 1 slot 2 2 2 Client Z
  192. 192. Server A slot 0 slot 1 slot 2 Client X 0 0 Client Y 1 1 Server B slot 0 slot 1 slot 2 Client Z 2 2
  193. 193. Server A slot 0 slot 1 slot 2 slot 3 Client X 0 0 Client Y 1 1 Server B slot 0 slot 1 slot 2 slot 3 3 3 Client Z 2 2
  194. 194. Server A slot 0 slot 1 slot 2 Client X 0 0 Client Y 1 1 Server B slot 0 slot 1 slot 2 Client Z 2 2 Steady-state balance
  195. 195. Server A slot 0 slot 1 slot 2 Client X 0 0 Server Failure Server B slot 0 slot 1 slot 2 Client Y 1 1 Client Z 2 2
  196. 196. Server A slot 0 slot 1 slot 2 Client X 0 Client Y 1 Server B slot 0 Client Z 2
  197. 197. Server A slot 0 slot 1 slot 2 Client X 0 Client Y 1 Server B Client Z 2
  198. 198. Server A slot 0 slot 1 slot 2 slot 3 Client X 0 Client Y 1 Server B 3 Client Z 2
  199. 199. Server A slot 0 slot 1 slot 2 slot 3 Client X 0 Client Y 1 Server B Client Z 2 3
  200. 200. Server A slot 0 slot 1 slot 2 slot 3 slot 4 Client X 0 4 Client Y 1 Server B Client Z 2 3
  201. 201. Server A slot 0 slot 1 slot 2 slot 3 slot 4 Client X 0 4 Client Y 1 Server B Client Z 2 3
  202. 202. Server A slot 0 slot 1 slot 2 slot 3 slot 4 slot 5 Client X 0 4 5 Client Y 1 Server B Client Z 2 3
  203. 203. Server A slot 0 slot 1 slot 2 slot 3 slot 4 slot 5 Client X 0 4 Client Y 1 5 Server B Client Z 2 3
  204. 204. Server A slot 0 slot 1 slot 2 slot 3 slot 4 slot 5 Client X 0 4 Client Y 1 5 Server B Client Z 2 3 Steady-state balance
  205. 205. Server A slot 0 slot 1 slot 2 slot 3 slot 4 slot 5 Client X 0 4 Client Y 1 5 Server B Client Z 2 3 Steady-state balance
  206. 206. Server A slot 0 slot 1 slot 2 slot 3 slot 4 slot 5 Client X 0 4 Server Restored Client Y 1 5 Server B Client Z 2 3
  207. 207. Server A slot 0 slot 1 slot 2 slot 3 slot 4 slot 5 Client X 0 4 Client Y 1 5 Server B Client Z 2 3
  208. 208. Server A slot 0 slot 1 slot 2 slot 3 slot 4 slot 5 Client X 0 4 Client Y 1 5 Server B slot 0 0 Client Z 2 3
  209. 209. Server A slot 0 slot 1 slot 2 slot 3 slot 4 slot 5 Server B slot 0 Client X 0 4 Client Y 1 5 Client Z 2 0<3
  210. 210. Server A slot 0 slot 1 slot 2 slot 3 slot 4 slot 5 Server B slot 0 Client X 0 4 Client Y 1 5 Client Z 2 0
  211. 211. Server A slot 0 slot 1 slot 2 slot 3 slot 4 slot 5 Server B slot 0 Client X 0 4 Client Y 1 5 Client Z 2 0
  212. 212. Server A slot 0 slot 1 slot 2 slot 3 slot 4 slot 5 Server B slot 0 Client X 0 4 Client Y 1 5 Client Z 0 2
  213. 213. Server A slot 0 slot 1 slot 2 slot 3 slot 4 slot 5 Client X 0 4 Client Y 1 5 1 Server B slot 0 slot 1 Client Z 0 2
  214. 214. Server A slot 0 slot 1 slot 2 slot 3 slot 4 slot 5 Server B slot 0 slot 1 Client X 0 4 Client Y 1 1<5 Client Z 0 2
  215. 215. Server A slot 0 slot 1 slot 2 slot 3 slot 4 Client X 0 4 Client Y 1 1 Server B slot 0 slot 1 Client Z 0 2
  216. 216. Server A slot 0 slot 1 slot 2 slot 3 slot 4 Client X 0 4 2 Client Y 1 1 Server B slot 0 slot 1 slot 2 Client Z 0 2
  217. 217. Server A slot 0 slot 1 slot 2 slot 3 slot 4 Client X 0 2<4 Client Y 1 1 Server B slot 0 slot 1 slot 2 Client Z 0 2
  218. 218. Server A slot 0 slot 1 slot 2 Client X 0 2 Client Y 1 1 Server B slot 0 slot 1 slot 2 Client Z 0 2
  219. 219. Server A slot 0 slot 1 slot 2 Client X 0 2 Client Y 1 1 Server B slot 0 slot 1 slot 2 Client Z 0 2 Steady-state balance
  220. 220. Server A slot 0 slot 1 slot 2 Client X 0 2 Client Shutdown Server B slot 0 slot 1 slot 2 Client Y 1 1 Client Z 0 2
  221. 221. Server A slot 0 slot 1 slot 2 Client X 0 2 Client Y 1 1 Server B slot 0 slot 1 slot 2 Client Z 0 2
  222. 222. Server A slot 0 slot 1 slot 2 Server B slot 0 slot 1 slot 2 Client X 0 2 Client Z 0 2
  223. 223. Server A slot 0 slot 1 slot 2 Server B slot 0 slot 1 slot 2 Client X 0 2 1 Client Z 0 2
  224. 224. Server A slot 0 slot 1 slot 2 Server B slot 0 slot 1 slot 2 Client X 0 2 Client Z 0 1<2
  225. 225. Server A slot 0 Server B slot 0 slot 1 slot 2 Client X 0 2 Client Z 0 1
  226. 226. Server A slot 0 slot 1 Server B slot 0 slot 1 slot 2 1 Client X 0 2 Client Z 0 1
  227. 227. Server A slot 0 slot 1 Client X 0 1<2 Server B slot 0 slot 1 slot 2 Client Z 0 1
  228. 228. Server A slot 0 slot 1 Server B slot 0 slot 1 Client X 0 1 Client Z 0 1
  229. 229. Server A slot 0 slot 1 Client X 0 1 Server B slot 0 slot 1 Client Z 0 1 Steady-state balance
  230. 230. Server A slot 0 slot 1 Client X 0 1 Client Rejoins Server B slot 0 slot 1 Client Z 0 1
  231. 231. Server A slot 0 slot 1 Client X 0 1 Client Y Server B slot 0 slot 1 Client Z 0 1
  232. 232. Server A slot 0 slot 1 slot 2 Client X 0 1 2 Client Y 2 Server B slot 0 slot 1 slot 2 Client Z 0 1
  233. 233. Server A slot 0 slot 1 slot 2 Client X 0 1 Client Y 2 2 Server B slot 0 slot 1 slot 2 Client Z 0 1
  234. 234. Server A slot 0 slot 1 slot 2 Client X 0 1 Client Y 2 2 Server B slot 0 slot 1 slot 2 Client Z 0 1 Steady-state balance
  235. 235. Why does this Balance?
  236. 236. Connections are like running water seeking lower ground
  237. 237. Slots Connections Servers
  238. 238. Slots Connections Servers
  239. 239. Slots Connections Servers
  240. 240. Slots Connections Servers
  241. 241. Slots Connections Servers
  242. 242. Slots Connections Servers
  243. 243. Slots Connections Servers
  244. 244. Slots Connections Servers
  245. 245. Slots Connections Servers
  246. 246. Slots Connections Servers
  247. 247. Slots Connections Servers
  248. 248. Slots Connections Servers
  249. 249. Roughly Equal Distribution Slots Connections Servers
  250. 250. Edge cases
  251. 251. Server A slot 0 slot 1 Server B slot 0 slot 1 Client X 0 1 Client Z 0 1
  252. 252. Server A slot 0 slot 1 Client X 0 1 Balanced but not ideal Server B slot 0 slot 1 Client Z 0 1
  253. 253. Server A slot 0 slot 1 Client X 0 1 Server B slot 0 Client Z
  254. 254. Server A slot 0 slot 1 Client X 0 1 Server B slot 0 Client Z EMPTY POOL!
  255. 255. Server A slot 0 slot 1 Client X 0 1 ✘ Resilient Server B slot 0 EMPTY POOL! Client Z
  256. 256. Fix by adding entropy
  257. 257. Fix by adding entropy aka “Table Shaking”
  258. 258. Table Shaking Servers regularly hang up on connections
  259. 259. Table Shaking Servers regularly hang up on connections Clients expect failed connections
  260. 260. Table Shaking Servers regularly hang up on connections Clients expect failed connections Failures are retried on new connections
  261. 261. Table Shaking Servers regularly hang up on connections Clients expect failed connections Failures are retried on new connections Bad configurations are less likely
  262. 262. Server A slot 0 slot 1 Client X 0 1 Table Shaking turns this Server B slot 0 slot 1 Client Z 0 1
  263. 263. Server A slot 0 slot 1 Client X 0 1 Into this Server B slot 0 slot 1 Client Z 0 1
  264. 264. Server A slot 0 slot 1 Server B slot 0 Client X 0 Client Z 1
  265. 265. Server A slot 0 slot 1 Client X 0 YAY! YAY! Server B slot 0 Client Z 1
  266. 266. Balancing Tricks: Handicapping
  267. 267. Handicapping is Server Self-quarantine
  268. 268. Handicapping Exploit slot number assignment
  269. 269. Handicapping Exploit slot number assignment Unhealthy servers inflate slot numbers
  270. 270. Handicapping Exploit slot number assignment Unhealthy servers inflate slot numbers Clients naturally avoid these servers
  271. 271. Slots Connections Servers
  272. 272. Unhealthy Slots Connections Servers
  273. 273. Slots Connections Servers
  274. 274. Slots Connections Servers
  275. 275. Slots Connections Servers
  276. 276. Unhealthy Slots Connections Servers
  277. 277. Slots Connections Servers
  278. 278. Slots Connections Servers
  279. 279. Slots Connections Servers graceful degradation
  280. 280. Is Boxcar good?
  281. 281. Boxcar ✓ Simple deploys ✓ Efficient networking (Fast) ? Resilient ✓ Horizontally scalable ? Balanced
  282. 282. Clients are pessimistic
  283. 283. Clients are pessimistic Failure is expected
  284. 284. Boxcar ✓ Simple deploys ✓ Efficient networking (Fast) ✓ Resilient ✓ Horizontally scalable ? Balanced
  285. 285. Balance Connections Not Requests
  286. 286. Balancing Review: Naive Round Robin
  287. 287. Slow Fast
  288. 288. Slow Fast
  289. 289. Slow Fast
  290. 290. Slow Fast
  291. 291. Slow Fast
  292. 292. Slow Fast
  293. 293. Slow Fast
  294. 294. Slow Fast
  295. 295. Slow Fast
  296. 296. Slow Fast
  297. 297. Slow Fast
  298. 298. Can’t keep up Slow Fast
  299. 299. The problem was that requests (connections) piled up
  300. 300. Boxcar has a fixed number of connections
  301. 301. Boxcar has a fixed number of connections there’s nothing to pile up
  302. 302. Slow Server Fast Server
  303. 303. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  304. 304. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  305. 305. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  306. 306. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  307. 307. Client 7 9 0 1 Slot 7 Slot 9 Slow Server Fast Server
  308. 308. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  309. 309. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  310. 310. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  311. 311. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  312. 312. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  313. 313. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  314. 314. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  315. 315. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  316. 316. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  317. 317. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  318. 318. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  319. 319. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  320. 320. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  321. 321. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  322. 322. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  323. 323. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  324. 324. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  325. 325. Client 7 9 Slot 7 Slot 9 Slow Server Fast Server
  326. 326. Client 7 9 2 requests 4 requests Slot 7 Slot 9 Slow Server Fast Server
  327. 327. Slow servers handle fewer requests
  328. 328. No overloaded servers
  329. 329. All requests are serviced
  330. 330. Load balancing is probabilistic
  331. 331. Boxcar ✓ Simple deploys ✓ Efficient networking (Fast) ✓ Resilient ✓ Horizontally scalable ✓ Balanced
  332. 332. Boxcar ✓ Simple deploys ✓ Efficient networking (Fast) ✓ Resilient ✓ Horizontally scalable ✓ Balanced
  333. 333. Good enough for Indeed
  334. 334. Services well over a BILLION requests every day
  335. 335. Fundamental technology
  336. 336. Powering over 20 different services
  337. 337. In production since 2009
  338. 338. Service Oriented Architecture
  339. 339. Q&A
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×