Scaling Slack during explosive growth

Scaling Slack during
explosive growth
Javier Turegano (he/him)
@setoide

Agenda ● Intro
● Explosive growth
● Investing in scalability
● When things go wrong

Javier Turegano
@setoide
(he/him)
Sr. Engineering Manager -
Cloud Eng - APAC

It all started as
a game
ORIGIN

IT and securityMarketing Design HR FinanceFile sharing
Dev tools Communications Analytics Support ProductivitySales
Slack integrates with 1800+ tools teams use daily

SOME STATS
130K+ paid customers
150+ countries
65 Fortune 100 customers
2,000+ employees

A good portion
of the planet
starts to move
to WFH due to
covid-19
ORIGIN

Off-course we
have a bot to
track this in
Slack
ORIGIN

Hack-HHVM
Load testing
frameworks
On-boarding
bigger
enterprises
Data Stores
improvements
Disaster-theatre
pieces
Client Re-
architecture
Cloud Infra
Advancements
New network
architecture
Scalability investments over time
* This is a made-up timeline as an example. Scaling Slack - The Good, the Unexpected, and the Road Ahead

Reacting to explosive growth
👩
🏿
💻

Reacting to explosive growth
#help-cloud-econ

Edge
POPs
Load
Balancing
API
Data
Stores
Basic layered architecture
Engineering stats as of April 2020

Client to edge
AT THE EDGE
Edge POP
Apps / Chatbots
API
Edge POP
Websocket
HTTPS
HTTPS
WS
HTTPS
Achieving massive scale in a brave new (front-end)
world

Message
handling and
caching
AT THE EDGE
Edge
Cache
Golang
Edge POP
Websocket
Websocket
Flannel: An Application-Level Edge Cache
to Make Slack Scale

Message
handling and
caching
AT THE EDGE
Edge
Cache
Golang
Edge POP
Websocket
Websocket

Processing file
uploads at the
edge
AT THE EDGE
S3
Edge POP
File upload

Rate limiting &
degraded
mode
AT THE EDGE
HTTP/1.1 429 Too Many Requests
Accept: application/json, text/plain
Content-Type: application/x-www-form-urlencoded
Date: Tue, 29 Jan 2019 18:41:22 GMT
retry-after: 3
{
"ok": false,
"error": "ratelimited"
}
🚧
Handling Rate Limits with Slack APIs

Edge
POPs
Load
Balancing
API - B
Cache
Data
Stores
150M messages
sent per minute
during peak hours
13M simultaneous websocket
connections per day
At the edge

High performance
Scaling compute
● Hacklang / HHVM
Hacklang at Slack: A Better PHP

Horizontal Auto
Scaling
Scaling compute
● Stateless
● Predictive
Scaling
● Dynamic
Scaling
https://aws.amazon.com/blogs/aws/new-predictive-scaling-for-ec2-powered-by-
machine-learning/

API farms
Scaling compute
● Different API groups depending
on the characteristics of the work API - B API -CAPI - A
Load
Balancing

Load
Balancing
API - B API -CAPI - A
Edge
POPs
Data
Stores
14B HTTP requests per day
Dedicated API farms

Async
processing +
decoupling
Scaling compute
API - B
Job
Executors
Queue (s)

Async
processing at
Slack
Scaling compute
https://slack.engineering/scaling-slacks-job-queue/

Edge
POPs
Load
Balancing
Cache
Data
Stores
Job
Executors
Queue (s)
5B background jobs enqueued
per day
Async jobs

Specialized
data engines
Data Engines
● Search
Rebuilding Message Search at Slack

Scaling your stateful tier
Data Engines
● Separate reads and writes
● Caching
Data
Stores
(reads)
API - B
Caching
Data
Stores
(writes)

Sharding
databases with
Vitess
Data Engines
A Journey into Slack’s Database Service

Edge
POPs
Load
Balancing
Search
Engine
Cache
Data
Shards
Cache
Job
Executors
Job
Queue
65b database
queries per day
9 PB of database storage
Scaling our data engines

Incident management at Slack (in Slack)
When things go wrong
Incident bot and incident channels at the centre of our response
#incd-53716
Adam Pretti 3:47PM
joined #incd-53716
#incd-53716
Thread
#incd-53716
But we also have a couple of backup strategies if/when Slack is not
available.

Users are unable to connect to Slack

How it felt that
day
All Hands on Deck

Edge
POPs
Load
Balancing
Search
Engine
Cache
Data
Shards
Cache
Job
Executors
Queue (s)
A problem in any of your tiers has the
potential to affect your whole service
Houston, we’ve got a problem.

Technical
details
Load
Balancing
A Terrible, Horrible, No-Good, Very Bad Day at Slack

Incident
Review
process

Scaling Slack during explosive growth

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scaling Slack during explosive growth

Similar to Scaling Slack during explosive growth (20)

More from Javier Turégano Molina

More from Javier Turégano Molina (20)

Recently uploaded

Recently uploaded (20)

Scaling Slack during explosive growth

Editor's Notes