Slides from my DevSecOps Days talk (09/2020) about Scaling Slack during explosive growth. In this talk I shared some of the ides on how Slack managed to scale their service during the explosive growth in demand we lived during the start of the covid-19 pandemic with a world wide movement to work from home. In the talk I covered what did I mean with explosive growth, how we had invested in scalability over time in Slack's architecture divided by different layers and what do we do when things go wrong.
6. IT and securityMarketing Design HR FinanceFile sharing
Dev tools Communications Analytics Support ProductivitySales
Slack integrates with 1800+ tools teams use daily
7. SOME STATS
130K+ paid customers
150+ countries
65 Fortune 100 customers
2,000+ employees
18. Client to edge
AT THE EDGE
Edge POP
Apps / Chatbots
API
Edge POP
Websocket
HTTPS
HTTPS
WS
HTTPS
Achieving massive scale in a brave new (front-end)
world
19. Message
handling and
caching
AT THE EDGE
Edge
Cache
Golang
Edge POP
Websocket
Websocket
Flannel: An Application-Level Edge Cache
to Make Slack Scale
36. Edge
POPs
Load
Balancing
API - B API -CAPI - A
Search
Engine
Cache
Data
Shards
Cache
Job
Executors
Job
Queue
65b database
queries per day
9 PB of database storage
Scaling our data engines
38. Incident management at Slack (in Slack)
When things go wrong
Incident bot and incident channels at the centre of our response
#incd-53716
Adam Pretti 3:47PM
joined #incd-53716
#incd-53716
Thread
#incd-53716
But we also have a couple of backup strategies if/when Slack is not
available.
40. How it felt that
day
When things go wrong
All Hands on Deck
41. Edge
POPs
Load
Balancing
API - B API -CAPI - A
Search
Engine
Cache
Data
Shards
Cache
Job
Executors
Queue (s)
A problem in any of your tiers has the
potential to affect your whole service
Houston, we’ve got a problem.
Thanks for having me! It’s a pleasure to be here and I hope everybody is doing alright in these difficult days we are living. This is probably going to be the less Security focus talk in the conference but I hope still relevant to you.
Spaniard
Moved to Melbourne 8 years ago after 1 year in freezing Cambridge in the UKI am an Open Source, IT leadership and Devops enthusiast
I’ve been with Slack for a year, running the Cloud Engineering team split between Melbourne and San FranciscoI’m a co-organizer of the Devopsgirls community.
Maybe move to the other section
This is a made-up timeline to use as a talking point.Our capacity to scale is due to the improvements and investments we have done over the years, and those on the timeline can be some examples.One thing that has help is growing with our customers. As we are taking bigger orgs we had to scale our systems to meet their needs. Number of users, messages, etc… @channel exampleCapacity planning / load testing
Disaster piece - Chaos engineering
Recommend in detail talk from Demmer, principal engineer on how we’ve implemented some of this improvements.
Guidance from leadership: look after ourselves, keep Slack up and keep delivering value to customers
Engineering teams focus: Looking for ceilings, bottlenecks and known problems in your architecture/infrastructure/services and figure out where we had to scale our systems
As an example: Some systems had to scale from supporting 100,000 rpms to sometimes double that capacity at peak
There are many different techniques that allowed us for the flexibility required to be able to react. We will cover some today.
Also an important component was managing the increase cost of our infra and how to optimize that
#help-cloud-econ
Even if lego pieces are quite cool, I’ll be using a super simplified version of a system architecture view to walk through the different scalability improvements we applied over time
Due to time constraints on my talk a lot of the slides will have a link that expands on the particular topics for those interested
All the engineering stats presented are based on figures of April 2020
We have different types of clients: Apps / Chat-bots (our own and third parties), Slack client for laptops and mobile and web clients.
We deploy our own Edge POPs (Points of presence) to terminate those connections closer to our customers. We have Edge POPs in most AWS regions over the world.Two ways of connecting: Websockets and HTTPS
We also do a lot of processing at the Edge making it faster for users but also offloading a lot of the work from the main APIs.
For example, we do a lot of the message massaging/processing/enrichment at the edge. A previous version of our Edge stack was composed of series of Java processes handling Real Time messaging In order to scale an initial process that handles websocket connection and caching was created in Go called Flannel
A good example is when a user connects first to a particular team in the morning, we preload a lot of information
There is information required for almost any message transaction that is loaded from cache close to 100% times of the times (for example team_ids)
We can see in this slide how one super common action benefits from caching channels and users when we start typing in the quick-switcher.
Another example is file upload processing. Instead of doing this centrally we delegate it to our Edge Pops.
Upload happens faster as its closer to the user and we can to all the image processing (metadata additions, security checks, setting up permissions, etc…)
Apps and chatbots and some time even normal clients may have the capability of overloading your API.
Two techniques can be used here to control this behaviour:
Rate limiting
Client in degraded mode
We can also implement rate limiting between services using native language capabilities or more recently Envoy
A good portion of the Slack APIs were originally written in PHP.
At some point these were migrated to a combination of:
Hack: a strongly typed version of PHP
HHVM: a highly performance execution engine for Hack
Both originally developed by Facebook and then opensourced.
One of the motivations to move to Hack was developer productivity.
We can divide our API into different farms depending on the tasks that they are performing and that will allow us to adapt the underlying infrastructure.
For example:
API for collecting statistics
For background jobs
This is the current implementation
We are working in simplifying some of the components and using SQS
Originally developed by YouToube and opensourced.It provides a sharding and topology management solution that sits on top of MysqlYou don’t need to modify your application as they connect to VTGates as they were accessing a single databaseBut in the background allows us to shard data not only by team but also by different tables: user, channel, etc… regardless of the database servers serving itWe moved from master-master to master-slave and manage failover
----
Kubernetes
In a nutshell
Anyone can trigger an incident using the /assemble command
We have a 24x7 rotation of Incident commander and Customer Experience liaisons
When paged by assemble they join the channel from where someone called assemble and a video call for rapid categorization is setup
Depending on severity we use incident bot to create a new incident channel and notify relevant parties when a high severity
Required responders and SMEs are added to the channel /escalate command
A lot of the context of how we responded lives in the channel and it’s super valuable to extract lessons from the incident
Automated processes to create CAN reports (Conditions, Actions, Needs) and a Jira ticket to coordinate
Incident commander makes sure that once we are All Clear someone is assigned to run an Incident Review
If Slack is not available for response (Separate Slack workspace, Zoom, Google group)