Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
How Slack Works
Keith Adams
kma@slack-corp.com @keithmadams facebook.com/kma
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese an...
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practi...
What is
Slack?
What is
Slack?
Voice Calls! Platform! Something about Bots!!
Persistent
Group
Messaging
But first it was a
Service
In this talk
● How Slack works today
➞ Application logic
➞ Persistence
➞ Real-time messaging
➞ Deferring work for later
● ...
Also in this talk
● Flaws
● Challenges
● Mistakes
● Dead-ends
● Future directions
Slack Scale
● 4M DAU, 5.8M WAU
Peak simultaneous connected: 2.5M
● > 2H / weekday for each active user
> 10H / weekday con...
Slack House Style
● Conservative technical taste
Most supporting technologies are >10 years old
● Willing to write a littl...
Cartoon Architecture of Slack
MySQL
Job
Queue
Message
Server
WebApp
Case Study: Login and Receive Messages
slack.com
POST /api/rtm.start?token=xoxo--&...
Slack’s webapp codebase
● PHP monolith of app logic
<1MLoC
● Scaled-out LAMP stack app
Memcache wrapped around sharded MyS...
World’s shortest PHP-at-Slack FAQ
● Q: I hear/believe/have experienced PHP to be terrible.
A: It sort of is, but it also w...
Login and Receive Messages: the “mains”
slack.com main0
main1
SELECT db_shard FROM teams
WHERE domain = %domain
Login and Receive Messages: the shards
slack.com
main0
main1
main0
main1
main0
main1
main0
main1
Shard123
a
Shard123
b
SEL...
MySQL Shards
● Source of truth for most customer data
Teams, users, channels, messages, comments, emoji, ...
● Replication...
Why MySQL?
● Many, many thousands of server-years of working
● The relational model is a good discipline
● Experience
● To...
Master-Master Replication
www1 Shard123
a
Shard123
b
www17
MMR Complications
● Choosing A in CAP terms
● Conflicts are possible
➞ Most resolved automatically
➞ Some manually, by ope...
Case Study: Login and Receive Messages
slack.com
{
“ok”: true,
“url”:
“wss://ms9.slack-msgs.com/websocket
/7I5yBpcvk”,
…
}
Rtm.start payload
● Rtm.start returns an image of the whole team
● Architecture of clients
➞ Eventually consistent snapsho...
Cartoon Architecture of Slack
MySQL
Job
Queue
Message
Server
WebApp
Persist,
broadcast
messages
Message Delivery
Message
Server
WebApp
Wrinkles in Message Server
● Race between rtm.start and connection to MS
➞ Event log mechanism
● Glitches, delays, net par...
Link unfurling
Deferring Work
Search indexing
Exports/Imports
Job Queue
(Redis)
WebApp
Job Workers
Putting it all together
mains
shards
Message
Server
WebApp
Things missing from the cartoon
● Memcache wrapped around many DB accesses
➞ Case-by-case
➞ Manual
● Computed data service...
Slack Today: The Good Parts
● Team-partitioning
➞ Easy scaling to lots of teams
➞ Isolates failures and perf problems
➞ Ma...
Some Hard Cases
Hard scenarios
● Mains failures
● Rtm.start on large teams
● Mass reconnects
Mains failure
● 1 master fails, partner takes over
● If both fail?
➞ Many users can proceed via memcache
➞ For the rest Sl...
Rtm.start for large teams
● Returns image of entire team
● Channel membership is O(n2
) for n users
Mass reconnects
● A large team loses, then regains, office Internet connectivity
● n users perform O(n2
) rtm.start operat...
What are we going to
Doabout it?
Scale-out mains
● Replace mains spof
● With what? We’re not sure yet
● Kicking the tires carefully on a scary change
Rtm.start for large teams
● Incremental work
➞ Current p95,p99: 221ms, 660ms
● Core problem: channel membership is O(n2
)
...
Mass reconnects
● Introducing flannel
● Application-level edge cache
Pre-Flannel
Message Delivery
Message
Server
WebApp
Message
Server
Flannel status
● On for a few teams
● Rolling out to you soon with any luck
Phew
Stuff I had to leave out
● Lots of client tech!
● Voice
● Backups
● Data warehouse
● Search
● Deploying code
● Monitoring ...
Wrapping up
● Sketch of how Slack works
➞ Application Logic
➞ Persistence
➞ Real-time messaging
➞ Asynchronous Work
● Prob...
There is a lot left to do
slack.com/jobs
...
Deployable Message Server
● Channel-sharded message bus
● Flannel discovers Channel servers via Consul
➞ Scatters user wri...
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
slack-infrastructure
Upcoming SlideShare
Loading in …5
×

How Slack Works

11,770 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2hoevvV.

Keith Adams takes a tour of Slack's infrastructure, from native and web clients, through the edge, into the Slack datacenter, and around the various services that provide real-time messaging, search, voice calls, and custom emoji. Filmed at qconsf.com.

Keith Adams is Chief Architect at Slack. Before Slack, he was an engineer at Facebook, where he contributed to search infrastructure, led work on the HipHop Virtual Machine, and helped start Facebook AI Research. He was also an early engineer at VMware.

Published in: Technology
  • Get access to 16,000 woodworking plans, Download 50 FREE Plans... ●●● http://tinyurl.com/yy9yh8fu
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Like to know how to take easy surveys and get huge checks - then you need to visit us now! Having so many paid surveys available to you all the time let you live the kind of life you want. learn more...●●● https://tinyurl.com/make2793amonth
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I went from getting $3 surveys to $500 surveys every day!! learn more... ♣♣♣ https://tinyurl.com/realmoneystreams2019
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Get access to 16,000 woodworking plans, Download 50 FREE Plans... ◆◆◆ http://tinyurl.com/yy9yh8fu
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Increase Penis Size - Secrets to Increase Penis Size Revealed. ★★★ https://tinyurl.com/ydaetwbk
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

How Slack Works

  1. 1. How Slack Works Keith Adams kma@slack-corp.com @keithmadams facebook.com/kma
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ slack-infrastructure
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. What is Slack?
  5. 5. What is Slack? Voice Calls! Platform! Something about Bots!!
  6. 6. Persistent Group Messaging But first it was a Service
  7. 7. In this talk ● How Slack works today ➞ Application logic ➞ Persistence ➞ Real-time messaging ➞ Deferring work for later ● Problems ● What we’re doing about them
  8. 8. Also in this talk ● Flaws ● Challenges ● Mistakes ● Dead-ends ● Future directions
  9. 9. Slack Scale ● 4M DAU, 5.8M WAU Peak simultaneous connected: 2.5M ● > 2H / weekday for each active user > 10H / weekday connected ● Half of DAU outside US
  10. 10. Slack House Style ● Conservative technical taste Most supporting technologies are >10 years old ● Willing to write a little code Choose low coupling, fitness-to-purpose over DRY ● Minimalism Choose something we already operate over something new and tailor-made Shallow, transparent stack of abstractions
  11. 11. Cartoon Architecture of Slack MySQL Job Queue Message Server WebApp
  12. 12. Case Study: Login and Receive Messages slack.com POST /api/rtm.start?token=xoxo--&...
  13. 13. Slack’s webapp codebase ● PHP monolith of app logic <1MLoC ● Scaled-out LAMP stack app Memcache wrapped around sharded MySQL ● Recently migrated to HHVM Performance, hacklang
  14. 14. World’s shortest PHP-at-Slack FAQ ● Q: I hear/believe/have experienced PHP to be terrible. A: It sort of is, but it also works well. ● Q: I’m skeptical. A: You’re in good company! Check out this blog post. But we should probably get on with the talk at hand ... ● Q: Sounds good. A: Right-o.
  15. 15. Login and Receive Messages: the “mains” slack.com main0 main1 SELECT db_shard FROM teams WHERE domain = %domain
  16. 16. Login and Receive Messages: the shards slack.com main0 main1 main0 main1 main0 main1 main0 main1 Shard123 a Shard123 b SELECT * FROM channels WHERE team_id = 711 ...
  17. 17. MySQL Shards ● Source of truth for most customer data Teams, users, channels, messages, comments, emoji, ... ● Replication across two DCs Available for 1-DC failure ● Sharded by team For performance, fault isolation, and scalability
  18. 18. Why MySQL? ● Many, many thousands of server-years of working ● The relational model is a good discipline ● Experience ● Tooling Not because of ACID, though
  19. 19. Master-Master Replication www1 Shard123 a Shard123 b www17
  20. 20. MMR Complications ● Choosing A in CAP terms ● Conflicts are possible ➞ Most resolved automatically ➞ Some manually, by operator action(!) ● INSERT ON DUPLICATE KEY UPDATE … ● Partitioning by team saves us ➞ Team writes cannot overlap ➞ Even teams use “left” head, odd teams use “right” head
  21. 21. Case Study: Login and Receive Messages slack.com { “ok”: true, “url”: “wss://ms9.slack-msgs.com/websocket /7I5yBpcvk”, … }
  22. 22. Rtm.start payload ● Rtm.start returns an image of the whole team ● Architecture of clients ➞ Eventually consistent snapshot of whole team ➞ Updates trickle in through the web socket ● Guarantees responsive clients ● ...once connection is established
  23. 23. Cartoon Architecture of Slack MySQL Job Queue Message Server WebApp
  24. 24. Persist, broadcast messages Message Delivery Message Server WebApp
  25. 25. Wrinkles in Message Server ● Race between rtm.start and connection to MS ➞ Event log mechanism ● Glitches, delays, net partitions while persisting ➞ In-memory queue of pending sends ➞ Queue depth sensitive barometer of system health ● Most messages are presence
  26. 26. Link unfurling Deferring Work Search indexing Exports/Imports Job Queue (Redis) WebApp Job Workers
  27. 27. Putting it all together mains shards Message Server WebApp
  28. 28. Things missing from the cartoon ● Memcache wrapped around many DB accesses ➞ Case-by-case ➞ Manual ● Computed data service (CDS) ➞ Provides ML models via Thrift interface ● Rate-limiting around critical services ● Search! ➞ Solr ➞ Team-partitioned ➞ fed from job queue workers
  29. 29. Slack Today: The Good Parts ● Team-partitioning ➞ Easy scaling to lots of teams ➞ Isolates failures and perf problems ➞ Makes customer complaints easy to field ➞ Natural fit for a paid product ● Per-team Message Server ➞ Low-latency broadcasts
  30. 30. Some Hard Cases
  31. 31. Hard scenarios ● Mains failures ● Rtm.start on large teams ● Mass reconnects
  32. 32. Mains failure ● 1 master fails, partner takes over ● If both fail? ➞ Many users can proceed via memcache ➞ For the rest Slack is down ➞ Quite possible if failure was load-induced
  33. 33. Rtm.start for large teams ● Returns image of entire team ● Channel membership is O(n2 ) for n users
  34. 34. Mass reconnects ● A large team loses, then regains, office Internet connectivity ● n users perform O(n2 ) rtm.start operations ● Can ‘melt’ the team shard
  35. 35. What are we going to Doabout it?
  36. 36. Scale-out mains ● Replace mains spof ● With what? We’re not sure yet ● Kicking the tires carefully on a scary change
  37. 37. Rtm.start for large teams ● Incremental work ➞ Current p95,p99: 221ms, 660ms ● Core problem: channel membership is O(n2 ) ● Change APIs so clients can load channel members lazily ● Much harder than it sounds!
  38. 38. Mass reconnects ● Introducing flannel ● Application-level edge cache
  39. 39. Pre-Flannel Message Delivery Message Server WebApp
  40. 40. Message Server
  41. 41. Flannel status ● On for a few teams ● Rolling out to you soon with any luck
  42. 42. Phew
  43. 43. Stuff I had to leave out ● Lots of client tech! ● Voice ● Backups ● Data warehouse ● Search ● Deploying code ● Monitoring and alerting
  44. 44. Wrapping up ● Sketch of how Slack works ➞ Application Logic ➞ Persistence ➞ Real-time messaging ➞ Asynchronous Work ● Problems ● What we’re doing about them
  45. 45. There is a lot left to do slack.com/jobs
  46. 46. ...
  47. 47. Deployable Message Server ● Channel-sharded message bus ● Flannel discovers Channel servers via Consul ➞ Scatters user writes ➞ Gathers channel reads ● Failures do not need reconnects
  48. 48. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ slack-infrastructure

×