Scaling Twitter To Go After
the Fail Whale
Jonathan Reichhold - Twitter Engineering
Early Twitter....
2010 World Cup Challenge
• Tweet and user requests growing
exponentially (good problem)
Load....
Monolithic Architecture
• Ruby on Rails
• Temporally-sharded MySQL
• Memcached
• ~60 engineers
Stabilize & Understand
• Learn & make improvements
• Don’t just survive
Be Realistic & Ambitious
• Prioritize what can be fixed and timeframes
for doing it
• Sometimes need the duct tape
• Find ...
A Bad
Approach
• Flip
switches/branches/other
until fixed
http://www.flickr.com/photos/chrism70/1144424032
Science
Step 1:Trustworty Data
• https://blog.twitter.com/2013/observability-at-twitter
Step 2: Set Expectations
• Being on-call is a job and during high stress
will burn folks out
• Maintain calm and order
Post Mortems
• Improvement becomes part of process
• Stress makes system stronger not weaker
Teamwork
• All of this made possible by amazing team
and management
• Culture
Capacity Planning &
Forecast
• Just in time but realistic
• Figure out real buffers
Longer Term Changes
• Architecture changes take time and
changes in organization
Improve Efficiency
• Rails/Ruby -> Scala & JVM
• 200-300 RPS -> 10,000-20,000
• Single process per request -> Finagle
Service Orientation
• Make changes at
interface
boundary, not in
single monolith
• Team interactions
simplified
• Core nou...
Move out of public
cloud
• Flexibility and latency demand at some
point
• Hard problem
• Datacenter as failure domain
• Me...
Dynamic Configuration
• Update routes and compare live vs
dark/new
• Quickly adjust to issues
• Faster and less fragile de...
Improve storage
• Gizzard for MySQL
• Improve Memcached
• Storage as a service
• Snowflake IDs
Development Speed
• Startups live and die by development speed
• Make easier to ship but contain damage
Conclusion
• Fail whale is now an endangered species
• Went from event driven spikes to pushing
continuous reliability imp...
Tweet Spikes Today
• New Tweets per second (TPS) record: 143,199 TPS.
Typical day: more than 500 million Tweets sent; aver...
Final Thoughts
• Marathon not a sprint. Maintain systems
and yourself
• We are hiring to make system even better
Endangered: Fail Whale
Jonathan
Reichhold
@jreichhold
Questions?
• https://blog.twitter.com/2013/new-tweets-per-seco
• https://blog.twitter.com/2013/observability-
at-twitter
Upcoming SlideShare
Loading in...5
×

#Surgeconf Scaling Twitter to go After the Fail Whale

308

Published on

Originally known for a "fail whale" that occurred frequently on the site, Twitter has changed significantly to make sure we are available no matter what is happening around the world without a blip.

This goal felt unattainable three years ago, when the 2010 World Cup put Twitter squarely in the center of a real-time, global conversation. The influx of Tweets—from every shot on goal, penalty kick, and yellow or red card—repeatedly took its toll and made Twitter unavailable for short periods of time. Engineering worked throughout the nights during this time, desperately trying to find and implement order-of-magnitudes of efficiency gains. Unfortunately, those gains were quickly swamped by Twitter’s rapid growth, and engineering had started to run out of low-hanging fruit to fix.

After that experience, we determined we needed to step back. We then determined we needed to re-architect the site to support the continued growth of Twitter and to keep it running smoothly. Since then we’ve worked hard to make sure that the service is resilient to the world’s impulses. We’re now able to withstand events like Castle in the Sky viewings, the Super Bowl, and the global New Year’s Eve celebration. This re-architecture has not only made the service more resilient when traffic spikes to record highs, but also provides a more flexible platform on which to build more features faster, including synchronizing direct messages across devices, Twitter cards that allow Tweets to become richer and contain more content, and a rich search experience that includes stories and users. And more features are coming.

This talk will cover some of the lessons learned and changes made to not only grow, but also to become more resilient to world events and less fragile to whales.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
308
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Introduce self.....
  • Job of startups is to take risk and push quickly but learn from it. Twitter's genesis valued quick design and feature iteration over stability. We were known for the fail whale Events dominating Twitter traffic Limited engineers. Slow to ramp up with traffic (impedence mismatch)
  • Limited resources (people, machines, and algorithms) Whale was a meme and what we were known for Rapid iteration on features and infrastructure 2010 FIFA World Cup forced an understanding of reliability as a feature on the organization. While we didn't completely break we suffered lots.
  • Allowed us to iterate fast. Became a bottleneck
  • Harden the code and algorithms for scale and nature of distributed system Changing the structure and systems to external stresses made the system more resilient to later challenges.  Fail whale extinct
  • Most people start with pattern they know Nothing is learned in general
  • Instead propose a hypothesis and test Did the system change as expected? Learn what works (and doesn’t) and use understanding for next iteration
  • Be rigorous on recording data and validating Lots of time can be wasted on invalid data Signal vs noise False Alerts
  • Being on-call is a job and during high stress will burn folks out Make sure you have good communications (phone numbers, IRC/Campfire, IM, Chat, etc) that work for group Make sure folks are kept informed regularly and can focus on problems. Both above and below you Maintain calm and order
  • #Surgeconf Scaling Twitter to go After the Fail Whale

    1. 1. Scaling Twitter To Go After the Fail Whale Jonathan Reichhold - Twitter Engineering
    2. 2. Early Twitter....
    3. 3. 2010 World Cup Challenge • Tweet and user requests growing exponentially (good problem)
    4. 4. Load....
    5. 5. Monolithic Architecture • Ruby on Rails • Temporally-sharded MySQL • Memcached • ~60 engineers
    6. 6. Stabilize & Understand • Learn & make improvements • Don’t just survive
    7. 7. Be Realistic & Ambitious • Prioritize what can be fixed and timeframes for doing it • Sometimes need the duct tape • Find patterns and improvements for the long term
    8. 8. A Bad Approach • Flip switches/branches/other until fixed http://www.flickr.com/photos/chrism70/1144424032
    9. 9. Science
    10. 10. Step 1:Trustworty Data • https://blog.twitter.com/2013/observability-at-twitter
    11. 11. Step 2: Set Expectations • Being on-call is a job and during high stress will burn folks out • Maintain calm and order
    12. 12. Post Mortems • Improvement becomes part of process • Stress makes system stronger not weaker
    13. 13. Teamwork • All of this made possible by amazing team and management • Culture
    14. 14. Capacity Planning & Forecast • Just in time but realistic • Figure out real buffers
    15. 15. Longer Term Changes • Architecture changes take time and changes in organization
    16. 16. Improve Efficiency • Rails/Ruby -> Scala & JVM • 200-300 RPS -> 10,000-20,000 • Single process per request -> Finagle
    17. 17. Service Orientation • Make changes at interface boundary, not in single monolith • Team interactions simplified • Core nouns and verbs
    18. 18. Move out of public cloud • Flexibility and latency demand at some point • Hard problem • Datacenter as failure domain • Mesos
    19. 19. Dynamic Configuration • Update routes and compare live vs dark/new • Quickly adjust to issues • Faster and less fragile deploys
    20. 20. Improve storage • Gizzard for MySQL • Improve Memcached • Storage as a service • Snowflake IDs
    21. 21. Development Speed • Startups live and die by development speed • Make easier to ship but contain damage
    22. 22. Conclusion • Fail whale is now an endangered species • Went from event driven spikes to pushing continuous reliability improvements where events became trivial
    23. 23. Tweet Spikes Today • New Tweets per second (TPS) record: 143,199 TPS. Typical day: more than 500 million Tweets sent; average 5,700 TPS. (August 2 at 7:21:50 PDT;August 3 at 11:21:50 JST) • https://blog.twitter.com/2013/new-tweets-per-second-record-and-how
    24. 24. Final Thoughts • Marathon not a sprint. Maintain systems and yourself • We are hiring to make system even better
    25. 25. Endangered: Fail Whale Jonathan Reichhold @jreichhold
    26. 26. Questions? • https://blog.twitter.com/2013/new-tweets-per-seco • https://blog.twitter.com/2013/observability- at-twitter
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×