#Surgeconf Scaling Twitter to go After the Fail Whale

•Download as PPT, PDF•

2 likes•785 views

Originally known for a "fail whale" that occurred frequently on the site, Twitter has changed significantly to make sure we are available no matter what is happening around the world without a blip. This goal felt unattainable three years ago, when the 2010 World Cup put Twitter squarely in the center of a real-time, global conversation. The influx of Tweets—from every shot on goal, penalty kick, and yellow or red card—repeatedly took its toll and made Twitter unavailable for short periods of time. Engineering worked throughout the nights during this time, desperately trying to find and implement order-of-magnitudes of efficiency gains. Unfortunately, those gains were quickly swamped by Twitter’s rapid growth, and engineering had started to run out of low-hanging fruit to fix. After that experience, we determined we needed to step back. We then determined we needed to re-architect the site to support the continued growth of Twitter and to keep it running smoothly. Since then we’ve worked hard to make sure that the service is resilient to the world’s impulses. We’re now able to withstand events like Castle in the Sky viewings, the Super Bowl, and the global New Year’s Eve celebration. This re-architecture has not only made the service more resilient when traffic spikes to record highs, but also provides a more flexible platform on which to build more features faster, including synchronizing direct messages across devices, Twitter cards that allow Tweets to become richer and contain more content, and a rich search experience that includes stories and users. And more features are coming. This talk will cover some of the lessons learned and changes made to not only grow, but also to become more resilient to world events and less fragile to whales.

Technology

Scaling Twitter To Go After
the Fail Whale
Jonathan Reichhold - Twitter Engineering

2010 World Cup Challenge
• Tweet and user requests growing
exponentially (good problem)

Monolithic Architecture
• Ruby on Rails
• Temporally-sharded MySQL
• Memcached
• ~60 engineers

Stabilize & Understand
• Learn & make improvements
• Don’t just survive

Be Realistic & Ambitious
• Prioritize what can be fixed and timeframes
for doing it
• Sometimes need the duct tape
• Find patterns and improvements for the
long term

A Bad
Approach
• Flip
switches/branches/other
until fixed
http://www.flickr.com/photos/chrism70/1144424032

Step 1:Trustworty Data
• https://blog.twitter.com/2013/observability-at-twitter

Step 2: Set Expectations
• Being on-call is a job and during high stress
will burn folks out
• Maintain calm and order

Post Mortems
• Improvement becomes part of process
• Stress makes system stronger not weaker

Teamwork
• All of this made possible by amazing team
and management
• Culture

Capacity Planning &
Forecast
• Just in time but realistic
• Figure out real buffers

Longer Term Changes
• Architecture changes take time and
changes in organization

Improve Efficiency
• Rails/Ruby -> Scala & JVM
• 200-300 RPS -> 10,000-20,000
• Single process per request -> Finagle

Service Orientation
• Make changes at
interface
boundary, not in
single monolith
• Team interactions
simplified
• Core nouns and
verbs

Move out of public
cloud
• Flexibility and latency demand at some
point
• Hard problem
• Datacenter as failure domain
• Mesos

Dynamic Configuration
• Update routes and compare live vs
dark/new
• Quickly adjust to issues
• Faster and less fragile deploys

Improve storage
• Gizzard for MySQL
• Improve Memcached
• Storage as a service
• Snowflake IDs

Development Speed
• Startups live and die by development speed
• Make easier to ship but contain damage

Conclusion
• Fail whale is now an endangered species
• Went from event driven spikes to pushing
continuous reliability improvements where
events became trivial

Tweet Spikes Today
• New Tweets per second (TPS) record: 143,199 TPS.
Typical day: more than 500 million Tweets sent; average
5,700 TPS. (August 2 at 7:21:50 PDT;August 3 at
11:21:50 JST)
• https://blog.twitter.com/2013/new-tweets-per-second-record-and-how

Final Thoughts
• Marathon not a sprint. Maintain systems
and yourself
• We are hiring to make system even better

Endangered: Fail Whale
Jonathan
Reichhold
@jreichhold

Questions?
• https://blog.twitter.com/2013/new-tweets-per-seco
• https://blog.twitter.com/2013/observability-
at-twitter

What's hot

Cassandra Core Concepts - Cassandra Day TorontoJon Haddad

CQRSFabian Vilers

11 Goals of High Functioning SQL DevelopersIke Ellis

MySQL 5.7 - the first few monthsSimon J Mudd

MySQL Failover and OrchestratorSimon J Mudd

Sustaining Your CareerScott Lowe

Experiences testing dev versions of MySQL and why it is good for youSimon J Mudd

Containerization: The DevOps Revolution SoftServe

Kanban @ nine.chMatías E. Fernández

Show Me the Numbers: Automated Browser colleenfry

DevOps in the Real WorldMax Yermakhanov

Tech writing in a continuous deployment environmentChristine Burwinkle

SPA vs. MPAMehmet Ali Tastan

New Server in an HourMike Hillwig

In Memory Cahce StructureMehmet Ali Tastan

MySQL X protocol - Talking to MySQL Directly over the WireSimon J Mudd

You've Launched! Now What?Amye Scavarda

Diagnosing Problems in Production - CassandraJon Haddad

What's hot (18)

Cassandra Core Concepts - Cassandra Day Toronto

CQRS

11 Goals of High Functioning SQL Developers

MySQL 5.7 - the first few months

MySQL Failover and Orchestrator

Sustaining Your Career

Experiences testing dev versions of MySQL and why it is good for you

Containerization: The DevOps Revolution

Kanban @ nine.ch

Show Me the Numbers: Automated Browser

DevOps in the Real World

Tech writing in a continuous deployment environment

SPA vs. MPA

New Server in an Hour

In Memory Cahce Structure

MySQL X protocol - Talking to MySQL Directly over the Wire

You've Launched! Now What?

Diagnosing Problems in Production - Cassandra

Similar to #Surgeconf Scaling Twitter to go After the Fail Whale

DevOpsDays Austin: Helping Horses Become Unicorns, Chef's Operations Maturity...Matt Ray

Kafka Summit SF 2017 - Running Kafka for Maximum Painconfluent

Into The Box 2020 Keynote Day 1Ortus Solutions, Corp

A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...confluent

What we learned at pass summit in 2018Red Gate Software

Project Sherpa: How RightScale Went All in on DockerRightScale

Chirp 2010: Scaling TwitterJohn Adams

John adams talk cloudyJohn Adams

Gaurav_CVGaurav Srivastav

Building data intensive applicationsAmit Kejriwal

Dibi Conference 2012Scott Rutherford

Scaling a High Traffic Web Application: Our Journey from Java to PHP120bi

Scaling High Traffic Web ApplicationsAchievers Tech

E2 evc 3-2-1-rule - mikeresselerMike Resseler

Industrial Data ScienceNiko Vuokko

Stored Procedure Superpowers: A Developer’s GuideVoltDB

The Death of the Star SchemaDATAVERSITY

Scaling Systems: Architectures that growGibraltar Software

Who wants to be a DBA? Roles and ResponsibilitiesKevin Kline

[System design] Design a tweeter-like systemAree Oh

Similar to #Surgeconf Scaling Twitter to go After the Fail Whale (20)

DevOpsDays Austin: Helping Horses Become Unicorns, Chef's Operations Maturity...

Kafka Summit SF 2017 - Running Kafka for Maximum Pain

Into The Box 2020 Keynote Day 1

A Marriage of Lambda and Kappa: Supporting Iterative Development of an Event ...

What we learned at pass summit in 2018

Project Sherpa: How RightScale Went All in on Docker

Chirp 2010: Scaling Twitter

John adams talk cloudy

Gaurav_CV

Building data intensive applications

Dibi Conference 2012

Scaling a High Traffic Web Application: Our Journey from Java to PHP

Scaling High Traffic Web Applications

E2 evc 3-2-1-rule - mikeresseler

Industrial Data Science

Stored Procedure Superpowers: A Developer’s Guide

The Death of the Star Schema

Scaling Systems: Architectures that grow

Who wants to be a DBA? Roles and Responsibilities

[System design] Design a tweeter-like system

Recently uploaded

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5

Connecting the Dots for Information Discovery.pdfNeo4j

Rise of the Machines: Known As Drones...Rick Flair

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq

Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González

Scale your database traffic with Read & Write split using MySQL RouterMydbops

Decarbonising Buildings: Making a net-zero built environment a realityIES VE

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

UiPath Community: Communication Mining from Zero to HeroUiPathCommunity

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Manual 508 Accessibility Compliance AuditSkynet Technologies

So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda

Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal

Data governance with Unity Catalog PresentationKnoldus Inc.

Recently uploaded (20)

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...

Connecting the Dots for Information Discovery.pdf

Rise of the Machines: Known As Drones...

TeamStation AI System Report LATAM IT Salaries 2024

Genislab builds better products and faster go-to-market with Lean project man...

Generative Artificial Intelligence: How generative AI works.pdf

Scale your database traffic with Read & Write split using MySQL Router

Decarbonising Buildings: Making a net-zero built environment a reality

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

UiPath Community: Communication Mining from Zero to Hero

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Manual 508 Accessibility Compliance Audit

So einfach geht modernes Roaming fuer Notes und Nomad.pdf

Assure Ecommerce and Retail Operations Uptime with ThousandEyes

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes

What is DBT - The Ultimate Data Build Tool.pdf

DevEX - reference for building teams, processes, and platforms

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...

Data governance with Unity Catalog Presentation

#Surgeconf Scaling Twitter to go After the Fail Whale

1. Scaling Twitter To Go After the Fail Whale Jonathan Reichhold - Twitter Engineering

2. Early Twitter....

3. 2010 World Cup Challenge • Tweet and user requests growing exponentially (good problem)

4. Load....

5. Monolithic Architecture • Ruby on Rails • Temporally-sharded MySQL • Memcached • ~60 engineers

6. Stabilize & Understand • Learn & make improvements • Don’t just survive

7. Be Realistic & Ambitious • Prioritize what can be fixed and timeframes for doing it • Sometimes need the duct tape • Find patterns and improvements for the long term

8. A Bad Approach • Flip switches/branches/other until fixed http://www.flickr.com/photos/chrism70/1144424032

9. Science

10. Step 1:Trustworty Data • https://blog.twitter.com/2013/observability-at-twitter

11. Step 2: Set Expectations • Being on-call is a job and during high stress will burn folks out • Maintain calm and order

12. Post Mortems • Improvement becomes part of process • Stress makes system stronger not weaker

13. Teamwork • All of this made possible by amazing team and management • Culture

14. Capacity Planning & Forecast • Just in time but realistic • Figure out real buffers

15. Longer Term Changes • Architecture changes take time and changes in organization

16. Improve Efficiency • Rails/Ruby -> Scala & JVM • 200-300 RPS -> 10,000-20,000 • Single process per request -> Finagle

17. Service Orientation • Make changes at interface boundary, not in single monolith • Team interactions simplified • Core nouns and verbs

18. Move out of public cloud • Flexibility and latency demand at some point • Hard problem • Datacenter as failure domain • Mesos

19. Dynamic Configuration • Update routes and compare live vs dark/new • Quickly adjust to issues • Faster and less fragile deploys

20. Improve storage • Gizzard for MySQL • Improve Memcached • Storage as a service • Snowflake IDs

21. Development Speed • Startups live and die by development speed • Make easier to ship but contain damage

22. Conclusion • Fail whale is now an endangered species • Went from event driven spikes to pushing continuous reliability improvements where events became trivial

23. Tweet Spikes Today • New Tweets per second (TPS) record: 143,199 TPS. Typical day: more than 500 million Tweets sent; average 5,700 TPS. (August 2 at 7:21:50 PDT;August 3 at 11:21:50 JST) • https://blog.twitter.com/2013/new-tweets-per-second-record-and-how

24. Final Thoughts • Marathon not a sprint. Maintain systems and yourself • We are hiring to make system even better

25. Endangered: Fail Whale Jonathan Reichhold @jreichhold

26. Questions? • https://blog.twitter.com/2013/new-tweets-per-seco • https://blog.twitter.com/2013/observability- at-twitter

Editor's Notes

Introduce self.....
Job of startups is to take risk and push quickly but learn from it. Twitter's genesis valued quick design and feature iteration over stability. We were known for the fail whale Events dominating Twitter traffic Limited engineers. Slow to ramp up with traffic (impedence mismatch)
Limited resources (people, machines, and algorithms) Whale was a meme and what we were known for Rapid iteration on features and infrastructure 2010 FIFA World Cup forced an understanding of reliability as a feature on the organization. While we didn't completely break we suffered lots.
Allowed us to iterate fast. Became a bottleneck
Harden the code and algorithms for scale and nature of distributed system Changing the structure and systems to external stresses made the system more resilient to later challenges. Fail whale extinct
Most people start with pattern they know Nothing is learned in general
Instead propose a hypothesis and test Did the system change as expected? Learn what works (and doesn’t) and use understanding for next iteration
Be rigorous on recording data and validating Lots of time can be wasted on invalid data Signal vs noise False Alerts
Being on-call is a job and during high stress will burn folks out Make sure you have good communications (phone numbers, IRC/Campfire, IM, Chat, etc) that work for group Make sure folks are kept informed regularly and can focus on problems. Both above and below you Maintain calm and order

#Surgeconf Scaling Twitter to go After the Fail Whale

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to #Surgeconf Scaling Twitter to go After the Fail Whale

Similar to #Surgeconf Scaling Twitter to go After the Fail Whale (20)

Recently uploaded

Recently uploaded (20)

#Surgeconf Scaling Twitter to go After the Fail Whale

Editor's Notes