Fixing Twitter
... and Finding your own Fail Whale




            John Adams
        Twitter Operations
        <jna@twit...
Operations
• Small team, growing rapidly.
• What do we do?
 • Software Performance (back-end)
 • Availability
 • Capacity ...
Managed Services
• Dedicated team (NTTA)
• 24/7 Hands on remote support
• No clouds. We tried that!
 • Need raw processing...
752%
2008 Growth




              Unique Visitors (in Millions)
752%
2008 Growth
           5



         3.75



          2.5



         1.25



           0
                Dec 07   ...
That was only the beginning...




previous
 graph!
Uniques




 Not slowing down, despite what outsiders say.
  Hard for outsiders to measure API usage!
Growth = Pain
 + an appreciation for Institutionalized Fear
Mantra!


   Find Weakest
       Point




   Metrics +
Logs + Science =
    Analysis
Mantra!


   Find Weakest    Take Corrective
       Point           Action




   Metrics +         Process
Logs + Science...
Mantra!


   Find Weakest    Take Corrective    Move to Next
       Point           Action         Weakest Point




   Me...
Find the Weakest Point

  • Metrics + Graphs
   • Individual metrics are irrelevant
  • Logs
  • SCIENCE!
  • Find out wha...
Instrument Everything




                        (cc) seenoevil@flickr
Monitoring

 • Graph and report critical metrics in as near
   real time as possible
 • You already have the tools.
  • RR...
Dashboards
 • “Criticals” view
 • Smokeping/MRTG
 • Google Analytics
  • Not just for
     HTTP 200s/SEO
 • XML Feeds from...
Analyze
  • Turn data into information
   • Where is the code base going?
   • Are things worse than they were?
     • Und...
Forecasting                   Curve-fitting for capacity planning
                               (R, fityk, Mathematica, Cur...
Deploys

  • Graph time-of-deploy along side server
      CPU and Latency
  • Display time-of-last-deploy on dashboard


 ...
Whale-Watcher
•   Simple shell script,
    •   MASSIVE WIN.
•   Whale = HTTP 503 (timeout)
•   Robot = HTTP 500 (error)
• ...
Take Action !
Feature “Darkmode”
 • Specific site controls to enable and
   disable computationally or IO-Heavy site
   function
 • The “...
Configuration
Management
• Start automated configuration management
  EARLY in your company.
• Don’t wait until it’s too lat...
Configuration
Management
   • Complex Environment
   • Multiple Admins
   • Unknown Interactions
   • Solution: 2nd set of ...
Process through Reviews
Reviewboard
         www.review-board.org

 • SVN pre-commit hook causes a failure if
   the log message doesn’t include
 ...
Improve
Communication


                Campfire
Subsystems
Many limiting factors in the request pipeline

        Apache                      Rails
      MPM Model                  ...
Make an attack plan.
 Symptom    Bottleneck      Vector       Solution

                            HTTP
Bandwidth   Netwo...
Make an attack plan.
 Symptom    Bottleneck      Vector       Solution

                            HTTP
Bandwidth   Netwo...
Make an attack plan.
 Symptom    Bottleneck      Vector       Solution

                            HTTP
Bandwidth   Netwo...
Make an attack plan.
 Symptom    Bottleneck      Vector       Solution

                            HTTP
Bandwidth   Netwo...
Make an attack plan.
 Symptom    Bottleneck      Vector       Solution

                            HTTP
Bandwidth   Netwo...
CPU: More with Less
• Reduction in 40% of CPU by replacing dual
  and quad core machines with 8 core
• Switching from AMD ...
Rails
• Stop blaming Rails.
• Analysis found:
 • Caching + Cache invalidation problems
 • Bad queries generated by ActiveR...
Disk is the new Tape.
• Social Networking application profile has
  many   O(ny)   operations.
• Page requests have to happ...
Caching
  • We’re the real-time web, but lots of caching
    opportunity
  • Most caching strategies rely on long TTLs
   ...
Caching   50% decrease in load with Native C
                gem + libmemcached
Cache Money!
• Active Record Plugin
 • Cache when reading from the DB
 • Cache when writing to the DB
• Transparently prov...
Caching

 • “Cache Everything!” not the best policy
 • Invalidating caches at the right time is
   difficult.
 • Cold Cache...
Memcached
• memcached isn’t perfect.
 • Memcached SEGVs hurt us early on.
• Evictions make the cache unreliable for
  impo...
API + Caching (search)
• Cache and control abusive clients
• Varnish between two Apache Virtual Hosts
  (failover to anoth...
Relational Databases
not a Panacea
• Good for:
 • Users, Relational Data, Transactions
• Bad:
 • Queues. Polling operation...
Queues
• Many message queue solutions on the
  market
• At high loads, most perform poorly when
  used in ‘durable’ mode.
...
Kestrel
Falco tinnunculus




  • Works like memcache (same protocol)
  • SET = enqueue | GET = dequeue
  • No strict orde...
Asynchronous
Requests
• Inbound traffic consumes a mongrel
• Outbound traffic consumes a mongrel
• The request pipeline shou...
Don’t make services
dependent
• Move operations out of the synchronous
  request cycle
 • Email
 • Complex object generati...
Daemons
• Many different types at Twitter.
• # of daemons have to match the workload
 • Early Kestrel would crash if queue...
MySQL Challenges
• Replication Delay
 • Single threaded. Slow.
• Social Networking not good for RDBMS
 • N x N relationshi...
MySQL

• Replication delay and cache eviction
  produce inconsistent results to the end
  user.
• Locks create resource co...
Database Replication
  • Major issues around users and statuses
    tables
  • Multiple functional masters (FRP, FWP)
  • ...
status.twitter.com

• Keep users in the loop, or suffer.
• Hosted on different service (Tumblr)
• No matter how little inf...
Key Points

• Databases not always the best store.
• Instrument everything.
• Use metrics to make decisions, not guesses.
...
Thanks!
Twitter Open Source (Apache License):

- CacheMoney Gem (Write through Caching)
http://github.com/nkallen/cache-mo...
Upcoming SlideShare
Loading in...5
×

Fixing Twitter Velocity2009

1,583

Published on

This is my Keynote talk which I gave at O'Reilly Velocity in 2009. I

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,583
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
24
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • What are we going to talk about? We&amp;#x2019;re going to talk about fixing twitter, but
    we&amp;#x2019;re also going to talk about how you can apply what we&amp;#x2019;ve done to your site.
  • talk about architecture but (amazon sqs vs kestrel.. microseconds in kestrel)

    250mS obsession w/ instant retry


    No clouds, but we still use S3.
  • Since then we&amp;#x2019;ve experienced phenomenal growth
  • Of course, how do you find these points?

    You need instrumentation. Every metric you can get your hands on needs to be graphed and recorded at reasonable rates that do not cause harm to the running environemnt
  • Unless detrimental to performance!
  • We have one of these, do you?
  • talk about the status ID problem
    Don&amp;#x2019;t be afraid to go to the scientific / statistical side of the company
    Can do this with disk space, network bandwidth, anything. stats = good
  • Talk about why we whale. Whale vs. timeout vs
  • talk about automated configuration management talk
  • Alternatives are immature, or complex.
    Lots of people
  • talk about cpu contention.
    apache cannot be allowed to overload rails, there are limits.
    Everyone has a listen queue. When they overload, you get waits. How much wait can you deal with?
  • moore&amp;#x2019;s law = 12.5% year on year as we refresh hardware
  • so now, where are you going to store all this data?
  • Facebook&amp;#x2019;s memcache isn&amp;#x2019;t much better.
  • I&amp;#x2019;m sure the thing everyone says is &amp;#x201C;more caching.&amp;#x201D; It&amp;#x2019;s a good start but it&amp;#x2019;s not
    always the answer to all problems. Memcached can be unreliable. It can crash.

    Memcache SEGV from code push
  • ????? disclosure ? Abuse ?
  • we&amp;#x2019;ll talk about the databases in a bit
  • rabbit mq, openmq = well documented problem - if you create a durable(persistant) queue, perf drops off dramatically no resiliance from failure
    and cannot back to disk reliabliy.

    Durable option kills performance. they were all slower than kestrel
  • everything consumes resources.
    talk about passenger and threaded model, and problems with mod_rewrite.
  • Most people design for fast clients that can dequeue quickly.
    Deal with the bottleneck.
  • and a couple of final notes...
  • Fixing Twitter Velocity2009

    1. 1. Fixing Twitter ... and Finding your own Fail Whale John Adams Twitter Operations <jna@twitter.com>
    2. 2. Operations • Small team, growing rapidly. • What do we do? • Software Performance (back-end) • Availability • Capacity Planning (metrics-driven) • Configuration Management • We don’t deal with the physical plant.
    3. 3. Managed Services • Dedicated team (NTTA) • 24/7 Hands on remote support • No clouds. We tried that! • Need raw processing power, latency too high in existing cloud offerings • Frees us to deal with real, intellectual, computer science problems.
    4. 4. 752% 2008 Growth Unique Visitors (in Millions)
    5. 5. 752% 2008 Growth 5 3.75 2.5 1.25 0 Dec 07 Feb 08 Apr 08 Jun 08 Aug 08 Oct 08 Dec 08 Unique Visitors (in Millions)
    6. 6. That was only the beginning... previous graph!
    7. 7. Uniques Not slowing down, despite what outsiders say. Hard for outsiders to measure API usage!
    8. 8. Growth = Pain + an appreciation for Institutionalized Fear
    9. 9. Mantra! Find Weakest Point Metrics + Logs + Science = Analysis
    10. 10. Mantra! Find Weakest Take Corrective Point Action Metrics + Process Logs + Science = Analysis
    11. 11. Mantra! Find Weakest Take Corrective Move to Next Point Action Weakest Point Metrics + Process Repeatability Logs + Science = Analysis
    12. 12. Find the Weakest Point • Metrics + Graphs • Individual metrics are irrelevant • Logs • SCIENCE! • Find out what the actionable items are.
    13. 13. Instrument Everything (cc) seenoevil@flickr
    14. 14. Monitoring • Graph and report critical metrics in as near real time as possible • You already have the tools. • RRD • Ganglia + custom gMetric scripts • MRTG
    15. 15. Dashboards • “Criticals” view • Smokeping/MRTG • Google Analytics • Not just for HTTP 200s/SEO • XML Feeds from managed services • Data Porn!
    16. 16. Analyze • Turn data into information • Where is the code base going? • Are things worse than they were? • Understand the impact of the last software deploy • Run check scripts during and after deploys • Capacity Planning, not Fire Fighting!
    17. 17. Forecasting Curve-fitting for capacity planning (R, fityk, Mathematica, CurveFit) unsigned int (32 bit) Twitpocolypse status_id signed int (32 bit) Twitpocolypse r2=0.99
    18. 18. Deploys • Graph time-of-deploy along side server CPU and Latency • Display time-of-last-deploy on dashboard last deploy times
    19. 19. Whale-Watcher • Simple shell script, • MASSIVE WIN. • Whale = HTTP 503 (timeout) • Robot = HTTP 500 (error) • Examines last 100,000 lines of aggregated daemon / www logs • “Whales per Second” > Wthreshold • Thar be whales! Call in ops.
    20. 20. Take Action !
    21. 21. Feature “Darkmode” • Specific site controls to enable and disable computationally or IO-Heavy site function • The “Emergency Stop” button • Changes logged and reported to all teams • Around 60 switches we can throw • Static / Read-only mode
    22. 22. Configuration Management • Start automated configuration management EARLY in your company. • Don’t wait until it’s too late. • Twitter started within the first few months.
    23. 23. Configuration Management • Complex Environment • Multiple Admins • Unknown Interactions • Solution: 2nd set of eyes.
    24. 24. Process through Reviews
    25. 25. Reviewboard www.review-board.org • SVN pre-commit hook causes a failure if the log message doesn’t include ‘reviewed’ • SVN post-commit hook informs people what changed via email • Watches the entire SVN tree
    26. 26. Improve Communication Campfire
    27. 27. Subsystems
    28. 28. Many limiting factors in the request pipeline Apache Rails MPM Model (mongrel) MaxClients 2:1 oversubscribed TCP Listen queue depth to cores Memcached # connections MySQL Varnish (search) # db connections # threads
    29. 29. Make an attack plan. Symptom Bottleneck Vector Solution HTTP Bandwidth Network Servers++ Latency Better Timeline Database Update Delay algorithm DBs++ Search Database Delays Code Updates Algorithm Latency Algorithms
    30. 30. Make an attack plan. Symptom Bottleneck Vector Solution HTTP Bandwidth Network Servers++ Latency Better Timeline Database Update Delay algorithm DBs++ Search Database Delays Code Updates Algorithm Latency Algorithms
    31. 31. Make an attack plan. Symptom Bottleneck Vector Solution HTTP Bandwidth Network Servers++ Latency Better Timeline Database Update Delay algorithm DBs++ Search Database Delays Code Updates Algorithm Latency Algorithms
    32. 32. Make an attack plan. Symptom Bottleneck Vector Solution HTTP Bandwidth Network Servers++ Latency Better Timeline Database Update Delay algorithm DBs++ Search Database Delays Code Updates Algorithm Latency Algorithms
    33. 33. Make an attack plan. Symptom Bottleneck Vector Solution HTTP Bandwidth Network Servers++ Latency Better Timeline Database Update Delay algorithm DBs++ Search Database Delays Code Updates Algorithm Latency Algorithms
    34. 34. CPU: More with Less • Reduction in 40% of CPU by replacing dual and quad core machines with 8 core • Switching from AMD to Intel Xeon = 30% gain • Saved data center space, power, cost per month. • Not the best option if you own machines. Capital expenditure = hard to realize new technology gains.
    35. 35. Rails • Stop blaming Rails. • Analysis found: • Caching + Cache invalidation problems • Bad queries generated by ActiveRecord, resulting in slow queries against the db • Queue Latency • Memcache / Page Cache Corruption • Replication Lag
    36. 36. Disk is the new Tape. • Social Networking application profile has many O(ny) operations. • Page requests have to happen in < 500mS or users start to notice. Goal: 250-300mS • Web 2.0 isn’t possible without lots of RAM • What to do?
    37. 37. Caching • We’re the real-time web, but lots of caching opportunity • Most caching strategies rely on long TTLs (>60 s) • Separate memcache pools for different data types to prevent eviction • Optimize Ruby Gem to libmemcached + FNV Hash instead of Ruby + MD5 • Twitter now largest contributor to libmemcached
    38. 38. Caching 50% decrease in load with Native C gem + libmemcached
    39. 39. Cache Money! • Active Record Plugin • Cache when reading from the DB • Cache when writing to the DB • Transparently provides caching • Removes need for set/get cache code • Open Source!
    40. 40. Caching • “Cache Everything!” not the best policy • Invalidating caches at the right time is difficult. • Cold Cache problem • Network Memory Bus != Infinite
    41. 41. Memcached • memcached isn’t perfect. • Memcached SEGVs hurt us early on. • Evictions make the cache unreliable for important configuration data (loss of darkmode flags, for example) • Data and Hash Corruption (even in 1.2.6) • Exposed corruption issue with specific inputs causing SEGV and unexpected behavior
    42. 42. API + Caching (search) • Cache and control abusive clients • Varnish between two Apache Virtual Hosts (failover to another backend if Varnish dies) • Remove Cache busting query strings before applying hash algorithm • Using ESI to cache jQuery requests when specifying a callback= parameter - big win.
    43. 43. Relational Databases not a Panacea • Good for: • Users, Relational Data, Transactions • Bad: • Queues. Polling operations. Caching. • You don’t need ACID for everything. • Enter the message queue...
    44. 44. Queues • Many message queue solutions on the market • At high loads, most perform poorly when used in ‘durable’ mode. • Erlang based queues work well (RabbitMQ), but you need in house Erlang experience. • We wrote our own. • Kestrel to the rescue!
    45. 45. Kestrel Falco tinnunculus • Works like memcache (same protocol) • SET = enqueue | GET = dequeue • No strict ordering of jobs • No shared state between servers • Written in Scala.
    46. 46. Asynchronous Requests • Inbound traffic consumes a mongrel • Outbound traffic consumes a mongrel • The request pipeline should not be used to handle 3rd party communications or back-end work. • Daemons, Daemons, Daemons.
    47. 47. Don’t make services dependent • Move operations out of the synchronous request cycle • Email • Complex object generation (timelines) • 3rd party services (bit.ly, sms, etc.)
    48. 48. Daemons • Many different types at Twitter. • # of daemons have to match the workload • Early Kestrel would crash if queues filled • “Seppaku” patch • Kill daemons after n requests • Long-running daemons = low memory
    49. 49. MySQL Challenges • Replication Delay • Single threaded. Slow. • Social Networking not good for RDBMS • N x N relationships and social graph / tree traversal • Sharding importance • Disk issues (FS Choice, noatime, scheduling algorithm)
    50. 50. MySQL • Replication delay and cache eviction produce inconsistent results to the end user. • Locks create resource contention for popular data
    51. 51. Database Replication • Major issues around users and statuses tables • Multiple functional masters (FRP, FWP) • Make sure your code reads and writes to the write DBs. Reading from master = slow death • Monitor the DB. Find slow / poorly designed queries • Kill long running queries before they kill you (mkill)
    52. 52. status.twitter.com • Keep users in the loop, or suffer. • Hosted on different service (Tumblr) • No matter how little information you have available.
    53. 53. Key Points • Databases not always the best store. • Instrument everything. • Use metrics to make decisions, not guesses. • Don’t make services dependent • Process asynchronously when possible
    54. 54. Thanks! Twitter Open Source (Apache License): - CacheMoney Gem (Write through Caching) http://github.com/nkallen/cache-money/tree/master - Libmemcached http://tangent.org/552/libmemcached.html - Kestrel (Memcache-like message queue) http://github.com/robey/kestrel - mod_memcache_block (Apache 2.x Limiter/blocker) http://github.com/netik/mod_memcache_block
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×