SlideShare a Scribd company logo
Fixing Twitter
... and Finding your own Fail Whale




            John Adams
        Twitter Operations
        <jna@twitter.com>
Operations
• Small team, growing rapidly.
• What do we do?
 • Software Performance (back-end)
 • Availability
 • Capacity Planning (metrics-driven)
 • Configuration Management
• We don’t deal with the physical plant.
Managed Services
• Dedicated team (NTTA)
• 24/7 Hands on remote support
• No clouds. We tried that!
 • Need raw processing power, latency too
     high in existing cloud offerings
• Frees us to deal with real, intellectual,
  computer science problems.
752%
2008 Growth
           5



         3.75



          2.5



         1.25



           0
                Dec 07    Feb 08   Apr 08       Jun 08   Aug 08   Oct 08 Dec 08
                Unique Visitors (in Millions)
That was only the beginning...




previous
 graph!
Uniques




 Not slowing down, despite what outsiders say.
  Hard for outsiders to measure API usage!
Growth = Pain
 + an appreciation for Institutionalized Fear
Mantra!


   Find Weakest
       Point




   Metrics +
Logs + Science =
    Analysis
Mantra!


   Find Weakest    Take Corrective
       Point           Action




   Metrics +         Process
Logs + Science =
    Analysis
Mantra!


   Find Weakest    Take Corrective    Move to Next
       Point           Action         Weakest Point




   Metrics +         Process         Repeatability
Logs + Science =
    Analysis
Find the Weakest Point

  • Metrics + Graphs
   • Individual metrics are irrelevant
  • Logs
  • SCIENCE!
  • Find out what the actionable items are.
Instrument Everything




                        (cc) seenoevil@flickr
Monitoring

 • Graph and report critical metrics in as near
   real time as possible
 • You already have the tools.
  • RRD
  • Ganglia + custom gMetric scripts
  • MRTG
Dashboards
 • “Criticals” view
 • Smokeping/MRTG
 • Google Analytics
  • Not just for
     HTTP 200s/SEO
 • XML Feeds from
   managed services
 • Data Porn!
Analyze
  • Turn data into information
   • Where is the code base going?
   • Are things worse than they were?
     • Understand the impact of the last
        software deploy
      • Run check scripts during and after
        deploys
  • Capacity Planning, not Fire Fighting!
Forecasting                   Curve-fitting for capacity planning
                               (R, fityk, Mathematica, CurveFit)



              unsigned int (32 bit)
                Twitpocolypse



  status_id

                                      signed int (32 bit)
                                        Twitpocolypse




                                                  r2=0.99
Deploys

  • Graph time-of-deploy along side server
      CPU and Latency
  • Display time-of-last-deploy on dashboard


 last deploy times
Whale-Watcher
•   Simple shell script,
    •   MASSIVE WIN.
•   Whale = HTTP 503 (timeout)
•   Robot = HTTP 500 (error)
•   Examines last 100,000 lines of aggregated
    daemon / www logs
•   “Whales per Second” > Wthreshold
    •   Thar be whales! Call in ops.
Take Action !
Feature “Darkmode”
 • Specific site controls to enable and
   disable computationally or IO-Heavy site
   function
 • The “Emergency Stop” button
 • Changes logged and reported to all teams
 • Around 60 switches we can throw
 • Static / Read-only mode
Configuration
Management
• Start automated configuration management
  EARLY in your company.
• Don’t wait until it’s too late.
• Twitter started within the first few months.
Configuration
Management
   • Complex Environment
   • Multiple Admins
   • Unknown Interactions
   • Solution: 2nd set of eyes.
Process through Reviews
Reviewboard
         www.review-board.org

 • SVN pre-commit hook causes a failure if
   the log message doesn’t include
   ‘reviewed’
 • SVN post-commit hook informs people
   what changed via email
 • Watches the entire SVN tree
Improve
Communication


                Campfire
Subsystems
Many limiting factors in the request pipeline

        Apache                      Rails
      MPM Model                  (mongrel)
      MaxClients            2:1 oversubscribed
 TCP Listen queue depth
                                  to cores

                                 Memcached
                                # connections


                                    MySQL
   Varnish (search)            # db connections
      # threads
Make an attack plan.
 Symptom    Bottleneck      Vector       Solution

                            HTTP
Bandwidth   Network                     Servers++
                           Latency
                                          Better
 Timeline   Database     Update Delay
                                        algorithm
                                         DBs++
  Search    Database        Delays
                                          Code
 Updates    Algorithm      Latency      Algorithms
CPU: More with Less
• Reduction in 40% of CPU by replacing dual
  and quad core machines with 8 core
• Switching from AMD to Intel Xeon = 30%
  gain
• Saved data center space, power, cost per
  month.
• Not the best option if you own machines.
  Capital expenditure = hard to realize new
  technology gains.
Rails
• Stop blaming Rails.
• Analysis found:
 • Caching + Cache invalidation problems
 • Bad queries generated by ActiveRecord,
    resulting in slow queries against the db
 • Queue Latency
 • Memcache / Page Cache Corruption
 • Replication Lag
Disk is the new Tape.
• Social Networking application profile has
  many   O(ny)   operations.
• Page requests have to happen in < 500mS
  or users start to notice. Goal: 250-300mS
• Web 2.0 isn’t possible without lots of RAM
• What to do?
Caching
  • We’re the real-time web, but lots of caching
    opportunity
  • Most caching strategies rely on long TTLs
    (>60 s)
  • Separate memcache pools for different data
    types to prevent eviction
  • Optimize Ruby Gem to libmemcached +
    FNV Hash instead of Ruby + MD5
  • Twitter now largest contributor to
    libmemcached
Caching   50% decrease in load with Native C
                gem + libmemcached
Cache Money!
• Active Record Plugin
 • Cache when reading from the DB
 • Cache when writing to the DB
• Transparently provides caching
 • Removes need for set/get cache code
 • Open Source!
Caching

 • “Cache Everything!” not the best policy
 • Invalidating caches at the right time is
   difficult.
 • Cold Cache problem
 • Network Memory Bus != Infinite
Memcached
• memcached isn’t perfect.
 • Memcached SEGVs hurt us early on.
• Evictions make the cache unreliable for
  important configuration data
  (loss of darkmode flags, for example)
• Data and Hash Corruption (even in 1.2.6)
 • Exposed corruption issue with specific
    inputs causing SEGV and unexpected
    behavior
API + Caching (search)
• Cache and control abusive clients
• Varnish between two Apache Virtual Hosts
  (failover to another backend if Varnish
  dies)
• Remove Cache busting query strings before
  applying hash algorithm
• Using ESI to cache jQuery requests when
  specifying a callback= parameter - big win.
Relational Databases
not a Panacea
• Good for:
 • Users, Relational Data, Transactions
• Bad:
 • Queues. Polling operations. Caching.
• You don’t need ACID for everything.
• Enter the message queue...
Queues
• Many message queue solutions on the
  market
• At high loads, most perform poorly when
  used in ‘durable’ mode.
• Erlang based queues work well
  (RabbitMQ), but you need in house Erlang
  experience.
• We wrote our own.
 • Kestrel to the rescue!
Kestrel
Falco tinnunculus




  • Works like memcache (same protocol)
  • SET = enqueue | GET = dequeue
  • No strict ordering of jobs
  • No shared state between servers
  • Written in Scala.
Asynchronous
Requests
• Inbound traffic consumes a mongrel
• Outbound traffic consumes a mongrel
• The request pipeline should not be used to
  handle 3rd party communications or
  back-end work.
• Daemons, Daemons, Daemons.
Don’t make services
dependent
• Move operations out of the synchronous
  request cycle
 • Email
 • Complex object generation (timelines)
 • 3rd party services (bit.ly, sms, etc.)
Daemons
• Many different types at Twitter.
• # of daemons have to match the workload
 • Early Kestrel would crash if queues filled
• “Seppaku” patch
 • Kill daemons after n requests
• Long-running daemons = low memory
MySQL Challenges
• Replication Delay
 • Single threaded. Slow.
• Social Networking not good for RDBMS
 • N x N relationships and social graph /
    tree traversal
 • Sharding importance
 • Disk issues (FS Choice, noatime,
    scheduling algorithm)
MySQL

• Replication delay and cache eviction
  produce inconsistent results to the end
  user.
• Locks create resource contention for
  popular data
Database Replication
  • Major issues around users and statuses
    tables
  • Multiple functional masters (FRP, FWP)
  • Make sure your code reads and writes to
    the write DBs. Reading from master = slow
    death
    • Monitor the DB. Find slow / poorly
      designed queries
  • Kill long running queries before they kill
    you (mkill)
status.twitter.com

• Keep users in the loop, or suffer.
• Hosted on different service (Tumblr)
• No matter how little information you have
  available.
Key Points

• Databases not always the best store.
• Instrument everything.
• Use metrics to make decisions, not guesses.
• Don’t make services dependent
• Process asynchronously when possible
Thanks!
Twitter Open Source (Apache License):

- CacheMoney Gem (Write through Caching)
http://github.com/nkallen/cache-money/tree/master

- Libmemcached
http://tangent.org/552/libmemcached.html

- Kestrel (Memcache-like message queue)
http://github.com/robey/kestrel

- mod_memcache_block (Apache 2.x Limiter/blocker)
http://github.com/netik/mod_memcache_block

More Related Content

What's hot

Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
DataStax Academy
 
Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/OThousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/O
George Cao
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?
Jagadish Venkatraman
 
Diagnosing MySQL performance problems
Diagnosing  MySQL performance problemsDiagnosing  MySQL performance problems
Diagnosing MySQL performance problems
Justin Swanhart
 
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
DataStax
 
Open west 2015 talk ben coverston
Open west 2015 talk ben coverstonOpen west 2015 talk ben coverston
Open west 2015 talk ben coverston
bcoverston
 
What Drove Wordnik Non-Relational?
What Drove Wordnik Non-Relational?What Drove Wordnik Non-Relational?
What Drove Wordnik Non-Relational?
DATAVERSITY
 
Linkedin NUS QCon 2009 slides
Linkedin NUS QCon 2009 slidesLinkedin NUS QCon 2009 slides
Linkedin NUS QCon 2009 slides
ruslansv
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Amazon Web Services
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
Vadim Semenov
 
Cmg06 utilization is useless
Cmg06 utilization is uselessCmg06 utilization is useless
Cmg06 utilization is useless
Adrian Cockcroft
 
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast DataAkka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
Lightbend
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
zznate
 
Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtime
Guido Schmutz
 
Error in hadoop
Error in hadoopError in hadoop
Error in hadoopLen Bass
 
keyvi the key value index @ Cliqz
keyvi the key value index @ Cliqzkeyvi the key value index @ Cliqz
keyvi the key value index @ Cliqz
Hendrik Muhs
 
Reactive Supply To Changing Demand
Reactive Supply To Changing DemandReactive Supply To Changing Demand
Reactive Supply To Changing Demand
Jonas Bonér
 
a real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxxa real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at DevoxxNathan Bijnens
 
Yes sql08 inmemorydb
Yes sql08 inmemorydbYes sql08 inmemorydb
Yes sql08 inmemorydb
Daniel Austin
 

What's hot (19)

Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
Thousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/OThousands of Threads and Blocking I/O
Thousands of Threads and Blocking I/O
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?
 
Diagnosing MySQL performance problems
Diagnosing  MySQL performance problemsDiagnosing  MySQL performance problems
Diagnosing MySQL performance problems
 
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
 
Open west 2015 talk ben coverston
Open west 2015 talk ben coverstonOpen west 2015 talk ben coverston
Open west 2015 talk ben coverston
 
What Drove Wordnik Non-Relational?
What Drove Wordnik Non-Relational?What Drove Wordnik Non-Relational?
What Drove Wordnik Non-Relational?
 
Linkedin NUS QCon 2009 slides
Linkedin NUS QCon 2009 slidesLinkedin NUS QCon 2009 slides
Linkedin NUS QCon 2009 slides
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Using apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at DatadogUsing apache spark for processing trillions of records each day at Datadog
Using apache spark for processing trillions of records each day at Datadog
 
Cmg06 utilization is useless
Cmg06 utilization is uselessCmg06 utilization is useless
Cmg06 utilization is useless
 
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast DataAkka Streams And Kafka Streams: Where Microservices Meet Fast Data
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
 
Nyc summit intro_to_cassandra
Nyc summit intro_to_cassandraNyc summit intro_to_cassandra
Nyc summit intro_to_cassandra
 
Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtime
 
Error in hadoop
Error in hadoopError in hadoop
Error in hadoop
 
keyvi the key value index @ Cliqz
keyvi the key value index @ Cliqzkeyvi the key value index @ Cliqz
keyvi the key value index @ Cliqz
 
Reactive Supply To Changing Demand
Reactive Supply To Changing DemandReactive Supply To Changing Demand
Reactive Supply To Changing Demand
 
a real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxxa real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxx
 
Yes sql08 inmemorydb
Yes sql08 inmemorydbYes sql08 inmemorydb
Yes sql08 inmemorydb
 

Similar to Fixing twitter

Fixing Twitter Velocity2009
Fixing Twitter Velocity2009Fixing Twitter Velocity2009
Fixing Twitter Velocity2009
John Adams
 
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
confluent
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDB
FoundationDB
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
 
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20....Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
Javier García Magna
 
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum PainKafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
confluent
 
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd KnownCassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
DataStax
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
DataStax Academy
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
DataStax Academy
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
DataStax Academy
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented Design
Rodrigo Campos
 
In Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionIn Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics Solution
Adaryl "Bob" Wakefield, MBA
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
Krishna Gade
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
Hakka Labs
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
Jon Haddad
 
SeaJUG May 2012 mybatis
SeaJUG May 2012 mybatisSeaJUG May 2012 mybatis
SeaJUG May 2012 mybatisWill Iverson
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profit
Rodrigo Campos
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
David Martínez Rego
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
Jurriaan Persyn
 

Similar to Fixing twitter (20)

Fixing Twitter Velocity2009
Fixing Twitter Velocity2009Fixing Twitter Velocity2009
Fixing Twitter Velocity2009
 
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
What's inside the black box? Using ML to tune and manage Kafka. (Matthew Stum...
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDB
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20....Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
 
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum PainKafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
 
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd KnownCassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
Cassandra Community Webinar: MySQL to Cassandra - What I Wish I'd Known
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
 
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in ProductionCassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented Design
 
In Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionIn Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics Solution
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
Diagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - CassandraDiagnosing Problems in Production - Cassandra
Diagnosing Problems in Production - Cassandra
 
SeaJUG May 2012 mybatis
SeaJUG May 2012 mybatisSeaJUG May 2012 mybatis
SeaJUG May 2012 mybatis
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profit
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Voldemort Nosql
Voldemort NosqlVoldemort Nosql
Voldemort Nosql
 

More from Roger Xia

机器学习推动金融数据智能
机器学习推动金融数据智能机器学习推动金融数据智能
机器学习推动金融数据智能Roger Xia
 
Code reviews
Code reviewsCode reviews
Code reviewsRoger Xia
 
Python introduction
Python introductionPython introduction
Python introductionRoger Xia
 
Learning notes ruby
Learning notes rubyLearning notes ruby
Learning notes rubyRoger Xia
 
Converged open platform for enterprise
Converged open platform for enterpriseConverged open platform for enterprise
Converged open platform for enterpriseRoger Xia
 
Code reviews
Code reviewsCode reviews
Code reviewsRoger Xia
 
E commerce search strategies
E commerce search strategiesE commerce search strategies
E commerce search strategiesRoger Xia
 
Indefero source code_managment
Indefero source code_managmentIndefero source code_managment
Indefero source code_managmentRoger Xia
 
Web Services Atomic Transactio
 Web Services Atomic Transactio Web Services Atomic Transactio
Web Services Atomic Transactio
Roger Xia
 
Web service through cxf
Web service through cxfWeb service through cxf
Web service through cxfRoger Xia
 
Q con london2011-matthewwall-whyichosemongodbforguardiancouk
Q con london2011-matthewwall-whyichosemongodbforguardiancoukQ con london2011-matthewwall-whyichosemongodbforguardiancouk
Q con london2011-matthewwall-whyichosemongodbforguardiancouk
Roger Xia
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
Roger Xia
 
Consistency-New-Generation-Databases
Consistency-New-Generation-DatabasesConsistency-New-Generation-Databases
Consistency-New-Generation-DatabasesRoger Xia
 
Java explore
Java exploreJava explore
Java explore
Roger Xia
 
Mongo db实战
Mongo db实战Mongo db实战
Mongo db实战
Roger Xia
 
Ca siteminder
Ca siteminderCa siteminder
Ca siteminder
Roger Xia
 
Eclipse plug in mylyn & tasktop
Eclipse plug in mylyn & tasktopEclipse plug in mylyn & tasktop
Eclipse plug in mylyn & tasktop
Roger Xia
 
新浪微博架构猜想
新浪微博架构猜想新浪微博架构猜想
新浪微博架构猜想
Roger Xia
 

More from Roger Xia (20)

机器学习推动金融数据智能
机器学习推动金融数据智能机器学习推动金融数据智能
机器学习推动金融数据智能
 
Code reviews
Code reviewsCode reviews
Code reviews
 
Python introduction
Python introductionPython introduction
Python introduction
 
Learning notes ruby
Learning notes rubyLearning notes ruby
Learning notes ruby
 
Converged open platform for enterprise
Converged open platform for enterpriseConverged open platform for enterprise
Converged open platform for enterprise
 
Code reviews
Code reviewsCode reviews
Code reviews
 
E commerce search strategies
E commerce search strategiesE commerce search strategies
E commerce search strategies
 
Saml
SamlSaml
Saml
 
JavaEE6
JavaEE6JavaEE6
JavaEE6
 
Indefero source code_managment
Indefero source code_managmentIndefero source code_managment
Indefero source code_managment
 
Web Services Atomic Transactio
 Web Services Atomic Transactio Web Services Atomic Transactio
Web Services Atomic Transactio
 
Web service through cxf
Web service through cxfWeb service through cxf
Web service through cxf
 
Q con london2011-matthewwall-whyichosemongodbforguardiancouk
Q con london2011-matthewwall-whyichosemongodbforguardiancoukQ con london2011-matthewwall-whyichosemongodbforguardiancouk
Q con london2011-matthewwall-whyichosemongodbforguardiancouk
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
 
Consistency-New-Generation-Databases
Consistency-New-Generation-DatabasesConsistency-New-Generation-Databases
Consistency-New-Generation-Databases
 
Java explore
Java exploreJava explore
Java explore
 
Mongo db实战
Mongo db实战Mongo db实战
Mongo db实战
 
Ca siteminder
Ca siteminderCa siteminder
Ca siteminder
 
Eclipse plug in mylyn & tasktop
Eclipse plug in mylyn & tasktopEclipse plug in mylyn & tasktop
Eclipse plug in mylyn & tasktop
 
新浪微博架构猜想
新浪微博架构猜想新浪微博架构猜想
新浪微博架构猜想
 

Recently uploaded

Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 

Recently uploaded (20)

Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 

Fixing twitter

  • 1. Fixing Twitter ... and Finding your own Fail Whale John Adams Twitter Operations <jna@twitter.com>
  • 2. Operations • Small team, growing rapidly. • What do we do? • Software Performance (back-end) • Availability • Capacity Planning (metrics-driven) • Configuration Management • We don’t deal with the physical plant.
  • 3. Managed Services • Dedicated team (NTTA) • 24/7 Hands on remote support • No clouds. We tried that! • Need raw processing power, latency too high in existing cloud offerings • Frees us to deal with real, intellectual, computer science problems.
  • 4. 752% 2008 Growth 5 3.75 2.5 1.25 0 Dec 07 Feb 08 Apr 08 Jun 08 Aug 08 Oct 08 Dec 08 Unique Visitors (in Millions)
  • 5. That was only the beginning... previous graph!
  • 6. Uniques Not slowing down, despite what outsiders say. Hard for outsiders to measure API usage!
  • 7. Growth = Pain + an appreciation for Institutionalized Fear
  • 8. Mantra! Find Weakest Point Metrics + Logs + Science = Analysis
  • 9. Mantra! Find Weakest Take Corrective Point Action Metrics + Process Logs + Science = Analysis
  • 10. Mantra! Find Weakest Take Corrective Move to Next Point Action Weakest Point Metrics + Process Repeatability Logs + Science = Analysis
  • 11. Find the Weakest Point • Metrics + Graphs • Individual metrics are irrelevant • Logs • SCIENCE! • Find out what the actionable items are.
  • 12. Instrument Everything (cc) seenoevil@flickr
  • 13. Monitoring • Graph and report critical metrics in as near real time as possible • You already have the tools. • RRD • Ganglia + custom gMetric scripts • MRTG
  • 14. Dashboards • “Criticals” view • Smokeping/MRTG • Google Analytics • Not just for HTTP 200s/SEO • XML Feeds from managed services • Data Porn!
  • 15. Analyze • Turn data into information • Where is the code base going? • Are things worse than they were? • Understand the impact of the last software deploy • Run check scripts during and after deploys • Capacity Planning, not Fire Fighting!
  • 16. Forecasting Curve-fitting for capacity planning (R, fityk, Mathematica, CurveFit) unsigned int (32 bit) Twitpocolypse status_id signed int (32 bit) Twitpocolypse r2=0.99
  • 17. Deploys • Graph time-of-deploy along side server CPU and Latency • Display time-of-last-deploy on dashboard last deploy times
  • 18. Whale-Watcher • Simple shell script, • MASSIVE WIN. • Whale = HTTP 503 (timeout) • Robot = HTTP 500 (error) • Examines last 100,000 lines of aggregated daemon / www logs • “Whales per Second” > Wthreshold • Thar be whales! Call in ops.
  • 20. Feature “Darkmode” • Specific site controls to enable and disable computationally or IO-Heavy site function • The “Emergency Stop” button • Changes logged and reported to all teams • Around 60 switches we can throw • Static / Read-only mode
  • 21. Configuration Management • Start automated configuration management EARLY in your company. • Don’t wait until it’s too late. • Twitter started within the first few months.
  • 22. Configuration Management • Complex Environment • Multiple Admins • Unknown Interactions • Solution: 2nd set of eyes.
  • 24. Reviewboard www.review-board.org • SVN pre-commit hook causes a failure if the log message doesn’t include ‘reviewed’ • SVN post-commit hook informs people what changed via email • Watches the entire SVN tree
  • 27. Many limiting factors in the request pipeline Apache Rails MPM Model (mongrel) MaxClients 2:1 oversubscribed TCP Listen queue depth to cores Memcached # connections MySQL Varnish (search) # db connections # threads
  • 28. Make an attack plan. Symptom Bottleneck Vector Solution HTTP Bandwidth Network Servers++ Latency Better Timeline Database Update Delay algorithm DBs++ Search Database Delays Code Updates Algorithm Latency Algorithms
  • 29. CPU: More with Less • Reduction in 40% of CPU by replacing dual and quad core machines with 8 core • Switching from AMD to Intel Xeon = 30% gain • Saved data center space, power, cost per month. • Not the best option if you own machines. Capital expenditure = hard to realize new technology gains.
  • 30. Rails • Stop blaming Rails. • Analysis found: • Caching + Cache invalidation problems • Bad queries generated by ActiveRecord, resulting in slow queries against the db • Queue Latency • Memcache / Page Cache Corruption • Replication Lag
  • 31. Disk is the new Tape. • Social Networking application profile has many O(ny) operations. • Page requests have to happen in < 500mS or users start to notice. Goal: 250-300mS • Web 2.0 isn’t possible without lots of RAM • What to do?
  • 32. Caching • We’re the real-time web, but lots of caching opportunity • Most caching strategies rely on long TTLs (>60 s) • Separate memcache pools for different data types to prevent eviction • Optimize Ruby Gem to libmemcached + FNV Hash instead of Ruby + MD5 • Twitter now largest contributor to libmemcached
  • 33. Caching 50% decrease in load with Native C gem + libmemcached
  • 34. Cache Money! • Active Record Plugin • Cache when reading from the DB • Cache when writing to the DB • Transparently provides caching • Removes need for set/get cache code • Open Source!
  • 35. Caching • “Cache Everything!” not the best policy • Invalidating caches at the right time is difficult. • Cold Cache problem • Network Memory Bus != Infinite
  • 36. Memcached • memcached isn’t perfect. • Memcached SEGVs hurt us early on. • Evictions make the cache unreliable for important configuration data (loss of darkmode flags, for example) • Data and Hash Corruption (even in 1.2.6) • Exposed corruption issue with specific inputs causing SEGV and unexpected behavior
  • 37. API + Caching (search) • Cache and control abusive clients • Varnish between two Apache Virtual Hosts (failover to another backend if Varnish dies) • Remove Cache busting query strings before applying hash algorithm • Using ESI to cache jQuery requests when specifying a callback= parameter - big win.
  • 38. Relational Databases not a Panacea • Good for: • Users, Relational Data, Transactions • Bad: • Queues. Polling operations. Caching. • You don’t need ACID for everything. • Enter the message queue...
  • 39. Queues • Many message queue solutions on the market • At high loads, most perform poorly when used in ‘durable’ mode. • Erlang based queues work well (RabbitMQ), but you need in house Erlang experience. • We wrote our own. • Kestrel to the rescue!
  • 40. Kestrel Falco tinnunculus • Works like memcache (same protocol) • SET = enqueue | GET = dequeue • No strict ordering of jobs • No shared state between servers • Written in Scala.
  • 41. Asynchronous Requests • Inbound traffic consumes a mongrel • Outbound traffic consumes a mongrel • The request pipeline should not be used to handle 3rd party communications or back-end work. • Daemons, Daemons, Daemons.
  • 42. Don’t make services dependent • Move operations out of the synchronous request cycle • Email • Complex object generation (timelines) • 3rd party services (bit.ly, sms, etc.)
  • 43. Daemons • Many different types at Twitter. • # of daemons have to match the workload • Early Kestrel would crash if queues filled • “Seppaku” patch • Kill daemons after n requests • Long-running daemons = low memory
  • 44. MySQL Challenges • Replication Delay • Single threaded. Slow. • Social Networking not good for RDBMS • N x N relationships and social graph / tree traversal • Sharding importance • Disk issues (FS Choice, noatime, scheduling algorithm)
  • 45. MySQL • Replication delay and cache eviction produce inconsistent results to the end user. • Locks create resource contention for popular data
  • 46. Database Replication • Major issues around users and statuses tables • Multiple functional masters (FRP, FWP) • Make sure your code reads and writes to the write DBs. Reading from master = slow death • Monitor the DB. Find slow / poorly designed queries • Kill long running queries before they kill you (mkill)
  • 47. status.twitter.com • Keep users in the loop, or suffer. • Hosted on different service (Tumblr) • No matter how little information you have available.
  • 48. Key Points • Databases not always the best store. • Instrument everything. • Use metrics to make decisions, not guesses. • Don’t make services dependent • Process asynchronously when possible
  • 49. Thanks! Twitter Open Source (Apache License): - CacheMoney Gem (Write through Caching) http://github.com/nkallen/cache-money/tree/master - Libmemcached http://tangent.org/552/libmemcached.html - Kestrel (Memcache-like message queue) http://github.com/robey/kestrel - mod_memcache_block (Apache 2.x Limiter/blocker) http://github.com/netik/mod_memcache_block