SlideShare a Scribd company logo
Scaling Twitter
John Adams
Twitter Operations



                     “Talk Cloudy to Me” 2011
John Adams / @netik
•   Early Twitter employee

•   Operations + Security Team

•   Keynote Speaker: O’Reilly Velocity 2009, 2010

•   O’Reilly Web 2.0 Speaker (2008, 2010)

•   Previous companies: Inktomi, Apple, c|net
Operations

•   Support the site and the developers

•   Make it performant

•   Capacity Planning (metrics-driven)

•   Configuration Management

•   Improve existing architecture
2,940               3,085
Tweets per Second   Tweets per Second
  Japan Scores!       Lakers Win!
6939 Tweets per second
  Japanese New Year
Where’s the cloud?
•   Amazon S3 for Avatar, Background, and Asset
    Storage. Few if any EC2 instances (test only.)

•   Direct S3 is expensive

    •   Cost for data in, data out, data stored

•   Cloudfront is cheaper than direct S3

    •   Cloudfront to S3 origins = free

    •   Use International CDNs on top of Cloudfront
CDNs and Versioning
•   Hard to purge quickly from CDNs

•   DMCA, image takedowns, emergency changes

•   CDN images, javascript, and other static files
    get long-lived expire times from origin

•   Version entire static asset store on deploys

     •   /static/NNNNNN/directory/file...
#newtwitter is an API client
Nothing works the
first time.
•   Scale site using best available technologies

•   Plan to build everything more than once.

•   Most solutions work to a certain level of scale,
    and then you must re-evaluate to grow.

•   This is a continual process.
UNIX friends fail at scale
•   Cron

    •   Add NTP, and many machines executing the
        same thing cause “micro” outages across the site.

•   Syslog

    •   Truncation, data loss, aggregation issues

•   RRD

    •   Data rounding over time
Operations Mantra


     Find
    Weakest
     Point


   Metrics +
Logs + Science =
   Analysis
Operations Mantra


     Find            Take
    Weakest        Corrective
     Point          Action


   Metrics +
Logs + Science =    Process
   Analysis
Operations Mantra

                                  Move to
     Find            Take
                                   Next
    Weakest        Corrective
                                  Weakest
     Point          Action
                                   Point


   Metrics +
Logs + Science =    Process     Repeatability
   Analysis
MTTD
MTTR
Sysadmin 2.0 (Devops)
•   Don’t be a just a sysadmin anymore.

•   Think of Systems management as a
    programming task (puppet, chef, cfengine...)

•   No more silos, or lobbing things over the wall

•   We’re all on the same side. Work Together!
Monitoring
•   Twitter graphs and reports critical metrics in
    as near to real time as possible

•   If you build tools against our API, you should
    too.

•   Use this data to inform the public

    •   dev.twitter.com - API availability

    •   status.twitter.com
Data Analysis
•   Instrumenting the world pays off.

•   “Data analysis, visualization, and other
    techniques for seeing patterns in data are
    going to be an increasingly valuable skill set.
    Employers take notice!”
          “Web Squared: Web 2.0 Five Years On”, Tim O’Reilly, Web 2.0 Summit, 2009
Profiling
•   Low-level

•   Identify bottlenecks inside of core tools

    •   Latency, Network Usage, Memory leaks

•   Methods

    •   Network services:

        •   tcpdump + tcpdstat, yconalyzer

    •   Introspect with Google perftools
Forecasting
                              Curve-fitting for capacity planning
                               (R, fityk, Mathematica, CurveFit)



              unsigned int (32 bit)
                Twitpocolypse



  status_id

                                      signed int (32 bit)
                                        Twitpocolypse




                                                  r2=0.99
Loony (tool)
•   Accesses central machine database (MySQL)

    •   Python, Django, Paraminko SSH

    •   Ties into LDAP

•   Filter and list machines, find asset data

•   On demand changes across machines
Murder
•   Bittorrent based replication for deploys
    (Python w/libtorrent)

•   ~30-60 seconds to update >1k machines

•   Uses our machine database to find destination
    hosts

•   Legal P2P
Configuration Management

  •   Start automated configuration management
      EARLY in your company.

  •   Don’t wait until it’s too late.

  •   Twitter started within the first few months.
Virtualization
•   A great start, but...

    •   Images are not a stable way to replicate
        configuration

    •   Changes made to deployed images cannot be
        easily detected and repaired

    •   Typically the only solution is to redeploy
Puppet
•   Puppet + SVN

    •   Hundreds of modules

    •   Runs constantly

    •   Post-Commit hooks for config+sanity chceks

•   No one logs into machines

•   Centralized Change
Issues with Centralized
Management
•   Complex Environment

•   Multiple Admins

•   Unknown Interactions

•   Solution: 2nd set of eyes.
Process through Reviews
Logging
•   Syslog doesn’t work at high traffic rates

    •   No redundancy, no ability to recover from
        daemon failure

•   Moving large files around is painful

•   Solution:

    •   Scribe
Scribe
•   Fault-tolerant data transmission

•   Simple data model, easy to extend

•   Log locally, then scribe to aggregation nodes

•   Slow-start recovery when aggregator goes down

•   Twitter patches

    •   LZO compression and direct HDFS
Hadoop for Ops
•   Once the data’s scribed to HDFS you can:

    •   Aggregate reports across thousands of
        servers

    •   Produce application level metrics

    •   Use map-reduce to gain insight into your
        systems.
Analyze
•   Turn data into information

    •   Where is the code base going?

    •   Are things worse than they were?

        •   Understand the impact of the last software
            deploy

        •   Run check scripts during and after deploys

•   Capacity Planning, not Fire Fighting!
Dashboard
•   “Criticals” view

•   Smokeping/MRTG

•   Google Analytics

    •   Not just for HTTP
        200s/SEO

•   XML Feeds from
    managed services
Whale Watcher
•   Simple shell script, Huge Win

•   Whale = HTTP 503 (timeout)

•   Robot = HTTP 500 (error)

•   Examines last 60 seconds of aggregated
    daemon / www logs

•   “Whales per Second” > Wthreshold

    •   Thar be whales! Call in ops.
Deploy Watcher
Sample window: 300.0 seconds
First start time:
Mon Apr 5 15:30:00 2010 (Mon Apr   5 08:30:00 PDT 2010)
Second start time:
Tue Apr 6 02:09:40 2010 (Mon Apr   5 19:09:40 PDT 2010)

PRODUCTION APACHE: ALL OK
PRODUCTION OTHER: ALL OK
WEB049 CANARY APACHE: ALL OK
WEB049 CANARY BACKEND SERVICES: ALL OK
DAEMON031 CANARY BACKEND SERVICES: ALL OK
DAEMON031 CANARY OTHER: ALL OK
Deploys
•   Block deploys if site in error state

•   Graph time-of-deploy along side server CPU
    and Latency

•   Display time-of-last-deploy on dashboard

•   Communicate deploys in Campfire to teams


       ^^ last deploy times ^^
Feature “Decider”
•   Controls to enable and disable
    computationally or I/O heavy features

•   Our “Emergency Stop” button, and feature
    release control. All features go through it.

•   Changes logged and reported to all teams

•   Hundreds of knobs we can turn.

•   Not boolean, expressed as % of userbase
subsystems
request flow
                       Load Balancers


                          Apache


                       Rails (Unicorn)


             FlockDB      Kestrel        Memcached


                 MySQL        Cassandra


Monitoring               Daemons                     Mail Servers
Many limiting factors in the request pipeline

        Apache                    Rails
      Worker Model              (unicorn)
       MaxClients
                           2:1 oversubscribed
  TCP Listen queue depth
                                 to cores


                                Memcached
                               # connections

                                   MySQL
    Varnish (search)          # db connections
       # threads
Unicorn Rails Server
•   Connection push to socket polling model

•   Deploys without Downtime

•   Less memory and 30% less CPU

•   Shift from ProxyPass to Proxy Balancer

    •   mod_proxy_balancer lies about usage

    •   Race condition in counters patched
Rails
•   Front-end (Scala/Java back-end)

•   Not to blame for our issues. Analysis found:

    •   Caching + Cache invalidation problems

    •   Bad queries generated by ActiveRecord,
        resulting in slow queries against the db

    •   Garbage Collection issues (20-25%)

•   Replication Lag
memcached
•   Network Memory Bus isn’t infinite

•   Evictions make the cache unreliable for
    important configuration data
    (loss of darkmode flags, for example)

•   Segmented into pools for better performance

•   Examine slab allocation and watch for high
    use/eviction rates on individual slabs using
    peep. Adjust slab factors and size accordingly.
Decomposition

•   SOA: Service Orientated Architecture

•   Take application and decompose into services

•   Admin the services as separate units

•   Decouple the services from each other
Asynchronous Requests
•   Executing work during the web request is
    expensive

•   The request pipeline should not be used to
    handle 3rd party communications or
    back-end work.

    •   Move work to queues

    •   Run daemons against queues
Thrift
•   Cross-language services framework

•   Originally developed at Facebook

•   Now an Apache project

•   Seamless operation between C++, Java,
    Python, PHP, Ruby, Erlang, Perl, Haskell, C#,
    Cocoa, Smalltalk, OCaml (phew!)
Kestrel
•   Works like memcache (same protocol)

•   SET = enqueue | GET = dequeue

•   No strict ordering of jobs

•   No shared state between servers

•   Written in Scala. Open Source.
Daemons
•   Many different types at Twitter.

•   # of daemons have to match the workload

•   Early Kestrel would crash if queues filled

•   “Seppaku” patch

    •   Kill daemons after n requests

    •   Long-running leaky daemons = low memory
Flock DB                         Flock DB

•   Gizzard sharding
    framework                    Gizzard
•   Billions of edges

•   MySQL backend
                         Mysql    Mysql     Mysql
•   Open Source


 •   http://github.com/twitter/gizzard
Disk is the new Tape.
•   Social Networking application profile has
    many O(ny) operations.

•   Page requests have to happen in < 500mS or
    users start to notice. Goal: 250-300mS

•   Web 2.0 isn’t possible without lots of RAM

•   What to do?
Caching
•   We’re “real time”, but still lots of caching
    opportunity

•   Most caching strategies rely on long TTLs (>60 s)

•   Separate memcache pools for different data types
    to prevent eviction

•   Optimize Ruby Gem to libmemcached + FNV
    Hash instead of Ruby + MD5

•   Twitter largest contributor to libmemcached
Caching
•   “Cache Everything!” not the best policy, as

•   Invalidating caches at the right time is
    difficult.

•   Cold Cache problem; What happens after
    power or system failure?

•   Use cache to augment db, not to replace
MySQL
•   We have many MySQL servers

•   Increasingly used more and more as key/value
    store

•   Many instances spread out through the
    Gizzard sharding framework
MySQL Challenges
•   Replication Delay

    •   Single threaded replication = pain.

•   Social Networking not good for RDBMS

    •   N x N relationships and social graph / tree
        traversal - we have FlockDB for that

    •   Disk issues

        •   FS Choice, noatime, scheduling algorithm
Database Replication
•   Major issues around users and statuses tables

•   Multiple functional masters (FRP, FWP)

•   Make sure your code reads and writes to the
    write DBs. Reading from master = slow death

    •   Monitor the DB. Find slow / poorly designed
        queries

•   Kill long running queries before they kill you
    (mkill)
Key Points
•   Use cloud where necessary

•   Instrument everything.

•   Use metrics to make decisions, not intuition.

•   Decompose your app and remove
    dependencies between services

•   Process asynchronously when possible
Questions?
Thanks!
•   We support and use Open Source

    •   http://twitter.com/about/opensource

•   Work at scale - We’re hiring.

    •   @jointheflock

More Related Content

What's hot

Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
P. Taylor Goetz
 
Super Sizing Youtube with Python
Super Sizing Youtube with PythonSuper Sizing Youtube with Python
Super Sizing Youtube with Python
didip
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?
Jagadish Venkatraman
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
DECK36
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
Shyam Raj
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
DataWorks Summit/Hadoop Summit
 
Functional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksFunctional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming Frameworks
Huafeng Wang
 
Open west 2015 talk ben coverston
Open west 2015 talk ben coverstonOpen west 2015 talk ben coverston
Open west 2015 talk ben coverston
bcoverston
 
FunctionalConf '16 Robert Virding Erlang Ecosystem
FunctionalConf '16 Robert Virding Erlang EcosystemFunctionalConf '16 Robert Virding Erlang Ecosystem
FunctionalConf '16 Robert Virding Erlang Ecosystem
Robert Virding
 
Erlang factory 2011 london
Erlang factory 2011 londonErlang factory 2011 london
Erlang factory 2011 london
Paolo Negri
 
Combining the strength of erlang and Ruby
Combining the strength of erlang and RubyCombining the strength of erlang and Ruby
Combining the strength of erlang and Ruby
Martin Rehfeld
 
Monitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixMonitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with Zabbix
Gerger
 
Erlang as a cloud citizen, a fractal approach to throughput
Erlang as a cloud citizen, a fractal approach to throughputErlang as a cloud citizen, a fractal approach to throughput
Erlang as a cloud citizen, a fractal approach to throughput
Paolo Negri
 
Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"
Paolo Negri
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_final
Ramya Sunil
 

What's hot (17)

Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Super Sizing Youtube with Python
Super Sizing Youtube with PythonSuper Sizing Youtube with Python
Super Sizing Youtube with Python
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Functional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksFunctional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming Frameworks
 
Open west 2015 talk ben coverston
Open west 2015 talk ben coverstonOpen west 2015 talk ben coverston
Open west 2015 talk ben coverston
 
FunctionalConf '16 Robert Virding Erlang Ecosystem
FunctionalConf '16 Robert Virding Erlang EcosystemFunctionalConf '16 Robert Virding Erlang Ecosystem
FunctionalConf '16 Robert Virding Erlang Ecosystem
 
Erlang factory 2011 london
Erlang factory 2011 londonErlang factory 2011 london
Erlang factory 2011 london
 
Combining the strength of erlang and Ruby
Combining the strength of erlang and RubyCombining the strength of erlang and Ruby
Combining the strength of erlang and Ruby
 
Monitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixMonitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with Zabbix
 
Erlang as a cloud citizen, a fractal approach to throughput
Erlang as a cloud citizen, a fractal approach to throughputErlang as a cloud citizen, a fractal approach to throughput
Erlang as a cloud citizen, a fractal approach to throughput
 
Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_final
 

Viewers also liked

Listening
ListeningListening
Listening
Liz Vera
 
Benedict arnold
Benedict arnoldBenedict arnold
Benedict arnold
Shad Whitesell
 
Chapter 20
Chapter 20Chapter 20
Chapter 20
Liz Vera
 
John adams powerpoint word
John adams powerpoint wordJohn adams powerpoint word
John adams powerpoint word
jessieandkeilea
 
The presidency of john adams
The presidency of john adamsThe presidency of john adams
The presidency of john adams
Allison Barnette
 
John Adams' presidency 8.37
John Adams' presidency 8.37John Adams' presidency 8.37
John Adams' presidency 8.37
Blake Harris
 
John Adams Presidency
John Adams PresidencyJohn Adams Presidency
John Adams Presidency
jmorrow123
 
John adams
John adamsJohn adams
John adams
kirillz123
 
Thomas jefferson
Thomas jeffersonThomas jefferson
Thomas jefferson
shalbert
 
John Adams Presidential Term
John Adams Presidential TermJohn Adams Presidential Term
John Adams Presidential Term
st3vo66
 

Viewers also liked (10)

Listening
ListeningListening
Listening
 
Benedict arnold
Benedict arnoldBenedict arnold
Benedict arnold
 
Chapter 20
Chapter 20Chapter 20
Chapter 20
 
John adams powerpoint word
John adams powerpoint wordJohn adams powerpoint word
John adams powerpoint word
 
The presidency of john adams
The presidency of john adamsThe presidency of john adams
The presidency of john adams
 
John Adams' presidency 8.37
John Adams' presidency 8.37John Adams' presidency 8.37
John Adams' presidency 8.37
 
John Adams Presidency
John Adams PresidencyJohn Adams Presidency
John Adams Presidency
 
John adams
John adamsJohn adams
John adams
 
Thomas jefferson
Thomas jeffersonThomas jefferson
Thomas jefferson
 
John Adams Presidential Term
John Adams Presidential TermJohn Adams Presidential Term
John Adams Presidential Term
 

Similar to John adams talk cloudy

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
Roger Xia
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
liujianrong
 
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20....Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
Javier García Magna
 
Scaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWSScaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWS
Brett McLain
 
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum PainKafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
confluent
 
Fixing Twitter Velocity2009
Fixing Twitter Velocity2009Fixing Twitter Velocity2009
Fixing Twitter Velocity2009
John Adams
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves Goeleven
Particular Software
 
Exploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservicesExploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservices
💡 Tomasz Kogut
 
DCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at NetflixDCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at Netflix
Docker, Inc.
 
Monitoring MySQL at scale
Monitoring MySQL at scaleMonitoring MySQL at scale
Monitoring MySQL at scale
Ovais Tariq
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDB
FoundationDB
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Redis Labs
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented Design
Rodrigo Campos
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
SolarWinds Loggly
 
CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National Police
CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National PoliceCodeMotion Amsterdam 2018 - Microservices in action at the Dutch National Police
CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National Police
Bert Jan Schrijver
 
Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...
Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...
Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...
Codemotion
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profit
Rodrigo Campos
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
jhugg
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
 

Similar to John adams talk cloudy (20)

Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20....Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
 
Scaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWSScaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWS
 
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum PainKafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
 
Fixing Twitter Velocity2009
Fixing Twitter Velocity2009Fixing Twitter Velocity2009
Fixing Twitter Velocity2009
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves Goeleven
 
Exploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservicesExploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservices
 
DCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at NetflixDCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at Netflix
 
Monitoring MySQL at scale
Monitoring MySQL at scaleMonitoring MySQL at scale
Monitoring MySQL at scale
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDB
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented Design
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National Police
CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National PoliceCodeMotion Amsterdam 2018 - Microservices in action at the Dutch National Police
CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National Police
 
Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...
Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...
Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profit
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 

Recently uploaded

Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
marufrahmanstratejm
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
Data Hops
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 

Recently uploaded (20)

Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Public CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptxPublic CyberSecurity Awareness Presentation 2024.pptx
Public CyberSecurity Awareness Presentation 2024.pptx
 
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3FREE A4 Cyber Security Awareness  Posters-Social Engineering part 3
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 

John adams talk cloudy

  • 1. Scaling Twitter John Adams Twitter Operations “Talk Cloudy to Me” 2011
  • 2. John Adams / @netik • Early Twitter employee • Operations + Security Team • Keynote Speaker: O’Reilly Velocity 2009, 2010 • O’Reilly Web 2.0 Speaker (2008, 2010) • Previous companies: Inktomi, Apple, c|net
  • 3. Operations • Support the site and the developers • Make it performant • Capacity Planning (metrics-driven) • Configuration Management • Improve existing architecture
  • 4. 2,940 3,085 Tweets per Second Tweets per Second Japan Scores! Lakers Win!
  • 5. 6939 Tweets per second Japanese New Year
  • 6. Where’s the cloud? • Amazon S3 for Avatar, Background, and Asset Storage. Few if any EC2 instances (test only.) • Direct S3 is expensive • Cost for data in, data out, data stored • Cloudfront is cheaper than direct S3 • Cloudfront to S3 origins = free • Use International CDNs on top of Cloudfront
  • 7. CDNs and Versioning • Hard to purge quickly from CDNs • DMCA, image takedowns, emergency changes • CDN images, javascript, and other static files get long-lived expire times from origin • Version entire static asset store on deploys • /static/NNNNNN/directory/file...
  • 8. #newtwitter is an API client
  • 9. Nothing works the first time. • Scale site using best available technologies • Plan to build everything more than once. • Most solutions work to a certain level of scale, and then you must re-evaluate to grow. • This is a continual process.
  • 10. UNIX friends fail at scale • Cron • Add NTP, and many machines executing the same thing cause “micro” outages across the site. • Syslog • Truncation, data loss, aggregation issues • RRD • Data rounding over time
  • 11. Operations Mantra Find Weakest Point Metrics + Logs + Science = Analysis
  • 12. Operations Mantra Find Take Weakest Corrective Point Action Metrics + Logs + Science = Process Analysis
  • 13. Operations Mantra Move to Find Take Next Weakest Corrective Weakest Point Action Point Metrics + Logs + Science = Process Repeatability Analysis
  • 14. MTTD
  • 15. MTTR
  • 16. Sysadmin 2.0 (Devops) • Don’t be a just a sysadmin anymore. • Think of Systems management as a programming task (puppet, chef, cfengine...) • No more silos, or lobbing things over the wall • We’re all on the same side. Work Together!
  • 17. Monitoring • Twitter graphs and reports critical metrics in as near to real time as possible • If you build tools against our API, you should too. • Use this data to inform the public • dev.twitter.com - API availability • status.twitter.com
  • 18. Data Analysis • Instrumenting the world pays off. • “Data analysis, visualization, and other techniques for seeing patterns in data are going to be an increasingly valuable skill set. Employers take notice!” “Web Squared: Web 2.0 Five Years On”, Tim O’Reilly, Web 2.0 Summit, 2009
  • 19. Profiling • Low-level • Identify bottlenecks inside of core tools • Latency, Network Usage, Memory leaks • Methods • Network services: • tcpdump + tcpdstat, yconalyzer • Introspect with Google perftools
  • 20. Forecasting Curve-fitting for capacity planning (R, fityk, Mathematica, CurveFit) unsigned int (32 bit) Twitpocolypse status_id signed int (32 bit) Twitpocolypse r2=0.99
  • 21. Loony (tool) • Accesses central machine database (MySQL) • Python, Django, Paraminko SSH • Ties into LDAP • Filter and list machines, find asset data • On demand changes across machines
  • 22. Murder • Bittorrent based replication for deploys (Python w/libtorrent) • ~30-60 seconds to update >1k machines • Uses our machine database to find destination hosts • Legal P2P
  • 23. Configuration Management • Start automated configuration management EARLY in your company. • Don’t wait until it’s too late. • Twitter started within the first few months.
  • 24. Virtualization • A great start, but... • Images are not a stable way to replicate configuration • Changes made to deployed images cannot be easily detected and repaired • Typically the only solution is to redeploy
  • 25. Puppet • Puppet + SVN • Hundreds of modules • Runs constantly • Post-Commit hooks for config+sanity chceks • No one logs into machines • Centralized Change
  • 26. Issues with Centralized Management • Complex Environment • Multiple Admins • Unknown Interactions • Solution: 2nd set of eyes.
  • 28. Logging • Syslog doesn’t work at high traffic rates • No redundancy, no ability to recover from daemon failure • Moving large files around is painful • Solution: • Scribe
  • 29. Scribe • Fault-tolerant data transmission • Simple data model, easy to extend • Log locally, then scribe to aggregation nodes • Slow-start recovery when aggregator goes down • Twitter patches • LZO compression and direct HDFS
  • 30. Hadoop for Ops • Once the data’s scribed to HDFS you can: • Aggregate reports across thousands of servers • Produce application level metrics • Use map-reduce to gain insight into your systems.
  • 31. Analyze • Turn data into information • Where is the code base going? • Are things worse than they were? • Understand the impact of the last software deploy • Run check scripts during and after deploys • Capacity Planning, not Fire Fighting!
  • 32. Dashboard • “Criticals” view • Smokeping/MRTG • Google Analytics • Not just for HTTP 200s/SEO • XML Feeds from managed services
  • 33. Whale Watcher • Simple shell script, Huge Win • Whale = HTTP 503 (timeout) • Robot = HTTP 500 (error) • Examines last 60 seconds of aggregated daemon / www logs • “Whales per Second” > Wthreshold • Thar be whales! Call in ops.
  • 34. Deploy Watcher Sample window: 300.0 seconds First start time: Mon Apr 5 15:30:00 2010 (Mon Apr 5 08:30:00 PDT 2010) Second start time: Tue Apr 6 02:09:40 2010 (Mon Apr 5 19:09:40 PDT 2010) PRODUCTION APACHE: ALL OK PRODUCTION OTHER: ALL OK WEB049 CANARY APACHE: ALL OK WEB049 CANARY BACKEND SERVICES: ALL OK DAEMON031 CANARY BACKEND SERVICES: ALL OK DAEMON031 CANARY OTHER: ALL OK
  • 35. Deploys • Block deploys if site in error state • Graph time-of-deploy along side server CPU and Latency • Display time-of-last-deploy on dashboard • Communicate deploys in Campfire to teams ^^ last deploy times ^^
  • 36. Feature “Decider” • Controls to enable and disable computationally or I/O heavy features • Our “Emergency Stop” button, and feature release control. All features go through it. • Changes logged and reported to all teams • Hundreds of knobs we can turn. • Not boolean, expressed as % of userbase
  • 38. request flow Load Balancers Apache Rails (Unicorn) FlockDB Kestrel Memcached MySQL Cassandra Monitoring Daemons Mail Servers
  • 39. Many limiting factors in the request pipeline Apache Rails Worker Model (unicorn) MaxClients 2:1 oversubscribed TCP Listen queue depth to cores Memcached # connections MySQL Varnish (search) # db connections # threads
  • 40. Unicorn Rails Server • Connection push to socket polling model • Deploys without Downtime • Less memory and 30% less CPU • Shift from ProxyPass to Proxy Balancer • mod_proxy_balancer lies about usage • Race condition in counters patched
  • 41. Rails • Front-end (Scala/Java back-end) • Not to blame for our issues. Analysis found: • Caching + Cache invalidation problems • Bad queries generated by ActiveRecord, resulting in slow queries against the db • Garbage Collection issues (20-25%) • Replication Lag
  • 42. memcached • Network Memory Bus isn’t infinite • Evictions make the cache unreliable for important configuration data (loss of darkmode flags, for example) • Segmented into pools for better performance • Examine slab allocation and watch for high use/eviction rates on individual slabs using peep. Adjust slab factors and size accordingly.
  • 43. Decomposition • SOA: Service Orientated Architecture • Take application and decompose into services • Admin the services as separate units • Decouple the services from each other
  • 44. Asynchronous Requests • Executing work during the web request is expensive • The request pipeline should not be used to handle 3rd party communications or back-end work. • Move work to queues • Run daemons against queues
  • 45. Thrift • Cross-language services framework • Originally developed at Facebook • Now an Apache project • Seamless operation between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, OCaml (phew!)
  • 46. Kestrel • Works like memcache (same protocol) • SET = enqueue | GET = dequeue • No strict ordering of jobs • No shared state between servers • Written in Scala. Open Source.
  • 47. Daemons • Many different types at Twitter. • # of daemons have to match the workload • Early Kestrel would crash if queues filled • “Seppaku” patch • Kill daemons after n requests • Long-running leaky daemons = low memory
  • 48. Flock DB Flock DB • Gizzard sharding framework Gizzard • Billions of edges • MySQL backend Mysql Mysql Mysql • Open Source • http://github.com/twitter/gizzard
  • 49. Disk is the new Tape. • Social Networking application profile has many O(ny) operations. • Page requests have to happen in < 500mS or users start to notice. Goal: 250-300mS • Web 2.0 isn’t possible without lots of RAM • What to do?
  • 50. Caching • We’re “real time”, but still lots of caching opportunity • Most caching strategies rely on long TTLs (>60 s) • Separate memcache pools for different data types to prevent eviction • Optimize Ruby Gem to libmemcached + FNV Hash instead of Ruby + MD5 • Twitter largest contributor to libmemcached
  • 51. Caching • “Cache Everything!” not the best policy, as • Invalidating caches at the right time is difficult. • Cold Cache problem; What happens after power or system failure? • Use cache to augment db, not to replace
  • 52. MySQL • We have many MySQL servers • Increasingly used more and more as key/value store • Many instances spread out through the Gizzard sharding framework
  • 53. MySQL Challenges • Replication Delay • Single threaded replication = pain. • Social Networking not good for RDBMS • N x N relationships and social graph / tree traversal - we have FlockDB for that • Disk issues • FS Choice, noatime, scheduling algorithm
  • 54. Database Replication • Major issues around users and statuses tables • Multiple functional masters (FRP, FWP) • Make sure your code reads and writes to the write DBs. Reading from master = slow death • Monitor the DB. Find slow / poorly designed queries • Kill long running queries before they kill you (mkill)
  • 55. Key Points • Use cloud where necessary • Instrument everything. • Use metrics to make decisions, not intuition. • Decompose your app and remove dependencies between services • Process asynchronously when possible
  • 57. Thanks! • We support and use Open Source • http://twitter.com/about/opensource • Work at scale - We’re hiring. • @jointheflock