SlideShare a Scribd company logo
1 of 57
Download to read offline
Scaling Twitter
John Adams
Twitter Operations



                     “Talk Cloudy to Me” 2011
John Adams / @netik
•   Early Twitter employee

•   Operations + Security Team

•   Keynote Speaker: O’Reilly Velocity 2009, 2010

•   O’Reilly Web 2.0 Speaker (2008, 2010)

•   Previous companies: Inktomi, Apple, c|net
Operations

•   Support the site and the developers

•   Make it performant

•   Capacity Planning (metrics-driven)

•   Configuration Management

•   Improve existing architecture
2,940               3,085
Tweets per Second   Tweets per Second
  Japan Scores!       Lakers Win!
6939 Tweets per second
  Japanese New Year
Where’s the cloud?
•   Amazon S3 for Avatar, Background, and Asset
    Storage. Few if any EC2 instances (test only.)

•   Direct S3 is expensive

    •   Cost for data in, data out, data stored

•   Cloudfront is cheaper than direct S3

    •   Cloudfront to S3 origins = free

    •   Use International CDNs on top of Cloudfront
CDNs and Versioning
•   Hard to purge quickly from CDNs

•   DMCA, image takedowns, emergency changes

•   CDN images, javascript, and other static files
    get long-lived expire times from origin

•   Version entire static asset store on deploys

     •   /static/NNNNNN/directory/file...
#newtwitter is an API client
Nothing works the
first time.
•   Scale site using best available technologies

•   Plan to build everything more than once.

•   Most solutions work to a certain level of scale,
    and then you must re-evaluate to grow.

•   This is a continual process.
UNIX friends fail at scale
•   Cron

    •   Add NTP, and many machines executing the
        same thing cause “micro” outages across the site.

•   Syslog

    •   Truncation, data loss, aggregation issues

•   RRD

    •   Data rounding over time
Operations Mantra


     Find
    Weakest
     Point


   Metrics +
Logs + Science =
   Analysis
Operations Mantra


     Find            Take
    Weakest        Corrective
     Point          Action


   Metrics +
Logs + Science =    Process
   Analysis
Operations Mantra

                                  Move to
     Find            Take
                                   Next
    Weakest        Corrective
                                  Weakest
     Point          Action
                                   Point


   Metrics +
Logs + Science =    Process     Repeatability
   Analysis
MTTD
MTTR
Sysadmin 2.0 (Devops)
•   Don’t be a just a sysadmin anymore.

•   Think of Systems management as a
    programming task (puppet, chef, cfengine...)

•   No more silos, or lobbing things over the wall

•   We’re all on the same side. Work Together!
Monitoring
•   Twitter graphs and reports critical metrics in
    as near to real time as possible

•   If you build tools against our API, you should
    too.

•   Use this data to inform the public

    •   dev.twitter.com - API availability

    •   status.twitter.com
Data Analysis
•   Instrumenting the world pays off.

•   “Data analysis, visualization, and other
    techniques for seeing patterns in data are
    going to be an increasingly valuable skill set.
    Employers take notice!”
          “Web Squared: Web 2.0 Five Years On”, Tim O’Reilly, Web 2.0 Summit, 2009
Profiling
•   Low-level

•   Identify bottlenecks inside of core tools

    •   Latency, Network Usage, Memory leaks

•   Methods

    •   Network services:

        •   tcpdump + tcpdstat, yconalyzer

    •   Introspect with Google perftools
Forecasting
                              Curve-fitting for capacity planning
                               (R, fityk, Mathematica, CurveFit)



              unsigned int (32 bit)
                Twitpocolypse



  status_id

                                      signed int (32 bit)
                                        Twitpocolypse




                                                  r2=0.99
Loony (tool)
•   Accesses central machine database (MySQL)

    •   Python, Django, Paraminko SSH

    •   Ties into LDAP

•   Filter and list machines, find asset data

•   On demand changes across machines
Murder
•   Bittorrent based replication for deploys
    (Python w/libtorrent)

•   ~30-60 seconds to update >1k machines

•   Uses our machine database to find destination
    hosts

•   Legal P2P
Configuration Management

  •   Start automated configuration management
      EARLY in your company.

  •   Don’t wait until it’s too late.

  •   Twitter started within the first few months.
Virtualization
•   A great start, but...

    •   Images are not a stable way to replicate
        configuration

    •   Changes made to deployed images cannot be
        easily detected and repaired

    •   Typically the only solution is to redeploy
Puppet
•   Puppet + SVN

    •   Hundreds of modules

    •   Runs constantly

    •   Post-Commit hooks for config+sanity chceks

•   No one logs into machines

•   Centralized Change
Issues with Centralized
Management
•   Complex Environment

•   Multiple Admins

•   Unknown Interactions

•   Solution: 2nd set of eyes.
Process through Reviews
Logging
•   Syslog doesn’t work at high traffic rates

    •   No redundancy, no ability to recover from
        daemon failure

•   Moving large files around is painful

•   Solution:

    •   Scribe
Scribe
•   Fault-tolerant data transmission

•   Simple data model, easy to extend

•   Log locally, then scribe to aggregation nodes

•   Slow-start recovery when aggregator goes down

•   Twitter patches

    •   LZO compression and direct HDFS
Hadoop for Ops
•   Once the data’s scribed to HDFS you can:

    •   Aggregate reports across thousands of
        servers

    •   Produce application level metrics

    •   Use map-reduce to gain insight into your
        systems.
Analyze
•   Turn data into information

    •   Where is the code base going?

    •   Are things worse than they were?

        •   Understand the impact of the last software
            deploy

        •   Run check scripts during and after deploys

•   Capacity Planning, not Fire Fighting!
Dashboard
•   “Criticals” view

•   Smokeping/MRTG

•   Google Analytics

    •   Not just for HTTP
        200s/SEO

•   XML Feeds from
    managed services
Whale Watcher
•   Simple shell script, Huge Win

•   Whale = HTTP 503 (timeout)

•   Robot = HTTP 500 (error)

•   Examines last 60 seconds of aggregated
    daemon / www logs

•   “Whales per Second” > Wthreshold

    •   Thar be whales! Call in ops.
Deploy Watcher
Sample window: 300.0 seconds
First start time:
Mon Apr 5 15:30:00 2010 (Mon Apr   5 08:30:00 PDT 2010)
Second start time:
Tue Apr 6 02:09:40 2010 (Mon Apr   5 19:09:40 PDT 2010)

PRODUCTION APACHE: ALL OK
PRODUCTION OTHER: ALL OK
WEB049 CANARY APACHE: ALL OK
WEB049 CANARY BACKEND SERVICES: ALL OK
DAEMON031 CANARY BACKEND SERVICES: ALL OK
DAEMON031 CANARY OTHER: ALL OK
Deploys
•   Block deploys if site in error state

•   Graph time-of-deploy along side server CPU
    and Latency

•   Display time-of-last-deploy on dashboard

•   Communicate deploys in Campfire to teams


       ^^ last deploy times ^^
Feature “Decider”
•   Controls to enable and disable
    computationally or I/O heavy features

•   Our “Emergency Stop” button, and feature
    release control. All features go through it.

•   Changes logged and reported to all teams

•   Hundreds of knobs we can turn.

•   Not boolean, expressed as % of userbase
subsystems
request flow
                       Load Balancers


                          Apache


                       Rails (Unicorn)


             FlockDB      Kestrel        Memcached


                 MySQL        Cassandra


Monitoring               Daemons                     Mail Servers
Many limiting factors in the request pipeline

        Apache                    Rails
      Worker Model              (unicorn)
       MaxClients
                           2:1 oversubscribed
  TCP Listen queue depth
                                 to cores


                                Memcached
                               # connections

                                   MySQL
    Varnish (search)          # db connections
       # threads
Unicorn Rails Server
•   Connection push to socket polling model

•   Deploys without Downtime

•   Less memory and 30% less CPU

•   Shift from ProxyPass to Proxy Balancer

    •   mod_proxy_balancer lies about usage

    •   Race condition in counters patched
Rails
•   Front-end (Scala/Java back-end)

•   Not to blame for our issues. Analysis found:

    •   Caching + Cache invalidation problems

    •   Bad queries generated by ActiveRecord,
        resulting in slow queries against the db

    •   Garbage Collection issues (20-25%)

•   Replication Lag
memcached
•   Network Memory Bus isn’t infinite

•   Evictions make the cache unreliable for
    important configuration data
    (loss of darkmode flags, for example)

•   Segmented into pools for better performance

•   Examine slab allocation and watch for high
    use/eviction rates on individual slabs using
    peep. Adjust slab factors and size accordingly.
Decomposition

•   SOA: Service Orientated Architecture

•   Take application and decompose into services

•   Admin the services as separate units

•   Decouple the services from each other
Asynchronous Requests
•   Executing work during the web request is
    expensive

•   The request pipeline should not be used to
    handle 3rd party communications or
    back-end work.

    •   Move work to queues

    •   Run daemons against queues
Thrift
•   Cross-language services framework

•   Originally developed at Facebook

•   Now an Apache project

•   Seamless operation between C++, Java,
    Python, PHP, Ruby, Erlang, Perl, Haskell, C#,
    Cocoa, Smalltalk, OCaml (phew!)
Kestrel
•   Works like memcache (same protocol)

•   SET = enqueue | GET = dequeue

•   No strict ordering of jobs

•   No shared state between servers

•   Written in Scala. Open Source.
Daemons
•   Many different types at Twitter.

•   # of daemons have to match the workload

•   Early Kestrel would crash if queues filled

•   “Seppaku” patch

    •   Kill daemons after n requests

    •   Long-running leaky daemons = low memory
Flock DB                         Flock DB

•   Gizzard sharding
    framework                    Gizzard
•   Billions of edges

•   MySQL backend
                         Mysql    Mysql     Mysql
•   Open Source


 •   http://github.com/twitter/gizzard
Disk is the new Tape.
•   Social Networking application profile has
    many O(ny) operations.

•   Page requests have to happen in < 500mS or
    users start to notice. Goal: 250-300mS

•   Web 2.0 isn’t possible without lots of RAM

•   What to do?
Caching
•   We’re “real time”, but still lots of caching
    opportunity

•   Most caching strategies rely on long TTLs (>60 s)

•   Separate memcache pools for different data types
    to prevent eviction

•   Optimize Ruby Gem to libmemcached + FNV
    Hash instead of Ruby + MD5

•   Twitter largest contributor to libmemcached
Caching
•   “Cache Everything!” not the best policy, as

•   Invalidating caches at the right time is
    difficult.

•   Cold Cache problem; What happens after
    power or system failure?

•   Use cache to augment db, not to replace
MySQL
•   We have many MySQL servers

•   Increasingly used more and more as key/value
    store

•   Many instances spread out through the
    Gizzard sharding framework
MySQL Challenges
•   Replication Delay

    •   Single threaded replication = pain.

•   Social Networking not good for RDBMS

    •   N x N relationships and social graph / tree
        traversal - we have FlockDB for that

    •   Disk issues

        •   FS Choice, noatime, scheduling algorithm
Database Replication
•   Major issues around users and statuses tables

•   Multiple functional masters (FRP, FWP)

•   Make sure your code reads and writes to the
    write DBs. Reading from master = slow death

    •   Monitor the DB. Find slow / poorly designed
        queries

•   Kill long running queries before they kill you
    (mkill)
Key Points
•   Use cloud where necessary

•   Instrument everything.

•   Use metrics to make decisions, not intuition.

•   Decompose your app and remove
    dependencies between services

•   Process asynchronously when possible
Questions?
Thanks!
•   We support and use Open Source

    •   http://twitter.com/about/opensource

•   Work at scale - We’re hiring.

    •   @jointheflock

More Related Content

What's hot

Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Super Sizing Youtube with Python
Super Sizing Youtube with PythonSuper Sizing Youtube with Python
Super Sizing Youtube with Pythondidip
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?Jagadish Venkatraman
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Storm presentation
Storm presentationStorm presentation
Storm presentationShyam Raj
 
Functional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksFunctional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksHuafeng Wang
 
Open west 2015 talk ben coverston
Open west 2015 talk ben coverstonOpen west 2015 talk ben coverston
Open west 2015 talk ben coverstonbcoverston
 
FunctionalConf '16 Robert Virding Erlang Ecosystem
FunctionalConf '16 Robert Virding Erlang EcosystemFunctionalConf '16 Robert Virding Erlang Ecosystem
FunctionalConf '16 Robert Virding Erlang EcosystemRobert Virding
 
Erlang factory 2011 london
Erlang factory 2011 londonErlang factory 2011 london
Erlang factory 2011 londonPaolo Negri
 
Combining the strength of erlang and Ruby
Combining the strength of erlang and RubyCombining the strength of erlang and Ruby
Combining the strength of erlang and RubyMartin Rehfeld
 
Monitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixMonitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixGerger
 
Erlang as a cloud citizen, a fractal approach to throughput
Erlang as a cloud citizen, a fractal approach to throughputErlang as a cloud citizen, a fractal approach to throughput
Erlang as a cloud citizen, a fractal approach to throughputPaolo Negri
 
Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Paolo Negri
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_finalRamya Sunil
 

What's hot (17)

Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Super Sizing Youtube with Python
Super Sizing Youtube with PythonSuper Sizing Youtube with Python
Super Sizing Youtube with Python
 
ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?ApacheCon BigData - What it takes to process a trillion events a day?
ApacheCon BigData - What it takes to process a trillion events a day?
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Functional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming FrameworksFunctional Comparison and Performance Evaluation of Streaming Frameworks
Functional Comparison and Performance Evaluation of Streaming Frameworks
 
Open west 2015 talk ben coverston
Open west 2015 talk ben coverstonOpen west 2015 talk ben coverston
Open west 2015 talk ben coverston
 
FunctionalConf '16 Robert Virding Erlang Ecosystem
FunctionalConf '16 Robert Virding Erlang EcosystemFunctionalConf '16 Robert Virding Erlang Ecosystem
FunctionalConf '16 Robert Virding Erlang Ecosystem
 
Erlang factory 2011 london
Erlang factory 2011 londonErlang factory 2011 london
Erlang factory 2011 london
 
Combining the strength of erlang and Ruby
Combining the strength of erlang and RubyCombining the strength of erlang and Ruby
Combining the strength of erlang and Ruby
 
Monitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixMonitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with Zabbix
 
Erlang as a cloud citizen, a fractal approach to throughput
Erlang as a cloud citizen, a fractal approach to throughputErlang as a cloud citizen, a fractal approach to throughput
Erlang as a cloud citizen, a fractal approach to throughput
 
Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"Erlang factory SF 2011 "Erlang and the big switch in social games"
Erlang factory SF 2011 "Erlang and the big switch in social games"
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_final
 

Viewers also liked

Chapter 20
Chapter 20Chapter 20
Chapter 20Liz Vera
 
John adams powerpoint word
John adams powerpoint wordJohn adams powerpoint word
John adams powerpoint wordjessieandkeilea
 
The presidency of john adams
The presidency of john adamsThe presidency of john adams
The presidency of john adamsAllison Barnette
 
John Adams' presidency 8.37
John Adams' presidency 8.37John Adams' presidency 8.37
John Adams' presidency 8.37Blake Harris
 
John Adams Presidency
John Adams PresidencyJohn Adams Presidency
John Adams Presidencyjmorrow123
 
Thomas jefferson
Thomas jeffersonThomas jefferson
Thomas jeffersonshalbert
 
John Adams Presidential Term
John Adams Presidential TermJohn Adams Presidential Term
John Adams Presidential Termst3vo66
 

Viewers also liked (10)

Listening
ListeningListening
Listening
 
Benedict arnold
Benedict arnoldBenedict arnold
Benedict arnold
 
Chapter 20
Chapter 20Chapter 20
Chapter 20
 
John adams powerpoint word
John adams powerpoint wordJohn adams powerpoint word
John adams powerpoint word
 
The presidency of john adams
The presidency of john adamsThe presidency of john adams
The presidency of john adams
 
John Adams' presidency 8.37
John Adams' presidency 8.37John Adams' presidency 8.37
John Adams' presidency 8.37
 
John Adams Presidency
John Adams PresidencyJohn Adams Presidency
John Adams Presidency
 
John adams
John adamsJohn adams
John adams
 
Thomas jefferson
Thomas jeffersonThomas jefferson
Thomas jefferson
 
John Adams Presidential Term
John Adams Presidential TermJohn Adams Presidential Term
John Adams Presidential Term
 

Similar to John adams talk cloudy

Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20....Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...Javier García Magna
 
Scaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWSScaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWSBrett McLain
 
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum PainKafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum Painconfluent
 
Fixing Twitter Velocity2009
Fixing Twitter Velocity2009Fixing Twitter Velocity2009
Fixing Twitter Velocity2009John Adams
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenParticular Software
 
Exploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservicesExploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservices💡 Tomasz Kogut
 
DCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at NetflixDCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at NetflixDocker, Inc.
 
Monitoring MySQL at scale
Monitoring MySQL at scaleMonitoring MySQL at scale
Monitoring MySQL at scaleOvais Tariq
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDBFoundationDB
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented DesignRodrigo Campos
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly SolarWinds Loggly
 
Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...
Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...
Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...Codemotion
 
CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National Police
CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National PoliceCodeMotion Amsterdam 2018 - Microservices in action at the Dutch National Police
CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National PoliceBert Jan Schrijver
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profitRodrigo Campos
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn
 

Similar to John adams talk cloudy (20)

Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20....Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
 
Scaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWSScaling a MeteorJS SaaS app on AWS
Scaling a MeteorJS SaaS app on AWS
 
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum PainKafka Summit SF 2017 - Running Kafka for Maximum Pain
Kafka Summit SF 2017 - Running Kafka for Maximum Pain
 
Fixing Twitter Velocity2009
Fixing Twitter Velocity2009Fixing Twitter Velocity2009
Fixing Twitter Velocity2009
 
The impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves GoelevenThe impact of cloud NSBCon NY by Yves Goeleven
The impact of cloud NSBCon NY by Yves Goeleven
 
Exploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservicesExploring Twitter's Finagle technology stack for microservices
Exploring Twitter's Finagle technology stack for microservices
 
DCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at NetflixDCSF19 Container Security: Theory & Practice at Netflix
DCSF19 Container Security: Theory & Practice at Netflix
 
Monitoring MySQL at scale
Monitoring MySQL at scaleMonitoring MySQL at scale
Monitoring MySQL at scale
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDB
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented Design
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...
Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...
Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...
 
CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National Police
CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National PoliceCodeMotion Amsterdam 2018 - Microservices in action at the Dutch National Police
CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National Police
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profit
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 

Recently uploaded

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

John adams talk cloudy

  • 1. Scaling Twitter John Adams Twitter Operations “Talk Cloudy to Me” 2011
  • 2. John Adams / @netik • Early Twitter employee • Operations + Security Team • Keynote Speaker: O’Reilly Velocity 2009, 2010 • O’Reilly Web 2.0 Speaker (2008, 2010) • Previous companies: Inktomi, Apple, c|net
  • 3. Operations • Support the site and the developers • Make it performant • Capacity Planning (metrics-driven) • Configuration Management • Improve existing architecture
  • 4. 2,940 3,085 Tweets per Second Tweets per Second Japan Scores! Lakers Win!
  • 5. 6939 Tweets per second Japanese New Year
  • 6. Where’s the cloud? • Amazon S3 for Avatar, Background, and Asset Storage. Few if any EC2 instances (test only.) • Direct S3 is expensive • Cost for data in, data out, data stored • Cloudfront is cheaper than direct S3 • Cloudfront to S3 origins = free • Use International CDNs on top of Cloudfront
  • 7. CDNs and Versioning • Hard to purge quickly from CDNs • DMCA, image takedowns, emergency changes • CDN images, javascript, and other static files get long-lived expire times from origin • Version entire static asset store on deploys • /static/NNNNNN/directory/file...
  • 8. #newtwitter is an API client
  • 9. Nothing works the first time. • Scale site using best available technologies • Plan to build everything more than once. • Most solutions work to a certain level of scale, and then you must re-evaluate to grow. • This is a continual process.
  • 10. UNIX friends fail at scale • Cron • Add NTP, and many machines executing the same thing cause “micro” outages across the site. • Syslog • Truncation, data loss, aggregation issues • RRD • Data rounding over time
  • 11. Operations Mantra Find Weakest Point Metrics + Logs + Science = Analysis
  • 12. Operations Mantra Find Take Weakest Corrective Point Action Metrics + Logs + Science = Process Analysis
  • 13. Operations Mantra Move to Find Take Next Weakest Corrective Weakest Point Action Point Metrics + Logs + Science = Process Repeatability Analysis
  • 14. MTTD
  • 15. MTTR
  • 16. Sysadmin 2.0 (Devops) • Don’t be a just a sysadmin anymore. • Think of Systems management as a programming task (puppet, chef, cfengine...) • No more silos, or lobbing things over the wall • We’re all on the same side. Work Together!
  • 17. Monitoring • Twitter graphs and reports critical metrics in as near to real time as possible • If you build tools against our API, you should too. • Use this data to inform the public • dev.twitter.com - API availability • status.twitter.com
  • 18. Data Analysis • Instrumenting the world pays off. • “Data analysis, visualization, and other techniques for seeing patterns in data are going to be an increasingly valuable skill set. Employers take notice!” “Web Squared: Web 2.0 Five Years On”, Tim O’Reilly, Web 2.0 Summit, 2009
  • 19. Profiling • Low-level • Identify bottlenecks inside of core tools • Latency, Network Usage, Memory leaks • Methods • Network services: • tcpdump + tcpdstat, yconalyzer • Introspect with Google perftools
  • 20. Forecasting Curve-fitting for capacity planning (R, fityk, Mathematica, CurveFit) unsigned int (32 bit) Twitpocolypse status_id signed int (32 bit) Twitpocolypse r2=0.99
  • 21. Loony (tool) • Accesses central machine database (MySQL) • Python, Django, Paraminko SSH • Ties into LDAP • Filter and list machines, find asset data • On demand changes across machines
  • 22. Murder • Bittorrent based replication for deploys (Python w/libtorrent) • ~30-60 seconds to update >1k machines • Uses our machine database to find destination hosts • Legal P2P
  • 23. Configuration Management • Start automated configuration management EARLY in your company. • Don’t wait until it’s too late. • Twitter started within the first few months.
  • 24. Virtualization • A great start, but... • Images are not a stable way to replicate configuration • Changes made to deployed images cannot be easily detected and repaired • Typically the only solution is to redeploy
  • 25. Puppet • Puppet + SVN • Hundreds of modules • Runs constantly • Post-Commit hooks for config+sanity chceks • No one logs into machines • Centralized Change
  • 26. Issues with Centralized Management • Complex Environment • Multiple Admins • Unknown Interactions • Solution: 2nd set of eyes.
  • 28. Logging • Syslog doesn’t work at high traffic rates • No redundancy, no ability to recover from daemon failure • Moving large files around is painful • Solution: • Scribe
  • 29. Scribe • Fault-tolerant data transmission • Simple data model, easy to extend • Log locally, then scribe to aggregation nodes • Slow-start recovery when aggregator goes down • Twitter patches • LZO compression and direct HDFS
  • 30. Hadoop for Ops • Once the data’s scribed to HDFS you can: • Aggregate reports across thousands of servers • Produce application level metrics • Use map-reduce to gain insight into your systems.
  • 31. Analyze • Turn data into information • Where is the code base going? • Are things worse than they were? • Understand the impact of the last software deploy • Run check scripts during and after deploys • Capacity Planning, not Fire Fighting!
  • 32. Dashboard • “Criticals” view • Smokeping/MRTG • Google Analytics • Not just for HTTP 200s/SEO • XML Feeds from managed services
  • 33. Whale Watcher • Simple shell script, Huge Win • Whale = HTTP 503 (timeout) • Robot = HTTP 500 (error) • Examines last 60 seconds of aggregated daemon / www logs • “Whales per Second” > Wthreshold • Thar be whales! Call in ops.
  • 34. Deploy Watcher Sample window: 300.0 seconds First start time: Mon Apr 5 15:30:00 2010 (Mon Apr 5 08:30:00 PDT 2010) Second start time: Tue Apr 6 02:09:40 2010 (Mon Apr 5 19:09:40 PDT 2010) PRODUCTION APACHE: ALL OK PRODUCTION OTHER: ALL OK WEB049 CANARY APACHE: ALL OK WEB049 CANARY BACKEND SERVICES: ALL OK DAEMON031 CANARY BACKEND SERVICES: ALL OK DAEMON031 CANARY OTHER: ALL OK
  • 35. Deploys • Block deploys if site in error state • Graph time-of-deploy along side server CPU and Latency • Display time-of-last-deploy on dashboard • Communicate deploys in Campfire to teams ^^ last deploy times ^^
  • 36. Feature “Decider” • Controls to enable and disable computationally or I/O heavy features • Our “Emergency Stop” button, and feature release control. All features go through it. • Changes logged and reported to all teams • Hundreds of knobs we can turn. • Not boolean, expressed as % of userbase
  • 38. request flow Load Balancers Apache Rails (Unicorn) FlockDB Kestrel Memcached MySQL Cassandra Monitoring Daemons Mail Servers
  • 39. Many limiting factors in the request pipeline Apache Rails Worker Model (unicorn) MaxClients 2:1 oversubscribed TCP Listen queue depth to cores Memcached # connections MySQL Varnish (search) # db connections # threads
  • 40. Unicorn Rails Server • Connection push to socket polling model • Deploys without Downtime • Less memory and 30% less CPU • Shift from ProxyPass to Proxy Balancer • mod_proxy_balancer lies about usage • Race condition in counters patched
  • 41. Rails • Front-end (Scala/Java back-end) • Not to blame for our issues. Analysis found: • Caching + Cache invalidation problems • Bad queries generated by ActiveRecord, resulting in slow queries against the db • Garbage Collection issues (20-25%) • Replication Lag
  • 42. memcached • Network Memory Bus isn’t infinite • Evictions make the cache unreliable for important configuration data (loss of darkmode flags, for example) • Segmented into pools for better performance • Examine slab allocation and watch for high use/eviction rates on individual slabs using peep. Adjust slab factors and size accordingly.
  • 43. Decomposition • SOA: Service Orientated Architecture • Take application and decompose into services • Admin the services as separate units • Decouple the services from each other
  • 44. Asynchronous Requests • Executing work during the web request is expensive • The request pipeline should not be used to handle 3rd party communications or back-end work. • Move work to queues • Run daemons against queues
  • 45. Thrift • Cross-language services framework • Originally developed at Facebook • Now an Apache project • Seamless operation between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, OCaml (phew!)
  • 46. Kestrel • Works like memcache (same protocol) • SET = enqueue | GET = dequeue • No strict ordering of jobs • No shared state between servers • Written in Scala. Open Source.
  • 47. Daemons • Many different types at Twitter. • # of daemons have to match the workload • Early Kestrel would crash if queues filled • “Seppaku” patch • Kill daemons after n requests • Long-running leaky daemons = low memory
  • 48. Flock DB Flock DB • Gizzard sharding framework Gizzard • Billions of edges • MySQL backend Mysql Mysql Mysql • Open Source • http://github.com/twitter/gizzard
  • 49. Disk is the new Tape. • Social Networking application profile has many O(ny) operations. • Page requests have to happen in < 500mS or users start to notice. Goal: 250-300mS • Web 2.0 isn’t possible without lots of RAM • What to do?
  • 50. Caching • We’re “real time”, but still lots of caching opportunity • Most caching strategies rely on long TTLs (>60 s) • Separate memcache pools for different data types to prevent eviction • Optimize Ruby Gem to libmemcached + FNV Hash instead of Ruby + MD5 • Twitter largest contributor to libmemcached
  • 51. Caching • “Cache Everything!” not the best policy, as • Invalidating caches at the right time is difficult. • Cold Cache problem; What happens after power or system failure? • Use cache to augment db, not to replace
  • 52. MySQL • We have many MySQL servers • Increasingly used more and more as key/value store • Many instances spread out through the Gizzard sharding framework
  • 53. MySQL Challenges • Replication Delay • Single threaded replication = pain. • Social Networking not good for RDBMS • N x N relationships and social graph / tree traversal - we have FlockDB for that • Disk issues • FS Choice, noatime, scheduling algorithm
  • 54. Database Replication • Major issues around users and statuses tables • Multiple functional masters (FRP, FWP) • Make sure your code reads and writes to the write DBs. Reading from master = slow death • Monitor the DB. Find slow / poorly designed queries • Kill long running queries before they kill you (mkill)
  • 55. Key Points • Use cloud where necessary • Instrument everything. • Use metrics to make decisions, not intuition. • Decompose your app and remove dependencies between services • Process asynchronously when possible
  • 57. Thanks! • We support and use Open Source • http://twitter.com/about/opensource • Work at scale - We’re hiring. • @jointheflock