SlideShare a Scribd company logo
1 of 49
Download to read offline
Billions of Hits:
Scaling Twitter
John Adams
Twitter Operations
#chirpscale
John Adams                            @netik
•   Early Twitter employee (mid-2008)

•   Lead engineer: Outward Facing Services (Apache,
    Unicorn, SMTP), Auth, Security

•   Keynote Speaker: O’Reilly Velocity 2009

•   O’Reilly Web 2.0 Speaker (2008, 2010)

•   Previous companies: Inktomi, Apple, c|net

•   Working on Web Operations book with John Alspaw
    (flickr, etsy), out in June
Growth.
752%
                       2008 Growth
source: comscore.com - (based only on www traffic, not API)
1358%
                       2009 Growth
source: comscore.com - (based only on www traffic, not API)
12 th
                    most popular
source: alexa.com
55M
                   Tweets per day
                  (640 TPS/sec, 1000 TPS/sec peak)
source: twitter.com internal
600M
                   Searches/Day
source: twitter.com internal
25%




Web               API
            75%
Operations
•   What do we do?

    •   Site Availability

    •   Capacity Planning (metrics-driven)

    •   Configuration Management

    •   Security

    •   Much more than basic Sysadmin
What have we done?
•   Improved response time, reduced latency

•   Less errors during deploys (Unicorn!)

•   Faster performance

•   Lower MTTD (Mean time to Detect)

•   Lower MTTR (Mean time to Recovery)
Operations Mantra

                                  Move to
     Find            Take
                                   Next
    Weakest        Corrective
                                  Weakest
     Point          Action
                                   Point


   Metrics +
Logs + Science =    Process     Repeatability
   Analysis
Make an attack plan.
 Symptom    Bottleneck   Vector     Solution

                          HTTP
Bandwidth   Network                Servers++
                         Latency
 Timeline                Update      Better
            Database
  Delay                   Delay    algorithm
  Status                             Flock
            Database     Delays
 Growth                            Cassandra
 Updates    Algorithm    Latency   Algorithms
Finding Weakness
•   Metrics + Graphs

    •   Individual metrics are irrelevant

    •   We aggregate metrics to find knowledge

•   Logs

•   SCIENCE!
Monitoring
•   Twitter graphs and reports critical metrics in
    as near real time as possible

•   If you build tools against our API, you should
    too.

    •   RRD, other Time-Series DB solutions

    •   Ganglia + custom gmetric scripts

•   dev.twitter.com - API availability
Analyze
•   Turn data into information

    •   Where is the code base going?

    •   Are things worse than they were?

        •   Understand the impact of the last software
            deploy

        •   Run check scripts during and after deploys

•   Capacity Planning, not Fire Fighting!
Data Analysis
•   Instrumenting the world pays off.

•   “Data analysis, visualization, and other
    techniques for seeing patterns in data are
    going to be an increasingly valuable skill set.
    Employers take notice!”
          “Web Squared: Web 2.0 Five Years On”, Tim O’Reilly, Web 2.0 Summit, 2009
Forecasting             Curve-fitting for capacity planning
                        (R, fityk, Mathematica, CurveFit)



              unsigned int (32 bit)
                Twitpocolypse



  status_id

                                      signed int (32 bit)
                                        Twitpocolypse




                                                  r2=0.99
Internal Dashboard
External API Dashbord




   http://dev.twitter.com/status
What’s a Robot ?
•   Actual error in the Rails stack (HTTP 500)

•   Uncaught Exception

•   Code problem, or failure / nil result

•   Increases our exception count

•   Shows up in Reports
What’s a Whale ?
•   HTTP Error 502, 503

•   Twitter has a hard and fast five second timeout

•   We’d rather fail fast than block on requests

•   We also kill long-running queries (mkill)

•   Timeout
Whale Watcher
•   Simple shell script,

    •   MASSIVE WIN by @ronpepsi

•   Whale = HTTP 503 (timeout)

•   Robot = HTTP 500 (error)

•   Examines last 60 seconds of
    aggregated daemon / www logs

•   “Whales per Second” > Wthreshold

    •   Thar be whales! Call in ops.
Deploy Watcher
Sample window: 300.0 seconds

First start time:
Mon Apr 5 15:30:00 2010 (Mon Apr   5 08:30:00 PDT 2010)
Second start time:
Tue Apr 6 02:09:40 2010 (Mon Apr   5 19:09:40 PDT 2010)

PRODUCTION APACHE: ALL OK
PRODUCTION OTHER: ALL OK
WEB0049 CANARY APACHE: ALL OK
WEB0049 CANARY BACKEND SERVICES: ALL OK
DAEMON0031 CANARY BACKEND SERVICES: ALL OK
DAEMON0031 CANARY OTHER: ALL OK
Feature “Darkmode”
•   Specific site controls to enable and disable
    computationally or IO-Heavy site function

•   The “Emergency Stop” button

•   Changes logged and reported to all teams

•   Around 60 switches we can throw

•   Static / Read-only mode
request flow
           Load Balancers

         Apache mod_proxy

           Rails (Unicorn)

 Flock      memcached        Kestrel

         MySQL      Cassandra

             Daemons
Servers
•   Co-located, dedicated machines at NTT America

•   No clouds; Only for monitoring, not serving

    •   Need raw processing power, latency too high
        in existing cloud offerings

•   Frees us to deal with real, intellectual, computer
    science problems.

•   Moving to our own data center soon
unicorn
•   A single socket Rails application Server (Rack)

•   Zero Downtime Deploys (!)

    •   Controlled, shuffled transfer to new code

•   Less memory, 30% less CPU

•   Shift from mod_proxy_balancer to
    mod_proxy_pass

    •   HAProxy, Ngnix wasn’t any better. really.
Rails
•   Mostly only for front-end.

•   Back end mostly Scala and pure ruby

•   Not to blame for our issues. Analysis found:

    •   Caching + Cache invalidation problems

    •   Bad queries generated by ActiveRecord, resulting in
        slow queries against the db

    •   Queue Latency

•   Replication Lag
memcached
•   memcached isn’t perfect.

    •   Memcached SEGVs hurt us early on.

•   Evictions make the cache unreliable for
    important configuration data
    (loss of darkmode flags, for example)

•   Network Memory Bus isn’t infinite

•   Segmented into pools for better performance
Loony
•   Central machine database (MySQL)

    •   Python, Django, Paraminko SSH

        •   Paraminko - Twitter OSS (@robey)

    •   Ties into LDAP groups

•   When data center sends us email, machine
    definitions built in real-time
Murder
•   @lg rocks!

•   Bittorrent based replication for deploys

•   ~30-60 seconds to update >1k machines

•   P2P - Legal, valid, Awesome.
Kestrel
•   @robey

•   Works like memcache (same protocol)

•   SET = enqueue | GET = dequeue

•   No strict ordering of jobs

•   No shared state between servers

•   Written in Scala.
Asynchronous Requests
•   Inbound traffic consumes a unicorn worker

•   Outbound traffic consumes a unicorn worker

•   The request pipeline should not be used to
    handle 3rd party communications or
    back-end work.

•   Reroute traffic to daemons
Daemons
•   Daemons touch every tweet

•   Many different daemon types at Twitter

•   Old way: One daemon per type (Rails)

    •   New way: Fewer Daemons (Pure Ruby)

•   Daemon Slayer - A Multi Daemon that could
    do many different jobs, all at once.
Disk is the new Tape.
•   Social Networking application profile has
    many O(ny) operations.

•   Page requests have to happen in < 500mS or
    users start to notice. Goal: 250-300mS

•   Web 2.0 isn’t possible without lots of RAM

•   SSDs? What to do?
Caching
•   We’re the real-time web, but lots of caching
    opportunity. You should cache what you get from us.

•   Most caching strategies rely on long TTLs (>60 s)

•   Separate memcache pools for different data types to
    prevent eviction

•   Optimize Ruby Gem to libmemcached + FNV Hash
    instead of Ruby + MD5

•   Twitter now largest contributor to libmemcached
MySQL
•   Sharding large volumes of data is hard

•   Replication delay and cache eviction produce
    inconsistent results to the end user.

•   Locks create resource contention for popular
    data
MySQL Challenges
•   Replication Delay

    •   Single threaded. Slow.

•   Social Networking not good for RDBMS

    •   N x N relationships and social graph / tree
        traversal

    •   Disk issues (FS Choice, noatime, scheduling
        algorithm)
Relational Databases
not a Panacea
•   Good for:

    •   Users, Relational Data, Transactions

•   Bad:

    •   Queues. Polling operations. Social Graph.

•   You don’t need ACID for everything.
Database Replication
•   Major issues around users and statuses tables

•   Multiple functional masters (FRP, FWP)

•   Make sure your code reads and writes to the
    write DBs. Reading from master = slow death

    •   Monitor the DB. Find slow / poorly designed
        queries

•   Kill long running queries before they kill you
    (mkill)
Flock
                                          Flock
•   Scalable Social Graph Store

•   Sharding via Gizzard
                                          Gizzard
•   MySQL backend (many.)

•   13 billion edges,
    100K reads/second
                                  Mysql   Mysql     Mysql
•   Open Source!
Cassandra
•   Originally written by Facebook

•   Distributed Data Store

•   @rk’s changes to Cassandra Open Sourced

•   Currently double-writing into it

•   Transitioning to 100% soon.
Lessons Learned
•   Instrument everything. Start graphing early.

•   Cache as much as possible

•   Start working on scaling early.

•   Don’t rely on memcache, and don’t rely on the
    database

•   Don’t use mongrel. Use Unicorn.
Join Us!
@jointheflock
Q&A
Thanks!
•   @jointheflock

•   http://twitter.com/jobs

•   Download our work

    •   http://twitter.com/about/opensource

More Related Content

What's hot

Securing DevOps through Privileged Access Management
Securing DevOps through Privileged Access ManagementSecuring DevOps through Privileged Access Management
Securing DevOps through Privileged Access ManagementBeyondTrust
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
OReilly SACON London: Potholes in the road from monolithic hell: Microservice...
OReilly SACON London: Potholes in the road from monolithic hell: Microservice...OReilly SACON London: Potholes in the road from monolithic hell: Microservice...
OReilly SACON London: Potholes in the road from monolithic hell: Microservice...Chris Richardson
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsJonas Bonér
 
Deploying Privileged Access Workstations (PAWs)
Deploying Privileged Access Workstations (PAWs)Deploying Privileged Access Workstations (PAWs)
Deploying Privileged Access Workstations (PAWs)Blue Teamer
 
Cyber Threat hunting workshop
Cyber Threat hunting workshopCyber Threat hunting workshop
Cyber Threat hunting workshopArpan Raval
 
Centralize and Simplify Secrets Management for Red Hat OpenShift Container En...
Centralize and Simplify Secrets Management for Red Hat OpenShift Container En...Centralize and Simplify Secrets Management for Red Hat OpenShift Container En...
Centralize and Simplify Secrets Management for Red Hat OpenShift Container En...DevOps.com
 
Access Control Models: Controlling Resource Authorization
Access Control Models: Controlling Resource AuthorizationAccess Control Models: Controlling Resource Authorization
Access Control Models: Controlling Resource AuthorizationMark Niebergall
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyKishore Gopalakrishna
 
Developing event-driven microservices with event sourcing and CQRS (phillyete)
Developing event-driven microservices with event sourcing and CQRS (phillyete)Developing event-driven microservices with event sourcing and CQRS (phillyete)
Developing event-driven microservices with event sourcing and CQRS (phillyete)Chris Richardson
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Domain Driven Design com Python
Domain Driven Design com PythonDomain Driven Design com Python
Domain Driven Design com PythonFrederico Cabral
 
DNS DDoS mitigation using Amazon Route 53 and AWS Shield
DNS DDoS mitigation using Amazon Route 53 and AWS ShieldDNS DDoS mitigation using Amazon Route 53 and AWS Shield
DNS DDoS mitigation using Amazon Route 53 and AWS ShieldAmazon Web Services
 
"X" Driven-Development Methodologies
"X" Driven-Development Methodologies"X" Driven-Development Methodologies
"X" Driven-Development MethodologiesDamian T. Gordon
 
Introduction to redis - version 2
Introduction to redis - version 2Introduction to redis - version 2
Introduction to redis - version 2Dvir Volk
 
Database Performance Tuning
Database Performance Tuning Database Performance Tuning
Database Performance Tuning Arno Huetter
 
MITRE ATT&CKcon 2.0: State of the ATT&CK; Blake Strom, MITRE
MITRE ATT&CKcon 2.0: State of the ATT&CK; Blake Strom, MITREMITRE ATT&CKcon 2.0: State of the ATT&CK; Blake Strom, MITRE
MITRE ATT&CKcon 2.0: State of the ATT&CK; Blake Strom, MITREMITRE - ATT&CKcon
 

What's hot (20)

Securing DevOps through Privileged Access Management
Securing DevOps through Privileged Access ManagementSecuring DevOps through Privileged Access Management
Securing DevOps through Privileged Access Management
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
OReilly SACON London: Potholes in the road from monolithic hell: Microservice...
OReilly SACON London: Potholes in the road from monolithic hell: Microservice...OReilly SACON London: Potholes in the road from monolithic hell: Microservice...
OReilly SACON London: Potholes in the road from monolithic hell: Microservice...
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
How to Design Indexes, Really
How to Design Indexes, ReallyHow to Design Indexes, Really
How to Design Indexes, Really
 
Deploying Privileged Access Workstations (PAWs)
Deploying Privileged Access Workstations (PAWs)Deploying Privileged Access Workstations (PAWs)
Deploying Privileged Access Workstations (PAWs)
 
Cyber Threat hunting workshop
Cyber Threat hunting workshopCyber Threat hunting workshop
Cyber Threat hunting workshop
 
Centralize and Simplify Secrets Management for Red Hat OpenShift Container En...
Centralize and Simplify Secrets Management for Red Hat OpenShift Container En...Centralize and Simplify Secrets Management for Red Hat OpenShift Container En...
Centralize and Simplify Secrets Management for Red Hat OpenShift Container En...
 
Access Control Models: Controlling Resource Authorization
Access Control Models: Controlling Resource AuthorizationAccess Control Models: Controlling Resource Authorization
Access Control Models: Controlling Resource Authorization
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
 
Developing event-driven microservices with event sourcing and CQRS (phillyete)
Developing event-driven microservices with event sourcing and CQRS (phillyete)Developing event-driven microservices with event sourcing and CQRS (phillyete)
Developing event-driven microservices with event sourcing and CQRS (phillyete)
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Domain Driven Design com Python
Domain Driven Design com PythonDomain Driven Design com Python
Domain Driven Design com Python
 
DNS DDoS mitigation using Amazon Route 53 and AWS Shield
DNS DDoS mitigation using Amazon Route 53 and AWS ShieldDNS DDoS mitigation using Amazon Route 53 and AWS Shield
DNS DDoS mitigation using Amazon Route 53 and AWS Shield
 
"X" Driven-Development Methodologies
"X" Driven-Development Methodologies"X" Driven-Development Methodologies
"X" Driven-Development Methodologies
 
Introduction to redis - version 2
Introduction to redis - version 2Introduction to redis - version 2
Introduction to redis - version 2
 
Database Performance Tuning
Database Performance Tuning Database Performance Tuning
Database Performance Tuning
 
MITRE ATT&CKcon 2.0: State of the ATT&CK; Blake Strom, MITRE
MITRE ATT&CKcon 2.0: State of the ATT&CK; Blake Strom, MITREMITRE ATT&CKcon 2.0: State of the ATT&CK; Blake Strom, MITRE
MITRE ATT&CKcon 2.0: State of the ATT&CK; Blake Strom, MITRE
 
AWS DynamoDB and Schema Design
AWS DynamoDB and Schema DesignAWS DynamoDB and Schema Design
AWS DynamoDB and Schema Design
 

Viewers also liked

Scaling Instagram
Scaling InstagramScaling Instagram
Scaling Instagramiammutex
 
Embracing Open Source: Practice and Experience from Alibaba
Embracing Open Source: Practice and Experience from AlibabaEmbracing Open Source: Practice and Experience from Alibaba
Embracing Open Source: Practice and Experience from AlibabaWensong Zhang
 
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...Jon Watte
 
Us history shaping a new nation
Us history shaping a new nationUs history shaping a new nation
Us history shaping a new nationMrO97
 
LVS development and experience
LVS development and experienceLVS development and experience
LVS development and experienceWensong Zhang
 
3. shaping a new nation [1782 1788]
3. shaping a new nation [1782 1788]3. shaping a new nation [1782 1788]
3. shaping a new nation [1782 1788]jtoma84
 
Product design: How to create a product
Product design: How to create a productProduct design: How to create a product
Product design: How to create a productPress42
 
Scaling Twitter with Cassandra
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with CassandraRyan King
 
Washington Presidency
Washington PresidencyWashington Presidency
Washington Presidencymrsvogel
 
The presidency of george washingtion ppt for notes
The presidency of george washingtion ppt for notesThe presidency of george washingtion ppt for notes
The presidency of george washingtion ppt for notesMatthew Fulghum
 
Lesson Plan_Us History_the birth of a new nation
Lesson Plan_Us History_the birth of a new nationLesson Plan_Us History_the birth of a new nation
Lesson Plan_Us History_the birth of a new nationscott severance
 
The presidency of john adams
The presidency of john adamsThe presidency of john adams
The presidency of john adamsAllison Barnette
 
The First Five Presidents of the United States
The First Five Presidents of the United StatesThe First Five Presidents of the United States
The First Five Presidents of the United Statesmentzers
 
American History - Chapter 6
American History - Chapter 6American History - Chapter 6
American History - Chapter 6Alison Kurtz
 
Adams To Jefferson
Adams To JeffersonAdams To Jefferson
Adams To JeffersonJames Henry
 

Viewers also liked (20)

Scaling Instagram
Scaling InstagramScaling Instagram
Scaling Instagram
 
Embracing Open Source: Practice and Experience from Alibaba
Embracing Open Source: Practice and Experience from AlibabaEmbracing Open Source: Practice and Experience from Alibaba
Embracing Open Source: Practice and Experience from Alibaba
 
Load balancing at tuenti
Load balancing at tuentiLoad balancing at tuenti
Load balancing at tuenti
 
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...
 
Xyz affair
Xyz affairXyz affair
Xyz affair
 
Us history shaping a new nation
Us history shaping a new nationUs history shaping a new nation
Us history shaping a new nation
 
Thomas jefferson
Thomas jeffersonThomas jefferson
Thomas jefferson
 
LVS development and experience
LVS development and experienceLVS development and experience
LVS development and experience
 
3. shaping a new nation [1782 1788]
3. shaping a new nation [1782 1788]3. shaping a new nation [1782 1788]
3. shaping a new nation [1782 1788]
 
Product design: How to create a product
Product design: How to create a productProduct design: How to create a product
Product design: How to create a product
 
Scaling Twitter with Cassandra
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with Cassandra
 
Washington Presidency
Washington PresidencyWashington Presidency
Washington Presidency
 
The presidency of george washingtion ppt for notes
The presidency of george washingtion ppt for notesThe presidency of george washingtion ppt for notes
The presidency of george washingtion ppt for notes
 
Lesson Plan_Us History_the birth of a new nation
Lesson Plan_Us History_the birth of a new nationLesson Plan_Us History_the birth of a new nation
Lesson Plan_Us History_the birth of a new nation
 
The presidency of john adams
The presidency of john adamsThe presidency of john adams
The presidency of john adams
 
The First Five Presidents of the United States
The First Five Presidents of the United StatesThe First Five Presidents of the United States
The First Five Presidents of the United States
 
John adams presidency ppt
John adams presidency pptJohn adams presidency ppt
John adams presidency ppt
 
American History - Chapter 6
American History - Chapter 6American History - Chapter 6
American History - Chapter 6
 
Adams To Jefferson
Adams To JeffersonAdams To Jefferson
Adams To Jefferson
 
Chapter 10 Sections 1 & 2
Chapter 10 Sections 1 & 2Chapter 10 Sections 1 & 2
Chapter 10 Sections 1 & 2
 

Similar to Chirp 2010: Scaling Twitter

John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
Fixing Twitter Velocity2009
Fixing Twitter Velocity2009Fixing Twitter Velocity2009
Fixing Twitter Velocity2009John Adams
 
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20....Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...Javier García Magna
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profitRodrigo Campos
 
Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comDamien Krotkine
 
Hacklu2011 tricaud
Hacklu2011 tricaudHacklu2011 tricaud
Hacklu2011 tricaudstricaud
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenChristopher Whitaker
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs
 
Asynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per secondAsynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per secondStuart (Pid) Williams
 
Monitoring MySQL at scale
Monitoring MySQL at scaleMonitoring MySQL at scale
Monitoring MySQL at scaleOvais Tariq
 
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?DATAVERSITY
 
In Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionIn Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionAdaryl "Bob" Wakefield, MBA
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDBFoundationDB
 

Similar to Chirp 2010: Scaling Twitter (20)

John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Velocity2009
Fixing Twitter Velocity2009Fixing Twitter Velocity2009
Fixing Twitter Velocity2009
 
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20....Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
 
Dibi Conference 2012
Dibi Conference 2012Dibi Conference 2012
Dibi Conference 2012
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profit
 
Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.com
 
Hacklu2011 tricaud
Hacklu2011 tricaudHacklu2011 tricaud
Hacklu2011 tricaud
 
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
 
Open Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe OlsenOpen Data Summit Presentation by Joe Olsen
Open Data Summit Presentation by Joe Olsen
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
Be faster then rabbits
Be faster then rabbitsBe faster then rabbits
Be faster then rabbits
 
Asynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per secondAsynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per second
 
Monitoring MySQL at scale
Monitoring MySQL at scaleMonitoring MySQL at scale
Monitoring MySQL at scale
 
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
 
In Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics SolutionIn Memory Databases: A Real Time Analytics Solution
In Memory Databases: A Real Time Analytics Solution
 
Building FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDB
 

Recently uploaded

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Recently uploaded (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

Chirp 2010: Scaling Twitter

  • 1.
  • 2. Billions of Hits: Scaling Twitter John Adams Twitter Operations
  • 4. John Adams @netik • Early Twitter employee (mid-2008) • Lead engineer: Outward Facing Services (Apache, Unicorn, SMTP), Auth, Security • Keynote Speaker: O’Reilly Velocity 2009 • O’Reilly Web 2.0 Speaker (2008, 2010) • Previous companies: Inktomi, Apple, c|net • Working on Web Operations book with John Alspaw (flickr, etsy), out in June
  • 6. 752% 2008 Growth source: comscore.com - (based only on www traffic, not API)
  • 7. 1358% 2009 Growth source: comscore.com - (based only on www traffic, not API)
  • 8. 12 th most popular source: alexa.com
  • 9. 55M Tweets per day (640 TPS/sec, 1000 TPS/sec peak) source: twitter.com internal
  • 10. 600M Searches/Day source: twitter.com internal
  • 11. 25% Web API 75%
  • 12. Operations • What do we do? • Site Availability • Capacity Planning (metrics-driven) • Configuration Management • Security • Much more than basic Sysadmin
  • 13. What have we done? • Improved response time, reduced latency • Less errors during deploys (Unicorn!) • Faster performance • Lower MTTD (Mean time to Detect) • Lower MTTR (Mean time to Recovery)
  • 14. Operations Mantra Move to Find Take Next Weakest Corrective Weakest Point Action Point Metrics + Logs + Science = Process Repeatability Analysis
  • 15. Make an attack plan. Symptom Bottleneck Vector Solution HTTP Bandwidth Network Servers++ Latency Timeline Update Better Database Delay Delay algorithm Status Flock Database Delays Growth Cassandra Updates Algorithm Latency Algorithms
  • 16. Finding Weakness • Metrics + Graphs • Individual metrics are irrelevant • We aggregate metrics to find knowledge • Logs • SCIENCE!
  • 17. Monitoring • Twitter graphs and reports critical metrics in as near real time as possible • If you build tools against our API, you should too. • RRD, other Time-Series DB solutions • Ganglia + custom gmetric scripts • dev.twitter.com - API availability
  • 18. Analyze • Turn data into information • Where is the code base going? • Are things worse than they were? • Understand the impact of the last software deploy • Run check scripts during and after deploys • Capacity Planning, not Fire Fighting!
  • 19. Data Analysis • Instrumenting the world pays off. • “Data analysis, visualization, and other techniques for seeing patterns in data are going to be an increasingly valuable skill set. Employers take notice!” “Web Squared: Web 2.0 Five Years On”, Tim O’Reilly, Web 2.0 Summit, 2009
  • 20. Forecasting Curve-fitting for capacity planning (R, fityk, Mathematica, CurveFit) unsigned int (32 bit) Twitpocolypse status_id signed int (32 bit) Twitpocolypse r2=0.99
  • 22. External API Dashbord http://dev.twitter.com/status
  • 23. What’s a Robot ? • Actual error in the Rails stack (HTTP 500) • Uncaught Exception • Code problem, or failure / nil result • Increases our exception count • Shows up in Reports
  • 24. What’s a Whale ? • HTTP Error 502, 503 • Twitter has a hard and fast five second timeout • We’d rather fail fast than block on requests • We also kill long-running queries (mkill) • Timeout
  • 25. Whale Watcher • Simple shell script, • MASSIVE WIN by @ronpepsi • Whale = HTTP 503 (timeout) • Robot = HTTP 500 (error) • Examines last 60 seconds of aggregated daemon / www logs • “Whales per Second” > Wthreshold • Thar be whales! Call in ops.
  • 26. Deploy Watcher Sample window: 300.0 seconds First start time: Mon Apr 5 15:30:00 2010 (Mon Apr 5 08:30:00 PDT 2010) Second start time: Tue Apr 6 02:09:40 2010 (Mon Apr 5 19:09:40 PDT 2010) PRODUCTION APACHE: ALL OK PRODUCTION OTHER: ALL OK WEB0049 CANARY APACHE: ALL OK WEB0049 CANARY BACKEND SERVICES: ALL OK DAEMON0031 CANARY BACKEND SERVICES: ALL OK DAEMON0031 CANARY OTHER: ALL OK
  • 27. Feature “Darkmode” • Specific site controls to enable and disable computationally or IO-Heavy site function • The “Emergency Stop” button • Changes logged and reported to all teams • Around 60 switches we can throw • Static / Read-only mode
  • 28. request flow Load Balancers Apache mod_proxy Rails (Unicorn) Flock memcached Kestrel MySQL Cassandra Daemons
  • 29. Servers • Co-located, dedicated machines at NTT America • No clouds; Only for monitoring, not serving • Need raw processing power, latency too high in existing cloud offerings • Frees us to deal with real, intellectual, computer science problems. • Moving to our own data center soon
  • 30. unicorn • A single socket Rails application Server (Rack) • Zero Downtime Deploys (!) • Controlled, shuffled transfer to new code • Less memory, 30% less CPU • Shift from mod_proxy_balancer to mod_proxy_pass • HAProxy, Ngnix wasn’t any better. really.
  • 31. Rails • Mostly only for front-end. • Back end mostly Scala and pure ruby • Not to blame for our issues. Analysis found: • Caching + Cache invalidation problems • Bad queries generated by ActiveRecord, resulting in slow queries against the db • Queue Latency • Replication Lag
  • 32. memcached • memcached isn’t perfect. • Memcached SEGVs hurt us early on. • Evictions make the cache unreliable for important configuration data (loss of darkmode flags, for example) • Network Memory Bus isn’t infinite • Segmented into pools for better performance
  • 33. Loony • Central machine database (MySQL) • Python, Django, Paraminko SSH • Paraminko - Twitter OSS (@robey) • Ties into LDAP groups • When data center sends us email, machine definitions built in real-time
  • 34. Murder • @lg rocks! • Bittorrent based replication for deploys • ~30-60 seconds to update >1k machines • P2P - Legal, valid, Awesome.
  • 35. Kestrel • @robey • Works like memcache (same protocol) • SET = enqueue | GET = dequeue • No strict ordering of jobs • No shared state between servers • Written in Scala.
  • 36. Asynchronous Requests • Inbound traffic consumes a unicorn worker • Outbound traffic consumes a unicorn worker • The request pipeline should not be used to handle 3rd party communications or back-end work. • Reroute traffic to daemons
  • 37. Daemons • Daemons touch every tweet • Many different daemon types at Twitter • Old way: One daemon per type (Rails) • New way: Fewer Daemons (Pure Ruby) • Daemon Slayer - A Multi Daemon that could do many different jobs, all at once.
  • 38. Disk is the new Tape. • Social Networking application profile has many O(ny) operations. • Page requests have to happen in < 500mS or users start to notice. Goal: 250-300mS • Web 2.0 isn’t possible without lots of RAM • SSDs? What to do?
  • 39. Caching • We’re the real-time web, but lots of caching opportunity. You should cache what you get from us. • Most caching strategies rely on long TTLs (>60 s) • Separate memcache pools for different data types to prevent eviction • Optimize Ruby Gem to libmemcached + FNV Hash instead of Ruby + MD5 • Twitter now largest contributor to libmemcached
  • 40. MySQL • Sharding large volumes of data is hard • Replication delay and cache eviction produce inconsistent results to the end user. • Locks create resource contention for popular data
  • 41. MySQL Challenges • Replication Delay • Single threaded. Slow. • Social Networking not good for RDBMS • N x N relationships and social graph / tree traversal • Disk issues (FS Choice, noatime, scheduling algorithm)
  • 42. Relational Databases not a Panacea • Good for: • Users, Relational Data, Transactions • Bad: • Queues. Polling operations. Social Graph. • You don’t need ACID for everything.
  • 43. Database Replication • Major issues around users and statuses tables • Multiple functional masters (FRP, FWP) • Make sure your code reads and writes to the write DBs. Reading from master = slow death • Monitor the DB. Find slow / poorly designed queries • Kill long running queries before they kill you (mkill)
  • 44. Flock Flock • Scalable Social Graph Store • Sharding via Gizzard Gizzard • MySQL backend (many.) • 13 billion edges, 100K reads/second Mysql Mysql Mysql • Open Source!
  • 45. Cassandra • Originally written by Facebook • Distributed Data Store • @rk’s changes to Cassandra Open Sourced • Currently double-writing into it • Transitioning to 100% soon.
  • 46. Lessons Learned • Instrument everything. Start graphing early. • Cache as much as possible • Start working on scaling early. • Don’t rely on memcache, and don’t rely on the database • Don’t use mongrel. Use Unicorn.
  • 48. Q&A
  • 49. Thanks! • @jointheflock • http://twitter.com/jobs • Download our work • http://twitter.com/about/opensource