Chirp 2010: Scaling Twitter

J
John AdamsOperations Engineer
Chirp 2010: Scaling Twitter
Billions of Hits:
Scaling Twitter
John Adams
Twitter Operations
#chirpscale
John Adams                            @netik
•   Early Twitter employee (mid-2008)

•   Lead engineer: Outward Facing Services (Apache,
    Unicorn, SMTP), Auth, Security

•   Keynote Speaker: O’Reilly Velocity 2009

•   O’Reilly Web 2.0 Speaker (2008, 2010)

•   Previous companies: Inktomi, Apple, c|net

•   Working on Web Operations book with John Alspaw
    (flickr, etsy), out in June
Growth.
752%
                       2008 Growth
source: comscore.com - (based only on www traffic, not API)
1358%
                       2009 Growth
source: comscore.com - (based only on www traffic, not API)
12 th
                    most popular
source: alexa.com
55M
                   Tweets per day
                  (640 TPS/sec, 1000 TPS/sec peak)
source: twitter.com internal
600M
                   Searches/Day
source: twitter.com internal
25%




Web               API
            75%
Operations
•   What do we do?

    •   Site Availability

    •   Capacity Planning (metrics-driven)

    •   Configuration Management

    •   Security

    •   Much more than basic Sysadmin
What have we done?
•   Improved response time, reduced latency

•   Less errors during deploys (Unicorn!)

•   Faster performance

•   Lower MTTD (Mean time to Detect)

•   Lower MTTR (Mean time to Recovery)
Operations Mantra

                                  Move to
     Find            Take
                                   Next
    Weakest        Corrective
                                  Weakest
     Point          Action
                                   Point


   Metrics +
Logs + Science =    Process     Repeatability
   Analysis
Make an attack plan.
 Symptom    Bottleneck   Vector     Solution

                          HTTP
Bandwidth   Network                Servers++
                         Latency
 Timeline                Update      Better
            Database
  Delay                   Delay    algorithm
  Status                             Flock
            Database     Delays
 Growth                            Cassandra
 Updates    Algorithm    Latency   Algorithms
Finding Weakness
•   Metrics + Graphs

    •   Individual metrics are irrelevant

    •   We aggregate metrics to find knowledge

•   Logs

•   SCIENCE!
Monitoring
•   Twitter graphs and reports critical metrics in
    as near real time as possible

•   If you build tools against our API, you should
    too.

    •   RRD, other Time-Series DB solutions

    •   Ganglia + custom gmetric scripts

•   dev.twitter.com - API availability
Analyze
•   Turn data into information

    •   Where is the code base going?

    •   Are things worse than they were?

        •   Understand the impact of the last software
            deploy

        •   Run check scripts during and after deploys

•   Capacity Planning, not Fire Fighting!
Data Analysis
•   Instrumenting the world pays off.

•   “Data analysis, visualization, and other
    techniques for seeing patterns in data are
    going to be an increasingly valuable skill set.
    Employers take notice!”
          “Web Squared: Web 2.0 Five Years On”, Tim O’Reilly, Web 2.0 Summit, 2009
Forecasting             Curve-fitting for capacity planning
                        (R, fityk, Mathematica, CurveFit)



              unsigned int (32 bit)
                Twitpocolypse



  status_id

                                      signed int (32 bit)
                                        Twitpocolypse




                                                  r2=0.99
Internal Dashboard
External API Dashbord




   http://dev.twitter.com/status
What’s a Robot ?
•   Actual error in the Rails stack (HTTP 500)

•   Uncaught Exception

•   Code problem, or failure / nil result

•   Increases our exception count

•   Shows up in Reports
What’s a Whale ?
•   HTTP Error 502, 503

•   Twitter has a hard and fast five second timeout

•   We’d rather fail fast than block on requests

•   We also kill long-running queries (mkill)

•   Timeout
Whale Watcher
•   Simple shell script,

    •   MASSIVE WIN by @ronpepsi

•   Whale = HTTP 503 (timeout)

•   Robot = HTTP 500 (error)

•   Examines last 60 seconds of
    aggregated daemon / www logs

•   “Whales per Second” > Wthreshold

    •   Thar be whales! Call in ops.
Deploy Watcher
Sample window: 300.0 seconds

First start time:
Mon Apr 5 15:30:00 2010 (Mon Apr   5 08:30:00 PDT 2010)
Second start time:
Tue Apr 6 02:09:40 2010 (Mon Apr   5 19:09:40 PDT 2010)

PRODUCTION APACHE: ALL OK
PRODUCTION OTHER: ALL OK
WEB0049 CANARY APACHE: ALL OK
WEB0049 CANARY BACKEND SERVICES: ALL OK
DAEMON0031 CANARY BACKEND SERVICES: ALL OK
DAEMON0031 CANARY OTHER: ALL OK
Feature “Darkmode”
•   Specific site controls to enable and disable
    computationally or IO-Heavy site function

•   The “Emergency Stop” button

•   Changes logged and reported to all teams

•   Around 60 switches we can throw

•   Static / Read-only mode
request flow
           Load Balancers

         Apache mod_proxy

           Rails (Unicorn)

 Flock      memcached        Kestrel

         MySQL      Cassandra

             Daemons
Servers
•   Co-located, dedicated machines at NTT America

•   No clouds; Only for monitoring, not serving

    •   Need raw processing power, latency too high
        in existing cloud offerings

•   Frees us to deal with real, intellectual, computer
    science problems.

•   Moving to our own data center soon
unicorn
•   A single socket Rails application Server (Rack)

•   Zero Downtime Deploys (!)

    •   Controlled, shuffled transfer to new code

•   Less memory, 30% less CPU

•   Shift from mod_proxy_balancer to
    mod_proxy_pass

    •   HAProxy, Ngnix wasn’t any better. really.
Rails
•   Mostly only for front-end.

•   Back end mostly Scala and pure ruby

•   Not to blame for our issues. Analysis found:

    •   Caching + Cache invalidation problems

    •   Bad queries generated by ActiveRecord, resulting in
        slow queries against the db

    •   Queue Latency

•   Replication Lag
memcached
•   memcached isn’t perfect.

    •   Memcached SEGVs hurt us early on.

•   Evictions make the cache unreliable for
    important configuration data
    (loss of darkmode flags, for example)

•   Network Memory Bus isn’t infinite

•   Segmented into pools for better performance
Loony
•   Central machine database (MySQL)

    •   Python, Django, Paraminko SSH

        •   Paraminko - Twitter OSS (@robey)

    •   Ties into LDAP groups

•   When data center sends us email, machine
    definitions built in real-time
Murder
•   @lg rocks!

•   Bittorrent based replication for deploys

•   ~30-60 seconds to update >1k machines

•   P2P - Legal, valid, Awesome.
Kestrel
•   @robey

•   Works like memcache (same protocol)

•   SET = enqueue | GET = dequeue

•   No strict ordering of jobs

•   No shared state between servers

•   Written in Scala.
Asynchronous Requests
•   Inbound traffic consumes a unicorn worker

•   Outbound traffic consumes a unicorn worker

•   The request pipeline should not be used to
    handle 3rd party communications or
    back-end work.

•   Reroute traffic to daemons
Daemons
•   Daemons touch every tweet

•   Many different daemon types at Twitter

•   Old way: One daemon per type (Rails)

    •   New way: Fewer Daemons (Pure Ruby)

•   Daemon Slayer - A Multi Daemon that could
    do many different jobs, all at once.
Disk is the new Tape.
•   Social Networking application profile has
    many O(ny) operations.

•   Page requests have to happen in < 500mS or
    users start to notice. Goal: 250-300mS

•   Web 2.0 isn’t possible without lots of RAM

•   SSDs? What to do?
Caching
•   We’re the real-time web, but lots of caching
    opportunity. You should cache what you get from us.

•   Most caching strategies rely on long TTLs (>60 s)

•   Separate memcache pools for different data types to
    prevent eviction

•   Optimize Ruby Gem to libmemcached + FNV Hash
    instead of Ruby + MD5

•   Twitter now largest contributor to libmemcached
MySQL
•   Sharding large volumes of data is hard

•   Replication delay and cache eviction produce
    inconsistent results to the end user.

•   Locks create resource contention for popular
    data
MySQL Challenges
•   Replication Delay

    •   Single threaded. Slow.

•   Social Networking not good for RDBMS

    •   N x N relationships and social graph / tree
        traversal

    •   Disk issues (FS Choice, noatime, scheduling
        algorithm)
Relational Databases
not a Panacea
•   Good for:

    •   Users, Relational Data, Transactions

•   Bad:

    •   Queues. Polling operations. Social Graph.

•   You don’t need ACID for everything.
Database Replication
•   Major issues around users and statuses tables

•   Multiple functional masters (FRP, FWP)

•   Make sure your code reads and writes to the
    write DBs. Reading from master = slow death

    •   Monitor the DB. Find slow / poorly designed
        queries

•   Kill long running queries before they kill you
    (mkill)
Flock
                                          Flock
•   Scalable Social Graph Store

•   Sharding via Gizzard
                                          Gizzard
•   MySQL backend (many.)

•   13 billion edges,
    100K reads/second
                                  Mysql   Mysql     Mysql
•   Open Source!
Cassandra
•   Originally written by Facebook

•   Distributed Data Store

•   @rk’s changes to Cassandra Open Sourced

•   Currently double-writing into it

•   Transitioning to 100% soon.
Lessons Learned
•   Instrument everything. Start graphing early.

•   Cache as much as possible

•   Start working on scaling early.

•   Don’t rely on memcache, and don’t rely on the
    database

•   Don’t use mongrel. Use Unicorn.
Join Us!
@jointheflock
Q&A
Thanks!
•   @jointheflock

•   http://twitter.com/jobs

•   Download our work

    •   http://twitter.com/about/opensource
1 of 49

Recommended

NoSQL at Twitter (NoSQL EU 2010) by
NoSQL at Twitter (NoSQL EU 2010)NoSQL at Twitter (NoSQL EU 2010)
NoSQL at Twitter (NoSQL EU 2010)Kevin Weil
94.2K views156 slides
Load balancing at tuenti by
Load balancing at tuentiLoad balancing at tuenti
Load balancing at tuentiRicardo Bartolomé
16.6K views46 slides
중니어의 고뇌: 1인분 개발자, 다음을 찾아서 by
중니어의 고뇌: 1인분 개발자, 다음을 찾아서중니어의 고뇌: 1인분 개발자, 다음을 찾아서
중니어의 고뇌: 1인분 개발자, 다음을 찾아서Yurim Jin
4.2K views15 slides
[NDC 2018] 신입 개발자가 알아야 할 윈도우 메모리릭 디버깅 by
[NDC 2018] 신입 개발자가 알아야 할 윈도우 메모리릭 디버깅[NDC 2018] 신입 개발자가 알아야 할 윈도우 메모리릭 디버깅
[NDC 2018] 신입 개발자가 알아야 할 윈도우 메모리릭 디버깅DongMin Choi
12.6K views123 slides
Building Event Streaming Architectures on Scylla and Kafka by
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaScyllaDB
862 views56 slides
Etsy Activity Feeds Architecture by
Etsy Activity Feeds ArchitectureEtsy Activity Feeds Architecture
Etsy Activity Feeds ArchitectureDan McKinley
112K views45 slides

More Related Content

What's hot

코드의 품질 (Code Quality) by
코드의 품질 (Code Quality)코드의 품질 (Code Quality)
코드의 품질 (Code Quality)ChulHui Lee
3.6K views62 slides
Reddit/Quora Software System Design by
Reddit/Quora Software System DesignReddit/Quora Software System Design
Reddit/Quora Software System DesignElia Ahadi
982 views8 slides
GNU Make, Autotools, CMake 簡介 by
GNU Make, Autotools, CMake 簡介GNU Make, Autotools, CMake 簡介
GNU Make, Autotools, CMake 簡介Wen Liao
25.8K views106 slides
Building GUI App with Electron and Lisp by
Building GUI App with Electron and LispBuilding GUI App with Electron and Lisp
Building GUI App with Electron and Lispfukamachi
20.4K views62 slides
Cloud Pub_Sub by
Cloud Pub_SubCloud Pub_Sub
Cloud Pub_SubKnoldus Inc.
179 views11 slides
Apache HBase Performance Tuning by
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance TuningLars Hofhansl
39.6K views54 slides

What's hot(20)

코드의 품질 (Code Quality) by ChulHui Lee
코드의 품질 (Code Quality)코드의 품질 (Code Quality)
코드의 품질 (Code Quality)
ChulHui Lee3.6K views
Reddit/Quora Software System Design by Elia Ahadi
Reddit/Quora Software System DesignReddit/Quora Software System Design
Reddit/Quora Software System Design
Elia Ahadi982 views
GNU Make, Autotools, CMake 簡介 by Wen Liao
GNU Make, Autotools, CMake 簡介GNU Make, Autotools, CMake 簡介
GNU Make, Autotools, CMake 簡介
Wen Liao25.8K views
Building GUI App with Electron and Lisp by fukamachi
Building GUI App with Electron and LispBuilding GUI App with Electron and Lisp
Building GUI App with Electron and Lisp
fukamachi20.4K views
Apache HBase Performance Tuning by Lars Hofhansl
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
Lars Hofhansl39.6K views
RxSwift 활용하기 - Let'Swift 2017 by Wanbok Choi
RxSwift 활용하기 - Let'Swift 2017RxSwift 활용하기 - Let'Swift 2017
RxSwift 활용하기 - Let'Swift 2017
Wanbok Choi1.3K views
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit... by Databricks
 Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit... Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Databricks3.4K views
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버 by Heungsub Lee
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버
[야생의 땅: 듀랑고] 서버 아키텍처 - SPOF 없는 분산 MMORPG 서버
Heungsub Lee50.2K views
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva... by Monica Beckwith
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Garbage First Garbage Collector (G1 GC) - Migration to, Expectations and Adva...
Monica Beckwith48.6K views
Từ Gà Đến Pro Git và GitHub trong 60 phút by Huy Hoàng Phạm
Từ Gà Đến Pro Git và GitHub trong 60 phútTừ Gà Đến Pro Git và GitHub trong 60 phút
Từ Gà Đến Pro Git và GitHub trong 60 phút
Huy Hoàng Phạm6.3K views
Multiplayer Game Sync Techniques through CAP theorem by Seungmo Koo
Multiplayer Game Sync Techniques through CAP theoremMultiplayer Game Sync Techniques through CAP theorem
Multiplayer Game Sync Techniques through CAP theorem
Seungmo Koo11.8K views
청강대 특강 - 프로젝트 제대로 해보기 by Chris Ohk
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기
Chris Ohk3.2K views
Life in a Queue - Using Message Queue with django by Tareque Hossain
Life in a Queue - Using Message Queue with djangoLife in a Queue - Using Message Queue with django
Life in a Queue - Using Message Queue with django
Tareque Hossain15K views
Introduction to Python Celery by Mahendra M
Introduction to Python CeleryIntroduction to Python Celery
Introduction to Python Celery
Mahendra M4.6K views
Keeping Latency Low for User-Defined Functions with WebAssembly by ScyllaDB
Keeping Latency Low for User-Defined Functions with WebAssemblyKeeping Latency Low for User-Defined Functions with WebAssembly
Keeping Latency Low for User-Defined Functions with WebAssembly
ScyllaDB470 views
게임서버프로그래밍 #2 - IOCP Adv by Seungmo Koo
게임서버프로그래밍 #2 - IOCP Adv게임서버프로그래밍 #2 - IOCP Adv
게임서버프로그래밍 #2 - IOCP Adv
Seungmo Koo5.3K views
게임서버프로그래밍 #7 - 패킷핸들링 및 암호화 by Seungmo Koo
게임서버프로그래밍 #7 - 패킷핸들링 및 암호화게임서버프로그래밍 #7 - 패킷핸들링 및 암호화
게임서버프로그래밍 #7 - 패킷핸들링 및 암호화
Seungmo Koo6.7K views
Advanced Flink Training - Design patterns for streaming applications by Aljoscha Krettek
Advanced Flink Training - Design patterns for streaming applicationsAdvanced Flink Training - Design patterns for streaming applications
Advanced Flink Training - Design patterns for streaming applications
Aljoscha Krettek968 views
Behavior driven development (bdd) by Rohit Bisht
Behavior driven development (bdd)Behavior driven development (bdd)
Behavior driven development (bdd)
Rohit Bisht394 views

Viewers also liked

Scaling Instagram by
Scaling InstagramScaling Instagram
Scaling Instagramiammutex
189.4K views185 slides
Embracing Open Source: Practice and Experience from Alibaba by
Embracing Open Source: Practice and Experience from AlibabaEmbracing Open Source: Practice and Experience from Alibaba
Embracing Open Source: Practice and Experience from AlibabaWensong Zhang
17.6K views71 slides
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ... by
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...Jon Watte
8.2K views39 slides
Xyz affair by
Xyz affairXyz affair
Xyz affairlostpuppies
1.5K views4 slides
Us history shaping a new nation by
Us history shaping a new nationUs history shaping a new nation
Us history shaping a new nationMrO97
1.5K views34 slides
Thomas jefferson by
Thomas jeffersonThomas jefferson
Thomas jeffersonkellycrowell
1.8K views7 slides

Viewers also liked(20)

Scaling Instagram by iammutex
Scaling InstagramScaling Instagram
Scaling Instagram
iammutex189.4K views
Embracing Open Source: Practice and Experience from Alibaba by Wensong Zhang
Embracing Open Source: Practice and Experience from AlibabaEmbracing Open Source: Practice and Experience from Alibaba
Embracing Open Source: Practice and Experience from Alibaba
Wensong Zhang17.6K views
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ... by Jon Watte
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...
Message Queuing on a Large Scale: IMVUs stateful real-time message queue for ...
Jon Watte8.2K views
Us history shaping a new nation by MrO97
Us history shaping a new nationUs history shaping a new nation
Us history shaping a new nation
MrO971.5K views
LVS development and experience by Wensong Zhang
LVS development and experienceLVS development and experience
LVS development and experience
Wensong Zhang2.7K views
3. shaping a new nation [1782 1788] by jtoma84
3. shaping a new nation [1782 1788]3. shaping a new nation [1782 1788]
3. shaping a new nation [1782 1788]
jtoma843.2K views
Product design: How to create a product by Press42
Product design: How to create a productProduct design: How to create a product
Product design: How to create a product
Press425.6K views
Scaling Twitter with Cassandra by Ryan King
Scaling Twitter with CassandraScaling Twitter with Cassandra
Scaling Twitter with Cassandra
Ryan King20.6K views
Washington Presidency by mrsvogel
Washington PresidencyWashington Presidency
Washington Presidency
mrsvogel1.5K views
The presidency of george washingtion ppt for notes by Matthew Fulghum
The presidency of george washingtion ppt for notesThe presidency of george washingtion ppt for notes
The presidency of george washingtion ppt for notes
Matthew Fulghum1.2K views
Lesson Plan_Us History_the birth of a new nation by scott severance
Lesson Plan_Us History_the birth of a new nationLesson Plan_Us History_the birth of a new nation
Lesson Plan_Us History_the birth of a new nation
scott severance363 views
The First Five Presidents of the United States by mentzers
The First Five Presidents of the United StatesThe First Five Presidents of the United States
The First Five Presidents of the United States
mentzers15.9K views
American History - Chapter 6 by Alison Kurtz
American History - Chapter 6American History - Chapter 6
American History - Chapter 6
Alison Kurtz5.9K views
Election Of 1800 Power Point by guest8b3f7
Election Of 1800 Power PointElection Of 1800 Power Point
Election Of 1800 Power Point
guest8b3f77.9K views

Similar to Chirp 2010: Scaling Twitter

John adams talk cloudy by
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
3.3K views57 slides
Fixing twitter by
Fixing twitterFixing twitter
Fixing twitterRoger Xia
546 views49 slides
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ... by
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
444 views49 slides
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ... by
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
1.4K views49 slides
Fixing_Twitter by
Fixing_TwitterFixing_Twitter
Fixing_Twitterliujianrong
562 views49 slides
Fixing Twitter Velocity2009 by
Fixing Twitter Velocity2009Fixing Twitter Velocity2009
Fixing Twitter Velocity2009John Adams
1.2K views54 slides

Similar to Chirp 2010: Scaling Twitter(20)

John adams talk cloudy by John Adams
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
John Adams3.3K views
Fixing twitter by Roger Xia
Fixing twitterFixing twitter
Fixing twitter
Roger Xia546 views
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ... by xlight
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight444 views
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ... by smallerror
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
smallerror1.4K views
Fixing Twitter Velocity2009 by John Adams
Fixing Twitter Velocity2009Fixing Twitter Velocity2009
Fixing Twitter Velocity2009
John Adams1.2K views
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20... by Javier García Magna
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20....Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
.Net Microservices with Event Sourcing, CQRS, Docker and... Windows Server 20...
Capacity Planning for fun & profit by Rodrigo Campos
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profit
Rodrigo Campos1.1K views
Using Riak for Events storage and analysis at Booking.com by Damien Krotkine
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.com
Damien Krotkine3.9K views
Hacklu2011 tricaud by stricaud
Hacklu2011 tricaudHacklu2011 tricaud
Hacklu2011 tricaud
stricaud558 views
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An... by Open Analytics
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...
Open Analytics3K views
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog by Redis Labs
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Redis Labs2K views
Asynchronous design with Spring and RTI: 1M events per second by Stuart (Pid) Williams
Asynchronous design with Spring and RTI: 1M events per secondAsynchronous design with Spring and RTI: 1M events per second
Asynchronous design with Spring and RTI: 1M events per second
Monitoring MySQL at scale by Ovais Tariq
Monitoring MySQL at scaleMonitoring MySQL at scale
Monitoring MySQL at scale
Ovais Tariq2.1K views
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational? by DATAVERSITY
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
A Case Study of NoSQL Adoption: What Drove Wordnik Non-Relational?
DATAVERSITY1.4K views
Building FoundationDB by FoundationDB
Building FoundationDBBuilding FoundationDB
Building FoundationDB
FoundationDB3.4K views

Recently uploaded

Data-centric AI and the convergence of data and model engineering: opportunit... by
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
34 views40 slides
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica... by
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...NUS-ISS
16 views28 slides
PharoJS - Zürich Smalltalk Group Meetup November 2023 by
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023Noury Bouraqadi
120 views17 slides
How the World's Leading Independent Automotive Distributor is Reinventing Its... by
How the World's Leading Independent Automotive Distributor is Reinventing Its...How the World's Leading Independent Automotive Distributor is Reinventing Its...
How the World's Leading Independent Automotive Distributor is Reinventing Its...NUS-ISS
15 views25 slides
Understanding GenAI/LLM and What is Google Offering - Felix Goh by
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix GohNUS-ISS
41 views33 slides
Transcript: The Details of Description Techniques tips and tangents on altern... by
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...BookNet Canada
130 views15 slides

Recently uploaded(20)

Data-centric AI and the convergence of data and model engineering: opportunit... by Paolo Missier
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
Paolo Missier34 views
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica... by NUS-ISS
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
Emerging & Future Technology - How to Prepare for the Next 10 Years of Radica...
NUS-ISS16 views
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi120 views
How the World's Leading Independent Automotive Distributor is Reinventing Its... by NUS-ISS
How the World's Leading Independent Automotive Distributor is Reinventing Its...How the World's Leading Independent Automotive Distributor is Reinventing Its...
How the World's Leading Independent Automotive Distributor is Reinventing Its...
NUS-ISS15 views
Understanding GenAI/LLM and What is Google Offering - Felix Goh by NUS-ISS
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix Goh
NUS-ISS41 views
Transcript: The Details of Description Techniques tips and tangents on altern... by BookNet Canada
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...
BookNet Canada130 views
The details of description: Techniques, tips, and tangents on alternative tex... by BookNet Canada
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada121 views
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma17 views
Web Dev - 1 PPT.pdf by gdsczhcet
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet55 views
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV by Splunk
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
Splunk88 views
The Importance of Cybersecurity for Digital Transformation by NUS-ISS
The Importance of Cybersecurity for Digital TransformationThe Importance of Cybersecurity for Digital Transformation
The Importance of Cybersecurity for Digital Transformation
NUS-ISS27 views
Empathic Computing: Delivering the Potential of the Metaverse by Mark Billinghurst
Empathic Computing: Delivering  the Potential of the MetaverseEmpathic Computing: Delivering  the Potential of the Metaverse
Empathic Computing: Delivering the Potential of the Metaverse
Mark Billinghurst470 views
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb12 views
Future of Learning - Khoong Chan Meng by NUS-ISS
Future of Learning - Khoong Chan MengFuture of Learning - Khoong Chan Meng
Future of Learning - Khoong Chan Meng
NUS-ISS33 views
[2023] Putting the R! in R&D.pdf by Eleanor McHugh
[2023] Putting the R! in R&D.pdf[2023] Putting the R! in R&D.pdf
[2023] Putting the R! in R&D.pdf
Eleanor McHugh38 views
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors by sugiuralab
TouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective SensorsTouchLog: Finger Micro Gesture Recognition  Using Photo-Reflective Sensors
TouchLog: Finger Micro Gesture Recognition Using Photo-Reflective Sensors
sugiuralab15 views

Chirp 2010: Scaling Twitter

  • 2. Billions of Hits: Scaling Twitter John Adams Twitter Operations
  • 4. John Adams @netik • Early Twitter employee (mid-2008) • Lead engineer: Outward Facing Services (Apache, Unicorn, SMTP), Auth, Security • Keynote Speaker: O’Reilly Velocity 2009 • O’Reilly Web 2.0 Speaker (2008, 2010) • Previous companies: Inktomi, Apple, c|net • Working on Web Operations book with John Alspaw (flickr, etsy), out in June
  • 6. 752% 2008 Growth source: comscore.com - (based only on www traffic, not API)
  • 7. 1358% 2009 Growth source: comscore.com - (based only on www traffic, not API)
  • 8. 12 th most popular source: alexa.com
  • 9. 55M Tweets per day (640 TPS/sec, 1000 TPS/sec peak) source: twitter.com internal
  • 10. 600M Searches/Day source: twitter.com internal
  • 11. 25% Web API 75%
  • 12. Operations • What do we do? • Site Availability • Capacity Planning (metrics-driven) • Configuration Management • Security • Much more than basic Sysadmin
  • 13. What have we done? • Improved response time, reduced latency • Less errors during deploys (Unicorn!) • Faster performance • Lower MTTD (Mean time to Detect) • Lower MTTR (Mean time to Recovery)
  • 14. Operations Mantra Move to Find Take Next Weakest Corrective Weakest Point Action Point Metrics + Logs + Science = Process Repeatability Analysis
  • 15. Make an attack plan. Symptom Bottleneck Vector Solution HTTP Bandwidth Network Servers++ Latency Timeline Update Better Database Delay Delay algorithm Status Flock Database Delays Growth Cassandra Updates Algorithm Latency Algorithms
  • 16. Finding Weakness • Metrics + Graphs • Individual metrics are irrelevant • We aggregate metrics to find knowledge • Logs • SCIENCE!
  • 17. Monitoring • Twitter graphs and reports critical metrics in as near real time as possible • If you build tools against our API, you should too. • RRD, other Time-Series DB solutions • Ganglia + custom gmetric scripts • dev.twitter.com - API availability
  • 18. Analyze • Turn data into information • Where is the code base going? • Are things worse than they were? • Understand the impact of the last software deploy • Run check scripts during and after deploys • Capacity Planning, not Fire Fighting!
  • 19. Data Analysis • Instrumenting the world pays off. • “Data analysis, visualization, and other techniques for seeing patterns in data are going to be an increasingly valuable skill set. Employers take notice!” “Web Squared: Web 2.0 Five Years On”, Tim O’Reilly, Web 2.0 Summit, 2009
  • 20. Forecasting Curve-fitting for capacity planning (R, fityk, Mathematica, CurveFit) unsigned int (32 bit) Twitpocolypse status_id signed int (32 bit) Twitpocolypse r2=0.99
  • 22. External API Dashbord http://dev.twitter.com/status
  • 23. What’s a Robot ? • Actual error in the Rails stack (HTTP 500) • Uncaught Exception • Code problem, or failure / nil result • Increases our exception count • Shows up in Reports
  • 24. What’s a Whale ? • HTTP Error 502, 503 • Twitter has a hard and fast five second timeout • We’d rather fail fast than block on requests • We also kill long-running queries (mkill) • Timeout
  • 25. Whale Watcher • Simple shell script, • MASSIVE WIN by @ronpepsi • Whale = HTTP 503 (timeout) • Robot = HTTP 500 (error) • Examines last 60 seconds of aggregated daemon / www logs • “Whales per Second” > Wthreshold • Thar be whales! Call in ops.
  • 26. Deploy Watcher Sample window: 300.0 seconds First start time: Mon Apr 5 15:30:00 2010 (Mon Apr 5 08:30:00 PDT 2010) Second start time: Tue Apr 6 02:09:40 2010 (Mon Apr 5 19:09:40 PDT 2010) PRODUCTION APACHE: ALL OK PRODUCTION OTHER: ALL OK WEB0049 CANARY APACHE: ALL OK WEB0049 CANARY BACKEND SERVICES: ALL OK DAEMON0031 CANARY BACKEND SERVICES: ALL OK DAEMON0031 CANARY OTHER: ALL OK
  • 27. Feature “Darkmode” • Specific site controls to enable and disable computationally or IO-Heavy site function • The “Emergency Stop” button • Changes logged and reported to all teams • Around 60 switches we can throw • Static / Read-only mode
  • 28. request flow Load Balancers Apache mod_proxy Rails (Unicorn) Flock memcached Kestrel MySQL Cassandra Daemons
  • 29. Servers • Co-located, dedicated machines at NTT America • No clouds; Only for monitoring, not serving • Need raw processing power, latency too high in existing cloud offerings • Frees us to deal with real, intellectual, computer science problems. • Moving to our own data center soon
  • 30. unicorn • A single socket Rails application Server (Rack) • Zero Downtime Deploys (!) • Controlled, shuffled transfer to new code • Less memory, 30% less CPU • Shift from mod_proxy_balancer to mod_proxy_pass • HAProxy, Ngnix wasn’t any better. really.
  • 31. Rails • Mostly only for front-end. • Back end mostly Scala and pure ruby • Not to blame for our issues. Analysis found: • Caching + Cache invalidation problems • Bad queries generated by ActiveRecord, resulting in slow queries against the db • Queue Latency • Replication Lag
  • 32. memcached • memcached isn’t perfect. • Memcached SEGVs hurt us early on. • Evictions make the cache unreliable for important configuration data (loss of darkmode flags, for example) • Network Memory Bus isn’t infinite • Segmented into pools for better performance
  • 33. Loony • Central machine database (MySQL) • Python, Django, Paraminko SSH • Paraminko - Twitter OSS (@robey) • Ties into LDAP groups • When data center sends us email, machine definitions built in real-time
  • 34. Murder • @lg rocks! • Bittorrent based replication for deploys • ~30-60 seconds to update >1k machines • P2P - Legal, valid, Awesome.
  • 35. Kestrel • @robey • Works like memcache (same protocol) • SET = enqueue | GET = dequeue • No strict ordering of jobs • No shared state between servers • Written in Scala.
  • 36. Asynchronous Requests • Inbound traffic consumes a unicorn worker • Outbound traffic consumes a unicorn worker • The request pipeline should not be used to handle 3rd party communications or back-end work. • Reroute traffic to daemons
  • 37. Daemons • Daemons touch every tweet • Many different daemon types at Twitter • Old way: One daemon per type (Rails) • New way: Fewer Daemons (Pure Ruby) • Daemon Slayer - A Multi Daemon that could do many different jobs, all at once.
  • 38. Disk is the new Tape. • Social Networking application profile has many O(ny) operations. • Page requests have to happen in < 500mS or users start to notice. Goal: 250-300mS • Web 2.0 isn’t possible without lots of RAM • SSDs? What to do?
  • 39. Caching • We’re the real-time web, but lots of caching opportunity. You should cache what you get from us. • Most caching strategies rely on long TTLs (>60 s) • Separate memcache pools for different data types to prevent eviction • Optimize Ruby Gem to libmemcached + FNV Hash instead of Ruby + MD5 • Twitter now largest contributor to libmemcached
  • 40. MySQL • Sharding large volumes of data is hard • Replication delay and cache eviction produce inconsistent results to the end user. • Locks create resource contention for popular data
  • 41. MySQL Challenges • Replication Delay • Single threaded. Slow. • Social Networking not good for RDBMS • N x N relationships and social graph / tree traversal • Disk issues (FS Choice, noatime, scheduling algorithm)
  • 42. Relational Databases not a Panacea • Good for: • Users, Relational Data, Transactions • Bad: • Queues. Polling operations. Social Graph. • You don’t need ACID for everything.
  • 43. Database Replication • Major issues around users and statuses tables • Multiple functional masters (FRP, FWP) • Make sure your code reads and writes to the write DBs. Reading from master = slow death • Monitor the DB. Find slow / poorly designed queries • Kill long running queries before they kill you (mkill)
  • 44. Flock Flock • Scalable Social Graph Store • Sharding via Gizzard Gizzard • MySQL backend (many.) • 13 billion edges, 100K reads/second Mysql Mysql Mysql • Open Source!
  • 45. Cassandra • Originally written by Facebook • Distributed Data Store • @rk’s changes to Cassandra Open Sourced • Currently double-writing into it • Transitioning to 100% soon.
  • 46. Lessons Learned • Instrument everything. Start graphing early. • Cache as much as possible • Start working on scaling early. • Don’t rely on memcache, and don’t rely on the database • Don’t use mongrel. Use Unicorn.
  • 48. Q&A
  • 49. Thanks! • @jointheflock • http://twitter.com/jobs • Download our work • http://twitter.com/about/opensource