• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Chirp 2010: Scaling Twitter
 

Chirp 2010: Scaling Twitter

on

  • 40,544 views

Billions of Hits: Scaling Twitter

Billions of Hits: Scaling Twitter

From my presentation at Twitter's Chirp 2010.

Statistics

Views

Total Views
40,544
Views on SlideShare
32,227
Embed Views
8,317

Actions

Likes
119
Downloads
816
Comments
6

66 Embeds 8,317

http://www.ryanmcintyre.com 4924
http://www.datacenterknowledge.com 737
http://www.readwriteweb.com 596
http://abrdev.com 487
http://gigaom.com 269
http://readwrite.com 254
http://escalabilidade.com 164
http://www.slideshare.net 155
http://servermaniac.de 114
http://feeds.feedburner.com 110
http://www.labnol.org 78
http://b0op.com 68
http://datacenter.silicon.fr 48
http://datacenter.vnunet.fr 39
http://blog.corujadeti.com.br 35
http://webtapmagazine.com 30
http://kbour23.wordpress.com 25
http://radiobytes.wordpress.com 24
http://static.slidesharecdn.com 21
http://blog.ultrainno.com 18
http://1o24.blogspot.com 11
http://erictric.com 10
http://www.scoop.it 8
http://translate.googleusercontent.com 7
http://wordpress-11838-ross.app.standingcloud.com 6
http://swik.net 5
url_unknown 5
http://blog.karachicorner.com 4
http://myreader.posterous.com 4
http://blog.sayan.ee 4
http://blog.newitfarmer.com 3
http://www.techgig.com 3
http://blog.kernel23.org 3
http://ashishkprasad.blogspot.com 3
http://blog.derquinse.net 3
http://webcache.googleusercontent.com 3
http://www.reddit.com 3
http://ashishkprasad.blogspot.ca 2
http://twitter.com 2
http://www.iweb34.com 2
http://localhost 2
http://archives.sayan.ee 2
http://reddal.blogspot.com 2
http://derek-richardson.com 2
https://si0.twimg.com 1
https://twitter.com 1
http://andrewbfife.blogspot.de 1
http://share.sweska.net 1
http://cat.www.ryanmcintyre.com.meowbify.com 1
http://jakeo.org 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

16 of 6 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Chirp 2010: Scaling Twitter Chirp 2010: Scaling Twitter Presentation Transcript

    • Billions of Hits: Scaling Twitter John Adams Twitter Operations
    • #chirpscale
    • John Adams @netik • Early Twitter employee (mid-2008) • Lead engineer: Outward Facing Services (Apache, Unicorn, SMTP), Auth, Security • Keynote Speaker: O’Reilly Velocity 2009 • O’Reilly Web 2.0 Speaker (2008, 2010) • Previous companies: Inktomi, Apple, c|net • Working on Web Operations book with John Alspaw (flickr, etsy), out in June
    • Growth.
    • 752% 2008 Growth source: comscore.com - (based only on www traffic, not API)
    • 1358% 2009 Growth source: comscore.com - (based only on www traffic, not API)
    • 12 th most popular source: alexa.com
    • 55M Tweets per day (640 TPS/sec, 1000 TPS/sec peak) source: twitter.com internal
    • 600M Searches/Day source: twitter.com internal
    • 25% Web API 75%
    • Operations • What do we do? • Site Availability • Capacity Planning (metrics-driven) • Configuration Management • Security • Much more than basic Sysadmin
    • What have we done? • Improved response time, reduced latency • Less errors during deploys (Unicorn!) • Faster performance • Lower MTTD (Mean time to Detect) • Lower MTTR (Mean time to Recovery)
    • Operations Mantra Move to Find Take Next Weakest Corrective Weakest Point Action Point Metrics + Logs + Science = Process Repeatability Analysis
    • Make an attack plan. Symptom Bottleneck Vector Solution HTTP Bandwidth Network Servers++ Latency Timeline Update Better Database Delay Delay algorithm Status Flock Database Delays Growth Cassandra Updates Algorithm Latency Algorithms
    • Finding Weakness • Metrics + Graphs • Individual metrics are irrelevant • We aggregate metrics to find knowledge • Logs • SCIENCE!
    • Monitoring • Twitter graphs and reports critical metrics in as near real time as possible • If you build tools against our API, you should too. • RRD, other Time-Series DB solutions • Ganglia + custom gmetric scripts • dev.twitter.com - API availability
    • Analyze • Turn data into information • Where is the code base going? • Are things worse than they were? • Understand the impact of the last software deploy • Run check scripts during and after deploys • Capacity Planning, not Fire Fighting!
    • Data Analysis • Instrumenting the world pays off. • “Data analysis, visualization, and other techniques for seeing patterns in data are going to be an increasingly valuable skill set. Employers take notice!” “Web Squared: Web 2.0 Five Years On”, Tim O’Reilly, Web 2.0 Summit, 2009
    • Forecasting Curve-fitting for capacity planning (R, fityk, Mathematica, CurveFit) unsigned int (32 bit) Twitpocolypse status_id signed int (32 bit) Twitpocolypse r2=0.99
    • Internal Dashboard
    • External API Dashbord http://dev.twitter.com/status
    • What’s a Robot ? • Actual error in the Rails stack (HTTP 500) • Uncaught Exception • Code problem, or failure / nil result • Increases our exception count • Shows up in Reports
    • What’s a Whale ? • HTTP Error 502, 503 • Twitter has a hard and fast five second timeout • We’d rather fail fast than block on requests • We also kill long-running queries (mkill) • Timeout
    • Whale Watcher • Simple shell script, • MASSIVE WIN by @ronpepsi • Whale = HTTP 503 (timeout) • Robot = HTTP 500 (error) • Examines last 60 seconds of aggregated daemon / www logs • “Whales per Second” > Wthreshold • Thar be whales! Call in ops.
    • Deploy Watcher Sample window: 300.0 seconds First start time: Mon Apr 5 15:30:00 2010 (Mon Apr 5 08:30:00 PDT 2010) Second start time: Tue Apr 6 02:09:40 2010 (Mon Apr 5 19:09:40 PDT 2010) PRODUCTION APACHE: ALL OK PRODUCTION OTHER: ALL OK WEB0049 CANARY APACHE: ALL OK WEB0049 CANARY BACKEND SERVICES: ALL OK DAEMON0031 CANARY BACKEND SERVICES: ALL OK DAEMON0031 CANARY OTHER: ALL OK
    • Feature “Darkmode” • Specific site controls to enable and disable computationally or IO-Heavy site function • The “Emergency Stop” button • Changes logged and reported to all teams • Around 60 switches we can throw • Static / Read-only mode
    • request flow Load Balancers Apache mod_proxy Rails (Unicorn) Flock memcached Kestrel MySQL Cassandra Daemons
    • Servers • Co-located, dedicated machines at NTT America • No clouds; Only for monitoring, not serving • Need raw processing power, latency too high in existing cloud offerings • Frees us to deal with real, intellectual, computer science problems. • Moving to our own data center soon
    • unicorn • A single socket Rails application Server (Rack) • Zero Downtime Deploys (!) • Controlled, shuffled transfer to new code • Less memory, 30% less CPU • Shift from mod_proxy_balancer to mod_proxy_pass • HAProxy, Ngnix wasn’t any better. really.
    • Rails • Mostly only for front-end. • Back end mostly Scala and pure ruby • Not to blame for our issues. Analysis found: • Caching + Cache invalidation problems • Bad queries generated by ActiveRecord, resulting in slow queries against the db • Queue Latency • Replication Lag
    • memcached • memcached isn’t perfect. • Memcached SEGVs hurt us early on. • Evictions make the cache unreliable for important configuration data (loss of darkmode flags, for example) • Network Memory Bus isn’t infinite • Segmented into pools for better performance
    • Loony • Central machine database (MySQL) • Python, Django, Paraminko SSH • Paraminko - Twitter OSS (@robey) • Ties into LDAP groups • When data center sends us email, machine definitions built in real-time
    • Murder • @lg rocks! • Bittorrent based replication for deploys • ~30-60 seconds to update >1k machines • P2P - Legal, valid, Awesome.
    • Kestrel • @robey • Works like memcache (same protocol) • SET = enqueue | GET = dequeue • No strict ordering of jobs • No shared state between servers • Written in Scala.
    • Asynchronous Requests • Inbound traffic consumes a unicorn worker • Outbound traffic consumes a unicorn worker • The request pipeline should not be used to handle 3rd party communications or back-end work. • Reroute traffic to daemons
    • Daemons • Daemons touch every tweet • Many different daemon types at Twitter • Old way: One daemon per type (Rails) • New way: Fewer Daemons (Pure Ruby) • Daemon Slayer - A Multi Daemon that could do many different jobs, all at once.
    • Disk is the new Tape. • Social Networking application profile has many O(ny) operations. • Page requests have to happen in < 500mS or users start to notice. Goal: 250-300mS • Web 2.0 isn’t possible without lots of RAM • SSDs? What to do?
    • Caching • We’re the real-time web, but lots of caching opportunity. You should cache what you get from us. • Most caching strategies rely on long TTLs (>60 s) • Separate memcache pools for different data types to prevent eviction • Optimize Ruby Gem to libmemcached + FNV Hash instead of Ruby + MD5 • Twitter now largest contributor to libmemcached
    • MySQL • Sharding large volumes of data is hard • Replication delay and cache eviction produce inconsistent results to the end user. • Locks create resource contention for popular data
    • MySQL Challenges • Replication Delay • Single threaded. Slow. • Social Networking not good for RDBMS • N x N relationships and social graph / tree traversal • Disk issues (FS Choice, noatime, scheduling algorithm)
    • Relational Databases not a Panacea • Good for: • Users, Relational Data, Transactions • Bad: • Queues. Polling operations. Social Graph. • You don’t need ACID for everything.
    • Database Replication • Major issues around users and statuses tables • Multiple functional masters (FRP, FWP) • Make sure your code reads and writes to the write DBs. Reading from master = slow death • Monitor the DB. Find slow / poorly designed queries • Kill long running queries before they kill you (mkill)
    • Flock Flock • Scalable Social Graph Store • Sharding via Gizzard Gizzard • MySQL backend (many.) • 13 billion edges, 100K reads/second Mysql Mysql Mysql • Open Source!
    • Cassandra • Originally written by Facebook • Distributed Data Store • @rk’s changes to Cassandra Open Sourced • Currently double-writing into it • Transitioning to 100% soon.
    • Lessons Learned • Instrument everything. Start graphing early. • Cache as much as possible • Start working on scaling early. • Don’t rely on memcache, and don’t rely on the database • Don’t use mongrel. Use Unicorn.
    • Join Us! @jointheflock
    • Q&A
    • Thanks! • @jointheflock • http://twitter.com/jobs • Download our work • http://twitter.com/about/opensource