High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011
Upcoming SlideShare
Loading in...5
×
 

High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011

on

  • 59,444 views

Designing a massively scalable highly available persistence layer has been one of the great challenges we’ve faced building out Twilio’s cloud communications infrastructure. Robust Voice and SMS ...

Designing a massively scalable highly available persistence layer has been one of the great challenges we’ve faced building out Twilio’s cloud communications infrastructure. Robust Voice and SMS APIs have strict consistency, latency, and availability requirements that cannot be solved using traditional sharding or scaling approaches. In this talk we first look to understand the challenges of running high-availability services in the cloud and then describe how we’ve architected “in-flight” and “post-flight” data into separate datastores that can be implemented using a range of technologies.

Statistics

Views

Total Views
59,444
Views on SlideShare
34,566
Embed Views
24,878

Actions

Likes
19
Downloads
253
Comments
1

25 Embeds 24,878

http://www.twilio.com 22759
http://twilio.com 1299
https://www.twilio.com 485
http://localhost 191
https://twitter.com 26
http://lanyrd.com 26
http://notes.whatthenerd.com 26
http://webcache.googleusercontent.com 16
http://nuevospowerpoints.blogspot.com 11
http://translate.googleusercontent.com 8
http://us-w1.rockmelt.com 8
http://mundo-powerpoints.blogspot.com 4
http://a0.twimg.com 4
http://www.dev.twilio.com 2
http://public-vip29c4ab3d.prod.twilio.com 2
http://paper.li 2
http://w3w.twilio.com 1
https://adame.fwd.wf&_=1358976623043 HTTP 1
https://si0.twimg.com 1
http://23.21.222.203 1
http://webcache-exp-test.googleusercontent.com 1
http://207.46.192.232 1
http://www.newsblur.com 1
http://fiveholiday55.blogspot.com&_=1324583816858 HTTP 1
http://twiliobot.twilio.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • New Spy Camera Professional Spy Camera DVR Spy Sunglasses Camera DVR HD Spy Watch Camera DVR HD Spy pen Camera DVR

    Spy Lighter Camera DVR Spy Button Camera DVR Clock Spy Camera Spy Carkey Camera DVR Car Security Camera Spy GSM

    Audio Bug Spy Camera Detector CCTV Spy Camera Youtube Camera Recorder Sport Camera/MP3 Camera Spy Gadget New

    Wireless Spy Camera New Spy Camera DVR Shoe Spy Camera Hotelroom Spy Camera Bedroom Spy Camera Bathroom Spy Camera


    Eagle eye technology (Hong Kong) Co., Ltd

    URL:http://www.wholesalespycams.com

    http://onlinewholesalespycamera.com



    URL:http://www.wholesalespycams.com/Wholesale-Discount-professional-spycam_c404.html
    http://onlinewholesalespycamera.com/Wholesale-Discount-bathroom-spy-camera_c628.htm
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011 High-Availability Infrastructure in the Cloud - Evan Cooke - Web 2.0 Expo NYC 2011 Presentation Transcript

  • SCALING HIGH-AVAILABILITY INFRASTRUCTURE IN THE CLOUD OCT 11, 2011, WEB 2.0 twilio CLOUD COMMUNICATIONS EVAN COOKE CO-FOUNDER & CTO
  • High-Availability Sounds good, we need that! eat! al M n ic Te ch mm ummY
  • High-Availability Sounds good, we need that! UptimeAvailability = Uptime + Downtime
  • High-Availability Sounds good, we need that! Availability % Downtime/yr Downtime/mo99.9% ("three nines") 8.76 hours 43.2 minutes99.99% ("four nines") 52.56 minutes 4.32 minutes99.999% ("five nines") 5.26 minutes 25.9 seconds99.9999% ("six nines") 31.5 seconds 2.59 seconds
  • High-Availability Sounds good, we need that! Availability % Downtime/yr Downtime/mo99.9% ("three nines") 8.76 hours 43.2 minutes99.99% ("four nines") 52.56 minutes 4.32 minutes99.999% ("five nines") 5.26 minutes 25.9 seconds99.9999% ("six nines") 31.5 seconds 2.59 secondsCan’t rely on human to respond in a 5 min window! Must use automation.
  • Happens to the best2.5 Hours Down 11 Hours Down Hours September 23, 2010 October 4, 2010 November 14, 2010“...we had to stop all “...At 6:30pm EST, we “...Before every run oftraffic to this database determined the most our test suite we destroycluster, which meant effective course of action then re-create theturning off the site. Once was to re-index the database... Due to thethe databases had [database] shard, which configuration errorrecovered and the root would address the memory GitHubs productioncause had been fixed, we fragmentation and usage database wasslowly allowed more issues. The whole process, destroyed then re-people back onto the including extensive testing created. Not good.”site.” against data loss and data corruption, took about five hours.”
  • Causes of DowntimeLack of best practice change controlLack of best practice monitoring of the relevant componentsLack of best practice requirements and procurementLack of best practice operationsLack of best practice avoidance of network failuresLack of best practice avoidance of internal application failuresLack of best practice avoidance of external services that failLack of best practice physical environmentLack of best practice network redundancyLack of best practice technical solution of backupLack of best practice process solution of backupLack of best practice physical locationLack of best practice infrastructure redundancyLack of best practice storage architecture redundancy E. Marcus and H. Stern, Blueprints for high availability, second edition. Indianapolis, IN, USA: John Wiley & Sons, Inc., 2003.
  • Cloud Non-Cloud Data Change Operations DatacenterPersistence Control storage change control monitoring of avoidance of architecture the relevant network failures redundancy components physical technical requirements environment solution of procurement network backup operations redundancy process avoidance of physical location solution of internal app backup failures infrastructure avoidance of redundancy external services that fail
  • Happens to the best2.5 Hours Down 11 Hours Down Hours September 23, 2010 October 4, 2010 November 14, 2010“...we had to stop all “...At 6:30pm EST, we “...Before every run oftraffic to this database determined the most our test suite we destroy Databasecluster, which meantturning off the site. Once Database effective course of action was to re-index the Database then re-create the database... Due to thethe databases had [database] shard, which configuration error would address the memoryrecovered and the rootcause had been fixed, we fragmentation and usage Change GitHubs production database was issues. The whole process,slowly allowed morepeople back onto the including extensive testing Control destroyed then re- created. Not good.”site.” against data loss and data corruption, took about five hours.”
  • Data Change Operations DatacenterPersistence Control Today control storagechange monitoring of the relevant avoidance of network failures architectureData Persistence redundancy components physicalChange Control technical requirements environment solution of procurement network backup operations redundancy lessons learned process avoidance of solution@twilio physical location of internal app backup failures infrastructure avoidance of redundancy external services that fail
  • Twilio provides web service APIs to automate Voice and SMS communications Carriers Inbound Calls Voice Outbound Calls Mobile/Browser VoIP Send To/From Phone SMS Numbers Short CodesDeveloper Phone Dynamically Buy Numbers Phone NumbersEnd User
  • 2011 2010 20093 6 20 70+
  • 100x Growth in Tx/Day over 1 Year100X10X X 1 Year
  • 2011 20102009 100’s of 10’s of Servers 10 ServersServers
  • 2011• 100’s of prod hosts in continuous operation• 80+ service types running in prod• 50+ prod database servers• Prod deployments several times/day across 7 engineering teams
  • 2011• Frameworks - PHP for frontend components - Python Twisted & gevent for async network services - Java for backend services• Storage technology - MySQL for core DB services - Redis for queuing and messaging
  • Data persistence is hard(especially in the cloud)
  • Data persistence is hard Data persistence is the hardesttechnical problem most scalable SaaS businesses face
  • What is data persistence? Stuff that looks like this
  • What is data persistence? Databases Queues Files
  • Incoming Requests LB A ATier 1 Data Q Q Persistence! SQLTier 2 B B B BFiles C C D D K/VTier 3
  • Why is persistence so hard?• Difficult to change structure - Huge inertia e.g., large schema migrations• Painful to recover from disk/node failures - “just boot a new node” doesn’t work• Woeful performance/scalability - I/O is huge bottleneck in modern servers (e.g. EC2)• Freak’in complex!!! - Atomic transactions/rollback, ACID, blah blah blah
  • Difficult to Change Structure ALTER TABLE names DROP COLUMN ValueId Name Value Id Name 1 Bob 12 1 Bob 2 Jane 78 2 Jane 3 Steve 56 3 Steve ... 500 million rows HOURS later...‣ You live with data decisions for a long time
  • Painful to Recover from Failures Data on secondary? W R R How much data? R/W consistency? DB DB Primary Secondary‣ Because of complexity, failover is human process
  • Woeful Performance/Scalabilityec2m1.xlargeraid0 4x ephemeralDevice: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %utilsda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00sdb 169.31 111.88 57.43 469.31 0.90 2.25 12.24 2.29 4.36 1.12 59.01sdc 178.22 110.89 59.41 396.04 0.93 1.98 13.08 1.58 3.50 1.18 53.56sdd 145.54 102.97 50.50 384.16 0.78 1.90 12.63 1.00 2.34 1.03 44.85sde 166.34 95.05 54.46 337.62 0.85 1.69 13.27 1.12 2.84 1.22 47.92md0 0.00 0.00 880.20 2007.92 3.44 7.82 7.99 0.00 0.00 0.00 0.00 ~10 MB/s write‣ Poor I/O on cloud today, 100x slower than real HW
  • Woeful Performance/Scalability DB DB DB DB DB DB ‣ Difficult to horizontally scale in the cloud
  • @!#$%^&* Complex• Incredibly complex BUFFER POOL AND MEMORY ---------------------- Total memory allocated 11655168000; in configuration Internal hash tables (constant factor Adaptive hash index 223758224 (179 Page hash 11248264 - Billion knobs and buttons Dictionary cache 45048690 (449 File system 84400 (82672 - Whole companies exist just to Lock system Recovery system 28180376 (281 0 (0 + 0) tune DB’s Threads 428608 (406 Dictionary memory allocated 57346• Lots of consistency/ Buffer pool size 693759 Buffer pool size, bytes 11366547456 Free buffers 1 transactional models Database pages Old database pages 691085 255087• Multi-region data is Modified db pages 326490 Pending reads 0 Pending writes: LRU 0, flush list 0, s unsolved - Facebook and Pages made young 497782847, not young 24.78 youngs/s, 0.00 non-youngs/s Google struggle Pages read 447257683, created 16982810 24.82 reads/s, 1.14 creates/s, 33.36 w Buffer pool hit rate 993 / 1000, young
  • Deep breath, step back Think about each problem (use @twilio examples) • Software that runs in the cloud • Open source
  • 1 Difficult to Change Structure • Don’t have structure - key/value databases (SimpleDB, Cassandra) - document-orient databases (CouchDB, MongoDB) • Don’t store a lot of data...
  • 1 Don’t Store Stuff • Outsource data as much as possible • But NOT to your customers
  • 1 Don’t Store Stuff • Aggressively archive and move data offline S3/SimpleDB ~500M Rows (keep indices in memory) Build UX that supports longer/restricted access times to older data
  • 1 Don’t Store Stuff • Avoid stateful systems/architectures where possible Web Browser Web Session DB Cookie: Web SessionID
  • 1 Don’t Store Stuff • Avoid stateful systems/architectures where possible Store state in client Web browser Browser Web Session DB Cookie: Web enc($session)
  • 2 Painful to Recover from Failures • Avoid single points of failure - E.g., master-master (active/active) - Complex to set up, complex failure modes - Sometimes it’s the only solution - Lots of great docs on web • Minimize number of stateful node, separate stateful & stateless components...
  • 2 Separate Stateful and Stateless Components Req App A App B App C On failure, even App B if we boot replacement, we lose data
  • 2 Separate Stateful and Stateless Components Req App A App B App C Queue Queue Queue On failure, even App B if we boot Queue replacement, we lose data
  • 2 Separate Stateful and Stateless Components Keep connection open for whole app path! (hint: use evented framework) Req App AA App BB App C App A App Twilio’s App stack App C App B App C SMS uses this approach On failure, we don’t lose a single request
  • 2 Painful to Recover from Failures • Avoid single points of failure - E.g., master-master (active/active) - Complex to set up, complex failure modes - Sometimes it’s the only solution - Lots of great blog docs on web • Minimize number of stateful nodes, separate stateful & stateless components • Build a data change control process to avoid mistakes and errors...
  • • 100’s of prod hosts in continuous operation • 80+ service types running in prod • 50+ prod database servers • Prod deployments several times/day across 7 engineering teamsComponents deployed at different frequencies: Partially Continuous Deployment
  • Website Deployment Content Frequency(Risk) 4 buckets Website CodeLog Scale 1000x REST API Big DB 100x Schema 10x 1x CMS PHP/Ruby Python/Java SQL etc. etc.
  • Website DeploymentContent Processes Website Code REST API Big DB Schema One Click CI Tests CI Tests CI Tests One Click Human Sign-off Human Sign-off One Click Human Assisted Click
  • 3 Woeful Performance/Scalability • If disk I/O is poor, avoid disk - Tune tune tune. Keep your indices in memory - Use an in-memory datastore e.g., Redis and configure replication such that if you have a master failure, you can always promote a slave • When disk I/O saturates, shard - LOTs of sharding info on web - Method of last resort, single point of failure becomes multiple single points of failure
  • 4 @#$%^&* Complex • Bring the simplest tool to Magic Database the job - Use a strictly consistent store only if you need it - If you don’t need HA, don’t add the complexity • There is no magic database. Magic Database does it all. Decompose requirements, Consistency, Availability, mix-and-match datastores Partition-tolerance, its got all three. as needed...
  • 4 Twilio Data Lifecycle CREATE UPDATE UPDATE name:foo name:foo name:foo name:foo status:INIT status:QUEUED status:GOING status:DONE ret:0 ret:0 ret:0 ret:42 Twilio Examples: Call, SMS, Conference Other Examples: Order, Workflow, $
  • 4 Twilio Data Lifecycle CREATE UPDATE UPDATE name:foo name:foo name:foo name:foo status:INIT status:QUEUED status:GOING status:DONE ret:0 ret:0 ret:0 ret:42 In-Flight Post-Flight
  • 4 Twilio Data Lifecycle Applications • Atomically update • Billing part of a workflow • Log Access • Analytics • Reporting In-Flight Post-Flight
  • 4 Twilio Data Lifecycle Properties High-Availability • Strict Consistency • Eventual Consistency • Key/Value • Range Queries w/ • ~20ms Filters • ~200ms In-Flight Post-Flight
  • 4 Twilio Data Lifecycle Systems with very different access semantics Data Store A Data Store B In-Flight Post-Flight
  • 4 In-Flight Post-Flight Eventual consistency Q Logs Range queries Filtered queries (REST API) ~200ms Billions Strict Consistency Eventual consistency Key/Value Arbitrary queries Q Reporting High Latency ~20ms Billions 10k-1M Idempotent Aggregation Q Billing Key/Value Billions
  • 4 In-Flight Post-Flight SQL Sharded Logs Cassandra/Acunu Q MongoDb (REST API) Riak CouchDb MySQL PostgreSQL Q Reporting Hadoop Redis NDB SQL Sharded Q Billing Redis
  • ataD
  • Why is persistence so hard?• Difficult to change structure Don’t store stuff! - Huge inertia e.g.,schema-less Go large schema migrations• Painful to recover from disk/node failures Separate stateful/stateless Change control processes - “just boot a new node” doesn’t work• Woeful performance/scalability Memory FTW Shard - I/O is huge bottleneck in modern servers (e.g. EC2)• Freak’in complex!!! data lifecycle Decompose - AtomicMinimize complexity blah blah transactions/rollback, ACID, blah
  • Incoming Requests LB A ATier 1 Q Q SQLTier 2 B B B BFiles C C D D K/VTier 3
  • Incoming Requests Idempotent LB request path A A Aggregate into Tier 1 HA queues Master-Master Q Q SQL SQL MySQL Tier 2 B B B B Move K/V to SimpleDB w/Move file store local cacheto S3 S3 SimpleDB C C D D Tier 3
  • Data Change Operations DatacenterPersistence Control storage architecture redundancy HA change control monitoring of the relevant components avoidance of network failures physical is technical requirements environment solution of procurement network backup operations redundancy process avoidance of Hard physical location solution of internal app backup failures infrastructure avoidance of redundancy external services that fail
  • SCALING HIGH-AVAILABILITYINFRASTRUCTURE IN THE CLOUD Focus on data How you store it Where you store it When you can delete it Control changes to it
  • Open Problems...In-Flight Post-Flight Massively scalable HA Logs Q range queries queue (REST API) filterable ~200ms Simple multi-AZ Simplemulti-region Q Reporting HA Hadoop consistent Hadoop K/V Massively Q Billing scalable aggregator
  • twiliohttp://www.twilio.com @emcooke