Designing a massively scalable highly available persistence layer has been one of the great challenges we’ve faced building out Twilio’s cloud communications infrastructure. Robust Voice and SMS APIs have strict consistency, latency, and availability requirements that cannot be solved using traditional sharding or scaling approaches. In this talk we first look to understand the challenges of running high-availability services in the cloud and then describe how we’ve architected “in-flight” and “post-flight” data into separate datastores that can be implemented using a range of technologies.
High-Availability Sounds good, we need that! Availability % Downtime/yr Downtime/mo99.9% ("three nines") 8.76 hours 43.2 minutes99.99% ("four nines") 52.56 minutes 4.32 minutes99.999% ("ﬁve nines") 5.26 minutes 25.9 seconds99.9999% ("six nines") 31.5 seconds 2.59 secondsCan’t rely on human to respond in a 5 min window! Must use automation.
Happens to the best2.5 Hours Down 11 Hours Down Hours September 23, 2010 October 4, 2010 November 14, 2010“...we had to stop all “...At 6:30pm EST, we “...Before every run oftrafﬁc to this database determined the most our test suite we destroycluster, which meant effective course of action then re-create theturning off the site. Once was to re-index the database... Due to thethe databases had [database] shard, which conﬁguration errorrecovered and the root would address the memory GitHubs productioncause had been ﬁxed, we fragmentation and usage database wasslowly allowed more issues. The whole process, destroyed then re-people back onto the including extensive testing created. Not good.”site.” against data loss and data corruption, took about ﬁve hours.”
Causes of DowntimeLack of best practice change controlLack of best practice monitoring of the relevant componentsLack of best practice requirements and procurementLack of best practice operationsLack of best practice avoidance of network failuresLack of best practice avoidance of internal application failuresLack of best practice avoidance of external services that failLack of best practice physical environmentLack of best practice network redundancyLack of best practice technical solution of backupLack of best practice process solution of backupLack of best practice physical locationLack of best practice infrastructure redundancyLack of best practice storage architecture redundancy E. Marcus and H. Stern, Blueprints for high availability, second edition. Indianapolis, IN, USA: John Wiley & Sons, Inc., 2003.
Cloud Non-Cloud Data Change Operations DatacenterPersistence Control storage change control monitoring of avoidance of architecture the relevant network failures redundancy components physical technical requirements environment solution of procurement network backup operations redundancy process avoidance of physical location solution of internal app backup failures infrastructure avoidance of redundancy external services that fail
Happens to the best2.5 Hours Down 11 Hours Down Hours September 23, 2010 October 4, 2010 November 14, 2010“...we had to stop all “...At 6:30pm EST, we “...Before every run oftrafﬁc to this database determined the most our test suite we destroy Databasecluster, which meantturning off the site. Once Database effective course of action was to re-index the Database then re-create the database... Due to thethe databases had [database] shard, which conﬁguration error would address the memoryrecovered and the rootcause had been ﬁxed, we fragmentation and usage Change GitHubs production database was issues. The whole process,slowly allowed morepeople back onto the including extensive testing Control destroyed then re- created. Not good.”site.” against data loss and data corruption, took about ﬁve hours.”
Data Change Operations DatacenterPersistence Control Today control storagechange monitoring of the relevant avoidance of network failures architectureData Persistence redundancy components physicalChange Control technical requirements environment solution of procurement network backup operations redundancy lessons learned process avoidance of solution@twilio physical location of internal app backup failures infrastructure avoidance of redundancy external services that fail
Twilio provides web service APIs to automate Voice and SMS communications Carriers Inbound Calls Voice Outbound Calls Mobile/Browser VoIP Send To/From Phone SMS Numbers Short CodesDeveloper Phone Dynamically Buy Numbers Phone NumbersEnd User
100x Growth in Tx/Day over 1 Year100X10X X 1 Year
2011 20102009 100’s of 10’s of Servers 10 ServersServers
2011• 100’s of prod hosts in continuous operation• 80+ service types running in prod• 50+ prod database servers• Prod deployments several times/day across 7 engineering teams
2011• Frameworks - PHP for frontend components - Python Twisted & gevent for async network services - Java for backend services• Storage technology - MySQL for core DB services - Redis for queuing and messaging
Data persistence is hard(especially in the cloud)
Data persistence is hard Data persistence is the hardesttechnical problem most scalable SaaS businesses face
What is data persistence? Stuﬀ that looks like this
What is data persistence? Databases Queues Files
Incoming Requests LB A ATier 1 Data Q Q Persistence! SQLTier 2 B B B BFiles C C D D K/VTier 3
Why is persistence so hard?• Difﬁcult to change structure - Huge inertia e.g., large schema migrations• Painful to recover from disk/node failures - “just boot a new node” doesn’t work• Woeful performance/scalability - I/O is huge bottleneck in modern servers (e.g. EC2)• Freak’in complex!!! - Atomic transactions/rollback, ACID, blah blah blah
Difﬁcult to Change Structure ALTER TABLE names DROP COLUMN ValueId Name Value Id Name 1 Bob 12 1 Bob 2 Jane 78 2 Jane 3 Steve 56 3 Steve ... 500 million rows HOURS later...‣ You live with data decisions for a long time
Painful to Recover from Failures Data on secondary? W R R How much data? R/W consistency? DB DB Primary Secondary‣ Because of complexity, failover is human process
Woeful Performance/Scalability DB DB DB DB DB DB ‣ Difﬁcult to horizontally scale in the cloud
@!#$%^&* Complex• Incredibly complex BUFFER POOL AND MEMORY ---------------------- Total memory allocated 11655168000; in conﬁguration Internal hash tables (constant factor Adaptive hash index 223758224 (179 Page hash 11248264 - Billion knobs and buttons Dictionary cache 45048690 (449 File system 84400 (82672 - Whole companies exist just to Lock system Recovery system 28180376 (281 0 (0 + 0) tune DB’s Threads 428608 (406 Dictionary memory allocated 57346• Lots of consistency/ Buffer pool size 693759 Buffer pool size, bytes 11366547456 Free buffers 1 transactional models Database pages Old database pages 691085 255087• Multi-region data is Modified db pages 326490 Pending reads 0 Pending writes: LRU 0, flush list 0, s unsolved - Facebook and Pages made young 497782847, not young 24.78 youngs/s, 0.00 non-youngs/s Google struggle Pages read 447257683, created 16982810 24.82 reads/s, 1.14 creates/s, 33.36 w Buffer pool hit rate 993 / 1000, young
Deep breath, step back Think about each problem (use @twilio examples) • Software that runs in the cloud • Open source
1 Difﬁcult to Change Structure • Don’t have structure - key/value databases (SimpleDB, Cassandra) - document-orient databases (CouchDB, MongoDB) • Don’t store a lot of data...
1 Don’t Store Stuff • Outsource data as much as possible • But NOT to your customers
1 Don’t Store Stuff • Aggressively archive and move data ofﬂine S3/SimpleDB ~500M Rows (keep indices in memory) Build UX that supports longer/restricted access times to older data
1 Don’t Store Stuff • Avoid stateful systems/architectures where possible Web Browser Web Session DB Cookie: Web SessionID
1 Don’t Store Stuff • Avoid stateful systems/architectures where possible Store state in client Web browser Browser Web Session DB Cookie: Web enc($session)
2 Painful to Recover from Failures • Avoid single points of failure - E.g., master-master (active/active) - Complex to set up, complex failure modes - Sometimes it’s the only solution - Lots of great docs on web • Minimize number of stateful node, separate stateful & stateless components...
2 Separate Stateful and Stateless Components Req App A App B App C On failure, even App B if we boot replacement, we lose data
2 Separate Stateful and Stateless Components Req App A App B App C Queue Queue Queue On failure, even App B if we boot Queue replacement, we lose data
2 Separate Stateful and Stateless Components Keep connection open for whole app path! (hint: use evented framework) Req App AA App BB App C App A App Twilio’s App stack App C App B App C SMS uses this approach On failure, we don’t lose a single request
2 Painful to Recover from Failures • Avoid single points of failure - E.g., master-master (active/active) - Complex to set up, complex failure modes - Sometimes it’s the only solution - Lots of great blog docs on web • Minimize number of stateful nodes, separate stateful & stateless components • Build a data change control process to avoid mistakes and errors...
• 100’s of prod hosts in continuous operation • 80+ service types running in prod • 50+ prod database servers • Prod deployments several times/day across 7 engineering teamsComponents deployed at different frequencies: Partially Continuous Deployment
Website Deployment Content Frequency(Risk) 4 buckets Website CodeLog Scale 1000x REST API Big DB 100x Schema 10x 1x CMS PHP/Ruby Python/Java SQL etc. etc.
Website DeploymentContent Processes Website Code REST API Big DB Schema One Click CI Tests CI Tests CI Tests One Click Human Sign-off Human Sign-off One Click Human Assisted Click
3 Woeful Performance/Scalability • If disk I/O is poor, avoid disk - Tune tune tune. Keep your indices in memory - Use an in-memory datastore e.g., Redis and conﬁgure replication such that if you have a master failure, you can always promote a slave • When disk I/O saturates, shard - LOTs of sharding info on web - Method of last resort, single point of failure becomes multiple single points of failure
4 @#$%^&* Complex • Bring the simplest tool to Magic Database the job - Use a strictly consistent store only if you need it - If you don’t need HA, don’t add the complexity • There is no magic database. Magic Database does it all. Decompose requirements, Consistency, Availability, mix-and-match datastores Partition-tolerance, its got all three. as needed...
Why is persistence so hard?• Difﬁcult to change structure Don’t store stuff! - Huge inertia e.g.,schema-less Go large schema migrations• Painful to recover from disk/node failures Separate stateful/stateless Change control processes - “just boot a new node” doesn’t work• Woeful performance/scalability Memory FTW Shard - I/O is huge bottleneck in modern servers (e.g. EC2)• Freak’in complex!!! data lifecycle Decompose - AtomicMinimize complexity blah blah transactions/rollback, ACID, blah
Incoming Requests LB A ATier 1 Q Q SQLTier 2 B B B BFiles C C D D K/VTier 3
Incoming Requests Idempotent LB request path A A Aggregate into Tier 1 HA queues Master-Master Q Q SQL SQL MySQL Tier 2 B B B B Move K/V to SimpleDB w/Move ﬁle store local cacheto S3 S3 SimpleDB C C D D Tier 3
Data Change Operations DatacenterPersistence Control storage architecture redundancy HA change control monitoring of the relevant components avoidance of network failures physical is technical requirements environment solution of procurement network backup operations redundancy process avoidance of Hard physical location solution of internal app backup failures infrastructure avoidance of redundancy external services that fail
SCALING HIGH-AVAILABILITYINFRASTRUCTURE IN THE CLOUD Focus on data How you store it Where you store it When you can delete it Control changes to it
Open Problems...In-Flight Post-Flight Massively scalable HA Logs Q range queries queue (REST API) ﬁlterable ~200ms Simple multi-AZ Simplemulti-region Q Reporting HA Hadoop consistent Hadoop K/V Massively Q Billing scalable aggregator