Your SlideShare is downloading. ×
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Deploying Next Gen Systems with Zero Downtime
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Deploying Next Gen Systems with Zero Downtime

1,593

Published on

Hear what we learned from deploying next generation systems at Twilio, with zero downtime.

Hear what we learned from deploying next generation systems at Twilio, with zero downtime.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,593
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Lets rewind to 2009 and take a look at what we built.\n
  • At Twilio\n- \nnerd out to billing everyday. \n\nIts odd, quirky, but super powerful \n\n- mission critical to the advancement of twilio.\n\nWe decided to invest in building this early on in the company.\n
  • \n
  • high availability\nalways be processing\n
  • Realtime scoreboards, usage, metrics\n
  • Two piggy banks\nSingle process dequeuers - limited by the number of transactions you can process on a single database\n
  • Two piggy banks\nSingle process dequeuers - limited by the number of transactions you can process on a single database\n
  • \n
  • Once we understood the problems...\n\nWent to the drawing board...\n\nWe built something new to replace the infrastructure alongside it.\n\n<button>\n\nA system that disconnects our dequeuers from our processors. \n\n<button> \n\nProcessors powered by a REST API and uses status codes for success & error resolution.\n\n<button>\n\nOnly inserting into our databases as a log server.\n\n<button>\n\nAnd to fix our realtime issue, we’re using redis as an in-flight datastore to atomically process metrics as we process transactions.\n
  • \n
  • \n
  • \n
  • Now we have two systems side-by-side, but we need to compare the two.\n<click>\nDouble book keeping lets us compare balances and metrics.\n\nIf we have a bug,\n<click>\nwe need to rollback. \n\nAnd we don’t want to do it all at once.\n<click>\n\n
  • Now we have two systems side-by-side, but we need to compare the two.\n<click>\nDouble book keeping lets us compare balances and metrics.\n\nIf we have a bug,\n<click>\nwe need to rollback. \n\nAnd we don’t want to do it all at once.\n<click>\n\n
  • Now we have two systems side-by-side, but we need to compare the two.\n<click>\nDouble book keeping lets us compare balances and metrics.\n\nIf we have a bug,\n<click>\nwe need to rollback. \n\nAnd we don’t want to do it all at once.\n<click>\n\n
  • With so much throughput, we couldn’t just shut down the billing system. We also couldn’t lose a billing event.\n<click>\nSo we built an abstraction between the two systems that would allow us to atomically control transactions.\n<click>\nWhen the faucet is turned off, it will wait till both queues are drained and send us a report.\n<click>\n
  • With so much throughput, we couldn’t just shut down the billing system. We also couldn’t lose a billing event.\n<click>\nSo we built an abstraction between the two systems that would allow us to atomically control transactions.\n<click>\nWhen the faucet is turned off, it will wait till both queues are drained and send us a report.\n<click>\n
  • Check the books. Can we turn any accounts online?\n
  • Nope, we should not enable any accounts yet.\n
  • If we had moved accounts, with a click, we can migrate them back.\n
  • Or if the cluster catches fire, we can turn off the entire new system and reroute billing traffic back to its legacy system.\n\n
  • <TODO>\nGraph build out each week with the story.\n\n\n\n\nPractice is good.\nWe tested our platform thoroughly in a practice mode with no account flags turned on.\n\nAs we progressed and fixed the edge-cases, we migrated 1%, %5, all the way up to all accounts over a period of time.\n\nPlanning with your tools lets you build a gradual deployment with ease.\n\n
  • Just to follow up\n\nbetter tools equal better deployments.\n\nWhen we had issues with our new in-flight store, we had a way to rollback.\nWhen we were seeing discrepancies in balances, we would investigate, fix, deploy, and compare.\n\nSo you get the idea, to migrate to a new micro-payments platform,\nwe must engineer tools that let us migrate back and\nforth with ease so that we can spend time on the solutions.\n
  • \n
  • Transcript

    • 1. DEPLOYING NEXT-GENERATION SYSTEMS TWILIO ENGINEERING
    • 2. WHY NEXTGENERATION SYSTEMS? #twiliocon
    • 3. 125,000,000 123,090,34100,000,000 4 75,000,000 67,186,111 e 50,000,000 25,679,631 25,000,000 365,782 0 Dec 2009 Dec 2010 Dec 2011 Dec 2012 Twilio Transactions Per Month
    • 4. WHY ZERODOWNTIME? #twiliocon
    • 5. HI. I’M TIMI’m Director of Engineering at twilio #twiliocon
    • 6. THE CHALLENGE• Design, build, deploy replacements of 2 core systems. #twiliocon
    • 7. THE CHALLENGE• Design, build, deploy replacements of 2 core systems. ➡ They must be HA. #twiliocon
    • 8. THE CHALLENGE• Design, build, deploy replacements of 2 core systems. ➡ They must be HA. ➡ They must be horizontally-scalable. #twiliocon
    • 9. THE CHALLENGE• Design, build, deploy replacements of 2 core systems. ➡ They must be HA. ➡ They must be horizontally-scalable. #twiliocon
    • 10. THE CHALLENGE• Design, build, deploy replacements of 2 core systems. ➡ They must be HA. ➡ They must be horizontally-scalable.... Oh, and don’t lose a single billing event or API request in the process. #twiliocon
    • 11. HI. I’M FRANKI’m API-Team Engineering Lead at twilio #twiliocon
    • 12. OUT WITH THE OLD... #twiliocon
    • 13. OUT WITH THE OLD... • Monolithic Codebase #twiliocon
    • 14. OUT WITH THE OLD... • Monolithic Codebase • Serving Millions of API Requests #twiliocon
    • 15. ... AND IN WITH THE NEW #twiliocon
    • 16. ... AND IN WITH THE NEW• Python + Flask #twiliocon
    • 17. ... AND IN WITH THE NEW• Python + Flask• Designed to serve billions of API requests #twiliocon
    • 18. ... AND IN WITH THE NEW• Python + Flask• Designed to serve billions of API requests• Zero Downtime, Zero Regressions #twiliocon
    • 19. TEST, TEST, TEST #twiliocon
    • 20. TEST, TEST, TEST• Unit & Functional tests for local development #twiliocon
    • 21. TEST, TEST, TEST• Unit & Functional tests for local development• The same tests run in our Staging Cluster #twiliocon
    • 22. TEST, TEST, TEST• Unit & Functional tests for local development• The same tests run in our Staging Cluster• The same cluster tests run against Both API Frameworks #twiliocon
    • 23. A TALE OF TWO APIS #twiliocon
    • 24. A TALE OF TWO APISCLUSTER TESTS #twiliocon
    • 25. A TALE OF TWO APISCLUSTER TESTS #twiliocon
    • 26. A TALE OF TWO APISCLUSTER TESTS #twiliocon
    • 27. A TALE OF TWO APIS DIFF RESULTSCLUSTER TESTS #twiliocon
    • 28. A TALE OF TWO APIS DIFF RESULTSCLUSTER TESTS #twiliocon
    • 29. NGINX.CONF
    • 30. NGINX.CONF# Map the HTTP header X-Requested-Api-Stack: <value># to a named location in nginxmap $http_x_requested_api_stack $requested_stack_default_php { default @php; python @python;}
    • 31. NGINX.CONF# Map the HTTP header X-Requested-Api-Stack: <value># to a named location in nginxmap $http_x_requested_api_stack $requested_stack_default_php { default @php; python @python;}
    • 32. NGINX.CONF# Map the HTTP header X-Requested-Api-Stack: <value># to a named location in nginxmap $http_x_requested_api_stack $requested_stack_default_php { default @php; python @python;}
    • 33. NGINX.CONF# Map the HTTP header X-Requested-Api-Stack: <value># to a named location in nginxmap $http_x_requested_api_stack $requested_stack_default_php { default @php; python @python;}
    • 34. NGINX.CONF# Map the HTTP header X-Requested-Api-Stack: <value># to a named location in nginxmap $http_x_requested_api_stack $requested_stack_default_php { default @php; python @python;}location @python { proxy_pass 127.0.0.1:5555;}location @php { proxy_pass 127.0.0.1:12345;}
    • 35. NGINX.CONF# Map the HTTP header X-Requested-Api-Stack: <value># to a named location in nginxmap $http_x_requested_api_stack $requested_stack_default_php { default @php; python @python;}location @python { proxy_pass 127.0.0.1:5555;}location @php { proxy_pass 127.0.0.1:12345;}location ~ / { try_files Kwijibo $requested_stack_default_php;}
    • 36. SPOT THE DIFFERENCE GET /2010-04-01/Accounts/ACaaaaaaaaaaaaaaaaaaaaaaaaa/Applications/APaaaaaaaaaaaaaaaaaaa{TwilioResponse: {TwilioResponse:[{Application: [ [{Application: [ {Sid: APaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}, {Sid: APaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}, {DateCreated: Mon, 22 Aug 2011 20:58:45 +0000}, {DateCreated: Mon, 22 Aug 2011 20:58:45 +0000}, {DateUpdated: Mon, 22 Aug 2011 20:58:45 +0000}, {DateUpdated: Mon, 22 Aug 2011 20:58:45 +0000}, {AccountSid: ACaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}, {AccountSid: ACaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}, {FriendlyName: Application Friendly Name}, {FriendlyName: Application Friendly Name}, {ApiVersion: 2010-04-01}, {ApiVersion: 2010-04-01}, {VoiceUrl: http://www.example.com/voice}, {VoiceUrl: http://www.example.com/voice}, {VoiceMethod: GET}, {VoiceMethod: GET}, {VoiceFallbackUrl: http://www.example.com/voice-callback}, {VoiceFallbackUrl: http://www.example.com/voice-callback}, {VoiceFallbackMethod: GET}, {VoiceFallbackMethod: GET}, {StatusCallback: http://www.example.com/status-callback}, {StatusCallback: http://www.example.com/status-callback}, {StatusCallbackMethod: GET}, {StatusCallbackMethod: GET}, {VoiceCallerIdLookup: false}, {VoiceCallerIdLookup: False}, {SmsUrl: http://www.example.com/sms}, {SmsUrl: http://www.example.com/sms}, {SmsMethod: GET}, {SmsMethod: GET}, {SmsFallbackUrl: http://www.example.com/sms-fallback}, {SmsFallbackUrl: http://www.example.com/sms-fallback}, {SmsFallbackMethod: GET}, {SmsFallbackMethod: GET}, {SmsStatusCallback: http://www.example.com/sms-status-callback}, {SmsStatusCallback: http://www.example.com/sms-status-callback},
    • 37. SPOT THE DIFFERENCE GET /2010-04-01/Accounts/ACaaaaaaaaaaaaaaaaaaaaaaaaa/Applications/APaaaaaaaaaaaaaaaaaaa {TwilioResponse: [{Application: [ {Sid: APaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}, {DateCreated: Mon, 22 Aug 2011 20:58:45 +0000}, {DateUpdated: Mon, 22 Aug 2011 20:58:45 +0000}, {AccountSid: ACaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}, {FriendlyName: Application Friendly Name}, {ApiVersion: 2010-04-01}, {VoiceUrl: http://www.example.com/voice}, {VoiceMethod: GET}, {VoiceFallbackUrl: http://www.example.com/voice-callback}, {VoiceFallbackMethod: GET}, {StatusCallback: http://www.example.com/status-callback}, {StatusCallbackMethod: GET},- {VoiceCallerIdLookup: false},? ^+ {VoiceCallerIdLookup: False},? ^ {SmsUrl: http://www.example.com/sms}, {SmsMethod: GET}, {SmsFallbackUrl: http://www.example.com/sms-fallback}, {SmsFallbackMethod: GET}, {SmsStatusCallback: http://www.example.com/sms-status-callback},
    • 38. ALL GROWN UP #twiliocon
    • 39. ALL GROWN UP API #twiliocon
    • 40. ALL GROWN UP API UNIT TESTS #twiliocon
    • 41. ALL GROWN UP API UNIT TESTS FUNCTIONAL TESTS #twiliocon
    • 42. ALL GROWN UP API UNIT TESTS FUNCTIONAL TESTS DIFF TESTS #twiliocon
    • 43. ALL GROWN UP API UNIT TESTS FUNCTIONAL TESTS DIFF TESTS CANARY DEPLOY #twiliocon
    • 44. ALL GROWN UP API UNIT TESTS FUNCTIONAL TESTS DIFF TESTS CANARY DEPLOY API #twiliocon
    • 45. ALL GROWN UP API UNIT TESTS FUNCTIONAL TESTS DIFF TESTS CANARY DEPLOY API #twiliocon
    • 46. Flask-RESTful• A simple REST resource framework for python Flask applications• Simplifies argument parsing, generating output, & defining resources• ORM / Library independent. It’s only dependency is Flask itself. http://www.twilio.com/open-source #twiliocon
    • 47. HI. I’M ADAMI’m MicroTransactions-Team Engineering Lead at twilio #twiliocon
    • 48. REWIND TO 2009
    • 49. NERDY TELECOMSTARTUP SEEKS RELIABLE HA #twiliocon
    • 50. MICRO-TRANSACTIONS @TWILIO #twiliocon
    • 51. MICRO-TRANSACTIONS @TWILIO #twiliocon
    • 52. MICRO-TRANSACTIONS @TWILIO #twiliocon
    • 53. TRANSACTIONS @TWILIO V1 BALANCE AGGREGAT UPDATER ION DEQUEUER DEQUEUERTX QUEUE DEQUEUER AGG DEQUEUER TX MYSQL AGG DEQUEUER DEQUEUER #twiliocon
    • 54. TRANSACTIONS @TWILIO V1 BALANCE AGGREGAT UPDATER ION DEQUEUER DEQUEUERTX QUEUE DEQUEUER AGG DEQUEUER TX MYSQL AGG DEQUEUER DEQUEUER #twiliocon
    • 55. TRANSACTIONS @TWILIO V1 BALANCE AGGREGAT UPDATER ION DEQUEUER DEQUEUERTX QUEUE DEQUEUER AGG DEQUEUER TX MYSQL AGG DEQUEUER DEQUEUER #twiliocon
    • 56. TX- TX- MYSQL INSERT only log dataAutomatically increments MY SQLbalance and counters REDIS MYSQL REDIS POST-FLIGHT #twiliocon
    • 57. ONLINE SYSTEM UPGRADES #twiliocon
    • 58. ONLINE SYSTEM UPGRADES• Double book-keeping with Shadow Mode #twiliocon
    • 59. ONLINE SYSTEM UPGRADES• Double book-keeping with Shadow Mode• Rollback support with Account Flags & Feature Flags #twiliocon
    • 60. ONLINE SYSTEM UPGRADES• Double book-keeping with Shadow Mode• Rollback support with Account Flags & Feature Flags• Gradual rollout #twiliocon
    • 61. 1. SHADOW MODE TX TX-MASTER-QUEUE DEQUEUER NGB LEGACY QUEUE QUEUE NGB- LEGACY- DEQUEUER DEQUEUER #twiliocon
    • 62. 1. SHADOW MODE TX TX-MASTER-QUEUE DEQUEUER ATOMIC faucet controls NGB LEGACY QUEUE QUEUE NGB- LEGACY- DEQUEUER DEQUEUER #twiliocon
    • 63. Sid Account Sid Old System New System SV... AC... 29.5500 29.5500 SV... AC... 144.8200 144.8200 COMPARING SV... SV... AC... AC... 35.6700 30.4200 35.6700 30.4200 FAUCETS: VS.OLD SYSTEM SV... SV... AC... AC... 106.9100 109.7900 106.9100 109.7900NEW SYSTEM SV... SV... AC... AC... 5.5900 10.8400 5.5900 10.8400 SV... AC... 74.0250 73.9450 SV... AC... 29.00 49.00 SV... AC... 13.9600 13.1200 SV... AC... 71.3800 91.2400 SV... AC... 671.0650 646.5050 SV... AC... 71.2600 71.2600 SV... AC... 44.5000 44.5000
    • 64. Sid Account Sid Old System New System SV... AC... 29.5500 29.5500 SV... AC... 144.8200 144.8200 COMPARING SV... SV... AC... AC... 35.6700 30.4200 35.6700 30.4200 FAUCETS: VS.OLD SYSTEM SV... SV... AC... AC... 106.9100 109.7900 106.9100 109.7900NEW SYSTEM SV... SV... AC... AC... 5.5900 10.8400 5.5900 10.8400 SV... AC... 74.0250 73.9450 SV... AC... 29.00 49.00 SV... AC... 13.9600 13.1200 SV... AC... 71.3800 91.2400 SV... AC... 671.0650 646.5050 SV... AC... 71.2600 71.2600 SV... AC... 44.5000 44.5000
    • 65. 2. ROLLBACK SUPPORT Toggle Account Features #twiliocon
    • 66. 2. ROLLBACK SUPPORT Toggle Cluster Features #twiliocon
    • 67. 3. GRADUAL ROLLOUT (PRACTICEe Week 1 Week 2 Week 3 Week 4 Week 5 Week 6 Week 7 Week 8 # of Errors Accounts on Next Gen Billing
    • 68. BETTER TOOLS Problem Solution• Redis killed by Linux OOM Killer • Rollback account flags• Inaccurate balances on NGB • Fix bug, deploy, compare• Credit card purchases failing • Rollback account flags• Failing to write log entries for NGB • Rollback feature flags #twiliocon
    • 69. THANK YOUCOME ASK US QUESTIONS!

    ×