Integrating multiple CDNs at Etsy

12,381 views

Published on

We embarked on a project to use multiple CDNs concurrently at Etsy. This talk goes through how and why.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
12,381
On SlideShare
0
From Embeds
0
Number of Embeds
10,244
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Integrating multiple CDNs at Etsy

  1. 1. Integrating Multiple CDN Providers Our experiences at Etsy @lozzd • @ickymettle
  2. 2. Marcus Barczak Laurie Denness Staff Operations Engineers
  3. 3. @lozzd • @ickymettle
  4. 4. @lozzd • @ickymettle
  5. 5. Beginning of 2010 Today @lozzd • @ickymettle
  6. 6. Background ▪ First started using a single CDN in 2008 ▪ Exponential Growth ▪ Start of 2012 began investigation into running multiple CDNs @lozzd • @ickymettle
  7. 7. Why use a CDN? ▪ Goal: Consistently fast user experience globally ▪ Improve last mile performance by caching content close to the user ▪ Offload content delivery from origin infrastructure to the CDN provider @lozzd • @ickymettle
  8. 8. Why use more than one CDN? @lozzd • @ickymettle
  9. 9. Why use more than one CDN? ▪ Resilience - Eliminate single point of failure @lozzd • @ickymettle
  10. 10. Why use more than one CDN? ▪ Resilience - Eliminate single point of failure ▪ Flexibility - Balance traffic based on business requirements @lozzd • @ickymettle
  11. 11. Why use more than one CDN? ▪ Resilience - Eliminate single point of failure ▪ Flexibility - Balance traffic based on business requirements ▪ Cost - Manage provider costs @lozzd • @ickymettle
  12. 12. The Plan http://www.flickr.com/photos/malloy/195204215
  13. 13. The Plan 1. Establish evaluation criteria 2. Initial configuration and testing 3. Test with production traffic 4. Operationalising @lozzd • @ickymettle
  14. 14. Evaluation Criteria @lozzd • @ickymettle http://www.flickr.com/photos/49212595@N00/5646403386
  15. 15. Evaluation Criteria ▪ Performance ▪ Configuration ▪ Reporting, Metrics and Logging ▪ Culture @lozzd • @ickymettle
  16. 16. Performance @lozzd • @ickymettle
  17. 17. Performance ▪ Baseline Response Times - Should be within ±5% of our existing CDN provider’s response times @lozzd • @ickymettle
  18. 18. Performance ▪ Baseline Response Times - Should be within ±5% of our existing CDN provider’s response times ▪ Hit Ratios and Origin Offload - Provider should achieve equivalent or better origin offload performance and hit ratios @lozzd • @ickymettle
  19. 19. Configuration @lozzd • @ickymettle
  20. 20. Configuration ▪ Complexity - how complex is the providers configuration system @lozzd • @ickymettle
  21. 21. Configuration ▪ Complexity - how complex is the providers configuration system ▪ Self service - can you make changes directly or do they require professional services or other intervention @lozzd • @ickymettle
  22. 22. Configuration ▪ Complexity - how complex is the providers configuration system ▪ Self service - can you make changes directly or do they require professional services or other intervention ▪ Latency for changes - how quickly do changes take to propagate @lozzd • @ickymettle
  23. 23. Reporting, Metrics and Logging ▪ Resolution ▪ Latency ▪ Delivery ▪ Customisation @lozzd • @ickymettle
  24. 24. Culture ▪ Understand our culture ▪ Postmortems ▪ Access to technical staff ▪ Shared success @lozzd • @ickymettle
  25. 25. Initial Configuration and Testing http://www.flickr.com/photos/7269902@N07/4592239326
  26. 26. Clean the house http://www.flickr.com/photos/mastergeorge/8562623590
  27. 27. Clean the house ▪ Managing caching TTLs from origin - CDNs honour the origin cache-control headers! @lozzd • @ickymettle
  28. 28. Clean the house ▪ Managing caching TTLs from origin - CDNs honour the origin cache-control headers! <LocationMatch ".(gif|jpg|jpeg|png|css|js)$"> Header set Cache-Control "max-age=94670800" </LocationMatch> @lozzd • @ickymettle
  29. 29. Clean the house ▪ Manage gzip compression from origin - Honoured by CDNs - Compression from origin to CDN @lozzd • @ickymettle
  30. 30. Clean the house ▪ Manage gzip compression from origin - Honoured by CDNs - Compression from origin to CDN ## mod_deflate compression - see OPS-1537 ## AddOutputFilterByType DEFLATE text/html text/plain text/css application/x-javascript [..] @lozzd • @ickymettle
  31. 31. Clean the house @lozzd • @ickymettle
  32. 32. Clean the house If you can do it at origin, do it at origin @lozzd • @ickymettle
  33. 33. Mean Time To Curl http://www.flickr.com/photos/wwarby/3297205226
  34. 34. curl -i -H 'Host: img0.etsystatic.com' global-ssl.fastly.net/someimage.jpg
  35. 35. curl -i -H 'Host: img0.etsystatic.com' global-ssl.fastly.net/someimage.jpg HTTP/1.1 200 OK Server: Apache Last-Modified: Sat, 09 Nov 2013 23:43:38 GMT Cache-Control: max-age=94670800 [...] X-Served-By: cache-lo82-LHR X-Cache: MISS X-Cache-Hits: 0
  36. 36. curl -i -H 'Host: img0.etsystatic.com' global-ssl.fastly.net/someimage.jpg
  37. 37. curl -i -H 'Host: img0.etsystatic.com' global-ssl.fastly.net/someimage.jpg HTTP/1.1 200 OK Server: Apache Last-Modified: Sat, 09 Nov 2013 23:43:38 GMT Cache-Control: max-age=94670800 [...] X-Served-By: cache-lo82-LHR X-Cache: HIT X-Cache-Hits: 1
  38. 38. Mean Time To Curl = Done https://www.etsy.com/listing/99871278
  39. 39. Mean Time To Curl ▪ No need to touch existing infrastructure ▪ Smoke test of functionality ▪ 10 minutes from configuration to curl ▪ New providers should be plug and play @lozzd • @ickymettle
  40. 40. Testing In Production http://www.flickr.com/photos/solarnu/10646426865
  41. 41. Testing with Production Traffic ▪ Images only at first ▪ Good test of caching performance ▪ Easy to test by swapping hostnames ▪ Made even easier with our A/B testing framework @lozzd • @ickymettle
  42. 42. A/B Test Framework ▪ Fine grained control ▪ Enable test for specific users or groups ▪ Percentage of users ▪ All controlled via configuration in code ▪ Rapid and complete rollback @lozzd • @ickymettle
  43. 43. Configure Mappings to CDNs $server_config["image"] = array( 'akamai' => array( 'img0-ak.etsystatic.com', 'img1-ak.etsystatic.com', ), 'edgecast' => array( 'img0-ec.etsystatic.com', 'img1-ec.etsystatic.com', ), 'fastly' => array( 'img0-f.etsystatic.com', 'img1-f.etsystatic.com', ), ); @lozzd • @ickymettle
  44. 44. Test Controls $server_config['ab']['cdn'] = array( 'enabled' => 'on', 'weights' => array( 'akamai' => 0.0, 'edgecast' => 0.0, 'fastly' => 0.0, 'origin' => 100.0, ), 'override' => 'cdn_diversity', ); @lozzd • @ickymettle
  45. 45. Metrics and Monitoring @lozzd • @ickymettle http://www.flickr.com/photos/nicolasfleury/6073151084
  46. 46. Metrics and Monitoring @lozzd • @ickymettle
  47. 47. Metrics and Monitoring Even if it doesn’t move, graph it anyway @lozzd • @ickymettle
  48. 48. Metrics and Monitoring Simplest approach: Provider’s dashboards @lozzd • @ickymettle
  49. 49. Metrics and Monitoring Simplest approach: Provider’s dashboards @lozzd • @ickymettle
  50. 50. Metrics and Monitoring ▪ Get more detail by pulling metrics in house ▪ Write script to pull data from API ▪ Create dashboards with data @lozzd • @ickymettle
  51. 51. Metrics and Monitoring ▪ Get more detail by pulling metrics in house ▪ Write script to pull data from API ▪ Create dashboards with data @lozzd • @ickymettle
  52. 52. Metrics and Monitoring @lozzd • @ickymettle
  53. 53. Metrics and Monitoring @lozzd • @ickymettle
  54. 54. Testing Plan 1. for c in $cdns; do rampup $c; done; 2. Deliberately slow and steady 3. Watch traffic increase 4. Watch origin offload increase 5. Watch performance @lozzd • @ickymettle
  55. 55. Downsides of this approach ▪ AB testing can’t be used for main site ▪ Exposing your test CNAMEs ▪ Especially if hotlinking is a concern @lozzd • @ickymettle
  56. 56. Downsides of this approach ▪ Exposing your test CNAMEs ▪ Especially if hotlinking is a concern @lozzd • @ickymettle
  57. 57. How do you know it’s broke? ▪ Check the graphs! ▪ Check with your community ▪ Keep support in the loop @lozzd • @ickymettle
  58. 58. Operationalising http://www.flickr.com/photos/98047351@N05/9706165200
  59. 59. Content Partitioning @lozzd • @ickymettle
  60. 60. Etsy’s site partitioning Dynamic HTML Content www.etsy.com @lozzd • @ickymettle
  61. 61. Etsy’s site partitioning Static Assets (js, css, fonts) site.etsystatic.com @lozzd • @ickymettle
  62. 62. Etsy’s site partitioning Listing Images, Avatars imgX.etsystatic.com @lozzd • @ickymettle
  63. 63. Etsy’s site partitioning Dynamic HTML Content www.etsy.com Static Assets (js, css, fonts) site.etsystatic.com Listing Images, Avatars imgX.etsystatic.com @lozzd • @ickymettle
  64. 64. Balancing Traffic in Production http://www.flickr.com/photos/wok_design/2499217405
  65. 65. Balancing Traffic Using DNS ▪ Traffic Manager ▪ Extends DNS to dynamically return records based on rules ▪ Weighted round robin @lozzd • @ickymettle
  66. 66. Balancing Traffic Using DNS [2589:~] $ dig +short www.etsy.com www.etsy.com.edgekey.net. e2463.b.akamaiedge.net. 23.74.122.37 [2589:~] $ dig +short www.etsy.com [2589:~] $ dig +short www.etsy.com etsy.com. cs34.adn.edgecastcdn.net. 38.123.123.123 93.184.219.54 [2589:~] $ dig +short www.etsy.com global-ssl.fastly.net. 185.31.19.184 @lozzd • @ickymettle
  67. 67. Balancing Traffic Using DNS [2589:~] $ dig +short www.etsy.com etsy.com. [2589:~] $ dig +short www.etsy.com 38.123.123.123 www.etsy.com.edgekey.net. e2463.b.akamaiedge.net. 23.74.122.37 [2589:~] $ dig +short www.etsy.com cs34.adn.edgecastcdn.net. 93.184.219.54 [2589:~] $ dig +short www.etsy.com global-ssl.fastly.net. 185.31.19.184 @lozzd • @ickymettle
  68. 68. Balancing Traffic Using DNS ▪ Rule updates typically made via web UI ▪ Can be slow and error prone ▪ Changes need to be applied to all three domains ▪ API available to make changes programmatically @lozzd • @ickymettle
  69. 69. cdncontrol @lozzd • @ickymettle http://www.flickr.com/photos/foshydog/4441105829
  70. 70. cdncontrol @lozzd • @ickymettle
  71. 71. cdncontrol @lozzd • @ickymettle
  72. 72. cdncontrol @lozzd • @ickymettle
  73. 73. cdncontrol @lozzd • @ickymettle
  74. 74. cdncontrol @lozzd • @ickymettle
  75. 75. cdncontrol @lozzd • @ickymettle
  76. 76. cdncontrol @lozzd • @ickymettle
  77. 77. cdncontrol @lozzd • @ickymettle
  78. 78. cdncontrol @lozzd • @ickymettle
  79. 79. cdncontrol @lozzd • @ickymettle
  80. 80. cdncontrol @lozzd • @ickymettle
  81. 81. cdncontrol @lozzd • @ickymettle
  82. 82. cdncontrol @lozzd • @ickymettle
  83. 83. cdncontrol @lozzd • @ickymettle
  84. 84. DNS balancing downsides ▪ Low TTLs for fast convergence @lozzd • @ickymettle
  85. 85. DNS balancing downsides ▪ Low TTLs for fast convergence ▪ Mo QPS == Mo Money @lozzd • @ickymettle
  86. 86. DNS balancing downsides ▪ Low TTLs for fast convergence ▪ Mo QPS == Mo Money ▪ More DNS lookups for users @lozzd • @ickymettle
  87. 87. DNS balancing downsides ▪ Low TTLs for fast convergence ▪ Mo QPS == Mo Money ▪ More DNS lookups for users ▪ Not 100% instant or deterministic @lozzd • @ickymettle
  88. 88. 50% within 1 minute @lozzd • @ickymettle
  89. 89. 50% within 1 minute Long Tail is Loooong @lozzd • @ickymettle
  90. 90. Monitoring in Production @lozzd • @ickymettle http://www.flickr.com/photos/9229426@N05/5160787240
  91. 91. Whoopsie Page ▪ Static HTML delivered for 5xx errors - Branding - Translated error messages - Links to status page @lozzd • @ickymettle
  92. 92. Whoopsie Page ▪ Static HTML delivered for 5xx errors - Branding - Translated error messages - Links to status page @lozzd • @ickymettle
  93. 93. Failure Beacons 1. 1x1 tracking pixel embedded in page [...] <img src="//failure.etsy.com/status/images/beacon.gif? beacon_source=fastly_origin_failure-etsy.com"> </body> </html> @lozzd • @ickymettle
  94. 94. Failure Beacons 1. 1x1 tracking pixel embedded in page 2. Request creates an access log line @lozzd • @ickymettle
  95. 95. Failure Beacons 1. 1x1 tracking pixel embedded in page 2. Request creates an access log line 3. Scrape them out minutely using logster self.reg = re.compile('^S+(s:)? (?P<remote_addr>[0-9.]+),? [0-9.,- ]+ [[^]]+] "GET /status/images/beacon.gif? (beacon_)?source=(?P<source>S+) HTTP/1.d" d+ [d-]+ "(? P<referrer>[^"]+)" "(?P<user_agent>[^"]+)" .*$') @lozzd • @ickymettle
  96. 96. Failure Beacons 1. 1x1 tracking pixel embedded in page 2. Request creates an access log line 3. Scrape them out minutely using logster 4. Logster posts event counts to Graphite @lozzd • @ickymettle
  97. 97. Failure Beacons 1. 1x1 tracking pixel embedded in page 2. Request creates an access log line 3. Scrape them out minutely using logster 4. Logster posts event counts to Graphite @lozzd • @ickymettle
  98. 98. Failure Beacons 1. 1x1 tracking pixel embedded in page 2. Request creates an access log line 3. Scrape them out minutely using logster 4. Logster posts event counts to Graphite 5. Alert on Graphite graph in Nagios @lozzd • @ickymettle
  99. 99. Failure Beacons 1. 1x1 tracking pixel embedded in page 2. Request creates an access log line 3. Scrape them out minutely using logster 4. Logster posts event counts to Graphite 5. Alert on Graphite graph in Nagios @lozzd • @ickymettle
  100. 100. Failure Beacons 1. 1x1 tracking pixel embedded in page 2. Request creates an access log line 3. Scrape them out minutely using logster 4. Logster posts event counts to Graphite 5. Alert on Graphite graph in Nagios @lozzd • @ickymettle
  101. 101. Failure Beacons ▪ Client IP address can be geolocated @lozzd • @ickymettle
  102. 102. Failure Beacons ▪ Client IP address can be geolocated @lozzd • @ickymettle
  103. 103. Failure Beacons ▪ Optional extra debugging information [31/Oct/2013:07:06:42 +0000] "GET /status/images/ beacon.gif?beacon_source=fastly_origin_failure-etsy.com &provider_error=Connection%20timed%20out &server_identity=cache-ny57-NYC HTTP/1.1" @lozzd • @ickymettle
  104. 104. Failure Beacons ▪ Optional extra debugging information @lozzd • @ickymettle
  105. 105. Tracking Requests to Origin GET / HTTP/1.1 User-Agent: curl/7.24.0 Accept: */* X-Forwarded-Host: www.etsy.com [...] X-CDN-Provider: edgecast [...] Host: www.etsy.com @lozzd • @ickymettle
  106. 106. Tracking Requests to Origin GET / HTTP/1.1 User-Agent: curl/7.24.0 Accept: */* X-Forwarded-Host: www.etsy.com [...] X-CDN-Provider: edgecast [...] Host: www.etsy.com @lozzd • @ickymettle
  107. 107. Backend Monitoring ▪ Vendor APIs to bring data in house @lozzd • @ickymettle
  108. 108. Backend Monitoring ▪ Vendor APIs to bring data in house @lozzd • @ickymettle
  109. 109. Backend Monitoring ▪ Logster on CDN provider header ▪ Vendor APIs to bring data in house @lozzd • @ickymettle
  110. 110. Backend Monitoring ▪ Vendor APIs to bring data in house ▪ Data in-house benefits include - Integration with our anomaly detection systems - Consistent and unified view of all CDN metrics - We control data retention period @lozzd • @ickymettle
  111. 111. Awareness ▪ Over 100 engineers ▪ Deploying 60 times a day ▪ Correlating external and internal services @lozzd • @ickymettle
  112. 112. Awareness @lozzd • @ickymettle
  113. 113. Awareness @lozzd • @ickymettle
  114. 114. Awareness @lozzd • @ickymettle
  115. 115. Awareness @lozzd • @ickymettle
  116. 116. Awareness @lozzd • @ickymettle
  117. 117. Awareness Deploy lines @lozzd • @ickymettle
  118. 118. Frontend Monitoring ▪ Performance is important to us ▪ Monitoring overall site performance ▪ Monitoring performance by CDN provider ▪ Real User Monitoring on key pages to track page performance @lozzd • @ickymettle
  119. 119. Frontend Monitoring ▪ Performance is important to us ▪ Monitoring overall site performance ▪ Monitoring performance by CDN provider ▪ SOASTA mPulse on key pages to track real user page performance @lozzd • @ickymettle
  120. 120. Downsides http://www.flickr.com/photos/39272170@N00/3841286802
  121. 121. Debugging: What broke? @lozzd • @ickymettle
  122. 122. Debugging: What broke? ▪ MTTD/MTTR can be extremely low with this system @lozzd • @ickymettle
  123. 123. Debugging: What broke? ▪ MTTD/MTTR can be extremely low with this system ▪ But not always @lozzd • @ickymettle
  124. 124. Debugging: What broke? ▪ MTTD/MTTR can be extremely low with this system ▪ But not always @lozzd • @ickymettle
  125. 125. Debugging: What broke? ▪ MTTD/MTTR can be extremely low with this system ▪ But not always @lozzd • @ickymettle
  126. 126. Debugging: What broke? ▪ Non technical member base ▪ Confusing and time consuming ▪ Amazing support team ▪ Log as much information as possible @lozzd • @ickymettle
  127. 127. http://www.flickr.com/photos/sk8geek/4649776194 Conclusions/Takeaways
  128. 128. Great success ▪ 12 months in the benefits have far outweighed the few downsides ▪ We’re continuing to evolve the system ▪ We’ll be sure to share our experience with the community along the way @lozzd • @ickymettle
  129. 129. Links/Open Source ▪ cdncontrol http://github.com/etsy/cdncontrol http://github.com/etsy/cdncontrol_ui ▪ logster http://github.com/etsy/logster ▪ CDN API to Graphite scripts http://github.com/lozzd/cdn_scripts @lozzd • @ickymettle
  130. 130. Thanks! Questions? @lozzd • @ickymettle
  131. 131. Integrating Multiple CDN Providers Our experiences at Etsy @lozzd • @ickymettle

×