Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013

9,405 views

Published on

Relying on a single content delivery network for your site can impose a number of flexibility limitations. By diversifying your CDN providers you can put the power back in your hands, allowing you to get the best of both worlds in terms of performance, reliability and cost. In this talk Marcus and Laurie will present Etsy’s recent work integrating multiple CDN providers to their site delivery infrastructure.

This presentation was delivered at Velocity Europe, November 2013

Published in: Technology

Integrating multiple CDN providers at Etsy - Velocity Europe (London) 2013

  1. 1. Integrating Multiple CDN Providers Our experiences at Etsy @lozzd • @ickymettle
  2. 2. Marcus Barczak Laurie Denness Staff Operations Engineers
  3. 3. @lozzd • @ickymettle
  4. 4. @lozzd • @ickymettle
  5. 5. Beginning of 2010 Today @lozzd • @ickymettle
  6. 6. Background ▪ First started using a single CDN in 2008 ▪ Exponential Growth ▪ Start of 2012 began investigation into running multiple CDNs @lozzd • @ickymettle
  7. 7. Why use a CDN? ▪ Goal: Consistently fast user experience globally ▪ Improve last mile performance by caching content close to the user ▪ Offload content delivery from origin infrastructure to the CDN provider @lozzd • @ickymettle
  8. 8. Why use more than one CDN? ▪ Resilience - Eliminate single point of failure ▪ Flexibility - Balance traffic based on business requirements ▪ Cost - Manage provider costs @lozzd • @ickymettle
  9. 9. The Plan http://www.flickr.com/photos/malloy/195204215
  10. 10. The Plan 1. Establish evaluation criteria 2. Initial configuration and testing 3. Test with production traffic 4. Operationalising @lozzd • @ickymettle
  11. 11. Evaluation Criteria @lozzd • @ickymettle http://www.flickr.com/photos/49212595@N00/5646403386
  12. 12. Evaluation Criteria ▪ Performance ▪ Configuration ▪ Reporting, Metrics and Logging ▪ Culture @lozzd • @ickymettle
  13. 13. Performance ▪ Baseline Response Times - Should be within ±5% of our existing CDN provider’s response times ▪ Hit Ratios and Origin Offload - Provider should achieve equivalent or better origin offload performance and hit ratios @lozzd • @ickymettle
  14. 14. Configuration ▪ Complexity - how complex is the providers configuration system ▪ Self service - can you make changes directly or do they require professional services or other intervention ▪ Latency for changes - how quickly do changes take to propagate @lozzd • @ickymettle
  15. 15. Reporting, Metrics and Logging ▪ Resolution ▪ Latency ▪ Delivery ▪ Customisation @lozzd • @ickymettle
  16. 16. Culture ▪ Understand our culture ▪ Postmortems ▪ Access to technical staff ▪ Shared success @lozzd • @ickymettle
  17. 17. Initial Configuration and Testing http://www.flickr.com/photos/7269902@N07/4592239326
  18. 18. Clean the house http://www.flickr.com/photos/mastergeorge/8562623590
  19. 19. Clean the house ▪ Managing caching TTLs from origin - CDNs honour the origin cache-control headers! <LocationMatch ".(gif|jpg|jpeg|png|css|js)$"> Header set Cache-Control "max-age=94670800" </LocationMatch> @lozzd • @ickymettle
  20. 20. Clean the house ▪ Manage gzip compression from origin - Honoured by CDNs - Compression from origin to CDN ## mod_deflate compression - see OPS-1537 ## AddOutputFilterByType DEFLATE text/html text/plain text/css application/x-javascript [..] @lozzd • @ickymettle
  21. 21. Clean the house If you can do it at origin, do it at origin @lozzd • @ickymettle
  22. 22. Mean Time To Curl http://www.flickr.com/photos/wwarby/3297205226
  23. 23. curl -i -H 'Host: img0.etsystatic.com' global-ssl.fastly.net/someimage.jpg HTTP/1.1 200 OK Server: Apache Last-Modified: Sat, 09 Nov 2013 23:43:38 GMT Cache-Control: max-age=94670800 [...] X-Served-By: cache-lo82-LHR X-Cache: MISS X-Cache-Hits: 0
  24. 24. curl -i -H 'Host: img0.etsystatic.com' global-ssl.fastly.net/someimage.jpg HTTP/1.1 200 OK Server: Apache Last-Modified: Sat, 09 Nov 2013 23:43:38 GMT Cache-Control: max-age=94670800 [...] X-Served-By: cache-lo82-LHR X-Cache: HIT X-Cache-Hits: 1
  25. 25. Mean Time To Curl = Done https://www.etsy.com/listing/99871278
  26. 26. Mean Time To Curl ▪ No need to touch existing infrastructure ▪ Smoke test of functionality ▪ 10 minutes from configuration to curl ▪ New providers should be plug and play @lozzd • @ickymettle
  27. 27. Testing In Production http://www.flickr.com/photos/solarnu/10646426865
  28. 28. Testing with Production Traffic ▪ Images only at first ▪ Good test of caching performance ▪ Easy to test by swapping hostnames ▪ Made even easier with our A/B testing framework @lozzd • @ickymettle
  29. 29. A/B Test Framework ▪ Fine grained control ▪ Enable test for specific users or groups ▪ Percentage of users ▪ All controlled via configuration in code ▪ Rapid and complete rollback @lozzd • @ickymettle
  30. 30. Configure Mappings to CDNs $server_config["image"] = array( 'akamai' => array( 'img0-ak.etsystatic.com', 'img1-ak.etsystatic.com', ), 'edgecast' => array( 'img0-ec.etsystatic.com', 'img1-ec.etsystatic.com', ), 'fastly' => array( 'img0-f.etsystatic.com', 'img1-f.etsystatic.com', ), ); @lozzd • @ickymettle
  31. 31. Test Controls $server_config['ab']['cdn'] = array( 'enabled' => 'on', 'weights' => array( 'akamai' => 0.0, 'edgecast' => 0.0, 'fastly' => 0.0, 'origin' => 100.0, ), 'override' => 'cdn_diversity', ); @lozzd • @ickymettle
  32. 32. Metrics and Monitoring @lozzd • @ickymettle http://www.flickr.com/photos/nicolasfleury/6073151084
  33. 33. Metrics and Monitoring Even if it doesn’t move, graph it anyway @lozzd • @ickymettle
  34. 34. Metrics and Monitoring Simplest approach: Provider’s dashboards @lozzd • @ickymettle
  35. 35. Metrics and Monitoring ▪ Get more detail by pulling metrics in house ▪ Write script to pull data from API ▪ Create dashboards with data @lozzd • @ickymettle
  36. 36. Metrics and Monitoring ▪ Get more detail by pulling metrics in house ▪ Write script to pull data from API ▪ Create dashboards with data @lozzd • @ickymettle
  37. 37. Metrics and Monitoring @lozzd • @ickymettle
  38. 38. Metrics and Monitoring @lozzd • @ickymettle
  39. 39. Testing Plan 1. for c in $cdns; do rampup $c; done; 2. Deliberately slow and steady 3. Watch traffic increase 4. Watch origin offload increase 5. Watch performance @lozzd • @ickymettle
  40. 40. Downsides of this approach ▪ AB testing can’t be used for main site ▪ Exposing your test CNAMEs ▪ Especially if hotlinking is a concern @lozzd • @ickymettle
  41. 41. Downsides of this approach ▪ Exposing your test CNAMEs ▪ Especially if hotlinking is a concern @lozzd • @ickymettle
  42. 42. How do you know it’s broke? ▪ Check the graphs! ▪ Check with your community ▪ Keep support in the loop @lozzd • @ickymettle
  43. 43. Operationalising http://www.flickr.com/photos/98047351@N05/9706165200
  44. 44. Content Partitioning @lozzd • @ickymettle
  45. 45. Etsy’s site partitioning Dynamic HTML Content www.etsy.com @lozzd • @ickymettle
  46. 46. Etsy’s site partitioning Static Assets (js, css, fonts) site.etsystatic.com @lozzd • @ickymettle
  47. 47. Etsy’s site partitioning Listing Images, Avatars imgX.etsystatic.com @lozzd • @ickymettle
  48. 48. Etsy’s site partitioning Dynamic HTML Content www.etsy.com Static Assets (js, css, fonts) site.etsystatic.com Listing Images, Avatars imgX.etsystatic.com @lozzd • @ickymettle
  49. 49. Balancing Traffic in Production http://www.flickr.com/photos/wok_design/2499217405
  50. 50. Balancing Traffic Using DNS ▪ Traffic Manager ▪ Extends DNS to dynamically return records based on rules ▪ Weighted round robin @lozzd • @ickymettle
  51. 51. Balancing Traffic Using DNS [2589:~] $ dig +short www.etsy.com www.etsy.com.edgekey.net. e2463.b.akamaiedge.net. 23.74.122.37 [2589:~] $ dig +short www.etsy.com [2589:~] $ dig +short www.etsy.com etsy.com. cs34.adn.edgecastcdn.net. 38.123.123.123 93.184.219.54 [2589:~] $ dig +short www.etsy.com global-ssl.fastly.net. 185.31.19.184 @lozzd • @ickymettle
  52. 52. Balancing Traffic Using DNS [2589:~] $ dig +short www.etsy.com etsy.com. [2589:~] $ dig +short www.etsy.com 38.123.123.123 www.etsy.com.edgekey.net. e2463.b.akamaiedge.net. 23.74.122.37 [2589:~] $ dig +short www.etsy.com cs34.adn.edgecastcdn.net. 93.184.219.54 [2589:~] $ dig +short www.etsy.com global-ssl.fastly.net. 185.31.19.184 @lozzd • @ickymettle
  53. 53. Balancing Traffic Using DNS ▪ Rule updates typically made via web UI ▪ Can be slow and error prone ▪ Changes need to be applied to all three domains ▪ API available to make changes programmatically @lozzd • @ickymettle
  54. 54. cdncontrol @lozzd • @ickymettle http://www.flickr.com/photos/foshydog/4441105829
  55. 55. cdncontrol @lozzd • @ickymettle
  56. 56. cdncontrol @lozzd • @ickymettle
  57. 57. cdncontrol @lozzd • @ickymettle
  58. 58. cdncontrol @lozzd • @ickymettle
  59. 59. cdncontrol @lozzd • @ickymettle
  60. 60. cdncontrol @lozzd • @ickymettle
  61. 61. cdncontrol @lozzd • @ickymettle
  62. 62. cdncontrol @lozzd • @ickymettle
  63. 63. cdncontrol @lozzd • @ickymettle
  64. 64. cdncontrol @lozzd • @ickymettle
  65. 65. DNS balancing downsides ▪ Low TTLs for fast convergence ▪ Mo QPS == Mo Money ▪ More DNS lookups for users ▪ Not 100% instant or deterministic @lozzd • @ickymettle
  66. 66. 50% within 1 minute Long Tail is Loooong @lozzd • @ickymettle
  67. 67. Monitoring in Production @lozzd • @ickymettle http://www.flickr.com/photos/9229426@N05/5160787240
  68. 68. Whoopsie Page ▪ Static HTML delivered for 5xx errors - Branding - Translated error messages - Links to status page @lozzd • @ickymettle
  69. 69. Whoopsie Page ▪ Static HTML delivered for 5xx errors - Branding - Translated error messages - Links to status page @lozzd • @ickymettle
  70. 70. Failure Beacons 1. 1x1 tracking pixel embedded in page [...] <img src="//failure.etsy.com/status/images/beacon.gif? beacon_source=fastly_origin_failure-etsy.com"> </body> </html> @lozzd • @ickymettle
  71. 71. Failure Beacons 1. 1x1 tracking pixel embedded in page 2. Request creates an access log line @lozzd • @ickymettle
  72. 72. Failure Beacons 1. 1x1 tracking pixel embedded in page 2. Request creates an access log line 3. Scrape them out minutely using logster self.reg = re.compile('^S+(s:)? (?P<remote_addr>[0-9.]+),? [0-9.,- ]+ [[^]]+] "GET /status/images/beacon.gif? (beacon_)?source=(?P<source>S+) HTTP/1.d" d+ [d-]+ "(? P<referrer>[^"]+)" "(?P<user_agent>[^"]+)" .*$') @lozzd • @ickymettle
  73. 73. Failure Beacons 1. 1x1 tracking pixel embedded in page 2. Request creates an access log line 3. Scrape them out minutely using logster 4. Logster posts event counts to Graphite @lozzd • @ickymettle
  74. 74. Failure Beacons 1. 1x1 tracking pixel embedded in page 2. Request creates an access log line 3. Scrape them out minutely using logster 4. Logster posts event counts to Graphite @lozzd • @ickymettle
  75. 75. Failure Beacons 1. 1x1 tracking pixel embedded in page 2. Request creates an access log line 3. Scrape them out minutely using logster 4. Logster posts event counts to Graphite 5. Alert on Graphite graph in Nagios @lozzd • @ickymettle
  76. 76. Failure Beacons 1. 1x1 tracking pixel embedded in page 2. Request creates an access log line 3. Scrape them out minutely using logster 4. Logster posts event counts to Graphite 5. Alert on Graphite graph in Nagios @lozzd • @ickymettle
  77. 77. Failure Beacons ▪ Client IP address can be geolocated @lozzd • @ickymettle
  78. 78. Failure Beacons ▪ Optional extra debugging information [31/Oct/2013:07:06:42 +0000] "GET /status/images/ beacon.gif?beacon_source=fastly_origin_failure-etsy.com &provider_error=Connection%20timed%20out &server_identity=cache-ny57-NYC HTTP/1.1" @lozzd • @ickymettle
  79. 79. Failure Beacons ▪ Optional extra debugging information @lozzd • @ickymettle
  80. 80. Tracking Requests to Origin GET / HTTP/1.1 User-Agent: curl/7.24.0 Accept: */* X-Forwarded-Host: www.etsy.com [...] X-CDN-Provider: edgecast [...] Host: www.etsy.com @lozzd • @ickymettle
  81. 81. Tracking Requests to Origin GET / HTTP/1.1 User-Agent: curl/7.24.0 Accept: */* X-Forwarded-Host: www.etsy.com [...] X-CDN-Provider: edgecast [...] Host: www.etsy.com @lozzd • @ickymettle
  82. 82. Backend Monitoring ▪ Vendor APIs to bring data in house @lozzd • @ickymettle
  83. 83. Backend Monitoring ▪ Logster on CDN provider header ▪ Vendor APIs to bring data in house @lozzd • @ickymettle
  84. 84. Backend Monitoring ▪ Vendor APIs to bring data in house ▪ Data in-house benefits include - Integration with our anomaly detection systems - Consistent and unified view of all CDN metrics - We control data retention period @lozzd • @ickymettle
  85. 85. Awareness ▪ Over 100 engineers ▪ Deploying 60 times a day ▪ Correlating external and internal services @lozzd • @ickymettle
  86. 86. Awareness @lozzd • @ickymettle
  87. 87. Awareness Deploy lines @lozzd • @ickymettle
  88. 88. Frontend Monitoring ▪ Performance is important to us ▪ Monitoring overall site performance ▪ Monitoring performance by CDN provider ▪ Real User Monitoring on key pages to track page performance @lozzd • @ickymettle
  89. 89. Frontend Monitoring ▪ Performance is important to us ▪ Monitoring overall site performance ▪ Monitoring performance by CDN provider ▪ SOASTA mPulse on key pages to track real user page performance @lozzd • @ickymettle
  90. 90. Downsides http://www.flickr.com/photos/39272170@N00/3841286802
  91. 91. Debugging: What broke? ▪ MTTD/MTTR can be extremely low with this system ▪ But not always @lozzd • @ickymettle
  92. 92. Debugging: What broke? ▪ MTTD/MTTR can be extremely low with this system ▪ But not always @lozzd • @ickymettle
  93. 93. Debugging: What broke? ▪ Non technical member base ▪ Confusing and time consuming ▪ Amazing support team ▪ Log as much information as possible @lozzd • @ickymettle
  94. 94. http://www.flickr.com/photos/sk8geek/4649776194 Conclusions/Takeaways
  95. 95. Great success ▪ 12 months in the benefits have far outweighed the few downsides ▪ We’re continuing to evolve the system ▪ We’ll be sure to share our experience with the community along the way @lozzd • @ickymettle
  96. 96. Links/Open Source ▪ cdncontrol http://github.com/etsy/cdncontrol http://github.com/etsy/cdncontrol_ui ▪ logster http://github.com/etsy/logster ▪ CDN API to Graphite scripts http://github.com/lozzd/cdn_scripts @lozzd • @ickymettle
  97. 97. Thanks! Questions? @lozzd • @ickymettle

×