Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability of a Dynamic and Geo-Distributed Ad Delivery Service

35 views

Published on

The advertising industry faces numerous challenges in achieving its goal of targeting a given audience dynamically and accurately in order to deliver a meaningful brand message. Near real-time, low latency delivery of dynamic content, the sheer volume of information processed, and the sparse geographic distribution of the intended eyeball traffic all drive the complexity of building a successful experience for the end user and the brand. Additionally, the competitiveness of the industry makes it critical to preserve low operational expenses while delivering reliably at scale. In attempting to address the above, we have found that a distributed infrastructure that leverages public cloud providers and a private cloud with open infrastructure technologies can deliver dynamic advertising content with low latency while preserving its high availability. But network or physical utility infrastructures can’t be relied on to ensure the service dependability. We show that the complexity of the networks, the sparse geographic distribution of eyeballs, the risk of data center failures, and the increase of encrypted transactions call for thoughtful architectures. The introduction of modern practices, failure injections, and self-healing mechanisms allowed us to improve the service fault tolerance while optimizing for latency and significantly improve our service reliability.

Published in: Technology
  • Be the first to comment

IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability of a Dynamic and Geo-Distributed Ad Delivery Service

  1. 1. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Use of Self-Healing Techniques to Improve the Reliability of a Dynamic and Geo-Distributed Ad Delivery Service Nicolas Brousse and Oleksii Mykhailov
  2. 2. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Adobe Advertising Cloud Serving All Media Content Across Any Screens in Any Format 2
  3. 3. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 3 BEFORE RFP, IO, human based orders NOW Programmatic Ad Buying with Real Time Bidding
  4. 4. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 4 Latency <50ms @ 95th percentile High Traffic 300 billion requests a day Huge Datasets Billions of objects to store
  5. 5. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 5 Ad Content Delivered To Eyeball
  6. 6. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 6 Traditional Ad Serving Implement GeoDNS GSLB
  7. 7. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Inconsistent GeoDNS Routing High Latency From Eyeball To Content Origin Origin Failure Impact User Experience Impact Campaign Performance and Revenue 7
  8. 8. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Relying on GeoDNS to Figure Out Eyeball Location is UNRELIABLE 8 Optimal Route Actual Route
  9. 9. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 9 TCP and TLS Handshake Impact Latency
  10. 10. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Datacenter Blackout Network Outage Human Errors Natural Disaster 10 High Risks Of Origin Failures
  11. 11. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Service Unavailability 11
  12. 12. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 12 SOLUTION Eyeball Traffic Access Content via Smart Edges
  13. 13. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 13 Smart Edges Are Anycast POPs That Manage Failover and Self-Healing
  14. 14. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 14 Few Words About Anycast � Shortest Path Routing Means � Not Latency Aware � Not Congestion Aware / Packet Loss � Limited Control for Traffic Steering � Difficult Troubleshooting � Failover lead to packet RST for Active Sessions � Mitigation with a large and well distributed number of POPs
  15. 15. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 15 Anycast POP Improve Latency e.g. 3X Faster
  16. 16. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 16 Automate Failover and Recovery
  17. 17. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 17 Inject Failures In Production To Validate Smart Edges Behavior
  18. 18. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Human Failover vs Self-Healing 18 > 1h Failure < 15min < 30min Region Traffic Rerouted Self Recover Note: since the paper publication, we reduced automated failover time to be less than a few seconds. See demo. Fig. 1 Human Failover with Manual Recovery steps Fig. 2 Automated Failover and Self-Healing Recovery
  19. 19. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 19 Injecting Complete Data Center Failure at the Regional Level LIVE
  20. 20. � 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Recorded Demo 20

×