IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability of a Dynamic and Geo-Distributed Ad Delivery Service

Nicolas Brousse
Nicolas BrousseCloud Technology Leader | Director, Operations Engineering at Adobe
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Use of Self-Healing Techniques to Improve the Reliability of a Dynamic
and Geo-Distributed Ad Delivery Service
Nicolas Brousse and Oleksii Mykhailov
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Adobe Advertising Cloud
Serving All Media Content Across
Any Screens in Any Format
2
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 3
BEFORE
RFP, IO, human based orders
NOW
Programmatic Ad Buying with
Real Time Bidding
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 4
Latency
<50ms @ 95th percentile
High Traffic
300 billion requests a day
Huge Datasets
Billions of objects to store
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 5
Ad Content Delivered To Eyeball
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 6
Traditional Ad
Serving Implement
GeoDNS GSLB
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Inconsistent GeoDNS Routing
High Latency From Eyeball To Content Origin
Origin Failure Impact User Experience
Impact Campaign Performance and Revenue
7
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Relying on GeoDNS to Figure Out Eyeball Location is
UNRELIABLE
8
Optimal Route
Actual Route
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 9
TCP and TLS
Handshake Impact
Latency
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Datacenter Blackout
Network Outage
Human Errors
Natural Disaster
10
High Risks Of Origin Failures
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Service Unavailability
11
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 12
SOLUTION
Eyeball Traffic Access Content via Smart Edges
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 13
Smart Edges Are
Anycast POPs That
Manage Failover and
Self-Healing
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 14
Few Words About Anycast
§ Shortest Path Routing Means
§ Not Latency Aware
§ Not Congestion Aware / Packet Loss
§ Limited Control for Traffic Steering
§ Difficult Troubleshooting
§ Failover lead to packet RST for Active Sessions
§ Mitigation with a large and well distributed number of POPs
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 15
Anycast POP
Improve Latency
e.g. 3X Faster
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 16
Automate
Failover and Recovery
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 17
Inject Failures In
Production To Validate
Smart Edges Behavior
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Human Failover vs Self-Healing
18
> 1h Failure
< 15min
< 30min Region
Traffic Rerouted
Self Recover
Note: since the paper publication, we reduced automated failover time to be less than a few seconds. See demo.
Fig. 1 Human Failover with Manual Recovery steps Fig. 2 Automated Failover and Self-Healing Recovery
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 19
Injecting Complete
Data Center Failure
at the Regional Level
LIVE
© 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Recorded Demo
20
IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability of a Dynamic and Geo-Distributed Ad Delivery Service
1 of 21

More Related Content

Similar to IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability of a Dynamic and Geo-Distributed Ad Delivery Service(20)

Developer To ArchitectDeveloper To Architect
Developer To Architect
Anurag Yadav187 views
Design - Start Your API Journey TodayDesign - Start Your API Journey Today
Design - Start Your API Journey Today
LaurenWendler289 views
Adobe Flash Platform Summit 2010Adobe Flash Platform Summit 2010
Adobe Flash Platform Summit 2010
Anne Kathrine Petterøe583 views
Marketing in the Age of MobileMarketing in the Age of Mobile
Marketing in the Age of Mobile
Adobe Experience Cloud164 views
Where is cold fusion headedWhere is cold fusion headed
Where is cold fusion headed
ColdFusionConference953 views
Design - Start Your API Journey TodayDesign - Start Your API Journey Today
Design - Start Your API Journey Today
LaurenWendler368 views
Value Added Services and WebRTCValue Added Services and WebRTC
Value Added Services and WebRTC
Dialogic Inc.1.9K views
Automating the Modern Software FactoryAutomating the Modern Software Factory
Automating the Modern Software Factory
CA Technologies1.6K views

Recently uploaded(20)

METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
Prity Khastgir IPR Strategic India Patent Attorney Amplify Innovation23 views
ThroughputThroughput
Throughput
Moisés Armani Ramírez28 views
Java Platform Approach 1.0 - Picnic MeetupJava Platform Approach 1.0 - Picnic Meetup
Java Platform Approach 1.0 - Picnic Meetup
Rick Ossendrijver23 views
[2023] Putting the R! in R&D.pdf[2023] Putting the R! in R&D.pdf
[2023] Putting the R! in R&D.pdf
Eleanor McHugh34 views
Liqid: Composable CXL PreviewLiqid: Composable CXL Preview
Liqid: Composable CXL Preview
CXL Forum118 views

IEEE ISSRE 2018 - Use of Self-Healing Techniques to Improve the Reliability of a Dynamic and Geo-Distributed Ad Delivery Service

  • 1. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Use of Self-Healing Techniques to Improve the Reliability of a Dynamic and Geo-Distributed Ad Delivery Service Nicolas Brousse and Oleksii Mykhailov
  • 2. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Adobe Advertising Cloud Serving All Media Content Across Any Screens in Any Format 2
  • 3. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 3 BEFORE RFP, IO, human based orders NOW Programmatic Ad Buying with Real Time Bidding
  • 4. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 4 Latency <50ms @ 95th percentile High Traffic 300 billion requests a day Huge Datasets Billions of objects to store
  • 5. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 5 Ad Content Delivered To Eyeball
  • 6. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 6 Traditional Ad Serving Implement GeoDNS GSLB
  • 7. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Inconsistent GeoDNS Routing High Latency From Eyeball To Content Origin Origin Failure Impact User Experience Impact Campaign Performance and Revenue 7
  • 8. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Relying on GeoDNS to Figure Out Eyeball Location is UNRELIABLE 8 Optimal Route Actual Route
  • 9. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 9 TCP and TLS Handshake Impact Latency
  • 10. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Datacenter Blackout Network Outage Human Errors Natural Disaster 10 High Risks Of Origin Failures
  • 11. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Service Unavailability 11
  • 12. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 12 SOLUTION Eyeball Traffic Access Content via Smart Edges
  • 13. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 13 Smart Edges Are Anycast POPs That Manage Failover and Self-Healing
  • 14. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 14 Few Words About Anycast § Shortest Path Routing Means § Not Latency Aware § Not Congestion Aware / Packet Loss § Limited Control for Traffic Steering § Difficult Troubleshooting § Failover lead to packet RST for Active Sessions § Mitigation with a large and well distributed number of POPs
  • 15. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 15 Anycast POP Improve Latency e.g. 3X Faster
  • 16. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 16 Automate Failover and Recovery
  • 17. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 17 Inject Failures In Production To Validate Smart Edges Behavior
  • 18. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Human Failover vs Self-Healing 18 > 1h Failure < 15min < 30min Region Traffic Rerouted Self Recover Note: since the paper publication, we reduced automated failover time to be less than a few seconds. See demo. Fig. 1 Human Failover with Manual Recovery steps Fig. 2 Automated Failover and Self-Healing Recovery
  • 19. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. 19 Injecting Complete Data Center Failure at the Regional Level LIVE
  • 20. © 2018 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Recorded Demo 20