Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Crisis to Calm: Story of Data Validation @ Netflix

136 views

Published on

Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2AwAlb9.

Lavanya Kanchanapalli shares her experience in maintaining a great Netflix customer experience while enabling fast and safe data propagation. She also talks about detecting and preventing bad data that is essential to high availability, ways to make circuit breakers, data canaries and staggered rollout effective, and efficient validations via sharing data and isolating change. Filmed at qconsf.com.

Lavanya Kanchanapalli works as a Senior Software Engineer at Netflix.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Crisis to Calm: Story of Data Validation @ Netflix

  1. 1. Crisis to Calm: A Story of Data Validation @ Netflix Lavanya Kanchanapalli
  2. 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ data-validation-netflix
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. ● Rollback data ● Increase capacity ● Netflix starts to work
  5. 5. ● Duplicate objects Root Cause ● Apps failed ● Cascading failures ● Netflix goes down!
  6. 6. Data Change Change in Behavior Netflix Microservices App1 Cloud App2 Appn
  7. 7. Metadata Architecture Video Metadata Service Amazon S3 Source System Source System Netflix Services Netflix Services Netflix Services Netflix ServicesNetflix Service Traffic
  8. 8. ● Single publisher ● Multiple consumers ● Hollow (hollow.how) ● Versioned data ● Fast propagation
  9. 9. Bad data happens!
  10. 10. Leaked Content
  11. 11. Disables Features
  12. 12. Deletes data
  13. 13. Data change = Code Push
  14. 14. 1 Detection Staggering 2 Rollback 3 Staggering 2 Rollback 3
  15. 15. 1 Detection Staggering 2 Rollback 3 Rollback 3
  16. 16. 1 Detection Staggering 2 Rollback 3
  17. 17. Many Unknowns
  18. 18. Would it be too slow?
  19. 19. Would there be too many failures?
  20. 20. Would it cost too much?
  21. 21. 1 Detection Staggering 2 Rollback 3
  22. 22. 1 Detection Staggering 2 Rollback 3 Staggering 2 Rollback 3
  23. 23. Chapter 1: Circuit Breakers
  24. 24. Metadata Architecture Video Metadata Service Amazon S3 Source System Source System Netflix Services Netflix Services Netflix Services Netflix ServicesNetflix Service Traffic
  25. 25. ● Integrity checks ● Duplicate detection ● Object counts ● Semantic checks Circuit Breakers
  26. 26. Know your data change
  27. 27. Knobs are key to sanity ● On/ off ● Threshold ● Exclusions
  28. 28. Business value is the key
  29. 29. Efficiency ● Change Isolation ● Sampling
  30. 30. Chapter 2: Canaries
  31. 31. Traditional Canaries Canary (New Code) Baseline (Old Code) Shadow Traffic Shadow Traffic Video Metadata Service Amazon S3 Netflix Services Netflix Services Netflix Services Netflix ServicesNetflix Service Source System Source System Traffic
  32. 32. Data Canaries Netflix Services Netflix Services Netflix Services Netflix ServicesVideo Metadata Service Amazon S3 Source System Source System Netflix Service Netflix Data Canary Service Netflix Service Traffic
  33. 33. Netflix Data Canary Service ● Pick key use case(s) ● Pick data to test ● Test with latest data Netflix ServiceNetflix Data Canary Service
  34. 34. 1 Detection Staggering 2 RollbackRollback 3
  35. 35. Amazon WS Global Infrastructure STAGGERED ROLLOUT
  36. 36. 1 Detection Staggering 2 Rollback 3
  37. 37. Keep calm & rollback ● Pin back ● Root cause ● Unpin
  38. 38. Rollback ● Visibility ● Traversing
  39. 39. Data Diff UI
  40. 40. Circuit Breaker UI
  41. 41. Pinning UI
  42. 42. Data validation is key to high availability ● Data change = Code push ● Circuit breakers & canaries ● Staggering and rollback
  43. 43. Thank You
  44. 44. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ data-validation-netflix

×