Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Improving the power of
a picture via A/B testing
Gopal Krishnan Director of Engineering
Dale Elliott Senior Software Engin...
TV is a lean back experience
90 seconds
Pop Quiz
A round plane figure
whose boundary (the
circumference) consists
of points equidistant from
a fixed point (the center).
A round plane figure
whose boundary (the
circumference) consists
of points equidistant from
a fixed point (the center).
Can we do better?
Sensitivity test
The Short Game
Single title A/B test result
14% better 6% better
Testable Hypothesis
Displaying better artwork will
result in greater engagement and
retention by helping members
discover stories they will en...
Data Driven
Netflix API service
Beacon (telemetry
collection service)
Hive (computes artwork
performance metrics for
every title/count...
Anatomy of artwork
Stable Image id for ground truth data
source-file-id-1 source-file-id-3source-file-id-2
Lineage-id-1
Diversity matters
Diversity matters
Pop Quiz
1 2
4 5 6
3
Building the A/B tests
vs.
Pairs of Explore and Exploit Tests
Explore Test
Current production
explore
New explore
Exploit Test
Current production
exp...
Multi-title explore allocation test
Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6
Title 1
Control
Image
Test Image 1 Test Image 2 Test Image 3 Test Image 4 Test I...
Engineering implementation / complexity
• Our A/B infrastructure is optimized for comparing test cells to each other
• Nee...
Solution:
• Treat all the members who see a title’s images as a virtual test
• Impression tracking -- not just test cell a...
Problems with multi-title, multi-cell test
• Cohorts of testers who all saw the same set of images
• Same number of images...
Single-cell explore allocation test
Title 1
“Cells” 1 2 3 4 5 6
Image Control Image 1 Image 2 Image 3 Image 4 Image 5
Title 2
“Cells” 1 2 3 4
Image Control Im...
Engineering implementation / complexity
Goals
• No cohorts
• Image stickiness
• No persistent storage
We used a determinis...
Netflix API Service
Engineering implementation / complexity
No persistence neededCells Cell 1 Cell 2
Title 1
Ctrl Image Ra...
● No more cohorts
● Flexible
● Clear winners for many titles
● Overall win based on key metrics
Can we do better?
Result
Problems
• Over exposure of under-performing images
• Under exposure of niche titles
• Unfair burden on testers
Title-level allocation test
Solution: Title-Level Allocation
• Limit allocated members per title
• Less exposure of under-performing images
• Still ge...
Test Evolution: Testers per title
C
Title A
Title B
Title C
Title A
Title B
● Some titles have few testers
in the small po...
Engineering implementation / complexity
• Goals from previous test
• No cohorts
• Image stickiness
• No persistent storage...
Netflix API Service
Architecture
Image
Data
Feed
Yellow
Square
(Y2)
Netflix Image Library
Member
Allocated
?
Title fully
A...
Oops
● Underestimated traffic
● Many titles allocated per member at once
● Write to Y2 for every allocation
Result: Servic...
Netflix API Service
Scaling
Image
Data
Feed
Yellow
Square
(Y2)
Netflix Image Library
Allocate with Random
Assignment
Log a...
Who to Test on?
Test on the same population you are
planning to rollout the changes to
Two Member Cohorts
• New Members are assigned to the experimental condition at the time
of sign-up
• Existing Members are ...
Decision Focuses More on New Members
• A “pure” sample which is not tainted by a previous Netflix experience
• A more sens...
Tiers of Metrics
• Primary: Customer retention
• Secondary: Streaming hours
• Tertiary: all other customer engagement metr...
How to Pick the Winner in Explore?
• Take fraction = (number of users played the title) /
(number of users been seen the t...
What is a Play?
What is a Play?
What is a Play?
Does Impression Location Matter?
Does Impression Location Matter?
Does Impression Location Matter?
Does it Matter How Many Impressions it Takes to
Play?
Netflix just
recommended an
awesome show to
me and I am going to
wat...
Does it Matter How Many Impressions it Takes to
Play?
I have seen the
show on Netflix a
few times. Maybe,
I should try it....
Take Fraction is NOT as trivial
as its definition implies.
How to Make the Final Decision?
Final decision is based on the exploit test
• Retention movement
• Streaming hours movemen...
Our Image Selection Test is a Win!
• Improved customer retention
• Improved customer engagement
Some Learnings
Emotions excellent to convey complex nuances
Great stories travel - but regional nuances can be powerful
Nice Guys Often Finish Last
Contact:
Gopal Krishnan
Dale Elliott
Kenny Xie
More details available at Netflix
techblog.
Talk to us outside at the booth.
Improving the power of a picture at Netflix -- the Science and Engineering Behind the Curtain
Improving the power of a picture at Netflix -- the Science and Engineering Behind the Curtain
Improving the power of a picture at Netflix -- the Science and Engineering Behind the Curtain
Improving the power of a picture at Netflix -- the Science and Engineering Behind the Curtain
Improving the power of a picture at Netflix -- the Science and Engineering Behind the Curtain
Improving the power of a picture at Netflix -- the Science and Engineering Behind the Curtain
Improving the power of a picture at Netflix -- the Science and Engineering Behind the Curtain
Improving the power of a picture at Netflix -- the Science and Engineering Behind the Curtain
Upcoming SlideShare
Loading in …5
×

Improving the power of a picture at Netflix -- the Science and Engineering Behind the Curtain

1,034 views

Published on

You have seen that A/B testing enables you to take a data-driven approach to improving the product. Here at Netflix we use A/B testing extensively to improve personalized recommendations on the homepage, playback, non-member signup flow, etc. One of the newer areas of A/B testing is around selecting the optimal image asset for every video on the service to best represent titles at a glance.

This session will explore the incremental steps towards building a sequence of A/B tests from a set of hypotheses about image asset selection, the fastest way to learn what improves the product, challenges with foundational data used for such tests, scaling challenges; test analyses, etc. Some of the details can be found in this tech blog here: http://techblog.netflix.com/2016/05/selecting-best-artwork-for-videos.html
-----

Video https://www.youtube.com/watch?v=trNPa6cGcIo

------

Image Credits:
Photo credit Richard Foster;
Photo credit https://commons.wikimedia.org/wiki/File:Youth-soccer-indiana.jpg
“Analyze this” movie by Time Warner. https://en.wikipedia.org/wiki/Analyze_This
https://commons.wikimedia.org/wiki/File:Question_Mark_Cloud.jpg

Published in: Engineering

Improving the power of a picture at Netflix -- the Science and Engineering Behind the Curtain

  1. 1. Improving the power of a picture via A/B testing Gopal Krishnan Director of Engineering Dale Elliott Senior Software Engineer Kenny Xie Senior Data Scientist
  2. 2. TV is a lean back experience
  3. 3. 90 seconds
  4. 4. Pop Quiz
  5. 5. A round plane figure whose boundary (the circumference) consists of points equidistant from a fixed point (the center).
  6. 6. A round plane figure whose boundary (the circumference) consists of points equidistant from a fixed point (the center).
  7. 7. Can we do better?
  8. 8. Sensitivity test
  9. 9. The Short Game
  10. 10. Single title A/B test result 14% better 6% better
  11. 11. Testable Hypothesis
  12. 12. Displaying better artwork will result in greater engagement and retention by helping members discover stories they will enjoy even faster.
  13. 13. Data Driven
  14. 14. Netflix API service Beacon (telemetry collection service) Hive (computes artwork performance metrics for every title/country/locale pair) Netflix Image Library Device (PS3, website, etc.) Feedback loop Serve artwork based on A/B logic Feed with artwork based on perf metric Collect plays & client impressions
  15. 15. Anatomy of artwork
  16. 16. Stable Image id for ground truth data source-file-id-1 source-file-id-3source-file-id-2 Lineage-id-1
  17. 17. Diversity matters
  18. 18. Diversity matters
  19. 19. Pop Quiz 1 2 4 5 6 3
  20. 20. Building the A/B tests vs.
  21. 21. Pairs of Explore and Exploit Tests Explore Test Current production explore New explore Exploit Test Current production exploit New exploit Winner Winner ● No member overlap ● Explore and exploit allocation happens simultaneously
  22. 22. Multi-title explore allocation test
  23. 23. Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6 Title 1 Control Image Test Image 1 Test Image 2 Test Image 3 Test Image 4 Test Image 5 Title 2 Control Image Test Image 1 Test Image 2 Test Image 3 Test Image 4 Test Image 5 ... ... ... ... ... ... ... Title n Control Image Test Image 1 Test Image 2 Test Image 3 Test Image 4 Test Image 5 Test Evolution: Single Title to Multiple Titles Single title, multi-cell test
  24. 24. Engineering implementation / complexity • Our A/B infrastructure is optimized for comparing test cells to each other • Need to compare data across cells for one title of many • Avoid creating hundreds of tests (one per title)
  25. 25. Solution: • Treat all the members who see a title’s images as a virtual test • Impression tracking -- not just test cell allocation -- defines test population per title Engineering implementation / complexity Allocated Members Title A impres- sions Title B impres- sions
  26. 26. Problems with multi-title, multi-cell test • Cohorts of testers who all saw the same set of images • Same number of images for every title
  27. 27. Single-cell explore allocation test
  28. 28. Title 1 “Cells” 1 2 3 4 5 6 Image Control Image 1 Image 2 Image 3 Image 4 Image 5 Title 2 “Cells” 1 2 3 4 Image Control Image 1 Image 2 Image 3 Test Evolution: Images per title Multi-cell explore evolves to Single-cell explore Devolves? Virtual Tests inside one test cell
  29. 29. Engineering implementation / complexity Goals • No cohorts • Image stickiness • No persistent storage We used a deterministic, pseudo-random calculation • new Random(memberID * titleId).nextInt(numImages)
  30. 30. Netflix API Service Engineering implementation / complexity No persistence neededCells Cell 1 Cell 2 Title 1 Ctrl Image Random of [Ctrl, Test 1, ... Test X1] Title 2 Ctrl Image Random of [Ctrl, Test 1, ... Test X2] ... ... ... Title n Ctrl Image Random of [Ctrl, Test 1, ... Test Xn] Image Data Feed (Title ID, Image Lists) Netflix Image Lib. Random assignment to all test members. Single-cell explore test
  31. 31. ● No more cohorts ● Flexible ● Clear winners for many titles ● Overall win based on key metrics Can we do better? Result
  32. 32. Problems • Over exposure of under-performing images • Under exposure of niche titles • Unfair burden on testers
  33. 33. Title-level allocation test
  34. 34. Solution: Title-Level Allocation • Limit allocated members per title • Less exposure of under-performing images • Still get enough data to determine winner • Allocate from a gigantic pool • More exposure for niche titles • Spreads testing burden
  35. 35. Test Evolution: Testers per title C Title A Title B Title C Title A Title B ● Some titles have few testers in the small pool ● Most titles have full testing allocation from larger pool
  36. 36. Engineering implementation / complexity • Goals from previous test • No cohorts • Image stickiness • No persistent storage • New goals • Less exposure for under-performing images • More exposure for niche titles • Faster decision and rollout of winning images • This time, we needed to persist the allocations
  37. 37. Netflix API Service Architecture Image Data Feed Yellow Square (Y2) Netflix Image Library Member Allocated ? Title fully Allocated ? Allocate with Random Assignment Log and store Allocation Select Assigned Image Select Control Image Select Assigned Image No No Yes Yes Title Metadata Service (VMS) Kafka
  38. 38. Oops ● Underestimated traffic ● Many titles allocated per member at once ● Write to Y2 for every allocation Result: Service disruption; we had to turn off the test
  39. 39. Netflix API Service Scaling Image Data Feed Yellow Square (Y2) Netflix Image Library Allocate with Random Assignment Log and store Allocation Kafka Stream Processor 1 write per member every 30 sec. Storing allocations as they occurred overloaded Yellow Square. Now, we log them to a stream and consolidate many writes into one.
  40. 40. Who to Test on? Test on the same population you are planning to rollout the changes to
  41. 41. Two Member Cohorts • New Members are assigned to the experimental condition at the time of sign-up • Existing Members are assigned to the experimental condition any time after free trial ended
  42. 42. Decision Focuses More on New Members • A “pure” sample which is not tainted by a previous Netflix experience • A more sensitive sample (“on the fence”)
  43. 43. Tiers of Metrics • Primary: Customer retention • Secondary: Streaming hours • Tertiary: all other customer engagement metrics • Play rate • Number of Netflix visits • ...
  44. 44. How to Pick the Winner in Explore? • Take fraction = (number of users played the title) / (number of users been seen the title) • Correlated with retention • Measurable from day one
  45. 45. What is a Play?
  46. 46. What is a Play?
  47. 47. What is a Play?
  48. 48. Does Impression Location Matter?
  49. 49. Does Impression Location Matter?
  50. 50. Does Impression Location Matter?
  51. 51. Does it Matter How Many Impressions it Takes to Play? Netflix just recommended an awesome show to me and I am going to watch it!!!
  52. 52. Does it Matter How Many Impressions it Takes to Play? I have seen the show on Netflix a few times. Maybe, I should try it...
  53. 53. Take Fraction is NOT as trivial as its definition implies.
  54. 54. How to Make the Final Decision? Final decision is based on the exploit test • Retention movement • Streaming hours movement • Engagement with titles explored in the test, titles not explored in the test • ….
  55. 55. Our Image Selection Test is a Win! • Improved customer retention • Improved customer engagement
  56. 56. Some Learnings
  57. 57. Emotions excellent to convey complex nuances
  58. 58. Great stories travel - but regional nuances can be powerful
  59. 59. Nice Guys Often Finish Last
  60. 60. Contact: Gopal Krishnan Dale Elliott Kenny Xie More details available at Netflix techblog. Talk to us outside at the booth.

×