Your SlideShare is downloading.
×

×

Saving this for later?
Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.

Text the download link to your phone

Standard text messaging rates apply

Like this presentation? Why not share!

- The Netflix Prize - Algorithms and ... by HollyWH 466 views
- Noncrossing partitions and reflecti... by David Bessis 112 views
- Chasing the Rabbit by David Bessis 1147 views
- Why G31 is K(pi,1) by David Bessis 316 views
- Chapter 1 Initial Description of Da... by Shelly38 850 views
- Machine Learning at PeerIndex by Ferenc Huszár 3794 views

2,194

Published on

Slides from an introductory talk on machine learning, and why mathematicians should take interest in it. …

Slides from an introductory talk on machine learning, and why mathematicians should take interest in it.

This is a very basic introduction, for math undergraduates & other curious minds.

Published in:
Education

No Downloads

Total Views

2,194

On Slideshare

0

From Embeds

0

Number of Embeds

3

Shares

0

Downloads

0

Comments

0

Likes

1

No embeds

No notes for slide

- 1. The Problem Strategies Some Funny New Science The Netﬂix Prize: yet another million dollar problem David Bessis Ecole Normale Sup´rieure, 27/01/2010 e David Bessis The Netﬂix Prize: yet another million dollar problem
- 2. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: David Bessis The Netﬂix Prize: yet another million dollar problem
- 3. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. David Bessis The Netﬂix Prize: yet another million dollar problem
- 4. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. David Bessis The Netﬂix Prize: yet another million dollar problem
- 5. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Solutions must David Bessis The Netﬂix Prize: yet another million dollar problem
- 6. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Solutions must ”be published in a refereed mathematics publication of worldwide repute” David Bessis The Netﬂix Prize: yet another million dollar problem
- 7. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Solutions must ”be published in a refereed mathematics publication of worldwide repute” ”have general acceptance in the mathematics community two years after” David Bessis The Netﬂix Prize: yet another million dollar problem
- 8. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. David Bessis The Netﬂix Prize: yet another million dollar problem
- 9. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e David Bessis The Netﬂix Prize: yet another million dollar problem
- 10. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. David Bessis The Netﬂix Prize: yet another million dollar problem
- 11. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netﬂix Prize: David Bessis The Netﬂix Prize: yet another million dollar problem
- 12. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netﬂix Prize: Funded in 2006 by the DVD rental company Netﬂix. David Bessis The Netﬂix Prize: yet another million dollar problem
- 13. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netﬂix Prize: Funded in 2006 by the DVD rental company Netﬂix. A problem in Applied Mathematics. David Bessis The Netﬂix Prize: yet another million dollar problem
- 14. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netﬂix Prize: Funded in 2006 by the DVD rental company Netﬂix. A problem in Applied Mathematics Computer Science. David Bessis The Netﬂix Prize: yet another million dollar problem
- 15. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netﬂix Prize: Funded in 2006 by the DVD rental company Netﬂix. A problem in Applied Mathematics Computer Science Psychology. David Bessis The Netﬂix Prize: yet another million dollar problem
- 16. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netﬂix Prize: Funded in 2006 by the DVD rental company Netﬂix. A problem in Applied Mathematics Computer Science Psychology (do we really care?) David Bessis The Netﬂix Prize: yet another million dollar problem
- 17. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netﬂix Prize: Funded in 2006 by the DVD rental company Netﬂix. A problem in Some Funny New Science. David Bessis The Netﬂix Prize: yet another million dollar problem
- 18. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netﬂix Prize: Funded in 2006 by the DVD rental company Netﬂix. A problem in Some Funny New Science. Clear rules. David Bessis The Netﬂix Prize: yet another million dollar problem
- 19. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netﬂix Prize: Funded in 2006 by the DVD rental company Netﬂix. A problem in Some Funny New Science. Reasonably clear rules. David Bessis The Netﬂix Prize: yet another million dollar problem
- 20. The Problem Strategies Some Funny New Science 7 + 1 Million Dollar Problems Millenium Prize Problems: Funded in 2000 by the Clay Mathematical Institute. Seven classical open problems in Mathematics. Fuzzy rules. The Poincar´ conjecture was solved by Perelman in 2003. e No award yet. Netﬂix Prize: Funded in 2006 by the DVD rental company Netﬂix. A problem in Some Funny New Science. Reasonably clear rules. Prize awarded in September 2009. David Bessis The Netﬂix Prize: yet another million dollar problem
- 21. The Problem Rules Strategies Competition Some Funny New Science Context Netﬂix has an “all-you-can-eat” pricing model. David Bessis The Netﬂix Prize: yet another million dollar problem
- 22. The Problem Rules Strategies Competition Some Funny New Science Context Netﬂix has an “all-you-can-eat” pricing model. They need their users to watch a lot of movies. David Bessis The Netﬂix Prize: yet another million dollar problem
- 23. The Problem Rules Strategies Competition Some Funny New Science Context Netﬂix has an “all-you-can-eat” pricing model. They need their users to watch a lot of movies. Beyond a few obvious choices, people don’t know what they want to watch. David Bessis The Netﬂix Prize: yet another million dollar problem
- 24. The Problem Rules Strategies Competition Some Funny New Science Context Netﬂix has an “all-you-can-eat” pricing model. They need their users to watch a lot of movies. Beyond a few obvious choices, people don’t know what they want to watch. Collaborative ﬁltering: recommending products based on prior evaluations by other users (just like Amazon does). David Bessis The Netﬂix Prize: yet another million dollar problem
- 25. The Problem Rules Strategies Competition Some Funny New Science Context Netﬂix has an “all-you-can-eat” pricing model. They need their users to watch a lot of movies. Beyond a few obvious choices, people don’t know what they want to watch. Collaborative ﬁltering: recommending products based on prior evaluations by other users (just like Amazon does). The Netﬂix prize is a collaborative ﬁltering competition: David Bessis The Netﬂix Prize: yet another million dollar problem
- 26. The Problem Rules Strategies Competition Some Funny New Science Context Netﬂix has an “all-you-can-eat” pricing model. They need their users to watch a lot of movies. Beyond a few obvious choices, people don’t know what they want to watch. Collaborative ﬁltering: recommending products based on prior evaluations by other users (just like Amazon does). The Netﬂix prize is a collaborative ﬁltering competition: Based on a huge dataset of actual ratings by Netﬂix users. David Bessis The Netﬂix Prize: yet another million dollar problem
- 27. The Problem Rules Strategies Competition Some Funny New Science Context Netﬂix has an “all-you-can-eat” pricing model. They need their users to watch a lot of movies. Beyond a few obvious choices, people don’t know what they want to watch. Collaborative ﬁltering: recommending products based on prior evaluations by other users (just like Amazon does). The Netﬂix prize is a collaborative ﬁltering competition: Based on a huge dataset of actual ratings by Netﬂix users. Open to almost everyone. David Bessis The Netﬂix Prize: yet another million dollar problem
- 28. The Problem Rules Strategies Competition Some Funny New Science Context Netﬂix has an “all-you-can-eat” pricing model. They need their users to watch a lot of movies. Beyond a few obvious choices, people don’t know what they want to watch. Collaborative ﬁltering: recommending products based on prior evaluations by other users (just like Amazon does). The Netﬂix prize is a collaborative ﬁltering competition: Based on a huge dataset of actual ratings by Netﬂix users. Open to almost everyone. Endowed with a $1.000.000 prize. David Bessis The Netﬂix Prize: yet another million dollar problem
- 29. The Problem Rules Strategies Competition Some Funny New Science The Dataset The user space U consists of 480 189 users (identiﬁed by a meaningless non-sequential integral id). David Bessis The Netﬂix Prize: yet another million dollar problem
- 30. The Problem Rules Strategies Competition Some Funny New Science The Dataset The user space U consists of 480 189 users (identiﬁed by a meaningless non-sequential integral id). The movie space M consists of 17 770 movies (identiﬁed by integers 1, . . . , 17 770, and the associated list of titles and release years is provided – this data is meaningful and minable). David Bessis The Netﬂix Prize: yet another million dollar problem
- 31. The Problem Rules Strategies Competition Some Funny New Science The Dataset The user space U consists of 480 189 users (identiﬁed by a meaningless non-sequential integral id). The movie space M consists of 17 770 movies (identiﬁed by integers 1, . . . , 17 770, and the associated list of titles and release years is provided – this data is meaningful and minable). The date space D spans the period Oct. 1998 – Dec. 2005 (extremely meaningful data; no time of day is provided). David Bessis The Netﬂix Prize: yet another million dollar problem
- 32. The Problem Rules Strategies Competition Some Funny New Science The Dataset The user space U consists of 480 189 users (identiﬁed by a meaningless non-sequential integral id). The movie space M consists of 17 770 movies (identiﬁed by integers 1, . . . , 17 770, and the associated list of titles and release years is provided – this data is meaningful and minable). The date space D spans the period Oct. 1998 – Dec. 2005 (extremely meaningful data; no time of day is provided). The rating space R is {1, 2, 3, 4, 5} (”stars”). David Bessis The Netﬂix Prize: yet another million dollar problem
- 33. The Problem Rules Strategies Competition Some Funny New Science The Dataset The user space U consists of 480 189 users (identiﬁed by a meaningless non-sequential integral id). The movie space M consists of 17 770 movies (identiﬁed by integers 1, . . . , 17 770, and the associated list of titles and release years is provided – this data is meaningful and minable). The date space D spans the period Oct. 1998 – Dec. 2005 (extremely meaningful data; no time of day is provided). The rating space R is {1, 2, 3, 4, 5} (”stars”). The training dataset T contains 100 480 507 quadruples (u, m, d, r ) ∈ U × M × D × R. David Bessis The Netﬂix Prize: yet another million dollar problem
- 34. The Problem Rules Strategies Competition Some Funny New Science The Dataset The user space U consists of 480 189 users (identiﬁed by a meaningless non-sequential integral id). The movie space M consists of 17 770 movies (identiﬁed by integers 1, . . . , 17 770, and the associated list of titles and release years is provided – this data is meaningful and minable). The date space D spans the period Oct. 1998 – Dec. 2005 (extremely meaningful data; no time of day is provided). The rating space R is {1, 2, 3, 4, 5} (”stars”). The training dataset T contains 100 480 507 quadruples (u, m, d, r ) ∈ U × M × D × R. The qualifying dataset Q contains 2 817 131 triples (u, m, d) ∈ U × M × D. David Bessis The Netﬂix Prize: yet another million dollar problem
- 35. The Problem Rules Strategies Competition Some Funny New Science The Challenge Open to everyone David Bessis The Netﬂix Prize: yet another million dollar problem
- 36. The Problem Rules Strategies Competition Some Funny New Science The Challenge Open to everyone except Netﬂix employees and their relatives David Bessis The Netﬂix Prize: yet another million dollar problem
- 37. The Problem Rules Strategies Competition Some Funny New Science The Challenge Open to everyone except Netﬂix employees and their relatives and residents of Cuba, Iran, Syria, North Korea, Myanmar and Sudan. David Bessis The Netﬂix Prize: yet another million dollar problem
- 38. The Problem Rules Strategies Competition Some Funny New Science The Challenge Open to everyone except Netﬂix employees and their relatives and residents of Cuba, Iran, Syria, North Korea, Myanmar and Sudan. Participants can join eﬀorts in teams. David Bessis The Netﬂix Prize: yet another million dollar problem
- 39. The Problem Rules Strategies Competition Some Funny New Science The Challenge Open to everyone except Netﬂix employees and their relatives and residents of Cuba, Iran, Syria, North Korea, Myanmar and Sudan. Participants can join eﬀorts in teams. They can upload their predictions up to once a day. David Bessis The Netﬂix Prize: yet another million dollar problem
- 40. The Problem Rules Strategies Competition Some Funny New Science The Challenge Open to everyone except Netﬂix employees and their relatives and residents of Cuba, Iran, Syria, North Korea, Myanmar and Sudan. Participants can join eﬀorts in teams. They can upload their predictions up to once a day. Predictions are maps from the qualifying set Q to the interval [1, 5]. David Bessis The Netﬂix Prize: yet another million dollar problem
- 41. The Problem Rules Strategies Competition Some Funny New Science The Challenge Open to everyone except Netﬂix employees and their relatives and residents of Cuba, Iran, Syria, North Korea, Myanmar and Sudan. Participants can join eﬀorts in teams. They can upload their predictions up to once a day. Predictions are maps from the qualifying set Q to the interval [1, 5]. The metric used to benchmark predictions is the RMSE (”root of mean square error”) 1 RMSE = |predicted rating for q − actual rating for q|2 |Q| q∈Q David Bessis The Netﬂix Prize: yet another million dollar problem
- 42. The Problem Rules Strategies Competition Some Funny New Science Typical RMSEs Theoretically, the RMSE cannot be greater than 2. David Bessis The Netﬂix Prize: yet another million dollar problem
- 43. The Problem Rules Strategies Competition Some Funny New Science Typical RMSEs Theoretically, the RMSE cannot be greater than 2. Users tend to view and rate movies they like, so they typically give 3, 4 or 5 stars rather than 1 or 2 (the above upper bound is unrealistically pessimistic). David Bessis The Netﬂix Prize: yet another million dollar problem
- 44. The Problem Rules Strategies Competition Some Funny New Science Typical RMSEs Theoretically, the RMSE cannot be greater than 2. Users tend to view and rate movies they like, so they typically give 3, 4 or 5 stars rather than 1 or 2 (the above upper bound is unrealistically pessimistic). A basic prediction consists of mapping a triple (u, m, d) to the average rating obtained by the movie m. David Bessis The Netﬂix Prize: yet another million dollar problem
- 45. The Problem Rules Strategies Competition Some Funny New Science Typical RMSEs Theoretically, the RMSE cannot be greater than 2. Users tend to view and rate movies they like, so they typically give 3, 4 or 5 stars rather than 1 or 2 (the above upper bound is unrealistically pessimistic). A basic prediction consists of mapping a triple (u, m, d) to the average rating obtained by the movie m. It achieves 1.0540. David Bessis The Netﬂix Prize: yet another million dollar problem
- 46. The Problem Rules Strategies Competition Some Funny New Science Typical RMSEs Theoretically, the RMSE cannot be greater than 2. Users tend to view and rate movies they like, so they typically give 3, 4 or 5 stars rather than 1 or 2 (the above upper bound is unrealistically pessimistic). A basic prediction consists of mapping a triple (u, m, d) to the average rating obtained by the movie m. It achieves 1.0540. At the beginning of the Challenge, Netﬂix’s in-house prediction system Cinematch achieved 0.9514 (roughly a 10% improvement). David Bessis The Netﬂix Prize: yet another million dollar problem
- 47. The Problem Rules Strategies Competition Some Funny New Science Typical RMSEs Theoretically, the RMSE cannot be greater than 2. Users tend to view and rate movies they like, so they typically give 3, 4 or 5 stars rather than 1 or 2 (the above upper bound is unrealistically pessimistic). A basic prediction consists of mapping a triple (u, m, d) to the average rating obtained by the movie m. It achieves 1.0540. At the beginning of the Challenge, Netﬂix’s in-house prediction system Cinematch achieved 0.9514 (roughly a 10% improvement). Netﬂix set the following target: obtain a further 10% improvement over Cinematch. David Bessis The Netﬂix Prize: yet another million dollar problem
- 48. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 1: a Cryptographic Trick Netﬂix has secretly partitioned the qualifying set Q = Q1 Q2 into two subsets of (approximately) equal sizes. David Bessis The Netﬂix Prize: yet another million dollar problem
- 49. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 1: a Cryptographic Trick Netﬂix has secretly partitioned the qualifying set Q = Q1 Q2 into two subsets of (approximately) equal sizes. The RMSE achieved on Q1 is revealed to participants (there is a public leaderboard). David Bessis The Netﬂix Prize: yet another million dollar problem
- 50. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 1: a Cryptographic Trick Netﬂix has secretly partitioned the qualifying set Q = Q1 Q2 into two subsets of (approximately) equal sizes. The RMSE achieved on Q1 is revealed to participants (there is a public leaderboard). The RMSE achieved on Q2 is used to determine the winner. David Bessis The Netﬂix Prize: yet another million dollar problem
- 51. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 1: a Cryptographic Trick Netﬂix has secretly partitioned the qualifying set Q = Q1 Q2 into two subsets of (approximately) equal sizes. The RMSE achieved on Q1 is revealed to participants (there is a public leaderboard). The RMSE achieved on Q2 is used to determine the winner. This prevented participants from “learning from the oracle”. David Bessis The Netﬂix Prize: yet another million dollar problem
- 52. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 1: a Cryptographic Trick Netﬂix has secretly partitioned the qualifying set Q = Q1 Q2 into two subsets of (approximately) equal sizes. The RMSE achieved on Q1 is revealed to participants (there is a public leaderboard). The RMSE achieved on Q2 is used to determine the winner. This prevented participants from “learning from the oracle”. The goal was to achieve 0.8572. David Bessis The Netﬂix Prize: yet another million dollar problem
- 53. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 2: Crowd Psychology Tricks The Challenged opened on October 2, 2006. David Bessis The Netﬂix Prize: yet another million dollar problem
- 54. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 2: Crowd Psychology Tricks The Challenged opened on October 2, 2006. Annual $50.000 prizes were oﬀered to current leaders David Bessis The Netﬂix Prize: yet another million dollar problem
- 55. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 2: Crowd Psychology Tricks The Challenged opened on October 2, 2006. Annual $50.000 prizes were oﬀered to current leaders provided they made their current methodology public. David Bessis The Netﬂix Prize: yet another million dollar problem
- 56. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 2: Crowd Psychology Tricks The Challenged opened on October 2, 2006. Annual $50.000 prizes were oﬀered to current leaders provided they made their current methodology public. The Challenge was to last for 30 more days after the goal was achieved. David Bessis The Netﬂix Prize: yet another million dollar problem
- 57. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 2: Crowd Psychology Tricks The Challenged opened on October 2, 2006. Annual $50.000 prizes were oﬀered to current leaders provided they made their current methodology public. The Challenge was to last for 30 more days after the goal was achieved. The winner would be the team with the best RMSE after this 30 days period David Bessis The Netﬂix Prize: yet another million dollar problem
- 58. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 2: Crowd Psychology Tricks The Challenged opened on October 2, 2006. Annual $50.000 prizes were oﬀered to current leaders provided they made their current methodology public. The Challenge was to last for 30 more days after the goal was achieved. The winner would be the team with the best RMSE after this 30 days period (no backstabbing arXiv-style “I posted ﬁrst” eﬀect). David Bessis The Netﬂix Prize: yet another million dollar problem
- 59. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 2: Crowd Psychology Tricks The Challenged opened on October 2, 2006. Annual $50.000 prizes were oﬀered to current leaders provided they made their current methodology public. The Challenge was to last for 30 more days after the goal was achieved. The winner would be the team with the best RMSE after this 30 days period (no backstabbing arXiv-style “I posted ﬁrst” eﬀect). Every detail was carefully anticipated (even the possibility of a tie). David Bessis The Netﬂix Prize: yet another million dollar problem
- 60. The Problem Rules Strategies Competition Some Funny New Science Very Smart Rules 2: Crowd Psychology Tricks The Challenged opened on October 2, 2006. Annual $50.000 prizes were oﬀered to current leaders provided they made their current methodology public. The Challenge was to last for 30 more days after the goal was achieved. The winner would be the team with the best RMSE after this 30 days period (no backstabbing arXiv-style “I posted ﬁrst” eﬀect). Every detail was carefully anticipated (even the possibility of a tie). These smart rules, together with the $1.000.000 prize, attracted thousands of participants. David Bessis The Netﬂix Prize: yet another million dollar problem
- 61. The Problem Rules Strategies Competition Some Funny New Science Timeline October 2006: Cinematch RMSE = 0.9514. David Bessis The Netﬂix Prize: yet another million dollar problem
- 62. The Problem Rules Strategies Competition Some Funny New Science Timeline October 2006: Cinematch RMSE = 0.9514. October 2007: team KorBell leads with 0.8712 (8.43% improvement). David Bessis The Netﬂix Prize: yet another million dollar problem
- 63. The Problem Rules Strategies Competition Some Funny New Science Timeline October 2006: Cinematch RMSE = 0.9514. October 2007: team KorBell leads with 0.8712 (8.43% improvement). October 2008: team “BellKor in BigChaos” (two teams merging eﬀorts) leads with 0.8616 (9.44% improvement). David Bessis The Netﬂix Prize: yet another million dollar problem
- 64. The Problem Rules Strategies Competition Some Funny New Science Timeline October 2006: Cinematch RMSE = 0.9514. October 2007: team KorBell leads with 0.8712 (8.43% improvement). October 2008: team “BellKor in BigChaos” (two teams merging eﬀorts) leads with 0.8616 (9.44% improvement). June 26, 2009: the goal is achieved. David Bessis The Netﬂix Prize: yet another million dollar problem
- 65. The Problem Rules Strategies Competition Some Funny New Science Timeline October 2006: Cinematch RMSE = 0.9514. October 2007: team KorBell leads with 0.8712 (8.43% improvement). October 2008: team “BellKor in BigChaos” (two teams merging eﬀorts) leads with 0.8616 (9.44% improvement). June 26, 2009: the goal is achieved. July 26, 2009: Netﬂix stops gathering solutions. David Bessis The Netﬂix Prize: yet another million dollar problem
- 66. The Problem Rules Strategies Competition Some Funny New Science Timeline October 2006: Cinematch RMSE = 0.9514. October 2007: team KorBell leads with 0.8712 (8.43% improvement). October 2008: team “BellKor in BigChaos” (two teams merging eﬀorts) leads with 0.8616 (9.44% improvement). June 26, 2009: the goal is achieved. July 26, 2009: Netﬂix stops gathering solutions. The winner is announced on September 18, 2009. David Bessis The Netﬂix Prize: yet another million dollar problem
- 67. The Problem Rules Strategies Competition Some Funny New Science The winning team Three teams combined their results to win the competition: BellKor Bob Bell (AT&T) Yehuda Koren (Yahoo) Chris Volinsky (AT&T) BigChaos Michael Jahrer (Commendo research and consulting) Andreas T¨scher (Commendo research and consulting) o Pragmatic Theory Martin Chabbert (Pragmatic Theory) Martin Piotte (Pragmatic Theory) David Bessis The Netﬂix Prize: yet another million dollar problem
- 68. The Problem Rules Strategies Competition Some Funny New Science The winning team Three teams combined their results to win the competition: BellKor Bob Bell (AT&T) Yehuda Koren (Yahoo) Chris Volinsky (AT&T) BigChaos Michael Jahrer (Commendo research and consulting) Andreas T¨scher (Commendo research and consulting) o Pragmatic Theory Martin Chabbert (Pragmatic Theory) Martin Piotte (Pragmatic Theory) Their winnning submission achieved a RMSE of 0.8567 (10.06% improvement over Cinematch.) David Bessis The Netﬂix Prize: yet another million dollar problem
- 69. The Problem Rules Strategies Competition Some Funny New Science The winning team Three teams combined their results to win the competition: BellKor Bob Bell (AT&T) Yehuda Koren (Yahoo) Chris Volinsky (AT&T) BigChaos Michael Jahrer (Commendo research and consulting) Andreas T¨scher (Commendo research and consulting) o Pragmatic Theory Martin Chabbert (Pragmatic Theory) Martin Piotte (Pragmatic Theory) Their winnning submission achieved a RMSE of 0.8567 (10.06% improvement over Cinematch.) Another team, The Ensemble, achieved the same RMSE... David Bessis The Netﬂix Prize: yet another million dollar problem
- 70. The Problem Rules Strategies Competition Some Funny New Science The winning team Three teams combined their results to win the competition: BellKor Bob Bell (AT&T) Yehuda Koren (Yahoo) Chris Volinsky (AT&T) BigChaos Michael Jahrer (Commendo research and consulting) Andreas T¨scher (Commendo research and consulting) o Pragmatic Theory Martin Chabbert (Pragmatic Theory) Martin Piotte (Pragmatic Theory) Their winnning submission achieved a RMSE of 0.8567 (10.06% improvement over Cinematch.) Another team, The Ensemble, achieved the same RMSE... ...and lost because their submission was posted 24 minutes later! David Bessis The Netﬂix Prize: yet another million dollar problem
- 71. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Computer implementation Memory requirements: Movies can be encoded on 2 bytes (17770 < 2562 ). David Bessis The Netﬂix Prize: yet another million dollar problem
- 72. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Computer implementation Memory requirements: Movies can be encoded on 2 bytes (17770 < 2562 ). Viewers can be encoded on 3 bytes (480189 < 2563 ). David Bessis The Netﬂix Prize: yet another million dollar problem
- 73. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Computer implementation Memory requirements: Movies can be encoded on 2 bytes (17770 < 2562 ). Viewers can be encoded on 3 bytes (480189 < 2563 ). Dates can be encoded on 2 bytes. David Bessis The Netﬂix Prize: yet another million dollar problem
- 74. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Computer implementation Memory requirements: Movies can be encoded on 2 bytes (17770 < 2562 ). Viewers can be encoded on 3 bytes (480189 < 2563 ). Dates can be encoded on 2 bytes. A triple (m, v , d) can be encoded on 7 bytes. David Bessis The Netﬂix Prize: yet another million dollar problem
- 75. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Computer implementation Memory requirements: Movies can be encoded on 2 bytes (17770 < 2562 ). Viewers can be encoded on 3 bytes (480189 < 2563 ). Dates can be encoded on 2 bytes. A triple (m, v , d) can be encoded on 7 bytes. 700 MB suﬃce to store the dataset. David Bessis The Netﬂix Prize: yet another million dollar problem
- 76. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Computer implementation Memory requirements: Movies can be encoded on 2 bytes (17770 < 2562 ). Viewers can be encoded on 3 bytes (480189 < 2563 ). Dates can be encoded on 2 bytes. A triple (m, v , d) can be encoded on 7 bytes. 700 MB suﬃce to store the dataset. It is possible (necessary) to work in RAM. David Bessis The Netﬂix Prize: yet another million dollar problem
- 77. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Computer implementation Memory requirements: Movies can be encoded on 2 bytes (17770 < 2562 ). Viewers can be encoded on 3 bytes (480189 < 2563 ). Dates can be encoded on 2 bytes. A triple (m, v , d) can be encoded on 7 bytes. 700 MB suﬃce to store the dataset. It is possible (necessary) to work in RAM. Commodity hardware is suﬃcient. David Bessis The Netﬂix Prize: yet another million dollar problem
- 78. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Computer implementation Memory requirements: Movies can be encoded on 2 bytes (17770 < 2562 ). Viewers can be encoded on 3 bytes (480189 < 2563 ). Dates can be encoded on 2 bytes. A triple (m, v , d) can be encoded on 7 bytes. 700 MB suﬃce to store the dataset. It is possible (necessary) to work in RAM. Commodity hardware is suﬃcient. (I have some Ruby code to interactively play with the dataset.) David Bessis The Netﬂix Prize: yet another million dollar problem
- 79. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. David Bessis The Netﬂix Prize: yet another million dollar problem
- 80. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: David Bessis The Netﬂix Prize: yet another million dollar problem
- 81. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: users sometime rate tens of movies on a single day. David Bessis The Netﬂix Prize: yet another million dollar problem
- 82. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: users sometime rate tens of movies on a single day. This causes an insanely huge bias within the dataset (movies are perceived diﬀerently when rated individually or within a rating spree), not fully exploited by the winners. David Bessis The Netﬂix Prize: yet another million dollar problem
- 83. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: users sometime rate tens of movies on a single day. This causes an insanely huge bias within the dataset (movies are perceived diﬀerently when rated individually or within a rating spree), not fully exploited by the winners. Netﬂix, do you read me? David Bessis The Netﬂix Prize: yet another million dollar problem
- 84. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: users sometime rate tens of movies on a single day. This causes an insanely huge bias within the dataset (movies are perceived diﬀerently when rated individually or within a rating spree), not fully exploited by the winners. Netﬂix, do you read me? Some movies were rated by hundreds of thousands viewers, some by just a few (long-tail distribution). David Bessis The Netﬂix Prize: yet another million dollar problem
- 85. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: users sometime rate tens of movies on a single day. This causes an insanely huge bias within the dataset (movies are perceived diﬀerently when rated individually or within a rating spree), not fully exploited by the winners. Netﬂix, do you read me? Some movies were rated by hundreds of thousands viewers, some by just a few (long-tail distribution). Similarly, a user rated all the movies, and many just a few. David Bessis The Netﬂix Prize: yet another million dollar problem
- 86. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: users sometime rate tens of movies on a single day. This causes an insanely huge bias within the dataset (movies are perceived diﬀerently when rated individually or within a rating spree), not fully exploited by the winners. Netﬂix, do you read me? Some movies were rated by hundreds of thousands viewers, some by just a few (long-tail distribution). Similarly, a user rated all the movies, and many just a few. Let F be the set of all ﬁnal 9 ratings for all individual users. David Bessis The Netﬂix Prize: yet another million dollar problem
- 87. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: users sometime rate tens of movies on a single day. This causes an insanely huge bias within the dataset (movies are perceived diﬀerently when rated individually or within a rating spree), not fully exploited by the winners. Netﬂix, do you read me? Some movies were rated by hundreds of thousands viewers, some by just a few (long-tail distribution). Similarly, a user rated all the movies, and many just a few. Let F be the set of all ﬁnal 9 ratings for all individual users. Then F = Q P, with P ⊂ T publicly tagged by Netﬂix. David Bessis The Netﬂix Prize: yet another million dollar problem
- 88. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: users sometime rate tens of movies on a single day. This causes an insanely huge bias within the dataset (movies are perceived diﬀerently when rated individually or within a rating spree), not fully exploited by the winners. Netﬂix, do you read me? Some movies were rated by hundreds of thousands viewers, some by just a few (long-tail distribution). Similarly, a user rated all the movies, and many just a few. Let F be the set of all ﬁnal 9 ratings for all individual users. Then F = Q P, with P ⊂ T publicly tagged by Netﬂix. Q is a random draw of 2/3 of F . David Bessis The Netﬂix Prize: yet another million dollar problem
- 89. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Remarks About 200 ratings per users. This is likely caused by Cinematch’s data gathering procedure: users sometime rate tens of movies on a single day. This causes an insanely huge bias within the dataset (movies are perceived diﬀerently when rated individually or within a rating spree), not fully exploited by the winners. Netﬂix, do you read me? Some movies were rated by hundreds of thousands viewers, some by just a few (long-tail distribution). Similarly, a user rated all the movies, and many just a few. Let F be the set of all ﬁnal 9 ratings for all individual users. Then F = Q P, with P ⊂ T publicly tagged by Netﬂix. Q is a random draw of 2/3 of F . Q resembles P but is very dissimilar from T . David Bessis The Netﬂix Prize: yet another million dollar problem
- 90. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Algorithms The machine learning toolbox consists of many methods: Clustering methods. David Bessis The Netﬂix Prize: yet another million dollar problem
- 91. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Algorithms The machine learning toolbox consists of many methods: Clustering methods. Regressions. David Bessis The Netﬂix Prize: yet another million dollar problem
- 92. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Algorithms The machine learning toolbox consists of many methods: Clustering methods. Regressions. Latent parameters methods (SVD). David Bessis The Netﬂix Prize: yet another million dollar problem
- 93. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Algorithms The machine learning toolbox consists of many methods: Clustering methods. Regressions. Latent parameters methods (SVD). Neural networks. David Bessis The Netﬂix Prize: yet another million dollar problem
- 94. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Algorithms The machine learning toolbox consists of many methods: Clustering methods. Regressions. Latent parameters methods (SVD). Neural networks. SVM David Bessis The Netﬂix Prize: yet another million dollar problem
- 95. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Algorithms The machine learning toolbox consists of many methods: Clustering methods. Regressions. Latent parameters methods (SVD). Neural networks. SVM ... David Bessis The Netﬂix Prize: yet another million dollar problem
- 96. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Beginner’s mistakes Underestimate the volume eﬀect. David Bessis The Netﬂix Prize: yet another million dollar problem
- 97. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Beginner’s mistakes Underestimate the volume eﬀect. Think conceptually and discretely rather than globally and continuously. David Bessis The Netﬂix Prize: yet another million dollar problem
- 98. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Beginner’s mistakes Underestimate the volume eﬀect. Think conceptually and discretely rather than globally and continuously. Put users and movies into categories (clustering introduces unwanted discontinuities). David Bessis The Netﬂix Prize: yet another million dollar problem
- 99. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Beginner’s mistakes Underestimate the volume eﬀect. Think conceptually and discretely rather than globally and continuously. Put users and movies into categories (clustering introduces unwanted discontinuities). Learn from the probe. David Bessis The Netﬂix Prize: yet another million dollar problem
- 100. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Beginner’s mistakes Underestimate the volume eﬀect. Think conceptually and discretely rather than globally and continuously. Put users and movies into categories (clustering introduces unwanted discontinuities). Learn from the probe. Dealing with 100 000 000 data isn’t a logic puzzle. David Bessis The Netﬂix Prize: yet another million dollar problem
- 101. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Beginner’s mistakes Underestimate the volume eﬀect. Think conceptually and discretely rather than globally and continuously. Put users and movies into categories (clustering introduces unwanted discontinuities). Learn from the probe. Dealing with 100 000 000 data isn’t a logic puzzle. It resembles Thermodynamics. David Bessis The Netﬂix Prize: yet another million dollar problem
- 102. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Linear regression Suppose all viewers in X have rated all movies in Y : the rating matrix is (rx,y )(x,y )∈X ×Y . David Bessis The Netﬂix Prize: yet another million dollar problem
- 103. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Linear regression Suppose all viewers in X have rated all movies in Y : the rating matrix is (rx,y )(x,y )∈X ×Y . Suppose you want to model the ratings given to a particular movie y0 based on the ratings given to the movies in Y = Y − {y0 }. David Bessis The Netﬂix Prize: yet another million dollar problem
- 104. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Linear regression Suppose all viewers in X have rated all movies in Y : the rating matrix is (rx,y )(x,y )∈X ×Y . Suppose you want to model the ratings given to a particular movie y0 based on the ratings given to the movies in Y = Y − {y0 }. A linear regression is a natural way to do that. David Bessis The Netﬂix Prize: yet another million dollar problem
- 105. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Linear regression Suppose all viewers in X have rated all movies in Y : the rating matrix is (rx,y )(x,y )∈X ×Y . Suppose you want to model the ratings given to a particular movie y0 based on the ratings given to the movies in Y = Y − {y0 }. A linear regression is a natural way to do that. Write (rx,y ) = (Cy )y ∈Y where the Cy are the column vectors. David Bessis The Netﬂix Prize: yet another million dollar problem
- 106. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Linear regression Suppose all viewers in X have rated all movies in Y : the rating matrix is (rx,y )(x,y )∈X ×Y . Suppose you want to model the ratings given to a particular movie y0 based on the ratings given to the movies in Y = Y − {y0 }. A linear regression is a natural way to do that. Write (rx,y ) = (Cy )y ∈Y where the Cy are the column vectors. Performing the linear regression consists of approximating Cy0 by ˆ its orthogonal projection Cy0 on the hyperplane generated by the (Cy )y ∈Y . David Bessis The Netﬂix Prize: yet another million dollar problem
- 107. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Linear regression Suppose all viewers in X have rated all movies in Y : the rating matrix is (rx,y )(x,y )∈X ×Y . Suppose you want to model the ratings given to a particular movie y0 based on the ratings given to the movies in Y = Y − {y0 }. A linear regression is a natural way to do that. Write (rx,y ) = (Cy )y ∈Y where the Cy are the column vectors. Performing the linear regression consists of approximating Cy0 by ˆ its orthogonal projection Cy0 on the hyperplane generated by the (Cy )y ∈Y . Clearly, there exists a unique solution. David Bessis The Netﬂix Prize: yet another million dollar problem
- 108. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Linear regression Suppose all viewers in X have rated all movies in Y : the rating matrix is (rx,y )(x,y )∈X ×Y . Suppose you want to model the ratings given to a particular movie y0 based on the ratings given to the movies in Y = Y − {y0 }. A linear regression is a natural way to do that. Write (rx,y ) = (Cy )y ∈Y where the Cy are the column vectors. Performing the linear regression consists of approximating Cy0 by ˆ its orthogonal projection Cy0 on the hyperplane generated by the (Cy )y ∈Y . Clearly, there exists a unique solution. It optimizes RMSE. David Bessis The Netﬂix Prize: yet another million dollar problem
- 109. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Linear regression Suppose all viewers in X have rated all movies in Y : the rating matrix is (rx,y )(x,y )∈X ×Y . Suppose you want to model the ratings given to a particular movie y0 based on the ratings given to the movies in Y = Y − {y0 }. A linear regression is a natural way to do that. Write (rx,y ) = (Cy )y ∈Y where the Cy are the column vectors. Performing the linear regression consists of approximating Cy0 by ˆ its orthogonal projection Cy0 on the hyperplane generated by the (Cy )y ∈Y . Clearly, there exists a unique solution. It optimizes RMSE. Write the formula! David Bessis The Netﬂix Prize: yet another million dollar problem
- 110. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 1: missing data Not all viewers have seen all movies. David Bessis The Netﬂix Prize: yet another million dollar problem
- 111. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 1: missing data Not all viewers have seen all movies. Worse, there are virtually no complete rectangular blocks within the dataset. David Bessis The Netﬂix Prize: yet another million dollar problem
- 112. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 1: missing data Not all viewers have seen all movies. Worse, there are virtually no complete rectangular blocks within the dataset. Regression by viewers or by movies? David Bessis The Netﬂix Prize: yet another million dollar problem
- 113. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 1: missing data Not all viewers have seen all movies. Worse, there are virtually no complete rectangular blocks within the dataset. Regression by viewers or by movies? It is better to do regression by movies. David Bessis The Netﬂix Prize: yet another million dollar problem
- 114. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 1: missing data Not all viewers have seen all movies. Worse, there are virtually no complete rectangular blocks within the dataset. Regression by viewers or by movies? It is better to do regression by movies. Normalize ratings: David Bessis The Netﬂix Prize: yet another million dollar problem
- 115. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 1: missing data Not all viewers have seen all movies. Worse, there are virtually no complete rectangular blocks within the dataset. Regression by viewers or by movies? It is better to do regression by movies. Normalize ratings: replace the rating rv ,m by the meaningful signal, i.e., the diﬀerence r v ,m between rv ,m and the average rating for m. David Bessis The Netﬂix Prize: yet another million dollar problem
- 116. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 1: missing data Not all viewers have seen all movies. Worse, there are virtually no complete rectangular blocks within the dataset. Regression by viewers or by movies? It is better to do regression by movies. Normalize ratings: replace the rating rv ,m by the meaningful signal, i.e., the diﬀerence r v ,m between rv ,m and the average rating for m. Then it becomes natural to set r v ,m to 0 when v hasn’t rated m. David Bessis The Netﬂix Prize: yet another million dollar problem
- 117. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 1: missing data Not all viewers have seen all movies. Worse, there are virtually no complete rectangular blocks within the dataset. Regression by viewers or by movies? It is better to do regression by movies. Normalize ratings: replace the rating rv ,m by the meaningful signal, i.e., the diﬀerence r v ,m between rv ,m and the average rating for m. Then it becomes natural to set r v ,m to 0 when v hasn’t rated m. Actually, whether or not v has rated m is a meaningful information! David Bessis The Netﬂix Prize: yet another million dollar problem
- 118. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 1: missing data Not all viewers have seen all movies. Worse, there are virtually no complete rectangular blocks within the dataset. Regression by viewers or by movies? It is better to do regression by movies. Normalize ratings: replace the rating rv ,m by the meaningful signal, i.e., the diﬀerence r v ,m between rv ,m and the average rating for m. Then it becomes natural to set r v ,m to 0 when v hasn’t rated m. Actually, whether or not v has rated m is a meaningful information! Add normalized bit columns to account for that. David Bessis The Netﬂix Prize: yet another million dollar problem
- 119. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 2: the curse of dimensionality We all know that Lagrange interpolators are not to be used on noisy data. Rather, one should look at best-ﬁtting polynomials of a given low degree. David Bessis The Netﬂix Prize: yet another million dollar problem
- 120. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 2: the curse of dimensionality We all know that Lagrange interpolators are not to be used on noisy data. Rather, one should look at best-ﬁtting polynomials of a given low degree. Similarly, the curse of dimensionality asserts that: With high-dimensionality datasets, one will always ﬁnd stupid predictors, making perfect predictions on the dataset, and failing to generalize. David Bessis The Netﬂix Prize: yet another million dollar problem
- 121. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 2: the curse of dimensionality We all know that Lagrange interpolators are not to be used on noisy data. Rather, one should look at best-ﬁtting polynomials of a given low degree. Similarly, the curse of dimensionality asserts that: With high-dimensionality datasets, one will always ﬁnd stupid predictors, making perfect predictions on the dataset, and failing to generalize. By looking at my audience today, what should I be able to infer? David Bessis The Netﬂix Prize: yet another million dollar problem
- 122. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 2: the curse of dimensionality We all know that Lagrange interpolators are not to be used on noisy data. Rather, one should look at best-ﬁtting polynomials of a given low degree. Similarly, the curse of dimensionality asserts that: With high-dimensionality datasets, one will always ﬁnd stupid predictors, making perfect predictions on the dataset, and failing to generalize. By looking at my audience today, what should I be able to infer? That having long hair is a reasonably good gender predictor? David Bessis The Netﬂix Prize: yet another million dollar problem
- 123. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 2: the curse of dimensionality We all know that Lagrange interpolators are not to be used on noisy data. Rather, one should look at best-ﬁtting polynomials of a given low degree. Similarly, the curse of dimensionality asserts that: With high-dimensionality datasets, one will always ﬁnd stupid predictors, making perfect predictions on the dataset, and failing to generalize. By looking at my audience today, what should I be able to infer? That having long hair is a reasonably good gender predictor? That wearing a grey sweater is a reasonably good gender predictor? David Bessis The Netﬂix Prize: yet another million dollar problem
- 124. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Real life problems 2: the curse of dimensionality We all know that Lagrange interpolators are not to be used on noisy data. Rather, one should look at best-ﬁtting polynomials of a given low degree. Similarly, the curse of dimensionality asserts that: With high-dimensionality datasets, one will always ﬁnd stupid predictors, making perfect predictions on the dataset, and failing to generalize. By looking at my audience today, what should I be able to infer? That having long hair is a reasonably good gender predictor? That wearing a grey sweater is a reasonably good gender predictor? Dilemma: overlearning vs underlearning. David Bessis The Netﬂix Prize: yet another million dollar problem
- 125. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Ridge regression (aka Tikhonov regularization) Linear regression: given vectors x, y1 , . . . , yn ∈ Rm , ﬁnd λ1 , . . . , λn that minimize ||x − λi yi ||2 . David Bessis The Netﬂix Prize: yet another million dollar problem
- 126. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Ridge regression (aka Tikhonov regularization) Linear regression: given vectors x, y1 , . . . , yn ∈ Rm , ﬁnd λ1 , . . . , λn that minimize ||x − λi yi ||2 . When n is large (with respect to m), the linear system is overdetermined. Overﬁtting occurs. David Bessis The Netﬂix Prize: yet another million dollar problem
- 127. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Ridge regression (aka Tikhonov regularization) Linear regression: given vectors x, y1 , . . . , yn ∈ Rm , ﬁnd λ1 , . . . , λn that minimize ||x − λi yi ||2 . When n is large (with respect to m), the linear system is overdetermined. Overﬁtting occurs. A telltale sign of overﬁtting is the presence of λi ’s with huge norms compensating each other. David Bessis The Netﬂix Prize: yet another million dollar problem
- 128. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Ridge regression (aka Tikhonov regularization) Linear regression: given vectors x, y1 , . . . , yn ∈ Rm , ﬁnd λ1 , . . . , λn that minimize ||x − λi yi ||2 . When n is large (with respect to m), the linear system is overdetermined. Overﬁtting occurs. A telltale sign of overﬁtting is the presence of λi ’s with huge norms compensating each other. Ridge regression (Tikhonov regularization): ﬁnd λ1 , . . . , λn that minimize ||x − λi yi ||2 + ε |λi |2 where ε is a well-adjusted (small) penalty term. David Bessis The Netﬂix Prize: yet another million dollar problem
- 129. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies Assume that movies diﬀer by their amount of certain qualities: David Bessis The Netﬂix Prize: yet another million dollar problem
- 130. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies Assume that movies diﬀer by their amount of certain qualities: Violence. David Bessis The Netﬂix Prize: yet another million dollar problem
- 131. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies Assume that movies diﬀer by their amount of certain qualities: Violence. Sex. David Bessis The Netﬂix Prize: yet another million dollar problem
- 132. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies Assume that movies diﬀer by their amount of certain qualities: Violence. Sex. Anything else? David Bessis The Netﬂix Prize: yet another million dollar problem
- 133. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies Assume that movies diﬀer by their amount of certain qualities: Violence. Sex. Maybe not. David Bessis The Netﬂix Prize: yet another million dollar problem
- 134. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies Assume that movies diﬀer by their amount of certain qualities: Violence. Sex. Humor. David Bessis The Netﬂix Prize: yet another million dollar problem
- 135. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies Assume that movies diﬀer by their amount of certain qualities: Violence. Sex. Humor. This Actor or that Actress or some Director... David Bessis The Netﬂix Prize: yet another million dollar problem
- 136. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies Assume that movies diﬀer by their amount of certain qualities: Violence. Sex. Humor. This Actor or that Actress or some Director... Victorian costumes. David Bessis The Netﬂix Prize: yet another million dollar problem
- 137. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies Assume that movies diﬀer by their amount of certain qualities: Violence. Sex. Humor. This Actor or that Actress or some Director... Victorian costumes. 3D David Bessis The Netﬂix Prize: yet another million dollar problem
- 138. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies Assume that movies diﬀer by their amount of certain qualities: Violence. Sex. Humor. This Actor or that Actress or some Director... Victorian costumes. 3D Beautiful Japanese landscapes with Mount Fuji. David Bessis The Netﬂix Prize: yet another million dollar problem
- 139. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies Assume that movies diﬀer by their amount of certain qualities: Violence. Sex. Humor. This Actor or that Actress or some Director... Victorian costumes. 3D Beautiful Japanese landscapes with Mount Fuji. Whatever. David Bessis The Netﬂix Prize: yet another million dollar problem
- 140. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies Assume that movies diﬀer by their amount of certain qualities: Violence. Sex. Humor. This Actor or that Actress or some Director... Victorian costumes. 3D Beautiful Japanese landscapes with Mount Fuji. Whatever. ... David Bessis The Netﬂix Prize: yet another million dollar problem
- 141. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 2 David Bessis The Netﬂix Prize: yet another million dollar problem
- 142. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 2 Maybe we could construct a map φ : M → RN from the space of movies to an abstract parameter space. David Bessis The Netﬂix Prize: yet another million dollar problem
- 143. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 2 Maybe we could construct a map φ : M → RN from the space of movies to an abstract parameter space. Maybe we could construct a map ψ : V → RN from the space of viewers to the same abstract parameter space, expressing the viewers’ appetite for this and that attribute. David Bessis The Netﬂix Prize: yet another million dollar problem
- 144. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 2 Maybe we could construct a map φ : M → RN from the space of movies to an abstract parameter space. Maybe we could construct a map ψ : V → RN from the space of viewers to the same abstract parameter space, expressing the viewers’ appetite for this and that attribute. Then a good estimate rv ,m should be (some calibrated normalized variant of) the scalar product: φ(m).ψ(v ). David Bessis The Netﬂix Prize: yet another million dollar problem
- 145. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 2 Maybe we could construct a map φ : M → RN from the space of movies to an abstract parameter space. Maybe we could construct a map ψ : V → RN from the space of viewers to the same abstract parameter space, expressing the viewers’ appetite for this and that attribute. Then a good estimate rv ,m should be (some calibrated normalized variant of) the scalar product: φ(m).ψ(v ). But how can we construct good φ and ψ? David Bessis The Netﬂix Prize: yet another million dollar problem
- 146. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 3 The Golden Rule of Machine Learning: David Bessis The Netﬂix Prize: yet another million dollar problem
- 147. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 3 The Golden Rule of Machine Learning: “Learn everything from thy dataset!” David Bessis The Netﬂix Prize: yet another million dollar problem
- 148. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 3 The Golden Rule of Machine Learning: “Learn everything from thy dataset!” Set N. Look for φ and ψ miniminizing |rv ,m − φ(m).ψ(v )|2 David Bessis The Netﬂix Prize: yet another million dollar problem
- 149. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 3 The Golden Rule of Machine Learning: “Learn everything from thy dataset!” Set N. Look for φ and ψ miniminizing |rv ,m − φ(m).ψ(v )|2 or, rather, miniminizing |rv ,m − φ(m).ψ(v )|2 + ε(||φ||2 + ||ψ||2 ). David Bessis The Netﬂix Prize: yet another million dollar problem
- 150. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 3 The Golden Rule of Machine Learning: “Learn everything from thy dataset!” Set N. Look for φ and ψ miniminizing |rv ,m − φ(m).ψ(v )|2 or, rather, miniminizing |rv ,m − φ(m).ψ(v )|2 + ε(||φ||2 + ||ψ||2 ). Phrase this as a convex problem with a unique solution. David Bessis The Netﬂix Prize: yet another million dollar problem
- 151. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 3 The Golden Rule of Machine Learning: “Learn everything from thy dataset!” Set N. Look for φ and ψ miniminizing |rv ,m − φ(m).ψ(v )|2 or, rather, miniminizing |rv ,m − φ(m).ψ(v )|2 + ε(||φ||2 + ||ψ||2 ). Phrase this as a convex problem with a unique solution. Tens of millions of parameters to adjust. David Bessis The Netﬂix Prize: yet another million dollar problem
- 152. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 3 The Golden Rule of Machine Learning: “Learn everything from thy dataset!” Set N. Look for φ and ψ miniminizing |rv ,m − φ(m).ψ(v )|2 or, rather, miniminizing |rv ,m − φ(m).ψ(v )|2 + ε(||φ||2 + ||ψ||2 ). Phrase this as a convex problem with a unique solution. Tens of millions of parameters to adjust. Approximate the solution by Stochastic Gradient Descent (some variant of Newton’s method). David Bessis The Netﬂix Prize: yet another million dollar problem
- 153. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 3 The Golden Rule of Machine Learning: “Learn everything from thy dataset!” Set N. Look for φ and ψ miniminizing |rv ,m − φ(m).ψ(v )|2 or, rather, miniminizing |rv ,m − φ(m).ψ(v )|2 + ε(||φ||2 + ||ψ||2 ). Phrase this as a convex problem with a unique solution. Tens of millions of parameters to adjust. Approximate the solution by Stochastic Gradient Descent (some variant of Newton’s method). It really works well! David Bessis The Netﬂix Prize: yet another million dollar problem
- 154. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 4 Afterwards, one may try to make sense of the attributes (PCA). David Bessis The Netﬂix Prize: yet another million dollar problem
- 155. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 4 Afterwards, one may try to make sense of the attributes (PCA). Objective basis for categorizing movies. David Bessis The Netﬂix Prize: yet another million dollar problem
- 156. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 4 Afterwards, one may try to make sense of the attributes (PCA). Objective basis for categorizing movies. N itself can be “learnt” from the dataset. David Bessis The Netﬂix Prize: yet another million dollar problem
- 157. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 4 Afterwards, one may try to make sense of the attributes (PCA). Objective basis for categorizing movies. N itself can be “learnt” from the dataset. N = 50 ⇒ RMSE = 0.9046 (disclaimer: my account is naive oversimpliﬁcation of Yehuda Koren’s paper) David Bessis The Netﬂix Prize: yet another million dollar problem
- 158. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 4 Afterwards, one may try to make sense of the attributes (PCA). Objective basis for categorizing movies. N itself can be “learnt” from the dataset. N = 50 ⇒ RMSE = 0.9046 (disclaimer: my account is naive oversimpliﬁcation of Yehuda Koren’s paper) N = 100 ⇒ RMSE = 0.9025 David Bessis The Netﬂix Prize: yet another million dollar problem
- 159. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 4 Afterwards, one may try to make sense of the attributes (PCA). Objective basis for categorizing movies. N itself can be “learnt” from the dataset. N = 50 ⇒ RMSE = 0.9046 (disclaimer: my account is naive oversimpliﬁcation of Yehuda Koren’s paper) N = 100 ⇒ RMSE = 0.9025 N = 200 ⇒ RMSE = 0.9009 David Bessis The Netﬂix Prize: yet another million dollar problem
- 160. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Assigning attributes to movies 4 Afterwards, one may try to make sense of the attributes (PCA). Objective basis for categorizing movies. N itself can be “learnt” from the dataset. N = 50 ⇒ RMSE = 0.9046 (disclaimer: my account is naive oversimpliﬁcation of Yehuda Koren’s paper) N = 100 ⇒ RMSE = 0.9025 N = 200 ⇒ RMSE = 0.9009 A natural way to deﬁne a “cognitive dimension of the space of movies”? David Bessis The Netﬂix Prize: yet another million dollar problem
- 161. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Latent Semantic Analysis Let W be a set of words, let D be a set of documents. David Bessis The Netﬂix Prize: yet another million dollar problem
- 162. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Latent Semantic Analysis Let W be a set of words, let D be a set of documents. Look at the frequency matrix M = (mw ,d ). David Bessis The Netﬂix Prize: yet another million dollar problem
- 163. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Latent Semantic Analysis Let W be a set of words, let D be a set of documents. Look at the frequency matrix M = (mw ,d ). Singular Value Decomposition. David Bessis The Netﬂix Prize: yet another million dollar problem
- 164. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Latent Semantic Analysis Let W be a set of words, let D be a set of documents. Look at the frequency matrix M = (mw ,d ). Singular Value Decomposition. ⇒ Abstract space of concepts. David Bessis The Netﬂix Prize: yet another million dollar problem
- 165. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Tuning and Blending This talks only mentions particular approaches. David Bessis The Netﬂix Prize: yet another million dollar problem
- 166. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Tuning and Blending This talks only mentions particular approaches. BellKor have a nice composite model (latent factors + regression + presence or absence of ratings, everything tuned simultaneously). David Bessis The Netﬂix Prize: yet another million dollar problem
- 167. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Tuning and Blending This talks only mentions particular approaches. BellKor have a nice composite model (latent factors + regression + presence or absence of ratings, everything tuned simultaneously). Baseline: global mean + movie oﬀset + user oﬀset (oﬀsets are learnt). David Bessis The Netﬂix Prize: yet another million dollar problem
- 168. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Tuning and Blending This talks only mentions particular approaches. BellKor have a nice composite model (latent factors + regression + presence or absence of ratings, everything tuned simultaneously). Baseline: global mean + movie oﬀset + user oﬀset (oﬀsets are learnt). BigChaos have ﬁltered out many factors (even the impact of the character length of the movie title, or days of the week). David Bessis The Netﬂix Prize: yet another million dollar problem
- 169. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Tuning and Blending This talks only mentions particular approaches. BellKor have a nice composite model (latent factors + regression + presence or absence of ratings, everything tuned simultaneously). Baseline: global mean + movie oﬀset + user oﬀset (oﬀsets are learnt). BigChaos have ﬁltered out many factors (even the impact of the character length of the movie title, or days of the week). BellKor have subtle ways to ﬁlter out time signals. David Bessis The Netﬂix Prize: yet another million dollar problem
- 170. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Tuning and Blending This talks only mentions particular approaches. BellKor have a nice composite model (latent factors + regression + presence or absence of ratings, everything tuned simultaneously). Baseline: global mean + movie oﬀset + user oﬀset (oﬀsets are learnt). BigChaos have ﬁltered out many factors (even the impact of the character length of the movie title, or days of the week). BellKor have subtle ways to ﬁlter out time signals. Any two models can be combined through a regression (calibrated on the probe). David Bessis The Netﬂix Prize: yet another million dollar problem
- 171. Practical issues The Problem Regressions Strategies Latent factors Some Funny New Science Tuning and Blending Tuning and Blending This talks only mentions particular approaches. BellKor have a nice composite model (latent factors + regression + presence or absence of ratings, everything tuned simultaneously). Baseline: global mean + movie oﬀset + user oﬀset (oﬀsets are learnt). BigChaos have ﬁltered out many factors (even the impact of the character length of the movie title, or days of the week). BellKor have subtle ways to ﬁlter out time signals. Any two models can be combined through a regression (calibrated on the probe). The winning solution is a sophisticated blend. David Bessis The Netﬂix Prize: yet another million dollar problem
- 172. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint What Statisticians are paid to do? Historically, the typical real-world statistical question was: David Bessis The Netﬂix Prize: yet another million dollar problem
- 173. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint What Statisticians are paid to do? Historically, the typical real-world statistical question was: Given a certain hypothesis... David Bessis The Netﬂix Prize: yet another million dollar problem
- 174. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint What Statisticians are paid to do? Historically, the typical real-world statistical question was: Given a certain hypothesis... ...and a tiny dataset... David Bessis The Netﬂix Prize: yet another million dollar problem
- 175. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint What Statisticians are paid to do? Historically, the typical real-world statistical question was: Given a certain hypothesis... ...and a tiny dataset... ...so tiny that one cannot be sure about anything.. David Bessis The Netﬂix Prize: yet another million dollar problem
- 176. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint What Statisticians are paid to do? Historically, the typical real-world statistical question was: Given a certain hypothesis... ...and a tiny dataset... ...so tiny that one cannot be sure about anything.. ...prove that the hypothesis is correct! David Bessis The Netﬂix Prize: yet another million dollar problem
- 177. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint The high-volume eﬀect With big datasets, the nature of the game is changing. “Big” can be millions, tens of millions (Netﬂix), billions (Advertisement Campaigns), trillions or even scarier amounts. David Bessis The Netﬂix Prize: yet another million dollar problem
- 178. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint The high-volume eﬀect With big datasets, the nature of the game is changing. “Big” can be millions, tens of millions (Netﬂix), billions (Advertisement Campaigns), trillions or even scarier amounts. It is no longer about checking a given hypothesis (forget about χ2 ). David Bessis The Netﬂix Prize: yet another million dollar problem
- 179. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint The high-volume eﬀect With big datasets, the nature of the game is changing. “Big” can be millions, tens of millions (Netﬂix), billions (Advertisement Campaigns), trillions or even scarier amounts. It is no longer about checking a given hypothesis (forget about χ2 ). It is about handling huge dataﬂows and automatically building millions of models. David Bessis The Netﬂix Prize: yet another million dollar problem
- 180. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint The high-volume eﬀect With big datasets, the nature of the game is changing. “Big” can be millions, tens of millions (Netﬂix), billions (Advertisement Campaigns), trillions or even scarier amounts. It is no longer about checking a given hypothesis (forget about χ2 ). It is about handling huge dataﬂows and automatically building millions of models. Our concept-based intuition tends to underestimate the predictive power of simple algorithms on big datasets. David Bessis The Netﬂix Prize: yet another million dollar problem
- 181. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint What is going on? New learning algorithms (e.g., semantic search of images). David Bessis The Netﬂix Prize: yet another million dollar problem
- 182. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint What is going on? New learning algorithms (e.g., semantic search of images). Hardware is cheap enough. David Bessis The Netﬂix Prize: yet another million dollar problem
- 183. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint What is going on? New learning algorithms (e.g., semantic search of images). Hardware is cheap enough. Programming languages are pleasant enough. David Bessis The Netﬂix Prize: yet another million dollar problem
- 184. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint What is going on? New learning algorithms (e.g., semantic search of images). Hardware is cheap enough. Programming languages are pleasant enough. Parallel computing is easy enough (Hadoop,...) David Bessis The Netﬂix Prize: yet another million dollar problem
- 185. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint What is going on? New learning algorithms (e.g., semantic search of images). Hardware is cheap enough. Programming languages are pleasant enough. Parallel computing is easy enough (Hadoop,...) Google-style problem-solving is no longer reserved to big corporations. David Bessis The Netﬂix Prize: yet another million dollar problem
- 186. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint What is going on? New learning algorithms (e.g., semantic search of images). Hardware is cheap enough. Programming languages are pleasant enough. Parallel computing is easy enough (Hadoop,...) Google-style problem-solving is no longer reserved to big corporations. This is changing the way science is done. David Bessis The Netﬂix Prize: yet another million dollar problem
- 187. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint What is going on? New learning algorithms (e.g., semantic search of images). Hardware is cheap enough. Programming languages are pleasant enough. Parallel computing is easy enough (Hadoop,...) Google-style problem-solving is no longer reserved to big corporations. This is changing the way science is done. Induction-Deduction-Transduction. David Bessis The Netﬂix Prize: yet another million dollar problem
- 188. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint The Netﬂix dataset, beyond collaborative ﬁltering Like any big dataset, the Netﬂix dataset is a world in reduction. David Bessis The Netﬂix Prize: yet another million dollar problem
- 189. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint The Netﬂix dataset, beyond collaborative ﬁltering Like any big dataset, the Netﬂix dataset is a world in reduction. It’s interest reaches beyond collaborative ﬁltering. David Bessis The Netﬂix Prize: yet another million dollar problem
- 190. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint The Netﬂix dataset, beyond collaborative ﬁltering Like any big dataset, the Netﬂix dataset is a world in reduction. It’s interest reaches beyond collaborative ﬁltering. Basic metrics easily cluster movies by genre or director. David Bessis The Netﬂix Prize: yet another million dollar problem
- 191. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint The Netﬂix dataset, beyond collaborative ﬁltering Like any big dataset, the Netﬂix dataset is a world in reduction. It’s interest reaches beyond collaborative ﬁltering. Basic metrics easily cluster movies by genre or director. Clear social, psychological, cultural signiﬁcance. David Bessis The Netﬂix Prize: yet another million dollar problem
- 192. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint The Netﬂix dataset, beyond collaborative ﬁltering Like any big dataset, the Netﬂix dataset is a world in reduction. It’s interest reaches beyond collaborative ﬁltering. Basic metrics easily cluster movies by genre or director. Clear social, psychological, cultural signiﬁcance. Play with the data! David Bessis The Netﬂix Prize: yet another million dollar problem
- 193. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint The Netﬂix dataset, beyond collaborative ﬁltering Like any big dataset, the Netﬂix dataset is a world in reduction. It’s interest reaches beyond collaborative ﬁltering. Basic metrics easily cluster movies by genre or director. Clear social, psychological, cultural signiﬁcance. Play with the data! One Example: ratings for certain movies are harder to predict, yet even this is meaningful: David Bessis The Netﬂix Prize: yet another million dollar problem
- 194. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint The Netﬂix dataset, beyond collaborative ﬁltering Like any big dataset, the Netﬂix dataset is a world in reduction. It’s interest reaches beyond collaborative ﬁltering. Basic metrics easily cluster movies by genre or director. Clear social, psychological, cultural signiﬁcance. Play with the data! One Example: ratings for certain movies are harder to predict, yet even this is meaningful: Napoleon Dynamite (see New York Times article.) David Bessis The Netﬂix Prize: yet another million dollar problem
- 195. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint The Netﬂix dataset, beyond collaborative ﬁltering Like any big dataset, the Netﬂix dataset is a world in reduction. It’s interest reaches beyond collaborative ﬁltering. Basic metrics easily cluster movies by genre or director. Clear social, psychological, cultural signiﬁcance. Play with the data! One Example: ratings for certain movies are harder to predict, yet even this is meaningful: Napoleon Dynamite (see New York Times article.) Wes Anderson’s movies (do I even know if I like them?) David Bessis The Netﬂix Prize: yet another million dollar problem
- 196. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint The Netﬂix dataset, beyond collaborative ﬁltering Like any big dataset, the Netﬂix dataset is a world in reduction. It’s interest reaches beyond collaborative ﬁltering. Basic metrics easily cluster movies by genre or director. Clear social, psychological, cultural signiﬁcance. Play with the data! One Example: ratings for certain movies are harder to predict, yet even this is meaningful: Napoleon Dynamite (see New York Times article.) Wes Anderson’s movies (do I even know if I like them?) What’s the minimal RMSE? David Bessis The Netﬂix Prize: yet another million dollar problem
- 197. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint The Netﬂix dataset, beyond collaborative ﬁltering Like any big dataset, the Netﬂix dataset is a world in reduction. It’s interest reaches beyond collaborative ﬁltering. Basic metrics easily cluster movies by genre or director. Clear social, psychological, cultural signiﬁcance. Play with the data! One Example: ratings for certain movies are harder to predict, yet even this is meaningful: Napoleon Dynamite (see New York Times article.) Wes Anderson’s movies (do I even know if I like them?) What’s the minimal RMSE? Does this question make sense? David Bessis The Netﬂix Prize: yet another million dollar problem
- 198. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint Surprises and questions So far, the mathematics are trivial. David Bessis The Netﬂix Prize: yet another million dollar problem
- 199. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint Surprises and questions So far, the mathematics are trivial. The eﬀectiveness of machine learning is very counter-intuitive to me. David Bessis The Netﬂix Prize: yet another million dollar problem
- 200. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint Surprises and questions So far, the mathematics are trivial. The eﬀectiveness of machine learning is very counter-intuitive to me. Beautiful concepts (cognitive dimension of the space of movies, concept of concept...) David Bessis The Netﬂix Prize: yet another million dollar problem
- 201. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint Surprises and questions So far, the mathematics are trivial. The eﬀectiveness of machine learning is very counter-intuitive to me. Beautiful concepts (cognitive dimension of the space of movies, concept of concept...) No-one has a clue about the theoretical bounds. David Bessis The Netﬂix Prize: yet another million dollar problem
- 202. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint Surprises and questions So far, the mathematics are trivial. The eﬀectiveness of machine learning is very counter-intuitive to me. Beautiful concepts (cognitive dimension of the space of movies, concept of concept...) No-one has a clue about the theoretical bounds. No-one knows where the added-value lies: David Bessis The Netﬂix Prize: yet another million dollar problem
- 203. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint Surprises and questions So far, the mathematics are trivial. The eﬀectiveness of machine learning is very counter-intuitive to me. Beautiful concepts (cognitive dimension of the space of movies, concept of concept...) No-one has a clue about the theoretical bounds. No-one knows where the added-value lies: Software quality? David Bessis The Netﬂix Prize: yet another million dollar problem
- 204. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint Surprises and questions So far, the mathematics are trivial. The eﬀectiveness of machine learning is very counter-intuitive to me. Beautiful concepts (cognitive dimension of the space of movies, concept of concept...) No-one has a clue about the theoretical bounds. No-one knows where the added-value lies: Software quality? Fine-tuning of models? David Bessis The Netﬂix Prize: yet another million dollar problem
- 205. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint Surprises and questions So far, the mathematics are trivial. The eﬀectiveness of machine learning is very counter-intuitive to me. Beautiful concepts (cognitive dimension of the space of movies, concept of concept...) No-one has a clue about the theoretical bounds. No-one knows where the added-value lies: Software quality? Fine-tuning of models? Intuition about the dataset and the problem? David Bessis The Netﬂix Prize: yet another million dollar problem
- 206. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint Surprises and questions So far, the mathematics are trivial. The eﬀectiveness of machine learning is very counter-intuitive to me. Beautiful concepts (cognitive dimension of the space of movies, concept of concept...) No-one has a clue about the theoretical bounds. No-one knows where the added-value lies: Software quality? Fine-tuning of models? Intuition about the dataset and the problem? Global architecture of the solutions? David Bessis The Netﬂix Prize: yet another million dollar problem
- 207. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint Surprises and questions So far, the mathematics are trivial. The eﬀectiveness of machine learning is very counter-intuitive to me. Beautiful concepts (cognitive dimension of the space of movies, concept of concept...) No-one has a clue about the theoretical bounds. No-one knows where the added-value lies: Software quality? Fine-tuning of models? Intuition about the dataset and the problem? Global architecture of the solutions? Maybe not serious math problems. David Bessis The Netﬂix Prize: yet another million dollar problem
- 208. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint Surprises and questions So far, the mathematics are trivial. The eﬀectiveness of machine learning is very counter-intuitive to me. Beautiful concepts (cognitive dimension of the space of movies, concept of concept...) No-one has a clue about the theoretical bounds. No-one knows where the added-value lies: Software quality? Fine-tuning of models? Intuition about the dataset and the problem? Global architecture of the solutions? Maybe not serious math problems. But serious problems for mathematical minds. David Bessis The Netﬂix Prize: yet another million dollar problem
- 209. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint Beyond Theorems Should we be satisﬁed with the fuzziness of the Millenium Prize Problems rules? David Bessis The Netﬂix Prize: yet another million dollar problem
- 210. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint Beyond Theorems Should we be satisﬁed with the fuzziness of the Millenium Prize Problems rules? Wasn’t Axiomatic Set Theory supposed to have solved the problem of objectivity in Mathematics? David Bessis The Netﬂix Prize: yet another million dollar problem
- 211. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint Beyond Theorems Should we be satisﬁed with the fuzziness of the Millenium Prize Problems rules? Wasn’t Axiomatic Set Theory supposed to have solved the problem of objectivity in Mathematics? The Netﬂix Prize is strikingly objective, strikingly mathematical. David Bessis The Netﬂix Prize: yet another million dollar problem
- 212. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint Beyond Theorems Should we be satisﬁed with the fuzziness of the Millenium Prize Problems rules? Wasn’t Axiomatic Set Theory supposed to have solved the problem of objectivity in Mathematics? The Netﬂix Prize is strikingly objective, strikingly mathematical. Yet I cannot see any real theorem in the winners’ solution. David Bessis The Netﬂix Prize: yet another million dollar problem
- 213. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint Beyond Theorems Should we be satisﬁed with the fuzziness of the Millenium Prize Problems rules? Wasn’t Axiomatic Set Theory supposed to have solved the problem of objectivity in Mathematics? The Netﬂix Prize is strikingly objective, strikingly mathematical. Yet I cannot see any real theorem in the winners’ solution. This isn’t depressing, but very exciting! David Bessis The Netﬂix Prize: yet another million dollar problem
- 214. The Problem Old Statistics vs New Statistics Strategies What is going on? Some Funny New Science A mathematician’s viewpoint Suggested readings http://www.netﬂixprize.com/ Yehuda Koren, Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model, proceedings of KDD’08. Clive Thompson, If You Liked This, You´re Sure to Love That, The New York Times, November 21, 2008. Ian Ayres, Super Crunchers. Play with the data! David Bessis The Netﬂix Prize: yet another million dollar problem

Be the first to comment