What is Kaggle? Platform for predictive modelling and analytics competitions Company provides data and defines the modelling problem Participants build models on part of the data Predictions are evaluated on another part of the data
What is Kaggle? Public competitions Private competitions Kaggle In Class
My experience with Kaggle Public competitions: Deloitte/FIDE Chess Rating Challenge Dont Overfit! Observing Dark Worlds Private competition Allstate Customer Retention Prediction
My experience with Kaggle Currently working on the Heritage Health Prize Predict which patients go to the hospital $ 3,000,000 grand prize $500,000 consolation prize
What is Bayes? No, that‟s not Rev. Thomas Bayes
What is Bayes?Simple recipe for reasoning under uncertainty: Quantify what you know before getting data: P(X) (“prior”) Build a model for your data P(Y|X) (“model”) Apply Bayes‟ rule P(X|Y) = P(Y|X)P(X)/P(Y) (“posterior”)
Monty Hall problem • Should you switch? • CONTROVERSY!
Monty Hall problem X is the number of the door with a car Prior P(X): All doors are equally likely to have the car P(door 1 has car) = 1/3 P(door 2 has car) = 1/3 P(door 3 has car) = 1/3
Monty Hall problem X is the number of the door with a car Y is the observation of the goat Model P(Y|X): Host knows which door has the goat Host never opens your chosen door Host always opens a door with a goat P(door 3 is opened | door 1 has car) = ½ P(door 3 is opened | door 2 has car) = 1 P(door 3 is opened | door 3 has car) = 0
Monty Hall problem Posterior P(X|Y): multiply: P(X)*P(Y|X), rescale: *2 Highest is for door 2 (1/3 * 1)*2 = 2/3
Monty Hall problem Switching or not depends on your model! Bayesian Analysis makes this clear
Observing Dark Worlds competition Organized by University of Edinburgh Sponsored by Winton Capital 80% of mass in the universe is dark matter Dark: It does not emit or absorb light We see its effect through gravityFind location of dark matter based on the effectsof its gravity
Observing Dark Worlds competition Posterior P(X|Y): Computation a bit more difficult We can get draws from P(X|Y) using MCMC Use samples (points) to approximate P(X|Y)
Observing Dark Worlds competition Minimize the distance between dark matter and our prediction Expected distance = average distance over samples from P(X|Y) Prediction:Choose the point thatminimizes the expecteddistance
Observing Dark Worlds competitionSounds pretty smart?Half-way down the leaderboard!
Observing Dark Worlds competition Leaderboard only based on 30 cases Final score determined on 90 other cases
Observing Dark Worlds competition Great modelling competition Bayes dominated: runner-up used very similar method Academic paper summarizing the results is being written
Deloitte/FIDE chess rating challenge 10 years of chess match results 2 years withheld, these should be predicted A beats B, B beats C, what isthe probability C will beat A? Sponsored by world chess federation FIDE and Deloitte Australia
Deloitte/FIDE chess rating challengeFIDE currently uses the Elo system Every player is assigned a skill Expected result is a function of the skill difference Points are rewarded based on this skill difference
Deloitte/FIDE chess rating challengeFIDE currently uses the Elo system
Deloitte/FIDE chess rating challengeProblems with the Elo system It‟s not Bayesian! This means uncertainty is not correctly incorporated It does not look back in time It does not properly discount past results There is also information in the pairings
Deloitte/FIDE chess rating challengeTrueSkill A Bayesian version of Elo Developed by Microsoft Used to rate Halo players
Deloitte/FIDE chess rating challengeMy tweaked version ofTrueSkillPrior P(X): Skill leveldistribution has the Gaussianbell shape
Deloitte/FIDE chess rating challengeMy tweaked version ofTrueSkillModel P(Y|X):- Basics the same as Elo- Discounts past results- Pairings are also part of Y
Deloitte/FIDE chess rating challengeMy tweaked version ofTrueSkillPosterior P(X|Y):- Bayes automatically makes us look back in time- Uncertainty is properly accounted for- Computation is very difficult!
Deloitte/FIDE chess rating challenge 1 week later Order is restored!
Deloitte/FIDE chess rating challenge 1 day later That didn‟t last long
Deloitte/FIDE chess rating challengeBy this time I had to go to a conference in St. Louis….
Deloitte/FIDE chess rating challenge Last-ditch effort in the early morning before the conference… Back to first place!
Deloitte/FIDE chess rating challenge But of course the public leaderboard is no guarantee… Victory!
Deloitte/FIDE chess rating challengeIt turns out I had beaten theinventors of TrueSkill, who invitedme for an internship at MicrosoftResearch, Cambridge
Deloitte/FIDE chess rating challenge Met my rival Jason „PlanetThanet‟ from the competition Jason went on to win many competition, currently ranked nr 2. of all Kagglers Also lead the Dark Worlds competition for a long time
Making connections through KaggleThese are just a few examples of the connections Ihave made through Kaggle Job offers Interesting people Consulting opportunities Invitations to talk to great people like you!
Conclusions Kaggle competitions are great fun Bayesian analysis provides a strong competitive edge Kaggle is a great way to market yourself and to make new connections
Questions? My blog: TimSalimans.com Algoritmica: Algoritmica.nlE-mail: firstname.lastname@example.org