Upcoming SlideShare
Loading in …5
×

# Rohan's Masters presentation

451 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

• Be the first to like this

No Downloads
Views
Total views
451
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Rohan's Masters presentation

1. 1. ``What do you know? Latent feature approach for the Kaggle s GrockIt challenge Rohan Anil Advised by Prof. Charles Elkan collaboration with Aditya Menon UC San Diego March 19, 2012
2. 2. Outline● Introduction ● Kaggle.com ● GrockIt ● ``What do you know? Challenge● Latent Feature Log-Linear (LFL)● Ensemble Learning● Our Results● Q/ A
3. 3. Kaggle.com
4. 4. What do you know? - Competition st 1 Prize : 3000\$ nd 2 Prize : 1500\$ 3rd Prize : 500\$
5. 5. GrockIt.com
6. 6. DatasetTraining Set4,851,476 outcomes of students answering various questionsOutcomesFour types:-i) correct ii) incorrect iii) skipped iv) timed-out.Students practicing for competitive examsi) GMAT, ii) ACT and iii) SAT
7. 7. Dataset
8. 8. DatasetDifferences between training set and test set are:-Bias Biased towards users who have answered more questions.#Respone Only one response per studentTemporal Outcomes are latter in time than the training responses and validation responses of that student.Outcomes Test set distribution is different from training set,it does not include timed-out or skipped outcomes.
9. 9. BaselineRasch Baseline A baseline was provided by Kaggle for the dataset.
10. 10. ...Bs - ability of the student sδq - difficulty of question qFor a given student s ( Fixed Bs ) – The probability of answering a question is only dependent on the difficult of the question q – Consequence of this is that for every student, the ranking interms of probability of answering the question correctly is the same.
11. 11. DatasetValidation set Grockit created a validation set which contains responses of 80,075 students on different questions.Test set Test set was used for ranking the teams, it contains responses of 93,100 users on different questions.
12. 12. Dyadic PredictionA dyadic prediction task is a learning task whichinvolves predicting a class label for a pair ofitems ( Hoffman 1999 )
13. 13. Side-InformationSometimes there is more information in thedataset. They are 1. side-information associated with u 2. side-information associated with i 3. interaction side-information for (u,i)
14. 14. Interpreting the task as a collaborative filtering problemThe dataset contains student responses forvarious questions.179,107 students and 6,046 questions
15. 15. Skipped.... Timed out
16. 16. ...Nominal Outcomes● Correct● Incorrect● Timed-Out● Skipped
17. 17. Dyadic PredictionTraining Set( , ) ..... ..... .....( , )
18. 18. Dyadic PredictionQuery in Test( , ) ?
19. 19. Side Information in the datasetAssociated with a studentNot AvailableAssociated with a questionQuestion Type, Group, Track, Subtrack, TagsAssociated with (student,question) dyadGame, Number of Players, Started at, Answered at,Deactivated at, Question set
20. 20. Side InformationQuestion Type Multiple Choice, Free ResponseGroup ACT, GMAT, SATSubtrack Critical Reasoning, Data Sufficiency, English, Ientifying Sentence Errors, Improving Paragraphs, Improving Sentences, Math, Multiple Choice, Passage Based Reading, Problem Solving, Reading, Reading Comprehension, Science, Sentence Completion, Sentence Correction, Student Produced ResponseTags describes the skill that is needed to solve the question.
21. 21. Dataset
22. 22. Dataset
23. 23. Dataset
24. 24. ....The dataset is similar to the typical dyadicdataset with a couple of key differences:● Duplicate Dyads There can exist duplicate dyad pairs in the training set with different outcomes, since a student can answer a question many times,● Collaborative or Competitive Answering In some games types, students can collaboratively answer questions.
25. 25. Motivation for Latent feature approachHighly successful at winning the Netflix prize 1M\$challenge (Toscher et al., 2009) where the problemwas to predict ratings for movies.
26. 26. Metric used to rank the teamsBinomial Capped Deviance, similar to log-likelihood Estimated probability of correct response Capped between [0.01,.99] True label of the dyad
27. 27. Leaderboard
28. 28. Latent feature log-linearMotivations for Latent Feature Log-Linear (LFL)(Menon & Elkan, 2010)Well calibrated Probabilities we need to predict the probability of correct outcome for the dyadic pairs in the test set.Leverage Side-Information Most collaborative filtering algorithms do not have any principled way of including side-informationScale Well To be used in the industry, the method has to scale well to large datasets
29. 29. Multiclass LFL model
30. 30. Multiclass LFL modelCase | Y| = 3 U1 I1 U1 2 I2 U3 I3p(y=3 | (user,item)) = exp( U3user . I3item ) ZZ = exp( U1user . I1item ) +exp( U1user . I1item ) + exp( U3user . I3item )
31. 31. Binary LFL on the datasetTest Set contains only two types of outcomes i) correct ii)incorrecty = 1 ( Correct Response)y = 0 ( Incorrect Response)The binary-LFL model has appeared in the literature before(Schein et al., 2003; Agarwal & Chen, 2009)
32. 32. TrainingWe optimize for the negative log likelihood Regularization TermsWe can optimize this objective function using thestochastic gradient descent method.
33. 33. Stochastic Gradient Descent
34. 34. LFL on GrockIt
35. 35. Stochastic Gradient Descent
36. 36. Grid Searchparameters
37. 37. Parallel SGD Training Was formulated independently by Gemulla et al., 2011
38. 38. KDD CUP, Spring, 2011This is us!!! =)
39. 39. Parallelism
40. 40. Side-InformationFor a question q, let g = group(q). We canadd a latent vector for each group i.e ACT,GMAT, SATPrediction equation after adding sideinformation is
41. 41. Categorical FeaturesGroup – GTrack – TSubtrack – STGame Type – GTQuestion Type – QT
42. 42. LFL Models
43. 43. Training SetTraining set contains four types of outcomesi) correct, ii) incorrect, iii) skipped and iv) timed-out.Test set contains four types of outcomesi) correct, ii) incorrectWe create two training sets, a) Training set with skipped and timed-out responses excluded b) Training set with skipped and timed-out responses treated as an incorrect outcome
44. 44. Results from LFL Models (a)
45. 45. Results from LFL Models (b)
46. 46. ObservationThrowing away data helps! Removing skipped and timed-out responses from training set improved the BCD (binomial capped deviance)Motivates for adapting the model to the test-set distribution to win the competition.
47. 47. Ensemble LearningNo Single Model works well on every dyad. Combining predictions from multiple models can outperform each of the individual models (Takcas et al., 2009 )1M\$ Netflix Prize was won by a blend ofmultiple models
48. 48. Intuition for Ensemble LearningTrue labels for four samples(1,1,0,0)Predictions from four different models.(0,1,0,0) – accuracy 75%(1,0,0,0) – accuracy 75%(1,1,1,0) – accuracy 75%(1,1,0,1) – accuracy 75%Average of different models(.75,.75,.25,.25)Threshold the average at 0.5(1,1,0,0) – accuracy = 100%
49. 49. Using Linear Regression for combining predictionsFor a set with known labels,{ (s,q) – > y(s,q) } , where y can take 0 or 1pi = p i ( y= 1) | (s,q) ) is the estimated probability of a correct response fromthe ith modelDefine matrix P and column matrix Y,where each row of P contains predictions from n models, ( p1 .., p i , .. p n )and Y contains the target value y(s,q)Similarly using predictions for every dyad in the set, we create matrix P withpredictions and Y with target values.We solve, Pw = Y
50. 50. ....To predict the probability of a correctresponse of an example in the test set,We combine predictions from n models usingthe weight vector wpestimated = wj pj
51. 51. Which set to use?Step 1for each of the n models Train on the training set Predict on the validation set save parametersStep 2: Estimate w using linear regression on the validation set predictionsStep 3:for each of the n models Train on the training set + validation set Predict on the test setStep 4:Combine predictions of the test set using w
52. 52. Results
53. 53. After combining predictions using linear regression
54. 54. 2 weeks later
55. 55. some weeks later..
56. 56. Gradient Boosted Decision TreesLeverage Side-Information in Ensemble learningGradient Boosted Decision Trees (GBDT) (Friedman, 1999)algorithm can be used to combine predictions and sideinformation together.Popular algorithmGBDT is a powerful learning algorithm that is widely used(see Li & Xu, 2009, chap. 6)The core of the algorithm is a decision tree learner
57. 57. Decision TreeDecision tres can handle both i) Numeric, andii) categorical variables.It can also handle missing information.
58. 58. Decision Tree Decision functionPrediction( Y1 + Y3 ) / 2 ................... ................. Prediction ( Y6 + Y7 + Y9 ) / 3
59. 59. Gradient BoostingSelect the base learner, and loss function.● Decision Tree as the base learner, and Squared Loss as the loss function Gradient boosting is an iterative-procedure● Iteratively fit a base learner on the gradient of the previous iteration
60. 60. Gradient BoostingWe can add the a regularization parameter as follows
61. 61. Side-Information for GBDT
62. 62. Meta-Features
63. 63. Preprocessing TagsEach question has a set of tags that is associated with it. Some arelisted below Statistics (incl. mean median mode),259 Strengthen Hypothesis,260 Student Produced Response,261 System of Linear Equations,262 Systems of Linear Equations,263 Systems of linear equations and inequalities,264We manually merge the tags that we feel are very similar.We cluster the tags into 40 clusters using spectral clustering (Ng etal., 2001) with normalized co-ocurrence of tags as the similaritymeasure to generate the affinity matrix A.
64. 64. Results from GBDT● GBDT only improved the bcd marginally.
65. 65. Including Temporal Features
66. 66. ...
67. 67. GBDT Results after including temporal features
68. 68. Feb 23, Week, competition end
69. 69. Last dayCombined predictions from GBDT modelsusing linear regression, improved slightly.
70. 70. Last day of competition
71. 71. Final Private set ranks
72. 72. Post competition analysisLatent feature approach is a good approach for this dataset.LFL performs really well on the datasetCode will be available soon @http:/ / code.google.com/ p/ latent-feature-log-linear/
73. 73. Questions
74. 74. ReferencesAgarwal, Deepak and Chen, Bee-Chung. Regression based latent factor models. InProceedings of the 15th ACM SIGKDD international conference on Knowledge discovery anddata mining, KDD ’09, pp. 19– 28, New York, NY, USA, 2009. ACM. ISBN 978- 1-60558-495-9.Friedman, Jerome H. Stochastic gradient boosting. Computational Statistics and DataAnalysis, 38: 367– 378, 1999.Gemulla, Rainer, Nijkamp, Erik, Haas, Peter J., and Sismanis, Yannis. Large-scale matrixfactorization with distributed stochastic gradient descent. In Proceedings of the 17th ACMSIGKDD international conference on Knowledge discovery and data mining, KDD ’11, NewYork, NY, USA, 2011. ACM. ISBN 978-1-4503-0813-7.Hofmann, Thomas, Puzicha, Jan, and Jordan Michael I. Learning from dyadic data. InProceedings of the 1998 conference on Advances in neural information processing systemsII, pp. 466– 472, Cambridge, MA, USA, 1999. MIT Press. ISBN 0-262-11245-0.Li, Xiaochun and Xu, Ronghui (eds.). High dimensional data analysis in cancer research.Springer, CA, U.S.A, 2009.Menon, Aditya Krishna and Elkan, Charles. A log linear model with latent features for dyadicpredic-tion. In ICDM’10, pp. 364– 373, 2010.Ng, Andrew Y., Jordan, Michael I., and Weiss, Yair. On spectral clustering: Analysis and analgorithm.In Advances in Nueral Information Processing Systems, pp. 849– 856. MIT Press,2001.
75. 75. ReferencesRasch, Georg. Estimation of parameters and control of the model for two responsecategories, 1960.Schein, Andrew I., Lawrence, Andrew I., Saul, Lawrence K., and Ungar, Lyle H. Ageneralized linear model for principal component analysis of binary data, 2003. ́Takcas, G abor, Pilaszy, Istvan, Nemeth, Bottyan, and Tikk, Domonkos. Scalablecollaborative filtering approaches for large recommender systems. J. Mach. Learn.Res., 10:623– 656, June 2009. ISSN 1532- 4435.Tscher, Andreas, Jahrer, Michael, and Bell, Robert M. The bigchaos solution to thenetflix grand prize, 2009.