Kddcup2011

9,953 views

Published on

0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
9,953
On SlideShare
0
From Embeds
0
Number of Embeds
7,922
Actions
Shares
0
Downloads
113
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Kddcup2011

  1. 1. The Art of Lemon’ssolution KDD Cup 2011 Track 2 Siwei Lai/ Rui Diao Liang Xiang
  2. 2. Outline Problem Introduction Data Analytics Content Item CF BSVD+ NBSVD+ Algorithms 11.2175% 3.8222% 3.5362% 3.8146%  Main Models  Model Ensemble Model Post Ensemble Process  Post Process 2.5033% 2.4808% Conclusion Future Work
  3. 3. Problem Introduction Two Tracks Track 2  Classification Problem  Positive Samples : tracks users vote higher than 80  Negative Samples : popular tracks users have not voted  Data Set  User voting data  Taxonomy data  Comments  Similar to Top-N recommendation problem  Using negative samples to prevent Harry Potter problem
  4. 4. Data Analytics User vote data may be ordered by time.  Anchoring effect  Vote on artists and then vote on their tracks This is main reason why we got 2nd position http://justaguyinagarage.blogspot.com/2011/0 6/recommendation-system-competitions.html
  5. 5. Data Analytics If a user have voted on artist/album, she will have large probability to vote the tracks of the artist/album. 45% 58% Artist ⇒ Artist’s tracks 75% 75% Album ⇒ Album’s tracks 45% 56% Item ⇒ Items with the same Artist 51% 52% Item ⇒ Items with the same Album
  6. 6. Data Analytics User vote data may be ordered by time.  Anchoring effect  Vote on artists and then vote on their tracks If a user have voted on artist, she will have large probability to vote the tracks of the artist.
  7. 7. Algorithm: Main Models Content-based Model Item-based Collaborative Filtering Model Binary Latent Factor Model Neighborhood-based Binary SVD Model
  8. 8. Content-based Model If a user have voted on artist/album, she will have large probability to vote the tracks of the artist/album.  Version 1. User will vote on a track if she have voted the same artist’s item before. (Error rate ≈ 17%) P(u, i) = 1 if user u have voted tracks with same artist/album of track i  Version 2. Use the average score of some artist/album. (Error rate ≈ 11%) P(u, i) = average score user u assigned on artist/album of track i or tracks with same artist/ablum
  9. 9. Item-based CollaborativeFiltering Jaccard Index Error rate ≈ 9%
  10. 10. Item-based CollaborativeFiltering Our Similarity
  11. 11. Item-based CollaborativeFiltering Model + Temporal information 141|8573 862|1455 2033|5396 ... ... ... 251480 0 232699 90 81180 64 232699 50 238869 90 3109 54 132238 50 271685 90 26594 52 1405 9 20 ... ... ... ... items items items 67376 50 252580 90 8830 26 3109 0 3109 90 232699 59 96153 30 49451 90 53396 57 ... ... ...
  12. 12. Item-based CollaborativeFiltering + Vote information 141|8573 862|1455 2033|5396 ... ... ... 251480 0 232699 90 81180 64 232699 50 238869 90 3109 54 132238 50 271685 90 26594 52 ... ... ... ... 67376 50 252580 90 8830 26 3109 0 3109 90 232699 59 96153 30 49451 90 53396 57 ... ... ...
  13. 13. Item-based CollaborativeFiltering Prediction
  14. 14. Item-based CollaborativeFiltering + Removing popular bias
  15. 15. Item-based CollaborativeFiltering Factors Error Rate (%) initial model (Jaccard Index + KNN) 8.9992 + removing popular bias 5.2953 + using temporal information 3.9283 + using vote information 3.8222 + using taxonomy information 3.6578
  16. 16. Binary Latent Factor Model prediction Error rate ≈ 6% Sampling  Positive samples: items in train data.  Negative samples: nearly the same as sampling test data.  Positive samples and Negative samples have the same number for each user
  17. 17. Binary Latent Factor Model+prediction Error rate ≈ 3.5%
  18. 18. Neighborhood-based BinarySVD Modelprediction
  19. 19. Features used Models Content Item CF BSVD+ NBSVD+FeaturesCollaborative filtering × √ √ √Neighborhood info × √ × √Ratings √ √ ○ ○Time ordering × √ × ×Artist/album √ ○ √ √Genre structure × × × ×
  20. 20. Model Ensemble Local test set Linear combination Local Train Simulated Annealing Train Set Set 8-fold cross validation Model Error Rate (%) weight Local Test Set Content 11.2175 0.002 Item CF 3.8222 0.438 Test Set BSVD+ 3.5362 0.006 NBSVD+ 3.8146 0.025
  21. 21. Post Process Some special features can not be modeled well Find special user-item pairs.  The most popular items.  Vote high on track’s album but vote low on it’s artist. … Multiply a factor
  22. 22. Algorithms Content 11.2175% 0.002 Item CF 0.483 3.8222% Model Post Process Ensemble 0.006 2.4808% 2.5033% BSVD+ 3.5362% 0.025 NBSVD+ 3.8146% …
  23. 23. Model Similarities
  24. 24. Conclusion Data Analysis is very important  User behavior data is ordered by time  Artist/Album data can improve accuracy a lot Team members number and model numbers is very important Useful algorithms:  Content-based  Neighborhood-based  Matrix Factorization
  25. 25. Future Work How to add temporal information into Binary SVD Model? Apply Binary SVD into real production  How to make explanation  How to make real-time on-line recommendation
  26. 26. Q&A Thanks! xlvector@gmail.com

×