Successfully reported this slideshow.
Upcoming SlideShare
×

# Kddcup2011

10,097 views

Published on

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Dating direct: ♥♥♥ http://bit.ly/36cXjBY ♥♥♥

Are you sure you want to  Yes  No
• Sex in your area is here: ♥♥♥ http://bit.ly/36cXjBY ♥♥♥

Are you sure you want to  Yes  No

Are you sure you want to  Yes  No

Are you sure you want to  Yes  No

### Kddcup2011

1. 1. The Art of Lemon’ssolution KDD Cup 2011 Track 2 Siwei Lai/ Rui Diao Liang Xiang
2. 2. Outline Problem Introduction Data Analytics Content Item CF BSVD+ NBSVD+ Algorithms 11.2175% 3.8222% 3.5362% 3.8146%  Main Models  Model Ensemble Model Post Ensemble Process  Post Process 2.5033% 2.4808% Conclusion Future Work
3. 3. Problem Introduction Two Tracks Track 2  Classification Problem  Positive Samples : tracks users vote higher than 80  Negative Samples : popular tracks users have not voted  Data Set  User voting data  Taxonomy data  Comments  Similar to Top-N recommendation problem  Using negative samples to prevent Harry Potter problem
4. 4. Data Analytics User vote data may be ordered by time.  Anchoring effect  Vote on artists and then vote on their tracks This is main reason why we got 2nd position http://justaguyinagarage.blogspot.com/2011/0 6/recommendation-system-competitions.html
5. 5. Data Analytics If a user have voted on artist/album, she will have large probability to vote the tracks of the artist/album. 45% 58% Artist ⇒ Artist’s tracks 75% 75% Album ⇒ Album’s tracks 45% 56% Item ⇒ Items with the same Artist 51% 52% Item ⇒ Items with the same Album
6. 6. Data Analytics User vote data may be ordered by time.  Anchoring effect  Vote on artists and then vote on their tracks If a user have voted on artist, she will have large probability to vote the tracks of the artist.
7. 7. Algorithm: Main Models Content-based Model Item-based Collaborative Filtering Model Binary Latent Factor Model Neighborhood-based Binary SVD Model
8. 8. Content-based Model If a user have voted on artist/album, she will have large probability to vote the tracks of the artist/album.  Version 1. User will vote on a track if she have voted the same artist’s item before. (Error rate ≈ 17%) P(u, i) = 1 if user u have voted tracks with same artist/album of track i  Version 2. Use the average score of some artist/album. (Error rate ≈ 11%) P(u, i) = average score user u assigned on artist/album of track i or tracks with same artist/ablum
9. 9. Item-based CollaborativeFiltering Jaccard Index Error rate ≈ 9%
10. 10. Item-based CollaborativeFiltering Our Similarity
11. 11. Item-based CollaborativeFiltering Model + Temporal information 141|8573 862|1455 2033|5396 ... ... ... 251480 0 232699 90 81180 64 232699 50 238869 90 3109 54 132238 50 271685 90 26594 52 1405 9 20 ... ... ... ... items items items 67376 50 252580 90 8830 26 3109 0 3109 90 232699 59 96153 30 49451 90 53396 57 ... ... ...
12. 12. Item-based CollaborativeFiltering + Vote information 141|8573 862|1455 2033|5396 ... ... ... 251480 0 232699 90 81180 64 232699 50 238869 90 3109 54 132238 50 271685 90 26594 52 ... ... ... ... 67376 50 252580 90 8830 26 3109 0 3109 90 232699 59 96153 30 49451 90 53396 57 ... ... ...
13. 13. Item-based CollaborativeFiltering Prediction
14. 14. Item-based CollaborativeFiltering + Removing popular bias
15. 15. Item-based CollaborativeFiltering Factors Error Rate (%) initial model (Jaccard Index + KNN) 8.9992 + removing popular bias 5.2953 + using temporal information 3.9283 + using vote information 3.8222 + using taxonomy information 3.6578
16. 16. Binary Latent Factor Model prediction Error rate ≈ 6% Sampling  Positive samples: items in train data.  Negative samples: nearly the same as sampling test data.  Positive samples and Negative samples have the same number for each user
17. 17. Binary Latent Factor Model+prediction Error rate ≈ 3.5%
18. 18. Neighborhood-based BinarySVD Modelprediction
19. 19. Features used Models Content Item CF BSVD+ NBSVD+FeaturesCollaborative filtering × √ √ √Neighborhood info × √ × √Ratings √ √ ○ ○Time ordering × √ × ×Artist/album √ ○ √ √Genre structure × × × ×
20. 20. Model Ensemble Local test set Linear combination Local Train Simulated Annealing Train Set Set 8-fold cross validation Model Error Rate (%) weight Local Test Set Content 11.2175 0.002 Item CF 3.8222 0.438 Test Set BSVD+ 3.5362 0.006 NBSVD+ 3.8146 0.025
21. 21. Post Process Some special features can not be modeled well Find special user-item pairs.  The most popular items.  Vote high on track’s album but vote low on it’s artist. … Multiply a factor
22. 22. Algorithms Content 11.2175% 0.002 Item CF 0.483 3.8222% Model Post Process Ensemble 0.006 2.4808% 2.5033% BSVD+ 3.5362% 0.025 NBSVD+ 3.8146% …
23. 23. Model Similarities
24. 24. Conclusion Data Analysis is very important  User behavior data is ordered by time  Artist/Album data can improve accuracy a lot Team members number and model numbers is very important Useful algorithms:  Content-based  Neighborhood-based  Matrix Factorization
25. 25. Future Work How to add temporal information into Binary SVD Model? Apply Binary SVD into real production  How to make explanation  How to make real-time on-line recommendation
26. 26. Q&A Thanks! xlvector@gmail.com