Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on Factorization Machines and Topic Modelling

683 views

Published on

This slides describes our solution for the RecSys Challenge 2016. In the challenge, several datasets were provided from a social network for business XING. The goal of the competition was to use these data to predict job postings that a user will interact positively with (click, bookmark or reply). Our solution to this problem includes three different types of models: Factorization Machine, item-based collaborative filtering, and content-based topic model on tags. Thus, we combined collaborative and content-based approaches in our solution.
Our best submission, which was a blend of ten models, achieved 7th place in the challenge's final leaderboard with a score of 1677898.52. The approaches presented in this paper are general and scalable. Therefore they can be applied to another problem of this type.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on Factorization Machines and Topic Modelling

  1. 1. © AvitoBasic elements guidelines. FILES: Avito-LOGO_RGB.eps, Avito-LOGO_CMYK_Pa RecSys Challenge 2016: Job Recommendation Based on Factorization Machines and Topic Modelling 7th place solution Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016
  2. 2. Basic elements guidelines. Problem statement Data description ∙ Impressions — details about which items (job postings) were shown to which user by the existing recommender (19 August 2015 — 9 November 2015). ∙ Interactions — interactions that the user performed on the items (clicked, bookmarked, replied or deleted). ∙ Users — users details: job roles, career level, discipline, industry, location, experience, and education. ∙ Items — items details: title, career level, discipline, industry, location, employment type, tags, created time and flag if item was active during the test. Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 2 / 23
  3. 3. Basic elements guidelines. Problem statement Data description: impressions and interactions Date interval: 2015-08-19 – 2015-11-09 Impressions ∙ 201M unique user-item-week tuples ∙ 2.7M unique users ∙ 846K unique items Interactions ∙ 8.8M events: clicked – 7.2M, deleted – 1.0M, replied – 422K, bookmarked – 206K ∙ 785K unique users ∙ 1.03M unique items ∙ 2.8M из 6.9M (user-item) pairs are in impressions Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 3 / 23
  4. 4. Basic elements guidelines. Problem statement Data description: target users and items 150K users for making recommendations, from which: ∙ 39.7К (26.5%) have no events ∙ 59.5K (39.6%) have less than 2 events ∙ 70.6K (47.1%) have less than 3 events 327К active items, from which: ∙ 129К (39.5%) have no events ∙ 164K (50.1%) have less than 2 events ∙ 188K (57.6%) have less then 3 events Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 4 / 23
  5. 5. Basic elements guidelines. Problem statement Task of the challenge score(R, ˆR) = ∑︁ u∈U 20(P2(ru, ˆru) + P4(ru, ˆru) + R30(ru, ˆru)+ +S30(ru, ˆru)) + 10(P6(ru, ˆru) + P20(ru, ˆru)), where U = {0, . . . , N − 1} – list of target users, R = {ru}u∈U – lists of relevant items, ˆR = {ˆru}u∈U – the solution, Pk(ru, ˆru) – precision at top k for user u, R30(ru, ˆru) – recall at top 30, S30(ru, ˆru) – user success. Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 5 / 23
  6. 6. Basic elements guidelines. Problem statement Models validation ∙ The last week of interactions ∙ 10 000 random users from those who made any interactions during this week ∙ Old items (created more than a month ago) without interactions were removed ∙ Obtained score was highly correlated with the result on the Public Leaderboard Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 6 / 23
  7. 7. Basic elements guidelines. Solution of the team Interesting insights from data ∙ A significant proportion of users and items have a small number of events or have no events. It means that we need to use a hybrid approach that takes into account not only collaborative filtering but the content data of items and users. ∙ Impressions slowly change over time. That is, the presence of a pair of user-item in impressions is a useful feature, and we use it as the separate model. ∙ Geographical features (distance, region, city, geoclusters etc.) are not improve score significantly. ∙ Tokens from user profiles and items are good features. Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 7 / 23
  8. 8. Basic elements guidelines. Solution of the team Interesting insights from data User profile and sessions exampleCV experience_n_experiencediscipline_id_user country_useregion_user jobroles 5 or more entr 10‐15 yearsSales & Commerce Germany Bavaria ['962959', '283291', '502342'] SESSIONS created_at impressiondiscipline_id_item country_itemregion_item title tags 09‐01 1:27 1 Production & ManufactGermany Baden‐Württem['620383', '1118975'] ['102823', '1335184', '624061', '73234', '1604815', '2862074'  09‐01 1:27 1 Other Disciplines Germany Hamburg ['4572761', '3543754', '196892['993979', '2426818', '792504', '4425481', '494116', '976257'  09‐01 1:27 1 Other Disciplines Germany Berlin ['18091'] ['4198994', '4182900', '4354582', '1399193', '1377742', '3580  09‐02 0:21 1 Health, Medical & Socianon_dach not specified ['165415', '1986087', '2585795['2426818', '3726822', '792504', '1830721', '184797', '325622  09‐04 20:46 0 IT & Software DevelopmGermany Brandenburg ['655030'] ['1491612', '972718', '2426818', '2383555', '4483314', '43216  09‐08 20:16 1 Other Disciplines Germany Hamburg ['2915824', '4035399', '156769['2110329', '503870', '2426818', '1437930', '2245760', '35922  09‐08 22:40 1 Sales & Commerce Germany not specified ['3418410', '3413328'] ['686709', '2036672', '3794933', '502342', '3413328', '117856  09‐08 22:41 1 IT & Software Developmnon_dach not specified ['3408137'] ['2632767', '1491612', '2245760', '689679', '1565617', '43216  09‐09 22:58 1 Administration Germany Berlin ['4141254', '1118975'] ['4162864', '1491612', '1565617', '689679', '4204056', '15454  09‐09 23:00 0 Production & ManufactGermany Lower Saxony ['4454260', '502342'] ['543177', '4160943', '2501578', '4329775', '3085937', '23421  09‐09 23:01 0 Other Disciplines Germany Lower Saxony ['1567693', '568776'] ['1178568', '1248479', '370640', '2342166', '94890', '3794933  09‐09 23:02 1 Health, Medical & Socianon_dach not specified ['1567693'] ['1178568', '1565617', '1491612', '1601282', '2380081', '4941  09‐09 23:08 0 Finance, Accounting & CAustria not specified ['2865345', '3294368'] ['3391339', '3176219', '4499767', '2426818', '798840', '37295  09‐09 23:08 0 Production & ManufactGermany Lower Saxony ['494116'] ['3176219', '159096', '3391339', '4499767', '494116', '372957  09‐09 23:08 0 Health, Medical & SociaSwitzerland not specified ['128836', '1836819'] ['1798728', '128836', '675557', '2976021']   09‐09 23:09 0 Production & ManufactAustria not specified ['2846960', '76751', '4227194',['2846960', '2632767', '3872048', '3939477', '1469275', '1695  09‐09 23:09 0 Engineering & TechnicaGermany Berlin ['4141254'] ['749243', '362736', '692505', '3669898', '624061', '494116']   09‐10 0:59 1 Production & ManufactAustria not specified ['1118975', '3478136'] ['502342', '4151211', '4439048', '3210328', '624061', '313759  09‐10 0:59 1 Engineering & TechnicaGermany Bavaria ['128836', '76887'] ['816406', '3347566', '502342', '4425481', '4160943', '160481  09‐10 0:59 1 Engineering & TechnicaGermany North Rhine‐W ['1119117', '3705605', '347813['2896178', '1357922', '2031982', '1491612', '1830721', '1335  09‐10 1:00 0 Other Disciplines Austria not specified ['2915824', '4035399', '156769['1565617', '1496767', '82994', '1625244', '1941434', '123188  09‐10 1:00 1 Teaching, R&D Germany North Rhine‐W ['1986087'] ['3144475', '4245173', '3096790', '655817', '2969837', '43216  09‐10 1:00 0 Other Disciplines Germany not specified ['2573697', '4035399', '448435['3457262', '3658040', '2126708', '2110329', '2630003', '4017  09‐10 1:00 0 Engineering & TechnicaGermany Baden‐Württem['2140778', '3241763'] ['1734724', '2000691', '4425481', '2111897', '577140', '94890  09‐10 1:00 1 Management & Corpor Germany Bavaria ['494116', '1119117', '2387379['4245173', '1231885', '272304', '4140111', '4321623', '18307  Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 8 / 23
  9. 9. Basic elements guidelines. Solution of the team Item-based collaborative filtering Similarity metrics: ∙ Jaccard ∙ Cosine ∙ Pearson Event types for training: ∙ All Positive interactions ∙ Only Click interactions ∙ Impressions Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 9 / 23
  10. 10. Basic elements guidelines. Solution of the team Factorization Machines Predicted score for user i on item j is given by: p(i,j) = 𝜇 + wi + wj + aT xi + bT yj + uT i vj , where 𝜇 – a global bias term, wi and wj are weight terms for user i and item j respectively, xi and yj are the user and item side feature vectors, a and b are the weight vectors for those side features, ui and vj – latent factors, which are vectors of fixed length (number of factors is a parameter). Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 10 / 23
  11. 11. Basic elements guidelines. Solution of the team Factorization Machines: main parameters ∙ Number of latent factors (30 – 400) ∙ Number of sampled negative examples (1 – 12) ∙ Maximum number of iterations (25 – 70) ∙ Regularization parameters (1e-9 – 1e-7) ∙ User and item side features Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 11 / 23
  12. 12. Basic elements guidelines. Solution of the team Factorization Machines: side features Users - all features (OneHotEncoder) ∙ jobroles ∙ career_level, discipline_id, industry_id ∙ country, region ∙ experience: n_entries_class, years, years_in_current ∙ edu: degree, field_of_studies Items - all features, except latitude and longitude (OneHotEncoder) ∙ title, tags ∙ career_level, discipline_id, industry_id ∙ country, region ∙ employment_type Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 12 / 23
  13. 13. Basic elements guidelines. Solution of the team Topic model: Latent Semantic Indexing (LSI) ∙ Let document associated with each user be all title and tags tokens of items, which the user interacted with and job roles tokens from user description. ∙ Convert each document into a token occurrences vector. ∙ Transform values in each vector to TF-IDF statistics and combine all vectors into a large token-document matrix. ∙ Then we apply Singular Value Decomposition (SVD) technique on the token-document matrix ∙ The similarity between user and item will be the similarity between corresponding latent vectors. Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 13 / 23
  14. 14. Basic elements guidelines. Solution of the team Solution framework Initial dataset Item­based models FM models Topic model Blending Output Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 14 / 23
  15. 15. Basic elements guidelines. Solution of the team Linear Ensemble Base models: FR0 SIM0 PI Local Score 1 0 0 76995 0 1 0 69622 0 0 1 104495 1 1 1 132505 ∙ SIM0 – Item-based Recommender (jaccard similarity) ∙ FR0 – Factorization Machines Recommender (400 factors) ∙ PI – Past Impressions Recommender (very simple model with binary output) ∙ Local Score: 10 000 random users who made interactions with items during last week ∙ The score on the Public Leaderboard ≈ 3.8 × the score on our Local Validation Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 15 / 23
  16. 16. Basic elements guidelines. Solution of the team Linear Ensemble Version «zero»: FR0 SIM0 PI Local Score 1 2 1 134285 The first version: FR0 SIM0 FR8.0 0 * SIM0 PI Local Score 1 13 8 1 138073 The second version: FR1 SIM0 FR8.0 1 * SIM0 PI Local Score 1 13 8 1 140876 FR1 = FR_f 100_i25 Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 16 / 23
  17. 17. Basic elements guidelines. Solution of the team Linear Ensemble The third version: FR2 SIM2 FR8.0 2 * SIM2 PI Local Score 1 13 8 1 143653 FR2 = 0.5*FR_f 100_i25 + 0.5*FR_f 400_i70 SIM2 = 0.5*SIM_jac + 0.5*SIM_click Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 17 / 23
  18. 18. Basic elements guidelines. Solution of the team Linear Ensemble Local Score (145841): 1.0*FR3 + 15.0 * (FR8.0 3 * SIM3) + 13.0 * SIM3 + 1.0 * PI - 0.5 * SIM_pearson - 0.3 * FR_f 400_i50_no_side + 0.5 * (FR_imp2.0 * SIM_imp), where FR3 = 0.5*FR_f 100_i25 + 0.5*FR_f 400_i70 SIM3 = SIM_click Local Score (146569): 1.0*FR4 + 15.0 * (FR8.0 4 * SIM4) + 13.0 * SIM4 + 1.0 * PI - 0.4 * SIM_pearson - 0.3 * FR_f 400_i50_no_side + 0.5 * (FR_imp2.0 * SIM_imp) + 0.2 * TM, where FR4 = 0.5*FR_f 100_i25 + 0.5*FR_f 400_i70 SIM4 = 0.4*SIM_jac + 0.6*SIM_t = 1 Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 18 / 23
  19. 19. Basic elements guidelines. Solution of the team Final models set SIM_jac Item-based jaccard similarity SIM_click Item-based jaccard similarity on clicks SIM_pearson Item-based pearson similarity SIM_imp Item-based jaccard similarity on impressions FR_f100_i25 Factorization, n_factors=100, iter=25 FR_f400_i70 Factorization, n_factors=400, iter=70 FR_f400_i50_no_side Factorization, no side data FR_imp Factorization on impressions TM LSI topic model PI Past Impressions Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 19 / 23
  20. 20. Basic elements guidelines. Solution of the team Hardware&Software ∙ 1 server: 28 cores, 56 threads, 256Gb RAM ∙ Full training + prediction = 16 hours ∙ All code was written in Python ∙ ML Libraries: graphlab, gensim Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 20 / 23
  21. 21. Basic elements guidelines. Leaderboard Submission history Score Rank Name Date 554655 9 Topic model 100 factors 06/27/16 548366 8 Top 150 candidates from every model 06/25/16 543284 8 8 model set: 4FR + 4SIM + TM 06/24/16 537157 9 Topic model 06/23/16 530599 10 3 models set: FR + 2SIM 06/23/16 497136 15 3 models set: FR + 2SIM 06/22/16 496241 1 FR with side data 03/20/16 397604 1 Past Impressions model 03/11/16 132790 1 Simple item-based recommender 03/10/16 Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 21 / 23
  22. 22. Basic elements guidelines. Leaderboard Results Rank Team Leaderboard Score Full Score 1 YunOS-OneSearch 681707.38 2052185.54 2 mim-solutions 675985.03 2035964.16 3 DaveXster 665592.06 2005263.73 4 PumpkinPie 622408.55 1866477.77 5 milk tea 613125.21 1846420.12 6 mdr_rec 605048.58 1823472.31 7 Avito 554654.72 1677898.52 8 recometric 556133.18 1677233.84 9 nodalpoints 555483.39 1671812.08 10 lucky_dog 542213.51 1632828.82 21 XING_TELECOM 461000.32 1397030.74 Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 22 / 23
  23. 23. Thank you! Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 23 / 23

×