Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on Factorization Machines and Topic Modelling

© AvitoBasic elements guidelines.
FILES: Avito-LOGO_RGB.eps, Avito-LOGO_CMYK_Pa
RecSys Challenge 2016: Job Recommendation
Based on Factorization Machines and Topic
Modelling
7th place solution
Vasily Leksin, Andrey Ostapets
Avito.ru
15-09-2016
Basic elements guidelines.
Problem statement
Data description
∙ Impressions — details about which items (job postings) were
shown to which user by the existing recommender (19 August
2015 — 9 November 2015).
∙ Interactions — interactions that the user performed on the
items (clicked, bookmarked, replied or deleted).
∙ Users — users details: job roles, career level, discipline,
industry, location, experience, and education.
∙ Items — items details: title, career level, discipline, industry,
location, employment type, tags, created time and flag if item
was active during the test.
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 2 / 23
Basic elements guidelines.
Problem statement
Data description: impressions and interactions
Date interval: 2015-08-19 – 2015-11-09
Impressions
∙ 201M unique user-item-week tuples
∙ 2.7M unique users
∙ 846K unique items
Interactions
∙ 8.8M events: clicked – 7.2M, deleted – 1.0M, replied – 422K,
bookmarked – 206K
∙ 785K unique users
∙ 1.03M unique items
∙ 2.8M из 6.9M (user-item) pairs are in impressions
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 3 / 23
Basic elements guidelines.
Problem statement
Data description: target users and items
150K users for making recommendations, from which:
∙ 39.7К (26.5%) have no events
∙ 59.5K (39.6%) have less than 2 events
∙ 70.6K (47.1%) have less than 3 events
327К active items, from which:
∙ 129К (39.5%) have no events
∙ 164K (50.1%) have less than 2 events
∙ 188K (57.6%) have less then 3 events
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 4 / 23
Basic elements guidelines.
Problem statement
Task of the challenge
score(R, ˆR) =
∑︁
u∈U
20(P2(ru, ˆru) + P4(ru, ˆru) + R30(ru, ˆru)+
+S30(ru, ˆru)) + 10(P6(ru, ˆru) + P20(ru, ˆru)),
where
U = {0, . . . , N − 1} – list of target users,
R = {ru}u∈U – lists of relevant items,
ˆR = {ˆru}u∈U – the solution,
Pk(ru, ˆru) – precision at top k for user u,
R30(ru, ˆru) – recall at top 30,
S30(ru, ˆru) – user success.
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 5 / 23
Basic elements guidelines.
Problem statement
Models validation
∙ The last week of interactions
∙ 10 000 random users from those who made any interactions
during this week
∙ Old items (created more than a month ago) without
interactions were removed
∙ Obtained score was highly correlated with the result on the
Public Leaderboard
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 6 / 23
Basic elements guidelines.
Solution of the team
Interesting insights from data
∙ A significant proportion of users and items have a small
number of events or have no events. It means that we need to
use a hybrid approach that takes into account not only
collaborative filtering but the content data of items and users.
∙ Impressions slowly change over time. That is, the presence of a
pair of user-item in impressions is a useful feature, and we use
it as the separate model.
∙ Geographical features (distance, region, city, geoclusters etc.)
are not improve score significantly.
∙ Tokens from user profiles and items are good features.
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 7 / 23
Basic elements guidelines.
Solution of the team
Interesting insights from data
User profile and sessions exampleCV
experience_n_experiencediscipline_id_user country_useregion_user jobroles
5 or more entr 10‐15 yearsSales & Commerce Germany Bavaria ['962959', '283291', '502342']
SESSIONS
created_at impressiondiscipline_id_item country_itemregion_item title tags
09‐01 1:27 1 Production & ManufactGermany Baden‐Württem['620383', '1118975'] ['102823', '1335184', '624061', '73234', '1604815', '2862074' 
09‐01 1:27 1 Other Disciplines Germany Hamburg ['4572761', '3543754', '196892['993979', '2426818', '792504', '4425481', '494116', '976257' 
09‐01 1:27 1 Other Disciplines Germany Berlin ['18091'] ['4198994', '4182900', '4354582', '1399193', '1377742', '3580 
09‐02 0:21 1 Health, Medical & Socianon_dach not specified ['165415', '1986087', '2585795['2426818', '3726822', '792504', '1830721', '184797', '325622 
09‐04 20:46 0 IT & Software DevelopmGermany Brandenburg ['655030'] ['1491612', '972718', '2426818', '2383555', '4483314', '43216 
09‐08 20:16 1 Other Disciplines Germany Hamburg ['2915824', '4035399', '156769['2110329', '503870', '2426818', '1437930', '2245760', '35922 
09‐08 22:40 1 Sales & Commerce Germany not specified ['3418410', '3413328'] ['686709', '2036672', '3794933', '502342', '3413328', '117856 
09‐08 22:41 1 IT & Software Developmnon_dach not specified ['3408137'] ['2632767', '1491612', '2245760', '689679', '1565617', '43216 
09‐09 22:58 1 Administration Germany Berlin ['4141254', '1118975'] ['4162864', '1491612', '1565617', '689679', '4204056', '15454 
09‐09 23:00 0 Production & ManufactGermany Lower Saxony ['4454260', '502342'] ['543177', '4160943', '2501578', '4329775', '3085937', '23421 
09‐09 23:01 0 Other Disciplines Germany Lower Saxony ['1567693', '568776'] ['1178568', '1248479', '370640', '2342166', '94890', '3794933 
09‐09 23:02 1 Health, Medical & Socianon_dach not specified ['1567693'] ['1178568', '1565617', '1491612', '1601282', '2380081', '4941 
09‐09 23:08 0 Finance, Accounting & CAustria not specified ['2865345', '3294368'] ['3391339', '3176219', '4499767', '2426818', '798840', '37295 
09‐09 23:08 0 Production & ManufactGermany Lower Saxony ['494116'] ['3176219', '159096', '3391339', '4499767', '494116', '372957 
09‐09 23:08 0 Health, Medical & SociaSwitzerland not specified ['128836', '1836819'] ['1798728', '128836', '675557', '2976021']  
09‐09 23:09 0 Production & ManufactAustria not specified ['2846960', '76751', '4227194',['2846960', '2632767', '3872048', '3939477', '1469275', '1695 
09‐09 23:09 0 Engineering & TechnicaGermany Berlin ['4141254'] ['749243', '362736', '692505', '3669898', '624061', '494116']  
09‐10 0:59 1 Production & ManufactAustria not specified ['1118975', '3478136'] ['502342', '4151211', '4439048', '3210328', '624061', '313759 
09‐10 0:59 1 Engineering & TechnicaGermany Bavaria ['128836', '76887'] ['816406', '3347566', '502342', '4425481', '4160943', '160481 
09‐10 0:59 1 Engineering & TechnicaGermany North Rhine‐W ['1119117', '3705605', '347813['2896178', '1357922', '2031982', '1491612', '1830721', '1335 
09‐10 1:00 0 Other Disciplines Austria not specified ['2915824', '4035399', '156769['1565617', '1496767', '82994', '1625244', '1941434', '123188 
09‐10 1:00 1 Teaching, R&D Germany North Rhine‐W ['1986087'] ['3144475', '4245173', '3096790', '655817', '2969837', '43216 
09‐10 1:00 0 Other Disciplines Germany not specified ['2573697', '4035399', '448435['3457262', '3658040', '2126708', '2110329', '2630003', '4017 
09‐10 1:00 0 Engineering & TechnicaGermany Baden‐Württem['2140778', '3241763'] ['1734724', '2000691', '4425481', '2111897', '577140', '94890 
09‐10 1:00 1 Management & Corpor Germany Bavaria ['494116', '1119117', '2387379['4245173', '1231885', '272304', '4140111', '4321623', '18307 
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 8 / 23
Basic elements guidelines.
Solution of the team
Item-based collaborative filtering
Similarity metrics:
∙ Jaccard
∙ Cosine
∙ Pearson
Event types for training:
∙ All Positive interactions
∙ Only Click interactions
∙ Impressions
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 9 / 23
Basic elements guidelines.
Solution of the team
Factorization Machines
Predicted score for user i on item j is given by:
p(i,j) = 𝜇 + wi + wj + aT
xi + bT
yj + uT
i vj ,
where
𝜇 – a global bias term,
wi and wj are weight terms for user i and item j respectively,
xi and yj are the user and item side feature vectors,
a and b are the weight vectors for those side features,
ui and vj – latent factors, which are vectors of fixed length
(number of factors is a parameter).
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 10 / 23
Basic elements guidelines.
Solution of the team
Factorization Machines: main parameters
∙ Number of latent factors (30 – 400)
∙ Number of sampled negative examples (1 – 12)
∙ Maximum number of iterations (25 – 70)
∙ Regularization parameters (1e-9 – 1e-7)
∙ User and item side features
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 11 / 23
Basic elements guidelines.
Solution of the team
Factorization Machines: side features
Users - all features (OneHotEncoder)
∙ jobroles
∙ career_level, discipline_id, industry_id
∙ country, region
∙ experience: n_entries_class, years, years_in_current
∙ edu: degree, field_of_studies
Items - all features, except latitude and longitude
(OneHotEncoder)
∙ title, tags
∙ career_level, discipline_id, industry_id
∙ country, region
∙ employment_type
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 12 / 23
Basic elements guidelines.
Solution of the team
Topic model: Latent Semantic Indexing (LSI)
∙ Let document associated with each user be all title and tags
tokens of items, which the user interacted with and job roles
tokens from user description.
∙ Convert each document into a token occurrences vector.
∙ Transform values in each vector to TF-IDF statistics and
combine all vectors into a large token-document matrix.
∙ Then we apply Singular Value Decomposition (SVD) technique
on the token-document matrix
∙ The similarity between user and item will be the similarity
between corresponding latent vectors.
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 13 / 23
Basic elements guidelines.
Solution of the team
Solution framework
Initial dataset
Item­based
models
FM models Topic model
Blending
Output
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 14 / 23
Basic elements guidelines.
Solution of the team
Linear Ensemble
Base models:
FR0 SIM0 PI Local Score
1 0 0 76995
0 1 0 69622
0 0 1 104495
1 1 1 132505
∙ SIM0 – Item-based Recommender (jaccard similarity)
∙ FR0 – Factorization Machines Recommender (400 factors)
∙ PI – Past Impressions Recommender (very simple model with
binary output)
∙ Local Score: 10 000 random users who made interactions with
items during last week
∙ The score on the Public Leaderboard ≈ 3.8 × the score on our
Local Validation
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 15 / 23
Basic elements guidelines.
Solution of the team
Linear Ensemble
Version «zero»:
FR0 SIM0 PI Local Score
1 2 1 134285
The first version:
FR0 SIM0 FR8.0
0 * SIM0 PI Local Score
1 13 8 1 138073
The second version:
FR1 SIM0 FR8.0
1 * SIM0 PI Local Score
1 13 8 1 140876
FR1 = FR_f 100_i25
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 16 / 23
Basic elements guidelines.
Solution of the team
Linear Ensemble
The third version:
FR2 SIM2 FR8.0
2 * SIM2 PI Local Score
1 13 8 1 143653
FR2 = 0.5*FR_f 100_i25 + 0.5*FR_f 400_i70
SIM2 = 0.5*SIM_jac + 0.5*SIM_click
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 17 / 23
Basic elements guidelines.
Solution of the team
Linear Ensemble
Local Score (145841):
1.0*FR3 + 15.0 * (FR8.0
3 * SIM3) + 13.0 * SIM3 + 1.0 * PI - 0.5 *
SIM_pearson - 0.3 * FR_f 400_i50_no_side + 0.5 *
(FR_imp2.0
* SIM_imp), where
FR3 = 0.5*FR_f 100_i25 + 0.5*FR_f 400_i70
SIM3 = SIM_click
Local Score (146569):
1.0*FR4 + 15.0 * (FR8.0
4 * SIM4) + 13.0 * SIM4 + 1.0 * PI - 0.4 *
SIM_pearson - 0.3 * FR_f 400_i50_no_side + 0.5 *
(FR_imp2.0
* SIM_imp) + 0.2 * TM, where
FR4 = 0.5*FR_f 100_i25 + 0.5*FR_f 400_i70
SIM4 = 0.4*SIM_jac + 0.6*SIM_t = 1
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 18 / 23
Basic elements guidelines.
Solution of the team
Final models set
SIM_jac Item-based jaccard similarity
SIM_click Item-based jaccard similarity on clicks
SIM_pearson Item-based pearson similarity
SIM_imp Item-based jaccard similarity on impressions
FR_f100_i25 Factorization, n_factors=100, iter=25
FR_f400_i70 Factorization, n_factors=400, iter=70
FR_f400_i50_no_side Factorization, no side data
FR_imp Factorization on impressions
TM LSI topic model
PI Past Impressions
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 19 / 23
Basic elements guidelines.
Solution of the team
Hardware&Software
∙ 1 server: 28 cores, 56 threads, 256Gb RAM
∙ Full training + prediction = 16 hours
∙ All code was written in Python
∙ ML Libraries: graphlab, gensim
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 20 / 23
Basic elements guidelines.
Leaderboard
Submission history
Score Rank Name Date
554655 9 Topic model 100 factors 06/27/16
548366 8 Top 150 candidates from every model 06/25/16
543284 8 8 model set: 4FR + 4SIM + TM 06/24/16
537157 9 Topic model 06/23/16
530599 10 3 models set: FR + 2SIM 06/23/16
497136 15 3 models set: FR + 2SIM 06/22/16
496241 1 FR with side data 03/20/16
397604 1 Past Impressions model 03/11/16
132790 1 Simple item-based recommender 03/10/16
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 21 / 23
Basic elements guidelines.
Leaderboard
Results
Rank Team Leaderboard Score Full Score
1 YunOS-OneSearch 681707.38 2052185.54
2 mim-solutions 675985.03 2035964.16
3 DaveXster 665592.06 2005263.73
4 PumpkinPie 622408.55 1866477.77
5 milk tea 613125.21 1846420.12
6 mdr_rec 605048.58 1823472.31
7 Avito 554654.72 1677898.52
8 recometric 556133.18 1677233.84
9 nodalpoints 555483.39 1671812.08
10 lucky_dog 542213.51 1632828.82
21 XING_TELECOM 461000.32 1397030.74
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 22 / 23
Thank you!
Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 23 / 23
1 of 23

More Related Content

What's hot(20)

1030 track 2 barrett_using our laptop1030 track 2 barrett_using our laptop
1030 track 2 barrett_using our laptop
Rising Media, Inc.140 views
Managing machine learningManaging machine learning
Managing machine learning
David Murgatroyd1.6K views
AIRG PresentationAIRG Presentation
AIRG Presentation
nirvdrum385 views
Group Project - Final - Linked inGroup Project - Final - Linked in
Group Project - Final - Linked in
Sanket Butoliya228 views
Weka linked inWeka linked in
Weka linked in
Sanket Butoliya322 views
Promise KeynotePromise Keynote
Promise Keynote
Tim Menzies488 views

Viewers also liked(20)

Jobandtalent at recsys challenge 2016Jobandtalent at recsys challenge 2016
Jobandtalent at recsys challenge 2016
Oscar Huarte1.9K views
Recruit recsys-review-magamboRecruit recsys-review-magambo
Recruit recsys-review-magambo
Elie Magambo Gatete187 views
Canopy kmeansCanopy kmeans
Canopy kmeans
nagwww4.2K views
Thesis_Nazarova_Final(1)Thesis_Nazarova_Final(1)
Thesis_Nazarova_Final(1)
Sardana Nazarova376 views
SocialLda SocialLda
SocialLda
saurabh_IIITH480 views
allegrotech - Data science  meetup #1 Introallegrotech - Data science  meetup #1 Intro
allegrotech - Data science meetup #1 Intro
Bartlomiej Twardowski825 views
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation
Daniele Di Mitri713 views
Recsys 2016Recsys 2016
Recsys 2016
Mindaugas Zickus3.5K views

Similar to Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on Factorization Machines and Topic Modelling(20)

Emotionalise me: Self-reporting and arousal measurements in virtual tourism e...Emotionalise me: Self-reporting and arousal measurements in virtual tourism e...
Emotionalise me: Self-reporting and arousal measurements in virtual tourism e...
International Federation for Information Technologies in Travel and Tourism (IFITT)74 views
Bertazo et al - Application Lifecycle Management and process monitoring throu...Bertazo et al - Application Lifecycle Management and process monitoring throu...
Bertazo et al - Application Lifecycle Management and process monitoring throu...
International Software Benchmarking Standards Group (ISBSG)349 views
Summary of pilot cases. New ways of working. Esa Nykänen, Jari Laarni, Hanna-...Summary of pilot cases. New ways of working. Esa Nykänen, Jari Laarni, Hanna-...
Summary of pilot cases. New ways of working. Esa Nykänen, Jari Laarni, Hanna-...
VTT Technical Research Centre of Finland Ltd764 views
TCI 2016 Business Upper AustriaTCI 2016 Business Upper Austria
TCI 2016 Business Upper Austria
TCI Network690 views
4.1.2nd WS. ETNA & Communication Technology R.Andrich4.1.2nd WS. ETNA & Communication Technology R.Andrich
4.1.2nd WS. ETNA & Communication Technology R.Andrich
ATIS4all, European Thematic Network, led by Technosite.413 views
Day1 Sokwoo RheeDay1 Sokwoo Rhee
Day1 Sokwoo Rhee
US-Ignite566 views

Avito recsys-challenge-2016RecSys Challenge 2016: Job Recommendation Based on Factorization Machines and Topic Modelling

  • 1. © AvitoBasic elements guidelines. FILES: Avito-LOGO_RGB.eps, Avito-LOGO_CMYK_Pa RecSys Challenge 2016: Job Recommendation Based on Factorization Machines and Topic Modelling 7th place solution Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016
  • 2. Basic elements guidelines. Problem statement Data description ∙ Impressions — details about which items (job postings) were shown to which user by the existing recommender (19 August 2015 — 9 November 2015). ∙ Interactions — interactions that the user performed on the items (clicked, bookmarked, replied or deleted). ∙ Users — users details: job roles, career level, discipline, industry, location, experience, and education. ∙ Items — items details: title, career level, discipline, industry, location, employment type, tags, created time and flag if item was active during the test. Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 2 / 23
  • 3. Basic elements guidelines. Problem statement Data description: impressions and interactions Date interval: 2015-08-19 – 2015-11-09 Impressions ∙ 201M unique user-item-week tuples ∙ 2.7M unique users ∙ 846K unique items Interactions ∙ 8.8M events: clicked – 7.2M, deleted – 1.0M, replied – 422K, bookmarked – 206K ∙ 785K unique users ∙ 1.03M unique items ∙ 2.8M из 6.9M (user-item) pairs are in impressions Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 3 / 23
  • 4. Basic elements guidelines. Problem statement Data description: target users and items 150K users for making recommendations, from which: ∙ 39.7К (26.5%) have no events ∙ 59.5K (39.6%) have less than 2 events ∙ 70.6K (47.1%) have less than 3 events 327К active items, from which: ∙ 129К (39.5%) have no events ∙ 164K (50.1%) have less than 2 events ∙ 188K (57.6%) have less then 3 events Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 4 / 23
  • 5. Basic elements guidelines. Problem statement Task of the challenge score(R, ˆR) = ∑︁ u∈U 20(P2(ru, ˆru) + P4(ru, ˆru) + R30(ru, ˆru)+ +S30(ru, ˆru)) + 10(P6(ru, ˆru) + P20(ru, ˆru)), where U = {0, . . . , N − 1} – list of target users, R = {ru}u∈U – lists of relevant items, ˆR = {ˆru}u∈U – the solution, Pk(ru, ˆru) – precision at top k for user u, R30(ru, ˆru) – recall at top 30, S30(ru, ˆru) – user success. Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 5 / 23
  • 6. Basic elements guidelines. Problem statement Models validation ∙ The last week of interactions ∙ 10 000 random users from those who made any interactions during this week ∙ Old items (created more than a month ago) without interactions were removed ∙ Obtained score was highly correlated with the result on the Public Leaderboard Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 6 / 23
  • 7. Basic elements guidelines. Solution of the team Interesting insights from data ∙ A significant proportion of users and items have a small number of events or have no events. It means that we need to use a hybrid approach that takes into account not only collaborative filtering but the content data of items and users. ∙ Impressions slowly change over time. That is, the presence of a pair of user-item in impressions is a useful feature, and we use it as the separate model. ∙ Geographical features (distance, region, city, geoclusters etc.) are not improve score significantly. ∙ Tokens from user profiles and items are good features. Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 7 / 23
  • 8. Basic elements guidelines. Solution of the team Interesting insights from data User profile and sessions exampleCV experience_n_experiencediscipline_id_user country_useregion_user jobroles 5 or more entr 10‐15 yearsSales & Commerce Germany Bavaria ['962959', '283291', '502342'] SESSIONS created_at impressiondiscipline_id_item country_itemregion_item title tags 09‐01 1:27 1 Production & ManufactGermany Baden‐Württem['620383', '1118975'] ['102823', '1335184', '624061', '73234', '1604815', '2862074'  09‐01 1:27 1 Other Disciplines Germany Hamburg ['4572761', '3543754', '196892['993979', '2426818', '792504', '4425481', '494116', '976257'  09‐01 1:27 1 Other Disciplines Germany Berlin ['18091'] ['4198994', '4182900', '4354582', '1399193', '1377742', '3580  09‐02 0:21 1 Health, Medical & Socianon_dach not specified ['165415', '1986087', '2585795['2426818', '3726822', '792504', '1830721', '184797', '325622  09‐04 20:46 0 IT & Software DevelopmGermany Brandenburg ['655030'] ['1491612', '972718', '2426818', '2383555', '4483314', '43216  09‐08 20:16 1 Other Disciplines Germany Hamburg ['2915824', '4035399', '156769['2110329', '503870', '2426818', '1437930', '2245760', '35922  09‐08 22:40 1 Sales & Commerce Germany not specified ['3418410', '3413328'] ['686709', '2036672', '3794933', '502342', '3413328', '117856  09‐08 22:41 1 IT & Software Developmnon_dach not specified ['3408137'] ['2632767', '1491612', '2245760', '689679', '1565617', '43216  09‐09 22:58 1 Administration Germany Berlin ['4141254', '1118975'] ['4162864', '1491612', '1565617', '689679', '4204056', '15454  09‐09 23:00 0 Production & ManufactGermany Lower Saxony ['4454260', '502342'] ['543177', '4160943', '2501578', '4329775', '3085937', '23421  09‐09 23:01 0 Other Disciplines Germany Lower Saxony ['1567693', '568776'] ['1178568', '1248479', '370640', '2342166', '94890', '3794933  09‐09 23:02 1 Health, Medical & Socianon_dach not specified ['1567693'] ['1178568', '1565617', '1491612', '1601282', '2380081', '4941  09‐09 23:08 0 Finance, Accounting & CAustria not specified ['2865345', '3294368'] ['3391339', '3176219', '4499767', '2426818', '798840', '37295  09‐09 23:08 0 Production & ManufactGermany Lower Saxony ['494116'] ['3176219', '159096', '3391339', '4499767', '494116', '372957  09‐09 23:08 0 Health, Medical & SociaSwitzerland not specified ['128836', '1836819'] ['1798728', '128836', '675557', '2976021']   09‐09 23:09 0 Production & ManufactAustria not specified ['2846960', '76751', '4227194',['2846960', '2632767', '3872048', '3939477', '1469275', '1695  09‐09 23:09 0 Engineering & TechnicaGermany Berlin ['4141254'] ['749243', '362736', '692505', '3669898', '624061', '494116']   09‐10 0:59 1 Production & ManufactAustria not specified ['1118975', '3478136'] ['502342', '4151211', '4439048', '3210328', '624061', '313759  09‐10 0:59 1 Engineering & TechnicaGermany Bavaria ['128836', '76887'] ['816406', '3347566', '502342', '4425481', '4160943', '160481  09‐10 0:59 1 Engineering & TechnicaGermany North Rhine‐W ['1119117', '3705605', '347813['2896178', '1357922', '2031982', '1491612', '1830721', '1335  09‐10 1:00 0 Other Disciplines Austria not specified ['2915824', '4035399', '156769['1565617', '1496767', '82994', '1625244', '1941434', '123188  09‐10 1:00 1 Teaching, R&D Germany North Rhine‐W ['1986087'] ['3144475', '4245173', '3096790', '655817', '2969837', '43216  09‐10 1:00 0 Other Disciplines Germany not specified ['2573697', '4035399', '448435['3457262', '3658040', '2126708', '2110329', '2630003', '4017  09‐10 1:00 0 Engineering & TechnicaGermany Baden‐Württem['2140778', '3241763'] ['1734724', '2000691', '4425481', '2111897', '577140', '94890  09‐10 1:00 1 Management & Corpor Germany Bavaria ['494116', '1119117', '2387379['4245173', '1231885', '272304', '4140111', '4321623', '18307  Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 8 / 23
  • 9. Basic elements guidelines. Solution of the team Item-based collaborative filtering Similarity metrics: ∙ Jaccard ∙ Cosine ∙ Pearson Event types for training: ∙ All Positive interactions ∙ Only Click interactions ∙ Impressions Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 9 / 23
  • 10. Basic elements guidelines. Solution of the team Factorization Machines Predicted score for user i on item j is given by: p(i,j) = 𝜇 + wi + wj + aT xi + bT yj + uT i vj , where 𝜇 – a global bias term, wi and wj are weight terms for user i and item j respectively, xi and yj are the user and item side feature vectors, a and b are the weight vectors for those side features, ui and vj – latent factors, which are vectors of fixed length (number of factors is a parameter). Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 10 / 23
  • 11. Basic elements guidelines. Solution of the team Factorization Machines: main parameters ∙ Number of latent factors (30 – 400) ∙ Number of sampled negative examples (1 – 12) ∙ Maximum number of iterations (25 – 70) ∙ Regularization parameters (1e-9 – 1e-7) ∙ User and item side features Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 11 / 23
  • 12. Basic elements guidelines. Solution of the team Factorization Machines: side features Users - all features (OneHotEncoder) ∙ jobroles ∙ career_level, discipline_id, industry_id ∙ country, region ∙ experience: n_entries_class, years, years_in_current ∙ edu: degree, field_of_studies Items - all features, except latitude and longitude (OneHotEncoder) ∙ title, tags ∙ career_level, discipline_id, industry_id ∙ country, region ∙ employment_type Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 12 / 23
  • 13. Basic elements guidelines. Solution of the team Topic model: Latent Semantic Indexing (LSI) ∙ Let document associated with each user be all title and tags tokens of items, which the user interacted with and job roles tokens from user description. ∙ Convert each document into a token occurrences vector. ∙ Transform values in each vector to TF-IDF statistics and combine all vectors into a large token-document matrix. ∙ Then we apply Singular Value Decomposition (SVD) technique on the token-document matrix ∙ The similarity between user and item will be the similarity between corresponding latent vectors. Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 13 / 23
  • 14. Basic elements guidelines. Solution of the team Solution framework Initial dataset Item­based models FM models Topic model Blending Output Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 14 / 23
  • 15. Basic elements guidelines. Solution of the team Linear Ensemble Base models: FR0 SIM0 PI Local Score 1 0 0 76995 0 1 0 69622 0 0 1 104495 1 1 1 132505 ∙ SIM0 – Item-based Recommender (jaccard similarity) ∙ FR0 – Factorization Machines Recommender (400 factors) ∙ PI – Past Impressions Recommender (very simple model with binary output) ∙ Local Score: 10 000 random users who made interactions with items during last week ∙ The score on the Public Leaderboard ≈ 3.8 × the score on our Local Validation Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 15 / 23
  • 16. Basic elements guidelines. Solution of the team Linear Ensemble Version «zero»: FR0 SIM0 PI Local Score 1 2 1 134285 The first version: FR0 SIM0 FR8.0 0 * SIM0 PI Local Score 1 13 8 1 138073 The second version: FR1 SIM0 FR8.0 1 * SIM0 PI Local Score 1 13 8 1 140876 FR1 = FR_f 100_i25 Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 16 / 23
  • 17. Basic elements guidelines. Solution of the team Linear Ensemble The third version: FR2 SIM2 FR8.0 2 * SIM2 PI Local Score 1 13 8 1 143653 FR2 = 0.5*FR_f 100_i25 + 0.5*FR_f 400_i70 SIM2 = 0.5*SIM_jac + 0.5*SIM_click Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 17 / 23
  • 18. Basic elements guidelines. Solution of the team Linear Ensemble Local Score (145841): 1.0*FR3 + 15.0 * (FR8.0 3 * SIM3) + 13.0 * SIM3 + 1.0 * PI - 0.5 * SIM_pearson - 0.3 * FR_f 400_i50_no_side + 0.5 * (FR_imp2.0 * SIM_imp), where FR3 = 0.5*FR_f 100_i25 + 0.5*FR_f 400_i70 SIM3 = SIM_click Local Score (146569): 1.0*FR4 + 15.0 * (FR8.0 4 * SIM4) + 13.0 * SIM4 + 1.0 * PI - 0.4 * SIM_pearson - 0.3 * FR_f 400_i50_no_side + 0.5 * (FR_imp2.0 * SIM_imp) + 0.2 * TM, where FR4 = 0.5*FR_f 100_i25 + 0.5*FR_f 400_i70 SIM4 = 0.4*SIM_jac + 0.6*SIM_t = 1 Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 18 / 23
  • 19. Basic elements guidelines. Solution of the team Final models set SIM_jac Item-based jaccard similarity SIM_click Item-based jaccard similarity on clicks SIM_pearson Item-based pearson similarity SIM_imp Item-based jaccard similarity on impressions FR_f100_i25 Factorization, n_factors=100, iter=25 FR_f400_i70 Factorization, n_factors=400, iter=70 FR_f400_i50_no_side Factorization, no side data FR_imp Factorization on impressions TM LSI topic model PI Past Impressions Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 19 / 23
  • 20. Basic elements guidelines. Solution of the team Hardware&Software ∙ 1 server: 28 cores, 56 threads, 256Gb RAM ∙ Full training + prediction = 16 hours ∙ All code was written in Python ∙ ML Libraries: graphlab, gensim Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 20 / 23
  • 21. Basic elements guidelines. Leaderboard Submission history Score Rank Name Date 554655 9 Topic model 100 factors 06/27/16 548366 8 Top 150 candidates from every model 06/25/16 543284 8 8 model set: 4FR + 4SIM + TM 06/24/16 537157 9 Topic model 06/23/16 530599 10 3 models set: FR + 2SIM 06/23/16 497136 15 3 models set: FR + 2SIM 06/22/16 496241 1 FR with side data 03/20/16 397604 1 Past Impressions model 03/11/16 132790 1 Simple item-based recommender 03/10/16 Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 21 / 23
  • 22. Basic elements guidelines. Leaderboard Results Rank Team Leaderboard Score Full Score 1 YunOS-OneSearch 681707.38 2052185.54 2 mim-solutions 675985.03 2035964.16 3 DaveXster 665592.06 2005263.73 4 PumpkinPie 622408.55 1866477.77 5 milk tea 613125.21 1846420.12 6 mdr_rec 605048.58 1823472.31 7 Avito 554654.72 1677898.52 8 recometric 556133.18 1677233.84 9 nodalpoints 555483.39 1671812.08 10 lucky_dog 542213.51 1632828.82 21 XING_TELECOM 461000.32 1397030.74 Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 22 / 23
  • 23. Thank you! Vasily Leksin, Andrey Ostapets Avito.ru 15-09-2016 23 / 23