KDD Cup 2013
Author – Paper Identification
Challenge (2nd place team)
Dmitry Efimov
Lucas Silva
Benjamin Solecki
Approach summary
Goal: find incorrectly
assigned pairs
author-paper
Supervised
machine learning problem
with binary respon...
Author – Paper graph
Author features
count
journals tf-idf
measure
Count features
NLP features
Multiple
source
features
author’s
duplicates
Paper features
Count features NLP features
Multiple
source
features
Additional
features
count
keywords
tf-idf
measure
pape...
Author – paper features (1 of 4)
Count
features
Multiple
source
features
Additional
features
Likelihood
features
Author – paper features (2 of 4)
Count
features
Additional
features
count of
coauthors with
the same
affiliation
reverse f...
Author – paper features (3 of 4)
Multiple
source
features
how many times
pair author-paper
appeared in the
Microsoft datab...
Author – paper features (4 of 4)
Likelihood
features
use Lj and Lja
as features
1) use (α∙ Lj + (1−α)∙ Lja) as feature
(sh...
Model
Gradient Boosting Machine
(package gbm in R)
Grid search for the set
of parameters
83 features in the final model
(o...
Result and conclusion
• Our MAP score is 0.98144 (the winning
submission score is 0.98259).
• Many algorithms (LambdaRank,...
Thank you!
Upcoming SlideShare
Loading in …5
×

KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

1,326 views

Published on

We describe our approach for solution of Author - Paper Identification Challenge: https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,326
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
19
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

  1. 1. KDD Cup 2013 Author – Paper Identification Challenge (2nd place team) Dmitry Efimov Lucas Silva Benjamin Solecki
  2. 2. Approach summary Goal: find incorrectly assigned pairs author-paper Supervised machine learning problem with binary response Deep feature engineering (> 300 features) Gradient Boosting Machine (package gbm in R)
  3. 3. Author – Paper graph
  4. 4. Author features count journals tf-idf measure Count features NLP features Multiple source features author’s duplicates
  5. 5. Paper features Count features NLP features Multiple source features Additional features count keywords tf-idf measure paper’s duplicates reverse features engineering
  6. 6. Author – paper features (1 of 4) Count features Multiple source features Additional features Likelihood features
  7. 7. Author – paper features (2 of 4) Count features Additional features count of coauthors with the same affiliation reverse feature engineering: year ranking feature
  8. 8. Author – paper features (3 of 4) Multiple source features how many times pair author-paper appeared in the Microsoft database?
  9. 9. Author – paper features (4 of 4) Likelihood features use Lj and Lja as features 1) use (α∙ Lj + (1−α)∙ Lja) as feature (shrunken likelihood); 2) mixed-effects models (package lme4 in R) to find α Lj – likelihood by journal Lja – likelihood by journal and author
  10. 10. Model Gradient Boosting Machine (package gbm in R) Grid search for the set of parameters 83 features in the final model (out of 300 calculated features )
  11. 11. Result and conclusion • Our MAP score is 0.98144 (the winning submission score is 0.98259). • Many algorithms (LambdaRank, LambdaMART, RankBoost) based on MAP optimization gave less MAP score than GBM with Bernoulli distribution. • The idea of feature classification based on bipartite author-paper graph is very promising. Analyzing of graph topology can give ideas for new features.
  12. 12. Thank you!

×