2. Problem Statement
•To determine the correct author from the author’s dataset for
a particular paper.
•Ambiguity in author names might cause a paper to be
assigned to the wrong author, which leads to noisy author
profiles
•This KDD Cup task challenges participants to determine which
papers in an author profile were truly written by a given author
3. Type of data
Data provided by KDD challenge is in csv format.
Paper -( Id, Title, Year, ConferenceId , JournalId,
Keywords)
Author -( Id, Name, Affiliation)
Paper-Author -( PaperId , AuthorId, Name, Affiliation)
Conference-(Id, ShortName,FullName,HomePage)
Journal -(Id, ShortName, FullName, HomePage)
Train-(AuthorId , ConfirmedPaperIds, DeletedPaperIds)
Test - (AuthorId , PaperIds)
Validation -(AuthorId,PaperIds,Usage)
4. Data Points
The data points include all papers written by an author,
his affliation (University, Technical Society, Groups).
Paper-Author -( PaperId , AuthorId, Name, Affiliation)
The meta data includes journals written by him and
conferences attended by an author.
Paper -( Id, Title, Year, ConferenceId , JournalId,
Keywords)
Author -( Id, Name, Affiliation)
Conference-(Id, ShortName,FullName,HomePage)
Journal -(Id, ShortName, FullName, HomePage)
5. Issues with data
Issues with data
The csv files needed cleaning
Few had attributes spilled over 3 rows
Some rows had more attributes than the
required number of attributes
Special characters caused issue
Wrote a Perl script to Clean data and format it
8. Predictions & Intuitions
Prediction:
Given a paper and an author, one should be able to
identify whether the given paper was written by the author.
Intuition:
We initially identified this problem as a Clustering problem.
We chose clustering because a set of papers written by one
author can be grouped together and then for a given
paper and author we can identify if the paper is from
author’s cluster.
The features PaperId, AuthorId, PaperTitle, AuthorName play
a significant role in the prediction.
9. Feature selection
We used following features from Train dataset
while building the model :
ConfirmedPaperIds
DeletedPaperIds
10. Tools Used & Model Trained
Tools Used:
Weka
R
Apache Mahout
Model Trained:
Simple K-Means
J-48
ZeroR
14. Error in R for Clustering
> y=read.table("Paper_fixed.csv",header=TRUE,sep=',')
> y[1:10,]
> km3 <- kmeans(x,3)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In kmeans(x, 3) : NAs introduced by coercion
15. Conclusion
Why clustering does not work for this problem?
Handling of mixed set of attributes is an issue in R
Simple Kmeans clustering works on calculating the distance
from centroids and thus needs numeric attributes and
distances. Hence clustering is not a best approach for our
problem
To overcome the problem we are trying to convert the
data into numeric integer values and then numeric
distance measures are applied for computing
However, this problem looks more like a classification
problem - to classify whether a paper is written by an
author
16. Moving on to Classification
algorithms..
ZeroR
Tree J-48
Naïve Bayes
20. Next Steps
We are working on the feature engineering - feature transformation – work on the Author
name attribute and transform it into a common format for all Author names.
Once we have the feature engineering done - We will working principally on Naïve Bayes
and other classification algorithms that we think will suit our problem
And fine tune the model…