Author paper midterm

Author- Paper Identification
Problem
Team :
Karthik Reddy Vakati
Nachammai C
Pooja Mishra
Guided By
Prof Duc Tran

Problem Statement
•To determine the correct author from the author’s dataset for
a particular paper.
•Ambiguity in author names might cause a paper to be
assigned to the wrong author, which leads to noisy author
profiles
•This KDD Cup task challenges participants to determine which
papers in an author profile were truly written by a given author

Type of data
Data provided by KDD challenge is in csv format.
 Paper -( Id, Title, Year, ConferenceId , JournalId,
Keywords)
 Author -( Id, Name, Affiliation)
 Paper-Author -( PaperId , AuthorId, Name, Affiliation)
 Conference-(Id, ShortName,FullName,HomePage)
 Journal -(Id, ShortName, FullName, HomePage)
 Train-(AuthorId , ConfirmedPaperIds, DeletedPaperIds)
 Test - (AuthorId , PaperIds)
 Validation -(AuthorId,PaperIds,Usage)

Data Points
The data points include all papers written by an author,
his affliation (University, Technical Society, Groups).
 Paper-Author -( PaperId , AuthorId, Name, Affiliation)
The meta data includes journals written by him and
conferences attended by an author.
 Paper -( Id, Title, Year, ConferenceId , JournalId,
Keywords)
 Author -( Id, Name, Affiliation)
 Conference-(Id, ShortName,FullName,HomePage)
 Journal -(Id, ShortName, FullName, HomePage)

Issues with data
Issues with data
The csv files needed cleaning
Few had attributes spilled over 3 rows
Some rows had more attributes than the
required number of attributes
Special characters caused issue
Wrote a Perl script to Clean data and format it

Predictions & Intuitions
Prediction:
 Given a paper and an author, one should be able to
identify whether the given paper was written by the author.
Intuition:
 We initially identified this problem as a Clustering problem.
We chose clustering because a set of papers written by one
author can be grouped together and then for a given
paper and author we can identify if the paper is from
author’s cluster.
 The features PaperId, AuthorId, PaperTitle, AuthorName play
a significant role in the prediction.

Feature selection
We used following features from Train dataset
while building the model :
ConfirmedPaperIds
DeletedPaperIds

Tools Used & Model Trained
Tools Used:
 Weka
 R
 Apache Mahout
Model Trained:
 Simple K-Means
 J-48
 ZeroR

K-means clustering using Weka
Training the data

Visualization of k-means
clustering result

Simple K-means clustering using R

Error in R for Clustering
 > y=read.table("Paper_fixed.csv",header=TRUE,sep=',')
 > y[1:10,]
 > km3 <- kmeans(x,3)
 Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
 In addition: Warning message:
 In kmeans(x, 3) : NAs introduced by coercion

Conclusion
Why clustering does not work for this problem?
Handling of mixed set of attributes is an issue in R
Simple Kmeans clustering works on calculating the distance
from centroids and thus needs numeric attributes and
distances. Hence clustering is not a best approach for our
problem
To overcome the problem we are trying to convert the
data into numeric integer values and then numeric
distance measures are applied for computing
However, this problem looks more like a classification
problem - to classify whether a paper is written by an
author

Moving on to Classification
algorithms..
 ZeroR
 Tree J-48
 Naïve Bayes

Results using Tree-J48 algorithm

Visualization of ZeroR results for Precision

Next Steps
 We are working on the feature engineering - feature transformation – work on the Author
name attribute and transform it into a common format for all Author names.
 Once we have the feature engineering done - We will working principally on Naïve Bayes
and other classification algorithms that we think will suit our problem
 And fine tune the model…

Author paper midterm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Author paper midterm

Similar to Author paper midterm (20)

Recently uploaded

Recently uploaded (20)

Author paper midterm