2. Problem Statement
To determine the correct author from the author’s dataset
for a particular paper.
Ambiguity in author names might cause a paper to be
assigned to the wrong author, which leads to noisy
author profiles
Challenge is to determine which papers in an author
profile were truly written by a given author
3. Data Points
The data points include all papers written by an author, his
affiliation (University, Technical Society, Groups).
Paper-Author -( PaperId , AuthorId, Name, Affiliation)
The meta data includes journals written by him and
conferences attended by an author.
Paper -( Id, Title, Year, ConferenceId , JournalId,
Keywords)
Author -( Id, Name, Affiliation)
Conference-(Id, ShortName,FullName,HomePage)
Journal -(Id, ShortName, FullName, HomePage)
5. Steps Taken to Solve Problem
Data preprocessing and cleaning
Feature engineering
Choose a model - Random Forest/Gradient Boost
Model
Building the model
Evaluating the results
Extracting feature values
Building the model using the modified train file
Evaluating the results
Tuning the model
6. Data preprocessing and cleaning
Issues with data
The csv files needed cleaning
Few had attributes spilled over 3 rows
Some rows had more attributes than the required
number of attributes
Special characters caused issue
Wrote a Perl script to Clean data and format it
9. Feature Engineering Steps
• Aggregation: combining multiple features into one.
How did we use: Elaborated train file with AuthorID, PaperID and
Confirmation combined with name and affiliation from Author file.
• Discretization: Converting continuous features or variables to
discretized or nominal features
How did we use: The year the paper is published. The max and min
years the author was actively publishing papers.
• Construction: Creating new features out of original ones
How did we use: Keywords in Paper
10. Author features
• distance between the author names in paper-author and author files
• matched substring ratio between the author names in paper-author and author files
• keywords used by a particular author(less weight)
• count keywords for author
• count the no of co-authors for a given co-author
• weighted TF-IDF measure of all author keywords inside author's papers
• count different papers of author
• years during which the author wrote many papers
• number of times an author is repeated (sum for distinct ids)
• list of distinct ids assigned to the same author
11. Paper features
• year of paper
• count authors of paper
• count duplicated papers for paper
• count duplicated authors for paper
• count keywords in paper
• how many time the exact same set of authors is repeated in different papers
(without
duplicates)
12. Paper-Author features
• correct affiliation from the table PaperAuthor: binary feature
• which year the author publishes (for the first year papers of author this feature
equals 1 for the second year papers - 2 and so on)
• count sources: number of times pair author-paper is appeared in the table
PaperAuthor
• are names in PaperAuthor table and Author table the same
13. Machine Learning Models Used
Models used earlier
• K-means clustering
• ZeroR
• Tree J-48
• Naïve Bayes
15. Feature values extraction
Count of paper for each author
Maximum active year for given author
Maximum active year for given author
Jaccard distance between author name in author file and
paper author file
Jaccard distance between affiliation in author file and paper
author file
The year paper was published
20. Lessons Learned!
When you know what to find out exactly in the provided data ,use
supervised learning model as classification rather than choosing
unsupervised learning model such as clustering.
When you want to find patterns or structures in the provided data use
unsupervised dlearning models such as clustering.
Try building model using the provided train file but it might not give you
better results always. You can try to modify it using the existing data but
making sure you do no change it.
Choosing the features is the most important thing and we can extract the
feature values from the given data and use it to build the model.
Choosing features from different data points will give better results than just
choosing them from only one.