Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Author paper identification problem


Published on

Author Paper Identification Problem Final Presentation Structure

Published in: Software, Technology, Business
  • Be the first to comment

  • Be the first to like this

Author paper identification problem

  1. 1. Author- Paper Identification Problem Guided By Prof Duc Tran Team : Karthik Reddy Vakati Nachammai C Pooja Mishra
  2. 2. Problem Statement To determine the correct author from the author’s dataset for a particular paper. Ambiguity in author names might cause a paper to be assigned to the wrong author, which leads to noisy author profiles Challenge is to determine which papers in an author profile were truly written by a given author
  3. 3. Data Points The data points include all papers written by an author, his affiliation (University, Technical Society, Groups).  Paper-Author -( PaperId , AuthorId, Name, Affiliation) The meta data includes journals written by him and conferences attended by an author.  Paper -( Id, Title, Year, ConferenceId , JournalId, Keywords)  Author -( Id, Name, Affiliation)  Conference-(Id, ShortName,FullName,HomePage)  Journal -(Id, ShortName, FullName, HomePage)
  4. 4. Machine Learning Task • Feature Engineering • Algorithms • Model Tuning • Results • Evaluation
  5. 5. Steps Taken to Solve Problem  Data preprocessing and cleaning  Feature engineering  Choose a model - Random Forest/Gradient Boost Model  Building the model  Evaluating the results  Extracting feature values  Building the model using the modified train file  Evaluating the results  Tuning the model
  6. 6. Data preprocessing and cleaning Issues with data The csv files needed cleaning Few had attributes spilled over 3 rows Some rows had more attributes than the required number of attributes Special characters caused issue Wrote a Perl script to Clean data and format it
  7. 7. Issues with data-I
  8. 8. Issues with data-II
  9. 9. Feature Engineering Steps • Aggregation: combining multiple features into one. How did we use: Elaborated train file with AuthorID, PaperID and Confirmation combined with name and affiliation from Author file. • Discretization: Converting continuous features or variables to discretized or nominal features How did we use: The year the paper is published. The max and min years the author was actively publishing papers. • Construction: Creating new features out of original ones How did we use: Keywords in Paper
  10. 10. Author features • distance between the author names in paper-author and author files • matched substring ratio between the author names in paper-author and author files • keywords used by a particular author(less weight) • count keywords for author • count the no of co-authors for a given co-author • weighted TF-IDF measure of all author keywords inside author's papers • count different papers of author • years during which the author wrote many papers • number of times an author is repeated (sum for distinct ids) • list of distinct ids assigned to the same author
  11. 11. Paper features • year of paper • count authors of paper • count duplicated papers for paper • count duplicated authors for paper • count keywords in paper • how many time the exact same set of authors is repeated in different papers (without duplicates)
  12. 12. Paper-Author features • correct affiliation from the table PaperAuthor: binary feature • which year the author publishes (for the first year papers of author this feature equals 1 for the second year papers - 2 and so on) • count sources: number of times pair author-paper is appeared in the table PaperAuthor • are names in PaperAuthor table and Author table the same
  13. 13. Machine Learning Models Used Models used earlier • K-means clustering • ZeroR • Tree J-48 • Naïve Bayes
  14. 14. Machine Learning Models Used • RandomForest • Gradient Boost
  15. 15. Feature values extraction Count of paper for each author Maximum active year for given author Maximum active year for given author Jaccard distance between author name in author file and paper author file Jaccard distance between affiliation in author file and paper author file The year paper was published
  16. 16. Build Random forest using weka With the elaborated train file
  17. 17. Build Random forest using Mahout
  18. 18. Build Gradient Boost using H20
  19. 19. Evaluation Metrics • Accuracy • Error percentage
  20. 20. Lessons Learned! When you know what to find out exactly in the provided data ,use supervised learning model as classification rather than choosing unsupervised learning model such as clustering. When you want to find patterns or structures in the provided data use unsupervised dlearning models such as clustering. Try building model using the provided train file but it might not give you better results always. You can try to modify it using the existing data but making sure you do no change it. Choosing the features is the most important thing and we can extract the feature values from the given data and use it to build the model. Choosing features from different data points will give better results than just choosing them from only one.
  21. 21. Thank you!!