Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Author- Paper Identification Problem
Guided By
Prof Duc Tran
Team :
Karthik Reddy Vakati
Nachammai C
Pooja Mishra
Problem Statement
To determine the correct author from the author’s dataset
for a particular paper.
Ambiguity in author ...
Data Points
The data points include all papers written by an author, his
affiliation (University, Technical Society, Group...
Machine Learning Task
• Building the model using train dataset and testing
• Feature Engineering
• Algorithms
• Model Tuni...
Steps Taken to Solve Problem
 Data preprocessing and cleaning
 Feature engineering
 Extracting feature values
 Creatin...
Data preprocessing and cleaning
Issues with data
Few had attributes spilled over 3 rows
Some rows had more attributes t...
Issues with data-I
Feature Engineering Steps
• Aggregation: combining multiple features into one.
How did we use: Elaborated train file with ...
Author features
• distance between the author names in paper-author and author files
• matched substring ratio between the...
Paper features
• year of paper
• count authors of paper
• count duplicated papers for paper
• count duplicated authors for...
Paper-Author features
• correct affiliation from the table PaperAuthor: binary feature
• which year the author publishes (...
Machine Learning Models Used
Models used earlier
• K-means clustering
• ZeroR
• Tree J-48
• Naïve Bayes
Machine Learning Models Used
• RandomForest
 Using Weka, Mahout and H20
• Gradient Boost
 Using H20
Build Random forest using weka
With the elaborated train file
Build Random forest using H20
Feature values extraction & importance
Count of paper for each author
Maximum active year for given author
Maximum acti...
1. Put the data in HDFS.
$HADOOP_HOME/bin/hadoop fs -mkdir testdata $HADOOP_HOME/bin/hadoop fs -put testdata
2. Build the ...
Evaluation Metrics
The mean average precision for N users at position n is
the average of the average precision of each us...
Lessons Learned!
Since we were given train and test data supervised learning is the best fit. Hence
classification algori...
Thank you!!
Upcoming SlideShare
Loading in …5
×

Author paper identification problem final presentation

389 views

Published on

Published in: Engineering, Technology, Business
  • Be the first to comment

  • Be the first to like this

Author paper identification problem final presentation

  1. 1. Author- Paper Identification Problem Guided By Prof Duc Tran Team : Karthik Reddy Vakati Nachammai C Pooja Mishra
  2. 2. Problem Statement To determine the correct author from the author’s dataset for a particular paper. Ambiguity in author names might cause a paper to be assigned to the wrong author, which leads to noisy author profiles. Challenge is to determine which papers in an author profile were truly written by a given author
  3. 3. Data Points The data points include all papers written by an author, his affiliation (University, Technical Society, Groups).  Paper-Author -( PaperId , AuthorId, Name, Affiliation) The meta data includes journals written by him and conferences attended by an author.  Paper -( Id, Title, Year, ConferenceId , JournalId, Keywords)  Author -( Id, Name, Affiliation)  Conference-(Id, ShortName,FullName,HomePage)  Journal -(Id, ShortName, FullName, HomePage)
  4. 4. Machine Learning Task • Building the model using train dataset and testing • Feature Engineering • Algorithms • Model Tuning • Results • Evaluation
  5. 5. Steps Taken to Solve Problem  Data preprocessing and cleaning  Feature engineering  Extracting feature values  Creating final input file  Choose a ML algorithm - Random Forest/Gradient Boost Model  Building the model  Tuning the model  Test the model on test data  Evaluating the results
  6. 6. Data preprocessing and cleaning Issues with data Few had attributes spilled over 3 rows Some rows had more attributes than the required number of attributes Wrote a Perl script to Clean data and format it Removed stop words using NLTK package in python Converted all text to lower case Removed special characters Removed noise from years field Assigned ID to each keyword and normalized it
  7. 7. Issues with data-I
  8. 8. Feature Engineering Steps • Aggregation: combining multiple features into one. How did we use: Elaborated train file with AuthorID, PaperID and Confirmation combined with name and affiliation from Author file and PaperAuthor file. • Construction: Creating new features out of original ones How did we use: minyear and maxyear into active years • Discretization: Converting continuous features or variables to discretized or nominal features How did we use: The year the paper is published. The max and min years the author was actively publishing papers.
  9. 9. Author features • distance between the author names in paper-author and author files • matched substring ratio between the author names in paper-author and author files • keywords used by a particular author(less weight) • count keywords for author • count the no of co-authors for a given co-author • weighted TF-IDF measure of all author keywords inside author's papers • count different papers of author • years during which the author wrote many papers • number of times an author is repeated (sum for distinct ids) • list of distinct ids assigned to the same author
  10. 10. Paper features • year of paper • count authors of paper • count duplicated papers for paper • count duplicated authors for paper • count keywords in paper • how many time the exact same set of authors is repeated in different papers (without duplicates)
  11. 11. Paper-Author features • correct affiliation from the table PaperAuthor: binary feature • which year the author publishes (for the first year papers of author this feature equals 1 for the second year papers - 2 and so on) • count sources: number of times pair author-paper is appeared in the table PaperAuthor table and Author table the same
  12. 12. Machine Learning Models Used Models used earlier • K-means clustering • ZeroR • Tree J-48 • Naïve Bayes
  13. 13. Machine Learning Models Used • RandomForest  Using Weka, Mahout and H20 • Gradient Boost  Using H20
  14. 14. Build Random forest using weka With the elaborated train file
  15. 15. Build Random forest using H20
  16. 16. Feature values extraction & importance Count of paper for each author Maximum active year for given author Maximum active year for given author Jaccard distance between author name in author file and paper author file Jaccard distance between affiliation in author file and paper author file The year paper was published Normalized Keyword ids
  17. 17. 1. Put the data in HDFS. $HADOOP_HOME/bin/hadoop fs -mkdir testdata $HADOOP_HOME/bin/hadoop fs -put testdata 2. Build the job files. $MAHOUT_HOME/ run: mvn clean install -DskipTests 3. Generate a file descriptor for the dataset. $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/core/target/mahout-core--job.jar org.apache.mahout.classifier.df.tools.Describe -p testdata/KDDTrain+.arff -f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L 4. Run the model $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples--job.jar org.apache.mahout.classifier.df.mapreduce.BuildForest -Dmapred.max.split.size=1874231 -d testdata/KDDTrain+.arff -ds testdata/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest 5. Using the Decision Forest to Classify new data. $MAHOUT_HOME/examples/target/mahout-examples--job.jar org.apache.mahout.classifier.df.mapreduce.TestForest -i nsl-kdd/KDDTest+.arff -ds nsl-kdd/KDDTrain+.info -m nsl-forest -a -mr -o predictions Random Forest using Mahout Steps
  18. 18. Evaluation Metrics The mean average precision for N users at position n is the average of the average precision of each user, i.e., MAP@n=∑i=1Nap@ni/N • Mean Average Precision
  19. 19. Lessons Learned! Since we were given train and test data supervised learning is the best fit. Hence classification algorithms work better for author paper identification problem rather than clustering. When you want to find patterns or structures in the provided data use unsupervised learning models such as clustering. Choosing the features is the most important thing and we can extract the feature values from the given data and use it to build the model. Construct an initial set of features and try to build the model and test for its accuracy. This might not give you better results always. Construct features from features. Choosing features from different data points will give better results than just choosing them from only one. Choose initial set of weights for each feature based on its importance. This will help in model tuning.
  20. 20. Thank you!!

×