SlideShare a Scribd company logo
1 of 21
Author- Paper Identification
Problem
Team :
Karthik Reddy Vakati
Nachammai C
Pooja Mishra
Guided By
Prof Duc Tran
Problem Statement
•To determine the correct author from the author’s dataset for
a particular paper.
•Ambiguity in author names might cause a paper to be
assigned to the wrong author, which leads to noisy author
profiles
•This KDD Cup task challenges participants to determine which
papers in an author profile were truly written by a given author
Type of data
Data provided by KDD challenge is in csv format.
 Paper -( Id, Title, Year, ConferenceId , JournalId,
Keywords)
 Author -( Id, Name, Affiliation)
 Paper-Author -( PaperId , AuthorId, Name, Affiliation)
 Conference-(Id, ShortName,FullName,HomePage)
 Journal -(Id, ShortName, FullName, HomePage)
 Train-(AuthorId , ConfirmedPaperIds, DeletedPaperIds)
 Test - (AuthorId , PaperIds)
 Validation -(AuthorId,PaperIds,Usage)
Data Points
The data points include all papers written by an author,
his affliation (University, Technical Society, Groups).
 Paper-Author -( PaperId , AuthorId, Name, Affiliation)
The meta data includes journals written by him and
conferences attended by an author.
 Paper -( Id, Title, Year, ConferenceId , JournalId,
Keywords)
 Author -( Id, Name, Affiliation)
 Conference-(Id, ShortName,FullName,HomePage)
 Journal -(Id, ShortName, FullName, HomePage)
Issues with data
Issues with data
The csv files needed cleaning
Few had attributes spilled over 3 rows
Some rows had more attributes than the
required number of attributes
Special characters caused issue
Wrote a Perl script to Clean data and format it
Issues with data-I
Issues with data-II
Predictions & Intuitions
Prediction:
 Given a paper and an author, one should be able to
identify whether the given paper was written by the author.
Intuition:
 We initially identified this problem as a Clustering problem.
We chose clustering because a set of papers written by one
author can be grouped together and then for a given
paper and author we can identify if the paper is from
author’s cluster.
 The features PaperId, AuthorId, PaperTitle, AuthorName play
a significant role in the prediction.
Feature selection
We used following features from Train dataset
while building the model :
ConfirmedPaperIds
DeletedPaperIds
Tools Used & Model Trained
Tools Used:
 Weka
 R
 Apache Mahout
Model Trained:
 Simple K-Means
 J-48
 ZeroR
K-means clustering using Weka
Training the data
Visualization of k-means
clustering result
Simple K-means clustering using R
Error in R for Clustering
 > y=read.table("Paper_fixed.csv",header=TRUE,sep=',')
 > y[1:10,]
 > km3 <- kmeans(x,3)
 Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
 In addition: Warning message:
 In kmeans(x, 3) : NAs introduced by coercion
Conclusion
Why clustering does not work for this problem?
Handling of mixed set of attributes is an issue in R
Simple Kmeans clustering works on calculating the distance
from centroids and thus needs numeric attributes and
distances. Hence clustering is not a best approach for our
problem
To overcome the problem we are trying to convert the
data into numeric integer values and then numeric
distance measures are applied for computing
However, this problem looks more like a classification
problem - to classify whether a paper is written by an
author
Moving on to Classification
algorithms..
 ZeroR
 Tree J-48
 Naïve Bayes
Results using Tree-J48 algorithm
Results using ZeroR algorithm
Visualization of ZeroR results for Precision
Next Steps
 We are working on the feature engineering - feature transformation – work on the Author
name attribute and transform it into a common format for all Author names.
 Once we have the feature engineering done - We will working principally on Naïve Bayes
and other classification algorithms that we think will suit our problem
 And fine tune the model…
Thank you!!

More Related Content

What's hot

IJSETR-VOL-3-ISSUE-12-3358-3363
IJSETR-VOL-3-ISSUE-12-3358-3363IJSETR-VOL-3-ISSUE-12-3358-3363
IJSETR-VOL-3-ISSUE-12-3358-3363
SHIVA REDDY
 
Author paper identification problem final presentation
Author  paper identification problem final presentationAuthor  paper identification problem final presentation
Author paper identification problem final presentation
Pooja Mishra
 
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
ijseajournal
 
Associative Classification: Synopsis
Associative Classification: SynopsisAssociative Classification: Synopsis
Associative Classification: Synopsis
Jagdeep Singh Malhi
 

What's hot (20)

IJSETR-VOL-3-ISSUE-12-3358-3363
IJSETR-VOL-3-ISSUE-12-3358-3363IJSETR-VOL-3-ISSUE-12-3358-3363
IJSETR-VOL-3-ISSUE-12-3358-3363
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 
Improved text clustering with
Improved text clustering withImproved text clustering with
Improved text clustering with
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
 
Author paper identification problem final presentation
Author  paper identification problem final presentationAuthor  paper identification problem final presentation
Author paper identification problem final presentation
 
Bl24409420
Bl24409420Bl24409420
Bl24409420
 
Text Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor AlgorithmText Categorization Using Improved K Nearest Neighbor Algorithm
Text Categorization Using Improved K Nearest Neighbor Algorithm
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
Clustering
ClusteringClustering
Clustering
 
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
 
Textmining Predictive Models
Textmining Predictive ModelsTextmining Predictive Models
Textmining Predictive Models
 
SAX-TimeSeries
SAX-TimeSeriesSAX-TimeSeries
SAX-TimeSeries
 
Data Mining
Data MiningData Mining
Data Mining
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
 
Clustering &amp; classification
Clustering &amp; classificationClustering &amp; classification
Clustering &amp; classification
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
Comparative study of classification algorithm for text based categorization
Comparative study of classification algorithm for text based categorizationComparative study of classification algorithm for text based categorization
Comparative study of classification algorithm for text based categorization
 
Associative Classification: Synopsis
Associative Classification: SynopsisAssociative Classification: Synopsis
Associative Classification: Synopsis
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
 

Similar to Author paper midterm

Authorship attribution pydata london
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata london
kperi
 
Lesson 2 data preprocessing
Lesson 2   data preprocessingLesson 2   data preprocessing
Lesson 2 data preprocessing
AbdurRazzaqe1
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
my6305874
 
Query expansion_Team42_IRE2k14
Query expansion_Team42_IRE2k14Query expansion_Team42_IRE2k14
Query expansion_Team42_IRE2k14
sudhir11292rt
 

Similar to Author paper midterm (20)

Author paper identification problem
Author  paper identification problemAuthor  paper identification problem
Author paper identification problem
 
Authorship attribution pydata london
Authorship attribution   pydata londonAuthorship attribution   pydata london
Authorship attribution pydata london
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
Lesson 2 data preprocessing
Lesson 2   data preprocessingLesson 2   data preprocessing
Lesson 2 data preprocessing
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 
Query expansion_Team42_IRE2k14
Query expansion_Team42_IRE2k14Query expansion_Team42_IRE2k14
Query expansion_Team42_IRE2k14
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
Imtiaz khan data_science_analytics
Imtiaz khan data_science_analyticsImtiaz khan data_science_analytics
Imtiaz khan data_science_analytics
 
G046024851
G046024851G046024851
G046024851
 
Bringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASBringing OpenClinica Data into SAS
Bringing OpenClinica Data into SAS
 
SQL
SQLSQL
SQL
 
SQL
SQL SQL
SQL
 
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
XII -  2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdfXII -  2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
XII - 2022-23 - IP - RAIPUR (CBSE FINAL EXAM).pdf
 
Beyond Collaborative Filtering: Learning to Rank Research Articles
Beyond Collaborative Filtering: Learning to Rank Research ArticlesBeyond Collaborative Filtering: Learning to Rank Research Articles
Beyond Collaborative Filtering: Learning to Rank Research Articles
 
4lang
4lang4lang
4lang
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptx
 
Data Retrival
Data RetrivalData Retrival
Data Retrival
 
A WEB BASED APPLICATION FOR RESUME PARSER USING NATURAL LANGUAGE PROCESSING T...
A WEB BASED APPLICATION FOR RESUME PARSER USING NATURAL LANGUAGE PROCESSING T...A WEB BASED APPLICATION FOR RESUME PARSER USING NATURAL LANGUAGE PROCESSING T...
A WEB BASED APPLICATION FOR RESUME PARSER USING NATURAL LANGUAGE PROCESSING T...
 
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and Map...
 

Recently uploaded

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 

Recently uploaded (20)

ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 

Author paper midterm

  • 1. Author- Paper Identification Problem Team : Karthik Reddy Vakati Nachammai C Pooja Mishra Guided By Prof Duc Tran
  • 2. Problem Statement •To determine the correct author from the author’s dataset for a particular paper. •Ambiguity in author names might cause a paper to be assigned to the wrong author, which leads to noisy author profiles •This KDD Cup task challenges participants to determine which papers in an author profile were truly written by a given author
  • 3. Type of data Data provided by KDD challenge is in csv format.  Paper -( Id, Title, Year, ConferenceId , JournalId, Keywords)  Author -( Id, Name, Affiliation)  Paper-Author -( PaperId , AuthorId, Name, Affiliation)  Conference-(Id, ShortName,FullName,HomePage)  Journal -(Id, ShortName, FullName, HomePage)  Train-(AuthorId , ConfirmedPaperIds, DeletedPaperIds)  Test - (AuthorId , PaperIds)  Validation -(AuthorId,PaperIds,Usage)
  • 4. Data Points The data points include all papers written by an author, his affliation (University, Technical Society, Groups).  Paper-Author -( PaperId , AuthorId, Name, Affiliation) The meta data includes journals written by him and conferences attended by an author.  Paper -( Id, Title, Year, ConferenceId , JournalId, Keywords)  Author -( Id, Name, Affiliation)  Conference-(Id, ShortName,FullName,HomePage)  Journal -(Id, ShortName, FullName, HomePage)
  • 5. Issues with data Issues with data The csv files needed cleaning Few had attributes spilled over 3 rows Some rows had more attributes than the required number of attributes Special characters caused issue Wrote a Perl script to Clean data and format it
  • 8. Predictions & Intuitions Prediction:  Given a paper and an author, one should be able to identify whether the given paper was written by the author. Intuition:  We initially identified this problem as a Clustering problem. We chose clustering because a set of papers written by one author can be grouped together and then for a given paper and author we can identify if the paper is from author’s cluster.  The features PaperId, AuthorId, PaperTitle, AuthorName play a significant role in the prediction.
  • 9. Feature selection We used following features from Train dataset while building the model : ConfirmedPaperIds DeletedPaperIds
  • 10. Tools Used & Model Trained Tools Used:  Weka  R  Apache Mahout Model Trained:  Simple K-Means  J-48  ZeroR
  • 11. K-means clustering using Weka Training the data
  • 14. Error in R for Clustering  > y=read.table("Paper_fixed.csv",header=TRUE,sep=',')  > y[1:10,]  > km3 <- kmeans(x,3)  Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)  In addition: Warning message:  In kmeans(x, 3) : NAs introduced by coercion
  • 15. Conclusion Why clustering does not work for this problem? Handling of mixed set of attributes is an issue in R Simple Kmeans clustering works on calculating the distance from centroids and thus needs numeric attributes and distances. Hence clustering is not a best approach for our problem To overcome the problem we are trying to convert the data into numeric integer values and then numeric distance measures are applied for computing However, this problem looks more like a classification problem - to classify whether a paper is written by an author
  • 16. Moving on to Classification algorithms..  ZeroR  Tree J-48  Naïve Bayes
  • 18. Results using ZeroR algorithm
  • 19. Visualization of ZeroR results for Precision
  • 20. Next Steps  We are working on the feature engineering - feature transformation – work on the Author name attribute and transform it into a common format for all Author names.  Once we have the feature engineering done - We will working principally on Naïve Bayes and other classification algorithms that we think will suit our problem  And fine tune the model…