SlideShare a Scribd company logo
1 of 25
Download to read offline
A Supervised Machine
Learning Algorithm for
Research Articles
Leonidas Akritidis, Panayiotis Bozanis
Dept. of Computer & Communication Engineering,
University of Thessaly, Greece
28th ACM Symposium on Applied Computing
Coimbra, Portugal, 18-22 March 2013
ACM SAC 2013, Coimbra, Portugal 2University of Thessaly
Research Article Classification
 We study the problem of categorizing
research articles (papers).
 In contrast to regular documents, the
research articles have their own features
(authors, co-authors, references,
publishing journal/conference, etc)
 Therefore, the classification of research
articles poses some unique challenges.
ACM SAC 2013, Coimbra, Portugal 3University of Thessaly
Problem Importance
 Allows users search only a specific portion
of the document collection.
 Digital libraries organize their content
 Helps users find similar items
 Facilitates the creation of robust search
tools
Recommendation
Query expansion, etc
ACM SAC 2013, Coimbra, Portugal 4University of Thessaly
Existing Approaches (1)
 Keyword extraction algorithms
The identify repeated textual patterns and
extract the most important terms
In the sequel, they employ traditional
classification approaches such as kNN.
 Machine learning (ML) approaches
Support Vector Machines (SVM)
AdaBoost.MH
ACM SAC 2013, Coimbra, Portugal 5University of Thessaly
Existing Approaches (2)
 Citation Analysis Algorithms
They exploit linkage information (e.g. multiple
articles cited together by a group of papers)
ACM SAC 2013, Coimbra, Portugal 6University of Thessaly
Our Approach
 Employs a set of labels C and a set of pre-
classified research papers - training set T.
 It trains a model based on C and T.
 The training phase takes into account the
particular features of the problem including
The authors history
Co-authorship information
Keywords selection
The previous publications
ACM SAC 2013, Coimbra, Portugal 7University of Thessaly
Preliminaries
 Our analysis involves four sets:
K: The set of all keywords (k  K)
A: The set of all authors (a  A)
J: The set of all journals (j  J)
C: The set of all labels (c  C)
 Multiple subsets, i.e.:
Kp
: The subset of keywords of an article p.
 And multiple frequency values i.e.:
|Pk
|: The number of articles containing k.
ACM SAC 2013, Coimbra, Portugal 8University of Thessaly
Training Phase
 During the training process we build three
relevance description vectors (RDV):
K : Denotes how frequently each keyword k
has been correlated with each field c.
A : Denotes how frequently each author a
has been correlated with each field c.
J : Denotes how frequently each journal j has
been correlated with each field c.
 Zero freqs are pruned from the vectors.
ACM SAC 2013, Coimbra, Portugal 9University of Thessaly
Model Training (Phase 1)
 In this phase we
populate the K RDV.
 Initially we extract all
keywords and labels
of the paper (3-4).
 for each pair (k,c) we search within K
If the (keyword, label) is not found, we set the
corresponding frequency equal to 1 (10-11).
Otherwise, we increase the frequency by 1 (13).
ACM SAC 2013, Coimbra, Portugal 10University of Thessaly
Model Training (Phase 2)
 In this phase we
populate the A RDV.
 We work similarly to
the previous case, but
we take into account
co-authorship data.
 Apart from (author, label) pairs, we also
store (author, co-author, label) triples.
ACM SAC 2013, Coimbra, Portugal 11University of Thessaly
Vector A
 The authors usually publish articles in
more research fields.
 The produced RDV A stores information
which demonstrates:
How frequently each author a has been
correlated with each label c.
How frequently each author a has been
correlated with each label c when co-authored
articles with a'.
ACM SAC 2013, Coimbra, Portugal 12University of Thessaly
Co-authorship in Vector A
 For instance, when an arbitrary author A
co-operates with B, s/he publishes articles
related to IR, whereas when co-operates
with C, publishes articles related to DM.
 Consequently, when we classify an
unlabeled article authored by A and B, we
know that probably the article discusses
an IR topic.
ACM SAC 2013, Coimbra, Portugal 13University of Thessaly
Model Training (Phase 3)
 In this phase we
populate the J RDV.
 There is only one
journal j, hence, the
process is simpler.
 for each pair (j,c) we search within J
If the (journal, label) is not found, we set the
corresponding frequency equal to 1 (36-37).
Otherwise, we increase the frequency by 1 (39).
ACM SAC 2013, Coimbra, Portugal 14University of Thessaly
Classification Process
 The three relevance description vectors
K, A, and J are now used to label to the
unclassified articles.
 For each article, each label is assigned
three partial scores according to the
article’s keywords, authors, journals, and
the contents of K, A, and J.
 The final score is a linear combination of
the three partial scores.
ACM SAC 2013, Coimbra, Portugal 15University of Thessaly
Articles Classification (Phase 1)
 In the first phase we
extract the keywords
of the unlabeled item
 For each keyword k we do a search in K.
 If k K, we retrieve the list of the correlated
labels and for each label c we compute its
partial score Sk
c
by using:
The frequencies |Pk
|, |Pk,c
|.
A scoring function Fk (i.e. IDF, log|Pk,c
|/ |Pk
|).
ACM SAC 2013, Coimbra, Portugal 16University of Thessaly
Articles Classification (Ph. 2-a)
 In this phase we use
the authors vector A.
 Initially, we check if
each paper author a
has co-operated with
the other authors.
 In case a pair (a,a') A we compute the
partial score Sa
c
of each label c by retrieving
the corresponding co-authorship
frequencies
ACM SAC 2013, Coimbra, Portugal 17University of Thessaly
Articles Classification (Ph. 2-b)
 Co-Authorship Frequencies
|Paa'
|: The articles coauthored by a and a'.
|Paa'c
|: » » » which are labeled as c.
 In the opposite case where a has not co-
operated with any of the other authors, we
compute the label score Sa
c
by using the
plain author frequencies:
|Pa
|: The number of articles authored by a.
|Pa,c
|: » » » which are labeled as c.
ACM SAC 2013, Coimbra, Portugal 18University of Thessaly
Articles Classification (Phase 3)
 Finally, we exploit the
history of the article’s
publishing journal.
 The score is computed by using the
frequencies stored in the J RDV:
|Pj
|: The number of articles published by j.
|Pj,c
|: » » » which are labeled as c.
ACM SAC 2013, Coimbra, Portugal 19University of Thessaly
Label scores
 The label score is computed as a linear
combination of the three partial scores:
 wk, wa, wj are constants used to tune the
contribution of keywords, authors, journals
ACM SAC 2013, Coimbra, Portugal 20University of Thessaly
Experimental Setup
 We used CiteSeerX dataset, a collection
comprised of 1.8 million research papers.
 We used three sets of labels, all based on
the IEEE/ACM taxonomy:
C11: A set of the 11 top-level categories.
C81: A set of 81 mid-level categories.
C276: A set of 276 third-level categories.
 IEEE/ACM labeled articles: 1.1 million
ACM SAC 2013, Coimbra, Portugal 21University of Thessaly
Experiments
 We compared our algorithm against the
state-of-the-art ML methods: SVM and
AdaBoost.MH.
 We created three training sets of 10,000
100,000 and 1.1 million articles.
 Statistics for the “large” training set:
ACM SAC 2013, Coimbra, Portugal 22University of Thessaly
Evaluation
 We split the training set in three equally
sized parts.
 We used the first two thirds to build K, A,
and J.
 The last third was used for evaluation.
 We checked all the possible combinations
for wk, wa, wj .
ACM SAC 2013, Coimbra, Portugal 23University of Thessaly
Accuracy
 Our approach outperformed the adversary
generic ML algorithms.
 Our method: 81%-
96%
 SVM: 78%-94%.
 ADA: 80%-88%
Some experiments
did not finish!
ACM SAC 2013, Coimbra, Portugal 24University of Thessaly
Conclusions
 We presented a supervised machine
learning approach for classifying research
articles.
 Our algorithm takes into consideration the
specific aspects of this particular problem.
 It outperforms the adversary approaches:
Better performance by about 6%, whereas
AdaBoost fails to complete in large datasets.
ACM SAC 2013, Coimbra, Portugal 25University of Thessaly
Thank you for watching
Any questions?

More Related Content

Similar to A Supervised Machine Learning Algorithm for Research Articles

Visualizing UML’s Sequence and Class Diagrams Using Graph-Based Clusters
Visualizing UML’s Sequence and   Class Diagrams Using Graph-Based Clusters  Visualizing UML’s Sequence and   Class Diagrams Using Graph-Based Clusters
Visualizing UML’s Sequence and Class Diagrams Using Graph-Based Clusters Nakul Sharma
 
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...Athens Big Data
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEnrico Daga
 
Data.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionData.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionMargaret Wang
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016ijcsbi
 
lecture15-supervised.ppt
lecture15-supervised.pptlecture15-supervised.ppt
lecture15-supervised.pptIndra Hermawan
 
Aad introduction
Aad introductionAad introduction
Aad introductionMr SMAK
 
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...IRJET Journal
 
HOP-Rec_RecSys18
HOP-Rec_RecSys18HOP-Rec_RecSys18
HOP-Rec_RecSys18Matt Yang
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...IRJET Journal
 
A comparative study of three validities computation methods for multimodel ap...
A comparative study of three validities computation methods for multimodel ap...A comparative study of three validities computation methods for multimodel ap...
A comparative study of three validities computation methods for multimodel ap...IJECEIAES
 
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...Editor IJCATR
 
se_lectures.DS_Store__MACOSXse_lectures._.DS_Storese_
se_lectures.DS_Store__MACOSXse_lectures._.DS_Storese_se_lectures.DS_Store__MACOSXse_lectures._.DS_Storese_
se_lectures.DS_Store__MACOSXse_lectures._.DS_Storese_WilheminaRossi174
 
Bag of Pursuits and Neural Gas for Improved Sparse Codin
Bag of Pursuits and Neural Gas for Improved Sparse CodinBag of Pursuits and Neural Gas for Improved Sparse Codin
Bag of Pursuits and Neural Gas for Improved Sparse CodinKarlos Svoboda
 
An Automatic Medical Image Segmentation using Teaching Learning Based Optimiz...
An Automatic Medical Image Segmentation using Teaching Learning Based Optimiz...An Automatic Medical Image Segmentation using Teaching Learning Based Optimiz...
An Automatic Medical Image Segmentation using Teaching Learning Based Optimiz...idescitation
 

Similar to A Supervised Machine Learning Algorithm for Research Articles (20)

Visualizing UML’s Sequence and Class Diagrams Using Graph-Based Clusters
Visualizing UML’s Sequence and   Class Diagrams Using Graph-Based Clusters  Visualizing UML’s Sequence and   Class Diagrams Using Graph-Based Clusters
Visualizing UML’s Sequence and Class Diagrams Using Graph-Based Clusters
 
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...
13th Athens Big Data Meetup - 2nd Talk - Training Neural Networks With Enterp...
 
Presentation
PresentationPresentation
Presentation
 
Early Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data CubesEarly Analysis and Debuggin of Linked Open Data Cubes
Early Analysis and Debuggin of Linked Open Data Cubes
 
Data.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionData.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and prediction
 
FinalPres
FinalPresFinalPres
FinalPres
 
Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016Vol 16 No 2 - July-December 2016
Vol 16 No 2 - July-December 2016
 
lecture15-supervised.ppt
lecture15-supervised.pptlecture15-supervised.ppt
lecture15-supervised.ppt
 
Aad introduction
Aad introductionAad introduction
Aad introduction
 
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
A Hybrid Data Clustering Approach using K-Means and Simplex Method-based Bact...
 
HOP-Rec_RecSys18
HOP-Rec_RecSys18HOP-Rec_RecSys18
HOP-Rec_RecSys18
 
Av33274282
Av33274282Av33274282
Av33274282
 
Av33274282
Av33274282Av33274282
Av33274282
 
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
An Optimized Parallel Algorithm for Longest Common Subsequence Using Openmp –...
 
Extracting biclusters of similar values with Triadic Concept Analysis
Extracting biclusters of similar values with Triadic Concept AnalysisExtracting biclusters of similar values with Triadic Concept Analysis
Extracting biclusters of similar values with Triadic Concept Analysis
 
A comparative study of three validities computation methods for multimodel ap...
A comparative study of three validities computation methods for multimodel ap...A comparative study of three validities computation methods for multimodel ap...
A comparative study of three validities computation methods for multimodel ap...
 
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
 
se_lectures.DS_Store__MACOSXse_lectures._.DS_Storese_
se_lectures.DS_Store__MACOSXse_lectures._.DS_Storese_se_lectures.DS_Store__MACOSXse_lectures._.DS_Storese_
se_lectures.DS_Store__MACOSXse_lectures._.DS_Storese_
 
Bag of Pursuits and Neural Gas for Improved Sparse Codin
Bag of Pursuits and Neural Gas for Improved Sparse CodinBag of Pursuits and Neural Gas for Improved Sparse Codin
Bag of Pursuits and Neural Gas for Improved Sparse Codin
 
An Automatic Medical Image Segmentation using Teaching Learning Based Optimiz...
An Automatic Medical Image Segmentation using Teaching Learning Based Optimiz...An Automatic Medical Image Segmentation using Teaching Learning Based Optimiz...
An Automatic Medical Image Segmentation using Teaching Learning Based Optimiz...
 

More from Leonidas Akritidis

An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank AggregationAn Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank AggregationLeonidas Akritidis
 
A Self-Pruning Classification Model for News
A Self-Pruning Classification Model for NewsA Self-Pruning Classification Model for News
A Self-Pruning Classification Model for NewsLeonidas Akritidis
 
Effective Products Categorization with Importance Scores and Morphological An...
Effective Products Categorization with ImportanceScores and Morphological An...Effective Products Categorization with ImportanceScores and Morphological An...
Effective Products Categorization with Importance Scores and Morphological An...Leonidas Akritidis
 
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...Leonidas Akritidis
 
Effective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product TitlesEffective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product TitlesLeonidas Akritidis
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceLeonidas Akritidis
 
Positional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted IndexesPositional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted IndexesLeonidas Akritidis
 
Identifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does MatterIdentifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does MatterLeonidas Akritidis
 

More from Leonidas Akritidis (8)

An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank AggregationAn Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
An Iterative Distance-Based Model for Unsupervised Weighted Rank Aggregation
 
A Self-Pruning Classification Model for News
A Self-Pruning Classification Model for NewsA Self-Pruning Classification Model for News
A Self-Pruning Classification Model for News
 
Effective Products Categorization with Importance Scores and Morphological An...
Effective Products Categorization with ImportanceScores and Morphological An...Effective Products Categorization with ImportanceScores and Morphological An...
Effective Products Categorization with Importance Scores and Morphological An...
 
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
 
Effective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product TitlesEffective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product Titles
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
 
Positional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted IndexesPositional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted Indexes
 
Identifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does MatterIdentifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does Matter
 

Recently uploaded

Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..MaherOthman7
 
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024EMMANUELLEFRANCEHELI
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisDr.Costas Sachpazis
 
Circuit Breakers for Engineering Students
Circuit Breakers for Engineering StudentsCircuit Breakers for Engineering Students
Circuit Breakers for Engineering Studentskannan348865
 
Artificial Intelligence in due diligence
Artificial Intelligence in due diligenceArtificial Intelligence in due diligence
Artificial Intelligence in due diligencemahaffeycheryld
 
electrical installation and maintenance.
electrical installation and maintenance.electrical installation and maintenance.
electrical installation and maintenance.benjamincojr
 
Raashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashidFaiyazSheikh
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1T.D. Shashikala
 
15-Minute City: A Completely New Horizon
15-Minute City: A Completely New Horizon15-Minute City: A Completely New Horizon
15-Minute City: A Completely New HorizonMorshed Ahmed Rahath
 
What is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsWhat is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsVIEW
 
Final DBMS Manual (2).pdf final lab manual
Final DBMS Manual (2).pdf final lab manualFinal DBMS Manual (2).pdf final lab manual
Final DBMS Manual (2).pdf final lab manualBalamuruganV28
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfJNTUA
 
Intro to Design (for Engineers) at Sydney Uni
Intro to Design (for Engineers) at Sydney UniIntro to Design (for Engineers) at Sydney Uni
Intro to Design (for Engineers) at Sydney UniR. Sosa
 
Passive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.pptPassive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.pptamrabdallah9
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxkalpana413121
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdfKamal Acharya
 
Developing a smart system for infant incubators using the internet of things ...
Developing a smart system for infant incubators using the internet of things ...Developing a smart system for infant incubators using the internet of things ...
Developing a smart system for infant incubators using the internet of things ...IJECEIAES
 
Autodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptxAutodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptxMustafa Ahmed
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfJNTUA
 
handbook on reinforce concrete and detailing
handbook on reinforce concrete and detailinghandbook on reinforce concrete and detailing
handbook on reinforce concrete and detailingAshishSingh1301
 

Recently uploaded (20)

Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..
 
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
NEWLETTER FRANCE HELICES/ SDS SURFACE DRIVES - MAY 2024
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
Circuit Breakers for Engineering Students
Circuit Breakers for Engineering StudentsCircuit Breakers for Engineering Students
Circuit Breakers for Engineering Students
 
Artificial Intelligence in due diligence
Artificial Intelligence in due diligenceArtificial Intelligence in due diligence
Artificial Intelligence in due diligence
 
electrical installation and maintenance.
electrical installation and maintenance.electrical installation and maintenance.
electrical installation and maintenance.
 
Raashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashid final report on Embedded Systems
Raashid final report on Embedded Systems
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
15-Minute City: A Completely New Horizon
15-Minute City: A Completely New Horizon15-Minute City: A Completely New Horizon
15-Minute City: A Completely New Horizon
 
What is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsWhat is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, Functions
 
Final DBMS Manual (2).pdf final lab manual
Final DBMS Manual (2).pdf final lab manualFinal DBMS Manual (2).pdf final lab manual
Final DBMS Manual (2).pdf final lab manual
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
 
Intro to Design (for Engineers) at Sydney Uni
Intro to Design (for Engineers) at Sydney UniIntro to Design (for Engineers) at Sydney Uni
Intro to Design (for Engineers) at Sydney Uni
 
Passive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.pptPassive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.ppt
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptx
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdf
 
Developing a smart system for infant incubators using the internet of things ...
Developing a smart system for infant incubators using the internet of things ...Developing a smart system for infant incubators using the internet of things ...
Developing a smart system for infant incubators using the internet of things ...
 
Autodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptxAutodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptx
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdf
 
handbook on reinforce concrete and detailing
handbook on reinforce concrete and detailinghandbook on reinforce concrete and detailing
handbook on reinforce concrete and detailing
 

A Supervised Machine Learning Algorithm for Research Articles

  • 1. A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University of Thessaly, Greece 28th ACM Symposium on Applied Computing Coimbra, Portugal, 18-22 March 2013
  • 2. ACM SAC 2013, Coimbra, Portugal 2University of Thessaly Research Article Classification  We study the problem of categorizing research articles (papers).  In contrast to regular documents, the research articles have their own features (authors, co-authors, references, publishing journal/conference, etc)  Therefore, the classification of research articles poses some unique challenges.
  • 3. ACM SAC 2013, Coimbra, Portugal 3University of Thessaly Problem Importance  Allows users search only a specific portion of the document collection.  Digital libraries organize their content  Helps users find similar items  Facilitates the creation of robust search tools Recommendation Query expansion, etc
  • 4. ACM SAC 2013, Coimbra, Portugal 4University of Thessaly Existing Approaches (1)  Keyword extraction algorithms The identify repeated textual patterns and extract the most important terms In the sequel, they employ traditional classification approaches such as kNN.  Machine learning (ML) approaches Support Vector Machines (SVM) AdaBoost.MH
  • 5. ACM SAC 2013, Coimbra, Portugal 5University of Thessaly Existing Approaches (2)  Citation Analysis Algorithms They exploit linkage information (e.g. multiple articles cited together by a group of papers)
  • 6. ACM SAC 2013, Coimbra, Portugal 6University of Thessaly Our Approach  Employs a set of labels C and a set of pre- classified research papers - training set T.  It trains a model based on C and T.  The training phase takes into account the particular features of the problem including The authors history Co-authorship information Keywords selection The previous publications
  • 7. ACM SAC 2013, Coimbra, Portugal 7University of Thessaly Preliminaries  Our analysis involves four sets: K: The set of all keywords (k  K) A: The set of all authors (a  A) J: The set of all journals (j  J) C: The set of all labels (c  C)  Multiple subsets, i.e.: Kp : The subset of keywords of an article p.  And multiple frequency values i.e.: |Pk |: The number of articles containing k.
  • 8. ACM SAC 2013, Coimbra, Portugal 8University of Thessaly Training Phase  During the training process we build three relevance description vectors (RDV): K : Denotes how frequently each keyword k has been correlated with each field c. A : Denotes how frequently each author a has been correlated with each field c. J : Denotes how frequently each journal j has been correlated with each field c.  Zero freqs are pruned from the vectors.
  • 9. ACM SAC 2013, Coimbra, Portugal 9University of Thessaly Model Training (Phase 1)  In this phase we populate the K RDV.  Initially we extract all keywords and labels of the paper (3-4).  for each pair (k,c) we search within K If the (keyword, label) is not found, we set the corresponding frequency equal to 1 (10-11). Otherwise, we increase the frequency by 1 (13).
  • 10. ACM SAC 2013, Coimbra, Portugal 10University of Thessaly Model Training (Phase 2)  In this phase we populate the A RDV.  We work similarly to the previous case, but we take into account co-authorship data.  Apart from (author, label) pairs, we also store (author, co-author, label) triples.
  • 11. ACM SAC 2013, Coimbra, Portugal 11University of Thessaly Vector A  The authors usually publish articles in more research fields.  The produced RDV A stores information which demonstrates: How frequently each author a has been correlated with each label c. How frequently each author a has been correlated with each label c when co-authored articles with a'.
  • 12. ACM SAC 2013, Coimbra, Portugal 12University of Thessaly Co-authorship in Vector A  For instance, when an arbitrary author A co-operates with B, s/he publishes articles related to IR, whereas when co-operates with C, publishes articles related to DM.  Consequently, when we classify an unlabeled article authored by A and B, we know that probably the article discusses an IR topic.
  • 13. ACM SAC 2013, Coimbra, Portugal 13University of Thessaly Model Training (Phase 3)  In this phase we populate the J RDV.  There is only one journal j, hence, the process is simpler.  for each pair (j,c) we search within J If the (journal, label) is not found, we set the corresponding frequency equal to 1 (36-37). Otherwise, we increase the frequency by 1 (39).
  • 14. ACM SAC 2013, Coimbra, Portugal 14University of Thessaly Classification Process  The three relevance description vectors K, A, and J are now used to label to the unclassified articles.  For each article, each label is assigned three partial scores according to the article’s keywords, authors, journals, and the contents of K, A, and J.  The final score is a linear combination of the three partial scores.
  • 15. ACM SAC 2013, Coimbra, Portugal 15University of Thessaly Articles Classification (Phase 1)  In the first phase we extract the keywords of the unlabeled item  For each keyword k we do a search in K.  If k K, we retrieve the list of the correlated labels and for each label c we compute its partial score Sk c by using: The frequencies |Pk |, |Pk,c |. A scoring function Fk (i.e. IDF, log|Pk,c |/ |Pk |).
  • 16. ACM SAC 2013, Coimbra, Portugal 16University of Thessaly Articles Classification (Ph. 2-a)  In this phase we use the authors vector A.  Initially, we check if each paper author a has co-operated with the other authors.  In case a pair (a,a') A we compute the partial score Sa c of each label c by retrieving the corresponding co-authorship frequencies
  • 17. ACM SAC 2013, Coimbra, Portugal 17University of Thessaly Articles Classification (Ph. 2-b)  Co-Authorship Frequencies |Paa' |: The articles coauthored by a and a'. |Paa'c |: » » » which are labeled as c.  In the opposite case where a has not co- operated with any of the other authors, we compute the label score Sa c by using the plain author frequencies: |Pa |: The number of articles authored by a. |Pa,c |: » » » which are labeled as c.
  • 18. ACM SAC 2013, Coimbra, Portugal 18University of Thessaly Articles Classification (Phase 3)  Finally, we exploit the history of the article’s publishing journal.  The score is computed by using the frequencies stored in the J RDV: |Pj |: The number of articles published by j. |Pj,c |: » » » which are labeled as c.
  • 19. ACM SAC 2013, Coimbra, Portugal 19University of Thessaly Label scores  The label score is computed as a linear combination of the three partial scores:  wk, wa, wj are constants used to tune the contribution of keywords, authors, journals
  • 20. ACM SAC 2013, Coimbra, Portugal 20University of Thessaly Experimental Setup  We used CiteSeerX dataset, a collection comprised of 1.8 million research papers.  We used three sets of labels, all based on the IEEE/ACM taxonomy: C11: A set of the 11 top-level categories. C81: A set of 81 mid-level categories. C276: A set of 276 third-level categories.  IEEE/ACM labeled articles: 1.1 million
  • 21. ACM SAC 2013, Coimbra, Portugal 21University of Thessaly Experiments  We compared our algorithm against the state-of-the-art ML methods: SVM and AdaBoost.MH.  We created three training sets of 10,000 100,000 and 1.1 million articles.  Statistics for the “large” training set:
  • 22. ACM SAC 2013, Coimbra, Portugal 22University of Thessaly Evaluation  We split the training set in three equally sized parts.  We used the first two thirds to build K, A, and J.  The last third was used for evaluation.  We checked all the possible combinations for wk, wa, wj .
  • 23. ACM SAC 2013, Coimbra, Portugal 23University of Thessaly Accuracy  Our approach outperformed the adversary generic ML algorithms.  Our method: 81%- 96%  SVM: 78%-94%.  ADA: 80%-88% Some experiments did not finish!
  • 24. ACM SAC 2013, Coimbra, Portugal 24University of Thessaly Conclusions  We presented a supervised machine learning approach for classifying research articles.  Our algorithm takes into consideration the specific aspects of this particular problem.  It outperforms the adversary approaches: Better performance by about 6%, whereas AdaBoost fails to complete in large datasets.
  • 25. ACM SAC 2013, Coimbra, Portugal 25University of Thessaly Thank you for watching Any questions?