SlideShare a Scribd company logo
1 of 19
Download to read offline
A Self-Pruning Classification Model for News
L. Akritidis1, A. Fevgas2, P. Bozanis3, M. Alamaniotis4
1,2,3DaSE Lab – Dpt. of Electrical & Computer Engineering – Univ. of Thessaly
4Dpt. of Electrical and Computer Engineering – Univ. of Texas at San Antonio
10th International Conference on Information,
Intelligence, Systems and Applications (IISA 2019)
July 15-17, 2018, Patras, Greece
News Aggregators
• Collect news feeds from various sources (news sites,
portals, etc.).
• They offer two main services:
– They group similar articles into the same cluster.
– They classify the articles into a predefined set of
categories/classes (politics, economics, athletics, etc.).
• We address the second problem, i.e., news articles
classification.
L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 2IISA 2019, July 15-17, 2019, Patras, Greece
News Articles Categorization
• One of the most important problems in the area.
• Given a news article and a set of categories, it is
required that we determine the category where
the article belongs to.
• It leads to numerous novel applications:
– Query expansion and rewriting.
– Relevant/Similar articles retrieval.
– Personalized recommendations etc.
L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 3IISA 2019, July 15-17, 2019, Patras, Greece
Theoretical Background
• Let Y be the set of all categories.
• An article x can only be assigned one category y.
• Each article x is described by its title τ.
• The title consists of a set of words
(w1, w2, … ,wlx).
L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 4IISA 2019, July 15-17, 2019, Patras, Greece
N-grams vs. words
• The proposed method performs morphological
analysis of the titles and extracts n-grams of
variable sizes.
• The reason of employing n-grams is that a n-gram is
less ambiguous than single words.
• For instance, the word company may be correlated
with multiple diverse categories (Economics,
Politics, or even Society).
• However, the bigram company shares is much more
specific.
L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 5IISA 2019, July 15-17, 2019, Patras, Greece
Tokens and Ambiguity
• We collectively refer to all n-grams and words as
tokens.
• Each token has its own level of ambiguity.
– Ambiguous tokens are not tightly correlated with a
single category; they can be connected with multiple
categories.
• Or, each token is of different importance.
• According to the previous example, longer tokens
are less ambiguous, that is, more important.
L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 6IISA 2019, July 15-17, 2019, Patras, Greece
Importance Scores
• Furthermore, a token is more important if it has
been correlated with only a few categories.
• Based on these notifications, we introduce the
following importance score for a token t.
– |Y|: the total number of categories,
– ft: the frequency of t, i.e. the number of the categories
which have been correlated with t, and
– lt: the length (in words) of t.
L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 7IISA 2019, July 15-17, 2019, Patras, Greece
Categorization Model: Training Phase
• The training phase builds a lexicon L for the tokens.
Each entry in the lexicon stores:
• The token t,
• Its frequency ft,
• Its importance score It,
• A list Λt which for each token-category relationship,
includes a pair in the form (y, ft,y), where y is a
category and ft,y is another frequency value that
reflects how many times t has been correlated with
the category y.
L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 8IISA 2019, July 15-17, 2019, Patras, Greece
Model Training Algorithm
L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 9IISA 2019, July 15-17, 2019, Patras, Greece
Categorization Model: Testing Phase
• The testing phase is based on the lexicon L of the
previous phase.
• Initially, an empty candidates list Y’is created.
• In the sequel, for each token t of each news
article x, we perform a search in L.
• In case it is successful, we retrieve Λt, ft and It of t.
Then, we traverse Λt and for each entry y we
update the candidates list Y’.
L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 10IISA 2019, July 15-17, 2019, Patras, Greece
Candidates Scoring
• The score St,y of each candidate category is
expressed as linear combination of the
importance score of the token and a quantity Qt,y:
• Finally, the candidates are sorted in decreasing St,y
order and the top candidate is selected as a
category for the article.
L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 11IISA 2019, July 15-17, 2019, Patras, Greece
Categorization Algorithm
L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 12IISA 2019, July 15-17, 2019, Patras, Greece
Model Pruning
• We may decrease the size of the lexicon L by
preserving only the tokens whose importance
score exceeds a threshold C:
• We will demonstrate experimentally that this
choice combines a significant reduction of the size
of L, with infinitesimal losses in the accuracy of
the algorithm.
L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 13IISA 2019, July 15-17, 2019, Patras, Greece
Experimental Setup
• Dataset: UCI News Aggregator
• 400,000 new stories & 4 categories.
• Training/Test Set sizes: 70%/30%.
• Two scenarios for the extracted n-grams:
– N=1: Extract unigrams (single words) only.
– N=3: Extract unigrams, bigrams, & trigrams.
• 0.0 < T < 1.0 with steps of 0.1.
L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 14IISA 2019, July 15-17, 2019, Patras, Greece
Examined Methods
• SNC – Supervised News Classifier – the proposed
algorithm.
• LogReg – Logistic Regression.
L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 15IISA 2019, July 15-17, 2019, Patras, Greece
Accuracy Evaluation
L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 16IISA 2019, July 15-17, 2019, Patras, Greece
Highest Accuracy:
94.1% for N=3, T=0.
Interesting
Measurement:
Accuracy 90% for
N=3, T=0.1.
Pruned Model Evaluation
L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 17IISA 2019, July 15-17, 2019, Patras, Greece
Recall: For N=3, T=0.1,
accuracy 90% with 16x
fewer tokens and 6x
fewer Λt entries
Conclusions
• We presented a supervised learning
algorithm for news articles categorization.
• It trains a classification model based on:
– The morphological analysis of the titles,
– The extraction of n-grams of variable sizes,
– The assignment of importance scores to each token.
• The method achieves ~95% classification
accuracy.
• It also embodies a self-pruning strategy.
L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 18IISA 2019, July 15-17, 2019, Patras, Greece
Thank You
Any Questions?
L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 19IISA 2019, July 15-17, 2019, Patras, Greece

More Related Content

Similar to A Self-Pruning Classification Model for News

Chapter 1 Introduction to statistics, Definitions, scope and limitations.pptx
Chapter 1 Introduction to statistics, Definitions, scope and limitations.pptxChapter 1 Introduction to statistics, Definitions, scope and limitations.pptx
Chapter 1 Introduction to statistics, Definitions, scope and limitations.pptx
SubashYadav14
 
Business statistics review
Business statistics reviewBusiness statistics review
Business statistics review
FELIXARCHER
 
Probability and statistics
Probability and statisticsProbability and statistics
Probability and statistics
Cyrus S. Koroma
 

Similar to A Self-Pruning Classification Model for News (7)

Chapter 1 Introduction to statistics, Definitions, scope and limitations.pptx
Chapter 1 Introduction to statistics, Definitions, scope and limitations.pptxChapter 1 Introduction to statistics, Definitions, scope and limitations.pptx
Chapter 1 Introduction to statistics, Definitions, scope and limitations.pptx
 
Business statistics review
Business statistics reviewBusiness statistics review
Business statistics review
 
Statistics Reference Book
Statistics Reference BookStatistics Reference Book
Statistics Reference Book
 
Probability and statistics
Probability and statisticsProbability and statistics
Probability and statistics
 
Employees Data Analysis by Applied SPSS
Employees Data Analysis by Applied SPSSEmployees Data Analysis by Applied SPSS
Employees Data Analysis by Applied SPSS
 
Thatian sisksm-20190123
Thatian sisksm-20190123Thatian sisksm-20190123
Thatian sisksm-20190123
 
Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...
Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...
Statistical Data Analysis | Data Analysis | Statistics Services | Data Collec...
 

More from Leonidas Akritidis

More from Leonidas Akritidis (6)

Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
Supervised Papers Classification on Large-Scale High-Dimensional Data with Ap...
 
Effective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product TitlesEffective Unsupervised Matching of Product Titles
Effective Unsupervised Matching of Product Titles
 
A Supervised Machine Learning Algorithm for Research Articles
A Supervised Machine Learning Algorithm for Research ArticlesA Supervised Machine Learning Algorithm for Research Articles
A Supervised Machine Learning Algorithm for Research Articles
 
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduceComputing Scientometrics in Large-Scale Academic Search Engines with MapReduce
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce
 
Positional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted IndexesPositional Data Organization and Compression in Web Inverted Indexes
Positional Data Organization and Compression in Web Inverted Indexes
 
Identifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does MatterIdentifying Influential Bloggers: Time Does Matter
Identifying Influential Bloggers: Time Does Matter
 

Recently uploaded

Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networks
IJECEIAES
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
rahulmanepalli02
 
Artificial intelligence presentation2-171219131633.pdf
Artificial intelligence presentation2-171219131633.pdfArtificial intelligence presentation2-171219131633.pdf
Artificial intelligence presentation2-171219131633.pdf
Kira Dess
 

Recently uploaded (20)

Adsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) pptAdsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) ppt
 
Software Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfSoftware Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdf
 
Intro to Design (for Engineers) at Sydney Uni
Intro to Design (for Engineers) at Sydney UniIntro to Design (for Engineers) at Sydney Uni
Intro to Design (for Engineers) at Sydney Uni
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
 
engineering chemistry power point presentation
engineering chemistry  power point presentationengineering chemistry  power point presentation
engineering chemistry power point presentation
 
Raashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashid final report on Embedded Systems
Raashid final report on Embedded Systems
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptx
 
Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networks
 
5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...5G and 6G refer to generations of mobile network technology, each representin...
5G and 6G refer to generations of mobile network technology, each representin...
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)
 
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas SachpazisSeismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
Seismic Hazard Assessment Software in Python by Prof. Dr. Costas Sachpazis
 
Autodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptxAutodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptx
 
Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
 
The Entity-Relationship Model(ER Diagram).pptx
The Entity-Relationship Model(ER Diagram).pptxThe Entity-Relationship Model(ER Diagram).pptx
The Entity-Relationship Model(ER Diagram).pptx
 
Insurance management system project report.pdf
Insurance management system project report.pdfInsurance management system project report.pdf
Insurance management system project report.pdf
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
History of Indian Railways - the story of Growth & Modernization
History of Indian Railways - the story of Growth & ModernizationHistory of Indian Railways - the story of Growth & Modernization
History of Indian Railways - the story of Growth & Modernization
 
Artificial intelligence presentation2-171219131633.pdf
Artificial intelligence presentation2-171219131633.pdfArtificial intelligence presentation2-171219131633.pdf
Artificial intelligence presentation2-171219131633.pdf
 

A Self-Pruning Classification Model for News

  • 1. A Self-Pruning Classification Model for News L. Akritidis1, A. Fevgas2, P. Bozanis3, M. Alamaniotis4 1,2,3DaSE Lab – Dpt. of Electrical & Computer Engineering – Univ. of Thessaly 4Dpt. of Electrical and Computer Engineering – Univ. of Texas at San Antonio 10th International Conference on Information, Intelligence, Systems and Applications (IISA 2019) July 15-17, 2018, Patras, Greece
  • 2. News Aggregators • Collect news feeds from various sources (news sites, portals, etc.). • They offer two main services: – They group similar articles into the same cluster. – They classify the articles into a predefined set of categories/classes (politics, economics, athletics, etc.). • We address the second problem, i.e., news articles classification. L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 2IISA 2019, July 15-17, 2019, Patras, Greece
  • 3. News Articles Categorization • One of the most important problems in the area. • Given a news article and a set of categories, it is required that we determine the category where the article belongs to. • It leads to numerous novel applications: – Query expansion and rewriting. – Relevant/Similar articles retrieval. – Personalized recommendations etc. L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 3IISA 2019, July 15-17, 2019, Patras, Greece
  • 4. Theoretical Background • Let Y be the set of all categories. • An article x can only be assigned one category y. • Each article x is described by its title τ. • The title consists of a set of words (w1, w2, … ,wlx). L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 4IISA 2019, July 15-17, 2019, Patras, Greece
  • 5. N-grams vs. words • The proposed method performs morphological analysis of the titles and extracts n-grams of variable sizes. • The reason of employing n-grams is that a n-gram is less ambiguous than single words. • For instance, the word company may be correlated with multiple diverse categories (Economics, Politics, or even Society). • However, the bigram company shares is much more specific. L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 5IISA 2019, July 15-17, 2019, Patras, Greece
  • 6. Tokens and Ambiguity • We collectively refer to all n-grams and words as tokens. • Each token has its own level of ambiguity. – Ambiguous tokens are not tightly correlated with a single category; they can be connected with multiple categories. • Or, each token is of different importance. • According to the previous example, longer tokens are less ambiguous, that is, more important. L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 6IISA 2019, July 15-17, 2019, Patras, Greece
  • 7. Importance Scores • Furthermore, a token is more important if it has been correlated with only a few categories. • Based on these notifications, we introduce the following importance score for a token t. – |Y|: the total number of categories, – ft: the frequency of t, i.e. the number of the categories which have been correlated with t, and – lt: the length (in words) of t. L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 7IISA 2019, July 15-17, 2019, Patras, Greece
  • 8. Categorization Model: Training Phase • The training phase builds a lexicon L for the tokens. Each entry in the lexicon stores: • The token t, • Its frequency ft, • Its importance score It, • A list Λt which for each token-category relationship, includes a pair in the form (y, ft,y), where y is a category and ft,y is another frequency value that reflects how many times t has been correlated with the category y. L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 8IISA 2019, July 15-17, 2019, Patras, Greece
  • 9. Model Training Algorithm L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 9IISA 2019, July 15-17, 2019, Patras, Greece
  • 10. Categorization Model: Testing Phase • The testing phase is based on the lexicon L of the previous phase. • Initially, an empty candidates list Y’is created. • In the sequel, for each token t of each news article x, we perform a search in L. • In case it is successful, we retrieve Λt, ft and It of t. Then, we traverse Λt and for each entry y we update the candidates list Y’. L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 10IISA 2019, July 15-17, 2019, Patras, Greece
  • 11. Candidates Scoring • The score St,y of each candidate category is expressed as linear combination of the importance score of the token and a quantity Qt,y: • Finally, the candidates are sorted in decreasing St,y order and the top candidate is selected as a category for the article. L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 11IISA 2019, July 15-17, 2019, Patras, Greece
  • 12. Categorization Algorithm L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 12IISA 2019, July 15-17, 2019, Patras, Greece
  • 13. Model Pruning • We may decrease the size of the lexicon L by preserving only the tokens whose importance score exceeds a threshold C: • We will demonstrate experimentally that this choice combines a significant reduction of the size of L, with infinitesimal losses in the accuracy of the algorithm. L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 13IISA 2019, July 15-17, 2019, Patras, Greece
  • 14. Experimental Setup • Dataset: UCI News Aggregator • 400,000 new stories & 4 categories. • Training/Test Set sizes: 70%/30%. • Two scenarios for the extracted n-grams: – N=1: Extract unigrams (single words) only. – N=3: Extract unigrams, bigrams, & trigrams. • 0.0 < T < 1.0 with steps of 0.1. L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 14IISA 2019, July 15-17, 2019, Patras, Greece
  • 15. Examined Methods • SNC – Supervised News Classifier – the proposed algorithm. • LogReg – Logistic Regression. L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 15IISA 2019, July 15-17, 2019, Patras, Greece
  • 16. Accuracy Evaluation L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 16IISA 2019, July 15-17, 2019, Patras, Greece Highest Accuracy: 94.1% for N=3, T=0. Interesting Measurement: Accuracy 90% for N=3, T=0.1.
  • 17. Pruned Model Evaluation L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 17IISA 2019, July 15-17, 2019, Patras, Greece Recall: For N=3, T=0.1, accuracy 90% with 16x fewer tokens and 6x fewer Λt entries
  • 18. Conclusions • We presented a supervised learning algorithm for news articles categorization. • It trains a classification model based on: – The morphological analysis of the titles, – The extraction of n-grams of variable sizes, – The assignment of importance scores to each token. • The method achieves ~95% classification accuracy. • It also embodies a self-pruning strategy. L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 18IISA 2019, July 15-17, 2019, Patras, Greece
  • 19. Thank You Any Questions? L. Akritidis, A. Fevgas, P. Bozanis, M. Alamaniotis 19IISA 2019, July 15-17, 2019, Patras, Greece