Autonomous News Clustering  and Classification for  an Intelligent Web Portal Traian Rebedea , Stefan Trausan-Matu “ Polit...
Overview <ul><li>Introduction </li></ul><ul><ul><li>Motivation </li></ul></ul><ul><ul><li>Intelligent News Processing </li...
Introduction <ul><li>WWW – increased number of users and web sites </li></ul><ul><li>Great volume of online information </...
Motivation <ul><li>Obtain an autonomous news portal using: </li></ul><ul><ul><li>Web syndication </li></ul></ul><ul><ul><l...
Intelligent News Processing   <ul><li>Objective: find the most important piece of news </li></ul><ul><li>Alternatives: </l...
Intelligent News Processing (2) <ul><li>NLP techniques – machine learning </li></ul><ul><li>News fetched from various sour...
Overview <ul><li>Introduction </li></ul><ul><ul><li>Motivation </li></ul></ul><ul><ul><li>Intelligent News Processing </li...
Theoretical Background <ul><li>Clustering and classification – widely used in NLP </li></ul><ul><li>Vector space model – B...
Text Clustering <ul><li>Partition the data into subsets – clusters </li></ul><ul><ul><li>Data in the same group has common...
Hierarchical Clustering <ul><li>Usually uses: </li></ul><ul><ul><li>Hard assignment </li></ul></ul><ul><ul><li>Greedy tech...
Text Classification <ul><li>Assign predefined labels (categories) to textual items </li></ul><ul><li>Supervised learning –...
Text Classification (2) <ul><li>Different methods: </li></ul><ul><ul><li>Separation of the space: NN </li></ul></ul><ul><u...
Overview <ul><li>Introduction </li></ul><ul><ul><li>Motivation </li></ul></ul><ul><ul><li>Intelligent News Processing </li...
Intelligent News Classification <ul><li>Purpose: develop an online news portal  able to function with a minimum of human i...
Functionality & Architecture <ul><li>Automated collecting of web syndications (periodically); </li></ul><ul><li>Save fresh...
Functionality & Architecture (2) <ul><li>These actions may be run in a single stage / sequentially, as well as individuall...
Functionality & Architecture (3)
News Clustering <ul><li>Preprocessing phase: </li></ul><ul><ul><li>Remove diacritical marks and other special characters t...
News Clustering – Algorithm  <ul><li>Hierarchical algorithm </li></ul><ul><ul><li>Agglomerative </li></ul></ul><ul><ul><li...
News Clustering - Similarity <ul><li>Used different measures: </li></ul><ul><ul><li>Frequency: </li></ul></ul><ul><ul><ul>...
News Clustering - Results <ul><ul><li>Used > 30 news sources </li></ul></ul><ul><ul><li>Frequency-based implementation wor...
News Classification <ul><li>Categories: Romania, Politics, Economy, Culture, International, Sports, High-Tech and High Lif...
Training Data <ul><li>Training data: 3279 news items </li></ul><ul><ul><li>Cross validation </li></ul></ul><ul><ul><li>2/3...
Classifiers <ul><li>k-NN classifiers: </li></ul><ul><ul><li>Simple k-NN (most similar item) </li></ul></ul><ul><ul><li>k-N...
Classification Results <ul><li>Nearest Center (center-based NN) had the best accuracy </li></ul><ul><li>Frequency-based ve...
Classification Results (2) <ul><li>Confusion matrix for the NC classifier </li></ul><ul><li>Average recall = 0.59, Average...
Conclusions <ul><li>Alternative to classical news portals </li></ul><ul><ul><li>Solve the problems of large amounts of new...
Thank You!
Upcoming SlideShare
Loading in …5
×

Autonomous News Clustering and Classification for an Intelligent Web Portal

4,144 views

Published on

Presentation at ISMIS 2008.

Published in: Education
1 Comment
2 Likes
Statistics
Notes
  • `A nice theoretical study on the clustering (similarity) and categorization of the news is presented. Using TF-IDF or vector space models can lead to wrong categorization of a news item. As you had pointed manual curing is always superior but nearly impossible and therefore automation is needed. One such site, http://www.globalne.ws offers nice categorization of Real Time News.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
4,144
On SlideShare
0
From Embeds
0
Number of Embeds
107
Actions
Shares
0
Downloads
78
Comments
1
Likes
2
Embeds 0
No embeds

No notes for slide

Autonomous News Clustering and Classification for an Intelligent Web Portal

  1. 1. Autonomous News Clustering and Classification for an Intelligent Web Portal Traian Rebedea , Stefan Trausan-Matu “ Politehnica” University of Bucharest, Department of Computer Science and Engineering {trebedea, trausan}@cs.pub.ro
  2. 2. Overview <ul><li>Introduction </li></ul><ul><ul><li>Motivation </li></ul></ul><ul><ul><li>Intelligent News Processing </li></ul></ul><ul><li>Theoretical Background </li></ul><ul><ul><li>Text Clustering </li></ul></ul><ul><ul><li>Text Classification </li></ul></ul><ul><li>Intelligent News Classification in Romanian </li></ul><ul><ul><li>Functionality & Architecture </li></ul></ul><ul><ul><li>News Clustering </li></ul></ul><ul><ul><li>News Classification </li></ul></ul><ul><li>Conclusions </li></ul>
  3. 3. Introduction <ul><li>WWW – increased number of users and web sites </li></ul><ul><li>Great volume of online information </li></ul><ul><li>Information redundancy on the Web </li></ul><ul><ul><li>Different sources – slight variations </li></ul></ul><ul><li>News available as web syndication </li></ul><ul><ul><li>XML-based formats – RSS, ATOM </li></ul></ul><ul><ul><li>Applications: aggregators – various flavours </li></ul></ul><ul><ul><li>Aggregators do not (usually) exploit the large volume of information and redundancy </li></ul></ul>
  4. 4. Motivation <ul><li>Obtain an autonomous news portal using: </li></ul><ul><ul><li>Web syndication </li></ul></ul><ul><ul><li>Advanced text processing methods </li></ul></ul><ul><ul><ul><li>Clustering </li></ul></ul></ul><ul><ul><ul><li>Classification </li></ul></ul></ul><ul><li>Large volumes of data </li></ul><ul><ul><li>Find a method to determine the importance of the processed data (pieces of news) </li></ul></ul><ul><li>News headlines </li></ul><ul><ul><li>Different sources </li></ul></ul><ul><ul><li>Many stories / headlines acquired from feeds – high volume & redundancy </li></ul></ul><ul><li>Intelligent News Processing </li></ul>
  5. 5. Intelligent News Processing <ul><li>Objective: find the most important piece of news </li></ul><ul><li>Alternatives: </li></ul><ul><ul><li>Manually assign an importance to each piece of news – difficult, time consuming </li></ul></ul><ul><ul><li>Number of readers for each news headline </li></ul></ul><ul><ul><ul><li>Does not (1) reduce the number of news headlines, (2) solve news redundancy, (3) offer an automatic method for computing the importance of a particular piece of news </li></ul></ul></ul><ul><ul><ul><li>Each piece of news is attached to the source – no alternative sources on the same subject </li></ul></ul></ul><ul><ul><li>Intelligent processing of news </li></ul></ul><ul><ul><ul><li>Automatically determines the main headlines </li></ul></ul></ul><ul><ul><ul><li>Offers a classification of news subjects - number of different pieces of news that compose it – objective measure provided by (news) specialists (news agencies, newspapers, TVs, etc.) </li></ul></ul></ul>
  6. 6. Intelligent News Processing (2) <ul><li>NLP techniques – machine learning </li></ul><ul><li>News fetched from various sources using web syndication </li></ul><ul><li>News clustering – used to determine the most important subjects </li></ul><ul><li>News classification – assign each piece of news to a category </li></ul><ul><li>Different approaches: </li></ul><ul><ul><li>Google News, Topix, NewsJunkie (Microsoft), European Media Monitor (EMM) </li></ul></ul><ul><ul><li>Some of them also consider: assigning labels to news (persons, companies, events), information novelty </li></ul></ul>
  7. 7. Overview <ul><li>Introduction </li></ul><ul><ul><li>Motivation </li></ul></ul><ul><ul><li>Intelligent News Processing </li></ul></ul><ul><li>Theoretical Background </li></ul><ul><ul><li>Text Clustering </li></ul></ul><ul><ul><li>Text Classification </li></ul></ul><ul><li>Intelligent News Classification in Romanian </li></ul><ul><ul><li>Functionality & Architecture </li></ul></ul><ul><ul><li>News Clustering </li></ul></ul><ul><ul><li>News Classification </li></ul></ul><ul><li>Conclusions </li></ul>
  8. 8. Theoretical Background <ul><li>Clustering and classification – widely used in NLP </li></ul><ul><li>Vector space model – Boolean, frequency, TF-IDF vectors </li></ul><ul><li>High dimensionality – number of distinct terms in all the analyzed pieces of news </li></ul><ul><ul><li>Curse of dimensionality </li></ul></ul><ul><ul><li>Similarity measures: based on cosine </li></ul></ul><ul><ul><li>Inverse of distance metrics do not offer good results </li></ul></ul>
  9. 9. Text Clustering <ul><li>Partition the data into subsets – clusters </li></ul><ul><ul><li>Data in the same group has common characteristics </li></ul></ul><ul><ul><li>Grouping process is applied based on the proximity of the elements that need to be clustered – similarity measure </li></ul></ul><ul><ul><li>Large volumes – exploits the redundancy </li></ul></ul><ul><li>Different techniques: </li></ul><ul><ul><li>Bottom-up (agglomerative) / top-down (divisive) </li></ul></ul><ul><ul><li>Hierarchical / flat – relationships between groups </li></ul></ul><ul><ul><li>Assignment: hard / soft </li></ul></ul>
  10. 10. Hierarchical Clustering <ul><li>Usually uses: </li></ul><ul><ul><li>Hard assignment </li></ul></ul><ul><ul><li>Greedy technique </li></ul></ul><ul><li>Computing similarity between clusters: </li></ul><ul><ul><li>Most similar elements – single link (fast) </li></ul></ul><ul><ul><li>Least similar elements – complete link (good results) </li></ul></ul><ul><ul><li>Average similarity of all the elements in a group – average link (fast & good results) </li></ul></ul>
  11. 11. Text Classification <ul><li>Assign predefined labels (categories) to textual items </li></ul><ul><li>Supervised learning – 2 stages </li></ul><ul><ul><li>Training the classifier – training set of items </li></ul></ul><ul><ul><li>Using it for assigning labels to new items </li></ul></ul><ul><li>Training => data model => classify items </li></ul><ul><li>Text classification: </li></ul><ul><ul><li>News and e-mail categorization </li></ul></ul><ul><ul><li>Automatic classification of large text documents </li></ul></ul><ul><ul><li>Text is unstructured and the number of features is very high ( > 1000) – unlike database (usually < 100) </li></ul></ul>
  12. 12. Text Classification (2) <ul><li>Different methods: </li></ul><ul><ul><li>Separation of the space: NN </li></ul></ul><ul><ul><li>Probability distribution: decision trees, Bayes, SVMs </li></ul></ul><ul><li>Nearest neighbour (NN) – easy to train and use </li></ul><ul><li>Training phase – simple and fast – indexing the training data for each category </li></ul><ul><li>Classifying a new item - the most similar k indexed documents are determined. The item is assigned to the class that has the most documents </li></ul><ul><li>Improvements: </li></ul><ul><ul><li>score for each class </li></ul></ul><ul><ul><li>offsets for each class added to the score </li></ul></ul><ul><li>Classifier can be trained to find the best values for k and the offsets </li></ul><ul><li>Disadvantages: increased time and memory for classification (than probabilistic based classifiers) </li></ul><ul><li>Use greedy features’ selection </li></ul>
  13. 13. Overview <ul><li>Introduction </li></ul><ul><ul><li>Motivation </li></ul></ul><ul><ul><li>Intelligent News Processing </li></ul></ul><ul><li>Theoretical Background </li></ul><ul><ul><li>Text Clustering </li></ul></ul><ul><ul><li>Text Classification </li></ul></ul><ul><li>Intelligent News Classification in Romanian </li></ul><ul><ul><li>Functionality & Architecture </li></ul></ul><ul><ul><li>News Clustering </li></ul></ul><ul><ul><li>News Classification </li></ul></ul><ul><li>Conclusions </li></ul>
  14. 14. Intelligent News Classification <ul><li>Purpose: develop an online news portal able to function with a minimum of human intervention </li></ul><ul><li>Makes use of: </li></ul><ul><ul><li>Web syndication </li></ul></ul><ul><ul><li>NLP techniques </li></ul></ul><ul><li>Advantages: </li></ul><ul><ul><li>autonomy towards an administrator </li></ul></ul><ul><ul><li>the methodology used to present the news based on the importance of the headlines over a period of time </li></ul></ul>
  15. 15. Functionality & Architecture <ul><li>Automated collecting of web syndications (periodically); </li></ul><ul><li>Save fresh pieces of news in the database; </li></ul><ul><li>Process the textual information of each fresh piece of news in order to determining the features’ vector associated with the news; </li></ul><ul><li>Group the news using a text clustering algorithm; </li></ul><ul><li>Classify each group of news within a predefined category, using a regularly retrained classifier; </li></ul><ul><li>Generate web pages corresponding to the most important subjects / headlines, grouped in various ways, including in each category of news. </li></ul>
  16. 16. Functionality & Architecture (2) <ul><li>These actions may be run in a single stage / sequentially, as well as individually, at different moments in time </li></ul><ul><ul><li>Operation is determined by the quantity of processed data </li></ul></ul><ul><ul><li>Functionality can be parallelized </li></ul></ul><ul><li>Functionality of the portal may be broken into two different modules that are relatively independent: </li></ul><ul><ul><li>Agent module, that processes the news items and generates the web pages </li></ul></ul><ul><ul><li>Web module that displays the information and implements search and personalization capabilities </li></ul></ul><ul><ul><li>The two modules communicate using a database </li></ul></ul>
  17. 17. Functionality & Architecture (3)
  18. 18. News Clustering <ul><li>Preprocessing phase: </li></ul><ul><ul><li>Remove diacritical marks and other special characters that are not used by all the news sources </li></ul></ul><ul><ul><li>Remove HTML tags and entities </li></ul></ul><ul><ul><li>Eliminate stop words </li></ul></ul><ul><ul><li>Tokenization </li></ul></ul><ul><ul><li>Stemming – special Romanian stemmer </li></ul></ul><ul><ul><ul><li>Inflexion rules are numerous and very complicated </li></ul></ul></ul><ul><ul><ul><li>They affect the inner structure of the words, not only the trailing part </li></ul></ul></ul><ul><ul><ul><li>Use a small set of solid rules (reduce the number of terms with 20-25%) </li></ul></ul></ul><ul><li>Vector space model – Boolean, frequency </li></ul>
  19. 19. News Clustering – Algorithm <ul><li>Hierarchical algorithm </li></ul><ul><ul><li>Agglomerative </li></ul></ul><ul><ul><li>Hard assignment </li></ul></ul><ul><ul><li>Average link </li></ul></ul><ul><li>Used two thresholds: </li></ul><ul><ul><li>Higher value – to merge very similar items </li></ul></ul><ul><ul><ul><li>Used in order to create very cohesive clusters </li></ul></ul></ul><ul><ul><li>Lower value – continue the process </li></ul></ul>
  20. 20. News Clustering - Similarity <ul><li>Used different measures: </li></ul><ul><ul><li>Frequency: </li></ul></ul><ul><ul><ul><li>Inverse of a distance </li></ul></ul></ul><ul><ul><ul><li>Cosine similarity </li></ul></ul></ul><ul><ul><li>Boolean: </li></ul></ul><ul><ul><ul><li>Jaccard similarity </li></ul></ul></ul><ul><ul><ul><li>Dice’s coefficient </li></ul></ul></ul><ul><ul><ul><li>Other: </li></ul></ul></ul>
  21. 21. News Clustering - Results <ul><ul><li>Used > 30 news sources </li></ul></ul><ul><ul><li>Frequency-based implementation worked better </li></ul></ul><ul><ul><li>News presented to the users: </li></ul></ul><ul><ul><ul><li>The most important headlines – number of items in a group </li></ul></ul></ul><ul><ul><ul><li>The importance of a piece of news – similarity with features’ vector of the headline </li></ul></ul></ul>
  22. 22. News Classification <ul><li>Categories: Romania, Politics, Economy, Culture, International, Sports, High-Tech and High Life </li></ul><ul><li>Classify the news clusters, not each piece of news individually </li></ul><ul><ul><li>Advantage: a cluster holds more features’ information – more probable to be correctly classified </li></ul></ul><ul><li>Training data – specialized RSS channels </li></ul>
  23. 23. Training Data <ul><li>Training data: 3279 news items </li></ul><ul><ul><li>Cross validation </li></ul></ul><ul><ul><li>2/3 – used for training </li></ul></ul><ul><ul><li>1/3 – used for evaluation </li></ul></ul><ul><ul><li>unequal distribution </li></ul></ul>
  24. 24. Classifiers <ul><li>k-NN classifiers: </li></ul><ul><ul><li>Simple k-NN (most similar item) </li></ul></ul><ul><ul><li>k-NN with scores </li></ul></ul><ul><ul><li>Center-based NN </li></ul></ul><ul><ul><ul><li>Training phase – slower </li></ul></ul></ul><ul><ul><ul><li>Classification phase – faster </li></ul></ul></ul><ul><li>Various values for k = 1, 3, 5, … </li></ul>
  25. 25. Classification Results <ul><li>Nearest Center (center-based NN) had the best accuracy </li></ul><ul><li>Frequency-based vector space with cosine similarity produced slightly better results than the Boolean-based vector space </li></ul><ul><li>k-NN with scores (notes with sum in the table above) produced slightly better results than simple k-NN </li></ul>
  26. 26. Classification Results (2) <ul><li>Confusion matrix for the NC classifier </li></ul><ul><li>Average recall = 0.59, Average precision = 0.62, Accuracy = 0.64, F1 = 0.61 </li></ul>
  27. 27. Conclusions <ul><li>Alternative to classical news portals </li></ul><ul><ul><li>Solve the problems of large amounts of news and of information redundancy, by using the latter as an advantage </li></ul></ul><ul><li>Web syndication and natural language processing techniques are used in order to achieve a human independent functionality </li></ul><ul><li>Clustering is used to exploit similar news and group them into a single topic – presented to the user </li></ul><ul><li>Automatic classification of the news topics – advantage over a single piece of news </li></ul><ul><li>Further development: </li></ul><ul><ul><li>Improve the clustering and classification techniques </li></ul></ul><ul><ul><li>Language independent or multilingual </li></ul></ul>
  28. 28. Thank You!

×