• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Autonomous News Clustering and Classification for an Intelligent Web Portal
 

Autonomous News Clustering and Classification for an Intelligent Web Portal

on

  • 3,514 views

Presentation at ISMIS 2008.

Presentation at ISMIS 2008.

Statistics

Views

Total Views
3,514
Views on SlideShare
3,484
Embed Views
30

Actions

Likes
1
Downloads
64
Comments
1

3 Embeds 30

http://www.linkedin.com 22
https://www.linkedin.com 7
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • `A nice theoretical study on the clustering (similarity) and categorization of the news is presented. Using TF-IDF or vector space models can lead to wrong categorization of a news item. As you had pointed manual curing is always superior but nearly impossible and therefore automation is needed. One such site, http://www.globalne.ws offers nice categorization of Real Time News.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Autonomous News Clustering and Classification for an Intelligent Web Portal Autonomous News Clustering and Classification for an Intelligent Web Portal Presentation Transcript

    • Autonomous News Clustering and Classification for an Intelligent Web Portal Traian Rebedea , Stefan Trausan-Matu “ Politehnica” University of Bucharest, Department of Computer Science and Engineering {trebedea, trausan}@cs.pub.ro
    • Overview
      • Introduction
        • Motivation
        • Intelligent News Processing
      • Theoretical Background
        • Text Clustering
        • Text Classification
      • Intelligent News Classification in Romanian
        • Functionality & Architecture
        • News Clustering
        • News Classification
      • Conclusions
    • Introduction
      • WWW – increased number of users and web sites
      • Great volume of online information
      • Information redundancy on the Web
        • Different sources – slight variations
      • News available as web syndication
        • XML-based formats – RSS, ATOM
        • Applications: aggregators – various flavours
        • Aggregators do not (usually) exploit the large volume of information and redundancy
    • Motivation
      • Obtain an autonomous news portal using:
        • Web syndication
        • Advanced text processing methods
          • Clustering
          • Classification
      • Large volumes of data
        • Find a method to determine the importance of the processed data (pieces of news)
      • News headlines
        • Different sources
        • Many stories / headlines acquired from feeds – high volume & redundancy
      • Intelligent News Processing
    • Intelligent News Processing
      • Objective: find the most important piece of news
      • Alternatives:
        • Manually assign an importance to each piece of news – difficult, time consuming
        • Number of readers for each news headline
          • Does not (1) reduce the number of news headlines, (2) solve news redundancy, (3) offer an automatic method for computing the importance of a particular piece of news
          • Each piece of news is attached to the source – no alternative sources on the same subject
        • Intelligent processing of news
          • Automatically determines the main headlines
          • Offers a classification of news subjects - number of different pieces of news that compose it – objective measure provided by (news) specialists (news agencies, newspapers, TVs, etc.)
    • Intelligent News Processing (2)
      • NLP techniques – machine learning
      • News fetched from various sources using web syndication
      • News clustering – used to determine the most important subjects
      • News classification – assign each piece of news to a category
      • Different approaches:
        • Google News, Topix, NewsJunkie (Microsoft), European Media Monitor (EMM)
        • Some of them also consider: assigning labels to news (persons, companies, events), information novelty
    • Overview
      • Introduction
        • Motivation
        • Intelligent News Processing
      • Theoretical Background
        • Text Clustering
        • Text Classification
      • Intelligent News Classification in Romanian
        • Functionality & Architecture
        • News Clustering
        • News Classification
      • Conclusions
    • Theoretical Background
      • Clustering and classification – widely used in NLP
      • Vector space model – Boolean, frequency, TF-IDF vectors
      • High dimensionality – number of distinct terms in all the analyzed pieces of news
        • Curse of dimensionality
        • Similarity measures: based on cosine
        • Inverse of distance metrics do not offer good results
    • Text Clustering
      • Partition the data into subsets – clusters
        • Data in the same group has common characteristics
        • Grouping process is applied based on the proximity of the elements that need to be clustered – similarity measure
        • Large volumes – exploits the redundancy
      • Different techniques:
        • Bottom-up (agglomerative) / top-down (divisive)
        • Hierarchical / flat – relationships between groups
        • Assignment: hard / soft
    • Hierarchical Clustering
      • Usually uses:
        • Hard assignment
        • Greedy technique
      • Computing similarity between clusters:
        • Most similar elements – single link (fast)
        • Least similar elements – complete link (good results)
        • Average similarity of all the elements in a group – average link (fast & good results)
    • Text Classification
      • Assign predefined labels (categories) to textual items
      • Supervised learning – 2 stages
        • Training the classifier – training set of items
        • Using it for assigning labels to new items
      • Training => data model => classify items
      • Text classification:
        • News and e-mail categorization
        • Automatic classification of large text documents
        • Text is unstructured and the number of features is very high ( > 1000) – unlike database (usually < 100)
    • Text Classification (2)
      • Different methods:
        • Separation of the space: NN
        • Probability distribution: decision trees, Bayes, SVMs
      • Nearest neighbour (NN) – easy to train and use
      • Training phase – simple and fast – indexing the training data for each category
      • Classifying a new item - the most similar k indexed documents are determined. The item is assigned to the class that has the most documents
      • Improvements:
        • score for each class
        • offsets for each class added to the score
      • Classifier can be trained to find the best values for k and the offsets
      • Disadvantages: increased time and memory for classification (than probabilistic based classifiers)
      • Use greedy features’ selection
    • Overview
      • Introduction
        • Motivation
        • Intelligent News Processing
      • Theoretical Background
        • Text Clustering
        • Text Classification
      • Intelligent News Classification in Romanian
        • Functionality & Architecture
        • News Clustering
        • News Classification
      • Conclusions
    • Intelligent News Classification
      • Purpose: develop an online news portal able to function with a minimum of human intervention
      • Makes use of:
        • Web syndication
        • NLP techniques
      • Advantages:
        • autonomy towards an administrator
        • the methodology used to present the news based on the importance of the headlines over a period of time
    • Functionality & Architecture
      • Automated collecting of web syndications (periodically);
      • Save fresh pieces of news in the database;
      • Process the textual information of each fresh piece of news in order to determining the features’ vector associated with the news;
      • Group the news using a text clustering algorithm;
      • Classify each group of news within a predefined category, using a regularly retrained classifier;
      • Generate web pages corresponding to the most important subjects / headlines, grouped in various ways, including in each category of news.
    • Functionality & Architecture (2)
      • These actions may be run in a single stage / sequentially, as well as individually, at different moments in time
        • Operation is determined by the quantity of processed data
        • Functionality can be parallelized
      • Functionality of the portal may be broken into two different modules that are relatively independent:
        • Agent module, that processes the news items and generates the web pages
        • Web module that displays the information and implements search and personalization capabilities
        • The two modules communicate using a database
    • Functionality & Architecture (3)
    • News Clustering
      • Preprocessing phase:
        • Remove diacritical marks and other special characters that are not used by all the news sources
        • Remove HTML tags and entities
        • Eliminate stop words
        • Tokenization
        • Stemming – special Romanian stemmer
          • Inflexion rules are numerous and very complicated
          • They affect the inner structure of the words, not only the trailing part
          • Use a small set of solid rules (reduce the number of terms with 20-25%)
      • Vector space model – Boolean, frequency
    • News Clustering – Algorithm
      • Hierarchical algorithm
        • Agglomerative
        • Hard assignment
        • Average link
      • Used two thresholds:
        • Higher value – to merge very similar items
          • Used in order to create very cohesive clusters
        • Lower value – continue the process
    • News Clustering - Similarity
      • Used different measures:
        • Frequency:
          • Inverse of a distance
          • Cosine similarity
        • Boolean:
          • Jaccard similarity
          • Dice’s coefficient
          • Other:
    • News Clustering - Results
        • Used > 30 news sources
        • Frequency-based implementation worked better
        • News presented to the users:
          • The most important headlines – number of items in a group
          • The importance of a piece of news – similarity with features’ vector of the headline
    • News Classification
      • Categories: Romania, Politics, Economy, Culture, International, Sports, High-Tech and High Life
      • Classify the news clusters, not each piece of news individually
        • Advantage: a cluster holds more features’ information – more probable to be correctly classified
      • Training data – specialized RSS channels
    • Training Data
      • Training data: 3279 news items
        • Cross validation
        • 2/3 – used for training
        • 1/3 – used for evaluation
        • unequal distribution
    • Classifiers
      • k-NN classifiers:
        • Simple k-NN (most similar item)
        • k-NN with scores
        • Center-based NN
          • Training phase – slower
          • Classification phase – faster
      • Various values for k = 1, 3, 5, …
    • Classification Results
      • Nearest Center (center-based NN) had the best accuracy
      • Frequency-based vector space with cosine similarity produced slightly better results than the Boolean-based vector space
      • k-NN with scores (notes with sum in the table above) produced slightly better results than simple k-NN
    • Classification Results (2)
      • Confusion matrix for the NC classifier
      • Average recall = 0.59, Average precision = 0.62, Accuracy = 0.64, F1 = 0.61
    • Conclusions
      • Alternative to classical news portals
        • Solve the problems of large amounts of news and of information redundancy, by using the latter as an advantage
      • Web syndication and natural language processing techniques are used in order to achieve a human independent functionality
      • Clustering is used to exploit similar news and group them into a single topic – presented to the user
      • Automatic classification of the news topics – advantage over a single piece of news
      • Further development:
        • Improve the clustering and classification techniques
        • Language independent or multilingual
    • Thank You!