• Save
Textmining Retrieval And Clustering
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Textmining Retrieval And Clustering

on

  • 1,663 views

Textmining Retrieval And Clustering

Textmining Retrieval And Clustering

Statistics

Views

Total Views
1,663
Views on SlideShare
1,640
Embed Views
23

Actions

Likes
1
Downloads
0
Comments
0

4 Embeds 23

http://dataminingtools.net 17
http://www.dataminingtools.net 3
http://www.slashdocs.com 2
http://www.slideshare.net 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Textmining Retrieval And Clustering Presentation Transcript

  • 1. Retrieval and clustering of documents
  • 2. Measuring similarity for retrieval
    • Given Set of documents a similarity measure determines for retrieval measures how many documents are relevant to the particular category.
  • 3. Cosine similarity for retrieval
    • Cosine similarity is a measure of similarity between two vectors of n dimensions by finding the cosine of the angle between them, often used to compare documents in text mining. Given two vectors of attributes, A and B , the cosine similarity, θ , is represented using a dot product and magnitude as
    • Similarity =cos(ᶿ)=A.B/||A||||B||
  • 4. Cosine similarity for retrieval
    • For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison.
    • The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating independence, and in-between values indicating intermediate similarity or dissimilarity.
  • 5. Cosine similarity for retrieval
    • In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.
  • 6. Web-based document search and link analysis
    • Link analysis has been used successfully for deciding which web pages to add to the collection of documents
    • how to order the documents matching a user query (i.e., how to rank pages).
    • It has also been used to categorize web pages, to find pages that are related to given pages, to find duplicated web
    • sites, and various other problems related to web information retrieval.
  • 7. Link Analysis
    • A link from page A to page B is a recommendation of page A
    • by the author of page B
    • If page A and page B are connected by a link the probability that they are on the same topic is higher than if they are not connected.
  • 8. Application
    • Ranking query results.(page Rank)
    • crawling
    • fi nding related pages,
    • computing web page reputations
    • geographic scope, prediction
    • categorizing web pages,
    • computing statistics of web pages and of search engines.
  • 9. Document matching
    • Document matching is defined as the matching of some stated user query against a set of free-text records.
    • These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual.
    • User queries can range from multi-sentence full descriptions of an information need to a few words.
  • 10. Steps involved in document matching
    • A document matching system has two main tasks:
    • Find relevant documents to user queries
    • Evaluate the matching results and sort them according to relevance, using algorithms such as PageRank.
  • 11. k-means clustering
    • Given a set of observations ( x 1 , x 2 , …, x n ), where each observation is a d -dimensional real vector, then k -means clustering aims to partition the n observations into k sets ( k  <  n ) S ={ S 1 , S 2 , …, S k } so as to minimize the within-cluster sum of squares
  • 12. K-Means algorithm
    • 0. Input : D ::={d 1 ,d 2 ,…d n }; k ::=the cluster number;
    • 1. Select k document vectors as the initial centriods of k clusters
    • 2 . Repeat
    • 3. Select one vector d in remaining documents
    • 4. Compute similarities between d and k centroids
    • 5. Put d in the closest cluster and recompute the centroid
    • 6. Until the centroids don’t change
    • 7. Output: k clusters of documents
  • 13. Pros and Cons
    • Advantage:
        • linear time complexity
        • works relatively well in low dimension space
    • Drawback:
        • distance computation in high dimension space
        • centroid vector may not well summarize the cluster documents
        • initial k clusters affect the quality of clusters
  • 14. Hierarchical clustering
    • Input : D ::={d 1 ,d 2 ,…d n };
    • 1. Calculate similarity matrix SIM[i,j]
    • 2 . Repeat
    • 3. Merge the most similar two clusters, K and L,
    • to form a new cluster KL
    • 4. Compute similarities between KL and each of the remaining
    • cluster and update SIM[i,j]
    • 5. Until there is a single(or specified number) cluster
    • 6 . Output: dendogram of clusters
  • 15. Pros and cons
    • Advantage:
        • producing better quality clusters
        • works relatively well in low dimension space
    • Drawback:
        • distance computation in high dimension space
        • quadratic time complexity
  • 16. The EM algorithm for clustering
    • Let the analyzed object be described by two random variables and which are assumed to have a probability distribution function
  • 17. The EM algorithm for clustering
    • The distribution is known up to its parameter(s) . It is assumed that we are given a set of samples independently drawn from the distribution
  • 18. The EM algorithm for clustering
    • The Expectation-Maximization (EM) algorithm is an optimization procedure which computes the Maximal-Likelihood (ML) estimate of the unknown parameter when only uncomplete ( is unknown) data are presented. In other words, the EM algorithm maximizes the likelihood function
  • 19. Evaluation of clustering
    • What Is A Good Clustering ?
    • Internal criterion: A good clustering will produce high quality clusters in which the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representation and the similarity measured used.
  • 20. conclusion
    • In this presentation we learned about
    • Measuring similarity for retrieval
    • Web-based document search and link analysis
    • Document matching
    • Clustering by similarity
    • Hierarchical clustering
    • The EM algorithm for clustering
    • Evaluation of clustering
  • 21. Visit more self help tutorials
    • Pick a tutorial of your choice and browse through it at your own pace.
    • The tutorials section is free, self-guiding and will not involve any additional support.
    • Visit us at www.dataminingtools.net