Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Textmining Retrieval And Clustering


Published on

Text mining: Text mining Retrieval And Clustering

Published in: Technology
  • Be the first to comment

Textmining Retrieval And Clustering

  1. 1. Retrieval and clustering of documents
  2. 2. Measuring similarity for retrieval <ul><li>Given Set of documents a similarity measure determines for retrieval measures how many documents are relevant to the particular category. </li></ul>
  3. 3. Cosine similarity for retrieval <ul><li>Cosine similarity is a measure of similarity between two vectors of n dimensions by finding the cosine of the angle between them, often used to compare documents in text mining. Given two vectors of attributes, A and B , the cosine similarity, θ , is represented using a dot product and magnitude as </li></ul><ul><li>Similarity =cos(ᶿ)=A.B/||A||||B|| </li></ul>
  4. 4. Cosine similarity for retrieval <ul><li>For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison. </li></ul><ul><li>The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating independence, and in-between values indicating intermediate similarity or dissimilarity. </li></ul>
  5. 5. Cosine similarity for retrieval <ul><li>In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°. </li></ul>
  6. 6. Web-based document search and link analysis <ul><li>Link analysis has been used successfully for deciding which web pages to add to the collection of documents </li></ul><ul><li>how to order the documents matching a user query (i.e., how to rank pages). </li></ul><ul><li>It has also been used to categorize web pages, to find pages that are related to given pages, to find duplicated web </li></ul><ul><li>sites, and various other problems related to web information retrieval. </li></ul>
  7. 7. Link Analysis <ul><li>A link from page A to page B is a recommendation of page A </li></ul><ul><li>by the author of page B </li></ul><ul><li>If page A and page B are connected by a link the probability that they are on the same topic is higher than if they are not connected. </li></ul>
  8. 8. Application <ul><li>Ranking query results.(page Rank) </li></ul><ul><li>crawling </li></ul><ul><li>fi nding related pages, </li></ul><ul><li>computing web page reputations </li></ul><ul><li>geographic scope, prediction </li></ul><ul><li>categorizing web pages, </li></ul><ul><li>computing statistics of web pages and of search engines. </li></ul>
  9. 9. Document matching <ul><li>Document matching is defined as the matching of some stated user query against a set of free-text records. </li></ul><ul><li>These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. </li></ul><ul><li>User queries can range from multi-sentence full descriptions of an information need to a few words. </li></ul>
  10. 10. Steps involved in document matching <ul><li>A document matching system has two main tasks: </li></ul><ul><li>Find relevant documents to user queries </li></ul><ul><li>Evaluate the matching results and sort them according to relevance, using algorithms such as PageRank. </li></ul>
  11. 11. k-means clustering <ul><li>Given a set of observations ( x 1 , x 2 , …, x n ), where each observation is a d -dimensional real vector, then k -means clustering aims to partition the n observations into k sets ( k  <  n ) S ={ S 1 , S 2 , …, S k } so as to minimize the within-cluster sum of squares </li></ul>
  12. 12. K-Means algorithm <ul><li>0. Input : D ::={d 1 ,d 2 ,…d n }; k ::=the cluster number; </li></ul><ul><li>1. Select k document vectors as the initial centriods of k clusters </li></ul><ul><li>2 . Repeat </li></ul><ul><li>3. Select one vector d in remaining documents </li></ul><ul><li>4. Compute similarities between d and k centroids </li></ul><ul><li>5. Put d in the closest cluster and recompute the centroid </li></ul><ul><li>6. Until the centroids don’t change </li></ul><ul><li>7. Output: k clusters of documents </li></ul>
  13. 13. Pros and Cons <ul><li>Advantage: </li></ul><ul><ul><ul><li>linear time complexity </li></ul></ul></ul><ul><ul><ul><li>works relatively well in low dimension space </li></ul></ul></ul><ul><li>Drawback: </li></ul><ul><ul><ul><li>distance computation in high dimension space </li></ul></ul></ul><ul><ul><ul><li>centroid vector may not well summarize the cluster documents </li></ul></ul></ul><ul><ul><ul><li>initial k clusters affect the quality of clusters </li></ul></ul></ul>
  14. 14. Hierarchical clustering <ul><li>Input : D ::={d 1 ,d 2 ,…d n }; </li></ul><ul><li>1. Calculate similarity matrix SIM[i,j] </li></ul><ul><li>2 . Repeat </li></ul><ul><li>3. Merge the most similar two clusters, K and L, </li></ul><ul><li>to form a new cluster KL </li></ul><ul><li>4. Compute similarities between KL and each of the remaining </li></ul><ul><li>cluster and update SIM[i,j] </li></ul><ul><li>5. Until there is a single(or specified number) cluster </li></ul><ul><li>6 . Output: dendogram of clusters </li></ul>
  15. 15. Pros and cons <ul><li>Advantage: </li></ul><ul><ul><ul><li>producing better quality clusters </li></ul></ul></ul><ul><ul><ul><li>works relatively well in low dimension space </li></ul></ul></ul><ul><li>Drawback: </li></ul><ul><ul><ul><li>distance computation in high dimension space </li></ul></ul></ul><ul><ul><ul><li>quadratic time complexity </li></ul></ul></ul>
  16. 16. The EM algorithm for clustering <ul><li>Let the analyzed object be described by two random variables and which are assumed to have a probability distribution function </li></ul>
  17. 17. The EM algorithm for clustering <ul><li>The distribution is known up to its parameter(s) . It is assumed that we are given a set of samples independently drawn from the distribution </li></ul>
  18. 18. The EM algorithm for clustering <ul><li>The Expectation-Maximization (EM) algorithm is an optimization procedure which computes the Maximal-Likelihood (ML) estimate of the unknown parameter when only uncomplete ( is unknown) data are presented. In other words, the EM algorithm maximizes the likelihood function </li></ul>
  19. 19. Evaluation of clustering <ul><li>What Is A Good Clustering ? </li></ul><ul><li>Internal criterion: A good clustering will produce high quality clusters in which the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representation and the similarity measured used. </li></ul>
  20. 20. conclusion <ul><li>In this presentation we learned about </li></ul><ul><li>Measuring similarity for retrieval </li></ul><ul><li>Web-based document search and link analysis </li></ul><ul><li>Document matching </li></ul><ul><li>Clustering by similarity </li></ul><ul><li>Hierarchical clustering </li></ul><ul><li>The EM algorithm for clustering </li></ul><ul><li>Evaluation of clustering </li></ul>
  21. 21. Visit more self help tutorials <ul><li>Pick a tutorial of your choice and browse through it at your own pace. </li></ul><ul><li>The tutorials section is free, self-guiding and will not involve any additional support. </li></ul><ul><li>Visit us at </li></ul>