Your SlideShare is downloading. ×
Retrieval and clustering of documents
Measuring similarity for retrieval <ul><li>Given Set of documents a similarity measure determines for retrieval measures h...
Cosine similarity for retrieval <ul><li>Cosine similarity  is a measure of similarity between two vectors of  n  dimension...
Cosine similarity for retrieval <ul><li>For text matching, the attribute vectors  A  and  B  are usually the term frequenc...
Cosine similarity for retrieval <ul><li>In the case of information retrieval, the cosine similarity of two documents will ...
Web-based document search and link analysis <ul><li>Link analysis has been used successfully for deciding which web pages ...
Link Analysis <ul><li>A link from page A to page B is a recommendation of page A </li></ul><ul><li>by the author of page B...
Application <ul><li>Ranking query results.(page Rank) </li></ul><ul><li>crawling </li></ul><ul><li>fi nding  related pages,...
Document matching <ul><li>Document matching  is defined as the matching of some stated user query against a set of free-te...
Steps involved in document matching <ul><li>A document matching system has two main tasks: </li></ul><ul><li>Find relevant...
  k-means clustering <ul><li>Given a set of observations ( x 1 ,  x 2 , …,  x n ), where each observation is a  d -dimensi...
K-Means algorithm <ul><li>0. Input :  D ::={d 1 ,d 2 ,…d n  };  k ::=the cluster number; </li></ul><ul><li>1.  Select k do...
Pros and Cons <ul><li>Advantage: </li></ul><ul><ul><ul><li>linear time complexity  </li></ul></ul></ul><ul><ul><ul><li>wor...
Hierarchical clustering <ul><li>Input :  D ::={d 1 ,d 2 ,…d n  }; </li></ul><ul><li>1.  Calculate similarity matrix SIM[i,...
Pros and cons <ul><li>Advantage: </li></ul><ul><ul><ul><li>producing better quality clusters </li></ul></ul></ul><ul><ul><...
The EM algorithm for clustering <ul><li>Let the analyzed object be described by two random variables and which are assumed...
The EM algorithm for clustering <ul><li>The distribution is known up to its parameter(s) . It is assumed that we are given...
The EM algorithm for clustering <ul><li>The Expectation-Maximization (EM) algorithm is an optimization procedure which com...
Evaluation of clustering <ul><li>What Is A Good  Clustering ? </li></ul><ul><li>Internal criterion: A good  clustering  wi...
conclusion <ul><li>In this presentation we learned about </li></ul><ul><li>Measuring similarity for retrieval </li></ul><u...
Visit more self help tutorials <ul><li>Pick a tutorial of your choice and browse through it at your own pace. </li></ul><u...
Upcoming SlideShare
Loading in...5
×

Textmining Retrieval And Clustering

1,337

Published on

Text mining: Text mining Retrieval And Clustering

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,337
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Textmining Retrieval And Clustering"

  1. 1. Retrieval and clustering of documents
  2. 2. Measuring similarity for retrieval <ul><li>Given Set of documents a similarity measure determines for retrieval measures how many documents are relevant to the particular category. </li></ul>
  3. 3. Cosine similarity for retrieval <ul><li>Cosine similarity is a measure of similarity between two vectors of n dimensions by finding the cosine of the angle between them, often used to compare documents in text mining. Given two vectors of attributes, A and B , the cosine similarity, θ , is represented using a dot product and magnitude as </li></ul><ul><li>Similarity =cos(ᶿ)=A.B/||A||||B|| </li></ul>
  4. 4. Cosine similarity for retrieval <ul><li>For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison. </li></ul><ul><li>The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating independence, and in-between values indicating intermediate similarity or dissimilarity. </li></ul>
  5. 5. Cosine similarity for retrieval <ul><li>In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°. </li></ul>
  6. 6. Web-based document search and link analysis <ul><li>Link analysis has been used successfully for deciding which web pages to add to the collection of documents </li></ul><ul><li>how to order the documents matching a user query (i.e., how to rank pages). </li></ul><ul><li>It has also been used to categorize web pages, to find pages that are related to given pages, to find duplicated web </li></ul><ul><li>sites, and various other problems related to web information retrieval. </li></ul>
  7. 7. Link Analysis <ul><li>A link from page A to page B is a recommendation of page A </li></ul><ul><li>by the author of page B </li></ul><ul><li>If page A and page B are connected by a link the probability that they are on the same topic is higher than if they are not connected. </li></ul>
  8. 8. Application <ul><li>Ranking query results.(page Rank) </li></ul><ul><li>crawling </li></ul><ul><li>fi nding related pages, </li></ul><ul><li>computing web page reputations </li></ul><ul><li>geographic scope, prediction </li></ul><ul><li>categorizing web pages, </li></ul><ul><li>computing statistics of web pages and of search engines. </li></ul>
  9. 9. Document matching <ul><li>Document matching is defined as the matching of some stated user query against a set of free-text records. </li></ul><ul><li>These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. </li></ul><ul><li>User queries can range from multi-sentence full descriptions of an information need to a few words. </li></ul>
  10. 10. Steps involved in document matching <ul><li>A document matching system has two main tasks: </li></ul><ul><li>Find relevant documents to user queries </li></ul><ul><li>Evaluate the matching results and sort them according to relevance, using algorithms such as PageRank. </li></ul>
  11. 11. k-means clustering <ul><li>Given a set of observations ( x 1 , x 2 , …, x n ), where each observation is a d -dimensional real vector, then k -means clustering aims to partition the n observations into k sets ( k  <  n ) S ={ S 1 , S 2 , …, S k } so as to minimize the within-cluster sum of squares </li></ul>
  12. 12. K-Means algorithm <ul><li>0. Input : D ::={d 1 ,d 2 ,…d n }; k ::=the cluster number; </li></ul><ul><li>1. Select k document vectors as the initial centriods of k clusters </li></ul><ul><li>2 . Repeat </li></ul><ul><li>3. Select one vector d in remaining documents </li></ul><ul><li>4. Compute similarities between d and k centroids </li></ul><ul><li>5. Put d in the closest cluster and recompute the centroid </li></ul><ul><li>6. Until the centroids don’t change </li></ul><ul><li>7. Output: k clusters of documents </li></ul>
  13. 13. Pros and Cons <ul><li>Advantage: </li></ul><ul><ul><ul><li>linear time complexity </li></ul></ul></ul><ul><ul><ul><li>works relatively well in low dimension space </li></ul></ul></ul><ul><li>Drawback: </li></ul><ul><ul><ul><li>distance computation in high dimension space </li></ul></ul></ul><ul><ul><ul><li>centroid vector may not well summarize the cluster documents </li></ul></ul></ul><ul><ul><ul><li>initial k clusters affect the quality of clusters </li></ul></ul></ul>
  14. 14. Hierarchical clustering <ul><li>Input : D ::={d 1 ,d 2 ,…d n }; </li></ul><ul><li>1. Calculate similarity matrix SIM[i,j] </li></ul><ul><li>2 . Repeat </li></ul><ul><li>3. Merge the most similar two clusters, K and L, </li></ul><ul><li>to form a new cluster KL </li></ul><ul><li>4. Compute similarities between KL and each of the remaining </li></ul><ul><li>cluster and update SIM[i,j] </li></ul><ul><li>5. Until there is a single(or specified number) cluster </li></ul><ul><li>6 . Output: dendogram of clusters </li></ul>
  15. 15. Pros and cons <ul><li>Advantage: </li></ul><ul><ul><ul><li>producing better quality clusters </li></ul></ul></ul><ul><ul><ul><li>works relatively well in low dimension space </li></ul></ul></ul><ul><li>Drawback: </li></ul><ul><ul><ul><li>distance computation in high dimension space </li></ul></ul></ul><ul><ul><ul><li>quadratic time complexity </li></ul></ul></ul>
  16. 16. The EM algorithm for clustering <ul><li>Let the analyzed object be described by two random variables and which are assumed to have a probability distribution function </li></ul>
  17. 17. The EM algorithm for clustering <ul><li>The distribution is known up to its parameter(s) . It is assumed that we are given a set of samples independently drawn from the distribution </li></ul>
  18. 18. The EM algorithm for clustering <ul><li>The Expectation-Maximization (EM) algorithm is an optimization procedure which computes the Maximal-Likelihood (ML) estimate of the unknown parameter when only uncomplete ( is unknown) data are presented. In other words, the EM algorithm maximizes the likelihood function </li></ul>
  19. 19. Evaluation of clustering <ul><li>What Is A Good Clustering ? </li></ul><ul><li>Internal criterion: A good clustering will produce high quality clusters in which the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representation and the similarity measured used. </li></ul>
  20. 20. conclusion <ul><li>In this presentation we learned about </li></ul><ul><li>Measuring similarity for retrieval </li></ul><ul><li>Web-based document search and link analysis </li></ul><ul><li>Document matching </li></ul><ul><li>Clustering by similarity </li></ul><ul><li>Hierarchical clustering </li></ul><ul><li>The EM algorithm for clustering </li></ul><ul><li>Evaluation of clustering </li></ul>
  21. 21. Visit more self help tutorials <ul><li>Pick a tutorial of your choice and browse through it at your own pace. </li></ul><ul><li>The tutorials section is free, self-guiding and will not involve any additional support. </li></ul><ul><li>Visit us at www.dataminingtools.net </li></ul>

×