Successfully reported this slideshow.
Upcoming SlideShare
×

# Textmining Retrieval And Clustering

1,556 views

Published on

Textmining Retrieval And Clustering

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Textmining Retrieval And Clustering

1. 1. Retrieval and clustering of documents
2. 2. Measuring similarity for retrieval <ul><li>Given Set of documents a similarity measure determines for retrieval measures how many documents are relevant to the particular category. </li></ul>
3. 3. Cosine similarity for retrieval <ul><li>Cosine similarity is a measure of similarity between two vectors of n dimensions by finding the cosine of the angle between them, often used to compare documents in text mining. Given two vectors of attributes, A and B , the cosine similarity, θ , is represented using a dot product and magnitude as </li></ul><ul><li>Similarity =cos(ᶿ)=A.B/||A||||B|| </li></ul>
4. 4. Cosine similarity for retrieval <ul><li>For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison. </li></ul><ul><li>The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating independence, and in-between values indicating intermediate similarity or dissimilarity. </li></ul>
5. 5. Cosine similarity for retrieval <ul><li>In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°. </li></ul>
6. 6. Web-based document search and link analysis <ul><li>Link analysis has been used successfully for deciding which web pages to add to the collection of documents </li></ul><ul><li>how to order the documents matching a user query (i.e., how to rank pages). </li></ul><ul><li>It has also been used to categorize web pages, to ﬁnd pages that are related to given pages, to ﬁnd duplicated web </li></ul><ul><li>sites, and various other problems related to web information retrieval. </li></ul>
7. 7. Link Analysis <ul><li>A link from page A to page B is a recommendation of page A </li></ul><ul><li>by the author of page B </li></ul><ul><li>If page A and page B are connected by a link the probability that they are on the same topic is higher than if they are not connected. </li></ul>
8. 8. Application <ul><li>Ranking query results.(page Rank) </li></ul><ul><li>crawling </li></ul><ul><li>ﬁ nding related pages, </li></ul><ul><li>computing web page reputations </li></ul><ul><li>geographic scope, prediction </li></ul><ul><li>categorizing web pages, </li></ul><ul><li>computing statistics of web pages and of search engines. </li></ul>
9. 9. Document matching <ul><li>Document matching is defined as the matching of some stated user query against a set of free-text records. </li></ul><ul><li>These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. </li></ul><ul><li>User queries can range from multi-sentence full descriptions of an information need to a few words. </li></ul>
10. 10. Steps involved in document matching <ul><li>A document matching system has two main tasks: </li></ul><ul><li>Find relevant documents to user queries </li></ul><ul><li>Evaluate the matching results and sort them according to relevance, using algorithms such as PageRank. </li></ul>
11. 11. k-means clustering <ul><li>Given a set of observations ( x 1 , x 2 , …, x n ), where each observation is a d -dimensional real vector, then k -means clustering aims to partition the n observations into k sets ( k  <  n ) S ={ S 1 , S 2 , …, S k } so as to minimize the within-cluster sum of squares </li></ul>
12. 12. K-Means algorithm <ul><li>0. Input : D ::={d 1 ,d 2 ,…d n }; k ::=the cluster number; </li></ul><ul><li>1. Select k document vectors as the initial centriods of k clusters </li></ul><ul><li>2 . Repeat </li></ul><ul><li>3. Select one vector d in remaining documents </li></ul><ul><li>4. Compute similarities between d and k centroids </li></ul><ul><li>5. Put d in the closest cluster and recompute the centroid </li></ul><ul><li>6. Until the centroids don’t change </li></ul><ul><li>7. Output: k clusters of documents </li></ul>
13. 13. Pros and Cons <ul><li>Advantage: </li></ul><ul><ul><ul><li>linear time complexity </li></ul></ul></ul><ul><ul><ul><li>works relatively well in low dimension space </li></ul></ul></ul><ul><li>Drawback: </li></ul><ul><ul><ul><li>distance computation in high dimension space </li></ul></ul></ul><ul><ul><ul><li>centroid vector may not well summarize the cluster documents </li></ul></ul></ul><ul><ul><ul><li>initial k clusters affect the quality of clusters </li></ul></ul></ul>
14. 14. Hierarchical clustering <ul><li>Input : D ::={d 1 ,d 2 ,…d n }; </li></ul><ul><li>1. Calculate similarity matrix SIM[i,j] </li></ul><ul><li>2 . Repeat </li></ul><ul><li>3. Merge the most similar two clusters, K and L, </li></ul><ul><li>to form a new cluster KL </li></ul><ul><li>4. Compute similarities between KL and each of the remaining </li></ul><ul><li>cluster and update SIM[i,j] </li></ul><ul><li>5. Until there is a single(or specified number) cluster </li></ul><ul><li>6 . Output: dendogram of clusters </li></ul>
15. 15. Pros and cons <ul><li>Advantage: </li></ul><ul><ul><ul><li>producing better quality clusters </li></ul></ul></ul><ul><ul><ul><li>works relatively well in low dimension space </li></ul></ul></ul><ul><li>Drawback: </li></ul><ul><ul><ul><li>distance computation in high dimension space </li></ul></ul></ul><ul><ul><ul><li>quadratic time complexity </li></ul></ul></ul>
16. 16. The EM algorithm for clustering <ul><li>Let the analyzed object be described by two random variables and which are assumed to have a probability distribution function </li></ul>
17. 17. The EM algorithm for clustering <ul><li>The distribution is known up to its parameter(s) . It is assumed that we are given a set of samples independently drawn from the distribution </li></ul>
18. 18. The EM algorithm for clustering <ul><li>The Expectation-Maximization (EM) algorithm is an optimization procedure which computes the Maximal-Likelihood (ML) estimate of the unknown parameter when only uncomplete ( is unknown) data are presented. In other words, the EM algorithm maximizes the likelihood function </li></ul>
19. 19. Evaluation of clustering <ul><li>What Is A Good Clustering ? </li></ul><ul><li>Internal criterion: A good clustering will produce high quality clusters in which the intra-class (that is, intra-cluster) similarity is high the inter-class similarity is low The measured quality of a clustering depends on both the document representation and the similarity measured used. </li></ul>
20. 20. conclusion <ul><li>In this presentation we learned about </li></ul><ul><li>Measuring similarity for retrieval </li></ul><ul><li>Web-based document search and link analysis </li></ul><ul><li>Document matching </li></ul><ul><li>Clustering by similarity </li></ul><ul><li>Hierarchical clustering </li></ul><ul><li>The EM algorithm for clustering </li></ul><ul><li>Evaluation of clustering </li></ul>
21. 21. Visit more self help tutorials <ul><li>Pick a tutorial of your choice and browse through it at your own pace. </li></ul><ul><li>The tutorials section is free, self-guiding and will not involve any additional support. </li></ul><ul><li>Visit us at www.dataminingtools.net </li></ul>