Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- similarity measure by ZHAO Sam 28111 views
- Day 11 eigrp by CYBERINTELLIGENTS 452 views
- Lesson 1 slideshow by Arnold Derrick Ki... 998 views
- Evaluation in Information Retrieval by Dishant Ailawadi 2150 views
- Cnc Router Configuration Guide - OM... by OMNI CNC Technolo... 2319 views
- MS Word Template_102504 by datacenters 396 views

No Downloads

Total views

15,899

On SlideShare

0

From Embeds

0

Number of Embeds

26

Shares

0

Downloads

0

Comments

1

Likes

41

No embeds

No notes for slide

- 1. SIMILARITY OFDOCUMENTS BASED ONVECTOR SPACE MODEL
- 2. IntroductionThis presentation gives an overview about the problem offinding documents which are similar and how Vector spacecan be used to solve it.A vector space is a mathematical structure formed by acollection of elements called vectors, which may be addedtogether and multiplied ("scaled") by numbers, called scalarsin this context.A document is a bag of words or a collection of words orterms. The problem can be easily experienced in the domainof web search or classification, where the aim is to find outdocuments which are similar in context or content.
- 3. IntroductionA vector v can be expressed as a sum of elements such as,v = a1vi1+a2vi2+….+anvinWhere ak are called scalars or weights and vin as thecomponents or elements.
- 4. Vectors Now we explore, how a set of documents can be represented as vectors in a common vector space. V(d) denotes the vector derived from document d, with one component for each dictionary term. t1 V(d2) V(Q) V(d1) θ t2The documents in a collection can be viewed as a set of vectors in vector space, inwhich there is one axis for every term.
- 5. VectorsIn the previous slide, the diagram shows a simplerepresentation of two document vectors - d1, d2 and aquery vector Q.The space contains terms – {t1,t2,t3,…tN}, but for simplicityonly two terms are represented since there is a axis for eachterm.The document d1 has components {t1,t3,…} and d2 hascomponents {t2,…}. So V(d1) is represented closer to axis t1and V(d2) is closer to t2.The angle θ represents the closeness of a document vectorto the query vector. And its value is calculated by cosine of θ.
- 6. VectorsWeightsThe weight of the components of a document vector can berepresented by Term Frequency or combination of TermFrequency and Inverse Document Frequency.Term Frequency denoted by tf, is the number of occurrencesof a term t in the document D .Document Frequency is the number of documents , where aparticular term t occurs.Inverse Document Frequency of a term t, denoted by idf islog(N/df), where N is the total number of documents in thespace. So, it reduces the weight when a term occurs manytimes in a document, or in other words a word with rareoccurrences has more weight.
- 7. Vectorstf-idf weightThe combination of tf and idf is the most popular weightused in case of document similarity exercises.tf-idf t,d = tf t,d * idf tSo, the weight is the highest when t occurs many timeswithin a small number of documents.And, the weight is the lowest , when the term occurs fewertimes in a document or occurs in many documents.Later, in the example you will see how tf-idf weights areused in the Similarity calculation.
- 8. SimilarityCosine SimilarityThe similarity between two documents can be found bycomputing the Cosine Similarity between their vectorrepresentations. V(d1).V(d2)sim(d1,d2) = ____________ |V(d1)||V(d2) The numerator is a dot product of two products, such as ∑ i=1 to M (xi * yi), and the denominator is the product of theEuclidean length of the vectors, such as|V(d1)| = √ ∑ i=1 to M (xi )2
- 9. SimilarityFor example,If the vector d1 has component weights {w1,w2,w3} andvector d2 has component weights {u1,u2},then the dot product = w1*u1 + w2*u2 .Since there is no third component, hence w3*ф = 0.Euclidean length of d1 = √ (w1)2 + (w2)2 + (w3)2
- 10. Example This is a famous example given by Dr. David Grossman and Dr. Ophir Frieder of the Illinois Institute of Technology. There are 3 documents, D1 = “Shipment of gold damaged in a fire” D2 = “Delivery of silver arrived in a silver truck” D3 = “Shipment of gold arrived in a truck” Q = “gold silver truck” No. of docs, D = 3 ; Inverse document frequency, IDF = log(D/dfi)Terms tfi Weights = tfi * IDFi Q D1 D2 D3 dfi D/dfi IDFi Q D1 D2 D3a 0 1 1 1 3 1 0.0000 0.0000 0.0000 0.0000 0.0000arrived 0 0 1 1 2 1.5 0.1761 0.0000 0.0000 0.1761 0.1761damaged 0 1 0 0 1 3 0.4771 0.0000 0.4771 0.0000 0.0000delivery 0 0 1 0 1 3 0.4771 0.0000 0.0000 0.4771 0.0000gold 1 1 0 1 2 1.5 0.1761 0.1761 0.1761 0.0000 0.1761fire 0 1 0 0 1 3 0.4771 0.0000 0.4771 0.0000 0.0000in 0 1 1 1 3 1 0.0000 0.0000 0.0000 0.0000 0.0000of 0 1 1 1 3 1 0.0000 0.0000 0.0000 0.0000 0.0000shipment 0 1 0 1 2 1.5 0.1761 0.0000 0.1761 0.0000 0.1761silver 1 0 2 0 1 3 0.4771 0.4771 0.0000 0.9542 0.0000truck 1 0 1 1 2 1.5 0.1761 0.1761 0.0000 0.1761 0.1761
- 11. Example … continuedSimilarity Analysis……We calculate the vector lengths,|D| = √ ∑i(wi,j)2which is the Euclidean length of the vector|D1| = √(0.4771)2 + (0.1761)2 + (0.4771)2 + (0.17761)2 = √0.5173 = 0.7192|D2| = √(0.1761)2 + (0.4771)2 + (0.9542)2 + (0.1761)2 = √1.2001 = 1.0955|D3| = √(0.1761)2 + √(0.1761)2 + √(0.1761)2 + √(0.1761)2 = √0.1240 = 0.3522|Q| = √ (0.1761)2 + (0.4771)2 + √(0.1761)2 = √0.2896 = 0.5382Next, we calculate the Dot products of the Query vector with each Documentvector, Q • Di = √ (wQ,j * wi,j )Q • D1 = 0.1761 * 0.1761 = 0.0310Q • D2 = 0.4771*0.9542 + 0.1761*0.1761 = 0.4862Q • D3 = 0.1761*0.1761 + 0.1761*0.1761 = 0.0620
- 12. Example … continuedNow, we calculate the cosine value,Cosine θ (d1) = Q • D1 /|Q|*|D1| = 0.0310/(0.5382 * 0.7192) = 0.0801Cosine θ (d2) = Q • D2 /|Q|*|D2| = 0.4862/(0.5382 * 1.0955) = 0.8246Cosine θ (d3) = Q • D3 /|Q|*|D3| = 0.0620/(0.5382 * 0.3522) = 0.3271So, we see that document D2 is the most similar to the Query.
- 13. ConclusionPros• Allows documents with partial match to be also identified• The cosine formula gives a score which can be used to order documents.Disadvantages• Documents are treated as bag of words and so the positional information about the terms is lost.Usage Apache Lucene, the text search api uses this concept while searchingfor documents matching a query.
- 14. Acknowledgements• An Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schutze.• Term Vector Theory and Keyword Weights by Dr. E. Garcia.• Information Retrieval: Algorithms and Heuristics by Dr. David Grossman and Dr. Ophir Frieder of the Illinois Institute of Technology• Wikipedia - http://en.wikipedia.org/wiki/Vector_space_model

No public clipboards found for this slide

If you are sharing this than please also enable the readers to save this!