Document similarity with vector space model

6,788 views
6,313 views

Published on

Published in: Technology
1 Comment
22 Likes
Statistics
Notes
No Downloads
Views
Total views
6,788
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
0
Comments
1
Likes
22
Embeds 0
No embeds

No notes for slide

Document similarity with vector space model

  1. 1. SIMILARITY OFDOCUMENTS BASED ONVECTOR SPACE MODEL
  2. 2. IntroductionThis presentation gives an overview about the problem offinding documents which are similar and how Vector spacecan be used to solve it.A vector space is a mathematical structure formed by acollection of elements called vectors, which may be addedtogether and multiplied ("scaled") by numbers, called scalarsin this context.A document is a bag of words or a collection of words orterms. The problem can be easily experienced in the domainof web search or classification, where the aim is to find outdocuments which are similar in context or content.
  3. 3. IntroductionA vector v can be expressed as a sum of elements such as,v = a1vi1+a2vi2+….+anvinWhere ak are called scalars or weights and vin as thecomponents or elements.
  4. 4. Vectors Now we explore, how a set of documents can be represented as vectors in a common vector space. V(d) denotes the vector derived from document d, with one component for each dictionary term. t1 V(d2) V(Q) V(d1) θ t2The documents in a collection can be viewed as a set of vectors in vector space, inwhich there is one axis for every term.
  5. 5. VectorsIn the previous slide, the diagram shows a simplerepresentation of two document vectors - d1, d2 and aquery vector Q.The space contains terms – {t1,t2,t3,…tN}, but for simplicityonly two terms are represented since there is a axis for eachterm.The document d1 has components {t1,t3,…} and d2 hascomponents {t2,…}. So V(d1) is represented closer to axis t1and V(d2) is closer to t2.The angle θ represents the closeness of a document vectorto the query vector. And its value is calculated by cosine of θ.
  6. 6. VectorsWeightsThe weight of the components of a document vector can berepresented by Term Frequency or combination of TermFrequency and Inverse Document Frequency.Term Frequency denoted by tf, is the number of occurrencesof a term t in the document D .Document Frequency is the number of documents , where aparticular term t occurs.Inverse Document Frequency of a term t, denoted by idf islog(N/df), where N is the total number of documents in thespace. So, it reduces the weight when a term occurs manytimes in a document, or in other words a word with rareoccurrences has more weight.
  7. 7. Vectorstf-idf weightThe combination of tf and idf is the most popular weightused in case of document similarity exercises.tf-idf t,d = tf t,d * idf tSo, the weight is the highest when t occurs many timeswithin a small number of documents.And, the weight is the lowest , when the term occurs fewertimes in a document or occurs in many documents.Later, in the example you will see how tf-idf weights areused in the Similarity calculation.
  8. 8. SimilarityCosine SimilarityThe similarity between two documents can be found bycomputing the Cosine Similarity between their vectorrepresentations. V(d1).V(d2)sim(d1,d2) = ____________ |V(d1)||V(d2) The numerator is a dot product of two products, such as ∑ i=1 to M (xi * yi), and the denominator is the product of theEuclidean length of the vectors, such as|V(d1)| = √ ∑ i=1 to M (xi )2
  9. 9. SimilarityFor example,If the vector d1 has component weights {w1,w2,w3} andvector d2 has component weights {u1,u2},then the dot product = w1*u1 + w2*u2 .Since there is no third component, hence w3*ф = 0.Euclidean length of d1 = √ (w1)2 + (w2)2 + (w3)2
  10. 10. Example This is a famous example given by Dr. David Grossman and Dr. Ophir Frieder of the Illinois Institute of Technology. There are 3 documents, D1 = “Shipment of gold damaged in a fire” D2 = “Delivery of silver arrived in a silver truck” D3 = “Shipment of gold arrived in a truck” Q = “gold silver truck” No. of docs, D = 3 ; Inverse document frequency, IDF = log(D/dfi)Terms tfi Weights = tfi * IDFi Q D1 D2 D3 dfi D/dfi IDFi Q D1 D2 D3a 0 1 1 1 3 1 0.0000 0.0000 0.0000 0.0000 0.0000arrived 0 0 1 1 2 1.5 0.1761 0.0000 0.0000 0.1761 0.1761damaged 0 1 0 0 1 3 0.4771 0.0000 0.4771 0.0000 0.0000delivery 0 0 1 0 1 3 0.4771 0.0000 0.0000 0.4771 0.0000gold 1 1 0 1 2 1.5 0.1761 0.1761 0.1761 0.0000 0.1761fire 0 1 0 0 1 3 0.4771 0.0000 0.4771 0.0000 0.0000in 0 1 1 1 3 1 0.0000 0.0000 0.0000 0.0000 0.0000of 0 1 1 1 3 1 0.0000 0.0000 0.0000 0.0000 0.0000shipment 0 1 0 1 2 1.5 0.1761 0.0000 0.1761 0.0000 0.1761silver 1 0 2 0 1 3 0.4771 0.4771 0.0000 0.9542 0.0000truck 1 0 1 1 2 1.5 0.1761 0.1761 0.0000 0.1761 0.1761
  11. 11. Example … continuedSimilarity Analysis……We calculate the vector lengths,|D| = √ ∑i(wi,j)2which is the Euclidean length of the vector|D1| = √(0.4771)2 + (0.1761)2 + (0.4771)2 + (0.17761)2 = √0.5173 = 0.7192|D2| = √(0.1761)2 + (0.4771)2 + (0.9542)2 + (0.1761)2 = √1.2001 = 1.0955|D3| = √(0.1761)2 + √(0.1761)2 + √(0.1761)2 + √(0.1761)2 = √0.1240 = 0.3522|Q| = √ (0.1761)2 + (0.4771)2 + √(0.1761)2 = √0.2896 = 0.5382Next, we calculate the Dot products of the Query vector with each Documentvector, Q • Di = √ (wQ,j * wi,j )Q • D1 = 0.1761 * 0.1761 = 0.0310Q • D2 = 0.4771*0.9542 + 0.1761*0.1761 = 0.4862Q • D3 = 0.1761*0.1761 + 0.1761*0.1761 = 0.0620
  12. 12. Example … continuedNow, we calculate the cosine value,Cosine θ (d1) = Q • D1 /|Q|*|D1| = 0.0310/(0.5382 * 0.7192) = 0.0801Cosine θ (d2) = Q • D2 /|Q|*|D2| = 0.4862/(0.5382 * 1.0955) = 0.8246Cosine θ (d3) = Q • D3 /|Q|*|D3| = 0.0620/(0.5382 * 0.3522) = 0.3271So, we see that document D2 is the most similar to the Query.
  13. 13. ConclusionPros• Allows documents with partial match to be also identified• The cosine formula gives a score which can be used to order documents.Disadvantages• Documents are treated as bag of words and so the positional information about the terms is lost.Usage Apache Lucene, the text search api uses this concept while searchingfor documents matching a query.
  14. 14. Acknowledgements• An Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schutze.• Term Vector Theory and Keyword Weights by Dr. E. Garcia.• Information Retrieval: Algorithms and Heuristics by Dr. David Grossman and Dr. Ophir Frieder of the Illinois Institute of Technology• Wikipedia - http://en.wikipedia.org/wiki/Vector_space_model

×