• Save
Document similarity with vector space model
Upcoming SlideShare
Loading in...5
×
 

Document similarity with vector space model

on

  • 1,475 views

 

Statistics

Views

Total Views
1,475
Views on SlideShare
1,475
Embed Views
0

Actions

Likes
2
Downloads
0
Comments
1

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Document similarity with vector space model Document similarity with vector space model Presentation Transcript

  • SIMILARITY OFDOCUMENTS BASED ONVECTOR SPACE MODEL
  • IntroductionThis presentation gives an overview about the problem offinding documents which are similar and how Vector spacecan be used to solve it.A vector space is a mathematical structure formed by acollection of elements called vectors, which may be addedtogether and multiplied ("scaled") by numbers, called scalarsin this context.A document is a bag of words or a collection of words orterms. The problem can be easily experienced in the domainof web search or classification, where the aim is to find outdocuments which are similar in context or content.
  • IntroductionA vector v can be expressed as a sum of elements such as,v = a1vi1+a2vi2+….+anvinWhere ak are called scalars or weights and vin as thecomponents or elements.
  • Vectors Now we explore, how a set of documents can be represented as vectors in a common vector space. V(d) denotes the vector derived from document d, with one component for each dictionary term. t1 V(d2) V(Q) V(d1) θ t2The documents in a collection can be viewed as a set of vectors in vector space, inwhich there is one axis for every term.
  • VectorsIn the previous slide, the diagram shows a simplerepresentation of two document vectors - d1, d2 and aquery vector Q.The space contains terms – {t1,t2,t3,…tN}, but for simplicityonly two terms are represented since there is a axis for eachterm.The document d1 has components {t1,t3,…} and d2 hascomponents {t2,…}. So V(d1) is represented closer to axis t1and V(d2) is closer to t2.The angle θ represents the closeness of a document vectorto the query vector. And its value is calculated by cosine of θ.
  • VectorsWeightsThe weight of the components of a document vector can berepresented by Term Frequency or combination of TermFrequency and Inverse Document Frequency.Term Frequency denoted by tf, is the number of occurrencesof a term t in the document D .Document Frequency is the number of documents , where aparticular term t occurs.Inverse Document Frequency of a term t, denoted by idf islog(N/df), where N is the total number of documents in thespace. So, it reduces the weight when a term occurs manytimes in a document, or in other words a word with rareoccurrences has more weight.
  • Vectorstf-idf weightThe combination of tf and idf is the most popular weightused in case of document similarity exercises.tf-idf t,d = tf t,d * idf tSo, the weight is the highest when t occurs many timeswithin a small number of documents.And, the weight is the lowest , when the term occurs fewertimes in a document or occurs in many documents.Later, in the example you will see how tf-idf weights areused in the Similarity calculation.
  • SimilarityCosine SimilarityThe similarity between two documents can be found bycomputing the Cosine Similarity between their vectorrepresentations. V(d1).V(d2)sim(d1,d2) = ____________ |V(d1)||V(d2) The numerator is a dot product of two products, such as ∑ i=1 to M (xi * yi), and the denominator is the product of theEuclidean length of the vectors, such as|V(d1)| = √ ∑ i=1 to M (xi )2
  • SimilarityFor example,If the vector d1 has component weights {w1,w2,w3} andvector d2 has component weights {u1,u2},then the dot product = w1*u1 + w2*u2 .Since there is no third component, hence w3*ф = 0.Euclidean length of d1 = √ (w1)2 + (w2)2 + (w3)2
  • Example This is a famous example given by Dr. David Grossman and Dr. Ophir Frieder of the Illinois Institute of Technology. There are 3 documents, D1 = “Shipment of gold damaged in a fire” D2 = “Delivery of silver arrived in a silver truck” D3 = “Shipment of gold arrived in a truck” Q = “gold silver truck” No. of docs, D = 3 ; Inverse document frequency, IDF = log(D/dfi)Terms tfi Weights = tfi * IDFi Q D1 D2 D3 dfi D/dfi IDFi Q D1 D2 D3a 0 1 1 1 3 1 0.0000 0.0000 0.0000 0.0000 0.0000arrived 0 0 1 1 2 1.5 0.1761 0.0000 0.0000 0.1761 0.1761damaged 0 1 0 0 1 3 0.4771 0.0000 0.4771 0.0000 0.0000delivery 0 0 1 0 1 3 0.4771 0.0000 0.0000 0.4771 0.0000gold 1 1 0 1 2 1.5 0.1761 0.1761 0.1761 0.0000 0.1761fire 0 1 0 0 1 3 0.4771 0.0000 0.4771 0.0000 0.0000in 0 1 1 1 3 1 0.0000 0.0000 0.0000 0.0000 0.0000of 0 1 1 1 3 1 0.0000 0.0000 0.0000 0.0000 0.0000shipment 0 1 0 1 2 1.5 0.1761 0.0000 0.1761 0.0000 0.1761silver 1 0 2 0 1 3 0.4771 0.4771 0.0000 0.9542 0.0000truck 1 0 1 1 2 1.5 0.1761 0.1761 0.0000 0.1761 0.1761
  • Example … continuedSimilarity Analysis……We calculate the vector lengths,|D| = √ ∑i(wi,j)2which is the Euclidean length of the vector|D1| = √(0.4771)2 + (0.1761)2 + (0.4771)2 + (0.17761)2 = √0.5173 = 0.7192|D2| = √(0.1761)2 + (0.4771)2 + (0.9542)2 + (0.1761)2 = √1.2001 = 1.0955|D3| = √(0.1761)2 + √(0.1761)2 + √(0.1761)2 + √(0.1761)2 = √0.1240 = 0.3522|Q| = √ (0.1761)2 + (0.4771)2 + √(0.1761)2 = √0.2896 = 0.5382Next, we calculate the Dot products of the Query vector with each Documentvector, Q • Di = √ (wQ,j * wi,j )Q • D1 = 0.1761 * 0.1761 = 0.0310Q • D2 = 0.4771*0.9542 + 0.1761*0.1761 = 0.4862Q • D3 = 0.1761*0.1761 + 0.1761*0.1761 = 0.0620
  • Example … continuedNow, we calculate the cosine value,Cosine θ (d1) = Q • D1 /|Q|*|D1| = 0.0310/(0.5382 * 0.7192) = 0.0801Cosine θ (d2) = Q • D2 /|Q|*|D2| = 0.4862/(0.5382 * 1.0955) = 0.8246Cosine θ (d3) = Q • D3 /|Q|*|D3| = 0.0620/(0.5382 * 0.3522) = 0.3271So, we see that document D2 is the most similar to the Query.
  • ConclusionPros• Allows documents with partial match to be also identified• The cosine formula gives a score which can be used to order documents.Disadvantages• Documents are treated as bag of words and so the positional information about the terms is lost.Usage Apache Lucene, the text search api uses this concept while searchingfor documents matching a query.
  • Acknowledgements• An Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schutze.• Term Vector Theory and Keyword Weights by Dr. E. Garcia.• Information Retrieval: Algorithms and Heuristics by Dr. David Grossman and Dr. Ophir Frieder of the Illinois Institute of Technology• Wikipedia - http://en.wikipedia.org/wiki/Vector_space_model