Text Similarities - PG Pushpin
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,626
On Slideshare
1,626
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
34
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Data retrieval - In layman terms, data retrieval means that the words or terms within a document or web page are translated to some mathematical structure.
  • This basically implies that given a document, each distinct word or term within it is translated to a particular mathematical structure; for e.g. vector, frequency matrix etc.
  • TF - In its simplest form, the term frequency is also called as Term Count which is nothing but the number of occurrence of the term in thedocument.IDF - obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotientIt should be noted that if a term has a high term-frequency in the given document and a low document-frequency in the considered bunch of documents (implying a high inverse document frequency), then a high tf-idf is achieved.
  • Too simple to be used. Not realistic.
  • http://www.miislita.com/term-vector/term-vector-3.htmlThe vector value (or term weights) for each existing term (in a document) is non-zero; which is calculated using some scheme. One such well-known scheme is TF/IDF.D1: "Shipment of gold damaged in a fire"D2: "Delivery of silver arrived in a silver truck"D3: "Shipment of gold arrived in a truck“Q: “Gold Silver Truck”
  • http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html?start=1
  • k should be high enough to remove unwanted and most common words (e.g. a, the) and low enough to keep the important words within the context.

Transcript

  • 1. PUSHPINTEXT SIMILARITIES Junaid Surve 6644418
  • 2. AGENDA Introduction Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA Similarity Measurements  Cosine Similarity  SOC-PMI Applications & Prototype Summary 2
  • 3. AGENDA Introduction Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA Similarity Measurements  Cosine Similarity  SOC-PMI Applications & Prototype Summary 3
  • 4. INTRODUCTION WWW – a huge tangled web of information. Issues faced – duplications, plagiarism, copyright violation etc. Aim : To detect and report duplicates Method : Compare and output the level of similarity which is “TEXT SIMILARITY”. 4
  • 5.  Text Similarity has 2 aspects :  Content Similarity : Words are compared. e.g. “I have a car” and “I have a vehicle” are 75% similar.  Expression Similarity : Meaning of the information is considered. e.g. “I have a car” and “I have a vehicle” can be considered 100% similar. Scope – Content Similarity 5
  • 6.  2 step process:  STEP 1 : Data Retrieval “The area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World WideWeb” [1]  STEP II : Similarity Measurements To correlate the words or terms of two or more documents or web pages. 6
  • 7. AGENDA Introduction Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA Similarity Measurements  Cosine Similarity  SOC-PMI Applications & Prototype Summary 7
  • 8. DATA RETRIEVAL Translation of literature to mathematics. A variety of such concrete techniques exist –  TF/IDF  Document-Term Matrix  VSM  LSA The corresponding mathematical structure is derived based of the relevant concrete data retrieval methodology used. 8
  • 9. TF/IDF Term Frequency / Inverse Document Frequency Idea : More common the term, the less importance it has and hence should be considered at the least end of the query spectrum. Two linear, independent aspects:  Term Frequency - frequency of occurrence of a term in a given document.  Inverse Document Frequency - measure of the general importance of the term. 9
  • 10. TF IDF Example [7] Three Documents –  D1: “Shipment of gold damaged in a fire”  D2: “Delivery of silver arrived in a silver truck”  D3: “Shipment of gold arrived in a truck” Two steps  Calculate the Term Frequency  Calculate the Inverse Document Frequency 10
  • 11. TF IDF Example Terms D1 D2 D3 dfi D/df i IDF= log(D/dfi)a 1 1 1 3 3/3 = 1 0arrived 1 1 2 3/2 = 1.5 0.1761damaged 1 1 3/1 = 3 0.4771delivery 1 1 3/1 = 3 0.4771fire 1 1 3/1 = 3 0.4771gold 1 1 2 3/2 = 1.5 0.1761in 1 1 1 3 3/3 = 1 0of 1 1 1 3 3/3 = 1 0silver 2 1 3/1 = 3 0.4771shipment 1 1 2 3/2 = 1.5 0.1761truck 1 1 2 3/2 = 1.5 0.176111
  • 12. Document-Term Matrix “A Document-Term Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.” [2] Rows – Documents Columns – Terms Only depicts which document contains which term and the number of occurrences of that term in the document. 12
  • 13. Document-Term Matrix Example D1 = “I like databases” D2 = “I hate hate databases” I like databases hate D1 1 1 1 0 D2 1 0 1 2 13
  • 14. VSM “Vector Space Model (VSM) is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for e.g. index terms.” [3] Each document and query is represented as a vector:  document : dj = (w1,j , w2,j , .... , wn,j)  query : q = (w1,q , w2,q , .... , wn,q) Terms can be individual words, keywords, or phrases, based on the type of application. 14
  • 15. VSM Example [7] Three Documents –  D1: “Shipment of gold damaged in a fire”  D2: “Delivery of silver arrived in a silver truck”  D3: “Shipment of gold arrived in a truck” Query –  Gold Silver Truck 15
  • 16. VSM Example continued...  Calculating TF-IDF Terms Q D1 D2 D3 IDFi QxIDFi D1xIDFi D2xIDFi D3xIDFia 1 1 1 0arrived 1 1 0.176 0.1761 0.1761 1damage 1 0.477 0.4771d 1delivery 1 0.477 0.4771 1fire 1 0.477 0.4771 1gold 1 1 1 0.176 0.1761 0.1761 0.1761 1in 1 1 1 0of 16 1 1 1 0silver 1 2 0.477 0.4771 0.9542
  • 17. LSA “Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the meaning of words and passages of words.” [4] Built on the assumption that similar terms tend to appear in close proximities and hence identification of correlation patterns between documents or terms becomes easier. 2 step process:  Construction of Document-Term Matrix  Singular Value Decomposition 17
  • 18. LSA Example Three Documents –  D1: “Shipment of gold damaged in a fire”  D2: “Delivery of silver arrived in a silver truck”  D3: “Shipment of gold arrived in a truck” Query –  Gold Silver Truck 18
  • 19. LSA Example contd... STEP 1 : Constructing the Term-Document Matrix & Query Matrix19
  • 20. LSA Example contd... STEP 2: Evaluating Singular Vector Decomposition20
  • 21. LSA Example contd... STEP 3 : Reducing Dimensionality w.r.t k21
  • 22.  Similar SVD evaluation and reduction is done for the query vector Q. At the end we have:  Reduced SVD Matrix V (for the documents)  Reduced SVD Matrix Q (for the query) V= Q= This further can be supplied to similarity measurement technique. 22
  • 23. AGENDA Introduction Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA Similarity Measurements  Cosine Similarity  SOC-PMI Applications & Prototype Summary 23
  • 24. SIMILARITY MEASUREMENTS Major focus of “Text Similarities” methodology. Uses the Mathematical Structures generated by the Data Retrieval techniques to evaluate the percentage of likeness between two or more documents or web pages. Two major techniques in focus here:  Cosine Similarity  SOC-PMI 24
  • 25. COSINE SIMILARITY Evaluate similarity between 2 vectors by measuring cosine of the angle between them. Cosine of the angle will detemine whether the vectors are roughly pointing in the same direction. In our scope : similarity will range between 0 and 1, since term weights are always positive. i.e. The angle between two considered vectors will never exceed 90 25
  • 26. COSINE Example [7] Example continued from VSM.  Three Documents –  D1: “Shipment of gold damaged in a fire”  D2: “Delivery of silver arrived in a silver truck”  D3: “Shipment of gold arrived in a truck”  Query – Gold Silver Truck We have calculated weights using TF-IDF scheme. Next Step – Calculate Cosine Similarity:  CosineΘDi = (Q . Di ) / (|Q| x |Di|)  i.e. First calculate Dot product: Q . Di  Then calculate scalar product: |Q| x |Di| 26
  • 27. COSINE Example continued... Dot Products: Q.Di = ∑i wQ,j wi,j  Q.D1 = 0.0310, Q.D2 = 0.4862, Q.D3 = 0.0620 Scalar Products: |Q| x |Di| = sqrt(∑i w2Q,j)sqrt(∑i w2i,j)  |Q| x |D1| = 0.3871, |Q| x |D2| = 0.5896, |Q| x |D3| = 0.1896 Cosine Similarity:  CosineΘD1 = 0.0801  CosineΘD2 = 0.8246  CosineΘD3 = 0.3271 27
  • 28. SOC-PMI “Second-Order Co-occurence Pointwise Mutual Information (SOC-PMI) is a semantic similarity measure using pointwise mutual information to sort lists of important neighbor words of the two target words from a large corpus.” [5] A lot of mathematics involved to generate the formula. This Similarity measure at the end is also normalized so as to limit the range of similarity between 0 and 1. 28
  • 29. SOC-PMI with an example Complicated method with a lot of mathematical formulae. Example [6] :  W1 = car  W2 = automobile  m = 70, n = 43 Assumptions:  ϒ = 3, ∂ = 0.7  window of 11 words β1 = β2 = 24.88 CORPUS 29
  • 30. SOC-PMI example contd... Bigram frequencies and the set X Types & Frequencies and the set Y of words with their PMI30 values
  • 31. SOC-PMI example contd...31
  • 32. AGENDA Introduction Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA Similarity Measurements  Cosine Similarity  SOC-PMI Applications & Prototype Summary 32
  • 33. APPLICATIONS Plagiarism Detection Term Similarity play an important in the field of Plagiarism Detection. Copyright Violation Copies of restricted Software/Data can be detected using Text Similarities. Recommender Services 33
  • 34. PROTOTYPE AIM : Finding the degree of Similarity between files. 2 steps  Data Retrival  TF-IDF  Similarity Measurement  Cosine  Pearson Correlation  Distribution Matrix  Co-occurence 34
  • 35. Prototype – Data Retrieval Steps followed to retrive data using TF-IDF scheme  SequenceFilesFromDirectory  Converts files into sequence files. < Text, Text >  DocumentProcessor  Converts the sequence file into <Text, StringTuple>  DictionaryVectorizer  Creates TF Vectors <Text, VectorWritable>  Creates dfcount < IntWritable, LongWritable>  Creates wordcount <Text, LongWritable>  TFIDFConverter  Creates TF-IDF vectors <Text, VectorWritable> 35
  • 36. Prototype – Similarity Measurement Intermediate steps  Convert the TF-IDF into a Matrix <IntWritable, VectorWritable> Similarity Measurement  Distribution Multiplication  Matrix * Matrix´  Cosine, Pearson Correlation and Co-occuerrence  RowSimilarityJob (Similarity Classname)  SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE  SIMILARITY_PEARSON_CORRELATION  SIMILARITY_COOCCURRENCE 36
  • 37. Prototype – Similarity Measurment Cosine Pearson Correlation Distribution Matrix Co-occurence 37
  • 38. AGENDA Introduction Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA Similarity Measurements  Cosine Similarity  SOC-PMI Applications & Prototype Summary 38
  • 39. SUMMARY What is Text Similarity. Scope - Content Similarity Steps involved in the process:  Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA  Similarity Measurements  Cosine Similarity  SOC-PMI Applications & Prototype 39
  • 40. 40
  • 41. References[1] Wikipedia: Information retrieval - Wikipedia, the free encyclopedia (2012), http://en.wikipedia.org/wiki/Information_retrieval[2] Wikipedia: Document-term matrix - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Document-term_matrix[3] Wikipedia: Vector space model - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Vector_space_model[4] Wikipedia: Latent semantic indexing - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Latent_semantic_indexing[5] Wikipedia: Second-order co-occurrence pointwise mutual information - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Second-order_co- occurrence_pointwise_mutual_information[6] Islam, A. and Inkpen, D. (2006). Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words, in Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 1033–1038.[7] Dr. E. Garcia. Mi Islita.com - http://www.miislita.com/information- retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html 41