Your SlideShare is downloading. ×
0
PUSHPINTEXT SIMILARITIES           Junaid Surve               6644418
AGENDA       Introduction       Data Retrieval           TF/IDF           Document-Term Matrix           VSM        ...
AGENDA       Introduction       Data Retrieval           TF/IDF           Document-Term Matrix           VSM        ...
INTRODUCTION       WWW – a huge tangled web of information.       Issues faced – duplications, plagiarism, copyright    ...
       Text Similarity has 2 aspects :           Content Similarity : Words are compared.            e.g. “I have a car”...
       2 step process:           STEP 1 : Data Retrieval            “The area of study concerned with searching for     ...
AGENDA       Introduction       Data Retrieval           TF/IDF           Document-Term Matrix           VSM        ...
DATA RETRIEVAL       Translation of literature to mathematics.       A variety of such concrete techniques exist –      ...
TF/IDF       Term Frequency / Inverse Document Frequency       Idea : More common the term, the less importance it      ...
TF IDF Example [7]    Three Documents –        D1: “Shipment of gold damaged in a fire”        D2: “Delivery of silver ...
TF IDF Example     Terms   D1   D2   D3   dfi    D/df i       IDF=                                              log(D/dfi)...
Document-Term Matrix    “A Document-Term Matrix is a mathematical matrix     that describes the frequency of terms that o...
Document-Term Matrix Example    D1 = “I like databases”    D2 = “I hate hate databases”                   I         like...
VSM    “Vector Space Model (VSM) is an algebraic model     for representing text documents (and any objects, in     gener...
VSM Example [7]    Three Documents –        D1: “Shipment of gold damaged in a fire”        D2: “Delivery of silver arr...
VSM Example continued...        Calculating TF-IDF Terms       Q    D1   D2   D3   IDFi    QxIDFi   D1xIDFi   D2xIDFi   D...
LSA    “Latent Semantic Analysis (LSA) is a theory and     method for extracting and representing the meaning     of word...
LSA Example    Three Documents –        D1: “Shipment of gold damaged in a fire”        D2: “Delivery of silver arrived...
LSA Example contd...     STEP 1 : Constructing the Term-Document Matrix & Query Matrix19
LSA Example contd...      STEP 2: Evaluating Singular Vector Decomposition20
LSA Example contd...         STEP 3 : Reducing Dimensionality w.r.t k21
    Similar SVD evaluation and reduction is done for the     query vector Q.    At the end we have:        Reduced SVD ...
AGENDA    Introduction    Data Retrieval        TF/IDF        Document-Term Matrix        VSM        LSA    Similar...
SIMILARITY MEASUREMENTS    Major focus of “Text Similarities” methodology.    Uses the Mathematical Structures generated...
COSINE SIMILARITY    Evaluate similarity between 2 vectors by measuring     cosine of the angle between them.    Cosine ...
COSINE Example [7]    Example continued from VSM.        Three Documents –            D1: “Shipment of gold damaged in ...
COSINE Example continued...    Dot Products: Q.Di = ∑i wQ,j wi,j        Q.D1 = 0.0310, Q.D2 = 0.4862, Q.D3 = 0.0620    ...
SOC-PMI    “Second-Order Co-occurence Pointwise Mutual     Information (SOC-PMI) is a semantic similarity     measure usi...
SOC-PMI with an example    Complicated method with a lot of mathematical     formulae.    Example [6] :        W1 = car...
SOC-PMI example contd...                            Bigram frequencies and the set X     Types & Frequencies              ...
SOC-PMI example contd...31
AGENDA    Introduction    Data Retrieval        TF/IDF        Document-Term Matrix        VSM        LSA    Similar...
APPLICATIONS    Plagiarism Detection     Term Similarity play an important in the field of     Plagiarism Detection.    ...
PROTOTYPE    AIM : Finding the degree of Similarity between files.    2 steps        Data Retrival            TF-IDF  ...
Prototype – Data Retrieval    Steps followed to retrive data using TF-IDF scheme        SequenceFilesFromDirectory      ...
Prototype – Similarity Measurement    Intermediate steps        Convert the TF-IDF into a Matrix <IntWritable,         V...
Prototype – Similarity Measurment    Cosine    Pearson Correlation    Distribution Matrix    Co-occurence    37
AGENDA    Introduction    Data Retrieval        TF/IDF        Document-Term Matrix        VSM        LSA    Similar...
SUMMARY    What is Text Similarity.    Scope - Content Similarity    Steps involved in the process:        Data Retrie...
40
References[1] Wikipedia: Information retrieval - Wikipedia, the free encyclopedia   (2012), http://en.wikipedia.org/wiki/I...
Upcoming SlideShare
Loading in...5
×

Text Similarities - PG Pushpin

1,575

Published on

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,575
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
43
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Data retrieval - In layman terms, data retrieval means that the words or terms within a document or web page are translated to some mathematical structure.
  • This basically implies that given a document, each distinct word or term within it is translated to a particular mathematical structure; for e.g. vector, frequency matrix etc.
  • TF - In its simplest form, the term frequency is also called as Term Count which is nothing but the number of occurrence of the term in thedocument.IDF - obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotientIt should be noted that if a term has a high term-frequency in the given document and a low document-frequency in the considered bunch of documents (implying a high inverse document frequency), then a high tf-idf is achieved.
  • Too simple to be used. Not realistic.
  • http://www.miislita.com/term-vector/term-vector-3.htmlThe vector value (or term weights) for each existing term (in a document) is non-zero; which is calculated using some scheme. One such well-known scheme is TF/IDF.D1: &quot;Shipment of gold damaged in a fire&quot;D2: &quot;Delivery of silver arrived in a silver truck&quot;D3: &quot;Shipment of gold arrived in a truck“Q: “Gold Silver Truck”
  • http://www.puffinwarellc.com/index.php/news-and-articles/articles/33-latent-semantic-analysis-tutorial.html?start=1
  • k should be high enough to remove unwanted and most common words (e.g. a, the) and low enough to keep the important words within the context.
  • Transcript of "Text Similarities - PG Pushpin"

    1. 1. PUSHPINTEXT SIMILARITIES Junaid Surve 6644418
    2. 2. AGENDA Introduction Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA Similarity Measurements  Cosine Similarity  SOC-PMI Applications & Prototype Summary 2
    3. 3. AGENDA Introduction Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA Similarity Measurements  Cosine Similarity  SOC-PMI Applications & Prototype Summary 3
    4. 4. INTRODUCTION WWW – a huge tangled web of information. Issues faced – duplications, plagiarism, copyright violation etc. Aim : To detect and report duplicates Method : Compare and output the level of similarity which is “TEXT SIMILARITY”. 4
    5. 5.  Text Similarity has 2 aspects :  Content Similarity : Words are compared. e.g. “I have a car” and “I have a vehicle” are 75% similar.  Expression Similarity : Meaning of the information is considered. e.g. “I have a car” and “I have a vehicle” can be considered 100% similar. Scope – Content Similarity 5
    6. 6.  2 step process:  STEP 1 : Data Retrieval “The area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World WideWeb” [1]  STEP II : Similarity Measurements To correlate the words or terms of two or more documents or web pages. 6
    7. 7. AGENDA Introduction Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA Similarity Measurements  Cosine Similarity  SOC-PMI Applications & Prototype Summary 7
    8. 8. DATA RETRIEVAL Translation of literature to mathematics. A variety of such concrete techniques exist –  TF/IDF  Document-Term Matrix  VSM  LSA The corresponding mathematical structure is derived based of the relevant concrete data retrieval methodology used. 8
    9. 9. TF/IDF Term Frequency / Inverse Document Frequency Idea : More common the term, the less importance it has and hence should be considered at the least end of the query spectrum. Two linear, independent aspects:  Term Frequency - frequency of occurrence of a term in a given document.  Inverse Document Frequency - measure of the general importance of the term. 9
    10. 10. TF IDF Example [7] Three Documents –  D1: “Shipment of gold damaged in a fire”  D2: “Delivery of silver arrived in a silver truck”  D3: “Shipment of gold arrived in a truck” Two steps  Calculate the Term Frequency  Calculate the Inverse Document Frequency 10
    11. 11. TF IDF Example Terms D1 D2 D3 dfi D/df i IDF= log(D/dfi)a 1 1 1 3 3/3 = 1 0arrived 1 1 2 3/2 = 1.5 0.1761damaged 1 1 3/1 = 3 0.4771delivery 1 1 3/1 = 3 0.4771fire 1 1 3/1 = 3 0.4771gold 1 1 2 3/2 = 1.5 0.1761in 1 1 1 3 3/3 = 1 0of 1 1 1 3 3/3 = 1 0silver 2 1 3/1 = 3 0.4771shipment 1 1 2 3/2 = 1.5 0.1761truck 1 1 2 3/2 = 1.5 0.176111
    12. 12. Document-Term Matrix “A Document-Term Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.” [2] Rows – Documents Columns – Terms Only depicts which document contains which term and the number of occurrences of that term in the document. 12
    13. 13. Document-Term Matrix Example D1 = “I like databases” D2 = “I hate hate databases” I like databases hate D1 1 1 1 0 D2 1 0 1 2 13
    14. 14. VSM “Vector Space Model (VSM) is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for e.g. index terms.” [3] Each document and query is represented as a vector:  document : dj = (w1,j , w2,j , .... , wn,j)  query : q = (w1,q , w2,q , .... , wn,q) Terms can be individual words, keywords, or phrases, based on the type of application. 14
    15. 15. VSM Example [7] Three Documents –  D1: “Shipment of gold damaged in a fire”  D2: “Delivery of silver arrived in a silver truck”  D3: “Shipment of gold arrived in a truck” Query –  Gold Silver Truck 15
    16. 16. VSM Example continued...  Calculating TF-IDF Terms Q D1 D2 D3 IDFi QxIDFi D1xIDFi D2xIDFi D3xIDFia 1 1 1 0arrived 1 1 0.176 0.1761 0.1761 1damage 1 0.477 0.4771d 1delivery 1 0.477 0.4771 1fire 1 0.477 0.4771 1gold 1 1 1 0.176 0.1761 0.1761 0.1761 1in 1 1 1 0of 16 1 1 1 0silver 1 2 0.477 0.4771 0.9542
    17. 17. LSA “Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the meaning of words and passages of words.” [4] Built on the assumption that similar terms tend to appear in close proximities and hence identification of correlation patterns between documents or terms becomes easier. 2 step process:  Construction of Document-Term Matrix  Singular Value Decomposition 17
    18. 18. LSA Example Three Documents –  D1: “Shipment of gold damaged in a fire”  D2: “Delivery of silver arrived in a silver truck”  D3: “Shipment of gold arrived in a truck” Query –  Gold Silver Truck 18
    19. 19. LSA Example contd... STEP 1 : Constructing the Term-Document Matrix & Query Matrix19
    20. 20. LSA Example contd... STEP 2: Evaluating Singular Vector Decomposition20
    21. 21. LSA Example contd... STEP 3 : Reducing Dimensionality w.r.t k21
    22. 22.  Similar SVD evaluation and reduction is done for the query vector Q. At the end we have:  Reduced SVD Matrix V (for the documents)  Reduced SVD Matrix Q (for the query) V= Q= This further can be supplied to similarity measurement technique. 22
    23. 23. AGENDA Introduction Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA Similarity Measurements  Cosine Similarity  SOC-PMI Applications & Prototype Summary 23
    24. 24. SIMILARITY MEASUREMENTS Major focus of “Text Similarities” methodology. Uses the Mathematical Structures generated by the Data Retrieval techniques to evaluate the percentage of likeness between two or more documents or web pages. Two major techniques in focus here:  Cosine Similarity  SOC-PMI 24
    25. 25. COSINE SIMILARITY Evaluate similarity between 2 vectors by measuring cosine of the angle between them. Cosine of the angle will detemine whether the vectors are roughly pointing in the same direction. In our scope : similarity will range between 0 and 1, since term weights are always positive. i.e. The angle between two considered vectors will never exceed 90 25
    26. 26. COSINE Example [7] Example continued from VSM.  Three Documents –  D1: “Shipment of gold damaged in a fire”  D2: “Delivery of silver arrived in a silver truck”  D3: “Shipment of gold arrived in a truck”  Query – Gold Silver Truck We have calculated weights using TF-IDF scheme. Next Step – Calculate Cosine Similarity:  CosineΘDi = (Q . Di ) / (|Q| x |Di|)  i.e. First calculate Dot product: Q . Di  Then calculate scalar product: |Q| x |Di| 26
    27. 27. COSINE Example continued... Dot Products: Q.Di = ∑i wQ,j wi,j  Q.D1 = 0.0310, Q.D2 = 0.4862, Q.D3 = 0.0620 Scalar Products: |Q| x |Di| = sqrt(∑i w2Q,j)sqrt(∑i w2i,j)  |Q| x |D1| = 0.3871, |Q| x |D2| = 0.5896, |Q| x |D3| = 0.1896 Cosine Similarity:  CosineΘD1 = 0.0801  CosineΘD2 = 0.8246  CosineΘD3 = 0.3271 27
    28. 28. SOC-PMI “Second-Order Co-occurence Pointwise Mutual Information (SOC-PMI) is a semantic similarity measure using pointwise mutual information to sort lists of important neighbor words of the two target words from a large corpus.” [5] A lot of mathematics involved to generate the formula. This Similarity measure at the end is also normalized so as to limit the range of similarity between 0 and 1. 28
    29. 29. SOC-PMI with an example Complicated method with a lot of mathematical formulae. Example [6] :  W1 = car  W2 = automobile  m = 70, n = 43 Assumptions:  ϒ = 3, ∂ = 0.7  window of 11 words β1 = β2 = 24.88 CORPUS 29
    30. 30. SOC-PMI example contd... Bigram frequencies and the set X Types & Frequencies and the set Y of words with their PMI30 values
    31. 31. SOC-PMI example contd...31
    32. 32. AGENDA Introduction Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA Similarity Measurements  Cosine Similarity  SOC-PMI Applications & Prototype Summary 32
    33. 33. APPLICATIONS Plagiarism Detection Term Similarity play an important in the field of Plagiarism Detection. Copyright Violation Copies of restricted Software/Data can be detected using Text Similarities. Recommender Services 33
    34. 34. PROTOTYPE AIM : Finding the degree of Similarity between files. 2 steps  Data Retrival  TF-IDF  Similarity Measurement  Cosine  Pearson Correlation  Distribution Matrix  Co-occurence 34
    35. 35. Prototype – Data Retrieval Steps followed to retrive data using TF-IDF scheme  SequenceFilesFromDirectory  Converts files into sequence files. < Text, Text >  DocumentProcessor  Converts the sequence file into <Text, StringTuple>  DictionaryVectorizer  Creates TF Vectors <Text, VectorWritable>  Creates dfcount < IntWritable, LongWritable>  Creates wordcount <Text, LongWritable>  TFIDFConverter  Creates TF-IDF vectors <Text, VectorWritable> 35
    36. 36. Prototype – Similarity Measurement Intermediate steps  Convert the TF-IDF into a Matrix <IntWritable, VectorWritable> Similarity Measurement  Distribution Multiplication  Matrix * Matrix´  Cosine, Pearson Correlation and Co-occuerrence  RowSimilarityJob (Similarity Classname)  SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE  SIMILARITY_PEARSON_CORRELATION  SIMILARITY_COOCCURRENCE 36
    37. 37. Prototype – Similarity Measurment Cosine Pearson Correlation Distribution Matrix Co-occurence 37
    38. 38. AGENDA Introduction Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA Similarity Measurements  Cosine Similarity  SOC-PMI Applications & Prototype Summary 38
    39. 39. SUMMARY What is Text Similarity. Scope - Content Similarity Steps involved in the process:  Data Retrieval  TF/IDF  Document-Term Matrix  VSM  LSA  Similarity Measurements  Cosine Similarity  SOC-PMI Applications & Prototype 39
    40. 40. 40
    41. 41. References[1] Wikipedia: Information retrieval - Wikipedia, the free encyclopedia (2012), http://en.wikipedia.org/wiki/Information_retrieval[2] Wikipedia: Document-term matrix - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Document-term_matrix[3] Wikipedia: Vector space model - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Vector_space_model[4] Wikipedia: Latent semantic indexing - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Latent_semantic_indexing[5] Wikipedia: Second-order co-occurrence pointwise mutual information - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Second-order_co- occurrence_pointwise_mutual_information[6] Islam, A. and Inkpen, D. (2006). Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words, in Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 1033–1038.[7] Dr. E. Garcia. Mi Islita.com - http://www.miislita.com/information- retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html 41
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×