Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Information Retrieval with Open Source by korzonek 2932 views
- The science behind predictive analy... by ankurpandeyinfo 1794 views
- Token classification using Bengali ... by Jeet Das 966 views
- Active directory гэж юу вэ? by Ochiroo Dorj 3325 views
- Context-aware Recommendation: A Qui... by YONG ZHENG 727 views
- Elements of Text Mining Part - I by Jaganadh Gopinadhan 11336 views

2,167 views

Published on

No Downloads

Total views

2,167

On SlideShare

0

From Embeds

0

Number of Embeds

1

Shares

0

Downloads

75

Comments

0

Likes

1

No embeds

No notes for slide

- 1. PUSHPINTEXT SIMILARITIES Junaid Surve 6644418
- 2. AGENDA Introduction Data Retrieval TF/IDF Document-Term Matrix VSM LSA Similarity Measurements Cosine Similarity SOC-PMI Applications & Prototype Summary 2
- 3. AGENDA Introduction Data Retrieval TF/IDF Document-Term Matrix VSM LSA Similarity Measurements Cosine Similarity SOC-PMI Applications & Prototype Summary 3
- 4. INTRODUCTION WWW – a huge tangled web of information. Issues faced – duplications, plagiarism, copyright violation etc. Aim : To detect and report duplicates Method : Compare and output the level of similarity which is “TEXT SIMILARITY”. 4
- 5. Text Similarity has 2 aspects : Content Similarity : Words are compared. e.g. “I have a car” and “I have a vehicle” are 75% similar. Expression Similarity : Meaning of the information is considered. e.g. “I have a car” and “I have a vehicle” can be considered 100% similar. Scope – Content Similarity 5
- 6. 2 step process: STEP 1 : Data Retrieval “The area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World WideWeb” [1] STEP II : Similarity Measurements To correlate the words or terms of two or more documents or web pages. 6
- 7. AGENDA Introduction Data Retrieval TF/IDF Document-Term Matrix VSM LSA Similarity Measurements Cosine Similarity SOC-PMI Applications & Prototype Summary 7
- 8. DATA RETRIEVAL Translation of literature to mathematics. A variety of such concrete techniques exist – TF/IDF Document-Term Matrix VSM LSA The corresponding mathematical structure is derived based of the relevant concrete data retrieval methodology used. 8
- 9. TF/IDF Term Frequency / Inverse Document Frequency Idea : More common the term, the less importance it has and hence should be considered at the least end of the query spectrum. Two linear, independent aspects: Term Frequency - frequency of occurrence of a term in a given document. Inverse Document Frequency - measure of the general importance of the term. 9
- 10. TF IDF Example [7] Three Documents – D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck” Two steps Calculate the Term Frequency Calculate the Inverse Document Frequency 10
- 11. TF IDF Example Terms D1 D2 D3 dfi D/df i IDF= log(D/dfi)a 1 1 1 3 3/3 = 1 0arrived 1 1 2 3/2 = 1.5 0.1761damaged 1 1 3/1 = 3 0.4771delivery 1 1 3/1 = 3 0.4771fire 1 1 3/1 = 3 0.4771gold 1 1 2 3/2 = 1.5 0.1761in 1 1 1 3 3/3 = 1 0of 1 1 1 3 3/3 = 1 0silver 2 1 3/1 = 3 0.4771shipment 1 1 2 3/2 = 1.5 0.1761truck 1 1 2 3/2 = 1.5 0.176111
- 12. Document-Term Matrix “A Document-Term Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.” [2] Rows – Documents Columns – Terms Only depicts which document contains which term and the number of occurrences of that term in the document. 12
- 13. Document-Term Matrix Example D1 = “I like databases” D2 = “I hate hate databases” I like databases hate D1 1 1 1 0 D2 1 0 1 2 13
- 14. VSM “Vector Space Model (VSM) is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for e.g. index terms.” [3] Each document and query is represented as a vector: document : dj = (w1,j , w2,j , .... , wn,j) query : q = (w1,q , w2,q , .... , wn,q) Terms can be individual words, keywords, or phrases, based on the type of application. 14
- 15. VSM Example [7] Three Documents – D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck” Query – Gold Silver Truck 15
- 16. VSM Example continued... Calculating TF-IDF Terms Q D1 D2 D3 IDFi QxIDFi D1xIDFi D2xIDFi D3xIDFia 1 1 1 0arrived 1 1 0.176 0.1761 0.1761 1damage 1 0.477 0.4771d 1delivery 1 0.477 0.4771 1fire 1 0.477 0.4771 1gold 1 1 1 0.176 0.1761 0.1761 0.1761 1in 1 1 1 0of 16 1 1 1 0silver 1 2 0.477 0.4771 0.9542
- 17. LSA “Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the meaning of words and passages of words.” [4] Built on the assumption that similar terms tend to appear in close proximities and hence identification of correlation patterns between documents or terms becomes easier. 2 step process: Construction of Document-Term Matrix Singular Value Decomposition 17
- 18. LSA Example Three Documents – D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck” Query – Gold Silver Truck 18
- 19. LSA Example contd... STEP 1 : Constructing the Term-Document Matrix & Query Matrix19
- 20. LSA Example contd... STEP 2: Evaluating Singular Vector Decomposition20
- 21. LSA Example contd... STEP 3 : Reducing Dimensionality w.r.t k21
- 22. Similar SVD evaluation and reduction is done for the query vector Q. At the end we have: Reduced SVD Matrix V (for the documents) Reduced SVD Matrix Q (for the query) V= Q= This further can be supplied to similarity measurement technique. 22
- 23. AGENDA Introduction Data Retrieval TF/IDF Document-Term Matrix VSM LSA Similarity Measurements Cosine Similarity SOC-PMI Applications & Prototype Summary 23
- 24. SIMILARITY MEASUREMENTS Major focus of “Text Similarities” methodology. Uses the Mathematical Structures generated by the Data Retrieval techniques to evaluate the percentage of likeness between two or more documents or web pages. Two major techniques in focus here: Cosine Similarity SOC-PMI 24
- 25. COSINE SIMILARITY Evaluate similarity between 2 vectors by measuring cosine of the angle between them. Cosine of the angle will detemine whether the vectors are roughly pointing in the same direction. In our scope : similarity will range between 0 and 1, since term weights are always positive. i.e. The angle between two considered vectors will never exceed 90 25
- 26. COSINE Example [7] Example continued from VSM. Three Documents – D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck” Query – Gold Silver Truck We have calculated weights using TF-IDF scheme. Next Step – Calculate Cosine Similarity: CosineΘDi = (Q . Di ) / (|Q| x |Di|) i.e. First calculate Dot product: Q . Di Then calculate scalar product: |Q| x |Di| 26
- 27. COSINE Example continued... Dot Products: Q.Di = ∑i wQ,j wi,j Q.D1 = 0.0310, Q.D2 = 0.4862, Q.D3 = 0.0620 Scalar Products: |Q| x |Di| = sqrt(∑i w2Q,j)sqrt(∑i w2i,j) |Q| x |D1| = 0.3871, |Q| x |D2| = 0.5896, |Q| x |D3| = 0.1896 Cosine Similarity: CosineΘD1 = 0.0801 CosineΘD2 = 0.8246 CosineΘD3 = 0.3271 27
- 28. SOC-PMI “Second-Order Co-occurence Pointwise Mutual Information (SOC-PMI) is a semantic similarity measure using pointwise mutual information to sort lists of important neighbor words of the two target words from a large corpus.” [5] A lot of mathematics involved to generate the formula. This Similarity measure at the end is also normalized so as to limit the range of similarity between 0 and 1. 28
- 29. SOC-PMI with an example Complicated method with a lot of mathematical formulae. Example [6] : W1 = car W2 = automobile m = 70, n = 43 Assumptions: ϒ = 3, ∂ = 0.7 window of 11 words β1 = β2 = 24.88 CORPUS 29
- 30. SOC-PMI example contd... Bigram frequencies and the set X Types & Frequencies and the set Y of words with their PMI30 values
- 31. SOC-PMI example contd...31
- 32. AGENDA Introduction Data Retrieval TF/IDF Document-Term Matrix VSM LSA Similarity Measurements Cosine Similarity SOC-PMI Applications & Prototype Summary 32
- 33. APPLICATIONS Plagiarism Detection Term Similarity play an important in the field of Plagiarism Detection. Copyright Violation Copies of restricted Software/Data can be detected using Text Similarities. Recommender Services 33
- 34. PROTOTYPE AIM : Finding the degree of Similarity between files. 2 steps Data Retrival TF-IDF Similarity Measurement Cosine Pearson Correlation Distribution Matrix Co-occurence 34
- 35. Prototype – Data Retrieval Steps followed to retrive data using TF-IDF scheme SequenceFilesFromDirectory Converts files into sequence files. < Text, Text > DocumentProcessor Converts the sequence file into <Text, StringTuple> DictionaryVectorizer Creates TF Vectors <Text, VectorWritable> Creates dfcount < IntWritable, LongWritable> Creates wordcount <Text, LongWritable> TFIDFConverter Creates TF-IDF vectors <Text, VectorWritable> 35
- 36. Prototype – Similarity Measurement Intermediate steps Convert the TF-IDF into a Matrix <IntWritable, VectorWritable> Similarity Measurement Distribution Multiplication Matrix * Matrix´ Cosine, Pearson Correlation and Co-occuerrence RowSimilarityJob (Similarity Classname) SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE SIMILARITY_PEARSON_CORRELATION SIMILARITY_COOCCURRENCE 36
- 37. Prototype – Similarity Measurment Cosine Pearson Correlation Distribution Matrix Co-occurence 37
- 38. AGENDA Introduction Data Retrieval TF/IDF Document-Term Matrix VSM LSA Similarity Measurements Cosine Similarity SOC-PMI Applications & Prototype Summary 38
- 39. SUMMARY What is Text Similarity. Scope - Content Similarity Steps involved in the process: Data Retrieval TF/IDF Document-Term Matrix VSM LSA Similarity Measurements Cosine Similarity SOC-PMI Applications & Prototype 39
- 40. 40
- 41. References[1] Wikipedia: Information retrieval - Wikipedia, the free encyclopedia (2012), http://en.wikipedia.org/wiki/Information_retrieval[2] Wikipedia: Document-term matrix - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Document-term_matrix[3] Wikipedia: Vector space model - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Vector_space_model[4] Wikipedia: Latent semantic indexing - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Latent_semantic_indexing[5] Wikipedia: Second-order co-occurrence pointwise mutual information - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Second-order_co- occurrence_pointwise_mutual_information[6] Islam, A. and Inkpen, D. (2006). Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words, in Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 1033–1038.[7] Dr. E. Garcia. Mi Islita.com - http://www.miislita.com/information- retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html 41

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment