Data retrieval - In layman terms, data retrieval means that the words or terms within a document or web page are translated to some mathematical structure.
This basically implies that given a document, each distinct word or term within it is translated to a particular mathematical structure; for e.g. vector, frequency matrix etc.
TF - In its simplest form, the term frequency is also called as Term Count which is nothing but the number of occurrence of the term in thedocument.IDF - obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotientIt should be noted that if a term has a high term-frequency in the given document and a low document-frequency in the considered bunch of documents (implying a high inverse document frequency), then a high tf-idf is achieved.
Too simple to be used. Not realistic.
http://www.miislita.com/term-vector/term-vector-3.htmlThe vector value (or term weights) for each existing term (in a document) is non-zero; which is calculated using some scheme. One such well-known scheme is TF/IDF.D1: "Shipment of gold damaged in a fire"D2: "Delivery of silver arrived in a silver truck"D3: "Shipment of gold arrived in a truck“Q: “Gold Silver Truck”
INTRODUCTION WWW – a huge tangled web of information. Issues faced – duplications, plagiarism, copyright violation etc. Aim : To detect and report duplicates Method : Compare and output the level of similarity which is “TEXT SIMILARITY”. 4
Text Similarity has 2 aspects : Content Similarity : Words are compared. e.g. “I have a car” and “I have a vehicle” are 75% similar. Expression Similarity : Meaning of the information is considered. e.g. “I have a car” and “I have a vehicle” can be considered 100% similar. Scope – Content Similarity 5
2 step process: STEP 1 : Data Retrieval “The area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World WideWeb”  STEP II : Similarity Measurements To correlate the words or terms of two or more documents or web pages. 6
DATA RETRIEVAL Translation of literature to mathematics. A variety of such concrete techniques exist – TF/IDF Document-Term Matrix VSM LSA The corresponding mathematical structure is derived based of the relevant concrete data retrieval methodology used. 8
TF/IDF Term Frequency / Inverse Document Frequency Idea : More common the term, the less importance it has and hence should be considered at the least end of the query spectrum. Two linear, independent aspects: Term Frequency - frequency of occurrence of a term in a given document. Inverse Document Frequency - measure of the general importance of the term. 9
TF IDF Example  Three Documents – D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck” Two steps Calculate the Term Frequency Calculate the Inverse Document Frequency 10
Document-Term Matrix “A Document-Term Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.”  Rows – Documents Columns – Terms Only depicts which document contains which term and the number of occurrences of that term in the document. 12
Document-Term Matrix Example D1 = “I like databases” D2 = “I hate hate databases” I like databases hate D1 1 1 1 0 D2 1 0 1 2 13
VSM “Vector Space Model (VSM) is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for e.g. index terms.”  Each document and query is represented as a vector: document : dj = (w1,j , w2,j , .... , wn,j) query : q = (w1,q , w2,q , .... , wn,q) Terms can be individual words, keywords, or phrases, based on the type of application. 14
VSM Example  Three Documents – D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck” Query – Gold Silver Truck 15
LSA “Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the meaning of words and passages of words.”  Built on the assumption that similar terms tend to appear in close proximities and hence identification of correlation patterns between documents or terms becomes easier. 2 step process: Construction of Document-Term Matrix Singular Value Decomposition 17
LSA Example Three Documents – D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck” Query – Gold Silver Truck 18
LSA Example contd... STEP 1 : Constructing the Term-Document Matrix & Query Matrix19
LSA Example contd... STEP 2: Evaluating Singular Vector Decomposition20
Similar SVD evaluation and reduction is done for the query vector Q. At the end we have: Reduced SVD Matrix V (for the documents) Reduced SVD Matrix Q (for the query) V= Q= This further can be supplied to similarity measurement technique. 22
SIMILARITY MEASUREMENTS Major focus of “Text Similarities” methodology. Uses the Mathematical Structures generated by the Data Retrieval techniques to evaluate the percentage of likeness between two or more documents or web pages. Two major techniques in focus here: Cosine Similarity SOC-PMI 24
COSINE SIMILARITY Evaluate similarity between 2 vectors by measuring cosine of the angle between them. Cosine of the angle will detemine whether the vectors are roughly pointing in the same direction. In our scope : similarity will range between 0 and 1, since term weights are always positive. i.e. The angle between two considered vectors will never exceed 90 25
COSINE Example  Example continued from VSM. Three Documents – D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck” Query – Gold Silver Truck We have calculated weights using TF-IDF scheme. Next Step – Calculate Cosine Similarity: CosineΘDi = (Q . Di ) / (|Q| x |Di|) i.e. First calculate Dot product: Q . Di Then calculate scalar product: |Q| x |Di| 26
SOC-PMI “Second-Order Co-occurence Pointwise Mutual Information (SOC-PMI) is a semantic similarity measure using pointwise mutual information to sort lists of important neighbor words of the two target words from a large corpus.”  A lot of mathematics involved to generate the formula. This Similarity measure at the end is also normalized so as to limit the range of similarity between 0 and 1. 28
SOC-PMI with an example Complicated method with a lot of mathematical formulae. Example  : W1 = car W2 = automobile m = 70, n = 43 Assumptions: ϒ = 3, ∂ = 0.7 window of 11 words β1 = β2 = 24.88 CORPUS 29
SOC-PMI example contd... Bigram frequencies and the set X Types & Frequencies and the set Y of words with their PMI30 values
APPLICATIONS Plagiarism Detection Term Similarity play an important in the field of Plagiarism Detection. Copyright Violation Copies of restricted Software/Data can be detected using Text Similarities. Recommender Services 33
PROTOTYPE AIM : Finding the degree of Similarity between files. 2 steps Data Retrival TF-IDF Similarity Measurement Cosine Pearson Correlation Distribution Matrix Co-occurence 34
Prototype – Data Retrieval Steps followed to retrive data using TF-IDF scheme SequenceFilesFromDirectory Converts files into sequence files. < Text, Text > DocumentProcessor Converts the sequence file into <Text, StringTuple> DictionaryVectorizer Creates TF Vectors <Text, VectorWritable> Creates dfcount < IntWritable, LongWritable> Creates wordcount <Text, LongWritable> TFIDFConverter Creates TF-IDF vectors <Text, VectorWritable> 35
Prototype – Similarity Measurement Intermediate steps Convert the TF-IDF into a Matrix <IntWritable, VectorWritable> Similarity Measurement Distribution Multiplication Matrix * Matrix´ Cosine, Pearson Correlation and Co-occuerrence RowSimilarityJob (Similarity Classname) SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE SIMILARITY_PEARSON_CORRELATION SIMILARITY_COOCCURRENCE 36
References Wikipedia: Information retrieval - Wikipedia, the free encyclopedia (2012), http://en.wikipedia.org/wiki/Information_retrieval Wikipedia: Document-term matrix - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Document-term_matrix Wikipedia: Vector space model - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Vector_space_model Wikipedia: Latent semantic indexing - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Latent_semantic_indexing Wikipedia: Second-order co-occurrence pointwise mutual information - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Second-order_co- occurrence_pointwise_mutual_information Islam, A. and Inkpen, D. (2006). Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words, in Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 1033–1038. Dr. E. Garcia. Mi Islita.com - http://www.miislita.com/information- retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html 41