The document discusses several information retrieval models including the Boolean, vector space, and probabilistic models. It provides details on how each model represents documents and queries, defines relevance, and ranks documents in response to queries. Specifically, it describes:
1) The Boolean model uses exact matching to retrieve only documents that satisfy a Boolean query, but does not rank results.
2) The vector space model represents documents and queries as vectors of term weights and ranks documents based on their similarity to the query vector using measures like cosine similarity.
3) Term frequency-inverse document frequency (TF-IDF) is discussed as a method to weight terms based on their importance.
- Documents are represented as vectors in a vector space, with one dimension per term. A training set consists of labelled documents that correspond to labelled points in this vector space.
- Classification methods include Rocchio classification, which divides the space into regions centered on class centroids, and k-nearest neighbors (kNN) classification, which assigns classes based on the labels of the k closest training examples without explicit surface definitions.
- Common text classification approaches include prototype-based classification, which represents each class as the centroid of training examples, and assigns new documents to the closest centroid class.
This document compares web search and information retrieval (IR) across 10 differentiators:
1. Languages - Web search indexes documents in many languages using full text, while IR databases usually cover one language.
2. File types - Web search indexes several file types including some without text, while IR indexes consistent formats like PDF.
3. Document length - Web documents vary widely in length from short to long, while IR documents vary less.
4. Document structure - Web documents are semi-structured HTML, while IR allows searching structured document fields.
I completed this project as part of my internship capstone at Learnbay. The task was to predict the flight fares for multiple flights in India. Visualizing the data helped me find trends and correlations between the independent and target variables. In the following step, I developed a model to predict the price.
The document discusses the World Wide Web and information retrieval on the web. It provides background on how the web was developed by Tim Berners-Lee in 1990 using HTML, HTTP, and URLs. It then discusses some key differences in information retrieval on the web compared to traditional library systems, including the presence of hyperlinks, heterogeneous content, duplication of content, exponential growth in the number of documents, and lack of stability. It also summarizes some challenges in web search including the expanding nature of the web, dynamically generated content, influence of monetary contributions on search results, and search engine spamming.
This document discusses spatial data mining and its applications. Spatial data mining involves extracting knowledge and relationships from large spatial databases. It can be used for applications like GIS, remote sensing, medical imaging, and more. Some challenges include the complexity of spatial data types and large data volumes. The document also covers topics like spatial data warehouses, dimensions and measures in spatial analysis, spatial association rule mining, and applications in fields such as earth science, crime mapping, and commerce.
This document provides an overview of an information retrieval system. It defines an information retrieval system as a system capable of storing, retrieving, and maintaining information such as text, images, audio, and video. The objectives of an information retrieval system are to minimize the overhead for a user to locate needed information. The document discusses functions like search, browse, indexing, cataloging, and various capabilities to facilitate querying and retrieving relevant information from the system.
Boolean,vector space retrieval Models Primya Tamil
The document discusses various information retrieval models including Boolean, vector space, and probabilistic models. It provides details on how documents and queries are represented and compared in the vector space model. Specifically, it explains that in this model, documents and queries are represented as vectors of term weights in a multi-dimensional space. The similarity between a document and query vector is calculated using measures like the inner product or cosine similarity to retrieve and rank documents.
This document introduces databases and database management systems. It discusses the disadvantages of file-based systems, including data duplication, incompatible formats, and fixed queries. A database was created to address these issues by centralizing data storage and control. A database management system provides tools to define, create, maintain and control access to a database. Common examples of databases include those for supermarkets, credit cards, travel agencies, libraries, insurance, and universities.
- Documents are represented as vectors in a vector space, with one dimension per term. A training set consists of labelled documents that correspond to labelled points in this vector space.
- Classification methods include Rocchio classification, which divides the space into regions centered on class centroids, and k-nearest neighbors (kNN) classification, which assigns classes based on the labels of the k closest training examples without explicit surface definitions.
- Common text classification approaches include prototype-based classification, which represents each class as the centroid of training examples, and assigns new documents to the closest centroid class.
This document compares web search and information retrieval (IR) across 10 differentiators:
1. Languages - Web search indexes documents in many languages using full text, while IR databases usually cover one language.
2. File types - Web search indexes several file types including some without text, while IR indexes consistent formats like PDF.
3. Document length - Web documents vary widely in length from short to long, while IR documents vary less.
4. Document structure - Web documents are semi-structured HTML, while IR allows searching structured document fields.
I completed this project as part of my internship capstone at Learnbay. The task was to predict the flight fares for multiple flights in India. Visualizing the data helped me find trends and correlations between the independent and target variables. In the following step, I developed a model to predict the price.
The document discusses the World Wide Web and information retrieval on the web. It provides background on how the web was developed by Tim Berners-Lee in 1990 using HTML, HTTP, and URLs. It then discusses some key differences in information retrieval on the web compared to traditional library systems, including the presence of hyperlinks, heterogeneous content, duplication of content, exponential growth in the number of documents, and lack of stability. It also summarizes some challenges in web search including the expanding nature of the web, dynamically generated content, influence of monetary contributions on search results, and search engine spamming.
This document discusses spatial data mining and its applications. Spatial data mining involves extracting knowledge and relationships from large spatial databases. It can be used for applications like GIS, remote sensing, medical imaging, and more. Some challenges include the complexity of spatial data types and large data volumes. The document also covers topics like spatial data warehouses, dimensions and measures in spatial analysis, spatial association rule mining, and applications in fields such as earth science, crime mapping, and commerce.
This document provides an overview of an information retrieval system. It defines an information retrieval system as a system capable of storing, retrieving, and maintaining information such as text, images, audio, and video. The objectives of an information retrieval system are to minimize the overhead for a user to locate needed information. The document discusses functions like search, browse, indexing, cataloging, and various capabilities to facilitate querying and retrieving relevant information from the system.
Boolean,vector space retrieval Models Primya Tamil
The document discusses various information retrieval models including Boolean, vector space, and probabilistic models. It provides details on how documents and queries are represented and compared in the vector space model. Specifically, it explains that in this model, documents and queries are represented as vectors of term weights in a multi-dimensional space. The similarity between a document and query vector is calculated using measures like the inner product or cosine similarity to retrieve and rank documents.
This document introduces databases and database management systems. It discusses the disadvantages of file-based systems, including data duplication, incompatible formats, and fixed queries. A database was created to address these issues by centralizing data storage and control. A database management system provides tools to define, create, maintain and control access to a database. Common examples of databases include those for supermarkets, credit cards, travel agencies, libraries, insurance, and universities.
The document discusses text categorization, which involves assigning categories or topics to documents. It covers key aspects of text categorization including definitions, applications, document representation, feature selection, dimensionality reduction, knowledge engineering and machine learning approaches. Specific classification algorithms discussed include naïve Bayes, Bayesian logistic regression, decision trees, decision rules, and more. The document provides details on how these algorithms work and their advantages/disadvantages for text categorization tasks.
The document discusses various aspects of term weighting which is important for text retrieval systems, including term frequency, document frequency, inverse document frequency, and how they are used to calculate TF-IDF weights for terms. It also covers stoplists, stemming, and the bag-of-words model which represents text as vectors of word occurrences without considering word order. Term weighting schemes play a major role in the similarity measures used by information retrieval systems to determine document relevance.
The document discusses distributed database systems, including homogeneous and heterogeneous distributed databases, distributed data storage using replication and fragmentation, distributed transactions, commit protocols like two-phase commit, and handling failures in distributed systems. Key topics covered are replication allowing high availability but increasing complexity, fragmentation allowing parallelism but requiring joins, and two-phase commit coordinating atomic commits across multiple sites through a prepare and commit phase.
Time-space tradeoffs allow solving problems in less time by using more memory or solving problems using very little space by spending more time. Common tradeoffs include storing compressed vs uncompressed data, re-rendering images vs storing pre-rendered images, using smaller code with loops vs larger code without loops, and storing lookup tables vs recalculating values. Examples demonstrate algorithms that use more time and less space vs more space and less time.
Queuing theory is the study of congestion and waiting in line systems. It examines queues that form when demand for a service exceeds the available resources, such as people waiting at a supermarket checkout, letters waiting to be processed at a post office, or cars waiting at a traffic signal. Queuing systems can be modeled and analyzed using notations like Kendall's notation to understand characteristics like expected wait times, number of servers needed, and how to manage peak traffic periods. The origins of queuing theory began with the research of A.K. Erlang in the early 1900s on modeling telephone traffic and wait times.
This document provides an introduction to text analytics using IBM SPSS Modeler. It defines key terms related to text analytics and outlines the main steps in the text analytics process: extraction, categorization, and visualization. It then provides a tutorial on using IBM SPSS Modeler to perform text analytics, including sourcing text, extracting concepts and relationships, categorizing records, and visualizing results. Templates and resources are described that can be used to start an interactive workbench session in Modeler for exploring text analytics.
This document discusses data generalization and summarization techniques. It describes how attribute-oriented induction generalizes data from low to high conceptual levels by examining attribute values. The number of distinct values for each attribute is considered, and attributes may be removed, generalized up concept hierarchies, or retained in the generalized relation. An algorithm for attribute-oriented induction takes a relational database and data mining query as input and outputs a generalized relation. Generalized data can be presented as crosstabs, bar charts, or pie charts.
Information retrieval 14 fuzzy set models of irVaibhav Khanna
Fuzzy Model is a set theoretic model of document retrieval based on fuzzy theory. An opposite to this is the Exact match mechanism by which only the objects satisfying some well specified criteria, against object attributes, are returned to the user as a query answer.
An inverted file indexes a text collection to speed up searching. It contains a vocabulary of distinct words and occurrences lists with information on where each word appears. For each term in the vocabulary, it stores a list of pointers to occurrences called an inverted list. Coarser granularity indexes use less storage but require more processing, while word-level indexes enable proximity searches but use more space. The document describes how inverted files are structured and constructed from text and discusses techniques like block addressing that reduce their space requirements.
The Boolean model is a classical information retrieval model based on set theory and Boolean logic. Queries are specified as Boolean expressions to retrieve documents that either contain or do not contain the query terms. All term frequencies are binary and documents are retrieved based on an exact match to the query terms. However, this model has limitations as it does not rank documents and queries are difficult for users to translate into Boolean expressions, often returning too few or too many results.
Probabilistic information retrieval models & systemsSelman Bozkır
The document discusses probabilistic information retrieval and Bayesian approaches. It introduces concepts like conditional probability, Bayes' theorem, and the probability ranking principle. It explains how probabilistic models estimate the probability of relevance between a document and query by representing them as term sets and making probabilistic assumptions. The goal is to rank documents by the probability of relevance to present the most likely relevant documents first.
This document provides an overview of information retrieval models. It begins with definitions of information retrieval and how it differs from data retrieval. It then discusses the retrieval process and logical representations of documents. A taxonomy of IR models is presented including classic, structured, and browsing models. Boolean, vector, and probabilistic models are explained as examples of classic models. The document concludes with descriptions of ad-hoc retrieval and filtering tasks and formal characteristics of IR models.
1) The document discusses concept learning, which involves inferring a Boolean function from training examples. It focuses on a concept learning task where hypotheses are represented as vectors of constraints on attribute values.
2) It describes the FIND-S algorithm, which finds the most specific hypothesis consistent with positive examples by generalizing constraints. However, FIND-S has limitations like ignoring negative examples.
3) The Candidate-Elimination algorithm represents the version space of all hypotheses consistent with examples to address FIND-S limitations. It outputs the version space rather than a single hypothesis.
There are three common data warehouse architectures: basic, with a staging area, and with a staging area and data marts. The basic architecture extracts data directly from source systems into the data warehouse for users. The staging area architecture uses a staging area to clean and process data before loading it into the warehouse. The third architecture adds data marts, which are subsets of the warehouse organized for specific business units like sales or purchasing.
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
The document provides an overview of data mining techniques and processes. It discusses data mining as the process of extracting knowledge from large amounts of data. It describes common data mining tasks like classification, regression, clustering, and association rule learning. It also outlines popular data mining processes like CRISP-DM and SEMMA that involve steps of business understanding, data preparation, modeling, evaluation and deployment. Decision trees are presented as a popular classification technique that uses a tree structure to split data into nodes and leaves to classify examples.
The document discusses data warehouses and their advantages. It describes the different views of a data warehouse including the top-down view, data source view, data warehouse view, and business query view. It also discusses approaches to building a data warehouse, including top-down and bottom-up, and steps involved including planning, requirements, design, integration, and deployment. Finally, it discusses technologies used to populate and refresh data warehouses like extraction, cleaning, transformation, load, and refresh tools.
This document summarizes text classification in PHP. It discusses what text classification is, common natural language processing terminology like tokenization and stemming, Bayes' theorem and how it relates to naive Bayes classification. It provides examples of tokenizing, stemming, stopping words, and building a naive Bayes classifier in PHP using the NlpTools library. Key steps like training and testing a classifier on sample text data are demonstrated.
Information extraction involves extracting structured information from unstructured text. The goal is to identify named entities, relations between entities, and populate a database. This may also include event extraction, resolving temporal expressions, and wrapper induction. Common tasks include named entity recognition, co-reference resolution, relation extraction, and event extraction. Statistical methods like conditional random fields are often used. Evaluation involves measuring precision and recall.
This document provides lecture notes on information retrieval systems. It covers key concepts like precision and recall, different retrieval strategies including vector space model and probabilistic models, and retrieval utilities. The vector space model represents documents and queries as vectors in a shared space and calculates similarity using cosine similarity. Probabilistic models assign probabilities to terms and documents and estimate relevance probabilities. The notes discuss term weighting schemes, inverted indexes to improve efficiency, and integrating structured data with text retrieval. The overall objective is for students to learn fundamental models and techniques for information storage and retrieval.
The document discusses the vector space model used in information retrieval. It explains that documents and queries are represented as weighted vectors in a multidimensional space. Similar vectors are close to each other. The weights used are usually tf-idf, which considers both the frequency of a term within a document and its rarity across documents. Documents are ranked based on the similarity between their vector representation and the query vector.
The document discusses text categorization, which involves assigning categories or topics to documents. It covers key aspects of text categorization including definitions, applications, document representation, feature selection, dimensionality reduction, knowledge engineering and machine learning approaches. Specific classification algorithms discussed include naïve Bayes, Bayesian logistic regression, decision trees, decision rules, and more. The document provides details on how these algorithms work and their advantages/disadvantages for text categorization tasks.
The document discusses various aspects of term weighting which is important for text retrieval systems, including term frequency, document frequency, inverse document frequency, and how they are used to calculate TF-IDF weights for terms. It also covers stoplists, stemming, and the bag-of-words model which represents text as vectors of word occurrences without considering word order. Term weighting schemes play a major role in the similarity measures used by information retrieval systems to determine document relevance.
The document discusses distributed database systems, including homogeneous and heterogeneous distributed databases, distributed data storage using replication and fragmentation, distributed transactions, commit protocols like two-phase commit, and handling failures in distributed systems. Key topics covered are replication allowing high availability but increasing complexity, fragmentation allowing parallelism but requiring joins, and two-phase commit coordinating atomic commits across multiple sites through a prepare and commit phase.
Time-space tradeoffs allow solving problems in less time by using more memory or solving problems using very little space by spending more time. Common tradeoffs include storing compressed vs uncompressed data, re-rendering images vs storing pre-rendered images, using smaller code with loops vs larger code without loops, and storing lookup tables vs recalculating values. Examples demonstrate algorithms that use more time and less space vs more space and less time.
Queuing theory is the study of congestion and waiting in line systems. It examines queues that form when demand for a service exceeds the available resources, such as people waiting at a supermarket checkout, letters waiting to be processed at a post office, or cars waiting at a traffic signal. Queuing systems can be modeled and analyzed using notations like Kendall's notation to understand characteristics like expected wait times, number of servers needed, and how to manage peak traffic periods. The origins of queuing theory began with the research of A.K. Erlang in the early 1900s on modeling telephone traffic and wait times.
This document provides an introduction to text analytics using IBM SPSS Modeler. It defines key terms related to text analytics and outlines the main steps in the text analytics process: extraction, categorization, and visualization. It then provides a tutorial on using IBM SPSS Modeler to perform text analytics, including sourcing text, extracting concepts and relationships, categorizing records, and visualizing results. Templates and resources are described that can be used to start an interactive workbench session in Modeler for exploring text analytics.
This document discusses data generalization and summarization techniques. It describes how attribute-oriented induction generalizes data from low to high conceptual levels by examining attribute values. The number of distinct values for each attribute is considered, and attributes may be removed, generalized up concept hierarchies, or retained in the generalized relation. An algorithm for attribute-oriented induction takes a relational database and data mining query as input and outputs a generalized relation. Generalized data can be presented as crosstabs, bar charts, or pie charts.
Information retrieval 14 fuzzy set models of irVaibhav Khanna
Fuzzy Model is a set theoretic model of document retrieval based on fuzzy theory. An opposite to this is the Exact match mechanism by which only the objects satisfying some well specified criteria, against object attributes, are returned to the user as a query answer.
An inverted file indexes a text collection to speed up searching. It contains a vocabulary of distinct words and occurrences lists with information on where each word appears. For each term in the vocabulary, it stores a list of pointers to occurrences called an inverted list. Coarser granularity indexes use less storage but require more processing, while word-level indexes enable proximity searches but use more space. The document describes how inverted files are structured and constructed from text and discusses techniques like block addressing that reduce their space requirements.
The Boolean model is a classical information retrieval model based on set theory and Boolean logic. Queries are specified as Boolean expressions to retrieve documents that either contain or do not contain the query terms. All term frequencies are binary and documents are retrieved based on an exact match to the query terms. However, this model has limitations as it does not rank documents and queries are difficult for users to translate into Boolean expressions, often returning too few or too many results.
Probabilistic information retrieval models & systemsSelman Bozkır
The document discusses probabilistic information retrieval and Bayesian approaches. It introduces concepts like conditional probability, Bayes' theorem, and the probability ranking principle. It explains how probabilistic models estimate the probability of relevance between a document and query by representing them as term sets and making probabilistic assumptions. The goal is to rank documents by the probability of relevance to present the most likely relevant documents first.
This document provides an overview of information retrieval models. It begins with definitions of information retrieval and how it differs from data retrieval. It then discusses the retrieval process and logical representations of documents. A taxonomy of IR models is presented including classic, structured, and browsing models. Boolean, vector, and probabilistic models are explained as examples of classic models. The document concludes with descriptions of ad-hoc retrieval and filtering tasks and formal characteristics of IR models.
1) The document discusses concept learning, which involves inferring a Boolean function from training examples. It focuses on a concept learning task where hypotheses are represented as vectors of constraints on attribute values.
2) It describes the FIND-S algorithm, which finds the most specific hypothesis consistent with positive examples by generalizing constraints. However, FIND-S has limitations like ignoring negative examples.
3) The Candidate-Elimination algorithm represents the version space of all hypotheses consistent with examples to address FIND-S limitations. It outputs the version space rather than a single hypothesis.
There are three common data warehouse architectures: basic, with a staging area, and with a staging area and data marts. The basic architecture extracts data directly from source systems into the data warehouse for users. The staging area architecture uses a staging area to clean and process data before loading it into the warehouse. The third architecture adds data marts, which are subsets of the warehouse organized for specific business units like sales or purchasing.
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
The document provides an overview of data mining techniques and processes. It discusses data mining as the process of extracting knowledge from large amounts of data. It describes common data mining tasks like classification, regression, clustering, and association rule learning. It also outlines popular data mining processes like CRISP-DM and SEMMA that involve steps of business understanding, data preparation, modeling, evaluation and deployment. Decision trees are presented as a popular classification technique that uses a tree structure to split data into nodes and leaves to classify examples.
The document discusses data warehouses and their advantages. It describes the different views of a data warehouse including the top-down view, data source view, data warehouse view, and business query view. It also discusses approaches to building a data warehouse, including top-down and bottom-up, and steps involved including planning, requirements, design, integration, and deployment. Finally, it discusses technologies used to populate and refresh data warehouses like extraction, cleaning, transformation, load, and refresh tools.
This document summarizes text classification in PHP. It discusses what text classification is, common natural language processing terminology like tokenization and stemming, Bayes' theorem and how it relates to naive Bayes classification. It provides examples of tokenizing, stemming, stopping words, and building a naive Bayes classifier in PHP using the NlpTools library. Key steps like training and testing a classifier on sample text data are demonstrated.
Information extraction involves extracting structured information from unstructured text. The goal is to identify named entities, relations between entities, and populate a database. This may also include event extraction, resolving temporal expressions, and wrapper induction. Common tasks include named entity recognition, co-reference resolution, relation extraction, and event extraction. Statistical methods like conditional random fields are often used. Evaluation involves measuring precision and recall.
This document provides lecture notes on information retrieval systems. It covers key concepts like precision and recall, different retrieval strategies including vector space model and probabilistic models, and retrieval utilities. The vector space model represents documents and queries as vectors in a shared space and calculates similarity using cosine similarity. Probabilistic models assign probabilities to terms and documents and estimate relevance probabilities. The notes discuss term weighting schemes, inverted indexes to improve efficiency, and integrating structured data with text retrieval. The overall objective is for students to learn fundamental models and techniques for information storage and retrieval.
The document discusses the vector space model used in information retrieval. It explains that documents and queries are represented as weighted vectors in a multidimensional space. Similar vectors are close to each other. The weights used are usually tf-idf, which considers both the frequency of a term within a document and its rarity across documents. Documents are ranked based on the similarity between their vector representation and the query vector.
The document discusses the vector space model used in information retrieval. It explains that documents and queries are represented as weighted vectors in a high dimensional vector space. Similarities between queries and documents are calculated to rank documents by relevance. Weights are often calculated using TF-IDF, which considers the frequency of terms within documents and across collections. Documents with vector representations closer to the query vector are considered more relevant.
This document discusses different information retrieval models including the Boolean model, vector space model, and probabilistic model. It focuses on describing the Boolean model and its drawbacks. Term frequency-inverse document frequency (TF-IDF) weighting is explained as a way to assign weights to terms based on frequency and document distribution. Cosine similarity is presented as a common way to measure similarity between a document vector and query vector in the vector space model.
The document discusses different techniques for topic modeling of documents, including TF-IDF weighting and cosine similarity. It proposes a semi-supervised approach that uses predefined topics from Prismatic to train an LDA model on Wikipedia articles. This model classifies news articles into topics. The accuracy is improved by redistributing term weights based on their relevance within topic clusters rather than just document frequency. An experiment on over 5000 news articles found that the combined weighting approach outperformed TF-IDF alone on articles with multiple topics or limited content.
IRJET - Document Comparison based on TF-IDF MetricIRJET Journal
This document discusses comparing documents based on the TF-IDF metric and cosine similarity. It begins by representing documents as vectors of terms weighted by TF-IDF. Cosine similarity is then used to measure the similarity between document vectors, with values ranging from 0 (completely dissimilar) to 1 (identical). The document demonstrates this approach on 5 sample documents from different domains, showing their pairwise cosine similarities. Comparing documents based on TF-IDF and cosine similarity allows analyzing relationships between documents in large corpora.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
International Journal of Engineering Research and Development is an international premier peer reviewed open access engineering and technology journal promoting the discovery, innovation, advancement and dissemination of basic and transitional knowledge in engineering, technology and related disciplines.
This document presents an algorithm for semantic-based similarity measure (SBSM) to improve text clustering. The algorithm assigns semantic weights to documents terms and phrases based on their use as arguments in proposition bank notation. It calculates similarity between a document and query based on matching weighted terms and phrases. Experimental results on a dataset show the SBSM using proposition bank notation achieves better performance than traditional similarity measures for text clustering.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
This document presents an algorithm for semantic-based similarity measure (SBSM) to improve text clustering. The algorithm assigns semantic weights to documents terms and phrases based on their use as arguments in proposition bank notation. It calculates similarity between a document and query based on matching weighted terms and phrases. Experimental results on a dataset show the SBSM using proposition bank notation improves performance over traditional measures like cosine and Jaccard similarity. The algorithm captures semantic information within documents for more accurate similarity assessment and clustering.
The document discusses two main types of retrieval models: Boolean models which use set theory and vector space models which use statistical and algebraic approaches. Vector space models represent documents and queries as vectors of keywords weighted by factors like term frequency and inverse document frequency. Similarity between document and query vectors is calculated using measures like the inner product or cosine similarity to retrieve and rank documents.
The vector space model (VSM) represents documents as vectors of identifiers such as words, where each unique word corresponds to a dimension. Documents are broken down and represented as vectors based on word frequency. Queries are also represented as vectors, and similarity measures such as cosine similarity are used to compare document and query vectors and retrieve the most relevant documents. Variations of the basic VSM include removing common words, weighting terms based on frequency and document distribution, and using tf-idf to emphasize important words.
The document discusses different techniques for weighting terms in the vector space model for information retrieval, including:
- Sublinear tf scaling using the logarithm of term frequency
- Tf-idf weighting
- Maximum tf normalization to mitigate higher weights for longer documents
It also discusses evaluating information retrieval systems using test collections with queries, relevant documents, and metrics like precision and recall. Standard test collections include Cranfield, TREC, and CLEF.
The document discusses probabilistic retrieval models in information retrieval. It provides an overview of older models like Boolean retrieval and vector space models. The main focus is on probabilistic models like BM25 and language models. It explains key concepts in probabilistic IR like the probability ranking principle, using Bayes' rule to estimate the probability that a document is relevant given features of the document, and estimating probabilities based on the frequencies of terms in relevant documents. The goal is to rank documents based on the probability of relevance to the query.
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERINGIJDKP
Fuzzy logic deals with degrees of truth. In this paper, we have shown how to apply fuzzy logic in text
mining in order to perform document clustering. We took an example of document clustering where the
documents had to be clustered into two categories. The method involved cleaning up the text and stemming
of words. Then, we chose ‘m’ features which differ significantly in their word frequencies (WF), normalized
by document length, between documents belonging to these two clusters. The documents to be clustered
were represented as a collection of ‘m’ normalized WF values. Fuzzy c-means (FCM) algorithm was used
to cluster these documents into two clusters. After the FCM execution finished, the documents in the two
clusters were analysed for the values of their respective ‘m’ features. It was known that documents
belonging to a document type ‘X’ tend to have higher WF values for some particular features. If the
documents belonging to a cluster had higher WF values for those same features, then that cluster was said
to represent ‘X’. By fuzzy logic, we not only get the cluster name, but also the degree to which a document
belongs to a cluster
The document discusses several methods for calculating the similarity between text documents, including document vectors, word embeddings, TF-IDF, cosine similarity, and Jaccard similarity. It explains that document vectors transform documents into real-valued vectors to measure similarity as distance. Word embeddings represent words as vectors to capture semantic similarity. TF-IDF measures word importance, and cosine similarity measures the angle between document vectors to indicate similarity. Jaccard similarity calculates the overlap between word sets in two documents.
Testing Different Log Bases for Vector Model Weighting Techniquekevig
Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.
Testing Different Log Bases for Vector Model Weighting Techniquekevig
Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.
This document discusses text mining and summarizes some key differences between text mining and data mining. Text mining, also known as text data mining or knowledge discovery in textual databases, is the process of analyzing text to identify novel information from a collection of documents. Unlike data mining which directly analyzes structured numeric data, text mining applies natural language processing techniques to discover new information from unstructured text data. The document then provides an overview of common text retrieval methods like the Boolean model and document ranking, and discusses measures used to evaluate text retrieval systems like precision and recall.
A Document Similarity Measurement without Dictionaries鍾誠 陳鍾誠
The document proposes a measure of document similarity called Common Keyword Similarity (CKS) that does not rely on dictionaries. CKS is based on finding common substrings between documents using a PAT-tree data structure. The importance of each substring is determined by its discriminating effect (KDE), which reflects how well it fits a given classification system. CKS is computed as the sum of the weights of the common keywords between two documents. Experimental results on news articles show that CKS without a dictionary has better recall and precision than a method using cosine coefficient that relies on a dictionary, since many terms cannot be found in dictionaries. The classification system used to determine keyword weights also significantly impacts performance.
The document discusses various C programming concepts related to structures, unions, files and storage classes. It defines structures and unions, and describes how to declare structure variables, initialize structures, define arrays of structures and use pointers to structures. It also covers nested structures, self-referential structures, and passing structures to functions. The document briefly explains storage classes like auto, static, extern and register. It also discusses files and streams in C programming.
The document discusses functions in C programming. It defines functions as blocks of code that perform a specific task and can be called multiple times. There are two types of functions: user-defined functions created by the programmer, and pre-defined functions that are part of standard libraries. Functions have three aspects - declaration, definition, and call. They can return a value or not, and take arguments or not. Examples are given of different function types. Recursion and string handling functions are also explained.
The document discusses various topics related to arrays and pointers in C programming. It begins by defining arrays as a data structure that stores homogeneous data in a linear sequence. It describes single dimensional, multi-dimensional, and different types of arrays. It then discusses pointers as variables that store the memory addresses of other variables and data types. It explains pointer declaration, initialization, dereferencing, arithmetic operations, and different types of pointers like void pointers and pointer to pointers.
The document provides an overview of the basics of C programming, including:
- An introduction to C as a general purpose, block structured, procedural programming language that has features of both high-level and low-level languages.
- Descriptions of C's character set, identifiers, keywords, data types, variables, constants, and operators.
- Explanations of common data types like integer, float, character, and pointers.
- An overview of program structure in C including declaration, initialization of variables, and use of operators, decision making statements, and loops.
The document discusses the fundamentals of computer programming. It describes how a computer system consists of hardware, software, data, and users. It explains the components of a computer including the CPU, memory unit, and input/output devices. It also discusses data representation, number systems, programming concepts, and the program development life cycle.
Recommender systems provide suggestions for items to users based on their preferences. They analyze data on items, users, and transactions between users and items. Common data sources include item metadata, user profiles and ratings, and records of users' purchases or ratings of items. Recommender systems aim to provide personalized recommendations to increase sales, suggest diverse items, improve user satisfaction and loyalty, and help users find relevant items. Collaborative filtering analyzes similarities between users to provide recommendations for items liked by similar users.
The document discusses the history and impact of the World Wide Web on information retrieval and search engines. It covers:
1) How Berners-Lee invented the World Wide Web in 1990 by creating HTTP, HTML, and the first browser and server. The Web has since grown enormously.
2) How the Web changed search by requiring crawling to collect documents in a central repository before indexing, and by increasing the scale, size, and difficulty of relevance prediction due to the large collection.
3) The basic architecture of centralized crawling and indexing used by most early search engines, and how distributed and cluster-based architectures were developed to handle the Web's massive growth.
This document provides an overview of an Information Retrieval Techniques course. It discusses the objectives of understanding IR basics, text classification, search engines, and recommender systems. The syllabus covers what information is, types of information, retrieval, how IR differs from data retrieval, components of an IR system including document, user and search subsystems, and early developments in the field of IR. It also discusses the software architecture of a traditional IR system including processes like document gathering, indexing, searching, and document management.
Packages in Java allow classes to be grouped together logically and prevent naming collisions. The key Java packages are:
1) java.lang contains classes for primitive types, strings, math functions, and exceptions.
2) java.util contains utility classes like vectors and hash tables.
3) java.io contains stream classes for input/output.
Packages are organized hierarchically, for example the awt package contains classes for graphical user interfaces. Classes in packages can be accessed using their fully qualified name or by importing the package. Users can also create their own packages by specifying the package name at the top of Java files in a corresponding directory structure.
The document discusses exception handling in Java. It covers topics like throwing and catching exceptions, checked vs unchecked exceptions, defining custom exception classes, and using assertions. The try-catch-finally block is used to catch exceptions. Exceptions can be thrown and caught within methods. Finally blocks are always executed even if an exception occurs.
A class in Java is a template used to create objects and define object data types and methods. Classes group objects with common properties and behaviors. Constructors are special methods used to initialize objects when they are created. There are two types of constructors - a no-argument constructor that initializes fields to default values, and a parameterized constructor that initializes fields to custom values passed as arguments.
The document discusses exception handling in Java. It introduces exceptions as conditions caused by runtime errors and how exception handling helps detect and report these "exceptional circumstances." It explains the types of errors like runtime, logic, and syntax errors. It provides examples of exception handling using try-catch blocks to handle exceptions locally or declaring throws to pass exceptions to other methods. Finally, it discusses standard Java exceptions and provides code examples demonstrating exception handling.
- The document discusses object-oriented programming concepts like classes, objects, methods, and constructors.
- A class defines the attributes and behaviors of an object, while an object is an instance of a class. Methods define behaviors that objects can perform.
- Constructors initialize an object's attributes when it is created. Multiple constructors with different parameters allows initializing objects in different ways.
- Accessor methods like getters allow retrieving an object's attribute values, while mutator methods like setters allow modifying attribute values.
- The examples demonstrate defining classes with attributes, methods, and constructors, as well as creating objects and calling methods on them.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
Discover the latest insights on Data Driven Maintenance with our comprehensive webinar presentation. Learn about traditional maintenance challenges, the right approach to utilizing data, and the benefits of adopting a Data Driven Maintenance strategy. Explore real-world examples, industry best practices, and innovative solutions like FMECA and the D3M model. This presentation, led by expert Jules Oudmans, is essential for asset owners looking to optimize their maintenance processes and leverage digital technologies for improved efficiency and performance. Download now to stay ahead in the evolving maintenance landscape.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Null Bangalore | Pentesters Approach to AWS IAMDivyanshu
#Abstract:
- Learn more about the real-world methods for auditing AWS IAM (Identity and Access Management) as a pentester. So let us proceed with a brief discussion of IAM as well as some typical misconfigurations and their potential exploits in order to reinforce the understanding of IAM security best practices.
- Gain actionable insights into AWS IAM policies and roles, using hands on approach.
#Prerequisites:
- Basic understanding of AWS services and architecture
- Familiarity with cloud security concepts
- Experience using the AWS Management Console or AWS CLI.
- For hands on lab create account on [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
# Scenario Covered:
- Basics of IAM in AWS
- Implementing IAM Policies with Least Privilege to Manage S3 Bucket
- Objective: Create an S3 bucket with least privilege IAM policy and validate access.
- Steps:
- Create S3 bucket.
- Attach least privilege policy to IAM user.
- Validate access.
- Exploiting IAM PassRole Misconfiguration
-Allows a user to pass a specific IAM role to an AWS service (ec2), typically used for service access delegation. Then exploit PassRole Misconfiguration granting unauthorized access to sensitive resources.
- Objective: Demonstrate how a PassRole misconfiguration can grant unauthorized access.
- Steps:
- Allow user to pass IAM role to EC2.
- Exploit misconfiguration for unauthorized access.
- Access sensitive resources.
- Exploiting IAM AssumeRole Misconfiguration with Overly Permissive Role
- An overly permissive IAM role configuration can lead to privilege escalation by creating a role with administrative privileges and allow a user to assume this role.
- Objective: Show how overly permissive IAM roles can lead to privilege escalation.
- Steps:
- Create role with administrative privileges.
- Allow user to assume the role.
- Perform administrative actions.
- Differentiation between PassRole vs AssumeRole
Try at [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
1. UNIT III
MODELING AND RETRIEVAL EVALUATION
2. Basic Retrieval Models
An IR model governs how a document and a query are represented and how the relevance
of a document to a user query is defined.
There are Three main IR models:
Boolean model
Vector space model
Probabilistic model
Although these models represent documents and queries differently, they used the same
framework. They all treat each document or query as a “bag” of words or terms.
Term sequence and position in a sentence or a document are ignored. That is, a document
is described by a set of distinctive terms.
Each term is associated with a weight.Given a collection of documents D, let
V = {t1, t2... t|V|} be the set of distinctive terms in the collection, where ti is a term.
The set V is usually called the vocabulary of the collection, and |V| is its size,
i.e., the number of terms in V.
A weight wij > 0 is associated with each term ti of a document dj D. A weight wij > 0
is associated with each term ti of a document dj D.For a term that does not appear in
document dj, wij = 0.
Each document dj is thus represented with a term vector, dj = (w1j, w2j, ..., w|V|j),where
each weight wij corresponds to the term ti V, and quantifies the level of importance of ti
in document dj.
An IR model is a quadruple [D, Q, F, R(qi, dj)] where
1. D is a set of logical views for the documents in the collection
2. Q is a set of logical views for the user queries
3. F is a framework for modeling documents and queries
4. R(qi, dj) is a ranking function
2. Fig 2.1 Taxonomy of IR Models
2.1 Boolean Model
The Boolean model is one of the earliest and simplest information retrieval
models.
It uses the notion of exact matching to match documents to the user query.
Both the query and the retrieval are based on Boolean algebra.
3. Document Representation:
In the Boolean model, documents and queries are represented as sets of terms.
That is, each term is only considered present or absent in a document.
Using the vector representation of the document above, the weight wij ( {0, 1})
of term ti in document dj is 1 if ti appears in document dj, and 0 otherwise, i.e.,
Wij = 1 if appears in dj
0 otherwise.
Boolean Queries:
Query terms are combined logically using the Boolean operators AND, OR, and
NOT, which have their usual semantics in logic.
Thus, a Boolean query has a precise semantics.
For instance, the query, ((x AND y) AND (NOT z)) says that a retrieved document
must contain both the terms x and y but not z.
As another example, the query expression (x OR y) means that at least one of
these terms must be in each retrieved document.
Here, we assume that x, y and z are terms. In general, they can be Boolean
expressions themselves.
Document Retrieval:
Given a Boolean query, the system retrieves every document that makes the query
logically true.
Thus, the retrieval is based on the binary decision criterion, i.e., a document is
either relevant or irrelevant. Intuitively, this is called exact match.
Most search engines support some limited forms of Boolean retrieval using
explicit inclusion and exclusion operators.
4. For example, the following query can be issued to Google, ‘mining –data
+“equipment price”’, where + (inclusion) and – (exclusion) are similar to Boolean
operators AND and NOT respectively.
The operator OR may be supported as well.
Drawbacks of the Boolean Model
No ranking of the documents is provided (absence of a grading scale)
Information need has to be translated into a Boolean expression, which most users
find awkward
The Boolean queries formulated by the users are most often too simplistic.
2.2 TF-IDF (Term Frequency/Inverse Document Frequency) Weighting
Term frequency and weighting
We assign to each term in a document a weight for that term that depends on the
number of occurrences of the term in the document.
We would like to compute a score between a query term t and a document d,
based on the weight of t in d. The simplest approach is to assign the weight to be equal to
the number of occurrences of term t in document d.
This weighting scheme is referred to as term frequency and is denoted tft,d, with
the subscripts denoting the term and the document in order.
For a document d, the set of weights determined by the tf weights above (or
indeed any weighting function that maps the number of occurrences of t in d to a positive
real value) may be viewed as a quantitative digest of that document.
In this view of a document, known in the literature as the bag of words model, the
exact ordering of the terms in a document is ignored but the number of occurrences of
each term is material (in contrast to Boolean retrieval).
Inverse document frequency
Raw term frequency as above suffers from a critical problem: all terms are
considered equally important when it comes to assessing relevancy on a query.
5. For instance, a collection of documents on the auto industry is likely to have the
term auto in almost every document. To this end, we introduce a mechanism for
attenuating the effect of terms that occur too often in the collection to be
meaningful for relevance determination.
An immediate idea is to scale down the term weights of terms with high collection
frequency, defined to be the total number of occurrences of a term in the
collection.
The idea would be to reduce the tf weight of a term by a factor that grows with its
collection frequency. Instead, it is more commonplace to use for this purpose the
document frequency dft, defined to be the number of documents in the collection
that contain a term t.
This is because in trying to discriminate between documents for the purpose of
scoring it is better to use a document-level statistic (such as the number of
documents containing a term) than to use a collection-wide statistic for the term.
The reason to prefer df to cf is illustrated in Figure 2.2, where a simple example
shows that collection frequency (cf) and document frequency (df) can behave
rather differently.In particular, the cf values for both try and insurance are roughly
equal, but their df values differ significantly.
Intuitively, we want the few documents that contain insurance to get a higher
boost for a query on insurance than the many documents containing try get from a
query on try.
Word cf df
try 10422 8760
insurance 10440 3997
Figure 2.2 Collection frequency (cf) and document frequency (df) behave differently,
as in this example from the Reuters collection.
6. How is the document frequency df of a term used to scale its weight? Denoting as usual
the total number of documents in a collection by N, we define the inverse document
frequency (idf) of a term t as follows:
idft = log
𝑁
𝑑𝑓𝑡
Thus the idf of a rare term is high, whereas the idf of a frequent term is likely to be low.
Figure 2.3 gives an example of idf’s in the Reuters collection of 806,791 documents; in
this example logarithms are to the base 10.
Term dft idft
car 18,165 1.65
auto 6723 2.08
insurance 19,241 1.62
best 25,235 1.5
Figure 2.3 Example of idf values. Here we give the idf’s of terms with various
frequencies in the Reuters collection of 806,791 documents.
Tf-idf weighting
We now combine the definitions of term frequency and inverse document frequency,
to produce a composite weight for each term in each document.
The tf-idf weighting scheme assigns to term t a weight in document d given by
tf-idft,d = tft,d ×idft.
In other words, tf-idft,d assigns to term t a weight in document d that is
7. 1. Highest when t occurs many times within a small number of documents (thus
lending high discriminating power to those documents);
2. lower when the term occurs fewer times in a document, or occurs in many
documents (thus offering a less pronounced relevance signal);
3. Lowest when the term occurs in virtually all documents.
At this point, we may view each document as a vector with one component
corresponding to each term in the dictionary, together with a weight for each
component that is given by equation above. For dictionary terms that do not occur in
a document, this weight is zero.
Document d is the sum, over all query terms, of the number of times each of the
query terms occurs in d.
We can refine this idea so that we add up not the number of occurrences of each
query term t in d, but instead the tf-idf weight of each term in d.
Score (q, d) = ∑ tf − idf𝑡, 𝑑.
𝑡∈𝑞
Cosine similarity
Documents could be ranked by computing the distance between the points
representing the documents and the query.
More commonly, a similarity measure is used (rather than a distance or dissimilarity
measure), so that the documents with the highest scores are the most similar to the
query.
A number of similarity measures have been proposed and tested for this purpose.
The most successful of these is the cosine correlation similarity measure.
The cosine correlation measures the cosine of the angle between the query and the
document vectors.
When the vectors are normalized so that all documents and queries are represented by
vectors of equal length, the cosine of the angle between two identical vectors will be
8. 1 (the angle is zero), and for two vectors that do not share any non-zero terms, the
cosine will be 0.
The cosine measure is defined as:
𝐶𝑜𝑠𝑖𝑛𝑒(𝐷𝑖, 𝑄) =
∑ 𝑑𝑖𝑗 · 𝑞𝑗
𝑡
𝑗=1
√∑ 𝑑𝑖𝑗2
𝑡
𝑗=1 . ∑ 𝑞𝑗2
𝑡
𝑗=1
The numerator of this measure is the sum of the products of the term weights for the
matching query and document terms (known as the dot product or inner product).
The denominator normalizes this score by dividing by the product of the lengths of
the two vectors. There is no theoretical reason why the cosine correlation should be
preferred to other similarity measures, but it does perform somewhat better in
evaluations of search quality.
As an example, consider two documents D1 = (0.5, 0.8, 0.3) and D2 = (0.9, 0.4, 0.2)
indexed by three terms, where the numbers represent term weights.
Given the query Q = (1.5, 1.0, 0) indexed by the same terms, the cosine measures for
the two documents are:
Cosine (D1, Q) =
(0.5 × 1.5) + (0.8 × 1.0)
√(0.52 + 0.82 + 0.32)(1.52 + 1.02)
=0.87
Cosine (D2, Q) =
(0.9 × 1.5) + (0.4 × 1.0)
√(0.92 + 0.42 + 0.22)(1.52 + 1.02)
= 0.97
The second document has a higher score because it has a high weight for the first
term, which also has a high weight in the query.
Even this simple example shows that ranking based on the vector space model is
able to reflect term importance and the number of matching terms, which is not
possible in Boolean retrieval.
9. 2.3 Vector-Space Model
This model is perhaps the best known and most widely used IR model.
It has the advantage of being a simple and intuitively appealing framework for
implementing term weighting, ranking, and relevance feedback.
The vector model proposes a framework in which partial matching is possible.
This is accomplished by assigning non-binary weights to index terms in queries
and in documents
Term weights are used to compute a degree of similarity between a query and
each document.
The documents are ranked in decreasing order of their degree of similarity.
In this model, documents and queries are assumed to be part of a t-dimensional
vector space, where t is the number of index terms (words, stems, phrases, etc.).
A document Di is represented by a vector of index terms:
Di = (di1, di2, . . . , dit),
Where dij represents the weight of the jth term.
A document collection containing n documents can be represented as a matrix of
term weights, where each row represents a document and each column describes
weights that were assigned to a term for a particular document:
Term1 Term2 . . . Termt
Doc1 d11 d12 . . . d1t
Doc2 d21 d22 . . . d2t
...
...
Docn dn1 dn2 . . . dnt
Figure 2.4 gives a simple example of the vector representation for four documents.
10. The term-document matrix has been rotated so that now the terms are the rows
and the documents are the columns.
The term weights are simply the count of the terms in the document.
Stopwords are not indexed in this example, and the words have been stemmed.
D1 Tropical Freshwater Aquarium Fish.
D2 Tropical Fish, Aquarium Care, Tank Setup.
D3 Keeping Tropical Fish and Goldfish in Aquariums, and Fish Bowls.
D4 The Tropical Tank Homepage - Tropical Fish and Aquariums.
Terms Documents
D1 D2 D3 D4
Aquarium 1 1 1 1
bowl 0 0 1 0
care 0 1 0 0
fish 1 1 2 1
freshwater 1 0 0 0
goldfish 0 0 1 0
homepage 0 0 0 1
keep 0 0 1 0
setup 0 1 0 0
tank 0 1 0 1
tropical 1 1 1 2
Figure.2.5. Term-document matrix for a collection of four documents
11. DocumentD3, for example, is represented by the vector (1, 1, 0, 2, 0, 1, 0, 1, 0, 0, 1).
Queries are represented the same way as documents.
That is, a query Q is represented by a vector of t weights:
Q = (q1, q2, . . . , qt),
where qj is the weight of the jth term in the query.
If, for example the query was “tropical fish”, then using the vector representation in Figure
2.5, the query would be (0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1).
Example:
Here is a simplified example of the vector space retrieval model.
Consider a very small collection C that consists in the following three documents:
d1: “new york times”
d2: “new york post”
d3: “los angeles times”
Some terms appear in two documents, some appear only in one document.
The total number of documents is N=3.
Therefore, the idf values for the terms are:
angles log2(3/1)=1.584
los log2(3/1)=1.584
new log2(3/2)=0.584
post log2(3/1)=1.584
times log2(3/2)=0.584
york log2(3/2)=0.584
12. For all the documents, we calculate the tf scores for all the terms in C.
We assume the words in the vectors are ordered alphabetically.
angeles los new post times york
d1 0 0 1 0 1 1
d2 0 0 1 1 0 1
d3 1 1 0 0 1 0
Now we multiply the tf scores by the idf values of each term, obtaining the following matrix
of documents-by-terms:
(All the terms appeared only once in each document in our small collection, so the
maximum value for normalization is 1.)
angeles los new post times york
d1 0 0 0.584 0 0.584 0.584
d2 0 0 0.584 1.584 0 0.584
d3 1.584 1.584 0 0 0.584 0
Given the following query: “new new times”,
we calculate the tf-idf vector for the query, and compute the score of each document in C
relative to this query, using the cosine similarity measure. When computing the tf-idf values
for the query terms we divide the frequency by the maximum frequency (2) and multiply
with the idf values
q 0 0 (2/2)*0.584=0.584 0 (1/2)*0.584=0.292 0
13. We calculate the length of each document and of the query:
Length of d1 = sqrt(0.584^2+0.584^2+0.584^2)=1.011
Length of d2 = sqrt(0.584^2+1.584^2+0.584^2)=1.786
Length of d3 = sqrt(1.584^2+1.584^2+0.584^2)=2.316
Length of q = sqrt(0.584^2+0.292^2)=0.652
Then the similarity values are:
cosSim(d1,q) = (0*0+0*0+0.584*0.584+0*0+0.584*0.292+0.584*0) / (1.011*0.652) =
0.776
cosSim(d2,q) = (0*0+0*0+0.584*0.584+1.584*0+0*0.292+0.584*0) / (1.786*0.652) =
0.292
cosSim(d3,q) = (1.584*0+1.584*0+0*0.584+0*0+0.584*0.292+0*0) / (2.316*0.652) =
0.112
According to the similarity values, the final order in which the documents are presented as
result to the query will be: d1, d2, d3.
2.4 Probabilistic Model
Given a user information need (represented as a query) and a collection of
documents (transformed into document representations), a system must
determine how well the documents satisfy the query.
Boolean or vector space models of IR: query-document matching done in a
formally defined but semantically imprecise calculus of index terms
An IR system has an uncertain understanding of the user query, and makes an
uncertain guess of whether a document satisfies the query.
Probability theory provides a principled foundation for such reasoning under
uncertainty.
14. Probabilistic models exploit this foundation to estimate how likely it is that a
document is relevant to a query.
Review of basic probability theory
For events A and B
o Joint probability P(A, B) of both events occurring
o Conditional probability P(A|B) of event A occurring given that event B has
occurred
Chain rule gives fundamental relationship between joint and conditional
probabilities:
Similarly for the complement of an event :
Partition rule: if B can be divided into an exhaustive set of disjoint sub cases, then
P (B) is the sum of the probabilities of the sub cases.
A special case of this rule gives:
Bayes’ Rule for inverting conditional probabilities:
Can be thought of as a way of updating probabilities:
o Start off with prior probability P(A) (initial estimate of how likely event A is
in the absence of any other information)
15. o Derive a posterior probability P(A|B) after having seen the evidence B, based
on the likelihood of B occurring in the two cases that A does or does not hold
Odds of an event provide a kind of multiplier for how probabilities change:
Odds:
The Probability Ranking Principle
The 1/0 loss case
o For a query q and a document d in the collection, let Rd,q be an indicator random
variable that says whether d is relevant with respect to a given query q. That is, it
takes on a value of 1 when the document is relevant and 0 otherwise.
o In context we will often write just R for Rd,q. Using a probabilistic model, the
obvious order in which to present documents to the user is to rank documents by
their estimated probability of relevance with respect to the information need: P(R
= 1|d, q).
o This is the basis of the Probability
Ranking Principle (PRP)
“If a reference retrieval system’s response to each request is a ranking of the
documents in the collection in order of decreasing probability of relevance to the user
who submitted the request, where the probabilities are estimated as accurately as
possible on the basis of whatever data have been made available to the system for this
purpose, the overall effectiveness of the system to its user will be the best that is
obtainable on the basis of those data.”
o In the simplest case of the PRP, there are no retrieval costs or other utility
16. concerns that would differentially weight actions or errors.
o You lose a point for either returning a non relevant document or failing to return a
relevant document (such a binary situation where you are evaluated on your
accuracy is called 1/0 loss).
o The goal is to return the best possible results as the top k documents, for any
value of k the user chooses to examine.
o The PRP then says to simply rank all documents in decreasing order of P(R = 1|d,
q).
o If a set of retrieval results is to be returned, rather than an ordering, the Bayes
Optimal Decision Rule, the decision which minimizes the risk of loss, is to simply
return documents that are more likely relevant than non relevant:
d is relevant iff P(R = 1|d, q) > P(R = 0|d, q)
The PRP with retrieval costs
o Suppose, instead, that we assume a model of retrieval costs.
o Let C1 be the cost of not retrieving a relevant document and C0 the cost of
retrieval of a non relevant document.
o Then the Probability Ranking Principle says that if for a specific document d
and for all documents d′ not yet retrieved
C0 · P(R = 0|d) − C1 · P(R = 1|d) ≤ C0 · P(R = 0|d′) − C1 · P(R = 1|d′)
then d is the next document to be retrieved.
Such a model gives a formal framework where we can model differential costs of
false positives and false negatives and even system performance issues at the
modeling stage
The Binary Independence Model
o Traditionally used with the PRP
17. Assumptions:
o ‘Binary’ (equivalent to Boolean): documents and queries represented as binary
term incidence vectors
E.g., document d represented by vector x
⃗ = (x1, . . . , xM), where xt = 1 if term t
occurs in d and xt = 0 otherwise
o Different documents may have the same vector representation ‘Independence’: no
association between terms (not true, but practically works - ‘naive’ assumption of
Naive Bayes models)
o To make a probabilistic retrieval strategy precise, need to estimate how terms in
documents contribute to relevance
Find measurable statistics (term frequency, document frequency,
document length) that affect judgments about document relevance
Combine these statistics to estimate the probability of document relevance
Order documents by decreasing estimated probability of relevance P(R|d,
q)
Assume that the relevance of each document is independent of the
relevance of other documents (not true, in practice allows duplicate results)
P(R|d, q) modelled using term incidence vectors as P(R|X
⃗
⃗ , q
⃗ )
P(X
⃗
⃗ │R = 1, q
⃗ ) and P(X
⃗
⃗ │R = 0, q
⃗ ) : probability that if a relevant or non relevant
document is retrieved.
Statistics about the actual document collection are used to estimate these
probabilities.
Since a document is either relevant or non relevant to a query, we must have that:
18. Probability Estimates in Practice
Assuming that relevant documents are a very small percentage of the collection,
approximate statistics for non relevant documents by statistics from the whole
collection
Hence, ut (the probability of term occurrence in non relevant documents for a
query) is dft/N and log[(1 − ut )/ut ] = log[(N − dft)/df t ] ≈ log N/df t
The above approximation cannot easily be extended to relevant documents
Statistics of relevant documents (pt ) can be estimated in various ways:
1. Use the frequency of term occurrence in known relevant documents (if
known). This is the basis of probabilistic approaches to relevance
feedback weighting in a feedback loop
2. Set as constant. E.g., assume that pt is constant over all terms xt in the
query and that pt = 0.5
2.5 Latent Semantic Indexing Model
The retrieval models discussed so far are based on keyword or term
matching, i.e., matching terms in the user query with those in the documents.
If a user query uses different words from the words used in a document, the
document will not be retrieved although it may be relevant because the
document uses some synonyms of the words in the user query.
This causes low recall. For example, “picture”, “image” and “photo” are
synonyms in the context of digital cameras. If the user query only has the
word “picture”, relevant documents that contain “image” or “photo” but not
“picture” will not be retrieved.
Latent semantic indexing (LSI), aims to deal with this problem through the
identification of statistical associations of terms.
It is assumed that there is some underlying latent semantic structure in the
data that is partially obscured by the randomness of word choice.
19. It then uses a statistical technique, called singular value decomposition
(SVD), to estimate this latent structure, and to remove the “noise”.
The results of this decomposition are descriptions of terms and documents
based on the latent semantic structure derived from SVD. This structure is
also called the hidden “concept” space, which associates syntactically
different but semantically similar terms and documents.
These transformed terms and documents in the “concept” space are then used
in retrieval, not the original terms or documents.
Let D be the text collection, the number of distinctive words in D be m and
the number of documents in D be n.
LSI starts with an m×n term document matrix A. Each row of A represents a
term and each column represents a document.
The matrix may be computed in various ways, e.g., using term frequency or
TF-IDF values.
We use term frequency as an example in this section. Thus, each entry or cell
of the matrix A, denoted by Aij, is the number of times that term i occurs in
document j.
Singular Value Decomposition
o What SVD does is to factor matrix A (a m×n matrix) into the product of three
matrices, i.e.,
A= U∑VT
Where,
U is a m×r matrix and its columns, called right singular vectors, are
eigenvectors associated with the r non-zero eigen values of AAT.
Furthermore, the columns of U are unit orthogonal vectors,
i.e., UTU = I (identity matrix).
20. V is an n×r matrix and its columns, called right singular vectors, are
eigenvectors associated with the r non-zero eigenvalues of ATA. The
columns of V are also unit orthogonal vectors, i.e., VTV = I.
∑ is a r×r diagonal matrix,
∑=diag( 𝜎1, 𝜎 2, …, 𝜎 r), 𝜎i >0. 𝜎1, 𝜎 2, …..and 𝜎 r , called singular
values, are the non-negative square roots of the r (non-zero) eigen values
of AAT. They are arranged in decreasing order, i.e., 𝜎1≥ 𝜎 2≥ …≥𝜎 r≥0.
o We note that initially U is in fact an m�m matrix and V an n×n matrix and∑
an m×n diagonal matrix.
o ∑’s diagonal consists of nonnegative eigen values of AAT or ATA.
o However, due to zero eigen values, ∑ has zero-valued rows and columns.
Matrix multiplication tells us that those zero-valued rows and columns from
∑ can be dropped.
o Then, the last m×r columns in U and the last n-r columns in V can also be
dropped.
m is the number of row (terms) in A, representing the number of terms.
n is the number of columns in A, representing the number of documents.
r is the rank of A,
r≤ min(m, n).
The singular value decomposition of A always exists and is unique up to
1. allowable permutations of columns of U and V and elements of ∑ leaving it still
diagonal; that is, columns i and j of ∑ may be interchanged iff row i and j of ∑ are
interchanged, and columns i and j of U and V are interchanged.
2. sign (+/-) flip in U and V.
21. o An important feature of SVD is that we can delete some insignificant
dimensions in the transformed (or “concept”) space to optimally (in the least
square sense) approximate matrix A.
o The significance of the dimensions is indicated by the magnitudes of the
singular values in ∑, which are already sorted. In the context of information
retrieval, the insignificant dimensions may represent “noisy” in the data, and
should be removed.
o Let us use only the k largest singular values in ∑ and set the remaining small
ones to zero. The approximated matrix of A is denoted by Ak.
o We can also reduce the size of the matrices ∑, U and V by deleting the last r-k
rows and columns from ∑, the last r-k columns in U and the last r-k columns
in V.
We then obtain
Ak=Uk∑kVk
T
o Which means that we use the k-largest singular triplets to approximate the
original (and somewhat “noisy”) term-document matrix A.
o The new space is called the k-concept space.
o Figure 2.6 shows the original matrices and the reduced matrices
schematically.
22. Fig. 2.6. The schematic representation of A and Ak
o It is critical that the LSI method does not re-construct the original term
document matrix A perfectly.
o The truncated SVD captures most of the important underlying structures in
the association of terms and documents, yet at the same time removes the
noise or variability in word usage that plagues keyword matching retrieval
methods.
Query and Retrieval
o Given a user query q (represented by a column vector as those in A), it is first
converted into a document in the k-concept space, denoted by qk. This
transformation is necessary because SVD has transformed the original
documents into the k-concept space and stored them in Vk.
o The idea is that q is treated as a new document in the original space
represented as a column in A, and then mapped to qk as an additional
document (or column) in Vk
T
Documents
Term
vectors k
Document
k
k
∑k
Vk
T
vectors
∑ VT
Terms A/Ak = Uk U
23. q= ∑kqk
T
o Since the columns in U are unit orthogonal vectors, UkT
Uk = I.
Thus,
UkT
q=∑kqk
T
o As the inverse of a diagonal matrix is still a diagonal matrix, and each entry
on the diagonal is 1/ 𝜎 i(1≤i≤k), if it is multiplied on both sides of above
equation,we obtain, ∑k
-1
UkT
q= qk
T
o Finally, we get the following (notice that the transpose of a diagonal matrix is
itself), qk=qT
Uk∑k
-1
o For retrieval, we simply compare qk with each document (row) in Vk using a
similarity measure, e.g., the cosine similarity.
o Recall that each row of Vk (or each column of Vk
T
) corresponds to a
document (column) in A.
o This method has been used traditionally.
2.6 Neural Network Model
The human brain is composed of billions of neurons. Each neuron can be
viewed as a small processing unit.
A neuron is stimulated by input signals and emits output signals in reaction.
A chain reaction of propagating signals is called a spread activation process.
As a result of spread activation, the brain might command the body to take
physical reactions
A neural network is an oversimplified representation of the neuron
interconnections in the human brain: nodes are processing units edges are
synaptic connections the strength of a propagating signal is modeled by a
weight assigned to each edge the state of a node is defined by its activation
level depending on its activation level, a node might issue an output signal
24. A neural network model for information retrieval can be defined as illustrated
in figure 2.7
Query terms Document terms Documents
Figure 2.7 A neural network model for information retrieval
Figure. 2.7 is composed of three layers: one for the query terms, one for the
document terms, and a third one for the documents themselves.
Here, however, the query term nodes are the ones which initiate the inference
process by sending signals to the document term nodes.
Following that, the document term nodes might themselves generate signals to the
document nodes.
This completes a first phase in which a signal travels from the query term nodes
to the document nodes (i.e., from the left to the right in Fig. 2.7 )
25. The neural network however, does not stop after the first phase of signal
propagation. In fact, the document nodes in their turn might generate new signals
which are directed back to the document term nodes.
Upon receiving the stimulus, the document term nodes might again fire new
signals directed to the document nodes, repeating the process.
The signal become weaker at each iteration and the spread activation process
eventually halts.
To improve the retrieval performance, the network continues with the spreading
activation process after the first round of propagation.
This modifies the initial vector ranking in a process analogous to a user relevance
feedback cycle.
To make the process more effective, a minimum activation threshold might be
defined such that document nodes below this threshold send no signals out.
There is no conclusive evidence that a neural network provides superior
performance with general collections. In fact, the model has not been tested
extensively with large document collections.
2.7 Retrieval Evaluation
To evaluate an IR system is to measure how well the system meets the information needs
of the users.
o This is troublesome, given that a same result set might be interpreted differently
by distinct users.
o To deal with this problem, some metrics have been defined that, on average, have
a correlation with the preferences of a group of users.
Without proper retrieval evaluation, one cannot
o determine how well the IR system is performing
o compare the performance of the IR system with that of other systems, objectively
Retrieval evaluation is a critical and integral component of any modern IR system
Systematic evaluation of the IR system allows answering questions such as:
o a modification to the ranking function is proposed, should we go ahead and
launch it?
26. o a new probabilistic ranking function has just been devised, is it superior to the
vector model and BM25 rankings?
o for which types of queries, such as business, product, and geographic queries, a
given ranking modification works best?
Lack of evaluation prevents answering these questions and precludes fine tuning of the
ranking function.
Retrieval performance evaluation consists of associating a quantitative metric to the
results produced by an IR system.
This metric should be directly associated with the relevance of the results to the user
Usually, its computation requires comparing the results produced by the system with
results suggested by humans for a same set of queries.
2.8 Retrieval Metrics
The Cranfield Paradigm
Evaluation of IR systems is the result of early experimentation initiated in the 50’s by
Cyril Cleverdon.
The insights derived from these experiments provide a foundation for the evaluation of
IR systems.
Cleverdon obtained a grant from the National Science Foundation to compare distinct
indexing systems.
These experiments provided interesting insights, that culminated in the modern metrics
of precision and recall
o Recall ratio: the fraction of relevant documents retrieved
o Precision ration: the fraction of documents retrieved that are relevant
For instance, it became clear that, in practical situations, the majority of searches does not
require high recall.
Instead, the vast majority of the users require just a few relevant answers.
The next step was to devise a set of experiments that would allow evaluating each
indexing system in isolation more thoroughly.
The result was a test reference collection composed of documents, queries, and relevance
judgements.
27. It became known as the Cranfield-2 collection.
The reference collection allows using the same set of documents and queries to evaluate
different ranking systems.
The uniformity of this setup allows quick evaluation of new ranking functions
2.9 Precision and Recall
Consider,
I: an information request
R: the set of relevant documents for I
A: the answer set for I, generated by an IR system
R ∩ A: the intersection of the sets R and A
The recall and precision measures are defined as follows
o Recall is the fraction of the relevant documents (the set R ) which has been
retrieved
i.e., Recall = |R ∩ A|
|R|
o Precision is the fraction of the retrieved documents (the set A) which is
relevant
i.e., Precision = |R ∩ A|
|A|
The definition of precision and recall assumes that all docs in the set A have been
examined. However, the user is not usually presented with all docs in the answer set
A at once.
Consider a reference collection and a set of test queries
28. Let R q1 be the set of relevant docs for a query q1:
R q 1 = { d3, d5, d9, d25, d39, d44, d56, d71, d89, d123 }
o Consider a new IR algorithm that yields the following answer to q 1 (relevant
docs are marked with a bullet):
01. d123 • 06. d9 • 11. d38
02. d84 07. d511 12. d48
03. d56 • 08. d129 13. d250
04. d6 09. d187 14. d113
05. d8 10. d25 • 15. d3 •
If we examine this ranking, we observe that The document d123, ranked as number
1, is relevant
This document corresponds to 10% of all relevant documents.
Thus, we say that we have a precision of 100% at 10% recall.
The document d56, ranked as number 3, is the next relevant.
At this point, two documents out of three are relevant, and two of the ten relevant
documents have been seen.
Thus, we say that we have a precision of 66.6% at 20% recall.
2.10 Reference Collection
Reference collections, which are based on the foundations established by the Cranfield
experiments, constitute the most used evaluation method in IR
A reference collection is composed of:
o A set D of pre-selected documents
o A set I of information need descriptions used for testing
o A set of relevance judgements associated with each pair [im, dj], im € I and dj € D .
The relevance judgement has a value of 0 if document dj is non-relevant to im , and 1
otherwise.
29. These judgements are produced by human specialists.
With small collections one can apply the Cranfield evaluation paradigm to provide
relevance assessments.
With large collections, however, not all documents can be evaluated relatively to a given
information need.
The alternative is consider only the top k documents produced by various ranking
algorithms for a given information need.
This is called the pooling method.
The method works for reference collections of a few million documents, such as the
TREC collections.
2.11 User-based Evaluation
Recall and precision assume that the set of relevant docs for a query is independent of the
users.
However, different users might have different relevance interpretations.
To cope with this problem, user-oriented measures have been proposed.
As before,
o consider a reference collection, an information request I, and a retrieval algorithm
to be evaluated
o With regard to I, let R be the set of relevant documents and A be the set of
answers retrieved.
30. Fig 2.8. Coverage and novelty ratios for a given example information request.
K: set of documents known to the user
K ∩ R ∩ A: set of relevant docs that have been retrieved and are known to the user
( R ∩ A ) − K: set of relevant docs that have been retrieved but are not known to the user
Figure 2.8 illustrates the situation.
The coverage ratio is the fraction of the documents known and relevant that are in the
answer set, that is
Coverage = |K ∩ R ∩ A|
|K ∩ R|
The novelty ratio is the fraction of the relevant docs in the answer set that are not known
to the user
Novelty = |( R ∩ A ) − K|
|R ∩ A|
A high coverage indicates that the system has found most of the relevant docs the user
expected to see.
31. A high novelty indicates that the system is revealing many new relevant docs which
were unknown.
Additionally, two other measures can be defined
o relative recall: ratio between the number of relevant docs found and the number of
relevant docs the user expected to find
o recall effort: ratio between the number of relevant docs the user expected to find
and the number of documents examined in an attempt to find the expected relevant
documents
2.12 Relevance feedback and query expansion
In most collections, the same concept may be referred to using different
words. This issue, known as synonymy, has an impact on the recall of most
information retrieval systems.
For example, you would want a search for aircraft to match plane (but only for
references to an airplane, not a wood work-ing plane), and for a search on
thermodynamics to match references to heat in appropriate discussions.
Users often attempt to address this problem themselves by manually refining a
query, as was discussed in this Section.
The methods for tackling this problem split into two major classes as shown in
Fig.2.9:
Global methods
Local methods.
Global methods are techniques for expanding or reformulating query terms
independent of the query and results returned from it, so that changes in the
query wording will cause the new query to match other semantically similar
terms. Global methods include:
Query expansion/reformulation with a thesaurus or Word Net
Query expansion via automatic thesaurus generation
Techniques like spelling correction
Local methods adjust a query relative to the documents that initially appear to
match the query. Local methods include:
32. Relevance feedback
Pseudo relevance feedback, also known as Blind relevance
feedback
(Global) indirect relevance feedback
Implicit Feedback
Fig. 2.9 (a) Local analysis (b) Global analysis
33. Relevance feedback and pseudo relevance feedback
Relevance feedback: user feedback on relevance of docs in initial set of results
Basic Procedure:
User issues a (short, simple) query
The system returns an initial set of retrieval results.
The user marks some results as relevant or non-relevant
The system computes a better query representation of the information need based on
feedback
The system displays a revised set of retrieval results.
Relevance feedback can go through one or more iterations of this sort.
The process exploits the idea that it may be difficult to formulate a good query
when you don’t know the collection well, but it is easy to judge particular
documents, and so it makes sense to engage in iterative query refinement of this
sort.
In such a scenario, relevance feedback can also be effective in tracking a user’s
evolving information need: seeing some documents may lead users to refine their
understanding of the information they are seeking.
Image search provides a good example of relevance feedback.
After the user enters an initial query for bike on the demonstration system at:
http://nayana.ece.ucsb.edu/imsearch/imsearch.html
the initial results (in this case, images) are returned.
In Figure 2.10 (a), the user has selected some of them as relevant. These
will be used to refine the query, while other displayed results have no
effect on the reformulation.
Figure 2.10 (b) then shows the new top-ranked results calculated after this
round of relevance feedback.
34. (a)
(b)
Figure 2.10 RF searching over images. (a) The user views the initial query results for
a query of bike, selects the first, third and fourth result in the top row and the fourth
result in the bottom row as relevant, and submits this feedback. (b) The users sees the
revised result set. Precision is greatly improved.
35. The Rocchio algorithm for relevance feedback
It is the classic algorithm for implementing RF.
The Rocchio algorithm uses the vector space model to pick a relevance feedback
query
Rocchio seeks the query𝑞𝑜𝑝𝑡 that maximizes
𝑞𝑜𝑝𝑡 = 𝑎𝑟𝑔 𝑚𝑎𝑥[𝑠𝑖𝑚(𝑞, 𝐶𝑟) − 𝑠𝑖𝑚(𝑞, 𝐶𝑛𝑟)]
𝑞 = query vector, that maximizes similarity with relevant documents while
minimizing similarity with non relevant documents.
Cr = the set of relevant documents
Cnr = the set of non relevant documents
Under cosine similarity, the optimal query vector 𝑞𝑜𝑝𝑡 for separating the
relevant and non relevant documents is:
𝑞𝑜𝑝𝑡 =
1
|𝐶𝑟|
∑ 𝑑𝑗
𝑑𝑗∈𝐶𝑟
−
1
|𝐶𝑛𝑟|
∑ 𝑑𝑗
𝑑𝑗∈𝐶𝑛𝑟
That is, the optimal query is the vector difference between the centroids of
the relevant and non relevant documents as shown in Figure 2.11
36. Figure 2.11 The Rocchio optimal query for separating relevant and non relevant documents
However, this observation is not terribly useful, precisely because the full set of relevant
documents is not known
The Rocchio (1971) algorithm.
This was the relevance feedback mechanism introduced in and popularized by Salton’s
SMART system around 1970
The algorithm proposes using the modified query 𝑞m:
𝑞𝑚 = 𝛼𝑞𝑜 + 𝛽
1
|𝐷𝑟|
∑ 𝑑𝑗 − 𝛾
1
|𝐷𝑛𝑟|
𝑑𝑗∈𝐷𝑟
∑ 𝑑𝑗
𝑑𝑗∈𝐷𝑛𝑟
Where,
𝑞𝑜 = Original query vector
𝑞𝑚 = modified query vector
Dr = set of known relevant doc vectors
Dnr = set of known irrelevant doc vectors
These are different from Cr and Cnr
α,β,γ: weights (hand-chosen or set empirically)
37. New query moves toward relevant documents and away from irrelevant documents.
Figure 2.12 An application of Rocchio’s algorithm. Some documents have been labeled
as relevant and non relevant and the initial query vector is moved in response to this
feedback.
Tradeoff α vs. β and γ: If we have a lot of judged documents, we want a higher β and γ
Some weights in query vector can go negative: Negative term weights are ignored (set
to 0)
Positive feedback is more valuable than negative feedback (so, set γ < β; e.g. γ = 0.25, β
= 0.75) -many systems only allow positive feedback (γ=0)
Relevance feedback can improve recall and precision
When does relevant feedback work?
The success of relevance feedback depends on certain assumptions.
• User has sufficient knowledge for initial query
• Relevance prototypes are “well-behaved”
Term distribution in relevant documents will be similar
38. Term distribution in non-relevant documents will be different from those in relevant
documents
Relevance feedback can also have practical problems.
The long queries that are generated by straightforward application of relevance feedback
techniques are inefficient for a typical IR system.
This results in a high computing cost for the retrieval and potentially long response times
for the user.
A partial solution to this is to only reweight certain prominent terms in the relevant
documents, such as perhaps the top 20 terms by term frequency.
Probabilistic relevance feedback
Rather than reweighting the query in a vector space, if a user has told us some relevant
and non relevant documents, then we can proceed to build a classifier.
One way of doing this is with a Naive Bayes probabilistic model.
If R is a Boolean indicator variable expressing the relevance of a document, then we can
estimate P (xt = 1|R), the probability of a term t appearing in a document, depending on
whether it is relevant or not.
Relevance feedback on the web
Some web search engines offer a similar/related pages feature: the user indicates a
document in the results set as exemplary from the standpoint of meeting his information
need and requests more documents like it.
This can be viewed as a particular simple form of relevance feedback.
However, in general relevance feedback has been little used in web search. One
exception was the Excite web search engine, which initially provided full relevance
feedback. However, the feature was in time dropped, due to lack of use.
On the web, few people use advanced search interfaces and most would like to complete
their search in a single interaction.
But the lack of uptake also probably reflects two other factors: relevance feedback is hard
to explain to the average user, and relevance feedback is mainly a recall enhancing
strategy, and web search users are only rarely concerned with getting sufficient recall.
39. Evaluation of relevance feedback strategies
Interactive relevance feedback can give very substantial gains in retrieval performance.
Empirically, one round of relevance feedback is often very useful.
Two rounds is sometimes marginally more useful.
Successful use of relevance feedback requires enough judged documents; otherwise the
process is unstable in that it may drift away from the user’s information need.
Accordingly, having at least five judged documents is recommended.
There is some subtlety to evaluating the effectiveness of relevance feedback in a sound
and enlightening way.
The obvious first strategy is to start with an initial query q0 and to compute a precision-
recall graph.
Following one round of feedback from the user, we compute the modified query qm
and again compute a precision-recall graph.
Here, in both rounds we assess performance over all documents in the collection, which
makes comparisons straight forward. If we do this, we find spectacular gains from
relevance feedback: gains on the order of 50%inmean average precision. But
unfortunately it is cheating.
The gains are partly due to the fact that known relevant documents (judged by the user)
are now ranked higher. Fairness demands that we should only evaluate with respect to
documents not seen by the user.
A second idea is to use documents in the residual collection (the set of documents
minus those assessed relevant) for the second round of evaluation.
This seems like a more realistic evaluation. Unfortunately, the measured performance
can then often be lower than for the original query.
This is particularly the case if there are few relevant documents, and so a fair proportion
of them have been judged by the user in the first round. The relative performance of
variant relevance feedback methods can be validly compared, but it is difficult to validly
compare performance with and without relevance feedback because the collection size
and the number of relevant documents changes from before the feedback to after it. Thus
neither of these methods is fully satisfactory.
40. A third method is to have two collections, one which is used for the initial query and
relevance judgments, and the second that is then used for comparative evaluation. The
performance of both q0 and qm can be validly compared on the second collection.
Perhaps the best evaluation of the utility of relevance feedback is to do user studies of its
effectiveness, in particular by doing a time-based comparison: how fast does a user find
relevant documents with relevance feedback vs.another strategy (such as query
reformulation), or alternatively, how many relevant documents does a user find in a
certain amount of time.
Pseudo relevance feedback
It is also known as blind relevance feedback, provides a method for automatic local
analysis.
It automates the manual part of relevance feedback, so that the user gets improved
retrieval performance without an extended interaction.
The method is to do normal retrieval to find an initial set of most relevant documents, to
then assume that the top k ranked documents are relevant, and finally to do relevance
feedback as before under this assumption.
Indirect relevance feedback
We can also use indirect sources of evidence rather than explicit feedback on relevance
as the basis for relevance feedback.
This is often called implicit (relevance) feedback.
Implicit feedback is less reliable than explicit feedback, but is more useful than pseudo
relevance feedback, which contains no evidence of user judgments.
Moreover, while users are often reluctant to provide explicit feedback, it is easy to collect
implicit feedback in large quantities for a high volume system, such as a web search engine.
41. Query Expansion based on a Similarity Thesaurus
Similarity Thesaurus
We now discuss a query expansion model based on a global similarity thesaurus
constructed automatically
The similarity thesaurus is based on term to term relationships rather than on a matrix of
co-occurrence
Special attention is paid to the selection of terms for expansion and to the reweighting of
these terms
Terms for expansion are selected based on their similarity to the whole query
A similarity thesaurus is built using term to term relationships
These relationships are derived by considering that the terms are concepts in a concept
space
In this concept space, each term is indexed by the documents in which it appears
Thus, terms assume the original role of documents while documents are interpreted as
indexing elements
Let,
t: number of terms in the collection
N: number of documents in the collection
fi,j : frequency of term ki in document dj
tj : number of distinct index terms in document dj
Then,
itfj = log t
tj
is the inverse term frequency for document dj
(analogous to inverse document frequency)
Within this framework, with each term ki is associated a vector ki given by
ki = (wi,1,wi,2, . . . ,wi,N)
The relationship between two terms ku and kv is computed as a correlation factor cu,v
given by
42. cu,v = ku ·kv =𝜀dj wu,j × wv,j
Given the global similarity thesaurus, query expansion is done in three steps as follows
o First, represent the query in the same vector space used for representing the index
terms
o Second, compute a similarity sim(q, kv) between each term kv correlated to the
query terms and the whole query q
o Third, expand the query with the top r ranked terms according to sim(q, kv)
2.13 Explicit Relevance Feedback
Relevance feedback is a feature of some information retrieval systems. The idea
behind relevance feedback is to take the results that are initially returned from a
given query, to gather user feedback, and to use information about whether or not
those results are relevant to perform a new query.
We can usefully distinguish between three types of feedback:
o Explicit feedback
o Implicit feedback, and
o Blind or "pseudo" feedback.
Explicit feedback is obtained from assessors of relevance indicating the
relevance of a document retrieved for a query. This type of feedback is defined
as explicit only when the assessors (or other users of a system) know that the
feedback provided is interpreted as relevance judgments.
Users may indicate relevance explicitly using a binary or graded relevance
system. Binary relevance feedback indicates that a document is either relevant or
irrelevant for a given query. Graded relevance feedback indicates the relevance
of a document to a query on a scale using numbers, letters, or descriptions (such
as "not relevant", "somewhat relevant", "relevant", or "very relevant").
Graded relevance may also take the form of a cardinal ordering of documents
created by an assessor; that is, the assessor places documents of a result set in
order of (usually descending) relevance. An example of this would be
the SearchWiki feature implemented by Google on their search website.
The relevance feedback information needs to be interpolated with the original
query to improve retrieval performance, such as the well-known Rocchio
algorithm.
43. A performance metric which became popular around 2005 to measure the
usefulness of a ranking algorithm based on the explicit relevance feedback
is NDCG. Other measures include precision at k and mean average precision.