The document discusses information retrieval (IR) models, including the Boolean, vector space, and probabilistic models. The Boolean model represents documents and queries as sets of index terms and determines relevance through binary term presence, while the vector space model represents documents and queries as weighted vectors in a multidimensional space and ranks documents by calculating similarity between document and query vectors. The probabilistic model determines relevance probabilities based on the likelihood of terms appearing in relevant vs. non-relevant documents.
This document provides an overview of a course on Information Storage and Retrieval. The course objectives are to familiarize students with theories of information storage and retrieval, introduce modern concepts of information retrieval systems, and acquaint students with indexing, matching, organizing and evaluating strategies for IR systems. The course content covers topics like text operations, term weighting, indexing structures, IR models, retrieval evaluation, and query languages over 14 weeks. Students will be evaluated through assessments, a project, and a final exam.
Chapter 1 Introduction to Information Storage and Retrieval.pdfHabtamu100
This course outline provides information about an Information Storage and Retrieval course for third year Information Technology students. The course will cover introductory concepts of information storage and retrieval over 5 ECTS credits across one semester. Topics will include automatic text operations, indexing structures, retrieval models, evaluation, query languages, and current issues. Assessment will include assignments, tests, a project, midterm, and final exam.
This document discusses stacks and queues as linear data structures. It defines stacks as last-in, first-out (LIFO) collections where the last item added is the first removed. Queues are first-in, first-out (FIFO) collections where the first item added is the first removed. Common stack and queue operations like push, pop, insert, and remove are presented along with algorithms and examples. Applications of stacks and queues in areas like expression evaluation, string reversal, and scheduling are also covered.
This document provides an example of student records in an unnormalized form, containing repeating groups. It then demonstrates normalizing the data by removing the repeating groups into multiple tables in first normal form. Further normalization results in separating attributes with partial dependencies and non-key dependencies into their own tables, achieving second and third normal form respectively. The document explains the different normal forms and how normalization helps reduce data anomalies on insert, update and delete operations.
This chapter introduces the notion of Information Retrieval (IR). it discusses after a survey of classification of various IR systems and major components of an IR system, the notion of Boolean Retrieval model and Invertex Index and extended Boolean are presented.
This document provides an overview of database system concepts and architecture. It discusses different data models including conceptual, physical and implementation models. It also covers database languages, interfaces, utilities and centralized versus distributed (client-server) architectures. Specifically, it describes hierarchical and network data models, the three schema architecture, data independence, DBMS languages like DDL and DML, and different DBMS classifications including relational, object-oriented and distributed systems.
Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System
This document provides an overview of database concepts and terminology. It discusses different types of databases based on number of users (single, multi, workgroup, enterprise), number of computers used (centralized, distributed), and how up-to-date the data is (production, data warehouse). It also covers database categorizations, the relational model, entity types and occurrences, relationship types and occurrences, attributes, keys, and E.F. Codd's 12 rules for relational databases.
This document provides an overview of a course on Information Storage and Retrieval. The course objectives are to familiarize students with theories of information storage and retrieval, introduce modern concepts of information retrieval systems, and acquaint students with indexing, matching, organizing and evaluating strategies for IR systems. The course content covers topics like text operations, term weighting, indexing structures, IR models, retrieval evaluation, and query languages over 14 weeks. Students will be evaluated through assessments, a project, and a final exam.
Chapter 1 Introduction to Information Storage and Retrieval.pdfHabtamu100
This course outline provides information about an Information Storage and Retrieval course for third year Information Technology students. The course will cover introductory concepts of information storage and retrieval over 5 ECTS credits across one semester. Topics will include automatic text operations, indexing structures, retrieval models, evaluation, query languages, and current issues. Assessment will include assignments, tests, a project, midterm, and final exam.
This document discusses stacks and queues as linear data structures. It defines stacks as last-in, first-out (LIFO) collections where the last item added is the first removed. Queues are first-in, first-out (FIFO) collections where the first item added is the first removed. Common stack and queue operations like push, pop, insert, and remove are presented along with algorithms and examples. Applications of stacks and queues in areas like expression evaluation, string reversal, and scheduling are also covered.
This document provides an example of student records in an unnormalized form, containing repeating groups. It then demonstrates normalizing the data by removing the repeating groups into multiple tables in first normal form. Further normalization results in separating attributes with partial dependencies and non-key dependencies into their own tables, achieving second and third normal form respectively. The document explains the different normal forms and how normalization helps reduce data anomalies on insert, update and delete operations.
This chapter introduces the notion of Information Retrieval (IR). it discusses after a survey of classification of various IR systems and major components of an IR system, the notion of Boolean Retrieval model and Invertex Index and extended Boolean are presented.
This document provides an overview of database system concepts and architecture. It discusses different data models including conceptual, physical and implementation models. It also covers database languages, interfaces, utilities and centralized versus distributed (client-server) architectures. Specifically, it describes hierarchical and network data models, the three schema architecture, data independence, DBMS languages like DDL and DML, and different DBMS classifications including relational, object-oriented and distributed systems.
Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System
This document provides an overview of database concepts and terminology. It discusses different types of databases based on number of users (single, multi, workgroup, enterprise), number of computers used (centralized, distributed), and how up-to-date the data is (production, data warehouse). It also covers database categorizations, the relational model, entity types and occurrences, relationship types and occurrences, attributes, keys, and E.F. Codd's 12 rules for relational databases.
Neural networks can be used for machine learning tasks like classification. They consist of interconnected nodes that update their weight values during a training process using examples. Neural networks have been applied successfully to tasks like handwritten character recognition, autonomous vehicle control by observing human drivers, and text-to-speech pronunciation generation. Their architecture is inspired by the human brain but neural networks are trained using computational methods while the brain uses biological processes.
The document discusses R-trees, a data structure used to index multi-dimensional spatial data. R-trees allow for efficient searching of spatial data by grouping data into minimum bounding rectangles (MBRs) and storing them in a tree structure based on these envelopes. The tree structure resembles a B+-tree, with internal nodes containing pointers to child nodes or data records. R-trees provide efficient search, insertion, and deletion of spatial data objects through operations on the tree structure and splitting or merging of nodes as needed.
This document outlines algorithms for query processing and optimization in database systems. It discusses translating SQL queries to relational algebra, algorithms for sorting and joining large datasets that exceed available memory, including nested loop joins, sort-merge joins, and hash joins. It also describes query optimization techniques and factors that influence query performance.
This document presents an overview of text mining. It discusses how text mining differs from data mining in that it involves natural language processing of unstructured or semi-structured text data rather than structured numeric data. The key steps of text mining include pre-processing text, applying techniques like summarization, classification, clustering and information extraction, and analyzing the results. Some common applications of text mining are market trend analysis and filtering of spam emails. While text mining allows extraction of information from diverse sources, it requires initial learning systems and suitable programs for knowledge discovery.
The document discusses different types of trees and graphs as data structures. It defines trees as hierarchical data structures that can represent information in a flexible manner. Binary search trees allow rapid retrieval of data based on keys. Different types of trees are discussed including binary trees, ordered trees, rooted trees, and complete trees. Graphs are also covered as structures that can represent relationships between data items and support applications like social networks. Common graph terms like nodes, edges, directed/undirected graphs, and connectivity are defined.
- An object-relational database (ORD) or object-relational database management system (ORDBMS) supports objects, classes, and inheritance directly in the database schema and query language, while also retaining the relational model.
- An ORDBMS supports an extended form of SQL called SQL3 for handling abstract data types. It allows storage of complex data types like images and location data.
- Key advantages of ORDBMS include reuse and sharing of code through inheritance, increased productivity for developers and users, and more powerful query capabilities. Key challenges include complexity, immaturity of the technology, and increased costs.
This document discusses stacks as a linear data structure. It defines a stack as a last-in, first-out (LIFO) collection where the last item added is the first removed. The core stack operations of push and pop are introduced, along with algorithms to insert, delete, and display items in a stack. Examples of stack applications include reversing strings, checking validity of expressions with nested parentheses, and converting infix notation to postfix.
Entities represent people, objects, or abstract concepts and have attributes that describe examples of the entity. Notation defines the entity name in capital letters and attributes in brackets. Entities and attributes are part of data modeling, where entities become database tables and attributes become fields during implementation. Examples provided include a PUPIL entity with attributes like name and DOB, a CAR entity with attributes like make and model, and a DOCTOR APPOINTMENT entity with date and time attributes.
B+ tree is a data structure used to store (key, value) pairs in a tree-like structure for fast retrieval. It has a root node, intermediary nodes that point to leaf nodes but do not store data, and leaf nodes that store the actual records. Leaf nodes must contain a minimum and maximum number of values, and non-leaf nodes a minimum and maximum number of child nodes. B+ trees allow fast traversal and search via balanced structure and sorted nodes. Records are inserted and deleted by searching the tree to the appropriate leaf node and updating pointers and values.
Binary trees are a non-linear data structure where each node has at most two children, used to represent hierarchical relationships, with nodes connected through parent-child links and traversed through preorder, inorder, and postorder methods; they can be represented through arrays or linked lists and support common operations like search, insert, and delete through comparing node values and restructuring child pointers.
Linear search, also called sequential search, is a method for finding a target value within a list by sequentially checking each element until a match is found or all elements are searched. It works by iterating through each element of a data structure (such as an array or list), comparing it to the target value, and returning the index or position of the target if found. The key aspects covered include definitions of data, structure, data structure, and search. Pseudocode and examples of linear search on a phone directory are provided. Advantages are that it is simple and works for small data sets, while disadvantages are that search time increases linearly with the size of the data set.
The document discusses normalization in database design. Normalization is the process of organizing data to avoid redundancy and dependency. It involves splitting tables and restructuring relationships between tables. The document outlines various normal forms including 1NF, 2NF, 3NF, BCNF, 4NF and 5NF and provides examples to illustrate how to normalize tables to conform to each form.
Three main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning. Supervised learning involves training a model using labeled input/output data where the desired outputs are provided, allowing the model to map inputs to outputs. Unsupervised learning involves discovering hidden patterns in unlabeled data and grouping similar data points together. Reinforcement learning involves an agent learning through trial-and-error interactions with a dynamic environment by receiving rewards or punishments for actions.
This document provides an overview of linear search and binary search algorithms.
It explains that linear search sequentially searches through an array one element at a time to find a target value. It is simple to implement but has poor efficiency as the time scales linearly with the size of the input.
Binary search is more efficient by cutting the search space in half at each step. It works on a sorted array by comparing the target to the middle element and determining which half to search next. The time complexity of binary search is logarithmic rather than linear.
This document provides an overview of entity-relationship modeling as a first step for designing a relational database. It describes how to model entities, attributes, relationships, and participation constraints. Key aspects covered include using boxes to represent entity types, diamonds for relationship types, and labeling relationships with degrees. The document also discusses handling multi-valued attributes and deciding whether to model concepts as attributes or entity types.
The document discusses the vector space model for representing text documents and queries in information retrieval systems. It describes how documents and queries are represented as vectors of term weights, with each term being assigned a weight based on its frequency in the document or query. The vector space model allows documents and queries to be compared by calculating the similarity between their vector representations. Terms that are more frequent in a document and less frequent overall are given higher weights through techniques like TF-IDF weighting. This vector representation enables efficient retrieval of documents ranked by similarity to the query.
This document discusses Shellsort, an algorithm developed by Donald Shell in 1959 that improves on insertion sort. Shellsort works by comparing elements that are farther apart within an array rather than adjacent elements. It makes multiple passes through a list, sorting subsets of elements using an increment sequence that decreases until the final pass sorts adjacent elements using insertion sort. Shellsort breaks the quadratic time barrier of insertion sort and is faster for medium sized lists but is outperformed by other algorithms like merge, heap, and quick sort for large lists. Examples are provided to illustrate how Shellsort works by sorting a sample list.
The document discusses two main types of retrieval models: Boolean models which use set theory and vector space models which use statistical and algebraic approaches. Vector space models represent documents and queries as vectors of keywords weighted by factors like term frequency and inverse document frequency. Similarity between document and query vectors is calculated using measures like the inner product or cosine similarity to retrieve and rank documents.
This document discusses vector space retrieval models. It describes how documents and queries are represented as vectors in a common vector space based on terms. Terms are weighted using metrics like term frequency (TF) and inverse document frequency (IDF) to determine importance. The cosine similarity measure is used to calculate similarity between document and query vectors and rank results by relevance. While simple and effective in practice, vector space models have limitations like missing semantic and syntactic information.
Neural networks can be used for machine learning tasks like classification. They consist of interconnected nodes that update their weight values during a training process using examples. Neural networks have been applied successfully to tasks like handwritten character recognition, autonomous vehicle control by observing human drivers, and text-to-speech pronunciation generation. Their architecture is inspired by the human brain but neural networks are trained using computational methods while the brain uses biological processes.
The document discusses R-trees, a data structure used to index multi-dimensional spatial data. R-trees allow for efficient searching of spatial data by grouping data into minimum bounding rectangles (MBRs) and storing them in a tree structure based on these envelopes. The tree structure resembles a B+-tree, with internal nodes containing pointers to child nodes or data records. R-trees provide efficient search, insertion, and deletion of spatial data objects through operations on the tree structure and splitting or merging of nodes as needed.
This document outlines algorithms for query processing and optimization in database systems. It discusses translating SQL queries to relational algebra, algorithms for sorting and joining large datasets that exceed available memory, including nested loop joins, sort-merge joins, and hash joins. It also describes query optimization techniques and factors that influence query performance.
This document presents an overview of text mining. It discusses how text mining differs from data mining in that it involves natural language processing of unstructured or semi-structured text data rather than structured numeric data. The key steps of text mining include pre-processing text, applying techniques like summarization, classification, clustering and information extraction, and analyzing the results. Some common applications of text mining are market trend analysis and filtering of spam emails. While text mining allows extraction of information from diverse sources, it requires initial learning systems and suitable programs for knowledge discovery.
The document discusses different types of trees and graphs as data structures. It defines trees as hierarchical data structures that can represent information in a flexible manner. Binary search trees allow rapid retrieval of data based on keys. Different types of trees are discussed including binary trees, ordered trees, rooted trees, and complete trees. Graphs are also covered as structures that can represent relationships between data items and support applications like social networks. Common graph terms like nodes, edges, directed/undirected graphs, and connectivity are defined.
- An object-relational database (ORD) or object-relational database management system (ORDBMS) supports objects, classes, and inheritance directly in the database schema and query language, while also retaining the relational model.
- An ORDBMS supports an extended form of SQL called SQL3 for handling abstract data types. It allows storage of complex data types like images and location data.
- Key advantages of ORDBMS include reuse and sharing of code through inheritance, increased productivity for developers and users, and more powerful query capabilities. Key challenges include complexity, immaturity of the technology, and increased costs.
This document discusses stacks as a linear data structure. It defines a stack as a last-in, first-out (LIFO) collection where the last item added is the first removed. The core stack operations of push and pop are introduced, along with algorithms to insert, delete, and display items in a stack. Examples of stack applications include reversing strings, checking validity of expressions with nested parentheses, and converting infix notation to postfix.
Entities represent people, objects, or abstract concepts and have attributes that describe examples of the entity. Notation defines the entity name in capital letters and attributes in brackets. Entities and attributes are part of data modeling, where entities become database tables and attributes become fields during implementation. Examples provided include a PUPIL entity with attributes like name and DOB, a CAR entity with attributes like make and model, and a DOCTOR APPOINTMENT entity with date and time attributes.
B+ tree is a data structure used to store (key, value) pairs in a tree-like structure for fast retrieval. It has a root node, intermediary nodes that point to leaf nodes but do not store data, and leaf nodes that store the actual records. Leaf nodes must contain a minimum and maximum number of values, and non-leaf nodes a minimum and maximum number of child nodes. B+ trees allow fast traversal and search via balanced structure and sorted nodes. Records are inserted and deleted by searching the tree to the appropriate leaf node and updating pointers and values.
Binary trees are a non-linear data structure where each node has at most two children, used to represent hierarchical relationships, with nodes connected through parent-child links and traversed through preorder, inorder, and postorder methods; they can be represented through arrays or linked lists and support common operations like search, insert, and delete through comparing node values and restructuring child pointers.
Linear search, also called sequential search, is a method for finding a target value within a list by sequentially checking each element until a match is found or all elements are searched. It works by iterating through each element of a data structure (such as an array or list), comparing it to the target value, and returning the index or position of the target if found. The key aspects covered include definitions of data, structure, data structure, and search. Pseudocode and examples of linear search on a phone directory are provided. Advantages are that it is simple and works for small data sets, while disadvantages are that search time increases linearly with the size of the data set.
The document discusses normalization in database design. Normalization is the process of organizing data to avoid redundancy and dependency. It involves splitting tables and restructuring relationships between tables. The document outlines various normal forms including 1NF, 2NF, 3NF, BCNF, 4NF and 5NF and provides examples to illustrate how to normalize tables to conform to each form.
Three main types of machine learning are supervised learning, unsupervised learning, and reinforcement learning. Supervised learning involves training a model using labeled input/output data where the desired outputs are provided, allowing the model to map inputs to outputs. Unsupervised learning involves discovering hidden patterns in unlabeled data and grouping similar data points together. Reinforcement learning involves an agent learning through trial-and-error interactions with a dynamic environment by receiving rewards or punishments for actions.
This document provides an overview of linear search and binary search algorithms.
It explains that linear search sequentially searches through an array one element at a time to find a target value. It is simple to implement but has poor efficiency as the time scales linearly with the size of the input.
Binary search is more efficient by cutting the search space in half at each step. It works on a sorted array by comparing the target to the middle element and determining which half to search next. The time complexity of binary search is logarithmic rather than linear.
This document provides an overview of entity-relationship modeling as a first step for designing a relational database. It describes how to model entities, attributes, relationships, and participation constraints. Key aspects covered include using boxes to represent entity types, diamonds for relationship types, and labeling relationships with degrees. The document also discusses handling multi-valued attributes and deciding whether to model concepts as attributes or entity types.
The document discusses the vector space model for representing text documents and queries in information retrieval systems. It describes how documents and queries are represented as vectors of term weights, with each term being assigned a weight based on its frequency in the document or query. The vector space model allows documents and queries to be compared by calculating the similarity between their vector representations. Terms that are more frequent in a document and less frequent overall are given higher weights through techniques like TF-IDF weighting. This vector representation enables efficient retrieval of documents ranked by similarity to the query.
This document discusses Shellsort, an algorithm developed by Donald Shell in 1959 that improves on insertion sort. Shellsort works by comparing elements that are farther apart within an array rather than adjacent elements. It makes multiple passes through a list, sorting subsets of elements using an increment sequence that decreases until the final pass sorts adjacent elements using insertion sort. Shellsort breaks the quadratic time barrier of insertion sort and is faster for medium sized lists but is outperformed by other algorithms like merge, heap, and quick sort for large lists. Examples are provided to illustrate how Shellsort works by sorting a sample list.
The document discusses two main types of retrieval models: Boolean models which use set theory and vector space models which use statistical and algebraic approaches. Vector space models represent documents and queries as vectors of keywords weighted by factors like term frequency and inverse document frequency. Similarity between document and query vectors is calculated using measures like the inner product or cosine similarity to retrieve and rank documents.
This document discusses vector space retrieval models. It describes how documents and queries are represented as vectors in a common vector space based on terms. Terms are weighted using metrics like term frequency (TF) and inverse document frequency (IDF) to determine importance. The cosine similarity measure is used to calculate similarity between document and query vectors and rank results by relevance. While simple and effective in practice, vector space models have limitations like missing semantic and syntactic information.
This document discusses different information retrieval models including the Boolean model, vector space model, and probabilistic model. It focuses on describing the Boolean model and its drawbacks. Term frequency-inverse document frequency (TF-IDF) weighting is explained as a way to assign weights to terms based on frequency and document distribution. Cosine similarity is presented as a common way to measure similarity between a document vector and query vector in the vector space model.
The document discusses the vector space model used in information retrieval. It explains that documents and queries are represented as weighted vectors in a multidimensional space. Similar vectors are close to each other. The weights used are usually tf-idf, which considers both the frequency of a term within a document and its rarity across documents. Documents are ranked based on the similarity between their vector representation and the query vector.
The document discusses the vector space model used in information retrieval. It explains that documents and queries are represented as weighted vectors in a high dimensional vector space. Similarities between queries and documents are calculated to rank documents by relevance. Weights are often calculated using TF-IDF, which considers the frequency of terms within documents and across collections. Documents with vector representations closer to the query vector are considered more relevant.
The document is a presentation on information retrieval by Richard Chbeir. It discusses key concepts in information retrieval including definitions of information retrieval, the information retrieval process, query and document processing techniques like stop word removal and stemming, representation models like the Boolean and vector space models, and inverted indexes. Specific topics covered include query representation, document indexing and processing, weighting schemes for terms, and measuring similarity between queries and documents.
The document discusses different techniques for topic modeling of documents, including TF-IDF weighting and cosine similarity. It proposes a semi-supervised approach that uses predefined topics from Prismatic to train an LDA model on Wikipedia articles. This model classifies news articles into topics. The accuracy is improved by redistributing term weights based on their relevance within topic clusters rather than just document frequency. An experiment on over 5000 news articles found that the combined weighting approach outperformed TF-IDF alone on articles with multiple topics or limited content.
The document discusses the basics of information retrieval systems. It covers two main stages - indexing and retrieval. In the indexing stage, documents are preprocessed and stored in an index. In retrieval, queries are issued and the index is accessed to find relevant documents. The document then discusses several models for defining relevance between documents and queries, including the Boolean model and vector space model. It also covers techniques for representing documents and queries as vectors and calculating similarity between them.
Simple semantics in topic detection and trackingGeorge Ang
This document describes a topic detection and tracking (TDT) system that uses semantic classes to represent documents. It splits documents into categories like names, locations, terms and temporals. Documents are represented as event vectors of these classes. The system compares event vectors class by class using metrics like cosine similarity. Experiments on a news corpus show the system achieves reasonable performance on tasks like topic tracking and first story detection, though performance degrades without vagueness factors. Semantic augmentation did not improve results as expected.
Term weighting assigns a weight to terms in documents to quantify their importance in describing the document's contents. Weights are higher for terms that occur frequently in a document but rarely in other documents. Term frequency in a document and inverse document frequency are used to calculate TF-IDF weights. Term occurrences may be correlated, so term weights should reflect their correlation. For example, terms like "computer" and "network" often appear together in documents about computer networks.
The document discusses several information retrieval models including the Boolean, vector space, and probabilistic models. It provides details on how each model represents documents and queries, defines relevance, and ranks documents in response to queries. Specifically, it describes:
1) The Boolean model uses exact matching to retrieve only documents that satisfy a Boolean query, but does not rank results.
2) The vector space model represents documents and queries as vectors of term weights and ranks documents based on their similarity to the query vector using measures like cosine similarity.
3) Term frequency-inverse document frequency (TF-IDF) is discussed as a method to weight terms based on their importance.
This document provides lecture notes on information retrieval systems. It covers key concepts like precision and recall, different retrieval strategies including vector space model and probabilistic models, and retrieval utilities. The vector space model represents documents and queries as vectors in a shared space and calculates similarity using cosine similarity. Probabilistic models assign probabilities to terms and documents and estimate relevance probabilities. The notes discuss term weighting schemes, inverted indexes to improve efficiency, and integrating structured data with text retrieval. The overall objective is for students to learn fundamental models and techniques for information storage and retrieval.
The document discusses different techniques for weighting terms in the vector space model for information retrieval, including:
- Sublinear tf scaling using the logarithm of term frequency
- Tf-idf weighting
- Maximum tf normalization to mitigate higher weights for longer documents
It also discusses evaluating information retrieval systems using test collections with queries, relevant documents, and metrics like precision and recall. Standard test collections include Cranfield, TREC, and CLEF.
This document provides an overview of the Introduction to Algorithms course, including the course modules and motivating problems. It introduces the Document Distance problem, which aims to define metrics to measure the similarity between documents based on word frequencies. It discusses an initial Python program ("docdist1.py") to calculate document distance that runs inefficiently due to quadratic time list concatenation. Profiling identifies this as the bottleneck. The solution is to use list extension, resulting in "docdist3.py". Further optimizations include using a dictionary to count word frequencies in constant time, creating "docdist4.py". The document outlines remaining opportunities like improving the word extraction and sorting algorithms.
The document discusses probabilistic retrieval models in information retrieval. It provides an overview of older models like Boolean retrieval and vector space models. The main focus is on probabilistic models like BM25 and language models. It explains key concepts in probabilistic IR like the probability ranking principle, using Bayes' rule to estimate the probability that a document is relevant given features of the document, and estimating probabilities based on the frequencies of terms in relevant documents. The goal is to rank documents based on the probability of relevance to the query.
This document discusses various techniques used in web search engines for indexing and ranking documents. It covers topics like inverted indices, stopword removal, stemming, relevance feedback, vector space models, and Bayesian inference networks. Web search engines prepare an index of keywords for documents and return ranked lists in response to queries by measuring similarities between query and document vectors based on term frequencies and inverse document frequencies.
Boolean,vector space retrieval Models Primya Tamil
The document discusses various information retrieval models including Boolean, vector space, and probabilistic models. It provides details on how documents and queries are represented and compared in the vector space model. Specifically, it explains that in this model, documents and queries are represented as vectors of term weights in a multi-dimensional space. The similarity between a document and query vector is calculated using measures like the inner product or cosine similarity to retrieve and rank documents.
This document provides an introduction to information retrieval systems and their main components. It discusses how IR systems aim to find relevant documents from a large collection in response to a user's information need. The key processes involved are document indexing to represent contents, query formulation, retrieval of relevant documents, and system evaluation. Indexing involves selecting important keywords from documents and assigning them weights. Various retrieval models are described for comparing document and query representations, such as vector space and probabilistic models. The document also discusses challenges in document representation, query processing, and evaluating system effectiveness.
Knowledge Discovery Query Language (KDQL)Zakaria Zubi
The document discusses Knowledge Discovery Query Language (KDQL), a proposed query language for interacting with i-extended databases in the knowledge discovery process. KDQL is designed to handle data mining rules and retrieve association rules from i-extended databases. The key points are:
1) KDQL is based on SQL and is intended to support tasks like association rule mining within the ODBC_KDD(2) model for knowledge discovery.
2) It can be used to query i-extended databases, which contain both data and discovered patterns.
3) The KDQL RULES operator allows users to specify data mining tasks like finding association rules that satisfy certain frequency and confidence thresholds.
The document discusses different information retrieval models including boolean, vector, and probabilistic models. The boolean model uses set theory and boolean algebra to represent documents and queries. The vector model assigns weights to terms and ranks documents based on similarity to the query. The probabilistic model calculates the probability of a document being relevant given a query. It also covers structured models for different tasks like filtering and browsing.
This document provides an outline for a course on Information Storage and Retrieval. It includes information on the course code, credits, target group, instructor contact details, course description and objectives. The course syllabus outlines 8 chapters covering topics like introduction to information retrieval systems, text operations and indexing, retrieval models and evaluation, query languages and operations, and current issues in IR. Student assessment will include assignments, tests, exams and a project. Reference books for the course are also listed.
The document discusses techniques for improving query formulations in information retrieval systems, including relevance feedback and query expansion. It describes how relevance feedback works by reformulating the query based on terms from documents users mark as relevant or irrelevant. This can be done automatically through pseudo relevance feedback by assuming the top retrieved documents are relevant. The document also discusses problems that can arise with relevance feedback and potential solutions, such as presenting modified queries to users for review.
This document discusses various text operations and techniques for automatic indexing in information retrieval systems. It covers topics like tokenization, stop word removal, stemming, term weighting, Zipf's law, Luhn's model of word frequency, and Heap's law on vocabulary growth. The goal of these text operations is to select meaningful index terms from documents to represent their contents and reduce noise for more effective retrieval.
This document summarizes key aspects of evaluating information retrieval systems, including:
- Precision and recall are common performance measures, where precision measures the percentage of retrieved documents that are relevant and recall measures the percentage of relevant documents retrieved.
- Other measures include mean average precision (MAP), which averages precision scores across queries, and R-precision, which measures precision after R relevant documents are retrieved, where R is the total number of relevant documents.
- Precision and recall can be plotted on a graph to show their tradeoff, with interpolation used to calculate precision at standard recall levels for better comparison of systems.
- Relevance judgments can be subjective, situational, and dynamic, making evaluation of IR systems challenging.
The document discusses various indexing structures used to improve efficiency of information retrieval from document collections. It describes sequential files which arrange records sequentially but sorted on a key. Inverted files list terms alphabetically with pointers to documents containing the term. Suffix trees and arrays store suffixes to enable fast lookups. The construction of inverted files involves extracting terms, building vocabulary and postings lists. Searching inverted files uses binary search on vocabulary and accesses postings lists. Suffix trees compactly store all suffixes to enable prefix matching.
The document discusses techniques for improving information retrieval through query reformulation, including relevance feedback and query expansion. It describes:
1) Relevance feedback involves revising the original query based on a user's judgements about retrieved documents, such as increasing weights of terms in relevant documents and decreasing weights in irrelevant documents.
2) Pseudo relevance feedback automatically assumes the top retrieved documents are relevant and uses them to expand the query, without explicit user feedback.
3) Query expansion techniques aim to retrieve additional relevant documents by adding new related terms to the original query, drawing terms either from the local set of initially retrieved documents, or globally from the entire document collection. Local analysis tends to give better results by avoiding ambiguity.
This document discusses different types of query languages used for information retrieval systems. It describes keyword queries where documents are retrieved based on the presence of query words. Phrase queries search for an exact sequence of words. Boolean queries use logical operators like AND, OR and NOT to combine search terms. Natural language queries allow users to enter searches in a free-form manner but require translation to a formal query language. The document provides examples and explanations of each query language type over its 12 sections.
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.pptHenry Hollis
The History of NZ 1870-1900.
Making of a Nation.
From the NZ Wars to Liberals,
Richard Seddon, George Grey,
Social Laboratory, New Zealand,
Confiscations, Kotahitanga, Kingitanga, Parliament, Suffrage, Repudiation, Economic Change, Agriculture, Gold Mining, Timber, Flax, Sheep, Dairying,
Leveraging Generative AI to Drive Nonprofit InnovationTechSoup
In this webinar, participants learned how to utilize Generative AI to streamline operations and elevate member engagement. Amazon Web Service experts provided a customer specific use cases and dived into low/no-code tools that are quick and easy to deploy through Amazon Web Service (AWS.)
The chapter Lifelines of National Economy in Class 10 Geography focuses on the various modes of transportation and communication that play a vital role in the economic development of a country. These lifelines are crucial for the movement of goods, services, and people, thereby connecting different regions and promoting economic activities.
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
This presentation was provided by Rebecca Benner, Ph.D., of the American Society of Anesthesiologists, for the second session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session Two: 'Expanding Pathways to Publishing Careers,' was held June 13, 2024.
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
Gender and Mental Health - Counselling and Family Therapy Applications and In...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
1. 1
Chapter 4
IR Models
Outline
Introduction of IR Models
Boolean model
Vector space model
Probabilistic model
Word evidence: Bag of words
IR systems usually adopt index terms to index and
retrieve documents
Each document is represented by a set of
representative keywords or index terms (called Bag
of Words)
An index term is a word from a document useful for
remembering the document main themes
Not all terms are equally useful for representing the document
contents:
less frequent terms allow identifying a narrower set
of documents
But No ordering information is attached to the Bag of Words
identified from the document collection.
IR Models - Basic Concepts
IR Models - Basic Concepts
• One central problem regarding IR systems is the
issue of predicting which documents are relevant
and which are not
• Such a decision is usually dependent on a ranking
algorithm which attempts to establish a simple
ordering of the documents retrieved
• Documents appearning at the top of this
ordering are considered to be more likely to be
relevant
• Thus ranking algorithms are at the core of IR
systems
• The IR models determine the predictions of
what is relevant and what is not, based on the
notion of relevance implemented by the system
IR Models - Basic Concepts
• After preprocessing, N distinct terms remain which
are Unique terms that form the VOCABULARY
• Let
– ki be an index term i & dj be a document j
– K = (k1, k2, …, kN) is the set of all index terms
• Each term, i, in a document or query j, is given a
real-valued weight, wij.
– wij is a weight associated with (ki,dj). If wij = 0 ,
it indicates that term does not belong to
document dj
• The weight wij quantifies the importance of the
index term for describing the document contents
• vec(dj) = (w1j, w2j, …, wtj) is a weighted vector
associated with the document dj
2. 2
Mapping Documents & Queries
Represent both documents and queries as N-
dimensional vectors in a term-document matrix, which
shows occurrence of terms in the document collection
or query
An entry in the matrix corresponds to the “weight” of a
term in the document; )
,...,
,
(
);
,...,
,
( ,
,
2
,
1
,
,
2
,
1 k
N
k
k
k
j
N
j
j
j t
t
t
q
t
t
t
d
T1 T2 …. TN
D1 w11 w12 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
–Document collection is mapped to
term-by-document matrix
–The documents are viewed as
vectors in multidimensional space
• “Nearby” vectors are related
–Normalize the weight as usual for
vector length to avoid the effect of
document length
Weighting Terms in Vector Sapce
• The importance of the index terms is represented by
weights associated to them
• Problem: to show the importance of the index term for
describing the document/query contents, what weight can
we assign?
• Solution 1: Binary weights: t=1 if presence, 0 otherwise
– Similarity: number of terms in common
• Problem: Not all terms equally interesting
– E.g. the vs. dog vs. cat
• Solution: Replace binary weights with non-binary weights
)
,...,
,
(
);
,...,
,
( ,
,
2
,
1
,
,
2
,
1 k
N
k
k
k
j
N
j
j
j w
w
w
q
w
w
w
d
The Boolean Model
• Boolean model is a simple model based on set theory
• The Boolean model imposes a binary criterion for
deciding relevance
• Terms are either present or absent. Thus,
wij {0,1}
• sim(q,dj) = 1, if document satisfies the boolean query
0 otherwise
T1 T2 …. TN
D1 w11 w12 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
- Note that, no weights
assigned in-between 0 and 1,
only values 0 or 1 can be
assigned
d1
d2
d3
d4 d5
d6
d7
k1
k2
k3
• Generate the relevant documents retrieved by
the Boolean model for the query :
q = k1 (k2 k3)
The Boolean Model: Example
3. 3
• Given the following determine documents retrieved by the
Boolean model based IR system
• Index Terms: K1, …,K8.
• Documents:
1. D1 = {K1, K2, K3, K4, K5}
2. D2 = {K1, K2, K3, K4}
3. D3 = {K2, K4, K6, K8}
4. D4 = {K1, K3, K5, K7}
5. D5 = {K4, K5, K6, K7, K8}
6. D6 = {K1, K2, K3, K4}
• Query: K1 (K2 K3)
• Answer: {D1, D2, D4, D6} ({D1, D2, D3, D6} {D3, D5})
= {D1, D2, D6}
The Boolean Model: Example
Given the following three documents, Construct Term – document
matrix and find the relevant documents retrieved by the
Boolean model for given query
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
• Query: “gold (silver truck)”
Table below shows document –term (ti) matrix
The Boolean Model: Further Example
arrive damage deliver fire gold silver ship truck
D1
D2
D3
query
Also find the relevant
documents for the queries:
(a) “golddelivery”;
(b) ship ¬ gold;
(c) “silver truck”
Exercise
Given the following three documents with
the following contents:
– D1 = “computer information retrieval”
– D2 = “computer retrieval”
– D3 = “information”
– D4 = “computer information”
• What are the relevant documents
retrieved for the queries:
– Q1 = “information retrieval”
– Q2 = “information ¬computer”
Doc No Term 1 Term 2 Term 3 Term 4
1 Swift
2 Shakespeare
3 Shakespeare Swift
4 Milton
5 Milton Swift
6 Milton Shakespeare
7 Milton Shakespeare Swift
8 Chaucer
9 Chaucer Swift
10 Chaucer Shakespeare
11 Chaucer Shakespeare Swift
12 Chaucer Milton
13 Chaucer Milton Swift
14 Chaucer Milton Shakespeare
15 Chaucer Milton Shakespeare Swift
Exercise: What are the relevant documents retrieved for the query:
((chaucer OR milton) AND (swift OR shakespeare))
4. 4
Drawbacks of the Boolean Model
• Retrieval based on binary decision criteria with no
notion of partial matching
• No ranking of the documents is provided (absence of
a grading scale)
• Information need has to be translated into a Boolean
expression which most users find awkward
• The Boolean queries formulated by the users are
most often too simplistic
• As a consequence, the Boolean model frequently
returns either too few or too many documents in
response to a user query
• Just changing a boolean operator from “AND” to “OR”
changes the result from intersection to union
Vector-Space Model (VSM)
• This is the most commonly used strategy for measuring
relevance of documents for a given query. This is
because,
• Use of binary weights is too limiting
• Non-binary weights provide consideration for partial
matches
• These term weights are used to compute a degree of
similarity between a query and each document
• Ranked set of documents provides for better matching
• The idea behind VSM is that
• the meaning of a document is conveyed by the words
used in that document
Vector-Space Model
To find relevant documens for a given query,
• First, Documents and queries are mapped into term vector
space.
• Note that queries are considered as short documents
• Second, in the vector space, queries and documents are
represented as weighted vectors
• There are different weighting technique; the most widely
used one is computing tf*idf for each term
• Third, similarity measurement is used to rank documents by
the closeness of their vectors to the query.
• Documents are ranked by closeness to the query.
Closeness is determined by a similarity score calculation
Term-document matrix.
• A collection of n documents and query can be represented
in the vector space model by a term-document matrix.
–An entry in the matrix corresponds to the “weight” of a
term in the document;
–zero means the term has no significance in the document or
it simply doesn’t exist in the document. Otherwise, wij >
0 whenever ki dj
T1 T2 …. TN
D1 w11 w21 … w1N
D2 w21 w22 … w2N
: : : :
: : : :
DM wM1 wM2 … wMN
5. 5
Computing weights
• How do we compute weights for term i in document j and
query q; wij and wiq ?
• A good weight must take into account two effects:
–Quantification of intra-document contents (similarity)
• The tf factor, the term frequency within a document
–Quantification of inter-documents separation
(dissimilarity)
• The idf factor, the inverse document frequency
–As a result of which most IR systems are using tf*idf
weighting technique:
wij = tf(i,j) * idf(i)
Let,
N be the total number of documents in the collection
ni be the number of documents which contain ki
freq(i,j) raw frequency of ki within dj
A normalized tf factor is given by
f(i,j) = freq(i,j) / max(freq(l,j))
where the maximum is computed over all terms which
occur within the document dj
The idf factor is computed as
idf(i) = log (N/ni)
the log is used to make the values of tf and idf
comparable. It can also be interpreted as the amount of
information associated with the term ki.
Computing Weights
The best term-weighting schemes use tf*idf
weights which are given by
wij = f(i,j) * log(N/ni)
For the query term weights, a suggestion is
wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) *
log(N/ni)
The vector space model with tf*idf weights is
a good ranking strategy with general
collections
The vector space model is usually as good as the
known ranking alternatives.
It is also simple and fast to compute.
Computing weights
A collection includes 10,000 documents
The term A appears 20 times in a particular
document
The maximum appearance of any term in this
document is 50
The term A appears in 2,000 of the collection
documents.
Compute TF*IDF weight?
f(i,j) = freq(i,j) / max(freq(l,j)) = 20/50 = 0.4
idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 2.32
wij = f(i,j) * log(N/ni) = 0.4 * 2.32 = 0.928
Example: Computing weights
6. 6
Similarity Measure
• Sim(q,dj) = cos()
• Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1
• A document is retrieved even if it matches the query
terms only partially
i
j
dj
q
n
i k
i
n
i j
i
n
i k
i
j
i
j
j
j
q
w
q
w
q
d
q
d
q
d
sim
1
2
,
1
2
,
1 ,
,
)
,
(
Suppose we query for the query: Q: “gold silver
truck”. The database collection consists of
three documents with the following documents.
• D1: “Shipment of gold damaged in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
Assume that all terms are used, including
common terms, stop words, and also no terms
are reduced to root terms.
Show retrieval results in ranked order?
Vector-Space Model: Example
Vector-Space Model: Example
Terms Q Counts TF DF IDF Wi = TF*IDF
Q D1 D2 D3
D1 D2 D3
a 0 1 1 1 3 0 0 0 0 0
arrived 0 0 1 1 2 0.176 0 0 0.176 0.176
damaged 0 1 0 0 1 0.477 0 0.477 0 0
delivery 0 0 1 0 1 0.477 0 0 0.477 0
fire 0 1 0 0 1 0.477 0 0.477 0 0
gold 1 1 0 1 2 0.176 0.176 0.176 0 0.176
in 0 1 1 1 3 0 0 0 0 0
of 0 1 1 1 3 0 0 0 0 0
silver 1 0 2 0 1 0.477 0.477 0 0.954 0
shipment 0 1 0 1 2 0.176 0 0.176 0 0.176
truck 1 0 1 1 2 0.176 0.176 0 0.176 0.176
Vector-Space Model
Terms Q D1 D2 D3
a 0 0 0 0
arrived 0 0 0.176 0.176
damaged 0 0.477 0 0
delivery 0 0 0.477 0
fire 0 0.477 0 0
gold 0.176 0.176 0 0.176
in 0 0 0 0
of 0 0 0 0
silver 0.477 0 0.954 0
shipment 0 0.176 0 0.176
truck 0.176 0 0.176 0.176
7. 7
Vector-Space Model: Example
• Compute similarity using cosine Sim(q,d1)
• First, for each document and query, compute all vector
lengths (zero terms ignored)
|d1|= = = 0.719
|d2|= = = 1.095
|d3|= = = 0.352
|q|= = = 0.538
• Next, compute dot products (zero products ignored)
Q*d1= 0.176*0.167 = 0.0310
Q*d2 = 0.954*0.477 + 0.176 *0.176 = 0.4862
Q*d3 = 0.176*0.167 + 0.176*0.167 = 0.0620
2
2
2
2
176
.
0
176
.
0
477
.
0
477
.
0
517
.
0
2
2
2
2
176
.
0
176
.
0
477
.
0
176
.
0
2001
.
1
2
2
2
2
176
.
0
176
.
0
176
.
0
176
.
0
124
.
0
2896
.
0
2
2
2
176
.
0
471
.
0
176
.
0
Vector-Space Model: Example
Now, compute similarity score
Sim(q,d1) = (0.0310) / (0.538*0.719) = 0.0801
Sim(q,d1) = (0.4862 ) / (0.538*1.095)= 0.8246
Sim(q,d1) = (0.0620) / (0.538*0.352)= 0.3271
Finally, we sort and rank documents in descending
order according to the similarity scores
Rank 1: Doc 2 = 0.8246
Rank 2: Doc 3 = 0.3271
Rank 3: Doc 1 = 0.0801
• Exercise: using normalized TF, rank documents using
cosine similarity measure? Hint: Normalize TF of term i
in doc j using max frequency of a term k in document j.
Vector-Space Model
• Advantages:
• term-weighting improves quality of the answer
set since it displays in ranked order
• partial matching allows retrieval of documents
that approximate the query conditions
• cosine ranking formula sorts documents
according to degree of similarity to the query
• Disadvantages:
• assumes independence of index terms (??)
Technical Memo Example
Suppose the database collection consists of the following
documents.
c1: Human machine interface for Lab ABC computer
applications
c2: A survey of user opinion of computer system response
time
c3: The EPS user interface management system
c4: System and human system engineering testing of EPS
c5: Relation of user-perceived response time to error
measure
M1: The generation of random, binary, unordered trees
M2: The intersection graph of paths in trees
M3: Graph minors: Widths of trees and well-quasi-ordering
m4: Graph minors: A survey
Query:
Find documents relevant to "human computer interaction"
8. 8
Exercises
• Given the following documents, rank documents
according to their relevance to the query using cosine
similarity, Euclidean distance and inner product
measures?
docID words in document
1 Taipei Taiwan
2 Macao Taiwan Shanghai
3 Japan Sapporo
4 Sapporo Osaka Taiwan
Query: Taiwan Sapporo ?
29 30
Thank you