INFORMATION
RETRIEVAL
Presentation
Team Members
Ha
2113269
Phuong
2112070
Nhan
2114277
Content
.
Text Preprocessing
Evaluation & Applications
Basic Concepts
Traditional IR Models &
Modern Approaches
Indexing
Basic
Concepts
What is Information Retrieval?
Information retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from
within large collections (usually stored on computers).
Our goal is to develop a system to address the ad hoc retrieval task, which is
the most standard IR task. In it, a system aims to provide documents from
within the collection that are relevant to an arbitrary user information need,
communicated to the system by means of a one-off, user-initiated query.
What is Information Retrieval?
Terminology
Document Collection Query
Term Vocabulary
Notations
Documents and
Collection
Terms and
Vocabulary
Query Terms
Text
Preprocessing
Tokenization
Tokenization is the process of breaking
down text into individual units, or tokens.
This step is foundational as it converts
unstructured text into a structured format
that can be easily analyzed.
Removing Stopwords
Stopwords are common, low-information
words, which frequently appear in text but
often do not carry significant meaning.
Removing stopwords reduces the amount
of data for processing, focusing on more
meaningful terms.
Normalization
Text normalization is the process of converting text to a consistent format,
which helps to avoid variations that may lead to mismatches in analysis. This
includes:
• Lowercasing: “Text”, “TEXT”, “text” -> “text”
• Removing punctuation and special characters (.,;:)
• Replacing contractions: “can’t” -> “cannot”
• Standardizing accents: “café” -> “cafe”
Stemming and Lemmatization
• Stemming is the process of
reducing words to their root or base
form, often by removing common
suffixes.
• Lemmatization reduces words to
their lemma, or dictionary form,
which maintains semantic meaning
more accurately than stemming.
Indexing
Inverted Index
We keep a dictionary of terms. Then
for each term, we have a list that
records which documents the term
occurs in. Each item in the list is
conventionally called a posting. The
list is then called a postings list.
Index Construction
Index construction algorithms and
techniques
• Blocked Sort-Based Indexing
• Single-Pass In-Memory Indexing
• Distributed Indexing
• Dynamic Indexing
Block Sort-Based Indexing
Blocked Sort-Based Indexing (BSBI) is an indexing method that builds an
inverted index by processing large datasets in manageable chunks or
"blocks," which are then merged to create the final index.
This approach is particularly useful when indexing large datasets that do not
fit in memory.
Block Sort-Based Indexing
Indexing process:
• Documents are divided into fixed-size blocks (e.g., 100,000 documents
per block).
• Indexing Each Block:
⚬ Each block is read into memory, and a term-document mapping is
created for that block.
⚬ The terms are sorted within each block to construct a local inverted
index.
• The individual sorted blocks are written to disk. After all blocks have been
processed, a multi-way merge algorithm is used to combine them into
a single, comprehensive inverted index.
Block Sort-Based Indexing
Merge process
Block Sort-Based Indexing
BSBI Algorithm
Single-Pass In-Memory Indexing
Single-Pass In-Memory Indexing (SPIMI) is an alternative to BSBI designed
for efficiency by avoiding repeated sorting operations. Instead, SPIMI builds
each block of the index in a single pass, reducing the memory needed for
term storage.
Single-Pass In-Memory Indexing
SPIMI indexing process
• Unlike BSBI, SPIMI processes each term immediately. Terms are not stored for
later sorting, they are indexed as they appear in documents.
• Building posting lists:
⚬ As terms are encountered, SPIMI creates a posting list for each unique
term.
⚬ If the term already exists, SPIMI simply updates the list with the new
document information.
• Once the memory is full, the block (containing all posting lists for terms
processed so far) is written to disk, and memory is cleared for the next block.
• Similar to BSBI, the individual blocks are merged to create the final inverted
index.
Single-Pass In-Memory Indexing
SPIMI
Algorithm
Distributed Indexing
As data sizes increase, indexing single-machine
capabilities become insufficient.
Distributed Indexing addresses this by distributing
the workload across multiple machines or nodes,
allowing for faster processing and larger index sizes.
• Map Phase: The dataset is split into
chunks and distributed to multiple nodes.
Each node extracts terms and creates
partial inverted indexes (or posting lists)
for the documents it processes.
• Reduce Phase: The partial indexes from
each node are combined into a single
comprehensive index. Terms are merged
across nodes to produce final posting lists
for each term in the vocabulary.
Distributed Indexing: MapReduce
Distributed Indexing: Architecture
Sharding Replication Coordination
Dynamic Indexing
In many real-world scenarios, new documents are added, and old documents may be
removed or modified frequently. Dynamic Indexing is designed to handle such changes
without needing to rebuild the entire index.
Approach to Updates:
• For new documents, a separate in-memory index is created, which contains the
latest updates.
• Periodically, the in-memory index is merged with the main index to maintain a
single, consistent index.
• When a document is removed, a "deletion marker" is added to the in-memory index
to signal that the document should no longer appear in search results. During
periodic merges, these markers are used to purge entries from the main index.
Index Compression: Why?
Index compression is essential in Information Retrieval systems to:
• Reduce the storage size of indexes
• Optimize memory usage
• Speed up retrieval operations
By compressing both the dictionary and the postings files, IR systems can
handle large-scale datasets more efficiently.
Dictionary as an Array
The simplest data structure for the dictionary is to sort the vocabulary
lexicographically and store it in an array of fixed-width entries
Problem: Using fixed-width entries for terms is clearly wasteful, as the average
length of a term in English is just about eight characters
Dictionary as a String
We can overcome the fixed-width entries shortcomings by storing the
dictionary terms as one long string of characters.
Blocking
We can further compress the dictionary by grouping terms in the string into blocks
of size k and keeping a term pointer only for the first term of each block.
Blocking: Tradeoff
By increasing the block size k, we get
better compression.
However, there is a tradeoff between
compression and the speed of term
lookup.
Front Coding
A common prefix is identified for a subsequence of the term list and then referred
to with a special character.
Minimal Perfect Hashing
Other schemes with even greater compression rely on minimal perfect hashing,
that is, a hash function that maps M terms onto [1, . . . , M] without collisions.
However, we cannot adapt perfect hashes incrementally because each new term
causes a collision and therefore requires the creation of a new perfect hash
function. Therefore, they cannot be used in a dynamic environment.
Traditional
IR Models
Boolean Retrieval Model
The Boolean retrieval model is a model for information retrieval in which we can
pose any query in the form of a Boolean expression of terms.
• AND: All terms connected by AND must be present in the document for it to be
retrieved.
• OR: At least one of the terms connected by OR must be present in the
document.
• NOT: The term following NOT must not be present in the document.
Vector Space Model
In the case of large document collections, the resulting number of matching
documents can far exceed the number a human user could possibly sift through.
Accordingly, it is essential for a search engine to rank-order the documents
matching a query. To do this, the search engine computes, for each matching
document, a score with respect to the query at hand.
Vector space model (VSM) can achieve this by viewing each document as a vector
of such weights, we can compute a score between a query and each document.
This view is known as vector space scoring.
Term Weighting: tf-idf
Instead of using raw word counts, we compute a term weight for each
document word. One of the most common term weighting schemes is the tf-
idf weighting, which is the product of two terms:
• Term frequency (tf): the number of occurrences of term t in document d
• Inverse document frequency (idf): Inverse of document frequency,
which is the number of documents in the collection that contain a term t
Document Scoring
We may view each document as a vector
with one component corresponding to
each term in the dictionary, together with
a weight for each component. We can also
view a query as a vector.
We score document d by the cosine
similarity of its vector d with the query
vector q
Probabilistic Models
In the Boolean or vector space models of IR, matching is done in a formally
defined but semantically imprecise calculus of index terms.
• Given only a query, an IR system has an uncertain understanding of
the information need.
• Given the query and document representations, a system has an
uncertain guess of whether a document has content relevant to the
information need.
Probability theory provides a principled foundation for such reasoning
under uncertainty. We can use probabilistic models to estimate how likely
it is that a document is relevant to an information need.
Probability Ranking Principle
The obvious order in which to present documents to the user is to rank
documents by their estimated probability of relevance with respect to the
information need:
This is the basis of the Probability Ranking Principle (PRP), shortly stated as:
An IR system should rank documents based on the probability that a document is
relevant to a query.
Binary Independence Model
The Binary Independence Model (BIM) is the model that has traditionally been
used with the PRP. It introduces some simple assumptions:
• Binary: Documents and queries are both represented as binary term
incidence vectors, where each entry has the value of 1 if the corresponding
term is presented in the document/query and 0 otherwise
• Independence: Terms are modeled as occurring in documents independently.
The model recognizes no association between terms.
BIM: Relevance Probability
Under the BIM, we model the probability that a document is relevant via the
probability in terms of term incidence vectors.
Using the Bayes rule we have:
BIM: Ranking Function
Given those relevance probability, we can rank documents by their odds of
relevance:
We can further simplify this formula using the established assumptions,
eventually we arrive at the Retrieval Status Value (RSV):
Okapi BM25
The BIM was originally designed for short catalog records and abstracts of fairly
consistent length, and it works reasonably in these contexts, but for modern full-
text search collections, it seems clear that a model should pay attention to term
frequency and document length, as in VSM.
The BM25 weighting scheme, often called Okapi weighting, after the system in
which it was first implemented, was developed as a way of building a probabilistic
model sensitive to these quantities while not introducing too many additional
parameters into the model.
Okapi BM25: Ranking Function
The retrieval status value formula used in BM25 weight scheme is:
Tuning parameters:
• k1 for document term frequency scaling (k1 > 0)
• b for the scaling by document length (0 b 1)
≤ ≤
• k3 for term frequency scaling of the query (k3 > 0)
Modern
Approaches
in IR
Neural Ranking Models
• Traditional retrieval models obtain ranking
scores by searching for exact matches of words
in both the query and the document.
• However, this limits the availability of positional
and semantic information and may lead to
vocabulary mismatch.
• By contrast, neural ranking models construct
query-to-document relevance structures to learn
feature representations automatically.
Neural Ranking Models
• Representation-focused models
• Interaction-focused models
Hihi
Haha
Hehe
Representation-
focused models
Encodes the query and
the document
separately
No interaction occurs
between the query and
the document during the
encoding procedure.
DSSM can predict the semantic
similarity between sentences
and obtain a lower-dimensional
semantic vector representation
of a sentence
DSSM
DSSM Architecture
Semantic
Transformatio
n
DNN
Word Hashing
DSSm
Reduces the
dimensionality of word
vectors from bag-of-
words while retaining
important features of
words.
Helps DSSM capture
complex semantic
relationships and
improves accuracy in
search and ranking.
Transforms both queries and
documents into semantic
vectors within a hidden space.
The goal is for queries and
documents to be closer as
vectors when they have similar
meanings.
Loss Function
Similarity Function
DSSm
Measures the similarity
between the semantic
vectors of the query
and the document. The
closer the two vectors,
the higher the
relevance between the
query and the
document.
Pushes unrelated query-
document pairs farther apart
and pulls related pairs closer
together.
Hihi
Haha
Hehe
interaction-
focused models
Evaluates the
relationship between a
query and a document
based on their
interaction.
Captures more
contextual information.
K-NRM (Kernel-based neural
ranking model), uses a kernel
pooling layer to calculate the
relevance between the query
and words in the document.
K-NRM
K-NRM architecture
K-NRM
K-NRM functions
Pretrained Language Models
• Although all the methods
mentioned above improve
performance across various
information retrieval (IR)
applications, they do not perform
consistently well when trained on
small-scale data.
• Developing a better language
model is crucial for information
processing.
• Some pretrained language models
focus on deep transformer
architectures, such as GPT, BERT,
etc.
BERT
BERT inputs, token embeddings, segment embeddings and position
embeddings.
BERT
• BERT addresses the limitations and disadvantages of ELMo and OpenAI GPT
(both are either undirectional or sequential, not exploiting contextual
relationships across all text inputs) by using pretraining objectives such as
MLM (Masked Language Model) and NSP (Next Sentence Prediction).
• 15% of WordPiece tokens are randomly selected for prediction.
• Each selected token has an 80% chance of being replaced with the [MASK]
token.
• There is a 10% chance that the token is replaced by a random token, and a 10%
chance it is kept as is.
colBERT
• ColBERT (Contextualized Late Interaction over BERT) is a model designed to
improve BERT's performance.
• ColBERT addresses several key challenges faced by traditional BERT models:
⚬ High computational cost.
⚬ Interaction latency.
⚬ Vector similarity search.
⚬ Scalability.
⚬ High performance.
colBERT
colBERT architecture
colBERT
• CoLBERT consists of 3 main components:
⚬ Query encoder.
⚬ Document encoder.
⚬ Late interaction mechanism.
Retrieval-Augmented Generation (RAG)
RAG is an AI model that combines information retrieval
capabilities with the language generation abilities of large
language models (LLMs). It helps generate accurate, up-to-date
answers that closely align with factual information. By integrating
real-world data with the language skills of the model, RAG
minimizes "hallucination" errors, resulting in more reliable
responses.
Retrieval-Augmented Generation (RAG)
Retrieve - Gather relevant info from external
sources.
Generate - Integrate info into LLM for accurate, context-
rich responses.
Retrieval-Augmented Generation (RAG)
• Create the vector database.
• Receive the query.
• Retrieve information.
• Rank and filter information.
• Combine information.
• Generate text.
Retrieval-Augmented Generation (RAG)
• Updated Information: Overcomes the
knowledge limitations of static language
models.
• Higher Accuracy: Reduces
misinformation.
• Cost Efficiency: Lowers the number of
tokens processed in the LLM model.
IR System
Evaluation
Evaluation Metrics
• Online Metrics: These measure user interactions
when the system is live, such as the click-through
rate (CTR).
• Offline Metrics: These are assessed before
deployment, focusing on the relevance of
returned results. They are often split into:
⚬ Order-aware: Metrics where the order of
results affects the score, such as NDCG@K.
⚬ Order-unaware: Metrics where order doesn’t
impact the score, such as Recall@K.
Evaluation Metrics
• Recall@K: The proportion of relevant results
within the top K results.
• Mean Reciprocal Rank (MRR): Measures the rank
of the first relevant result.
• Mean Average Precision@K (MAP@K): Average
precision across multiple queries.
• NDCG@K: Evaluates the quality of the ranked
result list based on ranking.
Applications
General Application
• Digital Library: Manages and provides access to
digital materials (text, images, audio, video).
• Information Filtering System: Eliminates
redundant information, such as spam filtering
and recommendation systems.
• Media Search: Finds images, music, and videos
based on keywords and features.
• Search Engine: Searches across the web,
desktop, mobile, and social media to meet users'
information needs.
Domain applications of IR techniques
Geography, Chemistry, Legal, and
Software Engineering: Domain-specific
search to support specialized information
retrieval within professional fields.
Demo:
Simple Search Engine
using Vector Space Model

Information Retrieval - Natural Language Process

  • 1.
  • 2.
  • 3.
    Content . Text Preprocessing Evaluation &Applications Basic Concepts Traditional IR Models & Modern Approaches Indexing
  • 4.
  • 5.
    What is InformationRetrieval? Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Our goal is to develop a system to address the ad hoc retrieval task, which is the most standard IR task. In it, a system aims to provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user-initiated query.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    Tokenization Tokenization is theprocess of breaking down text into individual units, or tokens. This step is foundational as it converts unstructured text into a structured format that can be easily analyzed.
  • 11.
    Removing Stopwords Stopwords arecommon, low-information words, which frequently appear in text but often do not carry significant meaning. Removing stopwords reduces the amount of data for processing, focusing on more meaningful terms.
  • 12.
    Normalization Text normalization isthe process of converting text to a consistent format, which helps to avoid variations that may lead to mismatches in analysis. This includes: • Lowercasing: “Text”, “TEXT”, “text” -> “text” • Removing punctuation and special characters (.,;:) • Replacing contractions: “can’t” -> “cannot” • Standardizing accents: “café” -> “cafe”
  • 13.
    Stemming and Lemmatization •Stemming is the process of reducing words to their root or base form, often by removing common suffixes. • Lemmatization reduces words to their lemma, or dictionary form, which maintains semantic meaning more accurately than stemming.
  • 14.
  • 15.
    Inverted Index We keepa dictionary of terms. Then for each term, we have a list that records which documents the term occurs in. Each item in the list is conventionally called a posting. The list is then called a postings list.
  • 16.
    Index Construction Index constructionalgorithms and techniques • Blocked Sort-Based Indexing • Single-Pass In-Memory Indexing • Distributed Indexing • Dynamic Indexing
  • 17.
    Block Sort-Based Indexing BlockedSort-Based Indexing (BSBI) is an indexing method that builds an inverted index by processing large datasets in manageable chunks or "blocks," which are then merged to create the final index. This approach is particularly useful when indexing large datasets that do not fit in memory.
  • 18.
    Block Sort-Based Indexing Indexingprocess: • Documents are divided into fixed-size blocks (e.g., 100,000 documents per block). • Indexing Each Block: ⚬ Each block is read into memory, and a term-document mapping is created for that block. ⚬ The terms are sorted within each block to construct a local inverted index. • The individual sorted blocks are written to disk. After all blocks have been processed, a multi-way merge algorithm is used to combine them into a single, comprehensive inverted index.
  • 19.
  • 20.
  • 21.
    Single-Pass In-Memory Indexing Single-PassIn-Memory Indexing (SPIMI) is an alternative to BSBI designed for efficiency by avoiding repeated sorting operations. Instead, SPIMI builds each block of the index in a single pass, reducing the memory needed for term storage.
  • 22.
    Single-Pass In-Memory Indexing SPIMIindexing process • Unlike BSBI, SPIMI processes each term immediately. Terms are not stored for later sorting, they are indexed as they appear in documents. • Building posting lists: ⚬ As terms are encountered, SPIMI creates a posting list for each unique term. ⚬ If the term already exists, SPIMI simply updates the list with the new document information. • Once the memory is full, the block (containing all posting lists for terms processed so far) is written to disk, and memory is cleared for the next block. • Similar to BSBI, the individual blocks are merged to create the final inverted index.
  • 23.
  • 24.
    Distributed Indexing As datasizes increase, indexing single-machine capabilities become insufficient. Distributed Indexing addresses this by distributing the workload across multiple machines or nodes, allowing for faster processing and larger index sizes.
  • 25.
    • Map Phase:The dataset is split into chunks and distributed to multiple nodes. Each node extracts terms and creates partial inverted indexes (or posting lists) for the documents it processes. • Reduce Phase: The partial indexes from each node are combined into a single comprehensive index. Terms are merged across nodes to produce final posting lists for each term in the vocabulary. Distributed Indexing: MapReduce
  • 26.
  • 27.
    Dynamic Indexing In manyreal-world scenarios, new documents are added, and old documents may be removed or modified frequently. Dynamic Indexing is designed to handle such changes without needing to rebuild the entire index. Approach to Updates: • For new documents, a separate in-memory index is created, which contains the latest updates. • Periodically, the in-memory index is merged with the main index to maintain a single, consistent index. • When a document is removed, a "deletion marker" is added to the in-memory index to signal that the document should no longer appear in search results. During periodic merges, these markers are used to purge entries from the main index.
  • 28.
    Index Compression: Why? Indexcompression is essential in Information Retrieval systems to: • Reduce the storage size of indexes • Optimize memory usage • Speed up retrieval operations By compressing both the dictionary and the postings files, IR systems can handle large-scale datasets more efficiently.
  • 29.
    Dictionary as anArray The simplest data structure for the dictionary is to sort the vocabulary lexicographically and store it in an array of fixed-width entries Problem: Using fixed-width entries for terms is clearly wasteful, as the average length of a term in English is just about eight characters
  • 30.
    Dictionary as aString We can overcome the fixed-width entries shortcomings by storing the dictionary terms as one long string of characters.
  • 31.
    Blocking We can furthercompress the dictionary by grouping terms in the string into blocks of size k and keeping a term pointer only for the first term of each block.
  • 32.
    Blocking: Tradeoff By increasingthe block size k, we get better compression. However, there is a tradeoff between compression and the speed of term lookup.
  • 33.
    Front Coding A commonprefix is identified for a subsequence of the term list and then referred to with a special character.
  • 34.
    Minimal Perfect Hashing Otherschemes with even greater compression rely on minimal perfect hashing, that is, a hash function that maps M terms onto [1, . . . , M] without collisions. However, we cannot adapt perfect hashes incrementally because each new term causes a collision and therefore requires the creation of a new perfect hash function. Therefore, they cannot be used in a dynamic environment.
  • 35.
  • 36.
    Boolean Retrieval Model TheBoolean retrieval model is a model for information retrieval in which we can pose any query in the form of a Boolean expression of terms. • AND: All terms connected by AND must be present in the document for it to be retrieved. • OR: At least one of the terms connected by OR must be present in the document. • NOT: The term following NOT must not be present in the document.
  • 37.
    Vector Space Model Inthe case of large document collections, the resulting number of matching documents can far exceed the number a human user could possibly sift through. Accordingly, it is essential for a search engine to rank-order the documents matching a query. To do this, the search engine computes, for each matching document, a score with respect to the query at hand. Vector space model (VSM) can achieve this by viewing each document as a vector of such weights, we can compute a score between a query and each document. This view is known as vector space scoring.
  • 38.
    Term Weighting: tf-idf Insteadof using raw word counts, we compute a term weight for each document word. One of the most common term weighting schemes is the tf- idf weighting, which is the product of two terms: • Term frequency (tf): the number of occurrences of term t in document d • Inverse document frequency (idf): Inverse of document frequency, which is the number of documents in the collection that contain a term t
  • 39.
    Document Scoring We mayview each document as a vector with one component corresponding to each term in the dictionary, together with a weight for each component. We can also view a query as a vector. We score document d by the cosine similarity of its vector d with the query vector q
  • 40.
    Probabilistic Models In theBoolean or vector space models of IR, matching is done in a formally defined but semantically imprecise calculus of index terms. • Given only a query, an IR system has an uncertain understanding of the information need. • Given the query and document representations, a system has an uncertain guess of whether a document has content relevant to the information need. Probability theory provides a principled foundation for such reasoning under uncertainty. We can use probabilistic models to estimate how likely it is that a document is relevant to an information need.
  • 41.
    Probability Ranking Principle Theobvious order in which to present documents to the user is to rank documents by their estimated probability of relevance with respect to the information need: This is the basis of the Probability Ranking Principle (PRP), shortly stated as: An IR system should rank documents based on the probability that a document is relevant to a query.
  • 42.
    Binary Independence Model TheBinary Independence Model (BIM) is the model that has traditionally been used with the PRP. It introduces some simple assumptions: • Binary: Documents and queries are both represented as binary term incidence vectors, where each entry has the value of 1 if the corresponding term is presented in the document/query and 0 otherwise • Independence: Terms are modeled as occurring in documents independently. The model recognizes no association between terms.
  • 43.
    BIM: Relevance Probability Underthe BIM, we model the probability that a document is relevant via the probability in terms of term incidence vectors. Using the Bayes rule we have:
  • 44.
    BIM: Ranking Function Giventhose relevance probability, we can rank documents by their odds of relevance: We can further simplify this formula using the established assumptions, eventually we arrive at the Retrieval Status Value (RSV):
  • 45.
    Okapi BM25 The BIMwas originally designed for short catalog records and abstracts of fairly consistent length, and it works reasonably in these contexts, but for modern full- text search collections, it seems clear that a model should pay attention to term frequency and document length, as in VSM. The BM25 weighting scheme, often called Okapi weighting, after the system in which it was first implemented, was developed as a way of building a probabilistic model sensitive to these quantities while not introducing too many additional parameters into the model.
  • 46.
    Okapi BM25: RankingFunction The retrieval status value formula used in BM25 weight scheme is: Tuning parameters: • k1 for document term frequency scaling (k1 > 0) • b for the scaling by document length (0 b 1) ≤ ≤ • k3 for term frequency scaling of the query (k3 > 0)
  • 47.
  • 48.
    Neural Ranking Models •Traditional retrieval models obtain ranking scores by searching for exact matches of words in both the query and the document. • However, this limits the availability of positional and semantic information and may lead to vocabulary mismatch. • By contrast, neural ranking models construct query-to-document relevance structures to learn feature representations automatically.
  • 49.
    Neural Ranking Models •Representation-focused models • Interaction-focused models
  • 50.
    Hihi Haha Hehe Representation- focused models Encodes thequery and the document separately No interaction occurs between the query and the document during the encoding procedure. DSSM can predict the semantic similarity between sentences and obtain a lower-dimensional semantic vector representation of a sentence
  • 51.
  • 52.
    Semantic Transformatio n DNN Word Hashing DSSm Reduces the dimensionalityof word vectors from bag-of- words while retaining important features of words. Helps DSSM capture complex semantic relationships and improves accuracy in search and ranking. Transforms both queries and documents into semantic vectors within a hidden space. The goal is for queries and documents to be closer as vectors when they have similar meanings.
  • 53.
    Loss Function Similarity Function DSSm Measuresthe similarity between the semantic vectors of the query and the document. The closer the two vectors, the higher the relevance between the query and the document. Pushes unrelated query- document pairs farther apart and pulls related pairs closer together.
  • 54.
    Hihi Haha Hehe interaction- focused models Evaluates the relationshipbetween a query and a document based on their interaction. Captures more contextual information. K-NRM (Kernel-based neural ranking model), uses a kernel pooling layer to calculate the relevance between the query and words in the document.
  • 55.
  • 56.
  • 57.
    Pretrained Language Models •Although all the methods mentioned above improve performance across various information retrieval (IR) applications, they do not perform consistently well when trained on small-scale data. • Developing a better language model is crucial for information processing. • Some pretrained language models focus on deep transformer architectures, such as GPT, BERT, etc.
  • 58.
    BERT BERT inputs, tokenembeddings, segment embeddings and position embeddings.
  • 59.
    BERT • BERT addressesthe limitations and disadvantages of ELMo and OpenAI GPT (both are either undirectional or sequential, not exploiting contextual relationships across all text inputs) by using pretraining objectives such as MLM (Masked Language Model) and NSP (Next Sentence Prediction). • 15% of WordPiece tokens are randomly selected for prediction. • Each selected token has an 80% chance of being replaced with the [MASK] token. • There is a 10% chance that the token is replaced by a random token, and a 10% chance it is kept as is.
  • 60.
    colBERT • ColBERT (ContextualizedLate Interaction over BERT) is a model designed to improve BERT's performance. • ColBERT addresses several key challenges faced by traditional BERT models: ⚬ High computational cost. ⚬ Interaction latency. ⚬ Vector similarity search. ⚬ Scalability. ⚬ High performance.
  • 61.
  • 62.
    colBERT • CoLBERT consistsof 3 main components: ⚬ Query encoder. ⚬ Document encoder. ⚬ Late interaction mechanism.
  • 63.
    Retrieval-Augmented Generation (RAG) RAGis an AI model that combines information retrieval capabilities with the language generation abilities of large language models (LLMs). It helps generate accurate, up-to-date answers that closely align with factual information. By integrating real-world data with the language skills of the model, RAG minimizes "hallucination" errors, resulting in more reliable responses.
  • 64.
    Retrieval-Augmented Generation (RAG) Retrieve- Gather relevant info from external sources. Generate - Integrate info into LLM for accurate, context- rich responses.
  • 65.
    Retrieval-Augmented Generation (RAG) •Create the vector database. • Receive the query. • Retrieve information. • Rank and filter information. • Combine information. • Generate text.
  • 66.
    Retrieval-Augmented Generation (RAG) •Updated Information: Overcomes the knowledge limitations of static language models. • Higher Accuracy: Reduces misinformation. • Cost Efficiency: Lowers the number of tokens processed in the LLM model.
  • 67.
  • 68.
    Evaluation Metrics • OnlineMetrics: These measure user interactions when the system is live, such as the click-through rate (CTR). • Offline Metrics: These are assessed before deployment, focusing on the relevance of returned results. They are often split into: ⚬ Order-aware: Metrics where the order of results affects the score, such as NDCG@K. ⚬ Order-unaware: Metrics where order doesn’t impact the score, such as Recall@K.
  • 69.
    Evaluation Metrics • Recall@K:The proportion of relevant results within the top K results. • Mean Reciprocal Rank (MRR): Measures the rank of the first relevant result. • Mean Average Precision@K (MAP@K): Average precision across multiple queries. • NDCG@K: Evaluates the quality of the ranked result list based on ranking.
  • 70.
  • 71.
    General Application • DigitalLibrary: Manages and provides access to digital materials (text, images, audio, video). • Information Filtering System: Eliminates redundant information, such as spam filtering and recommendation systems. • Media Search: Finds images, music, and videos based on keywords and features. • Search Engine: Searches across the web, desktop, mobile, and social media to meet users' information needs.
  • 72.
    Domain applications ofIR techniques Geography, Chemistry, Legal, and Software Engineering: Domain-specific search to support specialized information retrieval within professional fields.
  • 73.

Editor's Notes

  • #15 We keep a dictionary of terms (sometimes also referred to as a vocabulary or lexicon). Then for each term, we have a list that records which documents the term occurs in. Each item in the list – which records that a term appeared in a document (and, later, often, the positions in the document) – is conventionally called a posting. The list is then called a postings list (or inverted list), and all the postings lists taken together are referred to as the postings.
  • #26 Sharding: The document collection is divided into “shards,” each of which is handled by a different node. Each shard contains a subset of documents and has its own index. Replication: To ensure fault tolerance and load balancing, shards are often replicated across multiple nodes. This way, if one node fails, another can take over its workload. Coordination: A master node (or coordinator) manages shard assignment and oversees the indexing and retrieval processes, ensuring smooth distribution and merging.