Information Retrieval - Natural Language Process

INFORMATION
RETRIEVAL
Presentation

Team Members
Ha
2113269
Phuong
2112070
Nhan
2114277

Content
.
Text Preprocessing
Evaluation & Applications
Basic Concepts
Traditional IR Models &
Modern Approaches
Indexing

What is Information Retrieval?
Information retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from
within large collections (usually stored on computers).
Our goal is to develop a system to address the ad hoc retrieval task, which is
the most standard IR task. In it, a system aims to provide documents from
within the collection that are relevant to an arbitrary user information need,
communicated to the system by means of a one-off, user-initiated query.

What is Information Retrieval?

Terminology
Document Collection Query
Term Vocabulary

Notations
Documents and
Collection
Terms and
Vocabulary
Query Terms

Tokenization
Tokenization is the process of breaking
down text into individual units, or tokens.
This step is foundational as it converts
unstructured text into a structured format
that can be easily analyzed.

Removing Stopwords
Stopwords are common, low-information
words, which frequently appear in text but
often do not carry significant meaning.
Removing stopwords reduces the amount
of data for processing, focusing on more
meaningful terms.

Normalization
Text normalization is the process of converting text to a consistent format,
which helps to avoid variations that may lead to mismatches in analysis. This
includes:
• Lowercasing: “Text”, “TEXT”, “text” -> “text”
• Removing punctuation and special characters (.,;:)
• Replacing contractions: “can’t” -> “cannot”
• Standardizing accents: “café” -> “cafe”

Stemming and Lemmatization
• Stemming is the process of
reducing words to their root or base
form, often by removing common
suffixes.
• Lemmatization reduces words to
their lemma, or dictionary form,
which maintains semantic meaning
more accurately than stemming.

Inverted Index
We keep a dictionary of terms. Then
for each term, we have a list that
records which documents the term
occurs in. Each item in the list is
conventionally called a posting. The
list is then called a postings list.

Index Construction
Index construction algorithms and
techniques
• Blocked Sort-Based Indexing
• Single-Pass In-Memory Indexing
• Distributed Indexing
• Dynamic Indexing

Block Sort-Based Indexing
Blocked Sort-Based Indexing (BSBI) is an indexing method that builds an
inverted index by processing large datasets in manageable chunks or
"blocks," which are then merged to create the final index.
This approach is particularly useful when indexing large datasets that do not
fit in memory.

Indexing process:
• Documents are divided into fixed-size blocks (e.g., 100,000 documents
per block).
• Indexing Each Block:
⚬ Each block is read into memory, and a term-document mapping is
created for that block.
⚬ The terms are sorted within each block to construct a local inverted
index.
• The individual sorted blocks are written to disk. After all blocks have been
processed, a multi-way merge algorithm is used to combine them into
a single, comprehensive inverted index.

Merge process

BSBI Algorithm

Single-Pass In-Memory Indexing
Single-Pass In-Memory Indexing (SPIMI) is an alternative to BSBI designed
for efficiency by avoiding repeated sorting operations. Instead, SPIMI builds
each block of the index in a single pass, reducing the memory needed for
term storage.

SPIMI indexing process
• Unlike BSBI, SPIMI processes each term immediately. Terms are not stored for
later sorting, they are indexed as they appear in documents.
• Building posting lists:
⚬ As terms are encountered, SPIMI creates a posting list for each unique
term.
⚬ If the term already exists, SPIMI simply updates the list with the new
document information.
• Once the memory is full, the block (containing all posting lists for terms
processed so far) is written to disk, and memory is cleared for the next block.
• Similar to BSBI, the individual blocks are merged to create the final inverted
index.

SPIMI
Algorithm

Distributed Indexing
As data sizes increase, indexing single-machine
capabilities become insufficient.
Distributed Indexing addresses this by distributing
the workload across multiple machines or nodes,
allowing for faster processing and larger index sizes.

• Map Phase: The dataset is split into
chunks and distributed to multiple nodes.
Each node extracts terms and creates
partial inverted indexes (or posting lists)
for the documents it processes.
• Reduce Phase: The partial indexes from
each node are combined into a single
comprehensive index. Terms are merged
across nodes to produce final posting lists
for each term in the vocabulary.
Distributed Indexing: MapReduce

Distributed Indexing: Architecture
Sharding Replication Coordination

Dynamic Indexing
In many real-world scenarios, new documents are added, and old documents may be
removed or modified frequently. Dynamic Indexing is designed to handle such changes
without needing to rebuild the entire index.
Approach to Updates:
• For new documents, a separate in-memory index is created, which contains the
latest updates.
• Periodically, the in-memory index is merged with the main index to maintain a
single, consistent index.
• When a document is removed, a "deletion marker" is added to the in-memory index
to signal that the document should no longer appear in search results. During
periodic merges, these markers are used to purge entries from the main index.

Index Compression: Why?
Index compression is essential in Information Retrieval systems to:
• Reduce the storage size of indexes
• Optimize memory usage
• Speed up retrieval operations
By compressing both the dictionary and the postings files, IR systems can
handle large-scale datasets more efficiently.

Dictionary as an Array
The simplest data structure for the dictionary is to sort the vocabulary
lexicographically and store it in an array of fixed-width entries
Problem: Using fixed-width entries for terms is clearly wasteful, as the average
length of a term in English is just about eight characters

Dictionary as a String
We can overcome the fixed-width entries shortcomings by storing the
dictionary terms as one long string of characters.

Blocking
We can further compress the dictionary by grouping terms in the string into blocks
of size k and keeping a term pointer only for the first term of each block.

Blocking: Tradeoff
By increasing the block size k, we get
better compression.
However, there is a tradeoff between
compression and the speed of term
lookup.

Front Coding
A common prefix is identified for a subsequence of the term list and then referred
to with a special character.

Minimal Perfect Hashing
Other schemes with even greater compression rely on minimal perfect hashing,
that is, a hash function that maps M terms onto [1, . . . , M] without collisions.
However, we cannot adapt perfect hashes incrementally because each new term
causes a collision and therefore requires the creation of a new perfect hash
function. Therefore, they cannot be used in a dynamic environment.

Boolean Retrieval Model
The Boolean retrieval model is a model for information retrieval in which we can
pose any query in the form of a Boolean expression of terms.
• AND: All terms connected by AND must be present in the document for it to be
retrieved.
• OR: At least one of the terms connected by OR must be present in the
document.
• NOT: The term following NOT must not be present in the document.

Vector Space Model
In the case of large document collections, the resulting number of matching
documents can far exceed the number a human user could possibly sift through.
Accordingly, it is essential for a search engine to rank-order the documents
matching a query. To do this, the search engine computes, for each matching
document, a score with respect to the query at hand.
Vector space model (VSM) can achieve this by viewing each document as a vector
of such weights, we can compute a score between a query and each document.
This view is known as vector space scoring.

Term Weighting: tf-idf
Instead of using raw word counts, we compute a term weight for each
document word. One of the most common term weighting schemes is the tf-
idf weighting, which is the product of two terms:
• Term frequency (tf): the number of occurrences of term t in document d
• Inverse document frequency (idf): Inverse of document frequency,
which is the number of documents in the collection that contain a term t

Document Scoring
We may view each document as a vector
with one component corresponding to
each term in the dictionary, together with
a weight for each component. We can also
view a query as a vector.
We score document d by the cosine
similarity of its vector d with the query
vector q

Probabilistic Models
In the Boolean or vector space models of IR, matching is done in a formally
defined but semantically imprecise calculus of index terms.
• Given only a query, an IR system has an uncertain understanding of
the information need.
• Given the query and document representations, a system has an
uncertain guess of whether a document has content relevant to the
information need.
Probability theory provides a principled foundation for such reasoning
under uncertainty. We can use probabilistic models to estimate how likely
it is that a document is relevant to an information need.

Probability Ranking Principle
The obvious order in which to present documents to the user is to rank
documents by their estimated probability of relevance with respect to the
information need:
This is the basis of the Probability Ranking Principle (PRP), shortly stated as:
An IR system should rank documents based on the probability that a document is
relevant to a query.

Binary Independence Model
The Binary Independence Model (BIM) is the model that has traditionally been
used with the PRP. It introduces some simple assumptions:
• Binary: Documents and queries are both represented as binary term
incidence vectors, where each entry has the value of 1 if the corresponding
term is presented in the document/query and 0 otherwise
• Independence: Terms are modeled as occurring in documents independently.
The model recognizes no association between terms.

BIM: Relevance Probability
Under the BIM, we model the probability that a document is relevant via the
probability in terms of term incidence vectors.
Using the Bayes rule we have:

BIM: Ranking Function
Given those relevance probability, we can rank documents by their odds of
relevance:
We can further simplify this formula using the established assumptions,
eventually we arrive at the Retrieval Status Value (RSV):

Okapi BM25
The BIM was originally designed for short catalog records and abstracts of fairly
consistent length, and it works reasonably in these contexts, but for modern full-
text search collections, it seems clear that a model should pay attention to term
frequency and document length, as in VSM.
The BM25 weighting scheme, often called Okapi weighting, after the system in
which it was first implemented, was developed as a way of building a probabilistic
model sensitive to these quantities while not introducing too many additional
parameters into the model.

Okapi BM25: Ranking Function
The retrieval status value formula used in BM25 weight scheme is:
Tuning parameters:
• k1 for document term frequency scaling (k1 > 0)
• b for the scaling by document length (0 b 1)
≤ ≤
• k3 for term frequency scaling of the query (k3 > 0)

Neural Ranking Models
• Traditional retrieval models obtain ranking
scores by searching for exact matches of words
in both the query and the document.
• However, this limits the availability of positional
and semantic information and may lead to
vocabulary mismatch.
• By contrast, neural ranking models construct
query-to-document relevance structures to learn
feature representations automatically.

Neural Ranking Models
• Representation-focused models
• Interaction-focused models

Hihi
Haha
Hehe
Representation-
focused models
Encodes the query and
the document
separately
No interaction occurs
between the query and
the document during the
encoding procedure.
DSSM can predict the semantic
similarity between sentences
and obtain a lower-dimensional
semantic vector representation
of a sentence

Semantic
Transformatio
n
DNN
Word Hashing
DSSm
Reduces the
dimensionality of word
vectors from bag-of-
words while retaining
important features of
words.
Helps DSSM capture
complex semantic
relationships and
improves accuracy in
search and ranking.
Transforms both queries and
documents into semantic
vectors within a hidden space.
The goal is for queries and
documents to be closer as
vectors when they have similar
meanings.

Loss Function
Similarity Function
DSSm
Measures the similarity
between the semantic
vectors of the query
and the document. The
closer the two vectors,
the higher the
relevance between the
query and the
document.
Pushes unrelated query-
document pairs farther apart
and pulls related pairs closer
together.

Hihi
Haha
Hehe
interaction-
focused models
Evaluates the
relationship between a
query and a document
based on their
interaction.
Captures more
contextual information.
K-NRM (Kernel-based neural
ranking model), uses a kernel
pooling layer to calculate the
relevance between the query
and words in the document.

Pretrained Language Models
• Although all the methods
mentioned above improve
performance across various
information retrieval (IR)
applications, they do not perform
consistently well when trained on
small-scale data.
• Developing a better language
model is crucial for information
processing.
• Some pretrained language models
focus on deep transformer
architectures, such as GPT, BERT,
etc.

BERT
BERT inputs, token embeddings, segment embeddings and position
embeddings.

BERT
• BERT addresses the limitations and disadvantages of ELMo and OpenAI GPT
(both are either undirectional or sequential, not exploiting contextual
relationships across all text inputs) by using pretraining objectives such as
MLM (Masked Language Model) and NSP (Next Sentence Prediction).
• 15% of WordPiece tokens are randomly selected for prediction.
• Each selected token has an 80% chance of being replaced with the [MASK]
token.
• There is a 10% chance that the token is replaced by a random token, and a 10%
chance it is kept as is.

colBERT
• ColBERT (Contextualized Late Interaction over BERT) is a model designed to
improve BERT's performance.
• ColBERT addresses several key challenges faced by traditional BERT models:
⚬ High computational cost.
⚬ Interaction latency.
⚬ Vector similarity search.
⚬ Scalability.
⚬ High performance.

colBERT
• CoLBERT consists of 3 main components:
⚬ Query encoder.
⚬ Document encoder.
⚬ Late interaction mechanism.

Retrieval-Augmented Generation (RAG)
RAG is an AI model that combines information retrieval
capabilities with the language generation abilities of large
language models (LLMs). It helps generate accurate, up-to-date
answers that closely align with factual information. By integrating
real-world data with the language skills of the model, RAG
minimizes "hallucination" errors, resulting in more reliable
responses.

Retrieve - Gather relevant info from external
sources.
Generate - Integrate info into LLM for accurate, context-
rich responses.

• Create the vector database.
• Receive the query.
• Retrieve information.
• Rank and filter information.
• Combine information.
• Generate text.

• Updated Information: Overcomes the
knowledge limitations of static language
models.
• Higher Accuracy: Reduces
misinformation.
• Cost Efficiency: Lowers the number of
tokens processed in the LLM model.

Evaluation Metrics
• Online Metrics: These measure user interactions
when the system is live, such as the click-through
rate (CTR).
• Offline Metrics: These are assessed before
deployment, focusing on the relevance of
returned results. They are often split into:
⚬ Order-aware: Metrics where the order of
results affects the score, such as NDCG@K.
⚬ Order-unaware: Metrics where order doesn’t
impact the score, such as Recall@K.

Evaluation Metrics
• Recall@K: The proportion of relevant results
within the top K results.
• Mean Reciprocal Rank (MRR): Measures the rank
of the first relevant result.
• Mean Average Precision@K (MAP@K): Average
precision across multiple queries.
• NDCG@K: Evaluates the quality of the ranked
result list based on ranking.

General Application
• Digital Library: Manages and provides access to
digital materials (text, images, audio, video).
• Information Filtering System: Eliminates
redundant information, such as spam filtering
and recommendation systems.
• Media Search: Finds images, music, and videos
based on keywords and features.
• Search Engine: Searches across the web,
desktop, mobile, and social media to meet users'
information needs.

Domain applications of IR techniques
Geography, Chemistry, Legal, and
Software Engineering: Domain-specific
search to support specialized information
retrieval within professional fields.

Demo:
Simple Search Engine
using Vector Space Model

Information Retrieval - Natural Language Process

More Related Content

Similar to Information Retrieval - Natural Language Process

Recently uploaded

Information Retrieval - Natural Language Process

Editor's Notes