SlideShare a Scribd company logo
1 of 45
Chapter 6 : Query Languages
Adama Science and Technology University
School of Electrical Engineering and
Computing
Department of CSE
Kibrom T 2023
2
Keyword-based Querying
 In the field of information retrieval (IR), query languages play a
crucial role in facilitating effective and efficient searching of
information.
 A query language allows users to express their information needs in a
organized manner, enabling the retrieval system to understand and
process those queries to provide relevant results.
 In the context of information retrieval, a query refers to a request or a
question posed to a system in order to retrieve specific information
that matches the specified criteria. It is a way for users to express their
information needs and retrieve relevant results from a search engine or
IR system.
3
Keyword-based Querying
 Queries are combinations of words.
 The document collection is searched for documents that contain
these words.
 Word queries are intuitive, easy to express and provide fast
ranking.
 The concept of word must be defined:
 A word is a sequence of letters terminated by a separator (period,
comma, space, etc).
 Definition of letter and separator is flexible; e.g., hyphen could be
defined as a letter or as a separator.
 Usually, common words (such as “a”, “the”, “of”, …) are ignored.
4
Single-word Queries
 A query is a single word.
 Single-word queries refer to queries that consist of a single
keyword or term.
 Usually used for searching in document images.
 Simplest form of query.
 What are the possible documents retrieved as relevant?
 All documents that include this word are retrieved.
 Examples of single-word queries: Football, apple ….
 On what base documents are ranked?
 Documents may be ranked by the frequency of the query word in
the document.
 Documents containing more of the query word are given the
highest priority.
5
Single-word Queries
 What is the difference between Keyword-based Querying and
Single-word Queries. Explain with example.
 Keyword-based querying and single-word queries are related
concepts but differ in terms of the level of specificity and
complexity.
 Keyword-based querying involves searching for documents or
information based on specific keywords or terms.
 On the other hand, single-word queries are queries that consist
of a single keyword or term.
 Keyword-based Query: "Healthy recipe for vegetarian lasagna“
 Single-word Query: "Lasagna"
6
Phrase Queries
 A query is a sequence of words treated as a single unit. Also
called “literal string” or “exact phrase” query.
 Phrase is usually surrounded by quotation marks.
 All documents that include this phrase are retrieved.
 Usually, separators (commas, colons, ...) & common words (“a”,
“the”, “of”, “for”…) in the phrase are ignored.
 In effect, this query is for a set of words that must appear in
sequence.
 Allows users to specify a context and thus gain precision.
 Ex.: “Information Processing for Document Retrieval”.
 What are the possible documents retrieved as relevant?
 All documents that include phrase query are retrieved.
 On what base documents are ranked?
7
Phrase Queries
• Unlike single-word queries that retrieve documents with individual
keywords, phrase queries focus on finding the exact occurrence of a
particular sequence of words. Here's an explanation of phrase queries
with an example:
• The search system then looks for documents that include the exact
phrase in the specified order. Here's an example:
• "artificial intelligence"
• In this example, the phrase query "artificial intelligence" is specified
within quotation marks. The search system will retrieve documents or
information where the exact phrase "artificial intelligence" appears,
rather than documents containing the individual words "artificial" and
"intelligence" separately.
Multiple-word Queries
 A query is a set of words (or phrases).
 Ex.: What is the result for the query “Data Mining and Intelligent
Database Design”?
 What are the possible documents retrieved as relevant?
 Two options: A document is retrieved if it includes:
 Any of the query words, or
 each of the query words.
 What is the difference between Multiple-word and phrases
Queries?
8
Multiple-word Queries
 On what bases documents be ranked to list according to best
matching principle?
 Documents are ranked by the number of query words they contain.
 A document containing n query words is ranked higher than a document
containing m < n query words.
 This implies that, all else being equal, documents that contain a larger
number of query words are considered more relevant and receive a
higher ranking.
 Documents are ranked in decreasing order:
 Those containing all the query words are ranked at the top, only one
query word at bottom.
 Frequency counts may be used to break ties among documents that
contain the same query words.
 What is the difference between Multiple-word and phrases
Queries?
9
Multiple-word Queries
 Phrase queries and multiple-word queries are similar in that they both
involve searching for specific combinations of words or terms.
However, there are some key differences between the two:
 Matching Criteria:
 In phrase queries, the search system looks for the exact occurrence of a
specific phrase. The words in the phrase must appear in the specified
order for a match to be found.
 In multiple-word queries, the search system looks for documents that
contain any combination of the specified words or terms. The order of the
words is not strictly enforced, and they can appear in any order within the
document.
 .
10
11
Boolean Queries
 Queries are formulated based on concepts from logic: AND, OR,
NOT.
 It describes the information needed by relating multiple words with
Boolean operators.
 Semantics: For each query word w a corresponding set Dw is
constructed that includes the documents that contain w.
 The Boolean expression is then interpreted as an expression on
the corresponding document sets with corresponding set
operators:
 AND: Finds only documents containing all of the specified words
or phrases.
 OR: Finds documents containing at least one of the specified words
or phrases.
 NOT: Excludes documents containing the specified word or
phrase.
12
Examples: Boolean Queries
 1.Computer OR server
 Finds documents containing either computer, server or both.
 2. (computer OR server) NOT mainframe
 Select all documents that discuss computers or servers, do not
select any documents that discuss mainframes.
 3. Computer NOT (server OR mainframe)
 Select all documents that discuss computers, and do not discuss
either servers or mainframes.
 4. Computer OR server NOT mainframe
 Select all documents that discuss computers, or documents that
discuss servers but do not discuss mainframes.
13
Weighted Queries
 Weighted queries, also known as term-weighted queries, are a type of
query where each term or keyword is assigned a weight or
importance. These weights indicate the relative significance or
relevance of each term in the query.
 The search system uses these weights to rank the search results and
retrieve documents that align more closely with the user's information
needs.
 Here's an example to illustrate the concept:
Weighted Query: apple^3 OR banana^2 OR orange. The query consists
of three terms: "apple," "banana," and "orange." Each term is assigned a
weight, indicated by the superscript number. The weight reflects the relative
importance or relevance of each term.
 The ranking of a document is the sum of the weights for the query words
that it satisfies.
14
Weighted Queries
 Each of the words is assigned a different weight, expressing the
relative importance of the word within the query.
 A query is then a set of word-weight pairs:
(q1, w1), (q2, w2), …, (qn, wn).
 The ranking of a document is the sum of the weights for the
query words that it satisfies.
 Example: given Query: (A,0.8,), (B,0.9), (C,0.3); and
 Document 1: (A, B, D) and Document 2: (A, C, D) which
document ranked first ?
 Score of Document 1: 0.8 (for term A) + 0.9 (for B) + 0 (for C) = 1.7
 Score of Document 2: 0.8 (for term A) + 0 (for B) + 0.3 (for C) = 1.1
Each document includes two words from the query, but Document1 is
ranked higher because it includes more important words.
15
Pattern Queries
 What is Pattern?
 An expression that defines a set of objects. Pattern shows the
internal representation of an object.
 What is the pattern of a word?
 Pattern matching: A word matches a pattern if it is equal to one
of the words defined by the pattern.
 In other words,
The semantics are of disjunction: A pattern P that defines a word
(c1, c2, …, cn) is interpreted as c1 v c2 v … v cn.
16
Pattern Queries
 Pattern queries, also known as wildcard queries or pattern-based
queries, are a type of information retrieval query that involves
searching for documents or information using patterns or wildcards to
match variations of terms.
 Instead of specifying the exact terms, pattern queries allow users to
define a pattern with placeholders or wildcards to match multiple
variations of a term or a group of terms.
 Here's an explanation of pattern queries with an example:
 Pattern queries use special characters, known as wildcards, to
represent unknown or variable portions of terms. The most
commonly used wildcards are:
 Asterisk (*) wildcard: The asterisk represents any number of characters,
including none or multiple characters. It can be used to match different
variations of a term or to capture unknown parts of a word.
 Question mark (?) wildcard: The question mark represents a single character. It
can be used to match variations in spelling or to capture one character in a
specific position within a term.
17
Pattern Queries
 Pattern queries can be useful in various scenarios, such as:
 Expanding searches: Pattern queries help retrieve documents that
contain various word forms or spellings.
 For example, a pattern query like "organiz*ation" can match "organization" and
"organisation" simultaneously.
 Handling misspellings: Pattern queries with question mark wildcards
can accommodate minor spelling variations.
 For instance, a pattern query like "c?t" can match "cat," "cot," or "cut.“
 Searching for specific patterns: Pattern queries can capture specific
patterns within terms.
 For example, a pattern query like "b?g" can match "big," "beg," or "bug."
18
Pattern Queries
 Similarity pattern. Specifies a string and a radius
 Defines all the words whose distance from the string is within the
radius.
 Assume the distance between two strings is measured by the
number of one-character changes (insertions, deletions,
replacements) required to transform one string into the other.
 The similarity pattern (king, 2) defines kin, kong, knig, kings, cling,
…
 Useful to compensate for typing or scanning (OCR) errors.
 One of the technique used for pattern matching is string editing.
19
Pattern Queries
 The similarity pattern takes into account the concept of "edit distance"
or "string distance," which measures the number of operations required
to transform one string into another.
 When specifying a similarity pattern, two main components are
typically involved: the string itself and a radius or threshold value.
 The string represents the target pattern or sequence that you want to
find similar matches for, and the radius determines the maximum
allowed difference or deviation from the target pattern.
 For example, let's consider the similarity pattern of "cat" with a radius of 1. This
means we are looking for strings that are similar to "cat" with a maximum edit
distance of 1.
 "cat" (exact match)
 "bat" (substitution of 'c' with 'b')
 "car" (substitution of 't' with 'r')
 "catz" (insertion of 'z')
 "at" (deletion of 'c')
20
String Editing
 The problem is given two sequences of symbols, X = x1 x2 … xn
and Y = y1 y2 … ym, transform X to Y, based on a sequence of
three operations: Delete, Insert and Replace, so that for every
operation COST(Cij) is incurred.
 The objective of string editing is to identify a minimum cost
sequence of edit operation that will transform X into Y.
 Example: consider the sequences:
 X = {a a b a b} and Y = {b a b b}
 Identify a minimum cost sequence of edit operation that transform
X into Y.
 Assume change costs 2 units, delete 1 unit and insert 1 unit.
21
Dynamic programming
 The minimum cost of any edit sequence that transforms x1 x2 …
xi into y1 y2 … yj (for i>0 and j>0) is the minimum of the three
costs: delete, replace, or insert operations.
 The following recurrence equation is used for COST(i,j).
0 if i=0, j=0
COST(i-1,0) + D(xi) i>0, j=0
COST(0,j-1) + I(yj) j>0, i=0
COST'(i,j) i>0, j>0
where COST'(i,j) = min { COST(i-1,j) + D(xi),
COST(i-1,j-1) + C(xi,yj),
COST(i,j-1) + I(yj)
}
COST(i,j) =
22
Example
 Transform the sequences:
 Xi = {a a b a b} into Yj = {b a b b}
 With minimum cost sequence of edit operation using dynamic
programming approach, Assume that change costs 2 units, delete
and insert 1 unit.
0 1 2 3 4
1 2 1 2 3
2 3 2 3 4
3 2 3 2 3
4 3 2 3 4
5 4 3 2 3
4
i
1
2
3
5
0
j 0 1 2 3 4
 The value 3 at (5,4) is the
optimal solution
 By tracing back one can
determine which operations
lead to optimal solution.
 Delete x1, Delete x2 and
Insert y4 Or,
 Change x1 to y1 & Delete x4.
23
Natural language
 Using natural language for querying is very attractive.
 Example: Find all the documents that discuss
 “ campaign finance reforms, including documents that discuss
violations of campaign financing regulations.
 Do not include documents that discuss campaign contributions
by the gun and the tobacco industries”.
 Natural language queries are converted to a formal language for
processing against a set of documents.
 Such translation requires intelligence and is still a challenge.
24
Natural language
 Pseudo NL processing: System scans the text and extracts
recognized terms and Boolean connectors.
 The grammaticality of the text is not important.
 Often used by search engines.
 Problem: Recognizing the negation in the search statement
(“Do not include...”).
 Compromise: Users enter natural language clauses connected
with Boolean operators.
 In the above example: “campaign finance reforms” or
“violations of campaign financing regulations" and not
“campaign contributions by the gun and the tobacco industries”.
Question & Answer
9/25/2023 25
Thank You !!!
9/25/2023 26
Chapter 7 : Query Operations
Adama Science and Technology University
School of Electrical Engineering and
Computing
Department of CSE
Kibrom T 2023
28
Introduction
 No detailed knowledge of collection and searching environment.
Difficult to formulate queries well designed for searching
Need many formulations of queries for effective searching
 First formulation: often naïve attempt to retrieve relevant
information.
 Documents initially retrieved:
Can be examined for relevance information (by the user or
automatically by the system) to provide relevance feedback.
 Improve query formulations for retrieving additional relevant
documents (using query reformulation techniques)
29
Query Reformulation
 Identify terms related to query terms.
 Revise query to account for feedback using two basic techniques:
Query Expansion: Add new terms related to query terms from
relevant documents.
Term Reweighting: modify term weights based on documents
relevance for the users query.
Increase weight of terms in relevant documents and decrease
weight of terms in irrelevant documents.
 Several algorithms for query reformulation.
Term Reweighting for Query
Reformulation
 Term weight vectors of documents assessed relevant.
Similarities among themselves.
 Term weight vectors of documents assessed non-relevant.
Dissimilar for those of relevant documents.
 Reformulated query:
Closer to term weight vectors of relevant documents.
Term Reweighting for Query
Reformulation: Rochio Formula
For query q:
 Dr: set of relevant documents among retrieved documents.
 Dn: set of non-relevant documents among retrieved documents.
 Cr: set of relevant documents among all documents in collection.
 ,,: tuning constants.
 Initial formulation =1
 Usually information in relevant documents is more important than
in non-relevant documents (<<).

 

 


Dn
d
j
Dr
d
j
i
i
j
j
d
Dn
d
Dr
q
q



1
Term Reweighting for Query
Reformulation: Ide Formula
 Initial formulation = = =1
 Same comments as for the Rochio formula.
 Both Ide and Rochio: no optimal criterion.
 E.g. you are given a query vector qi = (2,3,1,2,5); and documents;
d1(3,3,2,0,9); d2(2,2,1,0,12); d3(3,2,1,0,9); d4(2,2,1,3,1);
d5(1,2,1,3,3);
 3 documents identified as relevant by a user, (i.e. d1-d3); and 2
documents as irrelevant (i.e d4-d5).
 Compute the modified query using standard Rochio equation with
= = =1

 

 


Dn
d
j
Dr
d
j
i
i
j
j
d
d
q
q 


1
33
Approaches for Query Operations
 Users relevance feedback:
Approaches based on feedback from users about relevance of
documents retrieved.
 Pseudo-relevance feedback:
Approaches based on information derived from set of initially
retrieved documents (local set of documents), which is called Local
Analysis.
Approaches based on global information derived from document
collection, which is called Global Analysis.
34
Users Relevance Feedback
 Most popular query reformulation strategy.
 Cycle:
 User presented with list of retrieved documents.
 User marks those which are relevant.
 In practice: top 10-20 ranked documents are examined.
 Select important terms from documents assessed relevant by users.
 Enhance importance of these terms in a new query.
 Expected:
 New query moves towards relevant documents and away from
non-relevant documents.
35
Relevance Feedback Architecture
Rankings
IR
System
Document
corpus
Ranked
Relevant
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
1. Doc1 
2. Doc2 
3. Doc3 
.
.
Feedback
Query
String
Revised
Query
ReRanked
Relevant
Documents
1. Doc2
2. Doc4
3. Doc5
.
.
Query
Reformulation
36
Relevance Feedback
 After initial searching results are presented, allow the user to
provide feedback on the relevance of one or more of the
retrieved documents.
 Use this feedback information to reformulate the query.
 Produce new results based on reformulated query.
 Allows more interactive, multi-pass process.
37
Pseudo Relevance Feedback
 Use relevance feedback methods without explicit user input.
 Obtain relevance feedback automatically;
 Identify terms related to query terms (e.g. synonyms, stemming
variations, terms close to query terms in text)
 Just assume the top m retrieved documents are relevant, and use
them to reformulate the query.
 Allows for query expansion that includes terms that are
correlated with the query terms.
 Two strategies:
 Local strategies
 Global strategies
38
Local Analysis
 Examine only documents retrieved automatically for query to
determine query expansion.
 At query time, dynamically determine similar terms based on
analysis of top-ranked retrieved documents.
 Base correlation analysis on only the “local” set of retrieved
documents for a specific query.
 Avoids ambiguity by determining similar (correlated) terms only
within relevant documents.
 “Apple computer”  “Apple computer Power book laptop”
39
Global Analysis
 Expand query using information from whole set of documents in
collection.
 Determine term similarity through a pre-computed statistical
analysis of the complete corpus.
 Thesaurus-like structure using all documents:
 Approach to automatically built thesaurus.
 (e.g. similarity thesaurus based on co-occurrence frequency)
 Approach to select terms for query expansion.
 A thesaurus provides information on synonyms and semantically
related words and phrases.
 Example: physician
similar/synonymous: doctor, medical, MD
related: general practitioner, surgeon
40
Thesaurus-based Query Expansion
 For each term, t, in a query, expand the query with synonyms
and related words of t from the thesaurus.
 May weight added terms less than original query terms.
 Generally increases recall.
 May significantly decrease precision, particularly with
ambiguous terms.
 “interest rate”  “interest rate fascinate evaluate”
41
Global vs. Local Analysis
 Global analysis requires intensive term correlation computation
only once at system development time.
 Local analysis requires intensive term correlation computation for
every query at run time (although number of terms and documents
is less than in global analysis).
 But local analysis gives better results.
 Term ambiguity may introduce irrelevant statistically correlated
terms during global analysis.
 “Apple computer”  “Apple red fruit computer”
42
Global Analysis Refinements
 Only expand query with terms that are similar to all terms in
the query.
 “fruit” not added to “Apple computer” since it is far from
“computer.”
 “fruit” added to “apple pie” since “fruit” close to both “apple”
and “pie.”
 Use more sophisticated term weights (instead of just
frequency) when computing term correlations.



Q
k
ij
i
j
c
Q
k
sim )
,
(
43
Query Expansion Conclusions
 Expansion of queries with related terms can improve
performance, particularly recall.
 However, must select similar terms very carefully to avoid
problems, such as loss of precision.
Question & Answer
9/25/2023 44
Thank You !!!
9/25/2023 45

More Related Content

What's hot

Semantic web Document
Semantic web DocumentSemantic web Document
Semantic web Documentap
 
Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Saeedeh Shekarpour
 
Case study on gina(gobal innovation network and analysis)
Case study on gina(gobal innovation network and analysis)Case study on gina(gobal innovation network and analysis)
Case study on gina(gobal innovation network and analysis)SaloniAgrawal41
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduceVarad Meru
 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on irPrimya Tamil
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningKmPooja4
 
Issues on Artificial Intelligence and Future (Standards Perspective)
Issues on Artificial Intelligence  and Future (Standards Perspective)Issues on Artificial Intelligence  and Future (Standards Perspective)
Issues on Artificial Intelligence and Future (Standards Perspective)Seungyun Lee
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
 
Genetic algorithms
Genetic algorithmsGenetic algorithms
Genetic algorithmsswapnac12
 
Martian_Handbook_FAQs.pdf
Martian_Handbook_FAQs.pdfMartian_Handbook_FAQs.pdf
Martian_Handbook_FAQs.pdfArya818625
 
HCI 3e - Ch 12: Cognitive models
HCI 3e - Ch 12:  Cognitive modelsHCI 3e - Ch 12:  Cognitive models
HCI 3e - Ch 12: Cognitive modelsAlan Dix
 
Text mining Pre-processing
Text mining Pre-processingText mining Pre-processing
Text mining Pre-processingCreditas
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsSelman Bozkır
 
PageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibPageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibEl Habib NFAOUI
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibEl Habib NFAOUI
 

What's hot (20)

Web Mining
Web MiningWeb Mining
Web Mining
 
Semantic web Document
Semantic web DocumentSemantic web Document
Semantic web Document
 
Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems
 
Case study on gina(gobal innovation network and analysis)
Case study on gina(gobal innovation network and analysis)Case study on gina(gobal innovation network and analysis)
Case study on gina(gobal innovation network and analysis)
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
The impact of web on ir
The impact of web on irThe impact of web on ir
The impact of web on ir
 
Text mining
Text miningText mining
Text mining
 
Text features
Text featuresText features
Text features
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Issues on Artificial Intelligence and Future (Standards Perspective)
Issues on Artificial Intelligence  and Future (Standards Perspective)Issues on Artificial Intelligence  and Future (Standards Perspective)
Issues on Artificial Intelligence and Future (Standards Perspective)
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
Text MIning
Text MIningText MIning
Text MIning
 
Genetic algorithms
Genetic algorithmsGenetic algorithms
Genetic algorithms
 
Martian_Handbook_FAQs.pdf
Martian_Handbook_FAQs.pdfMartian_Handbook_FAQs.pdf
Martian_Handbook_FAQs.pdf
 
HCI 3e - Ch 12: Cognitive models
HCI 3e - Ch 12:  Cognitive modelsHCI 3e - Ch 12:  Cognitive models
HCI 3e - Ch 12: Cognitive models
 
Text mining Pre-processing
Text mining Pre-processingText mining Pre-processing
Text mining Pre-processing
 
Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
PageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibPageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_Habib
 
Information Retrieval Evaluation
Information Retrieval EvaluationInformation Retrieval Evaluation
Information Retrieval Evaluation
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 

Similar to 6&7-Query Languages & Operations.ppt

Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantinimaxfalc
 
Phrase Based Indexing
Phrase Based IndexingPhrase Based Indexing
Phrase Based Indexingbalaabirami
 
Phrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information RetrivelPhrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information Retrivelbalaabirami
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdfHabtamu100
 
Technical Whitepaper: A Knowledge Correlation Search Engine
Technical Whitepaper: A Knowledge Correlation Search EngineTechnical Whitepaper: A Knowledge Correlation Search Engine
Technical Whitepaper: A Knowledge Correlation Search Engines0P5a41b
 
14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for TranslationRIILP
 
Automatic multiple choice question generation system for
Automatic multiple choice question generation system forAutomatic multiple choice question generation system for
Automatic multiple choice question generation system forAlexander Decker
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
Search explained T3DD15
Search explained T3DD15Search explained T3DD15
Search explained T3DD15Hans Höchtl
 
Business Research Methods. search strategies for online databases
Business Research Methods. search strategies for online databasesBusiness Research Methods. search strategies for online databases
Business Research Methods. search strategies for online databasesAhsan Khan Eco (Superior College)
 
The comparative study of information retrieval models used in search engines
The comparative study of information retrieval models used in search enginesThe comparative study of information retrieval models used in search engines
The comparative study of information retrieval models used in search enginesfawad khan
 
Domain Specific Named Entity Recognition Using Supervised Approach
Domain Specific Named Entity Recognition Using Supervised ApproachDomain Specific Named Entity Recognition Using Supervised Approach
Domain Specific Named Entity Recognition Using Supervised ApproachWaqas Tariq
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrievalmghgk
 

Similar to 6&7-Query Languages & Operations.ppt (20)

Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
 
Phrase Based Indexing
Phrase Based IndexingPhrase Based Indexing
Phrase Based Indexing
 
Phrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information RetrivelPhrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information Retrivel
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
 
Technical Whitepaper: A Knowledge Correlation Search Engine
Technical Whitepaper: A Knowledge Correlation Search EngineTechnical Whitepaper: A Knowledge Correlation Search Engine
Technical Whitepaper: A Knowledge Correlation Search Engine
 
14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation14. Michael Oakes (UoW) Natural Language Processing for Translation
14. Michael Oakes (UoW) Natural Language Processing for Translation
 
Automatic multiple choice question generation system for
Automatic multiple choice question generation system forAutomatic multiple choice question generation system for
Automatic multiple choice question generation system for
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text mining
 
G04124041046
G04124041046G04124041046
G04124041046
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Search explained T3DD15
Search explained T3DD15Search explained T3DD15
Search explained T3DD15
 
Business Research Methods. search strategies for online databases
Business Research Methods. search strategies for online databasesBusiness Research Methods. search strategies for online databases
Business Research Methods. search strategies for online databases
 
Ir 03
Ir   03Ir   03
Ir 03
 
Ijcai 2007 Pedersen
Ijcai 2007 PedersenIjcai 2007 Pedersen
Ijcai 2007 Pedersen
 
UNIT 3 IRT.docx
UNIT 3 IRT.docxUNIT 3 IRT.docx
UNIT 3 IRT.docx
 
The comparative study of information retrieval models used in search engines
The comparative study of information retrieval models used in search enginesThe comparative study of information retrieval models used in search engines
The comparative study of information retrieval models used in search engines
 
Domain Specific Named Entity Recognition Using Supervised Approach
Domain Specific Named Entity Recognition Using Supervised ApproachDomain Specific Named Entity Recognition Using Supervised Approach
Domain Specific Named Entity Recognition Using Supervised Approach
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Lec 2
Lec 2Lec 2
Lec 2
 
Boolean Retrieval
Boolean RetrievalBoolean Retrieval
Boolean Retrieval
 

More from BereketAraya

4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
B.Heat Exchange-PPT.ppt
B.Heat Exchange-PPT.pptB.Heat Exchange-PPT.ppt
B.Heat Exchange-PPT.pptBereketAraya
 
H.Insulation-PPT.ppt
H.Insulation-PPT.pptH.Insulation-PPT.ppt
H.Insulation-PPT.pptBereketAraya
 
COLD AND SUNNNY ZONE.pptx
COLD AND SUNNNY ZONE.pptxCOLD AND SUNNNY ZONE.pptx
COLD AND SUNNNY ZONE.pptxBereketAraya
 
Product design (datan).pptx
Product design (datan).pptxProduct design (datan).pptx
Product design (datan).pptxBereketAraya
 

More from BereketAraya (7)

CH-II-I.pptx
CH-II-I.pptxCH-II-I.pptx
CH-II-I.pptx
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
B.Heat Exchange-PPT.ppt
B.Heat Exchange-PPT.pptB.Heat Exchange-PPT.ppt
B.Heat Exchange-PPT.ppt
 
H.Insulation-PPT.ppt
H.Insulation-PPT.pptH.Insulation-PPT.ppt
H.Insulation-PPT.ppt
 
COLD AND SUNNNY ZONE.pptx
COLD AND SUNNNY ZONE.pptxCOLD AND SUNNNY ZONE.pptx
COLD AND SUNNNY ZONE.pptx
 
Product design (datan).pptx
Product design (datan).pptxProduct design (datan).pptx
Product design (datan).pptx
 

Recently uploaded

Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 

Recently uploaded (20)

Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 

6&7-Query Languages & Operations.ppt

  • 1. Chapter 6 : Query Languages Adama Science and Technology University School of Electrical Engineering and Computing Department of CSE Kibrom T 2023
  • 2. 2 Keyword-based Querying  In the field of information retrieval (IR), query languages play a crucial role in facilitating effective and efficient searching of information.  A query language allows users to express their information needs in a organized manner, enabling the retrieval system to understand and process those queries to provide relevant results.  In the context of information retrieval, a query refers to a request or a question posed to a system in order to retrieve specific information that matches the specified criteria. It is a way for users to express their information needs and retrieve relevant results from a search engine or IR system.
  • 3. 3 Keyword-based Querying  Queries are combinations of words.  The document collection is searched for documents that contain these words.  Word queries are intuitive, easy to express and provide fast ranking.  The concept of word must be defined:  A word is a sequence of letters terminated by a separator (period, comma, space, etc).  Definition of letter and separator is flexible; e.g., hyphen could be defined as a letter or as a separator.  Usually, common words (such as “a”, “the”, “of”, …) are ignored.
  • 4. 4 Single-word Queries  A query is a single word.  Single-word queries refer to queries that consist of a single keyword or term.  Usually used for searching in document images.  Simplest form of query.  What are the possible documents retrieved as relevant?  All documents that include this word are retrieved.  Examples of single-word queries: Football, apple ….  On what base documents are ranked?  Documents may be ranked by the frequency of the query word in the document.  Documents containing more of the query word are given the highest priority.
  • 5. 5 Single-word Queries  What is the difference between Keyword-based Querying and Single-word Queries. Explain with example.  Keyword-based querying and single-word queries are related concepts but differ in terms of the level of specificity and complexity.  Keyword-based querying involves searching for documents or information based on specific keywords or terms.  On the other hand, single-word queries are queries that consist of a single keyword or term.  Keyword-based Query: "Healthy recipe for vegetarian lasagna“  Single-word Query: "Lasagna"
  • 6. 6 Phrase Queries  A query is a sequence of words treated as a single unit. Also called “literal string” or “exact phrase” query.  Phrase is usually surrounded by quotation marks.  All documents that include this phrase are retrieved.  Usually, separators (commas, colons, ...) & common words (“a”, “the”, “of”, “for”…) in the phrase are ignored.  In effect, this query is for a set of words that must appear in sequence.  Allows users to specify a context and thus gain precision.  Ex.: “Information Processing for Document Retrieval”.  What are the possible documents retrieved as relevant?  All documents that include phrase query are retrieved.  On what base documents are ranked?
  • 7. 7 Phrase Queries • Unlike single-word queries that retrieve documents with individual keywords, phrase queries focus on finding the exact occurrence of a particular sequence of words. Here's an explanation of phrase queries with an example: • The search system then looks for documents that include the exact phrase in the specified order. Here's an example: • "artificial intelligence" • In this example, the phrase query "artificial intelligence" is specified within quotation marks. The search system will retrieve documents or information where the exact phrase "artificial intelligence" appears, rather than documents containing the individual words "artificial" and "intelligence" separately.
  • 8. Multiple-word Queries  A query is a set of words (or phrases).  Ex.: What is the result for the query “Data Mining and Intelligent Database Design”?  What are the possible documents retrieved as relevant?  Two options: A document is retrieved if it includes:  Any of the query words, or  each of the query words.  What is the difference between Multiple-word and phrases Queries? 8
  • 9. Multiple-word Queries  On what bases documents be ranked to list according to best matching principle?  Documents are ranked by the number of query words they contain.  A document containing n query words is ranked higher than a document containing m < n query words.  This implies that, all else being equal, documents that contain a larger number of query words are considered more relevant and receive a higher ranking.  Documents are ranked in decreasing order:  Those containing all the query words are ranked at the top, only one query word at bottom.  Frequency counts may be used to break ties among documents that contain the same query words.  What is the difference between Multiple-word and phrases Queries? 9
  • 10. Multiple-word Queries  Phrase queries and multiple-word queries are similar in that they both involve searching for specific combinations of words or terms. However, there are some key differences between the two:  Matching Criteria:  In phrase queries, the search system looks for the exact occurrence of a specific phrase. The words in the phrase must appear in the specified order for a match to be found.  In multiple-word queries, the search system looks for documents that contain any combination of the specified words or terms. The order of the words is not strictly enforced, and they can appear in any order within the document.  . 10
  • 11. 11 Boolean Queries  Queries are formulated based on concepts from logic: AND, OR, NOT.  It describes the information needed by relating multiple words with Boolean operators.  Semantics: For each query word w a corresponding set Dw is constructed that includes the documents that contain w.  The Boolean expression is then interpreted as an expression on the corresponding document sets with corresponding set operators:  AND: Finds only documents containing all of the specified words or phrases.  OR: Finds documents containing at least one of the specified words or phrases.  NOT: Excludes documents containing the specified word or phrase.
  • 12. 12 Examples: Boolean Queries  1.Computer OR server  Finds documents containing either computer, server or both.  2. (computer OR server) NOT mainframe  Select all documents that discuss computers or servers, do not select any documents that discuss mainframes.  3. Computer NOT (server OR mainframe)  Select all documents that discuss computers, and do not discuss either servers or mainframes.  4. Computer OR server NOT mainframe  Select all documents that discuss computers, or documents that discuss servers but do not discuss mainframes.
  • 13. 13 Weighted Queries  Weighted queries, also known as term-weighted queries, are a type of query where each term or keyword is assigned a weight or importance. These weights indicate the relative significance or relevance of each term in the query.  The search system uses these weights to rank the search results and retrieve documents that align more closely with the user's information needs.  Here's an example to illustrate the concept: Weighted Query: apple^3 OR banana^2 OR orange. The query consists of three terms: "apple," "banana," and "orange." Each term is assigned a weight, indicated by the superscript number. The weight reflects the relative importance or relevance of each term.  The ranking of a document is the sum of the weights for the query words that it satisfies.
  • 14. 14 Weighted Queries  Each of the words is assigned a different weight, expressing the relative importance of the word within the query.  A query is then a set of word-weight pairs: (q1, w1), (q2, w2), …, (qn, wn).  The ranking of a document is the sum of the weights for the query words that it satisfies.  Example: given Query: (A,0.8,), (B,0.9), (C,0.3); and  Document 1: (A, B, D) and Document 2: (A, C, D) which document ranked first ?  Score of Document 1: 0.8 (for term A) + 0.9 (for B) + 0 (for C) = 1.7  Score of Document 2: 0.8 (for term A) + 0 (for B) + 0.3 (for C) = 1.1 Each document includes two words from the query, but Document1 is ranked higher because it includes more important words.
  • 15. 15 Pattern Queries  What is Pattern?  An expression that defines a set of objects. Pattern shows the internal representation of an object.  What is the pattern of a word?  Pattern matching: A word matches a pattern if it is equal to one of the words defined by the pattern.  In other words, The semantics are of disjunction: A pattern P that defines a word (c1, c2, …, cn) is interpreted as c1 v c2 v … v cn.
  • 16. 16 Pattern Queries  Pattern queries, also known as wildcard queries or pattern-based queries, are a type of information retrieval query that involves searching for documents or information using patterns or wildcards to match variations of terms.  Instead of specifying the exact terms, pattern queries allow users to define a pattern with placeholders or wildcards to match multiple variations of a term or a group of terms.  Here's an explanation of pattern queries with an example:  Pattern queries use special characters, known as wildcards, to represent unknown or variable portions of terms. The most commonly used wildcards are:  Asterisk (*) wildcard: The asterisk represents any number of characters, including none or multiple characters. It can be used to match different variations of a term or to capture unknown parts of a word.  Question mark (?) wildcard: The question mark represents a single character. It can be used to match variations in spelling or to capture one character in a specific position within a term.
  • 17. 17 Pattern Queries  Pattern queries can be useful in various scenarios, such as:  Expanding searches: Pattern queries help retrieve documents that contain various word forms or spellings.  For example, a pattern query like "organiz*ation" can match "organization" and "organisation" simultaneously.  Handling misspellings: Pattern queries with question mark wildcards can accommodate minor spelling variations.  For instance, a pattern query like "c?t" can match "cat," "cot," or "cut.“  Searching for specific patterns: Pattern queries can capture specific patterns within terms.  For example, a pattern query like "b?g" can match "big," "beg," or "bug."
  • 18. 18 Pattern Queries  Similarity pattern. Specifies a string and a radius  Defines all the words whose distance from the string is within the radius.  Assume the distance between two strings is measured by the number of one-character changes (insertions, deletions, replacements) required to transform one string into the other.  The similarity pattern (king, 2) defines kin, kong, knig, kings, cling, …  Useful to compensate for typing or scanning (OCR) errors.  One of the technique used for pattern matching is string editing.
  • 19. 19 Pattern Queries  The similarity pattern takes into account the concept of "edit distance" or "string distance," which measures the number of operations required to transform one string into another.  When specifying a similarity pattern, two main components are typically involved: the string itself and a radius or threshold value.  The string represents the target pattern or sequence that you want to find similar matches for, and the radius determines the maximum allowed difference or deviation from the target pattern.  For example, let's consider the similarity pattern of "cat" with a radius of 1. This means we are looking for strings that are similar to "cat" with a maximum edit distance of 1.  "cat" (exact match)  "bat" (substitution of 'c' with 'b')  "car" (substitution of 't' with 'r')  "catz" (insertion of 'z')  "at" (deletion of 'c')
  • 20. 20 String Editing  The problem is given two sequences of symbols, X = x1 x2 … xn and Y = y1 y2 … ym, transform X to Y, based on a sequence of three operations: Delete, Insert and Replace, so that for every operation COST(Cij) is incurred.  The objective of string editing is to identify a minimum cost sequence of edit operation that will transform X into Y.  Example: consider the sequences:  X = {a a b a b} and Y = {b a b b}  Identify a minimum cost sequence of edit operation that transform X into Y.  Assume change costs 2 units, delete 1 unit and insert 1 unit.
  • 21. 21 Dynamic programming  The minimum cost of any edit sequence that transforms x1 x2 … xi into y1 y2 … yj (for i>0 and j>0) is the minimum of the three costs: delete, replace, or insert operations.  The following recurrence equation is used for COST(i,j). 0 if i=0, j=0 COST(i-1,0) + D(xi) i>0, j=0 COST(0,j-1) + I(yj) j>0, i=0 COST'(i,j) i>0, j>0 where COST'(i,j) = min { COST(i-1,j) + D(xi), COST(i-1,j-1) + C(xi,yj), COST(i,j-1) + I(yj) } COST(i,j) =
  • 22. 22 Example  Transform the sequences:  Xi = {a a b a b} into Yj = {b a b b}  With minimum cost sequence of edit operation using dynamic programming approach, Assume that change costs 2 units, delete and insert 1 unit. 0 1 2 3 4 1 2 1 2 3 2 3 2 3 4 3 2 3 2 3 4 3 2 3 4 5 4 3 2 3 4 i 1 2 3 5 0 j 0 1 2 3 4  The value 3 at (5,4) is the optimal solution  By tracing back one can determine which operations lead to optimal solution.  Delete x1, Delete x2 and Insert y4 Or,  Change x1 to y1 & Delete x4.
  • 23. 23 Natural language  Using natural language for querying is very attractive.  Example: Find all the documents that discuss  “ campaign finance reforms, including documents that discuss violations of campaign financing regulations.  Do not include documents that discuss campaign contributions by the gun and the tobacco industries”.  Natural language queries are converted to a formal language for processing against a set of documents.  Such translation requires intelligence and is still a challenge.
  • 24. 24 Natural language  Pseudo NL processing: System scans the text and extracts recognized terms and Boolean connectors.  The grammaticality of the text is not important.  Often used by search engines.  Problem: Recognizing the negation in the search statement (“Do not include...”).  Compromise: Users enter natural language clauses connected with Boolean operators.  In the above example: “campaign finance reforms” or “violations of campaign financing regulations" and not “campaign contributions by the gun and the tobacco industries”.
  • 27. Chapter 7 : Query Operations Adama Science and Technology University School of Electrical Engineering and Computing Department of CSE Kibrom T 2023
  • 28. 28 Introduction  No detailed knowledge of collection and searching environment. Difficult to formulate queries well designed for searching Need many formulations of queries for effective searching  First formulation: often naïve attempt to retrieve relevant information.  Documents initially retrieved: Can be examined for relevance information (by the user or automatically by the system) to provide relevance feedback.  Improve query formulations for retrieving additional relevant documents (using query reformulation techniques)
  • 29. 29 Query Reformulation  Identify terms related to query terms.  Revise query to account for feedback using two basic techniques: Query Expansion: Add new terms related to query terms from relevant documents. Term Reweighting: modify term weights based on documents relevance for the users query. Increase weight of terms in relevant documents and decrease weight of terms in irrelevant documents.  Several algorithms for query reformulation.
  • 30. Term Reweighting for Query Reformulation  Term weight vectors of documents assessed relevant. Similarities among themselves.  Term weight vectors of documents assessed non-relevant. Dissimilar for those of relevant documents.  Reformulated query: Closer to term weight vectors of relevant documents.
  • 31. Term Reweighting for Query Reformulation: Rochio Formula For query q:  Dr: set of relevant documents among retrieved documents.  Dn: set of non-relevant documents among retrieved documents.  Cr: set of relevant documents among all documents in collection.  ,,: tuning constants.  Initial formulation =1  Usually information in relevant documents is more important than in non-relevant documents (<<).         Dn d j Dr d j i i j j d Dn d Dr q q    1
  • 32. Term Reweighting for Query Reformulation: Ide Formula  Initial formulation = = =1  Same comments as for the Rochio formula.  Both Ide and Rochio: no optimal criterion.  E.g. you are given a query vector qi = (2,3,1,2,5); and documents; d1(3,3,2,0,9); d2(2,2,1,0,12); d3(3,2,1,0,9); d4(2,2,1,3,1); d5(1,2,1,3,3);  3 documents identified as relevant by a user, (i.e. d1-d3); and 2 documents as irrelevant (i.e d4-d5).  Compute the modified query using standard Rochio equation with = = =1         Dn d j Dr d j i i j j d d q q    1
  • 33. 33 Approaches for Query Operations  Users relevance feedback: Approaches based on feedback from users about relevance of documents retrieved.  Pseudo-relevance feedback: Approaches based on information derived from set of initially retrieved documents (local set of documents), which is called Local Analysis. Approaches based on global information derived from document collection, which is called Global Analysis.
  • 34. 34 Users Relevance Feedback  Most popular query reformulation strategy.  Cycle:  User presented with list of retrieved documents.  User marks those which are relevant.  In practice: top 10-20 ranked documents are examined.  Select important terms from documents assessed relevant by users.  Enhance importance of these terms in a new query.  Expected:  New query moves towards relevant documents and away from non-relevant documents.
  • 35. 35 Relevance Feedback Architecture Rankings IR System Document corpus Ranked Relevant Documents 1. Doc1 2. Doc2 3. Doc3 . . 1. Doc1  2. Doc2  3. Doc3  . . Feedback Query String Revised Query ReRanked Relevant Documents 1. Doc2 2. Doc4 3. Doc5 . . Query Reformulation
  • 36. 36 Relevance Feedback  After initial searching results are presented, allow the user to provide feedback on the relevance of one or more of the retrieved documents.  Use this feedback information to reformulate the query.  Produce new results based on reformulated query.  Allows more interactive, multi-pass process.
  • 37. 37 Pseudo Relevance Feedback  Use relevance feedback methods without explicit user input.  Obtain relevance feedback automatically;  Identify terms related to query terms (e.g. synonyms, stemming variations, terms close to query terms in text)  Just assume the top m retrieved documents are relevant, and use them to reformulate the query.  Allows for query expansion that includes terms that are correlated with the query terms.  Two strategies:  Local strategies  Global strategies
  • 38. 38 Local Analysis  Examine only documents retrieved automatically for query to determine query expansion.  At query time, dynamically determine similar terms based on analysis of top-ranked retrieved documents.  Base correlation analysis on only the “local” set of retrieved documents for a specific query.  Avoids ambiguity by determining similar (correlated) terms only within relevant documents.  “Apple computer”  “Apple computer Power book laptop”
  • 39. 39 Global Analysis  Expand query using information from whole set of documents in collection.  Determine term similarity through a pre-computed statistical analysis of the complete corpus.  Thesaurus-like structure using all documents:  Approach to automatically built thesaurus.  (e.g. similarity thesaurus based on co-occurrence frequency)  Approach to select terms for query expansion.  A thesaurus provides information on synonyms and semantically related words and phrases.  Example: physician similar/synonymous: doctor, medical, MD related: general practitioner, surgeon
  • 40. 40 Thesaurus-based Query Expansion  For each term, t, in a query, expand the query with synonyms and related words of t from the thesaurus.  May weight added terms less than original query terms.  Generally increases recall.  May significantly decrease precision, particularly with ambiguous terms.  “interest rate”  “interest rate fascinate evaluate”
  • 41. 41 Global vs. Local Analysis  Global analysis requires intensive term correlation computation only once at system development time.  Local analysis requires intensive term correlation computation for every query at run time (although number of terms and documents is less than in global analysis).  But local analysis gives better results.  Term ambiguity may introduce irrelevant statistically correlated terms during global analysis.  “Apple computer”  “Apple red fruit computer”
  • 42. 42 Global Analysis Refinements  Only expand query with terms that are similar to all terms in the query.  “fruit” not added to “Apple computer” since it is far from “computer.”  “fruit” added to “apple pie” since “fruit” close to both “apple” and “pie.”  Use more sophisticated term weights (instead of just frequency) when computing term correlations.    Q k ij i j c Q k sim ) , (
  • 43. 43 Query Expansion Conclusions  Expansion of queries with related terms can improve performance, particularly recall.  However, must select similar terms very carefully to avoid problems, such as loss of precision.