SlideShare a Scribd company logo
1 of 239
UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the
IR System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
SEMESTER – VIII : PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
UNIT – I : INTRODUCTION
INTRODUCTION TO INFORMATION RETRIEVAL
What is IR?
• Information retrieval (IR) is finding material . . .
• of an unstructured nature . . .
• that satisfies an information need from within large collections .
. . . (usually stored on computers).
• Unstructured data means that
• a formal, semantically overt, easy-for-computer structure is missing.
• In contrast to the rigidly structured data used in DB style searching
(e.g. product inventories, personnel records)
What is IR?
• The process of actively seeking out information relevant
to a topic of interest (van Rijsbergen)
• Typically it refers to the automatic (rather thanmanual)
retrieval of documents
• Information Retrieval System (IRS)
• “Document” is the generic term for an information
• holder (book, chapter, article, webpage, etc)
What is IR?
• Information retrieval is
the science of searching for information
a)
b)
c)
d)
in a document,
searching for documents themselves, and
also searching for the metadata that describes data, and
for databases of texts, images or sounds.
What is IR?
• Information Retrieval (IR) can be defined as
• a software program that deals with the organization, storage,
retrieval, and evaluation of information from document
repositories, particularly textual information.
• Information Retrieval is the activity of obtaining material
that can usually be documented on an unstructured nature
• i.e. usually text which satisfies an information need from
within large collections which is stored on computers.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
What is IR?
• IR helps users
• find information that matches their information needs expressed as queries.
• Historically, IR is about document retrieval, emphasizing
document as the basic unit. – Finding documents relevant to
user queries
• Technically, IR studies the acquisition, organization, storage,
retrieval, and distribution of information.
• For example, Information Retrieval can be when a user
enters a query into the system.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
What is IR?
• Information retrieval (IR) is a broad area of
Computer Science focused
• primarily on providing the users with easy access
to information of their interest, as follows.
• Information retrieval deals with
• the representation,
• storage,
• organization of, and
• access to information items such as documents,
Web pages, online catalogs, structured and semi-
structured records, multimedia objects.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
What is IR?
•Information retrieval (IR)
in computing and information science is the process of
obtaining information system resources that are
relevant to an information need from a collection of
those resources. Searches can be based on full-text or
other content-based indexing.
: PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
UNIT – I : INTRODUCTION
What is IR?
•Information retrieval (IR) Quality :
• Are the retrieved documents about the target subject up-
to-date?
• from a trusted source?
• satisfying the user’s needs?
• How should we rank documents in terms of these factors?
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
What are IR Types?
• Information retrieval (IR) are of two types:
1. Precision and
2. recall
Above are the two parameters of retrieval effectiveness.
1.Precision refers to how many of the retrieved documents are
relevant to the user.
2.Recall refers to what fraction of relevant documents in the collection
are retrieved.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
What are the types of information retrieval?
• Methods/Techniques in which information retrieval
techniques are employed include:
• Adversarial information retrieval.
• Automatic summarization. Multi-document summarization.
• Compound term processing.
• Cross-lingual retrieval.
• Document classification.
• Spam filtering.
• Question answering.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Why we do IR?
•Information retrieval can provide organizations with
immediate value
• --while it's important to try to figure out ways to capture
tacit knowledge,
• information retrieval provides a means to get at
information that already exists in electronic formats
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Why we do IR?
• Information retrieval can provide organizations with immediate value--while it's important to try to
figure out ways to capture tacit knowledge, information retrieval provides a means to get at
information that already exists in electronic formats.
• The representation and organization of the information items should be such as to
• provide the users with easy access to information of their interest.
• Nowadays, research in IR includes
• modeling, Web search, text classification, systems architecture, user interfaces, data visualization, filtering,
languages.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
What is the need for information retrieval?
•Information retrieval can provide :
•organizations with immediate value
--while it's important to try to figure out ways to capture
tacit knowledge, information retrieval provides
a means to get at information that already
exists in electronic formats.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
What is the need for information retrieval in WEB?
• Web: – A huge, widely-distributed, highly heterogeneous,
semistructured,, interconnected, evolving,
hypertext/hypermedia information repository
• Main issues – Abundance of information
• The 99% of all the information are not interesting for the 99% of
all users – The static Web is a very small part of all the Web.
• Dynamic Website – To access the Web user need to exploit Search
Engines (SE)
• SE must be improved
• To help people to better formulate their information needs
• More personalization is needed
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
WEB IR : What is Web IR?
• Web IR can be defined as
• the application of theories and methodologies from IR to the World Wide Web.
Web Information Retrieval models are ways of integrating many sources of
evidence about documents,
1. the links,
2. the structure of the document,
3. the actual content of the document,
4. the quality of the document, etc.
so that an effective Web search engine can be achieved.
: PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
UNIT – I : INTRODUCTION
Issues in IR
•The main issues of the
Information Retrieval
(IR) are :
1. Document and Query
Indexing
2. Query Evaluation
3. System Evaluation.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Issues in IR : Document and Query Indexing
• Document and Query Indexing
Main goal of Document and
Query Indexing is to find
important meanings and
creating an internal
representation.
• The factors to be considered
are accuracy to represent
semantics, exhaustiveness, and
facility for a computer to
manipulate.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Issues in IR : Query Evaluation
• Query Evaluation –
In the retrieval model how can a document be
represented with the selected keywords and
how are documents and query representations
compared to calculate a score.
• Information Retrieval (IR) deals with issues like
uncertainty and vagueness in information
systems.
• Uncertainty :
The available representation does not typically reflect
true semantics of objects such as images, videos etc.
• Vagueness :
The information that the user requires lacks clarity, is
only vaguely expressed in a query, feedback or user
action.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Issues in IR : System Evaluation
• System Evaluation –
System Evaluation tells about the
importance of determining the impact of
information given on user achievement.
Here, we see if the efficiency of the
particular system related to time and
space.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IR in practice
• Not only librarians, professional searchers, etc engage
themselves in the activity of information retrieval
• but nowadays hundreds of millions of people engage in IR every day when
they use web search engines.
• Information Retrieval is believed to be the dominant form of
Information access
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IR in practice
• Information Retrieval is a research-driven theoretical and
experimental discipline
• The focus is on different aspects of the information–
seeking process, depending on the researcher’s
background or interest:
• Computer scientist – fast and accurate search engine
• Librarian – organization and indexing of information
• Cognitive scientist – the process in the searcher’s mind
• Philosopher – Is this really relevant ?
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IR in practice
•Progress influenced by advances in Computational
Linguistics, Information Visualization, Cognitive
Psychology, HCI, …
• Experimental vs. operational systems
• Analogy to wcawrmwa.nsutfuadcteurnintgsf
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IR Process
• An information retrieval process begins when a user enters a
query into the system.
• Queries are formal statements of information needs, for
example search strings in web search engines. In information
retrieval a query does not uniquely identify a single object in
the collection.
• Instead, several objects may match the query, perhaps with
different degrees of relevancy.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Fundamental concepts in IR
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Fundamental concepts in IR
•What is information ?
•Meaning vs. form
• Data vs. Information Retrieval
•Relevance
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
The stages of IR
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
FORMULATION OF IR PROCESS
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IR system
• The IR system assists the users in finding the information they
require
• but it does not explicitly return the answers to the question.
• It notifies regarding the existence and location of documents
that might consist of the required information.
• Information retrieval also extends support to users in browsing
or filtering document collection or processing a set of retrieved
documents.
• The system searches over billions of documents stored on
millions of computers.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IR system
• An IR system has the ability to represent, store, organize, and access
information items.
• A set of keywords are required to search. Keywords are what people
are searching for in search engines.
• These keywords summarize the description of the information.
• A spam filter, manual or automatic means are provided by Email
program for classifying the mails so that it can be placed directly into
particular folders.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – I: INTRODUCTION
TOPIC – 2: EARLY DEVELOPMENTS
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the
IR System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
EARLY DEVELOPMENTS
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
EARLY DEVELOPMENTS
• For more than 5, 000 years, man has organized information f or later
retrieval and searching.
• In its most usual form, this has been done by
• compiling,
• storing,
• organizing, and
• indexing clay tablets,
• hieroglyphics,
• papyrus rolls, and
• books.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
EARLY DEVELOPMENTS
• IR in the 17th
century: Samuel
Pepys, the famous
English diarist,
subject-indexed his
treasured 1000+
books library with
key words.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
EARLY DEVELOPMENTS
• IR in the 17th
century: Samuel
Pepys, the famous
English diarist,
subject-indexed his
treasured 1000+
books library with
key words.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
EARLY DEVELOPMENTS
• Document Collection: text units we have built an IR system
over.
• Usually documents
• But could be
• memos
• book chapters paragraphs scenes of a movie
• turns in a conversation...
• Lots of them
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
EARLY DEVELOPMENTS
• For holding the various items,
• special purpose buildings called libraries,
• from the Latin word liber for book, or bibliothekes,
• from the Greek word biblion for papyrus roll, are used.
• The oldest known library was created in Elba,
• in the “Fertile Crescent”,
• currently northern Syria,
• some time between 3,000 and 2,500 BC.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
EARLY DEVELOPMENTS
• In the seventh century BC,
• Assyrian king Ashurbanipal created the library of Nineveh, on the
Tigris River (today, north of Iraq),
• which contained more than 30,000 clay tablets at the time of its destruction in
612 BC.
• By 300 BC, Ptolemy Soter, a Macedonian general, created the Great
Library in Alexandria – the Egyptian city at the mouth of the Nile
named after
• the Macedonian king Alexander the Great (356-323 BC).
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
EARLY DEVELOPMENTS
• For seven centuries the Great Library, jointly with other major libraries
in the city,
• made Alexandria the intellectual capital of the Western world.
• Since then, libraries have expanded and flourished. Nowadays, they
are everywhere.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
EARLY DEVELOPMENTS
• . They constitute the collective memory of the human race and their
popularity is in the rising.
• In 2008 alone, people in the US visited their libraries some 1.3 billion
times and checked out more than 2 billion items
• an increase in both yearly figures of more than 10 percent.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
EARLY DEVELOPMENTS
• Since the volume of information in libraries is always growing,
• it is necessary to build specialized data structures for fast search – the indexes.
In one form or another,
• indexes are at the core of every modern information retrieval system.
• They provide fast access to the data and allow speeding up query
processing.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
EARLY DEVELOPMENTS
• For centuries indexes have been created manually as sets of categories.
• Each category in the index is typically composed of labels that identify its
associated topics and of pointers to the documents that discuss those
topics.
• While these indexes are usually designed by library and information science
researchers,
• the advent of modern computers has allowed the construction of large indexes
automatically,
• which has accelerated the development of the area of Information Retrieval (IR).
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
EARLY DEVELOPMENTS
• Early developments in IR date back to research efforts conducted in the 50’s
by pioneers such as
• Hans Peter Luhn,
• Eugene Garfield,
• Philip Bagley, and
• Cal vi n Moores,
• this last one having allegedly coined the term information retrieval.
• In 1955, Allen Kent and colleagues published a paper describing the
precision and recall metrics,
• which was followed by the publication in 1962 of the Cranfield studies by Cyril
Cleverdon.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
EARLY DEVELOPMENTS
• In 1963, Joseph Becker and Robert Hayes published the first book on
information retrieval [164].
• Throughout the 60’s, Gerard Salton and Karen Sparck Jones, among
others, shaped the field
• by developing the fundamental concepts that led to the modern technologies
of ranking in IR.
• In1968,thefirstIR book authored by Salton was published.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
EARLY DEVELOPMENTS
• . In 1971, N.Jardine and C.J.VanRijsbergen articulated the “cluster
hypothesis”.
• In 1978, the first ACM Conference on IR (ACM SIGIR) was held in
Rochester, New York.
• In 1979, C.J. Van Rijsbergen published Information Retrieval, which
focused on probabilistic models.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
EARLY DEVELOPMENTS
• In 1983, Salton and McGill published Introduction to Modern
Information Retrieval, a classic book on IR focused on vector models.
• Since then,
• the IR community has grown to include
• thousands of professors,
• researchers,
• students,
• engineers, and
• practitioners
• throughout the world.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
EARLY DEVELOPMENTS
• The main conference in the area,
• the ACM International Conference on Information Retrieval (ACM SIGIR),
• now attracts hundreds of attendees and receives hundreds of submitted
papers on an yearly basis.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
MODERN IR DEVELOPMENTS
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
MODERN IR DEVELOPMENTS
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
MODERN IR DEVELOPMENTS
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – I: INTRODUCTION
Topic 3: THE IR PROBELM
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the
IR System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO INFORMATION RETRIEVAL (IR) PROBELM
What is IR PROBLEM?
Users of modern IR systems, such as search engine users, have
information needs of varying complexity.
In the simplest case, they are looking for the link to the homepage of a
company, government, or institution.
In the more sophisticated cases, they are looking for information
required to execute tasks associated with their jobs or immediate
needs.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IR PROBLEM EXAMPLE
An example of a more complex information need is as
follows:
• Find all documents that address the role of the Federal
Government in financing the operation of the National Railroad
Transportation Corporation (AMTRAK).
• This full description of the user need does not necessarily provide
• the best formulation for querying the IR system.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IR PROBLEM EXAMPLE
•Instead, the user might want to first translate this
information need into
• a query, or sequence of queries, to be posed to the
system.
• In its most common form, this translation yields
• a set of keywords, or index terms, which summarize the
user information need.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IR PROBLEM EXAMPLE
• Given the user query,
• the key goal of the IR system is to retrieve information that is useful or relevant to the user.
• The emphasis is on the retrieval of information as opposed to the
retrieval of data.
• To be effective in its attempt to satisfy the user information need,
• the IR system must somehow ‘interpret’ the contents of the information items.
• That is, the documents in a collection, and rank them according to a
degree of relevance to the user query.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IR PROBLEM
• Given the user query,
• the key goal of the IR system is to retrieve information that is useful or relevant to the user.
• The emphasis is on the retrieval of information as opposed to the
retrieval of data.
• To be effective in its attempt to satisfy the user information need,
• the IR system must somehow ‘interpret’ the contents of the information items.
• That is, the documents in a collection, and rank them according to a
degree of relevance to the user query.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IR PROBLEM
• This ‘interpretation’ of a document content involves
extracting syntactic and semantic information from the
document text and using this information to match the user
information need.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IR PROBLEM
• The IR Problem: the primary goal of an IR system is to
• retrieve all the documents that are relevant to a user query while retrieving as
few non relevant documents as possible.
• The difficulty is knowing not only how to extract information
from the documents
• but also knowing how to use it to decide relevance.
• That is, the notion of relevance is of central importance in IR.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IR PROBLEM
• One main issue is that relevance is a personal assessment that depends on the
task being solved and its context.
• For example:
• Relevance can change
• with time (e.g., new information becomes available),
• with location (e.g., the most relevant answer is the closest one), or
• even with the device (e.g., the best answer is a short document that is easier to download
and visualize).
• In this sense, no IR system can provide perfect answers to all users all the time.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – I: INTRODUCTION
Topic 4:THE USERS TASK
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the
IR System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
AN OVERVIEW OF USER TASK : What is USER TASK IN IRS?
• The user of a retrieval system has to :
• translate his information need into a query in the language provided by the system.
• With an information retrieval system, this normally implies
• specifying a set of words which convey the semantics of the information need.
• With a data retrieval system,
• a query expression (such as, for instance, a regular expression) is used to
• convey the constraints that must be satisfied by objects in the answer set.
• In both cases, we say that the user searches for useful information executing
a retrieval task.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IRS – USER TASK : What is USER TASK IN IRS?
• Users of modern IR systems,
• such as search engine users, have information needs of varying complexity.
• The user of a retrieval system has to translate their information need into a query in
the language provided by the system.
• With an IR system, such as a search engine,
• this usually implies specifying a set of words that convey the semantics of the
information need.
• We say that the user is searching or querying for information of their interest.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IRS – USER TASK EXAMPLE
To illustrate,
• The user might be interested in documents about
• car racing in general and might decide to glance related documents about Formula 1 racing,
• Formula Indy, and the ‘24 Hours of Le Mans.
• We say that
• the user is browsing or navigating the documents in the collection, not searching.
• It is still a process of retrieving information,
• but one whose main objectives are less clearly defined in the beginning.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IRS – USER TASK
• While searching for information of interest is the main
retrieval task on the Web,
• search can also be used for satisfying other user needs distinct from
information access,
• such as the buying of goods and the placing of reservations.
• Consider now a user
• who has an interest that is either poorly defined or inherently
broad,
• such that the query to specify is unclear.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IRS – USER TASK
• The task in this case is more related to
• exploratory search and resembles a process of quasi-sequential search for
information of interest.
• Here we, make a clear distinction between the different tasks
the user of the retrieval system might be engaged in.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IRS – USER TASK
• The task might be then of two distinct types: searching and
browsing, as illustrated in Figure:
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IRS – USER TASK
• In a process of retrieving
information, one whose
main objectives are not
clearly defined in the
beginning and whose
purpose might change
during the interaction with
the system.
• Then, user task may go
with Browsing only.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IRS – USER TASK
• User Choice of
Information
Retrieval:
• Push
• Pull
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
IRS – USER TASK
• Both retrieval and browsing are, in the language of the World Wide
Web, `pulling' actions.
• That is, the user requests the information in an interactive manner.
• An alternative is to do retrieval in an automatic and permanent
fashion using software agents which push the information towards
the user.
• For instance, information useful to a user could be extracted
periodically from a news service.
• In this case, we say that the IR system is executing a particular retrieval task which
consists of filtering relevant information for later inspection by the user.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – I: INTRODUCTION
Topic 5: THE INFORMATION VERSUS
RETRIEVAL
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus
Data Retrieval
6. The IR System
7. The Software Architecture of the IR System
8. The Retrieval and Ranking Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO INFORMATION
What is Information?
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO RETRIEVAL
What is Retrieval?
• This does not mean that there is
no structure in the data
Document structure (headings,
paragraphs, lists. . . )
• Explicit markup formatting (e.g.
in HTML, XML. . . ) Linguistic
structure (latent, hidden)
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
SELECT *
FROM business
catalogue WHERE
category =
’florist’ AND city
zip = ’cb1’
INTRODUCTION TO RETRIEVAL
What is Information retrieval (IR) ?
Information retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text)
that satisfies an information need from within large
collections (usually stored on computers).
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
INTRODUCTION TO RETRIEVAL
• An information need is
• the topic about which the user desires to know
more about.
• A query is
• what the user conveys to the computer in an
attempt to communicate the information need.
• A document is relevant
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Known-item search.
Precise information seeking
search Open-ended search
• if the user perceives that it contains information (“topical search”)
of value with respect to their personal
information need.
THE INFORMATION VERSUS RETRIEVAL
• Data retrieval, in the context of an IR system, consists
• mainly of determining which documents of a collection contain
• the keywords in the user query which, most frequently, is not enough to satisfy
the user information need.
• In fact, the user of an IR system is concerned more with
• retrieving information about a subject than with retrieving data that satisfies a
given query.
• For instance, a user of an IR system is willing to accept
• documents that contain synonyms of the query terms in the result set,
• even when those documents do not contain any query terms.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
INFORMATION RETRIEVAL
• “Ad hoc” retrieval web retrieval
• Support for browsing and filtering document collections:
• Clustering
• Classification; using fixed labels (common information needs, age groups,
topics; )
• Further processing a set of retrieved documents,
• e.g., by using natural language processing
• Information extraction Summarization Question answering.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
RETRIEVAL TYPES
1) Web search ( )
• Search ground are billions of documents on millions of computers
• issues: spidering; efficient indexing and search; malicious manipulation to boost
search engine rankings
2) Enterprise and institutional search ( )
• e.g company’s documentation, patents, research articles often domain-specific
• Centralised storage; dedicated machines for search.
• Most prevalent IR evaluation scenario: US intelligence analyst’s searches
3) Personal information retrieval (email, pers. documents; )
• e.g., Mac OS X Spotlight; Windows’ Instant Search
• Issues: different file types; maintenance-free, lightweight to run in background
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE INFORMATION VERSUS RETRIEVAL
• In an IR system the retrieved objects might be
• inaccurate and small errors are likely to go unnoticed.
• In a data retrieval system, on the contrary,
• a single erroneous object among a retrieval system,
• such as defined structure and semantics thousand retrieved
objects means total failure.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE INFORMATION VERSUS RETRIEVAL
• While A data a relational database, deals with data that has a well
defined structure and semantics.
• while an IR system deals with natural language text which is not
well structured.
• Data retrieval, while providing a solution to the user of a database
system, does not solve
• the problem of retrieving information about a subject or topic.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE INFORMATION VERSUS RETRIEVAL
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Data Information
Unorganized raw facts that need processing without
which it is seemingly random and useless to humans
Information is a processed, organized data presented in a given
context and is useful to humans.
Data is an individual unit that contains raw material
which does not carry any specific meaning.
Information is a group of data thatcollectively carry a logical
meaning
Data Doesn't depended on information. Information depends on data.
It is measured in bits and bytes. Information is measured in meaningfulunits like time, quantity, etc.
Data is never suited to thespecific needs of a designer. Information is specific to the expectations and requirements
because all the irrelevant facts and figures are removed, during the
transformation process.
An example of Data is a
Student’score.
The average score of a class is the information derived from the
given data.
Any Questions?
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – I: INTRODUCTION
Topic 6: THE IR SYSTEM
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the
IR System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
The role of an IR system
A modern view:
• Support the user in
– exploring a problem domain, understanding its
terminology, concepts and structure
– clarifying, refining and formulating an information need
– finding documents that match the info need description
• As many relevant docs as possible
• As few non-relevant documents as possible
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
How does it do this ?
• User interfaces and visualization tools for
– exploring a collection of documents
– exploring search results
• Query expansion based on
– Thesauri
– Lexical/statistic analysis of text / context and concept formation
– Relevance feedback
• Indexing and matching model
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
How well does it do this?
• Evaluation
– Of the components
• Indexing / matching algorithms
– Of the exploratory process overall
• Usability issues
• Usefulness to task
• User satisfaction
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
What do we want from an IRS ?
• Systemic approach
– Goal (for a known information need):
• Return as many relevant documents as possible and as few
non-relevant documents as possible
• Cognitive approach
–Goal (in an interactive information-seeking environment,
with a given IRS):
• Support the user’s exploration of the problem domain and the task completion.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
OIRS is the techniques of storing and recovering and
often disseminating recorded data especially through
the use of a computerized system”
(Merriam Webster Dictionary)
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
QUALITIES OF IRS
• The effectiveness of an IR system (i.e., the quality of its search results) is
determined by two key statistics about the system’s returned results for a query:
1. Precision: What fraction of the returned results are relevant to the
information need?
2. Recall: What fraction of the relevant documents in the collection were
returned by the system?
• Queries to be addressed:
• What is the best balance between the two?
• Easy to get perfect recall: just retrieve everything
• Easy to get good precision: retrieve only the most relevant
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE SOFTWARE ARCHITECTURE OF THE IR SYSTEM
• To describe the IR system, we use a simple and generic software architecture as shown in Figure
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
•The first step in setting up an IR system is to
assemble the document collection,
• which can be private or be crawled from the Web. In the
second case a crawler module is responsible for
collecting the documents.
•The document collection is stored in disk storage
usually referred to as the central repository.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
•The documents in the central repository need to be
indexed for fast retrieval and ranking.
•The most used index structure is an inverted index
composed of all the distinct words of the collection
and, for each word,
• a list of the documents that contain it.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
•Given that the document collection is indexed, the
retrieval process can be initiated.
•It consists of retrieving documents that satisfy either
a user query or a click in a hyper link.
• In the first case, we say that the user is searching for
information of interest;
• in the second case, we say that the user is browsing for
information of interest.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• Use retrieval as PROCESS it applies to the searching process
requires browsing and how it compares to searching.
• To search, the user first specifies a query that reflects their
information need.
• Next, the user query is parsed and expanded with, for
instance, spelling variants of a query word.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• The expanded query,
• which we refer to as the system query, is then processed against the index to
retrieve a subset of all documents.
• Following, the retrieved documents are ranked and the top
documents are returned to the user.
• The purpose of ranking is to identify the documents that are
most likely to be considered relevant by the user, and
constitutes the most critical part of the IR system.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• Given the inherent subjectivity in deciding relevance,
• evaluating the quality of the answer set is a key step for improving the IR
system.
• A systematic evaluation process allows
• fine tuning the ranking algorithm and improving the quality of the results.
• The most common evaluation procedure consists of
comparing
• the set of results produced by the IR system with results suggested by
human specialists.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – I: INTRODUCTION
Topic 7: THE SOFTWARE ARCHITECTURE OF
THE IR SYSTEM
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of
the IR System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
SYSTEM ARCHITECTURE OF IRS
A high level view of the software architecture of an IR
system will provide:
1) Components
2) Tools
3) Environment
4) Data source(s)
Also additional elements needed for through
understanding of Data flow.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE SOFTWARE ARCHITECTURE OF THE IR SYSTEM
• To describe the IR system, we use a simple and generic software architecture as shown in Figure
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE SOFTWARE ARCHITECTURE OF THE IR SYSTEM
• To describe the IR system, we use a simple and generic software architecture as shown in Figure
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• The first step in setting up an IR
system is to assemble the
document collection,
• which can be private or be crawled
from the Web. In the second case a
crawler module is responsible for
collecting the documents.
• The document collection is stored
in disk storage usually referred to
as the central repository.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• The documents in the central
repository need to be indexed
for fast retrieval and ranking.
• The most used index structure
is an inverted index composed
of all the distinct words of the
collection and, for each word,
• a list of the documents that
contain it.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
•Given that the document collection is indexed, the
retrieval process can be initiated.
•It consists of retrieving documents that satisfy either
a user query or a click in a hyper link.
• In the first case, we say that the user is searching for
information of interest;
• in the second case, we say that the user is browsing for
information of interest.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• Use retrieval as PROCESS it applies to the searching process
requires browsing and how it compares to searching.
• To search, the user first specifies a query that reflects their
information need.
• Next, the user query is parsed and expanded with, for
instance, spelling variants of a query word.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• The expanded query,
• which we refer to as the system query, is then processed against the index to
retrieve a subset of all documents.
• Following, the retrieved documents are ranked and the top
documents are returned to the user.
• The purpose of ranking is to identify the documents that are
most likely to be considered relevant by the user, and
constitutes the most critical part of the IR system.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• Given the inherent subjectivity in deciding relevance,
• evaluating the quality of the answer set is a key step for improving the IR
system.
• A systematic evaluation process allows
• fine tuning the ranking algorithm and improving the quality of the results.
• The most common evaluation procedure consists of
comparing
• the set of results produced by the IR system with results suggested by
human specialists.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• To improve the ranking,
• we might collect feedback from the users and use this
information to change the results.
• In the Web,
• the most abundant form of user feedback are the clicks on the
documents in the results set.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• Another important source of information for Web ranking are the
hyperlinks among pages,
• which allow identifying sites of high authority.
• There are many other concepts and technologies that bear impact
on
• the design of a full fledged IR system, such as a modern search engine.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – I: INTRODUCTION
Topic 8: THE RETRIEVAL AND RANKING
PROCESSES
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the IR
System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Ranked Retrieval
• Ranked retrieval
• The documents are ranked based on their score
• Advantages
– Query easy to specify
–The output is ranked based on the estimated relevance of the
documents to the query
– A wide variety of theoretical models exist
• Disadvantages
– Query less precise (although weighting can be used)
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE RETRIEVAL AND RANKING PROCESSES
• To describe the retrieval and ranking processes, we further
elaborate on our description of the modules shown in Figure
1.2, as illustrated in Figure 1.3.
• Given the documents of the collection,
1. we first apply text operations to them such as eliminating stop
words, stemming, and
2. selecting a subset of all terms for use as indexing terms.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE RETRIEVAL AND RANKING PROCESSES
• To describe the IR system Retrieval and Ranking Processes, we illustrate it through the figure
as given below:
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE RETRIEVAL AND RANKING PROCESSES
• To describe the IR system Retrieval and Ranking Processes, we illustrate it through the figure
as given below:
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE RETRIEVAL AND RANKING PROCESSES
• The indexing terms are then used to compose document
representations,
• which might be smaller than the documents themselves (depending on the
subset of index terms selected).
• Given the document representations, it is necessary to build
an index of the text.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE RETRIEVAL AND RANKING PROCESSES
• Different index structures might be used,
• but the most popular one is an inverted index.
• The steps required to generate the index compose the indexing process and
must be executed offline,
• before the system is ready to process any queries.
• The resources (time and storage space) spent on the indexing process are amortized by
querying the retrieval system many times.
• Given that the document collection is indexed, the retrieval process can be initiated.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE RETRIEVAL AND RANKING PROCESSES
• The user first specifies a query that reflects their information
need.
• This query is then parsed and modified by operations that
resemble those applied to the documents.
• Typical operations at this point consist of spelling corrections
and elimination of terms such as stop words,
• whenever appropriated.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE RETRIEVAL AND RANKING PROCESSES
• Next, the transformed query is expanded and modified.
• For instance, the query might be modified using query suggestions made by
the system and confirmed by the user.
• The expanded and modified query is then processed to obtain the
set of retrieved documents,
• which is composed of documents that contain the query terms.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE RETRIEVAL AND RANKING PROCESSES
• Fast query processing is made possible by the index structure
previously built.
• The steps required to produce the set of retrieved documents
constitute the retrieval process.
• Next, the retrieved documents are ranked according to a likelihood
of relevance to the user.
• This is a most critical step because the quality of the results,
• as perceived by the users, is fundamentally dependent on the ranking.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE RETRIEVAL AND RANKING PROCESSES
• The top ranked documents are then formatted for presentation to
the user.
• The formatting consists of retrieving the title of the documents and
generating snippets for them,
• i.e., text excerpts that contain the query terms,
• which are then displayed to the user.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS)
• The documents in the central
repository need to be indexed
for fast retrieval and ranking.
• The most used index structure
is an inverted index composed
of all the distinct words of the
collection and, for each word,
• a list of the documents that
contain it.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – I: INTRODUCTION
Topic 9: THE WEB
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the IR
System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
WEB – What is Web?
• The Web, or World Wide Web (W3), is basically a
system of Internet servers that support specially
formatted documents.
• The documents are formatted in a markup
language called HTML (HyperText Markup
Language) that supports links to other
documents, as well as graphics, audio, and video
files.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
WEB – What Does Web Mean?
• The Web is the common name for the World Wide
Web, a subset of the Internet consisting of the pages
that can be accessed by a Web browser.
• Many people assume that the Web is the same as the
Internet, and use these terms interchangeably.
• However, the term Internet actually refers to the global network of
servers that makes the information sharing that happens over the
Web possible.
• So, although the Web does make up a large portion of the Internet,
but they are not one and same.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
WEB – KEY TOPICS
• WEB SITE OR WEB PAGES
• WEB SERVER
• WEB CLIENT
• WEB APPLICATIONS OR WEB APPs
• WEB SOFTWARES – WEB 1.0, WRB 2.0, WEB 3.0
• WEB SERVICES
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
WEB - A Brief History
• “As We May Think” influenced people like Douglas
Engelbart who,
• at the Fall Joint Computer Conference in San Francisco in December
of 1968,
• ran a demonstration in which he introduced the first ever computer
mouse, video conferencing, teleconferencing, and hypertext.
• It was so incredible that it became known as “the
mother of all demos” [1690].
• Of the innovations displayed, the one that interests us
the most here is hypertext.
• The term was coined by Ted Nelson in his Project
Xanadu.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
WEB - A Brief History
• Hypertext allows the reader to jump from one electronic document
to another,
• which was one important property regarding the problem Tim Berners-Lee
faced in 1989.
• At the time, Berners-Lee worked in Geneva at the CERN – Conseil
Europeen pourla Recherche Nucleaire.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
WEB - A Brief History
• Researchers who wanted to share documentation with others had to
• reformat their documents to make them compatible with an internal
publishing system.
• It was annoying and generated many questions,
• many of which ended up been directed towards Berners-Lee.
• He understood that a better solution was required.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
WEB - A Brief History
• It just so happened that CERN was the largest Internet node in Europe.
• Berners Lee reasoned that
• it would be nice if the solution to the problem of sharing documents were
decentralized, such that the researchers could share their contributions freely.
• He saw that
• a networked hypertext, through the Internet, would be a good solution
and started working on its implementation
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
WEB - A Brief History
• In 1990,
• he wrote the HTTP protocol, defined the HTML language,
• wrote the first browser, which he called “World Wide Web”, and the first
Web server.
• In 1991,
• he made his browser and server software available in the Internet. The Web
was born.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Web Information Retrieval : What is web in information retrieval?
•Web Information retrieval is
• the process of searching within
a huge World Wide Web
document collection for a
particular information
need (called a query).
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Web Information Retrieval : What is web in information retrieval?
• The Web can be considered as
• a large-scale document collection, for which classical text retrieval techniques can be
applied.
• However, its unique features and structure offer new sources of evidence that can be
used to enhance the effectiveness of Information Retrieval (IR) systems.
• Generally, Web IR examines
• the combination of evidence from both the textual content of documents and
• the structure of the Web,
• as well as the search behavior of users and issues related to the evaluation of retrieval
effectiveness in the Web setting.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Web Information Retrieval
• Web documents / data
• No traditional collection
– Huge
• Time and space to crawl index
• IRSs cannot store copies of documents
– Dynamic, volatile, anarchic, un-
controlled
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Web Information Retrieval
– Homogeneous sub-collections
• Structure
– In documents (un-/semi-/fully-structured)
– Between docs: network of inter-connected nodes
– Hyper-links - conceptual vs. physical documents
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Web Information Retrieval
• Web Information Retrieval models are
• ways of integrating many sources of evidence about documents, such
as
1. the links,
2. the structure of the document, the actual content of the document,
3. the quality of the document, etc.
4. so that an effective Web search engine can be achieved.
• In contrast with the traditional library-type settings of IR systems,
• the Web is a hostile environment, where Web search engines have to deal with
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – I: INTRODUCTION
Topic 10: THE E-PUBLISHING ERA
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the IR
System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
What is E-Publishing?
• Electronic publishing (also referred to as publishing, digital publishing,
or online publishing) includes
• the digital publication of
• e-books,
• digital magazines, and
• the development of digital libraries and catalogues.
• It also includes
• the editing of
• books,
• journals and
• Magazines
to be posted on a screen (computer, e-reader, tablet, or Smartphone).
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Need of E-Publishing
• Since its inception,
• the Web became a huge success.
• The number of Web pages now far exceeds
• 20 billion and the number of Web users in the world exceeds 1.7 billion.
• It is known that there are
• more than one trillion distinct URLs on the Web,
• even if many of them are pointers to dynamic pages, not static HTML pages.
• A viable model of economic sustainability based on online advertising was
developed.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE E-PUBLISHING ERA
• Electronic publishing has become common in scientific publishing
• where it has been argued that peer-reviewed scientific journals are in the process of
being replaced by electronic publishing.
• It is also becoming common to distribute
• books, magazines, and newspapers to consumers through tablet reading devices,
• a market that is growing by millions each year,generated by online vendors .
• Apple's iTunes bookstore, Amazon's bookstore for Kindle, and books in the Google Play
Bookstore.
• Market research suggested that half of all magazine and newspaper
circulation is based on E-Publishing.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE E-PUBLISHING ERA
• The advent of the Web changed the world in a way that few people
could have anticipated.
• Yet, one has to wonder on the characteristics of the Web that have made it
so successful.
• Is there a single characteristic of the Web that was most decisive
for its success?
• The simple HTML markup language, the low access costs, the wide spread
reach of the Internet, the interactive browser interface, the search engines.
• While providing the fundamental infrastructure for the Web,
• these technologies were not the root cause of its popularity.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE E-PUBLISHING ERA – It ‘s Time of Birth
• What was it then?
• The fundamental shift in human relationships, introduced by the Web, was freedom to publish.
• Example:
• Jane Austen is one of the most famous writers in English literature. Her books are read
by people all over the world and have been made into countless TV, film, theatre and
radio adaptations.
• This is all the more impressive because she only wrote six full-length novels.
• Jane Austen did not have that freedom,
• so she had to either convince a publisher of the quality of her work or pay for the publication of
an edition of it herself.
• Since she could not pay for it,
• she had to be patient and wait for the publisher to become convinced.
• It took 15 years.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
THE E-PUBLISHING ERA – It ‘s Time of Birth
• In the world of the Web, this is no longer the case.
• People can now publish their ideas on the Web and reach millions of
people over night,
• without paying anything for it and without having to convince the editorial board of
a large publishing company.
• Restrictions imposed by mass communication media companies and
• by natural geographical barriers were almost entirely removed by the invention of
the Web,
• which has led to a freedom to publish that marks the birth of a new era.
• One which we refer to as The e-Publishing Era.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – I: INTRODUCTION
Topic 11: HOW THE WEB CHANGED SEARCH
PROFESSIONAL ELECTIVE – IV
INFORMATIONRETRIEVAL TECHNIQUES
UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the IR
System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
PROFESSIONAL ELECTIVE – IV
INFORMATIONRETRIEVAL TECHNIQUES
HOW THE WEB CHANGED SEARCH
• Web search is
• today ‘s the most prominent application of IR and its techniques.
• Indeed, the ranking and indexing components of
any search engine are
• fundamentally IR pieces of technology.
•
• An immediate consequence is that
the Web has had a major impact in the development of IR
PROFESSIONAL ELECTIVE – IV
INFORMATIONRETRIEVAL TECHNIQUES
HOW THE WEB CHANGED SEARCH
1) The first major impact of the Web on search is
• relatedto the characteristicsof the document collectionitself.
• The Web collection is
• composed of documents(or pages) distributed over millions of sites and connected through hyperlinks,
• i.e., links that associate a piece of text of a page with other Web pages.
• The inherent distributed nature of the Web collection requires
• collectingall documents and storing copies of them in a central repository,prior to indexing.
• This new phase in the IR process, introduced by the Web, is called Web Search.
• The system that implements the IR process(es) is called Search Engine or IR System.
PROFESSIONAL ELECTIVE – IV
INFORMATIONRETRIEVAL TECHNIQUES
HOW THE WEB CHANGED SEARCH
2) The second major impact of the Web on search is related to
• the size of the collection and the volume of user queries submitted on a daily basis.
•
•
• Given that the Web grew larger and faster than any previous known text collection,
the search engines have now to handle a volume of text that far exceeds 20 billion pages, i.e.,
a volume of text much larger than any previous text collection
The volume of user queries is also much larger than ever before, even if estimates vary
widely.
•
• The combination of
a very large text collection with a very high query traffic has pushed the performance
and scalability of search engines to limits that largely exceed those of any previous IR
system.
PROFESSIONAL ELECTIVE – IV
INFORMATIONRETRIEVAL TECHNIQUES
HOW THE WEB CHANGED SEARCH
• That is,
• performance and scalability have become critical characteristics of the IR system, much
more than they used to be prior to the Web.
• While we do not discuss performance and scalability of search engines.
3) The third major impact of the Web on search is also related to the vast
size of the document collection.
• In a very large collection, predicting relevance is much harder than before.
• Basically, any query retrieves a large number of documents that match its terms,
which means that there are many noisy documents in the set of retrieved documents.
•
• That is, documents that seem related to the query
but are actually not relevant to it according to the judgment of a large fraction of the
users are retrieved.
PROFESSIONAL ELECTIVE – IV
INFORMATIONRETRIEVAL TECHNIQUES
HOW THE WEB CHANGED SEARCH
• This problem first showed up in the early Web search engines and became more severe as the Web grew.
• Fortunately, the Web also includes
• new sources of evidence not present in standard document collections that can be used to alleviate the problem,
such as hyperlinksand user clicks in documents in the answer set.
• Two other major impacts of the Web on search derive from the fact that the Web is not just a repository of
documents and data, but also a medium to do business.
• One immediate implication is that the search problem has been extended beyond the seeking of text
information to also encompass other user needs such as the price of a book, the phone number of a
hotel, the link for downloading a software.
• Providing effective answers to
• these types of information needs frequentlyrequiresidentifying structured data associatedwith the object of
interestsuch as price, location, or descriptionsof some of its key characteristics.
PROFESSIONAL ELECTIVE – IV
INFORMATIONRETRIEVAL TECHNIQUES
HOW THE WEB CHANGED SEARCH
The fifth and final impact of the Web on search derives from Web
advertising and other economic incentives.
The continued success of the Web as an interactive media for the
masses created incentives for its economic exploration in the
form of, for instance, advertising and electronic commerce.
These incentives led also to the abusive availability of commercial
information disguised in the form of purely informational
content, which is usually referred to as Web spam.
PROFESSIONAL ELECTIVE – IV
INFORMATIONRETRIEVAL TECHNIQUES
HOW THE WEB CHANGED SEARCH
The increasingly pervasive presence of spam on the Web has made the quest for
relevance even more difficult than before,
i.e., spam content is sometimes so compelling that it is confused with truly
relevant content.
Because of that, it is not unreasonable to think that spam makes relevance
negative,
i.e., the presence of spam makes the current ranking algorithms produce
answers sets that are worst than they would be if the Web were spam
free.
This difficulty is so large that today we talk of Adversarial Web Retrieval.
PROFESSIONAL ELECTIVE – IV
INFORMATIONRETRIEVAL TECHNIQUES
Any Questions?
PROFESSIONAL ELECTIVE – IV
INFORMATIONRETRIEVAL TECHNIQUES
P1WU
UNIT – I: INTRODUCTION
Topic 12: PRACTICAL ISSUES ON THE WEB
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the IR
System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the
Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
PRACTICAL ISSUES ON THE WEB
• Electronic commerce is a major trend on the
Web nowadays and one which has benefited
millions of people.
• In an electronic transaction, the buyer usually
submits to the vendor credit information to be
used for charging purposes.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
PRACTICAL ISSUES ON THE WEB
• In its most common form, such information consists of
a credit card number.
• For security reasons, this information is usually
encrypted, as done by institutions and companies that
deploy automatic authentication processes.
• Besides security, another issue of major interest is
privacy. Frequently, people are willing to exchange
information as long as it does not become public.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
PRACTICAL ISSUES ON THE WEB
• The reasons are many,
• but the most common one is to protect oneself against misuse of private information by third
parties.
• Privacy is another issue which affects the deployment of the Web and which has not
been properly addressed yet.
• Two other important issues are
1. copyright and
2. patent rights.
• It is far from clear how the wide spread of data on the Web affects copyright and
patent laws in the various countries.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
PRACTICAL ISSUES ON THE WEB
•
•
•
• This is important because it affects the business of building up and
deploying large digital libraries.
For instance,
is a site which supervises all the information it posts acting as a publisher?
And if so, is it responsible for misuse of the information it posts (even if it
is not the source)?
•
• Additionally, other practical issues of interest include
scanning, optical character recognition (OCR), and cross-language retrieval
• (in which the query is in one language but the documents retrieved are in
another language).
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – I: INTRODUCTION
Topic 13: HOW PEOPLE SEARCH
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the IR
System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
HOW PEOPLE SEARCH?
• Search tasks range from the relatively simple
• e.g., looking up disputed facts or finding weather
information
• to the rich and complex
• e.g., job seeking and planning vacations.
• Search interfaces should support a range of tasks,
• while taking into account how people think about searching
for information.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Information Lookup versus Exploratory Search
• User interaction with search interfaces differs depending on
a) the type of task,
b) the amount of time and
c) effort available to invest in the process, and the domain expertise
of the information seeker.
• The simple interaction dialogue used in Web search engines is
most appropriate for :
1. finding answers to questions or
2. to finding Web sites or other resources that act as search starting
points.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
HOW PEOPLE SEARCH?
•But, as Marchionini notes,
•the “turn-taking” interface of Web search engines is
• inherently limited and in many cases is being supplanted by
specialty search engines
• such as
• for travel and health information – that offer richer
interaction models.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Classic versus Dynamic Model of Information Seeking
• Researchers have developed numerous theoretical models of
• how people go about doing search tasks?
• The classic notion of the information seeking process model as
described by
• Sutcliffe and Ennis is formulated as a cycle consisting of four main
activities:
1. problem identification,
2. articulation of information need(s),
3. query formulation, and
4. results evaluation.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Classic versus Dynamic Model of Information Seeking
• The standard model of the information seeking process contains
• an underlying assumption that the user’s information need is static and the
information seeking process is one of successively refining a query until
• all and only those documents relevant to the original information need have been
retrieved.
•
• More recent models emphasize the dynamic nature of the search process,
noting that users learn as they search, and their information needs adjust as they see
retrieval results and other document surrogates.
• This dynamic process is sometimes referred to as the berry picking model
of search.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
P1WU
UNIT – I: INTRODUCTION
Topic 14: SEARCH INTERFACES TODAY
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the IR
System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
SEARCH INTERFACES TODAY
a)
b)
c)
•
•
•
•
• At the heart of the typical search session is
a cycle of query specification,
inspection of retrieval results, and
query reformulation.
• As the process proceeds, the searcher learns about
their topic, as well as about the available information sources.
• There are several user interface components that have become
standard in search interfaces and which exhibit high usability.
• As these components are described,
the design characteristics that they support will be underscored.
• Ideally, these components are integrated together to
support the different parts of the process, but it is useful to discuss each separately.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Query Specification
• Once a search starting point has been selected, the primary methods for
a searcher to express their information need are
• either entering words into a search entry form or selecting links from a
directory or other information organization display.
• For Web search engines,
• the query is specified in textual form.
• Today this is usually done by typing text on a keyboard,
• but in future,
• query specification via spoken commands will most likely become increasingly
common, using mobile devices as the input medium.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Query Specification
• Typically in Web queries today,
• the text is very short, consisting of one to three words.
• Multiword queries are often meant to be construed as a phrase,
• but can also consist of multiple topics.
• Short queries reflect the standard usage scenario in which the user “tests the
waters” to
• see what the search engine returns in response to their short query.
• If the results do not look relevant, then
• the user reformulates their query.
• If the results are promising, then
• the user navigates to the most relevant- looking Web site and pursues more fine-tuned queries
on that site.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Query Specification
• This search behavior, in which a general query is used to find a
promising part of the information space, and then
• follow hyperlinks within relevant Web sites,
• is a demonstration of the orienteering strategy of Web search.
• There is evidence that in many cases searchers would prefer to
state their information need in more detail,
• but past experience with search engines taught them that this
method does not work well,
• and that keyword querying combined with orienteering works better.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Query Specification Interfaces
• The standard interface for a textual query is a search box entry form,
• in which the user types a query, activated by hitting the return key on the keyboard or selecting a button
associated with the form.
• Studies suggest a relationship between query length and the width of the entry form;
• results find that either small forms discourage long queries or wide forms encourage longer
queries.
• Some entry forms are divided into multiple components
• , allowing for a more general free text query followed by a form that filters the query in some way.
• For instance, at yelp.com, the user enters
• a general query into the first entry form and refines the search by location in the second form (see Figure
2.1).
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Query Specification Interfaces
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Query Specification Interfaces
• Forms allow for selecting information that has been used in the past;
sometimes this information is structured and allows for setting parameters to
be used in future.
•
For instance, the yelp.com
• form shows the user’s home location (if it has been indicated in the past) along
with recently specified locations and the option to add additional locations.
•
• An increasingly common strategy within the search form is to show hints about
what kind of information should be entered into each form via greyed-out text.
For instance, in zvents.com search (see Figure 2.2), the first box is labeled
• “what are you looking for?”
• while the second box is labeled “when (tonight, this weekend, ...)”..
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Query Specification Interfaces
• When the user places the cursor into the entry form, the grey text
disappears, and the user can type in their query terms. This example also
illustrates
•
specialized input types that some search engines are supporting today.
• For instance, the zvents.com site recognizes that
• words like “tomorrow” are time-sensitive, and interprets them in the expected
manner.
• It also allows flexibility in the syntax of more formal specification of
dates.
• So searching for “comedy” on “wed” automatically computes the date for the
nearest future Wednesday.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Query Specification Interfaces
• This is an example of designing the interface to reflect
how people think,
• rather than making how the user thinks conform to the brittle, literal
expectations of typical programs.
• (This approach to “loose” query specification works
better for “casual” interfaces in which getting the date
right is not critical to the use of the system;
• casual date specification when filling out a tax form is not acceptable, as
the cost of error is too high.)
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Query Specification Interfaces
• This is an example of designing the interface to reflect
how people think,
• rather than making how the user thinks conform to the brittle, literal
expectations of typical programs.
• (This approach to “loose” query specification works
better for “casual” interfaces in which getting the date
right is not critical to the use of the system;
• casual date specification when filling out a tax form is not acceptable, as
the cost of error is too high.)
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Query Specification Interfaces
• An innovation that has greatly improved query specification is the inclusion of a dynamically
generated list of query suggestions, shown in real time as the user types the query.
• This is referred to variously as auto-complete, auto-suggest, and dynamic query suggestions.
• A large log study found that users clicked on dynamic suggestions in the Yahoo Search Assist tool
about one third of the time they were presented .
• Often the suggestions shown are those whose prefix matches the characters typed so far,
• but in some cases, suggestions are shown that only have interior letters matching.
• If the user types a multiple word query, suggestions may be shown that are synonyms of what has
been typed so far, but which do not contain lexical matches.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Query Specification Interfaces
• To exemplify, Netflix.com both describes what is wanted in gray and then shows hits via a
dropdown box.
• In dynamic query suggestion interfaces, the display of the matches varies.
• Some interfaces color the suggestions according to category information.
• In most cases, the user must move the mouse down to the desired suggestion in order to
select it,
• at which point the suggestion is used to fill the query box.
• In some cases, the query is then run immediately; in others, the user must hit the Return
key or click the Search button.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Query Specification Interfaces
• The suggestions can be derived from several sources. In some
cases, the list is taken from the user’s own query history, in other
cases,
• it is based on popular queries issued by other users.
• The list can be derived from a set of metadata that a Web site’s
designer considers important,
• such as a list of known diseases or gene names for a search over
pharmacological literature (see Figure 2.3),
• a list of product names when searching within an e-commerce site, or a
list of known film names when searching a movie site.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Query Specification Interfaces
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Query Specification Interfaces
• The suggestions can also be derived from all of the text contained within a
Web site.
• Another form of query specification consists of choosing from a display of
information,
• typically in the form of hyperlinks or saved bookmarks.
• In some cases,
• the action of selecting a link produces more links for further navigation, in addition to
results listings.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Query Reformulation
• After a query is specified and results have been produced, a number of tools
exist to help the user reformulate their query,
• or take their information seeking process in a new direction.
• Analysis of search engine logs shows that reformulation is a common
activity;
• one study found that more than 50% of searchers modified at least one query during a
session, with nearly a third of these involving three or more queries.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Organizing Search Results
• Searchers often express
• a desire for a user interface that organizes search results into meaningful groups to
help understand the results and decide what to do next.
• A longitudinal study in which users were provided with
• the ability to group search results found that users changed their search habits in
response to having the grouping mechanism available.
• Currently two methods for grouping search results are popular: category
systems, especially faceted categories, and clustering.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Organizing Search Results
• A category system is a set of meaningful labels organized in such a way as to
reflect the concepts relevant to a domain.
• They are usually created manually, although assignment of documents to
categories can be automated to a certain degree of accuracy.
• Good category systems have the characteristics of being coherent and
(relatively) complete,
• their structure is predictable and consistent across search results for an information
collection.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Any Questions?
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
P1WU
 UNIT – I: INTRODUCTION Topic 15: VISUALIZATION SEARCH
 INTERFACES
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
UNIT I INTRODUCTION
1. Information Retrieval
2. Early Developments
3. The IR Problem
4. The Users Task
5. Information versus Data Retrieval
6. The IR System
7. The Software Architecture of the IR
System
8. The Retrieval and Ranking
Processes
9. The Web
10. The e-Publishing Era
11. How the web changed Search
12. Practical Issues on the Web
13. How People Search
14. Search Interfaces Today
15. Visualization in Search
Interfaces
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
WHAT IS VISUALIZATION ?
•Visualization or visualisation is
•any technique for creating images,
diagrams, or animations to
communicate a message.
•Visualization through visual imagery
has been an effective way to
communicate both abstract and
concrete ideas since the dawn of
humanity.
• Wikipedia
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
What is an example of visualizing?
•Visualize is to imagine, or to paint a
picture of something in your mind,
or to make something visible.
•When you close your eyes and
imagine yourself winning first
prize,
• this is an example of when you
visualize winning first prize.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
What are the stages of visualization?
• The five phases of visualization process:
1. data gathering,
2. processing,
3. preparation,
4. reduction and visual layout design.
• In recent years, a comparably fresh research field —
• information visualization has become commonly available for
the researchers of all specialties.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
What is visualization research?
• Visualization research develops tools to
scaffold the exploratory analysis
process and deepens our
understanding of how visualizations
support meaning-making about data.
• A recent trend in the community
involves
• understanding how visualizations can facilitate
interpretability of deep learning models.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
VISUALIZATION STEPS
•The following concrete
steps are required to be
a competent
Visualization expert :
1) Ability to Visualize
2) Finding opportunities
3) Learning ability
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
VISUALIZATION SEARCH INTERFACES
•Text as a representation is highly effective for conveying
abstract information,
•but reading and even scanning text is a cognitively taxing activity, and
must be done in a linear fashion.
•By contrast, images can be scanned quickly and the visual
system perceives information in parallel.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
VISUALIZATION SEARCH INTERFACES
•People are highly attuned to
•images and visual information, and pictures and graphics can be captivating and
appealing.
•A visual representation can communicate
•some kinds of information much more rapidly and effectively than any other
method.
•Consider :
•the deference between a written description of a person’s face and a
photograph of it, or
• the deference between a table of numbers containing a correlation and a
scatter plot showing the same information.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
VISUALIZATION SEARCH INTERFACES
• Experimentation with visualization
for search has been primarily
applied in the following ways:
1. Visualizing Boolean Syntax
2. Visualizing Query Terms within
Retrieval Results
3. Visualizing Relationships among
Words and Documents
4. Visualization for Text Mining.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
1. Visualizing Boolean Syntax
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
1. Visualizing Boolean Syntax
• A Boolean expression is formed with OR ,
AND , NOT keyword.
• Example of Boolean Expression Formation
and its visualization is as follows:
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Visualizing Boolean Syntax
• Boolean query syntax is
• difficult for most users and is rarely used in Web search.
• For many years, researchers have experimented with
• how to visualize Boolean query specification, in order to make it more
understandable.
• A common approach is to show Venn diagrams visually;
• Hertzum and Frokjaer [755] found that
• a simple Venn diagram representation produced more accurate results than
Boolean syntax.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Visualizing Boolean Syntax
• A more flexible version of this idea was seen in the VQuery
system (see Figure 2.15).
• Each query term is represented by
• a circle or oval, and the intersection among circles indicates ANDing
(conjoining) of terms.
• VQuery represented disjunction by
• sets of circles within an active area of the canvas, and negation by
deselecting a circle with in the active area.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Visualizing Boolean Syntax
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Visualizing Boolean Syntax
• One problem with Boolean queries is that they can easily end up with empty
results or too many results.
• To remedy this, the filter-flow visualization allows users to
• lay out the deferent components of the query, and show via a graphical flow how many hits
would result after each operator is applied .
• Other visual representations of Boolean queries include
• lining up blocks vertically and horizontally and representing components of queries as
overlapping “magic” lenses.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
2. Visualizing Relationships among Words and Documents
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Visualizing Query Terms
To make more effective of your
Query to be submitted to SQL
Tool:
It is recommended to use the Query
Deign Tool and use mouse to form
and / or restructure the
relationship .
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Visualizing Query Terms within Retrieval Results
• As discussed above, understanding the role played by the query terms within
the retrieved documents can help with the assessment of relevance.
• In standard search results listings, summary sentences are often selected that
contain query terms, and the occurrence of these terms are highlighted or
boldfaced where they appear in the title, summary, and URL.
• Highlighting of this kind has been shown to be effective from a usability
perspective.
• Experimental visualizations have been designed that make this relationship
more explicit.
• One of the best known is the TileBars interface, in which documents are shown
as horizontal glyphs with the locations of the query term hits marked along the
glyph (see Figure 2.16).
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Visualizing Query Terms within Retrieval Results
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Visualizing Query Terms within Retrieval Results
• The user is encouraged to break the query into its different facets, with one
concept per line, and then
• the horizontal rows within each document’s representation show the frequency of
occurrence of query terms within each topic.
• Longer documents are divided into
• subtopic segments,
• either using paragraph or section breaks, or an automated discourse segmentation
technique called TextTiling.
• Grayscale implies the frequency of the query term occurrences.
• The visualization shows where the discussion of the different query topics
overlaps within the document.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
3. Visualizing Relationships Among Words and Documents
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
What is Word Visualization?
• A word cloud (or tag cloud) is a word visualization
that displays
• the most used words in a text from small to large,
according to how often each appears.
• They give a glance into
• the most important keywords in news articles, social
media posts, and customer reviews, among other text.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Visualizing Relationships Among Words and Documents
• Numerous visualization developers have proposed variations on the idea of
placing words and documents on a two-dimensional canvas,
• where proximity of glyphs represents semantic relationships among the terms or
documents.
• An early version of this idea is seen in the VIBE interface, where
• queries are laidout on a plane, and documents that contain combinations of the
queries are placed midway between the icons representing those terms (see Figure
2.19).
• A more modern version of this idea is seen
• in the Aduna Autofocus product, and the Lyberworld project presented a 3D version of
the ideas behind VIBE.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Visualizing Relationships Among Words and Documents
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Visualizing Relationships Among Words and Documents
• Another variation of this idea is
• to map documents or words from a very high dimensional term space
down into
• a two-dimensional plane, and show where the documents or words fall within that plane,
using 2D or 3D.
• This variation on clustering can be done to documents retrieved
as
• a result of a query, or documents that match a query can be highlighted
within a preprocessed set of documents.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
4. Visualization for Text Mining
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
What is Text visualization ?
• Text visualization is
• the technique of using graphs, charts, or word clouds to showcase written
data in a visual manner.
• This provides quick insight into
• the most relevant keywords in a text, summarizes content, and reveals trends and
patterns across documents.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Visualization for Text Mining
• The subsections above show that usability results for visualization in search
are not particularly strong.
• It seems in fact that visualization is better used for purposes of analysis and
exploration of textual data.
• Most users of search systems are not interested in seeing
• how words are distributed
• across documents ? or
• in viewing the most common words within a collection?
• but these are interesting activities for computational linguists, analysts, and curious
word enthusiasts.
PROFESSIONAL ELECTIVE – IV
INFORMATION RETRIEVAL TECHNIQUES
Information RetrievalsT_I_materials.pptx
Information RetrievalsT_I_materials.pptx
Information RetrievalsT_I_materials.pptx
Information RetrievalsT_I_materials.pptx

More Related Content

Similar to Information RetrievalsT_I_materials.pptx

Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise SearchFindwise
 
IT6701 Information Management Unit - IV
IT6701 Information Management Unit - IVIT6701 Information Management Unit - IV
IT6701 Information Management Unit - IVpkaviya
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notesAnandh Arumugakan
 
Community Capability Model Framework Checklist Tool - Demo & Review
Community Capability Model Framework Checklist Tool - Demo & ReviewCommunity Capability Model Framework Checklist Tool - Demo & Review
Community Capability Model Framework Checklist Tool - Demo & ReviewManjulaPatel
 
INFORMATION RETRIEVAL ‎AND DISSEMINATION
INFORMATION RETRIEVAL ‎AND DISSEMINATIONINFORMATION RETRIEVAL ‎AND DISSEMINATION
INFORMATION RETRIEVAL ‎AND DISSEMINATIONLibcorpio
 
Bab 11 pembelajaran manajemen Komputer.pptx
Bab 11 pembelajaran manajemen Komputer.pptxBab 11 pembelajaran manajemen Komputer.pptx
Bab 11 pembelajaran manajemen Komputer.pptxHermanTusiadi
 
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalIndexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalVikas Bhushan
 
Delivering Linked Data Training to Data Science Practitioners
Delivering Linked Data Training to Data Science PractitionersDelivering Linked Data Training to Data Science Practitioners
Delivering Linked Data Training to Data Science PractitionersMarin Dimitrov
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 
Course description
Course descriptionCourse description
Course descriptionkanagesk
 
Linked Data Applications - WWW2010
Linked Data Applications - WWW2010Linked Data Applications - WWW2010
Linked Data Applications - WWW2010Juan Sequeda
 
Introduction to Competitive Intelligence Portals
Introduction to Competitive Intelligence PortalsIntroduction to Competitive Intelligence Portals
Introduction to Competitive Intelligence PortalsComintelli
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎Libcorpio
 
Use of ICT in educational research
Use of ICT in educational researchUse of ICT in educational research
Use of ICT in educational researchRamakanta Mohalik
 

Similar to Information RetrievalsT_I_materials.pptx (20)

CHAPTER -12 it.pptx
CHAPTER -12 it.pptxCHAPTER -12 it.pptx
CHAPTER -12 it.pptx
 
Ir 01
Ir   01Ir   01
Ir 01
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise Search
 
IT6701 Information Management Unit - IV
IT6701 Information Management Unit - IVIT6701 Information Management Unit - IV
IT6701 Information Management Unit - IV
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Community Capability Model Framework Checklist Tool - Demo & Review
Community Capability Model Framework Checklist Tool - Demo & ReviewCommunity Capability Model Framework Checklist Tool - Demo & Review
Community Capability Model Framework Checklist Tool - Demo & Review
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080 IRT UNIT I NOTES.pdf
 
CS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdfCS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdf
 
INFORMATION RETRIEVAL ‎AND DISSEMINATION
INFORMATION RETRIEVAL ‎AND DISSEMINATIONINFORMATION RETRIEVAL ‎AND DISSEMINATION
INFORMATION RETRIEVAL ‎AND DISSEMINATION
 
Bab 11 pembelajaran manajemen Komputer.pptx
Bab 11 pembelajaran manajemen Komputer.pptxBab 11 pembelajaran manajemen Komputer.pptx
Bab 11 pembelajaran manajemen Komputer.pptx
 
Web mining
Web miningWeb mining
Web mining
 
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information RetrievalIndexing Techniques: Their Usage in Search Engines for Information Retrieval
Indexing Techniques: Their Usage in Search Engines for Information Retrieval
 
Delivering Linked Data Training to Data Science Practitioners
Delivering Linked Data Training to Data Science PractitionersDelivering Linked Data Training to Data Science Practitioners
Delivering Linked Data Training to Data Science Practitioners
 
00 intro
00 intro00 intro
00 intro
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Course description
Course descriptionCourse description
Course description
 
Linked Data Applications - WWW2010
Linked Data Applications - WWW2010Linked Data Applications - WWW2010
Linked Data Applications - WWW2010
 
Introduction to Competitive Intelligence Portals
Introduction to Competitive Intelligence PortalsIntroduction to Competitive Intelligence Portals
Introduction to Competitive Intelligence Portals
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
 
Use of ICT in educational research
Use of ICT in educational researchUse of ICT in educational research
Use of ICT in educational research
 

More from lekhacce

Introduction to HTML language Web design.pptx
Introduction to HTML language Web design.pptxIntroduction to HTML language Web design.pptx
Introduction to HTML language Web design.pptxlekhacce
 
HTML_TABLES,FORMS,FRAME markup lang.pptx
HTML_TABLES,FORMS,FRAME markup lang.pptxHTML_TABLES,FORMS,FRAME markup lang.pptx
HTML_TABLES,FORMS,FRAME markup lang.pptxlekhacce
 
javascript client side scripting la.pptx
javascript client side scripting la.pptxjavascript client side scripting la.pptx
javascript client side scripting la.pptxlekhacce
 
matlab-130408153714-phpapp02_lab123.ppsx
matlab-130408153714-phpapp02_lab123.ppsxmatlab-130408153714-phpapp02_lab123.ppsx
matlab-130408153714-phpapp02_lab123.ppsxlekhacce
 
webdevelopment_6132030-lva1-app6891.pptx
webdevelopment_6132030-lva1-app6891.pptxwebdevelopment_6132030-lva1-app6891.pptx
webdevelopment_6132030-lva1-app6891.pptxlekhacce
 
1_chapter one Java content materials.ppt
1_chapter one Java content materials.ppt1_chapter one Java content materials.ppt
1_chapter one Java content materials.pptlekhacce
 
Information_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdfInformation_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdflekhacce
 
slides_chap02.pdf
slides_chap02.pdfslides_chap02.pdf
slides_chap02.pdflekhacce
 
slides_chap01.pdf
slides_chap01.pdfslides_chap01.pdf
slides_chap01.pdflekhacce
 

More from lekhacce (10)

Introduction to HTML language Web design.pptx
Introduction to HTML language Web design.pptxIntroduction to HTML language Web design.pptx
Introduction to HTML language Web design.pptx
 
HTML_TABLES,FORMS,FRAME markup lang.pptx
HTML_TABLES,FORMS,FRAME markup lang.pptxHTML_TABLES,FORMS,FRAME markup lang.pptx
HTML_TABLES,FORMS,FRAME markup lang.pptx
 
javascript client side scripting la.pptx
javascript client side scripting la.pptxjavascript client side scripting la.pptx
javascript client side scripting la.pptx
 
matlab-130408153714-phpapp02_lab123.ppsx
matlab-130408153714-phpapp02_lab123.ppsxmatlab-130408153714-phpapp02_lab123.ppsx
matlab-130408153714-phpapp02_lab123.ppsx
 
webdevelopment_6132030-lva1-app6891.pptx
webdevelopment_6132030-lva1-app6891.pptxwebdevelopment_6132030-lva1-app6891.pptx
webdevelopment_6132030-lva1-app6891.pptx
 
1_chapter one Java content materials.ppt
1_chapter one Java content materials.ppt1_chapter one Java content materials.ppt
1_chapter one Java content materials.ppt
 
Information_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdfInformation_Retrievals Unit_3_chap09.pdf
Information_Retrievals Unit_3_chap09.pdf
 
slides_chap02.pdf
slides_chap02.pdfslides_chap02.pdf
slides_chap02.pdf
 
AES.pptx
AES.pptxAES.pptx
AES.pptx
 
slides_chap01.pdf
slides_chap01.pdfslides_chap01.pdf
slides_chap01.pdf
 

Recently uploaded

IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLDeelipZope
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...Call Girls in Nagpur High Profile
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 

Recently uploaded (20)

IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Current Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCLCurrent Transformer Drawing and GTP for MSETCL
Current Transformer Drawing and GTP for MSETCL
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
High Profile Call Girls Nashik Megha 7001305949 Independent Escort Service Na...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 

Information RetrievalsT_I_materials.pptx

  • 1. UNIT I INTRODUCTION 1. Information Retrieval 2. Early Developments 3. The IR Problem 4. The Users Task 5. Information versus Data Retrieval 6. The IR System 7. The Software Architecture of the IR System 8. The Retrieval and Ranking Processes 9. The Web 10. The e-Publishing Era 11. How the web changed Search 12. Practical Issues on the Web 13. How People Search 14. Search Interfaces Today 15. Visualization in Search Interfaces. SEMESTER – VIII : PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES UNIT – I : INTRODUCTION
  • 3. What is IR? • Information retrieval (IR) is finding material . . . • of an unstructured nature . . . • that satisfies an information need from within large collections . . . . (usually stored on computers). • Unstructured data means that • a formal, semantically overt, easy-for-computer structure is missing. • In contrast to the rigidly structured data used in DB style searching (e.g. product inventories, personnel records)
  • 4. What is IR? • The process of actively seeking out information relevant to a topic of interest (van Rijsbergen) • Typically it refers to the automatic (rather thanmanual) retrieval of documents • Information Retrieval System (IRS) • “Document” is the generic term for an information • holder (book, chapter, article, webpage, etc)
  • 5. What is IR? • Information retrieval is the science of searching for information a) b) c) d) in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.
  • 6. What is IR? • Information Retrieval (IR) can be defined as • a software program that deals with the organization, storage, retrieval, and evaluation of information from document repositories, particularly textual information. • Information Retrieval is the activity of obtaining material that can usually be documented on an unstructured nature • i.e. usually text which satisfies an information need from within large collections which is stored on computers. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 7. What is IR? • IR helps users • find information that matches their information needs expressed as queries. • Historically, IR is about document retrieval, emphasizing document as the basic unit. – Finding documents relevant to user queries • Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information. • For example, Information Retrieval can be when a user enters a query into the system. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 8. What is IR? • Information retrieval (IR) is a broad area of Computer Science focused • primarily on providing the users with easy access to information of their interest, as follows. • Information retrieval deals with • the representation, • storage, • organization of, and • access to information items such as documents, Web pages, online catalogs, structured and semi- structured records, multimedia objects. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 9. What is IR? •Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. : PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES UNIT – I : INTRODUCTION
  • 10. What is IR? •Information retrieval (IR) Quality : • Are the retrieved documents about the target subject up- to-date? • from a trusted source? • satisfying the user’s needs? • How should we rank documents in terms of these factors? PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 11. What are IR Types? • Information retrieval (IR) are of two types: 1. Precision and 2. recall Above are the two parameters of retrieval effectiveness. 1.Precision refers to how many of the retrieved documents are relevant to the user. 2.Recall refers to what fraction of relevant documents in the collection are retrieved. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 12. What are the types of information retrieval? • Methods/Techniques in which information retrieval techniques are employed include: • Adversarial information retrieval. • Automatic summarization. Multi-document summarization. • Compound term processing. • Cross-lingual retrieval. • Document classification. • Spam filtering. • Question answering. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 13. Why we do IR? •Information retrieval can provide organizations with immediate value • --while it's important to try to figure out ways to capture tacit knowledge, • information retrieval provides a means to get at information that already exists in electronic formats PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 14. Why we do IR? • Information retrieval can provide organizations with immediate value--while it's important to try to figure out ways to capture tacit knowledge, information retrieval provides a means to get at information that already exists in electronic formats. • The representation and organization of the information items should be such as to • provide the users with easy access to information of their interest. • Nowadays, research in IR includes • modeling, Web search, text classification, systems architecture, user interfaces, data visualization, filtering, languages. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 15. What is the need for information retrieval? •Information retrieval can provide : •organizations with immediate value --while it's important to try to figure out ways to capture tacit knowledge, information retrieval provides a means to get at information that already exists in electronic formats. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 16. What is the need for information retrieval in WEB? • Web: – A huge, widely-distributed, highly heterogeneous, semistructured,, interconnected, evolving, hypertext/hypermedia information repository • Main issues – Abundance of information • The 99% of all the information are not interesting for the 99% of all users – The static Web is a very small part of all the Web. • Dynamic Website – To access the Web user need to exploit Search Engines (SE) • SE must be improved • To help people to better formulate their information needs • More personalization is needed PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 17. WEB IR : What is Web IR? • Web IR can be defined as • the application of theories and methodologies from IR to the World Wide Web. Web Information Retrieval models are ways of integrating many sources of evidence about documents, 1. the links, 2. the structure of the document, 3. the actual content of the document, 4. the quality of the document, etc. so that an effective Web search engine can be achieved. : PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES UNIT – I : INTRODUCTION
  • 18. Issues in IR •The main issues of the Information Retrieval (IR) are : 1. Document and Query Indexing 2. Query Evaluation 3. System Evaluation. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 19. Issues in IR : Document and Query Indexing • Document and Query Indexing Main goal of Document and Query Indexing is to find important meanings and creating an internal representation. • The factors to be considered are accuracy to represent semantics, exhaustiveness, and facility for a computer to manipulate. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 20. Issues in IR : Query Evaluation • Query Evaluation – In the retrieval model how can a document be represented with the selected keywords and how are documents and query representations compared to calculate a score. • Information Retrieval (IR) deals with issues like uncertainty and vagueness in information systems. • Uncertainty : The available representation does not typically reflect true semantics of objects such as images, videos etc. • Vagueness : The information that the user requires lacks clarity, is only vaguely expressed in a query, feedback or user action. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 21. Issues in IR : System Evaluation • System Evaluation – System Evaluation tells about the importance of determining the impact of information given on user achievement. Here, we see if the efficiency of the particular system related to time and space. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 22. IR in practice • Not only librarians, professional searchers, etc engage themselves in the activity of information retrieval • but nowadays hundreds of millions of people engage in IR every day when they use web search engines. • Information Retrieval is believed to be the dominant form of Information access PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 23. IR in practice • Information Retrieval is a research-driven theoretical and experimental discipline • The focus is on different aspects of the information– seeking process, depending on the researcher’s background or interest: • Computer scientist – fast and accurate search engine • Librarian – organization and indexing of information • Cognitive scientist – the process in the searcher’s mind • Philosopher – Is this really relevant ? PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 24. IR in practice •Progress influenced by advances in Computational Linguistics, Information Visualization, Cognitive Psychology, HCI, … • Experimental vs. operational systems • Analogy to wcawrmwa.nsutfuadcteurnintgsf PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 25. IR Process • An information retrieval process begins when a user enters a query into the system. • Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single object in the collection. • Instead, several objects may match the query, perhaps with different degrees of relevancy. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 26. Fundamental concepts in IR PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 27. Fundamental concepts in IR •What is information ? •Meaning vs. form • Data vs. Information Retrieval •Relevance PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 28. The stages of IR PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 29. FORMULATION OF IR PROCESS PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 30. IR system • The IR system assists the users in finding the information they require • but it does not explicitly return the answers to the question. • It notifies regarding the existence and location of documents that might consist of the required information. • Information retrieval also extends support to users in browsing or filtering document collection or processing a set of retrieved documents. • The system searches over billions of documents stored on millions of computers. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 31. IR system • An IR system has the ability to represent, store, organize, and access information items. • A set of keywords are required to search. Keywords are what people are searching for in search engines. • These keywords summarize the description of the information. • A spam filter, manual or automatic means are provided by Email program for classifying the mails so that it can be placed directly into particular folders. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 32. Any Questions? PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 33. P1WU UNIT – I: INTRODUCTION TOPIC – 2: EARLY DEVELOPMENTS PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 34. UNIT I INTRODUCTION 1. Information Retrieval 2. Early Developments 3. The IR Problem 4. The Users Task 5. Information versus Data Retrieval 6. The IR System 7. The Software Architecture of the IR System 8. The Retrieval and Ranking Processes 9. The Web 10. The e-Publishing Era 11. How the web changed Search 12. Practical Issues on the Web 13. How People Search 14. Search Interfaces Today 15. Visualization in Search Interfaces. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 35. EARLY DEVELOPMENTS PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 36. EARLY DEVELOPMENTS • For more than 5, 000 years, man has organized information f or later retrieval and searching. • In its most usual form, this has been done by • compiling, • storing, • organizing, and • indexing clay tablets, • hieroglyphics, • papyrus rolls, and • books. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 37. EARLY DEVELOPMENTS • IR in the 17th century: Samuel Pepys, the famous English diarist, subject-indexed his treasured 1000+ books library with key words. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 38. EARLY DEVELOPMENTS • IR in the 17th century: Samuel Pepys, the famous English diarist, subject-indexed his treasured 1000+ books library with key words. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 39. EARLY DEVELOPMENTS • Document Collection: text units we have built an IR system over. • Usually documents • But could be • memos • book chapters paragraphs scenes of a movie • turns in a conversation... • Lots of them PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 40. EARLY DEVELOPMENTS • For holding the various items, • special purpose buildings called libraries, • from the Latin word liber for book, or bibliothekes, • from the Greek word biblion for papyrus roll, are used. • The oldest known library was created in Elba, • in the “Fertile Crescent”, • currently northern Syria, • some time between 3,000 and 2,500 BC. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 41. EARLY DEVELOPMENTS • In the seventh century BC, • Assyrian king Ashurbanipal created the library of Nineveh, on the Tigris River (today, north of Iraq), • which contained more than 30,000 clay tablets at the time of its destruction in 612 BC. • By 300 BC, Ptolemy Soter, a Macedonian general, created the Great Library in Alexandria – the Egyptian city at the mouth of the Nile named after • the Macedonian king Alexander the Great (356-323 BC). PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 42. EARLY DEVELOPMENTS • For seven centuries the Great Library, jointly with other major libraries in the city, • made Alexandria the intellectual capital of the Western world. • Since then, libraries have expanded and flourished. Nowadays, they are everywhere. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 43. EARLY DEVELOPMENTS • . They constitute the collective memory of the human race and their popularity is in the rising. • In 2008 alone, people in the US visited their libraries some 1.3 billion times and checked out more than 2 billion items • an increase in both yearly figures of more than 10 percent. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 44. EARLY DEVELOPMENTS • Since the volume of information in libraries is always growing, • it is necessary to build specialized data structures for fast search – the indexes. In one form or another, • indexes are at the core of every modern information retrieval system. • They provide fast access to the data and allow speeding up query processing. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 45. EARLY DEVELOPMENTS • For centuries indexes have been created manually as sets of categories. • Each category in the index is typically composed of labels that identify its associated topics and of pointers to the documents that discuss those topics. • While these indexes are usually designed by library and information science researchers, • the advent of modern computers has allowed the construction of large indexes automatically, • which has accelerated the development of the area of Information Retrieval (IR). PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 46. EARLY DEVELOPMENTS • Early developments in IR date back to research efforts conducted in the 50’s by pioneers such as • Hans Peter Luhn, • Eugene Garfield, • Philip Bagley, and • Cal vi n Moores, • this last one having allegedly coined the term information retrieval. • In 1955, Allen Kent and colleagues published a paper describing the precision and recall metrics, • which was followed by the publication in 1962 of the Cranfield studies by Cyril Cleverdon. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 47. EARLY DEVELOPMENTS • In 1963, Joseph Becker and Robert Hayes published the first book on information retrieval [164]. • Throughout the 60’s, Gerard Salton and Karen Sparck Jones, among others, shaped the field • by developing the fundamental concepts that led to the modern technologies of ranking in IR. • In1968,thefirstIR book authored by Salton was published. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 48. EARLY DEVELOPMENTS • . In 1971, N.Jardine and C.J.VanRijsbergen articulated the “cluster hypothesis”. • In 1978, the first ACM Conference on IR (ACM SIGIR) was held in Rochester, New York. • In 1979, C.J. Van Rijsbergen published Information Retrieval, which focused on probabilistic models. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 49. EARLY DEVELOPMENTS • In 1983, Salton and McGill published Introduction to Modern Information Retrieval, a classic book on IR focused on vector models. • Since then, • the IR community has grown to include • thousands of professors, • researchers, • students, • engineers, and • practitioners • throughout the world. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 50. EARLY DEVELOPMENTS • The main conference in the area, • the ACM International Conference on Information Retrieval (ACM SIGIR), • now attracts hundreds of attendees and receives hundreds of submitted papers on an yearly basis. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 51. MODERN IR DEVELOPMENTS PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 52. MODERN IR DEVELOPMENTS PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 53. MODERN IR DEVELOPMENTS PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 54. Any Questions? PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 55. P1WU UNIT – I: INTRODUCTION Topic 3: THE IR PROBELM PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 56. UNIT I INTRODUCTION 1. Information Retrieval 2. Early Developments 3. The IR Problem 4. The Users Task 5. Information versus Data Retrieval 6. The IR System 7. The Software Architecture of the IR System 8. The Retrieval and Ranking Processes 9. The Web 10. The e-Publishing Era 11. How the web changed Search 12. Practical Issues on the Web 13. How People Search 14. Search Interfaces Today 15. Visualization in Search Interfaces. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 57. INTRODUCTION TO INFORMATION RETRIEVAL (IR) PROBELM What is IR PROBLEM? Users of modern IR systems, such as search engine users, have information needs of varying complexity. In the simplest case, they are looking for the link to the homepage of a company, government, or institution. In the more sophisticated cases, they are looking for information required to execute tasks associated with their jobs or immediate needs. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 58. IR PROBLEM EXAMPLE An example of a more complex information need is as follows: • Find all documents that address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK). • This full description of the user need does not necessarily provide • the best formulation for querying the IR system. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 59. IR PROBLEM EXAMPLE •Instead, the user might want to first translate this information need into • a query, or sequence of queries, to be posed to the system. • In its most common form, this translation yields • a set of keywords, or index terms, which summarize the user information need. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 60. IR PROBLEM EXAMPLE • Given the user query, • the key goal of the IR system is to retrieve information that is useful or relevant to the user. • The emphasis is on the retrieval of information as opposed to the retrieval of data. • To be effective in its attempt to satisfy the user information need, • the IR system must somehow ‘interpret’ the contents of the information items. • That is, the documents in a collection, and rank them according to a degree of relevance to the user query. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 61. IR PROBLEM • Given the user query, • the key goal of the IR system is to retrieve information that is useful or relevant to the user. • The emphasis is on the retrieval of information as opposed to the retrieval of data. • To be effective in its attempt to satisfy the user information need, • the IR system must somehow ‘interpret’ the contents of the information items. • That is, the documents in a collection, and rank them according to a degree of relevance to the user query. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 62. IR PROBLEM • This ‘interpretation’ of a document content involves extracting syntactic and semantic information from the document text and using this information to match the user information need. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 63. IR PROBLEM • The IR Problem: the primary goal of an IR system is to • retrieve all the documents that are relevant to a user query while retrieving as few non relevant documents as possible. • The difficulty is knowing not only how to extract information from the documents • but also knowing how to use it to decide relevance. • That is, the notion of relevance is of central importance in IR. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 64. IR PROBLEM • One main issue is that relevance is a personal assessment that depends on the task being solved and its context. • For example: • Relevance can change • with time (e.g., new information becomes available), • with location (e.g., the most relevant answer is the closest one), or • even with the device (e.g., the best answer is a short document that is easier to download and visualize). • In this sense, no IR system can provide perfect answers to all users all the time. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 65. Any Questions? PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 66. P1WU UNIT – I: INTRODUCTION Topic 4:THE USERS TASK PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 67. UNIT I INTRODUCTION 1. Information Retrieval 2. Early Developments 3. The IR Problem 4. The Users Task 5. Information versus Data Retrieval 6. The IR System 7. The Software Architecture of the IR System 8. The Retrieval and Ranking Processes 9. The Web 10. The e-Publishing Era 11. How the web changed Search 12. Practical Issues on the Web 13. How People Search 14. Search Interfaces Today 15. Visualization in Search Interfaces. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 68. AN OVERVIEW OF USER TASK : What is USER TASK IN IRS? • The user of a retrieval system has to : • translate his information need into a query in the language provided by the system. • With an information retrieval system, this normally implies • specifying a set of words which convey the semantics of the information need. • With a data retrieval system, • a query expression (such as, for instance, a regular expression) is used to • convey the constraints that must be satisfied by objects in the answer set. • In both cases, we say that the user searches for useful information executing a retrieval task. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 69. IRS – USER TASK : What is USER TASK IN IRS? • Users of modern IR systems, • such as search engine users, have information needs of varying complexity. • The user of a retrieval system has to translate their information need into a query in the language provided by the system. • With an IR system, such as a search engine, • this usually implies specifying a set of words that convey the semantics of the information need. • We say that the user is searching or querying for information of their interest. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 70. IRS – USER TASK EXAMPLE To illustrate, • The user might be interested in documents about • car racing in general and might decide to glance related documents about Formula 1 racing, • Formula Indy, and the ‘24 Hours of Le Mans. • We say that • the user is browsing or navigating the documents in the collection, not searching. • It is still a process of retrieving information, • but one whose main objectives are less clearly defined in the beginning. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 71. IRS – USER TASK • While searching for information of interest is the main retrieval task on the Web, • search can also be used for satisfying other user needs distinct from information access, • such as the buying of goods and the placing of reservations. • Consider now a user • who has an interest that is either poorly defined or inherently broad, • such that the query to specify is unclear. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 72. IRS – USER TASK • The task in this case is more related to • exploratory search and resembles a process of quasi-sequential search for information of interest. • Here we, make a clear distinction between the different tasks the user of the retrieval system might be engaged in. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 73. IRS – USER TASK • The task might be then of two distinct types: searching and browsing, as illustrated in Figure: PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 74. IRS – USER TASK • In a process of retrieving information, one whose main objectives are not clearly defined in the beginning and whose purpose might change during the interaction with the system. • Then, user task may go with Browsing only. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 75. IRS – USER TASK • User Choice of Information Retrieval: • Push • Pull PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 76. IRS – USER TASK • Both retrieval and browsing are, in the language of the World Wide Web, `pulling' actions. • That is, the user requests the information in an interactive manner. • An alternative is to do retrieval in an automatic and permanent fashion using software agents which push the information towards the user. • For instance, information useful to a user could be extracted periodically from a news service. • In this case, we say that the IR system is executing a particular retrieval task which consists of filtering relevant information for later inspection by the user. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 77. Any Questions? PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 78. P1WU UNIT – I: INTRODUCTION Topic 5: THE INFORMATION VERSUS RETRIEVAL PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 79. UNIT I INTRODUCTION 1. Information Retrieval 2. Early Developments 3. The IR Problem 4. The Users Task 5. Information versus Data Retrieval 6. The IR System 7. The Software Architecture of the IR System 8. The Retrieval and Ranking Processes 9. The Web 10. The e-Publishing Era 11. How the web changed Search 12. Practical Issues on the Web 13. How People Search 14. Search Interfaces Today 15. Visualization in Search Interfaces. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 80. INTRODUCTION TO INFORMATION What is Information? PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 81. INTRODUCTION TO RETRIEVAL What is Retrieval? • This does not mean that there is no structure in the data Document structure (headings, paragraphs, lists. . . ) • Explicit markup formatting (e.g. in HTML, XML. . . ) Linguistic structure (latent, hidden) PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES SELECT * FROM business catalogue WHERE category = ’florist’ AND city zip = ’cb1’
  • 82. INTRODUCTION TO RETRIEVAL What is Information retrieval (IR) ? Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 83. INTRODUCTION TO RETRIEVAL • An information need is • the topic about which the user desires to know more about. • A query is • what the user conveys to the computer in an attempt to communicate the information need. • A document is relevant PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES Known-item search. Precise information seeking search Open-ended search • if the user perceives that it contains information (“topical search”) of value with respect to their personal information need.
  • 84. THE INFORMATION VERSUS RETRIEVAL • Data retrieval, in the context of an IR system, consists • mainly of determining which documents of a collection contain • the keywords in the user query which, most frequently, is not enough to satisfy the user information need. • In fact, the user of an IR system is concerned more with • retrieving information about a subject than with retrieving data that satisfies a given query. • For instance, a user of an IR system is willing to accept • documents that contain synonyms of the query terms in the result set, • even when those documents do not contain any query terms. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 85. INFORMATION RETRIEVAL • “Ad hoc” retrieval web retrieval • Support for browsing and filtering document collections: • Clustering • Classification; using fixed labels (common information needs, age groups, topics; ) • Further processing a set of retrieved documents, • e.g., by using natural language processing • Information extraction Summarization Question answering. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 86. RETRIEVAL TYPES 1) Web search ( ) • Search ground are billions of documents on millions of computers • issues: spidering; efficient indexing and search; malicious manipulation to boost search engine rankings 2) Enterprise and institutional search ( ) • e.g company’s documentation, patents, research articles often domain-specific • Centralised storage; dedicated machines for search. • Most prevalent IR evaluation scenario: US intelligence analyst’s searches 3) Personal information retrieval (email, pers. documents; ) • e.g., Mac OS X Spotlight; Windows’ Instant Search • Issues: different file types; maintenance-free, lightweight to run in background PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 87. THE INFORMATION VERSUS RETRIEVAL • In an IR system the retrieved objects might be • inaccurate and small errors are likely to go unnoticed. • In a data retrieval system, on the contrary, • a single erroneous object among a retrieval system, • such as defined structure and semantics thousand retrieved objects means total failure. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 88. THE INFORMATION VERSUS RETRIEVAL • While A data a relational database, deals with data that has a well defined structure and semantics. • while an IR system deals with natural language text which is not well structured. • Data retrieval, while providing a solution to the user of a database system, does not solve • the problem of retrieving information about a subject or topic. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 89. THE INFORMATION VERSUS RETRIEVAL PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES Data Information Unorganized raw facts that need processing without which it is seemingly random and useless to humans Information is a processed, organized data presented in a given context and is useful to humans. Data is an individual unit that contains raw material which does not carry any specific meaning. Information is a group of data thatcollectively carry a logical meaning Data Doesn't depended on information. Information depends on data. It is measured in bits and bytes. Information is measured in meaningfulunits like time, quantity, etc. Data is never suited to thespecific needs of a designer. Information is specific to the expectations and requirements because all the irrelevant facts and figures are removed, during the transformation process. An example of Data is a Student’score. The average score of a class is the information derived from the given data.
  • 90. Any Questions? PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 91. P1WU UNIT – I: INTRODUCTION Topic 6: THE IR SYSTEM PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 92. UNIT I INTRODUCTION 1. Information Retrieval 2. Early Developments 3. The IR Problem 4. The Users Task 5. Information versus Data Retrieval 6. The IR System 7. The Software Architecture of the IR System 8. The Retrieval and Ranking Processes 9. The Web 10. The e-Publishing Era 11. How the web changed Search 12. Practical Issues on the Web 13. How People Search 14. Search Interfaces Today 15. Visualization in Search Interfaces. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 93. The role of an IR system A modern view: • Support the user in – exploring a problem domain, understanding its terminology, concepts and structure – clarifying, refining and formulating an information need – finding documents that match the info need description • As many relevant docs as possible • As few non-relevant documents as possible PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 94. How does it do this ? • User interfaces and visualization tools for – exploring a collection of documents – exploring search results • Query expansion based on – Thesauri – Lexical/statistic analysis of text / context and concept formation – Relevance feedback • Indexing and matching model PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 95. How well does it do this? • Evaluation – Of the components • Indexing / matching algorithms – Of the exploratory process overall • Usability issues • Usefulness to task • User satisfaction PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 96. What do we want from an IRS ? • Systemic approach – Goal (for a known information need): • Return as many relevant documents as possible and as few non-relevant documents as possible • Cognitive approach –Goal (in an interactive information-seeking environment, with a given IRS): • Support the user’s exploration of the problem domain and the task completion. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 97. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS) OIRS is the techniques of storing and recovering and often disseminating recorded data especially through the use of a computerized system” (Merriam Webster Dictionary) PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 98. QUALITIES OF IRS • The effectiveness of an IR system (i.e., the quality of its search results) is determined by two key statistics about the system’s returned results for a query: 1. Precision: What fraction of the returned results are relevant to the information need? 2. Recall: What fraction of the relevant documents in the collection were returned by the system? • Queries to be addressed: • What is the best balance between the two? • Easy to get perfect recall: just retrieve everything • Easy to get good precision: retrieve only the most relevant PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 99. THE SOFTWARE ARCHITECTURE OF THE IR SYSTEM • To describe the IR system, we use a simple and generic software architecture as shown in Figure PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 100. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS) •The first step in setting up an IR system is to assemble the document collection, • which can be private or be crawled from the Web. In the second case a crawler module is responsible for collecting the documents. •The document collection is stored in disk storage usually referred to as the central repository. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 101. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS) •The documents in the central repository need to be indexed for fast retrieval and ranking. •The most used index structure is an inverted index composed of all the distinct words of the collection and, for each word, • a list of the documents that contain it. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 102. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS) •Given that the document collection is indexed, the retrieval process can be initiated. •It consists of retrieving documents that satisfy either a user query or a click in a hyper link. • In the first case, we say that the user is searching for information of interest; • in the second case, we say that the user is browsing for information of interest. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 103. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS) • Use retrieval as PROCESS it applies to the searching process requires browsing and how it compares to searching. • To search, the user first specifies a query that reflects their information need. • Next, the user query is parsed and expanded with, for instance, spelling variants of a query word. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 104. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS) • The expanded query, • which we refer to as the system query, is then processed against the index to retrieve a subset of all documents. • Following, the retrieved documents are ranked and the top documents are returned to the user. • The purpose of ranking is to identify the documents that are most likely to be considered relevant by the user, and constitutes the most critical part of the IR system. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 105. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS) • Given the inherent subjectivity in deciding relevance, • evaluating the quality of the answer set is a key step for improving the IR system. • A systematic evaluation process allows • fine tuning the ranking algorithm and improving the quality of the results. • The most common evaluation procedure consists of comparing • the set of results produced by the IR system with results suggested by human specialists. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 106. Any Questions? PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 107. P1WU UNIT – I: INTRODUCTION Topic 7: THE SOFTWARE ARCHITECTURE OF THE IR SYSTEM PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 108. UNIT I INTRODUCTION 1. Information Retrieval 2. Early Developments 3. The IR Problem 4. The Users Task 5. Information versus Data Retrieval 6. The IR System 7. The Software Architecture of the IR System 8. The Retrieval and Ranking Processes 9. The Web 10. The e-Publishing Era 11. How the web changed Search 12. Practical Issues on the Web 13. How People Search 14. Search Interfaces Today 15. Visualization in Search Interfaces. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 109. SYSTEM ARCHITECTURE OF IRS A high level view of the software architecture of an IR system will provide: 1) Components 2) Tools 3) Environment 4) Data source(s) Also additional elements needed for through understanding of Data flow. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 110. THE SOFTWARE ARCHITECTURE OF THE IR SYSTEM • To describe the IR system, we use a simple and generic software architecture as shown in Figure PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 111. THE SOFTWARE ARCHITECTURE OF THE IR SYSTEM • To describe the IR system, we use a simple and generic software architecture as shown in Figure PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 112. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS) • The first step in setting up an IR system is to assemble the document collection, • which can be private or be crawled from the Web. In the second case a crawler module is responsible for collecting the documents. • The document collection is stored in disk storage usually referred to as the central repository. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 113. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS) • The documents in the central repository need to be indexed for fast retrieval and ranking. • The most used index structure is an inverted index composed of all the distinct words of the collection and, for each word, • a list of the documents that contain it. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 114. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS) •Given that the document collection is indexed, the retrieval process can be initiated. •It consists of retrieving documents that satisfy either a user query or a click in a hyper link. • In the first case, we say that the user is searching for information of interest; • in the second case, we say that the user is browsing for information of interest. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 115. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS) • Use retrieval as PROCESS it applies to the searching process requires browsing and how it compares to searching. • To search, the user first specifies a query that reflects their information need. • Next, the user query is parsed and expanded with, for instance, spelling variants of a query word. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 116. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS) • The expanded query, • which we refer to as the system query, is then processed against the index to retrieve a subset of all documents. • Following, the retrieved documents are ranked and the top documents are returned to the user. • The purpose of ranking is to identify the documents that are most likely to be considered relevant by the user, and constitutes the most critical part of the IR system. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 117. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS) • Given the inherent subjectivity in deciding relevance, • evaluating the quality of the answer set is a key step for improving the IR system. • A systematic evaluation process allows • fine tuning the ranking algorithm and improving the quality of the results. • The most common evaluation procedure consists of comparing • the set of results produced by the IR system with results suggested by human specialists. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 118. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS) • To improve the ranking, • we might collect feedback from the users and use this information to change the results. • In the Web, • the most abundant form of user feedback are the clicks on the documents in the results set. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 119. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS) • Another important source of information for Web ranking are the hyperlinks among pages, • which allow identifying sites of high authority. • There are many other concepts and technologies that bear impact on • the design of a full fledged IR system, such as a modern search engine. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 120. Any Questions? PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 121. P1WU UNIT – I: INTRODUCTION Topic 8: THE RETRIEVAL AND RANKING PROCESSES PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 122. UNIT I INTRODUCTION 1. Information Retrieval 2. Early Developments 3. The IR Problem 4. The Users Task 5. Information versus Data Retrieval 6. The IR System 7. The Software Architecture of the IR System 8. The Retrieval and Ranking Processes 9. The Web 10. The e-Publishing Era 11. How the web changed Search 12. Practical Issues on the Web 13. How People Search 14. Search Interfaces Today 15. Visualization in Search Interfaces. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 123. Ranked Retrieval • Ranked retrieval • The documents are ranked based on their score • Advantages – Query easy to specify –The output is ranked based on the estimated relevance of the documents to the query – A wide variety of theoretical models exist • Disadvantages – Query less precise (although weighting can be used) PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 124. THE RETRIEVAL AND RANKING PROCESSES • To describe the retrieval and ranking processes, we further elaborate on our description of the modules shown in Figure 1.2, as illustrated in Figure 1.3. • Given the documents of the collection, 1. we first apply text operations to them such as eliminating stop words, stemming, and 2. selecting a subset of all terms for use as indexing terms. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 125. THE RETRIEVAL AND RANKING PROCESSES • To describe the IR system Retrieval and Ranking Processes, we illustrate it through the figure as given below: PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 126. THE RETRIEVAL AND RANKING PROCESSES • To describe the IR system Retrieval and Ranking Processes, we illustrate it through the figure as given below: PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 127. THE RETRIEVAL AND RANKING PROCESSES • The indexing terms are then used to compose document representations, • which might be smaller than the documents themselves (depending on the subset of index terms selected). • Given the document representations, it is necessary to build an index of the text. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 128. THE RETRIEVAL AND RANKING PROCESSES • Different index structures might be used, • but the most popular one is an inverted index. • The steps required to generate the index compose the indexing process and must be executed offline, • before the system is ready to process any queries. • The resources (time and storage space) spent on the indexing process are amortized by querying the retrieval system many times. • Given that the document collection is indexed, the retrieval process can be initiated. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 129. THE RETRIEVAL AND RANKING PROCESSES • The user first specifies a query that reflects their information need. • This query is then parsed and modified by operations that resemble those applied to the documents. • Typical operations at this point consist of spelling corrections and elimination of terms such as stop words, • whenever appropriated. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 130. THE RETRIEVAL AND RANKING PROCESSES • Next, the transformed query is expanded and modified. • For instance, the query might be modified using query suggestions made by the system and confirmed by the user. • The expanded and modified query is then processed to obtain the set of retrieved documents, • which is composed of documents that contain the query terms. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 131. THE RETRIEVAL AND RANKING PROCESSES • Fast query processing is made possible by the index structure previously built. • The steps required to produce the set of retrieved documents constitute the retrieval process. • Next, the retrieved documents are ranked according to a likelihood of relevance to the user. • This is a most critical step because the quality of the results, • as perceived by the users, is fundamentally dependent on the ranking. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 132. THE RETRIEVAL AND RANKING PROCESSES • The top ranked documents are then formatted for presentation to the user. • The formatting consists of retrieving the title of the documents and generating snippets for them, • i.e., text excerpts that contain the query terms, • which are then displayed to the user. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 133. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS) PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 134. ONLINE INFORMATION RETRIEVAL SYSTEM(OIRS) • The documents in the central repository need to be indexed for fast retrieval and ranking. • The most used index structure is an inverted index composed of all the distinct words of the collection and, for each word, • a list of the documents that contain it. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 135. Any Questions? PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 136. P1WU UNIT – I: INTRODUCTION Topic 9: THE WEB PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 137. UNIT I INTRODUCTION 1. Information Retrieval 2. Early Developments 3. The IR Problem 4. The Users Task 5. Information versus Data Retrieval 6. The IR System 7. The Software Architecture of the IR System 8. The Retrieval and Ranking Processes 9. The Web 10. The e-Publishing Era 11. How the web changed Search 12. Practical Issues on the Web 13. How People Search 14. Search Interfaces Today 15. Visualization in Search Interfaces. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 138. WEB – What is Web? • The Web, or World Wide Web (W3), is basically a system of Internet servers that support specially formatted documents. • The documents are formatted in a markup language called HTML (HyperText Markup Language) that supports links to other documents, as well as graphics, audio, and video files. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 139. WEB – What Does Web Mean? • The Web is the common name for the World Wide Web, a subset of the Internet consisting of the pages that can be accessed by a Web browser. • Many people assume that the Web is the same as the Internet, and use these terms interchangeably. • However, the term Internet actually refers to the global network of servers that makes the information sharing that happens over the Web possible. • So, although the Web does make up a large portion of the Internet, but they are not one and same. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 140. WEB – KEY TOPICS • WEB SITE OR WEB PAGES • WEB SERVER • WEB CLIENT • WEB APPLICATIONS OR WEB APPs • WEB SOFTWARES – WEB 1.0, WRB 2.0, WEB 3.0 • WEB SERVICES PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 141. WEB - A Brief History • “As We May Think” influenced people like Douglas Engelbart who, • at the Fall Joint Computer Conference in San Francisco in December of 1968, • ran a demonstration in which he introduced the first ever computer mouse, video conferencing, teleconferencing, and hypertext. • It was so incredible that it became known as “the mother of all demos” [1690]. • Of the innovations displayed, the one that interests us the most here is hypertext. • The term was coined by Ted Nelson in his Project Xanadu. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 142. WEB - A Brief History • Hypertext allows the reader to jump from one electronic document to another, • which was one important property regarding the problem Tim Berners-Lee faced in 1989. • At the time, Berners-Lee worked in Geneva at the CERN – Conseil Europeen pourla Recherche Nucleaire. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 143. WEB - A Brief History • Researchers who wanted to share documentation with others had to • reformat their documents to make them compatible with an internal publishing system. • It was annoying and generated many questions, • many of which ended up been directed towards Berners-Lee. • He understood that a better solution was required. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 144. WEB - A Brief History • It just so happened that CERN was the largest Internet node in Europe. • Berners Lee reasoned that • it would be nice if the solution to the problem of sharing documents were decentralized, such that the researchers could share their contributions freely. • He saw that • a networked hypertext, through the Internet, would be a good solution and started working on its implementation PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 145. WEB - A Brief History • In 1990, • he wrote the HTTP protocol, defined the HTML language, • wrote the first browser, which he called “World Wide Web”, and the first Web server. • In 1991, • he made his browser and server software available in the Internet. The Web was born. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 146. Web Information Retrieval : What is web in information retrieval? •Web Information retrieval is • the process of searching within a huge World Wide Web document collection for a particular information need (called a query). PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 147. Web Information Retrieval : What is web in information retrieval? • The Web can be considered as • a large-scale document collection, for which classical text retrieval techniques can be applied. • However, its unique features and structure offer new sources of evidence that can be used to enhance the effectiveness of Information Retrieval (IR) systems. • Generally, Web IR examines • the combination of evidence from both the textual content of documents and • the structure of the Web, • as well as the search behavior of users and issues related to the evaluation of retrieval effectiveness in the Web setting. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 148. Web Information Retrieval • Web documents / data • No traditional collection – Huge • Time and space to crawl index • IRSs cannot store copies of documents – Dynamic, volatile, anarchic, un- controlled PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 149. Web Information Retrieval – Homogeneous sub-collections • Structure – In documents (un-/semi-/fully-structured) – Between docs: network of inter-connected nodes – Hyper-links - conceptual vs. physical documents PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 150. Web Information Retrieval • Web Information Retrieval models are • ways of integrating many sources of evidence about documents, such as 1. the links, 2. the structure of the document, the actual content of the document, 3. the quality of the document, etc. 4. so that an effective Web search engine can be achieved. • In contrast with the traditional library-type settings of IR systems, • the Web is a hostile environment, where Web search engines have to deal with PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 151. Any Questions? PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 152. P1WU UNIT – I: INTRODUCTION Topic 10: THE E-PUBLISHING ERA PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 153. UNIT I INTRODUCTION 1. Information Retrieval 2. Early Developments 3. The IR Problem 4. The Users Task 5. Information versus Data Retrieval 6. The IR System 7. The Software Architecture of the IR System 8. The Retrieval and Ranking Processes 9. The Web 10. The e-Publishing Era 11. How the web changed Search 12. Practical Issues on the Web 13. How People Search 14. Search Interfaces Today 15. Visualization in Search Interfaces. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 154. What is E-Publishing? • Electronic publishing (also referred to as publishing, digital publishing, or online publishing) includes • the digital publication of • e-books, • digital magazines, and • the development of digital libraries and catalogues. • It also includes • the editing of • books, • journals and • Magazines to be posted on a screen (computer, e-reader, tablet, or Smartphone). PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 155. Need of E-Publishing • Since its inception, • the Web became a huge success. • The number of Web pages now far exceeds • 20 billion and the number of Web users in the world exceeds 1.7 billion. • It is known that there are • more than one trillion distinct URLs on the Web, • even if many of them are pointers to dynamic pages, not static HTML pages. • A viable model of economic sustainability based on online advertising was developed. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 156. THE E-PUBLISHING ERA • Electronic publishing has become common in scientific publishing • where it has been argued that peer-reviewed scientific journals are in the process of being replaced by electronic publishing. • It is also becoming common to distribute • books, magazines, and newspapers to consumers through tablet reading devices, • a market that is growing by millions each year,generated by online vendors . • Apple's iTunes bookstore, Amazon's bookstore for Kindle, and books in the Google Play Bookstore. • Market research suggested that half of all magazine and newspaper circulation is based on E-Publishing. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 157. THE E-PUBLISHING ERA • The advent of the Web changed the world in a way that few people could have anticipated. • Yet, one has to wonder on the characteristics of the Web that have made it so successful. • Is there a single characteristic of the Web that was most decisive for its success? • The simple HTML markup language, the low access costs, the wide spread reach of the Internet, the interactive browser interface, the search engines. • While providing the fundamental infrastructure for the Web, • these technologies were not the root cause of its popularity. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 158. THE E-PUBLISHING ERA – It ‘s Time of Birth • What was it then? • The fundamental shift in human relationships, introduced by the Web, was freedom to publish. • Example: • Jane Austen is one of the most famous writers in English literature. Her books are read by people all over the world and have been made into countless TV, film, theatre and radio adaptations. • This is all the more impressive because she only wrote six full-length novels. • Jane Austen did not have that freedom, • so she had to either convince a publisher of the quality of her work or pay for the publication of an edition of it herself. • Since she could not pay for it, • she had to be patient and wait for the publisher to become convinced. • It took 15 years. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 159. THE E-PUBLISHING ERA – It ‘s Time of Birth • In the world of the Web, this is no longer the case. • People can now publish their ideas on the Web and reach millions of people over night, • without paying anything for it and without having to convince the editorial board of a large publishing company. • Restrictions imposed by mass communication media companies and • by natural geographical barriers were almost entirely removed by the invention of the Web, • which has led to a freedom to publish that marks the birth of a new era. • One which we refer to as The e-Publishing Era. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 160. Any Questions? PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 161. P1WU UNIT – I: INTRODUCTION Topic 11: HOW THE WEB CHANGED SEARCH PROFESSIONAL ELECTIVE – IV INFORMATIONRETRIEVAL TECHNIQUES
  • 162. UNIT I INTRODUCTION 1. Information Retrieval 2. Early Developments 3. The IR Problem 4. The Users Task 5. Information versus Data Retrieval 6. The IR System 7. The Software Architecture of the IR System 8. The Retrieval and Ranking Processes 9. The Web 10. The e-Publishing Era 11. How the web changed Search 12. Practical Issues on the Web 13. How People Search 14. Search Interfaces Today 15. Visualization in Search Interfaces. PROFESSIONAL ELECTIVE – IV INFORMATIONRETRIEVAL TECHNIQUES
  • 163. HOW THE WEB CHANGED SEARCH • Web search is • today ‘s the most prominent application of IR and its techniques. • Indeed, the ranking and indexing components of any search engine are • fundamentally IR pieces of technology. • • An immediate consequence is that the Web has had a major impact in the development of IR PROFESSIONAL ELECTIVE – IV INFORMATIONRETRIEVAL TECHNIQUES
  • 164. HOW THE WEB CHANGED SEARCH 1) The first major impact of the Web on search is • relatedto the characteristicsof the document collectionitself. • The Web collection is • composed of documents(or pages) distributed over millions of sites and connected through hyperlinks, • i.e., links that associate a piece of text of a page with other Web pages. • The inherent distributed nature of the Web collection requires • collectingall documents and storing copies of them in a central repository,prior to indexing. • This new phase in the IR process, introduced by the Web, is called Web Search. • The system that implements the IR process(es) is called Search Engine or IR System. PROFESSIONAL ELECTIVE – IV INFORMATIONRETRIEVAL TECHNIQUES
  • 165. HOW THE WEB CHANGED SEARCH 2) The second major impact of the Web on search is related to • the size of the collection and the volume of user queries submitted on a daily basis. • • • Given that the Web grew larger and faster than any previous known text collection, the search engines have now to handle a volume of text that far exceeds 20 billion pages, i.e., a volume of text much larger than any previous text collection The volume of user queries is also much larger than ever before, even if estimates vary widely. • • The combination of a very large text collection with a very high query traffic has pushed the performance and scalability of search engines to limits that largely exceed those of any previous IR system. PROFESSIONAL ELECTIVE – IV INFORMATIONRETRIEVAL TECHNIQUES
  • 166. HOW THE WEB CHANGED SEARCH • That is, • performance and scalability have become critical characteristics of the IR system, much more than they used to be prior to the Web. • While we do not discuss performance and scalability of search engines. 3) The third major impact of the Web on search is also related to the vast size of the document collection. • In a very large collection, predicting relevance is much harder than before. • Basically, any query retrieves a large number of documents that match its terms, which means that there are many noisy documents in the set of retrieved documents. • • That is, documents that seem related to the query but are actually not relevant to it according to the judgment of a large fraction of the users are retrieved. PROFESSIONAL ELECTIVE – IV INFORMATIONRETRIEVAL TECHNIQUES
  • 167. HOW THE WEB CHANGED SEARCH • This problem first showed up in the early Web search engines and became more severe as the Web grew. • Fortunately, the Web also includes • new sources of evidence not present in standard document collections that can be used to alleviate the problem, such as hyperlinksand user clicks in documents in the answer set. • Two other major impacts of the Web on search derive from the fact that the Web is not just a repository of documents and data, but also a medium to do business. • One immediate implication is that the search problem has been extended beyond the seeking of text information to also encompass other user needs such as the price of a book, the phone number of a hotel, the link for downloading a software. • Providing effective answers to • these types of information needs frequentlyrequiresidentifying structured data associatedwith the object of interestsuch as price, location, or descriptionsof some of its key characteristics. PROFESSIONAL ELECTIVE – IV INFORMATIONRETRIEVAL TECHNIQUES
  • 168. HOW THE WEB CHANGED SEARCH The fifth and final impact of the Web on search derives from Web advertising and other economic incentives. The continued success of the Web as an interactive media for the masses created incentives for its economic exploration in the form of, for instance, advertising and electronic commerce. These incentives led also to the abusive availability of commercial information disguised in the form of purely informational content, which is usually referred to as Web spam. PROFESSIONAL ELECTIVE – IV INFORMATIONRETRIEVAL TECHNIQUES
  • 169. HOW THE WEB CHANGED SEARCH The increasingly pervasive presence of spam on the Web has made the quest for relevance even more difficult than before, i.e., spam content is sometimes so compelling that it is confused with truly relevant content. Because of that, it is not unreasonable to think that spam makes relevance negative, i.e., the presence of spam makes the current ranking algorithms produce answers sets that are worst than they would be if the Web were spam free. This difficulty is so large that today we talk of Adversarial Web Retrieval. PROFESSIONAL ELECTIVE – IV INFORMATIONRETRIEVAL TECHNIQUES
  • 170. Any Questions? PROFESSIONAL ELECTIVE – IV INFORMATIONRETRIEVAL TECHNIQUES
  • 171. P1WU UNIT – I: INTRODUCTION Topic 12: PRACTICAL ISSUES ON THE WEB PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 172. UNIT I INTRODUCTION 1. Information Retrieval 2. Early Developments 3. The IR Problem 4. The Users Task 5. Information versus Data Retrieval 6. The IR System 7. The Software Architecture of the IR System 8. The Retrieval and Ranking Processes 9. The Web 10. The e-Publishing Era 11. How the web changed Search 12. Practical Issues on the Web 13. How People Search 14. Search Interfaces Today 15. Visualization in Search Interfaces. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 173. PRACTICAL ISSUES ON THE WEB • Electronic commerce is a major trend on the Web nowadays and one which has benefited millions of people. • In an electronic transaction, the buyer usually submits to the vendor credit information to be used for charging purposes. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 174. PRACTICAL ISSUES ON THE WEB • In its most common form, such information consists of a credit card number. • For security reasons, this information is usually encrypted, as done by institutions and companies that deploy automatic authentication processes. • Besides security, another issue of major interest is privacy. Frequently, people are willing to exchange information as long as it does not become public. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 175. PRACTICAL ISSUES ON THE WEB • The reasons are many, • but the most common one is to protect oneself against misuse of private information by third parties. • Privacy is another issue which affects the deployment of the Web and which has not been properly addressed yet. • Two other important issues are 1. copyright and 2. patent rights. • It is far from clear how the wide spread of data on the Web affects copyright and patent laws in the various countries. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 176. PRACTICAL ISSUES ON THE WEB • • • • This is important because it affects the business of building up and deploying large digital libraries. For instance, is a site which supervises all the information it posts acting as a publisher? And if so, is it responsible for misuse of the information it posts (even if it is not the source)? • • Additionally, other practical issues of interest include scanning, optical character recognition (OCR), and cross-language retrieval • (in which the query is in one language but the documents retrieved are in another language). PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 177. Any Questions? PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 178. P1WU UNIT – I: INTRODUCTION Topic 13: HOW PEOPLE SEARCH PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 179. UNIT I INTRODUCTION 1. Information Retrieval 2. Early Developments 3. The IR Problem 4. The Users Task 5. Information versus Data Retrieval 6. The IR System 7. The Software Architecture of the IR System 8. The Retrieval and Ranking Processes 9. The Web 10. The e-Publishing Era 11. How the web changed Search 12. Practical Issues on the Web 13. How People Search 14. Search Interfaces Today 15. Visualization in Search Interfaces. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 180. HOW PEOPLE SEARCH? • Search tasks range from the relatively simple • e.g., looking up disputed facts or finding weather information • to the rich and complex • e.g., job seeking and planning vacations. • Search interfaces should support a range of tasks, • while taking into account how people think about searching for information. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 181. Information Lookup versus Exploratory Search • User interaction with search interfaces differs depending on a) the type of task, b) the amount of time and c) effort available to invest in the process, and the domain expertise of the information seeker. • The simple interaction dialogue used in Web search engines is most appropriate for : 1. finding answers to questions or 2. to finding Web sites or other resources that act as search starting points. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 182. HOW PEOPLE SEARCH? •But, as Marchionini notes, •the “turn-taking” interface of Web search engines is • inherently limited and in many cases is being supplanted by specialty search engines • such as • for travel and health information – that offer richer interaction models. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 183. Classic versus Dynamic Model of Information Seeking • Researchers have developed numerous theoretical models of • how people go about doing search tasks? • The classic notion of the information seeking process model as described by • Sutcliffe and Ennis is formulated as a cycle consisting of four main activities: 1. problem identification, 2. articulation of information need(s), 3. query formulation, and 4. results evaluation. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 184. Classic versus Dynamic Model of Information Seeking • The standard model of the information seeking process contains • an underlying assumption that the user’s information need is static and the information seeking process is one of successively refining a query until • all and only those documents relevant to the original information need have been retrieved. • • More recent models emphasize the dynamic nature of the search process, noting that users learn as they search, and their information needs adjust as they see retrieval results and other document surrogates. • This dynamic process is sometimes referred to as the berry picking model of search. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 185. Any Questions? PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 186. P1WU UNIT – I: INTRODUCTION Topic 14: SEARCH INTERFACES TODAY PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 187. UNIT I INTRODUCTION 1. Information Retrieval 2. Early Developments 3. The IR Problem 4. The Users Task 5. Information versus Data Retrieval 6. The IR System 7. The Software Architecture of the IR System 8. The Retrieval and Ranking Processes 9. The Web 10. The e-Publishing Era 11. How the web changed Search 12. Practical Issues on the Web 13. How People Search 14. Search Interfaces Today 15. Visualization in Search Interfaces. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 188. SEARCH INTERFACES TODAY a) b) c) • • • • • At the heart of the typical search session is a cycle of query specification, inspection of retrieval results, and query reformulation. • As the process proceeds, the searcher learns about their topic, as well as about the available information sources. • There are several user interface components that have become standard in search interfaces and which exhibit high usability. • As these components are described, the design characteristics that they support will be underscored. • Ideally, these components are integrated together to support the different parts of the process, but it is useful to discuss each separately. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 189. Query Specification • Once a search starting point has been selected, the primary methods for a searcher to express their information need are • either entering words into a search entry form or selecting links from a directory or other information organization display. • For Web search engines, • the query is specified in textual form. • Today this is usually done by typing text on a keyboard, • but in future, • query specification via spoken commands will most likely become increasingly common, using mobile devices as the input medium. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 190. Query Specification • Typically in Web queries today, • the text is very short, consisting of one to three words. • Multiword queries are often meant to be construed as a phrase, • but can also consist of multiple topics. • Short queries reflect the standard usage scenario in which the user “tests the waters” to • see what the search engine returns in response to their short query. • If the results do not look relevant, then • the user reformulates their query. • If the results are promising, then • the user navigates to the most relevant- looking Web site and pursues more fine-tuned queries on that site. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 191. Query Specification • This search behavior, in which a general query is used to find a promising part of the information space, and then • follow hyperlinks within relevant Web sites, • is a demonstration of the orienteering strategy of Web search. • There is evidence that in many cases searchers would prefer to state their information need in more detail, • but past experience with search engines taught them that this method does not work well, • and that keyword querying combined with orienteering works better. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 192. Query Specification Interfaces • The standard interface for a textual query is a search box entry form, • in which the user types a query, activated by hitting the return key on the keyboard or selecting a button associated with the form. • Studies suggest a relationship between query length and the width of the entry form; • results find that either small forms discourage long queries or wide forms encourage longer queries. • Some entry forms are divided into multiple components • , allowing for a more general free text query followed by a form that filters the query in some way. • For instance, at yelp.com, the user enters • a general query into the first entry form and refines the search by location in the second form (see Figure 2.1). PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 193. Query Specification Interfaces PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 194. Query Specification Interfaces • Forms allow for selecting information that has been used in the past; sometimes this information is structured and allows for setting parameters to be used in future. • For instance, the yelp.com • form shows the user’s home location (if it has been indicated in the past) along with recently specified locations and the option to add additional locations. • • An increasingly common strategy within the search form is to show hints about what kind of information should be entered into each form via greyed-out text. For instance, in zvents.com search (see Figure 2.2), the first box is labeled • “what are you looking for?” • while the second box is labeled “when (tonight, this weekend, ...)”.. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 195. Query Specification Interfaces • When the user places the cursor into the entry form, the grey text disappears, and the user can type in their query terms. This example also illustrates • specialized input types that some search engines are supporting today. • For instance, the zvents.com site recognizes that • words like “tomorrow” are time-sensitive, and interprets them in the expected manner. • It also allows flexibility in the syntax of more formal specification of dates. • So searching for “comedy” on “wed” automatically computes the date for the nearest future Wednesday. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 196. Query Specification Interfaces • This is an example of designing the interface to reflect how people think, • rather than making how the user thinks conform to the brittle, literal expectations of typical programs. • (This approach to “loose” query specification works better for “casual” interfaces in which getting the date right is not critical to the use of the system; • casual date specification when filling out a tax form is not acceptable, as the cost of error is too high.) PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 197. Query Specification Interfaces • This is an example of designing the interface to reflect how people think, • rather than making how the user thinks conform to the brittle, literal expectations of typical programs. • (This approach to “loose” query specification works better for “casual” interfaces in which getting the date right is not critical to the use of the system; • casual date specification when filling out a tax form is not acceptable, as the cost of error is too high.) PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 198. Query Specification Interfaces • An innovation that has greatly improved query specification is the inclusion of a dynamically generated list of query suggestions, shown in real time as the user types the query. • This is referred to variously as auto-complete, auto-suggest, and dynamic query suggestions. • A large log study found that users clicked on dynamic suggestions in the Yahoo Search Assist tool about one third of the time they were presented . • Often the suggestions shown are those whose prefix matches the characters typed so far, • but in some cases, suggestions are shown that only have interior letters matching. • If the user types a multiple word query, suggestions may be shown that are synonyms of what has been typed so far, but which do not contain lexical matches. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 199. Query Specification Interfaces • To exemplify, Netflix.com both describes what is wanted in gray and then shows hits via a dropdown box. • In dynamic query suggestion interfaces, the display of the matches varies. • Some interfaces color the suggestions according to category information. • In most cases, the user must move the mouse down to the desired suggestion in order to select it, • at which point the suggestion is used to fill the query box. • In some cases, the query is then run immediately; in others, the user must hit the Return key or click the Search button. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 200. Query Specification Interfaces • The suggestions can be derived from several sources. In some cases, the list is taken from the user’s own query history, in other cases, • it is based on popular queries issued by other users. • The list can be derived from a set of metadata that a Web site’s designer considers important, • such as a list of known diseases or gene names for a search over pharmacological literature (see Figure 2.3), • a list of product names when searching within an e-commerce site, or a list of known film names when searching a movie site. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 201. Query Specification Interfaces PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 202. Query Specification Interfaces • The suggestions can also be derived from all of the text contained within a Web site. • Another form of query specification consists of choosing from a display of information, • typically in the form of hyperlinks or saved bookmarks. • In some cases, • the action of selecting a link produces more links for further navigation, in addition to results listings. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 203. Query Reformulation • After a query is specified and results have been produced, a number of tools exist to help the user reformulate their query, • or take their information seeking process in a new direction. • Analysis of search engine logs shows that reformulation is a common activity; • one study found that more than 50% of searchers modified at least one query during a session, with nearly a third of these involving three or more queries. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 204. Organizing Search Results • Searchers often express • a desire for a user interface that organizes search results into meaningful groups to help understand the results and decide what to do next. • A longitudinal study in which users were provided with • the ability to group search results found that users changed their search habits in response to having the grouping mechanism available. • Currently two methods for grouping search results are popular: category systems, especially faceted categories, and clustering. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 205. Organizing Search Results • A category system is a set of meaningful labels organized in such a way as to reflect the concepts relevant to a domain. • They are usually created manually, although assignment of documents to categories can be automated to a certain degree of accuracy. • Good category systems have the characteristics of being coherent and (relatively) complete, • their structure is predictable and consistent across search results for an information collection. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 206. Any Questions? PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 207. P1WU  UNIT – I: INTRODUCTION Topic 15: VISUALIZATION SEARCH  INTERFACES PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 208. UNIT I INTRODUCTION 1. Information Retrieval 2. Early Developments 3. The IR Problem 4. The Users Task 5. Information versus Data Retrieval 6. The IR System 7. The Software Architecture of the IR System 8. The Retrieval and Ranking Processes 9. The Web 10. The e-Publishing Era 11. How the web changed Search 12. Practical Issues on the Web 13. How People Search 14. Search Interfaces Today 15. Visualization in Search Interfaces PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 209. WHAT IS VISUALIZATION ? •Visualization or visualisation is •any technique for creating images, diagrams, or animations to communicate a message. •Visualization through visual imagery has been an effective way to communicate both abstract and concrete ideas since the dawn of humanity. • Wikipedia PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 210. What is an example of visualizing? •Visualize is to imagine, or to paint a picture of something in your mind, or to make something visible. •When you close your eyes and imagine yourself winning first prize, • this is an example of when you visualize winning first prize. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 211. What are the stages of visualization? • The five phases of visualization process: 1. data gathering, 2. processing, 3. preparation, 4. reduction and visual layout design. • In recent years, a comparably fresh research field — • information visualization has become commonly available for the researchers of all specialties. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 212. What is visualization research? • Visualization research develops tools to scaffold the exploratory analysis process and deepens our understanding of how visualizations support meaning-making about data. • A recent trend in the community involves • understanding how visualizations can facilitate interpretability of deep learning models. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 213. VISUALIZATION STEPS •The following concrete steps are required to be a competent Visualization expert : 1) Ability to Visualize 2) Finding opportunities 3) Learning ability PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 214. VISUALIZATION SEARCH INTERFACES •Text as a representation is highly effective for conveying abstract information, •but reading and even scanning text is a cognitively taxing activity, and must be done in a linear fashion. •By contrast, images can be scanned quickly and the visual system perceives information in parallel. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 215. VISUALIZATION SEARCH INTERFACES •People are highly attuned to •images and visual information, and pictures and graphics can be captivating and appealing. •A visual representation can communicate •some kinds of information much more rapidly and effectively than any other method. •Consider : •the deference between a written description of a person’s face and a photograph of it, or • the deference between a table of numbers containing a correlation and a scatter plot showing the same information. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 216. VISUALIZATION SEARCH INTERFACES • Experimentation with visualization for search has been primarily applied in the following ways: 1. Visualizing Boolean Syntax 2. Visualizing Query Terms within Retrieval Results 3. Visualizing Relationships among Words and Documents 4. Visualization for Text Mining. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 217. 1. Visualizing Boolean Syntax PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 218. 1. Visualizing Boolean Syntax • A Boolean expression is formed with OR , AND , NOT keyword. • Example of Boolean Expression Formation and its visualization is as follows: PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 219. Visualizing Boolean Syntax • Boolean query syntax is • difficult for most users and is rarely used in Web search. • For many years, researchers have experimented with • how to visualize Boolean query specification, in order to make it more understandable. • A common approach is to show Venn diagrams visually; • Hertzum and Frokjaer [755] found that • a simple Venn diagram representation produced more accurate results than Boolean syntax. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 220. Visualizing Boolean Syntax • A more flexible version of this idea was seen in the VQuery system (see Figure 2.15). • Each query term is represented by • a circle or oval, and the intersection among circles indicates ANDing (conjoining) of terms. • VQuery represented disjunction by • sets of circles within an active area of the canvas, and negation by deselecting a circle with in the active area. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 221. Visualizing Boolean Syntax PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 222. Visualizing Boolean Syntax • One problem with Boolean queries is that they can easily end up with empty results or too many results. • To remedy this, the filter-flow visualization allows users to • lay out the deferent components of the query, and show via a graphical flow how many hits would result after each operator is applied . • Other visual representations of Boolean queries include • lining up blocks vertically and horizontally and representing components of queries as overlapping “magic” lenses. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 223. 2. Visualizing Relationships among Words and Documents PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 224. Visualizing Query Terms To make more effective of your Query to be submitted to SQL Tool: It is recommended to use the Query Deign Tool and use mouse to form and / or restructure the relationship . PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 225. Visualizing Query Terms within Retrieval Results • As discussed above, understanding the role played by the query terms within the retrieved documents can help with the assessment of relevance. • In standard search results listings, summary sentences are often selected that contain query terms, and the occurrence of these terms are highlighted or boldfaced where they appear in the title, summary, and URL. • Highlighting of this kind has been shown to be effective from a usability perspective. • Experimental visualizations have been designed that make this relationship more explicit. • One of the best known is the TileBars interface, in which documents are shown as horizontal glyphs with the locations of the query term hits marked along the glyph (see Figure 2.16). PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 226. Visualizing Query Terms within Retrieval Results PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 227. Visualizing Query Terms within Retrieval Results • The user is encouraged to break the query into its different facets, with one concept per line, and then • the horizontal rows within each document’s representation show the frequency of occurrence of query terms within each topic. • Longer documents are divided into • subtopic segments, • either using paragraph or section breaks, or an automated discourse segmentation technique called TextTiling. • Grayscale implies the frequency of the query term occurrences. • The visualization shows where the discussion of the different query topics overlaps within the document. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 228. 3. Visualizing Relationships Among Words and Documents PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 229. What is Word Visualization? • A word cloud (or tag cloud) is a word visualization that displays • the most used words in a text from small to large, according to how often each appears. • They give a glance into • the most important keywords in news articles, social media posts, and customer reviews, among other text. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 230. Visualizing Relationships Among Words and Documents • Numerous visualization developers have proposed variations on the idea of placing words and documents on a two-dimensional canvas, • where proximity of glyphs represents semantic relationships among the terms or documents. • An early version of this idea is seen in the VIBE interface, where • queries are laidout on a plane, and documents that contain combinations of the queries are placed midway between the icons representing those terms (see Figure 2.19). • A more modern version of this idea is seen • in the Aduna Autofocus product, and the Lyberworld project presented a 3D version of the ideas behind VIBE. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 231. Visualizing Relationships Among Words and Documents PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 232. Visualizing Relationships Among Words and Documents • Another variation of this idea is • to map documents or words from a very high dimensional term space down into • a two-dimensional plane, and show where the documents or words fall within that plane, using 2D or 3D. • This variation on clustering can be done to documents retrieved as • a result of a query, or documents that match a query can be highlighted within a preprocessed set of documents. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 233. 4. Visualization for Text Mining PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 234. What is Text visualization ? • Text visualization is • the technique of using graphs, charts, or word clouds to showcase written data in a visual manner. • This provides quick insight into • the most relevant keywords in a text, summarizes content, and reveals trends and patterns across documents. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES
  • 235. Visualization for Text Mining • The subsections above show that usability results for visualization in search are not particularly strong. • It seems in fact that visualization is better used for purposes of analysis and exploration of textual data. • Most users of search systems are not interested in seeing • how words are distributed • across documents ? or • in viewing the most common words within a collection? • but these are interesting activities for computational linguists, analysts, and curious word enthusiasts. PROFESSIONAL ELECTIVE – IV INFORMATION RETRIEVAL TECHNIQUES