Information Retrieval
Techniques
p. 1
Chapter 1
Introduction
Information Retrieval
The IR Problem
The IR System
The Web
Information Retrieval (IR)
IR deals with the representation, storage, organization
of, and access to information items
Types of information items: documents, Web pages, online
catalogs, structured records, multimedia objects
Early goals of the IR area: indexing text and searching
for useful documents in a collection
Nowadays, research in IR includes:
Modeling, Web search, text classification, systems architecture,
user interfaces, data visualization, filtering and languages
– p. 2
Early Developments
For more than 5,000 years, man has organized
information for later retrieval and searching
This has been done by compiling, storing, organizing, and
indexing papyrus, hieroglyphics, and books
For holding the various items, special purpose buildings
called libraries, or bibliothekes, are used
The oldest known library was created in Elba, in the Fertile
Crescent, between 3,000 and 2,500 BC
By 300 BC, Ptolemy Soter, a Macedonian general, created the
Great Library at Alexandria
Nowadays, libraries are everywhere
In 2008, more than 2 billion items were checked out from
libraries in the US—an increase of 10% over the previous year
– p. 3
Early Developments
Since the volume of information in libraries is always
growing, it is necessary to build specialized data
structures for fast search — the indexes
For centuries indexes have been created manually as
sets of categories, with labels associated with each
category
The advent of modern computers has allowed the
construction of large indexes automatically
– p. 4
Early Developments in IR
During the 50’s, research efforts in IR were initiated by
pioneers such as Hans Peter Luhn, Eugene Garfield,
Philip Bagley, and Calvin Moores, who allegedly coined
the term Information Retrieval
In 1962, Cyril Cleverdon published the Cranfield studies
on retrieval evaluation
In 1963, Joseph Becker and Robert Hayes published
the first book on IR
In the late 60’s, key research conducted by Karen
Sparck Jones and Gerard Salton, among others, led to
the definition of the TF-IDF term weighting scheme
– p. 5
Early Developments in IR
In 1971, Jardine and van Rijsbergen articulated the
cluster hypothesis
In 1978, the first ACM SIGIR Internation Conference on
Information Retrieval was held in Rochester
In 1979, van Rijsbergen published a classic book
entitled Information Retrieval, which focused on the
Probabilistic Model
In 1983, Salton and McGill published a classic book
entitled Introduction to Modern Information Retrieval,
which focused on the Vector Model
– p. 6
Libraries and Digital Libraries
Libraries were among the first institutions to adopt IR
systems for retrieving information
Initially, such systems consisted of an automation of
existing processes such as card catalogs searching
Increased search functionality was then added
Ex: subject headings, keywords, query operators
Nowadays, the focus has been on improved graphical
interfaces, electronic forms, hypertext features
– p. 7
IR at the Center of the Stage
Until recently, IR was an area of interest restricted
mainly to librarians and information experts
A single fact changed these perceptions—the
introduction of the Web, which has become the largest
repository of knowledge in human history
Due to its enormous size, finding useful information on
the Web usually requires running a search
And searching on the Web is all about IR and its
technologies
Thus, almost overnight, IR has gained a
place with other technologies at the
center of the stage
– p. 8
The IR Problem
– p. 9
The IR Problem
Users of modern IR systems, such as search engine
users, have information needs of varying complexity
An example of complex information need is as
follows:
Find all documents that address the role
of the Federal Government in financing
the operation of the National Railroad
Transportation Corporation (AMTRAK)
– p. 10
The IR Problem
This full description of the user information need is not
necessarily a good query to be submitted to the IR
system
Instead, the user might want to first translate this
information need into a query
This translation process yields a set of keywords, or
index terms, which summarize the user information
need
Given the user query, the key goal of the IR system is to
retrieve information that is useful or relevant to the user
– p. 11
The IR Problem
That is, the IR system must rank the information items
according to a degree of relevance to the user query
The IR Problem
The key goal of an IR system is to retrieve all the
items that are relevant to a user query, while
retrieving as few nonrelevant items as possible
The notion of relevance is of central importance in
IR
– p. 12
The User’s Task
Consider a user who seeks information on a topic of
their interest
This user first translates their information need into a query,
which requires specifying the words that compose the query
In this case, we say that the user is searching or querying for
information of their interest
Consider now a user who has an interest that is either
poorly defined or inherently broad
For instance, the user has an interest in car racing and wants to
browse documents on Formula 1 and Formula Indy
In this case, we say that the user is browsing or navigating the
documents of the collection
– p. 13
The User’s Task
– p. 14
Information × Data Retrieval
Data retrieval: the task of determining which documents
of a collection contain the keywords in the user query
Data retrieval system
Ex: relational databases
Deals with data that has a well defined structure and semantics
A single erroneous object among a thousand retrieved objects
means total failure
Data retrieval does not solve the problem of retrieving
information about a subject or topic
– p. 15
The IR System
– p. 16
Architecture of the IR System
High level software architecture of an IR system
– p. 17
Retrieval and Ranking Processes
The processes of indexing, retrieval, and ranking
– p. 18
The Web
– p. 19
A Brief History
At the end of World War II, Vannevar Bush looked for
applications of new technologies to peace times
Bush first produced a report entitled Science, The
Endless Frontier
This report directly influenced the creation of the National
Science Foundation
Following, he wrote As We May Think, a remarkable
paper which discussed new hardware and software
gadgets
In Bush’s words:
Whole new forms of encyclopedias will appear, ready-made with a
mesh of associative trails running through them, ready to be dropped
into the memex and there amplified
– p. 20
A Brief History
As We May Think influenced people like Douglas
Engelbart, who invented the computer mouse and
introduced the concept of hyperlinked texts
Ted Nelson, working in his Project Xanadu, pushed the
concept further and coined the term hypertext
A hypertext allows the reader to jump from one
electronic document to another, which was one
important property regarding the problem that Tim
Berners-Lee faced in 1989
– p. 21
A Brief History
At the time, Berners-Lee worked in Geneva at the
CERN—Conseil Européen pour la Recherche Nucléaire
There, researchers who wanted to share documentation
with others had to reformat their documents to make
them compatible with an internal publishing system
Berners-Lee reasoned that it would be nice if the
solution of sharing documents were decentralized
He saw that a networked hypertext would be a good
solution and started working on its implementation
– p. 22
A Brief History
In 1990, Berners-Lee
Wrote the HTTP protocol
Defined the HTML language
Wrote the first browser, which he called World Wide Web
Wrote the first Web server
In 1991, he made his browser and server software
available in the Internet
The Web was born!
– p. 23
The e-Publishing Era
Since its inception, the Web became a huge success
Well over 20 billion pages are now available and accessible in the
Web
More than one fourth of humanity now access the Web on a
regular basis
Why is the Web such a success? What is the single
most important characteristic of the Web that makes it
so revolutionary?
In search for an answer, let us dwell into the life of a
writer who lived at the end of the 18th Century
– p. 24
The e-Publishing Era
She finished the first draft of her novel in 1796
The first attempt of publication was refused without a reading
The novel was only published 15 years later!
She got a flat fee of $110, which meant that she was not paid
anything for the many subsequent editions
Further, her authorship was anonymized under the reference “By
a Lady”
We are talking of ...
– p. 25
The e-Publishing Era
Pride and Prejudice is the second or third best loved
novel in the UK ever, after The Lord of the Rings and
Harry Potter
It has been the subject of six TV series and five film
versions
The last of these, starring Keira Knightley and Matthew
Macfadyen, grossed over 100 million dollars
Jane Austen published anonymously her entire
life
Throughout the 20th century, her novels have never
been out of print
– p. 26
The e-Publishing Era
Jane Austen was discriminated because there was no
freedom to publish in the beginning of the 19th century
The Web, unleashed by the inventiveness of Tim
Berners-Lee, changed this once and for all
It did so by universalizing freedom to publish
The Web moved mankind into a new era,
into a new time, into The e-Publishing
Era
– p. 27
How the Web Changed Search
Web search is today the most prominent application of
IR and its techniques—the ranking and indexing
components of any search engine are fundamentally IR
pieces of technology
The first major impact of the Web on search is related
to the characteristics of the document collection itself
The Web is composed of pages distributed over millions of sites
and connected through hyperlinks
This requires collecting all documents and storing copies of them
in a central repository, prior to indexing
This new phase in the IR process, introduced by the Web, is
called crawling
– p. 28
How the Web Changed Search
The second major impact of the Web on search is
related to:
The size of the collection
The volume of user queries submitted on a daily basis
As a consequence, performance and scalability have become
critical characteristics of the IR system
The third major impact : in a very large collection,
predicting relevance is much harder than before
Fortunately, the Web also includes new sources of evidence
Ex: hyperlinks and user clicks in documents in the answer set
– p. 29
How the Web Changed Search
The fourth major impact derives from the fact that the
Web is also a medium to do business
Search problem has been extended beyond the seeking of text
information to also encompass other user needs
Ex: the price of a book, the phone number of a hotel, the link for
downloading a software
The fifth major impact of the Web on search is Web
spam
Web spam: abusive availability of commercial information
disguised in the form of informational content
This difficulty is so large that today we talk of Adversarial Web
Retrieval
– p. 30
Practical Issues in the Web
Security
Commercial transations over the Internet are not yet a completely
safe procedure
Privacy
Frequently, people are willing to exchange information as long as
it does not become public
Copyright and patent rights
It is far from clear how the wide spread of data on the Web affects
copyright and patent laws in the various countries
Scanning, optical character recognition (OCR), and
cross-language retrieval
– p. 31
Organization of the Book
– p. 32
Focus of the Book
The book presents an overall view of research in IR
from a computer scientist’s perspective
This means that the main focus of the book is on computer
algorithms and techniques used in IR systems
A rather distinct viewpoint is taken by librarians and
information science researchers
In this viewpoint, the focus is on trying to understand how people
interpret and use information
This human-centered viewpoint is discussed in the user
interfaces chapter and in the last two chapters of the
book
– p. 33
Book Contents
– p. 34
Modern Information Retrieval
p. 1
Chapter 2
User Interfaces for Search
How People Search
Search Interfaces Today
Visualization in Search
Interfaces
Design and Evaluation
of Search Interfaces
Introduction
This chapter focuses on
the human users of search systems
the search user interface, i.e., the window through which search
systems are seen
The user interface role is to aid in the searchers’
understanding and expression of their information need
Further, the interface should help users
formulate their queries
select among available information sources
understand search results
keep track of the progress of their search
p. 2
How People Search
p. 3
How People Search
User interaction with search interfaces differs
depending on
the type of task
the domain expertise of the information seeker
the amount of time and effort available to invest in
the process
Marchionini makes a distinction between information
lookup and exploratory search
Information lookup tasks
are akin to fact retrieval or question answering
can be satisfied by discrete pieces of information: numbers,
dates, names, or Web sites
can work well for standard Web search interactions
p. 4
How People Search
Exploratory search is divided into learning and
investigating tasks
Learning search
requires more than single query-response pairs
requires the searcher to spend time
scanning and reading multiple information items
synthesizing content to form new understanding
p. 5
How People Search
Investigating refers to a longer-term process which
involves multiple iterations that take place over perhaps very long
periods of time
may return results that are critically assessed before being
integrated into personal and professional knowledge bases
may be concerned with finding a large proportion of the relevant
information available
p. 6
How People Search
Information seeking can be seen as being part of a
larger process referred to as sensemaking
Sensemaking is an iterative process of formulating a
conceptual representation from a large collection
Russell et al. observe that most of the effort in
sensemaking goes towards the synthesis of a good
representation
Some sensemaking activities interweave search
throughout, while others consist of doing a batch of
search followed by a batch of analysis and synthesis
p. 7
How People Search
Examples of deep analysis tasks that require
sensemaking (in addition to search)
the legal discovery process
epidemiology (disease tracking)
studying customer complaints to improve service
obtaining business intelligence.
p. 8
Classic × Dynamic Model
Classic notion of the information seeking process:
1. problem identification
2. articulation of information need(s)
3. query formulation
4. results evaluation
More recent models emphasize the dynamic nature of
the search process
The users learn as they search
Their information needs adjust as they see retrieval results and
other document surrogates
This dynamic process is sometimes referred to as the
berry picking model of search
p. 9
Classic × Dynamic Model
The rapid response times of today’s Web search
engines allow searchers:
to look at the results that come back
to reformulate their query based on these results
This kind of behavior is a commonly-observed strategy
within the berry-picking approach
Sometimes it is referred to as orienteering
Jansen et al made a analysis of search logs and found
that the proportion of users who modified queries is
52%
p. 10
Classic × Dynamic Model
Some seeking models cast the process in terms of
strategies and how choices for next steps are made
In some cases, these models are meant to reflect conscious
planning behavior by expert searchers
In others, the models are meant to capture the less planned,
potentially more reactive behavior of a typical information seeker
p. 11
Navigation × Search
Navigation: the searcher looks at an information
structure and browses among the available information
This browsing strategy is preferrable when the
information structure is well-matched to the user’s
information need
it is mentally less taxing to recognize a piece of information than it
is to recall it
it works well only so long as appropriate links are available
If the links are not avaliable, then the browsing
experience might be frustrating
p. 12
Navigation × Search
Spool discusses an example of a user looking for a
software driver for a particular laser printer
Say the user first clicks on printers, then laser printers,
then the following sequence of links:
HP laser printers
HP laser printers model 9750
software for HP laser printers model 9750
software drivers for HP laser printers model 9750
software drivers for HP laser printers model 9750
for the
Win98 operating system
This kind of interaction is acceptable when each
refinement makes sense for the task at hand
p. 13
Search Process
Numerous studies have been made of people engaged
in the search process
The results of these studies can help guide the design
of search interfaces
One common observation is that users often
reformulate their queries with slight modifications
Another is that searchers often search for information
that they have previously accessed
The users’ search strategies differ when searching over
previously seen materials
Researchers have developed search interfaces support
both query history and revisitation
p. 14
Search Process
Studies also show that it is difficult for people to
determine whether or not a document is relevant to a
topic
The less users know about a topic, the poorer judges they are of
whether a search result is relevant to that topic
Other studies found that searchers tend to look at only
the top-ranked retrieved results
Further, they are biased towards thinking the top one or
two results are better than those beneath them
p. 15
Search Process
Studies also show that people are poor at estimating
how much of the relevant material they have found
Other studies have assessed the effects of knowledge
of the search process itself
These studies have observed that experts use different
strategies than novices searchers
For instance, Tabatabai et al found that
expert searchers were more patient than novices
this positive attitude led to better search outcomes
p. 16
Search Interfaces Today
p. 17
Getting Started
How does an information seeking session begin in
online information systems?
The most common way is to use a Web search engine
Another method is to select a Web site from a personal
collection of already-visited sites
which are typically stored in a browser’s bookmark
Online bookmark systems are popular among a smaller segment
of users
Ex: Delicious.com
Web directories are also used as a common starting point, but
have been largely replaced by search engines
p. 18
Query Specification
The primary methods for a searcher to express their
information need are either
entering words into a search entry form
selecting links from a directory or other information organization
display
For Web search engines, the query is specified in
textual form
Typically, Web queries today are very short consisting
of one to three words
p. 19
Query Specification
Short queries reflect the standard usage scenario in
which the user tests the waters:
If the results do not look relevant, then the user reformulates their
query
If the results are promising, then the user navigates to the most
relevant-looking Web site
This search behavior is a demonstration of the
orienteering strategy of Web search
p. 20
Query Specification
Before the Web, search systems regularly supported
Boolean operators and command-based syntax
However, these are often difficult for most users to understand
Jansen et al conducted a study over a Web log with
1.5M queries, and found that
2.1% of the queries contained Boolean operators
7.6% contained other query syntax, primarily double-quotation
marks for phrases
White et al examined interaction logs of nearly 600,000
users, and found that
1.1% of the queries contained one or more operators
8.7% of the users used an operator at any time
p. 21
Query Specification
Web ranking has gone through three major phases
In the first phase, from approximately 1994–2000:
Since the Web was much smaller then, complex queries were
less likely to yield relevant information
Further, pages retrieved not necessarily contained all query
words
Around 1997, Google moved to conjunctive
queries only
The other Web search engines followed, and conjunctive ranking
became the norm
Google also added term proximity information and page
importance scoring (PageRank)
As the Web grew, longer queries posed as phrases started to
produce highly relevant results
p. 22
Query Specification Interfaces
The standard interface for a textual query is a search
box entry form
Studies suggest a relationship between query length
and the width of the entry form
Results found that either small forms discourage long queries or
wide forms encourage longer queries
p. 23
Query Specification Interfaces
Some entry forms are followed by a form that filters the
query in some way
For instance, at yelp.com, the user can refine the
search by location using a second form
Notice that the yelp.com form also shows the user’s
home location, if it has been specified previously
p. 24
Query Specification Interfaces
Some search forms show hints on what kind of
information should be entered into each form
For instance, in zvents.com search, the first box is
labeled “what are you looking for”?
p. 25
Query Specification Interfaces
The previous example also illustrates specialized input
types that some search engines are supporting today
The zvents.com site recognizes that words like “tomorrow” are
time-sensitive
It also allows flexibility in the syntax of dates
To illustrate, searching for “comedy on wed ”
automatically computes the date for the nearest future
Wednesday
This is an example of how the interface can be designed to reflect
how people think
p. 26
Query Specification Interfaces
Some interfaces show a list of query suggestions as the
user types the query
This is referred to as auto-complete, auto-suggest, or dynamic
query suggestions
Anick et al found that users clicked on dynamic Yahoo
suggestions one third of the time
Often the suggestions shown are those whose prefix
matches the characters typed so far
However, in some cases, suggestions are shown that only have
interior letters matching
Further, suggestions may be shown that are synonyms
of the words typed so far
p. 27
Query Specification Interfaces
Dynamic query suggestions, from Netflix.com
p. 28
Query Specification Interfaces
The dynamic query suggestions can be derived from
several sources, including:
The user’s own query history
A set of metadata that a Web site’s designer considers important
All of the text contained within a Web site
p. 29
Query Specification Interfaces
Dynamic query suggestions, grouped by type, from
NextBio.com:
p. 30
Retrieval Results Display
When displaying search results, either
the documents must be shown in full, or else
the searcher must be presented with some kind of representation
of the content of those documents
The document surrogate refers to the information that
summarizes the document
This information is a key part of the success of the search
interface
The design of document surrogates is an active area of research
and experimentation
The quality of the surrogate can greatly effect the perceived
relevance of the search results listing
p. 31
Retrieval Results Display
In Web search, the page title is usually shown
prominently, along with the URL and other metadata
In search over information collections, metadata such
as date published and author are often displayed
Text summary (or snippet) containing text extracted
from the document is also critical
Currently, the standard results display is a vertical list of
textual summaries
This list is sometimes referred to as the SERP (Search
Engine Results Page)
p. 32
Retrieval Results Display
In some cases the summaries are excerpts drawn from
the full text that contain the query terms
In other cases, specialized kinds of metadata are
shown in addition to standard textual results
This technique is known as blended results or
universal search
p. 33
Retrieval Results Display
For example, a query on a term like “rainbow” may
return sample images as one entry in the results listing
p. 34
Retrieval Results Display
A query on the name of a sports team might retrieve the
latest game scores and a link to buy tickets
p. 35
Retrieval Results Display
Nielsen notes that in some cases the information need
is satisfied directly in the search results listing
This makes the search engine an “answer engine”
Displaying the query terms in the context in which they
appear in the document:
Improves the user’s ability to gauge the relevance of the results
It is sometimes referred to as KWIC - keywords in context
It is also known as query-biased summaries, query-oriented
summaries, or user-directed summaries
p. 36
Retrieval Results Display
The visual effect of query term highlighting can also
improve usability of search results listings
Highlighting can be shown both in document surrogates in the
retrieval results and in the retrieved documents
Determining which text to place in the summary, and
how much text to show, is a challenging problem
Often the summaries contain all the query terms in
close proximity to one another
However, there is a trade-off between
Showing contiguous sentences, to aid in coherence in the result
Showing sentences that contain the query terms
p. 37
Retrieval Results Display
Some results suggest that it is better to show full
sentences rather than cut them off
On the other hand, very long sentences are usually not desirable
in the results listing
Further, the kind of information to display should vary
according to the intent of the query
Longer results are deemed better than shorter ones for certain
types of information need
On the other hand, abbreviated listing is preferable for
navegational queries
Similarly, requests for factual information can be satisfied with a
concise results display
p. 38
Retrieval Results Display
Other kinds of document information can be usefully
shown in the search results page
p. 39
Retrieval Results Display
The page results below show figures extracted from
journal articles alongside the search results
p. 40
Query Reformulation
There are tools to help users reformulate their query
One technique consists of showing terms related to the query or
to the documents retrieved in response to the query
A special case of this is spelling corrections or
suggestions
Usually only one suggested alternative is shown: clicking on that
alternative re-executes the query
In years back, the search results were shown using the
purportedly incorrect spelling
p. 41
Query Reformulation
Microsoft Live’s search results page for the query “IMF”
p. 42
Query Reformulation
Term expansion: search interfaces are increasingly
employing related term suggestions
Log studies suggest that term suggestions are a
somewhat heavily-used feature in Web search
Jansen et al made a log study and found that 8% of
queries were generated from term suggestions
Anick et al found that 6% of users who were exposed to
term suggestions chose to click on them
p. 43
Query Reformulation
Some query term suggestions are based on the entire
search session of the particular user
Others are based on behavior of other users who have
issued the same or similar queries in the past
One strategy is to show similar queries by other users
Another is to extract terms from documents that have been
clicked on in the past by searchers who issued the same
query
p. 44
Query Reformulation
Relevance feedback is another method whose goal is
to aid in query reformulation
The main idea is to have the user indicate which
documents are relevant to their query
In some variations, users also indicate which terms extracted
from those documents are relevant
The system then computes a new query from this
information and shows a new retrieval set
p. 45
Query Reformulation
Nonetheless, this method has not been found to be
successful from a usability perspective
Because that, it does not appear in standard interfaces
today
This stems from several factors:
People are not particularly good at judging document relevance,
especially for topics with which they are unfamiliar
The beneficial behavior of relevance feedback is inconsistent
p. 46
Organizing Search Results
Organizing results into meaningful groups can help
users understand the results and decide what to do
next
Popular methods for grouping search results: category
systems and clustering
Category system: meaningful labels organized in such
a way as to reflect the concepts relevant to a domain
Good category systems have the characteristics of being
coherent and relatively complete
Their structure is predictable and consistent across search
results for an information collection
p. 47
Organizing Search Results
The most commonly used category structures are flat,
hierarchical, and faceted categories
Flat categories are simply lists of topics or subjects
They can be used for grouping, filtering (narrowing), and sorting
sets of documents in search interfaces
Most Web sites organize their information into general
categories
Selecting that category narrows the set of information shown
accordingly
p. 48
Organizing Search Results
Some experimental Web search engines automatically
organize results into flat categories
Studies using this kind of design have received positive user
responses (Dumais et al, Kules et al)
However, it can difficult to find the right subset of
categories to use for the vast content of the Web
Rather, category systems seem to work better for more
focused information collections
p. 49
Organizing Search Results
In the early days of the Web, hierarchical directory
systems such as Yahoo’s were popular
Hierarchy can also be effective in the presentation of
search results over a book or other small collection
The Superbook system was an early search interface
based on this idea
In the Superbook system, the search results were
shown in the context of the table-of-contents hierarchy
p. 50
Organizing Search Results
The SuperBook interface for showing retrieval results in
context
p. 51
Organizing Search Results
An alternative representation is the faceted metadata
Unlike flat categories, faceted metadata allow the
assignment of multiple categories to a single item
Each category corresponds to a different facet
(dimension or feature type) of the collection of items
p. 52
Organizing Search Results
Figure below shows a example of faceted navigation
p. 53
Organizing Search Results
Clustering refers to the grouping of items according to
some measure of similarity
It groups together documents that are similar to one
another but different from the rest of the collection
Such as all the document written in Japanese that appear in a
collection of primarily English articles
The greatest advantage of clustering is that it is fully
automatable
The disadvantages of clustering include
an unpredictability in the form and quality of results
the difficulty of labeling the groups
the counter-intuitiveness of cluster sub-hierarchies
p. 54
Organizing Search Results
Output produced using Findex clustering
p. 55
Organizing Search Results
Cluster output on the query “senate”, from Clusty.com
p. 56
Visualization in Search Interfaces
Experimentation with visualization for search has been
primarily applied in the following ways:
Visualizing Boolean syntax
Visualizing query terms within retrieval results
Visualizing relationships among words and documents
Visualization for text mining
p. 57
Visualizing Boolean Syntax
Boolean query syntax is difficult for most users and is
rarely used in Web search
For many years, researchers have experimented with
how to visualize Boolean query specification
A common approach is to show Venn diagrams
A more flexible version of this idea was seen in the
VQuery system, proposed by Steve Jones
p. 58
Visualizing Boolean Syntax
The VQuery interface for Boolean query specification
p. 59
Visualizing Query Terms
Understanding the role of the query terms within the
retrieved docs can help relevance assessment
Experimental visualizations have been designed that
make this role more explicit
In the TileBars interface, for instance, documents are
shown as horizontal glyphs
The locations of the query term hits marked along the
glyph
The user is encouraged to break the query into its
different facets, with one concept per line
Then, the lines show the frequency of occurrence of
query terms within each topic
p. 60
Visualizing Query Terms
The TileBars interface
p. 61
Visualizing Query Terms
Other approaches include placing the query terms in
bar charts, scatter plots, and tables
A usability study by Reiterer et al compared five
views:
a standard Web search engine-style results listing
a list view showing titles, document metadata, and a graphic
showing locations of query terms
a color TileBars-like view
a color bar chart view like that of Veerasamy & Belkin
a scatter plot view plotting relevance scores against date of
publication
p. 62
Visualizing Query Terms
Field-sortable search results view
p. 63
Visualizing Query Terms
Colored TileBars view
p. 64
Visualizing Query Terms
When asked for subjective responses, the 40
participants of the study preferred, on average, in this
order:
Field-sortable view first
TileBars
Web-style listing
The bar chart and scatter plot received negative
responses
p. 65
Visualizing Query Terms
Another variation on the idea of showing query term hits
within documents is to show thumbnails
Thumbnails are miniaturized rendered versions of the visual
appearance of the document
However, Czerwinski et al found that thumbnails are no
better than blank squares for improving search results
The negative study results may stem from a problem
with the size of the thumbnails
Woodruff et al shows that making the query terms more visible
via highlighting within the thumbnail improves its usability
p. 66
Visualizing Query Terms
Textually enhanced thumbnails
p. 67
Words and Docs Relationships
Numerous works proposed variations on the idea of
placing words and docs on a two-dimensional canvas
In these works, proximity of glyphs represents semantic
relationships among the terms or documents
An early version of this idea is the VIBE interface
Documents containing combinations of the query terms are
placed midway between the icons representing those terms
The Aduna Autofocus and the Lyberworld projects
presented a 3D version of the ideas behind VIBE
p. 68
Words and Docs Relationships
The VIBE display
p. 69
Words and Docs Relationships
Another idea is to map docs or words from a very high-
dimensional term space down into a 2D plane
The docs or words fall within that plane, using 2D or 3D
This variation on clustering can be done to
documents retrieved as a result of a query
documents that match a query can be highlighted within a
pre-processed set of documents
InfoSky and xFIND’s VisIslands are two variations on
these starfield displays
p. 70
Words and Docs Relationships
InfoSky, from Jonker et al
p. 71
Words and Docs Relationships
xFIND’s VisIslands, from Andrews et al
p. 72
Words and Docs Relationships
These views are relatively easy to compute and can be
visually striking
However, evaluations that have been conducted so far
provide negative evidence as to their usefulness
The main problems are that the contents of the documents are
not visible in such views
A more promising application of this kind of idea is in
the layout of thesaurus terms, in a small network graph
Ex: Visual Wordnet
p. 73
Words and Docs Relationships
The Visual Wordnet view of the WordNet lexical
thesaurus
p. 74
Visualization for Text Mining
Visualization is also used for purposes of analysis and
exploration of textual data
Visualizations such as the Word Tree show a piece of a
text concordance
It allows the user to view which words and phrases commonly
precede or follow a given word
Another example is the NameVoyager, which shows
frequencies of names for U.S. children across time
p. 75
Visualization for Text Mining
The Word Tree visualization, on Martin Luther King’s
I have a dream speech, from Wattenberg et al
p. 76
Visualization for Text Mining
The popularity of baby names over time (names
beginning with JA), from babynamewizard.com
p. 77
Visualization for Text Mining
Visualization is also used in search interfaces intended
for analysts
An example is the TRIST information triage system,
from Proulx et al
In this system, search results is represented as
document icons
Thousands of documents can be viewed in one display
It supports multiple linked dimensions that allow for
finding characteristics and correlations among the docs
Its designers won the IEEE Visual Analytics Science
and Technology (VAST) contest for two years running
p. 78
Visualization for Text Mining
The TRIST interface with results for queries related to
Avian Flu
p. 79
Design and Evaluation
User interface design: a field of Human-Computer
Interaction (HCI)
This field studies how people think about, respond to,
and use technology
User-centered design: a set of practices developed to
facilitate the design of interfaces
The design process begins by determining what the
intended users’ goals are
Then, the interface is devised to help people achieve
those goals by completing a series of tasks
p. 80
Design and Evaluation
Goals in the domain of information access can range
quite widely
From finding a plumber to keeping informed about a business
competitor
From writing a publishable scholarly article to investigating an
allegation of fraud
The design of interfaces is an iterative process, in which
the goals and tasks are elucidated via user research
p. 81
Design and Evaluation
Evaluating a user interface is often different from
evaluating a ranking algorithm or a crawling technique
A crawler can be assessed by crisp quantitative metrics such as
coverage and freshness
A ranking algorithm can be evaluated by precision, recall, and
speed
The quality of a user interface is determined by how
people respond to it
Subjective responses are as, if not more, important
than quantitative measures
If a person has a choice between two systems, they will
use the one they prefer
p. 82
Design and Evaluation
The reasons for preference may be determined by a
host of factors:
Speed, familiarity, aesthetics, preferred features, or perceived
ranking accuracy
Often the preferred choice is the familiar one
p. 83
Design and Evaluation
How best to evaluate a user interface depends on the
current stage in the development cycle
When starting with a new design or idea, discount
usability methods are typically used
Example: showing a few users different designs asking them to
indicate which parts are promising and which are not
Another commonly used discount evaluation method
is
heuristic evaluation
Usability experts “walk through” a design and evaluate the
functionality in accordance with a set of design guidelines
p. 84
Design and Evaluation
A formal experiment must be carefully designed to take
into account potentially confounding factors
For instance, it is important for participants to be motivated to do
well on the task
This kind of study can uncover important subjective
results
Such as whether a new design is strongly preferred over a
baseline
However, it is difficult to find accurate quantitative
differences with a small number of participants
p. 85
Design and Evaluation
Another problem: the timing variable is not the right
measure for evaluating an interactive search session
A tool that allows the searcher to learn about their subject matter
as they search may be more beneficial, but take more time
Two approaches to evaluating search interfaces have
gained in popularity in recent years
One is to conduct a longitudinal study
Participants use a new interface for an extended period of time,
and their usage is monitored and logged
Evaluation is based both on log analysis and questionnaires and
interviews with the participants
p. 86
Design and Evaluation
Another evaluation technique is to perform experiments
on already heavily-used Web sites
Consider a search engine that receives millions of
queries a day
a randomly selected subset of the users is shown a new
design
their actions are logged and compared to another randomly
selected control group that continues to use the existing interface
this approach is often referred to as bucket testing, A/B testing
p. 87

JM Information Retrieval Techniques Unit I

  • 1.
    Information Retrieval Techniques p. 1 Chapter1 Introduction Information Retrieval The IR Problem The IR System The Web
  • 2.
    Information Retrieval (IR) IRdeals with the representation, storage, organization of, and access to information items Types of information items: documents, Web pages, online catalogs, structured records, multimedia objects Early goals of the IR area: indexing text and searching for useful documents in a collection Nowadays, research in IR includes: Modeling, Web search, text classification, systems architecture, user interfaces, data visualization, filtering and languages – p. 2
  • 3.
    Early Developments For morethan 5,000 years, man has organized information for later retrieval and searching This has been done by compiling, storing, organizing, and indexing papyrus, hieroglyphics, and books For holding the various items, special purpose buildings called libraries, or bibliothekes, are used The oldest known library was created in Elba, in the Fertile Crescent, between 3,000 and 2,500 BC By 300 BC, Ptolemy Soter, a Macedonian general, created the Great Library at Alexandria Nowadays, libraries are everywhere In 2008, more than 2 billion items were checked out from libraries in the US—an increase of 10% over the previous year – p. 3
  • 4.
    Early Developments Since thevolume of information in libraries is always growing, it is necessary to build specialized data structures for fast search — the indexes For centuries indexes have been created manually as sets of categories, with labels associated with each category The advent of modern computers has allowed the construction of large indexes automatically – p. 4
  • 5.
    Early Developments inIR During the 50’s, research efforts in IR were initiated by pioneers such as Hans Peter Luhn, Eugene Garfield, Philip Bagley, and Calvin Moores, who allegedly coined the term Information Retrieval In 1962, Cyril Cleverdon published the Cranfield studies on retrieval evaluation In 1963, Joseph Becker and Robert Hayes published the first book on IR In the late 60’s, key research conducted by Karen Sparck Jones and Gerard Salton, among others, led to the definition of the TF-IDF term weighting scheme – p. 5
  • 6.
    Early Developments inIR In 1971, Jardine and van Rijsbergen articulated the cluster hypothesis In 1978, the first ACM SIGIR Internation Conference on Information Retrieval was held in Rochester In 1979, van Rijsbergen published a classic book entitled Information Retrieval, which focused on the Probabilistic Model In 1983, Salton and McGill published a classic book entitled Introduction to Modern Information Retrieval, which focused on the Vector Model – p. 6
  • 7.
    Libraries and DigitalLibraries Libraries were among the first institutions to adopt IR systems for retrieving information Initially, such systems consisted of an automation of existing processes such as card catalogs searching Increased search functionality was then added Ex: subject headings, keywords, query operators Nowadays, the focus has been on improved graphical interfaces, electronic forms, hypertext features – p. 7
  • 8.
    IR at theCenter of the Stage Until recently, IR was an area of interest restricted mainly to librarians and information experts A single fact changed these perceptions—the introduction of the Web, which has become the largest repository of knowledge in human history Due to its enormous size, finding useful information on the Web usually requires running a search And searching on the Web is all about IR and its technologies Thus, almost overnight, IR has gained a place with other technologies at the center of the stage – p. 8
  • 9.
  • 10.
    The IR Problem Usersof modern IR systems, such as search engine users, have information needs of varying complexity An example of complex information need is as follows: Find all documents that address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK) – p. 10
  • 11.
    The IR Problem Thisfull description of the user information need is not necessarily a good query to be submitted to the IR system Instead, the user might want to first translate this information need into a query This translation process yields a set of keywords, or index terms, which summarize the user information need Given the user query, the key goal of the IR system is to retrieve information that is useful or relevant to the user – p. 11
  • 12.
    The IR Problem Thatis, the IR system must rank the information items according to a degree of relevance to the user query The IR Problem The key goal of an IR system is to retrieve all the items that are relevant to a user query, while retrieving as few nonrelevant items as possible The notion of relevance is of central importance in IR – p. 12
  • 13.
    The User’s Task Considera user who seeks information on a topic of their interest This user first translates their information need into a query, which requires specifying the words that compose the query In this case, we say that the user is searching or querying for information of their interest Consider now a user who has an interest that is either poorly defined or inherently broad For instance, the user has an interest in car racing and wants to browse documents on Formula 1 and Formula Indy In this case, we say that the user is browsing or navigating the documents of the collection – p. 13
  • 14.
  • 15.
    Information × DataRetrieval Data retrieval: the task of determining which documents of a collection contain the keywords in the user query Data retrieval system Ex: relational databases Deals with data that has a well defined structure and semantics A single erroneous object among a thousand retrieved objects means total failure Data retrieval does not solve the problem of retrieving information about a subject or topic – p. 15
  • 16.
  • 17.
    Architecture of theIR System High level software architecture of an IR system – p. 17
  • 18.
    Retrieval and RankingProcesses The processes of indexing, retrieval, and ranking – p. 18
  • 19.
  • 20.
    A Brief History Atthe end of World War II, Vannevar Bush looked for applications of new technologies to peace times Bush first produced a report entitled Science, The Endless Frontier This report directly influenced the creation of the National Science Foundation Following, he wrote As We May Think, a remarkable paper which discussed new hardware and software gadgets In Bush’s words: Whole new forms of encyclopedias will appear, ready-made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified – p. 20
  • 21.
    A Brief History AsWe May Think influenced people like Douglas Engelbart, who invented the computer mouse and introduced the concept of hyperlinked texts Ted Nelson, working in his Project Xanadu, pushed the concept further and coined the term hypertext A hypertext allows the reader to jump from one electronic document to another, which was one important property regarding the problem that Tim Berners-Lee faced in 1989 – p. 21
  • 22.
    A Brief History Atthe time, Berners-Lee worked in Geneva at the CERN—Conseil Européen pour la Recherche Nucléaire There, researchers who wanted to share documentation with others had to reformat their documents to make them compatible with an internal publishing system Berners-Lee reasoned that it would be nice if the solution of sharing documents were decentralized He saw that a networked hypertext would be a good solution and started working on its implementation – p. 22
  • 23.
    A Brief History In1990, Berners-Lee Wrote the HTTP protocol Defined the HTML language Wrote the first browser, which he called World Wide Web Wrote the first Web server In 1991, he made his browser and server software available in the Internet The Web was born! – p. 23
  • 24.
    The e-Publishing Era Sinceits inception, the Web became a huge success Well over 20 billion pages are now available and accessible in the Web More than one fourth of humanity now access the Web on a regular basis Why is the Web such a success? What is the single most important characteristic of the Web that makes it so revolutionary? In search for an answer, let us dwell into the life of a writer who lived at the end of the 18th Century – p. 24
  • 25.
    The e-Publishing Era Shefinished the first draft of her novel in 1796 The first attempt of publication was refused without a reading The novel was only published 15 years later! She got a flat fee of $110, which meant that she was not paid anything for the many subsequent editions Further, her authorship was anonymized under the reference “By a Lady” We are talking of ... – p. 25
  • 26.
    The e-Publishing Era Prideand Prejudice is the second or third best loved novel in the UK ever, after The Lord of the Rings and Harry Potter It has been the subject of six TV series and five film versions The last of these, starring Keira Knightley and Matthew Macfadyen, grossed over 100 million dollars Jane Austen published anonymously her entire life Throughout the 20th century, her novels have never been out of print – p. 26
  • 27.
    The e-Publishing Era JaneAusten was discriminated because there was no freedom to publish in the beginning of the 19th century The Web, unleashed by the inventiveness of Tim Berners-Lee, changed this once and for all It did so by universalizing freedom to publish The Web moved mankind into a new era, into a new time, into The e-Publishing Era – p. 27
  • 28.
    How the WebChanged Search Web search is today the most prominent application of IR and its techniques—the ranking and indexing components of any search engine are fundamentally IR pieces of technology The first major impact of the Web on search is related to the characteristics of the document collection itself The Web is composed of pages distributed over millions of sites and connected through hyperlinks This requires collecting all documents and storing copies of them in a central repository, prior to indexing This new phase in the IR process, introduced by the Web, is called crawling – p. 28
  • 29.
    How the WebChanged Search The second major impact of the Web on search is related to: The size of the collection The volume of user queries submitted on a daily basis As a consequence, performance and scalability have become critical characteristics of the IR system The third major impact : in a very large collection, predicting relevance is much harder than before Fortunately, the Web also includes new sources of evidence Ex: hyperlinks and user clicks in documents in the answer set – p. 29
  • 30.
    How the WebChanged Search The fourth major impact derives from the fact that the Web is also a medium to do business Search problem has been extended beyond the seeking of text information to also encompass other user needs Ex: the price of a book, the phone number of a hotel, the link for downloading a software The fifth major impact of the Web on search is Web spam Web spam: abusive availability of commercial information disguised in the form of informational content This difficulty is so large that today we talk of Adversarial Web Retrieval – p. 30
  • 31.
    Practical Issues inthe Web Security Commercial transations over the Internet are not yet a completely safe procedure Privacy Frequently, people are willing to exchange information as long as it does not become public Copyright and patent rights It is far from clear how the wide spread of data on the Web affects copyright and patent laws in the various countries Scanning, optical character recognition (OCR), and cross-language retrieval – p. 31
  • 32.
    Organization of theBook – p. 32
  • 33.
    Focus of theBook The book presents an overall view of research in IR from a computer scientist’s perspective This means that the main focus of the book is on computer algorithms and techniques used in IR systems A rather distinct viewpoint is taken by librarians and information science researchers In this viewpoint, the focus is on trying to understand how people interpret and use information This human-centered viewpoint is discussed in the user interfaces chapter and in the last two chapters of the book – p. 33
  • 34.
  • 35.
    Modern Information Retrieval p.1 Chapter 2 User Interfaces for Search How People Search Search Interfaces Today Visualization in Search Interfaces Design and Evaluation of Search Interfaces
  • 36.
    Introduction This chapter focuseson the human users of search systems the search user interface, i.e., the window through which search systems are seen The user interface role is to aid in the searchers’ understanding and expression of their information need Further, the interface should help users formulate their queries select among available information sources understand search results keep track of the progress of their search p. 2
  • 37.
  • 38.
    How People Search Userinteraction with search interfaces differs depending on the type of task the domain expertise of the information seeker the amount of time and effort available to invest in the process Marchionini makes a distinction between information lookup and exploratory search Information lookup tasks are akin to fact retrieval or question answering can be satisfied by discrete pieces of information: numbers, dates, names, or Web sites can work well for standard Web search interactions p. 4
  • 39.
    How People Search Exploratorysearch is divided into learning and investigating tasks Learning search requires more than single query-response pairs requires the searcher to spend time scanning and reading multiple information items synthesizing content to form new understanding p. 5
  • 40.
    How People Search Investigatingrefers to a longer-term process which involves multiple iterations that take place over perhaps very long periods of time may return results that are critically assessed before being integrated into personal and professional knowledge bases may be concerned with finding a large proportion of the relevant information available p. 6
  • 41.
    How People Search Informationseeking can be seen as being part of a larger process referred to as sensemaking Sensemaking is an iterative process of formulating a conceptual representation from a large collection Russell et al. observe that most of the effort in sensemaking goes towards the synthesis of a good representation Some sensemaking activities interweave search throughout, while others consist of doing a batch of search followed by a batch of analysis and synthesis p. 7
  • 42.
    How People Search Examplesof deep analysis tasks that require sensemaking (in addition to search) the legal discovery process epidemiology (disease tracking) studying customer complaints to improve service obtaining business intelligence. p. 8
  • 43.
    Classic × DynamicModel Classic notion of the information seeking process: 1. problem identification 2. articulation of information need(s) 3. query formulation 4. results evaluation More recent models emphasize the dynamic nature of the search process The users learn as they search Their information needs adjust as they see retrieval results and other document surrogates This dynamic process is sometimes referred to as the berry picking model of search p. 9
  • 44.
    Classic × DynamicModel The rapid response times of today’s Web search engines allow searchers: to look at the results that come back to reformulate their query based on these results This kind of behavior is a commonly-observed strategy within the berry-picking approach Sometimes it is referred to as orienteering Jansen et al made a analysis of search logs and found that the proportion of users who modified queries is 52% p. 10
  • 45.
    Classic × DynamicModel Some seeking models cast the process in terms of strategies and how choices for next steps are made In some cases, these models are meant to reflect conscious planning behavior by expert searchers In others, the models are meant to capture the less planned, potentially more reactive behavior of a typical information seeker p. 11
  • 46.
    Navigation × Search Navigation:the searcher looks at an information structure and browses among the available information This browsing strategy is preferrable when the information structure is well-matched to the user’s information need it is mentally less taxing to recognize a piece of information than it is to recall it it works well only so long as appropriate links are available If the links are not avaliable, then the browsing experience might be frustrating p. 12
  • 47.
    Navigation × Search Spooldiscusses an example of a user looking for a software driver for a particular laser printer Say the user first clicks on printers, then laser printers, then the following sequence of links: HP laser printers HP laser printers model 9750 software for HP laser printers model 9750 software drivers for HP laser printers model 9750 software drivers for HP laser printers model 9750 for the Win98 operating system This kind of interaction is acceptable when each refinement makes sense for the task at hand p. 13
  • 48.
    Search Process Numerous studieshave been made of people engaged in the search process The results of these studies can help guide the design of search interfaces One common observation is that users often reformulate their queries with slight modifications Another is that searchers often search for information that they have previously accessed The users’ search strategies differ when searching over previously seen materials Researchers have developed search interfaces support both query history and revisitation p. 14
  • 49.
    Search Process Studies alsoshow that it is difficult for people to determine whether or not a document is relevant to a topic The less users know about a topic, the poorer judges they are of whether a search result is relevant to that topic Other studies found that searchers tend to look at only the top-ranked retrieved results Further, they are biased towards thinking the top one or two results are better than those beneath them p. 15
  • 50.
    Search Process Studies alsoshow that people are poor at estimating how much of the relevant material they have found Other studies have assessed the effects of knowledge of the search process itself These studies have observed that experts use different strategies than novices searchers For instance, Tabatabai et al found that expert searchers were more patient than novices this positive attitude led to better search outcomes p. 16
  • 51.
  • 52.
    Getting Started How doesan information seeking session begin in online information systems? The most common way is to use a Web search engine Another method is to select a Web site from a personal collection of already-visited sites which are typically stored in a browser’s bookmark Online bookmark systems are popular among a smaller segment of users Ex: Delicious.com Web directories are also used as a common starting point, but have been largely replaced by search engines p. 18
  • 53.
    Query Specification The primarymethods for a searcher to express their information need are either entering words into a search entry form selecting links from a directory or other information organization display For Web search engines, the query is specified in textual form Typically, Web queries today are very short consisting of one to three words p. 19
  • 54.
    Query Specification Short queriesreflect the standard usage scenario in which the user tests the waters: If the results do not look relevant, then the user reformulates their query If the results are promising, then the user navigates to the most relevant-looking Web site This search behavior is a demonstration of the orienteering strategy of Web search p. 20
  • 55.
    Query Specification Before theWeb, search systems regularly supported Boolean operators and command-based syntax However, these are often difficult for most users to understand Jansen et al conducted a study over a Web log with 1.5M queries, and found that 2.1% of the queries contained Boolean operators 7.6% contained other query syntax, primarily double-quotation marks for phrases White et al examined interaction logs of nearly 600,000 users, and found that 1.1% of the queries contained one or more operators 8.7% of the users used an operator at any time p. 21
  • 56.
    Query Specification Web rankinghas gone through three major phases In the first phase, from approximately 1994–2000: Since the Web was much smaller then, complex queries were less likely to yield relevant information Further, pages retrieved not necessarily contained all query words Around 1997, Google moved to conjunctive queries only The other Web search engines followed, and conjunctive ranking became the norm Google also added term proximity information and page importance scoring (PageRank) As the Web grew, longer queries posed as phrases started to produce highly relevant results p. 22
  • 57.
    Query Specification Interfaces Thestandard interface for a textual query is a search box entry form Studies suggest a relationship between query length and the width of the entry form Results found that either small forms discourage long queries or wide forms encourage longer queries p. 23
  • 58.
    Query Specification Interfaces Someentry forms are followed by a form that filters the query in some way For instance, at yelp.com, the user can refine the search by location using a second form Notice that the yelp.com form also shows the user’s home location, if it has been specified previously p. 24
  • 59.
    Query Specification Interfaces Somesearch forms show hints on what kind of information should be entered into each form For instance, in zvents.com search, the first box is labeled “what are you looking for”? p. 25
  • 60.
    Query Specification Interfaces Theprevious example also illustrates specialized input types that some search engines are supporting today The zvents.com site recognizes that words like “tomorrow” are time-sensitive It also allows flexibility in the syntax of dates To illustrate, searching for “comedy on wed ” automatically computes the date for the nearest future Wednesday This is an example of how the interface can be designed to reflect how people think p. 26
  • 61.
    Query Specification Interfaces Someinterfaces show a list of query suggestions as the user types the query This is referred to as auto-complete, auto-suggest, or dynamic query suggestions Anick et al found that users clicked on dynamic Yahoo suggestions one third of the time Often the suggestions shown are those whose prefix matches the characters typed so far However, in some cases, suggestions are shown that only have interior letters matching Further, suggestions may be shown that are synonyms of the words typed so far p. 27
  • 62.
    Query Specification Interfaces Dynamicquery suggestions, from Netflix.com p. 28
  • 63.
    Query Specification Interfaces Thedynamic query suggestions can be derived from several sources, including: The user’s own query history A set of metadata that a Web site’s designer considers important All of the text contained within a Web site p. 29
  • 64.
    Query Specification Interfaces Dynamicquery suggestions, grouped by type, from NextBio.com: p. 30
  • 65.
    Retrieval Results Display Whendisplaying search results, either the documents must be shown in full, or else the searcher must be presented with some kind of representation of the content of those documents The document surrogate refers to the information that summarizes the document This information is a key part of the success of the search interface The design of document surrogates is an active area of research and experimentation The quality of the surrogate can greatly effect the perceived relevance of the search results listing p. 31
  • 66.
    Retrieval Results Display InWeb search, the page title is usually shown prominently, along with the URL and other metadata In search over information collections, metadata such as date published and author are often displayed Text summary (or snippet) containing text extracted from the document is also critical Currently, the standard results display is a vertical list of textual summaries This list is sometimes referred to as the SERP (Search Engine Results Page) p. 32
  • 67.
    Retrieval Results Display Insome cases the summaries are excerpts drawn from the full text that contain the query terms In other cases, specialized kinds of metadata are shown in addition to standard textual results This technique is known as blended results or universal search p. 33
  • 68.
    Retrieval Results Display Forexample, a query on a term like “rainbow” may return sample images as one entry in the results listing p. 34
  • 69.
    Retrieval Results Display Aquery on the name of a sports team might retrieve the latest game scores and a link to buy tickets p. 35
  • 70.
    Retrieval Results Display Nielsennotes that in some cases the information need is satisfied directly in the search results listing This makes the search engine an “answer engine” Displaying the query terms in the context in which they appear in the document: Improves the user’s ability to gauge the relevance of the results It is sometimes referred to as KWIC - keywords in context It is also known as query-biased summaries, query-oriented summaries, or user-directed summaries p. 36
  • 71.
    Retrieval Results Display Thevisual effect of query term highlighting can also improve usability of search results listings Highlighting can be shown both in document surrogates in the retrieval results and in the retrieved documents Determining which text to place in the summary, and how much text to show, is a challenging problem Often the summaries contain all the query terms in close proximity to one another However, there is a trade-off between Showing contiguous sentences, to aid in coherence in the result Showing sentences that contain the query terms p. 37
  • 72.
    Retrieval Results Display Someresults suggest that it is better to show full sentences rather than cut them off On the other hand, very long sentences are usually not desirable in the results listing Further, the kind of information to display should vary according to the intent of the query Longer results are deemed better than shorter ones for certain types of information need On the other hand, abbreviated listing is preferable for navegational queries Similarly, requests for factual information can be satisfied with a concise results display p. 38
  • 73.
    Retrieval Results Display Otherkinds of document information can be usefully shown in the search results page p. 39
  • 74.
    Retrieval Results Display Thepage results below show figures extracted from journal articles alongside the search results p. 40
  • 75.
    Query Reformulation There aretools to help users reformulate their query One technique consists of showing terms related to the query or to the documents retrieved in response to the query A special case of this is spelling corrections or suggestions Usually only one suggested alternative is shown: clicking on that alternative re-executes the query In years back, the search results were shown using the purportedly incorrect spelling p. 41
  • 76.
    Query Reformulation Microsoft Live’ssearch results page for the query “IMF” p. 42
  • 77.
    Query Reformulation Term expansion:search interfaces are increasingly employing related term suggestions Log studies suggest that term suggestions are a somewhat heavily-used feature in Web search Jansen et al made a log study and found that 8% of queries were generated from term suggestions Anick et al found that 6% of users who were exposed to term suggestions chose to click on them p. 43
  • 78.
    Query Reformulation Some queryterm suggestions are based on the entire search session of the particular user Others are based on behavior of other users who have issued the same or similar queries in the past One strategy is to show similar queries by other users Another is to extract terms from documents that have been clicked on in the past by searchers who issued the same query p. 44
  • 79.
    Query Reformulation Relevance feedbackis another method whose goal is to aid in query reformulation The main idea is to have the user indicate which documents are relevant to their query In some variations, users also indicate which terms extracted from those documents are relevant The system then computes a new query from this information and shows a new retrieval set p. 45
  • 80.
    Query Reformulation Nonetheless, thismethod has not been found to be successful from a usability perspective Because that, it does not appear in standard interfaces today This stems from several factors: People are not particularly good at judging document relevance, especially for topics with which they are unfamiliar The beneficial behavior of relevance feedback is inconsistent p. 46
  • 81.
    Organizing Search Results Organizingresults into meaningful groups can help users understand the results and decide what to do next Popular methods for grouping search results: category systems and clustering Category system: meaningful labels organized in such a way as to reflect the concepts relevant to a domain Good category systems have the characteristics of being coherent and relatively complete Their structure is predictable and consistent across search results for an information collection p. 47
  • 82.
    Organizing Search Results Themost commonly used category structures are flat, hierarchical, and faceted categories Flat categories are simply lists of topics or subjects They can be used for grouping, filtering (narrowing), and sorting sets of documents in search interfaces Most Web sites organize their information into general categories Selecting that category narrows the set of information shown accordingly p. 48
  • 83.
    Organizing Search Results Someexperimental Web search engines automatically organize results into flat categories Studies using this kind of design have received positive user responses (Dumais et al, Kules et al) However, it can difficult to find the right subset of categories to use for the vast content of the Web Rather, category systems seem to work better for more focused information collections p. 49
  • 84.
    Organizing Search Results Inthe early days of the Web, hierarchical directory systems such as Yahoo’s were popular Hierarchy can also be effective in the presentation of search results over a book or other small collection The Superbook system was an early search interface based on this idea In the Superbook system, the search results were shown in the context of the table-of-contents hierarchy p. 50
  • 85.
    Organizing Search Results TheSuperBook interface for showing retrieval results in context p. 51
  • 86.
    Organizing Search Results Analternative representation is the faceted metadata Unlike flat categories, faceted metadata allow the assignment of multiple categories to a single item Each category corresponds to a different facet (dimension or feature type) of the collection of items p. 52
  • 87.
    Organizing Search Results Figurebelow shows a example of faceted navigation p. 53
  • 88.
    Organizing Search Results Clusteringrefers to the grouping of items according to some measure of similarity It groups together documents that are similar to one another but different from the rest of the collection Such as all the document written in Japanese that appear in a collection of primarily English articles The greatest advantage of clustering is that it is fully automatable The disadvantages of clustering include an unpredictability in the form and quality of results the difficulty of labeling the groups the counter-intuitiveness of cluster sub-hierarchies p. 54
  • 89.
    Organizing Search Results Outputproduced using Findex clustering p. 55
  • 90.
    Organizing Search Results Clusteroutput on the query “senate”, from Clusty.com p. 56
  • 91.
    Visualization in SearchInterfaces Experimentation with visualization for search has been primarily applied in the following ways: Visualizing Boolean syntax Visualizing query terms within retrieval results Visualizing relationships among words and documents Visualization for text mining p. 57
  • 92.
    Visualizing Boolean Syntax Booleanquery syntax is difficult for most users and is rarely used in Web search For many years, researchers have experimented with how to visualize Boolean query specification A common approach is to show Venn diagrams A more flexible version of this idea was seen in the VQuery system, proposed by Steve Jones p. 58
  • 93.
    Visualizing Boolean Syntax TheVQuery interface for Boolean query specification p. 59
  • 94.
    Visualizing Query Terms Understandingthe role of the query terms within the retrieved docs can help relevance assessment Experimental visualizations have been designed that make this role more explicit In the TileBars interface, for instance, documents are shown as horizontal glyphs The locations of the query term hits marked along the glyph The user is encouraged to break the query into its different facets, with one concept per line Then, the lines show the frequency of occurrence of query terms within each topic p. 60
  • 95.
    Visualizing Query Terms TheTileBars interface p. 61
  • 96.
    Visualizing Query Terms Otherapproaches include placing the query terms in bar charts, scatter plots, and tables A usability study by Reiterer et al compared five views: a standard Web search engine-style results listing a list view showing titles, document metadata, and a graphic showing locations of query terms a color TileBars-like view a color bar chart view like that of Veerasamy & Belkin a scatter plot view plotting relevance scores against date of publication p. 62
  • 97.
    Visualizing Query Terms Field-sortablesearch results view p. 63
  • 98.
  • 99.
    Visualizing Query Terms Whenasked for subjective responses, the 40 participants of the study preferred, on average, in this order: Field-sortable view first TileBars Web-style listing The bar chart and scatter plot received negative responses p. 65
  • 100.
    Visualizing Query Terms Anothervariation on the idea of showing query term hits within documents is to show thumbnails Thumbnails are miniaturized rendered versions of the visual appearance of the document However, Czerwinski et al found that thumbnails are no better than blank squares for improving search results The negative study results may stem from a problem with the size of the thumbnails Woodruff et al shows that making the query terms more visible via highlighting within the thumbnail improves its usability p. 66
  • 101.
    Visualizing Query Terms Textuallyenhanced thumbnails p. 67
  • 102.
    Words and DocsRelationships Numerous works proposed variations on the idea of placing words and docs on a two-dimensional canvas In these works, proximity of glyphs represents semantic relationships among the terms or documents An early version of this idea is the VIBE interface Documents containing combinations of the query terms are placed midway between the icons representing those terms The Aduna Autofocus and the Lyberworld projects presented a 3D version of the ideas behind VIBE p. 68
  • 103.
    Words and DocsRelationships The VIBE display p. 69
  • 104.
    Words and DocsRelationships Another idea is to map docs or words from a very high- dimensional term space down into a 2D plane The docs or words fall within that plane, using 2D or 3D This variation on clustering can be done to documents retrieved as a result of a query documents that match a query can be highlighted within a pre-processed set of documents InfoSky and xFIND’s VisIslands are two variations on these starfield displays p. 70
  • 105.
    Words and DocsRelationships InfoSky, from Jonker et al p. 71
  • 106.
    Words and DocsRelationships xFIND’s VisIslands, from Andrews et al p. 72
  • 107.
    Words and DocsRelationships These views are relatively easy to compute and can be visually striking However, evaluations that have been conducted so far provide negative evidence as to their usefulness The main problems are that the contents of the documents are not visible in such views A more promising application of this kind of idea is in the layout of thesaurus terms, in a small network graph Ex: Visual Wordnet p. 73
  • 108.
    Words and DocsRelationships The Visual Wordnet view of the WordNet lexical thesaurus p. 74
  • 109.
    Visualization for TextMining Visualization is also used for purposes of analysis and exploration of textual data Visualizations such as the Word Tree show a piece of a text concordance It allows the user to view which words and phrases commonly precede or follow a given word Another example is the NameVoyager, which shows frequencies of names for U.S. children across time p. 75
  • 110.
    Visualization for TextMining The Word Tree visualization, on Martin Luther King’s I have a dream speech, from Wattenberg et al p. 76
  • 111.
    Visualization for TextMining The popularity of baby names over time (names beginning with JA), from babynamewizard.com p. 77
  • 112.
    Visualization for TextMining Visualization is also used in search interfaces intended for analysts An example is the TRIST information triage system, from Proulx et al In this system, search results is represented as document icons Thousands of documents can be viewed in one display It supports multiple linked dimensions that allow for finding characteristics and correlations among the docs Its designers won the IEEE Visual Analytics Science and Technology (VAST) contest for two years running p. 78
  • 113.
    Visualization for TextMining The TRIST interface with results for queries related to Avian Flu p. 79
  • 114.
    Design and Evaluation Userinterface design: a field of Human-Computer Interaction (HCI) This field studies how people think about, respond to, and use technology User-centered design: a set of practices developed to facilitate the design of interfaces The design process begins by determining what the intended users’ goals are Then, the interface is devised to help people achieve those goals by completing a series of tasks p. 80
  • 115.
    Design and Evaluation Goalsin the domain of information access can range quite widely From finding a plumber to keeping informed about a business competitor From writing a publishable scholarly article to investigating an allegation of fraud The design of interfaces is an iterative process, in which the goals and tasks are elucidated via user research p. 81
  • 116.
    Design and Evaluation Evaluatinga user interface is often different from evaluating a ranking algorithm or a crawling technique A crawler can be assessed by crisp quantitative metrics such as coverage and freshness A ranking algorithm can be evaluated by precision, recall, and speed The quality of a user interface is determined by how people respond to it Subjective responses are as, if not more, important than quantitative measures If a person has a choice between two systems, they will use the one they prefer p. 82
  • 117.
    Design and Evaluation Thereasons for preference may be determined by a host of factors: Speed, familiarity, aesthetics, preferred features, or perceived ranking accuracy Often the preferred choice is the familiar one p. 83
  • 118.
    Design and Evaluation Howbest to evaluate a user interface depends on the current stage in the development cycle When starting with a new design or idea, discount usability methods are typically used Example: showing a few users different designs asking them to indicate which parts are promising and which are not Another commonly used discount evaluation method is heuristic evaluation Usability experts “walk through” a design and evaluate the functionality in accordance with a set of design guidelines p. 84
  • 119.
    Design and Evaluation Aformal experiment must be carefully designed to take into account potentially confounding factors For instance, it is important for participants to be motivated to do well on the task This kind of study can uncover important subjective results Such as whether a new design is strongly preferred over a baseline However, it is difficult to find accurate quantitative differences with a small number of participants p. 85
  • 120.
    Design and Evaluation Anotherproblem: the timing variable is not the right measure for evaluating an interactive search session A tool that allows the searcher to learn about their subject matter as they search may be more beneficial, but take more time Two approaches to evaluating search interfaces have gained in popularity in recent years One is to conduct a longitudinal study Participants use a new interface for an extended period of time, and their usage is monitored and logged Evaluation is based both on log analysis and questionnaires and interviews with the participants p. 86
  • 121.
    Design and Evaluation Anotherevaluation technique is to perform experiments on already heavily-used Web sites Consider a search engine that receives millions of queries a day a randomly selected subset of the users is shown a new design their actions are logged and compared to another randomly selected control group that continues to use the existing interface this approach is often referred to as bucket testing, A/B testing p. 87