1. UNIT-IV
IV Year / VIII Semester
By
K.Karthick AP/CSE
KNCET.
KONGUNADU COLLEGE OF ENGINEERING AND
TECHNOLOGY
(Autonomous)
NAMAKKAL- TRICHY MAIN ROAD, THOTTIAM
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
CS8080 – Information Retrieval Techniques
3. The Web
• A Brief History
• In 1990, Berners-Lee Wrote the HTTP protocol
• Defined the HTML language
• Wrote the first browser, which he called World Wide Web
• Wrote the first Web server
• In 1991, he made his browser and server software available
in the Internet. The Web was born!
• Since its inception, the Web became a huge success
• Well over 20 billion pages are now available and accessible
in the Web
• More than one fourth of humanity now access the Web on a
regular basis
4. How the Web Changed Search
• Web search is today the most prominent application of
IR and its techniques—the ranking and indexing
components of any search engine are fundamentally IR
pieces of technology
• The first major impact of the Web on search is related
to the characteristics of the document collection itself
• The Web is composed of pages distributed over
millions of sites and connected through hyperlinks
• This requires collecting all documents and storing
copies of them in a central repository, prior to indexing
• This new phase in the IR process, introduced by the
Web, is called crawling
5. • The second major impact of the Web on search is
related to:
• The size of the collection
• The volume of user queries submitted on a daily basis
• As a consequence, performance and scalability have
become critical characteristics of the IR system
• The third major impact: in a very large collection,
predicting
• relevance is much harder than before
• Fortunately, the Web also includes new sources of
• evidence
• Ex: hyperlinks and user clicks in documents in the
answer set
6. • The fourth major impact derives from the fact
that the Web is also a medium to do business
• Search problem has been extended beyond the
seeking of text information to also encompass
other user needs
• Ex: the price of a book, the phone number of a
hotel, the link fo r downloading a software
• The fifth major impact of the Web on search is
• Web spam Web spam: abusive availability of
commercial information disguised in the form of
informational content
• This difficulty is so large that today we talk of
Adversarial Web Retrieval
7. Web structure
• The web graph
• We can view the static Web consisting of static HTML pages
together with the hyperlinks between them as a directed graph in
which each web page is a node and each hyperlink a directed edge.
• We refer to the hyperlinks into a page as in-links and those out of a
page as out-links.
• The number of in-links to a page (also known as its in-degree) has
averaged from roughly 8 to 15, in a range of studies. We similarly
define the out-degree of a web page to be the number of links out
of it. These notions are represented in Figure 4.1.
• The web can be represented as a graph:
• • Webpage ≡ node
• • hyperlink ≡ directed edge
• • # in-links ≡ in-degree of a node
• • # out-links ≡ out-degree of a node
8.
9. Search Engine Architectures
• A software architecture consists of software
components, the interfaces provided by those
components, and the relationships between
them
– describes a system at a particular level of
abstraction
• Architecture of a search engine determined by
2 requirements
– effectiveness (quality of results) and efficiency
(response time and throughput)
10. • Each Search Engine architecture has its own
components. The architecture of search
engines falls into the following categories:
• Basic architecture- Centralized crawler
• Cluster based architecture
• Distributed Architectures
• Multi site Architecture
11. Basic architecture- Centralized
crawler
– used by most engines
– crawlers are software agents that traverse the Web copying pages
– pages crawled are stored in a central repository and then indexed
– index is used in a centralized fashion to answer queries
– most search engines use indices based on the inverted index
– only a logical, rather than a literal, view of the text needs to be
indexed
• Normalization operations
– removal of punctuation
– substitution of multiple consecutive spaces by a single space
– conversion of uppercase to lowercase letters
• Some engines eliminate stopwords to reduce index size
• Index is complemented with metadata associated with pages
– Creation date, size, title, etc.
12. • Given a query
– 10 results shown are subset of complete result set
– if user requests more results, search engine can
• recompute the query to generate the next 10 results
• obtain them from a partial result set maintained in
main memory
– In any case, a search engine never computes the
full answer set for the whole Web
13.
14. • Main problem faced by this architecture
– gathering of the data, and
– sheer volume
• Crawler-indexer architecture could not cope
with Web growth (end of 1990s)
– solution: distribute and parallelize computation
15. Cluster-based Architecture
• Current engines adopt a massively parallel cluster-
based architecture
– document partitioning is used
– replicated to handle the overall query load
– cluster replicas maintained in various geographical
locations to decrease latency time (for nearby users)
• Many crucial details need to be addressed
– good balance between the internal and external activities
– good load balancing among different clusters
– fault tolerance at software level to protect against
hardware failures
16. • Caching
– Search engines need to be fast
• whenever possible, execute tasks in main memory
• caching is highly recommended and extensively used
– provides for shorter average response time
– significantly reduces workload on back-end servers
– decreases the overall amount of bandwidth utilized
– In the Web, caching can be done both at the client
or the server side
17. – Caching of answers
• the most effective caching technique in search engines
• query distribution follows a power law
– small cache can answer a large percentage of queries
• with a 30% hit-rate, capacity of search engine increases
by almost 43%
– Still, in any time window a large fraction of queries
will be unique
• hence, those queries will not be in the cache
• 50% in Baeza-Yates et al this can be improved by also
caching inverted lists at the search cluster level
18.
19. Distributed Architectures
– There exist several variants of the crawler-indexer
architecture
– We describe here the most important ones
• most significant early example is Harvest
• among newest proposals, we distinguish the multi-site
architecture proposed by Baeza-Yates et al
– Harvest
• Harvest uses a distributed architecture to gather and distribute
data
• interestingly, it does not suffer from some of common problems of
the crawler-indexer architectures, such as
– increased servers load caused by reception of simultaneous requests
from different crawlers
– increased Web traffic, due to crawlers retrieving entire objects,
• while most content is not retained eventually
• lack of coordination between engines, as information is
gathered independently by each crawler
20. • To avoid these issues, two components are
introduced: gatherers and brokers
• gatherer: collects and extracts indexing
information from one or more Web servers
• broker: provides indexing mechanism and
query interface to the data gathered
21.
22. Multi-site Architecture
– As the document collection grows
• capacity of query processors has to grow as well
• unlikely that growth in size of single processors can
match growth of very large collections
– even if a large number of servers is used
– main reasons are physical constraints such as size of single
data center and power and cooling requirements
23. – Distributed resolution of queries using different
query processors is a viable approach
• enables a more scalable solution
• but also imposes new challenges
• one such challenge is the routing of queries to
appropriate query processors
– to utilize more efficiently available resources and provide
more precise results
– factors affecting query routing include geographical proximity,
query topic, or language of the query
24. – Geographical proximity: reduce network latency
by using resources close to the user posing the
query
– Possible implementation is DNS redirection
• according to IP address of client, the DNS service routes
query to appropriate Web server
– usually the closest in terms of network distance
– As another example, DNS service can use the
geographical location to determine where to route
queries to
– There is a fluctuation in submitted queries from a
particular geographic region during a day
• possible to offload a server from a busy region by
rerouting some queries to query servers in a less busy
region
25. Search Engine Ranking
• Ranking is the hardest and most important function of
a search engine
• A key first challenge
• devise an adequate process of evaluating the ranking,
in terms of relevance of results to the user
• without such evaluation, it is close to impossible to fine
tune the ranking function
• without fine tuning the ranking, there is no state-of-
the-art engine—this is an empirical field of science
• A second critical challenge
• identification of quality content in the Web
26. • Evidence of quality can be indicated by several
signals such as:
• domain names
• text content
• links (like Page Rank)
• Web page access patterns as monitored by the
search engine
• Additional useful signals are provided by the
layout of the Web page, its title, metadata,
font sizes, etc.
27. Third critical challenge
• avoiding, preventing, managing Web spam
• spammers are malicious users who try to trick
search engines by
• artificially inflating signals used for ranking
• a consequence of the economic incentives of the
current advertising model adopted by search
engines
• A fourth major challenge
• defining the ranking function and computing it
28. Link-based Ranking
• Number of hyperlinks that point to a page
provides a measure of its popularity and
quality
• Many links in common among pages are
indicative of page relations with potential
value for ranking purposes
• Examples of ranking techniques that exploit
links are discussed next
29. Early Algorithms
• Use incoming links for ranking Web pages
• But, just counting links was not a very reliable measure of
authoritativeness
• Yuwono and Lee proposed three early ranking algorithms (in
addition to the classic TF-IDF vector ranking)
• Boolean spread
• given page p in result set
• extend result set with pages that point to and are pointed by page p
• Vector spread
• given page p in result set
• extend result set with pages that point to and are pointed by page p
• most-cited
• a page p is assigned a score given by the sum of number of query
words contained in other pages that point to page p
• WebQuery is an early algorithm that allows visual browsing of Web
pages
30. Page Rank
• The basic idea is that good pages point to good pages
• Let p, r be two variables for pages and L a set of links
• Is the link analysis method that uses link structure to
rank the web pages by search engines
• PAGERANK: the web graphs a numerical score between
0 and 1, known as its Page Rank. The Page Rank of a
node will depend on the link structure of the web
graph
• Page Rank: scoring measure based only on the link
structure of web pages
– Every node in the web graph is given a score between 0
and1, depending on its in and out-links
– Given a query, a search engine combines the Page Rank
score with other values to compute the ranked retrieval
(e.g. cosine similarity, relevance feedback, etc.)
31. The PageRank Citation Ranking: Bring
Order to the web
• Why is Page Importance Rating important?
– New challenges for information retrieval on the World Wide
Web.
• Huge number of web pages: 150 million by1998
1000 billion by 2008
• Diversity of web pages: different topics, different quality, etc.
• What is PageRank?
• A method for rating the importance of web pages
objectively and mechanically using the link structure of the
web.
32. The History of PageRank
• PageRank was developed by Larry Page (hence
the name Page-Rank) and Sergey Brin.
• It is first as part of a research project about a
new kind of search engine. That project
started in 1995 and led to a functional
prototype in 1998.
• Shortly after, Page and Brin founded Google.
• 16 billion…
33. • There are some news about that PageRank
will be canceled by Google.
• There are large numbers of Search Engine
Optimization (SEO).
• SEO use different trick methods to make a
web page more important under the rating of
PageRank.
34. • 150 million web pages 1.7 billion links
• Backlinks and Forward links:
A and B are C’s backlinks
C is A and B’s forward link
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46. • Underlying idea: computing a score reflecting
the “visitability” of a page
– When surfing the web, a user may visit some
pages more often than others (e.g. more in-links)
– Pages that are often visited are more likely to
contain relevant information
– When a dead-end page is reached, the user may
teleport (e.g. type an address in the browser)
47. • Page Rank: Strengths & Weaknesses
• Strengths
– Elegant theoretical foundation with nice
interpretations (e.g., stochastic matrices, principal
Eigenvectors, stationary state probabilities in random
walk model/ Markov-chain process)
– Can be extended to topic-based and personalized
models
– Static measure (i.e., can be precomputed)
– Relatively straight-forward to scale
– PageRank scores are relatively stable to graph
perturbations
• Weaknesses
– Rank is not stable to perturbations
48. HITS
• Hypertext Induced Topic Selection (HITS) is a link
analysis method developed by John Kleinberg in 1999
using Hub and Authority scores.
• considers set of pages S that point to or are pointed by
pages in answer set
• pages that have many links pointing to it are called
authorities
• pages that have many outgoing links are called hubs
• positive two-way feedback
• better authority pages come from incoming edges from
good hubs
• better hub pages come from outgoing edges to good
authorities
49. – Start with a “root set” of pages relevant to a topic
(or information need)
– Create a “base sub graph” by adding to the root
set
– –All pages that have incoming links from pages in the root set
– –For each page in the root set: up to 𝑘 pages pointing to it
– Compute authority and hubness scores for all
pages in the base sub graph
• Rank pages by decreasing authority scores
50.
51. • Let
• H(p): hub value of page p
• A(p): authority value of page p
• H(p) and A(p) are defined such that
52. Simple Ranking Functions
• use a global ranking function such as PageRank
• In that case, quality of a Web page in the result set is
independent of the query
• the query only selects pages to be ranked
• use a linear combination of different ranking signals
• for instance, combine BM25 (text-based ranking) with
PageRank (link-based ranking)
• To illustrate, consider the pages p that satisfy query Q
• Rank score R(p,Q) of page p with regard to query Q can
be computed as
• R(p,Q) = ∝ BM25(p,Q) + (1-∝ ) PR(p)