UNIT-IV
IV Year / VIII Semester
By
K.Karthick AP/CSE
KNCET.
KONGUNADU COLLEGE OF ENGINEERING AND
TECHNOLOGY
(Autonomous)
NAMAKKAL- TRICHY MAIN ROAD, THOTTIAM
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
CS8080 – Information Retrieval Techniques
Syllabus
The Web
• A Brief History
• In 1990, Berners-Lee Wrote the HTTP protocol
• Defined the HTML language
• Wrote the first browser, which he called World Wide Web
• Wrote the first Web server
• In 1991, he made his browser and server software available
in the Internet. The Web was born!
• Since its inception, the Web became a huge success
• Well over 20 billion pages are now available and accessible
in the Web
• More than one fourth of humanity now access the Web on a
regular basis
How the Web Changed Search
• Web search is today the most prominent application of
IR and its techniques—the ranking and indexing
components of any search engine are fundamentally IR
pieces of technology
• The first major impact of the Web on search is related
to the characteristics of the document collection itself
• The Web is composed of pages distributed over
millions of sites and connected through hyperlinks
• This requires collecting all documents and storing
copies of them in a central repository, prior to indexing
• This new phase in the IR process, introduced by the
Web, is called crawling
• The second major impact of the Web on search is
related to:
• The size of the collection
• The volume of user queries submitted on a daily basis
• As a consequence, performance and scalability have
become critical characteristics of the IR system
• The third major impact: in a very large collection,
predicting
• relevance is much harder than before
• Fortunately, the Web also includes new sources of
• evidence
• Ex: hyperlinks and user clicks in documents in the
answer set
• The fourth major impact derives from the fact
that the Web is also a medium to do business
• Search problem has been extended beyond the
seeking of text information to also encompass
other user needs
• Ex: the price of a book, the phone number of a
hotel, the link fo r downloading a software
• The fifth major impact of the Web on search is
• Web spam Web spam: abusive availability of
commercial information disguised in the form of
informational content
• This difficulty is so large that today we talk of
Adversarial Web Retrieval
Web structure
• The web graph
• We can view the static Web consisting of static HTML pages
together with the hyperlinks between them as a directed graph in
which each web page is a node and each hyperlink a directed edge.
• We refer to the hyperlinks into a page as in-links and those out of a
page as out-links.
• The number of in-links to a page (also known as its in-degree) has
averaged from roughly 8 to 15, in a range of studies. We similarly
define the out-degree of a web page to be the number of links out
of it. These notions are represented in Figure 4.1.
• The web can be represented as a graph:
• • Webpage ≡ node
• • hyperlink ≡ directed edge
• • # in-links ≡ in-degree of a node
• • # out-links ≡ out-degree of a node
Search Engine Architectures
• A software architecture consists of software
components, the interfaces provided by those
components, and the relationships between
them
– describes a system at a particular level of
abstraction
• Architecture of a search engine determined by
2 requirements
– effectiveness (quality of results) and efficiency
(response time and throughput)
• Each Search Engine architecture has its own
components. The architecture of search
engines falls into the following categories:
• Basic architecture- Centralized crawler
• Cluster based architecture
• Distributed Architectures
• Multi site Architecture
Basic architecture- Centralized
crawler
– used by most engines
– crawlers are software agents that traverse the Web copying pages
– pages crawled are stored in a central repository and then indexed
– index is used in a centralized fashion to answer queries
– most search engines use indices based on the inverted index
– only a logical, rather than a literal, view of the text needs to be
indexed
• Normalization operations
– removal of punctuation
– substitution of multiple consecutive spaces by a single space
– conversion of uppercase to lowercase letters
• Some engines eliminate stopwords to reduce index size
• Index is complemented with metadata associated with pages
– Creation date, size, title, etc.
• Given a query
– 10 results shown are subset of complete result set
– if user requests more results, search engine can
• recompute the query to generate the next 10 results
• obtain them from a partial result set maintained in
main memory
– In any case, a search engine never computes the
full answer set for the whole Web
• Main problem faced by this architecture
– gathering of the data, and
– sheer volume
• Crawler-indexer architecture could not cope
with Web growth (end of 1990s)
– solution: distribute and parallelize computation
Cluster-based Architecture
• Current engines adopt a massively parallel cluster-
based architecture
– document partitioning is used
– replicated to handle the overall query load
– cluster replicas maintained in various geographical
locations to decrease latency time (for nearby users)
• Many crucial details need to be addressed
– good balance between the internal and external activities
– good load balancing among different clusters
– fault tolerance at software level to protect against
hardware failures
• Caching
– Search engines need to be fast
• whenever possible, execute tasks in main memory
• caching is highly recommended and extensively used
– provides for shorter average response time
– significantly reduces workload on back-end servers
– decreases the overall amount of bandwidth utilized
– In the Web, caching can be done both at the client
or the server side
– Caching of answers
• the most effective caching technique in search engines
• query distribution follows a power law
– small cache can answer a large percentage of queries
• with a 30% hit-rate, capacity of search engine increases
by almost 43%
– Still, in any time window a large fraction of queries
will be unique
• hence, those queries will not be in the cache
• 50% in Baeza-Yates et al this can be improved by also
caching inverted lists at the search cluster level
Distributed Architectures
– There exist several variants of the crawler-indexer
architecture
– We describe here the most important ones
• most significant early example is Harvest
• among newest proposals, we distinguish the multi-site
architecture proposed by Baeza-Yates et al
– Harvest
• Harvest uses a distributed architecture to gather and distribute
data
• interestingly, it does not suffer from some of common problems of
the crawler-indexer architectures, such as
– increased servers load caused by reception of simultaneous requests
from different crawlers
– increased Web traffic, due to crawlers retrieving entire objects,
• while most content is not retained eventually
• lack of coordination between engines, as information is
gathered independently by each crawler
• To avoid these issues, two components are
introduced: gatherers and brokers
• gatherer: collects and extracts indexing
information from one or more Web servers
• broker: provides indexing mechanism and
query interface to the data gathered
Multi-site Architecture
– As the document collection grows
• capacity of query processors has to grow as well
• unlikely that growth in size of single processors can
match growth of very large collections
– even if a large number of servers is used
– main reasons are physical constraints such as size of single
data center and power and cooling requirements
– Distributed resolution of queries using different
query processors is a viable approach
• enables a more scalable solution
• but also imposes new challenges
• one such challenge is the routing of queries to
appropriate query processors
– to utilize more efficiently available resources and provide
more precise results
– factors affecting query routing include geographical proximity,
query topic, or language of the query
– Geographical proximity: reduce network latency
by using resources close to the user posing the
query
– Possible implementation is DNS redirection
• according to IP address of client, the DNS service routes
query to appropriate Web server
– usually the closest in terms of network distance
– As another example, DNS service can use the
geographical location to determine where to route
queries to
– There is a fluctuation in submitted queries from a
particular geographic region during a day
• possible to offload a server from a busy region by
rerouting some queries to query servers in a less busy
region
Search Engine Ranking
• Ranking is the hardest and most important function of
a search engine
• A key first challenge
• devise an adequate process of evaluating the ranking,
in terms of relevance of results to the user
• without such evaluation, it is close to impossible to fine
tune the ranking function
• without fine tuning the ranking, there is no state-of-
the-art engine—this is an empirical field of science
• A second critical challenge
• identification of quality content in the Web
• Evidence of quality can be indicated by several
signals such as:
• domain names
• text content
• links (like Page Rank)
• Web page access patterns as monitored by the
search engine
• Additional useful signals are provided by the
layout of the Web page, its title, metadata,
font sizes, etc.
Third critical challenge
• avoiding, preventing, managing Web spam
• spammers are malicious users who try to trick
search engines by
• artificially inflating signals used for ranking
• a consequence of the economic incentives of the
current advertising model adopted by search
engines
• A fourth major challenge
• defining the ranking function and computing it
Link-based Ranking
• Number of hyperlinks that point to a page
provides a measure of its popularity and
quality
• Many links in common among pages are
indicative of page relations with potential
value for ranking purposes
• Examples of ranking techniques that exploit
links are discussed next
Early Algorithms
• Use incoming links for ranking Web pages
• But, just counting links was not a very reliable measure of
authoritativeness
• Yuwono and Lee proposed three early ranking algorithms (in
addition to the classic TF-IDF vector ranking)
• Boolean spread
• given page p in result set
• extend result set with pages that point to and are pointed by page p
• Vector spread
• given page p in result set
• extend result set with pages that point to and are pointed by page p
• most-cited
• a page p is assigned a score given by the sum of number of query
words contained in other pages that point to page p
• WebQuery is an early algorithm that allows visual browsing of Web
pages
Page Rank
• The basic idea is that good pages point to good pages
• Let p, r be two variables for pages and L a set of links
• Is the link analysis method that uses link structure to
rank the web pages by search engines
• PAGERANK: the web graphs a numerical score between
0 and 1, known as its Page Rank. The Page Rank of a
node will depend on the link structure of the web
graph
• Page Rank: scoring measure based only on the link
structure of web pages
– Every node in the web graph is given a score between 0
and1, depending on its in and out-links
– Given a query, a search engine combines the Page Rank
score with other values to compute the ranked retrieval
(e.g. cosine similarity, relevance feedback, etc.)
The PageRank Citation Ranking: Bring
Order to the web
• Why is Page Importance Rating important?
– New challenges for information retrieval on the World Wide
Web.
• Huge number of web pages: 150 million by1998
1000 billion by 2008
• Diversity of web pages: different topics, different quality, etc.
• What is PageRank?
• A method for rating the importance of web pages
objectively and mechanically using the link structure of the
web.
The History of PageRank
• PageRank was developed by Larry Page (hence
the name Page-Rank) and Sergey Brin.
• It is first as part of a research project about a
new kind of search engine. That project
started in 1995 and led to a functional
prototype in 1998.
• Shortly after, Page and Brin founded Google.
• 16 billion…
• There are some news about that PageRank
will be canceled by Google.
• There are large numbers of Search Engine
Optimization (SEO).
• SEO use different trick methods to make a
web page more important under the rating of
PageRank.
• 150 million web pages  1.7 billion links
• Backlinks and Forward links:
A and B are C’s backlinks
C is A and B’s forward link
• Underlying idea: computing a score reflecting
the “visitability” of a page
– When surfing the web, a user may visit some
pages more often than others (e.g. more in-links)
– Pages that are often visited are more likely to
contain relevant information
– When a dead-end page is reached, the user may
teleport (e.g. type an address in the browser)
• Page Rank: Strengths & Weaknesses
• Strengths
– Elegant theoretical foundation with nice
interpretations (e.g., stochastic matrices, principal
Eigenvectors, stationary state probabilities in random
walk model/ Markov-chain process)
– Can be extended to topic-based and personalized
models
– Static measure (i.e., can be precomputed)
– Relatively straight-forward to scale
– PageRank scores are relatively stable to graph
perturbations
• Weaknesses
– Rank is not stable to perturbations
HITS
• Hypertext Induced Topic Selection (HITS) is a link
analysis method developed by John Kleinberg in 1999
using Hub and Authority scores.
• considers set of pages S that point to or are pointed by
pages in answer set
• pages that have many links pointing to it are called
authorities
• pages that have many outgoing links are called hubs
• positive two-way feedback
• better authority pages come from incoming edges from
good hubs
• better hub pages come from outgoing edges to good
authorities
– Start with a “root set” of pages relevant to a topic
(or information need)
– Create a “base sub graph” by adding to the root
set
– –All pages that have incoming links from pages in the root set
– –For each page in the root set: up to 𝑘 pages pointing to it
– Compute authority and hubness scores for all
pages in the base sub graph
• Rank pages by decreasing authority scores
• Let
• H(p): hub value of page p
• A(p): authority value of page p
• H(p) and A(p) are defined such that
Simple Ranking Functions
• use a global ranking function such as PageRank
• In that case, quality of a Web page in the result set is
independent of the query
• the query only selects pages to be ranked
• use a linear combination of different ranking signals
• for instance, combine BM25 (text-based ranking) with
PageRank (link-based ranking)
• To illustrate, consider the pages p that satisfy query Q
• Rank score R(p,Q) of page p with regard to query Q can
be computed as
• R(p,Q) = ∝ BM25(p,Q) + (1-∝ ) PR(p)
IRT Unit_4.pptx

IRT Unit_4.pptx

  • 1.
    UNIT-IV IV Year /VIII Semester By K.Karthick AP/CSE KNCET. KONGUNADU COLLEGE OF ENGINEERING AND TECHNOLOGY (Autonomous) NAMAKKAL- TRICHY MAIN ROAD, THOTTIAM DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CS8080 – Information Retrieval Techniques
  • 2.
  • 3.
    The Web • ABrief History • In 1990, Berners-Lee Wrote the HTTP protocol • Defined the HTML language • Wrote the first browser, which he called World Wide Web • Wrote the first Web server • In 1991, he made his browser and server software available in the Internet. The Web was born! • Since its inception, the Web became a huge success • Well over 20 billion pages are now available and accessible in the Web • More than one fourth of humanity now access the Web on a regular basis
  • 4.
    How the WebChanged Search • Web search is today the most prominent application of IR and its techniques—the ranking and indexing components of any search engine are fundamentally IR pieces of technology • The first major impact of the Web on search is related to the characteristics of the document collection itself • The Web is composed of pages distributed over millions of sites and connected through hyperlinks • This requires collecting all documents and storing copies of them in a central repository, prior to indexing • This new phase in the IR process, introduced by the Web, is called crawling
  • 5.
    • The secondmajor impact of the Web on search is related to: • The size of the collection • The volume of user queries submitted on a daily basis • As a consequence, performance and scalability have become critical characteristics of the IR system • The third major impact: in a very large collection, predicting • relevance is much harder than before • Fortunately, the Web also includes new sources of • evidence • Ex: hyperlinks and user clicks in documents in the answer set
  • 6.
    • The fourthmajor impact derives from the fact that the Web is also a medium to do business • Search problem has been extended beyond the seeking of text information to also encompass other user needs • Ex: the price of a book, the phone number of a hotel, the link fo r downloading a software • The fifth major impact of the Web on search is • Web spam Web spam: abusive availability of commercial information disguised in the form of informational content • This difficulty is so large that today we talk of Adversarial Web Retrieval
  • 7.
    Web structure • Theweb graph • We can view the static Web consisting of static HTML pages together with the hyperlinks between them as a directed graph in which each web page is a node and each hyperlink a directed edge. • We refer to the hyperlinks into a page as in-links and those out of a page as out-links. • The number of in-links to a page (also known as its in-degree) has averaged from roughly 8 to 15, in a range of studies. We similarly define the out-degree of a web page to be the number of links out of it. These notions are represented in Figure 4.1. • The web can be represented as a graph: • • Webpage ≡ node • • hyperlink ≡ directed edge • • # in-links ≡ in-degree of a node • • # out-links ≡ out-degree of a node
  • 9.
    Search Engine Architectures •A software architecture consists of software components, the interfaces provided by those components, and the relationships between them – describes a system at a particular level of abstraction • Architecture of a search engine determined by 2 requirements – effectiveness (quality of results) and efficiency (response time and throughput)
  • 10.
    • Each SearchEngine architecture has its own components. The architecture of search engines falls into the following categories: • Basic architecture- Centralized crawler • Cluster based architecture • Distributed Architectures • Multi site Architecture
  • 11.
    Basic architecture- Centralized crawler –used by most engines – crawlers are software agents that traverse the Web copying pages – pages crawled are stored in a central repository and then indexed – index is used in a centralized fashion to answer queries – most search engines use indices based on the inverted index – only a logical, rather than a literal, view of the text needs to be indexed • Normalization operations – removal of punctuation – substitution of multiple consecutive spaces by a single space – conversion of uppercase to lowercase letters • Some engines eliminate stopwords to reduce index size • Index is complemented with metadata associated with pages – Creation date, size, title, etc.
  • 12.
    • Given aquery – 10 results shown are subset of complete result set – if user requests more results, search engine can • recompute the query to generate the next 10 results • obtain them from a partial result set maintained in main memory – In any case, a search engine never computes the full answer set for the whole Web
  • 14.
    • Main problemfaced by this architecture – gathering of the data, and – sheer volume • Crawler-indexer architecture could not cope with Web growth (end of 1990s) – solution: distribute and parallelize computation
  • 15.
    Cluster-based Architecture • Currentengines adopt a massively parallel cluster- based architecture – document partitioning is used – replicated to handle the overall query load – cluster replicas maintained in various geographical locations to decrease latency time (for nearby users) • Many crucial details need to be addressed – good balance between the internal and external activities – good load balancing among different clusters – fault tolerance at software level to protect against hardware failures
  • 16.
    • Caching – Searchengines need to be fast • whenever possible, execute tasks in main memory • caching is highly recommended and extensively used – provides for shorter average response time – significantly reduces workload on back-end servers – decreases the overall amount of bandwidth utilized – In the Web, caching can be done both at the client or the server side
  • 17.
    – Caching ofanswers • the most effective caching technique in search engines • query distribution follows a power law – small cache can answer a large percentage of queries • with a 30% hit-rate, capacity of search engine increases by almost 43% – Still, in any time window a large fraction of queries will be unique • hence, those queries will not be in the cache • 50% in Baeza-Yates et al this can be improved by also caching inverted lists at the search cluster level
  • 19.
    Distributed Architectures – Thereexist several variants of the crawler-indexer architecture – We describe here the most important ones • most significant early example is Harvest • among newest proposals, we distinguish the multi-site architecture proposed by Baeza-Yates et al – Harvest • Harvest uses a distributed architecture to gather and distribute data • interestingly, it does not suffer from some of common problems of the crawler-indexer architectures, such as – increased servers load caused by reception of simultaneous requests from different crawlers – increased Web traffic, due to crawlers retrieving entire objects, • while most content is not retained eventually • lack of coordination between engines, as information is gathered independently by each crawler
  • 20.
    • To avoidthese issues, two components are introduced: gatherers and brokers • gatherer: collects and extracts indexing information from one or more Web servers • broker: provides indexing mechanism and query interface to the data gathered
  • 22.
    Multi-site Architecture – Asthe document collection grows • capacity of query processors has to grow as well • unlikely that growth in size of single processors can match growth of very large collections – even if a large number of servers is used – main reasons are physical constraints such as size of single data center and power and cooling requirements
  • 23.
    – Distributed resolutionof queries using different query processors is a viable approach • enables a more scalable solution • but also imposes new challenges • one such challenge is the routing of queries to appropriate query processors – to utilize more efficiently available resources and provide more precise results – factors affecting query routing include geographical proximity, query topic, or language of the query
  • 24.
    – Geographical proximity:reduce network latency by using resources close to the user posing the query – Possible implementation is DNS redirection • according to IP address of client, the DNS service routes query to appropriate Web server – usually the closest in terms of network distance – As another example, DNS service can use the geographical location to determine where to route queries to – There is a fluctuation in submitted queries from a particular geographic region during a day • possible to offload a server from a busy region by rerouting some queries to query servers in a less busy region
  • 25.
    Search Engine Ranking •Ranking is the hardest and most important function of a search engine • A key first challenge • devise an adequate process of evaluating the ranking, in terms of relevance of results to the user • without such evaluation, it is close to impossible to fine tune the ranking function • without fine tuning the ranking, there is no state-of- the-art engine—this is an empirical field of science • A second critical challenge • identification of quality content in the Web
  • 26.
    • Evidence ofquality can be indicated by several signals such as: • domain names • text content • links (like Page Rank) • Web page access patterns as monitored by the search engine • Additional useful signals are provided by the layout of the Web page, its title, metadata, font sizes, etc.
  • 27.
    Third critical challenge •avoiding, preventing, managing Web spam • spammers are malicious users who try to trick search engines by • artificially inflating signals used for ranking • a consequence of the economic incentives of the current advertising model adopted by search engines • A fourth major challenge • defining the ranking function and computing it
  • 28.
    Link-based Ranking • Numberof hyperlinks that point to a page provides a measure of its popularity and quality • Many links in common among pages are indicative of page relations with potential value for ranking purposes • Examples of ranking techniques that exploit links are discussed next
  • 29.
    Early Algorithms • Useincoming links for ranking Web pages • But, just counting links was not a very reliable measure of authoritativeness • Yuwono and Lee proposed three early ranking algorithms (in addition to the classic TF-IDF vector ranking) • Boolean spread • given page p in result set • extend result set with pages that point to and are pointed by page p • Vector spread • given page p in result set • extend result set with pages that point to and are pointed by page p • most-cited • a page p is assigned a score given by the sum of number of query words contained in other pages that point to page p • WebQuery is an early algorithm that allows visual browsing of Web pages
  • 30.
    Page Rank • Thebasic idea is that good pages point to good pages • Let p, r be two variables for pages and L a set of links • Is the link analysis method that uses link structure to rank the web pages by search engines • PAGERANK: the web graphs a numerical score between 0 and 1, known as its Page Rank. The Page Rank of a node will depend on the link structure of the web graph • Page Rank: scoring measure based only on the link structure of web pages – Every node in the web graph is given a score between 0 and1, depending on its in and out-links – Given a query, a search engine combines the Page Rank score with other values to compute the ranked retrieval (e.g. cosine similarity, relevance feedback, etc.)
  • 31.
    The PageRank CitationRanking: Bring Order to the web • Why is Page Importance Rating important? – New challenges for information retrieval on the World Wide Web. • Huge number of web pages: 150 million by1998 1000 billion by 2008 • Diversity of web pages: different topics, different quality, etc. • What is PageRank? • A method for rating the importance of web pages objectively and mechanically using the link structure of the web.
  • 32.
    The History ofPageRank • PageRank was developed by Larry Page (hence the name Page-Rank) and Sergey Brin. • It is first as part of a research project about a new kind of search engine. That project started in 1995 and led to a functional prototype in 1998. • Shortly after, Page and Brin founded Google. • 16 billion…
  • 33.
    • There aresome news about that PageRank will be canceled by Google. • There are large numbers of Search Engine Optimization (SEO). • SEO use different trick methods to make a web page more important under the rating of PageRank.
  • 34.
    • 150 millionweb pages  1.7 billion links • Backlinks and Forward links: A and B are C’s backlinks C is A and B’s forward link
  • 46.
    • Underlying idea:computing a score reflecting the “visitability” of a page – When surfing the web, a user may visit some pages more often than others (e.g. more in-links) – Pages that are often visited are more likely to contain relevant information – When a dead-end page is reached, the user may teleport (e.g. type an address in the browser)
  • 47.
    • Page Rank:Strengths & Weaknesses • Strengths – Elegant theoretical foundation with nice interpretations (e.g., stochastic matrices, principal Eigenvectors, stationary state probabilities in random walk model/ Markov-chain process) – Can be extended to topic-based and personalized models – Static measure (i.e., can be precomputed) – Relatively straight-forward to scale – PageRank scores are relatively stable to graph perturbations • Weaknesses – Rank is not stable to perturbations
  • 48.
    HITS • Hypertext InducedTopic Selection (HITS) is a link analysis method developed by John Kleinberg in 1999 using Hub and Authority scores. • considers set of pages S that point to or are pointed by pages in answer set • pages that have many links pointing to it are called authorities • pages that have many outgoing links are called hubs • positive two-way feedback • better authority pages come from incoming edges from good hubs • better hub pages come from outgoing edges to good authorities
  • 49.
    – Start witha “root set” of pages relevant to a topic (or information need) – Create a “base sub graph” by adding to the root set – –All pages that have incoming links from pages in the root set – –For each page in the root set: up to 𝑘 pages pointing to it – Compute authority and hubness scores for all pages in the base sub graph • Rank pages by decreasing authority scores
  • 51.
    • Let • H(p):hub value of page p • A(p): authority value of page p • H(p) and A(p) are defined such that
  • 52.
    Simple Ranking Functions •use a global ranking function such as PageRank • In that case, quality of a Web page in the result set is independent of the query • the query only selects pages to be ranked • use a linear combination of different ranking signals • for instance, combine BM25 (text-based ranking) with PageRank (link-based ranking) • To illustrate, consider the pages p that satisfy query Q • Rank score R(p,Q) of page p with regard to query Q can be computed as • R(p,Q) = ∝ BM25(p,Q) + (1-∝ ) PR(p)