SlideShare a Scribd company logo
1 of 53
UNIT-IV
IV Year / VIII Semester
By
K.Karthick AP/CSE
KNCET.
KONGUNADU COLLEGE OF ENGINEERING AND
TECHNOLOGY
(Autonomous)
NAMAKKAL- TRICHY MAIN ROAD, THOTTIAM
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
CS8080 – Information Retrieval Techniques
Syllabus
The Web
• A Brief History
• In 1990, Berners-Lee Wrote the HTTP protocol
• Defined the HTML language
• Wrote the first browser, which he called World Wide Web
• Wrote the first Web server
• In 1991, he made his browser and server software available
in the Internet. The Web was born!
• Since its inception, the Web became a huge success
• Well over 20 billion pages are now available and accessible
in the Web
• More than one fourth of humanity now access the Web on a
regular basis
How the Web Changed Search
• Web search is today the most prominent application of
IR and its techniques—the ranking and indexing
components of any search engine are fundamentally IR
pieces of technology
• The first major impact of the Web on search is related
to the characteristics of the document collection itself
• The Web is composed of pages distributed over
millions of sites and connected through hyperlinks
• This requires collecting all documents and storing
copies of them in a central repository, prior to indexing
• This new phase in the IR process, introduced by the
Web, is called crawling
• The second major impact of the Web on search is
related to:
• The size of the collection
• The volume of user queries submitted on a daily basis
• As a consequence, performance and scalability have
become critical characteristics of the IR system
• The third major impact: in a very large collection,
predicting
• relevance is much harder than before
• Fortunately, the Web also includes new sources of
• evidence
• Ex: hyperlinks and user clicks in documents in the
answer set
• The fourth major impact derives from the fact
that the Web is also a medium to do business
• Search problem has been extended beyond the
seeking of text information to also encompass
other user needs
• Ex: the price of a book, the phone number of a
hotel, the link fo r downloading a software
• The fifth major impact of the Web on search is
• Web spam Web spam: abusive availability of
commercial information disguised in the form of
informational content
• This difficulty is so large that today we talk of
Adversarial Web Retrieval
Web structure
• The web graph
• We can view the static Web consisting of static HTML pages
together with the hyperlinks between them as a directed graph in
which each web page is a node and each hyperlink a directed edge.
• We refer to the hyperlinks into a page as in-links and those out of a
page as out-links.
• The number of in-links to a page (also known as its in-degree) has
averaged from roughly 8 to 15, in a range of studies. We similarly
define the out-degree of a web page to be the number of links out
of it. These notions are represented in Figure 4.1.
• The web can be represented as a graph:
• • Webpage ≡ node
• • hyperlink ≡ directed edge
• • # in-links ≡ in-degree of a node
• • # out-links ≡ out-degree of a node
Search Engine Architectures
• A software architecture consists of software
components, the interfaces provided by those
components, and the relationships between
them
– describes a system at a particular level of
abstraction
• Architecture of a search engine determined by
2 requirements
– effectiveness (quality of results) and efficiency
(response time and throughput)
• Each Search Engine architecture has its own
components. The architecture of search
engines falls into the following categories:
• Basic architecture- Centralized crawler
• Cluster based architecture
• Distributed Architectures
• Multi site Architecture
Basic architecture- Centralized
crawler
– used by most engines
– crawlers are software agents that traverse the Web copying pages
– pages crawled are stored in a central repository and then indexed
– index is used in a centralized fashion to answer queries
– most search engines use indices based on the inverted index
– only a logical, rather than a literal, view of the text needs to be
indexed
• Normalization operations
– removal of punctuation
– substitution of multiple consecutive spaces by a single space
– conversion of uppercase to lowercase letters
• Some engines eliminate stopwords to reduce index size
• Index is complemented with metadata associated with pages
– Creation date, size, title, etc.
• Given a query
– 10 results shown are subset of complete result set
– if user requests more results, search engine can
• recompute the query to generate the next 10 results
• obtain them from a partial result set maintained in
main memory
– In any case, a search engine never computes the
full answer set for the whole Web
• Main problem faced by this architecture
– gathering of the data, and
– sheer volume
• Crawler-indexer architecture could not cope
with Web growth (end of 1990s)
– solution: distribute and parallelize computation
Cluster-based Architecture
• Current engines adopt a massively parallel cluster-
based architecture
– document partitioning is used
– replicated to handle the overall query load
– cluster replicas maintained in various geographical
locations to decrease latency time (for nearby users)
• Many crucial details need to be addressed
– good balance between the internal and external activities
– good load balancing among different clusters
– fault tolerance at software level to protect against
hardware failures
• Caching
– Search engines need to be fast
• whenever possible, execute tasks in main memory
• caching is highly recommended and extensively used
– provides for shorter average response time
– significantly reduces workload on back-end servers
– decreases the overall amount of bandwidth utilized
– In the Web, caching can be done both at the client
or the server side
– Caching of answers
• the most effective caching technique in search engines
• query distribution follows a power law
– small cache can answer a large percentage of queries
• with a 30% hit-rate, capacity of search engine increases
by almost 43%
– Still, in any time window a large fraction of queries
will be unique
• hence, those queries will not be in the cache
• 50% in Baeza-Yates et al this can be improved by also
caching inverted lists at the search cluster level
Distributed Architectures
– There exist several variants of the crawler-indexer
architecture
– We describe here the most important ones
• most significant early example is Harvest
• among newest proposals, we distinguish the multi-site
architecture proposed by Baeza-Yates et al
– Harvest
• Harvest uses a distributed architecture to gather and distribute
data
• interestingly, it does not suffer from some of common problems of
the crawler-indexer architectures, such as
– increased servers load caused by reception of simultaneous requests
from different crawlers
– increased Web traffic, due to crawlers retrieving entire objects,
• while most content is not retained eventually
• lack of coordination between engines, as information is
gathered independently by each crawler
• To avoid these issues, two components are
introduced: gatherers and brokers
• gatherer: collects and extracts indexing
information from one or more Web servers
• broker: provides indexing mechanism and
query interface to the data gathered
Multi-site Architecture
– As the document collection grows
• capacity of query processors has to grow as well
• unlikely that growth in size of single processors can
match growth of very large collections
– even if a large number of servers is used
– main reasons are physical constraints such as size of single
data center and power and cooling requirements
– Distributed resolution of queries using different
query processors is a viable approach
• enables a more scalable solution
• but also imposes new challenges
• one such challenge is the routing of queries to
appropriate query processors
– to utilize more efficiently available resources and provide
more precise results
– factors affecting query routing include geographical proximity,
query topic, or language of the query
– Geographical proximity: reduce network latency
by using resources close to the user posing the
query
– Possible implementation is DNS redirection
• according to IP address of client, the DNS service routes
query to appropriate Web server
– usually the closest in terms of network distance
– As another example, DNS service can use the
geographical location to determine where to route
queries to
– There is a fluctuation in submitted queries from a
particular geographic region during a day
• possible to offload a server from a busy region by
rerouting some queries to query servers in a less busy
region
Search Engine Ranking
• Ranking is the hardest and most important function of
a search engine
• A key first challenge
• devise an adequate process of evaluating the ranking,
in terms of relevance of results to the user
• without such evaluation, it is close to impossible to fine
tune the ranking function
• without fine tuning the ranking, there is no state-of-
the-art engine—this is an empirical field of science
• A second critical challenge
• identification of quality content in the Web
• Evidence of quality can be indicated by several
signals such as:
• domain names
• text content
• links (like Page Rank)
• Web page access patterns as monitored by the
search engine
• Additional useful signals are provided by the
layout of the Web page, its title, metadata,
font sizes, etc.
Third critical challenge
• avoiding, preventing, managing Web spam
• spammers are malicious users who try to trick
search engines by
• artificially inflating signals used for ranking
• a consequence of the economic incentives of the
current advertising model adopted by search
engines
• A fourth major challenge
• defining the ranking function and computing it
Link-based Ranking
• Number of hyperlinks that point to a page
provides a measure of its popularity and
quality
• Many links in common among pages are
indicative of page relations with potential
value for ranking purposes
• Examples of ranking techniques that exploit
links are discussed next
Early Algorithms
• Use incoming links for ranking Web pages
• But, just counting links was not a very reliable measure of
authoritativeness
• Yuwono and Lee proposed three early ranking algorithms (in
addition to the classic TF-IDF vector ranking)
• Boolean spread
• given page p in result set
• extend result set with pages that point to and are pointed by page p
• Vector spread
• given page p in result set
• extend result set with pages that point to and are pointed by page p
• most-cited
• a page p is assigned a score given by the sum of number of query
words contained in other pages that point to page p
• WebQuery is an early algorithm that allows visual browsing of Web
pages
Page Rank
• The basic idea is that good pages point to good pages
• Let p, r be two variables for pages and L a set of links
• Is the link analysis method that uses link structure to
rank the web pages by search engines
• PAGERANK: the web graphs a numerical score between
0 and 1, known as its Page Rank. The Page Rank of a
node will depend on the link structure of the web
graph
• Page Rank: scoring measure based only on the link
structure of web pages
– Every node in the web graph is given a score between 0
and1, depending on its in and out-links
– Given a query, a search engine combines the Page Rank
score with other values to compute the ranked retrieval
(e.g. cosine similarity, relevance feedback, etc.)
The PageRank Citation Ranking: Bring
Order to the web
• Why is Page Importance Rating important?
– New challenges for information retrieval on the World Wide
Web.
• Huge number of web pages: 150 million by1998
1000 billion by 2008
• Diversity of web pages: different topics, different quality, etc.
• What is PageRank?
• A method for rating the importance of web pages
objectively and mechanically using the link structure of the
web.
The History of PageRank
• PageRank was developed by Larry Page (hence
the name Page-Rank) and Sergey Brin.
• It is first as part of a research project about a
new kind of search engine. That project
started in 1995 and led to a functional
prototype in 1998.
• Shortly after, Page and Brin founded Google.
• 16 billion…
• There are some news about that PageRank
will be canceled by Google.
• There are large numbers of Search Engine
Optimization (SEO).
• SEO use different trick methods to make a
web page more important under the rating of
PageRank.
• 150 million web pages  1.7 billion links
• Backlinks and Forward links:
A and B are C’s backlinks
C is A and B’s forward link
• Underlying idea: computing a score reflecting
the “visitability” of a page
– When surfing the web, a user may visit some
pages more often than others (e.g. more in-links)
– Pages that are often visited are more likely to
contain relevant information
– When a dead-end page is reached, the user may
teleport (e.g. type an address in the browser)
• Page Rank: Strengths & Weaknesses
• Strengths
– Elegant theoretical foundation with nice
interpretations (e.g., stochastic matrices, principal
Eigenvectors, stationary state probabilities in random
walk model/ Markov-chain process)
– Can be extended to topic-based and personalized
models
– Static measure (i.e., can be precomputed)
– Relatively straight-forward to scale
– PageRank scores are relatively stable to graph
perturbations
• Weaknesses
– Rank is not stable to perturbations
HITS
• Hypertext Induced Topic Selection (HITS) is a link
analysis method developed by John Kleinberg in 1999
using Hub and Authority scores.
• considers set of pages S that point to or are pointed by
pages in answer set
• pages that have many links pointing to it are called
authorities
• pages that have many outgoing links are called hubs
• positive two-way feedback
• better authority pages come from incoming edges from
good hubs
• better hub pages come from outgoing edges to good
authorities
– Start with a “root set” of pages relevant to a topic
(or information need)
– Create a “base sub graph” by adding to the root
set
– –All pages that have incoming links from pages in the root set
– –For each page in the root set: up to 𝑘 pages pointing to it
– Compute authority and hubness scores for all
pages in the base sub graph
• Rank pages by decreasing authority scores
• Let
• H(p): hub value of page p
• A(p): authority value of page p
• H(p) and A(p) are defined such that
Simple Ranking Functions
• use a global ranking function such as PageRank
• In that case, quality of a Web page in the result set is
independent of the query
• the query only selects pages to be ranked
• use a linear combination of different ranking signals
• for instance, combine BM25 (text-based ranking) with
PageRank (link-based ranking)
• To illustrate, consider the pages p that satisfy query Q
• Rank score R(p,Q) of page p with regard to query Q can
be computed as
• R(p,Q) = ∝ BM25(p,Q) + (1-∝ ) PR(p)
IRT Unit_4.pptx

More Related Content

Similar to IRT Unit_4.pptx

introduction to web engineering.pptx
introduction to web engineering.pptxintroduction to web engineering.pptx
introduction to web engineering.pptxNaglaaFathy42
 
Ch2_Ed7_Network_Applications.ppt
Ch2_Ed7_Network_Applications.pptCh2_Ed7_Network_Applications.ppt
Ch2_Ed7_Network_Applications.pptFernandoLipardoJr
 
Introduction to the Internet and Web.pptx
Introduction to the Internet and Web.pptxIntroduction to the Internet and Web.pptx
Introduction to the Internet and Web.pptxhishamousl
 
Web-Oriented Architecture (WOA)
Web-Oriented Architecture (WOA)Web-Oriented Architecture (WOA)
Web-Oriented Architecture (WOA)thetechnicalweb
 
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...Ram G Athreya
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the webVan-Duyet Le
 
most common Web Testing interview questions and answers.pptx
most common Web Testing interview questions and answers.pptxmost common Web Testing interview questions and answers.pptx
most common Web Testing interview questions and answers.pptxMetSylvaMetuge
 
Accessing network needs
Accessing network needsAccessing network needs
Accessing network needsrahuldaredia21
 
Addressing scalability challenges in peer-to-peer search
Addressing scalability challenges in peer-to-peer searchAddressing scalability challenges in peer-to-peer search
Addressing scalability challenges in peer-to-peer searchHarisankar H
 
1. web technology basics
1. web technology basics1. web technology basics
1. web technology basicsJyoti Yadav
 
Rzepnicki_thesis_presentation_2003(2) (1)
Rzepnicki_thesis_presentation_2003(2) (1)Rzepnicki_thesis_presentation_2003(2) (1)
Rzepnicki_thesis_presentation_2003(2) (1)Witold Rzepnicki
 

Similar to IRT Unit_4.pptx (20)

introduction to web engineering.pptx
introduction to web engineering.pptxintroduction to web engineering.pptx
introduction to web engineering.pptx
 
Ch2_Ed7_Network_Applications.ppt
Ch2_Ed7_Network_Applications.pptCh2_Ed7_Network_Applications.ppt
Ch2_Ed7_Network_Applications.ppt
 
Introduction to the Internet and Web.pptx
Introduction to the Internet and Web.pptxIntroduction to the Internet and Web.pptx
Introduction to the Internet and Web.pptx
 
world wide web
world wide webworld wide web
world wide web
 
Web-Oriented Architecture (WOA)
Web-Oriented Architecture (WOA)Web-Oriented Architecture (WOA)
Web-Oriented Architecture (WOA)
 
E commerce
E commerceE commerce
E commerce
 
WELecture01.pptx
WELecture01.pptxWELecture01.pptx
WELecture01.pptx
 
Gaurav web mining
Gaurav web miningGaurav web mining
Gaurav web mining
 
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web
 
most common Web Testing interview questions and answers.pptx
most common Web Testing interview questions and answers.pptxmost common Web Testing interview questions and answers.pptx
most common Web Testing interview questions and answers.pptx
 
SWS-Lecture5.ppt
SWS-Lecture5.pptSWS-Lecture5.ppt
SWS-Lecture5.ppt
 
Accessing network needs
Accessing network needsAccessing network needs
Accessing network needs
 
Html
HtmlHtml
Html
 
Addressing scalability challenges in peer-to-peer search
Addressing scalability challenges in peer-to-peer searchAddressing scalability challenges in peer-to-peer search
Addressing scalability challenges in peer-to-peer search
 
E3602042044
E3602042044E3602042044
E3602042044
 
1. web technology basics
1. web technology basics1. web technology basics
1. web technology basics
 
Rzepnicki_thesis_presentation_2003(2) (1)
Rzepnicki_thesis_presentation_2003(2) (1)Rzepnicki_thesis_presentation_2003(2) (1)
Rzepnicki_thesis_presentation_2003(2) (1)
 
ch1.pptx
ch1.pptxch1.pptx
ch1.pptx
 
Internet.ppt
Internet.pptInternet.ppt
Internet.ppt
 

More from thenmozhip8

IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptxthenmozhip8
 
packages unit 5 .ppt
packages  unit 5 .pptpackages  unit 5 .ppt
packages unit 5 .pptthenmozhip8
 
Definning class.pptx unit 3
Definning class.pptx unit 3Definning class.pptx unit 3
Definning class.pptx unit 3thenmozhip8
 
exception-handling-in-java.ppt unit 2
exception-handling-in-java.ppt unit 2exception-handling-in-java.ppt unit 2
exception-handling-in-java.ppt unit 2thenmozhip8
 
unit 1 full ppt.pptx
unit 1 full ppt.pptxunit 1 full ppt.pptx
unit 1 full ppt.pptxthenmozhip8
 

More from thenmozhip8 (14)

U5 SPC.pptx
U5 SPC.pptxU5 SPC.pptx
U5 SPC.pptx
 
Unit 4.pdf
Unit 4.pdfUnit 4.pdf
Unit 4.pdf
 
unit 3 ppt.pptx
unit 3 ppt.pptxunit 3 ppt.pptx
unit 3 ppt.pptx
 
U2.ppt
U2.pptU2.ppt
U2.ppt
 
Unit 1 .ppt
Unit 1 .pptUnit 1 .ppt
Unit 1 .ppt
 
IR UNIT V.docx
IR UNIT  V.docxIR UNIT  V.docx
IR UNIT V.docx
 
UNIT 3 IRT.docx
UNIT 3 IRT.docxUNIT 3 IRT.docx
UNIT 3 IRT.docx
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptx
 
IRT Unit_I.pptx
IRT Unit_I.pptxIRT Unit_I.pptx
IRT Unit_I.pptx
 
packages unit 5 .ppt
packages  unit 5 .pptpackages  unit 5 .ppt
packages unit 5 .ppt
 
unit 4 .ppt
unit 4 .pptunit 4 .ppt
unit 4 .ppt
 
Definning class.pptx unit 3
Definning class.pptx unit 3Definning class.pptx unit 3
Definning class.pptx unit 3
 
exception-handling-in-java.ppt unit 2
exception-handling-in-java.ppt unit 2exception-handling-in-java.ppt unit 2
exception-handling-in-java.ppt unit 2
 
unit 1 full ppt.pptx
unit 1 full ppt.pptxunit 1 full ppt.pptx
unit 1 full ppt.pptx
 

Recently uploaded

MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 

Recently uploaded (20)

DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 

IRT Unit_4.pptx

  • 1. UNIT-IV IV Year / VIII Semester By K.Karthick AP/CSE KNCET. KONGUNADU COLLEGE OF ENGINEERING AND TECHNOLOGY (Autonomous) NAMAKKAL- TRICHY MAIN ROAD, THOTTIAM DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CS8080 – Information Retrieval Techniques
  • 3. The Web • A Brief History • In 1990, Berners-Lee Wrote the HTTP protocol • Defined the HTML language • Wrote the first browser, which he called World Wide Web • Wrote the first Web server • In 1991, he made his browser and server software available in the Internet. The Web was born! • Since its inception, the Web became a huge success • Well over 20 billion pages are now available and accessible in the Web • More than one fourth of humanity now access the Web on a regular basis
  • 4. How the Web Changed Search • Web search is today the most prominent application of IR and its techniques—the ranking and indexing components of any search engine are fundamentally IR pieces of technology • The first major impact of the Web on search is related to the characteristics of the document collection itself • The Web is composed of pages distributed over millions of sites and connected through hyperlinks • This requires collecting all documents and storing copies of them in a central repository, prior to indexing • This new phase in the IR process, introduced by the Web, is called crawling
  • 5. • The second major impact of the Web on search is related to: • The size of the collection • The volume of user queries submitted on a daily basis • As a consequence, performance and scalability have become critical characteristics of the IR system • The third major impact: in a very large collection, predicting • relevance is much harder than before • Fortunately, the Web also includes new sources of • evidence • Ex: hyperlinks and user clicks in documents in the answer set
  • 6. • The fourth major impact derives from the fact that the Web is also a medium to do business • Search problem has been extended beyond the seeking of text information to also encompass other user needs • Ex: the price of a book, the phone number of a hotel, the link fo r downloading a software • The fifth major impact of the Web on search is • Web spam Web spam: abusive availability of commercial information disguised in the form of informational content • This difficulty is so large that today we talk of Adversarial Web Retrieval
  • 7. Web structure • The web graph • We can view the static Web consisting of static HTML pages together with the hyperlinks between them as a directed graph in which each web page is a node and each hyperlink a directed edge. • We refer to the hyperlinks into a page as in-links and those out of a page as out-links. • The number of in-links to a page (also known as its in-degree) has averaged from roughly 8 to 15, in a range of studies. We similarly define the out-degree of a web page to be the number of links out of it. These notions are represented in Figure 4.1. • The web can be represented as a graph: • • Webpage ≡ node • • hyperlink ≡ directed edge • • # in-links ≡ in-degree of a node • • # out-links ≡ out-degree of a node
  • 8.
  • 9. Search Engine Architectures • A software architecture consists of software components, the interfaces provided by those components, and the relationships between them – describes a system at a particular level of abstraction • Architecture of a search engine determined by 2 requirements – effectiveness (quality of results) and efficiency (response time and throughput)
  • 10. • Each Search Engine architecture has its own components. The architecture of search engines falls into the following categories: • Basic architecture- Centralized crawler • Cluster based architecture • Distributed Architectures • Multi site Architecture
  • 11. Basic architecture- Centralized crawler – used by most engines – crawlers are software agents that traverse the Web copying pages – pages crawled are stored in a central repository and then indexed – index is used in a centralized fashion to answer queries – most search engines use indices based on the inverted index – only a logical, rather than a literal, view of the text needs to be indexed • Normalization operations – removal of punctuation – substitution of multiple consecutive spaces by a single space – conversion of uppercase to lowercase letters • Some engines eliminate stopwords to reduce index size • Index is complemented with metadata associated with pages – Creation date, size, title, etc.
  • 12. • Given a query – 10 results shown are subset of complete result set – if user requests more results, search engine can • recompute the query to generate the next 10 results • obtain them from a partial result set maintained in main memory – In any case, a search engine never computes the full answer set for the whole Web
  • 13.
  • 14. • Main problem faced by this architecture – gathering of the data, and – sheer volume • Crawler-indexer architecture could not cope with Web growth (end of 1990s) – solution: distribute and parallelize computation
  • 15. Cluster-based Architecture • Current engines adopt a massively parallel cluster- based architecture – document partitioning is used – replicated to handle the overall query load – cluster replicas maintained in various geographical locations to decrease latency time (for nearby users) • Many crucial details need to be addressed – good balance between the internal and external activities – good load balancing among different clusters – fault tolerance at software level to protect against hardware failures
  • 16. • Caching – Search engines need to be fast • whenever possible, execute tasks in main memory • caching is highly recommended and extensively used – provides for shorter average response time – significantly reduces workload on back-end servers – decreases the overall amount of bandwidth utilized – In the Web, caching can be done both at the client or the server side
  • 17. – Caching of answers • the most effective caching technique in search engines • query distribution follows a power law – small cache can answer a large percentage of queries • with a 30% hit-rate, capacity of search engine increases by almost 43% – Still, in any time window a large fraction of queries will be unique • hence, those queries will not be in the cache • 50% in Baeza-Yates et al this can be improved by also caching inverted lists at the search cluster level
  • 18.
  • 19. Distributed Architectures – There exist several variants of the crawler-indexer architecture – We describe here the most important ones • most significant early example is Harvest • among newest proposals, we distinguish the multi-site architecture proposed by Baeza-Yates et al – Harvest • Harvest uses a distributed architecture to gather and distribute data • interestingly, it does not suffer from some of common problems of the crawler-indexer architectures, such as – increased servers load caused by reception of simultaneous requests from different crawlers – increased Web traffic, due to crawlers retrieving entire objects, • while most content is not retained eventually • lack of coordination between engines, as information is gathered independently by each crawler
  • 20. • To avoid these issues, two components are introduced: gatherers and brokers • gatherer: collects and extracts indexing information from one or more Web servers • broker: provides indexing mechanism and query interface to the data gathered
  • 21.
  • 22. Multi-site Architecture – As the document collection grows • capacity of query processors has to grow as well • unlikely that growth in size of single processors can match growth of very large collections – even if a large number of servers is used – main reasons are physical constraints such as size of single data center and power and cooling requirements
  • 23. – Distributed resolution of queries using different query processors is a viable approach • enables a more scalable solution • but also imposes new challenges • one such challenge is the routing of queries to appropriate query processors – to utilize more efficiently available resources and provide more precise results – factors affecting query routing include geographical proximity, query topic, or language of the query
  • 24. – Geographical proximity: reduce network latency by using resources close to the user posing the query – Possible implementation is DNS redirection • according to IP address of client, the DNS service routes query to appropriate Web server – usually the closest in terms of network distance – As another example, DNS service can use the geographical location to determine where to route queries to – There is a fluctuation in submitted queries from a particular geographic region during a day • possible to offload a server from a busy region by rerouting some queries to query servers in a less busy region
  • 25. Search Engine Ranking • Ranking is the hardest and most important function of a search engine • A key first challenge • devise an adequate process of evaluating the ranking, in terms of relevance of results to the user • without such evaluation, it is close to impossible to fine tune the ranking function • without fine tuning the ranking, there is no state-of- the-art engine—this is an empirical field of science • A second critical challenge • identification of quality content in the Web
  • 26. • Evidence of quality can be indicated by several signals such as: • domain names • text content • links (like Page Rank) • Web page access patterns as monitored by the search engine • Additional useful signals are provided by the layout of the Web page, its title, metadata, font sizes, etc.
  • 27. Third critical challenge • avoiding, preventing, managing Web spam • spammers are malicious users who try to trick search engines by • artificially inflating signals used for ranking • a consequence of the economic incentives of the current advertising model adopted by search engines • A fourth major challenge • defining the ranking function and computing it
  • 28. Link-based Ranking • Number of hyperlinks that point to a page provides a measure of its popularity and quality • Many links in common among pages are indicative of page relations with potential value for ranking purposes • Examples of ranking techniques that exploit links are discussed next
  • 29. Early Algorithms • Use incoming links for ranking Web pages • But, just counting links was not a very reliable measure of authoritativeness • Yuwono and Lee proposed three early ranking algorithms (in addition to the classic TF-IDF vector ranking) • Boolean spread • given page p in result set • extend result set with pages that point to and are pointed by page p • Vector spread • given page p in result set • extend result set with pages that point to and are pointed by page p • most-cited • a page p is assigned a score given by the sum of number of query words contained in other pages that point to page p • WebQuery is an early algorithm that allows visual browsing of Web pages
  • 30. Page Rank • The basic idea is that good pages point to good pages • Let p, r be two variables for pages and L a set of links • Is the link analysis method that uses link structure to rank the web pages by search engines • PAGERANK: the web graphs a numerical score between 0 and 1, known as its Page Rank. The Page Rank of a node will depend on the link structure of the web graph • Page Rank: scoring measure based only on the link structure of web pages – Every node in the web graph is given a score between 0 and1, depending on its in and out-links – Given a query, a search engine combines the Page Rank score with other values to compute the ranked retrieval (e.g. cosine similarity, relevance feedback, etc.)
  • 31. The PageRank Citation Ranking: Bring Order to the web • Why is Page Importance Rating important? – New challenges for information retrieval on the World Wide Web. • Huge number of web pages: 150 million by1998 1000 billion by 2008 • Diversity of web pages: different topics, different quality, etc. • What is PageRank? • A method for rating the importance of web pages objectively and mechanically using the link structure of the web.
  • 32. The History of PageRank • PageRank was developed by Larry Page (hence the name Page-Rank) and Sergey Brin. • It is first as part of a research project about a new kind of search engine. That project started in 1995 and led to a functional prototype in 1998. • Shortly after, Page and Brin founded Google. • 16 billion…
  • 33. • There are some news about that PageRank will be canceled by Google. • There are large numbers of Search Engine Optimization (SEO). • SEO use different trick methods to make a web page more important under the rating of PageRank.
  • 34. • 150 million web pages  1.7 billion links • Backlinks and Forward links: A and B are C’s backlinks C is A and B’s forward link
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46. • Underlying idea: computing a score reflecting the “visitability” of a page – When surfing the web, a user may visit some pages more often than others (e.g. more in-links) – Pages that are often visited are more likely to contain relevant information – When a dead-end page is reached, the user may teleport (e.g. type an address in the browser)
  • 47. • Page Rank: Strengths & Weaknesses • Strengths – Elegant theoretical foundation with nice interpretations (e.g., stochastic matrices, principal Eigenvectors, stationary state probabilities in random walk model/ Markov-chain process) – Can be extended to topic-based and personalized models – Static measure (i.e., can be precomputed) – Relatively straight-forward to scale – PageRank scores are relatively stable to graph perturbations • Weaknesses – Rank is not stable to perturbations
  • 48. HITS • Hypertext Induced Topic Selection (HITS) is a link analysis method developed by John Kleinberg in 1999 using Hub and Authority scores. • considers set of pages S that point to or are pointed by pages in answer set • pages that have many links pointing to it are called authorities • pages that have many outgoing links are called hubs • positive two-way feedback • better authority pages come from incoming edges from good hubs • better hub pages come from outgoing edges to good authorities
  • 49. – Start with a “root set” of pages relevant to a topic (or information need) – Create a “base sub graph” by adding to the root set – –All pages that have incoming links from pages in the root set – –For each page in the root set: up to 𝑘 pages pointing to it – Compute authority and hubness scores for all pages in the base sub graph • Rank pages by decreasing authority scores
  • 50.
  • 51. • Let • H(p): hub value of page p • A(p): authority value of page p • H(p) and A(p) are defined such that
  • 52. Simple Ranking Functions • use a global ranking function such as PageRank • In that case, quality of a Web page in the result set is independent of the query • the query only selects pages to be ranked • use a linear combination of different ranking signals • for instance, combine BM25 (text-based ranking) with PageRank (link-based ranking) • To illustrate, consider the pages p that satisfy query Q • Rank score R(p,Q) of page p with regard to query Q can be computed as • R(p,Q) = ∝ BM25(p,Q) + (1-∝ ) PR(p)