SlideShare a Scribd company logo
1 of 61
Intelligent IR on the World Wide Web
CSC 575
Intelligent Information Retrieval
Intelligent Information Retrieval 2
Intelligent IR on the World Wide Web
 Web IR versus Classic IR
 Web Spiders and Crawlers
 Citation/hyperlink Indexing and Analysis
 Intelligent Agents for the Web
Intelligent Information Retrieval 3
IR on the Web vs. Classsic IR
 Input: publicly accessible Web
 Goal: retrieve high quality pages that are relevant to
user’s need
 static (text, audio, images, etc.)
 dynamically generated (mostly database access)
 What’s different about the Web:
 large volume
 distributed data
 Heterogeneity of the data
 lack of stability
 high duplication
 high linkage
 lack of quality standard
Intelligent Information Retrieval 4
Search Engine Early History
 In 1990, Alan Emtage of McGill Univ. developed Archie (short for
“archives”)
 Assembled lists of files available on many FTP servers.
 Allowed regex search of these file names.
 In 1993, Veronica and Jughead were developed to search names of
text files available through Gopher servers.
 In 1993, early Web robots (spiders) were built to collect URL’s:
 Wanderer
 ALIWEB (Archie-Like Index of the WEB)
 WWW Worm (indexed URL’s and titles for regex search)
 In 1994, Stanford grad students David Filo and Jerry Yang started
manually collecting popular web sites into a topical hierarchy called
Yahoo.
Intelligent Information Retrieval 5
Search Engine Early History
 In early 1994, Brian Pinkerton developed WebCrawler as a class
project at U Wash.
 Eventually became part of Excite and AOL
 A few months later, Fuzzy Maudlin, a grad student at CMU
developed Lycos
 First to use a standard IR system
 First to index a large set of pages
 In late 1995, DEC developed Altavista
 Used a large farm of Alpha machines to quickly process large numbers of
queries
 Supported Boolean operators, phrases in queries.
 In 1998, Larry Page and Sergey Brin, Ph.D. students at Stanford,
started Google
 Main advance was use of link analysis to rank results partially based on
authority.
Intelligent Information Retrieval 6
Web Search
Query
String
IR
System
Ranked
Documents
1. Page1
2. Page2
3. Page3
.
.
Document
corpus
Web Spider
Intelligent Information Retrieval 7
Spiders (Robots/Bots/Crawlers)
 Start with a comprehensive set of root URL’s from
which to start the search.
 Follow all links on these pages recursively to find
additional pages.
 Index all novel found pages in an inverted index as they
are encountered.
 May allow users to directly submit pages to be indexed
(and crawled from).
Intelligent Information Retrieval 8
Search Strategy Trade-Off’s
 Breadth-first search strategy explores uniformly
outward from the root page but requires memory of all
nodes on the previous level (exponential in depth).
Standard spidering method.
 Depth-first search requires memory of only depth
times branching-factor (linear in depth) but gets “lost”
pursuing a single thread.
 Both strategies implementable using a queue of links
(URL’s).
Intelligent Information Retrieval 9
Avoiding Page Duplication
 Must detect when revisiting a page that has already
been spidered (web is a graph not a tree).
 Must efficiently index visited pages to allow rapid
recognition test.
 Tree indexing (e.g. trie)
 Hashtable
 Index page using URL as a key.
 Must canonicalize URL’s (e.g. delete ending “/”)
 Not detect duplicated or mirrored pages.
 Index page using textual content as a key.
 Requires first downloading page.
Intelligent Information Retrieval 10
Spidering Algorithm
Initialize queue (Q) with initial set of known URL’s.
Until Q empty or page or time limit exhausted:
Pop URL, L, from front of Q.
If L is not an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…)
continue loop.
If already visited L, continue loop.
Download page, P, for L.
If cannot download P (e.g. 404 error, robot excluded)
continue loop.
Index P (e.g. add to inverted index or store cached
copy).
Parse P to obtain list of new links N.
Append N to the end of Q.
Intelligent Information Retrieval 11
Queueing Strategy
 How new links added to the queue determines search
strategy.
 FIFO (append to end of Q)
 gives breadth-first search.
 LIFO (add to front of Q)
 gives depth-first search.
 Heuristically ordering the Q gives a “focused crawler”
that directs its search towards “interesting” pages.
 May be able to use standard AI search algorithms such as Best-
first search, A*, etc.
Intelligent Information Retrieval 12
Restricting Spidering
 Restrict spider to a particular site.
 Remove links to other sites from Q.
 Restrict spider to a particular directory.
 Remove links not in the specified directory.
 Obey page-owner restrictions
 robot exclusion protocol
Intelligent Information Retrieval 13
Anchor Text Indexing
 Extract anchor text (between <a> and </a>) of each link:
 Anchor text is usually descriptive of the document to which it points.
 Add anchor text to the content of the destination page to provide additional
relevant keyword indices.
 Used by Google:
 <a href=“http://www.microsoft.com”>Evil Empire</a>
 <a href=“http://www.ibm.com”>IBM</a>
 Helps when descriptive text in destination page is
embedded in image logos rather than in accessible text.
 Many times anchor text is not useful:
 “click here”
 Increases content more for popular pages with many in-
coming links, increasing recall of these pages.
 May even give higher weights to tokens from anchor text.
Intelligent Information Retrieval 14
Multi-Threaded Spidering
 Bottleneck is network delay in downloading individual
pages.
 Best to have multiple threads running in parallel each
requesting a page from a different host.
 Distribute URL’s to threads to guarantee equitable
distribution of requests across different hosts to
maximize through-put and avoid overloading any
single server.
 Early Google spider had multiple coordinated crawlers
with about 300 threads each, together able to download
over 100 pages per second.
Intelligent Information Retrieval 15
Directed/Focused Spidering
 Sort queue to explore more “interesting” pages first.
 Two styles of focus:
 Topic-Directed
 Link-Directed
Intelligent Information Retrieval 16
Topic-Directed Spidering
 Assume desired topic description or sample pages of
interest are given.
 Sort queue of links by the similarity (e.g. cosine metric)
of their source pages and/or anchor text to this topic
description.
 Preferentially explores pages related to a specific topic.
Intelligent Information Retrieval 17
Link-Directed Spidering
 Monitor links and keep track of in-degree and out-
degree of each page encountered.
 Sort queue to prefer popular pages with many in-
coming links (authorities).
 Sort queue to prefer summary pages with many out-
going links (hubs).
Intelligent Information Retrieval 18
Keeping Spidered Pages Up to Date
 Web is very dynamic: many new pages, updated pages,
deleted pages, etc.
 Periodically check spidered pages for updates and
deletions:
 Just look at header info (e.g. META tags on last update) to
determine if page has changed, only reload entire page if needed.
 Track how often each page is updated and
preferentially return to pages which are historically
more dynamic.
 Preferentially update pages that are accessed more
often to optimize freshness of more popular pages.
Intelligent Information Retrieval 19
Quality and the WWW
The Case for Connectivity Analysis
 Basic Idea: mine hyperlink information on the Web
 Assumptions:
 links often connect related pages
 a link between pages is a “recommendation”
 Approaches
 classic IR: co-citation analysis (a.k.a. “bibliometrics”)
 connectivity-based ranking (e.g., GOOGLE)
 HITS - hypertext induced topic search
Intelligent Information Retrieval 20
Co-Citation Analysis
 Has been around since the 50’s (Small, Garfield, White & McCain)
 Used to identify core sets of
 authors, journals, articles for particular fields of study
 Main Idea:
 Find pairs of papers that cite third papers
 Look for commonalities
 http://www.garfield.library.upenn.edu/papers/mapsciworld.html
Intelligent Information Retrieval 21
Co-citation analysis (From Garfield 98)
The Global Map of
Science, based on co-
citation clustering:
Size of the circle
represents number of
papers published in the
area;
Distance between
circles represents the
level of co-citation
between the fields;
By zooming in, deeper
levels in the hierarchy
can be exposed.
The Global Map of
Science, based on co-
citation clustering:
Size of the circle
represents number of
papers published in the
area;
Distance between
circles represents the
level of co-citation
between the fields;
By zooming in, deeper
levels in the hierarchy
can be exposed.
Intelligent Information Retrieval 22
Co-citation analysis (From Garfield 98)
Zooming in on
biomedicine, specialties
including cardiology,
immunology, etc., can
be viewed.
Zooming in on
biomedicine, specialties
including cardiology,
immunology, etc., can
be viewed.
Intelligent Information Retrieval 23
Co-citation analysis (From Garfield 98)
Intelligent Information Retrieval 24
CiteSeer: A Web Agent for Citation Analysis
(Bollacker, Lawrence, Giles - 1998)
The CiteSeer agent consists of
three main components:
(i) sub-agent to automatically
locate and acquire publications,
(ii) document parser and
database creator,
(iii) browser interface which
supports search by keyword
and browsing by citation links.
http://citeseer.ist.psu.edu/
CiteSeer: A Web Agent for Citation Analysis
CiteSeer: A Web Agent for Citation Analysis
Intelligent Information Retrieval 27
Citations vs. Links
 Web links are a bit different than citations:
 Many links are navigational.
 Many pages with high in-degree are portals not content providers.
 Not all links are endorsements.
 Company websites don’t point to their competitors.
 Citations to relevant literature is enforced by peer-review.
 Authorities
 pages that are recognized as providing significant, trustworthy, and useful
information on a topic.
 In-degree (number of pointers to a page) is one simple measure of authority.
 However in-degree treats all links as equal. Should links from pages that are
themselves authoritative count more?
 Hubs
 index pages that provide lots of useful links to relevant content pages (topic
authorities).
Intelligent Information Retrieval 28
Hypertext Induced Topic Search
 Basic Idea: look for “authority” and “hub” web pages (Kleinberg
98)
 authority: definitive content for a topic
 hub: index links to good content
 The two distinctions tend to blend
 Procedure:
 Issue a query on a term, e.g. “java”
 Get back a set of documents
 Look at the inlink and outlink patterns for the set of retrieved documents
 Perform statistical analysis to see which patterns are the most dominant ones
 Technique was initially used in IBM’s CLEVER system
 can find some good starting points for some topics
 doesn’t solve the whole search problem!
 doesn’t make explicit use of content (so may result in “topic drift” from
original query)
Intelligent Information Retrieval 29
Hypertext Induced Topic Search
 Intuition behind the HITS algorithm
 Authority comes from in-edges
 Being a good hub comes from out-edges
 Mutually re-enforcing relationship
 Better authority comes from in-edges of good hubs
 Being a better hub comes from out-edges of to good authorities
Hubs Authorities
A good authority is a page
that is pointed to by many
good hubs.
A good hub is a page that
points to many good
authorities.
Together they tend to form a
bipartite graph
A good authority is a page
that is pointed to by many
good hubs.
A good hub is a page that
points to many good
authorities.
Together they tend to form a
bipartite graph
Intelligent Information Retrieval 30
HITS Algorithm
 Computes hubs and authorities for a particular topic
specified by a normal query.
 1. First determine a set of relevant pages for the query called the base
set (base subgraph) S.
 For a specific query Q, let the set of documents returned by a standard search engine
be called the root set R. Initialize S to R.
 Add to S all pages pointed to by any page in R.
 Add to S all pages that point to any page in R.
 Analyze the link structure of the web subgraph defined by S to find
authority and hub pages in this set.
R
S
Intelligent Information Retrieval 31
HITS – Some Considerations
 Base Limitations
 To limit computational expense:
 Limit number of root pages to the top 200 pages retrieved for the query.
 Limit number of “back-pointer” pages to a random set of at most 50 pages
returned by a “reverse link” query.
 To eliminate purely navigational links:
 Eliminate links between two pages on the same host.
 To eliminate “non-authority-conveying” links:
 Allow only m (m ≅ 4−8) pages from a given host as pointers to any individual
page.
 Authorities and In-Degree
 Even within the base set S for a given query, the nodes with highest in-
degree are not necessarily authorities (may just be generally popular
pages like Yahoo or Amazon).
 True authority pages are pointed to by a number of hubs (i.e. pages that
point to lots of authorities).
Intelligent Information Retrieval 32
HITS: Iterative Algorithm
 Use an iterative algorithm to slowly converge on a mutually
reinforcing set of hubs and authorities.
 Maintain for each page p ∈ S:
 Authority score: ap (vector a)
 Hub score: hp (vector h)
 Initialize all ap = hp = 1
 Maintain normalized scores:
 Authorities are pointed to by lots of good hubs:
 Hubs point to lots of good authorities:
∑→
=
pqq
qp ha
:
( ) 1
2
=∑∈Sp
ph( ) 1
2
=∑∈Sp
pa
∑→
=
qpq
qp ah
:
Intelligent Information Retrieval 33
Illustrated Update Rules
2
3
a4 = h1 + h2 + h3
1
5
7
6
4
4h4 = a5 + a6 + a7
Intelligent Information Retrieval 34
HITS Iterative Algorithm
Initialize for all p ∈ S: ap = hp = 1
For i = 1 to k:
For all p ∈ S: (update auth. scores)
For all p ∈ S: (update hub scores)
For all p ∈ S: ap= ap/c c:
For all p ∈ S: hp= hp/c c:
∑→
=
pqq
qp ha
:
∑→
=
qpq
qp ah
:
( ) 1/
2
=∑∈Sp
p ca
( ) 1/
2
=∑∈Sp
p ch
(normalize a)
(normalize h)
Intelligent Information Retrieval 35
HITS Example
D
A
B
C
E
D A C B E
A: [0.0, 0.0, 2.0, 2.0, 1.0]
D A C B E
H: [4.0, 5.0, 0.0, 0.0, 0.0]
D A C B E
Norm A: [0.0, 0.0, 0.67, 0.67.0, 0.33]
D A C B E
Norm H: [0.62, 0.78, 0.0, 0.0, 0.0]
First Iteration
Normalize: divide
each vector by its
norm (square root
of the sum of the
squares)
Intelligent Information Retrieval 36
HITS Algorithm
 Let HUB[v] and AUTH[v] represent the hub and
authority values associated with a vertex v
 Repeat until HUB and AUTH vectors converge
 Normalize HUB and AUTH
 HUB[v] := Σ AUTH[ui] for all ui with Edge(v, ui)
 AUTH[v] := Σ HUB[wi] for all ui with Edge(wi, v)
A H
v
u1
u2
uk
...
w1
w2
wk
...
Intelligent Information Retrieval 37
Convergence
 Algorithm converges to a fix-point if iterated
indefinitely.
 Define A to be the adjacency matrix for the subgraph
defined by S.
 Aij = 1 for i ∈ S, j ∈ S iff i→j
 Authority vector, a, converges to the principal
eigenvector of AT
A
 Hub vector, h, converges to the principal eigenvector
of AAT
 In practice, 20 iterations produces fairly stable results.
Intelligent Information Retrieval 38
HITS Results
 Authorities for query: “Java”
 java.sun.com
 comp.lang.java FAQ
 Authorities for query “search engine”
 Yahoo.com
 Excite.com
 Lycos.com
 Altavista.com
 Authorities for query “Gates”
 Microsoft.com
 roadahead.com
In most cases, the final authorities were not in the initial root set generated using
Altavista. Authorities were brought in from linked and reverse-linked pages and
then HITS computed their high authority score.
Intelligent Information Retrieval 39
HITS: Other Applications
 Finding Similar Pages Using
Link Structure
 Given a page, P, let R (the root set)
be t (e.g. 200) pages that point to P.
 Grow a base set S from R.
 Run HITS on S.
 Return the best authorities in S as the
best similar-pages for P.
 Finds authorities in the “link
neighbor-hood” of P.
Similar Pages to “honda.com”:
- toyota.com
- ford.com
- bmwusa.com
- saturncars.com
- nissanmotors.com
- audi.com
- volvocars.com
Intelligent Information Retrieval 40
HITS: Other Applications
 HITS for Clustering
 An ambiguous query can result in the principal eigenvector only
covering one of the possible meanings.
 Non-principal eigenvectors may contain hubs & authorities for
other meanings.
 Example: “jaguar”:
 Atari video game (principal eigenvector)
 NFL Football team (2nd
non-princ. eigenvector)
 Automobile (3rd
non-princ. eigenvector)
 An application of Principle Component Analysis (PCA)
Intelligent Information Retrieval 41
HITS: Problems and Solutions
 Some edges are wrong (not “recommendations”)
 multiple edges from the same author
 automatically generated
 spam
Solution: weight edges to limit influence
 Topic Drift
 Query: jaguar AND cars
 Result: pages about cars in general
Solution: analyze content and assign topic scores to
nodes
Intelligent Information Retrieval 42
Modified HITS Algorithm
 Let HUB[v] and AUTH[v] represent the hub and
authority values associated with a vertex v
 Repeat until HUB and AUTH vectors converge
 Normalize HUB and AUTH
 HUB[v] := Σ AUTH[ui] . TopicScore[ui] . Weight(v, ui)
for all ui with Edge(v, ui)
 AUTH[v] := Σ HUB[wi] . TopicScore[wi] . Weight(wi, v)
for all ui with Edge(wi, v)
 Topic score is determined based on similarity measure
between the query and the documents
Intelligent Information Retrieval 43
PageRank
 Alternative link-analysis method used by Google (Brin
& Page, 1998).
 Does not attempt to capture the distinction between
hubs and authorities.
 Ranks pages just by authority.
 Applied to the entire Web rather than a local
neighborhood of pages surrounding the results of a
query.
Intelligent Information Retrieval 44
Initial PageRank Idea
 Just measuring in-degree (citation count) doesn’t account
for the authority of the source of a link.
 Initial page rank equation for page p:
 Nqis the total number of out-links from page q.
 A page, q, “gives” an equal fraction of its authority to all the pages it
points to (e.g. p).
 c is a normalizing constant set so that the rank of all pages always
sums to 1.
∑→
=
pqq qN
qR
cpR
:
)(
)(
Intelligent Information Retrieval 45
Initial PageRank Idea
 Can view it as a process of PageRank “flowing” from
pages to the pages they cite.
.1
.09
.05
.05
.03
.03
.03
.08
.08
.03
Intelligent Information Retrieval 46
Initial PageRank Algorithm
 Iterate rank-flowing process until convergence:
Let S be the total set of pages.
Initialize ∀p∈S: R(p) = 1/|S|
Until ranks do not change (much) (convergence)
For each p∈S:
For each p∈S: R(p) = cR´(p) (normalize)
∑→
=′
pqq qN
qR
pR
:
)(
)(
∑∈
′=
Sp
pRc )(/1
Intelligent Information Retrieval 47
Sample Stable Fixpoint
0.4
0.4
0.2
0.2
0.2
0.2
0.4
Intelligent Information Retrieval 48
Linear Algebra Version
 Treat R as a vector over web pages.
 Let A be a 2-d matrix over pages where
 Avu= 1/Nu if u→ v else Avu= 0
 Then R = cAR
 R converges to the principal eigenvector of A.
Intelligent Information Retrieval 49
Problem with Initial Idea
 A group of pages that only point to themselves but are
pointed to by other pages act as a “rank sink” and
absorb all the rank in the system.
 Solutions: Rank Score
 Introduce a “rank source” E that continually replenishes the rank
of each page, p, by a fixed amount E(p).








+= ∑→
)(
)(
)(
:
pE
N
qR
cpR
pqq q
Intelligent Information Retrieval 50
PageRank Algorithm
Let S be the total set of pages.
Let ∀p∈S: E(p) = α/|S| (for some 0<α<1, e.g. 0.15)
Initialize ∀p∈S: R(p) = 1/|S|
Until ranks do not change (much) (convergence)
For each p∈S:
For each p∈S: R(p) = cR´(p) (normalize)
)(
)(
)(
:
pE
N
qR
pR
pqq q
+=′ ∑→
∑∈
′=
Sp
pRc )(/1
Intelligent Information Retrieval
PageRank Example
A B
C
α = 0.3
A C B
Initial R: [0.33, 0.33, 0.33]
R’(C): R(A)/2 + R(B)/1 + 0.3/3
R’(B): R(A)/2 + 0.3/3
R’(A): 0.3/3
A C B
R’: [0.1, 0.595, 0.27]
A C B
R: [0.104, 0.617, 0.28]
Normalization factor:
1/[R’(A)+R’(B)+R’(C)] = 1/0.965
First Iteration Only:
before
normalization:
after
normalization:
Intelligent Information Retrieval 52
Random Surfer Model
 PageRank can be seen as modeling a “random surfer”
that starts on a random page and then at each point:
 With probability E(p) randomly jumps to page p.
 Otherwise, randomly follows a link on the current page.
 R(p) models the probability that this random surfer will
be on page p at any given time.
 “E jumps” are needed to prevent the random surfer
from getting “trapped” in web sinks with no outgoing
links.
Intelligent Information Retrieval 53
Speed of Convergence
 Early experiments on Google used 322 million links.
 PageRank algorithm converged (within small
tolerance) in about 52 iterations.
 Number of iterations required for convergence is
empirically O(log n) (where n is the number of links).
 Therefore calculation is quite efficient.
Intelligent Information Retrieval 54
Google Ranking
 Complete Google ranking includes (based on university
publications prior to commercialization).
 Vector-space similarity component.
 Keyword proximity component.
 HTML-tag weight component (e.g. title preference).
 PageRank component.
 Details of current commercial ranking functions are trade
secrets.
Intelligent Information Retrieval 55
Personalized PageRank
 PageRank can be biased (personalized) by changing E
to a non-uniform distribution.
 Restrict “random jumps” to a set of specified relevant
pages.
 For example, let E(p) = 0 except for one’s own home
page, for which E(p) = α
 This results in a bias towards pages that are closer in
the web graph to your own homepage.
 Similar personalization can be achieved by setting E(p)
for only pages p that are part of the user’s profile.
Intelligent Information Retrieval 56
PageRank-Biased Spidering
 Use PageRank to direct (focus) a spider on
“important” pages.
 Compute page-rank using the current set of crawled
pages.
 Order the spider’s search queue based on current
estimated PageRank.
Intelligent Information Retrieval 57
Link Analysis Conclusions
 Link analysis uses information about the
structure of the web graph to aid search.
 It is one of the major innovations in web search.
 It is the primary reason for Google’s success.
Intelligent Information Retrieval 58
Behavior-Based Ranking
 Emergence of large-scale search engines allow for
mining aggregate behavior analysis to improving
ranking.
 Basic Idea:
 For each query Q, keep track of which docs in the results are
clicked on
 On subsequent requests for Q, re-order docs in results based on
click-throughs.
 Relevance assessment based on
 Behavior/usage
 vs. content
Intelligent Information Retrieval 59
Query-doc popularity matrix B
Queries
Docs
q
j
Bqj = number of times doc j
clicked-through on query q
When query q issued again, order docs by Bqj values.
Intelligent Information Retrieval 60
Vector space implementation
 Maintain a term-doc popularity matrix C
 as opposed to query-doc popularity
 initialized to all zeros
 Each column represents a doc j
 If doc j clicked on for query q, update Cj← Cj +ε q (here q is
viewed as a vector).
 On a query q’, compute its cosine proximity to Cjfor all
j.
 Combine this with the regular text score.
Intelligent Information Retrieval 61
Issues
 Normalization of Cj after updating
 Assumption of query compositionality
 “white house” document popularity derived from “white” and
“house”
 Updating - live or batch?
 Basic assumption:
 Relevance can be directly measured by number of click throughs
 Valid?

More Related Content

What's hot

Cours1 maîtriser un_système _dinformation_documentaire
Cours1 maîtriser un_système _dinformation_documentaireCours1 maîtriser un_système _dinformation_documentaire
Cours1 maîtriser un_système _dinformation_documentaireEvelyne Brémond
 
محركات البحث
محركات البحثمحركات البحث
محركات البحثEyas Shrif
 
Cours4.3 recherche documentaire-notions-info-et-doc
Cours4.3 recherche documentaire-notions-info-et-docCours4.3 recherche documentaire-notions-info-et-doc
Cours4.3 recherche documentaire-notions-info-et-docEvelyne Brémond
 
Biographical Reference Sources
Biographical Reference SourcesBiographical Reference Sources
Biographical Reference Sourcesmkwalsh55
 
Veille 2.0
Veille 2.0Veille 2.0
Veille 2.0Zebrure
 
Comment faire la veille sur internet outils et techniques
Comment faire la veille sur internet  outils et techniquesComment faire la veille sur internet  outils et techniques
Comment faire la veille sur internet outils et techniquesArnaud Dapa
 
أحدث 10 تقنيات فى مجال المكتبات
أحدث 10 تقنيات فى مجال المكتباتأحدث 10 تقنيات فى مجال المكتبات
أحدث 10 تقنيات فى مجال المكتباتAhmed Al-ajamy
 
Academic social networking sites
Academic social networking sitesAcademic social networking sites
Academic social networking sitesKaty Jordan
 

What's hot (12)

Cours1 maîtriser un_système _dinformation_documentaire
Cours1 maîtriser un_système _dinformation_documentaireCours1 maîtriser un_système _dinformation_documentaire
Cours1 maîtriser un_système _dinformation_documentaire
 
محركات البحث
محركات البحثمحركات البحث
محركات البحث
 
Chapter8
Chapter8Chapter8
Chapter8
 
Cours4.3 recherche documentaire-notions-info-et-doc
Cours4.3 recherche documentaire-notions-info-et-docCours4.3 recherche documentaire-notions-info-et-doc
Cours4.3 recherche documentaire-notions-info-et-doc
 
تحليل البيانات وتفسير المعطيات
تحليل البيانات وتفسير المعطياتتحليل البيانات وتفسير المعطيات
تحليل البيانات وتفسير المعطيات
 
Biographical Reference Sources
Biographical Reference SourcesBiographical Reference Sources
Biographical Reference Sources
 
Data Mining
Data MiningData Mining
Data Mining
 
Veille 2.0
Veille 2.0Veille 2.0
Veille 2.0
 
Scientometrics class
Scientometrics classScientometrics class
Scientometrics class
 
Comment faire la veille sur internet outils et techniques
Comment faire la veille sur internet  outils et techniquesComment faire la veille sur internet  outils et techniques
Comment faire la veille sur internet outils et techniques
 
أحدث 10 تقنيات فى مجال المكتبات
أحدث 10 تقنيات فى مجال المكتباتأحدث 10 تقنيات فى مجال المكتبات
أحدث 10 تقنيات فى مجال المكتبات
 
Academic social networking sites
Academic social networking sitesAcademic social networking sites
Academic social networking sites
 

Similar to Information retrieval

Smart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web HarvestingSmart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web Harvestingpaperpublications3
 
Ref22: Searchers Academy 2.0 Redux
Ref22: Searchers Academy 2.0 ReduxRef22: Searchers Academy 2.0 Redux
Ref22: Searchers Academy 2.0 ReduxAhniwa Ferrari
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniquesTola Odugbesan
 
Search Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanismSearch Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanismUmang MIshra
 
Understanding Seo At A Glance
Understanding Seo At A GlanceUnderstanding Seo At A Glance
Understanding Seo At A Glancepoojagupta267
 
การค้นหาสารสนเทศจาก WWW (ต่อ)
การค้นหาสารสนเทศจาก WWW (ต่อ)การค้นหาสารสนเทศจาก WWW (ต่อ)
การค้นหาสารสนเทศจาก WWW (ต่อ)Srion Janeprapapong
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineArjen de Vries
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...butest
 
Tolmachev Alexander Web Search Engines
Tolmachev Alexander Web Search EnginesTolmachev Alexander Web Search Engines
Tolmachev Alexander Web Search EnginesAlexanderTolmachev
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notesAnandh Arumugakan
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawlerRishikesh Pathak
 
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...Rana Jayant
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the webVan-Duyet Le
 

Similar to Information retrieval (20)

Smart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web HarvestingSmart Crawler for Efficient Deep-Web Harvesting
Smart Crawler for Efficient Deep-Web Harvesting
 
Web mining
Web miningWeb mining
Web mining
 
Ref22: Searchers Academy 2.0 Redux
Ref22: Searchers Academy 2.0 ReduxRef22: Searchers Academy 2.0 Redux
Ref22: Searchers Academy 2.0 Redux
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniques
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Search Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanismSearch Engine working, Crawlers working, Search Engine mechanism
Search Engine working, Crawlers working, Search Engine mechanism
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Understanding Seo At A Glance
Understanding Seo At A GlanceUnderstanding Seo At A Glance
Understanding Seo At A Glance
 
Search Engine
Search Engine Search Engine
Search Engine
 
การค้นหาสารสนเทศจาก WWW (ต่อ)
การค้นหาสารสนเทศจาก WWW (ต่อ)การค้นหาสารสนเทศจาก WWW (ต่อ)
การค้นหาสารสนเทศจาก WWW (ต่อ)
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search Engine
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 
Tolmachev Alexander Web Search Engines
Tolmachev Alexander Web Search EnginesTolmachev Alexander Web Search Engines
Tolmachev Alexander Web Search Engines
 
Search engines
Search enginesSearch engines
Search engines
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
 
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
Smart Crawler Base Paper A two stage crawler for efficiently harvesting deep-...
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web
 

More from Luis Goldster

Ruby on rails evaluation
Ruby on rails evaluationRuby on rails evaluation
Ruby on rails evaluationLuis Goldster
 
Ado.net &amp; data persistence frameworks
Ado.net &amp; data persistence frameworksAdo.net &amp; data persistence frameworks
Ado.net &amp; data persistence frameworksLuis Goldster
 
Multithreading models.ppt
Multithreading models.pptMultithreading models.ppt
Multithreading models.pptLuis Goldster
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data miningLuis Goldster
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data miningLuis Goldster
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discoveryLuis Goldster
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherenceLuis Goldster
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cacheLuis Goldster
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching worksLuis Goldster
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsLuis Goldster
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysisLuis Goldster
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with javaLuis Goldster
 

More from Luis Goldster (20)

Ruby on rails evaluation
Ruby on rails evaluationRuby on rails evaluation
Ruby on rails evaluation
 
Design patterns
Design patternsDesign patterns
Design patterns
 
Lisp and scheme i
Lisp and scheme iLisp and scheme i
Lisp and scheme i
 
Ado.net &amp; data persistence frameworks
Ado.net &amp; data persistence frameworksAdo.net &amp; data persistence frameworks
Ado.net &amp; data persistence frameworks
 
Multithreading models.ppt
Multithreading models.pptMultithreading models.ppt
Multithreading models.ppt
 
Business analytics and data mining
Business analytics and data miningBusiness analytics and data mining
Business analytics and data mining
 
Big picture of data mining
Big picture of data miningBig picture of data mining
Big picture of data mining
 
Data mining and knowledge discovery
Data mining and knowledge discoveryData mining and knowledge discovery
Data mining and knowledge discovery
 
Cache recap
Cache recapCache recap
Cache recap
 
Directory based cache coherence
Directory based cache coherenceDirectory based cache coherence
Directory based cache coherence
 
Hardware managed cache
Hardware managed cacheHardware managed cache
Hardware managed cache
 
How analysis services caching works
How analysis services caching worksHow analysis services caching works
How analysis services caching works
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
Optimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessorsOptimizing shared caches in chip multiprocessors
Optimizing shared caches in chip multiprocessors
 
Api crash
Api crashApi crash
Api crash
 
Object model
Object modelObject model
Object model
 
Abstraction file
Abstraction fileAbstraction file
Abstraction file
 
Object oriented analysis
Object oriented analysisObject oriented analysis
Object oriented analysis
 
Abstract class
Abstract classAbstract class
Abstract class
 
Concurrency with java
Concurrency with javaConcurrency with java
Concurrency with java
 

Recently uploaded

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Recently uploaded (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Information retrieval

  • 1. Intelligent IR on the World Wide Web CSC 575 Intelligent Information Retrieval
  • 2. Intelligent Information Retrieval 2 Intelligent IR on the World Wide Web  Web IR versus Classic IR  Web Spiders and Crawlers  Citation/hyperlink Indexing and Analysis  Intelligent Agents for the Web
  • 3. Intelligent Information Retrieval 3 IR on the Web vs. Classsic IR  Input: publicly accessible Web  Goal: retrieve high quality pages that are relevant to user’s need  static (text, audio, images, etc.)  dynamically generated (mostly database access)  What’s different about the Web:  large volume  distributed data  Heterogeneity of the data  lack of stability  high duplication  high linkage  lack of quality standard
  • 4. Intelligent Information Retrieval 4 Search Engine Early History  In 1990, Alan Emtage of McGill Univ. developed Archie (short for “archives”)  Assembled lists of files available on many FTP servers.  Allowed regex search of these file names.  In 1993, Veronica and Jughead were developed to search names of text files available through Gopher servers.  In 1993, early Web robots (spiders) were built to collect URL’s:  Wanderer  ALIWEB (Archie-Like Index of the WEB)  WWW Worm (indexed URL’s and titles for regex search)  In 1994, Stanford grad students David Filo and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo.
  • 5. Intelligent Information Retrieval 5 Search Engine Early History  In early 1994, Brian Pinkerton developed WebCrawler as a class project at U Wash.  Eventually became part of Excite and AOL  A few months later, Fuzzy Maudlin, a grad student at CMU developed Lycos  First to use a standard IR system  First to index a large set of pages  In late 1995, DEC developed Altavista  Used a large farm of Alpha machines to quickly process large numbers of queries  Supported Boolean operators, phrases in queries.  In 1998, Larry Page and Sergey Brin, Ph.D. students at Stanford, started Google  Main advance was use of link analysis to rank results partially based on authority.
  • 6. Intelligent Information Retrieval 6 Web Search Query String IR System Ranked Documents 1. Page1 2. Page2 3. Page3 . . Document corpus Web Spider
  • 7. Intelligent Information Retrieval 7 Spiders (Robots/Bots/Crawlers)  Start with a comprehensive set of root URL’s from which to start the search.  Follow all links on these pages recursively to find additional pages.  Index all novel found pages in an inverted index as they are encountered.  May allow users to directly submit pages to be indexed (and crawled from).
  • 8. Intelligent Information Retrieval 8 Search Strategy Trade-Off’s  Breadth-first search strategy explores uniformly outward from the root page but requires memory of all nodes on the previous level (exponential in depth). Standard spidering method.  Depth-first search requires memory of only depth times branching-factor (linear in depth) but gets “lost” pursuing a single thread.  Both strategies implementable using a queue of links (URL’s).
  • 9. Intelligent Information Retrieval 9 Avoiding Page Duplication  Must detect when revisiting a page that has already been spidered (web is a graph not a tree).  Must efficiently index visited pages to allow rapid recognition test.  Tree indexing (e.g. trie)  Hashtable  Index page using URL as a key.  Must canonicalize URL’s (e.g. delete ending “/”)  Not detect duplicated or mirrored pages.  Index page using textual content as a key.  Requires first downloading page.
  • 10. Intelligent Information Retrieval 10 Spidering Algorithm Initialize queue (Q) with initial set of known URL’s. Until Q empty or page or time limit exhausted: Pop URL, L, from front of Q. If L is not an HTML page (.gif, .jpeg, .ps, .pdf, .ppt…) continue loop. If already visited L, continue loop. Download page, P, for L. If cannot download P (e.g. 404 error, robot excluded) continue loop. Index P (e.g. add to inverted index or store cached copy). Parse P to obtain list of new links N. Append N to the end of Q.
  • 11. Intelligent Information Retrieval 11 Queueing Strategy  How new links added to the queue determines search strategy.  FIFO (append to end of Q)  gives breadth-first search.  LIFO (add to front of Q)  gives depth-first search.  Heuristically ordering the Q gives a “focused crawler” that directs its search towards “interesting” pages.  May be able to use standard AI search algorithms such as Best- first search, A*, etc.
  • 12. Intelligent Information Retrieval 12 Restricting Spidering  Restrict spider to a particular site.  Remove links to other sites from Q.  Restrict spider to a particular directory.  Remove links not in the specified directory.  Obey page-owner restrictions  robot exclusion protocol
  • 13. Intelligent Information Retrieval 13 Anchor Text Indexing  Extract anchor text (between <a> and </a>) of each link:  Anchor text is usually descriptive of the document to which it points.  Add anchor text to the content of the destination page to provide additional relevant keyword indices.  Used by Google:  <a href=“http://www.microsoft.com”>Evil Empire</a>  <a href=“http://www.ibm.com”>IBM</a>  Helps when descriptive text in destination page is embedded in image logos rather than in accessible text.  Many times anchor text is not useful:  “click here”  Increases content more for popular pages with many in- coming links, increasing recall of these pages.  May even give higher weights to tokens from anchor text.
  • 14. Intelligent Information Retrieval 14 Multi-Threaded Spidering  Bottleneck is network delay in downloading individual pages.  Best to have multiple threads running in parallel each requesting a page from a different host.  Distribute URL’s to threads to guarantee equitable distribution of requests across different hosts to maximize through-put and avoid overloading any single server.  Early Google spider had multiple coordinated crawlers with about 300 threads each, together able to download over 100 pages per second.
  • 15. Intelligent Information Retrieval 15 Directed/Focused Spidering  Sort queue to explore more “interesting” pages first.  Two styles of focus:  Topic-Directed  Link-Directed
  • 16. Intelligent Information Retrieval 16 Topic-Directed Spidering  Assume desired topic description or sample pages of interest are given.  Sort queue of links by the similarity (e.g. cosine metric) of their source pages and/or anchor text to this topic description.  Preferentially explores pages related to a specific topic.
  • 17. Intelligent Information Retrieval 17 Link-Directed Spidering  Monitor links and keep track of in-degree and out- degree of each page encountered.  Sort queue to prefer popular pages with many in- coming links (authorities).  Sort queue to prefer summary pages with many out- going links (hubs).
  • 18. Intelligent Information Retrieval 18 Keeping Spidered Pages Up to Date  Web is very dynamic: many new pages, updated pages, deleted pages, etc.  Periodically check spidered pages for updates and deletions:  Just look at header info (e.g. META tags on last update) to determine if page has changed, only reload entire page if needed.  Track how often each page is updated and preferentially return to pages which are historically more dynamic.  Preferentially update pages that are accessed more often to optimize freshness of more popular pages.
  • 19. Intelligent Information Retrieval 19 Quality and the WWW The Case for Connectivity Analysis  Basic Idea: mine hyperlink information on the Web  Assumptions:  links often connect related pages  a link between pages is a “recommendation”  Approaches  classic IR: co-citation analysis (a.k.a. “bibliometrics”)  connectivity-based ranking (e.g., GOOGLE)  HITS - hypertext induced topic search
  • 20. Intelligent Information Retrieval 20 Co-Citation Analysis  Has been around since the 50’s (Small, Garfield, White & McCain)  Used to identify core sets of  authors, journals, articles for particular fields of study  Main Idea:  Find pairs of papers that cite third papers  Look for commonalities  http://www.garfield.library.upenn.edu/papers/mapsciworld.html
  • 21. Intelligent Information Retrieval 21 Co-citation analysis (From Garfield 98) The Global Map of Science, based on co- citation clustering: Size of the circle represents number of papers published in the area; Distance between circles represents the level of co-citation between the fields; By zooming in, deeper levels in the hierarchy can be exposed. The Global Map of Science, based on co- citation clustering: Size of the circle represents number of papers published in the area; Distance between circles represents the level of co-citation between the fields; By zooming in, deeper levels in the hierarchy can be exposed.
  • 22. Intelligent Information Retrieval 22 Co-citation analysis (From Garfield 98) Zooming in on biomedicine, specialties including cardiology, immunology, etc., can be viewed. Zooming in on biomedicine, specialties including cardiology, immunology, etc., can be viewed.
  • 23. Intelligent Information Retrieval 23 Co-citation analysis (From Garfield 98)
  • 24. Intelligent Information Retrieval 24 CiteSeer: A Web Agent for Citation Analysis (Bollacker, Lawrence, Giles - 1998) The CiteSeer agent consists of three main components: (i) sub-agent to automatically locate and acquire publications, (ii) document parser and database creator, (iii) browser interface which supports search by keyword and browsing by citation links. http://citeseer.ist.psu.edu/
  • 25. CiteSeer: A Web Agent for Citation Analysis
  • 26. CiteSeer: A Web Agent for Citation Analysis
  • 27. Intelligent Information Retrieval 27 Citations vs. Links  Web links are a bit different than citations:  Many links are navigational.  Many pages with high in-degree are portals not content providers.  Not all links are endorsements.  Company websites don’t point to their competitors.  Citations to relevant literature is enforced by peer-review.  Authorities  pages that are recognized as providing significant, trustworthy, and useful information on a topic.  In-degree (number of pointers to a page) is one simple measure of authority.  However in-degree treats all links as equal. Should links from pages that are themselves authoritative count more?  Hubs  index pages that provide lots of useful links to relevant content pages (topic authorities).
  • 28. Intelligent Information Retrieval 28 Hypertext Induced Topic Search  Basic Idea: look for “authority” and “hub” web pages (Kleinberg 98)  authority: definitive content for a topic  hub: index links to good content  The two distinctions tend to blend  Procedure:  Issue a query on a term, e.g. “java”  Get back a set of documents  Look at the inlink and outlink patterns for the set of retrieved documents  Perform statistical analysis to see which patterns are the most dominant ones  Technique was initially used in IBM’s CLEVER system  can find some good starting points for some topics  doesn’t solve the whole search problem!  doesn’t make explicit use of content (so may result in “topic drift” from original query)
  • 29. Intelligent Information Retrieval 29 Hypertext Induced Topic Search  Intuition behind the HITS algorithm  Authority comes from in-edges  Being a good hub comes from out-edges  Mutually re-enforcing relationship  Better authority comes from in-edges of good hubs  Being a better hub comes from out-edges of to good authorities Hubs Authorities A good authority is a page that is pointed to by many good hubs. A good hub is a page that points to many good authorities. Together they tend to form a bipartite graph A good authority is a page that is pointed to by many good hubs. A good hub is a page that points to many good authorities. Together they tend to form a bipartite graph
  • 30. Intelligent Information Retrieval 30 HITS Algorithm  Computes hubs and authorities for a particular topic specified by a normal query.  1. First determine a set of relevant pages for the query called the base set (base subgraph) S.  For a specific query Q, let the set of documents returned by a standard search engine be called the root set R. Initialize S to R.  Add to S all pages pointed to by any page in R.  Add to S all pages that point to any page in R.  Analyze the link structure of the web subgraph defined by S to find authority and hub pages in this set. R S
  • 31. Intelligent Information Retrieval 31 HITS – Some Considerations  Base Limitations  To limit computational expense:  Limit number of root pages to the top 200 pages retrieved for the query.  Limit number of “back-pointer” pages to a random set of at most 50 pages returned by a “reverse link” query.  To eliminate purely navigational links:  Eliminate links between two pages on the same host.  To eliminate “non-authority-conveying” links:  Allow only m (m ≅ 4−8) pages from a given host as pointers to any individual page.  Authorities and In-Degree  Even within the base set S for a given query, the nodes with highest in- degree are not necessarily authorities (may just be generally popular pages like Yahoo or Amazon).  True authority pages are pointed to by a number of hubs (i.e. pages that point to lots of authorities).
  • 32. Intelligent Information Retrieval 32 HITS: Iterative Algorithm  Use an iterative algorithm to slowly converge on a mutually reinforcing set of hubs and authorities.  Maintain for each page p ∈ S:  Authority score: ap (vector a)  Hub score: hp (vector h)  Initialize all ap = hp = 1  Maintain normalized scores:  Authorities are pointed to by lots of good hubs:  Hubs point to lots of good authorities: ∑→ = pqq qp ha : ( ) 1 2 =∑∈Sp ph( ) 1 2 =∑∈Sp pa ∑→ = qpq qp ah :
  • 33. Intelligent Information Retrieval 33 Illustrated Update Rules 2 3 a4 = h1 + h2 + h3 1 5 7 6 4 4h4 = a5 + a6 + a7
  • 34. Intelligent Information Retrieval 34 HITS Iterative Algorithm Initialize for all p ∈ S: ap = hp = 1 For i = 1 to k: For all p ∈ S: (update auth. scores) For all p ∈ S: (update hub scores) For all p ∈ S: ap= ap/c c: For all p ∈ S: hp= hp/c c: ∑→ = pqq qp ha : ∑→ = qpq qp ah : ( ) 1/ 2 =∑∈Sp p ca ( ) 1/ 2 =∑∈Sp p ch (normalize a) (normalize h)
  • 35. Intelligent Information Retrieval 35 HITS Example D A B C E D A C B E A: [0.0, 0.0, 2.0, 2.0, 1.0] D A C B E H: [4.0, 5.0, 0.0, 0.0, 0.0] D A C B E Norm A: [0.0, 0.0, 0.67, 0.67.0, 0.33] D A C B E Norm H: [0.62, 0.78, 0.0, 0.0, 0.0] First Iteration Normalize: divide each vector by its norm (square root of the sum of the squares)
  • 36. Intelligent Information Retrieval 36 HITS Algorithm  Let HUB[v] and AUTH[v] represent the hub and authority values associated with a vertex v  Repeat until HUB and AUTH vectors converge  Normalize HUB and AUTH  HUB[v] := Σ AUTH[ui] for all ui with Edge(v, ui)  AUTH[v] := Σ HUB[wi] for all ui with Edge(wi, v) A H v u1 u2 uk ... w1 w2 wk ...
  • 37. Intelligent Information Retrieval 37 Convergence  Algorithm converges to a fix-point if iterated indefinitely.  Define A to be the adjacency matrix for the subgraph defined by S.  Aij = 1 for i ∈ S, j ∈ S iff i→j  Authority vector, a, converges to the principal eigenvector of AT A  Hub vector, h, converges to the principal eigenvector of AAT  In practice, 20 iterations produces fairly stable results.
  • 38. Intelligent Information Retrieval 38 HITS Results  Authorities for query: “Java”  java.sun.com  comp.lang.java FAQ  Authorities for query “search engine”  Yahoo.com  Excite.com  Lycos.com  Altavista.com  Authorities for query “Gates”  Microsoft.com  roadahead.com In most cases, the final authorities were not in the initial root set generated using Altavista. Authorities were brought in from linked and reverse-linked pages and then HITS computed their high authority score.
  • 39. Intelligent Information Retrieval 39 HITS: Other Applications  Finding Similar Pages Using Link Structure  Given a page, P, let R (the root set) be t (e.g. 200) pages that point to P.  Grow a base set S from R.  Run HITS on S.  Return the best authorities in S as the best similar-pages for P.  Finds authorities in the “link neighbor-hood” of P. Similar Pages to “honda.com”: - toyota.com - ford.com - bmwusa.com - saturncars.com - nissanmotors.com - audi.com - volvocars.com
  • 40. Intelligent Information Retrieval 40 HITS: Other Applications  HITS for Clustering  An ambiguous query can result in the principal eigenvector only covering one of the possible meanings.  Non-principal eigenvectors may contain hubs & authorities for other meanings.  Example: “jaguar”:  Atari video game (principal eigenvector)  NFL Football team (2nd non-princ. eigenvector)  Automobile (3rd non-princ. eigenvector)  An application of Principle Component Analysis (PCA)
  • 41. Intelligent Information Retrieval 41 HITS: Problems and Solutions  Some edges are wrong (not “recommendations”)  multiple edges from the same author  automatically generated  spam Solution: weight edges to limit influence  Topic Drift  Query: jaguar AND cars  Result: pages about cars in general Solution: analyze content and assign topic scores to nodes
  • 42. Intelligent Information Retrieval 42 Modified HITS Algorithm  Let HUB[v] and AUTH[v] represent the hub and authority values associated with a vertex v  Repeat until HUB and AUTH vectors converge  Normalize HUB and AUTH  HUB[v] := Σ AUTH[ui] . TopicScore[ui] . Weight(v, ui) for all ui with Edge(v, ui)  AUTH[v] := Σ HUB[wi] . TopicScore[wi] . Weight(wi, v) for all ui with Edge(wi, v)  Topic score is determined based on similarity measure between the query and the documents
  • 43. Intelligent Information Retrieval 43 PageRank  Alternative link-analysis method used by Google (Brin & Page, 1998).  Does not attempt to capture the distinction between hubs and authorities.  Ranks pages just by authority.  Applied to the entire Web rather than a local neighborhood of pages surrounding the results of a query.
  • 44. Intelligent Information Retrieval 44 Initial PageRank Idea  Just measuring in-degree (citation count) doesn’t account for the authority of the source of a link.  Initial page rank equation for page p:  Nqis the total number of out-links from page q.  A page, q, “gives” an equal fraction of its authority to all the pages it points to (e.g. p).  c is a normalizing constant set so that the rank of all pages always sums to 1. ∑→ = pqq qN qR cpR : )( )(
  • 45. Intelligent Information Retrieval 45 Initial PageRank Idea  Can view it as a process of PageRank “flowing” from pages to the pages they cite. .1 .09 .05 .05 .03 .03 .03 .08 .08 .03
  • 46. Intelligent Information Retrieval 46 Initial PageRank Algorithm  Iterate rank-flowing process until convergence: Let S be the total set of pages. Initialize ∀p∈S: R(p) = 1/|S| Until ranks do not change (much) (convergence) For each p∈S: For each p∈S: R(p) = cR´(p) (normalize) ∑→ =′ pqq qN qR pR : )( )( ∑∈ ′= Sp pRc )(/1
  • 47. Intelligent Information Retrieval 47 Sample Stable Fixpoint 0.4 0.4 0.2 0.2 0.2 0.2 0.4
  • 48. Intelligent Information Retrieval 48 Linear Algebra Version  Treat R as a vector over web pages.  Let A be a 2-d matrix over pages where  Avu= 1/Nu if u→ v else Avu= 0  Then R = cAR  R converges to the principal eigenvector of A.
  • 49. Intelligent Information Retrieval 49 Problem with Initial Idea  A group of pages that only point to themselves but are pointed to by other pages act as a “rank sink” and absorb all the rank in the system.  Solutions: Rank Score  Introduce a “rank source” E that continually replenishes the rank of each page, p, by a fixed amount E(p).         += ∑→ )( )( )( : pE N qR cpR pqq q
  • 50. Intelligent Information Retrieval 50 PageRank Algorithm Let S be the total set of pages. Let ∀p∈S: E(p) = α/|S| (for some 0<α<1, e.g. 0.15) Initialize ∀p∈S: R(p) = 1/|S| Until ranks do not change (much) (convergence) For each p∈S: For each p∈S: R(p) = cR´(p) (normalize) )( )( )( : pE N qR pR pqq q +=′ ∑→ ∑∈ ′= Sp pRc )(/1
  • 51. Intelligent Information Retrieval PageRank Example A B C α = 0.3 A C B Initial R: [0.33, 0.33, 0.33] R’(C): R(A)/2 + R(B)/1 + 0.3/3 R’(B): R(A)/2 + 0.3/3 R’(A): 0.3/3 A C B R’: [0.1, 0.595, 0.27] A C B R: [0.104, 0.617, 0.28] Normalization factor: 1/[R’(A)+R’(B)+R’(C)] = 1/0.965 First Iteration Only: before normalization: after normalization:
  • 52. Intelligent Information Retrieval 52 Random Surfer Model  PageRank can be seen as modeling a “random surfer” that starts on a random page and then at each point:  With probability E(p) randomly jumps to page p.  Otherwise, randomly follows a link on the current page.  R(p) models the probability that this random surfer will be on page p at any given time.  “E jumps” are needed to prevent the random surfer from getting “trapped” in web sinks with no outgoing links.
  • 53. Intelligent Information Retrieval 53 Speed of Convergence  Early experiments on Google used 322 million links.  PageRank algorithm converged (within small tolerance) in about 52 iterations.  Number of iterations required for convergence is empirically O(log n) (where n is the number of links).  Therefore calculation is quite efficient.
  • 54. Intelligent Information Retrieval 54 Google Ranking  Complete Google ranking includes (based on university publications prior to commercialization).  Vector-space similarity component.  Keyword proximity component.  HTML-tag weight component (e.g. title preference).  PageRank component.  Details of current commercial ranking functions are trade secrets.
  • 55. Intelligent Information Retrieval 55 Personalized PageRank  PageRank can be biased (personalized) by changing E to a non-uniform distribution.  Restrict “random jumps” to a set of specified relevant pages.  For example, let E(p) = 0 except for one’s own home page, for which E(p) = α  This results in a bias towards pages that are closer in the web graph to your own homepage.  Similar personalization can be achieved by setting E(p) for only pages p that are part of the user’s profile.
  • 56. Intelligent Information Retrieval 56 PageRank-Biased Spidering  Use PageRank to direct (focus) a spider on “important” pages.  Compute page-rank using the current set of crawled pages.  Order the spider’s search queue based on current estimated PageRank.
  • 57. Intelligent Information Retrieval 57 Link Analysis Conclusions  Link analysis uses information about the structure of the web graph to aid search.  It is one of the major innovations in web search.  It is the primary reason for Google’s success.
  • 58. Intelligent Information Retrieval 58 Behavior-Based Ranking  Emergence of large-scale search engines allow for mining aggregate behavior analysis to improving ranking.  Basic Idea:  For each query Q, keep track of which docs in the results are clicked on  On subsequent requests for Q, re-order docs in results based on click-throughs.  Relevance assessment based on  Behavior/usage  vs. content
  • 59. Intelligent Information Retrieval 59 Query-doc popularity matrix B Queries Docs q j Bqj = number of times doc j clicked-through on query q When query q issued again, order docs by Bqj values.
  • 60. Intelligent Information Retrieval 60 Vector space implementation  Maintain a term-doc popularity matrix C  as opposed to query-doc popularity  initialized to all zeros  Each column represents a doc j  If doc j clicked on for query q, update Cj← Cj +ε q (here q is viewed as a vector).  On a query q’, compute its cosine proximity to Cjfor all j.  Combine this with the regular text score.
  • 61. Intelligent Information Retrieval 61 Issues  Normalization of Cj after updating  Assumption of query compositionality  “white house” document popularity derived from “white” and “house”  Updating - live or batch?  Basic assumption:  Relevance can be directly measured by number of click throughs  Valid?

Editor's Notes

  1. Naïve implementation