Relating Web Characteristics with Link-Based Ranking - Presentation Transcript
Relating Web Characteristics
Ricardo Baeza-Yates
Carlos Castillo
Universidad de Chile
Agenda
Introduction
•
Link-based ranking
•
Web structure
•
Web characteristics
•
Web usage
•
Web dynamics
•
Conclusions
•
Relating Web Characteristics
Introduction: Sample
Web sample: .CL domain on year 2000
•
670,000 pages in 7,500 domains
•
15kb average page size
•
Collection from the TodoCL web search
•
engine
Relating Web Characteristics
Introduction: Emphasis
• Broder et al.: Graph Structure on the
Web (2000)
– Page-based structure based on strongly
connected components
– The Web graph is not a random graph
– Process: cut & paste model
• Our is mostly a site-based analysis
– Trying to make Web structure meaningful
Relating Web Characteristics
Introduction: The Empire
Relating Web Characteristics
Introduction: One Map
Relating Web Characteristics
Link ranking: Pagerank
Pages that point
to page p
k
q
Pagerank ( p ) = + (1 − q )∑ Pagerank (ri )
N i =1
Currently used by
Google
Probability of a
Brin & Page, 1998
random jump over
number of pages
Relating Web Characteristics
Link ranking: Hubs &
Authorities
• HITS algorithm (Kleinberg, 1998)
• A good authority is a page pointed by
good hubs, so we assume that it has
good content
• A good hub is a page that points to
good authorities, so we assume it is a
good set of links
• Linear system calculated by numerical
iteration
Relating Web Characteristics
Link ranking: Distribution
<2% with relevant
Pagerank
9% with relevant
2-3% with relevant
hub score
authority score
Relating Web Characteristics
Link ranking: Correlation
Hub score,
authority score
and Pagerank
do not seem
to be correlated
Relating Web Characteristics
Link ranking: Sites
• Which measure to use for sites ?
• Average score
– But good sites can have lots of bad pages
• Maximum score
– But one good page cannot be all that is
needed to be a good site
• Sum of the scores of all pages
– Natural for Pagerank
Relating Web Characteristics
Link ranking: Sites Graph
90% relevant site-Pagerank
It’s harder to have a
good hub than a
good authority (site)
Relating Web Characteristics
Web Structure: Basis
• The Web graph has structure:
MAIN
IN
OUT
ISLANDS
Relating Web Characteristics
Web Structure: Basis (cont.)
• The MAIN component has structure:
MAIN IN
MAIN OUT
MAIN MAIN
IN
MAIN NORM OUT
Relating Web Characteristics
Web Structure: Sketch
Relating Web Characteristics
Web Structure: Degree
Relating Web Characteristics
Web Structure: Sizes
Relating Web Characteristics
Web Structure: Preferences
Relating Web Characteristics
Web Structure: Preferences
OUT
MAIN
OUT
OUT
MAIN MAIN
MAIN MAIN
Real ODP TodoCL
Relating Web Characteristics
Web Structure: Various
Relating Web Characteristics
Web Structure: Link Scores
Relating Web Characteristics
Web Dynamics: Ages
• The kernel of the Web comes from the
past
Relating Web Characteristics
Web Dynamics: By
Component
Relating Web Characteristics
Web Dynamics: Pagerank
Pagerank is biased
against newer pages
Relating Web Characteristics
Web Dynamics: Hubs &
Authorities
Authority Score
Hub Score
Age (months)
Relating Web Characteristics
Conclusions
• Pagerank/HITS do not seem to be
correlated
– And Pagerank is biased to older pages
• Site ranking can help to make good
human-selected directories
• Finding good pages is not so simple
• Characterizing Web structure gives
valuable insight
– Web Graph Mining is just starting
Relating Web Characteristics
0 comments
Post a comment