Relating Web Characteristics
         Ricardo Baeza-Yates
            Carlos Castillo
         Universidad de Chile
Agenda
    Introduction
•
    Link-based ranking
•
    Web structure
•
    Web characteristics
•
    Web usage
•
    Web d...
Introduction: Sample
    Web sample: .CL domain on year 2000
•
    670,000 pages in 7,500 domains
•
    15kb average page ...
Introduction: Emphasis

• Broder et al.: Graph Structure on the
  Web (2000)
  – Page-based structure based on strongly
  ...
Introduction: The Empire




       Relating Web Characteristics
Introduction: One Map




      Relating Web Characteristics
Link ranking: Pagerank
                                  Pages that point
                                  to page p
    ...
Link ranking: Hubs &
          Authorities
• HITS algorithm (Kleinberg, 1998)
• A good authority is a page pointed by
  go...
Link ranking: Distribution
                            <2% with relevant
                            Pagerank




9% with ...
Link ranking: Correlation



                                         Hub score,
                                       au...
Link ranking: Sites

• Which measure to use for sites ?
• Average score
  – But good sites can have lots of bad pages
• Ma...
Link ranking: Sites Graph

                   90% relevant site-Pagerank




It’s harder to have a
good hub than a
good au...
Web Structure: Basis
• The Web graph has structure:

                 MAIN


 IN
                                         ...
Web Structure: Basis (cont.)
• The MAIN component has structure:




        MAIN IN
                                     ...
Web Structure: Sketch




      Relating Web Characteristics
Web Structure: Degree




      Relating Web Characteristics
Web Structure: Sizes




     Relating Web Characteristics
Web Structure: Preferences




        Relating Web Characteristics
Web Structure: Preferences

                  OUT
                                          MAIN
                         ...
Web Structure: Various




      Relating Web Characteristics
Web Structure: Link Scores




        Relating Web Characteristics
Web Dynamics: Ages
• The kernel of the Web comes from the
  past




             Relating Web Characteristics
Web Dynamics: By
  Component




    Relating Web Characteristics
Web Dynamics: Pagerank


            Pagerank is biased
            against newer pages




       Relating Web Characteri...
Web Dynamics: Hubs &
                       Authorities
Authority Score




                                        Hub Sc...
Conclusions
• Pagerank/HITS do not seem to be
  correlated
  – And Pagerank is biased to older pages
• Site ranking can he...
Upcoming SlideShare
Loading in …5
×

Relating Web Characteristics with Link-Based Ranking

1,027 views
939 views

Published on

Published in: Technology, Design
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,027
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
23
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Relating Web Characteristics with Link-Based Ranking

  1. 1. Relating Web Characteristics Ricardo Baeza-Yates Carlos Castillo Universidad de Chile
  2. 2. Agenda Introduction • Link-based ranking • Web structure • Web characteristics • Web usage • Web dynamics • Conclusions • Relating Web Characteristics
  3. 3. Introduction: Sample Web sample: .CL domain on year 2000 • 670,000 pages in 7,500 domains • 15kb average page size • Collection from the TodoCL web search • engine Relating Web Characteristics
  4. 4. Introduction: Emphasis • Broder et al.: Graph Structure on the Web (2000) – Page-based structure based on strongly connected components – The Web graph is not a random graph – Process: cut & paste model • Our is mostly a site-based analysis – Trying to make Web structure meaningful Relating Web Characteristics
  5. 5. Introduction: The Empire Relating Web Characteristics
  6. 6. Introduction: One Map Relating Web Characteristics
  7. 7. Link ranking: Pagerank Pages that point to page p k q Pagerank ( p ) = + (1 − q )∑ Pagerank (ri ) N i =1 Currently used by Google Probability of a Brin & Page, 1998 random jump over number of pages Relating Web Characteristics
  8. 8. Link ranking: Hubs & Authorities • HITS algorithm (Kleinberg, 1998) • A good authority is a page pointed by good hubs, so we assume that it has good content • A good hub is a page that points to good authorities, so we assume it is a good set of links • Linear system calculated by numerical iteration Relating Web Characteristics
  9. 9. Link ranking: Distribution <2% with relevant Pagerank 9% with relevant 2-3% with relevant hub score authority score Relating Web Characteristics
  10. 10. Link ranking: Correlation Hub score, authority score and Pagerank do not seem to be correlated Relating Web Characteristics
  11. 11. Link ranking: Sites • Which measure to use for sites ? • Average score – But good sites can have lots of bad pages • Maximum score – But one good page cannot be all that is needed to be a good site • Sum of the scores of all pages – Natural for Pagerank Relating Web Characteristics
  12. 12. Link ranking: Sites Graph 90% relevant site-Pagerank It’s harder to have a good hub than a good authority (site) Relating Web Characteristics
  13. 13. Web Structure: Basis • The Web graph has structure: MAIN IN OUT ISLANDS Relating Web Characteristics
  14. 14. Web Structure: Basis (cont.) • The MAIN component has structure: MAIN IN MAIN OUT MAIN MAIN IN MAIN NORM OUT Relating Web Characteristics
  15. 15. Web Structure: Sketch Relating Web Characteristics
  16. 16. Web Structure: Degree Relating Web Characteristics
  17. 17. Web Structure: Sizes Relating Web Characteristics
  18. 18. Web Structure: Preferences Relating Web Characteristics
  19. 19. Web Structure: Preferences OUT MAIN OUT OUT MAIN MAIN MAIN MAIN Real ODP TodoCL Relating Web Characteristics
  20. 20. Web Structure: Various Relating Web Characteristics
  21. 21. Web Structure: Link Scores Relating Web Characteristics
  22. 22. Web Dynamics: Ages • The kernel of the Web comes from the past Relating Web Characteristics
  23. 23. Web Dynamics: By Component Relating Web Characteristics
  24. 24. Web Dynamics: Pagerank Pagerank is biased against newer pages Relating Web Characteristics
  25. 25. Web Dynamics: Hubs & Authorities Authority Score Hub Score Age (months) Relating Web Characteristics
  26. 26. Conclusions • Pagerank/HITS do not seem to be correlated – And Pagerank is biased to older pages • Site ranking can help to make good human-selected directories • Finding good pages is not so simple • Characterizing Web structure gives valuable insight – Web Graph Mining is just starting Relating Web Characteristics

×