Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Science of the Interwebs

1,112 views

Published on

15-396 - Science of the interwebs presentation.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Science of the Interwebs

  1. 1. 15-396 Science of teh Interwebs
  2. 2. Web Search II Lecture 13 (October 14, 2008)
  3. 3. What Does the Web Look Like?
  4. 4. Can Think of the Web as a Directed Graph
  5. 5. What is a Node? There is an “infinite” number of pages in Google alone
  6. 6. Spider Traps http://foo.com/bar/foo/bar/foo/bar/foo/bar/.....
  7. 7. Modern Search Engines Focus on Relatively Stable Pages
  8. 8. What Does the Web Look Like?
  9. 9. A strongly connected component (SCC) in a directed graph is a subset of the nodes such that every node in the subset has a path to every other node in the subset
  10. 10. 56 Million 44 Million 44 Million 44 Million Data from 1999
  11. 11. How Should we Use Rankings in Search?
  12. 12. 1. Collect all pages that are relevant through text-only techniques: the query occurs in the title of the page, the query occurs in the page itself, etc. 2. Sort the outcome by e.g. global PageRank Problem: If Yahoo! Contains the text “flower” it will be one the first few results for the query Naïve Approach
  13. 13. Forget about PageRank for a Second…
  14. 14. 1. Collect all pages that are relevant through text-only techniques: the query occurs in the title of the page, the query occurs in the page itself, etc. 2. Let pages in this sample “vote” through links Problem: Super popular pages like Yahoo! still pose problems
  15. 15. Lists Some pages are “lists” of things A page’s value as a list = sum of votes received by all pages that it voted for
  16. 16. Hubs and Authorities: A Precursor of PageRank Hubs = High-value lists for the query Authorities = highly endorsed answers to the query For each page p, we assign it two values hub(p) and auth(p)
  17. 17. Start: for all p, hub(p) = 1, auth(p) = 1 Authority Update Rule: For each page p, update auth(p) to be the sum of the hub scores of all pages that point to it Hub Update Rule: For each page p, update hub(p) to be the sum of the authority scores of all pages that it points to For k times: Apply Authority Update Rule Apply Hub Update Rule
  18. 18. To make the numbers not grow infinitely, always normalize This process converges!
  19. 19. Combining Anchor Text A great newspaper Check out this picture Which link is better for the query “newspaper”? How do we incorporate this information into PageRank or “Hubs and Authorities”? We can multiply link contributions by a factor that indicates the quality
  20. 20. Impact Factor of Scientific Journals Nature Science New England Journal of Medicine Cell PNAS Journal of Biological Chemistry JAMA The Lancet NAT GENET Nature Medicine
  21. 21. Supreme Court Cases
  22. 22. g2g ttyl

×