Upcoming SlideShare
×

# Tutorial 8 (web graph models)

500 views

Published on

Part of the Search Engine course given in the Technion (2011)

Published in: Technology, Design
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
500
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
9
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Tutorial 8 (web graph models)

1. 1. Evolutionary Models of the Web Graph Kira Radinsky Web Size estimation models are based on the Standford slides by Christopher Manning and Prabhakar Raghavan
2. 2. 7 December 2010 2 Stochastic Models for the Web’s Graph So what can explain the observed Power Law in/out degree distributions of Web pages? • Standard G(n,p) Erdös-Rényi random graphs: – A graph contains n nodes, and every two nodes are connected with probability p – Degrees are distributed B(n-1,p), and since on the Web np<<n, they can be viewed as distributed Poisson(np-p) – Such distributions have light-weight, exponentially decreasing tails - nodes with very large in-degrees are practically impossible – yet, they abound on the Web Erdös-Rényi random graphs do not model the Web graph
3. 3. 7 December 2010 3 Evolutionary Models – First Attempt • The Web wasn’t built in a day; in fact, it is constantly growing and evolving • Models should (somewhat) reflect the authoring process of Web pages • Observation: older, well-established nodes should be better connected as they’ve been around longer and are better known • A corresponding model: – Start at time 0 with a single node. – At step t, add a new node with a single new edge that connects to one of the t pre-exiting nodes chosen uniformly at random – The expected in-degree at time T of the node added at time t: j=t+1,…,T 1/j  log T – log t – Doesn’t result in a power law – P(2x)/P(x) is not a constant
4. 4. 7 December 2010 236620 Search Engine Technology 4 Preferential Attachment • Observation: while older, well-established nodes are better known, it is not strictly because of their age but rather because of them having more in-links • The preferential attachment model: – Start at time 0 with a single node. – At step t, add a new node with a single new edge that connects to one of the t pre-exiting nodes • The probability of linking to node v: (1+in-degree(v)) / (2t-1) • A variant involves a parameter α: – Start at time 0 with a single node. – At step t, add a new node with a single new edge that connects to node v with probability α/t+(1- α)*in-degree(v)/(t-1) • Both variants indeed result in a Power-Law distribution of in-degrees (different exponents)
5. 5. 7 December 2010 5 Preferential Attachment (cont.) • Another observation: if search engine rankings are influenced by PageRank, then new pages will link to high- PageRank pages more than to low PageRank pages • The model uses two positive parameters d, p such that d+p<1 • The evolution: – Start at time 0 with a single node. – At step t, add a new node with a single new edge as follows: • With probability d, connect the edge to one of the existing nodes in proportion to the in-degree (or 1+in-degree) of that node • With probability p, connect the edge to a node chosen at random according to the PageRank distribution at time t • With probability 1-p-d, connect the edge to an existing node chosen uniformly at random • With properly chosen parameters, this model can fit both the in-degree and PageRank Power-Law distributions Raghavan et al., “Using PageRank to characterize Web Structure”, 2002
6. 6. 7 December 2010 6 The Copy Model The “Copy Model” assumes the following authoring model: • Each page is on a topic of interest to its author. – Some of its links will be copied from a previous page on the same topic, that the author found useful – Some links will be “original”, i.e. chosen independently by the author of the page • The stochastic process creates nodes with an out-degree of d (parallel edges are allowed) – Start at time 0 with a single node and d self-loops – At step t, add a new node with d out-links as follows • Choose an intermediate node v chosen u.a.r. from the t existing nodes • For j=1,…,d: – With probability α, connect link j to a node chosen u.a.r. from the t existing nodes – With probability 1-α, copy the j’th link of v • The copy model results in Power-Law in-degree distributions
7. 7. 7 December 2010 7 Evolutionary Models - Summary • Overall, models exist that can simultaneously fit the observed Power- Law distributions of in-degrees, out-degrees and PageRank – Many other properties of the graph are still unexplained by theoretical evolutionary models • The accepted models mix-and-match the principles of preferential attachment (degrees/PageRank), copying, and random connectivity • These models have the “rich get richer” property, and favor seniority (i.e. nodes from earlier rounds tend to have higher degrees) – One can add some random “fitness” to nodes, with preferential attachment considering fitness as well, to give new nodes better chances of competing with existing nodes • Note that there’s a difference between “rich get richer” and “winner takes all” – the Web’s graph doesn’t exhibit the dominance of a single winner
8. 8. 7 December 2010 236620 Search Engine Technology 8 Related Research Area: The Science of Networks • Power-law and scale-free networks • “Small World” networks and the importance of weak ties – Kleinberg’s small-world grid • Social/collaboration networks – Milgram’s “six degrees of separation” – The six degrees of Kevin Bacon – Erdös numbers ‫הסמג‬ ‫את‬ ‫קיבל‬ ‫שלי‬ ‫השכן‬ ‫של‬ ‫ודוד‬"‫ד‬ ‫אחותי‬ ‫של‬ ‫בן‬ ‫של‬ ‫אישתו‬ ‫סיפרה‬‫ברכה‬ ‫שלומי‬,‫משינה‬
9. 9. What is the size of the web ? • Issues – The web is really infinite • Dynamic content, e.g., calendar • Soft 404: www.yahoo.com/<anything> is a valid page – Static web contains syntactic duplication, mostly due to mirroring (~30%) – Some servers are seldom connected • Who cares? – Media, and consequently the user – Engine design – Engine crawl policy. Impact on recall.
10. 10. What can we attempt to measure? (IQ is whatever the IQ tests measure.) – The statically indexable web is whatever search engines index. • Different engines have different preferences – max url depth, max count/host, anti-spam rules, priority rules, etc. • Different engines index different things under the same URL: – frames, meta-keywords, document restrictions, document extensions, ...
11. 11. A B = (1/2) * Size A A B = (1/6) * Size B (1/2)*Size A = (1/6)*Size B Size A / Size B = (1/6)/(1/2) = 1/3 Sample URLs randomly from A Check if contained in B and vice versa A  B Each test involves: (i) Sampling (ii) Checking Relative Size from Overlap Given two engines A and B
12. 12. Sampling URLs • Ideal strategy: Generate a random URL and check for containment in each index. • Problem: Random URLs are hard to find! Enough to generate a random URL contained in a given Engine. • Approach 1: Generate a random URL contained in a given engine – Random queries – Random searches • Approach 2: Give us a true estimate of the size of the web (as opposed to just relative sizes of indexes) – Random IP addresses – Random walks
13. 13. Random URLs from random queries • Generate random query: how? – Lexicon: 400,000+ words from a web crawl – Conjunctive Queries: w1 and w2 e.g., vocalists AND rsi • Get 100 result URLs from engine A • Choose a random URL as the candidate to check for presence in engine B • This distribution induces a probability weight W(p) for each page. • Conjecture: W(SEA) / W(SEB) ~ |SEA| / |SEB| Not an English dictionary
14. 14. Random searches • Choose random searches extracted from a local log [Lawrence & Giles 97] or build “random searches” [Notess] – Use only queries with small result sets. – Count normalized URLs in result sets. – Use ratio statistics
15. 15. Random IP addresses • Generate random IP addresses • Find a web server at the given address – If there’s one • Collect all pages from server – From this, choose a page at random
16. 16. Random walks • View the Web as a directed graph • Build a random walk on this graph – Includes various “jump” rules back to visited sites • Does not get stuck in spider traps! • Can follow all links! – Converges to a stationary distribution • Must assume graph is finite and independent of the walk. • Conditions are not satisfied (cookie crumbs, flooding) • Time to convergence not really known – Sample from stationary distribution of walk – Use the “strong query” method to check coverage by SE