Tutorial 6 (web graph attributes)


Published on

Part of the Search Engine course given in the Technion (2011)

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Tutorial 6 (web graph attributes)

  1. 1. Web Graph Characteristics Kira Radinsky
  2. 2. 2 The Web as a Graph Pages as graph nodes, hyperlinks as edges. – Sometimes sites are taken as the nodes Some natural questions: 1. Distribution of the number of in-links to a page. 2. Distribution of the number of out-links from a page. 3. Distribution of the number of pages in a site. 4. Connectivity: is it possible to reach most pages from most pages? 5. Is there a theoretical model that fits the graph?
  3. 3. 3 Mathematical Background: Power-Law Distributions • A non-negative random variable X is said to have a Power-Law distribution if, for some constants c>0 and α>0: Prob[X>x] ~ x-α, or equivalently f(x) ~ x-(α+1) • Taking logs from both sides, we have: log Prob[X>x] = -αlog(x) + c • Power Law distributions have “heavy/long tails”, i.e. the probability mass of events whose value is far from the expectancy or median of the distribution is significant – Unlike Normal or Geometric/Exponential distributions, where the probability mass of the tail decreases exponentially, in Power Law distributions the mass of the tail decreases by the constant power of α – Another point of view: in an Exponential distribution, f(x)/p(x+k) is constant, whereas in a Power-Law distribution, f(x)/f(kx) is constant. – The “average” quantity in a Power-Law distribution is not “typical” • Examples of Power-Law distributions are Pareto and Zipf distributions (see next slides)
  4. 4. 4 Mathematical Background: The Pareto Distribution • A continuous, positive random variable X in the range [L,] is said to be distributed Pareto(L,k) if its probability density function is: f(X=x;k;L) = k Lk / xk+1 • This implies that Prob(X>x) = (L/x)k – Has finite expectancy of Lk/(k-1) only for k>1 – Has finite variance only for k>2 • Named after the Italian economist Vilfredo Pareto (1848- 1923), who modeled with it the distribution of wealth in society – Most people have little income; 20% of society holds 80% of the wealth
  5. 5. 5 Mathematical Background: Zipf’s Law • A random variable X follows Zipf’s Law (is “Zipfian”) with parameter α when the j’th most popular value of X occurs with probability that is proportional to j-α – Essentially the distribution is over the discrete ranks • Whenever α>1, X may take an infinite number of values (i.e. have infinitely many different value popularities) • Named after the American Linguist George Kingsley Zipf (1902-1950), who observed it on the frequencies of words in the English language – On a large corpus of English text, the 135 most frequently occurring words accounted for half of the text
  6. 6. 6 Mathematical Background:An Observed Zipfian Sample Implies a Power-Law The following analysis is due to Lada Adamic: • Assume that N units of wealth (coins) are distributed to M individuals – There are N observations of a random variable Y that can take on the discrete values 1,2,…,M • Yk=j (k=1,…N, j=1..M) means that person j got coin k – Denote by X1[Xm] the number of coins of the richest[poorest] individual at the end of the process • For simplicity, assume that N>>M and the Xj’s are all distinct • Assume that a perfect Zipfian behavior is observed, i.e. Xr/N ~ r-b for all r=1,…M – This trivially implies Xr ~ r-b
  7. 7. 7 Mathematical Background:An Observed Zipfian Sample Implies a Power-Law (cont.) • Recap: we distributed N coins to M individuals, and denoted by X1[Xm] the number of coins of the richest[poorest] individual at the end of the process • By assuming Zipfian wealth: Xr ~ r-b, or Xr=cr-b • Let Z be the random variable of a person’s wealth, i.e. the number of coins a person gets by this process • Observation: if the r’th richest person got Xr coins, then exactly r people out of M got Xr coins or more • Pr[Z  Xr]=Pr[Z  cr-b]=r/M • Define y= cr-b, and so r=(y/c)-(1/b), and so Pr[Z  y]= y-(1/b) c(1/b)/M • Hence Pr[Z  y] ~ y-(1/b), and Z obeys a Power-Law
  8. 8. 8 Distribution of Inlinks * Image taken from “Graph Structure in the Web”, Broder et al., WWW’2000. A plot of the number of nodes having each value of in-degree Both axes are in log-scale Denoting the size of the sample crawl by N (over 200M here), we have: Log (N*Prob[node has in-degree x])  -a*log(x)+c Log (Prob[node has in-degree x])  -a*log(x)+c’ Which indicates the Power-Law Prob[node has in-degree x] ~ x-a Note that the number of nodes with small in-degree is over-estimated while the number of nodes with very high in-degree is under-estimated
  9. 9. 9 More Power-Laws on the Web We’ve seen that the in-degree of pages exhibits a Power-Law. Furthermore: • Out-degree (somewhat surprising) • Degrees of the inter-host graph • Number of pages in Web sites • Number of visits to Web sites/pages • PageRank scores – With an exponent very close to that of the in-degree distribution – Curiously, degrees in the telephone call graph have the same 2.1 exponent • Frequencies of words (as observed by Zipf) • Popularities of queries submitted to search engines (will be discussed later in the course)
  10. 10. 10 The Web as a Graph Connectivity: is it possible to reach most pages from most pages? The Web is a bow-tie! The Web graph is also scale-free, fractal: many slices and subgraphs exhibit similar properties. Image taken from “Graph Structure in the Web”, Broder et al., WWW’2000.
  11. 11. 11 Self-Similarity on the Web Dill et al., ACM TOIT 2002 • Created large Thematically Unified Clusters (TUCs) • Pages containing a certain keyword • Pages of large Web sites/Intranets • Pages containing a geographical reference in the Western US • The host graph • In general, the TUCs display very similar graph properties, e.g. • In/out degree distributions • Bow-tie structure (relative sizes of the components) • Also discovered that the SCC of the different TUCs are strongly connected, i.e. it is possible to browse between the TUCs