1 chayes


Published on

Workshop on random graphs 24—26 oct
Доклад Дженнифер Чайес

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

1 chayes

  2. 2. Online Networks Online networks are often massive WWW has trillions of (static) sites 3D representation of WWW by Opte Facebook has over a billion users Small piece of FB mapped by Nexus
  3. 3. Algorithmic Network Questions • Ranking of the sites (e.g., PageRank) • Finding the most influential site (or k most influential sites) under various definitions of influence – Most highly connected – Most influential under a certain model, e.g., KKT independent cascade model • Covering the graph via local moves (the recruiter problem)
  4. 4. Constraints • Limitations on network visibility, e.g. Facebook, LinkedIn, etc. only let you see one or two hops away on the graph Facebook LinkedIn Need local (approximation) algorithms! • Limitations on compute time, especially relevant for online computation on massive graphs Want local (approximation) algorithms to be efficient, at expense of approximation factor if necessary.
  5. 5. Outline of the Talk I. Network algorithms with local access constraints – Context: local information algorithms – Algorithms on preferential attachment networks – Algorithms on general networks Borgs, Brautbar, C, Lucier, Khanna : WINE ‘12 II. Using locality to get sublinear algorithms without a priori access constraints Borgs, Brautbar, C, – PageRank problem – Finding the most influential nodes (viral marketing in ind. cascade model) Teng : WAW ’12 & Internet Mathematics Borgs, Brautbar, C, Lucier: SODA ‘14
  6. 6. A Networking Problem with Local Access Constraints Goal: Meet the most influential people.
  7. 7. A Networking Problem Goal: Meet the most influential people.
  8. 8. A Networking Problem Goal: Meet the most influential people. Find the highest-degree vertex.
  9. 9. A Networking Problem with Local Access Constraints
  10. 10. A Networking Problem with Local Access Constraints
  11. 11. A Networking Problem with Local Access Constraints
  12. 12. A Networking Problem with Local Access Constraints
  13. 13. A Networking Problem with Local Access Constraints
  14. 14. Motivating Question How well can a graph algorithm perform when it has only local visibility of the network structure? … on “natural” networks? … as a function of the “level” of visibility?
  15. 15. Online Social Networks Social network applications differ in what is visible: Facebook LinkedIn Orkut, Google+ … Question: what is the impact of this design choice?
  16. 16. Local Algorithms More generally: Search Problems: find the highest-degree node, the most central, … Coverage Problems: minimum dominating set, maximum k-coverage, … Connectivity Problems: shortest path, multicast, … “Local”: Graph topology is revealed locally as the algorithm builds its output set.
  17. 17. Outline of Part I: Algorithms with Local Access Constraints 1. A model of local information algorithms 2. Algorithms for preferential attachment networks 3. Minimum dominating set problem on general networks
  18. 18. Local Information Algorithms Input: Graph G = (V,E), initially unknown. Output: subset S of the vertices. (eg: find feasible S, minimizing |S|) Two operations: 1. Add a random node to S 2. Add any visible node to S r-Local algorithm Visible region: all nodes distance ≤ r from S, plus the induced subgraph …plus degrees of outermost nodes Note: To map this into questions on Facebook and LinkedIn, think of r as the distance out from your current set of connections, i.e., your set of friends.
  19. 19. 1-Local Algorithm
  20. 20. 1-Local Algorithm
  21. 21. 1-Local Algorithm
  22. 22. 1-Local Algorithm
  23. 23. 2-Local Algorithm
  24. 24. 2-Local Algorithm This talk: focus mainly on 1-local algorithms.
  25. 25. Preferential Attachment Networks
  26. 26. Preferential Attachment Random network growth model [BA’99,BR’00,…] 1. Begin with small fixed graph (e.g. clique). 2. Each new node v connects to m ≥ 2 previous nodes at random, proportional to their degrees: Pr[i connects to j] ~ deg(j) 2 6 1 4 3 7 5
  27. 27. Preferential Attachment Properties: Connected (with high probability) Small diameter: O(logn / loglogn) Power law degree sequence: P(k) ~ k-3 Older nodes tend to have higher degree: E[deg(i)] = (n/i)½
  28. 28. Finding the Root Problem: Return a set S containing node 1. Opportunistic algorithm: Initialize S to an arbitrary node While S does not contain node 1: Add node v ϵ N(S) with largest degree 1-local Note: possible to remove assumption that alg can detect node 1.
  29. 29. Finding the Root Theorem: The opportunistic algorithm finds node 1 in O(log4n) queries, with high probability over the random graph process. Note: random walk requires O*(n½) queries. LinkedIn O(log4n) Facebook Random Walk: Ω(n½)
  30. 30. Applications s-t connectivity: O(log4n) - connect s and t to node 1 - can connect k terminals in O(k log4n) Find k nodes of largest degree: O(log4n + k) - find node 1, but don’t stop the algorithm
  31. 31. Proof Sketch Theorem: The greedy algorithm finds node 1 in time O(log4 n). The Hope: the algorithm reaches a node of degree 2k after k * polylog n iterations. The Problem: reaching a node of high degree does not necessarily imply progress. 1 Bottleneck
  32. 32. Proof Sketch Observation: if there is a path connecting S to node 1, with all nodes of degree ≥ d, then the algorithm never queries a node of degree < d. Q: How common are these “good” paths? A: For m ≥ 2, most nodes lie on good paths with constant probability. (Proof: detailed prob. analysis) 1 S
  33. 33. General Graphs
  34. 34. Minimum Dominating Set Problem: find smallest set S s.t. N(S) ∪ S = V.
  35. 35. Minimum Dominating Set Problem: find smallest set S s.t. N(S) ∪ S = V. max degree Lower bound: Ω(log Δ) (set cover) Upper bound: O(log Δ), 3-local [GK’98] How well can a 1-local algorithm perform?
  36. 36. A local algorithm Greedy Algorithm: Initialize S to a random node. While |D(S)| < n: Add node v ϵ N(S) that maximizes |D(S∪{v})|
  37. 37. A local algorithm Greedy Algorithm: Initialize S to a random node. While |D(S)| < n: Add node v ϵ N(S) that maximizes |D(S∪{v})| 3 4 8 9 5 1 6 7 2 Optimal: O(1) Greedy alg: Ω(n)
  38. 38. A local algorithm Greedy-Random Algorithm: Initialize S to a random node. While |D(S)| < n: Add node v ϵ N(S) that maximizes |D(S∪{v})| Add a random node from N(v) D(S) Theorem: The greedy-random algorithm obtains a (1 + 2logΔ) approximation (in expectation and whp).
  39. 39. Analysis Pick v in OPT. We wish to show that our alg. does not waste too many steps covering N(v). Consider some x chosen greedily by the algorithm, and which covers some vertex in N(v). x v
  40. 40. Analysis Pick v in OPT. We wish to show that our alg. does not waste too many steps covering N(v). Consider some x chosen greedily by the algorithm, and which covers some vertex in N(v). Case 1: v was visible. x x must cover many nodes due to greediness. v
  41. 41. Analysis Pick v in OPT. We wish to show that our alg. does not waste too many steps covering N(v). Consider some x chosen greedily by the algorithm, and which covers some vertex in N(v). Case 2: v was not visible. x If x covers few nodes, good chance to reveal v on the “random” step. v
  42. 42. Conclusions from Part I: • Local information algorithms: sequential decisions with limited network visibility. • Many problems can be solved locally and efficiently in preferential attachment and general networks. • The level of visibility can have a strong impact on the approximability of a problem.
  43. 43. Part II: Using Locality to Get Sublinear Algorithms • Sublinear algorithms to find high pagerank nodes • Sublinear algorithms for influence maximization
  44. 44. Sublinear Algorithms for PageRank: Definitions • View the WWW as a directed graph with G(V,E) with V being the webpages and E being the (directed) hyperlinks • PageRank: Do a random walk on the webgraph, restarting at a random site every say 1/a steps. The relative weight of a page in the stationary distribution is the PageRank of that page.
  45. 45. PageRank: Definitions • Random walk matrix M Muv = (1/dout(u) )Auv • dout(u) out-degree of vertex u • Auv adj. matrix of dir. graph • RW with restart at u: • in each step do one RW step with prob. 1-a, and jump to u with prob. a: p(u) = adu,. + (1 – a) p(u) M • Stationary distribution p p=pM • PageRank matrix Puv PageRank: PRv = Su Puv P = a1 + (1 – a) P M • Personalized PageRank vector p(u) (Always restart at u) p(u) = eu P = Pu. • Contribution vector c(v) c(v) = ev PT = P.v (All contributions to v)
  46. 46. Computing PageRank • Take G = (V,E) with |V| = n and |E| = m. • Significant PageRank Problem (SPP): Find all nodes with PR > D, and no nodes with PR < D/2. • Previous results on running time: – Power iteration method (Bianchi et. al. ‘03): (m) – Linear algebra improvement (Langville & Meyer ‘04):  (n) • Lower bound on running time: (n/D) (Roughly from n/D sites with PR = D, and all other sites with PR = 0.) • Approximate SPP: Can we use locality to get an (additive) -approximation which essentially matches the lower bound, i.e., time *(n/D)?
  47. 47. Roadmap for Approximate SPP • Steps of Calculation i. Calculate each Puv ii. Each PRv is the sum of the n terms in the contribution column vector: PRv = Su Puv iii. Do this for all n points, i.e. all contribution vectors • A priori each step should take (n) time
  48. 48. i. Local calculation of -approximate Puv • Previous results – Deterministic Bad if • Jeh-Widom ‘03: ((log n) -1 max-in-degree) in-degree unbounded • Andersen et. al. ‘06: (-1 max-in-degree) – Random • Fogaras et. al. ’05: Monte Carlo based approach which removed dependence on max-in-degree but gave mult. rather than additive error, and where approx. depended on Puv • Our approach – Modification of Forgaras et. al. – Handled concentration better to remove dependence on Puv
  49. 49. i. Local calc. of (,)-approximate Puv • Local method: uses – Terminating Random Walk: A RW which terminates with prob. a, and with prob. 1 - a, moves uniformly to a random outlink of the current node • Algorithm: Note : The probability that a terminating RW starting at u, happens to end it v, is Puv – For (-1-2 log n) do – Run a new terminating RW starting at u to a max (capping) length of log1/(1-a)(1/) – If the walk is terminated before reaching the capping length, add one to the counter of the node the walk last visited before terminating – Output avg count accumulated at each node • Running time: (-1-2log n log -1 ) ~ *(-1 )
  50. 50. ii. From Puv to approx. of PRv = Su Puv • Obviously can’t just sum (takes time n) • Alternative: Naïve sampling – Pick L random ui R {1, … , n}. – Check if sum of these L terms, Sui Puiv , is large or small wrt D L/n. • Problem with naïve sampling – To make sure error does not drown out expectation, need  = O(n/D) – To get concentration (Chernoff), need L = O*(n/D) –  Runtime = L O*(1/) = O*(n2/D2) rather than O*(n/D)
  51. 51. ii. Multiscale Sampling • Choose many scales t = 2-t • Estimate how many entries P.v of the contribution vector c(v) lie in the interval (t,2t) • Land up spending most of our time on the estimates of the larger P.v  lots of work • Estimate whether PRv > D in running time *(n/D).
  52. 52. iii. From question of PRv > D for one v to all v • Key: Use sparse matrix methods to do all n columns in parallel. • Maintain running time *(n/D).
  53. 53. Conclusion for PageRank Locality + multiscale analysis + sparse matrix methods  Running time = *(n/D) to find approx. of all nodes with significant pagerank PRv Running time sublinear in n for D = 0(np), 0 < p < 1.
  54. 54. Final Topic: Sublinear Algorithms for Models of Influence Maximization
  55. 55. Influence Maximization: Definitions Independent Cascade Model (Kempe, Kleinberg, Tardos ‘03) Introduced as model of viral marketing • G = (V,E) oriented graph with |V| = n, |E| = m, and edge weights {pe|eE} • pe = probability infection spreads out along e • I(S) = (random) size of the set that is eventually infected starting from seed set S  V Problem: For fixed k = |S|, find the seed set S which maximizes the expected influence E[I(S)]
  56. 56. KKT Model: Previous Results • KKT: E[I(S)] is submodular  maximizing E[I(S)] can be approximated to within (1 – 1/e) via greedy algorithm • With oracle access to E[I(S)], greedy alg has runtime = O(kn) • Oracle access can be simulated  Total runtime =  (mnk poly(-1)), i.e., even on a sparse graph, at least quadratic in n.
  57. 57. Influence Maximization: Our Results • Nearly linear time algorithm: We can find an approximately optimal seed with an approximation factor of (1 – 1/e – ) in time O*((m + n) -3). – Note: There is a lower bound of  (m + n), so this is essentially optimal. • Sublinear time algorithm: We can find an approximately optimal seed with an approximation factor of (1/) in time O*(n a(G)/ ) where a(G) is the arboricity* of the graph G. – Taking  = 0(np), 0 < p < 1, we get a time sublinear in n. *Arboricity of G is the minimum number of spanning forests necessary to cover all edges of G. Roughly speaking, arboricity corr. with density of graph.
  58. 58. Key Elements of the Proof • Key Idea: Preprocess G with random sampling  sparse hypergraph representation which retains influence characteristics of highinfluence nodes – Each hypergraph edge represents a set of nodes influenced by a random node in the transpose graph – Degree of set S in hypergraph is approximately proportional to influence of S in original graph – Allows us to efficiently estimate marginal influence in the original diffusion process with very few samples • Local and applicable in many access models: Only operations are accessing a random vertex and traversing edges incident to previously accessed vertex
  59. 59. Key Elements of the Proof • Sublinear variant: Construct two possible seed sets: one using a greedy algorithm according to the constructed hypergraph, and the other is a singleton selected at random according to the hypergraph degree distribution
  60. 60. Conclusions • Local network algorithms may either be required due to local information access constraints, or just desirable due to increased runtime efficiency • Recurring elements in the sublinear network algorithms: – Sampling (sometimes at multiple scales) rather than probing all elements – Interspersing greedy steps with random steps to see more of the space – Maintaining locality by using backwards random walks, transposes of matrices, etc. to find large contributors
  61. 61. Conclusions • With local network methods, it is possible to get sublinear time algorithms with reasonable approximation ratios for questions of interest in massive networks: – Finding the most highly connected node or nodes – Finding connections between nodes – Covering the network (“recruiter problem”) – Ranking of sites on the network (significant PageRank problem) – Finding sets of maximum influence in the independent cascade model
  62. 62. Thanks for your attention!