Your SlideShare is downloading. ×

1 chayes


Published on

Workshop on random graphs 24—26 oct …

Workshop on random graphs 24—26 oct
Доклад Дженнифер Чайес

Published in: Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 2. Online Networks Online networks are often massive WWW has trillions of (static) sites 3D representation of WWW by Opte Facebook has over a billion users Small piece of FB mapped by Nexus
  • 3. Algorithmic Network Questions • Ranking of the sites (e.g., PageRank) • Finding the most influential site (or k most influential sites) under various definitions of influence – Most highly connected – Most influential under a certain model, e.g., KKT independent cascade model • Covering the graph via local moves (the recruiter problem)
  • 4. Constraints • Limitations on network visibility, e.g. Facebook, LinkedIn, etc. only let you see one or two hops away on the graph Facebook LinkedIn Need local (approximation) algorithms! • Limitations on compute time, especially relevant for online computation on massive graphs Want local (approximation) algorithms to be efficient, at expense of approximation factor if necessary.
  • 5. Outline of the Talk I. Network algorithms with local access constraints – Context: local information algorithms – Algorithms on preferential attachment networks – Algorithms on general networks Borgs, Brautbar, C, Lucier, Khanna : WINE ‘12 II. Using locality to get sublinear algorithms without a priori access constraints Borgs, Brautbar, C, – PageRank problem – Finding the most influential nodes (viral marketing in ind. cascade model) Teng : WAW ’12 & Internet Mathematics Borgs, Brautbar, C, Lucier: SODA ‘14
  • 6. A Networking Problem with Local Access Constraints Goal: Meet the most influential people.
  • 7. A Networking Problem Goal: Meet the most influential people.
  • 8. A Networking Problem Goal: Meet the most influential people. Find the highest-degree vertex.
  • 9. A Networking Problem with Local Access Constraints
  • 10. A Networking Problem with Local Access Constraints
  • 11. A Networking Problem with Local Access Constraints
  • 12. A Networking Problem with Local Access Constraints
  • 13. A Networking Problem with Local Access Constraints
  • 14. Motivating Question How well can a graph algorithm perform when it has only local visibility of the network structure? … on “natural” networks? … as a function of the “level” of visibility?
  • 15. Online Social Networks Social network applications differ in what is visible: Facebook LinkedIn Orkut, Google+ … Question: what is the impact of this design choice?
  • 16. Local Algorithms More generally: Search Problems: find the highest-degree node, the most central, … Coverage Problems: minimum dominating set, maximum k-coverage, … Connectivity Problems: shortest path, multicast, … “Local”: Graph topology is revealed locally as the algorithm builds its output set.
  • 17. Outline of Part I: Algorithms with Local Access Constraints 1. A model of local information algorithms 2. Algorithms for preferential attachment networks 3. Minimum dominating set problem on general networks
  • 18. Local Information Algorithms Input: Graph G = (V,E), initially unknown. Output: subset S of the vertices. (eg: find feasible S, minimizing |S|) Two operations: 1. Add a random node to S 2. Add any visible node to S r-Local algorithm Visible region: all nodes distance ≤ r from S, plus the induced subgraph …plus degrees of outermost nodes Note: To map this into questions on Facebook and LinkedIn, think of r as the distance out from your current set of connections, i.e., your set of friends.
  • 19. 1-Local Algorithm
  • 20. 1-Local Algorithm
  • 21. 1-Local Algorithm
  • 22. 1-Local Algorithm
  • 23. 2-Local Algorithm
  • 24. 2-Local Algorithm This talk: focus mainly on 1-local algorithms.
  • 25. Preferential Attachment Networks
  • 26. Preferential Attachment Random network growth model [BA’99,BR’00,…] 1. Begin with small fixed graph (e.g. clique). 2. Each new node v connects to m ≥ 2 previous nodes at random, proportional to their degrees: Pr[i connects to j] ~ deg(j) 2 6 1 4 3 7 5
  • 27. Preferential Attachment Properties: Connected (with high probability) Small diameter: O(logn / loglogn) Power law degree sequence: P(k) ~ k-3 Older nodes tend to have higher degree: E[deg(i)] = (n/i)½
  • 28. Finding the Root Problem: Return a set S containing node 1. Opportunistic algorithm: Initialize S to an arbitrary node While S does not contain node 1: Add node v ϵ N(S) with largest degree 1-local Note: possible to remove assumption that alg can detect node 1.
  • 29. Finding the Root Theorem: The opportunistic algorithm finds node 1 in O(log4n) queries, with high probability over the random graph process. Note: random walk requires O*(n½) queries. LinkedIn O(log4n) Facebook Random Walk: Ω(n½)
  • 30. Applications s-t connectivity: O(log4n) - connect s and t to node 1 - can connect k terminals in O(k log4n) Find k nodes of largest degree: O(log4n + k) - find node 1, but don’t stop the algorithm
  • 31. Proof Sketch Theorem: The greedy algorithm finds node 1 in time O(log4 n). The Hope: the algorithm reaches a node of degree 2k after k * polylog n iterations. The Problem: reaching a node of high degree does not necessarily imply progress. 1 Bottleneck
  • 32. Proof Sketch Observation: if there is a path connecting S to node 1, with all nodes of degree ≥ d, then the algorithm never queries a node of degree < d. Q: How common are these “good” paths? A: For m ≥ 2, most nodes lie on good paths with constant probability. (Proof: detailed prob. analysis) 1 S
  • 33. General Graphs
  • 34. Minimum Dominating Set Problem: find smallest set S s.t. N(S) ∪ S = V.
  • 35. Minimum Dominating Set Problem: find smallest set S s.t. N(S) ∪ S = V. max degree Lower bound: Ω(log Δ) (set cover) Upper bound: O(log Δ), 3-local [GK’98] How well can a 1-local algorithm perform?
  • 36. A local algorithm Greedy Algorithm: Initialize S to a random node. While |D(S)| < n: Add node v ϵ N(S) that maximizes |D(S∪{v})|
  • 37. A local algorithm Greedy Algorithm: Initialize S to a random node. While |D(S)| < n: Add node v ϵ N(S) that maximizes |D(S∪{v})| 3 4 8 9 5 1 6 7 2 Optimal: O(1) Greedy alg: Ω(n)
  • 38. A local algorithm Greedy-Random Algorithm: Initialize S to a random node. While |D(S)| < n: Add node v ϵ N(S) that maximizes |D(S∪{v})| Add a random node from N(v) D(S) Theorem: The greedy-random algorithm obtains a (1 + 2logΔ) approximation (in expectation and whp).
  • 39. Analysis Pick v in OPT. We wish to show that our alg. does not waste too many steps covering N(v). Consider some x chosen greedily by the algorithm, and which covers some vertex in N(v). x v
  • 40. Analysis Pick v in OPT. We wish to show that our alg. does not waste too many steps covering N(v). Consider some x chosen greedily by the algorithm, and which covers some vertex in N(v). Case 1: v was visible. x x must cover many nodes due to greediness. v
  • 41. Analysis Pick v in OPT. We wish to show that our alg. does not waste too many steps covering N(v). Consider some x chosen greedily by the algorithm, and which covers some vertex in N(v). Case 2: v was not visible. x If x covers few nodes, good chance to reveal v on the “random” step. v
  • 42. Conclusions from Part I: • Local information algorithms: sequential decisions with limited network visibility. • Many problems can be solved locally and efficiently in preferential attachment and general networks. • The level of visibility can have a strong impact on the approximability of a problem.
  • 43. Part II: Using Locality to Get Sublinear Algorithms • Sublinear algorithms to find high pagerank nodes • Sublinear algorithms for influence maximization
  • 44. Sublinear Algorithms for PageRank: Definitions • View the WWW as a directed graph with G(V,E) with V being the webpages and E being the (directed) hyperlinks • PageRank: Do a random walk on the webgraph, restarting at a random site every say 1/a steps. The relative weight of a page in the stationary distribution is the PageRank of that page.
  • 45. PageRank: Definitions • Random walk matrix M Muv = (1/dout(u) )Auv • dout(u) out-degree of vertex u • Auv adj. matrix of dir. graph • RW with restart at u: • in each step do one RW step with prob. 1-a, and jump to u with prob. a: p(u) = adu,. + (1 – a) p(u) M • Stationary distribution p p=pM • PageRank matrix Puv PageRank: PRv = Su Puv P = a1 + (1 – a) P M • Personalized PageRank vector p(u) (Always restart at u) p(u) = eu P = Pu. • Contribution vector c(v) c(v) = ev PT = P.v (All contributions to v)
  • 46. Computing PageRank • Take G = (V,E) with |V| = n and |E| = m. • Significant PageRank Problem (SPP): Find all nodes with PR > D, and no nodes with PR < D/2. • Previous results on running time: – Power iteration method (Bianchi et. al. ‘03): (m) – Linear algebra improvement (Langville & Meyer ‘04):  (n) • Lower bound on running time: (n/D) (Roughly from n/D sites with PR = D, and all other sites with PR = 0.) • Approximate SPP: Can we use locality to get an (additive) -approximation which essentially matches the lower bound, i.e., time *(n/D)?
  • 47. Roadmap for Approximate SPP • Steps of Calculation i. Calculate each Puv ii. Each PRv is the sum of the n terms in the contribution column vector: PRv = Su Puv iii. Do this for all n points, i.e. all contribution vectors • A priori each step should take (n) time
  • 48. i. Local calculation of -approximate Puv • Previous results – Deterministic Bad if • Jeh-Widom ‘03: ((log n) -1 max-in-degree) in-degree unbounded • Andersen et. al. ‘06: (-1 max-in-degree) – Random • Fogaras et. al. ’05: Monte Carlo based approach which removed dependence on max-in-degree but gave mult. rather than additive error, and where approx. depended on Puv • Our approach – Modification of Forgaras et. al. – Handled concentration better to remove dependence on Puv
  • 49. i. Local calc. of (,)-approximate Puv • Local method: uses – Terminating Random Walk: A RW which terminates with prob. a, and with prob. 1 - a, moves uniformly to a random outlink of the current node • Algorithm: Note : The probability that a terminating RW starting at u, happens to end it v, is Puv – For (-1-2 log n) do – Run a new terminating RW starting at u to a max (capping) length of log1/(1-a)(1/) – If the walk is terminated before reaching the capping length, add one to the counter of the node the walk last visited before terminating – Output avg count accumulated at each node • Running time: (-1-2log n log -1 ) ~ *(-1 )
  • 50. ii. From Puv to approx. of PRv = Su Puv • Obviously can’t just sum (takes time n) • Alternative: Naïve sampling – Pick L random ui R {1, … , n}. – Check if sum of these L terms, Sui Puiv , is large or small wrt D L/n. • Problem with naïve sampling – To make sure error does not drown out expectation, need  = O(n/D) – To get concentration (Chernoff), need L = O*(n/D) –  Runtime = L O*(1/) = O*(n2/D2) rather than O*(n/D)
  • 51. ii. Multiscale Sampling • Choose many scales t = 2-t • Estimate how many entries P.v of the contribution vector c(v) lie in the interval (t,2t) • Land up spending most of our time on the estimates of the larger P.v  lots of work • Estimate whether PRv > D in running time *(n/D).
  • 52. iii. From question of PRv > D for one v to all v • Key: Use sparse matrix methods to do all n columns in parallel. • Maintain running time *(n/D).
  • 53. Conclusion for PageRank Locality + multiscale analysis + sparse matrix methods  Running time = *(n/D) to find approx. of all nodes with significant pagerank PRv Running time sublinear in n for D = 0(np), 0 < p < 1.
  • 54. Final Topic: Sublinear Algorithms for Models of Influence Maximization
  • 55. Influence Maximization: Definitions Independent Cascade Model (Kempe, Kleinberg, Tardos ‘03) Introduced as model of viral marketing • G = (V,E) oriented graph with |V| = n, |E| = m, and edge weights {pe|eE} • pe = probability infection spreads out along e • I(S) = (random) size of the set that is eventually infected starting from seed set S  V Problem: For fixed k = |S|, find the seed set S which maximizes the expected influence E[I(S)]
  • 56. KKT Model: Previous Results • KKT: E[I(S)] is submodular  maximizing E[I(S)] can be approximated to within (1 – 1/e) via greedy algorithm • With oracle access to E[I(S)], greedy alg has runtime = O(kn) • Oracle access can be simulated  Total runtime =  (mnk poly(-1)), i.e., even on a sparse graph, at least quadratic in n.
  • 57. Influence Maximization: Our Results • Nearly linear time algorithm: We can find an approximately optimal seed with an approximation factor of (1 – 1/e – ) in time O*((m + n) -3). – Note: There is a lower bound of  (m + n), so this is essentially optimal. • Sublinear time algorithm: We can find an approximately optimal seed with an approximation factor of (1/) in time O*(n a(G)/ ) where a(G) is the arboricity* of the graph G. – Taking  = 0(np), 0 < p < 1, we get a time sublinear in n. *Arboricity of G is the minimum number of spanning forests necessary to cover all edges of G. Roughly speaking, arboricity corr. with density of graph.
  • 58. Key Elements of the Proof • Key Idea: Preprocess G with random sampling  sparse hypergraph representation which retains influence characteristics of highinfluence nodes – Each hypergraph edge represents a set of nodes influenced by a random node in the transpose graph – Degree of set S in hypergraph is approximately proportional to influence of S in original graph – Allows us to efficiently estimate marginal influence in the original diffusion process with very few samples • Local and applicable in many access models: Only operations are accessing a random vertex and traversing edges incident to previously accessed vertex
  • 59. Key Elements of the Proof • Sublinear variant: Construct two possible seed sets: one using a greedy algorithm according to the constructed hypergraph, and the other is a singleton selected at random according to the hypergraph degree distribution
  • 60. Conclusions • Local network algorithms may either be required due to local information access constraints, or just desirable due to increased runtime efficiency • Recurring elements in the sublinear network algorithms: – Sampling (sometimes at multiple scales) rather than probing all elements – Interspersing greedy steps with random steps to see more of the space – Maintaining locality by using backwards random walks, transposes of matrices, etc. to find large contributors
  • 61. Conclusions • With local network methods, it is possible to get sublinear time algorithms with reasonable approximation ratios for questions of interest in massive networks: – Finding the most highly connected node or nodes – Finding connections between nodes – Covering the network (“recruiter problem”) – Ranking of sites on the network (significant PageRank problem) – Finding sets of maximum influence in the independent cascade model
  • 62. Thanks for your attention!