Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

2,256 views

2,027 views

2,027 views

Published on

Published in:
Technology

No Downloads

Total views

2,256

On SlideShare

0

From Embeds

0

Number of Embeds

147

Shares

0

Downloads

40

Comments

0

Likes

3

No embeds

No notes for slide

- 1. Thanks: Major part of this work done during visit at Twitter’s Personalization and Recommendations team (Fall-2012). DrunkardMob: Billions of Random Walks on Just a PC Aapo Kyrola Carnegie Mellon University Twitter: @kyrpov Big Data – small machine DrunkardMob - RecSys '13
- 2. This work in a Nutshell 1. Background: Random walk –based methods are popular in Recommender Systems. 2. Research problem: How to simulate random walks if your graph does not fit in memory? 3. Solution: Instead of doing one walk a time, do billions of them a time. Stream graph from disk and maintain walk states in RAM. DrunkardMob - RecSys '13
- 3. Contents • • • • Introduction to random walks Disk-based graph systems: GraphChi DrunkardMob algorithm Experiments All code available in GitHub: http://github.com/graphchi/graphchi-java DrunkardMob - RecSys '13
- 4. Introduction: Random Walks • Graph: G(V, E) – V = vertices / nodes, E = edges / links. • Walk is a sequence of random t visits to vertices: w := source(0) v(1) v(2) v(3) …. v(t) • Walks follow edges by default, but can also reset or teleport with certain probability. – Transition probability:'13 P(v(k+1) | v(k)) DrunkardMob - RecSys
- 5. Introduction (cont.) • Usually we are interested about the distribution of the visits. – Either global distribution or for each source separately. – Many applications (PageRank, FolkRank, SALSA,..) • Can be used to generate candidates: – Choose top K visited vertices as candidates to recommend. DrunkardMob - RecSys '13
- 6. Example: Global PageRank • Model: random surfer who starts from random webpage and clicks each link on the page with uniform probability: – With probability d, teleports to a random vertex infinite walk. “any vertex” P=d P=(1-d) / 3 ? P=(1-d) / 3 P=(1-d) / 3 • Pagerank(web page) ~ Can authority of web page. be computed using “power iteration” very efficiently (in secs / minutes even for graphs with billions of vertices) Not interesting. DrunkardMob - RecSys '13
- 7. Personalized Pagerank • Pagerank | home (source) nodes: – Compute pagerank vector for each node separately resets only to the home node(s). – Restrict home nodes to some category / topic / pages visited by a user. • Used e.g. for social network recommendations. DrunkardMob - RecSys '13 home vertex P=d P=(1-d) / 3 ? P=(1-d) / 3 P=(1-d) / 3
- 8. Personalized Pagerank (cont.) • Naïve computation of Personalized Pagerank (PPR): – Compute pagerank vector for each source separately using power iteration: O(n^2) • Approximate by sampling: – Simulate actual walks on the graph. DrunkardMob - RecSys '13
- 9. Random walk in an in-memory graph • Compute one walk a time (multiple in parallel, of course): in walks: parfor walk for i=1 to : vertex = walk.atVertex() walk.takeStep(vertex.randomNeighbor()) DrunkardMob - RecSys '13
- 10. Problem: What if Graph does not fit in memory? Twitter network visualization, by Akshay Java, 2009 Disk-based “singlemachine” graph systems: - “Paging” from disk is costly. Distributed graph systems: - Each hop across partition boundary is costly. (This talk) DrunkardMob - RecSys '13
- 11. DISK-BASED GRAPH SYSTEMS DrunkardMob - RecSys '13
- 12. Disk-based Graph Systems • Recently frameworks that can handle graphs with billions of edges on a single machine, using disk, have been proposed: – GraphChi (Kyrola, Blelloch, Guestrin: OSDI’12) – TurboGraph (KDD’13) – [X-Stream (SOSP’13) – model not suitable] • We assume vertex-centric model: – Computation done one vertex a time. DrunkardMob - RecSys '13
- 13. GraphChi execution model 1 v1 v2 n interval(1) interval(2) interval(P) shard(1) shard(2) shard(P) For T iterations: For p=1 to P For vertex in interval(p) updateFunction(vertex) DrunkardMob - RecSys '13
- 14. Random walk is often called “Drunkard’s Walk” DRUNKARDMOB ALGORITHM DrunkardMob - RecSys '13
- 15. DrunkardMob: Basic Idea • By example: – Task: Compute personalized pagerank (PPR) for 1 million users in a social network -- in parallel • I.e 1MM different home/source -nodes – For each user, launch 1000 random walks (with resets) – in parallel • Each walk takes 10 hops ~ Equivalent to one 10,000 hop walk (with resets) / user – For each user, keep track of the visits done by its 1000 short walks PPR for each user. – Store state of each walk in RAM, process graph from disk. = 1B random walks in parallel ~5 GB of RAM. DrunkardMob - RecSys '13
- 16. Random walks in GraphChi • DrunkardMob –algorithm – Reverse thinking ForEach interval p: walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p): mywalks = walkSnapshot.getWalksAtVertex(vertex.id) ForEach walk in mywalks: walkManager.addHop(walk, vertex.randomNeighbor()) Note: Need to store only current position of each walk! DrunkardMob - RecSys '13
- 17. WalkManager • Store walks in buckets – Array for each vertex would cost too much. DrunkardMob - RecSys '13
- 18. Encoding walks Only 4 bytes / walk. Keeps track of each path knowledge base applications. DrunkardMob - RecSys '13
- 19. Keeping track of walks GraphChi Walk Distribution Tracker (DrunkardCompanion) Execution interval Source A top-N visits Vertex walks table (WalkManager) DrunkardMob - RecSys '13 Source B top-N visits
- 20. Keeping track of walks GraphChi Walk Distribution Tracker (DrunkardCompanion) Execution interval Source A top-N visits Vertex walks table (WalkManager) DrunkardMob - RecSys '13 Source B top-N visits
- 21. Keeping track of Walks • If we don’t have enough RAM to store the distributions: – Cut long tails: Similar problem to estimating top-K frequent items in data streams with limited memory. • Can also write hops to disk (bucket-bybucket) and analyze later. DrunkardMob - RecSys '13
- 22. Validity • We assume that simulating 2000 x 5-hop walks with resets ~ 10000-hop walk with resets. – Not exactly same distribution – some longer streaks not covered. • But those would be not relevant anyway for recommendations! – See Fogaras (2005) for analysis. DrunkardMob - RecSys '13
- 23. Related Work • Fogaras, Racz, Csalogany, Sarlos: “Towards scaling fully personalized pagerank: Algorithms, lower bounds, experiments” (2005) – Similar idea with full external memory implementation. • We keep walks in memory. • Plenty of research in approximating PPR. DrunkardMob - RecSys '13
- 24. See paper for more experiments! EXPERIMENTS DrunkardMob - RecSys '13
- 25. Case Study: Twitter WTF • Implemented Twitter’s Who-to-Follow algorithm on GraphChi (see paper) – Based on WWW’13 paper by Gupta et al. – Use DrunkardMob to generate set of candidates to recommend for each user. – See paper. DrunkardMob - RecSys '13
- 26. PPR: Full Twitter Graph With a large server with SSD and 144 GB of memory: On Mac laptop, could estimate 500K-1M PPRs )= 0.51B walks ) in roughly the same time. DrunkardMob - RecSys '13
- 27. Runtime / Graph size Running time ~ linear with graph size DrunkardMob - RecSys '13
- 28. Comparison to in-memory walks Competitive with in-memory walks. However, if you can fit your graph in memory – no need for DrunkardMob. DrunkardMob - RecSys '13
- 29. Summary • DrunkardMob allows simulating random walks efficiently on extremely large graphs – Uses bulk of RAM for keeping track of walks, graph streamed from disk. – Graph size not limited by RAM. – Implement Twitter Who-To-Follow on your Laptop! • Future work: Adapt to distributed graph systems. – Even Hadoop if you really really want. DrunkardMob - RecSys '13
- 30. Thank You! • Code: http://github.com/graphchi/graphchijava Aapo Kyrölä Ph.D. candidate @ CMU http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov Special thanks to Pankaj Gupta, Dong Wang, Aneesh Sharma and Jayarama Shenoy @ Twitter. DrunkardMob - RecSys '13

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment