Upcoming SlideShare
×

# DrunkardMob: Billions of Random Walks on Just a PC

1,723

Published on

Research paper presentation at RecSys explaining how to simulate random walks if your graph does not fit in memory.

Published in: Technology
3 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

Views
Total Views
1,723
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
39
0
Likes
3
Embeds 0
No embeds

No notes for slide
• ----- Meeting Notes (10/15/13 17:19) -----
• So how would we do this if we could fit the graph in memory?
• ### DrunkardMob: Billions of Random Walks on Just a PC

1. 1. Thanks: Major part of this work done during visit at Twitter’s Personalization and Recommendations team (Fall-2012). DrunkardMob: Billions of Random Walks on Just a PC Aapo Kyrola Carnegie Mellon University Twitter: @kyrpov Big Data – small machine DrunkardMob - RecSys '13
2. 2. This work in a Nutshell 1. Background: Random walk –based methods are popular in Recommender Systems. 2. Research problem: How to simulate random walks if your graph does not fit in memory? 3. Solution: Instead of doing one walk a time, do billions of them a time. Stream graph from disk and maintain walk states in RAM. DrunkardMob - RecSys '13
3. 3. Contents • • • • Introduction to random walks Disk-based graph systems: GraphChi DrunkardMob algorithm Experiments All code available in GitHub: http://github.com/graphchi/graphchi-java DrunkardMob - RecSys '13
4. 4. Introduction: Random Walks • Graph: G(V, E) – V = vertices / nodes, E = edges / links. • Walk is a sequence of random t visits to vertices: w := source(0)  v(1)  v(2)  v(3) ….  v(t) • Walks follow edges by default, but can also reset or teleport with certain probability. – Transition probability:'13 P(v(k+1) | v(k)) DrunkardMob - RecSys
5. 5. Introduction (cont.) • Usually we are interested about the distribution of the visits. – Either global distribution or for each source separately. – Many applications (PageRank, FolkRank, SALSA,..) • Can be used to generate candidates: – Choose top K visited vertices as candidates to recommend. DrunkardMob - RecSys '13
6. 6. Example: Global PageRank • Model: random surfer who starts from random webpage and clicks each link on the page with uniform probability: – With probability d, teleports to a random vertex  infinite walk. “any vertex” P=d P=(1-d) / 3 ? P=(1-d) / 3 P=(1-d) / 3 • Pagerank(web page) ~ Can authority of web page. be computed using “power iteration” very efficiently (in secs / minutes even for graphs with billions of vertices)  Not interesting. DrunkardMob - RecSys '13
7. 7. Personalized Pagerank • Pagerank | home (source) nodes: – Compute pagerank vector for each node separately  resets only to the home node(s). – Restrict home nodes to some category / topic / pages visited by a user. • Used e.g. for social network recommendations. DrunkardMob - RecSys '13 home vertex P=d P=(1-d) / 3 ? P=(1-d) / 3 P=(1-d) / 3
8. 8. Personalized Pagerank (cont.) • Naïve computation of Personalized Pagerank (PPR): – Compute pagerank vector for each source separately using power iteration: O(n^2) • Approximate by sampling: – Simulate actual walks on the graph. DrunkardMob - RecSys '13
9. 9. Random walk in an in-memory graph • Compute one walk a time (multiple in parallel, of course): in walks: parfor walk for i=1 to : vertex = walk.atVertex() walk.takeStep(vertex.randomNeighbor()) DrunkardMob - RecSys '13
10. 10. Problem: What if Graph does not fit in memory? Twitter network visualization, by Akshay Java, 2009 Disk-based “singlemachine” graph systems: - “Paging” from disk is costly. Distributed graph systems: - Each hop across partition boundary is costly. (This talk) DrunkardMob - RecSys '13
11. 11. DISK-BASED GRAPH SYSTEMS DrunkardMob - RecSys '13
12. 12. Disk-based Graph Systems • Recently frameworks that can handle graphs with billions of edges on a single machine, using disk, have been proposed: – GraphChi (Kyrola, Blelloch, Guestrin: OSDI’12) – TurboGraph (KDD’13) – [X-Stream (SOSP’13) – model not suitable] • We assume vertex-centric model: – Computation done one vertex a time. DrunkardMob - RecSys '13
13. 13. GraphChi execution model 1 v1 v2 n interval(1) interval(2) interval(P) shard(1) shard(2) shard(P) For T iterations: For p=1 to P For vertex in interval(p) updateFunction(vertex) DrunkardMob - RecSys '13
14. 14. Random walk is often called “Drunkard’s Walk” DRUNKARDMOB ALGORITHM DrunkardMob - RecSys '13
15. 15. DrunkardMob: Basic Idea • By example: – Task: Compute personalized pagerank (PPR) for 1 million users in a social network -- in parallel • I.e 1MM different home/source -nodes – For each user, launch 1000 random walks (with resets) – in parallel • Each walk takes 10 hops ~ Equivalent to one 10,000 hop walk (with resets) / user – For each user, keep track of the visits done by its 1000 short walks  PPR for each user. – Store state of each walk in RAM, process graph from disk. = 1B random walks in parallel  ~5 GB of RAM. DrunkardMob - RecSys '13
16. 16. Random walks in GraphChi • DrunkardMob –algorithm – Reverse thinking ForEach interval p: walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p): mywalks = walkSnapshot.getWalksAtVertex(vertex.id) ForEach walk in mywalks: walkManager.addHop(walk, vertex.randomNeighbor()) Note: Need to store only current position of each walk! DrunkardMob - RecSys '13
17. 17. WalkManager • Store walks in buckets – Array for each vertex would cost too much. DrunkardMob - RecSys '13
18. 18. Encoding walks Only 4 bytes / walk. Keeps track of each path  knowledge base applications. DrunkardMob - RecSys '13
19. 19. Keeping track of walks GraphChi Walk Distribution Tracker (DrunkardCompanion) Execution interval Source A top-N visits Vertex walks table (WalkManager) DrunkardMob - RecSys '13 Source B top-N visits
20. 20. Keeping track of walks GraphChi Walk Distribution Tracker (DrunkardCompanion) Execution interval Source A top-N visits Vertex walks table (WalkManager) DrunkardMob - RecSys '13 Source B top-N visits
21. 21. Keeping track of Walks • If we don’t have enough RAM to store the distributions: – Cut long tails: Similar problem to estimating top-K frequent items in data streams with limited memory. • Can also write hops to disk (bucket-bybucket) and analyze later. DrunkardMob - RecSys '13
22. 22. Validity • We assume that simulating 2000 x 5-hop walks with resets ~ 10000-hop walk with resets. – Not exactly same distribution – some longer streaks not covered. • But those would be not relevant anyway for recommendations! – See Fogaras (2005) for analysis. DrunkardMob - RecSys '13
23. 23. Related Work • Fogaras, Racz, Csalogany, Sarlos: “Towards scaling fully personalized pagerank: Algorithms, lower bounds, experiments” (2005) – Similar idea with full external memory implementation. • We keep walks in memory. • Plenty of research in approximating PPR. DrunkardMob - RecSys '13
24. 24. See paper for more experiments! EXPERIMENTS DrunkardMob - RecSys '13
25. 25. Case Study: Twitter WTF • Implemented Twitter’s Who-to-Follow algorithm on GraphChi (see paper) – Based on WWW’13 paper by Gupta et al. – Use DrunkardMob to generate set of candidates to recommend for each user. – See paper. DrunkardMob - RecSys '13
26. 26. PPR: Full Twitter Graph With a large server with SSD and 144 GB of memory: On Mac laptop, could estimate 500K-1M PPRs )= 0.51B walks ) in roughly the same time. DrunkardMob - RecSys '13
27. 27. Runtime / Graph size Running time ~ linear with graph size DrunkardMob - RecSys '13
28. 28. Comparison to in-memory walks Competitive with in-memory walks. However, if you can fit your graph in memory – no need for DrunkardMob. DrunkardMob - RecSys '13
29. 29. Summary • DrunkardMob allows simulating random walks efficiently on extremely large graphs – Uses bulk of RAM for keeping track of walks, graph streamed from disk. – Graph size not limited by RAM. – Implement Twitter Who-To-Follow on your Laptop! • Future work: Adapt to distributed graph systems. – Even Hadoop if you really really want. DrunkardMob - RecSys '13
30. 30. Thank You! • Code: http://github.com/graphchi/graphchijava Aapo Kyrölä Ph.D. candidate @ CMU http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov Special thanks to Pankaj Gupta, Dong Wang, Aneesh Sharma and Jayarama Shenoy @ Twitter. DrunkardMob - RecSys '13
1. #### Gostou de algum slide específico?

Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.