Thanks: Major part of this work done during
visit at Twitter’s Personalization and
Recommendations team (Fall-2012).

Drun...
This work in a Nutshell
1. Background: Random walk –based
methods are popular in Recommender
Systems.
2. Research problem:...
Contents
•
•
•
•

Introduction to random walks
Disk-based graph systems: GraphChi
DrunkardMob algorithm
Experiments

All c...
Introduction: Random Walks
• Graph: G(V, E)
– V = vertices / nodes, E = edges / links.

• Walk is a sequence of random t v...
Introduction (cont.)
• Usually we are interested about the
distribution of the visits.
– Either global distribution or for...
Example: Global PageRank
• Model: random surfer who
starts from random
webpage and clicks each
link on the page with
unifo...
Personalized Pagerank
• Pagerank | home
(source) nodes:
– Compute pagerank vector
for each node separately
 resets only t...
Personalized Pagerank (cont.)
• Naïve computation of Personalized
Pagerank (PPR):
– Compute pagerank vector for each sourc...
Random walk in an in-memory
graph
• Compute one walk a time (multiple in
parallel, of course): in walks:
parfor walk
for i...
Problem: What if Graph does not
fit in memory?
Twitter network visualization,
by Akshay Java, 2009

Disk-based “singlemach...
DISK-BASED GRAPH
SYSTEMS
DrunkardMob - RecSys '13
Disk-based Graph Systems
• Recently frameworks that can handle
graphs with billions of edges on a single
machine, using di...
GraphChi execution model
1

v1

v2

n

interval(1)

interval(2)

interval(P)

shard(1)

shard(2)

shard(P)

For T iteratio...
Random walk is often called “Drunkard’s Walk”

DRUNKARDMOB ALGORITHM

DrunkardMob - RecSys '13
DrunkardMob: Basic Idea
• By example:
– Task: Compute personalized pagerank (PPR) for
1 million users in a social network ...
Random walks in GraphChi
• DrunkardMob –algorithm
– Reverse thinking
ForEach interval p:
walkSnapshot = getWalksForInterva...
WalkManager
• Store walks in buckets
– Array for each vertex would cost too much.

DrunkardMob - RecSys '13
Encoding walks

Only 4 bytes /
walk.

Keeps track of
each path 
knowledge
base
applications.

DrunkardMob - RecSys '13
Keeping track of walks
GraphChi

Walk Distribution Tracker
(DrunkardCompanion)

Execution interval

Source A
top-N visits
...
Keeping track of walks
GraphChi

Walk Distribution Tracker
(DrunkardCompanion)

Execution interval

Source A
top-N visits
...
Keeping track of Walks
• If we don’t have enough RAM to store the
distributions:
– Cut long tails: Similar problem to esti...
Validity
• We assume that simulating 2000 x 5-hop
walks with resets ~ 10000-hop walk with
resets.
– Not exactly same distr...
Related Work
• Fogaras, Racz, Csalogany, Sarlos:
“Towards scaling fully personalized
pagerank: Algorithms, lower bounds,
e...
See paper for more
experiments!

EXPERIMENTS

DrunkardMob - RecSys '13
Case Study: Twitter WTF
• Implemented Twitter’s Who-to-Follow
algorithm on GraphChi (see paper)
– Based on WWW’13 paper by...
PPR: Full Twitter Graph
With a large server with SSD and 144 GB of memory:

On Mac laptop, could estimate 500K-1M PPRs )= ...
Runtime / Graph size

Running time ~ linear with graph size
DrunkardMob - RecSys '13
Comparison to in-memory walks

Competitive with in-memory walks. However, if you can fit
your graph in memory – no need fo...
Summary
• DrunkardMob allows simulating random
walks efficiently on extremely large graphs
– Uses bulk of RAM for keeping ...
Thank You!
• Code: http://github.com/graphchi/graphchijava
Aapo Kyrölä
Ph.D. candidate @ CMU
http://www.cs.cmu.edu/~akyrol...
Upcoming SlideShare
Loading in...5
×

DrunkardMob: Billions of Random Walks on Just a PC

1,723

Published on

Research paper presentation at RecSys explaining how to simulate random walks if your graph does not fit in memory.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,723
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
39
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • ----- Meeting Notes (10/15/13 17:19) -----
  • So how would we do this if we could fit the graph in memory?
  • DrunkardMob: Billions of Random Walks on Just a PC

    1. 1. Thanks: Major part of this work done during visit at Twitter’s Personalization and Recommendations team (Fall-2012). DrunkardMob: Billions of Random Walks on Just a PC Aapo Kyrola Carnegie Mellon University Twitter: @kyrpov Big Data – small machine DrunkardMob - RecSys '13
    2. 2. This work in a Nutshell 1. Background: Random walk –based methods are popular in Recommender Systems. 2. Research problem: How to simulate random walks if your graph does not fit in memory? 3. Solution: Instead of doing one walk a time, do billions of them a time. Stream graph from disk and maintain walk states in RAM. DrunkardMob - RecSys '13
    3. 3. Contents • • • • Introduction to random walks Disk-based graph systems: GraphChi DrunkardMob algorithm Experiments All code available in GitHub: http://github.com/graphchi/graphchi-java DrunkardMob - RecSys '13
    4. 4. Introduction: Random Walks • Graph: G(V, E) – V = vertices / nodes, E = edges / links. • Walk is a sequence of random t visits to vertices: w := source(0)  v(1)  v(2)  v(3) ….  v(t) • Walks follow edges by default, but can also reset or teleport with certain probability. – Transition probability:'13 P(v(k+1) | v(k)) DrunkardMob - RecSys
    5. 5. Introduction (cont.) • Usually we are interested about the distribution of the visits. – Either global distribution or for each source separately. – Many applications (PageRank, FolkRank, SALSA,..) • Can be used to generate candidates: – Choose top K visited vertices as candidates to recommend. DrunkardMob - RecSys '13
    6. 6. Example: Global PageRank • Model: random surfer who starts from random webpage and clicks each link on the page with uniform probability: – With probability d, teleports to a random vertex  infinite walk. “any vertex” P=d P=(1-d) / 3 ? P=(1-d) / 3 P=(1-d) / 3 • Pagerank(web page) ~ Can authority of web page. be computed using “power iteration” very efficiently (in secs / minutes even for graphs with billions of vertices)  Not interesting. DrunkardMob - RecSys '13
    7. 7. Personalized Pagerank • Pagerank | home (source) nodes: – Compute pagerank vector for each node separately  resets only to the home node(s). – Restrict home nodes to some category / topic / pages visited by a user. • Used e.g. for social network recommendations. DrunkardMob - RecSys '13 home vertex P=d P=(1-d) / 3 ? P=(1-d) / 3 P=(1-d) / 3
    8. 8. Personalized Pagerank (cont.) • Naïve computation of Personalized Pagerank (PPR): – Compute pagerank vector for each source separately using power iteration: O(n^2) • Approximate by sampling: – Simulate actual walks on the graph. DrunkardMob - RecSys '13
    9. 9. Random walk in an in-memory graph • Compute one walk a time (multiple in parallel, of course): in walks: parfor walk for i=1 to : vertex = walk.atVertex() walk.takeStep(vertex.randomNeighbor()) DrunkardMob - RecSys '13
    10. 10. Problem: What if Graph does not fit in memory? Twitter network visualization, by Akshay Java, 2009 Disk-based “singlemachine” graph systems: - “Paging” from disk is costly. Distributed graph systems: - Each hop across partition boundary is costly. (This talk) DrunkardMob - RecSys '13
    11. 11. DISK-BASED GRAPH SYSTEMS DrunkardMob - RecSys '13
    12. 12. Disk-based Graph Systems • Recently frameworks that can handle graphs with billions of edges on a single machine, using disk, have been proposed: – GraphChi (Kyrola, Blelloch, Guestrin: OSDI’12) – TurboGraph (KDD’13) – [X-Stream (SOSP’13) – model not suitable] • We assume vertex-centric model: – Computation done one vertex a time. DrunkardMob - RecSys '13
    13. 13. GraphChi execution model 1 v1 v2 n interval(1) interval(2) interval(P) shard(1) shard(2) shard(P) For T iterations: For p=1 to P For vertex in interval(p) updateFunction(vertex) DrunkardMob - RecSys '13
    14. 14. Random walk is often called “Drunkard’s Walk” DRUNKARDMOB ALGORITHM DrunkardMob - RecSys '13
    15. 15. DrunkardMob: Basic Idea • By example: – Task: Compute personalized pagerank (PPR) for 1 million users in a social network -- in parallel • I.e 1MM different home/source -nodes – For each user, launch 1000 random walks (with resets) – in parallel • Each walk takes 10 hops ~ Equivalent to one 10,000 hop walk (with resets) / user – For each user, keep track of the visits done by its 1000 short walks  PPR for each user. – Store state of each walk in RAM, process graph from disk. = 1B random walks in parallel  ~5 GB of RAM. DrunkardMob - RecSys '13
    16. 16. Random walks in GraphChi • DrunkardMob –algorithm – Reverse thinking ForEach interval p: walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p): mywalks = walkSnapshot.getWalksAtVertex(vertex.id) ForEach walk in mywalks: walkManager.addHop(walk, vertex.randomNeighbor()) Note: Need to store only current position of each walk! DrunkardMob - RecSys '13
    17. 17. WalkManager • Store walks in buckets – Array for each vertex would cost too much. DrunkardMob - RecSys '13
    18. 18. Encoding walks Only 4 bytes / walk. Keeps track of each path  knowledge base applications. DrunkardMob - RecSys '13
    19. 19. Keeping track of walks GraphChi Walk Distribution Tracker (DrunkardCompanion) Execution interval Source A top-N visits Vertex walks table (WalkManager) DrunkardMob - RecSys '13 Source B top-N visits
    20. 20. Keeping track of walks GraphChi Walk Distribution Tracker (DrunkardCompanion) Execution interval Source A top-N visits Vertex walks table (WalkManager) DrunkardMob - RecSys '13 Source B top-N visits
    21. 21. Keeping track of Walks • If we don’t have enough RAM to store the distributions: – Cut long tails: Similar problem to estimating top-K frequent items in data streams with limited memory. • Can also write hops to disk (bucket-bybucket) and analyze later. DrunkardMob - RecSys '13
    22. 22. Validity • We assume that simulating 2000 x 5-hop walks with resets ~ 10000-hop walk with resets. – Not exactly same distribution – some longer streaks not covered. • But those would be not relevant anyway for recommendations! – See Fogaras (2005) for analysis. DrunkardMob - RecSys '13
    23. 23. Related Work • Fogaras, Racz, Csalogany, Sarlos: “Towards scaling fully personalized pagerank: Algorithms, lower bounds, experiments” (2005) – Similar idea with full external memory implementation. • We keep walks in memory. • Plenty of research in approximating PPR. DrunkardMob - RecSys '13
    24. 24. See paper for more experiments! EXPERIMENTS DrunkardMob - RecSys '13
    25. 25. Case Study: Twitter WTF • Implemented Twitter’s Who-to-Follow algorithm on GraphChi (see paper) – Based on WWW’13 paper by Gupta et al. – Use DrunkardMob to generate set of candidates to recommend for each user. – See paper. DrunkardMob - RecSys '13
    26. 26. PPR: Full Twitter Graph With a large server with SSD and 144 GB of memory: On Mac laptop, could estimate 500K-1M PPRs )= 0.51B walks ) in roughly the same time. DrunkardMob - RecSys '13
    27. 27. Runtime / Graph size Running time ~ linear with graph size DrunkardMob - RecSys '13
    28. 28. Comparison to in-memory walks Competitive with in-memory walks. However, if you can fit your graph in memory – no need for DrunkardMob. DrunkardMob - RecSys '13
    29. 29. Summary • DrunkardMob allows simulating random walks efficiently on extremely large graphs – Uses bulk of RAM for keeping track of walks, graph streamed from disk. – Graph size not limited by RAM. – Implement Twitter Who-To-Follow on your Laptop! • Future work: Adapt to distributed graph systems. – Even Hadoop if you really really want. DrunkardMob - RecSys '13
    30. 30. Thank You! • Code: http://github.com/graphchi/graphchijava Aapo Kyrölä Ph.D. candidate @ CMU http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov Special thanks to Pankaj Gupta, Dong Wang, Aneesh Sharma and Jayarama Shenoy @ Twitter. DrunkardMob - RecSys '13
    1. Gostou de algum slide específico?

      Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

    ×