So how would we do this if we could fit the graph in memory?
DrunkardMob: Billions of Random Walks on Just a PC
1.
Thanks: Major part of this work done during
visit at Twitter’s Personalization and
Recommendations team (Fall-2012).
DrunkardMob: Billions of
Random Walks on Just a PC
Aapo Kyrola
Carnegie Mellon University
Twitter: @kyrpov
Big Data – small machine
DrunkardMob - RecSys '13
2.
This work in a Nutshell
1. Background: Random walk –based
methods are popular in Recommender
Systems.
2. Research problem: How to simulate
random walks if your graph does not fit in
memory?
3. Solution: Instead of doing one walk a
time, do billions of them a time. Stream
graph from disk and maintain walk states
in RAM.
DrunkardMob - RecSys '13
3.
Contents
•
•
•
•
Introduction to random walks
Disk-based graph systems: GraphChi
DrunkardMob algorithm
Experiments
All code available in GitHub:
http://github.com/graphchi/graphchi-java
DrunkardMob - RecSys '13
4.
Introduction: Random Walks
• Graph: G(V, E)
– V = vertices / nodes, E = edges / links.
• Walk is a sequence of random t visits to
vertices:
w := source(0) v(1) v(2) v(3) ….
v(t)
• Walks follow edges by default, but can
also reset or teleport with certain
probability.
– Transition probability:'13 P(v(k+1) | v(k))
DrunkardMob - RecSys
5.
Introduction (cont.)
• Usually we are interested about the
distribution of the visits.
– Either global distribution or for each source
separately.
– Many applications (PageRank, FolkRank,
SALSA,..)
• Can be used to generate candidates:
– Choose top K visited vertices as candidates to
recommend.
DrunkardMob - RecSys '13
6.
Example: Global PageRank
• Model: random surfer who
starts from random
webpage and clicks each
link on the page with
uniform probability:
– With probability d, teleports
to a random vertex infinite
walk.
“any vertex”
P=d
P=(1-d) / 3
?
P=(1-d) / 3
P=(1-d) / 3
• Pagerank(web page) ~
Can
authority of web page. be computed using “power iteration” very
efficiently (in secs / minutes even for graphs with
billions of vertices) Not interesting.
DrunkardMob - RecSys '13
7.
Personalized Pagerank
• Pagerank | home
(source) nodes:
– Compute pagerank vector
for each node separately
resets only to the home
node(s).
– Restrict home nodes to
some category / topic /
pages visited by a user.
• Used e.g. for social
network
recommendations.
DrunkardMob - RecSys '13
home vertex
P=d
P=(1-d) / 3
?
P=(1-d) / 3
P=(1-d) / 3
8.
Personalized Pagerank (cont.)
• Naïve computation of Personalized
Pagerank (PPR):
– Compute pagerank vector for each source
separately using power iteration: O(n^2)
• Approximate by sampling:
– Simulate actual walks on the graph.
DrunkardMob - RecSys '13
9.
Random walk in an in-memory
graph
• Compute one walk a time (multiple in
parallel, of course): in walks:
parfor walk
for i=1 to
:
vertex = walk.atVertex()
walk.takeStep(vertex.randomNeighbor())
DrunkardMob - RecSys '13
10.
Problem: What if Graph does not
fit in memory?
Twitter network visualization,
by Akshay Java, 2009
Disk-based “singlemachine” graph
systems:
- “Paging” from disk
is costly.
Distributed graph
systems:
- Each hop across
partition boundary
is costly.
(This talk)
DrunkardMob - RecSys '13
11.
DISK-BASED GRAPH
SYSTEMS
DrunkardMob - RecSys '13
12.
Disk-based Graph Systems
• Recently frameworks that can handle
graphs with billions of edges on a single
machine, using disk, have been
proposed:
– GraphChi (Kyrola, Blelloch, Guestrin:
OSDI’12)
– TurboGraph (KDD’13)
– [X-Stream (SOSP’13) – model not suitable]
• We assume vertex-centric model:
– Computation done one vertex a time.
DrunkardMob - RecSys '13
13.
GraphChi execution model
1
v1
v2
n
interval(1)
interval(2)
interval(P)
shard(1)
shard(2)
shard(P)
For T iterations:
For p=1 to P
For vertex in interval(p)
updateFunction(vertex)
DrunkardMob - RecSys '13
14.
Random walk is often called “Drunkard’s Walk”
DRUNKARDMOB ALGORITHM
DrunkardMob - RecSys '13
15.
DrunkardMob: Basic Idea
• By example:
– Task: Compute personalized pagerank (PPR) for
1 million users in a social network -- in parallel
• I.e 1MM different home/source -nodes
– For each user, launch 1000 random walks (with
resets) – in parallel
• Each walk takes 10 hops
~ Equivalent to one 10,000 hop walk (with resets) / user
– For each user, keep track of the visits done by its
1000 short walks PPR for each user.
– Store state of each walk in RAM, process graph
from disk.
= 1B random walks in parallel ~5 GB of RAM.
DrunkardMob - RecSys '13
16.
Random walks in GraphChi
• DrunkardMob –algorithm
– Reverse thinking
ForEach interval p:
walkSnapshot = getWalksForInterval(p)
ForEach vertex in interval(p):
mywalks = walkSnapshot.getWalksAtVertex(vertex.id)
ForEach walk in mywalks:
walkManager.addHop(walk, vertex.randomNeighbor())
Note: Need to store only
current position of each walk!
DrunkardMob - RecSys '13
17.
WalkManager
• Store walks in buckets
– Array for each vertex would cost too much.
DrunkardMob - RecSys '13
18.
Encoding walks
Only 4 bytes /
walk.
Keeps track of
each path
knowledge
base
applications.
DrunkardMob - RecSys '13
19.
Keeping track of walks
GraphChi
Walk Distribution Tracker
(DrunkardCompanion)
Execution interval
Source A
top-N visits
Vertex walks table (WalkManager)
DrunkardMob - RecSys '13
Source B
top-N visits
20.
Keeping track of walks
GraphChi
Walk Distribution Tracker
(DrunkardCompanion)
Execution interval
Source A
top-N visits
Vertex walks table (WalkManager)
DrunkardMob - RecSys '13
Source B
top-N visits
21.
Keeping track of Walks
• If we don’t have enough RAM to store the
distributions:
– Cut long tails: Similar problem to estimating
top-K frequent items in data streams with
limited memory.
• Can also write hops to disk (bucket-bybucket) and analyze later.
DrunkardMob - RecSys '13
22.
Validity
• We assume that simulating 2000 x 5-hop
walks with resets ~ 10000-hop walk with
resets.
– Not exactly same distribution – some longer
streaks not covered.
• But those would be not relevant anyway for
recommendations!
– See Fogaras (2005) for analysis.
DrunkardMob - RecSys '13
23.
Related Work
• Fogaras, Racz, Csalogany, Sarlos:
“Towards scaling fully personalized
pagerank: Algorithms, lower bounds,
experiments” (2005)
– Similar idea with full external memory
implementation.
• We keep walks in memory.
• Plenty of research in approximating PPR.
DrunkardMob - RecSys '13
24.
See paper for more
experiments!
EXPERIMENTS
DrunkardMob - RecSys '13
25.
Case Study: Twitter WTF
• Implemented Twitter’s Who-to-Follow
algorithm on GraphChi (see paper)
– Based on WWW’13 paper by Gupta et al.
– Use DrunkardMob to generate set of
candidates to recommend for each user.
– See paper.
DrunkardMob - RecSys '13
26.
PPR: Full Twitter Graph
With a large server with SSD and 144 GB of memory:
On Mac laptop, could estimate 500K-1M PPRs )= 0.51B walks ) in roughly the same time.
DrunkardMob - RecSys '13
27.
Runtime / Graph size
Running time ~ linear with graph size
DrunkardMob - RecSys '13
28.
Comparison to in-memory walks
Competitive with in-memory walks. However, if you can fit
your graph in memory – no need for DrunkardMob.
DrunkardMob - RecSys '13
29.
Summary
• DrunkardMob allows simulating random
walks efficiently on extremely large graphs
– Uses bulk of RAM for keeping track of walks,
graph streamed from disk.
– Graph size not limited by RAM.
– Implement Twitter Who-To-Follow on your Laptop!
• Future work: Adapt to distributed graph
systems.
– Even Hadoop if you really really want.
DrunkardMob - RecSys '13
Be the first to comment