Presentation CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonzalez and Carlos Guestrin at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
6. The
Big
QuesFon
of
Big
Learning
How
will
we
design
and
implement
parallel
learning
systems?
7. MapReduce
for
Data-‐Parallel
ML
Excellent
for
large
data-‐parallel
tasks!
Data-Parallel
MapReduce
Feature
ExtracFon
Cross
ValidaFon
CompuFng
Sufficient
StaFsFcs
Graph-Parallel
Is
there
more
to
Machine
Learning
Graphical
Models
Gibbs
Sampling
Belief
PropagaFon
VariaFonal
Opt.
CollaboraLve
Filtering
Semi-‐Supervised
Learning
?
Tensor
FactorizaFon
Label
PropagaFon
CoEM
Graph
Analysis
PageRank
Triangle
CounFng
8. Es(mate
Poli(cal
Bias
?
?
?
Liberal
?
?
Post
?
Post
?
?
?
Semi-‐Supervised
&
?
TransducFve
Learning
Post
Post
?
?
?
Post
Post
Post
?
ConservaFve
?
?
?
?
Post
?
Post
?
?
?
?
Post
?
Post
?
Post
Post
Post
?
Post
Post
?
?
?
?
9. Flashback
to
1998
First
Google
advantage:
a
Graph
Algorithm
&
a
System
to
Support
it!
14. CollaboraFve
Filtering:
ExploiFng
Dependencies
Women
on
the
Verge
of
a
Nervous
Breakdown
The
CelebraFon
Latent
Factor
Models
City
of
God
Matrix
CompleFon/FactorizaFon
Models
Wild
Strawberries
La
Dolce
Vita
15. Topic
Modeling
Cat
Apple
Latent
Dirichlet
AllocaFon,
etc
Growth
Hat
Plant
17. Machine
Learning
Pipeline
Data
Extract
Features
Graph
Formation
Structured
Machine
Learning
Algorithm
6. Before
Value
from
Data
7. After
face
labels
images
docs
movie
raFngs
doc
topics
social
acFvity
8. After
movie
recommend
senFment
analysis
20. PageRank
Depends on rank
of who follows them…
Depends on rank
of who follows her
What’s the rank
of this user?
Rank?
Loops
in
graph
è
Must
iterate!
21. PageRank
IteraFon
R[j]
Iterate
unFl
convergence:
wji
R[i]
“My
rank
is
weighted
average
of
my
friends’
ranks”
X
R[i] = ↵ + (1 ↵)
wji R[j]
(j,i)2E
!
!
α
is
the
random
reset
probability
wji
is
the
prob.
transiFoning
(similarity)
from
j
to
i
22. ProperFes
of
Graph
Parallel
Algorithms
Dependency
Graph
Local
Updates
IteraFve
ComputaFon
My
Rank
Friends
Rank
23. The
Need
for
a
New
AbstracFon
!
Need:
Asynchronous,
Dynamic
Parallel
ComputaFons
Data-Parallel
Graph-Parallel
Map
Reduce
Feature
ExtracFon
Cross
ValidaFon
CompuFng
Sufficient
StaFsFcs
Graphical
Models
Gibbs
Sampling
Belief
PropagaFon
VariaFonal
Opt.
CollaboraLve
Filtering
Tensor
FactorizaFon
Semi-‐Supervised
Learning
Label
PropagaFon
CoEM
Data-‐Mining
PageRank
Triangle
CounFng
24. The
GraphLab
Goals
Know how to
solve ML problem
on 1 machine
Efficient
parallel
predicFons
26. Data
Graph
Data
associated
with
verFces
and
edges
Graph:
•
Social
Network
Vertex
Data:
•
User
profile
text
•
Current
interests
esFmates
Edge
Data:
•
Similarity
weights
27. How
do
we
program
graph
computaFon?
“Think
like
a
Vertex.”
-‐Malewicz
et
al.
[SIGMOD’10]
28. Update
FuncFons
User-‐defined
program:
applied
to
vertex
transforms
data
in
scope
of
vertex
pagerank(i,
scope){
//
Get
Neighborhood
data
(R[i],
wij,
R[j])
!scope;
//
Update
the
vertex
data
Update
funcFon
applied
(asynchronously)
R[i] ← α + (1− α ) ∑ w ji × R[ j];
in
parallel
unFl
convergence
j∈N [i]
//
Reschedule
Neighbors
if
needed
if
R[i]
changes
then
Many
schedulers
available
eschedule_neighbors_of(i);
r to
prioriFze
computaFon
}
Dynamic
computaLon
29. The
GraphLab
Framework
Graph
Based
Data
Representa(on
Scheduler
Update
FuncFons
User
Computa(on
Consistency
Model
36. Achilles
Heel:
Idealized
Graph
AssumpFon
Assumed…
Small
degree
"
Easy
to
parFFon
But,
Natural
Graphs…
Many
high
degree
verFces
(power-‐law
degree
distribuFon)
"
Very
hard
to
parFFon
38. High
Degree
VerFces
are
Common
Popular
Movies
Users
“Social”
People
NeYlix
Movies
Hyper
Parameters
θ
θ
β
θ
θ
Z
Z
Z
Z
Z
Z
Z
Z
w
w
Z
Z
w
w
Z
Z
w
w
Z
Z
Z
w
w
w
Z
w
w
w
w
w
w
w
Docs
α
Common
Words
LDA
Obama
Words
40. Problem:
High
Degree
VerLces
è
High
CommunicaLon
for
Distributed
Updates
Data transmitted
Y
across network
O(# cut edges)
Natural
graphs
do
not
have
low-‐cost
balanced
cuts
[Leskovec
et
al.
08,
Lang
04]
Machine
1
Machine
2
Popular
parFFoning
tools
(MeFs,
Chaco,…)
perform
poorly
[Abou-‐Rjeili
et
al.
06]
Extremely
slow
and
require
substan(al
memory
41. acement cutsParFFoning
edges:
most of the
Random
!
Both
GraphLab
1,
Pregel,
Twicer,
Facebook,…
rely
on
Random
(hashed)
parFFoning
for
Natural
Graphs
m 5.1. If vertices are randomly assign
s then the expected fraction of edges cu
For
p
Machines:
|Edges Cut|
1
E
=1
|E|
p
Machine
10
Machines
Machine
e
if just twoà
90%
of
1
dges
cut
used,
machines are 2
100
Machines
à
99%
of
edges
cut!
ample
ha
will be cut requiring order |E| /2 commu
All
data
is
communicated…
Licle
advantage
over
MapReduce
42. In
Summary
GraphLab
1
and
Pregel
are
not
well
suited
for
natural
graphs
!
!
Poor
performance
on
high-‐degree
verFces
Low
Quality
ParFFoning
44. Common
Padern
for
Update
Fncs.
R[j]
wji
R[i]
GraphLab_PageRank(i)
//
Compute
sum
over
neighbors
total
=
0
Gather
InformaLon
foreach(
j
in
in_neighbors(i)):
About
Neighborhood
total
=
total
+
R[j]
*
wji
//
Update
the
PageRank
Apply
Update
to
Vertex
R[i]
=
0.1
+
total
//
Trigger
neighbors
to
run
again
if
R[i]
not
converged
then
Sca7er
Signal
to
Neighbors
foreach(
j
in
out_neighbors(i))
Modify
Edge
Data
&
signal
vertex-‐program
on
j
45. GAS
DecomposiFon
Gather
(Reduce)
Accumulate
informaFon
about
neighborhood
Y
Y
Y
⌃
+
+
…
+
$
Scacer
Apply
the
accumulated
value
to
center
vertex
Σ
Y
Parallel
“Sum”
Apply
Y
Update
adjacent
edges
and
verFces.
Y’
Y’
46. Many
ML
Algorithms
fit
into
GAS
Model
graph
analyFcs,
inference
in
graphical
models,
matrix
factorizaFon,
collaboraFve
filtering,
clustering,
LDA,
…
47. Minimizing
CommunicaFon
in
GL2
PowerGraph:
Vertex
Cuts
Y
CommunicaFon
linear
in
#
spanned
machines
GL2
PowerGraph
includes
novel
vertex
cut
algorithms
%
A
vertex-‐cut
m gains
in
p
Provides
order
of
magnitude
inimizes
erformance
#
machines
per
vertex
Percola(on
theory
suggests
Power
Law
graphs
can
be
split
by
removing
only
a
small
set
of
ver(ces
[Albert
et
al.
2000]
è
Small
vertex
cuts
possible!
48. 7. After
From
the
AbstracFon
to
a
System
8. After
49. Triangle
CounLng
on
Twicer
Graph
34.8
Billion
Triangles
Hadoop
1636
Machines
[WWW’11]
423
Minutes
64
Machines
15
Seconds
Why?
Wrong
AbstracLon
$
Broadcast
O(degree2)
messages
per
Vertex
S.
Suri
and
S.
Vassilvitskii,
“CounFng
triangles
and
the
curse
of
the
last
reducer,”
WWW’11
50. Topic
Modeling
(LDA)
Million
Tokens
Per
Second
0
20
60
80
100
120
140
Specifically
engineered
for
this
task
Smola
et
al.
GL2
PowerGraph
40
64
cc2.8xlarge
EC2
Nodes
200
lines
of
code
&
4
human
hours
!
English
language
Wikipedia
!
!
2.6M
Documents,
8.3M
Words,
500M
Tokens
ComputaFonally
intensive
algorithm
100
Yahoo!
Machines
160
51. How
well
does
GraphLab
scale?
Yahoo
Altavista
Web
Graph
(2002):
One
of
the
largest
publicly
available
webgraphs
1.4B
Webpages,
6.7
Billion
Links
7
seconds
per
iter.
64
HPC
Nodes
1B
links
processed
per
second
30
lines
of
user
code
52. GraphChi:
Going
small
with
GraphLab
7. After
8. After
Solve
huge
problems
on
small
or
embedded
devices?
Key:
Exploit
non-‐volaFle
memory
(starFng
with
SSDs
and
HDs)
53. GraphChi
–
disk-‐based
GraphLab
Challenge:
Random
Accesses
Novel
GraphChi
soluLon:
Parallel
sliding
windows
method
è
minimizes
number
of
random
accesses
55. 6. Before
!
!
ML
algorithms
as
vertex
programs
Asynchronous
execuFon
and
consistency
models
7. After
!
8. After
!
Natural
graphs
change
the
nature
of
computaFon
Vertex
cuts
and
gather/apply/scacer
model
56. GL2
PowerGraph
focused
on
Scalability
at
the
loss
of
Usability
57. GraphLab
1
PageRank(i,
scope){
acc
=
0
for
(j
in
InNeighbors)
{
acc
+=
pr[j]
*
edge[j].weight
}
pr[i]
=
0.15
+
0.85
*
acc
}
Explicitly
described
operaLons
Code is intuitive
59. Scalability,
but
very
rigid
abstracFon
Great
flexibility,
but
hit
scalability
wall
(many
contorFons
needed
to
implement
SVD++,
Restricted
Boltzmann
Machines)
What now?
61. GL3
WarpGraph
Goals
Program
Like
GraphLab
1
Run
Like
GraphLab
2
Machine 1
Machine 2
62. Fine-‐Grained
PrimiFves
Expose
Neighborhood
OperaLons
through
Parallel
Iterators
R[i] = 0.15 + 0.85
X
(j,i)2E
Y
w[j, i] ⇤ R[j]
PageRankUpdateFunction(Y)
{
Y.pagerank
=
0.15
+
0.85
*
MapReduceNeighbors(
lambda
nbr:
nbr.pagerank*nbr.weight,
lambda
(a,b):
a
+
b
neighbors)
(aggregate sum over
)
}
63. Expressive,
Extensible
Neighborhood
API
Parallel
Transform
Adjacent
Edges
Broadcast
Y
Y
Y
Modify
adjacent
edges
Schedule
a
selected
subset
of
adjacent
verFces
Y
+
+
…
+
Y
Parallel
Sum
Y
MapReduce
over
Neighbors
64. Can
express
every
GL2
PowerGraph
program
(more
easily)
in
GL3
WarpGraph
But
GL3
is
more
expressive
MulFple
gathers
UpdateFunction(v)
{
if
(v.data
==
1)
accum
=
MapReduceNeighs(g,m)
else
...
}
Scacer
before
gather
CondiFonal
execuFon
66. 6. Before
!
!
ML
algorithms
as
vertex
programs
Asynchronous
execuFon
and
consistency
models
7. After
6. Before
!
8. After
!
Natural
graphs
change
the
nature
of
computaFon
Vertex
cuts
and
gather/apply/scacer
model
7. After
8. After
!
!
Usability
is
key
Access
neighborhood
through
parallelizable
iterators
and
latency
hiding
69. ExciFng
Time
to
Work
in
ML
With Big Data,
I’ll take over
the world!!!
We met
because of
Big Data
Why won’t
Big Data read
my mind???
Unique
opportuniFes
to
change
the
world!!
☺
But,
every
deployed
system
is
an
one-‐off
soluFon,
and
requires
PhDs
to
make
work…
'
70. But…
Even
basics
of
scalable
ML
can
be
challenging
ML
key
to
any
new
service
we
want
to
build
6
months
from
R/Matlab
to
producFon,
at
best
State-‐of-‐art
ML
algorithms
trapped
in
research
papers
Goal
of
GraphLab
3:
Make
huge-‐scale
machine
learning
accessible
to
all!
78. Now
with
GraphLab:
Learn/Prototype/Deploy
Even
basics
of
scalable
ML
can
be
challenging
6
months
from
R/Matlab
to
producFon,
at
best
State-‐of-‐art
ML
algorithms
trapped
in
research
papers
Learn ML with
GraphLab Notebook
pip install graphlab
then deploy on
EC2/Cluster
Fully integrated
via GraphLab Toolkits
79. We’re
selecFng
strategic
partners
Help
define
our
strategy
&
prioriFes
And,
get
the
value
of
GraphLab
in
your
company
partners@graphlab.com
80. 8. After
C++
GraphLab
2.2
available
now:
graphlab.com
Beta
Program:
beta.graphlab.com
Follow
us
on
Twicer:
@graphlabteam
Define
our
future:
partners@graphlab.com
Needless
to
say:
jobs@graphlab.com