Large-scale Recommender
Systems on Just a PC
LSRS 2013 keynote
(RecSys ’13 Hong Kong)
Aapo Kyrölä
Ph.D. candidate @ CMU
http://www.cs.cmu.edu/~akyrola
Twitter: @kyrpov
Big Data – small machine
My Background
• Academic: 5th year Ph.D. @ Carnegie Mellon.
Advisors: Guy Blelloch, Carlos Guestrin (UW)
2009
2012
+ Shotgun : Parallel L1-regularized regression solver (ICML 2011).
+ Internships at MSR Asia (2011) and Twitter (2012)
• Startup Entrepreneur
Habbo : founded 2000
Outline of this talk
1. Why single-computer computing?
2. Introduction to graph computation and
GraphChi
3. Recommender systems with GraphChi
4. Future directions & Conclusion
Why use a cluster?
Two reasons:
1. One computer cannot handle my problem in a
reasonable time.
1. I need to solve the problem very fast.
Why use a cluster?
Two reasons:
1. One computer cannot handle my problem in a
reasonable time.
Our work expands the space of feasible (graph) problems on
one machine:
- Our experiments use the same graphs, or bigger, than previous
papers on distributed graph computation. (+ we can do Twitter
graph on a laptop)
- Most data not that “big”.
1. I need to solve the problem very fast.
Our work raises the bar on required performance for a
“complicated” system.
Benefits of single machine systems
Assuming it can handle your big problems…
1. Programmer productivity
– Global state
– Can use “real data” for development
2. Inexpensive to install, administer, less
power.
3. Scalability.
Efficient Scaling
Distributed Graph
System
Task 7
Task 6
Task 5
Task 4
Task 3
Single-computer
system (capable of big tasks)
Task 2
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 1
6 machines
(Significantly) less
than 2x throughput
with 2x machines
T11
T10
T9
T8
T7
T6
T5
T4
T3
T2
T1
Task 1
Exactly 2x 2
Task
Task 3
throughput with 2x
Task 4
machines 5
Task
Task 6
Task 10
Task 11
Task 12
12 machines
Time
T
Time
T
Why graphs for recommender systems?
• Graph = matrix: edge(u,v) = M[u,v]
– Note: always sparse graphs
• Intuitive, human-understandable
representation
– Easy to visualize and explain.
• Unifies collaborative filtering (typically matrix
based) with recommendation in social
networks.
– Random walk algorithms.
• Local view vertex-centric computation
Vertex-Centric Computational Model
• Graph G = (V, E)
– directed edges: e = (source,
destination)
– each edge and vertex
associated with a value
(user-defined type)
– vertex and edge values can
be modified
• (structure modification also
supported)
A
B
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
GraphChi – Aapo Kyrola
12
Vertex-centric Programming
• “Think like a vertex”
• Popularized by the Pregel and GraphLab
projects
Data
Data
Data
Data
Data
{ // modify neighborhood }
Data
Data
Data
Data
Data
MyFunc(vertex)
The Main Challenge of Disk-based
Graph Computation:
Random Access
<< 5-10 M random edges
/ sec to achieve
“reasonable
performance”
100s reads/writes per sec
~ 100K reads / sec (commodity)
~ 1M reads / sec (high-end arrays)
Details: Kyrola, Blelloch, Guestrin: “Large-scale graph computation on just a PC” (OSDI 2012)
Parallel Sliding Windows
or
Only P large reads for each interval (sub-graph).
P2 reads on one full pass.
GraphChi Program Execution
For T iterations:
For p=1 to P
For v in interval(p)
updateFunction(v)
For T iterations:
For v=1 to V
updateFunction(v)
“Asynchronous”: updates immediately
visible (vs. bulk-synchronous).
Performance
GraphChi can compute on the
full Twitter follow-graph with
just a standard laptop.
~ as fast as a very large Hadoop cluster!
(size of the graph Fall 2013, > 20B edges [Gupta et al 2013])
GraphChi is Open Source
• C++ and Java-versions in GitHub:
http://github.com/graphchi
– Java-version has a Hadoop/Pig wrapper.
• If you really really want to use Hadoop.
Overview of Recommender Systems for
GraphChi
• Collaborative Filtering toolkit (next slide)
• Link prediction in large networks
– Random-walk based approaches (Twitter)
– Talk on Wednesday.
GraphChi’s Collaborative Filtering Toolkit
• Developed by Danny Bickson
(CMU / GraphLab Inc)
• Includes:
–
–
–
–
–
–
–
–
Alternative Least Squares (ALS)
Sparse-ALS
SVD++
LibFM (factorization machines)
GenSGD
Item-similarity based methods
PMF
CliMF (contributed by Mark
Levy)
– ….
See Danny’s blog for more
information:
http://bickson.blogspot.com
/2012/12/collaborativefiltering-with-graphchi.html
Note: In the C++ -version.
Java-version in development
by a CMU team.
Example: Alternative Least Squares
Matrix Factorization (ALS)
• Task: Predict ratings for items (movies) by
users.
• Model:
– Latent factor model (see next slide)
Reference: Y. Zhou, D. Wilkinson, R. Schreiber, R. Pan: “Large-Scale
Parallel Collaborative Filtering for the Netflix Prize” (2008)
ALS: Product – Item bipartite graph
0.4
2.3
-1.8
2.9
1.2
4
Women on the Verge of a
Nervous Breakdown
2.3
2.5
3.9
0.02
0.04
2.1
3.141
3
The Celebration
8.7
-3.2
2.8
0.9
0.2
2.9
0.04
City of God
4.1
2
Wild Strawberries
5
User’s rating of a movie modeled as a dot-product:
<factor(user), factor(movie)>
La Dolce Vita
ALS: GraphChi implementation
• Update function handles one vertex a time
(user or movie)
• For each user:
– Estimate latent(user): minimize least squares of
dot-product predicted ratings
• GraphChi executes the update function for
each vertex (in parallel), and loads edges
(ratings) from disk
– Latent factors in memory: need O(V) memory.
– If factors don’t fit in memory, can replicate to
edges. and thus store on disk
Scales to very large problems!
ALS: Performance
Matrix Factorization (Alternative Least Squares)
Netflix (99M edges), D=20
GraphChi (Mac
Mini)
GraphLab v1
(8 cores)
0
2
4
6
8
10
12
Minutes
Remark: Netflix is not a big problem, but
GraphChi will scale at most linearly with
input size (ALS is CPU bounded, so should
be sub-linear in #ratings).
Example: Item Based-CF
• Task: compute a similarity score [e,g.
Jaccard] for each movie-pair that has at least
one viewer in common.
– Similarity(X, Y) ~ # common viewers
– Output top K similar items for each item to a file.
– … or: create edge between X, Y containing the
similarity.
• Problem: enumerating all pairs takes too
much time.
Women on the Verge of a
Nervous Breakdown
3
Solution: Enumerate all
The Celebration
triangles of the graph.
New problem: how to
City of God
enumerate triangles if the
graph does not fit in RAM?
Wild Strawberries
La Dolce Vita
Algorithm:
• Let pivots be a subset of the vertices;
• Load all neighbor-lists (adjacency lists)
of pivots into RAM
• Use now GraphChi to load all vertices
from disk, one by one, and compare
their adjacency lists to the pivots’
adjacency lists (similar to merge).
• Repeat with a new subset of pivots.
PIVOTS
Single-Machine Computing in
Production?
• GraphChi supports incremental
computation with dynamic graphs:
– Can keep on running indefinitely, adding new
edges to the graph Constantly fresh model.
– However, requires engineering – not included
in the toolkit.
• Compare to a cluster-based system (such
as Hadoop) that needs to compute from
scratch.
Unified Recsys Platform for GraphChi?
• Working with masters students at CMU.
• Goal: ability to easily compare different
algorithms, parameters
– Unified input, output.
– General programmable API (not just file-based)
– Evaluation process: Several evaluation metrics;
Cross-validation, held-out data…
– Run many algorithm instances in parallel, on
same graph.
– Java.
• Scalable from the get-go.
DataDescriptor
data definition
column1 : categorical
column2: real
column3: key
column4: categorical
Input data
Algorithm X: Input
Algorithm Input Descriptor
map(input: DataDescriptor)
GraphChi
Preprocessor
aux
data
GraphChi Input
aux
data
Disk
GraphChi Input
Algorithm X Training
Program
Held-out
data (test
data)
Algorithm Y Training
Program
Algorithm X Predictor
training
metrics
test quality
metrics
Algorithm Z Training
Program
Recent developments: Disk-based Graph
Computation
• Recently two disk-based graph computation
systems published:
– TurboGraph (KDD’13)
– X-Stream (SOSP’13 in October)
• Significantly better performance than
GraphChi on many problems
– Avoid preprocessing (“sharding”)
– But GraphChi can do some computation that XStream cannot (triangle counting and related);
TurboGraph requires SSD
– Hot research area!
Do you need GraphChi – or any system?
• Heck, for many algorithms, you can just
mmap() over your (binary) adjacency list /
sparse matrix, and write a for-loop.
– See Lin, Chau, Kang Leveraging Memory Mapping for Fast and
Scalable Graph Computation on a PC (Big Data ’13)
• Obviously good to have a common API
– And some algos need more advanced
solutions (like GraphChi, XStream, TurboGraph)
Beware of the hype!
Conclusion
• Very large recommender algorithms can now
be run on just your PC or laptop.
– Additional performance from multi-core
parallelism.
– Great for productivity – scale by replicating.
• In general, good single machine scalability
requires care with data structures, memory
management natural with C/C++, with
Java (etc.) need low-level byte massaging.
– Frameworks like GraphChi hide the low-level.
• More work needed to ‘’productize’’ current
work.
This talk has two main goals: 1) to little bit challenge how we think about scalability: in this case, show how just a single machine, a Mac Mini, can solve very big problems – that people often use something like Hadoop for; 2) to talk about GraphChi, which is my research system and show how to implement rec sys for that.
HOW MANY KNOW GRAPHLAB? So because of my industry experience, on working with very large systems, I always focus on very practical solutions. And it is because of this experience of working with distributed systems, that I really understand the benefits of avoiding it!
Let me ask it otherway round. Why would you want to use a cluster?Most people do not have multi-tera or petabyte datasets.
Let me ask it otherway round. Why would you want to use a cluster?
This is a made-up example to illustrate a point. Relate to netflix off-line.Here we have chosen T to be the time the single machine system, such as GraphChi, solves the one task. Let’s assume the cluster system needs 6 machines to solve the problem, and does it about 7 times faster than GraphChi. Then in Time T it solves 7 tasks while GraphChi solves 6 tasks with the same cluster.Now if we double the size of the cluster, to twelve machines: cluster systems never have linear speedup, so let’s assume the performance increases by say 50%. Of course this is just fake numbers, but similar behavior happens at some cut-off point anyway. Now GraphChi will solve exactly twice the number of tasks in time T.
We are not only ones thinking this way…Add MSR paper?
Let’s now discuss what is the computational setting of this work. Let’s first introduce the basic computational model.
Note about edge-centric?
So as a recap, GraphChi is a disk-based GraphLab. While GraphLab2 is incredibly powerful on big clusters, or in the cloud, you can use GraphChi to solve as big problems on just a Mac Mini. Of course, GraphLab can solve the problems way faster – but I believe GraphChi provides performance that is more then enough for many. Spin-off of GraphLab projectDisk based GraphLabOSDI’12
I will now briefly demonstrate why disk-based graph computation was not a trivial problem. Perhaps we can assume it wasn’t, because no such system as stated in the goals clearly existed. But it makes sense to analyze why solving the problem required a small innovation, worthy of an OSDI publication. The main problem has been stated on the slide: random access, i.e when you need to read many times from many different locations on disk, is slow. This is especially true with hard drives: seek times are several milliseconds. On SSD, random access is much faster, but still far a far cry from the performance of RAM. Let’s now study this a bit.
So how does GraphChi work? I don’t have time to go to details now. It is based on an algorithm we invented called Parallel Sliding Windows. In this model you split the graph in to P shards, and the graph is processed in P parts. For each part you load one shard completely in to memory, and load continuous chunks of data from the other shards. All in all, you need very small number of random accesses, which are the bottleneck of disk based computing. GraphChi is good on both SSD and hard drive!
Another, perhaps a bit surprising motivation comes from thinking about scalability in large scale.The industry wants to compute many tasks on the same graph. For example, to compute personalizedRecommendations, same task is computed for people in different countries, different interests groups, etc.Currently: you need a cluster just to compute one single task. To compute tasks faster, you grow the cluster.But this work allows a different way. Since one machine can handle one big task, you can dedicate one taskPer machine.Why does this make sense? * Clusters are complex, and expensive to scale. * while in this new model, it is very simple as nodes do not talk to each other, and you can double the throughput by doubling the machinesThere are other motivations as well, such as reducing costs and energy. But let’s move on.
Single machine systems are easy to programBut currently need specialized solutions while if you use Hadoop etc., you can use same framework for wide variety of problems