Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Large-scale Recommendation Systems on Just a PC

5,487 views

Published on

My keynote at the large-scale recommender systems workshop at Recsys 2013.

Published in: Technology
  • Be the first to comment

Large-scale Recommendation Systems on Just a PC

  1. 1. Large-scale Recommender Systems on Just a PC LSRS 2013 keynote (RecSys ’13 Hong Kong) Aapo Kyrölä Ph.D. candidate @ CMU http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov Big Data – small machine
  2. 2. My Background • Academic: 5th year Ph.D. @ Carnegie Mellon. Advisors: Guy Blelloch, Carlos Guestrin (UW) 2009  2012  + Shotgun : Parallel L1-regularized regression solver (ICML 2011). + Internships at MSR Asia (2011) and Twitter (2012) • Startup Entrepreneur Habbo : founded 2000
  3. 3. Outline of this talk 1. Why single-computer computing? 2. Introduction to graph computation and GraphChi 3. Recommender systems with GraphChi 4. Future directions & Conclusion
  4. 4. Large-Scale Recommender Systems on Just a PC Why on a single machine? Can’t we just use the Cloud?
  5. 5. Why use a cluster? Two reasons: 1. One computer cannot handle my problem in a reasonable time. 1. I need to solve the problem very fast.
  6. 6. Why use a cluster? Two reasons: 1. One computer cannot handle my problem in a reasonable time. Our work expands the space of feasible (graph) problems on one machine: - Our experiments use the same graphs, or bigger, than previous papers on distributed graph computation. (+ we can do Twitter graph on a laptop) - Most data not that “big”. 1. I need to solve the problem very fast. Our work raises the bar on required performance for a “complicated” system.
  7. 7. Benefits of single machine systems Assuming it can handle your big problems… 1. Programmer productivity – Global state – Can use “real data” for development 2. Inexpensive to install, administer, less power. 3. Scalability.
  8. 8. Efficient Scaling Distributed Graph System Task 7 Task 6 Task 5 Task 4 Task 3 Single-computer system (capable of big tasks) Task 2 Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 1 6 machines (Significantly) less than 2x throughput with 2x machines T11 T10 T9 T8 T7 T6 T5 T4 T3 T2 T1 Task 1 Exactly 2x 2 Task Task 3 throughput with 2x Task 4 machines 5 Task Task 6 Task 10 Task 11 Task 12 12 machines Time T Time T
  9. 9. GRAPH COMPUTATION AND GRAPHCHI
  10. 10. Why graphs for recommender systems? • Graph = matrix: edge(u,v) = M[u,v] – Note: always sparse graphs • Intuitive, human-understandable representation – Easy to visualize and explain. • Unifies collaborative filtering (typically matrix based) with recommendation in social networks. – Random walk algorithms. • Local view  vertex-centric computation
  11. 11. Vertex-Centric Computational Model • Graph G = (V, E) – directed edges: e = (source, destination) – each edge and vertex associated with a value (user-defined type) – vertex and edge values can be modified • (structure modification also supported) A B Data Data Data Data Data Data Data Data Data Data GraphChi – Aapo Kyrola 12
  12. 12. Vertex-centric Programming • “Think like a vertex” • Popularized by the Pregel and GraphLab projects Data Data Data Data Data { // modify neighborhood } Data Data Data Data Data MyFunc(vertex)
  13. 13. What is GraphChi Both in OSDI’12!
  14. 14. The Main Challenge of Disk-based Graph Computation: Random Access << 5-10 M random edges / sec to achieve “reasonable performance” 100s reads/writes per sec ~ 100K reads / sec (commodity) ~ 1M reads / sec (high-end arrays)
  15. 15. Details: Kyrola, Blelloch, Guestrin: “Large-scale graph computation on just a PC” (OSDI 2012) Parallel Sliding Windows or Only P large reads for each interval (sub-graph). P2 reads on one full pass.
  16. 16. GraphChi Program Execution For T iterations: For p=1 to P For v in interval(p) updateFunction(v) For T iterations: For v=1 to V updateFunction(v) “Asynchronous”: updates immediately visible (vs. bulk-synchronous).
  17. 17. Performance GraphChi can compute on the full Twitter follow-graph with just a standard laptop. ~ as fast as a very large Hadoop cluster! (size of the graph Fall 2013, > 20B edges [Gupta et al 2013])
  18. 18. GraphChi is Open Source • C++ and Java-versions in GitHub: http://github.com/graphchi – Java-version has a Hadoop/Pig wrapper. • If you really really want to use Hadoop.
  19. 19. RECSYS MODEL TRAINING WITH GRAPHCHI
  20. 20. Overview of Recommender Systems for GraphChi • Collaborative Filtering toolkit (next slide) • Link prediction in large networks – Random-walk based approaches (Twitter) – Talk on Wednesday.
  21. 21. GraphChi’s Collaborative Filtering Toolkit • Developed by Danny Bickson (CMU / GraphLab Inc) • Includes: – – – – – – – – Alternative Least Squares (ALS) Sparse-ALS SVD++ LibFM (factorization machines) GenSGD Item-similarity based methods PMF CliMF (contributed by Mark Levy) – …. See Danny’s blog for more information: http://bickson.blogspot.com /2012/12/collaborativefiltering-with-graphchi.html Note: In the C++ -version. Java-version in development by a CMU team.
  22. 22. TWO EXAMPLES: ALS AND ITEM-BASED CF
  23. 23. Example: Alternative Least Squares Matrix Factorization (ALS) • Task: Predict ratings for items (movies) by users. • Model: – Latent factor model (see next slide) Reference: Y. Zhou, D. Wilkinson, R. Schreiber, R. Pan: “Large-Scale Parallel Collaborative Filtering for the Netflix Prize” (2008)
  24. 24. ALS: Product – Item bipartite graph 0.4 2.3 -1.8 2.9 1.2 4 Women on the Verge of a Nervous Breakdown 2.3 2.5 3.9 0.02 0.04 2.1 3.141 3 The Celebration 8.7 -3.2 2.8 0.9 0.2 2.9 0.04 City of God 4.1 2 Wild Strawberries 5 User’s rating of a movie modeled as a dot-product: <factor(user), factor(movie)> La Dolce Vita
  25. 25. ALS: GraphChi implementation • Update function handles one vertex a time (user or movie) • For each user: – Estimate latent(user): minimize least squares of dot-product predicted ratings • GraphChi executes the update function for each vertex (in parallel), and loads edges (ratings) from disk – Latent factors in memory: need O(V) memory. – If factors don’t fit in memory, can replicate to edges. and thus store on disk Scales to very large problems!
  26. 26. ALS: Performance Matrix Factorization (Alternative Least Squares) Netflix (99M edges), D=20 GraphChi (Mac Mini) GraphLab v1 (8 cores) 0 2 4 6 8 10 12 Minutes Remark: Netflix is not a big problem, but GraphChi will scale at most linearly with input size (ALS is CPU bounded, so should be sub-linear in #ratings).
  27. 27. Example: Item Based-CF • Task: compute a similarity score [e,g. Jaccard] for each movie-pair that has at least one viewer in common. – Similarity(X, Y) ~ # common viewers – Output top K similar items for each item to a file. – … or: create edge between X, Y containing the similarity. • Problem: enumerating all pairs takes too much time.
  28. 28. Women on the Verge of a Nervous Breakdown 3 Solution: Enumerate all The Celebration triangles of the graph. New problem: how to City of God enumerate triangles if the graph does not fit in RAM? Wild Strawberries La Dolce Vita
  29. 29. Enumerating Triangles (Item-CF) • Triangles with edge (u, v) = intersection(neighbors(u), neighbors(v)) • Iterative memory efficient solution (next slide)
  30. 30. Algorithm: • Let pivots be a subset of the vertices; • Load all neighbor-lists (adjacency lists) of pivots into RAM • Use now GraphChi to load all vertices from disk, one by one, and compare their adjacency lists to the pivots’ adjacency lists (similar to merge). • Repeat with a new subset of pivots. PIVOTS
  31. 31. Triangle Counting Performance Triangle Counting twitter-2010 (1.5B edges) GraphChi (Mac Mini) Hadoop (1636 machines) 0 50 100 150 200 250 Minutes 300 350 400 450
  32. 32. FUTURE DIRECTIONS & FINAL REMARKS
  33. 33. Single-Machine Computing in Production? • GraphChi supports incremental computation with dynamic graphs: – Can keep on running indefinitely, adding new edges to the graph  Constantly fresh model. – However, requires engineering – not included in the toolkit. • Compare to a cluster-based system (such as Hadoop) that needs to compute from scratch.
  34. 34. Unified Recsys Platform for GraphChi? • Working with masters students at CMU. • Goal: ability to easily compare different algorithms, parameters – Unified input, output. – General programmable API (not just file-based) – Evaluation process: Several evaluation metrics; Cross-validation, held-out data… – Run many algorithm instances in parallel, on same graph. – Java. • Scalable from the get-go.
  35. 35. DataDescriptor data definition column1 : categorical column2: real column3: key column4: categorical Input data Algorithm X: Input Algorithm Input Descriptor map(input: DataDescriptor) GraphChi Preprocessor aux data GraphChi Input
  36. 36. aux data Disk GraphChi Input Algorithm X Training Program Held-out data (test data) Algorithm Y Training Program Algorithm X Predictor training metrics test quality metrics Algorithm Z Training Program
  37. 37. Recent developments: Disk-based Graph Computation • Recently two disk-based graph computation systems published: – TurboGraph (KDD’13) – X-Stream (SOSP’13 in October) • Significantly better performance than GraphChi on many problems – Avoid preprocessing (“sharding”) – But GraphChi can do some computation that XStream cannot (triangle counting and related); TurboGraph requires SSD – Hot research area!
  38. 38. Do you need GraphChi – or any system? • Heck, for many algorithms, you can just mmap() over your (binary) adjacency list / sparse matrix, and write a for-loop. – See Lin, Chau, Kang Leveraging Memory Mapping for Fast and Scalable Graph Computation on a PC (Big Data ’13) • Obviously good to have a common API – And some algos need more advanced solutions (like GraphChi, XStream, TurboGraph) Beware of the hype!
  39. 39. Conclusion • Very large recommender algorithms can now be run on just your PC or laptop. – Additional performance from multi-core parallelism. – Great for productivity – scale by replicating. • In general, good single machine scalability requires care with data structures, memory management  natural with C/C++, with Java (etc.) need low-level byte massaging. – Frameworks like GraphChi hide the low-level. • More work needed to ‘’productize’’ current work.
  40. 40. Thank you! Aapo Kyrölä Ph.D. candidate @ CMU – soon to graduate! (Currently visiting U.W) http://www.cs.cmu.edu/~akyrola Twitter: @kyrpov

×