Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices


Published on

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • MapReduce excels in massively parallel processing, scalability, and fault tolerance.In terms of analytics, however, such systems have been limited primarily to aggregation processing, i.e., computation of simple aggregates such as SUM, COUNT, and AVERAGE, after using filtering, joining, and grouping operations to prepare the data for the aggregation step. Although most DMSs provide hooks for user-defined functions and procedures, they do not deliver the rich analytic functionality found in statistical packages.
  • Virtually all prior work attempts to get along with only one type of system, either adding large-scale data management capability to statistical packages or adding statistical functionality to DMSs. This approach leads to solutions that are often cumbersome, unfriendly to analysts, or wasteful in that a great deal of well established technology is needlessly re-invented or re-implemented.
  • Convert matrix operations to MapReduce functions.
  • R sending aggregation-processing queries to Hadoop (written in the high-level Jaql query language), and Hadoop sending aggregated data to R for advanced statistical processing or visualization.
  • R has serious limitations when applied to very large datasets: limited support for distributed processing, no strategy for load balancing, no fault tolerance, and is constrained by a server’s DRAM capacity.
  • Large-scale machine learning and graph processing on sparse matrices
  • Distributed array (darray) provides a shared, in-memory view of multi-dimensional data stored across multiple servers.
  • Repartitioning can be used to subdivide an array into a specified number of parts. Repartitioning is an optional performance optimization which helps when there is load imbalance in the system.
  • Note that for programs with general data structures (e.g., trees) writing invariants is difficult. However, for matrix computation, arrays are the only data structure and the relevant invariant is the compatibility in array sizes.
  • Qian’s comment: same concept as the snapshot isolation.
  • Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices

    1. 1. Distributed Machine Learning andGraph Processing with SparseMatricesSpeaker: LIN Qian
    2. 2. Big Data, Complex AlgorithmsPageRank(Dominant eigenvector)Recommendations(Matrix factorization)Anomaly detection(Top-K eigenvalues)User Importance(Vertex Centrality)Machine learning + Graph algorithms
    3. 3. Large-Scale Processing FrameworksData-parallel frameworks – MapReduce/Dryad (2004)– Process each record in parallel– Use case: Computing sufficient statistics, analytics queriesGraph-centric frameworks – Pregel/GraphLab (2010)– Process each vertex in parallel– Use case: Graphical modelsArray-based frameworks – MadLINQ (2012)– Process blocks of array in parallel– Use case: Linear Algebra Operations
    4. 4. PageRank using MatricesPower MethodDominanteigenvectorMpM = web graph matrixp = PageRank vectorSimplified algorithm repeat { p = M*p }Linear Algebra Operations on Sparse Matricesp
    5. 5. Statistical softwaremoderately-sized datasetssingle server, entirely in memory
    6. 6. Work-aroundfor massive datasetVertical scalabilitySampling
    7. 7. MapReduceLimited to aggregation processing
    8. 8. Data analyticsDeep vs. ScalableStatistical software(R, MATLAB, SPASS, SAS)MapReduce
    9. 9. Improvement ways1. Statistical sw. += large-scale data mgnt2. MapReduce += statistical functionality3. Combining both existing technologies
    10. 10. Parallel MATLAB, pR
    11. 11. HAMA, SciHadoop
    12. 12. MadLINQ [EuroSys’12]Linear algebra platform on DryadNot efficient for sparse matrix comp.
    13. 13. Ricardo [SIGMOD’10]But ends up inheriting theinefficiencies of the MapReduceinterfaceR Hadoopaggregation-processing queriesaggregated data
    14. 14. Array-basedSingle-threadedLimited support for scaling
    15. 15. Challenge 1: Sparse Matrices
    16. 16. Challenge 1 – Sparse Matrices1101001000100001 11 21 31 41 51 61 71 81 91Blockdensity(normalized)Block IDLiveJournal Netflix ClueWeb-1B1000x more data  Computation imbalance
    17. 17. Challenge 2 – Data SharingSharing data through pipes/networkTime-inefficient (sending copies)Space-inefficient (extra copies)Processcopy ofdatalocal copyProcessdataProcesscopy ofdataProcesscopy ofdataServer 1networkcopynetworkcopyServer 2Sparse matrices Communication overhead
    18. 18. Extend R – make it scalable, distributedLarge-scale machine learning and graphprocessing on sparse matrices
    19. 19. Presto architecture
    20. 20. Presto architectureWorkerWorkerMasterR instanceR instanceDRAMR instance R instanceR instanceDRAMR instance
    21. 21. Distributed array (darray)PartitionedSharedDynamic
    22. 22. foreachParallel executionof the loop bodyf(x)BarrierCall Update to publish changes
    23. 23. PageRank Using PrestoM  darray(dim=c(N,N),blocks=(s,N))P  darray(dim=c(N,1),blocks=(s,1))while(..){foreach(i,1:len,calculate(m=splits(M,i),x=splits(P), p=splits(P,i)) {p  m*x})}Create Distributed ArrayM pP1P2PN/s
    24. 24. PageRank Using PrestoM  darray(dim=c(N,N),blocks=(s,N))P  darray(dim=c(N,1),blocks=(s,1))while(..){foreach(i,1:len,calculate(m=splits(M,i),x=splits(P), p=splits(P,i)) {p  m*x})}Execute function in a clusterPass array partitionspP1P2PN/sM
    25. 25. Dynamic repartitioningTo address load imbalanceCorrectness
    26. 26. Repartitioning MatricesProfile executionRepartition
    27. 27. Invariantscompatibility in array sizes
    28. 28. Maintaining Size Invariantsinvariant(mat, vec, type=ROW)
    29. 29. Data sharingfor multi-coreZero-copy sharing across cores
    30. 30. Data sharing challenges1. Garbage collection2. Header conflictR object data partR objectheaderR instance R instance
    31. 31. Overriding R’s allocatorAllocate process-local headersMap data in shared memorypageShared R object data partLocal Robjectheaderpage boundary page boundary
    32. 32. Immutable partitions Safe sharingOnly share read-only data
    33. 33. Versioning arraysTo ensure correctness when arraysare shared across machines
    34. 34. Fault toleranceMaster: primary-backup replicationWorker: heartbeat-based failure detection
    35. 35. Presto applicationsPresto doubles LOC w.r.t. purely programming in R.
    36. 36. EvaluationFaster than Spark and Hadoopusing in-memory data
    37. 37. Multi-core support benefits
    38. 38. Data sharing benefits4.452.491.630.710.70.72102040CORES4.382. TransferNo sharingSharing
    39. 39. Repartitioning benefits0 20 40 60 80 100 120 140 160Workers Transfer Compute0 20 40 60 80 100 120 140 160WorkersNo RepartitionRepartition
    40. 40. Repartitioning benefits05010015020025030035040020003000400050006000700080000 5 10 15 20Cumulativepartitioningtime(s)Timetoconvergence(s)Number of RepartitionsConvergence TimeTime spent partitioning
    41. 41. Limitations1. In-memory computation2. One writer per partition3. Array-based programming
    42. 42. • Presto: Large scale array-basedframework extends R• Challenges with Sparse matrices• Repartitioning, sharing versioned arraysConclusion
    43. 43. IMDb Rating: 8.5Release Date: 27 June 2008Director: Doug SweetlandStudio: PixarRuntime: 5 minBrief:A stage magician’s rabbitgets into a magical onstagebrawl against his neglectfulguardian with two magichats.