Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- C-MR: Continuously Executing MapRed... by Qian Lin 318 views
- C-Cube: Elastic Continuous Clusteri... by Qian Lin 456 views
- Trinity: A Distributed Graph Engine... by Qian Lin 713 views
- Kineograph: Taking the Pulse of a F... by Qian Lin 433 views
- In-situ MapReduce for Log Processing by Qian Lin 370 views
- Optimizing Virtual Machines Using H... by Qian Lin 884 views

2,235 views

Published on

No Downloads

Total views

2,235

On SlideShare

0

From Embeds

0

Number of Embeds

44

Shares

0

Downloads

43

Comments

0

Likes

3

No embeds

No notes for slide

- 1. Distributed Machine Learning andGraph Processing with SparseMatricesSpeaker: LIN Qianhttp://www.comp.nus.edu.sg/~linqian/
- 2. Big Data, Complex AlgorithmsPageRank(Dominant eigenvector)Recommendations(Matrix factorization)Anomaly detection(Top-K eigenvalues)User Importance(Vertex Centrality)Machine learning + Graph algorithms
- 3. Large-Scale Processing FrameworksData-parallel frameworks – MapReduce/Dryad (2004)– Process each record in parallel– Use case: Computing sufficient statistics, analytics queriesGraph-centric frameworks – Pregel/GraphLab (2010)– Process each vertex in parallel– Use case: Graphical modelsArray-based frameworks – MadLINQ (2012)– Process blocks of array in parallel– Use case: Linear Algebra Operations
- 4. PageRank using MatricesPower MethodDominanteigenvectorMpM = web graph matrixp = PageRank vectorSimplified algorithm repeat { p = M*p }Linear Algebra Operations on Sparse Matricesp
- 5. Statistical softwaremoderately-sized datasetssingle server, entirely in memory
- 6. Work-aroundfor massive datasetVertical scalabilitySampling
- 7. MapReduceLimited to aggregation processing
- 8. Data analyticsDeep vs. ScalableStatistical software(R, MATLAB, SPASS, SAS)MapReduce
- 9. Improvement ways1. Statistical sw. += large-scale data mgnt2. MapReduce += statistical functionality3. Combining both existing technologies
- 10. Parallel MATLAB, pR
- 11. HAMA, SciHadoop
- 12. MadLINQ [EuroSys’12]Linear algebra platform on DryadNot efficient for sparse matrix comp.
- 13. Ricardo [SIGMOD’10]But ends up inheriting theinefficiencies of the MapReduceinterfaceR Hadoopaggregation-processing queriesaggregated data
- 14. Array-basedSingle-threadedLimited support for scaling
- 15. Challenge 1: Sparse Matrices
- 16. Challenge 1 – Sparse Matrices1101001000100001 11 21 31 41 51 61 71 81 91Blockdensity(normalized)Block IDLiveJournal Netflix ClueWeb-1B1000x more data Computation imbalance
- 17. Challenge 2 – Data SharingSharing data through pipes/networkTime-inefficient (sending copies)Space-inefficient (extra copies)Processcopy ofdatalocal copyProcessdataProcesscopy ofdataProcesscopy ofdataServer 1networkcopynetworkcopyServer 2Sparse matrices Communication overhead
- 18. Extend R – make it scalable, distributedLarge-scale machine learning and graphprocessing on sparse matrices
- 19. Presto architecture
- 20. Presto architectureWorkerWorkerMasterR instanceR instanceDRAMR instance R instanceR instanceDRAMR instance
- 21. Distributed array (darray)PartitionedSharedDynamic
- 22. foreachParallel executionof the loop bodyf(x)BarrierCall Update to publish changes
- 23. PageRank Using PrestoM darray(dim=c(N,N),blocks=(s,N))P darray(dim=c(N,1),blocks=(s,1))while(..){foreach(i,1:len,calculate(m=splits(M,i),x=splits(P), p=splits(P,i)) {p m*x})}Create Distributed ArrayM pP1P2PN/s
- 24. PageRank Using PrestoM darray(dim=c(N,N),blocks=(s,N))P darray(dim=c(N,1),blocks=(s,1))while(..){foreach(i,1:len,calculate(m=splits(M,i),x=splits(P), p=splits(P,i)) {p m*x})}Execute function in a clusterPass array partitionspP1P2PN/sM
- 25. Dynamic repartitioningTo address load imbalanceCorrectness
- 26. Repartitioning MatricesProfile executionRepartition
- 27. Invariantscompatibility in array sizes
- 28. Maintaining Size Invariantsinvariant(mat, vec, type=ROW)
- 29. Data sharingfor multi-coreZero-copy sharing across cores
- 30. Data sharing challenges1. Garbage collection2. Header conflictR object data partR objectheaderR instance R instance
- 31. Overriding R’s allocatorAllocate process-local headersMap data in shared memorypageShared R object data partLocal Robjectheaderpage boundary page boundary
- 32. Immutable partitions Safe sharingOnly share read-only data
- 33. Versioning arraysTo ensure correctness when arraysare shared across machines
- 34. Fault toleranceMaster: primary-backup replicationWorker: heartbeat-based failure detection
- 35. Presto applicationsPresto doubles LOC w.r.t. purely programming in R.
- 36. EvaluationFaster than Spark and Hadoopusing in-memory data
- 37. Multi-core support benefits
- 38. Data sharing benefits4.452.491.630.710.70.72102040CORES4.382.211.221.222.124.16102040CORESCompute TransferNo sharingSharing
- 39. Repartitioning benefits0 20 40 60 80 100 120 140 160Workers Transfer Compute0 20 40 60 80 100 120 140 160WorkersNo RepartitionRepartition
- 40. Repartitioning benefits05010015020025030035040020003000400050006000700080000 5 10 15 20Cumulativepartitioningtime(s)Timetoconvergence(s)Number of RepartitionsConvergence TimeTime spent partitioning
- 41. Limitations1. In-memory computation2. One writer per partition3. Array-based programming
- 42. • Presto: Large scale array-basedframework extends R• Challenges with Sparse matrices• Repartitioning, sharing versioned arraysConclusion
- 43. IMDb Rating: 8.5Release Date: 27 June 2008Director: Doug SweetlandStudio: PixarRuntime: 5 minBrief:A stage magician’s rabbitgets into a magical onstagebrawl against his neglectfulguardian with two magichats.

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment