Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sebastian Schelter – Distributed Machine Learing with the Samsara DSL

7,983 views

Published on

Flink Forward 2015

Published in: Technology

Sebastian Schelter – Distributed Machine Learing with the Samsara DSL

  1. 1. Distributed Machine Learning with the Samsara DSL Sebastian Schelter, Flink Forward 2015
  2. 2. About me • about to finish my PhD on „Scaling Data Mining in Massively Parallel Dataflow Systems“ • currently: – Machine Learning Scientist / Post-Doctoral Researcher at Amazon's Berlin-based ML group – senior researcher at the Database Group of TU Berlin • member of the Apache Software Foundation (Mahout/Giraph/Flink)
  3. 3. Samsara • Samsara is an easy-to-use domain specific language (DSL) for distributed large-scale machine learning on systems like Apache Spark and Apache Flink • part of the Apache Mahout project • uses Scala as programming/scripting environment • system-agnostic, R-like DSL : val G = B %*% B.t - C - C.t + (ksi dot ksi) * (s_q cross s_q) • algebraic expression optimizer for distributed linear algebra – provides a translation layer to distributed engines T qq TTT ssCCBBG 
  4. 4. Data Types • Scalar real values • In-memory vectors – dense – 2 types of sparse • In-memory matrices – sparse and dense – a number of specialized matrices • Distributed Row Matrices (DRM) – huge matrix, partitioned by rows – lives in the main memory of the cluster – provides small set of parallelized operations – lazily evaluated operation execution val x = 2.367 val v = dvec(1, 0, 5) val w = svec((0 -> 1)::(2 -> 5)::Nil) val A = dense((1, 0, 5), (2, 1, 4), (4, 3, 1)) val drmA = drmFromHDFS(...)
  5. 5. Features (1) • matrix, vector, scalar operators: in-memory, distributed • slicing operators • assignments (in-memory only) • vector-specific • summaries drmA %*% drmB A %*% x A.t %*% drmB A * B A(5 until 20, 3 until 40) A(5, ::); A(5, 5) x(a to b) A(5, ::) := x A *= B A -=: B; 1 /:= x x dot y; x cross y A.nrow; x.length; A.colSums; B.rowMeans x.sum; A.norm
  6. 6. Features (2) • solving linear systems • in-memory decompositions • distributed decompositions • caching of DRMs val x = solve(A, b) val (inMemQ, inMemR) = qr(inMemM) val ch = chol(inMemM) val (inMemV, d) = eigen(inMemM) val (inMemU, inMemV, s) = svd(inMemM) val (drmQ, inMemR) = thinQR(drmA) val (drmU, drmV, s) = dssvd(drmA, k = 50, q = 1) val drmA_cached = drmA.checkpoint() drmA_cached.uncache()
  7. 7. Example
  8. 8. Cereals Name protein fat carbo sugars rating Apple Cinnamon Cheerios 2 2 10.5 10 29.509541 Cap‘n‘Crunch 1 2 12 12 18.042851 Cocoa Puffs 1 1 12 13 22.736446 Froot Loops 2 1 11 13 32.207582 Honey Graham Ohs 1 2 12 11 21.871292 Wheaties Honey Gold 2 1 16 8 36.187559 Cheerios 6 2 17 1 50.764999 Clusters 3 2 13 7 40.400208 Great Grains Pecan 3 3 13 4 45.811716 http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html
  9. 9. Linear Regression • Assumption: target variable y generated by linear combination of feature matrix X with parameter vector β, plus noise ε • Goal: find estimate of the parameter vector β that explains the data well • Cereals example X = weights of ingredients y = customer rating  Xy
  10. 10. Data Ingestion • Usually: load dataset as DRM from a distributed filesystem: val drmData = drmFromHdfs(...) • ‚Mimick‘ a large dataset for our example: val drmData = drmParallelize(dense( (2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios (1, 2, 12, 12, 18.042851), // Cap'n'Crunch (1, 1, 12, 13, 22.736446), // Cocoa Puffs (2, 1, 11, 13, 32.207582), // Froot Loops (1, 2, 12, 11, 21.871292), // Honey Graham Ohs (2, 1, 16, 8, 36.187559), // Wheaties Honey Gold (6, 2, 17, 1, 50.764999), // Cheerios (3, 2, 13, 7, 40.400208), // Clusters (3, 3, 13, 4, 45.811716)), // Great Grains Pecan numPartitions = 2)
  11. 11. Data Preparation • Cereals example: target variable y is customer rating, weights of ingredients are features X • extract X as DRM by slicing, fetch y as in-core vector val drmX = drmData(::, 0 until 4) val y = drmData.collect(::, 4)                             8117164541333 4002084071323 7649995011726 1875593681612 87129221111221 20758232131112 73644622131211 04285118121221 509541291051022 . . . . . . . . .. drmX y
  12. 12. Estimating β • Ordinary Least Squares: minimizes the sum of residual squares between true target variable and prediction of target variable • Closed-form expression for estimation of ß as • Computing XTX andXTy is as simple as typing the formulas: val drmXtX = drmX.t %*% drmX val drmXty = drmX %*% y yXXX TT 1 )(ˆ  
  13. 13. Estimating β • Solve the following linear system to get least-squares estimate of ß • Fetch XTX andXTy onto the driver and use an in-core solver – assumes XTX fits into memory – uses analogon to R’s solve() function val XtX = drmXtX.collect val Xty = drmXty.collect(::, 0) val betaHat = solve(XtX, Xty) yXXX TT ˆ
  14. 14. Estimating β • Solve the following linear system to get least-squares estimate of ß • Fetch XTX andXTy onto the driver and use an in-memory solver – assumes XTX fits into memory – uses analogon to R’s solve() function val XtX = drmXtX.collect val Xty = drmXty.collect(::, 0) val betaHat = solve(XtX, Xty) yXXX TT ˆ → We have implemented distributed linear regression!
  15. 15. Goodness of fit • Prediction of the target variable is simple matrix-vector multiplication • Check L2 norm of the difference between true target variable and our prediction val yHat = (drmX %*% betaHat).collect(::, 0) (y - yHat).norm(2) ˆˆ Xy 
  16. 16. Adding a bias term • Bias term left out so far – constant factor added to the model, “shifts the line vertically” • Common trick is to add a column of ones to the feature matrix – bias term will be learned automatically                             141333 171323 111726 181612 1111221 1131112 1131211 1121221 11051022 .                             41333 71323 11726 81612 111221 131112 131211 121221 105.1022
  17. 17. Adding a bias term • How do we add a new column to a DRM? → mapBlock() allows for custom modifications of the matrix val drmXwithBiasColumn = drmX.mapBlock(ncol = drmX.ncol + 1) { case(keys, block) => // create a new block with an additional column val blockWithBiasCol = block.like(block.nrow, block.ncol+1) // copy data from current block into the new block blockWithBiasCol(::, 0 until block.ncol) := block // last column consists of ones blockWithBiasColumn(::, block.ncol) := 1 keys -> blockWithBiasColumn }
  18. 18. Under the covers
  19. 19. Underlying systems • prototype on Apache Spark • prototype on h20: • coming up: support for Apache Flink
  20. 20. Runtime & Optimization • Execution is defered, user composes logical operators • Computational actions implicitly trigger optimization (= selection of physical plan) and execution • Optimization factors: size of operands, orientation of operands, partitioning, sharing of computational paths • e. g.: matrix multiplication: – 5 physical operators for drmA %*% drmB – 2 operators for drmA %*% inMemA – 1 operator for drm A %*% x – 1 operator for x %*% drmA val C = A.t %*% A I.writeDrm(path); val inMemV =(U %*% M).collect
  21. 21. Optimization Example • Computation of ATA in example • Naïve execution 1st pass: transpose A (requires repartitioning of A) 2nd pass: multiply result with A (expensive, potentially requires repartitioning again) • Logical optimization: rewrite plan to use specialized logical operator for Transpose-Times-Self matrix multiplication val C = A.t %*% A
  22. 22. Optimization Example • Computation of ATA in example • Naïve execution 1st pass: transpose A (requires repartitioning of A) 2nd pass: multiply result with A (expensive, potentially requires repartitioning again) • Logical optimization: rewrite plan to use specialized logical operator for Transpose-Times-Self matrix multiplication val C = A.t %*% A Transpose A
  23. 23. Optimization Example • Computation of ATA in example • Naïve execution 1st pass: transpose A (requires repartitioning of A) 2nd pass: multiply result with A (expensive, potentially requires repartitioning again) • Logical optimization: rewrite plan to use specialized logical operator for Transpose-Times-Self matrix multiplication val C = A.t %*% A Transpose MatrixMult A A C
  24. 24. Optimization Example • Computation of ATA in example • Naïve execution 1st pass: transpose A (requires repartitioning of A) 2nd pass: multiply result with A (expensive, potentially requires repartitioning again) • Logical optimization Optimizer rewrites plan to use specialized logical operator for Transpose-Times-Self matrix multiplication val C = A.t %*% A Transpose MatrixMult A A C Transpose- Times-Self A C
  25. 25. Transpose-Times-Self • Samsara computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A   m i T ii T aaAA 0
  26. 26. Transpose-Times-Self • Samsara computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A   m i T ii T aaAA 0 A
  27. 27. Transpose-Times-Self • Samsara computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A   m i T ii T aaAA 0 x AAT
  28. 28. Transpose-Times-Self • Samsara computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A   m i T ii T aaAA 0 x = x AAT a1• a1• T
  29. 29. Transpose-Times-Self • Samsara computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A   m i T ii T aaAA 0 x = x + x AAT a1• a1• T a2• a2• T
  30. 30. Transpose-Times-Self • Mahout computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A   m i T ii T aaAA 0 x = x + +x x AAT a1• a1• T a2• a2• T a3• a3• T
  31. 31. Transpose-Times-Self • Samsara computes ATA via row-outer-product formulation – executes in a single pass over row-partitioned A   m i T ii T aaAA 0 x = x + + +x x x AAT a1• a1• T a2• a2• T a3• a3• T a4• a4• T
  32. 32. Physical operators for the distributed computation of ATA
  33. 33. Physical operators for Transpose-Times-Self • Two physical operators (concrete implementations) available for Transpose-Times-Self operation – standard operator AtA – operator AtA_slim, specialized implementation for tall & skinny matrices • Optimizer must choose – currently: depends on user-defined threshold for number of columns – ideally: cost based decision, dependent on estimates of intermediate result sizes Transpose- Times-Self A C
  34. 34. Physical operator AtA           1100 0101 0111 A
  35. 35. A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111
  36. 36. A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111 for 1st partition for 1st partition
  37. 37. A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111  0111 1 1        1100 0 0       for 1st partition for 1st partition
  38. 38. A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111  0111 1 1        1100 0 0       for 1st partition for 1st partition  0101 0 1      
  39. 39. A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111  0111 1 1        1100 0 0       for 1st partition for 1st partition  0101 0 1       for 2nd partition for 2nd partition
  40. 40. A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111  0111 1 1        1100 0 0       for 1st partition for 1st partition  0101 0 1        0111 0 1       for 2nd partition  1100 1 1       for 2nd partition
  41. 41. A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111  0111 1 1        1100 0 0       for 1st partition for 1st partition  0101 0 1        0111 0 1       for 2nd partition  0101 0 1        1100 1 1       for 2nd partition
  42. 42. A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111       0111 0111       0000 0000 for 1st partition for 1st partition       0000 0101       0000 0111 for 2nd partition       0000 0101       1100 1100 for 2nd partition
  43. 43. A2  1100 Physical operator AtA           1100 0101 0111 A1 A worker 1 worker 2       0101 0111       0111 0111       0000 0000 for 1st partition for 1st partition       0000 0101       0000 0111 for 2nd partition       0000 0101       1100 1100 for 2nd partition       0111 0212 worker 3       1100 1312 worker 4 ∑ ∑ AT A
  44. 44. Physical operator AtA_slim           1100 0101 0111 A
  45. 45. A2  1100 Physical operator AtA_slim           1100 0101 0111 A1 A worker 1 worker 2       0101 0111
  46. 46. A2 T A2A2  1100                  1 11 000 0000 Physical operator AtA_slim           1100 0101 0111 A1 T A1A1 A worker 1 worker 2       0101 0111                  0 02 011 0212
  47. 47. A2 T A2A2  1100                  1 11 000 0000 Physical operator AtA_slim           1100 0101 0111 A1 T A1A1 A C = AT A worker 1 worker 2 A1 T A1 + A2 T A2 driver       0101 0111                  0 02 011 0212               1100 1312 0111 0212
  48. 48. Pointers • Contribution to Apache Mahout in progress: https://issues.apache.org/jira/browse/MAHOUT-1570 • Apache Mahout has extensive documentation on Samsara – http://mahout.apache.org/users/environment/in-core-reference.html – https://mahout.apache.org/users/environment/out-of-core-reference.html
  49. 49. Thank you. Questions?

×