Bringing Algebraic Semantics to Mahout

2,544 views
2,269 views

Published on

0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,544
On SlideShare
0
From Embeds
0
Number of Embeds
58
Actions
Shares
0
Downloads
46
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Bringing Algebraic Semantics to Mahout

  1. 1. Bringing Algebraic Semantics to Apache Mahout Sebastian Schelter Infolunch @ Hasso Plattner Institut, Potsdam 05/07/2014
  2. 2. Mahout: Past & Future
  3. 3. Apache Mahout: History • library for scalable machine learning (ML) • started six years ago as ML on MapReduce • focus on popular ML problems and algorithms – Collaborative Filtering • (Item-based, User-based, Matrix Factorization) – Classification • (Naive Bayes, Logistic Regression, Random Forest) – Clustering • (k-Means, Streaming k-Means) – Dimensionality Reduction • (Lanczos, Stochastic SVD) + in-memory math library (fork of Colt) • large userbase (e.g. Adobe, AOL, Accenture, Foursquare, Mendeley, Researchgate, Twitter)
  4. 4. Apache Mahout: Problems • Organisational – Onerous to provide support and documentation for scalable ML (need to convey technical and theoretical background at the same time) – partly failed to compensate for developers that left the project – barrier for contributions too low • Technical – MapReduce not well suited for ML • slow execution, especially for iterations • constrained programming model makes code hard to write, read and adjust – different quality of implementations – lack of unified preprocessing machinery
  5. 5. Apache Mahout: Future • Narrowing of focus – removed a large number of rarely used or barely maintained algorithm implementations • Abandonment of MapReduce – will reject new MapReduce implementations – widely used „legacy“ implementations will be maintained • „Reboot“ – usage of modern parallel processing systems which offer richer progamming model & more efficient execution → Apache Spark, possibly Stratosphere in the future – DSL for linear algebraic operations as foundation for new implemenations – automatic optimization and parallelization of programs written in this DSL
  6. 6. Scala & Spark-Bindings
  7. 7. Requirements for an ideal ML environment 1. R/Matlab-like semantics – type system that covers (a) linear algebra, (b) statistics and (c) data frames 2. Modern programming language qualities – functional programming – object oriented programming – sensible byte code Performance – scriptable and interactive 3. Scalability – automatic distribution and parallelization with sensible performance 4. Collection of off-the-shelf building blocks and algorithms 5. Visualization
  8. 8. Requirements for an ideal ML environment 1. R/Matlab-like semantics – type system that covers (a) linear algebra, (b) statistics and (c) data frames 2. Modern programming language qualities – functional programming – object oriented programming – sensible byte code Performance – scriptable and Interactive 3. Scalability – automatic distribution and parallelization with sensible performance 4. Collection of off-the-shelf building blocks and algorithms 5. Visualization → Mahout‘s new Scala & Spark-bindings address requirements 1(a), 2, 3 and 4
  9. 9. Scala & Spark Bindings • Scala as programming/scripting environment • R-like DSL : val G = B %*% B.t - C - C.t + (ksi dot ksi) * (s_q cross s_q) • Algebraic expression optimizer for distributed linear algebra – provides a translation layer to distributed engines – currently supports Apache Spark only – might support Stratosphere in the future q T q TTT ssCCBBG 
  10. 10. Data Types • Scalar real values • In-memory vectors – dense – 2 types of sparse • In-memory matrices – sparse and dense – a number of specialized matrices • Distributed Row Matrices (DRM) – huge matrix, partitioned by rows – lives in the main memory of the cluster – provides small set of parallelized operations – lazily evaluated operation execution val x = 2.367 val v = dvec(1, 0, 5) val w = svec((0 -> 1)::(2 -> 5)::Nil) val A = dense((1, 0, 5), (2, 1, 4), (4, 3, 1)) val drmA = drmFromHDFS(...)
  11. 11. Features (1) • Matrix, vector, scalar operators: in-memory, out-of-core • Slicing operators • Assignments (in-memory only) • Vector-specific • Summaries drmA %*% drmB A %*% x A.t %*% drmB A * B A(5 until 20, 3 until 40) A(5, ::); A(5, 5) x(a to b) A(5, ::) := x A *= B A -=: B; 1 /:= x x dot y; x cross y A.nrow; x.length; A.colSums; B.rowMeans x.sum; A.norm
  12. 12. Features (2) • solving linear systems • in-memory decompositions • out-of-core decompositions • caching of DRMs val x = solve(A, b) val (inMemQ, inMemR) = qr(inMemM) val ch = chol(inMemM) val (inMemV, d) = eigen(inMemM) val (inMemU, inMemV, s) = svd(inMemM) val (drmQ, inMemR) = thinQR(drmA) val (drmU, drmV, s) = dssvd(drmA, k = 50, q = 1) val drmA_cached = drmA.checkpoint() drmA_cached.uncache()
  13. 13. Example
  14. 14. Cereals Name protein fat carbo sugars rating Apple Cinnamon Cheerios 2 2 10.5 10 29.509541 Cap‘n‘Crunch 1 2 12 12 18.042851 Cocoa Puffs 1 1 12 13 22.736446 Froot Loops 2 1 11 13 32.207582 Honey Graham Ohs 1 2 12 11 21.871292 Wheaties Honey Gold 2 1 16 8 36.187559 Cheerios 6 2 17 1 50.764999 Clusters 3 2 13 7 40.400208 Great Grains Pecan 3 3 13 4 45.811716 http://lib.stat.cmu.edu/DASL/Datafiles/Cereals.html
  15. 15. Linear Regression • Assumption: target variable y generated by linear combination of feature matrix X with parameter vector β, plus noise ε • Goal: find estimate of the parameter vector β that explains the data well • Cereals example X = weights of ingredients y = customer rating   Xy
  16. 16. Data Ingestion • Usually: load dataset as DRM from a distributed filesystem: val drmData = drmFromHdfs(...) • ‚Mimick‘ a large dataset for our example: val drmData = drmParallelize(dense( (2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios (1, 2, 12, 12, 18.042851), // Cap'n'Crunch (1, 1, 12, 13, 22.736446), // Cocoa Puffs (2, 1, 11, 13, 32.207582), // Froot Loops (1, 2, 12, 11, 21.871292), // Honey Graham Ohs (2, 1, 16, 8, 36.187559), // Wheaties Honey Gold (6, 2, 17, 1, 50.764999), // Cheerios (3, 2, 13, 7, 40.400208), // Clusters (3, 3, 13, 4, 45.811716)), // Great Grains Pecan numPartitions = 2)
  17. 17. Data Preparation • Cereals example: target variable y is customer rating, weights of ingredients are features X • extract X as DRM by slicing, fetch y as in-core vector val drmX = drmData(::, 0 until 4) val y = drmData.collect(::, 4)                             8117164541333 4002084071323 7649995011726 1875593681612 87129221111221 20758232131112 73644622131211 04285118121221 509541291051022 . . . . . . . . .. drmX y
  18. 18. Estimating β • Ordinary Least Squares: minimizes the sum of residual squares between true target variable and prediction of target variable • Closed-form expression for estimation of ß as • Computing XTX and XTy is as simple as typing the formulas: val drmXtX = drmX.t %*% drmX val drmXty = drmX %*% y yXXX TT 1 )(ˆ  
  19. 19. Estimating β • Solve the following linear system to get least-squares estimate of ß • Fetch XTX andXTy onto the driver and use an in-core solver – assumes XTX fits into memory – uses analogon to R’s solve() function val XtX = drmXtX.collect val Xty = drmXty.collect(::, 0) val betaHat = solve(XtX, Xty) yXXX TT ˆ
  20. 20. Estimating β • Solve the following linear system to get least-squares estimate of ß • Fetch XTX andXTy onto the driver and use an in-memory solver – assumes XTX fits into memory – uses analogon to R’s solve() function val XtX = drmXtX.collect val Xty = drmXty.collect(::, 0) val betaHat = solve(XtX, Xty) yXXX TT ˆ → We have implemented distributed linear regression!
  21. 21. Goodness of fit • Prediction of the target variable is simple matrix-vector multiplication • Check L2 norm of the difference between true target variable and our prediction val yHat = (drmX %*% betaHat).collect(::, 0) (y - yHat).norm(2) ˆˆ Xy 
  22. 22. Adding a bias term • Bias term left out so far – constant factor added to the model, “shifts the line vertically” • Common trick is to add a column of ones to the feature matrix – bias term will be learned automatically                             141333 171323 111726 181612 1111221 1131112 1131211 1121221 11051022 .                             41333 71323 11726 81612 111221 131112 131211 121221 105.1022
  23. 23. Adding a bias term • How do we add a new column to a DRM? → mapBlock() allows custom modifications to the matrix val drmXwithBiasColumn = drmX.mapBlock(ncol = drmX.ncol + 1) { case(keys, block) => // create a new block with an additional column val blockWithBiasCol = block.like(block.nrow, block.ncol+1) // copy data from current block into the new block blockWithBiasCol(::, 0 until block.ncol) := block // last column consists of ones blockWithBiasColumn(::, block.ncol) := 1 keys -> blockWithBiasColumn }
  24. 24. Under the covers
  25. 25. Underlying systems • currently: prototype on Apache Spark – fast and expressive cluster computing system – general computation graphs, in-memory primitives, rich API, interactive shell • future: add Stratosphere – research system developed by TU Berlin, HU Berlin, HPI – accepted into Apache Incubator recently – functionality similar to Apache Spark, adds data flow optimization and efficient out-of-core execution
  26. 26. Optimization • Execution is defered, user composes logical operators • Computational actions implicitly trigger optimization (= selection of physical plan) and execution • Optimization factors: size of operands, orientation of operands, partitioning, sharing of computational paths • e. g.: matrix multiplication: – 5 physical operators for drmA %*% drmB – 2 operators for drmA %*% inMemA – 1 operator for drm A %*% x – 1 operator for x %*% drmA val drmA = drmB.t %*% drmC drmA.writeDrm(path); val inMemA = (drmB.t %*% drmB).collect
  27. 27. Optimization Example (1) • Computation of XTX in the linear regression example • Logical optimization: logical plan gets rewritten to use a special logical operator for Transpose-Times-Self multiplication via a rule in the optimizer val drmXtX = drmX.t %*% drmX val XtX = drmXtX.collect OpAt OpAB drmX drmX drmXtX OpAtA drmX drmXtX
  28. 28. Optimization Example (2) • Mahout computes XTX via row-outer-product formulation of matrix multiplication that can be efficiently executed in a single pass over the row-partitioned X • optimizer chooses between to physical operators for OpAtA – standard implementation: • Map operator builds partial outer products from individual blocks based on an estimate of a good partitioning • Reduce operator adds them up using a Reduce operator – AtA_slim for tall but skinny matrices when an intermediate symmetric matrix requiring (X.ncol * X.ncol) / 2 entries fits into memory     m i T ii T xxXX 0
  29. 29. Thank you. Questions? Credits: Design & implementation of the Scala & Spark Bindings and parts of this slideset by Dmitriy Lyubimov dlyubimov@apache.org

×