Parallel External Memory Algorithms Applied to Generalized Linear Models

Like this? Share it with your network

Share

Parallel External Memory Algorithms Applied to Generalized Linear Models

  • 1,960 views
Uploaded on

Presentation by Lee Edlefsen, Revolution Analytics to JSM 2012, San Diego CA, July 30 2012 ...

Presentation by Lee Edlefsen, Revolution Analytics to JSM 2012, San Diego CA, July 30 2012

For the past several decades the rising tide of technology has allowed the same data analysis code to handle the increase in sizes of typical data sets. That era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of RAM, and of hard drives. To deal with this, statistical software must be able to use multiple cores and computers. Parallel external memory algorithms (PEMA's) provide the foundation for such software. External memory algorithms (EMA's) are those that do not require all data to be in RAM, and are widely available. Parallel implementations of EMA's allow them to run on multiple cores and computers, and to process unlimited rows of data. This paper describes a general approach to efficiently parallelizing EMA's, using an R and C++ implementation of GLM as a detailed example. It examines the requirements for efficient PEMA's; the arrangement of code for automatic parallelization; efficient threading; and efficient inter-process communication. It includes billion row benchmarks showing linear scaling with rows and nodes, and demonstrating that extremely high performance is achievable.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,960
On Slideshare
1,958
From Embeds
2
Number of Embeds
1

Actions

Shares
Downloads
40
Comments
0
Likes
0

Embeds 2

https://duckduckgo.com 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Parallel External Memory Algorithms applied to Generalized Linear ModelsLee E. Edlefsen, Ph.D.Chief ScientistJSM 2012 1
  • 2. Introduction and overview Revolution Confidential For the past several decades the rising tide oftechnology has allowed the same data analysis codeto handle the increase in sizes of typical data sets.That era is ending. The size of data sets is increasingmuch more rapidly than the speed of single cores, ofRAM, and of hard drives.To deal with this, statistical software must be able touse multiple cores and computers. Parallel externalmemory algorithms (PEMA’s) provide a foundation forsuch software. 2
  • 3. Introduction and overview – (2) Revolution Confidential External memory algorithms (EMA’s) are those that do not require all data to be in RAM, and are widely available. Parallel implementations of EMA’s allow them to run on multiple cores and computers, and to process unlimited rows of data. This paper describes a general approach to efficiently parallelizing EMA’s, using an R and C++ implementation of generalized linear models (GLM) as a detailed example. Revolution R Enterprise 3
  • 4. Introduction and overview – (3) Revolution Confidential This paper discusses:  the arrangement of code for “automatic” parallelization  the efficient use of cores  the efficient use of multiple computers (nodes) The approach presented is independent of the distributed computing platform (MPI, Hadoop, MPP database appliances) The paper includes billion row benchmarks showing linear scaling with rows and nodes, and demonstrating that extremely high performance is achievable Revolution R Enterprise 4
  • 5. High Performance Computing vs High Revolution ConfidentialPerformance Analytics HPA is HPC + Data High Performance Computing is CPU centric  Lots of processing on small amounts of data  Focus is on cores High Performance Analytics is data centric  Less processing per amount of data  Focus is on feeding data to the cores  On disk I/O, data locality  On efficient threading, data management in RAM Revolution R Enterprise 5
  • 6. High Performance Analytics in RevoScaleR Revolution Confidential Extremely high performance data management and data analysis Scales from small local data to huge distributed data Scales from laptop to cluster to cloud Based on a platform that “automatically” and efficiently parallelizes and distributes a broad class of predictive analytic algorithms This platform implements the approach to parallel external memory algorithms I will describe Revolution R Enterprise 6
  • 7. External memory algorithms Revolution Confidential External memory algorithms are those that allow computations to be split into pieces so that not all data has to be in memory at one time Such algorithms process data a “chunk” at a time, storing intermediate results from each chunk and combining them at the end Each chunk must produce an intermediate result that can be combined with other intermediate results to give the final result Such algorithms are widely available for data management and predictive analytics 7
  • 8. Parallel external memory algorithms Revolution Confidential(PEMA’S) PEMA’s are external memory algorithms that have been parallelized Such algorithms process data a chunk at a time in parallel, storing intermediate results from each chunk and combining them at the end External memory algorithms that are not “inherently sequential” can be parallelized  Results for one chunk of data cannot depend upon prior results  Data dependence (lags, leads) is OK Revolution R Enterprise 8
  • 9. Generalized Linear Models (GLM) Revolution Confidential The generalized linear model can be thought of as a generalization of linear regression It extends linear regression to handle dependent variables that are generated from exponential distribution functions, including Gaussian, Poisson, logistic, gamma, binomial, multinomial, and tweedie Generalized linear models are widely used in a variety of fields and industries Revolution R Enterprise 9
  • 10. GLM overview Revolution Confidential The dependent variable Y is generated from a distribution in the exponential family The expected value of Y is related to a linear predictor of the data X and parameters β through the inverse of a “link” function g(): E(Y) = mu = g-1(Xβ) The variance of Y is typically a function V() of the mean mu: Var(Y) = varmu = V(mu) Revolution R Enterprise 10
  • 11. GLM Estimation Revolution Confidential The parameters of GLM models can be estimated using maximum likelihood Iteratively reweighted least squares (IRLS) is commonly used to obtain the maximum likelihood estimates Each iteration of IRLS requires at least one pass through the data, generating a vector of weights and a “new” dependent variable and then doing a weighted least squares regression Revolution R Enterprise 11
  • 12. IRLS for GLM Revolution Confidential Given an estimate of the parameters β and the data X, IRLS requires the computation of a “weight” variable W and a “new” dependent variable Z: eta = Xβ mu = linkinv(eta) Z = (y-mu)/mu_eta, where mu_eta is the partial of mu with respect to eta W = sqrt(mu_eta*mu_eta)/varmu The next β is then computed by regressing Z on X, weighted by W If the estimation has not converged, the steps are repeated Revolution R Enterprise 12
  • 13. In-memory implementations Revolution Confidential The glm() function in R provides a beautiful and efficient in-memory implementation However, nearly every computational line of code involves processing all rows of data There is no easy way to directly convert an implementation like this into an implementation that can handle data too big to fit into memory and that can use multiple cores and multiple computers However, it can be accomplished by arranging the same computations into separate functions that accomplish separate tasks Revolution R Enterprise 13
  • 14. Example external memory algorithm for themean of a variable Revolution Confidential Initialization function: total=0, count=0 ProcessData function: for each block of x; total = sum(x), count=length(x) UpdateResults function: total12 = total1 + total2 ProcessResults function: mean = combined total / combined count 14
  • 15. A formalization of PEMA’s Revolution Confidential Arrange the code into 4 functions: 1. Initialize(): does any necessary initialization 2. ProcessData(): takes a chunk of data and produces an intermediate result (IR); this is the only function run in parallel; it must assume it does not have all data; it must produce no side-effects 3. UpdateResults(): takes two IR’s and produces another IR that is equivalent to the IR that would have been produced by combing the two corresponding chunks of data and calling ProcessData() 4. ProcessResults(): takes any given IR and converts it into a “final results” (FR) form Revolution R Enterprise 15
  • 16. An external memory algorithm for GLM Revolution Confidential Initialization function: set intermediate values to 0 ProcessData function: for given β and chunk of data X, compute Z, W and M, the weighted cross products matrix of X and Z for this chunk eta = Xβ, mu = linkinv(eta) Z = (y-mu)/mu_eta, W = sqrt(mu_eta*mu_eta)/varmu M = [X*W Z*W]’[X*W Z*W] UpdateResults function: M12 = M1 + M2 ProcessResults function: β = Solve(M) (solves a set of linear equations) Check for convergence and repeat if necessary Revolution R Enterprise 16
  • 17. A C++ and R implementation of GLM Revolution Confidential C++ “analysis” objects  Have 4 virtual PEMA methods, among others  Have member variables for intermediate results and for maintaining local state  Know how to copy themselves (including ability to not copy some members, for efficiency)  Have ability to call into R during ProcessData() R “family” objects for glm  Contain methods for computing Z, W (eta, mu, etc) Revolution R Enterprise 17
  • 18. GLM in C++ and R: Multiple Cores Revolution Confidential On each computer, a master analysis object makes a copy of itself for all usable threads (cores) except one The remaining thread is assigned to handle all I/O In a master loop over the data, the I/O object reads a chunk of data In parallel (after the first read), portions of the previously-read chunk are (virtually) passed to the ProcessData() methods of the other objects Revolution R Enterprise 18
  • 19. GLM in C++ and R: Multiple Cores – (2) Revolution Confidential For each chunk of data, Z,W are computed (in R or C++; if in R, only on 1 thread at a time is allowed); Xβ and M are computed in C++ After all data has been consumed, the master analysis object loops over all of the thread-specific objects and updates itself (using UpdateResults()), resulting in the intermediate results object that corresponds to all of the data processed on this computer If other computers are being used, this computer sends it intermediate results to the “master” node Revolution R Enterprise 19
  • 20. GLM in C++ and R: Multiple MPI Nodes Revolution Confidential A “master node” sends a copy of the analysis object, or instructions on how to create one, to each computer (node) on a cluster/grid, and the steps described above are carried out Each node reads and processes its portion of the data (the more local the data the better) Worker nodes do not communicate with each other Worker nodes do not communicate with the master node except for sending their results Revolution R Enterprise 20
  • 21. GLM in C++ and R: Multiple MPI Nodes (2) Revolution Confidential When each node has its final IR object, it sends it to the master node The master node gathers and combines all intermediate results using UpdateResults() When it has the final intermediate results, it calls ProcessResults() to get next estimate of β The master node checks for convergence, and repeats all of the steps if necessary Revolution R Enterprise 21
  • 22. Implementation in RevoScaleR Revolution Confidential The package RevoScaleR, which is part of Revolution R Enterprise, contains an implementation of GLM and other algorithms based on this approach The algorithms are internally threaded They can currently use MPI or RPC for inter- process communication Supports Platform LSF and HPC Server schedulers We are currently working on supporting Hadoop Revolution R Enterprise 22
  • 23. Some features of this implementation Revolution Confidential Handles an arbitrarily large number of rows in a fixed amount of memory Scales linearly with the number of rows Scales (approximately) linearly with the number of nodes Scales well with the number of cores per node Scales well with the number of parameters Works on commodity hardware Extremely high performance Revolution R Enterprise 23
  • 24. Scalability of linear regression with rows 1 million - 1 billion rows, 443 betas Revolution Confidential (4 cores) Time (secs) 1200 ~ 1.1 million rows/second 1000 800 600 400 200 0 0 200 400 600 800 1000 1200 Revolution R Enterprise 24
  • 25. Scalability of glm (logit) with rows 1 million - 1 billion rows, 443 betas Revolution Confidential (4 cores) Time (secs)4000350030002500200015001000500 0 0 200 400 600 800 1000 1200 Revolution R Enterprise 25
  • 26. Scalability with nodes: glm (logit) Revolution Confidential Big (1B rows) and Small (124M rows) dataBig (443 params) and Small (7 params) models (4 cores per node) Big Data, Big Model (Super scaling) 5 iterations per model Big Data, Small Model Small Data, Big Model Linear scaling reference Small Data, Small Model Revolution R Enterprise 26
  • 27. Timing comparisons Revolution Confidential glm() in CRAN R vs rxGlm in RevoScaleR SAS’s new HPA functionality vs rxGlm Revolution R Enterprise 27
  • 28. Revolution Confidential 28
  • 29. Revolution ConfidentialHPA Benchmarking comparison* – Logistic Regression Rows of data 1 billion 1 billion Parameters “just a few” 7 Time 80 seconds 44 seconds Data location In memory On disk Nodes 32 5 Cores 384 20 RAM 1,536 GB 80 GBRevolution R is faster on the same amount of data, despite using approximately a20th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM. *As published by SAS in HPC Wire, April 21, 2011 29
  • 30. Conclusion Revolution Confidential PEMA’s provide a systematic approach to scalable analytic algorithms Algorithms implemented in this way can handle unlimited numbers of rows on a single core in a fixed amount of RAM Such algorithms scale well with rows and nodes, and scale well with cores up to a point Work on commodity hardware Work on different distributed computing platforms Extremely high performance is possible Revolution R Enterprise 30
  • 31. Thank you! Revolution Confidential R-Core Team R Package Developers R Community Revolution R Enterprise Customers and Beta Testers Colleagues at Revolution AnalyticsContact:lee@revolutionanalytics.com 31