Parallel External Memory Algorithms Applied to Generalized Linear Models

Parallel External
Memory Algorithms
applied to
Generalized Linear
Models
Lee E. Edlefsen, Ph.D.
Chief Scientist
JSM 2012

1

Introduction and overview Revolution Confidential

 For the past several decades the rising tide of
technology has allowed the same data analysis code
to handle the increase in sizes of typical data sets.
That era is ending. The size of data sets is increasing
much more rapidly than the speed of single cores, of
RAM, and of hard drives.
To deal with this, statistical software must be able to
use multiple cores and computers. Parallel external
memory algorithms (PEMA’s) provide a foundation for
such software.

2

Introduction and overview – (2) Revolution Confidential

 External memory algorithms (EMA’s) are those that
do not require all data to be in RAM, and are widely
available.
 Parallel implementations of EMA’s allow them to
run on multiple cores and computers, and to
process unlimited rows of data.
 This paper describes a general approach to
efficiently parallelizing EMA’s, using an R and C++
implementation of generalized linear models (GLM)
as a detailed example.

Revolution R Enterprise 3

Introduction and overview – (3) Revolution Confidential

 This paper discusses:
 the arrangement of code for “automatic” parallelization
 the efficient use of cores
 the efficient use of multiple computers (nodes)
 The approach presented is independent of the
distributed computing platform (MPI, Hadoop, MPP
database appliances)
 The paper includes billion row benchmarks
showing linear scaling with rows and nodes, and
demonstrating that extremely high performance is
achievable

High Performance Computing vs High Revolution Confidential

Performance Analytics
 HPA is HPC + Data
 High Performance Computing is CPU centric
 Lots of processing on small amounts of data
 Focus is on cores
 High Performance Analytics is data centric
 Less processing per amount of data
 Focus is on feeding data to the cores
 On disk I/O, data locality
 On efficient threading, data management in RAM


High Performance Analytics in RevoScaleR Revolution Confidential

 Extremely high performance data management
and data analysis
 Scales from small local data to huge distributed
data
 Scales from laptop to cluster to cloud
 Based on a platform that “automatically” and
efficiently parallelizes and distributes a broad class
of predictive analytic algorithms
 This platform implements the approach to parallel
external memory algorithms I will describe

External memory algorithms Revolution Confidential

 External memory algorithms are those that allow
computations to be split into pieces so that not all data
has to be in memory at one time
 Such algorithms process data a “chunk” at a time,
storing intermediate results from each chunk and
combining them at the end
 Each chunk must produce an intermediate result that
can be combined with other intermediate results to give
the final result
 Such algorithms are widely available for data
management and predictive analytics

7

Parallel external memory algorithms Revolution Confidential

(PEMA’S)
 PEMA’s are external memory algorithms that have
been parallelized
 Such algorithms process data a chunk at a time in
parallel, storing intermediate results from each
chunk and combining them at the end
 External memory algorithms that are not “inherently
sequential” can be parallelized
 Results for one chunk of data cannot depend upon prior
results
 Data dependence (lags, leads) is OK


Generalized Linear Models (GLM) Revolution Confidential

 The generalized linear model can be thought of as
a generalization of linear regression
 It extends linear regression to handle dependent
variables that are generated from exponential
distribution functions, including Gaussian, Poisson,
logistic, gamma, binomial, multinomial, and
tweedie
 Generalized linear models are widely used in a
variety of fields and industries


GLM overview Revolution Confidential

 The dependent variable Y is generated from a
distribution in the exponential family
 The expected value of Y is related to a linear
predictor of the data X and parameters β through
the inverse of a “link” function g():
E(Y) = mu = g-1(Xβ)
 The variance of Y is typically a function V() of the
mean mu:
Var(Y) = varmu = V(mu)

GLM Estimation Revolution Confidential

 The parameters of GLM models can be estimated
using maximum likelihood
 Iteratively reweighted least squares (IRLS) is
commonly used to obtain the maximum likelihood
estimates
 Each iteration of IRLS requires at least one pass
through the data, generating a vector of weights
and a “new” dependent variable and then doing a
weighted least squares regression


IRLS for GLM Revolution Confidential

 Given an estimate of the parameters β and the
data X, IRLS requires the computation of a “weight”
variable W and a “new” dependent variable Z:
eta = Xβ
mu = linkinv(eta)
Z = (y-mu)/mu_eta, where mu_eta is the partial of mu with respect to eta
W = sqrt(mu_eta*mu_eta)/varmu

 The next β is then computed by regressing Z on X,
weighted by W
 If the estimation has not converged, the steps are
repeated

In-memory implementations Revolution Confidential

 The glm() function in R provides a beautiful and
efficient in-memory implementation
 However, nearly every computational line of code
involves processing all rows of data
 There is no easy way to directly convert an
implementation like this into an implementation that
can handle data too big to fit into memory and that
can use multiple cores and multiple computers
 However, it can be accomplished by arranging the
same computations into separate functions that
accomplish separate tasks

Example external memory algorithm for the
mean of a variable Revolution Confidential

 Initialization function: total=0, count=0
 ProcessData function: for each block of x; total =
sum(x), count=length(x)
 UpdateResults function: total12 = total1 + total2
 ProcessResults function: mean = combined total /
combined count

14

A formalization of PEMA’s Revolution Confidential

 Arrange the code into 4 functions:
1. Initialize(): does any necessary initialization
2. ProcessData(): takes a chunk of data and
produces an intermediate result (IR); this is the only
function run in parallel; it must assume it does not have
all data; it must produce no side-effects
3. UpdateResults(): takes two IR’s and produces
another IR that is equivalent to the IR that would have
been produced by combing the two corresponding
chunks of data and calling ProcessData()
4. ProcessResults(): takes any given IR and
converts it into a “final results” (FR) form

An external memory algorithm for GLM Revolution Confidential

 Initialization function: set intermediate values to 0
 ProcessData function: for given β and chunk of
data X, compute Z, W and M, the weighted cross
products matrix of X and Z for this chunk
eta = Xβ, mu = linkinv(eta)
Z = (y-mu)/mu_eta, W = sqrt(mu_eta*mu_eta)/varmu
M = [X*W Z*W]’[X*W Z*W]
 UpdateResults function:
M12 = M1 + M2

 ProcessResults function:
β = Solve(M) (solves a set of linear equations)

 Check for convergence and repeat if necessary

A C++ and R implementation of GLM Revolution Confidential

 C++ “analysis” objects
 Have 4 virtual PEMA methods, among others
 Have member variables for intermediate results
and for maintaining local state
 Know how to copy themselves (including ability
to not copy some members, for efficiency)
 Have ability to call into R during ProcessData()
 R “family” objects for glm
 Contain methods for computing Z, W (eta, mu,
etc)

GLM in C++ and R: Multiple Cores Revolution Confidential

 On each computer, a master analysis object makes
a copy of itself for all usable threads (cores)
except one
 The remaining thread is assigned to handle all I/O
 In a master loop over the data, the I/O object reads
a chunk of data
 In parallel (after the first read), portions of the
previously-read chunk are (virtually) passed to the
ProcessData() methods of the other objects


GLM in C++ and R: Multiple Cores – (2) Revolution Confidential

 For each chunk of data, Z,W are computed (in R
or C++; if in R, only on 1 thread at a time is
allowed); Xβ and M are computed in C++
 After all data has been consumed, the master
analysis object loops over all of the thread-specific
objects and updates itself (using UpdateResults()),
resulting in the intermediate results object that
corresponds to all of the data processed on this
computer
 If other computers are being used, this computer
sends it intermediate results to the “master” node


GLM in C++ and R: Multiple MPI Nodes Revolution Confidential

 A “master node” sends a copy of the analysis
object, or instructions on how to create one, to
each computer (node) on a cluster/grid, and the
steps described above are carried out
 Each node reads and processes its portion of the
data (the more local the data the better)
 Worker nodes do not communicate with each other
 Worker nodes do not communicate with the master
node except for sending their results


GLM in C++ and R: Multiple MPI Nodes (2) Revolution Confidential

 When each node has its final IR object, it sends it
to the master node
 The master node gathers and combines all
intermediate results using UpdateResults()
 When it has the final intermediate results, it calls
ProcessResults() to get next estimate of β
 The master node checks for convergence, and
repeats all of the steps if necessary


Implementation in RevoScaleR Revolution Confidential

 The package RevoScaleR, which is part of
Revolution R Enterprise, contains an
implementation of GLM and other algorithms based
on this approach
 The algorithms are internally threaded
 They can currently use MPI or RPC for inter-
process communication
 Supports Platform LSF and HPC Server
schedulers
 We are currently working on supporting Hadoop


Some features of this implementation Revolution Confidential

 Handles an arbitrarily large number of rows in a
fixed amount of memory
 Scales linearly with the number of rows
 Scales (approximately) linearly with the number
of nodes
 Scales well with the number of cores per node
 Scales well with the number of parameters
 Works on commodity hardware
 Extremely high performance

Scalability of linear regression with rows
1 million - 1 billion rows, 443 betas
Revolution Confidential

(4 cores)

Time (secs)
1200
~ 1.1 million rows/second
1000

800

600

400

200

0
0 200 400 600 800 1000 1200


Scalability of glm (logit) with rows
1 million - 1 billion rows, 443 betas Revolution Confidential

(4 cores)

Time (secs)
4000

3500

3000

2500

2000

1500

1000

500

0
0 200 400 600 800 1000 1200


Scalability with nodes: glm (logit) Revolution Confidential

Big (1B rows) and Small (124M rows) data
Big (443 params) and Small (7 params) models (4 cores per node)

Big Data, Big Model
(Super scaling)

5 iterations per model

Big Data, Small Model

Small Data, Big Model

Linear scaling reference

Small Data, Small Model


Timing comparisons Revolution Confidential

 glm() in CRAN R vs rxGlm in RevoScaleR
 SAS’s new HPA functionality vs rxGlm



28

HPA Benchmarking comparison* – Logistic Regression

Rows of data 1 billion 1 billion
Parameters “just a few” 7
Time 80 seconds 44 seconds
Data location In memory On disk
Nodes 32 5
Cores 384 20
RAM 1,536 GB 80 GB

Revolution R is faster on the same amount of data, despite using approximately a
20th as many cores, a 20th as much RAM, a 6th as many nodes, and not pre-
loading data into RAM.

*As published by SAS in HPC Wire, April 21, 2011 29

Conclusion Revolution Confidential

 PEMA’s provide a systematic approach to scalable
analytic algorithms
 Algorithms implemented in this way can handle
unlimited numbers of rows on a single core in a
fixed amount of RAM
 Such algorithms scale well with rows and nodes,
and scale well with cores up to a point
 Work on commodity hardware
 Work on different distributed computing platforms
 Extremely high performance is possible

Thank you! Revolution Confidential

 R-Core Team
 R Package Developers
 R Community
 Revolution R Enterprise Customers and Beta
Testers
 Colleagues at Revolution Analytics
Contact:
lee@revolutionanalytics.com

31

Parallel External Memory Algorithms Applied to Generalized Linear Models

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Parallel External Memory Algorithms Applied to Generalized Linear Models

Similar to Parallel External Memory Algorithms Applied to Generalized Linear Models (20)

More from Revolution Analytics

More from Revolution Analytics (20)

Recently uploaded

Recently uploaded (20)

Parallel External Memory Algorithms Applied to Generalized Linear Models