2013 05 ny

A Distributed Parallel
Logistic Regression & GLM
Cliff Click, CTO 0xdata
cliffc@0xdata.com
http://0xdata.com
http://cliffc.org/blog

0xdata.com 2
H2O – A Platform for Big Math
● In-memory distributed & parallel vector math
● Pure Java, runs in cloud, server, laptop
● Open source: http://0xdata.github.com/h2o
● java -jar h2o.jar -name meetup
● Will auto-cluster in this room
● Best with default GC, largest heap
● Inner loops: near FORTRAN speeds & Java ease
● for( int i=0; i<N; i++ )
...do_something... // auto-distribute & par

0xdata.com 3
GLM & Logistic Regression
● Vector Math (for non math majors):
● At the core, we compute a Gram Matrix
● i.e., we touch all the data
● Logistic Regression – solve with Iterative RLS
● Iterative: multiple passes, multiple Grams
ƞk
= Xßk
μk
= link-1
(ƞk
)
z = ƞk
+ (y-μk
)·link'(μk
)
ßk+1
= (XT
·w·X)-1
·(XT
·z)

0xdata.com 4
GLM & Logistic Regression
● Vector Math (for non math majors):
● At the core, we compute a Gram Matrix
● i.e., we touch all the data
● Logistic Regression – solve with Iterative RLS
● Iterative: multiple passes, multiple Grams
ƞk
= Xßk
μk
= link-1
(ƞk
)
z = ƞk
+ (y-μk
)·link'(μk
)
ßk+1
= (XT
·w·X)-1
·(XT
·z)
Inverse solved with
Cholesky Decomposition

0xdata.com 5
GLM Running Time
● n – number of rows or observations
● p – number of features
●
Gram Matrix: O(np2
) / #cpus
● n can be billions; constant is really small
● Data is distributed across machines
●
Cholesky Decomp: O(p3
)
●
Real limit: memory is O(p2
), on a single node
● Times a small number of iterations (5-50)

0xdata.com 6
Gram Matrix
●
Requires computing XT
·X
● A single observation: double x[], y;
for( int i=0; i<P; i++ ) {
for( int j=0; j<=i; j++ )
_xx[i][j] += x[i]*x[j];
_xy[i] += y*x[i];
}
_yy += y*y;
● Computed per-row
● Millions to billions of rows
● Parallelize / distribute per-row

0xdata.com 7
Distributed Vector Coding
● Map-Reduce Style
● Start with a Plain Olde Java Object
● Private clone per-Map
● Shallow-copy with-in JVM
Deep-copy across JVMs
● Map a “chunk” of data into private clone
● "chunk" == all the rows that fit in 4Meg
● Reduce: combine pairs of cloned objects

0xdata.com 8
Plain Old Java Object
● Using the POJO:
Gram G = new Gram();
G.invoke(A); // Compute the Gram of A
...G._xx[][]... // Use the Gram for more math
● Defining the POJO:
class Gram extends MRTask {
Key _data; // Input variable(s)
// Output variables
double _xx[][], _xy[], _yy;
void map( Key chunk ) { … }
void reduce( Gram other ) { … }

0xdata.com 9
Gram.map
● Define the map:
void map( Key chunk ) {
// Pull in 4M chunk of data
...boiler plate...
for( int r=0; r<rows; r++ ) {
double y,x[] = decompress(r);
for( int i=0; i<P; i++ ) {
_xx[i][j] += x[i]*x[j];
_xy[i] += y*x[i];
}
_yy += y*y;
}
}

0xdata.com 10
Gram.reduce
● Define the reduce:
// Fold 'other' into 'this'
void reduce( Gram other ) {
for( int i=0; i<P; i++ ) {
_xx[i][j] += other._xx[i][j];
_xy[i] += other._xy[i];
}
_yy += other._yy;
}

0xdata.com 11
Distributed Vector Coding 2
● Gram Matrix computed in parallel & distributed
● Excellent CPU & load-balancing
● About 1sec per Gig for 32 medium EC2 instances
● The whole Logistic Regression, about 10sec/Gig
– Varies by #features, (i.e. billion rows, 1000 features)
● Distribution & Parallelization handled by H2O
● Data is pre-split by rows during parse/ingest
●
map(chunk) is run where chunk is local
●
reduce runs both local & distributed
– Gram object auto-serialized, auto-cloned

0xdata.com 12
Other Inner-Loop Considerations
● Real inner loop has more cruft
● Some columns excluded by user
● Some rows excluded by sampling, or missing data
● Data is normalized & centered
● Catagorical column expansion
– Math is straightforward, but needs another indirection
● Iterative Reweighted Least Squares
– Adds weight to each row

0xdata.com 13
GLM + GLMGrid
● Gram matrix is computed in parallel & distributed
● Rest of GLM is all single-threaded pure Java
● Includes JAMA for Cholesky Decomposition
● Default 10-fold x-val runs in parallel
● Warm-start all models for faster solving
● GLMGrid: Parameter search for GLM
● In parallel try all combo's of λ & α

0xdata.com 14
Meta Considerations: Math @ Scale
● Easy coding style is key:
●
1st
cut GLM ready in 2 weeks, but
● Code was changing for months
● Incremental evolution of a number of features
● Distributed/parallel borders kept clean & simple
● Java
● Runs fine in a single-JVM in debugger + Eclipse
● Well understood programming model

0xdata.com 15
H2O: Memory Considerations
● Runs best with default GC, largest -Xmx
● Data cached in Java heap
● Cache size vs heap monitored, spill-to-disk
● FullGC typically <1sec even for >30G heap
● If data fits – math runs at memory speeds
● Else disk-bound
● Ingest: Typically need 4x to 6x more memory
● Depends on GZIP ratios & column-compress ratios

0xdata.com 16
H2O: Reliable Network I/O
● Uses both UDP & TCP
● UDP for fast point-to-point control logic
● Reliable UDP via timeout & retry
● TCP, under load, reliably fails silently
– No data at receiver, no errors at sender
– 100% fail, <5mins in our labs or EC2
● (so not a fault of virtualization)
● TCP uses the same reliable comm layer as UDP
– Only use TCP for congestion control of large xfers

0xdata.com 17
H2O: S3 Ingest
● H2O can inhale from S3 (any many others)
● S3, under load, reliably fails
● Unlike TCP, appears to throw exception every time
● Again, wrap in a relibility retry layer
● HDFS backed by S3 (jets3)
● New failure mode: reports premature EOF
● Again, wrap in a relibility retry layer

2013 05 ny

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (10)

Similar to 2013 05 ny

Similar to 2013 05 ny (20)

More from Sri Ambati

More from Sri Ambati (20)

2013 05 ny