Upcoming SlideShare
×

# 2013 05 ny

262

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total Views
262
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
9
0
Likes
0
Embeds 0
No embeds

No notes for slide

### 2013 05 ny

1. 1. A Distributed ParallelLogistic Regression & GLMCliff Click, CTO 0xdatacliffc@0xdata.comhttp://0xdata.comhttp://cliffc.org/blog
2. 2. 0xdata.com 2H2O – A Platform for Big Math● In-memory distributed & parallel vector math● Pure Java, runs in cloud, server, laptop● Open source: http://0xdata.github.com/h2o● java -jar h2o.jar -name meetup● Will auto-cluster in this room● Best with default GC, largest heap● Inner loops: near FORTRAN speeds & Java ease● for( int i=0; i<N; i++ )...do_something... // auto-distribute & par
3. 3. 0xdata.com 3GLM & Logistic Regression● Vector Math (for non math majors):● At the core, we compute a Gram Matrix● i.e., we touch all the data● Logistic Regression – solve with Iterative RLS● Iterative: multiple passes, multiple Gramsƞk= Xßkμk= link-1(ƞk)z = ƞk+ (y-μk)·link(μk)ßk+1= (XT·w·X)-1·(XT·z)
4. 4. 0xdata.com 4GLM & Logistic Regression● Vector Math (for non math majors):● At the core, we compute a Gram Matrix● i.e., we touch all the data● Logistic Regression – solve with Iterative RLS● Iterative: multiple passes, multiple Gramsƞk= Xßkμk= link-1(ƞk)z = ƞk+ (y-μk)·link(μk)ßk+1= (XT·w·X)-1·(XT·z)Inverse solved withCholesky Decomposition
5. 5. 0xdata.com 5GLM Running Time● n – number of rows or observations● p – number of features●Gram Matrix: O(np2) / #cpus● n can be billions; constant is really small● Data is distributed across machines●Cholesky Decomp: O(p3)●Real limit: memory is O(p2), on a single node● Times a small number of iterations (5-50)
6. 6. 0xdata.com 6Gram Matrix●Requires computing XT·X● A single observation: double x[], y;for( int i=0; i<P; i++ ) {for( int j=0; j<=i; j++ )_xx[i][j] += x[i]*x[j];_xy[i] += y*x[i];}_yy += y*y;● Computed per-row● Millions to billions of rows● Parallelize / distribute per-row
7. 7. 0xdata.com 7Distributed Vector Coding● Map-Reduce Style● Start with a Plain Olde Java Object● Private clone per-Map● Shallow-copy with-in JVMDeep-copy across JVMs● Map a “chunk” of data into private clone● "chunk" == all the rows that fit in 4Meg● Reduce: combine pairs of cloned objects
8. 8. 0xdata.com 8Plain Old Java Object● Using the POJO:Gram G = new Gram();G.invoke(A); // Compute the Gram of A...G._xx[][]... // Use the Gram for more math● Defining the POJO:class Gram extends MRTask {Key _data; // Input variable(s)// Output variablesdouble _xx[][], _xy[], _yy;void map( Key chunk ) { … }void reduce( Gram other ) { … }
9. 9. 0xdata.com 9Gram.map● Define the map:void map( Key chunk ) {// Pull in 4M chunk of data...boiler plate...for( int r=0; r<rows; r++ ) {double y,x[] = decompress(r);for( int i=0; i<P; i++ ) {for( int j=0; j<=i; j++ )_xx[i][j] += x[i]*x[j];_xy[i] += y*x[i];}_yy += y*y;}}
10. 10. 0xdata.com 10Gram.reduce● Define the reduce:// Fold other into thisvoid reduce( Gram other ) {for( int i=0; i<P; i++ ) {for( int j=0; j<=i; j++ )_xx[i][j] += other._xx[i][j];_xy[i] += other._xy[i];}_yy += other._yy;}
11. 11. 0xdata.com 11Distributed Vector Coding 2● Gram Matrix computed in parallel & distributed● Excellent CPU & load-balancing● About 1sec per Gig for 32 medium EC2 instances● The whole Logistic Regression, about 10sec/Gig– Varies by #features, (i.e. billion rows, 1000 features)● Distribution & Parallelization handled by H2O● Data is pre-split by rows during parse/ingest●map(chunk) is run where chunk is local●reduce runs both local & distributed– Gram object auto-serialized, auto-cloned
12. 12. 0xdata.com 12Other Inner-Loop Considerations● Real inner loop has more cruft● Some columns excluded by user● Some rows excluded by sampling, or missing data● Data is normalized & centered● Catagorical column expansion– Math is straightforward, but needs another indirection● Iterative Reweighted Least Squares– Adds weight to each row
13. 13. 0xdata.com 13GLM + GLMGrid● Gram matrix is computed in parallel & distributed● Rest of GLM is all single-threaded pure Java● Includes JAMA for Cholesky Decomposition● Default 10-fold x-val runs in parallel● Warm-start all models for faster solving● GLMGrid: Parameter search for GLM● In parallel try all combos of λ & α
14. 14. 0xdata.com 14Meta Considerations: Math @ Scale● Easy coding style is key:●1stcut GLM ready in 2 weeks, but● Code was changing for months● Incremental evolution of a number of features● Distributed/parallel borders kept clean & simple● Java● Runs fine in a single-JVM in debugger + Eclipse● Well understood programming model
15. 15. 0xdata.com 15H2O: Memory Considerations● Runs best with default GC, largest -Xmx● Data cached in Java heap● Cache size vs heap monitored, spill-to-disk● FullGC typically <1sec even for >30G heap● If data fits – math runs at memory speeds● Else disk-bound● Ingest: Typically need 4x to 6x more memory● Depends on GZIP ratios & column-compress ratios
16. 16. 0xdata.com 16H2O: Reliable Network I/O● Uses both UDP & TCP● UDP for fast point-to-point control logic● Reliable UDP via timeout & retry● TCP, under load, reliably fails silently– No data at receiver, no errors at sender– 100% fail, <5mins in our labs or EC2● (so not a fault of virtualization)● TCP uses the same reliable comm layer as UDP– Only use TCP for congestion control of large xfers
17. 17. 0xdata.com 17H2O: S3 Ingest● H2O can inhale from S3 (any many others)● S3, under load, reliably fails● Unlike TCP, appears to throw exception every time● Again, wrap in a relibility retry layer● HDFS backed by S3 (jets3)● New failure mode: reports premature EOF● Again, wrap in a relibility retry layer
1. #### A particular slide catching your eye?

Clipping is a handy way to collect important slides you want to go back to later.