GBM in H2O with Cliff Click: H2O API

1,912 views

Published on

- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

Published in: Technology, Education

GBM in H2O with Cliff Click: H2O API

  1. 1. GBM: Distributed Tree Algorithms on H2O Cliff Click, CTO 0xdata cliffc@0xdata.com http://0xdata.com http://cliffc.org/blog
  2. 2. 0xdata.com 2 H2O is... ● Pure Java, Open Source: 0xdata.com ● https://github.com/0xdata/h2o/ ● A Platform for doing Math ● Parallel Distributed Math ● In-memory analytics: GLM, GBM, RF, Logistic Reg ● Accessible via REST & JSON ● A K/V Store: ~150ns per get or put ● Distributed Fork/Join + Map/Reduce + K/V
  3. 3. 0xdata.com 3 Agenda ● Building Blocks For Big Data: ● Vecs & Frames & Chunks ● Distributed Tree Algorithms ● Access Patterns & Execution ● GBM on H2O ● Performance
  4. 4. 0xdata.com 4 A Collection of Distributed Vectors // A Distributed Vector // much more than 2billion elements class Vec { long length(); // more than an int's worth // fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx); void set(long idx, double d); // writable void append(double d); // variable sized }
  5. 5. 0xdata.com 5 JVM 4 Heap JVM 1 Heap JVM 2 Heap JVM 3 Heap Frames A Frame: Vec[] age sex zip ID car ●Vecs aligned in heaps ●Optimized for concurrent access ●Random access any row, any JVM ●But faster if local... more on that later
  6. 6. 0xdata.com 6 JVM 4 Heap JVM 1 Heap JVM 2 Heap JVM 3 Heap Distributed Data Taxonomy A Chunk, Unit of Parallel Access Vec Vec Vec Vec Vec ●Typically 1e3 to 1e6 elements ●Stored compressed ●In byte arrays ●Get/put is a few clock cycles including compression
  7. 7. 0xdata.com 7 JVM 4 Heap JVM 1 Heap JVM 2 Heap JVM 3 Heap Distributed Parallel Execution Vec Vec Vec Vec Vec ●All CPUs grab Chunks in parallel ●F/J load balances ●Code moves to Data ●Map/Reduce & F/J handles all sync ●H2O handles all comm, data manage
  8. 8. 0xdata.com 8 Distributed Data Taxonomy Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double Row i – i'th elements of all the Vecs in a Frame
  9. 9. 0xdata.com 9 Distributed Coding Taxonomy ● No Distribution Coding: ● Whole Algorithms, Whole Vector-Math ● REST + JSON: e.g. load data, GLM, get results ● Simple Data-Parallel Coding: ● Per-Row (or neighbor row) Math ● Map/Reduce-style: e.g. Any dense linear algebra ● Complex Data-Parallel Coding ● K/V Store, Graph Algo's, e.g. PageRank
  10. 10. 0xdata.com 10 Distributed Coding Taxonomy ● No Distribution Coding: ● Whole Algorithms, Whole Vector-Math ● REST + JSON: e.g. load data, GLM, get results ● Simple Data-Parallel Coding: ● Per-Row (or neighbor row) Math ● Map/Reduce-style: e.g. Any dense linear algebra ● Complex Data-Parallel Coding ● K/V Store, Graph Algo's, e.g. PageRank Read the docs! This talk! Join our GIT!
  11. 11. 0xdata.com 11 Simple Data-Parallel Coding ● Map/Reduce Per-Row: Stateless ● Example from Linear Regression, Σ y2 ● Auto-parallel, auto-distributed ● Near Fortran speed, Java Ease double sumY2 = new MRTask() { double map( double d ) { return d*d; } double reduce( double d1, double d2 ) { return d1+d2; } }.doAll( vecY );
  12. 12. 0xdata.com 12 Simple Data-Parallel Coding ● Map/Reduce Per-Row: State-full ● Linear Regression Pass1: Σ x, Σ y, Σ y2 class LRPass1 extends MRTask { double sumX, sumY, sumY2; // I Can Haz State? void map( double X, double Y ) { sumX += X; sumY += Y; sumY2 += Y*Y; } void reduce( LRPass1 that ) { sumX += that.sumX ; sumY += that.sumY ; sumY2 += that.sumY2; } }
  13. 13. 0xdata.com 13 Simple Data-Parallel Coding ● Map/Reduce Per-Row: Batch State-full class LRPass1 extends MRTask { double sumX, sumY, sumY2; void map( Chunk CX, Chunk CY ) {// Whole Chunks for( int i=0; i<CX.len; i++ ){// Batch! double X = CX.at(i), Y = CY.at(i); sumX += X; sumY += Y; sumY2 += Y*Y; } } void reduce( LRPass1 that ) { sumX += that.sumX ; sumY += that.sumY ; sumY2 += that.sumY2; } }
  14. 14. 0xdata.com 14 Distributed Trees ● Overlay a Tree over the data ● Really: Assign a Tree Node to each Row ● Number the Nodes ● Store "Node_ID" per row in a temp Vec ● Make a pass over all Rows ● Nodes not visited in order... ● but all rows, all Nodes efficiently visited ● Do work (e.g. histogram) per Row/Node Vec nids = v.makeZero(); … nids.set(row,nid)...
  15. 15. 0xdata.com 15 Distributed Trees ● An initial Tree ● All rows at nid==0 ● MRTask: compute stats ● Use the stats to make a decision... ● (varies by algorithm)! nid=0 X Y nids A 1.2 0 B 3.1 0 C -2. 0 D 1.1 0 nid=0nid=0 Tree MRTask.sum=3.4
  16. 16. 0xdata.com 16 Distributed Trees ● Next layer in the Tree (and MRTask across rows) ● Each row: decide! – If "1<Y<1.5" go left else right ● Compute stats per new leaf ● Each pass across all rows builds entire layer nid=0 X Y nids A 1.2 1 B 3.1 2 C -2. 2 D 1.1 1 nid=01 < Y < 1.5 Tree sum=1.1 nid=1 nid=2 sum=2.3
  17. 17. 0xdata.com 17 Distributed Trees ● Another MRTask, another layer... ● i.e., a 5-deep tree takes 5 passes ● nid=0nid=01 < Y < 1.5 Tree sum=1.1 Y==1.1 leaf nid=3 nid=4 X Y nids A 1.2 3 B 3.1 2 C -2. 2 D 1.1 4 sum= -2. sum=3.1
  18. 18. 0xdata.com 18 Distributed Trees ● Each pass is over one layer in the tree ● Builds per-node histogram in map+reduce calls class Pass extends MRTask2<Pass> { void map( Chunk chks[] ) { Chunk nids = chks[...]; // Node-IDs per row for( int r=0; r<nids.len; r++ ){// All rows int nid = nids.at80(i); // Node-ID THIS row // Lazy: not all Chunks see all Nodes if( dHisto[nid]==null ) dHisto[nid]=... // Accumulate histogram stats per node dHisto[nid].accum(chks,r); } } }.doAll(myDataFrame,nids);
  19. 19. 0xdata.com 19 Distributed Trees ● Each pass analyzes one Tree level ● Then decide how to build next level ● Reassign Rows to new levels in another pass – (actually merge the two passes) ● Builds a Histogram-per-Node ● Which requires a reduce() call to roll up ● All Histograms for one level done in parallel
  20. 20. 0xdata.com 20 Distributed Trees: utilities ● “score+build” in one pass: ● Test each row against decision from prior pass ● Assign to a new leaf ● Build histogram on that leaf ● “score”: just walk the tree, and get results ● “compress”: Tree from POJO to byte[] ● Easily 10x smaller, can still walk, score, print ● Plus utilities to walk, print, display
  21. 21. 0xdata.com 21 GBM on Distributed Trees ● GBM builds 1 Tree, 1 level at a time, but... ● We run the entire level in parallel & distributed ● Built breadth-first because it's "free" ● More data offset by more CPUs ● Classic GBM otherwise ● Build residuals tree-by-tree ● Tuning knobs: trees, depth, shrinkage, min_rows ● Pure Java
  22. 22. 0xdata.com 22 GBM on Distributed Trees ● Limiting factor: latency in turning over a level ● About 4x faster than R single-node on covtype ● Does the per-level compute in parallel ● Requires sending histograms over network – Can get big for very deep trees ●
  23. 23. 0xdata.com 23 Summary: Write (parallel) Java ● Most simple Java “just works” ● Fast: parallel distributed reads, writes, appends ● Reads same speed as plain Java array loads ● Writes, appends: slightly slower (compression) ● Typically memory bandwidth limited – (may be CPU limited in a few cases) ● Slower: conflicting writes (but follows strict JMM) ● Also supports transactional updates
  24. 24. 0xdata.com 24 Summary: Writing Analytics ● We're writing Big Data Analytics ● Generalized Linear Modeling (ADMM, GLMNET) – Logistic Regression, Poisson, Gamma ● Random Forest, GBM, KMeans++, KNN ● State-of-the-art Algorithms, running Distributed ● Solidly working on 100G datasets ● Heading for Tera Scale ● Paying customers (in production!) ● Come write your own (distributed) algorithm!!!
  25. 25. 0xdata.com 25 Cool Systems Stuff... ● … that I ran out of space for ● Reliable UDP, integrated w/RPC ● TCP is reliably UNReliable ● Already have a reliable UDP framework, so no prob ● Fork/Join Goodies: ● Priority Queues ● Distributed F/J ● Surviving fork bombs & lost threads ● K/V does JMM via hardware-like MESI protocol
  26. 26. 0xdata.com 26 H2O is... ● Pure Java, Open Source: 0xdata.com ● https://github.com/0xdata/h2o/ ● A Platform for doing Math ● Parallel Distributed Math ● In-memory analytics: GLM, GBM, RF, Logistic Reg ● Accessible via REST & JSON ● A K/V Store: ~150ns per get or put ● Distributed Fork/Join + Map/Reduce + K/V
  27. 27. 0xdata.com 27 The Platform NFS HDFS byte[] extends Iced extends DTask AutoBuffer RPC extends DRemoteTask D/F/J extends MRTask User code? JVM 1 NFS HDFS byte[] extends Iced extends DTask AutoBuffer RPC extends DRemoteTask D/F/J extends MRTask User code? JVM 2 K/V get/put UDP / TCP
  28. 28. 0xdata.com 28 Other Simple Examples ● Filter & Count (underage males): ● (can pass in any number of Vecs or a Frame) long sumY2 = new MRTask() { long map( long age, long sex ) { return (age<=17 && sex==MALE) ? 1 : 0; } long reduce( long d1, long d2 ) { return d1+d2; } }.doAll( vecAge, vecSex );
  29. 29. 0xdata.com 29 Other Simple Examples ● Filter into new set (underage males): ● Can write or append subset of rows – (append order is preserved) class Filter extends MRTask { void map(Chunk CRisk, Chunk CAge, Chunk CSex){ for( int i=0; i<CAge.len; i++ ) if( CAge.at(i)<=17 && CSex.at(i)==MALE ) CRisk.append(CAge.at(i)); // build a set } }; Vec risk = new AppendableVec(); new Filter().doAll( risk, vecAge, vecSex ); ...risk... // all the underage males
  30. 30. 0xdata.com 30 Other Simple Examples ● Filter into new set (underage males): ● Can write or append subset of rows – (append order is preserved) class Filter extends MRTask { void map(Chunk CRisk, Chunk CAge, Chunk CSex){ for( int i=0; i<CAge.len; i++ ) if( CAge.at(i)<=17 && CSex.at(i)==MALE ) CRisk.append(CAge.at(i)); // build a set } }; Vec risk = new AppendableVec(); new Filter().doAll( risk, vecAge, vecSex ); ...risk... // all the underage males
  31. 31. 0xdata.com 31 Other Simple Examples ● Group-by: count of car-types by age class AgeHisto extends MRTask { long carAges[][]; // count of cars by age void map( Chunk CAge, Chunk CCar ) { carAges = new long[numAges][numCars]; for( int i=0; i<CAge.len; i++ ) carAges[CAge.at(i)][CCar.at(i)]++; } void reduce( AgeHisto that ) { for( int i=0; i<carAges.length; i++ ) for( int j=0; i<carAges[j].length; j++ ) carAges[i][j] += that.carAges[i][j]; } }
  32. 32. 0xdata.com 32 class AgeHisto extends MRTask { long carAges[][]; // count of cars by age void map( Chunk CAge, Chunk CCar ) { carAges = new long[numAges][numCars]; for( int i=0; i<CAge.len; i++ ) carAges[CAge.at(i)][CCar.at(i)]++; } void reduce( AgeHisto that ) { for( int i=0; i<carAges.length; i++ ) for( int j=0; i<carAges[j].length; j++ ) carAges[i][j] += that.carAges[i][j]; } } Other Simple Examples ● Group-by: count of car-types by age Setting carAges in map() makes it an output field. Private per-map call, single-threaded write access. Must be rolled-up in the reduce call. Setting carAges in map makes it an output field. Private per-map call, single-threaded write access. Must be rolled-up in the reduce call.
  33. 33. 0xdata.com 33 Other Simple Examples ● Uniques ● Uses distributed hash set class Uniques extends MRTask { DNonBlockingHashSet<Long> dnbhs = new ...; void map( long id ) { dnbhs.add(id); } void reduce( Uniques that ) { dnbhs.putAll(that.dnbhs); } }; long uniques = new Uniques(). doAll( vecVistors ).dnbhs.size();
  34. 34. 0xdata.com 34 Other Simple Examples ● Uniques ● Uses distributed hash set class Uniques extends MRTask { DNonBlockingHashSet<Long> dnbhs = new ...; void map( long id ) { dnbhs.add(id); } void reduce( Uniques that ) { dnbhs.putAll(that.dnbhs); } }; long uniques = new Uniques(). doAll( vecVistors ).dnbhs.size(); Setting dnbhs in <init> makes it an input field. Shared across all maps(). Often read-only. This one is written, so needs a reduce.

×