Gradient Boosting Machine:
Distributed Regression Trees
on H2O
Cliff Click, CTO 0xdata
cliffc@0xdata.com
http://0xdata.com...
H2O is...
●

Pure Java, Open Source: 0xdata.com
●

●

https://github.com/0xdata/h2o/

A Platform for doing Math
●

Paralle...
Agenda
●

Building Blocks For Big Data:
●

●

Vecs & Frames & Chunks

Distributed Tree Algorithms
●

Access Patterns & Exe...
A Collection of Distributed Vectors
// A Distributed Vector
//
much more than 2billion elements
class Vec {
long length();...
Frames
A Frame: Vec[]
age

sex

zip

ID

car

JVM 1
Heap
JVM 2
Heap
JVM 3
Heap

Vecs aligned
in heaps
●Optimized for
concu...
Distributed Data Taxonomy
A Chunk, Unit of Parallel Access
Vec

Vec

Vec

Vec

Vec

JVM 1
Heap
JVM 2
Heap
JVM 3
Heap

Typi...
Distributed Parallel Execution
Vec

Vec

Vec

Vec

Vec

JVM 1
Heap

●

JVM 2
Heap

●

JVM 3
Heap

All CPUs grab
Chunks in ...
Distributed Data Taxonomy

Frame – a collection of Vecs
Vec – a collection of Chunks
Chunk – a collection of 1e3 to 1e6 el...
Distributed Coding Taxonomy
●

No Distribution Coding:
●
●

●

Whole Algorithms, Whole Vector-Math
REST + JSON: e.g. load ...
Distributed Coding Taxonomy
●

No Distribution Coding:

Read the docs!

●
●

●

Whole Algorithms, Whole Vector-Math
REST +...
Simple Data-Parallel Coding
●

Map/Reduce Per-Row: Stateless
●

Example from Linear Regression, Σ y2

double sumY2 = new M...
Simple Data-Parallel Coding
●

Map/Reduce Per-Row: State-full
●

Linear Regression Pass1: Σ x, Σ y, Σ y2

class LRPass1 ex...
Simple Data-Parallel Coding
●

Map/Reduce Per-Row: Batch State-full
class LRPass1 extends MRTask {
double sumX, sumY, sumY...
GBM (for K-classifier)
Elements of Statistical
Learning, 2nd Ed, 2009
Pg 387
Trevor Hastie,
Robert Tibshirani
Jerome Fried...
Distributed Trees
●

Overlay a Tree over the data
●

Really: Assign a Tree Node to each Row
Vec nids = v.makeZero();
… nid...
Distributed Trees
●

An initial Tree
●
●

All rows start on n0
MRTask: compute stats

X
0
1
2
3

Tree
n0

Y nids
1.3 0
1.1...
Distributed Trees
●

Next layer in the Tree (and MRTask across rows)
●

Each row: decide!
–

●

Tree

If "X<1.5" go right ...
Distributed Trees
●
●

●

Another MRTask, another layer...
i.e., a 5-deep tree
takes 5 passes

Tree
n0

Fully data-paralle...
Distributed Trees
●

Each pass is over one layer in the tree

●

Builds per-node histogram in map+reduce calls
class Pass ...
Distributed Trees
●

Each pass analyzes one Tree level
●

Then decide how to build next level

●

Reassign Rows to new lev...
Distributed Trees: utilities
●

“score+build” in one pass:
●

Test each row against decision from prior pass

●

Assign to...
GBM on Distributed Trees
●

GBM builds 1 Tree, 1 level at a time, but...

●

We run the entire level in parallel & distrib...
GBM on Distributed Trees
●

Limiting factor: latency in turning over a level
●

About 4x faster than R single-node on covt...
Summary: Write (parallel) Java
●

Most simple Java “just works”

●

Fast: parallel distributed reads, writes, appends
●

R...
Summary: Writing Analytics
●

We're writing Big Data Analytics
●

Generalized Linear Modeling (ADMM, GLMNET)
–

●

Logisti...
Cool Systems Stuff...
●

… that I ran out of space for

●

Reliable UDP, integrated w/RPC

●

TCP is reliably UNReliable
●...
H2O is...
●

Pure Java, Open Source: 0xdata.com
●

●

https://github.com/0xdata/h2o/

A Platform for doing Math
●

Paralle...
The Platform
JVM 1
extends MRTask

User code?

extends DRemoteTask
extends DTask

extends Iced
byte[]
NFS
HDFS

JVM 2
exte...
Other Simple Examples
●

Filter & Count (underage males):
●

(can pass in any number of Vecs or a Frame)

long sumY2 = new...
Other Simple Examples
●

Filter into new set (underage males):
●

Can write or append subset of rows
–

(append order is p...
Other Simple Examples
●

Filter into new set (underage males):
●

Can write or append subset of rows
–

(append order is p...
Other Simple Examples
●

Group-by: count of car-types by age
class AgeHisto extends MRTask {
long carAges[][]; // count of...
Other Simple Examples
●

Group-by: count of car-types by age

Setting carAges in map makes it an output field.
Setting car...
Other Simple Examples
●

Uniques
●

Uses distributed hash set

class Uniques extends MRTask {
DNonBlockingHashSet<Long> dn...
Other Simple Examples
●

Uniques
●

Uses distributed

Setting dnbhs in <init> makes it an input field.
Shared across all m...
Upcoming SlideShare
Loading in …5
×

Cliff Click Explains GBM at Netflix October 10 2013

1,419 views

Published on

Cliff Click Explains GBM at Netflix October 10 2013:
Presentation on how GBM is distributed and implemented for large data.

Published in: Technology, Self Improvement
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,419
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
37
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Cliff Click Explains GBM at Netflix October 10 2013

  1. 1. Gradient Boosting Machine: Distributed Regression Trees on H2O Cliff Click, CTO 0xdata cliffc@0xdata.com http://0xdata.com http://cliffc.org/blog
  2. 2. H2O is... ● Pure Java, Open Source: 0xdata.com ● ● https://github.com/0xdata/h2o/ A Platform for doing Math ● Parallel Distributed Math ● In-memory analytics: GLM, GBM, RF, Logistic Reg ● Accessible via REST & JSON ● A K/V Store: ~150ns per get or put ● Distributed Fork/Join + Map/Reduce + K/V 0xdata.com 2
  3. 3. Agenda ● Building Blocks For Big Data: ● ● Vecs & Frames & Chunks Distributed Tree Algorithms ● Access Patterns & Execution ● GBM on H2O ● Performance 0xdata.com 3
  4. 4. A Collection of Distributed Vectors // A Distributed Vector // much more than 2billion elements class Vec { long length(); // more than an int's worth // fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx); void set(long idx, double d); // writable void append(double d); // variable sized } 0xdata.com 4
  5. 5. Frames A Frame: Vec[] age sex zip ID car JVM 1 Heap JVM 2 Heap JVM 3 Heap Vecs aligned in heaps ●Optimized for concurrent access ●Random access any row, any JVM ● But faster if local... more on that later ● JVM 4 Heap 0xdata.com 5
  6. 6. Distributed Data Taxonomy A Chunk, Unit of Parallel Access Vec Vec Vec Vec Vec JVM 1 Heap JVM 2 Heap JVM 3 Heap Typically 1e3 to 1e6 elements ●Stored compressed ●In byte arrays ●Get/put is a few clock cycles including compression ● JVM 4 Heap 0xdata.com 6
  7. 7. Distributed Parallel Execution Vec Vec Vec Vec Vec JVM 1 Heap ● JVM 2 Heap ● JVM 3 Heap All CPUs grab Chunks in parallel ●F/J load balances Code moves to Data ●Map/Reduce & F/J handles all sync ●H2O handles all comm, data manage JVM 4 Heap 0xdata.com 7
  8. 8. Distributed Data Taxonomy Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double Row i – i'th elements of all the Vecs in a Frame 0xdata.com 8
  9. 9. Distributed Coding Taxonomy ● No Distribution Coding: ● ● ● Whole Algorithms, Whole Vector-Math REST + JSON: e.g. load data, GLM, get results Simple Data-Parallel Coding: ● ● ● Per-Row (or neighbor row) Math Map/Reduce-style: e.g. Any dense linear algebra Complex Data-Parallel Coding ● K/V Store, Graph Algo's, e.g. PageRank 0xdata.com 9
  10. 10. Distributed Coding Taxonomy ● No Distribution Coding: Read the docs! ● ● ● Whole Algorithms, Whole Vector-Math REST + JSON: e.g. load data, GLM, get results Simple Data-Parallel Coding: This talk! ● ● ● Per-Row (or neighbor row) Math Map/Reduce-style: e.g. Any dense linear algebra Complex Data-Parallel Coding ● Join our GIT! K/V Store, Graph Algo's, e.g. PageRank 0xdata.com 10
  11. 11. Simple Data-Parallel Coding ● Map/Reduce Per-Row: Stateless ● Example from Linear Regression, Σ y2 double sumY2 = new MRTask() { double map( double d ) { return d*d; } double reduce( double d1, double d2 ) { return d1+d2; } }.doAll( vecY ); ● Auto-parallel, auto-distributed ● Near Fortran speed, Java Ease 0xdata.com 11
  12. 12. Simple Data-Parallel Coding ● Map/Reduce Per-Row: State-full ● Linear Regression Pass1: Σ x, Σ y, Σ y2 class LRPass1 extends MRTask { double sumX, sumY, sumY2; // I Can Haz State? void map( double X, double Y ) { sumX += X; sumY += Y; sumY2 += Y*Y; } void reduce( LRPass1 that ) { sumX += that.sumX ; sumY += that.sumY ; sumY2 += that.sumY2; } 0xdata.com } 12
  13. 13. Simple Data-Parallel Coding ● Map/Reduce Per-Row: Batch State-full class LRPass1 extends MRTask { double sumX, sumY, sumY2; void map( Chunk CX, Chunk CY ) {// Whole Chunks for( int i=0; i<CX.len; i++ ){// Batch! double X = CX.at(i), Y = CY.at(i); sumX += X; sumY += Y; sumY2 += Y*Y; } } void reduce( LRPass1 that ) { sumX += that.sumX ; sumY += that.sumY ; sumY2 += that.sumY2; } 0xdata.com 13 }
  14. 14. GBM (for K-classifier) Elements of Statistical Learning, 2nd Ed, 2009 Pg 387 Trevor Hastie, Robert Tibshirani Jerome Friedman 0xdata.com 14
  15. 15. Distributed Trees ● Overlay a Tree over the data ● Really: Assign a Tree Node to each Row Vec nids = v.makeZero(); … nids.set(row,nid)... ● ● ● Number the Nodes Store "Node_ID" per row in a temp Vec Make a pass over all Rows ● ● ● Nodes not visited in order... but all rows, all Nodes efficiently visited Do work (e.g. histogram) per Row/Node 0xdata.com 15
  16. 16. Distributed Trees ● An initial Tree ● ● All rows start on n0 MRTask: compute stats X 0 1 2 3 Tree n0 Y nids 1.3 0 1.1 0 3.1 0 -2.1 0 MRTask.avg=0.85 MRTask.var =3.5075 ● Use the stats to make a decision... ● (varies by algorithm)! ● (e.g. lowest MSE, best col, best split) 0xdata.com 16
  17. 17. Distributed Trees ● Next layer in the Tree (and MRTask across rows) ● Each row: decide! – ● Tree If "X<1.5" go right else left Compute stats per new leaf n0 X>=1.5 ● Each pass across all rows builds entire layer X<1.5 n1 n2 avg=0.5 var=6.76 avg=1.2 var=0.01 X 0 1 2 3 Y nids 1.3 2 1.1 2 3.1 1 -2.1 1 0xdata.com 17
  18. 18. Distributed Trees ● ● ● Another MRTask, another layer... i.e., a 5-deep tree takes 5 passes Tree n0 Fully data-parallel for each tree level X<1.5 X>=1.5 n1 X 0 1 2 3 Y nids 1.3 2 1.1 2 3.1 4 -2.1 3 X>=2.5 n3 avg= -2.1 1.2 X<2.5 n4 avg=3.1 0xdata.com 18
  19. 19. Distributed Trees ● Each pass is over one layer in the tree ● Builds per-node histogram in map+reduce calls class Pass extends MRTask2<Pass> { void map( Chunk chks[] ) { Chunk nids = chks[...]; // Node-IDs per row for( int r=0; r<nids.len; r++ ){// All rows int nid = nids.at80(i); // Node-ID THIS row // Lazy: not all Chunks see all Nodes if( dHisto[nid]==null ) dHisto[nid]=... // Accumulate histogram stats per node dHisto[nid].accum(chks,r); } } 0xdata.com 19 }.doAll(myDataFrame,nids);
  20. 20. Distributed Trees ● Each pass analyzes one Tree level ● Then decide how to build next level ● Reassign Rows to new levels in another pass – ● Builds a Histogram-per-Node ● ● (actually merge the two passes) Which requires a reduce() call to roll up All Histograms for one level done in parallel 0xdata.com 20
  21. 21. Distributed Trees: utilities ● “score+build” in one pass: ● Test each row against decision from prior pass ● Assign to a new leaf ● Build histogram on that leaf ● “score”: just walk the tree, and get results ● “compress”: Tree from POJO to byte[] ● ● Easily 10x smaller, can still walk, score, print Plus utilities to walk, print, display 0xdata.com 21
  22. 22. GBM on Distributed Trees ● GBM builds 1 Tree, 1 level at a time, but... ● We run the entire level in parallel & distributed ● ● ● Built breadth-first because it's "free" More data offset by more CPUs Classic GBM otherwise ● ● Build residuals tree-by-tree ● ● ESL2, page 387 Tuning knobs: trees, depth, shrinkage, min_rows Pure Java 0xdata.com 22
  23. 23. GBM on Distributed Trees ● Limiting factor: latency in turning over a level ● About 4x faster than R single-node on covtype ● Does the per-level compute in parallel ● Requires sending histograms over network – ● Can get big for very deep tree Adding more data offset by adding more Nodes 0xdata.com 23
  24. 24. Summary: Write (parallel) Java ● Most simple Java “just works” ● Fast: parallel distributed reads, writes, appends ● Reads same speed as plain Java array loads ● Writes, appends: slightly slower (compression) ● Typically memory bandwidth limited – ● (may be CPU limited in a few cases) Slower: conflicting writes (but follows strict JMM) ● Also supports transactional updates 0xdata.com 24
  25. 25. Summary: Writing Analytics ● We're writing Big Data Analytics ● Generalized Linear Modeling (ADMM, GLMNET) – ● Logistic Regression, Poisson, Gamma Random Forest, GBM, KMeans++, KNN ● State-of-the-art Algorithms, running Distributed ● Solidly working on 100G datasets ● Heading for Tera Scale ● Paying customers (in production!) ● Come write your own (distributed) algorithm!!! 0xdata.com 25
  26. 26. Cool Systems Stuff... ● … that I ran out of space for ● Reliable UDP, integrated w/RPC ● TCP is reliably UNReliable ● ● Already have a reliable UDP framework, so no prob Fork/Join Goodies: ● ● Distributed F/J ● ● Priority Queues Surviving fork bombs & lost threads K/V does JMM via hardware-like MESI protocol 0xdata.com 26
  27. 27. H2O is... ● Pure Java, Open Source: 0xdata.com ● ● https://github.com/0xdata/h2o/ A Platform for doing Math ● Parallel Distributed Math ● In-memory analytics: GLM, GBM, RF, Logistic Reg ● Accessible via REST & JSON ● A K/V Store: ~150ns per get or put ● Distributed Fork/Join + Map/Reduce + K/V 0xdata.com 27
  28. 28. The Platform JVM 1 extends MRTask User code? extends DRemoteTask extends DTask extends Iced byte[] NFS HDFS JVM 2 extends MRTask D/F/J User code? extends DRemoteTask RPC K/V get/put AutoBuffer UDP / TCP extends DTask extends Iced D/F/J RPC AutoBuffer byte[] NFS HDFS 0xdata.com 28
  29. 29. Other Simple Examples ● Filter & Count (underage males): ● (can pass in any number of Vecs or a Frame) long sumY2 = new MRTask() { long map( long age, long sex ) { return (age<=17 && sex==MALE) ? 1 : 0; } long reduce( long d1, long d2 ) { return d1+d2; } }.doAll( vecAge, vecSex ); 0xdata.com 29
  30. 30. Other Simple Examples ● Filter into new set (underage males): ● Can write or append subset of rows – (append order is preserved) class Filter extends MRTask { void map(Chunk CRisk, Chunk CAge, Chunk CSex){ for( int i=0; i<CAge.len; i++ ) if( CAge.at(i)<=17 && CSex.at(i)==MALE ) CRisk.append(CAge.at(i)); // build a set } }; Vec risk = new AppendableVec(); new Filter().doAll( risk, vecAge, vecSex ); ...risk... // all the underage males 0xdata.com 30
  31. 31. Other Simple Examples ● Filter into new set (underage males): ● Can write or append subset of rows – (append order is preserved) class Filter extends MRTask { void map(Chunk CRisk, Chunk CAge, Chunk CSex){ for( int i=0; i<CAge.len; i++ ) if( CAge.at(i)<=17 && CSex.at(i)==MALE ) CRisk.append(CAge.at(i)); // build a set } }; Vec risk = new AppendableVec(); new Filter().doAll( risk, vecAge, vecSex ); ...risk... // all the underage males 0xdata.com 31
  32. 32. Other Simple Examples ● Group-by: count of car-types by age class AgeHisto extends MRTask { long carAges[][]; // count of cars by age void map( Chunk CAge, Chunk CCar ) { carAges = new long[numAges][numCars]; for( int i=0; i<CAge.len; i++ ) carAges[CAge.at(i)][CCar.at(i)]++; } void reduce( AgeHisto that ) { for( int i=0; i<carAges.length; i++ ) for( int j=0; i<carAges[j].length; j++ ) carAges[i][j] += that.carAges[i][j]; } } 0xdata.com 32
  33. 33. Other Simple Examples ● Group-by: count of car-types by age Setting carAges in map makes it an output field. Setting carAges in map()makes it an output field. class AgeHisto extendsper-map call, single-threaded write access. Private MRTask { Must be rolled-up cars by age Must be rolled-up in the reduce call. long carAges[][]; // count of in the reduce call. void map( Chunk CAge, Chunk CCar ) { carAges = new long[numAges][numCars]; for( int i=0; i<CAge.len; i++ ) carAges[CAge.at(i)][CCar.at(i)]++; } void reduce( AgeHisto that ) { for( int i=0; i<carAges.length; i++ ) for( int j=0; i<carAges[j].length; j++ ) carAges[i][j] += that.carAges[i][j]; } } 0xdata.com 33
  34. 34. Other Simple Examples ● Uniques ● Uses distributed hash set class Uniques extends MRTask { DNonBlockingHashSet<Long> dnbhs = new ...; void map( long id ) { dnbhs.add(id); } void reduce( Uniques that ) { dnbhs.putAll(that.dnbhs); } }; long uniques = new Uniques(). doAll( vecVistors ).dnbhs.size(); 0xdata.com 34
  35. 35. Other Simple Examples ● Uniques ● Uses distributed Setting dnbhs in <init> makes it an input field. Shared across all maps(). Often read-only. hash set This one is written, so needs a reduce. class Uniques extends MRTask { DNonBlockingHashSet<Long> dnbhs = new ...; void map( long id ) { dnbhs.add(id); } void reduce( Uniques that ) { dnbhs.putAll(that.dnbhs); } }; long uniques = new Uniques(). doAll( vecVistors ).dnbhs.size(); 0xdata.com 35

×