Top 10 Data Parallelism and Model Parallelism lessons from scaling H2O.
"Math Algorithms have primarily been the domain of desktop data science. With the success of scalable algorithms at Google, Amazon, and Netflix, there is an ever growing demand for sophisticated algorithms over big data. In this talk, we get a ringside view in the making of the world's most scalable and fastest machine learning framework, H2O, and the performance lessons learnt scaling it over EC2 for Netflix and over commodity hardware for other power users.
Top 10 Performance Gotchas is about the white hot stories of i/o wars, S3 resets, and muxers, as well as the power of primitive byte arrays, non-blocking structures, and fork/join queues. Of good data distribution & fine-grain decomposition of Algorithms to fine-grain blocks of parallel computation. It's a 10-point story of the rage of a network of machines against the tyranny of Amdahl while keeping the statistical properties of the data and accuracy of the algorithm."
5. Group
By
Grep
Messy
NAs
Classifica-on
Regression
Clustering
Ensembles
100’s
nanos
models
H 2O
Big Data
the
Adhoc
Explora-on
Math
Modeling
Real-‐-me
Scoring
Prediction
Engine
6. No New API!
Big
Data
Explora-on
Modeling
Scoring
Real-‐-me
H 2O
the
Prediction
Engine
Approximate!
results each step!
7. Intellectual
Legacy
Math
needs
to
be
free
Open
Source
Support and Innovation
hFps://github.com/0xdata/h2o
H 2O
the
Prediction
Engine
9. 10
Move Code not Data
Data chunks > code chunks
TCP for Data. UDP for Control.
>> Generated Java Assist
10. A Chunk, Unit of Parallel Access
A Frame: Vec[]
age
sex
zip
ID
car
JVM
1
Heap
JVM
2
Heap
JVM
3
Heap
JVM
4
Heap
Vecs aligned
in heaps
l Optimized for
concurrent access
l Random access
any row, any JVM
l
11. 9
Chunk-ing Express!
season for Variable-sized chunks
and a season Uniform chunks.
Tightly-packed!
(chunk is also unit of batch!)
12. 8
Reduce early. Reduce Often!
No Expensive intermediate states.
Fine-grain parallelism wins!
>> Fork / Join
13. 8
Reduce early. Reduce Often!
Vec
Vec
Vec
Vec
Vec
All CPUs grab
Chunks in parallel
Map/Reduce & F/J
handles all sync
JVM
1
Hea
p
JVM
2
Hea
p
JVM
3
Hea
p
JVM
4
Hea
p
14. 7
Slow is not different from Dead
Debugging slow
>> Heartbeats, Messages
Two General’s Paradox
15. 6
Memory Manager
in-memory system as good as
your memory manager!
lazy eviction.
compress.
align.
Corollary: Track down Leaks!
16. 5
Memory Overheads
Use primitives
// A Distributed Vector
//
much more than 2billion elements
class Vec {
long length(); // more than an int's worth
// fast random access
double at(long idx); // Get the idx'th elem
boolean isNA(long idx);
}
void set(long idx, double d); // writable
void append(double d); // variable sized
17. 4
Cache-‐Oblivious
Tree size
Bin size
Recursively divide
Till Data à Cache
18. 3
EC2 – Nothing is bounded
User-mode reliability
S3 Readers will TCP Reset
Mux your connections
Not all toolkits are equal.
>> JetS3
19. 2 No Locks, No Cry
Non-Blocking Data Structures.
// VOLATILE READ before key compare.
// CAS
private final boolean CAS_kvs( final Object[]
oldkvs, final Object[] newkvs ) {
return _unsafe.compareAndSwapObject(this,
_kvs_offset, oldkvs, newkvs );
}
45. Distributed Coding Taxonomy
l
No Distribution Coding:
l
l
l
Whole Algorithms, Whole Vector-Math!
REST + JSON: e.g. load data, GLM, get results!
Simple Data-Parallel Coding:
l
l
l
Per-Row (or neighbor row) Math!
Map/Reduce-style: e.g. Any dense linear algebra!
Complex Data-Parallel Coding
l
K/V Store, Graph Algo's, e.g. PageRank!
0xdata.c45
46. Distributed Coding Taxonomy
l
No Distribution Coding:
l
l
Whole Algorithms, Whole Vector-Math!
l
REST + JSON: e.g. load data, GLM, get results!
Simple Data-Parallel Coding:
l
Per-Row (or neighbor row) Math!
l
l
Read
the
docs!
This
talk!
Map/Reduce-style: e.g. Any dense linear algebra!
Complex Data-Parallel Coding
l
K/V Store, Graph Algo's, e.g. PageRank!
Join
our
GIT!
46
47. Distributed Data Taxonomy
Frame – a collection of Vecs
Vec – a collection of Chunks
Chunk – a collection of 1e3 to 1e6 elems
elem – a java double
Row i – i'th elements of all the Vecs in a Frame
0xdata.c47