Your SlideShare is downloading. ×
0
Better
Predictions!

H2O – The Open Source Math Engine !
H2O –
Open Source
in-memory
Machine Learning
for Big Data
4/23/13
Universe is sparse. Life is messy. 

Data is sparse & messy.!
- Lao Tzu
Hadoop = opportunity
Not enough Data Scientists
Analysts won’t code java
Group	
  By	
  
Grep	
  
Messy	
  
NAs	
  

Classifica-on	
  

Regression	
  

Clustering	
  

	
  	
  	
  	
  	
  	
  	
  ...
No New API!
Big	
  Data	
  
Explora-on	
  
Modeling	
  
Scoring	
  
Real-­‐-me	
  
	
  

H 2O
the
Prediction
Engine

Appro...
Intellectual	
  
Legacy	
  
	
  
Math	
  needs	
  	
  
to	
  be	
  free	
  
	
  
Open	
  Source	
  
	
  

Support and Inno...
All Top 10ʼs are binary!
- Anonymous
10	
  	
  	
  Move Code not Data	
  
Data chunks > code chunks
TCP for Data. UDP for Control.
>> Generated Java Assist
A Chunk, Unit of Parallel Access
A Frame: Vec[]
age	
  

sex	
  

zip	
  

ID	
  

car	
  

JVM
1
Heap
JVM
2
Heap
JVM
3
He...
9	
  	
  	
  Chunk-ing Express!	
  
season for Variable-sized chunks
and a season Uniform chunks.
Tightly-packed!
(chunk i...
8	
  	
  	
  Reduce early. Reduce Often!	
  
No Expensive intermediate states.
Fine-grain parallelism wins!
>> Fork / Join
8	
  	
  	
  Reduce early. Reduce Often!	
  
Vec	
   Vec	
   Vec	
   Vec	
   Vec	
  

All CPUs grab
Chunks in parallel
Map...
7	
  	
  	
  Slow is not different from Dead	
  
Debugging slow
>> Heartbeats, Messages
Two General’s Paradox
6	
  	
  	
  Memory Manager	
  
in-memory system as good as
your memory manager!
lazy eviction.
compress.
align.
Corollary...
5	
  	
  	
  Memory Overheads	
  
Use primitives
// A Distributed Vector
//
much more than 2billion elements
class Vec {
l...
4	
  	
  	
  Cache-­‐Oblivious	
  
Tree size
Bin size

Recursively divide
Till Data à Cache
3	
  	
  	
  EC2 – Nothing is bounded	
  
User-mode reliability
S3 Readers will TCP Reset
Mux your connections
Not all too...
2 No Locks, No Cry

	
  

Non-Blocking Data Structures.

// VOLATILE READ before key compare.
// CAS
private final boolean...
1 endian wars ended!
Keep-It-Simple-Serialization.

	
  

byte[ ]. roll-your-own. fast.
public AutoBuffer putA1
( byte[] a...
Data Movement is a Defect.
Slowing down helps communication.
Got Speed?	
  
0	
  	
  	
  Math always produces a number	
  
Accuracy rules over speed.
Predictive Performance
1	
  	
  	
  Shuffle	
  
Data presentation bias.
Sorted data => interesting results
2	
  	
  	
  Random acts of Kindness?	
  
3	
  	
  	
  Convex Problems: ADMM	
  
4  Amdahl strikes:
Cholesky / QR Decomposition	
  
Matrix operations
jama, jblas.. all single node.
Distributed version
ne...
5	
  	
  Random	
  Forests	
  
embarrassingly parallel
binning
tree-building
splits
6	
  	
  Boos-ng	
  
iterate & stage
weak-learners =>
strong learners
each tree can be parallel
minimize communication
7	
  	
  Neural	
  Nets	
  &	
  Clustering	
  
embarrassingly parallel
pre-calculate base stats
distance calculation
weigh...
8	
  	
  Ensembles	
  
Daisy chain a bunch of models
Interleave.
JIT – Minimize loops over data.
9	
  	
  	
  Tools	
  
Deterministic versions first!
Got Pen & Paper?
Optimize often.
Test Big Data soon.
Replace NAs to improves

predictive performance by about 10pc.





!
- Newton
Munging Missing Features

impute NAs with mean

impute NAs with knn

impute with recursive pca!
- Boyd
Unbalanced data

single rare classes

Fraud / No-Fraud!
Stratify
Unbalanced data

multiple rare classes

Browse, Click, Purchase!
Stratify
10	
  	
  	
  Data

is the System	
  

Use Customer Data
Algorithms for Sparse vs. Dense
Unbalanced Data.
Robustness under...
Before H2O

Velocity:	
  Events	
  

Online	
  Scoring	
  

Volume:	
  HDFS	
  

Rule	
  Engine	
  

Munging
slice n dice
...
Big	
  Data	
  
Explora-on	
  
Modeling	
  
Scoring	
  
Real-­‐-me	
  
	
  

Big Data beats Better Algorithms!
Big	
  Data	
  
Explora-on	
  
Modeling	
  
Scoring	
  
Real-­‐-me	
  
	
  

Big Data and Better Algorithms!
Scale & Paral...
Intellectual	
  
Legacy	
  
	
  
Math	
  needs	
  	
  
to	
  be	
  free	
  
	
  
Open	
  Source	
  
	
  

Support and Inno...
Better
Predictions!

H2O – The Open Source Math Engine !
Distributed Coding Taxonomy

l 

No Distribution Coding:
l 
l 

l 

Whole Algorithms, Whole Vector-Math!
REST + JSON: ...
Distributed Coding Taxonomy

l 

No Distribution Coding:
l 

l 

Whole Algorithms, Whole Vector-Math!

l 

REST + JSON...
Distributed Data Taxonomy

Frame – a collection of Vecs
Vec – a collection of Chunks
Chunk – a collection of 1e3 to 1e6 el...
Usecases

Conversion, Retention & Churn!
•  Lead Conversion!
•  Engagement!
•  Product Placement!
•  Recommendations!
Pric...
Top 10 Performance Gotchas for scaling in-memory Algorithms.
Top 10 Performance Gotchas for scaling in-memory Algorithms.
Top 10 Performance Gotchas for scaling in-memory Algorithms.
Upcoming SlideShare
Loading in...5
×

Top 10 Performance Gotchas for scaling in-memory Algorithms.

1,885

Published on

Top 10 Data Parallelism and Model Parallelism lessons from scaling H2O.

"Math Algorithms have primarily been the domain of desktop data science. With the success of scalable algorithms at Google, Amazon, and Netflix, there is an ever growing demand for sophisticated algorithms over big data. In this talk, we get a ringside view in the making of the world's most scalable and fastest machine learning framework, H2O, and the performance lessons learnt scaling it over EC2 for Netflix and over commodity hardware for other power users.

Top 10 Performance Gotchas is about the white hot stories of i/o wars, S3 resets, and muxers, as well as the power of primitive byte arrays, non-blocking structures, and fork/join queues. Of good data distribution & fine-grain decomposition of Algorithms to fine-grain blocks of parallel computation. It's a 10-point story of the rage of a network of machines against the tyranny of Amdahl while keeping the statistical properties of the data and accuracy of the algorithm."

Published in: Technology, Education
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,885
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
33
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Transcript of "Top 10 Performance Gotchas for scaling in-memory Algorithms."

  1. 1. Better Predictions! H2O – The Open Source Math Engine !
  2. 2. H2O – Open Source in-memory Machine Learning for Big Data 4/23/13
  3. 3. Universe is sparse. Life is messy. 
 Data is sparse & messy.! - Lao Tzu
  4. 4. Hadoop = opportunity Not enough Data Scientists Analysts won’t code java
  5. 5. Group  By   Grep   Messy   NAs   Classifica-on   Regression   Clustering                           Ensembles 100’s       nanos     models                           H 2O Big Data the Adhoc   Explora-on   Math   Modeling   Real-­‐-me   Scoring   Prediction Engine
  6. 6. No New API! Big  Data   Explora-on   Modeling   Scoring   Real-­‐-me     H 2O the Prediction Engine Approximate! results each step!
  7. 7. Intellectual   Legacy     Math  needs     to  be  free     Open  Source     Support and Innovation hFps://github.com/0xdata/h2o   H 2O the Prediction Engine
  8. 8. All Top 10ʼs are binary! - Anonymous
  9. 9. 10      Move Code not Data   Data chunks > code chunks TCP for Data. UDP for Control. >> Generated Java Assist
  10. 10. A Chunk, Unit of Parallel Access A Frame: Vec[] age   sex   zip   ID   car   JVM 1 Heap JVM 2 Heap JVM 3 Heap JVM 4 Heap Vecs aligned in heaps l Optimized for concurrent access l Random access any row, any JVM l 
  11. 11. 9      Chunk-ing Express!   season for Variable-sized chunks and a season Uniform chunks. Tightly-packed! (chunk is also unit of batch!)
  12. 12. 8      Reduce early. Reduce Often!   No Expensive intermediate states. Fine-grain parallelism wins! >> Fork / Join
  13. 13. 8      Reduce early. Reduce Often!   Vec   Vec   Vec   Vec   Vec   All CPUs grab Chunks in parallel Map/Reduce & F/J handles all sync JVM 1 Hea p JVM 2 Hea p JVM 3 Hea p JVM 4 Hea p
  14. 14. 7      Slow is not different from Dead   Debugging slow >> Heartbeats, Messages Two General’s Paradox
  15. 15. 6      Memory Manager   in-memory system as good as your memory manager! lazy eviction. compress. align. Corollary: Track down Leaks!
  16. 16. 5      Memory Overheads   Use primitives // A Distributed Vector // much more than 2billion elements class Vec { long length(); // more than an int's worth // fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx); } void set(long idx, double d); // writable void append(double d); // variable sized
  17. 17. 4      Cache-­‐Oblivious   Tree size Bin size Recursively divide Till Data à Cache
  18. 18. 3      EC2 – Nothing is bounded   User-mode reliability S3 Readers will TCP Reset Mux your connections Not all toolkits are equal. >> JetS3
  19. 19. 2 No Locks, No Cry   Non-Blocking Data Structures. // VOLATILE READ before key compare. // CAS private final boolean CAS_kvs( final Object[] oldkvs, final Object[] newkvs ) { return _unsafe.compareAndSwapObject(this, _kvs_offset, oldkvs, newkvs ); }
  20. 20. 1 endian wars ended! Keep-It-Simple-Serialization.   byte[ ]. roll-your-own. fast. public AutoBuffer putA1 ( byte[] ary, int sofar, int length ) { while( sofar < length ) { int len = Math.min(length - sofar, _bb.remaining()); _bb.put(ary, sofar, len); sofar += len; if( sofar < length ) sendPartial(); } return this; }
  21. 21. Data Movement is a Defect. Slowing down helps communication. Got Speed?  
  22. 22. 0      Math always produces a number   Accuracy rules over speed. Predictive Performance
  23. 23. 1      Shuffle   Data presentation bias. Sorted data => interesting results
  24. 24. 2      Random acts of Kindness?  
  25. 25. 3      Convex Problems: ADMM  
  26. 26. 4  Amdahl strikes: Cholesky / QR Decomposition   Matrix operations jama, jblas.. all single node. Distributed version needs data transfer!
  27. 27. 5    Random  Forests   embarrassingly parallel binning tree-building splits
  28. 28. 6    Boos-ng   iterate & stage weak-learners => strong learners each tree can be parallel minimize communication
  29. 29. 7    Neural  Nets  &  Clustering   embarrassingly parallel pre-calculate base stats distance calculation weight matrices – small footprint
  30. 30. 8    Ensembles   Daisy chain a bunch of models Interleave. JIT – Minimize loops over data.
  31. 31. 9      Tools   Deterministic versions first! Got Pen & Paper? Optimize often. Test Big Data soon.
  32. 32. Replace NAs to improves
 predictive performance by about 10pc.
 
 
 ! - Newton
  33. 33. Munging Missing Features
 impute NAs with mean
 impute NAs with knn
 impute with recursive pca! - Boyd
  34. 34. Unbalanced data
 single rare classes
 Fraud / No-Fraud! Stratify
  35. 35. Unbalanced data
 multiple rare classes
 Browse, Click, Purchase! Stratify
  36. 36. 10      Data is the System   Use Customer Data Algorithms for Sparse vs. Dense Unbalanced Data. Robustness under noise
  37. 37. Before H2O Velocity:  Events   Online  Scoring   Volume:  HDFS   Rule  Engine   Munging slice n dice Features HIVE/SQL Applications Explora-on   Data Scientist        Modeling   Offline  Scoring   Engineer Business Analyst Ensemble models Low latency Classification Regression Clustering Optimal Model Predictions
  38. 38. Big  Data   Explora-on   Modeling   Scoring   Real-­‐-me     Big Data beats Better Algorithms!
  39. 39. Big  Data   Explora-on   Modeling   Scoring   Real-­‐-me     Big Data and Better Algorithms! Scale & Parallelism!
  40. 40. Intellectual   Legacy     Math  needs     to  be  free     Open  Source     Support and Innovation hFps://github.com/0xdata/h2o   H 2O the Prediction Engine
  41. 41. Better Predictions! H2O – The Open Source Math Engine !
  42. 42. Distributed Coding Taxonomy l  No Distribution Coding: l  l  l  Whole Algorithms, Whole Vector-Math! REST + JSON: e.g. load data, GLM, get results! Simple Data-Parallel Coding: l  l  l  Per-Row (or neighbor row) Math! Map/Reduce-style: e.g. Any dense linear algebra! Complex Data-Parallel Coding l  K/V Store, Graph Algo's, e.g. PageRank! 0xdata.c45  
  43. 43. Distributed Coding Taxonomy l  No Distribution Coding: l  l  Whole Algorithms, Whole Vector-Math! l  REST + JSON: e.g. load data, GLM, get results! Simple Data-Parallel Coding: l  Per-Row (or neighbor row) Math! l  l  Read  the  docs!   This  talk!   Map/Reduce-style: e.g. Any dense linear algebra! Complex Data-Parallel Coding l  K/V Store, Graph Algo's, e.g. PageRank! Join  our  GIT!   46  
  44. 44. Distributed Data Taxonomy Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double Row i – i'th elements of all the Vecs in a Frame 0xdata.c47  
  45. 45. Usecases Conversion, Retention & Churn! •  Lead Conversion! •  Engagement! •  Product Placement! •  Recommendations! Pricing Engine! Fraud Detection!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×