SlideShare a Scribd company logo
1 of 48
Download to read offline
Better
Predictions!

H2O – The Open Source Math Engine !
H2O –
Open Source
in-memory
Machine Learning
for Big Data
4/23/13
Universe is sparse. Life is messy. 

Data is sparse & messy.!
- Lao Tzu
Hadoop = opportunity
Not enough Data Scientists
Analysts won’t code java
Group	
  By	
  
Grep	
  
Messy	
  
NAs	
  

Classifica-on	
  

Regression	
  

Clustering	
  

	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
   Ensembles
100’s	
  	
  	
   nanos	
   	
  
models
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
	
  

H 2O

Big Data

the
Adhoc	
  
Explora-on	
  

Math	
  
Modeling	
  

Real-­‐-me	
  
Scoring	
  

Prediction
Engine
No New API!
Big	
  Data	
  
Explora-on	
  
Modeling	
  
Scoring	
  
Real-­‐-me	
  
	
  

H 2O
the
Prediction
Engine

Approximate!
results each step!
Intellectual	
  
Legacy	
  
	
  
Math	
  needs	
  	
  
to	
  be	
  free	
  
	
  
Open	
  Source	
  
	
  

Support and Innovation

hFps://github.com/0xdata/h2o	
  

H 2O
the
Prediction
Engine
All Top 10ʼs are binary!
- Anonymous
10	
  	
  	
  Move Code not Data	
  
Data chunks > code chunks
TCP for Data. UDP for Control.
>> Generated Java Assist
A Chunk, Unit of Parallel Access
A Frame: Vec[]
age	
  

sex	
  

zip	
  

ID	
  

car	
  

JVM
1
Heap
JVM
2
Heap
JVM
3
Heap
JVM
4
Heap

Vecs aligned
in heaps
l Optimized for
concurrent access
l Random access
any row, any JVM
l 
9	
  	
  	
  Chunk-ing Express!	
  
season for Variable-sized chunks
and a season Uniform chunks.
Tightly-packed!
(chunk is also unit of batch!)
8	
  	
  	
  Reduce early. Reduce Often!	
  
No Expensive intermediate states.
Fine-grain parallelism wins!
>> Fork / Join
8	
  	
  	
  Reduce early. Reduce Often!	
  
Vec	
   Vec	
   Vec	
   Vec	
   Vec	
  

All CPUs grab
Chunks in parallel
Map/Reduce & F/J
handles all sync

JVM
1
Hea
p
JVM
2
Hea
p
JVM
3
Hea
p
JVM
4
Hea
p
7	
  	
  	
  Slow is not different from Dead	
  
Debugging slow
>> Heartbeats, Messages
Two General’s Paradox
6	
  	
  	
  Memory Manager	
  
in-memory system as good as
your memory manager!
lazy eviction.
compress.
align.
Corollary: Track down Leaks!
5	
  	
  	
  Memory Overheads	
  
Use primitives
// A Distributed Vector
//
much more than 2billion elements
class Vec {
long length(); // more than an int's worth
// fast random access
double at(long idx); // Get the idx'th elem
boolean isNA(long idx);

}

void set(long idx, double d); // writable
void append(double d); // variable sized
4	
  	
  	
  Cache-­‐Oblivious	
  
Tree size
Bin size

Recursively divide
Till Data à Cache
3	
  	
  	
  EC2 – Nothing is bounded	
  
User-mode reliability
S3 Readers will TCP Reset
Mux your connections
Not all toolkits are equal.
>> JetS3
2 No Locks, No Cry

	
  

Non-Blocking Data Structures.

// VOLATILE READ before key compare.
// CAS
private final boolean CAS_kvs( final Object[]
oldkvs, final Object[] newkvs ) {
return _unsafe.compareAndSwapObject(this,
_kvs_offset, oldkvs, newkvs );
}
1 endian wars ended!
Keep-It-Simple-Serialization.

	
  

byte[ ]. roll-your-own. fast.
public AutoBuffer putA1
( byte[] ary, int sofar, int length )
{
while( sofar < length ) {
int len = Math.min(length - sofar, _bb.remaining());
_bb.put(ary, sofar, len);
sofar += len;
if( sofar < length ) sendPartial();
}
return this;

}
Data Movement is a Defect.
Slowing down helps communication.
Got Speed?	
  
0	
  	
  	
  Math always produces a number	
  
Accuracy rules over speed.
Predictive Performance
1	
  	
  	
  Shuffle	
  
Data presentation bias.
Sorted data => interesting results
2	
  	
  	
  Random acts of Kindness?	
  
3	
  	
  	
  Convex Problems: ADMM	
  
4  Amdahl strikes:
Cholesky / QR Decomposition	
  
Matrix operations
jama, jblas.. all single node.
Distributed version
needs data transfer!
5	
  	
  Random	
  Forests	
  
embarrassingly parallel
binning
tree-building
splits
6	
  	
  Boos-ng	
  
iterate & stage
weak-learners =>
strong learners
each tree can be parallel
minimize communication
7	
  	
  Neural	
  Nets	
  &	
  Clustering	
  
embarrassingly parallel
pre-calculate base stats
distance calculation
weight matrices – small footprint
8	
  	
  Ensembles	
  
Daisy chain a bunch of models
Interleave.
JIT – Minimize loops over data.
9	
  	
  	
  Tools	
  
Deterministic versions first!
Got Pen & Paper?
Optimize often.
Test Big Data soon.
Replace NAs to improves

predictive performance by about 10pc.





!
- Newton
Munging Missing Features

impute NAs with mean

impute NAs with knn

impute with recursive pca!
- Boyd
Unbalanced data

single rare classes

Fraud / No-Fraud!
Stratify
Unbalanced data

multiple rare classes

Browse, Click, Purchase!
Stratify
10	
  	
  	
  Data

is the System	
  

Use Customer Data
Algorithms for Sparse vs. Dense
Unbalanced Data.
Robustness under noise
Before H2O

Velocity:	
  Events	
  

Online	
  Scoring	
  

Volume:	
  HDFS	
  

Rule	
  Engine	
  

Munging
slice n dice
Features

HIVE/SQL

Applications

Explora-on	
  

Data Scientist

	
  	
  	
  	
  Modeling	
  

Offline	
  Scoring	
  
Engineer

Business Analyst

Ensemble models
Low latency

Classification
Regression
Clustering
Optimal Model
Predictions
Big	
  Data	
  
Explora-on	
  
Modeling	
  
Scoring	
  
Real-­‐-me	
  
	
  

Big Data beats Better Algorithms!
Big	
  Data	
  
Explora-on	
  
Modeling	
  
Scoring	
  
Real-­‐-me	
  
	
  

Big Data and Better Algorithms!
Scale & Parallelism!
Intellectual	
  
Legacy	
  
	
  
Math	
  needs	
  	
  
to	
  be	
  free	
  
	
  
Open	
  Source	
  
	
  

Support and Innovation

hFps://github.com/0xdata/h2o	
  

H 2O
the
Prediction
Engine
Better
Predictions!

H2O – The Open Source Math Engine !
Distributed Coding Taxonomy

l 

No Distribution Coding:
l 
l 

l 

Whole Algorithms, Whole Vector-Math!
REST + JSON: e.g. load data, GLM, get results!

Simple Data-Parallel Coding:
l 
l 

l 

Per-Row (or neighbor row) Math!
Map/Reduce-style: e.g. Any dense linear algebra!

Complex Data-Parallel Coding
l 

K/V Store, Graph Algo's, e.g. PageRank!

0xdata.c45	
  
Distributed Coding Taxonomy

l 

No Distribution Coding:
l 

l 

Whole Algorithms, Whole Vector-Math!

l 

REST + JSON: e.g. load data, GLM, get results!

Simple Data-Parallel Coding:
l 

Per-Row (or neighbor row) Math!

l 

l 

Read	
  the	
  docs!	
  

This	
  talk!	
  

Map/Reduce-style: e.g. Any dense linear algebra!

Complex Data-Parallel Coding
l 

K/V Store, Graph Algo's, e.g. PageRank!

Join	
  our	
  GIT!	
  

46	
  
Distributed Data Taxonomy

Frame – a collection of Vecs
Vec – a collection of Chunks
Chunk – a collection of 1e3 to 1e6 elems
elem – a java double
Row i – i'th elements of all the Vecs in a Frame

0xdata.c47	
  
Usecases

Conversion, Retention & Churn!
•  Lead Conversion!
•  Engagement!
•  Product Placement!
•  Recommendations!
Pricing Engine!
Fraud Detection!

More Related Content

What's hot

DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax Academy
 
The Automation Factory
The Automation FactoryThe Automation Factory
The Automation FactoryNathan Milford
 
Spark application on ec2 cluster
Spark application on ec2 clusterSpark application on ec2 cluster
Spark application on ec2 clusterChao-Hsuan Shen
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...DataStax
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in CassandraJairam Chandar
 
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergyniallmilton
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Data Con LA
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityHiromitsu Komatsu
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit
 
Building your own NSQL store
Building your own NSQL storeBuilding your own NSQL store
Building your own NSQL storeEdward Capriolo
 
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...DataStax
 
Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011bostonrb
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to CassandraGokhan Atil
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchEdward Capriolo
 
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...DataStax
 
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016DataStax
 

What's hot (20)

DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and Analytics
 
The Automation Factory
The Automation FactoryThe Automation Factory
The Automation Factory
 
Spark application on ec2 cluster
Spark application on ec2 clusterSpark application on ec2 cluster
Spark application on ec2 cluster
 
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Cassandra synergy
Cassandra synergyCassandra synergy
Cassandra synergy
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
 
Cassandra+Hadoop
Cassandra+HadoopCassandra+Hadoop
Cassandra+Hadoop
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
 
Building your own NSQL store
Building your own NSQL storeBuilding your own NSQL store
Building your own NSQL store
 
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
 
Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011Mongodb in-anger-boston-rb-2011
Mongodb in-anger-boston-rb-2011
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Web-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batchWeb-scale data processing: practical approaches for low-latency and batch
Web-scale data processing: practical approaches for low-latency and batch
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...
 
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
 

Similar to Top 10 Performance Gotchas for scaling in-memory Algorithms.

Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
The Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInThe Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInCarl Steinbach
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformGeekNightHyderabad
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatternsgrepalex
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Malware vs Big Data
Malware vs Big DataMalware vs Big Data
Malware vs Big DataFrank Denis
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCSheetal Dolas
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...confluent
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...confluent
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityRenato Lucindo
 

Similar to Top 10 Performance Gotchas for scaling in-memory Algorithms. (20)

0xdata H2O Podcast
0xdata H2O Podcast0xdata H2O Podcast
0xdata H2O Podcast
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
The Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInThe Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedIn
 
LinkedIn
LinkedInLinkedIn
LinkedIn
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...Resilience: the key requirement of a [big] [data] architecture  - StampedeCon...
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
 
Whynosql
WhynosqlWhynosql
Whynosql
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Malware vs Big Data
Malware vs Big DataMalware vs Big Data
Malware vs Big Data
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi and Eri...
 
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
How To Use Kafka and Druid to Tame Your Router Data (Rachel Pedreschi, Imply ...
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
BigData primer
BigData primerBigData primer
BigData primer
 
Distributed Systems: scalability and high availability
Distributed Systems: scalability and high availabilityDistributed Systems: scalability and high availability
Distributed Systems: scalability and high availability
 

More from srisatish ambati

H2O Open Dallas 2016 keynote for Business Transformation
H2O Open Dallas 2016 keynote for Business TransformationH2O Open Dallas 2016 keynote for Business Transformation
H2O Open Dallas 2016 keynote for Business Transformationsrisatish ambati
 
Digital Transformation with AI and Data - H2O.ai and Open Source
Digital Transformation with AI and Data - H2O.ai and Open SourceDigital Transformation with AI and Data - H2O.ai and Open Source
Digital Transformation with AI and Data - H2O.ai and Open Sourcesrisatish ambati
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccsrisatish ambati
 
Svccg nosql 2011_sri-cassandra
Svccg nosql 2011_sri-cassandraSvccg nosql 2011_sri-cassandra
Svccg nosql 2011_sri-cassandrasrisatish ambati
 
Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...
Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...
Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...srisatish ambati
 
How to Stop Worrying and Start Caching in Java
How to Stop Worrying and Start Caching in JavaHow to Stop Worrying and Start Caching in Java
How to Stop Worrying and Start Caching in Javasrisatish ambati
 
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...srisatish ambati
 
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)srisatish ambati
 

More from srisatish ambati (10)

H2O Open Dallas 2016 keynote for Business Transformation
H2O Open Dallas 2016 keynote for Business TransformationH2O Open Dallas 2016 keynote for Business Transformation
H2O Open Dallas 2016 keynote for Business Transformation
 
Digital Transformation with AI and Data - H2O.ai and Open Source
Digital Transformation with AI and Data - H2O.ai and Open SourceDigital Transformation with AI and Data - H2O.ai and Open Source
Digital Transformation with AI and Data - H2O.ai and Open Source
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svcc
 
Jvm goes big_data_sfjava
Jvm goes big_data_sfjavaJvm goes big_data_sfjava
Jvm goes big_data_sfjava
 
jvm goes to big data
jvm goes to big datajvm goes to big data
jvm goes to big data
 
Svccg nosql 2011_sri-cassandra
Svccg nosql 2011_sri-cassandraSvccg nosql 2011_sri-cassandra
Svccg nosql 2011_sri-cassandra
 
Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...
Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...
Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago...
 
How to Stop Worrying and Start Caching in Java
How to Stop Worrying and Start Caching in JavaHow to Stop Worrying and Start Caching in Java
How to Stop Worrying and Start Caching in Java
 
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
 
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)
 

Recently uploaded

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 

Recently uploaded (20)

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 

Top 10 Performance Gotchas for scaling in-memory Algorithms.

  • 1. Better Predictions! H2O – The Open Source Math Engine !
  • 2. H2O – Open Source in-memory Machine Learning for Big Data 4/23/13
  • 3. Universe is sparse. Life is messy. 
 Data is sparse & messy.! - Lao Tzu
  • 4. Hadoop = opportunity Not enough Data Scientists Analysts won’t code java
  • 5. Group  By   Grep   Messy   NAs   Classifica-on   Regression   Clustering                           Ensembles 100’s       nanos     models                           H 2O Big Data the Adhoc   Explora-on   Math   Modeling   Real-­‐-me   Scoring   Prediction Engine
  • 6. No New API! Big  Data   Explora-on   Modeling   Scoring   Real-­‐-me     H 2O the Prediction Engine Approximate! results each step!
  • 7. Intellectual   Legacy     Math  needs     to  be  free     Open  Source     Support and Innovation hFps://github.com/0xdata/h2o   H 2O the Prediction Engine
  • 8. All Top 10ʼs are binary! - Anonymous
  • 9. 10      Move Code not Data   Data chunks > code chunks TCP for Data. UDP for Control. >> Generated Java Assist
  • 10. A Chunk, Unit of Parallel Access A Frame: Vec[] age   sex   zip   ID   car   JVM 1 Heap JVM 2 Heap JVM 3 Heap JVM 4 Heap Vecs aligned in heaps l Optimized for concurrent access l Random access any row, any JVM l 
  • 11. 9      Chunk-ing Express!   season for Variable-sized chunks and a season Uniform chunks. Tightly-packed! (chunk is also unit of batch!)
  • 12. 8      Reduce early. Reduce Often!   No Expensive intermediate states. Fine-grain parallelism wins! >> Fork / Join
  • 13. 8      Reduce early. Reduce Often!   Vec   Vec   Vec   Vec   Vec   All CPUs grab Chunks in parallel Map/Reduce & F/J handles all sync JVM 1 Hea p JVM 2 Hea p JVM 3 Hea p JVM 4 Hea p
  • 14. 7      Slow is not different from Dead   Debugging slow >> Heartbeats, Messages Two General’s Paradox
  • 15. 6      Memory Manager   in-memory system as good as your memory manager! lazy eviction. compress. align. Corollary: Track down Leaks!
  • 16. 5      Memory Overheads   Use primitives // A Distributed Vector // much more than 2billion elements class Vec { long length(); // more than an int's worth // fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx); } void set(long idx, double d); // writable void append(double d); // variable sized
  • 17. 4      Cache-­‐Oblivious   Tree size Bin size Recursively divide Till Data à Cache
  • 18. 3      EC2 – Nothing is bounded   User-mode reliability S3 Readers will TCP Reset Mux your connections Not all toolkits are equal. >> JetS3
  • 19. 2 No Locks, No Cry   Non-Blocking Data Structures. // VOLATILE READ before key compare. // CAS private final boolean CAS_kvs( final Object[] oldkvs, final Object[] newkvs ) { return _unsafe.compareAndSwapObject(this, _kvs_offset, oldkvs, newkvs ); }
  • 20.
  • 21. 1 endian wars ended! Keep-It-Simple-Serialization.   byte[ ]. roll-your-own. fast. public AutoBuffer putA1 ( byte[] ary, int sofar, int length ) { while( sofar < length ) { int len = Math.min(length - sofar, _bb.remaining()); _bb.put(ary, sofar, len); sofar += len; if( sofar < length ) sendPartial(); } return this; }
  • 22. Data Movement is a Defect. Slowing down helps communication. Got Speed?  
  • 23. 0      Math always produces a number   Accuracy rules over speed. Predictive Performance
  • 24. 1      Shuffle   Data presentation bias. Sorted data => interesting results
  • 25. 2      Random acts of Kindness?  
  • 26.
  • 27. 3      Convex Problems: ADMM  
  • 28. 4  Amdahl strikes: Cholesky / QR Decomposition   Matrix operations jama, jblas.. all single node. Distributed version needs data transfer!
  • 29. 5    Random  Forests   embarrassingly parallel binning tree-building splits
  • 30. 6    Boos-ng   iterate & stage weak-learners => strong learners each tree can be parallel minimize communication
  • 31. 7    Neural  Nets  &  Clustering   embarrassingly parallel pre-calculate base stats distance calculation weight matrices – small footprint
  • 32. 8    Ensembles   Daisy chain a bunch of models Interleave. JIT – Minimize loops over data.
  • 33. 9      Tools   Deterministic versions first! Got Pen & Paper? Optimize often. Test Big Data soon.
  • 34. Replace NAs to improves
 predictive performance by about 10pc.
 
 
 ! - Newton
  • 35. Munging Missing Features
 impute NAs with mean
 impute NAs with knn
 impute with recursive pca! - Boyd
  • 36. Unbalanced data
 single rare classes
 Fraud / No-Fraud! Stratify
  • 37. Unbalanced data
 multiple rare classes
 Browse, Click, Purchase! Stratify
  • 38. 10      Data is the System   Use Customer Data Algorithms for Sparse vs. Dense Unbalanced Data. Robustness under noise
  • 39. Before H2O Velocity:  Events   Online  Scoring   Volume:  HDFS   Rule  Engine   Munging slice n dice Features HIVE/SQL Applications Explora-on   Data Scientist        Modeling   Offline  Scoring   Engineer Business Analyst Ensemble models Low latency Classification Regression Clustering Optimal Model Predictions
  • 40. Big  Data   Explora-on   Modeling   Scoring   Real-­‐-me     Big Data beats Better Algorithms!
  • 41. Big  Data   Explora-on   Modeling   Scoring   Real-­‐-me     Big Data and Better Algorithms! Scale & Parallelism!
  • 42. Intellectual   Legacy     Math  needs     to  be  free     Open  Source     Support and Innovation hFps://github.com/0xdata/h2o   H 2O the Prediction Engine
  • 43.
  • 44. Better Predictions! H2O – The Open Source Math Engine !
  • 45. Distributed Coding Taxonomy l  No Distribution Coding: l  l  l  Whole Algorithms, Whole Vector-Math! REST + JSON: e.g. load data, GLM, get results! Simple Data-Parallel Coding: l  l  l  Per-Row (or neighbor row) Math! Map/Reduce-style: e.g. Any dense linear algebra! Complex Data-Parallel Coding l  K/V Store, Graph Algo's, e.g. PageRank! 0xdata.c45  
  • 46. Distributed Coding Taxonomy l  No Distribution Coding: l  l  Whole Algorithms, Whole Vector-Math! l  REST + JSON: e.g. load data, GLM, get results! Simple Data-Parallel Coding: l  Per-Row (or neighbor row) Math! l  l  Read  the  docs!   This  talk!   Map/Reduce-style: e.g. Any dense linear algebra! Complex Data-Parallel Coding l  K/V Store, Graph Algo's, e.g. PageRank! Join  our  GIT!   46  
  • 47. Distributed Data Taxonomy Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double Row i – i'th elements of all the Vecs in a Frame 0xdata.c47  
  • 48. Usecases Conversion, Retention & Churn! •  Lead Conversion! •  Engagement! •  Product Placement! •  Recommendations! Pricing Engine! Fraud Detection!