0xdata H2O Podcast

884 views
670 views

Published on

In this slidecast, SriSatish Ambati from 0xdata describes the company's new H20 Open Source, In-memory Machine Learning application for Big Data.

"We developed H2O to unlock the predictive power of big data through better algorithms," said SriSatish Ambati, CEO and co-founder of 0xdata. "H2O is simple, extensible and easy to use and deploy from R, Excel and Hadoop. The big data science world is one of algorithm-haves and have-nots. Amazon, Goldman Sachs, Google and Netflix have proven the power of algorithms on data. With our viral and open Apache software license philosophy, along with close ties into the math, Hadoop and R communities, we bring the power of Google-scale machine learning and modeling without sampling to the rest of the world."

Watch the presentation video: http://wp.me/p3RLEV-1xc
Learn more: http://0xdata.com

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
884
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
31
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

0xdata H2O Podcast

  1. 1. Better Predictions! H2O – The Open Source Math Engine !
  2. 2. H2O – Open Source in-memory Machine Learning for Big Data 4/23/13 SriSatish Ambati, July 2013
  3. 3. Universe is sparse. Life is messy. 
 Data is sparse & messy.! - Lao Tzu
  4. 4. Hadoop = opportunity Not enough Data Scientists Analysts won’t code java
  5. 5. Before H2O Velocity:  Events   Online  Scoring   Volume:  HDFS   Rule  Engine   Munging slice n dice Features HIVE/SQL Applications Explora;on   Data Scientist        Modeling   Offline  Scoring   Engineer Business Analyst Ensemble models Low latency Classification Regression Clustering Optimal Model Predictions
  6. 6. Group  By   Grep   Messy   NAs   Classifica;on   Regression   Clustering                           Ensembles 100’s       nanos     models                           H 2O Big Data the Adhoc   Explora;on   Math   Modeling   Real-­‐;me   Scoring   Prediction Engine
  7. 7. No New API! Big  Data   Explora;on   Modeling   Scoring   Real-­‐;me     H 2O the Prediction Engine Approximate! results each step!
  8. 8. Big  Data   Explora;on   Modeling   Scoring   Real-­‐;me     Big Data beats Better Algorithms!
  9. 9. Big  Data   Explora;on   Modeling   Scoring   Real-­‐;me     Big Data and Better Algorithms! Scale & Parallelism!
  10. 10. Intellectual   Legacy     Math  needs     to  be  free     Open  Source     Support and Innovation hLps://github.com/0xdata/h2o   H 2O the Prediction Engine
  11. 11. Usecases Conversion, Retention & Churn! •  Lead Conversion! •  Engagement! •  Product Placement! •  Recommendations! Pricing Engine! Fraud Detection!
  12. 12. Customers, Users Insurance   Credit  Card     Others…  
  13. 13. Big Data and Better Algorithms -­‐  Antonio  Mollins,  Data  Scien;st  
  14. 14. Pete Fishman, Data Science @Yammer
  15. 15. Screen title
  16. 16. Screen title
  17. 17. A Collection of Distributed Vectors // A Distributed Vector // much more than 2billion elements class Vec { long length(); // more than an int's worth // fast random access double at(long idx); // Get the idx'th elem boolean isNA(long idx); } void set(long idx, double d); // writable void append(double d); // variable sized 0xdata.c17  
  18. 18. Frames A Frame: Vec[] age   sex   zip   ID   car   JVM 1 Heap JVM 2 Heap JVM 3 Heap Vecs aligned in heaps l Optimized for concurrent access l Random access any row, any JVM l  But faster if local... more on that later l  JVM 4 Heap 0xdata.c18  
  19. 19. Distributed Data Taxonomy A Chunk, Unit of Parallel Access Vec   Vec   Vec   Vec   Vec   JVM 1 Heap JVM 2 Heap JVM 3 Heap Typically 1e3 to 1e6 elements l Stored compressed l In byte arrays l Get/put is a few clock cycles including compression l  JVM 4 Heap 0xdata.c19  
  20. 20. Distributed Parallel Execution Vec   Vec   Vec   Vec   Vec   JVM 1 Heap JVM 2 Heap JVM 3 Heap All CPUs grab Chunks in parallel l F/J load balances l  Code moves to Data l Map/Reduce & F/J handles all sync l H2O handles all comm, data manag l  JVM 4 Heap 0xdata.c20  
  21. 21. Distributed Data Taxonomy Frame – a collection of Vecs Vec – a collection of Chunks Chunk – a collection of 1e3 to 1e6 elems elem – a java double Row i – i'th elements of all the Vecs in a Frame 0xdata.c21  
  22. 22. Distributed Coding Taxonomy l  No Distribution Coding: l  l  l  Whole Algorithms, Whole Vector-Math! REST + JSON: e.g. load data, GLM, get results! Simple Data-Parallel Coding: l  l  l  Per-Row (or neighbor row) Math! Map/Reduce-style: e.g. Any dense linear algebra! Complex Data-Parallel Coding l  K/V Store, Graph Algo's, e.g. PageRank! 0xdata.c22  
  23. 23. Distributed Coding Taxonomy l  No Distribution Coding: l  l  Whole Algorithms, Whole Vector-Math! l  REST + JSON: e.g. load data, GLM, get results! Simple Data-Parallel Coding: l  Per-Row (or neighbor row) Math! l  l  Read  the  docs!   This  talk!   Map/Reduce-style: e.g. Any dense linear algebra! Complex Data-Parallel Coding l  K/V Store, Graph Algo's, e.g. PageRank! Join  our  GIT!   0xdata.c23  
  24. 24. Better Predictions! H2O – The Open Source Math Engine !

×