H2OthePredictionEngineBetter predictionshttps://github.com/0xdata/h2o
H2Omakeshadoopdo mathHadoop = opportunityNot enough Data ScientistsAnalysts won’t code java
H2OthePredictionEngineExploration Modeling ScoringBig Data
H2OthePredictionEngineAdhocExplorationMathModelingReal-timeScoringBig Data VelocityVolume
H2OthePredictionEngineAdhocExplorationMathModelingReal-timeScoringBig DataMessy ClusteringClassificationEnsembles100’snano...
H2OthePredictionEngineBig DataExplorationModelingScoringReal-time
H2OthePredictionEngineBig DataExplorationModelingScoringReal-timeNo New APIApproximateresults each step
H2OthePredictionEngineBig DataExplorationModelingScoringReal-timeMore Data beats Better Algorithms
H2OthePredictionEngineBig DataExplorationModelingScoringReal-timeMore Data and Better AlgorithmsScale & Parallelism
H2OthePredictionEngineBig DataExplorationModelingScoringReal-timeMore Data and Better AlgorithmsScale & Parallelismfraudde...
H2OthePredictionEngineIntellectualLegacyMath needsto be freeOpen SourceSupport and Innovationhttps://github.com/0xdata/h2o
SriSatish Ambati, CEO & Co-founderDirector of Engineering, DataStax, Cassandra & HadoopCustomers & Platform Marketing, Azu...
Stephen Boyd Professor of Mathematical Engineering, Stanford, Convex OptTrevor Hastie Professor of Statistics, Stanford, G...
Distributed!Extensible, reconfigurable!Math-at Scale – Simple LegosH2O+ σcov*µmeannGLMLogisticRegressionrandshufflehistogr...
Volume: HDFSHIVE/SQLData ScientistMungingslice n diceFeaturesClassificationRegressionClusteringOptimal ModelEngineerVeloci...
Product Road Map	algos:	RandomForest	GLM, ADMM, GLMnet,	k-means		data: dense, categorical	api: REST, JSON, R-like	console	...
secret saucemove code. not dataLinear Regressionfork/join. data partitioning. 	fine grain parallelism	phase 1 sums	phase 2 ...
Fraud DetectionScoring: Event stream on a ScoreCard ModelModeling: Random Forest for outlier detectionModeling: Event sequ...
Math & Hadoop users recommend us!
Data & AlgorithmsSQL | HDFS | S3 | NoSQLH2O – Real TimeRESTpatternssequencesDistributedCollectionsExecutionJSONRExcelJava ...
Hadoop EcosystemHDFSH2OMap ReduceHive Pig ImpalaDrillBatch InteractiveH2O
•  Alternating Direction Method of Multipliers (Boyd)•  Decomposition-coordination•  Small Local Sub-Problems and Global C...
l1 norm regularizationhttps://github.com/0xdata/h2o/blob/master/src/main/java/hex/DLSM.java
•  Text Book implementation from Breiman’s paper.•  Data is distributed upon ingest•  Splits on random selection of featur...
forest for the tree..iris dataset
•  1% increase in predictive power - $11m @ major onlinepayment system•  Each fraud scored accurately = expected value of ...
Deployment - commodity / cloudH2O	x86	H2O is pure java	and easy-to-install	company confidential. copyright 2012	H2O	H2O
H2OthePredictionEngineBetter predictionshttps://github.com/0xdata/h2o
H2OthePredictionEngineBig Data ScienceModeling & Scoring EngineApproximate results each stepNo new APIUse R, Excel & SASSc...
Upcoming SlideShare
Loading in …5
×

0xdata_h2o_BigDataScience_5.28.2013

2,225 views

Published on

Data Science is no longer Rocket Science with H2O.
H2O is the OpenSource Math and Prediction Engine for Big Data. H2O makes hadoop do math! And scales statistics, machine learning and math over BigData. With H2O everyone can get past tooling and scale issues to discover insights in the data.

H2O is extensible and users can build blocks using simple math legos in the core. H2O keeps familiar interfaces like R, Excel & JSON so that big data enthusiasts & & experts can explore, munge, model and score datasets using a range of simple to advanced algorithms. Data collection is easy. Decision making is hard. H2O makes it fast and easy to derive insights from your data through faster and better predictive modeling. H2O has a vision of online scoring and modeling in a single platform.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,225
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
54
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

0xdata_h2o_BigDataScience_5.28.2013

  1. 1. H2OthePredictionEngineBetter predictionshttps://github.com/0xdata/h2o
  2. 2. H2Omakeshadoopdo mathHadoop = opportunityNot enough Data ScientistsAnalysts won’t code java
  3. 3. H2OthePredictionEngineExploration Modeling ScoringBig Data
  4. 4. H2OthePredictionEngineAdhocExplorationMathModelingReal-timeScoringBig Data VelocityVolume
  5. 5. H2OthePredictionEngineAdhocExplorationMathModelingReal-timeScoringBig DataMessy ClusteringClassificationEnsembles100’snanosmodelsRegression
  6. 6. H2OthePredictionEngineBig DataExplorationModelingScoringReal-time
  7. 7. H2OthePredictionEngineBig DataExplorationModelingScoringReal-timeNo New APIApproximateresults each step
  8. 8. H2OthePredictionEngineBig DataExplorationModelingScoringReal-timeMore Data beats Better Algorithms
  9. 9. H2OthePredictionEngineBig DataExplorationModelingScoringReal-timeMore Data and Better AlgorithmsScale & Parallelism
  10. 10. H2OthePredictionEngineBig DataExplorationModelingScoringReal-timeMore Data and Better AlgorithmsScale & ParallelismfrauddetectionAppsreco engine
  11. 11. H2OthePredictionEngineIntellectualLegacyMath needsto be freeOpen SourceSupport and Innovationhttps://github.com/0xdata/h2o
  12. 12. SriSatish Ambati, CEO & Co-founderDirector of Engineering, DataStax, Cassandra & HadoopCustomers & Platform Marketing, AzulCliff Click. CTO & Co-founder Chief JVM Architect, Azul, Sun, HP, Motorola, JIT & Hotspot Tomas Nykodym Phd Security, Intrusion DetectionCyprien Noel Founder ObjectFabric, TradeWeb, SmartTradeMichal Malohlava Phd DSLs, CompilersJan Vitek Full Professor, Purdue, On Sabbatical, Real-time VM, R/stats CompilerKevin Normoyle AMD Fellow, Distinguished Engineer Sun, Consistency ModelsTom Kraljevic VP Of Engineering, founder Luminix, Azul, PMC-Sierra, Chromatic Credits & Team
  13. 13. Stephen Boyd Professor of Mathematical Engineering, Stanford, Convex OptTrevor Hastie Professor of Statistics, Stanford, Generalized Additive ModelsRob Tibshirani Professor of Statistics, Stanford, GLMNet, LassoDoug Lea Malloc for C. fork-join. java memory model, suny oswegoDhruba Borthakur HDFS, Hive, FacebookNiall Dalton TimeSeries DB, KX, High-frequency Trading, Cantor-FitzCharles Zedlewski VP Products, ClouderaData Science & Advisors
  14. 14. Distributed!Extensible, reconfigurable!Math-at Scale – Simple LegosH2O+ σcov*µmeannGLMLogisticRegressionrandshufflehistogramRandomDecisionTreesOLSk-means
  15. 15. Volume: HDFSHIVE/SQLData ScientistMungingslice n diceFeaturesClassificationRegressionClusteringOptimal ModelEngineerVelocity: Events Online ScoringExplorationModelingOffline ScoringBusiness AnalystEnsemble modelsLow latencyApplicationsPredictionsRule EngineBefore H2O
  16. 16. Product Road Map algos: RandomForest GLM, ADMM, GLMnet, k-means data: dense, categorical api: REST, JSON, R-like console Scale, Single-ExecutionGridSearch In 4-pilots algos: GroupBy, Grep Unbalanced App: Fraud Detection data: sparse api: R, math, string Adhoc Analytics Multi-Execution Scoring Engine Event Ingest In production algos: GBM, SVM, KNN Optimization App:RecoEngine data: sparse api:Tableau Visualization Multi-tenant Library Big Adoption 1.15.2013 5.15.2013 8.15.2013
  17. 17. secret saucemove code. not dataLinear Regressionfork/join. data partitioning. fine grain parallelism phase 1 sums phase 2 distance phase 3 validate arraylets leaf computes parent aggregates company confidential. copyright 2012
  18. 18. Fraud DetectionScoring: Event stream on a ScoreCard ModelModeling: Random Forest for outlier detectionModeling: Event sequence patternsCustomer Behavior & Merchant AnalyticsScoring: Purchase event stream scoring on Ensemble ModelsModeling: Logistic Regression models for Customer EngagementFailure Prediction from Sensor DataModel device failures and rank vendor graphs.Upstream Oil ExplorationDistance & Regression on 1TB big dataMLS for Oil fieldsUse Cases
  19. 19. Math & Hadoop users recommend us!
  20. 20. Data & AlgorithmsSQL | HDFS | S3 | NoSQLH2O – Real TimeRESTpatternssequencesDistributedCollectionsExecutionJSONRExcelJava API
  21. 21. Hadoop EcosystemHDFSH2OMap ReduceHive Pig ImpalaDrillBatch InteractiveH2O
  22. 22. •  Alternating Direction Method of Multipliers (Boyd)•  Decomposition-coordination•  Small Local Sub-Problems and Global Coordination•  Broadcast & Gather•  Decomposability Dual Ascent + Convergence of Multipliers•  Block & Component Separability•  Generalized Gradients (Hastie, Tibshirani, et al)Generalized Linear Modeling
  23. 23. l1 norm regularizationhttps://github.com/0xdata/h2o/blob/master/src/main/java/hex/DLSM.java
  24. 24. •  Text Book implementation from Breiman’s paper.•  Data is distributed upon ingest•  Splits on random selection of features•  Gini & Entropy•  Handle NAs (during training)•  Class-Weighting•  Stratified Sampling (local)Random Foresthttps://github.com/0xdata/h2o/tree/master/src/main/java/hex/rf
  25. 25. forest for the tree..iris dataset
  26. 26. •  1% increase in predictive power - $11m @ major onlinepayment system•  Each fraud scored accurately = expected value of 10s ofthousand dollars.•  Leads cost $10-100/lead – Predicting accurate conversionand quality of leads goes directly to bottom line.•  Competitive advantage in predicting which assets to acquire.Models unlock value in data
  27. 27. Deployment - commodity / cloudH2O x86 H2O is pure java and easy-to-install company confidential. copyright 2012 H2O H2O
  28. 28. H2OthePredictionEngineBetter predictionshttps://github.com/0xdata/h2o
  29. 29. H2OthePredictionEngineBig Data ScienceModeling & Scoring EngineApproximate results each stepNo new APIUse R, Excel & SASScale & Parallelism

×