SlideShare a Scribd company logo
1 of 21
Download to read offline
Open Source AI Platform for
Business Transformation
Desmond Chan
Senior Director of Marketing, H2O.ai
Agenda for H2O Introduction Webinar
▪ Company Introduction (5 mins)
▪ H2O Introduction and Demo (35 mins)
– Installation of H2O
– Flight delay prediction use case
• Use case description
• Data set description
• Data munging
• Model creation
▪ Q&A (10 mins)
H2O AI Platform
In-Memory, Distributed
Machine Learning with
Visual Intelligence
H2O AI in Spark
with Data Prep and ML
Pipelines
Operationalize Model
Building and Deployment
Governance.
Best-of-breed
GPU Deep Learning
with easy API and AutoML
TensorFlow, MXNet or Caffe
and H2O
Deep
Water
AI For Business
Transformation
Insights on Text,
Images, Transactions,
Speech
Best Machine
Learning Algorithms
on Spark
Platform to Build and
Scale Data Products.
Dual licensing (AGPL
and Commercial)
H2O is the #1 Platform for Open Source AI
Open Source Drives Community Adoption
Companies Using H2O.ai
2014 2015 2016 2017
9173
6427
3810
400
H2O.ai Users
2014 2015 2016 2017
83108
54163
38257
1000
* Data from July of every year, except for 2017 when data from Feb 21st are used.
H2O Recognized by Press and Customers
H2O.ai Strongly Positioned in Key Analyst Reports and Press
“Overall customer satisfaction is very
high.”
“H2O is especially suited to IoT edge
and device scenarios.”
“H2O had the highest reference customer
analytics support score of all the
vendors.”
H2O.ai is a Visionary 

in the Gartner Magic Quadrant

for Data Science Platforms
“H2O.ai has significant adoption by
large enterprises such as Macy’s,
Comcast, and Capital One.”
“H2O.ai is best known for developing
open source, cluster-distributed ML
algorithms at a time (2011) when big data
demanded them, but no one else had
them.”
H2O.ai is a Strong Performer

in the Forrester Predictive
Analytics & Machine Learning
H2O.ai is a Top 10 Hot Artificial
Intelligence (AI) Technologies
on Forbes
H2O.ai named alongside Nvidia, Google,
IBM, Intel, Microsoft, SAS, et al as in Top
10 Hot Artificial Intelligence (AI) on
Forbes - contributed by Gil Press
H2O Use Cases – Videos and Talks
Auto
Insurance
UBI
Telematics
Commercial
Insurance
Risk Analytics
Financial
Services
Customer
Insights
Digital Marketing
Consumer
Behavior
Pawan Divarkarla
Chief Data Officer
“H2O is an enabler in
how people are
thinking about data.”
Conor Jensen
Analytics Director
“Advanced analytics
was one of the key
investments we
decided to make.”
Brendan Herger
Data Scientist
“H2O is the best solution
to to iterate very quickly
on large datasets and
produce meaning models.”
Satya Satyamoorthy
Director, Software Dev
"I am a big fan of open
source. H2O is the best
fit in terms of cost as
well as ease of use and
scalability and
usability.”
Play Video Play Video Play Video Play Video
Progressive Zurich Capital One Nielsen Catalina
Amy Wang
Math Hacker, H2O.ai
What is H2O?
Open%source%in,memory%prediction%engineMath%Platform
• Parallelized%and%distributed%algorithms%making%the%most%use%out%of%
multithreaded%systems
• GLM,%Random%Forest,%GBM,%PCA,%etc.
Easy%to%use%and%adoptAPI
• Written%in%Java%– perfect%for%Java%Programmers
• REST%API%(JSON)%– drives%H2O%from%R,%Python,%Excel,%Tableau
More%data?%Or%better%models?%BOTHBig%Data
• Use%all%of%your%data%– model%without%down%sampling
• Run%a%simple%GLM%or%a%more%complex%GBM%to%find%the%best%fit%for%the%data
• More%Data%+%Better%Models%=%Better%Predictions
Supervised Learning
H2O Algorithms
Statistical
Analysis
Ensembles
Deep Neural
Networks
• Generalized Linear Models: Binomial, Gaussian, Gamma, Poisson, and
Tweedie
• Naive Bayes: Binary Text Classification
• Distributed Random Forest: Classification or Regression Models
• Gradient Boosting Machine: Ensembles of shallow decision trees with
increasing refined approximations
• Deep Learning: Create multi-layer feed forward neural networks starting
with an input layer followed by multiple layers of nonlinear transformations
Unsupervised Learning
Clustering
Dimensionality
Reduction
Anomaly Detection
• K-means: Partition observations into k clusters of the same spatial size.
Categorical features are one hot encoded.
• Archetypes [GLRM]: Partition observations into k archetypes.
• Principal Component Analysis: Linearly transforms correlated variables
to independent components
• Generalized Low Rank Model: Approximates data set as a product of
two low dimensional factors. Extends PCA to handle sparse data,
categorical data, and adds regularization.
• Autoencoders [Deep Learning]: Create multi-layer feed forward neural
networks starting with an input layer followed by multiple layers of
nonlinear transformations
H2O Algorithms
Accuracy with Speed and Scale
HDFS
S3
SQL
NoSQL
Classification
Regression
Feature
Engineering
In-Memory
Map Reduce/Fork Join
Columnar Compression
Deep Learning
PCA, GLM, Cox
Random Forest / GBM
Ensembles
Fast Modeling Engine
Streaming
Nano Fast Java Scoring Engines
Matrix
Factorization
Clustering
Munging
Reading Data into H2O with R
STEP 1
R user
h2o_df = h2o.importFile(“../data/allyears2k.csv”)
Reading Data from HDFS into H2O with R
H2O
H2O
H2O
data.csv
HTTP REST
API request to
H2O
has HDFS path
H2O ClusterInitiate
distributed
ingest
HDFS
Request
data from
HDFS
STEP 2
2.2
2.3
2.4
R
h2o.importFile()
2.1
R function
call
Reading Data from HDFS into H2O with R
H2O
H2O
H2O
R
HDFS
STEP 3
Cluster IP
Cluster Port
Pointer to Data
Return pointer
to data in
REST API
JSON
Response
HDFS
provides
data
3.3
3.4
3.1h2o_df object
created in R
data.csv
h2o_df
H2O
Fram
e
3.2
Distributed
H2O
Frame in DKV
H2O Cluster
Data Munging in R
Installing in R
Installing in Python
R> install.packages(“h2o”)
Terminal$ pip install h2o
Demo Time!
Questions?
Thanks for joining us!

More Related Content

What's hot

The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInThe Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedIn
OSCON Byrum
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 

What's hot (20)

Introduction to Data Science with H2O- Mountain View
Introduction to Data Science with H2O- Mountain ViewIntroduction to Data Science with H2O- Mountain View
Introduction to Data Science with H2O- Mountain View
 
The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInThe Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedIn
 
Bigdata Machine Learning Platform
Bigdata Machine Learning PlatformBigdata Machine Learning Platform
Bigdata Machine Learning Platform
 
How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Machine Data Analytics
Machine Data AnalyticsMachine Data Analytics
Machine Data Analytics
 
Open source log analytics
Open source log analyticsOpen source log analytics
Open source log analytics
 
Big Data Use Cases
Big Data Use CasesBig Data Use Cases
Big Data Use Cases
 
Analytics over Terabytes of Data at Twitter
Analytics over Terabytes of Data at TwitterAnalytics over Terabytes of Data at Twitter
Analytics over Terabytes of Data at Twitter
 
Make your data talk
Make your data talkMake your data talk
Make your data talk
 
Graph Data: a New Data Management Frontier
Graph Data: a New Data Management FrontierGraph Data: a New Data Management Frontier
Graph Data: a New Data Management Frontier
 
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 edition
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data Pipelines
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
 
How To Achieve Real-Time Analytics On A Data Lake Using GPUs
How To Achieve Real-Time Analytics On A Data Lake Using GPUsHow To Achieve Real-Time Analytics On A Data Lake Using GPUs
How To Achieve Real-Time Analytics On A Data Lake Using GPUs
 
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
 
An Introduction to H2O4GPU
An Introduction to H2O4GPUAn Introduction to H2O4GPU
An Introduction to H2O4GPU
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domain
 
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
Big Data Day LA 2016/ NoSQL track - Analytics at the Speed of Light with Redi...
 

Similar to Start Getting Your Feet Wet in Open Source Machine and Deep Learning

Introducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoIntroducción al Machine Learning Automático
Introducción al Machine Learning Automático
Sri Ambati
 

Similar to Start Getting Your Feet Wet in Open Source Machine and Deep Learning (20)

Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
 
Initiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AIInitiate Edinburgh 2019 - Big Data Meets AI
Initiate Edinburgh 2019 - Big Data Meets AI
 
Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
Accelerate ML Deployment with H2O Driverless AI on AWS
Accelerate ML Deployment with H2O Driverless AI on AWSAccelerate ML Deployment with H2O Driverless AI on AWS
Accelerate ML Deployment with H2O Driverless AI on AWS
 
Microsoft cloud big data strategy
Microsoft cloud big data strategyMicrosoft cloud big data strategy
Microsoft cloud big data strategy
 
Project "Deep Water"
Project "Deep Water"Project "Deep Water"
Project "Deep Water"
 
Modern Thinking área digital MSKM 21/09/2017
Modern Thinking área digital MSKM 21/09/2017Modern Thinking área digital MSKM 21/09/2017
Modern Thinking área digital MSKM 21/09/2017
 
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AI
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AIAWS Initiate Day Manchester 2019 – AWS Big Data Meets AI
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AI
 
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
How to Swiftly Operationalize the Data Lake for Advanced Analytics Using a Lo...
 
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
 
Bas van Dorst - Microsoft
Bas van Dorst - MicrosoftBas van Dorst - Microsoft
Bas van Dorst - Microsoft
 
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
Sudhir Rawat, Sr Techonology Evangelist at Microsoft SQL Business Intelligenc...
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Big Data in Action – Real-World Solution Showcase
 Big Data in Action – Real-World Solution Showcase Big Data in Action – Real-World Solution Showcase
Big Data in Action – Real-World Solution Showcase
 
Big Data Companies and Apache Software
Big Data Companies and Apache SoftwareBig Data Companies and Apache Software
Big Data Companies and Apache Software
 
Introducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoIntroducción al Machine Learning Automático
Introducción al Machine Learning Automático
 
Games en
Games enGames en
Games en
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
Machine Learning on Google Cloud with H2O
Machine Learning on Google Cloud with H2OMachine Learning on Google Cloud with H2O
Machine Learning on Google Cloud with H2O
 
How does Microsoft solve Big Data?
How does Microsoft solve Big Data?How does Microsoft solve Big Data?
How does Microsoft solve Big Data?
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Start Getting Your Feet Wet in Open Source Machine and Deep Learning

  • 1. Open Source AI Platform for Business Transformation
  • 2. Desmond Chan Senior Director of Marketing, H2O.ai
  • 3. Agenda for H2O Introduction Webinar ▪ Company Introduction (5 mins) ▪ H2O Introduction and Demo (35 mins) – Installation of H2O – Flight delay prediction use case • Use case description • Data set description • Data munging • Model creation ▪ Q&A (10 mins)
  • 4. H2O AI Platform In-Memory, Distributed Machine Learning with Visual Intelligence H2O AI in Spark with Data Prep and ML Pipelines Operationalize Model Building and Deployment Governance. Best-of-breed GPU Deep Learning with easy API and AutoML TensorFlow, MXNet or Caffe and H2O Deep Water AI For Business Transformation Insights on Text, Images, Transactions, Speech Best Machine Learning Algorithms on Spark Platform to Build and Scale Data Products. Dual licensing (AGPL and Commercial) H2O is the #1 Platform for Open Source AI
  • 5. Open Source Drives Community Adoption Companies Using H2O.ai 2014 2015 2016 2017 9173 6427 3810 400 H2O.ai Users 2014 2015 2016 2017 83108 54163 38257 1000 * Data from July of every year, except for 2017 when data from Feb 21st are used.
  • 6. H2O Recognized by Press and Customers
  • 7. H2O.ai Strongly Positioned in Key Analyst Reports and Press “Overall customer satisfaction is very high.” “H2O is especially suited to IoT edge and device scenarios.” “H2O had the highest reference customer analytics support score of all the vendors.” H2O.ai is a Visionary 
 in the Gartner Magic Quadrant
 for Data Science Platforms “H2O.ai has significant adoption by large enterprises such as Macy’s, Comcast, and Capital One.” “H2O.ai is best known for developing open source, cluster-distributed ML algorithms at a time (2011) when big data demanded them, but no one else had them.” H2O.ai is a Strong Performer
 in the Forrester Predictive Analytics & Machine Learning H2O.ai is a Top 10 Hot Artificial Intelligence (AI) Technologies on Forbes H2O.ai named alongside Nvidia, Google, IBM, Intel, Microsoft, SAS, et al as in Top 10 Hot Artificial Intelligence (AI) on Forbes - contributed by Gil Press
  • 8. H2O Use Cases – Videos and Talks Auto Insurance UBI Telematics Commercial Insurance Risk Analytics Financial Services Customer Insights Digital Marketing Consumer Behavior Pawan Divarkarla Chief Data Officer “H2O is an enabler in how people are thinking about data.” Conor Jensen Analytics Director “Advanced analytics was one of the key investments we decided to make.” Brendan Herger Data Scientist “H2O is the best solution to to iterate very quickly on large datasets and produce meaning models.” Satya Satyamoorthy Director, Software Dev "I am a big fan of open source. H2O is the best fit in terms of cost as well as ease of use and scalability and usability.” Play Video Play Video Play Video Play Video Progressive Zurich Capital One Nielsen Catalina
  • 10. What is H2O? Open%source%in,memory%prediction%engineMath%Platform • Parallelized%and%distributed%algorithms%making%the%most%use%out%of% multithreaded%systems • GLM,%Random%Forest,%GBM,%PCA,%etc. Easy%to%use%and%adoptAPI • Written%in%Java%– perfect%for%Java%Programmers • REST%API%(JSON)%– drives%H2O%from%R,%Python,%Excel,%Tableau More%data?%Or%better%models?%BOTHBig%Data • Use%all%of%your%data%– model%without%down%sampling • Run%a%simple%GLM%or%a%more%complex%GBM%to%find%the%best%fit%for%the%data • More%Data%+%Better%Models%=%Better%Predictions
  • 11. Supervised Learning H2O Algorithms Statistical Analysis Ensembles Deep Neural Networks • Generalized Linear Models: Binomial, Gaussian, Gamma, Poisson, and Tweedie • Naive Bayes: Binary Text Classification • Distributed Random Forest: Classification or Regression Models • Gradient Boosting Machine: Ensembles of shallow decision trees with increasing refined approximations • Deep Learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations
  • 12. Unsupervised Learning Clustering Dimensionality Reduction Anomaly Detection • K-means: Partition observations into k clusters of the same spatial size. Categorical features are one hot encoded. • Archetypes [GLRM]: Partition observations into k archetypes. • Principal Component Analysis: Linearly transforms correlated variables to independent components • Generalized Low Rank Model: Approximates data set as a product of two low dimensional factors. Extends PCA to handle sparse data, categorical data, and adds regularization. • Autoencoders [Deep Learning]: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations H2O Algorithms
  • 13. Accuracy with Speed and Scale HDFS S3 SQL NoSQL Classification Regression Feature Engineering In-Memory Map Reduce/Fork Join Columnar Compression Deep Learning PCA, GLM, Cox Random Forest / GBM Ensembles Fast Modeling Engine Streaming Nano Fast Java Scoring Engines Matrix Factorization Clustering Munging
  • 14. Reading Data into H2O with R STEP 1 R user h2o_df = h2o.importFile(“../data/allyears2k.csv”)
  • 15. Reading Data from HDFS into H2O with R H2O H2O H2O data.csv HTTP REST API request to H2O has HDFS path H2O ClusterInitiate distributed ingest HDFS Request data from HDFS STEP 2 2.2 2.3 2.4 R h2o.importFile() 2.1 R function call
  • 16. Reading Data from HDFS into H2O with R H2O H2O H2O R HDFS STEP 3 Cluster IP Cluster Port Pointer to Data Return pointer to data in REST API JSON Response HDFS provides data 3.3 3.4 3.1h2o_df object created in R data.csv h2o_df H2O Fram e 3.2 Distributed H2O Frame in DKV H2O Cluster
  • 18. Installing in R Installing in Python R> install.packages(“h2o”) Terminal$ pip install h2o