SlideShare a Scribd company logo
1 of 12
Download to read offline
Machine Learning
and Data Munging
in H2O Driverless AI
with datatable
Pasha Stetsenko & Oleksiy Kononenko
H2O.ai
@pydatatable
#H2OWORLD
Machine Learning with datatable
https://www.kaggle.com/c/microsoft-malware-prediction/discussion/75478
datatable.models
Follow the Regularized Leader (FTRL) algo for binomial classification
proposed by H. B. McMahan et al., “Ad click prediction: a view from the
trenches” https://research.google.com/pubs/archive/41159.pdf
• Python frontend, C++ backend
• Parallelized with OpenMP and Hogwild
• Supports boolean, integer, real and string features
• Hashing trick based on Murmur hash function
• Second-order feature interactions
• One-vs.-rest multinomial classification and regression for continuous
targets (experimental)
Python code example
For detailed help please refer to https://datatable.readthedocs.io/en/latest/ftrl.html
Five reasons to use datatable FTRL
1. Reliable: integrated into H2O Driverless AI as of v1.5
2. Fast: million rows in seconds
3. It’s all datatable: read data, munge data, fit/predict, save results
4. Already on Kaggle, thanks to Bojan Tunguz, Kaggle GM:
https://www.kaggle.com/tunguz/eda-with-python-datatable
5. Open source: MIT license
What is datatable anyways?
• R data.table is one of the top 10
most popular R packages
• Python datatable was started in an
attempt to mimic the internal design
and API of R data.table
• First customer: Driverless AI
Capabilities: • Efficient multi-threaded algorithms
• Memory-thrifty
• Memory-mapped on-disk datasets
• Native C++ implementation
• Open source
Load and view data Shows progress bar
while parsing
Type and size of each
column
Integer columns with
NAs are parsed as
integer
> 5x times faster than
pandas.read_csv()
• A large portion of data is ingested into DAI through fread
• Automatically detects parse parameters
• Multi-threaded parsing
• Recovers from encoding errors
• Reads CSV and Excel files
• Reads files inside archives
fread: a doorway to Driverless AI
Save to binary file
• 300 ms to write a 750MB file, or 2.5GB/s (on MacOS laptop)
• Writes are cached at OS level and deferred to a later stage
• Write immediately followed by a read from another process is equivalent to
direct memory sharing
• Opening a .jay file is nearly instant
DT[i, j, by(…),
sort(…),
join(…)]
Primary datatable syntax
SELECT j
FROM DT
JOIN join
WHERE|HAVING i
GROUP BY by
ORDER BY sort
Examples
Find the average flight duration for each flight
Remove from DT all records where average flight duration is either
negative or NA
For each carrier, select 3 longest flights
Notes:
R data.table logo is available in Source Code Form at
https://github.com/Rdatatable/data.table/blob/master/rdatatable.svg
The End
Thanks for watching

More Related Content

More from Sri Ambati

Automatic Model Documentation with H2O
Automatic Model Documentation with H2OAutomatic Model Documentation with H2O
Automatic Model Documentation with H2O
Sri Ambati
 
AI Solutions in Manufacturing
AI Solutions in ManufacturingAI Solutions in Manufacturing
AI Solutions in Manufacturing
Sri Ambati
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
Sri Ambati
 
AI and AutoML: Debunking Myths
AI and AutoML: Debunking MythsAI and AutoML: Debunking Myths
AI and AutoML: Debunking Myths
Sri Ambati
 

More from Sri Ambati (20)

Building Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2OBuilding Custom GenAI Apps at H2O
Building Custom GenAI Apps at H2O
 
Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical Applied Gen AI for the Finance Vertical
Applied Gen AI for the Finance Vertical
 
Cutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM PapersCutting Edge Tricks from LLM Papers
Cutting Edge Tricks from LLM Papers
 
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
KGM Mastering Classification and Regression with LLMs: Insights from Kaggle C...
 
LLM Interpretability
LLM Interpretability LLM Interpretability
LLM Interpretability
 
Never Reply to an Email Again
Never Reply to an Email AgainNever Reply to an Email Again
Never Reply to an Email Again
 
Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)Introducción al Aprendizaje Automatico con H2O-3 (1)
Introducción al Aprendizaje Automatico con H2O-3 (1)
 
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...
 
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 
ML Model Deployment and Scoring on the Edge with Automatic ML & DF
ML Model Deployment and Scoring on the Edge with Automatic ML & DFML Model Deployment and Scoring on the Edge with Automatic ML & DF
ML Model Deployment and Scoring on the Edge with Automatic ML & DF
 
Scaling & Managing Production Deployments with H2O ModelOps
Scaling & Managing Production Deployments with H2O ModelOpsScaling & Managing Production Deployments with H2O ModelOps
Scaling & Managing Production Deployments with H2O ModelOps
 
Automatic Model Documentation with H2O
Automatic Model Documentation with H2OAutomatic Model Documentation with H2O
Automatic Model Documentation with H2O
 
Your AI Transformation
Your AI Transformation Your AI Transformation
Your AI Transformation
 
AI Solutions in Manufacturing
AI Solutions in ManufacturingAI Solutions in Manufacturing
AI Solutions in Manufacturing
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
 
Getting Your Supply Chain Back on Track with AI
Getting Your Supply Chain Back on Track with AIGetting Your Supply Chain Back on Track with AI
Getting Your Supply Chain Back on Track with AI
 
AI and AutoML: Debunking Myths
AI and AutoML: Debunking MythsAI and AutoML: Debunking Myths
AI and AutoML: Debunking Myths
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Machine Learning and Data Munging in H2O Driverless AI with datatable

  • 1. Machine Learning and Data Munging in H2O Driverless AI with datatable Pasha Stetsenko & Oleksiy Kononenko H2O.ai @pydatatable #H2OWORLD
  • 2. Machine Learning with datatable https://www.kaggle.com/c/microsoft-malware-prediction/discussion/75478
  • 3. datatable.models Follow the Regularized Leader (FTRL) algo for binomial classification proposed by H. B. McMahan et al., “Ad click prediction: a view from the trenches” https://research.google.com/pubs/archive/41159.pdf • Python frontend, C++ backend • Parallelized with OpenMP and Hogwild • Supports boolean, integer, real and string features • Hashing trick based on Murmur hash function • Second-order feature interactions • One-vs.-rest multinomial classification and regression for continuous targets (experimental)
  • 4. Python code example For detailed help please refer to https://datatable.readthedocs.io/en/latest/ftrl.html
  • 5. Five reasons to use datatable FTRL 1. Reliable: integrated into H2O Driverless AI as of v1.5 2. Fast: million rows in seconds 3. It’s all datatable: read data, munge data, fit/predict, save results 4. Already on Kaggle, thanks to Bojan Tunguz, Kaggle GM: https://www.kaggle.com/tunguz/eda-with-python-datatable 5. Open source: MIT license
  • 6. What is datatable anyways? • R data.table is one of the top 10 most popular R packages • Python datatable was started in an attempt to mimic the internal design and API of R data.table • First customer: Driverless AI Capabilities: • Efficient multi-threaded algorithms • Memory-thrifty • Memory-mapped on-disk datasets • Native C++ implementation • Open source
  • 7. Load and view data Shows progress bar while parsing Type and size of each column Integer columns with NAs are parsed as integer > 5x times faster than pandas.read_csv()
  • 8. • A large portion of data is ingested into DAI through fread • Automatically detects parse parameters • Multi-threaded parsing • Recovers from encoding errors • Reads CSV and Excel files • Reads files inside archives fread: a doorway to Driverless AI
  • 9. Save to binary file • 300 ms to write a 750MB file, or 2.5GB/s (on MacOS laptop) • Writes are cached at OS level and deferred to a later stage • Write immediately followed by a read from another process is equivalent to direct memory sharing • Opening a .jay file is nearly instant
  • 10. DT[i, j, by(…), sort(…), join(…)] Primary datatable syntax SELECT j FROM DT JOIN join WHERE|HAVING i GROUP BY by ORDER BY sort
  • 11. Examples Find the average flight duration for each flight Remove from DT all records where average flight duration is either negative or NA For each carrier, select 3 longest flights
  • 12. Notes: R data.table logo is available in Source Code Form at https://github.com/Rdatatable/data.table/blob/master/rdatatable.svg The End Thanks for watching