Scalable Automatic Machine Learning in H2OSri Ambati
Abstract:
In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. Although H2O and other tools have made it easier for practitioners to train and deploy machine learning models at scale, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular, are notoriously difficult for a non-expert to tune properly.
In this presentation, we provide an overview of the the field of "Automatic Machine Learning" and introduce the new AutoML functionality in H2O. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard.
H2O AutoML is available in all the H2O interfaces including the h2o R package, Python module and the Flow web GUI. We will also provide simple code examples to get you started using AutoML.
Erin’s Bio:
Erin is a Statistician and Machine Learning Scientist at H2O.ai. She is the main author of H2O Ensemble. Before joining H2O, she was the Principal Data Scientist at Wise.io and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing. She also holds a B.S. and M.A. in Mathematics.
Strata San Jose 2016: Scalable Ensemble Learning with H2OSri Ambati
Erin LeDell's presentation on Scalable Ensemble Learning with H2O at Strata + Hadoop World San Jose, 03.29.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Scalable Machine Learning in R and Python with H2OSri Ambati
The focus of this presentation is scalable machine learning using the h2o R and Python packages. H2O is an open source, distributed machine learning platform designed for big data, with the added benefit that it's easy to use on a laptop (in addition to a multi-node Hadoop or Spark cluster). The core machine learning algorithms of H2O are implemented in high-performance Java, however, fully-featured APIs are available in R, Python, Scala, REST/JSON, and also through a web interface.
Since H2O's algorithm implementations are distributed, this allows the software to scale to very large datasets that may not fit into RAM on a single machine. H2O currently features distributed implementations of Generalized Linear Models, Gradient Boosting Machines, Random Forest, Deep Neural Nets, Stacked Ensembles (aka "Super Learners"), dimensionality reduction methods (PCA, GLRM), clustering algorithms (K-means), anomaly detection methods, among others.
R and Python code with H2O machine learning code examples will be demoed live and will be made available on GitHub for participants to follow along on their laptops if they choose. For those interested in running the code on a multi-node Amazon EC2 cluster, an H2O AMI is also available.
Author Bio:
Dr. Erin LeDell is a Machine Learning Scientist at H2O.ai, the company that produces the open source machine learning platform, H2O. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from UC Berkeley. Before joining H2O.ai, she was the Principal Data Scientist at Wise.io (acquired by GE in 2016) and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc.
Intro to H2O Machine Learning in R at Santa Clara UniversitySri Ambati
Erin LeDell's presentation on Intro to H2O Machine Learning in R at SCU
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Intro to H2O in Python - Data Science LASri Ambati
Erin LeDell's presentation on Intro to H2O Machine Learning in Python at Data Science LA meetup on 1.19.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Skutil - H2O meets Sklearn - Taylor SmithSri Ambati
Skutil brings the best of both worlds to H2O and sklearn, delivering an easy transition into the world of distributed computing that H2O offers, while providing the same, familiar interface that sklearn users have come to know and love.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Scalable Automatic Machine Learning in H2OSri Ambati
Abstract:
In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. Although H2O and other tools have made it easier for practitioners to train and deploy machine learning models at scale, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular, are notoriously difficult for a non-expert to tune properly.
In this presentation, we provide an overview of the the field of "Automatic Machine Learning" and introduce the new AutoML functionality in H2O. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard.
H2O AutoML is available in all the H2O interfaces including the h2o R package, Python module and the Flow web GUI. We will also provide simple code examples to get you started using AutoML.
Erin’s Bio:
Erin is a Statistician and Machine Learning Scientist at H2O.ai. She is the main author of H2O Ensemble. Before joining H2O, she was the Principal Data Scientist at Wise.io and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing. She also holds a B.S. and M.A. in Mathematics.
Strata San Jose 2016: Scalable Ensemble Learning with H2OSri Ambati
Erin LeDell's presentation on Scalable Ensemble Learning with H2O at Strata + Hadoop World San Jose, 03.29.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Scalable Machine Learning in R and Python with H2OSri Ambati
The focus of this presentation is scalable machine learning using the h2o R and Python packages. H2O is an open source, distributed machine learning platform designed for big data, with the added benefit that it's easy to use on a laptop (in addition to a multi-node Hadoop or Spark cluster). The core machine learning algorithms of H2O are implemented in high-performance Java, however, fully-featured APIs are available in R, Python, Scala, REST/JSON, and also through a web interface.
Since H2O's algorithm implementations are distributed, this allows the software to scale to very large datasets that may not fit into RAM on a single machine. H2O currently features distributed implementations of Generalized Linear Models, Gradient Boosting Machines, Random Forest, Deep Neural Nets, Stacked Ensembles (aka "Super Learners"), dimensionality reduction methods (PCA, GLRM), clustering algorithms (K-means), anomaly detection methods, among others.
R and Python code with H2O machine learning code examples will be demoed live and will be made available on GitHub for participants to follow along on their laptops if they choose. For those interested in running the code on a multi-node Amazon EC2 cluster, an H2O AMI is also available.
Author Bio:
Dr. Erin LeDell is a Machine Learning Scientist at H2O.ai, the company that produces the open source machine learning platform, H2O. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from UC Berkeley. Before joining H2O.ai, she was the Principal Data Scientist at Wise.io (acquired by GE in 2016) and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc.
Intro to H2O Machine Learning in R at Santa Clara UniversitySri Ambati
Erin LeDell's presentation on Intro to H2O Machine Learning in R at SCU
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Intro to H2O in Python - Data Science LASri Ambati
Erin LeDell's presentation on Intro to H2O Machine Learning in Python at Data Science LA meetup on 1.19.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Skutil - H2O meets Sklearn - Taylor SmithSri Ambati
Skutil brings the best of both worlds to H2O and sklearn, delivering an easy transition into the world of distributed computing that H2O offers, while providing the same, familiar interface that sklearn users have come to know and love.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Slides for a presentation I gave for the Machine Learning with Spark Tokyo meetup.
Introduction to Spark, H2O, SparklingWater and live demos of GBM and DL.
Ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. Practitioners may prefer ensemble algorithms when model performance is valued above other factors such as model complexity and training time. The Super Learner algorithm, also called "stacking", learns the optimal combination of the base learner fits. The latest version of H2O now contains a "Stacked Ensemble" method, which allows the user to stack H2O models into a Super Learner. The Stacked Ensemble method is the the native H2O version of stacking, previously only available in the h2oEnsemble R package, and now enables stacking from all the H2O APIs: Python, R, Scala, etc.
Erin is a Statistician and Machine Learning Scientist at H2O.ai. Before joining H2O, she was the Principal Data Scientist at Wise.io (acquired by GE Digital) and Marvin Mobile Security (acquired by Veracode) and the founder of DataScientific, Inc. Erin received her Ph.D. from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing.
H2O Deep Water - Making Deep Learning Accessible to EveryoneSri Ambati
Deep Water is H2O's integration with multiple open source deep learning libraries such as TensorFlow, MXNet and Caffe. On top of the performance gains from GPU backends, Deep Water naturally inherits all H2O properties in scalability. ease of use and deployment. In this talk, I will go through the motivation and benefits of Deep Water. After that, I will demonstrate how to build and deploy deep learning models with or without programming experience using H2O's R/Python/Flow (Web) interfaces.
Jo-fai (or Joe) is a data scientist at H2O.ai. Before joining H2O, he was in the business intelligence team at Virgin Media in UK where he developed data products to enable quick and smart business decisions. He also worked remotely for Domino Data Lab in the US as a data science evangelist promoting products via blogging and giving talks at meetups. Joe has a background in water engineering. Before his data science journey, he was an EngD research engineer at STREAM Industrial Doctorate Centre working on machine learning techniques for drainage design optimization. Prior to that, he was an asset management consultant specialized in data mining and constrained optimization for the utilities sector in the UK and abroad. He also holds an MSc in Environmental Management and a BEng in Civil Engineering.
H2O World - H2O Deep Learning with Arno CandelSri Ambati
H2O World 2015
Tutorial scripts for R, Python are here:
https://github.com/h2oai/h2o-world-2015-training/tree/master/tutorials/deeplearning
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Michal Malohlava talks about the PySparkling Water package for Spark and Python users.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
H2O Rains with Databricks Cloud - NY 02.16.16Sri Ambati
Michal Malohlava's presentation on H2O Rains with Databricks Cloud, New York, NY 02.16.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Jeff will showcase the sparklyr the new R package to interface with Spark and talk about the different use extensions including the rsparkling ML package.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
These slides will show how to approach a multi-class (classification) problem using H2O. The data that is being used is an aggregated log of multiple systems that are constantly providing information about their status, connections and traffic. In large organizations, these log datasets can be very huge and unidentifiable due to the number of sources, legacy systems etc. In our example, we use a created response for each source. The use H2O to classify the source of data.
Author Bio: Ashrith Barthur is a Security Scientist at H2O currently working on algorithms that detect anomalous behaviour in user activities, network traffic, attacks, financial fraud and global money movement. He has a PhD from Purdue University in the field of information security, specialized in Anomalous behaviour in DNS protocol.
Don’t forget to download H2O!
http://www.h2o.ai/download/
Dmitry will show the audience on how get started with Mxnet and building Deep Learning models to classify images, sound and text.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Making Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and ShinyJo-fai Chow
Joe recently teamed up with IBM and Aginity to create a proof of concept "Moneyball" app for the IBM Think conference in Vegas. The original goal was to prove that different tools (e.g. H2O, Aginity AMP, IBM Data Science Experience, R and Shiny) could work together seamlessly for common business use-cases. Little did Joe know, the app would be used by Ari Kaplan (the real "Moneyball" guy) to validate the future performance of some baseball players. Ari recommended one player to a Major League Baseball team. The player was signed the next day with a multimillion-dollar contract. This talk is about Joe's journey to a real "Moneyball" application.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
Workflow-Driven Geoinformatics Applications and Training in the Big Data EraIlkay Altintas, Ph.D.
My slides from the Big Data and The Earth Sciences: Grand Challenges Workshop on May 31st, 2017. Workshop link: http://prp.ucsd.edu/events/big-data-and-the-earth-science-grand-challenges-workshop
Slides for a presentation I gave for the Machine Learning with Spark Tokyo meetup.
Introduction to Spark, H2O, SparklingWater and live demos of GBM and DL.
Ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. Practitioners may prefer ensemble algorithms when model performance is valued above other factors such as model complexity and training time. The Super Learner algorithm, also called "stacking", learns the optimal combination of the base learner fits. The latest version of H2O now contains a "Stacked Ensemble" method, which allows the user to stack H2O models into a Super Learner. The Stacked Ensemble method is the the native H2O version of stacking, previously only available in the h2oEnsemble R package, and now enables stacking from all the H2O APIs: Python, R, Scala, etc.
Erin is a Statistician and Machine Learning Scientist at H2O.ai. Before joining H2O, she was the Principal Data Scientist at Wise.io (acquired by GE Digital) and Marvin Mobile Security (acquired by Veracode) and the founder of DataScientific, Inc. Erin received her Ph.D. from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing.
H2O Deep Water - Making Deep Learning Accessible to EveryoneSri Ambati
Deep Water is H2O's integration with multiple open source deep learning libraries such as TensorFlow, MXNet and Caffe. On top of the performance gains from GPU backends, Deep Water naturally inherits all H2O properties in scalability. ease of use and deployment. In this talk, I will go through the motivation and benefits of Deep Water. After that, I will demonstrate how to build and deploy deep learning models with or without programming experience using H2O's R/Python/Flow (Web) interfaces.
Jo-fai (or Joe) is a data scientist at H2O.ai. Before joining H2O, he was in the business intelligence team at Virgin Media in UK where he developed data products to enable quick and smart business decisions. He also worked remotely for Domino Data Lab in the US as a data science evangelist promoting products via blogging and giving talks at meetups. Joe has a background in water engineering. Before his data science journey, he was an EngD research engineer at STREAM Industrial Doctorate Centre working on machine learning techniques for drainage design optimization. Prior to that, he was an asset management consultant specialized in data mining and constrained optimization for the utilities sector in the UK and abroad. He also holds an MSc in Environmental Management and a BEng in Civil Engineering.
H2O World - H2O Deep Learning with Arno CandelSri Ambati
H2O World 2015
Tutorial scripts for R, Python are here:
https://github.com/h2oai/h2o-world-2015-training/tree/master/tutorials/deeplearning
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Michal Malohlava talks about the PySparkling Water package for Spark and Python users.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
H2O Rains with Databricks Cloud - NY 02.16.16Sri Ambati
Michal Malohlava's presentation on H2O Rains with Databricks Cloud, New York, NY 02.16.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Jeff will showcase the sparklyr the new R package to interface with Spark and talk about the different use extensions including the rsparkling ML package.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
These slides will show how to approach a multi-class (classification) problem using H2O. The data that is being used is an aggregated log of multiple systems that are constantly providing information about their status, connections and traffic. In large organizations, these log datasets can be very huge and unidentifiable due to the number of sources, legacy systems etc. In our example, we use a created response for each source. The use H2O to classify the source of data.
Author Bio: Ashrith Barthur is a Security Scientist at H2O currently working on algorithms that detect anomalous behaviour in user activities, network traffic, attacks, financial fraud and global money movement. He has a PhD from Purdue University in the field of information security, specialized in Anomalous behaviour in DNS protocol.
Don’t forget to download H2O!
http://www.h2o.ai/download/
Dmitry will show the audience on how get started with Mxnet and building Deep Learning models to classify images, sound and text.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Making Multimillion-Dollar Baseball Decisions with H2O AutoML, LIME and ShinyJo-fai Chow
Joe recently teamed up with IBM and Aginity to create a proof of concept "Moneyball" app for the IBM Think conference in Vegas. The original goal was to prove that different tools (e.g. H2O, Aginity AMP, IBM Data Science Experience, R and Shiny) could work together seamlessly for common business use-cases. Little did Joe know, the app would be used by Ari Kaplan (the real "Moneyball" guy) to validate the future performance of some baseball players. Ari recommended one player to a Major League Baseball team. The player was signed the next day with a multimillion-dollar contract. This talk is about Joe's journey to a real "Moneyball" application.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
Workflow-Driven Geoinformatics Applications and Training in the Big Data EraIlkay Altintas, Ph.D.
My slides from the Big Data and The Earth Sciences: Grand Challenges Workshop on May 31st, 2017. Workshop link: http://prp.ucsd.edu/events/big-data-and-the-earth-science-grand-challenges-workshop
Intro to Data Science for Non-Data ScientistsSri Ambati
Erin LeDell and Chen Huang's presentations from the Intro to Data Science for Non-Data Scientists Meetup at H2O HQ on 08.20.15
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
A three hour lecture I gave at the Jyväskylä Summer School. The talk goes through important details about the use of data science in real businesses. These include data deployment, data processing, practical issues with data solutions and arising trends in data science.
See also Part 1 of the lecture: Introduction Data Science. You can find it in my profile (click the face)
Considerations and challenges in building an end to-end microbiome workflowEagle Genomics
Many of the data management and analysis challenges in microbiome research are shared with genomics and other life-science big-data disciplines. However there are aspects that are specific: some are intrinsic to microbiome data, some are related to the maturity of the field, with others related to extracting business value from the data.
I presented these slides as a keynote at the Enterprise Intelligence Workshop at KDD2016 in San francisco.
In these slides, I describe our work towards developing a Maslow's Hierarchy for Human in the Loop Data Analytics!
Two hour lecture I gave at the Jyväskylä Summer School. The purpose of the talk is to give a quick non-technical overview of concepts and methodologies in data science. Topics include a wide overview of both pattern mining and machine learning.
See also Part 2 of the lecture: Industrial Data Science. You can find it in my profile (click the face)
Many of us data science and business analytics practitioners perform research and analysis for decision makers on a regular basis. The deliverable of such analysis often results in a Power Point presentation, and/or a model that needs to be productionalized. The code used to produce the analysis also needs to be considered a deliverable.
Many of us perform analysis without reproducibility in mind. With the increasing democratization of data, it is becoming more and more important for people that may not have scientific training to be able to create analysis that can be picked up by somebody else who can then reproduce your results. That, and creating reproducible research is just solid science.
We are going to spend an evening walking though the various tools available to create reproducible research on Big Data. You will get introduced to the Tidyverse of R packages and how to use them. We will discuss the ins and outs of various notebook technologies like Jupyter, and Zeppelin. You will have an opportunity to learn how to get up and running with R and Spark and the various options you have to learn on real clusters instead of just your local environment. There also be a quick introduction to source control and the various options you have around using Git.
The theme of the evening will be “getting started”. We will go over various training resources and show you the optimal path to go from zero to master. Some commentary will be provided around the current state of the job market and intel from the front lines of the data science language wars. This is a large topic and the evening will be fairly dynamic and responsive to the needs of the audience.
Bob Wakefield has spent the better part of 16 years building data systems for many organizations across various industries. He has been running Hadoop in a lab environment for 3 years. He is the principal of Mass Street Analytics, LLC a boutique data consultancy. Mass Street is a Hortonworks Consultant Partner and Confluent Partner.
In his spare time, he likes to work on an equity investment application that combines various sources of information to automatically arrive at investing decisions. When he is not doing that, you’ll find him flying his A-10 simulator. Full CV can be found here: https://www.linkedin.com/in/bobwakefieldmba/
On April 11th 2016, Prof. Prof. Henning Müller (HES-SO Valais-Wallis and Martinos Center) presented Challenges in medical imaging and the VISCERAL model at National Cancer Institute in Washington.
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
Do terms like "Data Lake" confuse you? You’re not alone. With all of the technology buzzwords flying around today, it can become a task to keep up with and clearly understand each of them. However a data lake is definitely something to dedicate the time to understand. Leveraging data lake technology, companies are finally able to keep all of their disparate information and streams of data in one secure location ready for consumption at any time – this includes structured, unstructured, and semi-structured data. For more information on our Big Data Consulting Services, don’t hesitate to visit us online at: http://bit.ly/2fvV5rR
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Sri Ambati
Sandeep Singh, Head of Applied AI Computer Vision, Beans.ai
H2O Open Source GenAI World SF 2023
In the modern era of machine learning, leveraging both open-source and closed-source solutions has become paramount for achieving cutting-edge results. This talk delves into the intricacies of seamlessly integrating open-source Large Language Model (LLM) solutions like Vicuna, Falcon, and Llama with industry giants such as ChatGPT and Google's Palm. As the demand for fine-tuned and specialized datasets grows, it is imperative to understand the synergy between these tools. Attendees will gain insights into best practices for building and enriching datasets tailored for fine-tuning tasks, ensuring that their LLM projects are both robust and efficient. Through real-world examples and hands-on demonstrations, this talk will equip attendees with the knowledge to harness the power of both open and closed-source tools in a coherent and effective manner.
Patrick Hall, Professor, AI Risk Management, The George Washington University
H2O Open Source GenAI World SF 2023
Language models are incredible engineering breakthroughs but require auditing and risk management before productization. These systems raise concerns about toxicity, transparency and reproducibility, intellectual property licensing and ownership, disinformation and misinformation, supply chains, and more. How can your organization leverage these new tools without taking on undue or unknown risks? While language models and associated risk management are in their infancy, a small number of best practices in governance and risk are starting to emerge. If you have a language model use case in mind, want to understand your risks, and do something about them, this presentation is for you!
Dr. Alexy Khrabrov, Open Source Science Community Director, IBM
H2O Open Source GenAI World SF 2023
In this talk, Dr. Alexy Khrabrov, recently elected Chair of the new Generative AI Commons at Linux Foundation for AI & Data, outlines the OSS AI landscape, challenges, and opportunities. With new models and frameworks being unveiled weekly, one thing remains constant: community building and validation of all aspects of AI is key to reliable and responsible AI we can use for business and society needs. Industrial AI is one key area where such community validation can prove invaluable.
Michelle Tanco, Head of Product, H2O.ai
H2O Open Source GenAI World SF 2023
Learn how the makers at H2O.ai are building internal tools to solve real use cases using H2O Wave and h2oGPT. We will walk through an end-to-end use case and discuss how to incorporate business rules and generated content to rapidly develop custom AI apps using only Python APIs.
Applied Gen AI for the Finance Vertical Sri Ambati
Megan Kurka, Vice President, Customer Data Scientist, H2O.ai
H2O Open Source GenAI World SF 2023
Discover the transformative power of Applied Gen AI. Learn how the H2O team builds customized applications and workflows that integrate capabilities of Gen AI and AutoML specifically designed to address and enhance financial use cases. Explore real world examples, learn best practices, and witness firsthand how our innovative solutions are reshaping the landscape of finance technology.
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati
Pascal Pfeiffer, Principal Data Scientist, H2O.ai
H2O Open Source GenAI World SF 2023
This talk dives into the expansive ecosystem of Large Language Models (LLMs), offering practitioners an insightful guide to various relevant applications, from natural language understanding to creative content generation. While exploring use cases across different industries, it also honestly addresses the current limitations of LLMs and anticipates future advancements.
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
En esta reunión virtual, damos una introducción a la plataforma de aprendizaje automático de código abierto número 1, H2O-3 y te mostramos cómo puedes usarla para desarrollar modelos para resolver diferentes casos de uso.
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati
Numerai is an open, crowd-sourced hedge fund powered by predictions from data scientists around the world. In return, participants are rewarded with weekly payouts in crypto.
In this talk, Joe will give an overview of the Numerai tournament based on his own experience. He will then explain how he automates the time-consuming tasks such as testing different modelling strategies, scoring new datasets, submitting predictions to Numerai as well as monitoring model performance with H2O Driverless AI and R.
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati
In this session, you will learn about what you should do after you’ve taken an AI transformation baseline. Over the span of this session, we will discuss the next steps in moving toward AI readiness through alignment of talent and tools to drive successful adoption and continuous use within an organization.
To find additional videos on AI courses, earn badges, join the courses at H2O.ai Learning Center: https://training.h2o.ai/products/ai-foundations-course
To find the Youtube video about this presentation: https://youtu.be/K1Cl3x3rd8g
Speaker:
Chemere Davis (H2O.ai - Senior Data Scientist Training Specialist)
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
2. INTRO – BIO
• Jorge Luis Hernandez Villapol
• Engineering Intern at H2O.ai
• Graduate Student at UNT Master in Electrical Engineering
• Background: Electronics Engineer
• Jorge@h2o.ai
3. AGENDA
• Intro - Bio
• Data Scientist Checklist
• H2O Intro – Products
• H2O Workflow
• Demo
• Where to go next?
• Q&A
4. CHECKLIST – BE PASSIONATE
• 10,000 Hours Rule - Malcolm Gladwell
• Marios Michailidis aka KazAnova
– Kaggle Grand Master – Top 3
– Senior Data Scientist at H2O.ai
5. CHECKLIST – BE EAGER TO LEARN
• New models
• New frameworks
• New technology
• New and old approaches
6. CHECKLIST- KEEP YOUR
FUNDAMENTALS IN CHECK
• Statistic Fundamentals
– Mean, Median, Variance
– Random Variables, pdf
– Central Limit Theorem, iids
• Error Metrics
– RSME, MSE, AUC
• Accuracy vs Precision
7. CHECKLIST – THE DATA SCIENTIST CYCLE
Question &
Hypothesis
Data Mining
Modeling
Evaluation
Present &
Document
Deplo
y
Idea
9. CHECKLIST – HAVE A TOOLBOX … OF
SOLUTIONS
• Whenever you get a new problem
– Have I done this before?
– Have I done something similar before?
– Can I reuse/adapt some I had done before?
• Whenever you get a new solution
– Document
– Present
– Save
10. CHECKLIST – KEEP YOUR TOOLBOX
UPDATED AND GROWING
• Do your own benchmark between your tools.
• Keep an eye for updates (FYI H2O makes minor releases every 2 weeks)
11. CHECKLIST – SEPARATE YOUR DATA
• Overfitting is public enemy #1
• Good rule of thumb is to have a Training, Validation and test set.
• Be careful with the split! No leakage to your test set!
12. CHECKLIST – ONE ENSEMBLE TO RULE
THEM ALL OR SIMPLER IS BETTER?
• Start with a Simpler Model as your Base Line.
• Grow on complexity until satisfied.
• Ensembles and Stacking helps against overfitting.
13. WHAT IS H2O?
H2O is an open source, in-memory, distributed, fast, and scalable machine
learning and predictive analytics platform that allows you to build machine
learning models on big data and provides easy productionalization of those
models in an enterprise environment.
14. Supervised Learning
• Generalized Linear Models: Binomial,
Gaussian, Gamma, Poisson and
Tweedie
• Naïve Bayes
Statistical
Analysis
Ensembles
• Distributed Random Forest:
Classification or regression models
• Gradient Boosting Machine:
Produces an ensemble of decision
trees with increasing refined
approximations
Deep Neural
Networks
• Deep learning: Create multi-layer feed
forward neural networks starting with
an input layer followed by multiple
layers of nonlinear transformations
ALGORITHMS ON H2O
Unsupervised Learning
• K-means: Partitions observations into k
clusters/groups of the same spatial
size. Automatically detect optimal k
Clustering
Dimensionality
Reduction
• Principal Component Analysis: Linearly
transforms correlated variables to independent
components
• Generalized Low Rank Models: extend the idea
of PCA to handle arbitrary data consisting of
numerical, Boolean, categorical, and missing data
Anomaly
Detection
• Autoencoders: Find outliers using a
nonlinear dimensionality reduction
using deep learning
16. WHERE TO GO NEXT?
• Download and test for yourself
– https://www.h2o.ai/
• Docs
– http://docs.h2o.ai/h2o/latest-stable/index.html
• Video Tutorials
– https://www.youtube.com/user/0xdata
Editor's Notes
A little bit about myself – My Research is currently based on Cognitive Radios – using Machine Learning to know the RF spectrum environment and use underutilized frequency bands
Last meetup people approached me with the question: where do I start? That’s why we decided to do a more introductory meetup this time. If you are already a data scientist the 1st half an hour will be a little bit slow but good to check all the principles.
A little generic advise to start up. Like everything you do, when you do it with passion the higher chance for you to excel on it.
Malcolm Gladwell – author
KazAnova started his data scientist career, relatively recently. In his talk how to win Kaggle competitions he tells the story on how he participated in a 100 competitions before he won the 1st one. And how he was putting 60hr a week besides his normal work. And the same history will tell most of the others Kaggle Grand Masters, the passion that drives this people make them obsessed with each competition and everything they do. Take them as example about how much time should you invest in your data science journey
You can check this one just by being here.
Truth is, is good to take some time every couple of weeks to try new solutions. Take a webinar, attend a meetup. Set yourself a couple of hours to do some critical review.
This is were I see most people fail. And that’s why we are gonna touch this early on.
When you ask me where do you start? I’m gonna say some Statistics 101. You need to know this concepts like the back of your hand. Data Science is not just making models a throwing them to the data. You need to do some data preparation, and this concepts are the ones that is gonna help you understand your data.
Metrics are possibly the most important thing you need to know. Why? Because your model is gonna report you on how he is doing in these metrics. Choosing the right metric depends on your data. On your problem. And understanding if you are doing good or bad is essensial. Lower is better is not gonna cut it.
One honorable mention is understanding the difference between accuracy and precision. Is your model is very accurate but not precise means that when your are wrong, you are very wrong, miles away from your target. Looking into those cases might be useful. Be a scientist be always curious
You start with an idea! I want to predict something. This might be your own idea or management idea and is your job to make it happen. So the very first step is defining your question. What do I want to predict? And your hypothesis, “I think that DL can do that”
Then you start a very extensive process, that I resumed in this slide as data mining. Now data mining involves many things. From your data exploration to your feature engineering and your data insights. The amount of time you spent on this step varies from problem to problem and from data scientist to data scientist. There are some that spent 90% of the time in this step and do a very simple model. This step is for sure the most important part of this whole cycle since a mistake here can break your whole solution. Then goes modeling, this is where you select what model do you want to use and what hyperparameters are best.
Then goes your evaluation, the moment of truth. You will know if your solution is worth something of not in this phase. If you are not satisfied with your model, think you can do better you go back to the very beginning, to your question and hypothesis. If you are satisfied then you make sure to document your work and make the presentation of your findings (to your boss of even to yourself) If everything looks good we move forward to the final step that is to deploy your model.
This might be obvious but you have to know how to write good quality code. Now there is so many languages nowdays. The questions rises, which one should I learn ?
Short Answer: all of them. Before you lose faith in me let me give you the long answer.
You need to be able to understand at a Basic Level all these languagues, and be able to write simple tasks. Truth is most of the data operations are no more than mathematical operations that can be broken down in simple steps. For example, Census example. Another example: you have no idea how to code a singular task, and you google it, if you are lucky you will find the answer in your preferred language, if you are mode advanced coder and don’t find a solution in your language but in another one, you should be able to get the overall idea of what’s happening.
Moreover, you do have to select a language that you feel confortable and want to be an expert on. In my case is python. Most or all of your work in going to be written in this language and it has to be good quality code. You not only want to be able to do all your data transformation, but you also want to be efficient. Performance is where you are going to be an expert. Last but not least, good code should be readable, sometimes there is a tradeoff between readability and performance, so careful. Remember your data science cycle you are going to document and present your work.