H2O World 2015 - Mark Landry
Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Curriculum Development
Learning Strategies
Very basic ideas about curriculum development focused for teachers in medical education with medical background .
Simple Regression presentation is a
partial fulfillment to the requirement in PA 297 Research for Public Administrators, presented by Atty. Gayam , Dr. Cabling and Mr. Cagampang
Curriculum Development
Learning Strategies
Very basic ideas about curriculum development focused for teachers in medical education with medical background .
Simple Regression presentation is a
partial fulfillment to the requirement in PA 297 Research for Public Administrators, presented by Atty. Gayam , Dr. Cabling and Mr. Cagampang
I gave a talk on Diaries for Data Collection to the staff and PhD students of Department of Computer Science at Bina Nusantra University, Indonesia, on 17 December 2020.
In this presentation, I shared my experience as a researcher who often uses diaries for data collection in research projects. I have been using diaries since my PhD years, and some of my PhD & MSc IT students also used diaries in their research studies. Diary is not as popular as other data collection methods, however, is able to provide insights where privacy is most concerned. As a UX researcher, I can see the benefits of a diary to my research findings.
Top 10 Data Science Practitioner PitfallsSri Ambati
Top 10 Data Science Practitioner Pitfalls Meetup with Erin LeDell and Mark Landry on 09.09.15
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
I gave a talk on Diaries for Data Collection to the staff and PhD students of Department of Computer Science at Bina Nusantra University, Indonesia, on 17 December 2020.
In this presentation, I shared my experience as a researcher who often uses diaries for data collection in research projects. I have been using diaries since my PhD years, and some of my PhD & MSc IT students also used diaries in their research studies. Diary is not as popular as other data collection methods, however, is able to provide insights where privacy is most concerned. As a UX researcher, I can see the benefits of a diary to my research findings.
Top 10 Data Science Practitioner PitfallsSri Ambati
Top 10 Data Science Practitioner Pitfalls Meetup with Erin LeDell and Mark Landry on 09.09.15
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
H2O World - Sparkling Water - Michal MalohlavaSri Ambati
H2O World 2015 - Michal Malohlava
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
These slides will show how to approach a multi-class (classification) problem using H2O. The data that is being used is an aggregated log of multiple systems that are constantly providing information about their status, connections and traffic. In large organizations, these log datasets can be very huge and unidentifiable due to the number of sources, legacy systems etc. In our example, we use a created response for each source. The use H2O to classify the source of data.
Ashrith Barthur is a Security Scientist at H2O currently working on algorithms that detect anomalous behaviour in user activities, network traffic, attacks, financial fraud and global money movement. He has a PhD from Purdue University in the field of information security, specialized in Anomalous behaviour in DNS protocol.
Don’t forget to download H2O!
http://www.h2o.ai/download/
YSI 5500D 5400 and 5200A aquaculture monitors and controllersXylem Inc.
Continuously monitor water quality in any aquaculture system with the YSI monitoring and control instruments.
Whether it is the need for a multiparameter solution or single measurement at one location, the YSI monitors are ideal.
The Model and the Train Wreck - A Training Data How-To -- @mrogati's talk at ...Monica Rogati
Getting training data for a recommender system is easy: if users clicked it, it’s a positive – if they didn’t, it’s a negative.
… Or is it? You’ve probably learned an algorithm to run on top of your existing algorithm, now and every time you re-train. And what do you do when the data product you’re building doesn’t have any users yet? Do you really launch with random results, hand label 50K examples, or ask a Turker to pretend they’re User #1337?
Unlike having a better algorithm, having better training data can improve your results by orders of magnitude. Yet training data generation is often an afterthought—a footnote in a formula-filled publication.
In this talk, we use examples from production recommender systems to bring training data to the forefront: from overcoming presentation bias to the art of crowdsourcing subjective judgments to creative data exhaust exploitation and feature creation.
Sparkling Water Meetup: Deep Learning for Public SafetySri Ambati
Sparkling Water Meetup 4.22.15 with Michal Malohlava
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Engineering patterns for implementing data science models on big data platformsHisham Arafat
Discussion of practically implementing data science models on big data platforms from engineering perspective. An eye opener on the engineering factors associated with designing and working solution. We use a simple text mining example on social media analytics for brand marketing. At the first while, it seems simple solution however if you go deeply and think on implementation aspects of even a simple analytics model, you can discover the degree of complexity at each part of the solution. An Abstraction of the Big Data key advantages would be very helpful to select appropriate Big Data technology components out of very large landscape. Two examples with reference are given for using Lambda Architecture and unusual way of image processing using Big Data abstraction provided.
H2O World - Welcome to H2O World with Arno CandelSri Ambati
H2O World 2015
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
H2O Machine Learning and Kalman Filters for Machine Prognostics - Galvanize SFSri Ambati
Hank Roark's presentation at Galvanize SF, 02.23.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Exploit Research and Development Megaprimer: DEP Bypassing with ROP ChainsAjin Abraham
Exploit Research and Development Megaprimer
http://opensecurity.in/exploit-research-and-development-megaprimer/
http://www.youtube.com/playlist?list=PLX3EwmWe0cS_5oy86fnqFRfHpxJHjtuyf
Open Source Framework for Deploying Data Science Models and Cloud Based Appli...ETCenter
Next generation applications address more sophisticated questions that go beyond 'What happened?' by using Machine Learning/Statistical modelling to answer 'Why?' and 'What will happen next? Data insights can be easily deployed and rapidly delivered to the decision makers via cloud based applications. This framework focuses on technologies available for the entire data workflow from ingestion and modeling to cloud deployment; Hadoop, MADlib, Python, R, CloudFoundry, etc. This presentation will also include examples of how this framework and innovative Data Science techniques have been applied across diverse business units within Media, including pricing analyses for ad optimization and predicting viewership.
Hacking Tizen: The OS of everything - WhitepaperAjin Abraham
Samsung’s first Tizen-based devices are set to launch in the middle of 2015. This paper presents the research outcome on the security analysis of Tizen OS and it’s underlying security architecture.
The paper begins with a quick introduction to Tizen architecture and explains the various components of Tizen OS. This will be followed by Tizen’s security model where application sandboxing and resource access control will be explained. Moving on, an overview of Tizen’s Content Security Framework which acts as an in-built malware detection API will be covered.
Various vulnerabilities in Tizen will be discussed including issues like Tizen WebKit2 address spoofing and content injection, Tizen WebKit CSP bypass and issues in Tizen’s memory protection (ASLR and DEP).
Applications in Tizen can be written in HTML5/JS/CSS or natively using C/C++. As a bonus, an overview of pentesting Tizen applications will also be presented along with some of the security implications. There will be comparisons made to traditional Android applications and how these security issues differ with Tizen.
Top 10 Data Science Practioner Pitfalls - Mark LandrySri Ambati
Over-fitting, misread data, NAs, collinear column elimination and other common issues play havoc in the day of practicing data scientist. In this talk, we review top 10 common pitfalls and steps to avoid them. #h2ony
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Top 10 Data Science Practitioner PitfallsSri Ambati
Over-fitting, misread data, NAs, collinear column elimination and other common issues play havoc in the day of practicing data scientist. In this talk, Mark Landry, one of the world’s leading Kagglers, will review the top 10 common pitfalls and steps to avoid them.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things.
Data Analytics, Machine Learning, and HPC in Today’s Changing Application Env...Intel® Software
This session explains how solutions desired by such IT/Internet/Silicon Valley etc companies can look like, how they may differ from the more “classical” consumers of machine learning and analytics, and the arising challenges that current and future HPC development may have to cope with.
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
Top 100+ Google Data Science Interview Questions.pdfDatacademy.ai
Data science interviews can be particularly difficult due to the many proficiencies that you'll have to demonstrate (technical skills, problem solving, communication) and the generally high bar to entry for the industry.we Provide Top 100+ Google Data Science Interview Questions : All You Need to know to Crack it
visit by :-https://www.datacademy.ai/google-data-science-interview-questions/
Building LLM Solutions using Open Source and Closed Source Solutions in Coher...Sri Ambati
Sandeep Singh, Head of Applied AI Computer Vision, Beans.ai
H2O Open Source GenAI World SF 2023
In the modern era of machine learning, leveraging both open-source and closed-source solutions has become paramount for achieving cutting-edge results. This talk delves into the intricacies of seamlessly integrating open-source Large Language Model (LLM) solutions like Vicuna, Falcon, and Llama with industry giants such as ChatGPT and Google's Palm. As the demand for fine-tuned and specialized datasets grows, it is imperative to understand the synergy between these tools. Attendees will gain insights into best practices for building and enriching datasets tailored for fine-tuning tasks, ensuring that their LLM projects are both robust and efficient. Through real-world examples and hands-on demonstrations, this talk will equip attendees with the knowledge to harness the power of both open and closed-source tools in a coherent and effective manner.
Patrick Hall, Professor, AI Risk Management, The George Washington University
H2O Open Source GenAI World SF 2023
Language models are incredible engineering breakthroughs but require auditing and risk management before productization. These systems raise concerns about toxicity, transparency and reproducibility, intellectual property licensing and ownership, disinformation and misinformation, supply chains, and more. How can your organization leverage these new tools without taking on undue or unknown risks? While language models and associated risk management are in their infancy, a small number of best practices in governance and risk are starting to emerge. If you have a language model use case in mind, want to understand your risks, and do something about them, this presentation is for you!
Dr. Alexy Khrabrov, Open Source Science Community Director, IBM
H2O Open Source GenAI World SF 2023
In this talk, Dr. Alexy Khrabrov, recently elected Chair of the new Generative AI Commons at Linux Foundation for AI & Data, outlines the OSS AI landscape, challenges, and opportunities. With new models and frameworks being unveiled weekly, one thing remains constant: community building and validation of all aspects of AI is key to reliable and responsible AI we can use for business and society needs. Industrial AI is one key area where such community validation can prove invaluable.
Michelle Tanco, Head of Product, H2O.ai
H2O Open Source GenAI World SF 2023
Learn how the makers at H2O.ai are building internal tools to solve real use cases using H2O Wave and h2oGPT. We will walk through an end-to-end use case and discuss how to incorporate business rules and generated content to rapidly develop custom AI apps using only Python APIs.
Applied Gen AI for the Finance Vertical Sri Ambati
Megan Kurka, Vice President, Customer Data Scientist, H2O.ai
H2O Open Source GenAI World SF 2023
Discover the transformative power of Applied Gen AI. Learn how the H2O team builds customized applications and workflows that integrate capabilities of Gen AI and AutoML specifically designed to address and enhance financial use cases. Explore real world examples, learn best practices, and witness firsthand how our innovative solutions are reshaping the landscape of finance technology.
Practitioner's Guide to LLMs: Exploring Use Cases and a Glimpse Beyond Curren...Sri Ambati
Pascal Pfeiffer, Principal Data Scientist, H2O.ai
H2O Open Source GenAI World SF 2023
This talk dives into the expansive ecosystem of Large Language Models (LLMs), offering practitioners an insightful guide to various relevant applications, from natural language understanding to creative content generation. While exploring use cases across different industries, it also honestly addresses the current limitations of LLMs and anticipates future advancements.
Introducción al Aprendizaje Automatico con H2O-3 (1)Sri Ambati
En esta reunión virtual, damos una introducción a la plataforma de aprendizaje automático de código abierto número 1, H2O-3 y te mostramos cómo puedes usarla para desarrollar modelos para resolver diferentes casos de uso.
From Rapid Prototypes to an end-to-end Model Deployment: an AI Hedge Fund Use...Sri Ambati
Numerai is an open, crowd-sourced hedge fund powered by predictions from data scientists around the world. In return, participants are rewarded with weekly payouts in crypto.
In this talk, Joe will give an overview of the Numerai tournament based on his own experience. He will then explain how he automates the time-consuming tasks such as testing different modelling strategies, scoring new datasets, submitting predictions to Numerai as well as monitoring model performance with H2O Driverless AI and R.
AI Foundations Course Module 1 - Shifting to the Next Step in Your AI Transfo...Sri Ambati
In this session, you will learn about what you should do after you’ve taken an AI transformation baseline. Over the span of this session, we will discuss the next steps in moving toward AI readiness through alignment of talent and tools to drive successful adoption and continuous use within an organization.
To find additional videos on AI courses, earn badges, join the courses at H2O.ai Learning Center: https://training.h2o.ai/products/ai-foundations-course
To find the Youtube video about this presentation: https://youtu.be/K1Cl3x3rd8g
Speaker:
Chemere Davis (H2O.ai - Senior Data Scientist Training Specialist)
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
3. H2O.ai
Machine Intelligence
1. Train vs Test
Training Set vs.
Test Set
• Partition the original data (randomly or stratified) into a
training set and a test set. (e.g. 70/30)
• It can be useful to evaluate the training error, but you
should not look at training error alone.
• Training error is not an estimate of generalization error
(on a test set or cross-validated), which is what you
should care more about.
• Training error vs test error over time is an useful thing to
calculate. It can tell you when you start to overfit your
model, so it is a useful metric in supervised machine
learning.
Training Error vs.
Test Error
6. H2O.ai
Machine Intelligence
2. Train vs Test vs Valid
Training Set vs.
Validation Set vs.
Test Set
• If you have “enough” data and plan to do some model
tuning, you should really partition your data into three
parts — Training, Validation and Test sets.
• There is no general rule for how you should partition the
data and it will depend on how strong the signal in your
data is, but an example could be: 50% Train, 25%
Validation and 25% Test
• The validation set is used strictly for model tuning (via
validation of models with different parameters) and the
test set is used to make a final estimate of the
generalization error.
Validation is for
Model Tuning
8. H2O.ai
Machine Intelligence
3. Model Performance
Test Error • Partition the original data (randomly) into a training set
and a test set. (e.g. 70/30)
• Train a model using the training set and evaluate
performance (a single time) on the test set.
• Train & test K
models as shown.
• Average the model
performance over
the K test sets.
• Report cross-
validated metrics.
• Regression: R^2, MSE, RMSE
• Classification: Accuracy, F1, H-measure, Log-loss
• Ranking (Binary Outcome): AUC, Partial AUC
K-fold
Cross-validation
Performance
Metrics
10. H2O.ai
Machine Intelligence
4. Class Imbalance
Imbalanced
Response Variable
• A dataset is said to be imbalanced when the binomial or
multinomial response variable has one or more classes
that are underrepresented in the training data, with
respect to the other classes.
• This is incredibly common in real-word datasets.
• In practice, balanced datasets are the rarity, unless they
have been artificially created.
• There is no precise definition of what defines an
imbalanced vs balanced dataset — the term is vague.
• My rule of thumb for binary response: If the minority
class makes <10% of the data, this can cause issues.
• Advertising — Probability that someone clicks on ad is
very low… very very low.
• Healthcare & Medicine — Certain diseases or adverse medical
conditions are rare.
• Fraud Detection — Insurance or credit fraud is rare.
Very common
Industries
11. H2O.ai
Machine Intelligence
4. Remedies
Artificial Balance • You can balance the training set using sampling.
• Notice that we don’t say to balance the test set. The test
set represents the true data distribution. The only way
to get “honest” model performance on your test set is to
use the original, unbalanced, test set.
• The same goes for the hold-out sets in cross-validation.
For this, you may end up having to write custom code,
depending on what software you use.
• H2O has a “balance_classes” argument that can be used to do
this properly & automatically.
• You can manually upsample (or downsample) your minority
(or majority) class(es) set either by duplicating (or sub-
sampling) rows, or by using row weights.
• The SMOTE (Synthetic Minority Oversampling Technique)
algorithm generates simulated training examples from the
minority class instead of upsampling.
Potential Pitfalls
Solutions
13. H2O.ai
Machine Intelligence
5. Categorical Data
Real Data • Most real world datasets contain categorical data.
• Problems can arise if you have too many categories.
• A lot of ML software will place limits on the number of
categories allowed in a single column (e.g. 1024) so you
may be forced to deal with this whether you like it or
not.
• When there are high-cardinality categorical columns,
often there will be many categories that only occur a
small number of times (not very useful).
• If you have some hierarchical knowledge about the data, then
you may be able to reduce the number of categories by using
some sensible higher-level mapping of the categories.
• Example: ICD-9 codes — thousands of unique diagnostic and
procedure codes. You can map each category to a higher
level super-category to reduce the cardinality.
Too Many
Categories
Solutions
15. H2O.ai
Machine Intelligence
6. Missing Data
Types of
Missing Data
• Unavailable: Valid for the observation, but not available
in the data set.
• Removed: Observation quality threshold may have not
been reached, and data removed
• Not applicable: measurement does not apply to the
particular observation (e.g. number of tires on a boat
observation)
• It depends! Some options:
• Ignore entire observation.
• Create an binary variable for each predictor to indicate
whether the data was missing or not
• Segment model based on data availability.
• Use alternative algorithm: decision trees accept missing
values; linear models typically do not.
What to Do
17. H2O.ai
Machine Intelligence
7. Outliers/Extreme Values
Types of Outliers
• Outliers can exist in response or predictors
• Valid outliers: rare, extreme events
• Invalid outliers: erroneous measurements
• Remove observations.
• Apply a transformation to reduce impact: e.g. log or
bins.
• Choose a loss function that is more robust: e.g. MAE vs
MSE.
• Impose a constraint on data range (cap values).
• Ask questions: Understand whether the values are valid
or invalid, to make the most appropriate choice.
What to Do
What Can
Happen
• Outlier values can have a disproportionate weight on the
model.
• MSE will focus on handling outlier observations more to
reduce squared error.
• Boosting will spend considerable modeling effort fitting
these observations.
19. H2O.ai
Machine Intelligence
8. Data Leakage
What Is It
• Leakage is allowing your model to use information that
will not be available in a production setting.
• Obvious example: using the Dow Jones daily gain/loss as
part of a model to predict individual stock performance
• Model is overfit.
• Will make predictions inconsistent with those you scored
when fitting the model (even with a validation set).
• Insights derived from the model will be incorrect.
• Understand the nature of your problem and data.
• Scrutinize model feedback, such as relative influence or
linear coefficient.
What Happens
What to Do
21. H2O.ai
Machine Intelligence
9. Useless Models
What is a
“Useless” Model?
• Solving the Wrong Problem.
• Not collecting appropriate data.
• Not structuring data correctly to solve the problem.
• Choosing a target/loss measure that does not optimize
the end use case: using accuracy to prioritize resources.
• Having a model that is not actionable.
• Using a complicated model that is less accurate than a
simple model.
• Understand the problem statement.
• Solving the wrong problem is an issue in all problem-solving
domains, but arguably easier with black box techniques
common to ML
• Utilize post-processing measures
• Create simple baseline models to understand lift of more
complex models
• Plan on an iterative approach: start quickly, even if on
imperfect data
• Question your models and attempt to understand them
What To Do
23. H2O.ai
Machine Intelligence
10. No Free Lunch
No Such Thing as
a Free Lunch
• No general purpose algorithm to solve all problems.
• No right answer on optimal data preparation.
• General heuristics are not always true:
• Tree models solve problems equivalently with any
order-preserving transformation.
• Decision trees and neural networks will
automatically find interactions.
• High number of predictors may be handled, but
lead to a less optimal result than fewer key
predictors.
• Models can not find relative information that span
multiple observations.
• Model feedback can be misleading: relative
influence, linear coefficients
• Understand how the underlying algorithms operate
• Try several algorithms and observe relative performance and
the characteristics of your data
• Feature engineering & feature selection
• Interpret and react to model feedback
What To Do
24. H2O.ai
Machine Intelligence
Where to learn more about H2O?
• H2O Online Training (free): http://learn.h2o.ai
• H2O Slidedecks: http://www.slideshare.net/0xdata
• H2O Video Presentations: https://www.youtube.com/user/0xdata
• H2O Community Events & Meetups: http://h2o.ai/events
• Machine Learning & Data Science courses: http://coursebuffet.com