DutchMLSchool. Introduction to Machine Learning, Models, Evaluations, and Ensembles (Supervised Learning I) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesBigML, Inc
DutchMLSchool. Logistic Regression, Deepnets, and Time Series (Supervised Learning II) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. Supervised vs Unsupervised LearningBigML, Inc
Supervised versus Unsupervised Learning Techniques - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Machine learning is becoming widely used to automate decision making. While machine learning seems complex, it involves finding patterns in data that can be used to make useful predictions. The document discusses how factors like increased data availability, faster computation, and easier tools have led to the rise of machine learning applications. It also notes common pitfalls in early machine learning adoption like overhyping results and failing to develop a clear strategy. Overall machine learning is transforming industries by enabling cheaper and more data-driven decisions at scale.
DutchMLSchool. ML: A Technical PerspectiveBigML, Inc
DutchMLSchool. Machine Learning: A Technical Perspective
TITLE AS IN SCHEDULE - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. Introduction to Machine Learning with the BigML PlatformBigML, Inc
Introduction to Machine Learning with the BigML Platform - ML for Executives Course.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Enhancing and Automating Decision Making with Machine Learning - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Anatomy of an Application: Machine Learning End-to-End - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Machine Learning: Business Perspective - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesBigML, Inc
DutchMLSchool. Logistic Regression, Deepnets, and Time Series (Supervised Learning II) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. Supervised vs Unsupervised LearningBigML, Inc
Supervised versus Unsupervised Learning Techniques - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Machine learning is becoming widely used to automate decision making. While machine learning seems complex, it involves finding patterns in data that can be used to make useful predictions. The document discusses how factors like increased data availability, faster computation, and easier tools have led to the rise of machine learning applications. It also notes common pitfalls in early machine learning adoption like overhyping results and failing to develop a clear strategy. Overall machine learning is transforming industries by enabling cheaper and more data-driven decisions at scale.
DutchMLSchool. ML: A Technical PerspectiveBigML, Inc
DutchMLSchool. Machine Learning: A Technical Perspective
TITLE AS IN SCHEDULE - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. Introduction to Machine Learning with the BigML PlatformBigML, Inc
Introduction to Machine Learning with the BigML Platform - ML for Executives Course.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Enhancing and Automating Decision Making with Machine Learning - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Anatomy of an Application: Machine Learning End-to-End - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Machine Learning: Business Perspective - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Opening Remarks - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. ML for Energy Trading and Automotive SectorBigML, Inc
Machine Learning for Energy Trading, Automotive Sector, and Logistics, presented by BigML's Partners A1 Digital.
Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Square's Machine Learning Infrastructure and Applications - Rong YanHakka Labs
1) Square uses machine learning for fraud detection in payments and to power recommendations on its Square Market platform.
2) Random forests and gradient boosted trees are the primary algorithms used for fraud detection, achieving up to a 10-11% improvement over random forests alone.
3) Square has built scalable machine learning infrastructure including parallel environments, data transport systems, and a learning management system to support rapid model development and evaluation.
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
MLSEV Virtual. Supervised vs UnsupervisedBigML, Inc
Supervised vs Unsupervised Learning Techniques, by Charles Parker, Vice President of Machine Learning algorithms at BigML.
*MLSEV 2020: Virtual Conference.
Module 4: Model Selection and EvaluationSara Hooker
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
H2O World - Top 10 Data Science Pitfalls - Mark LandrySri Ambati
H2O World 2015 - Mark Landry
Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Machine Learning: Understanding the Invisible Force Changing Our WorldKen Tabor
This document discusses the rise of machine learning and artificial intelligence. It provides quotes from industry leaders about the potential for AI to improve lives and build a better society. The text then explains what machine learning is, how it works through supervised, unsupervised and reinforcement learning, and some of the business applications of AI like product recommendations, fraud detection and machine translation. It also discusses the increasing investment in and priority placed on AI by companies, governments and researchers. The document encourages readers to consider the ethical implications of AI and ensure it is developed and applied in a way that benefits all of humanity.
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
How To Interview a Data Scientist
Daniel Tunkelang
Presented at the O'Reilly Strata 2013 Conference
Video: https://www.youtube.com/watch?v=gUTuESHKbXI
Interviewing data scientists is hard. The tech press sporadically publishes “best” interview questions that are cringe-worthy.
At LinkedIn, we put a heavy emphasis on the ability to think through the problems we work on. For example, if someone claims expertise in machine learning, we ask them to apply it to one of our recommendation problems. And, when we test coding and algorithmic problem solving, we do it with real problems that we’ve faced in the course of our day jobs. In general, we try as hard as possible to make the interview process representative of actual work.
In this session, I’ll offer general principles and concrete examples of how to interview data scientists. I’ll also touch on the challenges of sourcing and closing top candidates.
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
Data Science: A Mindset for Productivity
Keynote at 2015 Ronin Labs West Coast CTO Summit
https://www.eventjoy.com/e/west-coast-cto-summit-2015
Abstract
Data science isn't just about using a collection of technologies and algorithms. Data science requires a mindset that solves problems at a higher level of abstraction. How do we model utility when we think about optimization? How do we decide which hypotheses to test? How do we allocate our scarce resources to make progress?
There are no silver bullets. But I'll share what I've learned from a variety of contexts over the course of my work at Endeca, Google, and LinkedIn; and I hope you'll leave this talk with some practical wisdom you can apply to your next data science project.
Machine learning the high interest credit card of technical debt [PWL]Jenia Gorokhovsky
Machine learning systems can accumulate significant technical debt, like other complex software systems. This debt makes the systems difficult to change, maintain, and improve over time. There are several common sources of technical debt unique to machine learning systems, including entanglement between components, correction cascades between models, unstable or underutilized data dependencies, undeclared outputs being consumed by other systems, and issues around changes in external data. Mitigating this debt requires strategies like merging mature models, pruning experimental code, comprehensively testing data and configurations, monitoring outputs, and mapping all data and system dependencies.
As we move into a new era of ITSM computing, new big data and machine learning tools and methodologies are being developed to support IT staff by intelligently extracting insights and making predictions from the enormous amounts of data accumulated from the organization. According to Gartner, I&O leaders must take a comprehensive approach to incorporate advanced big data and machine learning technologies into their organizations or risk becoming irrelevant. But what exactly is big data and machine learning all about? How can you introduce these concepts into your existing Service Desk?
Join USF’s distinguished Computer Science and Engineering Professor Lawrence Hall and SunView Software’s VP of Marketing and Product Strategy John Prestridge as they break down the fundamentals of big data and machine learning and provide real-world examples of the impact the technologies will have on ITSM.
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org. If you would like to use this material to further our mission of improving access to machine learning. Education please reach out to inquiry@deltanalytics.org.
MLSEV. Models, Evaluations and Ensembles BigML, Inc
Introduction to Machine Learning. Supervised Learning (Part I): Models, Evaluations and Ensembles, by BigML.
MLSEV 2019: 1st edition of the Machine Learning School in Seville, Spain.
Opening Remarks - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. ML for Energy Trading and Automotive SectorBigML, Inc
Machine Learning for Energy Trading, Automotive Sector, and Logistics, presented by BigML's Partners A1 Digital.
Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Square's Machine Learning Infrastructure and Applications - Rong YanHakka Labs
1) Square uses machine learning for fraud detection in payments and to power recommendations on its Square Market platform.
2) Random forests and gradient boosted trees are the primary algorithms used for fraud detection, achieving up to a 10-11% improvement over random forests alone.
3) Square has built scalable machine learning infrastructure including parallel environments, data transport systems, and a learning management system to support rapid model development and evaluation.
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
MLSEV Virtual. Supervised vs UnsupervisedBigML, Inc
Supervised vs Unsupervised Learning Techniques, by Charles Parker, Vice President of Machine Learning algorithms at BigML.
*MLSEV 2020: Virtual Conference.
Module 4: Model Selection and EvaluationSara Hooker
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
H2O World - Top 10 Data Science Pitfalls - Mark LandrySri Ambati
H2O World 2015 - Mark Landry
Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Machine Learning: Understanding the Invisible Force Changing Our WorldKen Tabor
This document discusses the rise of machine learning and artificial intelligence. It provides quotes from industry leaders about the potential for AI to improve lives and build a better society. The text then explains what machine learning is, how it works through supervised, unsupervised and reinforcement learning, and some of the business applications of AI like product recommendations, fraud detection and machine translation. It also discusses the increasing investment in and priority placed on AI by companies, governments and researchers. The document encourages readers to consider the ethical implications of AI and ensure it is developed and applied in a way that benefits all of humanity.
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
How To Interview a Data Scientist
Daniel Tunkelang
Presented at the O'Reilly Strata 2013 Conference
Video: https://www.youtube.com/watch?v=gUTuESHKbXI
Interviewing data scientists is hard. The tech press sporadically publishes “best” interview questions that are cringe-worthy.
At LinkedIn, we put a heavy emphasis on the ability to think through the problems we work on. For example, if someone claims expertise in machine learning, we ask them to apply it to one of our recommendation problems. And, when we test coding and algorithmic problem solving, we do it with real problems that we’ve faced in the course of our day jobs. In general, we try as hard as possible to make the interview process representative of actual work.
In this session, I’ll offer general principles and concrete examples of how to interview data scientists. I’ll also touch on the challenges of sourcing and closing top candidates.
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
Data Science: A Mindset for Productivity
Keynote at 2015 Ronin Labs West Coast CTO Summit
https://www.eventjoy.com/e/west-coast-cto-summit-2015
Abstract
Data science isn't just about using a collection of technologies and algorithms. Data science requires a mindset that solves problems at a higher level of abstraction. How do we model utility when we think about optimization? How do we decide which hypotheses to test? How do we allocate our scarce resources to make progress?
There are no silver bullets. But I'll share what I've learned from a variety of contexts over the course of my work at Endeca, Google, and LinkedIn; and I hope you'll leave this talk with some practical wisdom you can apply to your next data science project.
Machine learning the high interest credit card of technical debt [PWL]Jenia Gorokhovsky
Machine learning systems can accumulate significant technical debt, like other complex software systems. This debt makes the systems difficult to change, maintain, and improve over time. There are several common sources of technical debt unique to machine learning systems, including entanglement between components, correction cascades between models, unstable or underutilized data dependencies, undeclared outputs being consumed by other systems, and issues around changes in external data. Mitigating this debt requires strategies like merging mature models, pruning experimental code, comprehensively testing data and configurations, monitoring outputs, and mapping all data and system dependencies.
As we move into a new era of ITSM computing, new big data and machine learning tools and methodologies are being developed to support IT staff by intelligently extracting insights and making predictions from the enormous amounts of data accumulated from the organization. According to Gartner, I&O leaders must take a comprehensive approach to incorporate advanced big data and machine learning technologies into their organizations or risk becoming irrelevant. But what exactly is big data and machine learning all about? How can you introduce these concepts into your existing Service Desk?
Join USF’s distinguished Computer Science and Engineering Professor Lawrence Hall and SunView Software’s VP of Marketing and Product Strategy John Prestridge as they break down the fundamentals of big data and machine learning and provide real-world examples of the impact the technologies will have on ITSM.
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org. If you would like to use this material to further our mission of improving access to machine learning. Education please reach out to inquiry@deltanalytics.org.
MLSEV. Models, Evaluations and Ensembles BigML, Inc
Introduction to Machine Learning. Supervised Learning (Part I): Models, Evaluations and Ensembles, by BigML.
MLSEV 2019: 1st edition of the Machine Learning School in Seville, Spain.
Introduction to End-to-End Machine Learning: Classification and Regression - Mercè Martín, VP of Bindings and Applications at BigML.
*Machine Learning School in The Netherlands 2022.
Cluster Analysis and Anomaly Detection (Unsupervised I) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
MLSEV Virtual. Automating Model SelectionBigML, Inc
1) Bayesian parameter optimization uses machine learning to predict the performance of untrained models based on parameters from previous models to efficiently search the parameter space.
2) However, there are still important issues like choosing the right evaluation metric, ensuring no information leakage between training and test data, and selecting the appropriate model for the problem and available data.
3) Automated model selection requires sufficient data to make accurate predictions; with insufficient data, the process can fail.
This document provides an overview of model evaluation metrics for supervised machine learning models. It discusses the importance of evaluating models to assess their performance and usefulness. Common evaluation metrics are introduced, including accuracy, precision, recall, F1 score, confusion matrices and more. The document cautions that models should not be evaluated on the training data, as this can provide overly optimistic results, and advocates for train-test splits or cross validation. Real-world examples and demonstrations are provided.
DutchMLSchool 2022 - Anomaly Detection at ScaleBigML, Inc
Lessons Learned Applying Anomaly Detection at Scale, by Álvaro Clemente, Machine Learning Engineer at BigML.
*Machine Learning School in The Netherlands 2022.
This document summarizes a presentation on model evaluation given at the 4th annual Valencian Summer School in Machine Learning. It discusses the importance of evaluating models to understand how well they will perform on new data and identify mistakes. Various evaluation metrics are introduced like accuracy, precision, recall, F1 score, and Phi coefficient. The dangers of evaluating on training data are explained, and techniques like train-test splits and cross-validation are recommended to get less optimistic evaluations. Regression metrics like MAE, MSE, and R-squared error are also covered. Different evaluation techniques for specific problem types like imbalanced classification, time series forecasting, and model selection are discussed.
An introduction to machine learning and statisticsSpotle.ai
This document provides an overview of machine learning and predictive modeling. It begins by describing how predictive models can be used in various domains like healthcare, finance, telecom, and business. It then discusses the differences between machine learning and predictive modeling, noting that machine learning aims to allow machines to learn autonomously using feedback mechanisms, while predictive modeling focuses on building statistical models to predict outcomes. The document also uses examples like Microsoft's Tay chatbot to illustrate how machine learning systems can be exposed to real-world data to continuously learn and improve. It concludes by explaining how predictive analytics fits within machine learning as the starting point to build initial predictive models and continuously monitor and refine them.
The document discusses parameter optimization and machine learning techniques for tuning models. It covers using machine learning to predict the performance of parameter configurations before training models on them, called Bayesian parameter optimization. It also discusses dangers of naive cross-validation and how to select the best model by considering factors beyond just performance like retraining needs and prediction speed. The document advocates creating diverse ensembles through techniques like fusions to improve stability and importance profiles.
VSSML17 L3. Clusters and Anomaly DetectionBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 1
Lecture 3: Clusters and Anomaly Detection. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
HR Analytics: Using Machine Learning to Predict Employee Turnover - Matt Danc...Sri Ambati
Presented at #H2OWorld 2017 in Mountain View, CA.
Enjoy the video: https://youtu.be/-qfEOwm5Th4.
Learn more about H2O.ai: https://www.h2o.ai/.
Follow @h2oai: https://twitter.com/h2oai.
- - -
In this talk, we discuss how we implemented H2O and LIME to predict and explain employee turnover on the IBM Watson HR Employee Attrition dataset. We use H2O’s new automated machine learning algorithm to improve on the accuracy of IBM Watson. We use LIME to produce feature importance and ultimately explain the black-box model produced by H2O.
Matt Dancho is the founder of Business Science (www.business-science.io), a consulting firm that assists organizations in applying data science to business applications. He is the creator of R packages tidyquant and timetk and has been working with data science for business and financial analysis since 2011. Matt holds master’s degrees in business and engineering, and has extensive experience in business intelligence, data mining, time series analysis, statistics and machine learning. Connect with Matt on twitter (https://twitter.com/mdancho84) and LinkedIn (https://www.linkedin.com/in/mattdancho/).
Machine Learning Essentials Demystified part1 | Big Data DemystifiedOmid Vahdaty
Machine Learning Essentials Abstract:
Machine Learning (ML) is one of the hottest topics in the IT world today. But what is it really all about?
In this session we will talk about what ML actually is and in which cases it is useful.
We will talk about a few common algorithms for creating ML models and demonstrate their use with Python. We will also take a peek at Deep Learning (DL) and Artificial Neural Networks and explain how they work (without too much math) and demonstrate DL model with Python.
The target audience are developers, data engineers and DBAs that do not have prior experience with ML and want to know how it actually works.
AI in the Real World: Challenges, and Risks and how to handle them?Srinath Perera
This document discusses challenges, risks, and how to handle them with AI in the real world. It covers:
- AI can perform tasks like driving a car faster and cheaper than humans, but can't fully explain how.
- Deploying and managing AI models at scale is complex, as is integrating models with user experiences. Bias and lack of transparency are also risks.
- When applying AI, such as in high-risk domains like medicine, it is important to audit models, gradually introduce them with trials, monitor outcomes, and find ways to identify and address errors or unfair impacts. With care and oversight, AI can be developed to help more people than it harms.
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
This document provides an overview of machine learning and predictive modeling techniques for hackers and data scientists. It discusses foundational concepts in machine learning like functionalism, connectionism, and black box modeling. It also covers practical techniques like feature engineering, model selection, evaluation, optimization, and popular Python libraries. The document encourages an experimental approach to hacking predictive models through techniques like brute forcing hyperparameters, fuzzing with data permutations, and social engineering within data science communities.
If you are curious what is ML all about, this is a gentle introduction to Machine Learning and Deep Learning. This includes questions such as why ML/Data Analytics/Deep Learning ? Intuitive Understanding o how they work and some models in detail. At last I share some useful resources to get started.
Similar to DutchMLSchool. Models, Evaluations, and Ensembles (20)
Digital Transformation and Process Optimization in ManufacturingBigML, Inc
Keyanoush Razavidinani, Digital Services Consultant at A1 Digital, a BigML Partner, highlights why it is important to identify and reduce human bottlenecks that optimize processes and let you focus on important activities. Additionally, Guillem Vidal, Machine Learning Engineer at BigML completes the session by showcasing how Machine Learning is put to use in the manufacturing industry with a use case to detect factory failures.
The Road to Production: Automating your Anomaly Detectors - by jao (Jose A. Ortega), Co-Founder and Chief Technology Officer at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - ML for AML ComplianceBigML, Inc
Machine Learning for Anti Money Laundering Compliance, by Kevin Nagel, Consultant and Data Scientist at INFORM.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Multi Perspective AnomaliesBigML, Inc
Multi Perspective Anomalies, by Jan W Veldsink, Master in the art of AI at Nyenrode, Rabobank, and Grio.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - My First Anomaly Detector BigML, Inc
The document discusses building an anomaly detector model to identify unusual transactions in a dataset. It describes loading transaction data with 31 features into the BigML platform and creating an anomaly detector model. The model scores new data and identifies the most anomalous fields to help detect fraud. Creating the anomaly detector involves interpreting the data, exploring the dataset distribution, and setting a threshold score to define what is considered anomalous.
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
History and Present Developments in Machine Learning, by Tom Dietterich, Emeritus Professor of computer science at Oregon State University and Chief Scientist at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - A Data-Driven CompanyBigML, Inc
A Data-Driven Company: 21 Lessons for Large Organizations to Create Value from AI, by Richard Benjamins, Chief AI and Data Strategist at Telefónica.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - ML in the Legal SectorBigML, Inc
How Machine Learning Transforms and Automates Legal Services, by Arnoud Engelfriet, Co-Founder at Lynn Legal.
*Machine Learning School in The Netherlands 2022.
This document describes a proposed solution using machine learning and artificial intelligence to help create a safer stadium experience. The solution involves two parts: 1) linking access to stadiums to a verified identity through a fan app for preregistration, and 2) using AI/ML to help detect unwanted behaviors or events early. The rest of the document provides more details on the proposed smart video review framework, including using computer vision and audio analysis techniques to help identify issues like flares, flags, banners, chants including monkey chants. The goal is to help reviewers more efficiently identify potential problems but with privacy, ethics and human oversight.
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsBigML, Inc
Process Optimization in Manufacturing Plants, by Keyanoush Razavidinani, Digital Business Consultant at A1 Digital.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Citizen Development in AIBigML, Inc
The document discusses the need for citizen developers and humans in the AI/ML process. It notes that while technology and talent are important, company culture must also support broad data analytics and AI/ML adoption. It then provides examples of how involving domain experts can help attribute meaning to correlations and build better causal models to improve AI systems. The document advocates for a systems thinking approach and having humans in the loop to help AI/ML systems consider the wider context and avoid issues like bias.
This new feature is a continuation of and improvement on our previous Image Processing release. Now, Object Detection lets you go a step further with your image data and allows you to locate objects and annotate regions in your images. Once your image regions are defined, you can train and evaluate Object Detection models, make predictions with them, and automate end-to-end Machine Learning workflows on a single platform. To make that possible, BigML enables Object Detection by introducing the regions optype.
As with any other BigML feature, Object Detection is available from the BigML Dashboard, API, and WhizzML for automation. Object Detection is extremely helpful to tackle a wide range of computer vision use cases such as medical image analysis, quality control in manufacturing, license plate recognition in transportation, people detection in security surveillance, among many others.
This new release brings Image Processing to the BigML platform, a feature that enhances our offering to solve image data-driven business problems with remarkable ease of use. Because BigML treats images as any other data type, this unique implementation allows you to easily use image data alongside text, categorical, numeric, date-time, and items data types as input to create any Machine Learning model available in our platform, both supervised and unsupervised.
Now, it is easier than ever to solve a wide variety of computer vision and image classification use cases in a single platform: label your image data, train and evaluate your models, make predictions, and automate your end-to-end Machine Learning workflows. As with any other BigML feature, Image Processing is available from the BigML Dashboard, API, and WhizzML, and it can be applied to solve use cases such as medical image analysis, visual product search, security surveillance, and vehicle damage detection, among others.
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureBigML, Inc
This session presents a quite common situation for those working in food and beverage retail (FnB) and highlights interesting insights to fight waste reduction.
Speaker: Stephen Kinns, CEO and Co-Founder at catsAi.
*ML in Retail 2021: Webinar.
Machine Learning in Retail: ML in the Retail SectorBigML, Inc
This is an introductory session about the role that Machine Learning is playing in the retail sector and how it is being deployed across the different areas of this industry.
Speaker: Atakan Cetinsoy, VP of Predictive Applications at BigML.
*ML in Retail 2021: Webinar.
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotBigML, Inc
This presentation analyzes the role that Machine Learning plays in legal automation with a real-world Machine Learning application.
Speaker: Arnoud Engelfriet, Co-Founder at Lynn Legal.
*ML in GRC 2021: Virtual Conference.
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...BigML, Inc
This is a real-life Machine Learning use case about integrated risk.
Speakers: Thomas Rengersen, Product Owner of the Governance Risk and Compliance Tool for Rabobank, and Thomas Alderse Baas, Co-Founder and Director of The Bowmen Group.
*ML in GRC 2021: Virtual Conference.
ML in GRC: Cybersecurity versus Governance, Risk Management, and ComplianceBigML, Inc
Some of these concepts (Cybersecurity, Governance, Risk Management, and Compliance) overlap and sometimes they can be confusing. This session helps us understand why those terms are key for any business to be successful.
Speaker: Jon Shende, Founding Investor at MyVayda.
*ML in GRC 2021: Virtual Conference.
Intelligent Mobility: Machine Learning in the Mobility IndustryBigML, Inc
The document discusses intelligent mobility and how machine learning can help improve transportation systems. It provides examples of how ML can be applied to roads, ports, railways and airports for tasks like license plate recognition, container tracking, flight delay prediction and predictive maintenance. The document also discusses how ML platforms can help companies build scalable predictive applications by standardizing workflows, integrating various data sources and empowering employees and domain experts to develop and use ML models.
Intelligent Mobility: Embedded Machine Learning, Damage Detection in RailBigML, Inc
BigML’s partner, A1 Digital, showcases in this presentation a specific use case on “Damage Detection in Rail with Embedded Machine Learning”.
Speaker: Dieter Myer, Senior Data Scientist and Machine Learning Consultant at A1 Digital.
*Intelligent Mobility 2021: Virtual Conference.
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of March 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
06-20-2024-AI Camp Meetup-Unstructured Data and Vector DatabasesTimothy Spann
Tech Talk: Unstructured Data and Vector Databases
Speaker: Tim Spann (Zilliz)
Abstract: In this session, I will discuss the unstructured data and the world of vector databases, we will see how they different from traditional databases. In which cases you need one and in which you probably don’t. I will also go over Similarity Search, where do you get vectors from and an example of a Vector Database Architecture. Wrapping up with an overview of Milvus.
Introduction
Unstructured data, vector databases, traditional databases, similarity search
Vectors
Where, What, How, Why Vectors? We’ll cover a Vector Database Architecture
Introducing Milvus
What drives Milvus' Emergence as the most widely adopted vector database
Hi Unstructured Data Friends!
I hope this video had all the unstructured data processing, AI and Vector Database demo you needed for now. If not, there’s a ton more linked below.
My source code is available here
https://github.com/tspannhw/
Let me know in the comments if you liked what you saw, how I can improve and what should I show next? Thanks, hope to see you soon at a Meetup in Princeton, Philadelphia, New York City or here in the Youtube Matrix.
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/141-10June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
https://www.meetup.com/unstructured-data-meetup-new-york/events/301383476/?slug=unstructured-data-meetup-new-york&eventId=301383476
https://www.aicamp.ai/event/eventdetails/W2024062014
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Marlon Dumas
This webinar discusses the limitations of traditional approaches for business process simulation based on had-crafted model with restrictive assumptions. It shows how process mining techniques can be assembled together to discover high-fidelity digital twins of end-to-end processes from event data.
06-18-2024-Princeton Meetup-Introduction to MilvusTimothy Spann
06-18-2024-Princeton Meetup-Introduction to Milvus
tim.spann@zilliz.com
https://www.linkedin.com/in/timothyspann/
https://x.com/paasdev
https://github.com/tspannhw
https://github.com/milvus-io/milvus
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/142-17June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
Expand LLMs' knowledge by incorporating external data sources into LLMs and your AI applications.
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)Rebecca Bilbro
To honor ten years of PyData London, join Dr. Rebecca Bilbro as she takes us back in time to reflect on a little over ten years working as a data scientist. One of the many renegade PhDs who joined the fledgling field of data science of the 2010's, Rebecca will share lessons learned the hard way, often from watching data science projects go sideways and learning to fix broken things. Through the lens of these canon events, she'll identify some of the anti-patterns and red flags she's learned to steer around.
2. BigML, Inc #DutchMLSchool
Supervised Learning I
Introduction to Machine Learning, Models, Evaluations and Ensembles
Poul Petersen
CIO, BigML, Inc
2
3. BigML, Inc #DutchMLSchool
Machine Learning Motivation
3
• You are looking to buy a house
• Recently found a house you like
• Is the asking price fair?
Imagine:
What Next?
4. BigML, Inc #DutchMLSchool
Maching Learning Motivation
4
Why not ask an expert?
• Experts can be rare / expensive
• Hard to validate experience:
• Experience with similar properties?
• Do they consider all relevant variables?
• Knowledge of market up to date?
• Hard to validate answer:
• How many times expert right / wrong?
• Probably can’t explain decision in detail
• Humans are not good at intuitive statistics
5. BigML, Inc #DutchMLSchool
Data vs Expert
5
Replace the expert with data?
• Intuition: square footage relates to price.
• Collect data from past sales
SQFT SOLD
2424 360000
1785 307500
1003 185000
4135 600000
1676 328500
1012 247000
3352 420000
2825 435350
PRICE = 125.3*SQFT + 96535
PREDICT
400262
320195
222211
614651
306538
223339
516541
450508
6. BigML, Inc #DutchMLSchool
Data vs Expert
6
Replace the expert scorecard
• Experts can be rare / expensive
• Hard to validate experience:
• Experience with similar properties?
• Do they consider all relevant variables?
• Knowledge of market up to date?
• Hard to validate answer:
• How many times expert right / wrong?
• Probably can’t explain decision in detail
• Humans are not good at intuitive statistics
7. BigML, Inc #DutchMLSchool
Data vs Expert
7
Replace the expert with data
• Intuition: square footage relates to price.
• Collect data from past sales
SQFT SOLD
2424 360000
1785 307500
1003 185000
4135 600000
1676 328500
1012 247000
3352 420000
2825 435350
PRICE = 125.3*SQFT + 96535
8. BigML, Inc #DutchMLSchool
More Data!
8
SQFT BEDS BATHS ADDRESS LOCATION
LOT
SIZE
YEAR
BUILT
PARKING
SPOTS
LATITUDE LONGITUDE SOLD
2424 4 3
1522 NW
Jonquil
Timberhill
SE 2nd
5227 1991 2 44,594828 -123,269328 360000
1785 3 2
7360 NW
Valley Vw
Country
Estates
25700 1979 2 44,643876 -123,238189 307500
1003 2 1
2620 NW
Chinaberry
Tamarack
Village
4792 1978 2 44,593704 -123,295424 185000
4135 5 3,5
4748 NW
Veronica
Suncrest 6098 2004 3 44,5929659 -123,306916 600000
1676 3 2
2842 NW
Monterey
Corvallis 8712 1975 2 44,5945279 -123,291523 328500
1012 3 1
2320 NW
Highland
Corvallis 9583 1959 2 44,591476 -123,262841 247000
3352 4 3
1205 NW
Ridgewood
Ridgewood
2
60113 1975 2 44,579439 -123,333888 420000
2825 3 411 NW 16th
Wilkins
Addition
4792 1938 1 44,570883 -123,272113 435350
Uhhhh……..
• Can we still fit a line to 10 variables? (well, yes)
• Will fitting a line give good results? (unlikely)
• What about those text fields and categorical values?
10. BigML, Inc #DutchMLSchool
Mythical ML Model?
10
• High representational power
• Fitting a line is an example of low
• Deep neural networks is an example of high
• High Ease-of-use
• Easy to configure - relatively few parameters
• Easy to interpret - how are decisions made?
• Easy to put into production
• Ability to work with real-world data
• Mixed data types: numeric, categorical, text, etc
• Handle missing values
• Resilient to outliers
• There are actually hundreds of possible choices…
13. BigML, Inc #DutchMLSchool
What Just Happened?
13
• We started with Housing data as a CSV from Redfin
• We uploaded the CSV to create Source
• Then we created a Dataset from the Source and reviewed the
summary statistics
• With 1-click we build a Model which can predict home prices
based on all the housing features
• We explored the Model and used it to make a Prediction
15. BigML, Inc #DutchMLSchool
Why Decision Trees
15
• Works for classification or regression
• Easy to understand: splits are features and values
• Lightweight and super fast at prediction time
17. BigML, Inc #DutchMLSchool
Why Decision Trees
17
• Works for classification or regression
• Easy to understand: splits are features and values
• Lightweight and super fast at prediction time
• Relatively parameter free
• Data can be messy
• Useless features are automatically ignored
• Works with un-normalized data
• Works with missing data at Training
19. BigML, Inc #DutchMLSchool
Why Decision Trees
19
• Works for classification or regression
• Easy to understand: splits are features and values
• Lightweight and super fast at prediction time
• Relatively parameter free
• Data can be messy
• Useless features are automatically ignored
• Works with un-normalized data
• Works with missing data at Training & Prediction
22. BigML, Inc #DutchMLSchool
Why Decision Trees
22
• Works for classification or regression
• Easy to understand: splits are features and values
• Lightweight and super fast at prediction time
• Relatively parameter free
• Data can be messy
• Useless features are automatically ignored
• Works with un-normalized data
• Works with missing data at Training & Prediction
• Resilient to outliers
• High representational power
• Works easily with mixed data types
23. BigML, Inc #DutchMLSchool
Data Types
23
numeric
1 2 3
1, 2.0, 3, -5.4 categorical
true / false
yes / no
giraffe / zebra / ape
categoricalcategorical
A B C
YEAR
MONTH
DAY-OF-MONTH
YYYY-MM-DD
DAY-OF-WEEK
HOUR
MINUTE
YYYY-MM-DD
YYYY-MM-DD
M-T-W-T-F-S-D
HH:MM:SS
HH:MM:SS
2013
September
25
Wednesday
10
02
DATE-TIME2013-09-25 10:02
DATE-TIME
text
Be not afraid of greatness:
some are born great, some
achieve greatness, and
some have greatness
thrust upon 'em.
text
“great”
“afraid”
“born”
appears 2 times
appears 1 time
appears 1 time
items
bread, sugar, coffee, milk
ice cream, hot fudge
items
25. BigML, Inc #DutchMLSchool
Learning Problems (fit)
25
Under-fitting Over-fitting
• Model fits too well does not “generalize”
• Captures the noise or outliers of the data
• Change algorithm or filter outliers
26. BigML, Inc #DutchMLSchool
Why Not Decision Trees
26
• Slightly prone to over-fitting
• But we’ll fix this with ensembles
• Splitting prefers decision boundaries that are parallel
to feature axes
29. BigML, Inc #DutchMLSchool
Why Not Decision Trees
29
• Slightly prone to over-fitting
• But we’ll fix this with ensembles
• Splitting prefers decision boundaries that are parallel
to feature axes
• More data!
• Predictions outside training data can be problematic
31. BigML, Inc #DutchMLSchool
Why Not Decision Trees
31
• Slightly prone to over-fitting
• But we’ll fix this with ensembles
• Splitting prefers decision boundaries that are parallel
to feature axes
• More data!
• Predictions outside training data can be problematic
• We can catch this with model competence
• Can be sensitive to small changes in training data
33. BigML, Inc #DutchMLSchool
Why Not Decision Trees
33
• Slightly prone to over-fitting
• But we’ll fix this with ensembles
• Splitting prefers decision boundaries that are parallel
to feature axes
• More data!
• Predictions outside training data can be problematic
• We can catch this with model competence
• Can be sensitive to small changes in training data
• What other models can we try?
• And how will we know which one works best?
36. BigML, Inc #DutchMLSchool
Mistakes can be Costly
36
FUN!
+ = DANGER!
Insight: Labeling a Yield as a stop is not as bad as
labelling a stop as a yield… Need better metrics!
37. BigML, Inc #DutchMLSchool
Evaluation Metrics
37
• Imagine we have a model that can predict a person’s dominant
hand, that is for any individual it predicts left / right
• Define the positive class
• This selection is arbitrary
• It is the class you are interested in!
• The negative class is the “other” class (or others)
• For this example, we choose : left
38. BigML, Inc #DutchMLSchool
Evaluation Metrics
38
• We choose the positive class: left
• True Positive (TP)
• We predicted left and the correct answer was left
• True Negative (TN)
• We predicted right and the correct answer was right
• False Positive (FP)
• Predicted left but the correct answer was right
• False Negative (FN)
• Predict right but the correct answer was left
39. BigML, Inc #DutchMLSchool
Evaluation Metrics
39
True Positive: Correctly predicted the positive class
True Negative: Correctly predicted the negative class
False Positive: Incorrectly predicted the positive class
False Negative: Incorrectly predicted the negative class
Remember…
40. BigML, Inc #DutchMLSchool
Accuracy
40
TP + TN
Total
• “Percentage correct” - like an exam
• If Accuracy = 1 then no mistakes
• If Accuracy = 0 then all mistakes
• Intuitive but not always useful
• Watch out for unbalanced classes!
• Ex: 90% of people are right-handed and 10% are left
• A silly model which always predicts right handed is
90% accurate
42. BigML, Inc #DutchMLSchool
Precision
42
TP
TP + FP
• “accuracy” or “purity” of positive class
• How well you did separating the positive class from the
negative class
• If Precision = 1 then no FP.
• You may have missed some left handers, but of the
ones you identified, all are left handed. No mistakes.
• If Precision = 0 then no TP
• None of the left handers you identified are actually left
handed. All mistakes.
44. BigML, Inc #DutchMLSchool
Recall
44
TP
TP + FN
• percentage of positive class correctly identified
• A measure of how well you identified all of the positive
class examples
• If Recall = 1 then no FN → All left handers identified
• There may be FP, so precision could be <1
• If Recall = 0 then no TP → No left handers identified
46. BigML, Inc #DutchMLSchool
f-Measure
46
2 * Recall * Precision
Recall + Precision
• harmonic mean of Recall & Precision
• If f-measure = 1 then Recall == Precision == 1
• If Precision OR Recall is small then the f-measure is small
47. BigML, Inc #DutchMLSchool
Phi Coefficient
47
__________TP*TN_-_FP*FN__________
SQRT[(TP+FP)(TP+FN)(TN+FP)(TN+FN)]
• Returns a value between -1 and 1
• If -1 then predictions are opposite reality
• =0 no correlation between predictions and reality
• =1 then predictions are always correct
49. BigML, Inc #DutchMLSchool
What Just Happened?
49
• Starting with the Diabetes Source, we created a Dataset and
then a Model.
• Using both the Model and the original Dataset, we created an
Evaluation.
• We reviewed the metrics provided by the Evaluation:
• Confusion Matrix
• Accuracy, Precision, Recall, f-measure and
phi
• This Model seemed to perform really, really well…
Question: Can we trust this model?
50. BigML, Inc #DutchMLSchool
Evaluation Danger!
50
• Never evaluate with the training data!
• Many models are able to “memorize” the training data
• This will result in overly optimistic evaluations!
51. BigML, Inc #DutchMLSchool
“Memorizing” Training Data
51
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 TRUE
85 26,6 0,351 31 FALSE
183 23,3 0,672 32 TRUE
89 28,1 0,167 21 FALSE
137 43,1 2,288 33 TRUE
116 25,6 0,201 30 FALSE
78 31 0,248 26 TRUE
115 35,3 0,134 29 FALSE
197 30,5 0,158 53 TRUE
Training Evaluating
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 ?
85 26,6 0,351 31 ?
• Exactly the same values!
• Who needs a model?
• What we want to know is how the
model performs with values never
seen at training:
124 22 0,107 46 ?
52. BigML, Inc #DutchMLSchool
Evaluation Danger!
52
• Never evaluate with the training data!
• Many models are able to “memorize” the training data
• This will result in overly optimistic evaluations!
• If you only have one Dataset, use a train/test split
53. BigML, Inc #DutchMLSchool
Train / Test Split
53
plasma
glucose
bmi
diabetes
pedigree
age diabetes
148 33,6 0,627 50 TRUE
183 23,3 0,672 32 TRUE
89 28,1 0,167 21 FALSE
78 31 0,248 26 TRUE
115 35,3 0,134 29 FALSE
197 30,5 0,158 53 TRUE
Train Test
plasma
glucose
bmi
diabetes
pedigree
age diabetes
85 26,6 0,351 31 FALSE
137 43,1 2,288 33 TRUE
116 25,6 0,201 30 FALSE
• These instances were never seen
at training time.
• Better evaluation of how the
model will perform with “new” data
54. BigML, Inc #DutchMLSchool
Evaluation Danger!
54
• Never evaluate with the training data!
• Many models are able to “memorize” the training data
• This will result in overly optimistic evaluations!
• If you only have one Dataset, use a train/test split
• Even a train/test split may not be enough!
• Might get a “lucky” split
• Solution is to repeat several times (formally to cross validate)
56. BigML, Inc #DutchMLSchool
What Just Happened?
56
• Starting with the Diabetes Dataset we created a train/test split
• We built a Model using the train set and evaluated it with the
test set
• The scores were much worse than before, showing the danger
of evaluating with training data.
• Then we built several other models with different parameters
and used the evaluation comparison tool to see which
performed the best.
Question:
Couldn’t we search for the best Model
or parameters?
STAY
TUNED
57. BigML, Inc #DutchMLSchool
Evaluation
57
• Never evaluate with the training data!
• Many models are able to “memorize” the training data
• This will result in overly optimistic evaluations!
• If you only have one Dataset, use a train/test split
• Even a train/test split may not be enough!
• Might get a “lucky” split
• Solution is to repeat several times (formally to cross validate)
• Don’t forget that accuracy can be mis-leading!
• Mostly useless with unbalanced classes (left/right?)
• Use weighting, operating points, other tricks…
58. BigML, Inc #DutchMLSchool
Operating Points
58
• The default probability threshold is 50%
• Changing the threshold can change the outcome for a
specific class
Rate Payment …
Actual
Outcome
Probability
PAID
Threshold
@ 50%
Threshold
@ 60%
Threshold
@ 90%
8,4 % US$456 … PAID 95 % PAID PAID PAID
9,6 % US$134 … PAID 87 % PAID PAID DEFAULT
18 % US$937 … DEFAULT 36 % DEFAULT DEFAULT DEFAULT
21 % US$35 … PAID 88 % PAID PAID DEFAULT
17,5 %US$1.044 … DEFAULT 55 % PAID DEFAULT DEFAULT
59. BigML, Inc #DutchMLSchool
What about Regressions?
59
• No classes:
• Not possible to count mistakes: TP, FP, TN, FN
• Predicted values are numeric: error is the amount “off”
• actual 200, predict 180 = error 20
• Mean Absolute Error / Mean Squared Error
• Both are a measure of total error
• Note: value of the error is “unbounded”.
• When comparing models, lower values are “better”
• R-Squared Error
• Measure of how much better the model is than always
predicting the mean
• < 0 model is worse then mean
• = 0 model is no better than the mean
• ➞ 1 model fits the data “perfectly”
61. BigML, Inc #DutchMLSchool
What just happened?
61
• We split the RedFin data into training and test Datasets
• We created a Model and Evaluation
• We examined the Evaluation metrics
Wait - What about Time Series?
67. BigML, Inc #DutchMLSchool
what is an Ensemble?
67
• Rather than build a single model…
• Combine the output of several typically “weaker” models into
a powerful ensemble…
• Q1: Why is this necessary?
• Q2: How do we build “weaker” models?
• Q3: How do we “combine” models?
68. BigML, Inc #DutchMLSchool
No Model is Perfect
68
• A given ML algorithm may simply not be able to exactly
model the “real solution” of a particular dataset.
• Try to fit a line to a curve
• Even if the model is very capable, the “real solution” may be
elusive
• DT/NN can model any decision boundary with enough
training data, but the solution is NP-hard
• Practical algorithms involve random processes and may
arrive at different, yet equally good, “solutions” depending
on the starting conditions, local optima, etc.
• If that wasn’t bad enough…
69. BigML, Inc #DutchMLSchool
No Data is Perfect
69
• Not enough data!
• Always working with finite training data
• Therefore, every “model” is an approximation of the “real
solution” and there may be several good approximations.
• Anomalies / Outliers
• The model is trying to generalize from discrete training
data.
• Outliers can “skew” the model, by overfitting
• Mistakes in your data
• Does the model have to do everything for you?
• But really, there is always mistakes in your data
70. BigML, Inc #DutchMLSchool
Ensemble Techniques
70
• Key Idea:
• By combining several good “models”, the combination
may be closer to the best possible “model”
• we want to ensure diversity. It’s not useful to use an
ensemble of 100 models that are all the same
• Training Data Tricks
• Build several models, each with only some of the data
• Introduce randomness directly into the algorithm
• Add training weights to “focus” the additional models on
the mistakes made
• Prediction Tricks
• Model the mistakes
• Model the output of several different algorithms
73. BigML, Inc #DutchMLSchool
Simple Example - Fit a Line
73
Partition the data… then model each partition…
For predictions, use the model for the same partition
?
74. BigML, Inc #DutchMLSchool
Decision Forest
74
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL 2
MODEL 3
MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
COMBINER
75. BigML, Inc #DutchMLSchool
Random Decision Forest
75
MODEL 1
DATASET
SAMPLE 1
SAMPLE 2
SAMPLE 3
SAMPLE 4
MODEL 2
MODEL 3
MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
SAMPLE 1
PREDICTION
COMBINER
76. BigML, Inc #DutchMLSchool
Boosting
76
ADDRESS BEDS BATHS SQFT
LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE
LAST SALE
PRICE
1522 NW
Jonquil
4 3 2424 5227 1991 44,594828 -123,269328 360000
7360 NW
Valley Vw
3 2 1785 25700 1979 44,643876 -123,238189 307500
4748 NW
Veronica
5 3,5 4135 6098 2004 44,5929659 -123,306916 600000
411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 435350
MODEL 1
PREDICTED
SALE PRICE
360750
306875
587500
435350
ERROR
750
-625
-12500
0
ADDRESS BEDS BATHS SQFT
LOT
SIZE
YEAR
BUILT
LATITUDE LONGITUDE ERROR
1522 NW
Jonquil
4 3 2424 5227 1991 44,594828 -123,269328 750
7360 NW
Valley Vw
3 2 1785 25700 1979 44,643876 -123,238189 625
4748 NW
Veronica
5 3,5 4135 6098 2004 44,5929659 -123,306916 12500
411 NW 16th 3 2825 4792 1938 44,570883 -123,272113 0
MODEL 2
PREDICTED
ERROR
750
625
12393,83333
6879,67857
Why stop at one iteration?
"Hey Model 1, what do you predict is the sale price of this home?"
"Hey Model 2, how much error do you predict Model 1 just made?"
77. BigML, Inc #DutchMLSchool
Boosting
77
DATASET MODEL 1
DATASET 2 MODEL 2
DATASET 3 MODEL 3
DATASET 4 MODEL 4
PREDICTION 1
PREDICTION 2
PREDICTION 3
PREDICTION 4
PREDICTION
SUM
Iteration 1
Iteration 2
Iteration 3
Iteration 4
etc…
79. BigML, Inc #DutchMLSchool
Which Ensemble Method
79
• The one that works best!
• Ok, but seriously. Did you evaluate?
• For "large" / "complex" datasets
• Use DF/RDF with deeper node threshold
• Even better, use Boosting with more iterations
• For "noisy" data
• Boosting may overfit
• RDF preferred
• For "wide" data
• Randomize features (RDF) will be quicker
• For "easy" data
• A single model may be fine
• Bonus: also has the best interpretability!
• For classification with "large" number of classes
• Boosting will be slower
• For "general" data
• DF/RDF likely better than a single model or Boosting.
• Boosting will be slower since the models are processed serially
81. BigML, Inc #DutchMLSchool
Summary
81
• Models have shortcomings: ability to fit, NP-hard, etc
• Data has shortcomings: not enough, outliers, mistakes, etc
• Ensemble Techniques can improve on single models
• Sampling: partitioning, Decision Tree bagging
• Adding Randomness: RDF
• Modeling the Error: Boosting
• Modeling the Models: Stacking
• Guidelines for knowing which one might work best in a given
situation