Enhancing and Automating Decision Making with Machine Learning - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Machine learning is becoming widely used to automate decision making. While machine learning seems complex, it involves finding patterns in data that can be used to make useful predictions. The document discusses how factors like increased data availability, faster computation, and easier tools have led to the rise of machine learning applications. It also notes common pitfalls in early machine learning adoption like overhyping results and failing to develop a clear strategy. Overall machine learning is transforming industries by enabling cheaper and more data-driven decisions at scale.
DutchMLSchool. ML: A Technical PerspectiveBigML, Inc
DutchMLSchool. Machine Learning: A Technical Perspective
TITLE AS IN SCHEDULE - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. Supervised vs Unsupervised LearningBigML, Inc
Supervised versus Unsupervised Learning Techniques - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Machine Learning: Business Perspective - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Anatomy of an Application: Machine Learning End-to-End - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. ML for Energy Trading and Automotive SectorBigML, Inc
Machine Learning for Energy Trading, Automotive Sector, and Logistics, presented by BigML's Partners A1 Digital.
Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. Introduction to Machine Learning with the BigML PlatformBigML, Inc
Introduction to Machine Learning with the BigML Platform - ML for Executives Course.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesBigML, Inc
DutchMLSchool. Logistic Regression, Deepnets, and Time Series (Supervised Learning II) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Machine learning is becoming widely used to automate decision making. While machine learning seems complex, it involves finding patterns in data that can be used to make useful predictions. The document discusses how factors like increased data availability, faster computation, and easier tools have led to the rise of machine learning applications. It also notes common pitfalls in early machine learning adoption like overhyping results and failing to develop a clear strategy. Overall machine learning is transforming industries by enabling cheaper and more data-driven decisions at scale.
DutchMLSchool. ML: A Technical PerspectiveBigML, Inc
DutchMLSchool. Machine Learning: A Technical Perspective
TITLE AS IN SCHEDULE - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. Supervised vs Unsupervised LearningBigML, Inc
Supervised versus Unsupervised Learning Techniques - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Machine Learning: Business Perspective - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Anatomy of an Application: Machine Learning End-to-End - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. ML for Energy Trading and Automotive SectorBigML, Inc
Machine Learning for Energy Trading, Automotive Sector, and Logistics, presented by BigML's Partners A1 Digital.
Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. Introduction to Machine Learning with the BigML PlatformBigML, Inc
Introduction to Machine Learning with the BigML Platform - ML for Executives Course.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. Logistic Regression, Deepnets, Time SeriesBigML, Inc
DutchMLSchool. Logistic Regression, Deepnets, and Time Series (Supervised Learning II) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. Models, Evaluations, and EnsemblesBigML, Inc
DutchMLSchool. Introduction to Machine Learning, Models, Evaluations, and Ensembles (Supervised Learning I) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Opening Remarks - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. Associations and Topic ModelsBigML, Inc
DutchMLSchool. Association Discovery and Topic Modeling (Unsupervised II) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Machine Learning for Logistics: Predicting Expedition Outcome - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
MLSEV. Models, Evaluations and Ensembles BigML, Inc
Introduction to Machine Learning. Supervised Learning (Part I): Models, Evaluations and Ensembles, by BigML.
MLSEV 2019: 1st edition of the Machine Learning School in Seville, Spain.
Elena Grewal, Data Science Manager, Airbnb at MLconf SF 2016MLconf
Before the Model: How Machine Learning Products Start, with Examples from Airbnb: Often the most important part of building a machine learning product is the formulation of the problem; the most elegant model is rendered useless without the right application and model architecture. Airbnb is an online marketplace for accommodations which has found many interesting applications for machine learning products by taking a data driven approach to investment in Machine learning products. Come hear about how the Airbnb team generates and vets ideas for machine learning products and tailors the product to business problems, with some examples of success and lessons learned along the way.
Square's Machine Learning Infrastructure and Applications - Rong YanHakka Labs
1) Square uses machine learning for fraud detection in payments and to power recommendations on its Square Market platform.
2) Random forests and gradient boosted trees are the primary algorithms used for fraud detection, achieving up to a 10-11% improvement over random forests alone.
3) Square has built scalable machine learning infrastructure including parallel environments, data transport systems, and a learning management system to support rapid model development and evaluation.
Building Custom Machine Learning Algorithms with Apache SystemMLsparktc
This document discusses Apache SystemML, which is a machine learning framework for building custom machine learning algorithms on Apache Spark. It originated from research projects at IBM involving machine learning on Hadoop. SystemML aims to allow data scientists to build ML solutions using languages like R and Python, while executing algorithms on big data platforms like Spark. It provides a high-level language for expressing algorithms and performs automatic parallelization and optimization. The document demonstrates SystemML through a matrix factorization example for a targeted advertising problem. It shows how to use SystemML, Spark and Zeppelin together to build a custom algorithm and optimize part of the machine learning pipeline.
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
First public meetup at Twitter Seattle, for Seattle DAML:
http://www.meetup.com/Seattle-DAML/events/159043422/
We compare/contrast several open source frameworks which have emerged for Machine Learning workflows, including KNIME, IPython Notebook and related Py libraries, Cascading, Cascalog, Scalding, Summingbird, Spark/MLbase, MBrace on .NET, etc. The analysis develops several points for "best of breed" and what features would be great to see across the board for many frameworks... leading up to a "scorecard" to help evaluate different alternatives. We also review the PMML standard for migrating predictive models, e.g., from SAS to Hadoop.
The document provides guidance on building an end-to-end machine learning project to predict California housing prices using census data. It discusses getting real data from open data repositories, framing the problem as a supervised regression task, preparing the data through cleaning, feature engineering, and scaling, selecting and training models, and evaluating on a held-out test set. The project emphasizes best practices like setting aside test data, exploring the data for insights, using pipelines for preprocessing, and techniques like grid search, randomized search, and ensembles to fine-tune models.
Building a performing Machine Learning model from A to ZCharles Vestur
A 1-hour read to become highly knowledgeable about Machine learning and the machinery underneath, from scratch!
A presentation introducing to all fundamental concepts of Machine Learning step by step, following a classical approach to build a performing model. Simple examples and illustrations are used all along the presentation to make the concepts easier to grasp.
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEDatabricks
For companies that solve real-world problems and generate revenue from the data science products, being able to understand why a model makes a certain prediction can be as crucial as achieving high prediction accuracy in many applications. However, as data scientists pursuing higher accuracy by implementing complex algorithms such as ensemble or deep learning models, the algorithm itself becomes a blackbox and it creates the trade-off between accuracy and interpretability of a model’s output.
To address this problem, a unified framework SHAP (SHapley Additive exPlanations) was developed to help users interpret the predictions of complex models. In this session, we will talk about how to apply SHAP to various modeling approaches (GLM, XGBoost, CNN) to explain how each feature contributes and extract intuitive insights from a particular prediction. This talk is intended to introduce the concept of general purpose model explainer, as well as help practitioners understand SHAP and its applications.
How To Interview a Data Scientist
Daniel Tunkelang
Presented at the O'Reilly Strata 2013 Conference
Video: https://www.youtube.com/watch?v=gUTuESHKbXI
Interviewing data scientists is hard. The tech press sporadically publishes “best” interview questions that are cringe-worthy.
At LinkedIn, we put a heavy emphasis on the ability to think through the problems we work on. For example, if someone claims expertise in machine learning, we ask them to apply it to one of our recommendation problems. And, when we test coding and algorithmic problem solving, we do it with real problems that we’ve faced in the course of our day jobs. In general, we try as hard as possible to make the interview process representative of actual work.
In this session, I’ll offer general principles and concrete examples of how to interview data scientists. I’ll also touch on the challenges of sourcing and closing top candidates.
How to Become a Data Scientist
SF Data Science Meetup, June 30, 2014
Video of this talk is available here: https://www.youtube.com/watch?v=c52IOlnPw08
More information at: http://www.zipfianacademy.com
Zipfian Academy @ Crowdflower
This document discusses max-diff (maximum difference) analysis, which is a method for collecting preference data. It covers when to use max-diff, experimental design considerations, problems with simple "counting" analysis, using latent class analysis instead, and computing preference shares from max-diff data. Latent class analysis addresses issues with counting analysis by accounting for experimental design, inconsistencies in preferences, and differences between individuals.
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
This document provides examples of different frameworks that can be used for machine learning data workflows, including KNIME, Python, Julia, Summingbird, Scalding, and Cascalog. It describes features of each framework such as KNIME's large number of integrations and visual workflow editing, Python's broad ecosystem, Julia's performance and parallelism support, Summingbird's ability to switch between Storm and Scalding backends, and Scalding's implementation of the Scala collections API over Cascading for compact workflow code. The document aims to familiarize readers with options for building machine learning data workflows.
Feature engineering is the process of using domain knowledge to create new features that allow machine learning algorithms to work better or work at all. It involves applying transformations and encoding schemes to raw data to construct informative features for modeling. Feature engineering is important because ML algorithms only learn from the data and features provided, so carefully engineered features are crucial. Effective feature engineering requires domain expertise, experimentation, and evaluation to identify representations of the data that best support predictive tasks.
Feature engineering is the process of using domain knowledge to create new features that allow machine learning algorithms to work better or work at all. It involves applying transformations to existing features, like splitting date-time fields or normalizing numeric values, as well as computing new features from existing ones. Flatline is a domain-specific language for programmatic feature engineering and filtering that allows creating new features using expressions over existing fields. Care must be taken to avoid leakage when creating new features.
DutchMLSchool. Models, Evaluations, and EnsemblesBigML, Inc
DutchMLSchool. Introduction to Machine Learning, Models, Evaluations, and Ensembles (Supervised Learning I) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Opening Remarks - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
DutchMLSchool. Associations and Topic ModelsBigML, Inc
DutchMLSchool. Association Discovery and Topic Modeling (Unsupervised II) - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
Machine Learning for Logistics: Predicting Expedition Outcome - Main Conference: Introduction to Machine Learning.
DutchMLSchool: 1st edition of the Machine Learning Summer School in The Netherlands.
MLSEV. Models, Evaluations and Ensembles BigML, Inc
Introduction to Machine Learning. Supervised Learning (Part I): Models, Evaluations and Ensembles, by BigML.
MLSEV 2019: 1st edition of the Machine Learning School in Seville, Spain.
Elena Grewal, Data Science Manager, Airbnb at MLconf SF 2016MLconf
Before the Model: How Machine Learning Products Start, with Examples from Airbnb: Often the most important part of building a machine learning product is the formulation of the problem; the most elegant model is rendered useless without the right application and model architecture. Airbnb is an online marketplace for accommodations which has found many interesting applications for machine learning products by taking a data driven approach to investment in Machine learning products. Come hear about how the Airbnb team generates and vets ideas for machine learning products and tailors the product to business problems, with some examples of success and lessons learned along the way.
Square's Machine Learning Infrastructure and Applications - Rong YanHakka Labs
1) Square uses machine learning for fraud detection in payments and to power recommendations on its Square Market platform.
2) Random forests and gradient boosted trees are the primary algorithms used for fraud detection, achieving up to a 10-11% improvement over random forests alone.
3) Square has built scalable machine learning infrastructure including parallel environments, data transport systems, and a learning management system to support rapid model development and evaluation.
Building Custom Machine Learning Algorithms with Apache SystemMLsparktc
This document discusses Apache SystemML, which is a machine learning framework for building custom machine learning algorithms on Apache Spark. It originated from research projects at IBM involving machine learning on Hadoop. SystemML aims to allow data scientists to build ML solutions using languages like R and Python, while executing algorithms on big data platforms like Spark. It provides a high-level language for expressing algorithms and performs automatic parallelization and optimization. The document demonstrates SystemML through a matrix factorization example for a targeted advertising problem. It shows how to use SystemML, Spark and Zeppelin together to build a custom algorithm and optimize part of the machine learning pipeline.
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
First public meetup at Twitter Seattle, for Seattle DAML:
http://www.meetup.com/Seattle-DAML/events/159043422/
We compare/contrast several open source frameworks which have emerged for Machine Learning workflows, including KNIME, IPython Notebook and related Py libraries, Cascading, Cascalog, Scalding, Summingbird, Spark/MLbase, MBrace on .NET, etc. The analysis develops several points for "best of breed" and what features would be great to see across the board for many frameworks... leading up to a "scorecard" to help evaluate different alternatives. We also review the PMML standard for migrating predictive models, e.g., from SAS to Hadoop.
The document provides guidance on building an end-to-end machine learning project to predict California housing prices using census data. It discusses getting real data from open data repositories, framing the problem as a supervised regression task, preparing the data through cleaning, feature engineering, and scaling, selecting and training models, and evaluating on a held-out test set. The project emphasizes best practices like setting aside test data, exploring the data for insights, using pipelines for preprocessing, and techniques like grid search, randomized search, and ensembles to fine-tune models.
Building a performing Machine Learning model from A to ZCharles Vestur
A 1-hour read to become highly knowledgeable about Machine learning and the machinery underneath, from scratch!
A presentation introducing to all fundamental concepts of Machine Learning step by step, following a classical approach to build a performing model. Simple examples and illustrations are used all along the presentation to make the concepts easier to grasp.
Unified Approach to Interpret Machine Learning Model: SHAP + LIMEDatabricks
For companies that solve real-world problems and generate revenue from the data science products, being able to understand why a model makes a certain prediction can be as crucial as achieving high prediction accuracy in many applications. However, as data scientists pursuing higher accuracy by implementing complex algorithms such as ensemble or deep learning models, the algorithm itself becomes a blackbox and it creates the trade-off between accuracy and interpretability of a model’s output.
To address this problem, a unified framework SHAP (SHapley Additive exPlanations) was developed to help users interpret the predictions of complex models. In this session, we will talk about how to apply SHAP to various modeling approaches (GLM, XGBoost, CNN) to explain how each feature contributes and extract intuitive insights from a particular prediction. This talk is intended to introduce the concept of general purpose model explainer, as well as help practitioners understand SHAP and its applications.
How To Interview a Data Scientist
Daniel Tunkelang
Presented at the O'Reilly Strata 2013 Conference
Video: https://www.youtube.com/watch?v=gUTuESHKbXI
Interviewing data scientists is hard. The tech press sporadically publishes “best” interview questions that are cringe-worthy.
At LinkedIn, we put a heavy emphasis on the ability to think through the problems we work on. For example, if someone claims expertise in machine learning, we ask them to apply it to one of our recommendation problems. And, when we test coding and algorithmic problem solving, we do it with real problems that we’ve faced in the course of our day jobs. In general, we try as hard as possible to make the interview process representative of actual work.
In this session, I’ll offer general principles and concrete examples of how to interview data scientists. I’ll also touch on the challenges of sourcing and closing top candidates.
How to Become a Data Scientist
SF Data Science Meetup, June 30, 2014
Video of this talk is available here: https://www.youtube.com/watch?v=c52IOlnPw08
More information at: http://www.zipfianacademy.com
Zipfian Academy @ Crowdflower
This document discusses max-diff (maximum difference) analysis, which is a method for collecting preference data. It covers when to use max-diff, experimental design considerations, problems with simple "counting" analysis, using latent class analysis instead, and computing preference shares from max-diff data. Latent class analysis addresses issues with counting analysis by accounting for experimental design, inconsistencies in preferences, and differences between individuals.
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
This document provides examples of different frameworks that can be used for machine learning data workflows, including KNIME, Python, Julia, Summingbird, Scalding, and Cascalog. It describes features of each framework such as KNIME's large number of integrations and visual workflow editing, Python's broad ecosystem, Julia's performance and parallelism support, Summingbird's ability to switch between Storm and Scalding backends, and Scalding's implementation of the Scala collections API over Cascading for compact workflow code. The document aims to familiarize readers with options for building machine learning data workflows.
Feature engineering is the process of using domain knowledge to create new features that allow machine learning algorithms to work better or work at all. It involves applying transformations and encoding schemes to raw data to construct informative features for modeling. Feature engineering is important because ML algorithms only learn from the data and features provided, so carefully engineered features are crucial. Effective feature engineering requires domain expertise, experimentation, and evaluation to identify representations of the data that best support predictive tasks.
Feature engineering is the process of using domain knowledge to create new features that allow machine learning algorithms to work better or work at all. It involves applying transformations to existing features, like splitting date-time fields or normalizing numeric values, as well as computing new features from existing ones. Flatline is a domain-specific language for programmatic feature engineering and filtering that allows creating new features using expressions over existing fields. Care must be taken to avoid leakage when creating new features.
The document discusses feature engineering for machine learning models. It provides examples of how to create new features from existing data fields using a domain-specific language called Flatline. Feature engineering techniques discussed include discretization, normalization, and adding new fields through calculations on other fields. The document emphasizes that feature engineering is important for helping machine learning algorithms work better or work at all, and that features should be carefully evaluated to avoid data leakage. Automating feature engineering is presented as an important part of the overall process.
This document summarizes a presentation on feature engineering for machine learning. It discusses how feature engineering is important for allowing machine learning algorithms to work better or at all by creating new features that provide better representations of the data. Various techniques for feature engineering are presented, including transforming date/time fields, handling categorical variables, text analysis, and discretizing continuous variables. The use of feature engineering tools like Flatline for programmatically creating new features is also demonstrated. Feature selection techniques are briefly discussed to help identify the most important and non-leaky features.
BigML Education - Feature Engineering with FlatlineBigML, Inc
Flatline is a domain-specific language for feature engineering and programmatic filtering of datasets. It allows users to transform datasets by adding new fields through custom expressions, extracting features from text and date fields, normalizing numeric values, and filtering rows programmatically. Flatline expressions are written in a Lisp-like syntax and can perform tasks like computing new features from existing ones, extracting structure from text, and labeling datasets based on conditional logic. Feature engineering with Flatline helps machine learning algorithms work better by providing more informative representations of the data.
The document discusses preparing data for machine learning models. It describes how real-world data is often messy and unstructured, requiring transformations like cleaning, labeling, aggregation, and structuring to make it suitable for ML tasks. The document provides examples of common data transformations including denormalizing, adding labels, handling missing values, and structuring output in CSV format. It emphasizes that the goal of transformations is to end up with tabular data where each row is an observation and each column is a feature.
Introduction to End-to-End Machine Learning: Classification and Regression - Mercè Martín, VP of Bindings and Applications at BigML.
*Machine Learning School in The Netherlands 2022.
The document discusses preparing data for machine learning by transforming raw data into machine learning-ready data. It outlines a holistic approach that involves defining goals, understanding required data structures, assessing available data, and performing transformations like cleaning, denormalizing, aggregating, pivoting, and feature engineering. The transformations are aimed at structuring the data into a format that machine learning algorithms can consume to build models. Automating the transformations and evaluating results is also emphasized.
VSSML17 L5. Basic Data Transformations and Feature EngineeringBigML, Inc
Valencian Summer School in Machine Learning 2017 - Day 2
Lecture 5: Basic Data Transformations and Feature Engineering. By Poul Petersen (BigML).
https://bigml.com/events/valencian-summer-school-in-machine-learning-2017
Des exemples de use cases dont vous pourrez vous inspirer, et de plateformes de ML-as-a-Service pour vous faciliter le human learning du machine learning, l'expérimentation, et le déploiement en production!
This document discusses predictive apps for startups using machine learning. It provides examples of everyday use cases like real estate price prediction and email spam detection. It explains that machine learning works by training a model on data and then using the model to make predictions on new data. The document also discusses how to make machine learning more accessible through cloud platforms and APIs, and how automation tools can help simplify machine learning tasks like model tuning and algorithm selection.
This document is a PANDAS application for a student named Soham Chakraborty. It introduces pandas and its use for data analysis. The document discusses python libraries like pandas, problem statements around missing data, and solutions for handling missing values. Code is provided to read a dataset, describe the data, identify missing values, fill them using the median, and output summaries. The dataset comes from Kaggle and contains app information. The document concludes that pandas makes filling missing values simple.
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
The ever-increasing interest around deep learning and neural networks has led to a vast increase in processing frameworks like TensorFlow and PyTorch. These libraries are built around the idea of a computational graph that models the dataflow of individual units. Because tensors are their basic computational unit, these frameworks can run efficiently on hardware accelerators (e.g. GPUs).Traditional machine learning (ML) such as linear regressions and decision trees in scikit-learn cannot currently be run on GPUs, missing out on the potential accelerations that deep learning and neural networks enjoy.
In this talk, we’ll show how you can use Hummingbird to achieve 1000x speedup in inferencing on GPUs by converting your traditional ML models to tensor-based models (PyTorch andTVM). https://github.com/microsoft/hummingbird
This talk is for intermediate audiences that use traditional machine learning and want to speedup the time it takes to perform inference with these models. After watching the talk, the audience should be able to use ~5 lines of code to convert their traditional models to tensor-based models to be able to try them out on GPUs.
Outline:
Introduction of what ML inference is (and why it’s different than training)
Motivation: Tensor-based DNN frameworks allow inference on GPU, but “traditional” ML frameworks do not
Why “traditional” ML methods are important
Introduction of what Hummingbirddoes and main benefits
Deep dive on how traditional ML models are built
Brief intro onhow Hummingbird converter works
Example of how Hummingbird can convert a tree model into a tensor-based model
Other models
Demo
Status
Q&A
This document provides an overview of data transformations needed to prepare data for machine learning applications. It discusses how data needs to be structured as instances with features for different machine learning tasks like classification, regression, clustering etc. It also covers common obstacles in data like missing values, incorrect formats and discusses techniques to address them like data cleaning, feature engineering and feature selection. Specific techniques discussed include joining multiple normalized datasets, handling missing values, aggregating features through counts and dealing with data from different sources and formats.
This document discusses data transformations for machine learning. It begins by noting that perfectly formatted data is ideal but rarely exists in reality. Common obstacles to machine learning-ready data are discussed, including data structure, missing values, and unwanted features. The process of transforming data involves understanding the goal, identifying relevant machine learning tasks, accessing and structuring the data, and performing feature engineering. Common transformations include data cleaning, labeling, denormalizing, aggregating, pivoting, and handling time windows. An example of transforming loan data from Prosper is provided to demonstrate handling streaming XML data updates.
Similar to DutchMLSchool. Automating Decision Making (20)
Digital Transformation and Process Optimization in ManufacturingBigML, Inc
Keyanoush Razavidinani, Digital Services Consultant at A1 Digital, a BigML Partner, highlights why it is important to identify and reduce human bottlenecks that optimize processes and let you focus on important activities. Additionally, Guillem Vidal, Machine Learning Engineer at BigML completes the session by showcasing how Machine Learning is put to use in the manufacturing industry with a use case to detect factory failures.
The Road to Production: Automating your Anomaly Detectors - by jao (Jose A. Ortega), Co-Founder and Chief Technology Officer at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - ML for AML ComplianceBigML, Inc
Machine Learning for Anti Money Laundering Compliance, by Kevin Nagel, Consultant and Data Scientist at INFORM.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Multi Perspective AnomaliesBigML, Inc
Multi Perspective Anomalies, by Jan W Veldsink, Master in the art of AI at Nyenrode, Rabobank, and Grio.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - My First Anomaly Detector BigML, Inc
The document discusses building an anomaly detector model to identify unusual transactions in a dataset. It describes loading transaction data with 31 features into the BigML platform and creating an anomaly detector model. The model scores new data and identifies the most anomalous fields to help detect fraud. Creating the anomaly detector involves interpreting the data, exploring the dataset distribution, and setting a threshold score to define what is considered anomalous.
DutchMLSchool 2022 - History and Developments in MLBigML, Inc
History and Present Developments in Machine Learning, by Tom Dietterich, Emeritus Professor of computer science at Oregon State University and Chief Scientist at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - A Data-Driven CompanyBigML, Inc
A Data-Driven Company: 21 Lessons for Large Organizations to Create Value from AI, by Richard Benjamins, Chief AI and Data Strategist at Telefónica.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - ML in the Legal SectorBigML, Inc
How Machine Learning Transforms and Automates Legal Services, by Arnoud Engelfriet, Co-Founder at Lynn Legal.
*Machine Learning School in The Netherlands 2022.
This document describes a proposed solution using machine learning and artificial intelligence to help create a safer stadium experience. The solution involves two parts: 1) linking access to stadiums to a verified identity through a fan app for preregistration, and 2) using AI/ML to help detect unwanted behaviors or events early. The rest of the document provides more details on the proposed smart video review framework, including using computer vision and audio analysis techniques to help identify issues like flares, flags, banners, chants including monkey chants. The goal is to help reviewers more efficiently identify potential problems but with privacy, ethics and human oversight.
DutchMLSchool 2022 - Process Optimization in Manufacturing PlantsBigML, Inc
Process Optimization in Manufacturing Plants, by Keyanoush Razavidinani, Digital Business Consultant at A1 Digital.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Anomaly Detection at ScaleBigML, Inc
Lessons Learned Applying Anomaly Detection at Scale, by Álvaro Clemente, Machine Learning Engineer at BigML.
*Machine Learning School in The Netherlands 2022.
DutchMLSchool 2022 - Citizen Development in AIBigML, Inc
The document discusses the need for citizen developers and humans in the AI/ML process. It notes that while technology and talent are important, company culture must also support broad data analytics and AI/ML adoption. It then provides examples of how involving domain experts can help attribute meaning to correlations and build better causal models to improve AI systems. The document advocates for a systems thinking approach and having humans in the loop to help AI/ML systems consider the wider context and avoid issues like bias.
This new feature is a continuation of and improvement on our previous Image Processing release. Now, Object Detection lets you go a step further with your image data and allows you to locate objects and annotate regions in your images. Once your image regions are defined, you can train and evaluate Object Detection models, make predictions with them, and automate end-to-end Machine Learning workflows on a single platform. To make that possible, BigML enables Object Detection by introducing the regions optype.
As with any other BigML feature, Object Detection is available from the BigML Dashboard, API, and WhizzML for automation. Object Detection is extremely helpful to tackle a wide range of computer vision use cases such as medical image analysis, quality control in manufacturing, license plate recognition in transportation, people detection in security surveillance, among many others.
This new release brings Image Processing to the BigML platform, a feature that enhances our offering to solve image data-driven business problems with remarkable ease of use. Because BigML treats images as any other data type, this unique implementation allows you to easily use image data alongside text, categorical, numeric, date-time, and items data types as input to create any Machine Learning model available in our platform, both supervised and unsupervised.
Now, it is easier than ever to solve a wide variety of computer vision and image classification use cases in a single platform: label your image data, train and evaluate your models, make predictions, and automate your end-to-end Machine Learning workflows. As with any other BigML feature, Image Processing is available from the BigML Dashboard, API, and WhizzML, and it can be applied to solve use cases such as medical image analysis, visual product search, security surveillance, and vehicle damage detection, among others.
Machine Learning in Retail: Know Your Customers' Customer. See Your FutureBigML, Inc
This session presents a quite common situation for those working in food and beverage retail (FnB) and highlights interesting insights to fight waste reduction.
Speaker: Stephen Kinns, CEO and Co-Founder at catsAi.
*ML in Retail 2021: Webinar.
Machine Learning in Retail: ML in the Retail SectorBigML, Inc
This is an introductory session about the role that Machine Learning is playing in the retail sector and how it is being deployed across the different areas of this industry.
Speaker: Atakan Cetinsoy, VP of Predictive Applications at BigML.
*ML in Retail 2021: Webinar.
ML in GRC: Machine Learning in Legal Automation, How to Trust a LawyerbotBigML, Inc
This presentation analyzes the role that Machine Learning plays in legal automation with a real-world Machine Learning application.
Speaker: Arnoud Engelfriet, Co-Founder at Lynn Legal.
*ML in GRC 2021: Virtual Conference.
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...BigML, Inc
This is a real-life Machine Learning use case about integrated risk.
Speakers: Thomas Rengersen, Product Owner of the Governance Risk and Compliance Tool for Rabobank, and Thomas Alderse Baas, Co-Founder and Director of The Bowmen Group.
*ML in GRC 2021: Virtual Conference.
ML in GRC: Cybersecurity versus Governance, Risk Management, and ComplianceBigML, Inc
Some of these concepts (Cybersecurity, Governance, Risk Management, and Compliance) overlap and sometimes they can be confusing. This session helps us understand why those terms are key for any business to be successful.
Speaker: Jon Shende, Founding Investor at MyVayda.
*ML in GRC 2021: Virtual Conference.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
2. BigML, Inc #DutchMLSchool 2
Feature Engineering
Creating Features that Make Machine Learning Work
Poul Petersen
CIO, BigML, Inc
3. BigML, Inc #DutchMLSchool
Gaming the ML Performance
3
• Use ML to improve performance automatically
• OptiML
• Unsupervised Feature Engineering (PCA, Topic Models,
Clustering, Anomaly Detection, etc)
• Automated feature selection
• Use domain knowledge to improve performance manually
• Bespoke features (requires expertise)
• Fusions of models
• Manual feature selection
A Tale of Two Strategies…
4. BigML, Inc #DutchMLSchool
what is Feature Engineering
4
Feature Engineering: applying domain knowledge of
the data to create new features that allow ML
algorithms to work better, or to work at all.
• This is really, really important - more than algorithm selection!
• In fact, so important that BigML often does it
automatically
• ML Algorithms have no deeper understanding of data
• Numerical: have a natural order, can be scaled, etc
• Categorical: have discrete values, etc.
• The "magic" is the ability to find patterns quickly and efficiently
• ML Algorithms only know what you tell/show it with data
• Medical: Kg and M, but BMI = Kg/M2 is better
• Lending: Debt and Income, but DTI is better
• Intuition can be risky: remember to prove it with an evaluation!
5. BigML, Inc #DutchMLSchool
Built-in Transformations
5
2013-09-25 10:02
Date-Time Fields
… year month day hour minute …
… 2013 Sep 25 10 2 …
… … … … … … …
NUM NUMCAT NUM NUM
• Date-Time fields have a lot of information "packed" into them
• Splitting out the time components allows ML algorithms to
discover time-based patterns.
DATE-TIME
6. BigML, Inc #DutchMLSchool
Built-in Transformations
6
Categorical Fields for Clustering/LR
… alchemy_category …
… business …
… recreation …
… health …
… … …
CAT
business health recreation …
… 1 0 0 …
… 0 0 1 …
… 0 1 0 …
… … … … …
NUM NUM NUM
• Clustering and Logistic Regression require numeric fields for
inputs
• Categorical values are transformed to numeric vectors
automatically*
• *Note: In BigML, clustering uses k-prototypes and the encoding used for LR can be configured.
7. BigML, Inc #DutchMLSchool
Built-in Transformations
7
Be not afraid of greatness:
some are born great, some achieve
greatness, and some have greatness
thrust upon ‘em.
TEXT
Text Fields
… great afraid born achieve …
… 4 1 1 1 …
… … … … … …
NUM NUM NUM NUM
• Unstructured text contains a lot of potentially interesting
patterns
• Bag-of-words analysis happens automatically and extracts
the "interesting" tokens in the text
• Another option is Topic Modeling to extract thematic meaning
8. BigML, Inc #DutchMLSchool
Help ML to Work Better
8
{
“url":"cbsnews",
"title":"Breaking News Headlines
Business Entertainment World News “,
"body":" news covering all the latest
breaking national and world news
headlines, including politics, sports,
entertainment, business and more.”
}
TEXT
title body
Breaking News… news covering…
… …
TEXT TEXT
When text is not actually unstructured
• In this case, the text field has structure (key/value pairs)
• Extracting the structure as new features may allow the ML
algorithm to work better
10. BigML, Inc #DutchMLSchool
Help ML to Work at all
10
When the pattern does not exist
Highway Number Direction Is Long
2 East-West FALSE
4 East-West FALSE
5 North-South TRUE
8 East-West FALSE
10 East-West TRUE
… … …
Goal: Predict principle direction from highway number
( = (mod (field "Highway Number") 2) 0)
12. BigML, Inc #DutchMLSchool
Feature Engineering
12
Discretization
Total Spend
7.342,99
304,12
4,56
345,87
8.546,32
NUM
“Predict will spend
$3,521 with error
$1,232”
Spend Category
Top 33%
Bottom 33%
Bottom 33%
Middle 33%
Top 33%
CAT
“Predict customer
will be Top 33% in
spending”
14. BigML, Inc #DutchMLSchool
Built-ins for FE
14
• Discretize: Converts a numeric value to categorical
• Replace missing values: fixed/max/mean/median/etc
• Normalize: Adjust a numeric value to a specific range of
values while preserving the distribution
• Math: Exponentiation, Logarithms, Squares, Roots, etc
• Types: Force a field value to categorical, integer, or real
• Random: Create random values for introducing noise
• Statistics: Mean, Population
• Refresh Fields:
• Types: recomputes field types. Ex: #classes > 1000
• Preferred: recomputes preferred status
15. BigML, Inc #DutchMLSchool
Flatline Add Fields
15
Computing with Existing Features
Debt Income
10.134 100.000
85.234 134.000
8.112 21.500
0 45.900
17.534 52.000
NUM NUM
(/ (field "Debt") (field "Income"))
Debt
Income
Debt to Income Ratio
0,10
0,64
0,38
0
0,34
NUM
17. BigML, Inc #DutchMLSchool
What is Flatline?
17
• DSL:
• Invented by BigML - Programmatic / Optimized for
speed
• Transforms datasets into new datasets
• Adding new fields / Filtering
• Transformations are written in lisp-style syntax
• Feature Engineering
• Computing new fields: (/ (field "Debt") (field
“Income”))
• Programmatic Filtering:
• Filtering datasets according to functions that evaluate
to true/false using the row of data as an input.
Flatline: a domain specific language for feature
engineering and programmatic filtering
18. BigML, Inc #DutchMLSchool
Flatline
18
• Lisp style syntax: Operators come first
• Correct: (+ 1 2) => NOT Correct: (1 + 2)
• Dataset Fields are first-class citizens
• (field “diabetes pedigree”)
• Limited programming language structures
• let, cond, if, map, list operators, */+-, etc.
• Built-in transformations
• statistics, strings, timestamps, windows
19. BigML, Inc #DutchMLSchool
Flatline s-expressions
19
(= 0 (+ (abs ( f "Month - 3" ) ) (abs ( f "Month - 2")) (abs ( f "Month - 1") ) ))
Name Month - 3 Month - 2 Month - 1
Joe Schmo 123,23 0 0
Jane Plain 0 0 0
Mary Happy 0 55,22 243,33
Tom Thumb 12,34 8,34 14,56
Un-Labelled Data
Labelled data
Name Month - 3 Month - 2 Month - 1 Default
Joe Schmo 123,23 0 0 FALSE
Jane Plain 0 0 0 TRUE
Mary Happy 0 55,22 243,33 FALSE
Tom Thumb 12,34 8,34 14,56 FALSE
Adding Simple Labels to Data
Define "default" as
missing three payments
in a row
31. BigML, Inc #DutchMLSchool
Advanced s-expressions
31
JSON Parser???
• Remember, Flatline is not a full programming language
• No loops
• No accumulated values
• Code executes on one row at a time and has a limited
view into other rows
https://gist.github.com/petersen-poul/504c62ceaace76227cc6d8e0c5f1704b
32. BigML, Inc #DutchMLSchool
Feature Engineering
32
Fix Missing Values in a “Meaningful” Way
F i l t e r
Zeros
Model
insulin
Predict
insulin
Select
insulin
Fixed
Dataset
Amended
Dataset
Original
Dataset
Clean
Dataset
( if ( = (field "insulin") 0) (field "predicted insulin") (field "insulin"))
35. BigML, Inc #DutchMLSchool
Feature Selection
35
• Model Summary
• Field Importance
• Algorithmic
• Best-First Feature Selection
• Boruta
• Leakage
• Tight Correlations (AD, Plot, Correlations)
• Test Data
• Perfect future knowledge
Care must be taken when creating features!
36. BigML, Inc #DutchMLSchool
Feature Selection
36
Leakage
• sales pipeline where step n-1 has no other
outcome then step n.
• stock close predicts stock open
• churn retention: the worst rep is actually the best
(correlation != causation)
• cancer prediction where one input is a doctor
ordered test for the condition
• account ID predicts fraud (because only new
accounts are fraudsters)
37. BigML, Inc #DutchMLSchool
Summary
37
• Feature Engineering: what is it / why it is important
• Automatic transformations: date-time, text, etc
• Built-in functions: filtering and feature engineering
• Discretization / Normalization / etc.
• Flatline: programmatic feature engineering / filtering
• Structure
• Examples: Adding fields / filtering
• When building features it is important to watch for leakage
39. BigML, Inc #DutchMLSchool
Title
39
Decreasing Interpretability / Better Representation / Longer Training
IncreasingDataSize/Complexity
Early Stage
Rapid Prototyping
Mid Stage
Proven Application
Late Stage
Critical Performance
DeepnetsSingle Tree Model
Logistic Regression Boosted Trees
Random
Decision Forest
Decision Forest
TO
O
H
AR
D
40. BigML, Inc #DutchMLSchool
BigML Deepnets
40
• The success of a Deepnet is dependent on getting the right
network structure for the dataset
• But, there are too many parameters:
• Nodes, layers, activation function, learning rate, etc…
• And setting them takes significant expert knowledge
• Solution:
• Metalearning (a good initial guess)
• Network search (try a bunch)
Remember this?
41. BigML, Inc #DutchMLSchool
OptiML
41
• Each resource has several parameters that impact quality
• Number of trees, missing splits, nodes, weight
• Rather than trial and error, we can use ML to find ideal
parameters
• Why not make the model type, Decision Tree, Boosted Tree,
etc, a parameter as well?
• Similar to Deepnet network search, but finds the optimum
machine learning algorithm and parameters for your data
automatically
Key Insight: We can solve any parameter selection
problem in a similar way.
42. BigML, Inc #DutchMLSchool
The Challenge…
42
• We will start with a dataset from StumbleUpon
• Train/Test split with seed “bigml”
• Build and Evaluate:
• 1-click Model, LR, Ensemble, Deepnet
• Top model from OptiML output
• Compare the results using the phi coefficient
• Explore other ideas for improving performance further
44. BigML, Inc #DutchMLSchool
Results…
44
All scores are phi, evaluated against a holdout
• 1-Click Decision Tree: 0.36
• 1-Click LR: 0.47
• 1-Click Ensemble: 0.58
• Best OptiML Model (LR): 0.66
• 1-Click Deepnet: 0.67
•
What else can we try?
45. BigML, Inc #DutchMLSchool
Fusions Inside
45
• Fuse any set of models into a new “fusion”
• Must have the same objective type
• Inputs and feature space can differ
• Weights can be added
• Give more importance to individual models
• Fusions can be fused as well
• Especially useful for fusing OptiML models
Key Insight: ML algorithms each have unique
strengths and weaknesses
48. BigML, Inc #DutchMLSchool
Results…
48
All scores are phi, evaluated against a holdout
• 1-Click Decision Tree: 0.36
• 1-Click LR: 0.47
• 1-Click Ensemble: 0.58
• Best OptiML Model (LR): 0.66
• 1-Click Deepnet: 0.67
• Fusion of top Model Types: 0.68
49. BigML, Inc #DutchMLSchool
Fusions: Under the Hood
49
P(TRUE) = [56+(100-67)+2*78] / 4
Model Prediction Probability Weight
Ensemble TRUE %56 1 Fus ion
Deepnet FALSE %67 1 TRUE %61
Model TRUE %78 2
Classification
Model Prediction Error Weight
Ensemble 156,78 12,56 1 Fus ion
Deepnet 139,55 9,88 1 160,13 17,49
Model 172,10 23,76 2
Regression
50. BigML, Inc #DutchMLSchool
Fusions: Like any BigML Model
50
• Fully accessible thru API and WhizzML
• Bindings have support for local predictions
51. BigML, Inc #DutchMLSchool
Decision Boundary Smoothness
51
Single Tree:
• Outcome changes abruptly near decision
boundary
• And not at all parallel to the boundary
• This can be “surprising”
Single Tree + Deepnet:
• Keep the interpretability of the tree
• But with a more nuanced decision boundary
52. BigML, Inc #DutchMLSchool
Feature Stability
52
Feature Importance: Different subsets of features may have similar modeling
performance
Fusing models gives better resilience against missing values as well as
ensuring that all relevant features are utilized.
53. BigML, Inc #DutchMLSchool
Weighting over Time
53
1 Day
Data significance over time:
• Some data may change significance in different times
• Short-term user behavior versus long-term
• Weights can set to account for significance of time
1 Week
1 Month
w=8
w=4
w=2
54. BigML, Inc #DutchMLSchool
Improved Class Separation
54
Consider a 3-class objective
• Really only care about “yes” versus “not yes”
• A single model may struggle to separate the two negative classes
Yes No Maybe
yes/no/maybe
yes/no
yes/maybe
55. BigML, Inc #DutchMLSchool
Feature Space Optimization
55
Model Skills: Some ML algorithms “generally” do better
on some feature types:
• RDF for sparse text vectors
• LR/Deepnets for numeric features
• Trees for categorical features
Full
Numeric
Text
59. BigML, Inc #DutchMLSchool
Issues with High Dimensionality
59
• Implicitly increases model complexity, prone to overfitting
• Requires more observations in order to generalize well
• Contains correlated or useless variables
• Data is difficult to visualize
• Takes a longer time to train models or make predictions
Principal Component Analysis
addresses all of these issues
60. BigML, Inc #DutchMLSchool
Other Approaches
60
MODEL Pruning, Node threshold
ENSEMBLE Bagging, Randomization
LOGISTIC
REGRESSION
L1 and L2 penalties
DEEPNET Dropout
61. BigML, Inc #DutchMLSchool
Dimensionality Reduction
61
Feature Selection
• Preserves the original variables and selects a subset
• Often uses recursive methods or statistical thresholds
• Examples: RFE, Chi-Squared Test, Boruta
Feature Extraction
• Transforms original variables into variables better suited for modeling
• Examples: word vectors, clustering
• PCA falls into this category
Manual Approach
62. BigML, Inc #DutchMLSchool
When to use PCA
62
1. You want to reduce the number of variables in your model, but
it is not clear which should be eliminated
2. You want to generate variables that are not correlated
3. You are okay with sacrificing some amount of interpretability
for potential downstream performance gains
63. BigML, Inc #DutchMLSchool
How Does PCA Work?
63
Each PC is a linear combination of original variables
PC1 = w1F1 + w2F2 + w3F3 + … + wNFN
PC2 = w1F1 + w2F2 + w3F3 + … + wNFN
PCN = w1F1 + w2F2 + w3F3 + … + wNFN
…
70. BigML, Inc #DutchMLSchool
BigML PCA
70
• Standard PCA only applies to numerical data
• BigML uses three different data transformation methods in order to
handle different data types
• Numeric data: Principal Component Analysis (PCA)
• Categorical data: Multiple Correspondence Analysis (MCA)
• Mixed data: Factorial Analysis of Mixed Data (FAMD)
• BigML will automatically handle numeric, text, items, and categorical
data without needing user input