Yellowbrick is an open source Python library that provides visual diagnostic tools called “Visualizers” that extend the Scikit-Learn API to allow human steering of the model selection process. For teachers and students of machine learning, Yellowbrick can be used as a framework for teaching and understanding a large variety of algorithms and methods.
Machine learning is the hacker art of describing the features of instances that we want to make predictions about, then fitting the data that describes those instances to a model form. Applied machine learning has come a long way from it's beginnings in academia, and with tools like Scikit-Learn, it's easier than ever to generate operational models for a wide variety of applications. Thanks to the ease and variety of the tools in Scikit-Learn, the primary job of the data scientist is model selection. Model selection involves performing feature engineering, hyperparameter tuning, and algorithm selection. These dimensions of machine learning often lead computer scientists towards automatic model selection via optimization (maximization) of a model's evaluation metric. However, the search space is large, and grid search approaches to machine learning can easily lead to failure and frustration. Human intuition is still essential to machine learning, and visual analysis in concert with automatic methods can allow data scientists to steer model selection towards better fitted models, faster. In this talk, we will discuss interactive visual methods for better understanding, steering, and tuning machine learning models.
Machine learning is the hacker art of describing the features of instances that we want to make predictions about, then fitting the data that describes those instances to a model form. Applied machine learning has come a long way from it's beginnings in academia, and with tools like Scikit-Learn, it's easier than ever to generate operational models for a wide variety of applications. Thanks to the ease and variety of the tools in Scikit-Learn, the primary job of the data scientist is model selection. Model selection involves performing feature engineering, hyperparameter tuning, and algorithm selection. These dimensions of machine learning often lead computer scientists towards automatic model selection via optimization (maximization) of a model's evaluation metric. However, the search space is large, and grid search approaches to machine learning can easily lead to failure and frustration. Human intuition is still essential to machine learning, and visual analysis in concert with automatic methods can allow data scientists to steer model selection towards better fitted models, faster. In this talk, we will discuss interactive visual methods for better understanding, steering, and tuning machine learning models.
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
This presentation is aimed at fitting a Simple Linear Regression model in a Python program. IDE used is Spyder. Screenshots from a working example are used for demonstration.
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...Simplilearn
This Linear Regression in Machine Learning Presentation will help you understand the basics of Linear Regression algorithm - what is Linear Regression, why is it needed and how Simple Linear Regression works with solved examples, Linear regression analysis, applications of Linear Regression and Multiple Linear Regression model. At the end, we will implement a use case on profit estimation of companies using Linear Regression in Python. This Machine Learning presentation is ideal for beginners who want to understand Data Science algorithms as well as Machine Learning algorithms.
Below topics are covered in this Linear Regression Machine Learning Tutorial:
1. Introduction to Machine Learning
2. Machine Learning Algorithms
3. Applications of Linear Regression
4. Understanding Linear Regression
5. Multiple Linear Regression
6. Use case - Profit estimation of companies
What is Machine Learning: Machine Learning is an application of Artificial Intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - - -
Who should take this Machine Learning Training Course?
We recommend this Machine Learning training course for the following professionals in particular:
1. Developers aspiring to be a data scientist or Machine Learning engineer
2. Information architects who want to gain expertise in Machine Learning algorithms
3. Analytics professionals who want to work in Machine Learning or artificial intelligence
4. Graduates looking to build a career in data science and Machine Learning
- - - - - -
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Benjamin Bengfort
This is an overview of the goals and roadmap for the Yellowbrick model visualization library (www.scikit-yb.org). If you're interested in contributing to Yellowbrick or writing visualizers, this is a good place to get started.
In the presentation we discuss the expected workflow of data scientists interacting with the model selection triple and Scikit-Learn. We describe the Yellowbrick API and it's relationship to the Scikit-Learn API. We introduce our primary object: the Visualizer, an estimator that learns from data and displays it visually. Finally we describe the requirements for developing for Yellowbrick, the tools and utilities in place and how to get started.
Yellowbrick is a suite of visual diagnostic tools called "Visualizers" that extend the Scikit-Learn API to allow human steering of the model selection process. In a nutshell, Yellowbrick combines Scikit-Learn with Matplotlib in the best tradition of the Scikit-Learn documentation, but to produce visualizations for your models!
This presentation was given during the opening session of the 2017 Spring DDL Research Labs.
Linear Regression vs Logistic Regression | EdurekaEdureka!
YouTube: https://youtu.be/OCwZyYH14uw
** Data Science Certification using R: https://www.edureka.co/data-science **
This Edureka PPT on Linear Regression Vs Logistic Regression covers the basic concepts of linear and logistic models. The following topics are covered in this session:
Types of Machine Learning
Regression Vs Classification
What is Linear Regression?
What is Logistic Regression?
Linear Regression Use Case
Logistic Regression Use Case
Linear Regression Vs Logistic Regression
Blog Series: http://bit.ly/data-science-blogs
Data Science Training Playlist: http://bit.ly/data-science-playlist
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
This presentation introduces the Boosting ensemble method for machine learning. It's objective is to compare Boosting to the Random Forest ensemble method, explain the difference between AdaBoost and Gradient Boosting and annotate the pseudo-code for each algorithm for classification and regression, respectively. Kirkwood gave this instructional demonstration while applying to be a Data Scientist in Residence at Galvanize, Boulder.
Exploratory data analysis in R - Data Science ClubMartin Bago
How to analyse new dataset in R? What libraries to use, and what commands? How to understand your dataset in few minutes? Read my presentation for Data Science Club by Exponea and find out!
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
This presentation is aimed at fitting a Simple Linear Regression model in a Python program. IDE used is Spyder. Screenshots from a working example are used for demonstration.
Linear Regression Analysis | Linear Regression in Python | Machine Learning A...Simplilearn
This Linear Regression in Machine Learning Presentation will help you understand the basics of Linear Regression algorithm - what is Linear Regression, why is it needed and how Simple Linear Regression works with solved examples, Linear regression analysis, applications of Linear Regression and Multiple Linear Regression model. At the end, we will implement a use case on profit estimation of companies using Linear Regression in Python. This Machine Learning presentation is ideal for beginners who want to understand Data Science algorithms as well as Machine Learning algorithms.
Below topics are covered in this Linear Regression Machine Learning Tutorial:
1. Introduction to Machine Learning
2. Machine Learning Algorithms
3. Applications of Linear Regression
4. Understanding Linear Regression
5. Multiple Linear Regression
6. Use case - Profit estimation of companies
What is Machine Learning: Machine Learning is an application of Artificial Intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - - -
Who should take this Machine Learning Training Course?
We recommend this Machine Learning training course for the following professionals in particular:
1. Developers aspiring to be a data scientist or Machine Learning engineer
2. Information architects who want to gain expertise in Machine Learning algorithms
3. Analytics professionals who want to work in Machine Learning or artificial intelligence
4. Graduates looking to build a career in data science and Machine Learning
- - - - - -
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Benjamin Bengfort
This is an overview of the goals and roadmap for the Yellowbrick model visualization library (www.scikit-yb.org). If you're interested in contributing to Yellowbrick or writing visualizers, this is a good place to get started.
In the presentation we discuss the expected workflow of data scientists interacting with the model selection triple and Scikit-Learn. We describe the Yellowbrick API and it's relationship to the Scikit-Learn API. We introduce our primary object: the Visualizer, an estimator that learns from data and displays it visually. Finally we describe the requirements for developing for Yellowbrick, the tools and utilities in place and how to get started.
Yellowbrick is a suite of visual diagnostic tools called "Visualizers" that extend the Scikit-Learn API to allow human steering of the model selection process. In a nutshell, Yellowbrick combines Scikit-Learn with Matplotlib in the best tradition of the Scikit-Learn documentation, but to produce visualizations for your models!
This presentation was given during the opening session of the 2017 Spring DDL Research Labs.
Linear Regression vs Logistic Regression | EdurekaEdureka!
YouTube: https://youtu.be/OCwZyYH14uw
** Data Science Certification using R: https://www.edureka.co/data-science **
This Edureka PPT on Linear Regression Vs Logistic Regression covers the basic concepts of linear and logistic models. The following topics are covered in this session:
Types of Machine Learning
Regression Vs Classification
What is Linear Regression?
What is Logistic Regression?
Linear Regression Use Case
Logistic Regression Use Case
Linear Regression Vs Logistic Regression
Blog Series: http://bit.ly/data-science-blogs
Data Science Training Playlist: http://bit.ly/data-science-playlist
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
This presentation introduces the Boosting ensemble method for machine learning. It's objective is to compare Boosting to the Random Forest ensemble method, explain the difference between AdaBoost and Gradient Boosting and annotate the pseudo-code for each algorithm for classification and regression, respectively. Kirkwood gave this instructional demonstration while applying to be a Data Scientist in Residence at Galvanize, Boulder.
Exploratory data analysis in R - Data Science ClubMartin Bago
How to analyse new dataset in R? What libraries to use, and what commands? How to understand your dataset in few minutes? Read my presentation for Data Science Club by Exponea and find out!
Predict Backorder on a supply chain data for an OrganizationPiyush Srivastava
Performed cleaning and founded the important variables and created a best model using different classification techniques (Random Forest, Naïve Bayes, Decision tree, KNN, Neural Network, Support Vector Machine) to predict the back-order for an organization using the best modelling and technique approach.
This is an elaborate presentation on how to predict employee attrition using various machine learning models. This presentation will take you through the process of statistical model building using Python.
해당 자료는 풀잎스쿨 18기 중 "설명가능한 인공지능 기획!" 진행 중 Counterfactual Explanation 세션에 대해서 정리한 자료입니다.
논문, Youtube 및 하기 자료를 바탕으로 정리되었습니다.
https://christophm.github.io/interpretable-ml-book/
CRAM (Change Risk Assessment Model) is a novel model approach which can significantly contribute to the missing formality of business models especially in the change(s) risk assessment area.
Project Management has long established the need for risk management techniques to be utilised in the succinct definition of associated risks in projects and agreement on countervailing actions as an aim to reduce scope creep, increase the probability of on-time and in-budget delivery.
Uncontrolled changes, regardless of size and complexity, can certainly pose as risks, of any magnitude, to projects and affect project success or even an organisation’s coherence.
Using Python library such as numpy, scipy and pandas to carry out supervised learning operations like Support vector machine, decision tree and K-nearest neighbor.
Welcome to the Supervised Machine Learning and Data Sciences.
Algorithms for building models. Support Vector Machines.
Classification algorithm explanation and code in Python ( SVM ) .
Isotonic Regression is a statistical technique of fitting a free-form line to a sequence of observations such that the fitted line is non-decreasing (or non-increasing) everywhere, and lies as close to the observations as possible. Isotonic Regression is limited to predicting numeric output so the dependent variable must be numeric in nature…
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Rebecca Bilbro
Machine learning is ultimately a search for the best combination of features, algorithm, and hyperparameters that result in the best performing model. Oftentimes, this leads us to stay in our algorithmic comfort zones, or to resort to automated processes such as grid searches and random walks. Whether we stick to what we know or try many combinations, we are sometimes left wondering if we have actually succeeded.
By enhancing model selection with visual diagnostics, data scientists can inject human guidance to steer the search process. Visualizing feature transformations, algorithmic behavior, cross-validation methods, and model performance allows us a peek into the high dimensional realm that our models operate. As we continue to tune our models, trying to minimize both bias and variance, these glimpses allow us to be more strategic in our choices. The result is more effective modeling, speedier results, and greater understanding of underlying processes.
Visualization is an integral part of the data science workflow, but visual diagnostics are directly tied to machine learning transformers and models. The Yellowbrick library extends the scikit-learn API providing a Visualizer object, an estimator that learns from data and produces a visualization as a result. In this tutorial, we will explore feature visualizers, visualizers for classification, clustering, and regression, as well as model analysis visualizers. We'll work through several examples and show how visual diagnostics steer model selection, making machine learning more informed, and more effective.
Data Structures for Data Privacy: Lessons Learned in ProductionRebecca Bilbro
In this talk we present a decentralized messaging protocol and storage system in use across North America, Europe, and South East Asia. Built at the behest of an international nonprofit working group, the protocol and system are designed to address a unique problem at the intersection of financial crime regulation, distributed ledger technology, and user privacy. In our talk we discuss the many lessons learned in the process of architecting, implementing, and fostering adoption of this system. We present the Secure Envelope, a data structure that employs a combination of methods (symmetric and asymmetric encryption, mTLS, protocol buffers, etc) to safeguard data privacy both at rest and in flight.
Conflict-Free Replicated Data Types (PyCon 2022)Rebecca Bilbro
Jupyter Notebook may be one of the most controversial open source projects released in the last ten years! Love them or hate them, they’ve become a mainstay of data science and machine learning, and a significant part of the Python ecosystem. While Jupyter can simplify experimentation, rapid prototyping, documentation, and visualization, it often impedes version control, code review, and test coverage. Dev teams must accept the good with the bad… but what if they didn’t have to? In this talk we introduce conflict-free replicated data types (CRDT), a special object that supports strong consistency, and which can be used to enhance Jupyter notebooks for a truly collaborative experience.
First proposed by Shapiro et al in 2011 conflict-free replicated data types (CRDTs) evolved out of the Distributed Systems community for replication of data across a network of replicas. CRDTs are objects that come with a special guarantee — namely, that two different copies of that object can be strongly consistent, meaning they can be kept in sync. While CRDTs have enjoyed a good amount of attention from academia over the last years, primarily amongst database and cloud researchers, they have not led to many practical applications for everyday developers. However, recent work by Kleppmann et al shows CRDTs can be used for real-time rich-text collaboration — creating a “Google doc”-type experience with any document in a networked file system.
In this talk, we’ll present the basics of CRDTs and demonstrate how they work with examples written in Python. Next, we’ll explain how CRDTs can enable more collaborative Jupyter notebooks, opening up features such as synchronous insertions, diffs, and auto-merges, even with multiple simultaneous contributors!
(Py)testing the Limits of Machine LearningRebecca Bilbro
Despite the hype cycle, each day machine learning becomes a little less magic and a little more real. Predictions increasingly drive our everyday lives, embedded into more of our everyday applications. To support this creative surge, development teams are evolving, integrating novel open source software and state-of-the-art GPU hardware, and bringing on essential new teammates like data ethicists and machine learning engineers. Software teams are also now challenged to build and maintain codebases that are intentionally not fully deterministic.
This nondeterminism can manifest in a number of surprising and oftentimes very stressful ways! Successive runs of model training may produce slight but meaningful variations. Data wrangling pipelines turn out to be extremely sensitive to the order in which transformations are applied, and require thoughtful orchestration to avoid leakage. Model hyperparameters that can be tuned independently may have mutually exclusive conditions. Models can also degrade over time, producing increasingly unreliable predictions. Moreover, open source libraries are living, dynamic things; the latest release of your team's favorite library might cause your code to suddenly behave in unexpected ways.
Put simply, as ML becomes more of an expectation than an exception in our industry, testing has never been more important! Fortunately, we are lucky to have a rich open source ecosystem to support us in our journey to build the next generation of apps in a safe, stable way. In this talk we'll share some hard-won lessons, favorite open source packages, and reusable techniques for testing ML software components.
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyRebecca Bilbro
Eventually consistent systems are often more cost-effective to implement and maintain than their strongly consistent cousins. Gossip-based anti-entropy methods can be used to improve the consistency of such systems even as they expand geographically.
In the machine learning community, we're trained to think of size as inversely proportional to bias, driving us to ever larger datasets, increasingly complex model architectures, and ever better accuracy scores. But bigger doesn't always mean better.
What data quality issues emerge in large datasets? What complications surface as features become more geodistributed (e.g., diurnal patterns, seasonal variations, datetime formatting, multilingual text, etc.)? What happens as models attempt to extrapolate bigger and bigger patterns? Why is it that the pursuit of megamodels has driven a wedge between the ML definition of “bias” and the more colloquial sense of the word?
Perhaps the time has come to move away from monolithic models that reduce rich variations and complexities to a simple argmax on the output layer and instead embrace a new generation of model architectures that are just as organic and diverse as the data they seek to encode.
There have never been more commercial tools available for building distributed data apps — from cloud hosting services, to cloud-native databases, to cloud-based analytics platforms. So why is it still so hard to make a successful app with a global user base?
One of the toughest challenges cloud offerings take on is the problem of consensus, abstracting away most of the complexity. That's no small feat, given that this is a hard enough problem that people spend years getting a PhD just to understand it! Unfortunately, while buying off-the-shelf cloud services can accelerate the path to an MVP, it also makes optimization tough. How will we scale during a period of rapid user growth? How do we do I18n and l10n or guarantee a good UX for users on the other side of the world? How do we prevent replication that might get us into legal trouble?
In this talk, we'll consider several case studies of global apps (both successful and otherwise!), talk about the limitations of off-the-shelf consensus, and consider a future where everyday developers can use open source tools to build distributed data apps that are easier to reason about, maintain, and tune.
We live with an abundance of ML resources; from open source tools, to GPU workstations, to cloud-hosted autoML. What’s more, the lines between AI research and everyday ML have blurred; you can recreate a state-of-the-art model from arxiv papers at home. But can you afford to? In this talk, we explore ways to recession-proof your ML process without sacrificing on accuracy, explainability, or value.
EuroSciPy 2019: Visual diagnostics at scaleRebecca Bilbro
The hunt for the most effective machine learning model is hard enough with a modest dataset, and much more so as our data grow! As we search for the optimal combination of features, algorithm, and hyperparameters, we often use tools like histograms, heatmaps, embeddings, and other plots to make our processes more informed and effective. However, large, high-dimensional datasets can prove particularly challenging. In this talk, we’ll explore a suite of visual diagnostics, investigate their strengths and weaknesses in face of increasingly big data, and consider how we can steer the machine learning process, not only purposefully but at scale!
The hunt for the most effective machine learning model is hard enough with a modest dataset, and much more so as our data grow! As we search for the optimal combination of features, algorithm, and hyperparameters, we often use tools like histograms, heatmaps, embeddings, and other plots to make our processes more informed and effective. However, large, high-dimensional datasets can prove particularly challenging. In this talk, we’ll explore a suite of visual diagnostics, investigate their strengths and weaknesses in face of increasingly big data, and consider how we can steer the machine learning process, not only purposefully but at scale!
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
Machine learning often requires us to think spatially and make choices about what it means for two instances to be close or far apart. So which is best - Euclidean? Manhattan? Cosine? It all depends! In this talk, we'll explore open source tools and visual diagnostic strategies for picking good distance metrics when doing machine learning on text.
The Incredible Disappearing Data ScientistRebecca Bilbro
The last decade saw advances in compute power combine with an avalanche of open source software development, resulting in a revolution in machine learning and scalable analytics. “Data science” and “data product” are now household terms. This led to a new job description, the Data Scientist, which quickly became one of the most significant, exciting, and misunderstood jobs of the 21st century. One part statistician, one part computer scientist, and one part domain expert, data scientists seem poised to become the most pivotal value creators of the information age. And yet, danger (supposedly) lies ahead: human decisions are increasingly outsourced to algorithms of questionable ethical design; we’re putting everything on the blockchain; and perhaps most disturbingly, data science salaries are dropping precipitously as new graduates and Machine Learning as a Service (MLaaS) offerings flood the market. As we move into a future where predictive analytics is no longer a differentiator but instead a core business function, will data scientists proliferate or be automated out of a job?
In this talk, one humble data scientist attempts to cut through the hype to present an alternate vision of what data science is and can become. If not the “Sexiest Job of the 21st Century" as the Harvard Business Review once quipped, what is it like to be a workaday data scientist? What problems are we solving? How do we integrate with mature engineering teams? How do we engage with clients and product owners? How do we deploy non-deterministic models in production? In particular, we’ll examine critical integration points — technological and otherwise — we are currently tackling, which will ultimately determine our success, and our viability, over the next 10 years.
While data privacy challenges long predate current trends in machine-learning-as-a-service (MLAAS) offerings, predictive APIs do expose significant new attack vectors. To provide users with tailored recommendations, these applications often expose endpoints either to dynamic models or to pre-trained model artifacts, which learn patterns from data to surface insights. Problems arise when training data are collected, stored, and modeled in ways that jeopardize privacy. Even when user data is not exposed directly, private information can often be inferred using a technique called model inversion. In this talk, I discuss current research in black box model inversion and present a machine learning approach to discovering the model families of deployed black box models using only their decision topologies. Prior work suggests the efficacy of model family specific attack vectors (i.e., once the model is no longer a black box, it is easier to exploit). As such, we approach the problem only of model discovery and not of model inversion, reasoning that by solving the problem of model identification, we clear a path for information security and cryptography experts to use domain-specific tools for model inversion.
In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search.
This talk presents a new open source Python library, Yellowbrick, which extends the Scikit-Learn API with a visual transfomer (visualizer) that can incorporate visualizations of the model selection process into pipelines and modeling workflow. Visualizers enable machine learning practitioners to visually interpret the model selection process, steer workflows toward more predictive models, and avoid common pitfalls and traps. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.
Data Intelligence 2017 - Building a Gigaword CorpusRebecca Bilbro
As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. This talk walks through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.
While applications like Siri, Cortana, and Alexa may still seem like novelties, language-aware applications are rapidly becoming the new norm. Under the hood, these applications take in text data as input, parse it into composite parts, compute upon those composites, and then recombine them to deliver a meaningful and tailored end result. The best applications use language models trained on domain-specific corpora (collections of related documents containing natural language) that reduce ambiguity and prediction space to make results more intelligible. Here's the catch: these corpora are huge, generally consisting of at least hundreds of gigabytes of data inside of thousands of documents, and often more!
In this talk, we'll see how working with text data is substantially different from working with numeric data, and show that ingesting a raw text corpus in a form that will support the construction of a data product is no trivial task. For instance, when dealing with a text corpus, you have to consider not only how the data comes in (e.g. respecting rate limits, terms of use, etc.), but also where to store the data and how to keep it organized. Because the data comes from the web, it's often unpredictable, containing not only text but audio files, ads, videos, and other kinds of web detritus. Since the datasets are large, you need to anticipate potential performance problems and ensure memory safety through streaming data loading and multiprocessing. Finally, in anticipation of the machine learning components, you have to establish a standardized method of transforming your raw ingested text into a corpus that's ready for computation and modeling.
In this talk, we'll explore many of the challenges we experienced along the way and introduce two Python packages that make this work a bit easier: Baleen and Minke. Baleen is a package for ingesting formal natural language data from the discourse of professional and amateur writers, like bloggers and news outlets, in a categorized fashion. Minke extends Baleen with a library that performs parallel data loading, preprocessing, normalization, and keyphrase extraction to support machine learning on a large-scale custom corpus.
As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. This talk walks through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.
While applications like Siri, Cortana, and Alexa may still seem like novelties, language-aware applications are rapidly becoming the new norm. Under the hood, these applications take in text data as input, parse it into composite parts, compute upon those composites, and then recombine them to deliver a meaningful and tailored end result. The best applications use language models trained on domain-specific corpora (collections of related documents containing natural language) that reduce ambiguity and prediction space to make results more intelligible. Here's the catch: these corpora are huge, generally consisting of at least hundreds of gigabytes of data inside of thousands of documents, and often more!
In this talk, we'll see how working with text data is substantially different from working with numeric data, and show that ingesting a raw text corpus in a form that will support the construction of a data product is no trivial task. For instance, when dealing with a text corpus, you have to consider not only how the data comes in (e.g. respecting rate limits, terms of use, etc.), but also where to store the data and how to keep it organized. Because the data comes from the web, it's often unpredictable, containing not only text but audio files, ads, videos, and other kinds of web detritus. Since the datasets are large, you need to anticipate potential performance problems and ensure memory safety through streaming data loading and multiprocessing. Finally, in anticipation of the machine learning components, you have to establish a standardized method of transforming your raw ingested text into a corpus that's ready for computation and modeling.
In this talk, we'll explore many of the challenges we experienced along the way and introduce two Python packages that make this work a bit easier: Baleen and Minke. Baleen is a package for ingesting formal natural language data from the discourse of professional and amateur writers, like bloggers and news outlets, in a categorized fashion. Minke extends Baleen with a library that performs parallel data loading, preprocessing, normalization, and keyphrase extraction to support machine learning on a large-scale custom corpus.
Yellowbrick: Steering machine learning with visual transformersRebecca Bilbro
In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search.
This talk presents a new Python library, Yellowbrick, which extends the Scikit-Learn API with a visual transfomer (visualizer) that can incorporate visualizations of the model selection process into pipelines and modeling workflow. Yellowbrick is an open source, pure Python project that extends Scikit-Learn with visual analysis and diagnostic tools. The Yellowbrick API also wraps matplotlib to create publication-ready figures and interactive data explorations while still allowing developers fine-grain control of figures. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.
In this talk, we'll explore not only what you can do with Yellowbrick, but how it works under the hood (since we're always looking for new contributors!). We'll illustrate how Yellowbrick extends the Scikit-Learn and Matplotlib APIs with a new core object: the Visualizer. Visualizers allow visual models to be fit and transformed as part of the Scikit-Learn Pipeline process - providing iterative visual diagnostics throughout the transformation of high dimensional data.
In machine learning, model selection is a bit more nuanced than simply picking the 'right' or 'wrong' algorithm. In practice, the workflow includes (1) selecting and/or engineering the smallest and most predictive feature set, (2) choosing a set of algorithms from a model family, and (3) tuning the algorithm hyperparameters to optimize performance. Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search.
This talk presents a new open source Python library, Yellowbrick (scikit-yb.org), which extends the Scikit-Learn API with a visual transfomer (visualizer) that can incorporate visualizations of the model selection process into pipelines and modeling workflow. Visualizers enable machine learning practitioners to visually interpret the model selection process, steer workflows toward more predictive models, and avoid common pitfalls and traps. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
5. Use radviz or parallel
coordinates to look for
class separability
Yellowbrick Feature Visualizers
6. ● Based on spring tension
minimization algorithm.
● Features equally spaced on a unit
circle, instances dropped into circle.
● Features pull instances towards
their position on the circle in
proportion to their normalized
numerical value for that instance.
● Classification coloring based on
labels in data.
Radial Visualization
8. Parallel Coordinates
● Visualize clusters in data.
● Points represented as connected
line segments.
● Each vertical line represents one
attribute (x-axis units not
meaningful).
● One set of connected line segments
represents one instance.
● Points that tend to cluster will
appear closer together.
9. Use Rank2D for pairwise feature
analysis, find strong correlations
(potential collinearity?)
Rank2D
10. Rank2D
● Feature engineering requires
understanding of the relationships
between features
● Visualize pairwise relationships as
a heatmap
● Pearson shows us strong
correlations, potential collinearity
● Covariance helps us understand
the sequence of relationships
11. PCA Projection Plots
● Uses PCA to decompose high
dimensional data into two or three
dimensions
● Each instance plotted in a scatter
plot.
● Projected dataset can be analyzed
along axes of principle variation
● Can be interpreted to determine if
spherical distance metrics can be
utilized.
12. PCA Projection Plots
Can also plot in 3D to visualize more
components & get a better sense of
distribution in high dimensions
15. Feature Importance Plot
● Need to select the minimum
required features to produce a
valid model.
● The more features a model
contains, the more complex it is
(sparse data, errors due to
variance).
● This visualizer ranks and plots
underlying impact of features
relative to each other.
17. Recursive Feature Elimination
● Recursive feature elimination fits a
model and removes the weakest
feature(s) until the specified
number is reached.
● Features are ranked by internal
model’s coef_ or
feature_importances_
● Attempts to eliminate
dependencies and collinearity that
may exist in the model.
19. Evaluating Classifiers
● How well did predicted values match actual labeled values?
● In a 2-class problem, there are two ways to be “right”:
○ Classifier correctly identifies cases (aka “True Positives”)
○ Classifier correctly identifies non-cases (aka “True Negatives”)
● ...and two ways to be “wrong”:
○ Classifier incorrectly identifies a non-case as a case (aka “False Positive” or
“Type I Error”)
○ Classifier incorrectly identifies a case as a non-case (aka “False Negative”
or “Type II Error”)
20. Metrics for Classification
Metric Measures In Scikit-learn
Precision How many selected are relevant? from sklearn.metrics import precision_score
Recall How many relevant were selected? from sklearn.metrics import recall_score
F1 Weighted average of precision & recall from sklearn.metrics import f1_score
Confusion Matrix True positives, true negatives, false
positives, false negatives
from sklearn.metrics import confusion_matrix
ROC True positive rate vs. false positive rate, as
classification threshold varies
from sklearn.metrics import roc
AUC Aggregate accuracy, as classification
threshold varies
from sklearn.metrics import auc
23. Classification Report
from sklearn.metrics import classification_report as cr
print(cr(y, yhat, target_names=target_names))
● includes same basic info as confusion matrix
● 3 different evaluation metrics: precision, recall, F1 score
● includes class labels for interpretability
24. Classification Heatmaps
Precision: of
those labelled
edible, how
many actually
were?Is it better
to have
false
positives
here or
here?
Recall: how many
of the
poisonous ones
did our model
find?
25. ROC-AUC
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y,yhat)
roc_auc = auc(fpr, tpr)
Visualize tradeoff between classifier's sensitivity (how well it finds true
positives) and specificity (how well it avoids false positives)
● straight horizontal line -> perfect classifier
● pulling a lot toward the upper left corner -> good accuracy
● exactly aligned with the diagonal -> coin toss
27. ROC-AUC for Multiclass Classification
ROC curves are typically used in
binary classification,, but
Yellowbrick allows for multiclass
classification evaluation by
binarizing output (per-class) or
using one-vs-rest (micro score)
or one-vs-all (macro score)
strategies of classification.
28. Confusion Matrix
● takes as an argument actual values
and predicted values generated by
the fitted model
● outputs a confusion matrix
from sklearn.metrics import confusion_matrix
29. I have a lot
of classes;
how does
my model
perform on
each?
Do I care
about certain
classes
more than
others?
Confusion Matrix
30. Class Prediction Error Plot
Similar to
confusion
matrix, but
sometimes
more
interpretable
31. Discrimination Threshold Visualizer
* for binary
classification
only
● Probability or score at
which positive class is
chosen over negative.
● Generally set to 50%
● Can be adjusted to
increase/decrease
sensitivity to false
positives or other
application factors
● Cases that require special
treatment?
32. Evaluating Regressors
● How well does the model describe the training data?
● How well does the model predict out-of-sample data?
○ Goodness-of-fit
○ Randomness of residuals
○ Prediction error
33. Metrics for Regression
Metric Measures In Scikit-learn
Mean Square
Error (MSE,
RMSE)
distance between predicted values and
actual values (more sensitive to
outliers)
from sklearn.metrics import mean_squared_error
Absolute Error
(MAE, RAE)
distance between predicted values and
actual values (less sensitive to outliers)
from sklearn.metrics import
mean_absolute_error, median_absolute_error
Coefficient of
Determination (R-
Squared)
% of variance explained by the
regression; how well future samples are
likely to be predicted by the model
from sklearn.metrics import r2_score
35. Prediction Error Plots
from sklearn.model_selection import
cross_val_predict
● Cross-validation is a way of measuring model
performance.
● Divide data into training and test splits; fit model on
training, predict on test.
● Use cross_val_predict to visualize prediction errors as a
scatterplot of the predicted and actual values.
37. Plotting Residuals
● Standardized y-axis
● Model prediction on x-axis.
● Model accuracy on y-axis; distance from line at 0
indicates how good/bad the prediction was for
that value.
● Check whether residuals are consistent with
random error; data points should appear evenly
dispersed around the plotted line. Should not be
able to predict error.
● Visualize train and test data with different colors.
42. ● What to do with a low-
accuracy classifier?
● Check for class imbalance.
● Visual cue that we might
try stratified sampling,
oversampling, or getting
more data.
Class Balance
43. Cross Validation Scores
● Real world data are often
distributed somewhat
unevenly; the fitted model
likely to perform better on
some sections of data than
others.
● See cross-validated scores as a
bar chart (one bar for each
fold) with average score across
all folds plotted as dotted line.
● Explore variations in
performance using different
cross validation strategies.
44. Learning Curve
● Relationship of the training score
vs. the cross validated test score
for an estimator.
● Do we need more data? If the
scores converge together, then
probably not. If the training score
is much higher than the validation
score, then yes.
● Is the estimator more sensitive to
error due to variance or error due
to bias?
45. Validation Curve
● Plot the influence of a single
hyperparameter on the training
and test data.
● Is the estimator under- or over-
fitting for some hyperparameter
values?
For SVC, gamma is the coefficient
of the RBF kernel. The larger
gamma is, the tighter the support
vector is around single points (e.g.
overfitting). Here around gamma=0.1
the SVC memorizes the data.
47. Hyperparameters
● When we call fit() on an estimator, it learns the parameters of the algorithm
that make it fit the data best.
● However, some parameters are not directly learned within an estimator.
These are the ones we provide when we instantiate the estimator.
○ alpha for LASSO or Ridge
○ C, kernel, and gamma for SVC
● These parameters are often referred to as hyperparameters.
48. Examples:
● Alpha/penalty for regularization
● Kernel function in support vector machine
● Leaves or depth of a decision tree
● Neighbors used in a nearest neighbor classifier
● Clusters in a k-means clustering
Hyperparameters
49. How to pick the best hyperparameters?
● Use the defaults
● Pick randomly
● Search parameter space for the best score
(e.g. grid search)
… Except that hyperparameter space is large
and gridsearch is slow if you don’t know
already what you’re looking for.
Hyperparameters
51. Should I use Lasso,
Ridge, or ElasticNet?
Is regularlization
even working?
More alpha => less
complexity
Reduced bias, but
increased variance
Alpha selection with Yellowbrick
52. ● How many clusters do
you see?
● How do you pick an
initial value for k in k-
means clustering?
● How do you know
whether to increase or
decrease k?
● Is partitive clustering
the right choice?
What’s the right k?
53. higher silhouette scores
mean denser, more
separate clusters
The elbow
shows the
best value
of k…
Or suggests
a different
algorithm
K-selection with Yellowbrick
54. Manifold Visualization
● Embed instances described
by many dimensions into 2.
● Look for latent structures in
the data, noise, separability.
● Is it possible to create a
decision space in the data?
● Unlike PCA or SVD,
manifolds use nearest
neighbors, can capture non-
linear structures.
57. # Import the estimator
from sklearn.linear_model import Lasso
# Instantiate the estimator
model = Lasso()
# Fit the data to the estimator
model.fit(X_train, y_train)
# Generate a prediction
model.predict(X_test)
Scikit-Learn Estimator Interface
58. # Import the model and visualizer
from sklearn.linear_model import Lasso
from yellowbrick.regressor import PredictionError
# Instantiate the visualizer
visualizer = PredictionError(Lasso())
# Fit
visualizer.fit(X_train, y_train)
# Score and visualize
visualizer.score(X_test, y_test)
visualizer.poof()
Yellowbrick Visualizer Interface
59. The main API implemented by
Scikit-Learn is that of the
estimator. An estimator is any
object that learns from data;
it may be a classification,
regression or clustering algorithm,
or a transformer that
extracts/filters useful features
from raw data.
class Estimator(object):
def fit(self, X, y=None):
"""
Fits estimator to data.
"""
# set state of self
return self
def predict(self, X):
"""
Predict response of X
"""
# compute predictions pred
return pred
Scikit-learn Estimators
60. Transformers are special
cases of Estimators --
instead of making
predictions, they transform
the input dataset X to a new
dataset X′.
class Transformer(Estimator):
def transform(self, X):
"""
Transforms the input data.
"""
# transform X to X_prime
return X_prime
Scikit-learn Transformers
61. A visualizer is an estimator that
produces visualizations based
on data rather than new
datasets or predictions.
Visualizers are intended to work
in concert with Transformers
and Estimators to shed light
onto the modeling process.
class Visualizer(Estimator):
def draw(self):
"""
Draw the data
"""
self.ax.plot()
def finalize(self):
"""
Complete the figure
"""
self.ax.set_title()
def poof(self):
"""
Show the figure
"""
plt.show()
Yellowbrick Visualizers
63. Yellowbrick is an open source project that is supported by
a community who will gratefully and humbly accept any
contributions you might make to the project.
Large or small, any contribution makes a big difference;
and if you’ve never contributed to an open source project
before, we hope you will start with Yellowbrick!
The model selection triple.
Arun Kumar did a survey of the analytical process
He’s going to crop up in a bit in a more interesting way
This feels right to me; and hopefully you see something similar.
Machine learning is about learning from example
And works on instances (examples)
Cite: http://pages.cs.wisc.edu/~arun/vision/SIGMODRecord15.pdf
analysts typically use an iterative exploratory process
Visit the docs! http://www.scikit-yb.org/en/develop/index.html
For classification; potentially we want to see if there is good separability
Are some features more predictive than others?
We can see that the co2 values for the two classes are intertwined. We get a sense that something like a decision tree will have a hard time with this. Perhaps Gaussian instead? It will be able to use probabilities to describe the spread of those co2 values.
Feature engineering requires understanding of the relationships between features
Visualize pairwise relationships as a heatmap
Pearson shows us strong correlations => potential collinearity
Covariance helps us understand the sequence of relationships
Uses PCA to decompose high dimensional data into two or three dimensions
Each instance plotted in a scatter plot.
Projected dataset can be analyzed along axes of principle variation
Can be interpreted to determine if spherical distance metrics can be utilized.
Can also be plotted in three dimensions to attempt to visualize more components and get a better sense of the distribution in high dimensions
Uses PCA to decompose high dimensional data into two or three dimensions
Each instance plotted in a scatter plot.
Projected dataset can be analyzed along axes of principle variation
Can be interpreted to determine if spherical distance metrics can be utilized.
Can also be plotted in three dimensions to attempt to visualize more components and get a better sense of the distribution in high dimensions
Frequency distribution - top 50 tokens
Stochastic Neighbor Embedding, decomposition then projection into 2D scatterplot
Visual part-of-speech tagging
The feature engineering process involves selecting the minimum required features to produce a valid model because the more features a model contains, the more complex it is (and the more sparse the data), therefore the more sensitive the model is to errors due to variance.
A common approach to eliminating features is to describe their relative importance to a model, then eliminate weak features or combinations of features and re-evalute to see if the model fairs better during cross-validation.
Many model forms describe the underlying impact of features relative to each other. This visualizer uses this attribute to rank and plot relative importances.
Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached.
Features are ranked by the model’s coef_ or feature_importances_ attributes, and by recursively eliminating a small number of features per loop, RFE attempts to eliminate dependencies and collinearity that may exist in the model.
RFE requires a specified number of features to keep, however it is often not known in advance how many features are valid.
To find the optimal number of features cross-validation is used with RFE to score different feature subsets and select the best scoring collection of features.
The RFECVvisualizer plots the number of features in the model along with their cross-validated test score and variability and visualizes the selected number of features.
Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached.
Features are ranked by the model’s coef_ or feature_importances_ attributes, and by recursively eliminating a small number of features per loop, RFE attempts to eliminate dependencies and collinearity that may exist in the model.
RFE requires a specified number of features to keep, however it is often not known in advance how many features are valid.
To find the optimal number of features cross-validation is used with RFE to score different feature subsets and select the best scoring collection of features.
The RFECVvisualizer plots the number of features in the model along with their cross-validated test score and variability and visualizes the selected number of features.
Receiver operating characteristics/area under curve
Classification report heatmap - Quickly identify strengths & weaknesses of model - F1 vs Type I & Type II error
Visual confusion matrix - misclassification on a per-class basis
The class prediction error chart provides a way to quickly understand how good your classifier is at predicting the right classes.
A visualization of precision, recall, f1 score, and queue rate with respect to the discrimination threshold of a binary classifier.
The discrimination threshold is the probability or score at which the positive class is chosen over the negative class.
Generally, this is set to 50% but the threshold can be adjusted to increase or decrease the sensitivity to false positives or to other application factors.
One common use is to determine cases that require special treatment.
For example, a fraud prevention application might use a classification algorithm to determine if a transaction is likely fraudulent and needs to be investigated in detail.
Spam/not spam
Precision: An increase in precision is a reduction in the number of false positives; this metric should be optimized when the cost of special treatment is high (e.g. wasted time in fraud preventing or missing an important email).
Recall: An increase in recall decreases the likelihood that the positive class is missed; this metric should be optimized when it is vital to catch the case even at the cost of more false positives. (e.g. SPAM v. VIRUS)
F1 Score: The F1 score is the harmonic mean between precision and recall. The fbetaparameter determines the relative weight of precision and recall when computing this metric, by default set to 1 or F1. Optimizing this metric produces the best balance between precision and recall.
Queue Rate: The “queue” is the spam folder or the inbox of the fraud investigation desk. This metric describes the percentage of instances that must be reviewed. If review has a high cost (e.g. fraud prevention) then this must be minimized with respect to business requirements; if it doesn’t (e.g. spam filter), this could be optimized to ensure the inbox stays clean.
Where/why/how is model performing good/bad
Prediction error plot - 45 degree line is theoretical perfect
Residuals plot - 0 line is no error
See change in amount of variance between x and y, or along x axis => heteroscedasticity
Can we quickly detect class imbalance issues
Stratified sampling, oversampling, getting more data -- tricks will help us balance
But supervised methods can mask training data; simple graphs like these give us an at-a-glance reference
As this gets into multiclass problems, domination could be harder to see and really effect modeling
A learning curve shows the relationship of the training score vs the cross validated test score for an estimator with a varying number of training samples. This visualization is typically used two show two things:
How much the estimator benefits from more data (e.g. do we have “enough data” or will the estimator get better if used in an online fashion).
If the estimator is more sensitive to error due to variance vs. error due to bias.
If the training and cross validation scores converge together as more data is added (shown in the left figure), then the model will probably not benefit from more data. If the training score is much greater than the validation score (as shown in the right figure) then the model probably requires more training examples in order to generalize more effectively.
Plot the influence of a single hyperparameter on the training and test data to determine if the estimator is underfitting or overfitting for some hyperparameter values.
For a support vector classifier, gamma is the coefficient of the RBF kernel. It controls the influence of a single example. The larger gamma is, the tighter the support vector is around single points (overfitting the model).
In this visualization we see a definite inflection point around gamma=0.1. At this point the training score climbs rapidly as the SVC memorizes the data, while the cross-validation score begins to decrease as the model cannot generalize to unseen data.
Which regularization technique to use? Lasso/L1, Ridge/L2, or ElasticNet L1+L2
Regularization uses a Norm to penalize complexity at a rate, alpha
The higher the alpha, the more the regularization.
Complexity minimization reduces bias in the model, but increases variance
Goal: select the smallest alpha such that error is minimized
Visualize the tradeoff
Surprising to see: higher alpha increasing error, alpha jumping around, etc.
Embed R2, MSE, etc into the graph - quick reference
The Manifold visualizer provides high dimensional visualization using manifold learning to embed instances described by many dimensions into 2, thus allowing the creation of a scatter plot that shows latent structures in data.
Unlike decomposition methods such as PCA and SVD, manifolds generally use nearest-neighbors approaches to embedding, allowing them to capture non-linear structures that would be otherwise lost.
The projections that are produced can then be analyzed for noise or separability to determine if it is possible to create a decision space in the data.
Estimators learn from data
Have a fit and predict method
Transformers transform data
Have a transform method
Visualizers can be estimators or transformers
Generally have a draw, finalize, and poof method