Uploaded byTim Enalls

PPTX, PDF456 views

Machine Learning Project - 1994 U.S. Census

This project utilized supervised algorithms to model individual income based on the 1994 U.S. census data, ultimately finding the Adaboost model most effective for predicting if individuals earn over $50,000. The Adaboost model achieved high performance metrics, including an F-score of 85% and accuracy of 86%, with significant features identified as capital-loss, age, and capital gain. The project's findings emphasize the model's appropriateness for binary classification while demonstrating low bias and variance, leading to robust generalization across test data.

Data & Analytics◦

Machine Learning
Case Study
1994 U.S. CENSUS

Objective of Project
Machine Learning Presentation | Introduction
In this project, several supervised algorithms were employed to accurately model individuals' income using data
collected from the 1994 U.S. Census. The best candidate algorithm was chosen from preliminary results and further
optimized to best model the data.
The goal in this project was to construct a model that accurately predicts whether an individual makes more than
$50,000. The first ten rows of the data set are shown below:

Top Findings
Machine Learning Presentation | Overview
• Adaboost is an appropriate model for this data - Based on the results, the AdaBoost model is most appropriate for the task of
identifying individuals that make more than $50,000
• Highest F-Score out of several models tested
• Low prediction/training time
• Highly suitable for binary classifications
• Model generalizes well to the test set at 10,000 observations and beyond
• High AUC (Area Under Curve) score: – .90
• Appropriate cut-off for test is around .97 for true positives and .58 for false positives (decision/classification threshold)
• Capital-loss, Age, and Capital Gain have the Most Effect on the Prediction – Out of the top 5 features that have the most impact
on the accuracy of the model, these three features had weights above 0.05
• Relatively High Model Scores:
• F-Score - 85% of cases where represented by a hormonic average of precision and recall [2 * (Precision * Recall)/(Precision
+ Recall)]
• Accuracy – 86% of predictions were correct [(True Positives + True Negatives) / All Values]

Performance Metrics for Three Supervised Learning Models
Machine Learning Presentation | Model Evaluation
• Based on the results, the AdaBoost model is most appropriate for the task of identifying individuals that make more than $50,000. This conclusion is based
on the following reasons:
• Out of the three models shown above, the AdaBoost model has the highest F-score on the testing set when 100% of the training data is used.
• The AdaBoost model has a low prediction/training time, especially when compared to the SVC model.
• AdaBoost is highly suitable for the data since the label is comprised of two binary classifications.
AdaBoost
model has the
highest F-score
on the testing
set

Confusion Matrix
Machine Learning Presentation | Scoring the Chosen Model
• Relatively High Model Scores:
• F-Score - 85% of cases where represented by a hormonic average of precision and recall [2 * (Precision * Recall)/(Precision +
Recall)]
• Accuracy – 86% of predictions were correct [(True Positives + True Negatives) / All Values]
• Precision – 85% of positive predictions were correct [True Positives/(True Positives + False Positives)]
• Recall – 86% of positive cases were true positives [True Positives/(True Positives + False Negatives)]
Categories Counts
True Positive 6431
True Negative 1342
False Positive 863
False
Negative
409
These numbers
are based on
predictions
generated by
an optimized
Adaboost
algorithm

ROC (Receiver Operating Characteristics) Curve
Machine Learning Presentation | ROC Curve for the Chosen Model
• Appropriate cut-off for test is around 97% for true positives and 58% for false positives (decision/classification threshold)
• Relatively high AUC (Area Under Curve) score:
• The AUC score is 90%
• The AUC scores using 5-fold cross-validation are 92%, 91%, 92%, 92%, and 91%
Appropriate cut-off
for test is around 97%
for true positives and
58% for false positives
These numbers
are based on
predictions
generated by
an optimized
Adaboost
algorthm

Learning Curve
Machine Learning Presentation | Bias and Variance
• The Learning Curve is satisfactory for the following reasons:
• There are reasonable prediction accuracies—probably because there are an adequate number of features leading to an acceptable model
complexity.
• Variance is low, so over-fitting is not prevalent and the model will generalize well on the test set.
• There is low bias.
• The regularization parameter is adequate.
• 10,000 Observations is adequate for Optimal Accuracy - At above the level of 10,000 observations, adding more observations will not lead to a more
accurate model because the two curves converge and remain converged.
Another
version of this
visualization
would use
Mean Squared
Error (MSE) as
a metric,
instead of
accuracy
Accuracy for
the test set
starts low
because it’s
unlikely that
the model can
generalize with
such low
instances of
observations.
Around 10,000
observations is
adequate for
optimal accuracy.

Normalized Weights for Five Most Predictive Features
Machine Learning Presentation | Feature Importance
• Capital-loss, Age, and Capital Gain have the Most Effect on the Prediction – Out of the top 5 features that have the most impact on the accuracy of the
model, these three features had weights above 0.05
With correlated features, strong features can end up with low scores and the method can be biased towards variables with many categories.
In models
related to
predictors of
sales,
approaches
like these can
highlight
features that
are most
important to
customers.

Recommended

PPTX

Face recognisation system

bySaumya Ranjan Behura

PDF

What goes on during haar cascade face detection

byOnibiyo Joshua Toluse

DOC

SAP Integration With Excel - Advanced Guide

byBenedict Yong (杨腾翔)

PPTX

HCI Models of System

PPT

face recognition system using LBP

byMarwan H. Noman

PDF

Attendance Using Facial Recognition

byVikramaditya Tarai

PPTX

Detection and recognition of face using neural network

PPTX

movie ticket pricing prediction

byAbinaya Muruganantham

PPTX

Machine learning

bySaurabh Agrawal

PPTX

Face Recognition Technology

byShashidhar Reddy

PDF

Artificial Intelligence Game Search by Examples

PDF

Voyage dans le monde du Deep Learning

byAlexia Audevart

PPTX

Formal Methods lecture 01

PPT

Project management 02112009

byManish Chaurasia

PPTX

Rapid application development model

PPT

I tlecture2

byImen Charfeddine

PPT

HCI 3e - Ch 17: Models of the system

PPTX

Session 1 Lecture 2 PACT A Framework for Designing Interactive Systems

byKhalid Md Saifuddin

PDF

Human Emotion Recognition

byChaitanya Maddala

PPT

Schlumberger Presentation

PDF

Hfm install

byRaghuram Pavuluri

PPTX

Modern face recognition with deep learning

PDF

An Analysis of RSA Public Exponent e

byDharmalingam Ganesan

PDF

Getting Started with Groovy for the Non-Technical Superstars

byKyle Goodfriend

PPTX

List of Software Development Model and Methods

PPT

HFM-Implementation

bysailajasatish

PPTX

Diabetes prediction with r(using knn)

PPTX

Project presentation by Debendra Adhikari

byDEBENDRA ADHIKARI

PDF

Data mining - Machine Learning

PPTX

Echelon Asia Summit 2017 Startup Academy Workshop

byGarrett Teoh Hor Keong

More Related Content

PPTX

Face recognisation system

bySaumya Ranjan Behura

PDF

What goes on during haar cascade face detection

byOnibiyo Joshua Toluse

DOC

SAP Integration With Excel - Advanced Guide

byBenedict Yong (杨腾翔)

PPTX

HCI Models of System

PPT

face recognition system using LBP

byMarwan H. Noman

PDF

Attendance Using Facial Recognition

byVikramaditya Tarai

PPTX

Detection and recognition of face using neural network

PPTX

movie ticket pricing prediction

byAbinaya Muruganantham

Face recognisation system

bySaumya Ranjan Behura

What goes on during haar cascade face detection

byOnibiyo Joshua Toluse

SAP Integration With Excel - Advanced Guide

byBenedict Yong (杨腾翔)

HCI Models of System

face recognition system using LBP

byMarwan H. Noman

Attendance Using Facial Recognition

byVikramaditya Tarai

Detection and recognition of face using neural network

movie ticket pricing prediction

byAbinaya Muruganantham

What's hot

PPTX

Machine learning

bySaurabh Agrawal

PPTX

Face Recognition Technology

byShashidhar Reddy

PDF

Artificial Intelligence Game Search by Examples

PDF

Voyage dans le monde du Deep Learning

byAlexia Audevart

PPTX

Formal Methods lecture 01

PPT

Project management 02112009

byManish Chaurasia

PPTX

Rapid application development model

PPT

I tlecture2

byImen Charfeddine

PPT

HCI 3e - Ch 17: Models of the system

PPTX

Session 1 Lecture 2 PACT A Framework for Designing Interactive Systems

byKhalid Md Saifuddin

PDF

Human Emotion Recognition

byChaitanya Maddala

PPT

Schlumberger Presentation

PDF

Hfm install

byRaghuram Pavuluri

PPTX

Modern face recognition with deep learning

PDF

An Analysis of RSA Public Exponent e

byDharmalingam Ganesan

PDF

Getting Started with Groovy for the Non-Technical Superstars

byKyle Goodfriend

PPTX

List of Software Development Model and Methods

PPT

HFM-Implementation

bysailajasatish

PPTX

Diabetes prediction with r(using knn)

PPTX

Project presentation by Debendra Adhikari

byDEBENDRA ADHIKARI

Machine learning

bySaurabh Agrawal

Face Recognition Technology

byShashidhar Reddy

Artificial Intelligence Game Search by Examples

Voyage dans le monde du Deep Learning

byAlexia Audevart

Formal Methods lecture 01

Project management 02112009

byManish Chaurasia

Rapid application development model

I tlecture2

byImen Charfeddine

HCI 3e - Ch 17: Models of the system

Session 1 Lecture 2 PACT A Framework for Designing Interactive Systems

byKhalid Md Saifuddin

Human Emotion Recognition

byChaitanya Maddala

Schlumberger Presentation

Hfm install

byRaghuram Pavuluri

Modern face recognition with deep learning

An Analysis of RSA Public Exponent e

byDharmalingam Ganesan

Getting Started with Groovy for the Non-Technical Superstars

byKyle Goodfriend

List of Software Development Model and Methods

HFM-Implementation

bysailajasatish

Diabetes prediction with r(using knn)

Project presentation by Debendra Adhikari

byDEBENDRA ADHIKARI

Similar to Machine Learning Project - 1994 U.S. Census

PDF

Data mining - Machine Learning

PPTX

Echelon Asia Summit 2017 Startup Academy Workshop

byGarrett Teoh Hor Keong

PPTX

The 4 Machine Learning Models Imperative for Business Transformation

PDF

Predictive modeling

byPrashant Mudgal

PDF

Data Science for Business Managers - An intro to ROI for predictive analytics

byAkin Osman Kazakci

PDF

Classification vis a-vis ranking - gopi

byGopi Krishna Nuti

PDF

Machine learning meetup

byQuantUniversity

PPTX

Predire il futuro con Machine Learning & Big Data

byData Driven Innovation

PPTX

Machine learning workshop @DYP Pune

byGanesh Raskar

PPTX

Machine learning - What they don't teach you on Coursera ODSC London 2016

byHarvinder Atwal

PPTX

Machine Learning - Startup weekend UCSB 2018

PPTX

Tech meetup Data Driven - Codemotion

byantimo musone

PDF

Assessing Model Performance - Beginner's Guide

byMegan Verbakel

PDF

Confussion Matrix in Machine Learning.pdf

PDF

Machine Learning with Big Data using Apache Spark

PDF

Data Mining the City - A (practical) introduction to Machine Learning

PPTX

Ml in a Day Workshop 5/1

PDF

PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS

PDF

Ellicium Solutions - Making Data Science Work

byEllicium Solutions Inc.

PPTX

Informs presentation new ppt

bySalford Systems

Data mining - Machine Learning

Echelon Asia Summit 2017 Startup Academy Workshop

byGarrett Teoh Hor Keong

The 4 Machine Learning Models Imperative for Business Transformation

Predictive modeling

byPrashant Mudgal

Data Science for Business Managers - An intro to ROI for predictive analytics

byAkin Osman Kazakci

Classification vis a-vis ranking - gopi

byGopi Krishna Nuti

Machine learning meetup

byQuantUniversity

Predire il futuro con Machine Learning & Big Data

byData Driven Innovation

Machine learning workshop @DYP Pune

byGanesh Raskar

Machine learning - What they don't teach you on Coursera ODSC London 2016

byHarvinder Atwal

Machine Learning - Startup weekend UCSB 2018

Tech meetup Data Driven - Codemotion

byantimo musone

Assessing Model Performance - Beginner's Guide

byMegan Verbakel

Confussion Matrix in Machine Learning.pdf

Machine Learning with Big Data using Apache Spark

Data Mining the City - A (practical) introduction to Machine Learning

Ml in a Day Workshop 5/1

PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS

Ellicium Solutions - Making Data Science Work

byEllicium Solutions Inc.

Informs presentation new ppt

bySalford Systems

Recently uploaded

PPTX

DM Session 2 Websites_Lecture-1A-Intro-to-websites.pptx

PDF

3-5-Measures-of-Variability-Ungrouped.pd

bysheineencarnacion

PDF

Extract Google Restaurant Menu Data Using APIs

byFoodspark - Food Data & Insights

PDF

Market Basket Analysis market basket.pdf

bybakhtawarseerat33

PPTX

Computer Fundamentals-1_9facced54627132ad872ee2f20e2bd86.pptx

PDF

Sky High Insights: AWS Data Mesh That Unified 100 Databases for Airline Excel...

PDF

Habitat Loss and Juno possess an ability

byguestybrothers2

PDF

ECO_Q3_2018_PRE_EN greek economy trito tetramino.pdf

bymranderson131

PPTX

INTRODUCTION OF DATA SCIENCE.pptx THIS IS THE PPT FOR DATA SCIENCE SUBKECT

byvetrivendan7177

PPTX

GROUP20720-20SCIENCE20THE20GALAXY203-2.pptx

bycasinoruxlloydjude

PPTX

Coquitlam College Transcripts

PDF

Cloud BI Explained: Faster Insights, Lower Costs, Better Decisions

byMantha Phani Satya Anirudh

PDF

Azure Data Factory Overview for SSIS Developers.pdf

PDF

How Fashion & Apparel Brands Use Web Scraping for Trend Forecasting.pdf

by3i Data Scraping

PPTX

Narratives of Makuapo A narrative Study.pptx

byAiraAntonetteVZarcil

PPTX

Agentic AI Presentation | PowerPoint Presentation | Generative AI | Artificia...

bySiddharth Ghosh

PDF

Yellow and Black Futuristic Cloud Computing Presentation.pdf

PDF

Safely Managing Old Gmail Accounts in 2026.pdf

PPTX

Excel for Data Analytics for excel concept in this presentation

bysandyfunny412

PPTX

Investigatory-Project-on-Drug-Addiction-Final.pptx

byYashAggarwal233278

DM Session 2 Websites_Lecture-1A-Intro-to-websites.pptx

3-5-Measures-of-Variability-Ungrouped.pd

bysheineencarnacion

Extract Google Restaurant Menu Data Using APIs

byFoodspark - Food Data & Insights

Market Basket Analysis market basket.pdf

bybakhtawarseerat33

Computer Fundamentals-1_9facced54627132ad872ee2f20e2bd86.pptx

Sky High Insights: AWS Data Mesh That Unified 100 Databases for Airline Excel...

Habitat Loss and Juno possess an ability

byguestybrothers2

ECO_Q3_2018_PRE_EN greek economy trito tetramino.pdf

bymranderson131

INTRODUCTION OF DATA SCIENCE.pptx THIS IS THE PPT FOR DATA SCIENCE SUBKECT

byvetrivendan7177

GROUP20720-20SCIENCE20THE20GALAXY203-2.pptx

bycasinoruxlloydjude

Coquitlam College Transcripts

Cloud BI Explained: Faster Insights, Lower Costs, Better Decisions

byMantha Phani Satya Anirudh

Azure Data Factory Overview for SSIS Developers.pdf

How Fashion & Apparel Brands Use Web Scraping for Trend Forecasting.pdf

by3i Data Scraping

Narratives of Makuapo A narrative Study.pptx

byAiraAntonetteVZarcil

Agentic AI Presentation | PowerPoint Presentation | Generative AI | Artificia...

bySiddharth Ghosh

Yellow and Black Futuristic Cloud Computing Presentation.pdf

Safely Managing Old Gmail Accounts in 2026.pdf

Excel for Data Analytics for excel concept in this presentation

bysandyfunny412

Investigatory-Project-on-Drug-Addiction-Final.pptx

byYashAggarwal233278

Machine Learning Project - 1994 U.S. Census

1.
Machine Learning Case Study 1994U.S. CENSUS
2.
Objective of Project MachineLearning Presentation | Introduction In this project, several supervised algorithms were employed to accurately model individuals' income using data collected from the 1994 U.S. Census. The best candidate algorithm was chosen from preliminary results and further optimized to best model the data. The goal in this project was to construct a model that accurately predicts whether an individual makes more than $50,000. The first ten rows of the data set are shown below:
3.
Top Findings Machine LearningPresentation | Overview • Adaboost is an appropriate model for this data - Based on the results, the AdaBoost model is most appropriate for the task of identifying individuals that make more than $50,000 • Highest F-Score out of several models tested • Low prediction/training time • Highly suitable for binary classifications • Model generalizes well to the test set at 10,000 observations and beyond • High AUC (Area Under Curve) score: – .90 • Appropriate cut-off for test is around .97 for true positives and .58 for false positives (decision/classification threshold) • Capital-loss, Age, and Capital Gain have the Most Effect on the Prediction – Out of the top 5 features that have the most impact on the accuracy of the model, these three features had weights above 0.05 • Relatively High Model Scores: • F-Score - 85% of cases where represented by a hormonic average of precision and recall [2 * (Precision * Recall)/(Precision + Recall)] • Accuracy – 86% of predictions were correct [(True Positives + True Negatives) / All Values]
4.
Performance Metrics forThree Supervised Learning Models Machine Learning Presentation | Model Evaluation • Based on the results, the AdaBoost model is most appropriate for the task of identifying individuals that make more than $50,000. This conclusion is based on the following reasons: • Out of the three models shown above, the AdaBoost model has the highest F-score on the testing set when 100% of the training data is used. • The AdaBoost model has a low prediction/training time, especially when compared to the SVC model. • AdaBoost is highly suitable for the data since the label is comprised of two binary classifications. AdaBoost model has the highest F-score on the testing set
5.
Confusion Matrix Machine LearningPresentation | Scoring the Chosen Model • Relatively High Model Scores: • F-Score - 85% of cases where represented by a hormonic average of precision and recall [2 * (Precision * Recall)/(Precision + Recall)] • Accuracy – 86% of predictions were correct [(True Positives + True Negatives) / All Values] • Precision – 85% of positive predictions were correct [True Positives/(True Positives + False Positives)] • Recall – 86% of positive cases were true positives [True Positives/(True Positives + False Negatives)] Categories Counts True Positive 6431 True Negative 1342 False Positive 863 False Negative 409 These numbers are based on predictions generated by an optimized Adaboost algorithm
6.
ROC (Receiver OperatingCharacteristics) Curve Machine Learning Presentation | ROC Curve for the Chosen Model • Appropriate cut-off for test is around 97% for true positives and 58% for false positives (decision/classification threshold) • Relatively high AUC (Area Under Curve) score: • The AUC score is 90% • The AUC scores using 5-fold cross-validation are 92%, 91%, 92%, 92%, and 91% Appropriate cut-off for test is around 97% for true positives and 58% for false positives These numbers are based on predictions generated by an optimized Adaboost algorthm
7.
Learning Curve Machine LearningPresentation | Bias and Variance • The Learning Curve is satisfactory for the following reasons: • There are reasonable prediction accuracies—probably because there are an adequate number of features leading to an acceptable model complexity. • Variance is low, so over-fitting is not prevalent and the model will generalize well on the test set. • There is low bias. • The regularization parameter is adequate. • 10,000 Observations is adequate for Optimal Accuracy - At above the level of 10,000 observations, adding more observations will not lead to a more accurate model because the two curves converge and remain converged. Another version of this visualization would use Mean Squared Error (MSE) as a metric, instead of accuracy Accuracy for the test set starts low because it’s unlikely that the model can generalize with such low instances of observations. Around 10,000 observations is adequate for optimal accuracy.
8.
Normalized Weights forFive Most Predictive Features Machine Learning Presentation | Feature Importance • Capital-loss, Age, and Capital Gain have the Most Effect on the Prediction – Out of the top 5 features that have the most impact on the accuracy of the model, these three features had weights above 0.05 With correlated features, strong features can end up with low scores and the method can be biased towards variables with many categories. In models related to predictors of sales, approaches like these can highlight features that are most important to customers.

Editor's Notes

#5 Based on the results, the AdaBoost model is most appropriate for the task of identifying individuals that make more than $50,000. I reached this conclusion based on the following reasons: • Out of the three models, the AdaBoost model has the highest F-score on the testing set when 100% of the training data is used. • The AdaBoost model has a low prediction/training time, especially when compared to the SVC model. • AdaBoost is highly suitable for the data since the label is comprised of two binary classifications.
#6 https://medium.com/@djocz/confusion-matrix-aint-that-confusing-d29e18403327 A confusion matrix gives us a better idea of what our classification model is predicting right and what types of errors it is making. https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/ True Positive Rate: When it's actually yes, how often does it predict yes?TP/actual yes = 100/105 = 0.95 also known as "Sensitivity" or "Recall" True Negative Rate: When it's actually no, how often does it predict no?TN/actual no = 50/60 = 0.83 equivalent to 1 minus False Positive Rate also known as "Specificity" Null Error Rate: This is how often you would be wrong if you always predicted the majority class. (In our example, the null error rate would be 60/165=0.36 because if you always predicted yes, you would only be wrong for the 60 "no" cases.) This can be a useful baseline metric to compare your classifier against. However, the best classifier for a particular application will sometimes have a higher error rate than the null error rate, as demonstrated by the Accuracy Paradox. Cohen's Kappa: This is essentially a measure of how well the classifier performed as compared to how well it would have performed simply by chance. In other words, a model will have a high Kappa score if there is a big difference between the accuracy and the null error rate. (More details about Cohen's Kappa.) F Score: This is a weighted average of the true positive rate (recall) and precision. (More details about the F Score.) https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62 It is difficult to compare two models with low precision and high recall or vice versa. So to make them comparable, we use F-Score. F-score helps to measure Recall and Precision at the same time. It uses Harmonic Mean in place of Arithmetic Mean by punishing the extreme values more. https://www.geeksforgeeks.org/confusion-matrix-machine-learning/ High recall, low precision:This means that most of the positive examples are correctly recognized (low FN) but there are a lot of false positives. Low recall, high precision:This shows that we miss a lot of positive examples (high FN) but those we predict as positive are indeed positive (low FP) F-measure:Since we have two measures (Precision and Recall) it helps to have a measurement that represents both of them. We calculate an F-measure which uses Harmonic Mean in place of Arithmetic Mean as it punishes the extreme values more.The F-Measure will always be nearer to the smaller value of Precision or Recall. https://medium.com/datadriveninvestor/simplifying-the-confusion-matrix-aa1fa0b0fc35 Udacity: Note: Recap of accuracy, precision, recall Accuracy measures how often the classifier makes the correct prediction. It’s the ratio of the number of correct predictions to the total number of predictions (the number of test data points). Precision tells us what proportion of messages we classified as spam, actually were spam. It is a ratio of true positives(words classified as spam, and which are actually spam) to all positives(all words classified as spam, irrespective of whether that was the correct classificatio), in other words it is the ratio of [True Positives/(True Positives + False Positives)] Recall(sensitivity) tells us what proportion of messages that actually were spam were classified by us as spam. It is a ratio of true positives(words classified as spam, and which are actually spam) to all the words that were actually spam, in other words it is the ratio of [True Positives/(True Positives + False Negatives)] For classification problems that are skewed in their classification distributions like in our case, for example if we had a 100 text messages and only 2 were spam and the rest 98 weren't, accuracy by itself is not a very good metric. We could classify 90 messages as not spam(including the 2 that were spam but we classify them as not spam, hence they would be false negatives) and 10 as spam(all 10 false positives) and still get a reasonably good accuracy score. For such cases, precision and recall come in very handy. These two metrics can be combined to get the F1 score, which is weighted average(harmonic mean) of the precision and recall scores. This score can range from 0 to 1, with 1 being the best possible F1 score(we take the harmonic mean as we are dealing with ratios).
#7 https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5 ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better the model is at distinguishing between patients with disease and no disease. https://acutecaretesting.org/en/articles/roc-curves-what-are-they-and-how-are-they-used ROC curves are frequently used to show in a graphical way the connection/trade-off between clinical sensitivity and specificity for every possible cut-off for a test or a combination of tests. In addition the area under the ROC curve gives an idea about the benefit of using the test(s) in question. ROC curves are used in clinical biochemistry to choose the most appropriate cut-off for a test. The best cut-off has the highest true positive rate together with the lowest false positive rate. As the area under an ROC curve is a measure of the usefulness of a test in general, where a greater area means a more useful test, the areas under ROC curves are used to compare the usefulness of tests. The cut-off determines the clinical sensitivity (fraction of true positives to all with disease) and specificity (fraction of true negatives to all without disease). The AUC is the area under the ROC curve. This score gives us a good idea of how well the model performances. When you change the cut-off, you will get other values for true positives and negatives and false positives and negatives, but the number of all with disease is the same and so is the number of all without disease. Thus you will get an increase in sensitivity or specificity at the expense of lowering the other parameter when you change the cut-off [1]. An ROC curve shows the relationship between clinical sensitivity and specificity for every possible cut-off. The ROC curve is a graph with: - The x-axis showing 1 – specificity (= false positive fraction = FP/(FP+TN)) - The y-axis showing sensitivity (= true positive fraction = TP/(TP+FN)) https://medium.com/greyatom/lets-learn-about-auc-roc-curve-4a94b4d88152 the proportion of patients that were identified correctly to have the disease (i.e. True Positive) upon the total number of patients who actually have the disease is called as Sensitivity or Recall. the proportion of patients that were identified correctly to not have the disease (i.e. True Negative) upon the total number of patients who do not have the disease is called as Specificity. When sensitivity increase, specificity decreases and vice versa In a ROC graph, when the sensitivity increases, (1 — specificity) will also increase. https://www.theanalysisfactor.com/what-is-an-roc-curve/ A common usage in medical studies is to run an ROC to see how much better a single continuous predictor (a “biomarker”) can predict disease status compared to chance. https://www.medcalc.org/manual/roc-curves.php The area under the ROC curve (AUC) is a measure of how well a parameter can distinguish between two diagnostic groups (diseased/normal). https://www.medcalc.org/manual/roc-curves.php Sensitivity (with optional 95% Confidence Interval): Probability that a test result will be positive when the disease is present (true positive rate). Specificity (with optional 95% Confidence Interval): Probability that a test result will be negative when the disease is not present (true negative rate). https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
#8 https://www.dataquest.io/blog/learning-curves-machine-learning/ When the training set size is 1, we can see that the MSE for the training set is 0. This is normal behavior, since the model has no problem fitting perfectly a single data point. So when tested upon the same data point, the prediction is perfect. But when tested on the validation set (which has 1914 instances), the MSE rockets up to roughly 423.4. This relatively high value is the reason we restrict the y-axis range between 0 and 40. This enables us to read most MSE values with precision. Such a high value is expected, since it’s e From 500 training data points onward, the validation MSE stays roughly the same. This tells us something extremely important: adding more training data points won't lead to significantly better models. So instead of wasting time (and possibly money) with collecting more data, we need to try something else, like switching to an algorithm that can build more complex models.xtremely unlikely that a model trained on a single data point can generalize accurately to 1914 new instances it hasn't seen in training. To avoid a misconception here, it's important to notice that what really won't help is adding more instances (rows) to the training data. Adding more features, however, is a different thing and is very likely to help because it will increase the complexity of our current model. To find the answer, we need to look at the training error. If the training error is very low, it means that the training data is fitted very well by the estimated model. If the model fits the training data very well, it means it has low bias with respect to that set of data. If the training error is high, it means that the training data is not fitted well enough by the estimated model. If the model fails to fit the training data well, it means it has highbias with respect to that set of data. if the variance is high, then the model fits training data too well. When training data is fitted too well, the model will have trouble generalizing on data that hasn't seen in training. When such a model is tested on its training set, and then on a validation set, the training error will be low and the validation error will generally be high. https://medium.com/@datalesdatales/why-you-should-be-plotting-learning-curves-in-your-next-machine-learning-project-221bae60c53 What can you do if your model performance is not so good? There are several things you can do: Get more data Try a smaller set of features (reduce model complexity) Try adding/creating more features (increase model complexity) Try decreasing the regularisation parameter λ (increase model complexity) Try increasing the regularisation parameter λ (decrease model complexity) If your learning curves look like this, it means your model is suffering from high bias. Both the training and validation (or cross-validation) error is high and it doesn’t seem to improve with more training examples. The fact that your model is performing similarly bad for both the training and validation sets suggests that the model is underfitting the data and therefore has high bias. What can you do if your model performance is not so good? (pt. II) Cool, so you have now identified what’s going on with your model and are in a great position to decide what to do next. If your model has high bias, you should: Try adding/creating more features Try decreasing the regularisation parameter λ These two things will increase your model complexity and therefore will contribute to solve your underfitting problem. If your model has high variance, you should: Get more data Try a smaller set of features Try increasing the regularisation parameter λ When your model is overfitting the training data, you can either try reducing its complexity or getting more data. As you can see above, the learning curves chart of a high-variance model suggests that, with enough data, the validation and training error will end up closer to each other. An intuitive explanation for this is that if you give your model more data, the gap between your model’s complexity and the underlying complexity in your data will get smaller and smaller. https://www.kdnuggets.com/2016/02/21-data-science-interview-questions-answers.html Regularization is the process of adding a tuning parameter to a model to induce smoothness in order to prevent overfitting.
#9 https://blog.datadive.net/selecting-good-features-part-iii-random-forests/ https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html https://chrisalbon.com/machine_learning/trees_and_forests/feature_selection_using_random_forest/ Random Forests are often used for feature selection in a data science workflow. The reason is because the tree-based strategies used by random forests naturally ranks by how well they improve the purity of the node. This mean decrease in impurity over all trees (called gini impurity). Nodes with the greatest decrease in impurity happen at the start of the trees, while notes with the least decrease in impurity occur at the end of trees. Thus, by pruning trees below a particular node, we can create a subset of the most important features. https://explained.ai/rf-importance/index.html your feature importance measures will only be reliable if your model is trained with suitable hyper-parameters. https://explained.ai/rf-importance/index.html For example, if you build a model of house prices, knowing which features are most predictive of price tells us which features people are willing to pay for. Feature importance is the most useful interpretation tool, and data scientists regularly examine model parameters (such as the coefficients of linear models), to identify important features. landmines include not normalizing input data, properly interpreting coefficients when using Lasso or Ridge regularization, and avoiding highly-correlated variables (such as country and country_name). To learn more about the difficulties of interpreting regression coefficients, see Statistical Modeling: The Two Cultures (2001) by Leo Breiman (co-creator of Random Forests). In order to explain feature selection, we added a column of random numbers. (Any feature less important than a random column is junk and should be tossed out.) Spearman's correlation is the same thing as converting two variables to rank values and then running a standard Pearson's correlation on those ranked variables. Spearman's is nonparametric and does not assume a linear relationship between the variables; it looks for monotonic relationships. You can visualize this more easily using plot_corr_heatmap():