Machine Learning
Case Study
1994 U.S. CENSUS
Objective of Project
Machine Learning Presentation | Introduction
In this project, several supervised algorithms were employed to accurately model individuals' income using data
collected from the 1994 U.S. Census. The best candidate algorithm was chosen from preliminary results and further
optimized to best model the data.
The goal in this project was to construct a model that accurately predicts whether an individual makes more than
$50,000. The first ten rows of the data set are shown below:
Top Findings
Machine Learning Presentation | Overview
• Adaboost is an appropriate model for this data - Based on the results, the AdaBoost model is most appropriate for the task of
identifying individuals that make more than $50,000
• Highest F-Score out of several models tested
• Low prediction/training time
• Highly suitable for binary classifications
• Model generalizes well to the test set at 10,000 observations and beyond
• High AUC (Area Under Curve) score: – .90
• Appropriate cut-off for test is around .97 for true positives and .58 for false positives (decision/classification threshold)
• Capital-loss, Age, and Capital Gain have the Most Effect on the Prediction – Out of the top 5 features that have the most impact
on the accuracy of the model, these three features had weights above 0.05
• Relatively High Model Scores:
• F-Score - 85% of cases where represented by a hormonic average of precision and recall [2 * (Precision * Recall)/(Precision
+ Recall)]
• Accuracy – 86% of predictions were correct [(True Positives + True Negatives) / All Values]
Performance Metrics for Three Supervised Learning Models
Machine Learning Presentation | Model Evaluation
• Based on the results, the AdaBoost model is most appropriate for the task of identifying individuals that make more than $50,000. This conclusion is based
on the following reasons:
• Out of the three models shown above, the AdaBoost model has the highest F-score on the testing set when 100% of the training data is used.
• The AdaBoost model has a low prediction/training time, especially when compared to the SVC model.
• AdaBoost is highly suitable for the data since the label is comprised of two binary classifications.
AdaBoost
model has the
highest F-score
on the testing
set
Confusion Matrix
Machine Learning Presentation | Scoring the Chosen Model
• Relatively High Model Scores:
• F-Score - 85% of cases where represented by a hormonic average of precision and recall [2 * (Precision * Recall)/(Precision +
Recall)]
• Accuracy – 86% of predictions were correct [(True Positives + True Negatives) / All Values]
• Precision – 85% of positive predictions were correct [True Positives/(True Positives + False Positives)]
• Recall – 86% of positive cases were true positives [True Positives/(True Positives + False Negatives)]
Categories Counts
True Positive 6431
True Negative 1342
False Positive 863
False
Negative
409
These numbers
are based on
predictions
generated by
an optimized
Adaboost
algorithm
ROC (Receiver Operating Characteristics) Curve
Machine Learning Presentation | ROC Curve for the Chosen Model
• Appropriate cut-off for test is around 97% for true positives and 58% for false positives (decision/classification threshold)
• Relatively high AUC (Area Under Curve) score:
• The AUC score is 90%
• The AUC scores using 5-fold cross-validation are 92%, 91%, 92%, 92%, and 91%
Appropriate cut-off
for test is around 97%
for true positives and
58% for false positives
These numbers
are based on
predictions
generated by
an optimized
Adaboost
algorthm
Learning Curve
Machine Learning Presentation | Bias and Variance
• The Learning Curve is satisfactory for the following reasons:
• There are reasonable prediction accuracies—probably because there are an adequate number of features leading to an acceptable model
complexity.
• Variance is low, so over-fitting is not prevalent and the model will generalize well on the test set.
• There is low bias.
• The regularization parameter is adequate.
• 10,000 Observations is adequate for Optimal Accuracy - At above the level of 10,000 observations, adding more observations will not lead to a more
accurate model because the two curves converge and remain converged.
Another
version of this
visualization
would use
Mean Squared
Error (MSE) as
a metric,
instead of
accuracy
Accuracy for
the test set
starts low
because it’s
unlikely that
the model can
generalize with
such low
instances of
observations.
Around 10,000
observations is
adequate for
optimal accuracy.
Normalized Weights for Five Most Predictive Features
Machine Learning Presentation | Feature Importance
• Capital-loss, Age, and Capital Gain have the Most Effect on the Prediction – Out of the top 5 features that have the most impact on the accuracy of the
model, these three features had weights above 0.05
With correlated features, strong features can end up with low scores and the method can be biased towards variables with many categories.
In models
related to
predictors of
sales,
approaches
like these can
highlight
features that
are most
important to
customers.

Machine Learning Project - 1994 U.S. Census

  • 1.
  • 2.
    Objective of Project MachineLearning Presentation | Introduction In this project, several supervised algorithms were employed to accurately model individuals' income using data collected from the 1994 U.S. Census. The best candidate algorithm was chosen from preliminary results and further optimized to best model the data. The goal in this project was to construct a model that accurately predicts whether an individual makes more than $50,000. The first ten rows of the data set are shown below:
  • 3.
    Top Findings Machine LearningPresentation | Overview • Adaboost is an appropriate model for this data - Based on the results, the AdaBoost model is most appropriate for the task of identifying individuals that make more than $50,000 • Highest F-Score out of several models tested • Low prediction/training time • Highly suitable for binary classifications • Model generalizes well to the test set at 10,000 observations and beyond • High AUC (Area Under Curve) score: – .90 • Appropriate cut-off for test is around .97 for true positives and .58 for false positives (decision/classification threshold) • Capital-loss, Age, and Capital Gain have the Most Effect on the Prediction – Out of the top 5 features that have the most impact on the accuracy of the model, these three features had weights above 0.05 • Relatively High Model Scores: • F-Score - 85% of cases where represented by a hormonic average of precision and recall [2 * (Precision * Recall)/(Precision + Recall)] • Accuracy – 86% of predictions were correct [(True Positives + True Negatives) / All Values]
  • 4.
    Performance Metrics forThree Supervised Learning Models Machine Learning Presentation | Model Evaluation • Based on the results, the AdaBoost model is most appropriate for the task of identifying individuals that make more than $50,000. This conclusion is based on the following reasons: • Out of the three models shown above, the AdaBoost model has the highest F-score on the testing set when 100% of the training data is used. • The AdaBoost model has a low prediction/training time, especially when compared to the SVC model. • AdaBoost is highly suitable for the data since the label is comprised of two binary classifications. AdaBoost model has the highest F-score on the testing set
  • 5.
    Confusion Matrix Machine LearningPresentation | Scoring the Chosen Model • Relatively High Model Scores: • F-Score - 85% of cases where represented by a hormonic average of precision and recall [2 * (Precision * Recall)/(Precision + Recall)] • Accuracy – 86% of predictions were correct [(True Positives + True Negatives) / All Values] • Precision – 85% of positive predictions were correct [True Positives/(True Positives + False Positives)] • Recall – 86% of positive cases were true positives [True Positives/(True Positives + False Negatives)] Categories Counts True Positive 6431 True Negative 1342 False Positive 863 False Negative 409 These numbers are based on predictions generated by an optimized Adaboost algorithm
  • 6.
    ROC (Receiver OperatingCharacteristics) Curve Machine Learning Presentation | ROC Curve for the Chosen Model • Appropriate cut-off for test is around 97% for true positives and 58% for false positives (decision/classification threshold) • Relatively high AUC (Area Under Curve) score: • The AUC score is 90% • The AUC scores using 5-fold cross-validation are 92%, 91%, 92%, 92%, and 91% Appropriate cut-off for test is around 97% for true positives and 58% for false positives These numbers are based on predictions generated by an optimized Adaboost algorthm
  • 7.
    Learning Curve Machine LearningPresentation | Bias and Variance • The Learning Curve is satisfactory for the following reasons: • There are reasonable prediction accuracies—probably because there are an adequate number of features leading to an acceptable model complexity. • Variance is low, so over-fitting is not prevalent and the model will generalize well on the test set. • There is low bias. • The regularization parameter is adequate. • 10,000 Observations is adequate for Optimal Accuracy - At above the level of 10,000 observations, adding more observations will not lead to a more accurate model because the two curves converge and remain converged. Another version of this visualization would use Mean Squared Error (MSE) as a metric, instead of accuracy Accuracy for the test set starts low because it’s unlikely that the model can generalize with such low instances of observations. Around 10,000 observations is adequate for optimal accuracy.
  • 8.
    Normalized Weights forFive Most Predictive Features Machine Learning Presentation | Feature Importance • Capital-loss, Age, and Capital Gain have the Most Effect on the Prediction – Out of the top 5 features that have the most impact on the accuracy of the model, these three features had weights above 0.05 With correlated features, strong features can end up with low scores and the method can be biased towards variables with many categories. In models related to predictors of sales, approaches like these can highlight features that are most important to customers.

Editor's Notes

  • #5  Based on the results, the AdaBoost model is most appropriate for the task of identifying individuals that make more than $50,000. I reached this conclusion based on the following reasons: • Out of the three models, the AdaBoost model has the highest F-score on the testing set when 100% of the training data is used. • The AdaBoost model has a low prediction/training time, especially when compared to the SVC model. • AdaBoost is highly suitable for the data since the label is comprised of two binary classifications.
  • #6 https://medium.com/@djocz/confusion-matrix-aint-that-confusing-d29e18403327 A confusion matrix gives us a better idea of what our classification model is predicting right and what types of errors it is making. https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/ True Positive Rate: When it's actually yes, how often does it predict yes?TP/actual yes = 100/105 = 0.95 also known as "Sensitivity" or "Recall" True Negative Rate: When it's actually no, how often does it predict no?TN/actual no = 50/60 = 0.83 equivalent to 1 minus False Positive Rate also known as "Specificity" Null Error Rate: This is how often you would be wrong if you always predicted the majority class. (In our example, the null error rate would be 60/165=0.36 because if you always predicted yes, you would only be wrong for the 60 "no" cases.) This can be a useful baseline metric to compare your classifier against. However, the best classifier for a particular application will sometimes have a higher error rate than the null error rate, as demonstrated by the Accuracy Paradox. Cohen's Kappa: This is essentially a measure of how well the classifier performed as compared to how well it would have performed simply by chance. In other words, a model will have a high Kappa score if there is a big difference between the accuracy and the null error rate. (More details about Cohen's Kappa.) F Score: This is a weighted average of the true positive rate (recall) and precision. (More details about the F Score.) https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62 It is difficult to compare two models with low precision and high recall or vice versa. So to make them comparable, we use F-Score. F-score helps to measure Recall and Precision at the same time. It uses Harmonic Mean in place of Arithmetic Mean by punishing the extreme values more. https://www.geeksforgeeks.org/confusion-matrix-machine-learning/ High recall, low precision:This means that most of the positive examples are correctly recognized (low FN) but there are a lot of false positives. Low recall, high precision:This shows that we miss a lot of positive examples (high FN) but those we predict as positive are indeed positive (low FP) F-measure: Since we have two measures (Precision and Recall) it helps to have a measurement that represents both of them. We calculate an F-measure which uses Harmonic Mean in place of Arithmetic Mean as it punishes the extreme values more. The F-Measure will always be nearer to the smaller value of Precision or Recall. https://medium.com/datadriveninvestor/simplifying-the-confusion-matrix-aa1fa0b0fc35 Udacity: Note: Recap of accuracy, precision, recall Accuracy measures how often the classifier makes the correct prediction. It’s the ratio of the number of correct predictions to the total number of predictions (the number of test data points). Precision tells us what proportion of messages we classified as spam, actually were spam. It is a ratio of true positives(words classified as spam, and which are actually spam) to all positives(all words classified as spam, irrespective of whether that was the correct classificatio), in other words it is the ratio of [True Positives/(True Positives + False Positives)] Recall(sensitivity) tells us what proportion of messages that actually were spam were classified by us as spam. It is a ratio of true positives(words classified as spam, and which are actually spam) to all the words that were actually spam, in other words it is the ratio of [True Positives/(True Positives + False Negatives)] For classification problems that are skewed in their classification distributions like in our case, for example if we had a 100 text messages and only 2 were spam and the rest 98 weren't, accuracy by itself is not a very good metric. We could classify 90 messages as not spam(including the 2 that were spam but we classify them as not spam, hence they would be false negatives) and 10 as spam(all 10 false positives) and still get a reasonably good accuracy score. For such cases, precision and recall come in very handy. These two metrics can be combined to get the F1 score, which is weighted average(harmonic mean) of the precision and recall scores. This score can range from 0 to 1, with 1 being the best possible F1 score(we take the harmonic mean as we are dealing with ratios).
  • #7 https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5 ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better the model is at distinguishing between patients with disease and no disease. https://acutecaretesting.org/en/articles/roc-curves-what-are-they-and-how-are-they-used ROC curves are frequently used to show in a graphical way the connection/trade-off between clinical sensitivity and specificity for every possible cut-off for a test or a combination of tests. In addition the area under the ROC curve gives an idea about the benefit of using the test(s) in question.  ROC curves are used in clinical biochemistry to choose the most appropriate cut-off for a test. The best cut-off has the highest true positive rate together with the lowest false positive rate.   As the area under an ROC curve is a measure of the usefulness of a test in general, where a greater area means a more useful test, the areas under ROC curves are used to compare the usefulness of tests. The cut-off determines the clinical sensitivity (fraction of true positives to all with disease) and specificity (fraction of true negatives to all without disease).  The AUC is the area under the ROC curve. This score gives us a good idea of how well the model performances. When you change the cut-off, you will get other values for true positives and negatives and false positives and negatives, but the number of all with disease is the same and so is the number of all without disease.  Thus you will get an increase in sensitivity or specificity at the expense of lowering the other parameter when you change the cut-off [1]. An ROC curve shows the relationship between clinical sensitivity and specificity for every possible cut-off. The ROC curve is a graph with: - The x-axis showing 1 – specificity (= false positive fraction = FP/(FP+TN)) - The y-axis showing sensitivity (= true positive fraction = TP/(TP+FN)) https://medium.com/greyatom/lets-learn-about-auc-roc-curve-4a94b4d88152 the proportion of patients that were identified correctly to have the disease (i.e. True Positive) upon the total number of patients who actually have the disease is called as Sensitivity or Recall. the proportion of patients that were identified correctly to not have the disease (i.e. True Negative) upon the total number of patients who do not have the disease is called as Specificity. When sensitivity increase, specificity decreases and vice versa In a ROC graph, when the sensitivity increases, (1 — specificity) will also increase.  https://www.theanalysisfactor.com/what-is-an-roc-curve/ A common usage in medical studies is to run an ROC to see how much better a single continuous predictor  (a “biomarker”) can predict disease status compared to chance. https://www.medcalc.org/manual/roc-curves.php The area under the ROC curve (AUC) is a measure of how well a parameter can distinguish between two diagnostic groups (diseased/normal). https://www.medcalc.org/manual/roc-curves.php Sensitivity (with optional 95% Confidence Interval): Probability that a test result will be positive when the disease is present (true positive rate). Specificity (with optional 95% Confidence Interval): Probability that a test result will be negative when the disease is not present (true negative rate). https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
  • #8 https://www.dataquest.io/blog/learning-curves-machine-learning/ When the training set size is 1, we can see that the MSE for the training set is 0. This is normal behavior, since the model has no problem fitting perfectly a single data point. So when tested upon the same data point, the prediction is perfect. But when tested on the validation set (which has 1914 instances), the MSE rockets up to roughly 423.4. This relatively high value is the reason we restrict the y-axis range between 0 and 40. This enables us to read most MSE values with precision. Such a high value is expected, since it’s e From 500 training data points onward, the validation MSE stays roughly the same. This tells us something extremely important: adding more training data points won't lead to significantly better models. So instead of wasting time (and possibly money) with collecting more data, we need to try something else, like switching to an algorithm that can build more complex models.xtremely unlikely that a model trained on a single data point can generalize accurately to 1914 new instances it hasn't seen in training. To avoid a misconception here, it's important to notice that what really won't help is adding more instances (rows) to the training data. Adding more features, however, is a different thing and is very likely to help because it will increase the complexity of our current model. To find the answer, we need to look at the training error. If the training error is very low, it means that the training data is fitted very well by the estimated model. If the model fits the training data very well, it means it has low bias with respect to that set of data. If the training error is high, it means that the training data is not fitted well enough by the estimated model. If the model fails to fit the training data well, it means it has highbias with respect to that set of data.  if the variance is high, then the model fits training data too well. When training data is fitted too well, the model will have trouble generalizing on data that hasn't seen in training. When such a model is tested on its training set, and then on a validation set, the training error will be low and the validation error will generally be high. https://medium.com/@datalesdatales/why-you-should-be-plotting-learning-curves-in-your-next-machine-learning-project-221bae60c53 What can you do if your model performance is not so good? There are several things you can do: Get more data Try a smaller set of features (reduce model complexity) Try adding/creating more features (increase model complexity) Try decreasing the regularisation parameter λ (increase model complexity) Try increasing the regularisation parameter λ (decrease model complexity) If your learning curves look like this, it means your model is suffering from high bias. Both the training and validation (or cross-validation) error is high and it doesn’t seem to improve with more training examples. The fact that your model is performing similarly bad for both the training and validation sets suggests that the model is underfitting the data and therefore has high bias. What can you do if your model performance is not so good? (pt. II) Cool, so you have now identified what’s going on with your model and are in a great position to decide what to do next. If your model has high bias, you should: Try adding/creating more features Try decreasing the regularisation parameter λ These two things will increase your model complexity and therefore will contribute to solve your underfitting problem. If your model has high variance, you should: Get more data Try a smaller set of features Try increasing the regularisation parameter λ When your model is overfitting the training data, you can either try reducing its complexity or getting more data. As you can see above, the learning curves chart of a high-variance model suggests that, with enough data, the validation and training error will end up closer to each other. An intuitive explanation for this is that if you give your model more data, the gap between your model’s complexity and the underlying complexity in your data will get smaller and smaller. https://www.kdnuggets.com/2016/02/21-data-science-interview-questions-answers.html Regularization is the process of adding a tuning parameter to a model to induce smoothness in order to prevent overfitting. 
  • #9 https://blog.datadive.net/selecting-good-features-part-iii-random-forests/ https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html https://chrisalbon.com/machine_learning/trees_and_forests/feature_selection_using_random_forest/ Random Forests are often used for feature selection in a data science workflow. The reason is because the tree-based strategies used by random forests naturally ranks by how well they improve the purity of the node. This mean decrease in impurity over all trees (called gini impurity). Nodes with the greatest decrease in impurity happen at the start of the trees, while notes with the least decrease in impurity occur at the end of trees. Thus, by pruning trees below a particular node, we can create a subset of the most important features. https://explained.ai/rf-importance/index.html your feature importance measures will only be reliable if your model is trained with suitable hyper-parameters. https://explained.ai/rf-importance/index.html For example, if you build a model of house prices, knowing which features are most predictive of price tells us which features people are willing to pay for. Feature importance is the most useful interpretation tool, and data scientists regularly examine model parameters (such as the coefficients of linear models), to identify important features. landmines include not normalizing input data, properly interpreting coefficients when using Lasso or Ridge regularization, and avoiding highly-correlated variables (such as country and country_name). To learn more about the difficulties of interpreting regression coefficients, see Statistical Modeling: The Two Cultures (2001) by Leo Breiman (co-creator of Random Forests). In order to explain feature selection, we added a column of random numbers. (Any feature less important than a random column is junk and should be tossed out.) Spearman's correlation is the same thing as converting two variables to rank values and then running a standard Pearson's correlation on those ranked variables. Spearman's is nonparametric and does not assume a linear relationship between the variables; it looks for monotonic relationships. You can visualize this more easily using plot_corr_heatmap():