1. Two datasets were merged and a response variable was created to classify customers as having good or bad credit based on their delinquency levels. Several data preprocessing steps were applied including median imputation, variable reduction, and discretization of variables.
2. Logistic regression with backward selection was used to identify the top twelve predictive variables. The model was validated on holdout data and found consistent levels of errors.
3. The model was used to estimate profits by classifying customers and applying a cost-benefit calculation with various default probability cutoffs, finding an optimal cutoff of 0.21.
- The document discusses predicting the age of abalone based on its physical characteristics using statistical analysis.
- It analyzes a dataset containing measurements of 4,177 abalone to identify predictors of age, which is measured by the number of rings (rings + 1 = age).
- Principal component analysis is used to reduce the dimensionality of the seven predictor variables which include shell measurements and weight measurements. The first principal component best represents overall abalone size when using the correlation matrix rather than the covariance matrix.
1) The document discusses statistical process control charts analyzing defects per day for a manufacturing process. Charts show the process was initially in control but trends out of control later, suggesting the need for process adjustments.
2) Control charts were created for individual variables and together using multivariate methods. Most charts showed increasing out of control signals over time, indicating an unstable process in need of calibration to reduce unwanted variability.
3) Recommendations include eliminating assignable causes of variation, closely monitoring the process, and using multivariate control charts to better detect out of control conditions through analysis of variable interactions.
This study analyzed NFL running back statistics from 2008-2013 to determine when production peaks. A random intercepts model showed peaks at ages 24-26. Player level plots showed younger backs improving and older backs declining. Average points peaked at ages 26-27. Heavier backs peaked earlier from 22-26 while lighter backs were marginally more productive 28-32, with little difference between weight classes from 26-28.
- NFL running backs generally peak in fantasy point production between ages 24-27, with lighter backs peaking slightly later.
- Heavier backs have the highest peak production but decline the fastest after peaking, while lighter backs have a smaller peak but remain more productive later in their careers.
- Based on the data, NFL teams should focus on drafting heavier running backs out of college and look to sign lighter backs to shorter contracts after age 27 rather than re-signing backs based solely on past performance.
This document describes analyses performed on data from the National Health and Aging Trends Study to explore dimension reduction techniques. Principal component analysis and factor analysis were used. PCA yielded the most interpretable results, identifying three components explaining 22% of variance: Balance/Mobility, Household/Family Size, and Basic Care. Factor analysis with different extraction methods also identified three factors, though interpretations varied slightly with cross-loadings. Orthogonal and oblique rotations were compared.
This analysis examines differences in physical attributes between NFL player positions using data from the 2013 NFL Combine. Multivariate analysis of variance finds significant differences between four groups of positions across variables like vertical jump, bench press, and hand size. The most significant differences are between defensive backs/wide receivers and offensive linemen/defensive tackles, with the former jumping higher and having better bench press scores on average. The largest difference is between running backs and offensive linemen/defensive tackles, with the former being shorter on average. The analysis provides evidence that positions requiring more athleticism differ physically from those emphasizing stamina and stability.
The multiple linear regression model aims to predict water cases produced from four predictor variables: run time, downtime, setup time, and efficiency. Preliminary analysis found run time has the highest correlation to water cases. Residual analysis showed non-constant variance, so a square root transformation of water cases was tested but did not improve the model. Further analysis is needed to develop the best-fitting multiple linear regression model.
- The document discusses predicting the age of abalone based on its physical characteristics using statistical analysis.
- It analyzes a dataset containing measurements of 4,177 abalone to identify predictors of age, which is measured by the number of rings (rings + 1 = age).
- Principal component analysis is used to reduce the dimensionality of the seven predictor variables which include shell measurements and weight measurements. The first principal component best represents overall abalone size when using the correlation matrix rather than the covariance matrix.
1) The document discusses statistical process control charts analyzing defects per day for a manufacturing process. Charts show the process was initially in control but trends out of control later, suggesting the need for process adjustments.
2) Control charts were created for individual variables and together using multivariate methods. Most charts showed increasing out of control signals over time, indicating an unstable process in need of calibration to reduce unwanted variability.
3) Recommendations include eliminating assignable causes of variation, closely monitoring the process, and using multivariate control charts to better detect out of control conditions through analysis of variable interactions.
This study analyzed NFL running back statistics from 2008-2013 to determine when production peaks. A random intercepts model showed peaks at ages 24-26. Player level plots showed younger backs improving and older backs declining. Average points peaked at ages 26-27. Heavier backs peaked earlier from 22-26 while lighter backs were marginally more productive 28-32, with little difference between weight classes from 26-28.
- NFL running backs generally peak in fantasy point production between ages 24-27, with lighter backs peaking slightly later.
- Heavier backs have the highest peak production but decline the fastest after peaking, while lighter backs have a smaller peak but remain more productive later in their careers.
- Based on the data, NFL teams should focus on drafting heavier running backs out of college and look to sign lighter backs to shorter contracts after age 27 rather than re-signing backs based solely on past performance.
This document describes analyses performed on data from the National Health and Aging Trends Study to explore dimension reduction techniques. Principal component analysis and factor analysis were used. PCA yielded the most interpretable results, identifying three components explaining 22% of variance: Balance/Mobility, Household/Family Size, and Basic Care. Factor analysis with different extraction methods also identified three factors, though interpretations varied slightly with cross-loadings. Orthogonal and oblique rotations were compared.
This analysis examines differences in physical attributes between NFL player positions using data from the 2013 NFL Combine. Multivariate analysis of variance finds significant differences between four groups of positions across variables like vertical jump, bench press, and hand size. The most significant differences are between defensive backs/wide receivers and offensive linemen/defensive tackles, with the former jumping higher and having better bench press scores on average. The largest difference is between running backs and offensive linemen/defensive tackles, with the former being shorter on average. The analysis provides evidence that positions requiring more athleticism differ physically from those emphasizing stamina and stability.
The multiple linear regression model aims to predict water cases produced from four predictor variables: run time, downtime, setup time, and efficiency. Preliminary analysis found run time has the highest correlation to water cases. Residual analysis showed non-constant variance, so a square root transformation of water cases was tested but did not improve the model. Further analysis is needed to develop the best-fitting multiple linear regression model.
Detailed illustration of MSA procedures both for Variable and attribute, Analysis of results and planning for MSA. Complete guidance for planning and implementation of MSA.
This document summarizes the analysis of data from a pharmaceutical company to model and predict the output variable (titer) from input variables in a biochemical drug production process. Several statistical models were evaluated including linear regression, random forest, and MARS. The analysis involved developing blackbox models using only controlled input variables, snapshot models using all input variables at each time point, and history models incorporating changes in input variables over time to predict titer values. Model performance was compared using cross-validation.
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 3: Describing, Exploring, and Comparing Data
3.2: Measures of Variation
The document describes various variable selection methods applied to predict violent crime rates using socioeconomic data from US cities. It analyzes a dataset with 95 variables and 807 observations, using several variable selection techniques to determine the most predictive factors of violent crime. These include best random subset selection (BRSS), which approximates best subset selection by randomly selecting variable combinations. BRSS identified factors like immigration, ethnicity, family structure, and income as best predicting violent crime rates. Model performance was evaluated using metrics like R2, and BRSS had strong out-of-sample prediction, outperforming some other common techniques.
The document describes various variable selection methods applied to predict violent crime rates using socioeconomic data from US cities. It analyzes a dataset with 95 variables and 807 observations on income, family structure, ethnicity, and other factors to predict violent crime rates. Several variable selection techniques are applied including forward selection, backward elimination, lasso, elastic net, best random subset selection (BRSS), decision trees, and random forests. BRSS, which approximates best subset selection, identified 15 variables as most predictive of violent crime and had strong out-of-sample performance. Analysis of 1000 training and test splits found that BRSS, random forests, and decision trees consistently outperformed other techniques in terms of out-of-sample predictive accuracy
This document discusses repeatability and reproducibility in measurement systems. Repeatability refers to the variability from measurements taken by the same person on the same item, and depends on the precision of the measurement equipment. Reproducibility refers to variability from measurements taken by different operators on the same item. The document provides examples of using the range-and-average method and analysis of variance (ANOVA) method to quantify repeatability, reproducibility, and overall measurement system variability.
This document summarizes chapter 3 section 2 of an elementary statistics textbook. It discusses measures of variation, including range, variance, and standard deviation. The standard deviation describes how spread out data values are from the mean and is used to determine consistency and predictability within a specified interval. Several examples demonstrate calculating range, variance, and standard deviation for data sets. Chebyshev's theorem and the empirical rule relate standard deviations to the percentage of values that fall within certain intervals of the mean.
Credit Card Fraudulent Transaction Detection Research PaperGarvit Burad
Credit Card Fraudulent Transaction Detection Research Paper using Machine Learning technologies like Logistic Regression, Random Forrest, Feature Engineering and various techniques to deal with highly skewed dataset
This document discusses measurement system analysis (MSA), which assesses the quality of measurement systems that generate measurements critical to a facility's operations. MSA applies statistical methods to evaluate six key properties (repeatability, reproducibility, accuracy, bias, linearity, stability) that characterize a measurement system's variation. Conducting an MSA is important when a new measuring device is introduced, to compare measurement devices, or when required by a customer's control plan. The document outlines methods like variable and attribute gauge studies to assess a measurement system, and provides guidance on properly planning and conducting an MSA.
This document discusses measurement system analysis (MSA), which assesses the quality of measurement systems that generate measurements critical to a facility's operations. MSA applies statistical methods to evaluate six key properties (repeatability, reproducibility, accuracy, bias, linearity, stability) that characterize a measurement system's variation. Conducting an MSA is important when a new measuring device is introduced, to compare measurement devices, or when required by a customer's control plan. The document outlines methods like variable and attribute gauge studies to evaluate a measurement system, and provides acceptance criteria to determine if a system is acceptable based on the percentage of variation attributed to the gauge.
Measurement systems analysis and a study of anova methodeSAT Journals
Abstract
Instruments and measurement systems form the base of any process improvement strategies. The much widely used QC tools like
SPC depends on sample data taken from processes to track process variation which in turn depends on measuring system itself.
The purpose of Measurement System Analysis is to qualify a measurement system for use by quantifying its accuracy, precision,
and stability and to minimize their contribution in process variation through inherent tools such as ANOVA. The purpose of this
paper is to outline MSA and study ANOVA method through a real-time shop floor experiment.
Keywords: SPC, Accuracy, Precision, Stability, QC, ANOVA
This document provides an overview of quality control programs in medical laboratories. It defines key terms related to quality control such as accuracy, bias, calibration, standard deviation, and Westgard rules. The Westgard rules are statistical approaches used to detect errors in analytical testing by monitoring quality control sample measurements. The document explains concepts like standard deviation, coefficient of variation, and standard deviation index which are used to statistically analyze quality control sample results.
Practical Data Science: Data Modelling and PresentationHariniMS1
This document summarizes a student assignment to predict red wine quality using classification models. It describes using the wine quality dataset from UCI, preprocessing the data, exploring it visually, and training KNN and decision tree classifiers to predict wine quality. Evaluation shows the decision tree model achieved slightly higher accuracy than KNN, particularly when standard scaling was applied during modeling.
International Journal of Computational Engineering Research(IJCER)ijceronline
The document compares four techniques for imputing missing data: mean substitution, median substitution, standard deviation, and linear regression. It analyzes the accuracy of each technique after classifying a dataset into groups of 3, 6, and 9 using k-nearest neighbors classification. The results show that median substitution and standard deviation have higher accuracy than mean substitution and linear regression, especially as the group size increases and amount of missing data decreases. Standard deviation and median substitution on average achieved the highest accuracy rates, around 75-77%, while mean substitution and linear regression were lower, around 66-74%.
This paper advances the Domain Segmentation based on Uncertainty in the Surrogate (DSUS) framework which is a novel approach to characterize the uncertainty in surrogates. The leave-one-out cross-validation technique is adopted in the DSUS framework to measure local errors of a surrogate. A method is proposed in this paper to evaluate the performance of the leave-out-out cross-validation errors as local error measures. This method evaluates local errors by comparing: (i) the leave-one-out cross-validation error with (ii) the actual local error estimated within a local hypercube for each training point. The comparison results show that the leave-one-out cross-validation strategy can capture the local errors of a surrogate. The DSUS framework is then applied to key aspects of wind resource as- sessment and wind farm cost modeling. The uncertainties in the wind farm cost and the wind power potential are successfully characterized, which provides designers/users more confidence when using these models
This document summarizes the work done by an intern during their summer internship in the Medical Physics Department of Radiology. The intern conducted research to predict cancer outcomes based on breast lesion features. Key work included feature extraction from mammograms, analyzing features to differentiate malignant and benign lesions using ROC analysis and LDA, and exploring features to predict invasive vs. non-invasive cancer. Top predictive features were FWHM ROI, diameter, and margin sharpness. The intern gained skills in medical image analysis, statistical analysis, and evaluating results to identify trends.
This document summarizes a crime analysis project conducted by a university team. The team analyzed multiple data sets to build models predicting crime rates based on factors like population, weather, daylight hours, and economic indicators. They created binary, numeric, and crime ratio models and found the crime ratio model was most statistically significant. The team's analyses found crime rates generally increased with daylight savings time and increased slightly with higher temperatures. Their best model could predict monthly crime totals by city for most crime types except rare crimes like homicide and sexual assault.
The document presents an analysis comparing public and private undergraduate business schools. Some key findings include:
- Private school students on average pay $25,710 more per semester in tuition costs and have total program costs that are $102,000 higher than public schools.
- Public schools have a higher average starting salary per tuition dollar spent per semester ($4.22) compared to private schools.
- Public schools have better average ROI rankings (53 spots higher) indicating a better return on investment compared to private schools.
- On average, private school students spend $82,279 more on tuition than their starting median salary, while public school students make $16,598 more than their total tuition costs in
Detailed illustration of MSA procedures both for Variable and attribute, Analysis of results and planning for MSA. Complete guidance for planning and implementation of MSA.
This document summarizes the analysis of data from a pharmaceutical company to model and predict the output variable (titer) from input variables in a biochemical drug production process. Several statistical models were evaluated including linear regression, random forest, and MARS. The analysis involved developing blackbox models using only controlled input variables, snapshot models using all input variables at each time point, and history models incorporating changes in input variables over time to predict titer values. Model performance was compared using cross-validation.
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 3: Describing, Exploring, and Comparing Data
3.2: Measures of Variation
The document describes various variable selection methods applied to predict violent crime rates using socioeconomic data from US cities. It analyzes a dataset with 95 variables and 807 observations, using several variable selection techniques to determine the most predictive factors of violent crime. These include best random subset selection (BRSS), which approximates best subset selection by randomly selecting variable combinations. BRSS identified factors like immigration, ethnicity, family structure, and income as best predicting violent crime rates. Model performance was evaluated using metrics like R2, and BRSS had strong out-of-sample prediction, outperforming some other common techniques.
The document describes various variable selection methods applied to predict violent crime rates using socioeconomic data from US cities. It analyzes a dataset with 95 variables and 807 observations on income, family structure, ethnicity, and other factors to predict violent crime rates. Several variable selection techniques are applied including forward selection, backward elimination, lasso, elastic net, best random subset selection (BRSS), decision trees, and random forests. BRSS, which approximates best subset selection, identified 15 variables as most predictive of violent crime and had strong out-of-sample performance. Analysis of 1000 training and test splits found that BRSS, random forests, and decision trees consistently outperformed other techniques in terms of out-of-sample predictive accuracy
This document discusses repeatability and reproducibility in measurement systems. Repeatability refers to the variability from measurements taken by the same person on the same item, and depends on the precision of the measurement equipment. Reproducibility refers to variability from measurements taken by different operators on the same item. The document provides examples of using the range-and-average method and analysis of variance (ANOVA) method to quantify repeatability, reproducibility, and overall measurement system variability.
This document summarizes chapter 3 section 2 of an elementary statistics textbook. It discusses measures of variation, including range, variance, and standard deviation. The standard deviation describes how spread out data values are from the mean and is used to determine consistency and predictability within a specified interval. Several examples demonstrate calculating range, variance, and standard deviation for data sets. Chebyshev's theorem and the empirical rule relate standard deviations to the percentage of values that fall within certain intervals of the mean.
Credit Card Fraudulent Transaction Detection Research PaperGarvit Burad
Credit Card Fraudulent Transaction Detection Research Paper using Machine Learning technologies like Logistic Regression, Random Forrest, Feature Engineering and various techniques to deal with highly skewed dataset
This document discusses measurement system analysis (MSA), which assesses the quality of measurement systems that generate measurements critical to a facility's operations. MSA applies statistical methods to evaluate six key properties (repeatability, reproducibility, accuracy, bias, linearity, stability) that characterize a measurement system's variation. Conducting an MSA is important when a new measuring device is introduced, to compare measurement devices, or when required by a customer's control plan. The document outlines methods like variable and attribute gauge studies to assess a measurement system, and provides guidance on properly planning and conducting an MSA.
This document discusses measurement system analysis (MSA), which assesses the quality of measurement systems that generate measurements critical to a facility's operations. MSA applies statistical methods to evaluate six key properties (repeatability, reproducibility, accuracy, bias, linearity, stability) that characterize a measurement system's variation. Conducting an MSA is important when a new measuring device is introduced, to compare measurement devices, or when required by a customer's control plan. The document outlines methods like variable and attribute gauge studies to evaluate a measurement system, and provides acceptance criteria to determine if a system is acceptable based on the percentage of variation attributed to the gauge.
Measurement systems analysis and a study of anova methodeSAT Journals
Abstract
Instruments and measurement systems form the base of any process improvement strategies. The much widely used QC tools like
SPC depends on sample data taken from processes to track process variation which in turn depends on measuring system itself.
The purpose of Measurement System Analysis is to qualify a measurement system for use by quantifying its accuracy, precision,
and stability and to minimize their contribution in process variation through inherent tools such as ANOVA. The purpose of this
paper is to outline MSA and study ANOVA method through a real-time shop floor experiment.
Keywords: SPC, Accuracy, Precision, Stability, QC, ANOVA
This document provides an overview of quality control programs in medical laboratories. It defines key terms related to quality control such as accuracy, bias, calibration, standard deviation, and Westgard rules. The Westgard rules are statistical approaches used to detect errors in analytical testing by monitoring quality control sample measurements. The document explains concepts like standard deviation, coefficient of variation, and standard deviation index which are used to statistically analyze quality control sample results.
Practical Data Science: Data Modelling and PresentationHariniMS1
This document summarizes a student assignment to predict red wine quality using classification models. It describes using the wine quality dataset from UCI, preprocessing the data, exploring it visually, and training KNN and decision tree classifiers to predict wine quality. Evaluation shows the decision tree model achieved slightly higher accuracy than KNN, particularly when standard scaling was applied during modeling.
International Journal of Computational Engineering Research(IJCER)ijceronline
The document compares four techniques for imputing missing data: mean substitution, median substitution, standard deviation, and linear regression. It analyzes the accuracy of each technique after classifying a dataset into groups of 3, 6, and 9 using k-nearest neighbors classification. The results show that median substitution and standard deviation have higher accuracy than mean substitution and linear regression, especially as the group size increases and amount of missing data decreases. Standard deviation and median substitution on average achieved the highest accuracy rates, around 75-77%, while mean substitution and linear regression were lower, around 66-74%.
This paper advances the Domain Segmentation based on Uncertainty in the Surrogate (DSUS) framework which is a novel approach to characterize the uncertainty in surrogates. The leave-one-out cross-validation technique is adopted in the DSUS framework to measure local errors of a surrogate. A method is proposed in this paper to evaluate the performance of the leave-out-out cross-validation errors as local error measures. This method evaluates local errors by comparing: (i) the leave-one-out cross-validation error with (ii) the actual local error estimated within a local hypercube for each training point. The comparison results show that the leave-one-out cross-validation strategy can capture the local errors of a surrogate. The DSUS framework is then applied to key aspects of wind resource as- sessment and wind farm cost modeling. The uncertainties in the wind farm cost and the wind power potential are successfully characterized, which provides designers/users more confidence when using these models
This document summarizes the work done by an intern during their summer internship in the Medical Physics Department of Radiology. The intern conducted research to predict cancer outcomes based on breast lesion features. Key work included feature extraction from mammograms, analyzing features to differentiate malignant and benign lesions using ROC analysis and LDA, and exploring features to predict invasive vs. non-invasive cancer. Top predictive features were FWHM ROI, diameter, and margin sharpness. The intern gained skills in medical image analysis, statistical analysis, and evaluating results to identify trends.
This document summarizes a crime analysis project conducted by a university team. The team analyzed multiple data sets to build models predicting crime rates based on factors like population, weather, daylight hours, and economic indicators. They created binary, numeric, and crime ratio models and found the crime ratio model was most statistically significant. The team's analyses found crime rates generally increased with daylight savings time and increased slightly with higher temperatures. Their best model could predict monthly crime totals by city for most crime types except rare crimes like homicide and sexual assault.
The document presents an analysis comparing public and private undergraduate business schools. Some key findings include:
- Private school students on average pay $25,710 more per semester in tuition costs and have total program costs that are $102,000 higher than public schools.
- Public schools have a higher average starting salary per tuition dollar spent per semester ($4.22) compared to private schools.
- Public schools have better average ROI rankings (53 spots higher) indicating a better return on investment compared to private schools.
- On average, private school students spend $82,279 more on tuition than their starting median salary, while public school students make $16,598 more than their total tuition costs in
The analysts predicted the stock price of Apple (AAPL) over the next month, six months, and year. They reviewed three investment strategies and found that simply owning AAPL stock (Strategy 1) was expected to provide the highest return, though Strategy 2 of insuring half the position could lower risk slightly. Strategy 3 was not recommended as it carried much risk for only slight expected returns above the risk-free rate. The analysts advised contacting them with any other questions.
Sweden has a highly developed social democratic welfare state and constitutional monarchy. It has transitioned from an agricultural to a knowledge economy focused on biotechnology, telecommunications, medical technology, banking, and pharmaceuticals. Sweden has a very high human development index and ranks highly on measures of quality of life, education, health, and gender equality. It has a skilled workforce and strong economic institutions that have supported steady economic growth and rising living standards in recent decades.
This document outlines a regression analysis of SAT scores. It begins with an introduction and abstract. The authors then describe building a baseline model and progressively improving the model by testing for non-linearity, heteroskedasticity, normality, and outliers. Their final model explains 33% of the variation in SAT scores. Key limitations are its inability to predict future scores and lack of some relevant variables. The authors conclude more data could improve the model's explanatory power.
This document summarizes the process of creating a regression model to analyze factors that influence NBA players' points per game (PPG). The authors began with a baseline model and tested for non-linearity, heteroskedasticity, normality, and outliers. They refined the model by taking natural logs and inverse transformations of variables. The final model included minutes played, salary, years in college, assists, star status, and cubic forms of minutes and rebounds as significant predictors of PPG. While useful for explanation, the model has limitations like inability to predict future values and could be improved with additional data variables.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise causes chemical changes in the brain that may help protect against mental illness and improve symptoms.
River Forest is a 21-unit cabin community situated on 12 acres along the Toccoa River in Georgia. It features underground utilities, landscaping, and amenities like a riverside kitchen and golf cart trail. Units include duplexes and triplexes priced from $174,000-$225,000, offering profits up to $51,000 per unit. With construction and sales of all units, the potential profit is estimated at $1.4 million. Existing infrastructure like septic systems and wells are already in place.
This document provides an analysis of potential profits from developing riverfront property containing duplexes and triplexes. It details features and floor plans of the units, projected construction costs, the strong recent sales market with average prices from $275,000-$260,000, and estimates total potential profits of $1.8 million from selling 6 duplexes and 3 triplexes based on costs of $2.7 million and sales of $4.5 million.
This document outlines the constitution for the Economics and Social Science Club of Kennesaw State University. It establishes the club's name, purpose of promoting cultural and academic exchange, membership as undergraduate students, elected officer positions and their duties, requirements for meetings and voting, processes for officer replacement and amendments, and policies regarding nondiscrimination, anti-hazing, and agreement to follow university rules. It also provides meeting dates for the Student Activities Budget Advisory Committee.
1. Advising Faculty: Dr. Priestley
Department of Statistics and Analytical SciencesStep 1
Step 2
A Path to Profit via Logistic Regression
John Michael Croft
The response variable is created from delinquency ID values
ranging from 0 to 9. Where customers have multiple
delinquency IDs, the highest value is used to mitigate default
risk.
• Good = 0 where DelqID < 3.
• Bad =1 where DelqID > 2
• Average ‘default’ rate is 17.57% within our sample data.
PERF CPR
TABLE 23: MODEL VALIDATION COMPARISONS
VALIDATION
DATASET
% Valid 1 % Valid 2 % Type I Error % Type II Error Est. Profit per 1000
TRAINING 11.60% 67.00% 5.93% 15.47% $102,650
1 11.68% 66.85% 5.95% 15.53% $110,432
2 11.69% 66.83% 5.91% 15.61% $110,895
3 11.68% 66.84% 5.90% 15.58% $110,757
4 11.67% 66.82% 5.95% 15.57% $110,473
5 11.72% 66.79% 5.92% 15.58% $110,792
TABLE 20: ANALYSIS OF MAXIMUM LIKELIHOOD ESTIMATES
PARAMETER DF Estimate Standard Wald Pr > ChiSq
Error Chi-Square
ORDRADB6 1 0.4292 0.00349 15147.1948 <.0001
LODSEQTOPENB75 1 0.5552 0.00692 6440.2938 <.0001
CRATE2 1 0.4273 0.00632 4570.2359 <.0001
ORDBRR4524 1 0.8795 0.0104 7218.3086 <.0001
LODSEQBRCRATE1 1 0.7009 0.00851 6774.7374 <.0001
BRCRATE3 1 1.0478 0.0143 5366.704 <.0001
LODSEQBRMINB 1 0.3288 0.00752 1913.1348 <.0001
ODSEQBRAGE 1 3.8637 0.0618 3910.3534 <.0001
LOCINQS 1 0.0862 0.00145 3521.8592 <.0001
BRCRATE4 1 0.4266 0.00998 1829.0249 <.0001
TPCTSAT 1 -1.0071 0.0156 4182.2141 <.0001
ODDSBRRATE2 1 2.0349 0.0484 1764.9455 <.0001
Figure 13: BADPR1 Rank Default Rates DISC 2)avg_dep
0.09
0.10
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.20
0.21
0.22
0.23
0.24
0.25
0.26
0.27
0.28
0.29
0.30
0.31
0.32
0.33
0.34
0.35
Rank for VariableBADPR1
4 5 6 7 8 9
Figure 12: Initial & Final ORDBADPR1 Bin Default Rates (DISC 1)meangoodbad
0.09
0.10
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.20
0.21
0.22
0.23
0.24
0.25
0.26
0.27
0.28
0.29
0.30
0.31
0.32
0.33
0.34
0.35
0.36
0.37
0.38
0.39
0.40
0.41
badpr1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
meangoodbad
0.09
0.10
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.20
0.21
0.22
0.23
0.24
0.25
0.26
0.27
0.28
0.29
0.30
0.31
0.32
0.33
0.34
0.35
0.36
0.37
0.38
0.39
0.40
badpr1
0 1 2 3 4 5 6 7 8 9 10 11 12
TABLE 9: CLUSTER VARIABLE SELECTION
CLUSTER Variable OwnCluster NextClosest (1 - RSquareRatio)
CLUSTER 1 DCCRATE7 .892 .510 .221
DCRATE79 .892 .511 .220
DCR7924 .875 .531 .267
DCN90P24 .844 .620 .412
DCR39P24 .755 .633 .669
DCCR49 .881 .565 .272
CLUSTER 2 BRTRADES .892 .463 .201
BROPEN .800 .584 .482
BRCRATE1 .939 .514 .126
BRRATE1 .903 .507 .197
BRR124 .835 .585 .397
BROPENEX .892 .463 .201
CLUSTER 3 BRWCRATE .619 .347 .584
BRCRATE7 .856 .370 .228
BRRATE79 .855 .370 .230
BRR7924 .568 .365 .680
BRR49 .775 .677 .700
BRCR39 .810 .626 .507
BRCR49 .873 .551 .283
TABLE 7: PROPORTION OF
VARIANCE EXPLAINED
NUMBER OF
CLUSTERS
Proportion of
Variance
Explained
30 .725
31 .733
32 .739
33 .745
34 .753
35 .760
36 .766
37 .772
38 .778
39 .783
40 .788
41 .793
42 .797
43 .802
44 .807
45 .811
Figure 6: Max Standard Deviation of 4 for RBAL and TRADES
0 5000 10000 15000 20000 25000 30000 35000 40000 45000
rbal
0
2
4
6
8
Percent
Distributionof RBAL
0 5 10 15 20 25 30 35 40 45 50 55 60
trades
0
1
2
3
4
5
Percent
Distributionof TRADES
TABLE 2: GOODBAD DISTRIBUTION
GOODBAD FREQUENCY PERCENT
0 1034829 82.43
1 220600 17.57
Step 1: Merge Datasets Step 2: Create Response Variable
Step 3: Median Imputation: Coded Values Step 4: Variable Reduction
Two datasets are merged via an inner join to associate potential
credit performance predictors with delinquency data.
• CPR: 1.44M observations and 339 variables. Each
observation represents a unique customer.
• PERF: 17.24M observations and 18 variables. This file
contains the post hoc performance data.
• Post-Merge dataset has 1.26M observations and 339
potential predictors.
Step 5: Variable Dispersion Reduction Step 6: Variable Clustering
Step 7: Discretization Processes
Step 8: Final Model Predictors
Step 9: Maximize Est. Profit Step 10: Model Validation
Step 3
Many variables contained coded values (e.g. 999 and 999,999)
significantly skewing variable distributions. The increased
number of variables precludes more advanced methods like
regression imputation to be efficient for this process.
To minimize distributional impact, median imputation is
implemented due to the nature of many variables being right
skewed indicating a mean value greater than the median.
Step 4
For efficiency, a macro is implemented that analyzes all
potential predictors to address coded and missing values.
Implementing a cutoff of 40% retains 165 variables where the
combined coded and missing values do not exceed the cutoff.
Specifications .2 or .3 were strongly considered. However due
to the early stage in the process, more variables are retained.
Step 5
The same macro addresses variable dispersion by specifying a
parameter for maximum standard deviation. Any observation
beyond this specification is recoded to the median value as
missing and coded values are above.
As the max standard deviation allowed decreases, more
‘outliers’ are recoded to the median value compressing the right
tail toward the median. Considering many variables have right-
skewed distributions, a max standard deviation of four is chosen
to allow some distributional spread in the right tails.
Step 7
Step 8
Step 9
Step 10
Further reduction of the remaining variables is achieved through
variable cluster analysis. A scatter plot evaluates proportion of
variance explained at the different number of clusters. SAS
recommends 32 clusters as optimal, however 43 clusters are
retained for a desired 80% proportion of variance explained.
Within each cluster, a single variable is selected for optimal
representation. . The optimal variable will have the lowest
1-R^2 ratio within each cluster.
The remaining 44 potential predictor variables are evaluated for
transformations under two separate processes creating ordinal
versions of the variables.
Discretization 1 (DISC 1) is a supervised process distributing
transformed variables into equal width bins. DISC 1 requires
user evaluation for consecutive bin combination decisions from
lack of separation in bin default probabilities and bin
frequencies.
Discretization 2 (DISC2) is an unsupervised process distributing
transformed variables into equal frequency bins. DISC 2 utilizes
a t-test to evaluate consecutive ranks for significance and
combining lower ranks into higher ranks where no significance
is discovered.
Potential Transformations per DISC Technique:
• Ordinal Transformations
• Odds ratio of default
• Log odds ratio of default.
Up to seven potential versions of each variable will be
evaluated for modeling. Not all variable went through both
discretization processes due to multiple reasons. Approximately
250 variables will enter the modeling phase.
The remaining potential predictors are evaluated to logistically
model GOODBAD via a backward selection process (an
iterative process beginning with all potential predictors in the
model and removing the least significant predictor per iteration)
with 70% of the sample data.
This process repeats until all remaining predictors are
significant at the .05 significance level. (i.e. all remaining
predictors’ p-values are less than .05.) The top twelve predictors
from the backward selection logistic regression model.
Estimated profit is optimized using a function provided by the
‘client’ and specifying the cutoff for a high risk default.
Estimated profit is evaluated with two iterations 1) high level
cutoff from 0 to 1 by .1 and 2) a drill down from .1 to .4 by .01.
The optimal cutoff is .21
Profit function: [(250 * correctly predicted ‘GOODs’) – (856 *
incorrectly predicted ‘GOODs’)]
The final model is validated five times randomly sampling the
remaining 30% of the sample data.
Type I and Type II errors appears consistent in all validation
models.
Step 6
1− 𝑅2
𝑅𝑎𝑡𝑖𝑜 =
1− 𝑅2
𝑂𝑤𝑛 𝐶𝑙 𝑢𝑠𝑡𝑒𝑟
1− 𝑅2 𝑁𝑒𝑥𝑡 𝐶𝑙 𝑢𝑠𝑡𝑒𝑟
Potential Next Steps
Compare to a simplified model using original version of final
predictors.
Perform observational clustering and re-optimize the profit
function with cluster specific default probability cutoffs.
Evaluate other imputation methods and selection criteria.