Categorical data. Dataset is usually very big : say 100,000 individuals.
Categorical data. Dataset is usually very big : say 100,000 individuals.
1. Distributional features of data
The nature of consumer credit lends itself to statistical analysis.
DA: e.g. Altman’s Z-score model. Normality: law of large numbers for large samples; so important for small samples
Apparently the further the two distributions are separated, the better the credit score model can discriminate good and bad credits. There are several measures that can be used to gauge the difference between the two distribution, e.g. Wilks’ lamda, information value, alpha/beta error etc. see pp92 – 99 of Handbook.
How are the coefficients derived ? – see Altman’s paper. Altman selects the 5 variables from a list of 22 variables as doing the best overall job together in predicting corporate bankruptcy. Uses iterative procedures to evaluate different combinations of eligible variables and selects the profile that does the best job among the alternatives. Two sets of sample firms are used: healthy and bankrupt to find the discriminant function .
heteroscedasticity: use general least square method U taking 2 values: u = y – beta*x and y = 0, or 1; so 2 values for each i.
Mathematical programming: An objective criterion to optimize: e.g. the proportion of applicants correctly classified Subject to certain discriminating conditions (to discriminate good and bad credits) Recursive partitioning algorithm: A computerized nonparametric technique based on pattern recognition Results: a binary classification tree which assigns objects into selected a priori groups Terminal nodes represent final classification of all objects. From two samples of default and non-default firms, e.g., one can calculate the misclassified numbers. Expert systems : evidence shows the predictive performances are quite poor . Also known as artificial intelligence systems Computer-based decision-support systems A consultation module: asking users questions until enough evidence has been collected to support a final recommendation A knowledge base containing static data, algorithms, and rules that tell the system ‘what to do if’ A knowledge acquisition and learning module: rules from the lending officer and rules of its own
Classification methods that are easy to understand (such as regression, nearest neighbor and tree-based methods) are much more appealing , than those which are essentially black boxes (such as neural networks) But neural networks have advantages too
Two group : one is defaulted, the other non-defaulted. Logistic regression is hence more robust than linear models. Normality : for binary data , such as 0 or 1, it’s hard to justify they have normal distribution. This is severe in significance testing.
ML method: we want to obtain estimates of parameters that make the observations most likely to happen . This is usually done by specifying the exact distributions of error terms, i.e. the likelihood function.
The inverse of h is called the logit function : g=log[P/(1-P)] Note that function h guarantees P is between 0 and 1 , as required by probability.
M is the number of 1’s. that is, the number of defaults in the sample. Order l like this: the first are defaulted ones, then the non-defaulted ones. The probability is conditional here, i.e. conditional on X of the sample Likelihood function serves as the (minus) loss function that is to be optimized. The last function is the cumulative logistic distribution function
Simplification is correct: Pi_1_m (P)*Pi_m+1_n (1-P) = Pi_1_m (P/(1-P))*Pi_1_n (1-P) When do MLE, use the log version of the above formula NEED TO correct
All the observations are either default or non-default (so P’s are either 1 or 0 for the observations in the sample). Logit = log(odds) Log(odds) translates the odds of default to nondefault to be the opposite of the odds of nondefault to default (e.g. 2 vs. -2) Then the logit function is assumed to be linear . Not solvable by OLS: coz P is wither 1 or 0. coefficients are solved through MLE . If data set is large enough, we can use the sample relative frequency as an estimate of the true probability for each X level, then we have values of the logits, then OLS can be used This website is a logistic calculator: http://members.aol.com/johnp71/logistic.html
The regression line would be nonlinear None of the observations actually fall on the regression line. They all fall on 0 or 1 .
1. Explain thin tail implications: extreme events
Probit model uses probit function that maps the probability to a numerical value between –inf to inf. The probit function is the inverse of the normal cdf. Assume the observations are independent, one solves for beta’s using the ML estimation . See pp100 – 101 Handbook. The other link function often used is the logit function
Transfer function : that converts the combination of inputs to an output.
Several Neural network models: multilayer perceptron (MLP): best model for credit scoring purpose. mixture of experts (MOE) radial basis function (RBF) learning vector quantization (LVQ) fuzzy adaptive resonance (FAR) The weighted combination of inputs is called NET
The overall input g for a neuron is called potential . potential g is a linear combination of weights for the inputs X to the neuron The activation function is also called the transfer function that converts the potential to an output f. W0 is called bias , or the excitation threshold value . It’s like a constant in the regression model. One can set x0=1 and i starts from 0 in the summation sign
SVM can perform binary classification (pattern recognition) and real valued function approximation (regression estimation) tasks.
1. Smiley faces: good credits; stars: bad credits
Largest margin means the strongest differentiating power/most robust as any additional obligor that falls out of the margin region can be clearly identified; that falls into the middle is hard to label but this would incur no error in labeling. Support vectors: X (the data) associated with the two obligors are called support vectors .
NW regression: a nonparametric kernel method Fan’s paper is cited in Atiya’s review paper: predicting bankruptcies
Data get outdated : e.g. income will change; so behavior will change Application scores: the scores computed for applications, i.e. whether to extend facility based on this score Behavior scores: after facility has been granted Probability scores : not only want to know binary result, i.e. 0 and 1, but also the expected probability. This is important , e.g. calculating capitals and expected returns
will not calculate until another chapter.
will not calculate until another chapter.
will not calculate until another chapter.
will not calculate until another chapter.
Revolving credit: e.g. credit card
Revolving credit: e.g. credit card
will not calculate until another chapter.
will not calculate until another chapter.
In this example, X has 2 dimensions. W is perpendicular to the line.
Use the 2 nd equation minus the 1 st one: w’lamda = 2 => margin = 2/|w|
Minimizing |w| is equivalent to maximizing margin n refers to the number of obligors . The constraints: two groups must lie on either side of the margin y identifies default or non-default. Note the label here is different from other methods (say 0 and 1) This is linear SVM , the data are assumed to be linearly separable . (the constraint) The constraint is binding only for support vectors!
If the constraint is binding (i.e. =0), then lamda > 0 ( economic meaning : the shadow price of the constraint) If not binding (>0), then lamda = 0
Using dual is more convenient. See: http://en.wikipedia.org/wiki/Linear_programming#Duality Intuition: in primal problem, max the constraint to meet objective; in dual problem, min objective to meet constraint (use a graph to show this). Lamda is price, want to max it so more constraints are binding (less slack).
9.
Evaluating the credit applicant Time at present address Time at present job Residential status Debt ratio Bank reference Age Income # of Recent inquiries % of Balance to avail. lines # of Major derogs. Overall Decision Odds of repayment • • • CHARACTERISTICS + + - + + N / A - - + + + Accept ? • • • JUDGMENT 12 20 5 21 28 15 5 -7 10 35 212 Accept 11:1 • • • CREDIT SCORING
Multivariate statistical analysis: several predictors (independent variables) and several groups (categorical dependent variable, e.g. 0 and 1)
Predictive DA: for a new observation, calculate the discriminant score, then classify it according to the score
The objective is to maximize the between group to within group sum of squares ratio that results in the best discrimination between the groups (within group variance is solely due to randomness; between group variability is due to the difference of the means)
Normal distribution for the response variables (dependent variables) is assumed (but normality only becomes important if significance tests are to be taken for small samples)
16.
Statistical Credit Scoring Credit Score #Customers Good Credit Bad Credit Cut-off Score
In general there is no overall best method. What is best will depend on the details of the problem:
The data structure
The characteristics used
The extent to which it is possible to separate the classes by using those characteristics
The objective of the classification (overall misclassification rate, cost-weighted misclassification rate, bad risk rate among those accepted, some measure of profitability, etc.)
In the following slides, we will introduce three models, Logistic, Neural Networks, and SVM in detail, which are used widely today
Empirical studies show, logistic regression may perform better than linear models (Hence, better than Discriminant Analysis), when data is nonnormal (particularly for binary data), or when covariance matrices of the two groups are not identical.
Therefore, logistic regression is the preferred method among the statistical methods
Probit regression is similar to logistic regression
It is easy to show that the log of the odds (= logit) are a linear function:
Therefore, the odds per se are a multiplicative function.
Since probability takes on values between (0,1), the odds take on values between (0,∞), logits take on values between (-∞,∞). So, it looks very much like linear regression, and it does not need to restrict the dependent variable to values of {0, 1}.
Saturation effect: i.e. marginal effect of a financial ratio may decline quickly
Multiplicative factors: highly leveraged firms have a harder time borrowing money
Neural networks decide how to combine and transform the raw characteristics in the data, as well as yielding estimates of the parameters of the decision surface
Well suited to situations where we have a poor understanding of the data structure
Response analysis: avoid adverse selection consequences that result in increased concentrations of high-risk borrowers
Pricing strategies: avoid “follow the competition”, focus on segment profitability and cash flow
Loan amount determination: avoid to be judgmental, quantify probabilities of losses
Credit loss forecasting: decompositional roll rate modeling, trend and seasonal indexing, and vintage curve
Portfolio management strategies: important for repricing and retention, don’t be judgmental, integrating behavioral element and cash flow profitability analysis ( underwriting )
Collection strategies: behavioral models are useful
Maximizing the margin is equivalent to minimizing || w || 2 .
Minimize || w || 2 subject to the constraints:
Where we have defined y(n) = +1 for all y(n) = –1 for all This enables us to write the constraints as
55.
Quadratic Programming Problem Minimize the cost function (Lagrangian) Here we have introduced non-negative Lagrange multipliers l n 0 that express the constraints
Be the first to comment