General Concepts - Cluster Analysis What is it? Descriptive process. Create groups between objects that are more similar to eachother than to those in other clusters. Objectives Characterize the population. Understand behaviors. Identify opportunities. Apply different treatments.
General Concepts - Cluster Analysis Different clustering algorithms. Define the measures of similarity. Algorithms K-means. Kohonen Self-organizing maps (SOM).
General Concepts - Cluster Analysis K-means Iterative technique. Assign a set on n observations. k number of clusters. Nearest centroid. Each observation belongs to one cluster. Unsupervised algorithm.
General Concepts - Cluster Analysis Kohonen SOM Unsupervised and iterative algorithm. Kohonens learning law. Winning cluster. Training case. Learning rate.
General Concepts - Cluster Analysis Kohonen SOMK variables
General Concepts - Cluster Analysis Kohonen SOMDefine a learning rateInitialize the weights for each nodeSelect one training case and calculate thewinning clusterUpdate the codebook of the winningclusterConvergence checkSTOP
General Concepts - Logistic Regression Probability of occurrence of an event. Determine the relationship between the independentvariables and the dichotomous dependent variable. Fit output values between 0 and 1. Widely use in credit risk scorecards: Simplicity. Interpretability.
General Concepts - Multinomial Logistic Regression Generalization of the logistic regression. Dependent variable is not dichotomous. Predict more than two possibilities. Parameter estimation through maximum likelihood.
General Concepts – MLP neural networkHiddenLayer iInputLayerHiddenLayer 1HiddenLayer 2OutputLayerX16X17X18X19XnX11X12X13X14X15X6X7X8X9X10X1X2X3X4X5H1jH11H12H21H22H2jHi1Hi2HijYiBias2BiasiBias1 Abstraction of the nervous system. Collection of interconnected units.
General Concepts – Classification Distances Three algorithms: Minimum Euclidean distance. Minimum adjusted distance. Minimum Mahalanobis distance. Classify a new client to a corresponding cluster. Assign to the cluster with the minimum distance.
General Concepts – Classification Distances Minimum Euclidean distance Assign base on the distance to the centroids of thedefined clusters.188.8.131.52.5
General Concepts – Classification Distances Minimum adjusted distance Defined by the authors. Assign base on the distance to cluster radius. Average distance of all observations within a clusterto its centroid.184.108.40.206.0
General Concepts – Classification Distances Minimum adjusted distance
General Concepts – Classification Distances Minimum Mahalanobis distance Considers the correlation between variables. Scale-invariant. Covariance matrix is the Identity matrix. Reduces to the Euclidean distance.
General Concepts – The F1 score Two concepts: Precision. Recall. Standard measure of comparison when the outcome isbinary. Classification accuracy.Percentage of true observations thatwere correctly predicted.Percentage of real good observationsthat were correctly predicted. Harmonic mean between precision and recall.
ModelingLogistic regressionor Neural networkmodel developedfor each resultingcluster.Weighted averageof the scores,weighted by theprobability ofbelonging to eachcluster.Vote of eachscore, weighted bythe probability ofbelonging to eachcluster.
Modeling F1 score is calculated for every model. 248 models. SAS® base and SAS Enterprise Miner™ procedures.
Results 240 predictive clustering models. 8 scorecards for the entire population. Top 25% and Top 10% of the models. Contrast on each stage of the modeling process.
Results Stage 1: Clustering methodologyClustering Methodology Top 25% Top 10%K-Means 46% 52%Kohonen SOM 54% 48%There is no significant difference between the techniques.Oppositerelationship.
Results Stage 2: Predictive clusters methodologyPredictive Clusters Methodology Top 25% Top 10%Multinomial Logistic Regression 4% 0%MLP Neural Network 15% 0%Minimum Euclidean Distance 24% 28%Minimum Adjusted Distance 28% 44%Minimum Mahalanobis Distance 28% 28%81%100%Distance methodologies are more powerful
Results Stage 3: Credit scoring methodologyCredit Scoring Methodology Top 25% Top 10%Logistic Regression 49% 64%MLP Neural Network 51% 36%Approximatelythe SamepercentageLogisticregressionexceeds theN.N.Logistic regression scorecards performs better on segmentedpopulations.
Results Stage 4: Final score methodologyFinal Score Methodology Top 25% Top 10%Cluster Score 21% 8%Score Ensemble 28% 20%Classifier Average Vote Ensemble 51% 72%UndisputedwinnerThe best method to define the final score is the classifieraverage vote ensemble.
Results Entire population models rankingDatabaseLogistic Regression position(F1 Score Ranking)MLP Neural Network position(F1 Score Ranking)Database 1 34th 13thDatabase 2 4th 21stDatabase 3 32nd 16thDatabase 4 13th 24thBest scenarioThe best models according the F1 score statistic are the onesdeveloped using the predictive clusters methodology.
Conclusions Clustering methods lead to similar results. On cluster assignment distance methods perform better. Logistic regression could have a higher predictive powerafter using clusters analysis. The classifier average vote ensemble produce superiorresults on the task of defining the final score.Predictive clusters provide betterresults than a single scorecard.