1. Text Mining in Credit Scoring: Enhancement or maybe fully fledged modelling approach? Rafał Wojdan Warsaw School of Economics
Abstract
Objective
Methods
Results
Conclusions
References
-Kolyshkina, I ., & van Rooyen, M. (2006) Text Mining for Insurance Claim Cost Prediction. In G.J. Williams, & S.J. Simoff (Eds.), Data Mining LNAI 3775, Berlin, 192-202, Springer-Verlag
-G. Chakrabotry, M. Pagolu, S.Garla (2013) Text Mining and Analysis, Practical Methods, Examples and Case Studies Using SAS, SAS Institute
-http://support.sas.com/kb/22/601.html
As recently proved good credit scoring models are not only crucial for banking business but for the whole global economy. Hence, the study verifies if credit scoring models can be improved with variables obtained in text mining analysis using SAS Enterprise Miner 12.1. Furthermore, encouraged by successful applications of purely text based models in insurance industry, the project tries to build a model, constrained to only text variables, of similar predictive power to structured data only model. The purpose of the study is to evaluate the impact of textual information on credit scoring and uncover the potential hidden in text variables. However, banking data is highly classified the study is based on loans made via social lending service which is assumed to be a good proxy for banking industry. Social lending is a relatively new phenomena that combines features of social services and loan services. Thus, investors comments to loan inquiry that can be highly useful for text mining analyses are accessible there.
•To predict loan default using Logistic Regression models in SAS® Enterprise Miner™
•Compare results of best models in each of 3 groups: structured data based model, text data based model, hybrid data based model
•Identify the value of text data for credit scoring modelling
Data preparation:
•The data was collected from one Polish social lending site. Beside access to structured data the site offers also access to text based data like title, descriptions and comments to every loan. There were 47224 observations and 154 variables in the whole set.
•Text data were converted to structured forms using Text Cluster node. Various combinations of clusters, SVD resolution and MAX SVD dimensions were used in order to find best new variables.
•As there were zero to many comments assigned to every loan the aggregation was necessary. A set of variables indicating number and presence of comments from particular were built. For the other categories SVD were used.
•Some of the structured data were additionally grouped into bigger categories like for example age or loan percent
•The target variable is binary variable: 1 – default, 0 - repaid Model building: Different logistic models were build for every set of variables obtained in clustering. The models were compared using missclassification rate. As the models perfomred poor in identyfing defaults which, from perspective of investors, are more important, the modifications were applied. However, the event - ’1’ comprised 17% of the whole analysed population, the oversampling method was used to improve predictions of defualts. The stratified sample method was used to obtain sample of 12 000 observations equally split by target variable. Next prior probabitlies reflecting original proportions in population were applied. Additionally, the rounded inverse of set a priori probabilities weigths for decisions were used. The appropriate measure to compare results of achieved models in this new setting was Profit/Loss measure and classifications based on decisions table. The remaining observations were used for scoring and validation.
•The models without corrections performed really well in missclassification rate (between 10% and 20%), however were very poor at predicting 1 – not more than 50% of TruePositive predictions within 1. (train data)
•The corrected models accuracy was measured with classifications in decisions table. Here the rates ranged from 15% to 28%. However, the TruePositive dramatically improved to 88% of correctly identified 1 among all 1 for the best hybrid model (train data)
•The obtained models perfomed very well in classyfing the validation set and the worst text based model had accouracy at the level of 71%
•Among selection statistics, the hybrid based model performs better than the rest with a 75% accuracy
•After adjustmens text variables allowed to increase accuracy of model compared to structured model by 5%
•Even if distribution of target variable between obtained clusters is more or less similar, the SVD variables still allows to build text model with accuracy of 72% on train data and 71% on validation data. (adjusted model)
Roc curve
Lift
Hybrid model Structured model Text model