SlideShare a Scribd company logo
1 of 1
Download to read offline
Text Mining in Credit Scoring: Enhancement or maybe fully fledged modelling approach? Rafał Wojdan Warsaw School of Economics 
Abstract 
Objective 
Methods 
Results 
Conclusions 
References 
-Kolyshkina, I ., & van Rooyen, M. (2006) Text Mining for Insurance Claim Cost Prediction. In G.J. Williams, & S.J. Simoff (Eds.), Data Mining LNAI 3775, Berlin, 192-202, Springer-Verlag 
-G. Chakrabotry, M. Pagolu, S.Garla (2013) Text Mining and Analysis, Practical Methods, Examples and Case Studies Using SAS, SAS Institute 
-http://support.sas.com/kb/22/601.html 
As recently proved good credit scoring models are not only crucial for banking business but for the whole global economy. Hence, the study verifies if credit scoring models can be improved with variables obtained in text mining analysis using SAS Enterprise Miner 12.1. Furthermore, encouraged by successful applications of purely text based models in insurance industry, the project tries to build a model, constrained to only text variables, of similar predictive power to structured data only model. The purpose of the study is to evaluate the impact of textual information on credit scoring and uncover the potential hidden in text variables. However, banking data is highly classified the study is based on loans made via social lending service which is assumed to be a good proxy for banking industry. Social lending is a relatively new phenomena that combines features of social services and loan services. Thus, investors comments to loan inquiry that can be highly useful for text mining analyses are accessible there. 
•To predict loan default using Logistic Regression models in SAS® Enterprise Miner™ 
•Compare results of best models in each of 3 groups: structured data based model, text data based model, hybrid data based model 
•Identify the value of text data for credit scoring modelling 
Data preparation: 
•The data was collected from one Polish social lending site. Beside access to structured data the site offers also access to text based data like title, descriptions and comments to every loan. There were 47224 observations and 154 variables in the whole set. 
•Text data were converted to structured forms using Text Cluster node. Various combinations of clusters, SVD resolution and MAX SVD dimensions were used in order to find best new variables. 
•As there were zero to many comments assigned to every loan the aggregation was necessary. A set of variables indicating number and presence of comments from particular were built. For the other categories SVD were used. 
•Some of the structured data were additionally grouped into bigger categories like for example age or loan percent 
•The target variable is binary variable: 1 – default, 0 - repaid Model building: Different logistic models were build for every set of variables obtained in clustering. The models were compared using missclassification rate. As the models perfomred poor in identyfing defaults which, from perspective of investors, are more important, the modifications were applied. However, the event - ’1’ comprised 17% of the whole analysed population, the oversampling method was used to improve predictions of defualts. The stratified sample method was used to obtain sample of 12 000 observations equally split by target variable. Next prior probabitlies reflecting original proportions in population were applied. Additionally, the rounded inverse of set a priori probabilities weigths for decisions were used. The appropriate measure to compare results of achieved models in this new setting was Profit/Loss measure and classifications based on decisions table. The remaining observations were used for scoring and validation. 
•The models without corrections performed really well in missclassification rate (between 10% and 20%), however were very poor at predicting 1 – not more than 50% of TruePositive predictions within 1. (train data) 
•The corrected models accuracy was measured with classifications in decisions table. Here the rates ranged from 15% to 28%. However, the TruePositive dramatically improved to 88% of correctly identified 1 among all 1 for the best hybrid model (train data) 
•The obtained models perfomed very well in classyfing the validation set and the worst text based model had accouracy at the level of 71% 
•Among selection statistics, the hybrid based model performs better than the rest with a 75% accuracy 
•After adjustmens text variables allowed to increase accuracy of model compared to structured model by 5% 
•Even if distribution of target variable between obtained clusters is more or less similar, the SVD variables still allows to build text model with accuracy of 72% on train data and 71% on validation data. (adjusted model) 
Roc curve 
Lift 
Hybrid model Structured model Text model

More Related Content

What's hot

Prediction-Improving Early Warning Systems With Categorized Course Resource U...
Prediction-Improving Early Warning Systems With Categorized Course Resource U...Prediction-Improving Early Warning Systems With Categorized Course Resource U...
Prediction-Improving Early Warning Systems With Categorized Course Resource U...Beste Ulus
 
Dotnet ranking on data manifold with sink points
Dotnet  ranking on data manifold with sink pointsDotnet  ranking on data manifold with sink points
Dotnet ranking on data manifold with sink pointsEcway Technologies
 
Re-mining Positive and Negative Association Mining Results
Re-mining Positive and Negative Association Mining ResultsRe-mining Positive and Negative Association Mining Results
Re-mining Positive and Negative Association Mining Resultsertekg
 
IRJET- Stock Market Prediction using Financial News Articles
IRJET- Stock Market Prediction using Financial News ArticlesIRJET- Stock Market Prediction using Financial News Articles
IRJET- Stock Market Prediction using Financial News ArticlesIRJET Journal
 
OrganicDataNetwork Comprehensiveness & Compatibility of different organic mar...
OrganicDataNetwork Comprehensiveness & Compatibility of different organic mar...OrganicDataNetwork Comprehensiveness & Compatibility of different organic mar...
OrganicDataNetwork Comprehensiveness & Compatibility of different organic mar...Raffaele Zanoli
 
opinion feature extraction using enhanced opinion mining technique and intrin...
opinion feature extraction using enhanced opinion mining technique and intrin...opinion feature extraction using enhanced opinion mining technique and intrin...
opinion feature extraction using enhanced opinion mining technique and intrin...INFOGAIN PUBLICATION
 
Crime Type Prediction - Augmented Analytics Use Case – Smarten
Crime Type Prediction - Augmented Analytics Use Case – SmartenCrime Type Prediction - Augmented Analytics Use Case – Smarten
Crime Type Prediction - Augmented Analytics Use Case – SmartenSmarten Augmented Analytics
 
Trend analysis-of-time-series-data-using-data-mining-techniques By Raihan Sikdar
Trend analysis-of-time-series-data-using-data-mining-techniques By Raihan SikdarTrend analysis-of-time-series-data-using-data-mining-techniques By Raihan Sikdar
Trend analysis-of-time-series-data-using-data-mining-techniques By Raihan Sikdarraihansikdar
 
Portfolio Theory of Information Retrieval
Portfolio Theory of Information RetrievalPortfolio Theory of Information Retrieval
Portfolio Theory of Information RetrievalJun Wang
 
A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...
A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...
A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...Gurdal Ertek
 
Community detection using citation relations and textual similarities in a la...
Community detection using citation relations and textual similarities in a la...Community detection using citation relations and textual similarities in a la...
Community detection using citation relations and textual similarities in a la...Nees Jan van Eck
 
Data in decision making ppt secol
Data in decision making ppt   secolData in decision making ppt   secol
Data in decision making ppt secolAmosMwansumbule
 

What's hot (15)

Prediction-Improving Early Warning Systems With Categorized Course Resource U...
Prediction-Improving Early Warning Systems With Categorized Course Resource U...Prediction-Improving Early Warning Systems With Categorized Course Resource U...
Prediction-Improving Early Warning Systems With Categorized Course Resource U...
 
Dotnet ranking on data manifold with sink points
Dotnet  ranking on data manifold with sink pointsDotnet  ranking on data manifold with sink points
Dotnet ranking on data manifold with sink points
 
Re-mining Positive and Negative Association Mining Results
Re-mining Positive and Negative Association Mining ResultsRe-mining Positive and Negative Association Mining Results
Re-mining Positive and Negative Association Mining Results
 
Statistics
Statistics Statistics
Statistics
 
IRJET- Stock Market Prediction using Financial News Articles
IRJET- Stock Market Prediction using Financial News ArticlesIRJET- Stock Market Prediction using Financial News Articles
IRJET- Stock Market Prediction using Financial News Articles
 
محاضرة 9
محاضرة 9محاضرة 9
محاضرة 9
 
OrganicDataNetwork Comprehensiveness & Compatibility of different organic mar...
OrganicDataNetwork Comprehensiveness & Compatibility of different organic mar...OrganicDataNetwork Comprehensiveness & Compatibility of different organic mar...
OrganicDataNetwork Comprehensiveness & Compatibility of different organic mar...
 
opinion feature extraction using enhanced opinion mining technique and intrin...
opinion feature extraction using enhanced opinion mining technique and intrin...opinion feature extraction using enhanced opinion mining technique and intrin...
opinion feature extraction using enhanced opinion mining technique and intrin...
 
Crime Type Prediction - Augmented Analytics Use Case – Smarten
Crime Type Prediction - Augmented Analytics Use Case – SmartenCrime Type Prediction - Augmented Analytics Use Case – Smarten
Crime Type Prediction - Augmented Analytics Use Case – Smarten
 
Trend analysis-of-time-series-data-using-data-mining-techniques By Raihan Sikdar
Trend analysis-of-time-series-data-using-data-mining-techniques By Raihan SikdarTrend analysis-of-time-series-data-using-data-mining-techniques By Raihan Sikdar
Trend analysis-of-time-series-data-using-data-mining-techniques By Raihan Sikdar
 
Portfolio Theory of Information Retrieval
Portfolio Theory of Information RetrievalPortfolio Theory of Information Retrieval
Portfolio Theory of Information Retrieval
 
A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...
A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...
A Data Mining Framework for the Analysis of Patient Arrivals into Healthcare ...
 
Predictive data analytics models and their applications
Predictive data analytics models and their applicationsPredictive data analytics models and their applications
Predictive data analytics models and their applications
 
Community detection using citation relations and textual similarities in a la...
Community detection using citation relations and textual similarities in a la...Community detection using citation relations and textual similarities in a la...
Community detection using citation relations and textual similarities in a la...
 
Data in decision making ppt secol
Data in decision making ppt   secolData in decision making ppt   secol
Data in decision making ppt secol
 

Viewers also liked

An Overview of ROC Curves in SAS PROC LOGISTIC
An Overview of ROC Curves in SAS PROC LOGISTICAn Overview of ROC Curves in SAS PROC LOGISTIC
An Overview of ROC Curves in SAS PROC LOGISTICQuanticate
 
Analysis of loan_portfolioo
Analysis of loan_portfoliooAnalysis of loan_portfolioo
Analysis of loan_portfoliooJohn Rickmeier
 
Recom Banking Solution
Recom Banking  SolutionRecom Banking  Solution
Recom Banking Solutionjagishar
 
Instructions third assignment (loan analysis
Instructions third assignment (loan analysisInstructions third assignment (loan analysis
Instructions third assignment (loan analysisRonnie Kim
 
Is your bank operating in the dark?
Is your bank operating in the dark? Is your bank operating in the dark?
Is your bank operating in the dark? Gresham Computing
 
Data Quality, Data Mining & Applications of Data Mining in Banking Sector
Data Quality, Data Mining & Applications of Data Mining in Banking SectorData Quality, Data Mining & Applications of Data Mining in Banking Sector
Data Quality, Data Mining & Applications of Data Mining in Banking SectorSonu Mamman
 
BANKING SECTOR ANALYSIS OF IZMIR PROVINCE: A GRAPHICAL DATA-MINING ANALYSIS ...
BANKING SECTOR ANALYSIS OF IZMIR PROVINCE:A GRAPHICAL DATA-MINING ANALYSIS ...BANKING SECTOR ANALYSIS OF IZMIR PROVINCE:A GRAPHICAL DATA-MINING ANALYSIS ...
BANKING SECTOR ANALYSIS OF IZMIR PROVINCE: A GRAPHICAL DATA-MINING ANALYSIS ...Fatma ÇINAR
 
Sas credit scorecards
Sas credit scorecardsSas credit scorecards
Sas credit scorecardsTEMPLA73
 
HOME LOAN MARKET: CONSUMER ANALYSIS
HOME LOAN MARKET: CONSUMER ANALYSISHOME LOAN MARKET: CONSUMER ANALYSIS
HOME LOAN MARKET: CONSUMER ANALYSISDiv'yesh Lakhani
 
Commercial Banking Data Mining
Commercial Banking Data MiningCommercial Banking Data Mining
Commercial Banking Data MiningYashraj Lamsal
 
Data Mining Case Study
Data Mining Case StudyData Mining Case Study
Data Mining Case StudyXiaomeng Chai
 
How to validate your model
How to validate your modelHow to validate your model
How to validate your modelAlex Henderson
 
Analysis of Home Loan Industry at India Infoline Limited
Analysis of Home Loan Industry at India Infoline LimitedAnalysis of Home Loan Industry at India Infoline Limited
Analysis of Home Loan Industry at India Infoline LimitedRIYA JAIN
 
Analysis of Loan Markets
Analysis of Loan MarketsAnalysis of Loan Markets
Analysis of Loan MarketsBikramjit Saha
 
Data Mining Technique Clustering on Bank Data Set
Data Mining Technique Clustering on Bank Data Set  Data Mining Technique Clustering on Bank Data Set
Data Mining Technique Clustering on Bank Data Set Punit Kishore
 

Viewers also liked (20)

An Overview of ROC Curves in SAS PROC LOGISTIC
An Overview of ROC Curves in SAS PROC LOGISTICAn Overview of ROC Curves in SAS PROC LOGISTIC
An Overview of ROC Curves in SAS PROC LOGISTIC
 
Analysis of loan_portfolioo
Analysis of loan_portfoliooAnalysis of loan_portfolioo
Analysis of loan_portfolioo
 
Recom Banking Solution
Recom Banking  SolutionRecom Banking  Solution
Recom Banking Solution
 
Dbm630 lecture10
Dbm630 lecture10Dbm630 lecture10
Dbm630 lecture10
 
Instructions third assignment (loan analysis
Instructions third assignment (loan analysisInstructions third assignment (loan analysis
Instructions third assignment (loan analysis
 
Is your bank operating in the dark?
Is your bank operating in the dark? Is your bank operating in the dark?
Is your bank operating in the dark?
 
Data Quality, Data Mining & Applications of Data Mining in Banking Sector
Data Quality, Data Mining & Applications of Data Mining in Banking SectorData Quality, Data Mining & Applications of Data Mining in Banking Sector
Data Quality, Data Mining & Applications of Data Mining in Banking Sector
 
BANKING SECTOR ANALYSIS OF IZMIR PROVINCE: A GRAPHICAL DATA-MINING ANALYSIS ...
BANKING SECTOR ANALYSIS OF IZMIR PROVINCE:A GRAPHICAL DATA-MINING ANALYSIS ...BANKING SECTOR ANALYSIS OF IZMIR PROVINCE:A GRAPHICAL DATA-MINING ANALYSIS ...
BANKING SECTOR ANALYSIS OF IZMIR PROVINCE: A GRAPHICAL DATA-MINING ANALYSIS ...
 
Case study for DWDM
Case study for DWDMCase study for DWDM
Case study for DWDM
 
Sas credit scorecards
Sas credit scorecardsSas credit scorecards
Sas credit scorecards
 
HOME LOAN MARKET: CONSUMER ANALYSIS
HOME LOAN MARKET: CONSUMER ANALYSISHOME LOAN MARKET: CONSUMER ANALYSIS
HOME LOAN MARKET: CONSUMER ANALYSIS
 
Commercial Banking Data Mining
Commercial Banking Data MiningCommercial Banking Data Mining
Commercial Banking Data Mining
 
Credit scorecard
Credit scorecardCredit scorecard
Credit scorecard
 
Data Mining Case Study
Data Mining Case StudyData Mining Case Study
Data Mining Case Study
 
How to validate your model
How to validate your modelHow to validate your model
How to validate your model
 
Analysis of Home Loan Industry at India Infoline Limited
Analysis of Home Loan Industry at India Infoline LimitedAnalysis of Home Loan Industry at India Infoline Limited
Analysis of Home Loan Industry at India Infoline Limited
 
Analysis of Loan Markets
Analysis of Loan MarketsAnalysis of Loan Markets
Analysis of Loan Markets
 
Customer Segmentation
Customer SegmentationCustomer Segmentation
Customer Segmentation
 
Data Mining Technique Clustering on Bank Data Set
Data Mining Technique Clustering on Bank Data Set  Data Mining Technique Clustering on Bank Data Set
Data Mining Technique Clustering on Bank Data Set
 
Data Mining
Data MiningData Mining
Data Mining
 

Similar to Text Mining Enhances Credit Scoring Models

Review Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docxReview Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docxcarlstromcurtis
 
Instance Selection and Optimization of Neural Networks
Instance Selection and Optimization of Neural NetworksInstance Selection and Optimization of Neural Networks
Instance Selection and Optimization of Neural NetworksITIIIndustries
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningSanghamitra Deb
 
Driver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian NetworksDriver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian NetworksBayesia USA
 
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.Souma Maiti
 
B510519.pdf
B510519.pdfB510519.pdf
B510519.pdfaijbm
 
Krishna Chaitanya Yarlagadda Main Poster- Support Vector machines
Krishna Chaitanya Yarlagadda Main Poster- Support Vector machinesKrishna Chaitanya Yarlagadda Main Poster- Support Vector machines
Krishna Chaitanya Yarlagadda Main Poster- Support Vector machinesKrishna Chaitanya Yarlagadda
 
Hima_Lakkaraju_XAI_ShortCourse.pptx
Hima_Lakkaraju_XAI_ShortCourse.pptxHima_Lakkaraju_XAI_ShortCourse.pptx
Hima_Lakkaraju_XAI_ShortCourse.pptxPhanThDuy
 
Conversion Prediction for Advertisement Recommendation using Expectation Maxi...
Conversion Prediction for Advertisement Recommendation using Expectation Maxi...Conversion Prediction for Advertisement Recommendation using Expectation Maxi...
Conversion Prediction for Advertisement Recommendation using Expectation Maxi...IJCSIS Research Publications
 
Post Graduate Admission Prediction System
Post Graduate Admission Prediction SystemPost Graduate Admission Prediction System
Post Graduate Admission Prediction SystemIRJET Journal
 
Statistical Learning on Credit Data
Statistical Learning on Credit DataStatistical Learning on Credit Data
Statistical Learning on Credit DataFiras Obeid
 
churn_detection.pptx
churn_detection.pptxchurn_detection.pptx
churn_detection.pptxDhanuDhanu49
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientistMatthew Evans
 
Lead scoring case study presentation
Lead scoring case study presentationLead scoring case study presentation
Lead scoring case study presentationMithul Murugaadev
 
Credit risk assessment with imbalanced data sets using SVMs
Credit risk assessment with imbalanced data sets using SVMsCredit risk assessment with imbalanced data sets using SVMs
Credit risk assessment with imbalanced data sets using SVMsIRJET Journal
 
IRJET- Improving Prediction of Potential Clients for Bank Term Deposits using...
IRJET- Improving Prediction of Potential Clients for Bank Term Deposits using...IRJET- Improving Prediction of Potential Clients for Bank Term Deposits using...
IRJET- Improving Prediction of Potential Clients for Bank Term Deposits using...IRJET Journal
 
Customer_Analysis.docx
Customer_Analysis.docxCustomer_Analysis.docx
Customer_Analysis.docxKevalKabariya
 

Similar to Text Mining Enhances Credit Scoring Models (20)

Review Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docxReview Parameters Model Building & Interpretation and Model Tunin.docx
Review Parameters Model Building & Interpretation and Model Tunin.docx
 
Instance Selection and Optimization of Neural Networks
Instance Selection and Optimization of Neural NetworksInstance Selection and Optimization of Neural Networks
Instance Selection and Optimization of Neural Networks
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Driver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian NetworksDriver Analysis and Product Optimization with Bayesian Networks
Driver Analysis and Product Optimization with Bayesian Networks
 
Dmml report final
Dmml report finalDmml report final
Dmml report final
 
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
LOAN APPROVAL PRDICTION SYSTEM USING MACHINE LEARNING.
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
B510519.pdf
B510519.pdfB510519.pdf
B510519.pdf
 
Krishna Chaitanya Yarlagadda Main Poster- Support Vector machines
Krishna Chaitanya Yarlagadda Main Poster- Support Vector machinesKrishna Chaitanya Yarlagadda Main Poster- Support Vector machines
Krishna Chaitanya Yarlagadda Main Poster- Support Vector machines
 
Hima_Lakkaraju_XAI_ShortCourse.pptx
Hima_Lakkaraju_XAI_ShortCourse.pptxHima_Lakkaraju_XAI_ShortCourse.pptx
Hima_Lakkaraju_XAI_ShortCourse.pptx
 
Conversion Prediction for Advertisement Recommendation using Expectation Maxi...
Conversion Prediction for Advertisement Recommendation using Expectation Maxi...Conversion Prediction for Advertisement Recommendation using Expectation Maxi...
Conversion Prediction for Advertisement Recommendation using Expectation Maxi...
 
Post Graduate Admission Prediction System
Post Graduate Admission Prediction SystemPost Graduate Admission Prediction System
Post Graduate Admission Prediction System
 
Statistical Learning on Credit Data
Statistical Learning on Credit DataStatistical Learning on Credit Data
Statistical Learning on Credit Data
 
churn_detection.pptx
churn_detection.pptxchurn_detection.pptx
churn_detection.pptx
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientist
 
Lead scoring case study presentation
Lead scoring case study presentationLead scoring case study presentation
Lead scoring case study presentation
 
Machine learning project
Machine learning project Machine learning project
Machine learning project
 
Credit risk assessment with imbalanced data sets using SVMs
Credit risk assessment with imbalanced data sets using SVMsCredit risk assessment with imbalanced data sets using SVMs
Credit risk assessment with imbalanced data sets using SVMs
 
IRJET- Improving Prediction of Potential Clients for Bank Term Deposits using...
IRJET- Improving Prediction of Potential Clients for Bank Term Deposits using...IRJET- Improving Prediction of Potential Clients for Bank Term Deposits using...
IRJET- Improving Prediction of Potential Clients for Bank Term Deposits using...
 
Customer_Analysis.docx
Customer_Analysis.docxCustomer_Analysis.docx
Customer_Analysis.docx
 

Text Mining Enhances Credit Scoring Models

  • 1. Text Mining in Credit Scoring: Enhancement or maybe fully fledged modelling approach? Rafał Wojdan Warsaw School of Economics Abstract Objective Methods Results Conclusions References -Kolyshkina, I ., & van Rooyen, M. (2006) Text Mining for Insurance Claim Cost Prediction. In G.J. Williams, & S.J. Simoff (Eds.), Data Mining LNAI 3775, Berlin, 192-202, Springer-Verlag -G. Chakrabotry, M. Pagolu, S.Garla (2013) Text Mining and Analysis, Practical Methods, Examples and Case Studies Using SAS, SAS Institute -http://support.sas.com/kb/22/601.html As recently proved good credit scoring models are not only crucial for banking business but for the whole global economy. Hence, the study verifies if credit scoring models can be improved with variables obtained in text mining analysis using SAS Enterprise Miner 12.1. Furthermore, encouraged by successful applications of purely text based models in insurance industry, the project tries to build a model, constrained to only text variables, of similar predictive power to structured data only model. The purpose of the study is to evaluate the impact of textual information on credit scoring and uncover the potential hidden in text variables. However, banking data is highly classified the study is based on loans made via social lending service which is assumed to be a good proxy for banking industry. Social lending is a relatively new phenomena that combines features of social services and loan services. Thus, investors comments to loan inquiry that can be highly useful for text mining analyses are accessible there. •To predict loan default using Logistic Regression models in SAS® Enterprise Miner™ •Compare results of best models in each of 3 groups: structured data based model, text data based model, hybrid data based model •Identify the value of text data for credit scoring modelling Data preparation: •The data was collected from one Polish social lending site. Beside access to structured data the site offers also access to text based data like title, descriptions and comments to every loan. There were 47224 observations and 154 variables in the whole set. •Text data were converted to structured forms using Text Cluster node. Various combinations of clusters, SVD resolution and MAX SVD dimensions were used in order to find best new variables. •As there were zero to many comments assigned to every loan the aggregation was necessary. A set of variables indicating number and presence of comments from particular were built. For the other categories SVD were used. •Some of the structured data were additionally grouped into bigger categories like for example age or loan percent •The target variable is binary variable: 1 – default, 0 - repaid Model building: Different logistic models were build for every set of variables obtained in clustering. The models were compared using missclassification rate. As the models perfomred poor in identyfing defaults which, from perspective of investors, are more important, the modifications were applied. However, the event - ’1’ comprised 17% of the whole analysed population, the oversampling method was used to improve predictions of defualts. The stratified sample method was used to obtain sample of 12 000 observations equally split by target variable. Next prior probabitlies reflecting original proportions in population were applied. Additionally, the rounded inverse of set a priori probabilities weigths for decisions were used. The appropriate measure to compare results of achieved models in this new setting was Profit/Loss measure and classifications based on decisions table. The remaining observations were used for scoring and validation. •The models without corrections performed really well in missclassification rate (between 10% and 20%), however were very poor at predicting 1 – not more than 50% of TruePositive predictions within 1. (train data) •The corrected models accuracy was measured with classifications in decisions table. Here the rates ranged from 15% to 28%. However, the TruePositive dramatically improved to 88% of correctly identified 1 among all 1 for the best hybrid model (train data) •The obtained models perfomed very well in classyfing the validation set and the worst text based model had accouracy at the level of 71% •Among selection statistics, the hybrid based model performs better than the rest with a 75% accuracy •After adjustmens text variables allowed to increase accuracy of model compared to structured model by 5% •Even if distribution of target variable between obtained clusters is more or less similar, the SVD variables still allows to build text model with accuracy of 72% on train data and 71% on validation data. (adjusted model) Roc curve Lift Hybrid model Structured model Text model