SlideShare a Scribd company logo
1 of 45
RatingBot: A Text Mining Based Rating Approach
Presented by Wong Ming Jie & Zhu Cungen
CONTENT
● Motivation
● Credit Risk
● Measurement of Credit Risk
● Related Work
● Text Mining in Finance
● Machine Learning in credit Rating
● Research Question
● Data Context
● Methodology
● Discussion and Conclusion
MOTIVATION
● Financial institutions such as banks serve as
intermediaries between
o lenders (seeks short-term investing opportunities) and
o borrowers (seeks long-term financing)
● Facilitating development of products and services to both
groups of clienteles
● Their services and products need to achieve balance
between risk-taking and profit
o Maximise profits
o Do not want to exceed their risk appetite
MOTIVATION
● Risk is core of the financial industry
o Credit;
o Market; and
o Operational risks
● Credit risk is defined as the risk that a counterparty
(borrower) may become less likely to fulfill its obligation
in part or in full on the agreed upon date
● To manage credit risk, banks need to develop rating
models for capital calculation to
o Quantify possible expected and unexpected loss of the
counterparty
o Ascertain their creditworthiness
MOTIVATION
● Measurement of Credit Risk comprises:
o Probability Default (PD);
o Loss Given Default (LGD); and
o Exposure at Default (EAD)
● PD represents the creditworthiness of the counterparty
and is used for loan approval
● Estimating PD is based on applying statistical inference
from quantitative/categorical input variables using
financial information (historical number of defaults),
demographics (country) to determine whether a
counterparty will default
● This form of inference is limited to large homogenous
populations with substantial incidents of defaults
MOTIVATION
● For small populations which are heterogenous and low
default rates, the data used for statistical inference may
not be generalizable and is difficult to predict PD
● Most cases, banks move towards the approach of using
o Quantitative sources (including profitability)
o Qualitative sources (corporate disclosures, news, analyst calls)
● Characteristics of qualitative sources may contribute to
estimating credit-worthiness (improves PD estimation)
o Sentiment of corporate disclosures correlates with future earnings,
return on assets, stock returns and return volatility
o Strong financial performance may relate to creditworthiness which
could be overlooked
MOTIVATION
● Qualitative sources such as annual reports provide more
structured, objective and forward-looking information
and are publicly available
● More predictive power since it includes consideration of
quantitatively-difficult-to-quantify factors (future
strategy, performance) which may be relevant to
creditworthiness
● Unlike quantitative sources, qualitative sources require
experts to extract and interpret meaningful information
● Manual interpreting and coding exposed to subjectivity,
errors and inefficiencies which affects PD estimation
RELATED WORK – TEXT MINING
● Text mining has been studied extensively in finance
● Content analysis methods required to extract relevant
information from unstructured form of annual reports for
PD estimation
o Bag-of-words
o Document narrative extraction
● Unlike document narrative extraction, bag-of-words is
more flexible and assumes word (or sentence) order is
irrelevant for document representation
● Bag-of-words method can be implemented with either
o Term-weighting approach
o Machine-learning approach
RELATED WORK – TEXT MINING
● Text mining has been studied extensively in finance
● Content analysis methods required to extract relevant
information from unstructured form of annual reports for
PD estimation
o Bag-of-words
o Document narrative extraction
● Unlike document narrative extraction, bag-of-words is
more flexible and assumes word (or sentence) order is
irrelevant for document representation
● Bag-of-words method can be implemented with either
o Term-weighting approach
o Machine-learning approach
RELATED WORK – TEXT MINING
● Term-weighing structure a document in different terms
and assign a weight to each of them based on the level of
representation of importance to derive a sentiment score
● Predefined sentiment dictionaries can use to determine
the sentiment score of the document
o Harvard GI word list (designed for general uses)
o Longhran and McDonald (designed for financial uses and credit
rating applications)
● Longhran and MacDonald is extracted from a set of 10-K
SEC filings
● A 10-K filing is an annual report required by the U.S.
Securities and Exchange Commission (SEC) from a
company that provides a comprehensive summary of its
financial performance (within 60 days after its fiscal year)
RELATED WORK – TEXT MINING
● Machine-learning require a set of manually labelled words
or sentences as inputs which are then used to train
classification models
● These models will then be used to label new words or
sentences in the document
● This method is costlier to implement and requires finance
experts who are native speakers of the language of the
words
● A new approach is to identify themes within a corpus of
documents by identifying latent (hidden) topics that
represent the document using probabilistic Latent
Dirichlet Allocation (LDA) and then Gibbs Sampling
(MCMC) to derive the topics and approximate the
distribution parameters
RELATED WORK – CREDIT RATING
● Most machine learning in credit rating uses binary
classification (default/non-) to predict the default of a
counterparty
● Most common methods in literature are:
o Neural Networks (NN)
o Support Vector Machine (SVM)
o Linear and Logistic Regression (LR)
o Decision Trees (DT)
o Fuzzy Logic (FL)
o Genetic Programming (GP)
o Discriminant Analysis (DA)
● Most machine learning techniques proposed specifically
use only quantitative information as model variables or
inputs
RESEARCH QUESTION
● Research Question: Can we combine the two approaches?
o Term-weighting, topic models and sentiment analysis to identify
and represent qualitative information in annual reports in a
structured way -> inputs
o Apply classification approaches using machine-learning on these
inputs to predict the credit rating of a company
● Proposition: Based on the idea to use annual report of a
company as inputs and then apply text mining and
classification approaches to automatically and objectively
derive its credit rating
● Credit rating as an indication of perceived risk can be used
to investigate PD by banks and financial institutions
DATA CONTEXT
● Dataset 1:
o Downloaded from 10 available 10-K filings from SEC EDGAR
database from 2009 to 2016
o Avoid the influence of financial crisis between 2007 and 2008
o 34,972 10-K filings were joined with 17,622 Standard & Poor
(S&P)’s ratings in 2016 (9197 unique companies)
o Then the new joined dataset was constructed by matching the SEC
statements with the latest S&P ratings after 2009 so that the
reports were issued at most 9 months before the credit rating dates
o This resulted in 1716 data points for 1351 companies
o After removing 228 data points which had a defaulted rating given
(intention is to study credit rating class), the final sample has 1488
data points
DATA CONTEXT
● Dataset 2:
o Provided by a major European bank
o Consists of annual reports and internal credit ratings of companies
between 2009 and 2016
o Contains non-standardized general reports with some reported
partly in 10-K filings format and some not written in English
o 10,435 total annual statements
o After removing read-protected and non-English reports and reports
from defaulted and non-rated companies, there were only 5508
annual statements for consideration
● Dataset 1 was used because it allows for replication of the
results and 10-K filings are commonly used sources in
literature
● Dataset 2 was used because it represents real-world
scenario with internal ratings
DATA CONTEXT
● Document representation:
o Loughran and MacDonald dictionary was used to determine the
sentiment-weighted analysis for term-weighting approach
o Dictionary was derived from 10-K SEC findings (suitable for credit
rating context)
o Terms appearing in less than 5% of the documents (dataset 1 and
2) were removed based on the chi-squared statistics so that only
the most important terms remained
● Robustness check was also made to ensure that the
annual reports from dataset 1 and 2 are relevant to the
credit rating of a company
● Datasets 1 and 2 were compared across the absolute
frequency of terms and the sentiment-weighted
frequency of terms after using the dictionary
DATA CONTEXT
Methodology
 Data Context
o Feature: A company rates its counterparty 𝑖 by making annual
report π‘Žπ‘–, 𝑖 ∈ 1,2, … . , π‘š : textual or qualitative data 𝑿
o Label: Counterparty 𝑖 get credit rating 𝑐𝑖,𝑐𝑖 ∈ 1,2, … . , 𝑛
o What we can do?
οƒΌ Derive the relationship between textural report and rating
οƒΌ Predict the rating for new given counterparties or annual report
 Cannot directly compute textual data
o Information transformation
οƒΌ Preprocess m textual documents π‘Žπ‘– 𝑖=1
π‘š
οƒΌ Represent report with new computational form: quantitative
11/14/2018 18
Transformation:
1. Text preprocessing
2. Document representation
Qualitative
Form
Quantitative
Form
Methodology
 Text Preprocessing
o Step 1: clean pictures, HTML tags, formatting etc. raw text
οƒΌ Only raw text is left
o Step 2: transform every letter into low-case formation
οƒΌ For instance, β€˜Capital’ become β€˜capital’
o Step 3: remove numbers, special characters and punctuation
o Step 4: tokenize sentence into words by removing spaces
o Step 5: remove stop words, such as β€˜or’, β€˜and’, and β€˜the’ meaningless
o Step 6: stem terms (words) back to root form
11/14/2018 19
Transformation:
1. Text preprocessing
2. Document representation
Qualitative
Form
Quantitative
Form
Methodology
 Text Preprocessing
o Step 6: stem terms (words) back to root form
οƒΌ Stemming: remove β€˜s’ from β€˜risks’ or β€˜ed’ from β€˜declined’
β€’ Group terms into same semantic root
β€’ However, not applicable to terms like β€˜goes’ and β€˜went’
οƒΌ Alternatives: lemmatization
β€’ Can overcome limitation of stemming with higher precise
β€’ Require complex computation, unwieldy in practice
β€’ Still choose stemming algorithm
o Result of text preprocessing
οƒΌ A list of stemmed terms
11/14/2018 20
Transformation:
1. Text preprocessing
2. Document representation
Qualitative
Form
Quantitative
Form
Methodology
 Document representation
o In the list of stemmed terms
οƒΌ Many terms occur more than once
οƒΌ Compress them and make interpretable
β€’ Term-weighting: weight every unique term
11/14/2018 21
Transformation:
1. Text preprocessing
2. Document representation
Qualitative
Form
Quantitative
Form
Methodology
 Term-weighting
o Binary frequency: whether occur, dummy
οƒΌ Too Naive to consider true frequency
o Absolute or relative frequency: How many times a term occurs in π‘Žπ‘–
οƒΌ Ignore the distribution of a term over different π‘Žπ‘–
οƒΌ E.g. if a term has same distribution cross π‘Žπ‘–, it might be useless to
predict rating score
o Term frequency-inverted document frequency (tf-idf)
οƒΌ Decrease high frequency while increase low frequency (smoothing)
o Ignore the sentiment of words
οƒΌ In financial field, sentiment of report has significant relationship with
company performance
11/14/2018 22
Methodology
 Term-weighting
o Ignore the sentiment of words
o Sentiment-weighted frequency
11/14/2018 23
What if 𝑆𝑒𝑛𝑖.𝑙 = 𝑣𝑖,𝑙 Γ— 𝑆𝑒𝑛 π‘‘π‘’π‘Ÿπ‘šπ‘™ ?
Methodology
 Document representation
o In the list of stemmed terms
οƒΌ Many terms occur more than once
οƒΌ Compress them and make interpretable
β€’ Term-weighting: weight every unique term
οƒΌ Reduce terms to avoid overfitting and computation complexity
β€’ Term selection: remove terms with low frequency
β€’ Term selection: remove terms with low explanatory power by chi-
square test
β€’ Term extraction: topic model---LDA
11/14/2018 24
Transformation:
1. Text preprocessing
2. Document representation
Qualitative
Form
Quantitative
Form
input
Methodology
 LDA
o Sample one document π‘Žπ‘– with probability 𝑝 π‘Žπ‘–
o Sample π‘‘π‘–π‘Ÿπ‘–π‘β„Žπ‘™π‘’π‘‘ π‘‘π‘–π‘ π‘‘π‘Ÿπ‘–π‘π‘’π‘‘π‘–π‘œπ‘› 𝜢 to generate topic distribution πœ½π‘– for π‘Žπ‘–
o Sample πœ½π‘– for π‘Žπ‘– to get topic 𝑧𝑖,𝑗
o Topics are latent and unknown
o Sample π‘‘π‘–π‘Ÿπ‘–π‘β„Žπ‘™π‘’π‘‘ π‘‘π‘–π‘ π‘‘π‘Ÿπ‘–π‘π‘’π‘‘π‘–π‘œπ‘› 𝜷 to generate word distribution 𝝋 𝑧 𝑖,𝑗
for
topic 𝑧𝑖,𝑗
o Sample 𝝋 𝑧 𝑖,𝑗
to generate final word 𝑀𝑖,𝑗
11/14/2018 25
Hyperparameter: parameter of parameter
Methodology
 Document representation
o In the list of stemmed terms
οƒΌ Many terms occur more than once
οƒΌ Compress them and make interpretable
β€’ Term-weighting: weight every unique term
οƒΌ Reduce terms to avoid overfitting and computation complexity
β€’ Term selection: remove terms with low frequency
β€’ Term selection: remove terms with low explanatory power by chi-
square test
β€’ Term extraction: topic model---LDA
o Results: 𝑝 π‘‘π‘’π‘Ÿπ‘šπ‘™ 𝑙=1
𝑝
|π‘‘π‘œπ‘π‘–π‘β„Ž ,π‘‘π‘œπ‘π‘–π‘π‘–,β„Ž = 𝑝 π‘‘π‘œπ‘π‘–π‘β„Ž|π‘Žπ‘–
11/14/2018 26
Transformation:
1. Text preprocessing
2. Document representation
Qualitative
Form
Quantitative
Form
input
Methodology
 Document representation
o Results: 𝑝 π‘‘π‘’π‘Ÿπ‘šπ‘™ 𝑙=1
𝑝
|π‘‘π‘œπ‘π‘–π‘β„Ž ,π‘‘π‘œπ‘π‘–π‘π‘–,β„Ž = 𝑝 π‘‘π‘œπ‘π‘–π‘β„Ž|π‘Žπ‘–
οƒΌ Interpretation: π‘‘π‘œπ‘π‘–π‘π‘–,β„Ž measures the significance of topic term h
just as the role of weight of word term l
o Final date set: 𝑝 π‘‘π‘’π‘Ÿπ‘šπ‘™ 𝑙=1
𝑝
|π‘‘π‘œπ‘π‘–π‘β„Ž ,π‘‘π‘œπ‘π‘–π‘π‘–,β„Ž = 𝑝 π‘‘π‘œπ‘π‘–π‘β„Ž|π‘Žπ‘–
οƒΌ 𝑀𝑖,𝑙|𝑀𝑖,𝑙 𝑖𝑛𝑐𝑙𝑒𝑑𝑒𝑠 π‘€π‘’π‘–π‘”β„Žπ‘‘π‘  𝑆𝑒𝑛𝑖,𝑙 π‘Žπ‘›π‘‘ π‘‘π‘œπ‘π‘–π‘π‘–,β„Ž 𝑙=1
𝑝
, 𝑐𝑖
𝑖=1
π‘š
οƒΌ π‘Žπ‘–, 𝑐𝑖 𝑖=1
π‘š
11/14/2018 27
Transformation:
1. Text preprocessing
2. Document representation
Qualitative
Form π‘Žπ‘–
Quantitative
Form 𝑀𝑖,𝑙
Methodology
 Classification
o NaΓ―ve Bayes (NB): benchmark
οƒΌ Aim: 𝑝 𝑐𝑖 = π‘˜|π‘Žπ‘– ∝ 𝑝 𝑐𝑖 = π‘˜ 𝑝 π‘Žπ‘–|𝑐𝑖 = π‘˜
οƒΌ Known: 𝑝 𝑐𝑖 = π‘˜ = 𝑖=1
π‘š
𝐼 𝑐𝑖 = π‘˜ /π‘š
οƒΌ MLE: 𝑝 π‘Žπ‘–|𝑐𝑖 = π‘˜ = 𝑙=1
𝑝
𝑝 𝑀𝑖,𝑙|𝑐𝑖 = π‘˜ = 𝑙=1
𝑝
𝑝 𝑀𝑖,𝑙|𝑐𝑖 =
11/14/2018 28
Methodology
 Classification
o Support Vector Machine (SVM): what about benchmark?
οƒΌ Aim: maximum-margin hyperplane
οƒΌ the distance between the hyperplane and the nearest point π’˜
from either group is maximized.
οƒΌ Reduce overfitting by
L2-norm regularization
οƒΌ Stable
οƒΌ Lack of interpretability
11/14/2018 29
Methodology
 Classification
o Neural Networks (NN)
οƒΌ Three layers: input, hidden (a or multiple) and output layer
οƒΌ Before hidden----Aggregation function: 𝑙=1
𝑝
𝛽𝑙 𝑀𝑖,𝑙
οƒΌ In the hidden---- Activation function: g π‘₯ =
1, π‘₯ > 0
0, π‘₯ ≀ 0
for each
left layers
οƒΌ Label rule:𝑐 π‘Ž 𝑛𝑒𝑀
=
1, g πœ·π’˜ 𝑛𝑒𝑀 > 0
2, 𝑒𝑙𝑠𝑒
οƒΌ Train model:
οƒΌ backpropagation to update prediction errors
11/14/2018 30
Methodology
 Classification
o Decision Tree (DT)
οƒΌ Aim: 𝑝 𝑐𝑖 = π‘˜|π‘Žπ‘– ∝ 𝑝 𝑐𝑖 = π‘˜ 𝑝 π‘Žπ‘–|𝑐𝑖 = π‘˜
οƒΌ spilling criterion: chi-squared, Gini coefficient or entropy-based
οƒΌ Overfitting
οƒΌ Good interpretability
11/14/2018 31
Methodology
 Classification
o Logistic Regression (LR)
οƒΌ Popular methods
οƒΌ requires uncorrelated independent variables: exogeneity
11/14/2018 32
Methodology
 Classification
o Discriminant Analysis (DA)
οƒΌ Prior probability: 𝑀𝑖|𝑐𝑖 = π‘˜~𝑁 𝝁 π‘˜, 𝞒
οƒΌ Estimate 𝝁 π‘˜, 𝞒
οƒΌ Rule: 𝑐 π‘Ž 𝑛𝑒𝑀
=
1, log
𝑝 𝑐 𝑖=1|π‘Ž 𝑛𝑒𝑀
𝑝 𝑐 𝑖=2|π‘Ž 𝑛𝑒𝑀
> 1
2, 𝑒𝑙𝑠𝑒
οƒΌ log
𝑝 𝑐 𝑖=1|π‘Ž 𝑛𝑒𝑀
𝑝 𝑐 𝑖=2|π‘Ž 𝑛𝑒𝑀
is linear formation among 𝑀𝑖,𝑙
οƒΌ Not applicable to non-linear case
11/14/2018 33
Methodology
 Classification
o Supervised Topic Models (STM)
οƒΌ Define topic distribution over document probability: πœƒ π‘Ž 𝑖
|πœ—, ~π·π‘–π‘Ÿπ‘–π‘β„Žπ‘™π‘’π‘‘ πœ—
οƒΌ Sample topic 𝑧 π‘Ž 𝑖,𝑙| πœƒ π‘Ž 𝑖
, ~π‘€π‘’π‘™π‘‘π‘–π‘›π‘œπ‘šπ‘–π‘Žπ‘™ π·π‘–π‘ π‘‘π‘Ÿπ‘–π‘π‘’π‘‘π‘–π‘œπ‘› πœƒ π‘Ž 𝑖
οƒΌ Sample term π‘‘π‘’π‘Ÿπ‘š π‘Ž 𝑖,𝑙|𝑧 π‘Ž 𝑖,𝑙, 𝜷, ~𝑀𝑒𝑙𝑑 𝛽𝑧 π‘Žπ‘–
οƒΌ Sample the response variable 𝑐 π‘Ž 𝑖
|𝑧 π‘Ž 𝑖,𝑙, 𝜹, 𝜎2
, ~𝑁 𝜹 𝑇
𝑧 π‘Ž 𝑙
, 𝜎2 by linear
regression
β€’ 𝜹= 𝛿1, . . , π›Ώβ„Ž, . . , 𝛿 𝐻
𝑇, 𝑧 π‘Ž 𝑙
=
1
𝑝 𝑙=1
𝑝
𝑧 π‘Ž 𝑖,𝑙
β€’ Regress label on topics
β€’ Parameters are estimated Expectation-Maximization (EM)
οƒΌ Rule: 𝑐 π‘Ž 𝑛𝑒𝑀
= π‘Ÿπ‘œπ‘’π‘›π‘‘ 𝜹 𝑇
𝐸 𝑧 π‘Ž 𝑛𝑒𝑀
|πœ—, 𝜷, 𝑀
11/14/2018 34
Model Development and Evaluation
 Dependent Variables
o 19 ratings or classes: 𝑐𝑖
o Rating bands
οƒΌ π‘π‘Žπ‘›π‘‘π‘– =
1 𝑖𝑓𝑐𝑖 ≀ 7
2 𝑖𝑓 7 < 𝑐𝑖 ≀ 10
3 𝑖𝑓 10 < 𝑐𝑖 ≀ 13
4 𝑖𝑓𝑐𝑖 > 13
οƒΌ Less granular
o Binary dependent variable: 𝑏𝑖𝑛𝑖
οƒΌ 𝑏𝑖𝑛𝑖 =
investment grade company 𝑖𝑓𝑐𝑖 ≀ 10
π‘ π‘π‘’π‘π‘’π‘™π‘Žπ‘‘π‘–π‘£π‘’ π‘”π‘Ÿπ‘Žπ‘‘π‘’ π‘π‘œπ‘šπ‘π‘Žπ‘›π‘¦ 𝑖𝑓 > 10
οƒΌ Comparable
11/14/2018 35
Unbalanced
Unbalanced
Unbalanced
Model Development and Evaluation
 Dependent Variables
o In addition, intuitively, less classes, more accurate
οƒΌ 𝑝 π‘™π‘œπ‘π‘Žπ‘‘π‘’π‘‘ 𝑖𝑛 𝑏𝑖𝑔 𝑐𝑦𝑐𝑙𝑒 > 𝑝 π‘™π‘œπ‘π‘Žπ‘‘π‘’π‘‘ 𝑖𝑛 𝑠𝑒𝑏 βˆ’ 𝑐𝑦𝑐𝑙𝑒
οƒΌ Accuracy: Binary ratings> Rating bands > 19 ratings
11/14/2018 36
Model Development and Evaluation
 Test Set Performance
o Train: test = 3:1
o Every presented classifier performs best under responding column
o higher than random accuracy 1/19=5.2%, ΒΌ=25%, Β½=50%
o SVM best for dataset2 while DT and NN best for dataset1
11/14/2018 37
Model Development and Evaluation
 Test Set Performance
o π‘Žπ‘π‘π‘’π‘Ÿπ‘Žπ‘π‘¦ =
𝑇𝑃+𝑇𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
: overall accuracy
o π‘π‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› =
𝑇𝑃
𝑇𝑃+𝐹𝑃
: the proportion of correctly classified points within a true
class
o π‘Ÿπ‘’π‘π‘Žπ‘™π‘™ =
𝑇𝑃
𝑇𝑃+𝐹𝑁
: the proportion of correctly classified points within a
classified or generated class
11/14/2018 38
Model Development and Evaluation
 Test Set Performance
o π‘Žπ‘π‘ =
𝑇𝑃+𝑇𝑁
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
; π‘π‘Ÿπ‘’ =
𝑇𝑃
𝑇𝑃+𝐹𝑃
; π‘Ÿπ‘’π‘π‘Žπ‘™π‘™ =
𝑇𝑃
𝑇𝑃+𝐹𝑁
o Thought: small class tends to have low precision
o Thought: Some classifiers can sacrifice the precision and recall of
small class in order to pursue a high accuracy or a
or precision of big class.
o Thought: understand class distribution
Before choosing and develop classifier.
11/14/2018 39
Model Development and Evaluation
 Model Results
11/14/2018 40
Model Development and Evaluation
 Model Results
11/14/2018 41
Model Development and Evaluation
 Model Results
o How to interpret the result?
οƒΌ Compare accuracy between sentiment-using and -unusing
11/14/2018 42
Model Development and Evaluation
 Model Results
o The effect of considering sentiment of the terms depends on the classifier,
the dataset, and the type of dependent variable.
o NN, DT, and SVM work best
o STM works bad, which is surprising
οƒΌ Good interpretability
οƒΌ Potential reason given by this paper
 Should use linear regression (LR) rather than generalized linear model(GLM)
οƒΌ My thought:
 GLM may work better since it has less assumption than LR
 But why not good? In the generating process of STM, it assume response
variable (class) is produced by or sampled from topics. However, the true
ratings may not be produced in this way.
 We only know sentiment has relationship with rating, but that dose not
means sentiment produce rating
11/14/2018 43
Conclusion
 RatingBot to predict rating score
 Limitations
o The credibility of rating
οƒΌ Considering time component to capture the manipulation of companies to
annual report to get higher rating
o Try other classifiers, such as deep learning
o Including other text sources, such as news, social media content
11/14/2018 44
Thank you!

More Related Content

Similar to RatingBot: A Text Mining Based Rating Approach

Complete Introduction to Business Data Analysis
Complete Introduction to Business Data AnalysisComplete Introduction to Business Data Analysis
Complete Introduction to Business Data AnalysisSam Dias
Β 
Cecl automation banking book analytics v3
Cecl automation   banking book analytics v3Cecl automation   banking book analytics v3
Cecl automation banking book analytics v3Sohail Farooq
Β 
Shinto_Analytics_with_5 Yrs Experience
Shinto_Analytics_with_5 Yrs ExperienceShinto_Analytics_with_5 Yrs Experience
Shinto_Analytics_with_5 Yrs ExperienceShinto Kuttan
Β 
Sagar-Shukla
Sagar-ShuklaSagar-Shukla
Sagar-ShuklaSagar Shukla
Β 
PayNet AbsolutePD
PayNet AbsolutePDPayNet AbsolutePD
PayNet AbsolutePDZane Roelen
Β 
Creating a contemporary risk management system using python (dc)
Creating a contemporary risk management system using python (dc)Creating a contemporary risk management system using python (dc)
Creating a contemporary risk management system using python (dc)Piero Ferrante
Β 
LSC Technology Initiative Grant Conference 2015 | Session Materials - Demonst...
LSC Technology Initiative Grant Conference 2015 | Session Materials - Demonst...LSC Technology Initiative Grant Conference 2015 | Session Materials - Demonst...
LSC Technology Initiative Grant Conference 2015 | Session Materials - Demonst...Legal Services Corporation
Β 
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...patiladiti752
Β 
Pushplata Bora-Resume
Pushplata Bora-ResumePushplata Bora-Resume
Pushplata Bora-ResumePushplata Bora
Β 
cc-profile.en.pdf
cc-profile.en.pdfcc-profile.en.pdf
cc-profile.en.pdfHamza Deeb
Β 
Feasibility_Report.pptx.pdf
Feasibility_Report.pptx.pdfFeasibility_Report.pptx.pdf
Feasibility_Report.pptx.pdfYashShekhar5
Β 
Your Guide to Becoming a Data Analyst
Your Guide to Becoming a Data AnalystYour Guide to Becoming a Data Analyst
Your Guide to Becoming a Data AnalystOptnation
Β 
Resume_Nidhi Malhotra_BA_shared
Resume_Nidhi Malhotra_BA_sharedResume_Nidhi Malhotra_BA_shared
Resume_Nidhi Malhotra_BA_sharedNidhi Malhotra
Β 
Vishal Dube Resume Visual
Vishal Dube Resume VisualVishal Dube Resume Visual
Vishal Dube Resume VisualVishal Dube
Β 
Amit R Kolwankar
Amit R KolwankarAmit R Kolwankar
Amit R Kolwankaramit kolwankar
Β 

Similar to RatingBot: A Text Mining Based Rating Approach (20)

Gaurang_Chotalia
Gaurang_ChotaliaGaurang_Chotalia
Gaurang_Chotalia
Β 
asherresume_master1
asherresume_master1asherresume_master1
asherresume_master1
Β 
Complete Introduction to Business Data Analysis
Complete Introduction to Business Data AnalysisComplete Introduction to Business Data Analysis
Complete Introduction to Business Data Analysis
Β 
Cecl automation banking book analytics v3
Cecl automation   banking book analytics v3Cecl automation   banking book analytics v3
Cecl automation banking book analytics v3
Β 
Shinto_Analytics_with_5 Yrs Experience
Shinto_Analytics_with_5 Yrs ExperienceShinto_Analytics_with_5 Yrs Experience
Shinto_Analytics_with_5 Yrs Experience
Β 
Sagar-Shukla
Sagar-ShuklaSagar-Shukla
Sagar-Shukla
Β 
PayNet AbsolutePD
PayNet AbsolutePDPayNet AbsolutePD
PayNet AbsolutePD
Β 
Creating a contemporary risk management system using python (dc)
Creating a contemporary risk management system using python (dc)Creating a contemporary risk management system using python (dc)
Creating a contemporary risk management system using python (dc)
Β 
LSC Technology Initiative Grant Conference 2015 | Session Materials - Demonst...
LSC Technology Initiative Grant Conference 2015 | Session Materials - Demonst...LSC Technology Initiative Grant Conference 2015 | Session Materials - Demonst...
LSC Technology Initiative Grant Conference 2015 | Session Materials - Demonst...
Β 
Mallikarjun-Ext
Mallikarjun-ExtMallikarjun-Ext
Mallikarjun-Ext
Β 
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...
Β 
Pushplata Bora-Resume
Pushplata Bora-ResumePushplata Bora-Resume
Pushplata Bora-Resume
Β 
cc-profile.en.pdf
cc-profile.en.pdfcc-profile.en.pdf
cc-profile.en.pdf
Β 
Feasibility_Report.pptx.pdf
Feasibility_Report.pptx.pdfFeasibility_Report.pptx.pdf
Feasibility_Report.pptx.pdf
Β 
Your Guide to Becoming a Data Analyst
Your Guide to Becoming a Data AnalystYour Guide to Becoming a Data Analyst
Your Guide to Becoming a Data Analyst
Β 
Resume
ResumeResume
Resume
Β 
Resume_Nidhi Malhotra_BA_shared
Resume_Nidhi Malhotra_BA_sharedResume_Nidhi Malhotra_BA_shared
Resume_Nidhi Malhotra_BA_shared
Β 
Vishal Dube Resume Visual
Vishal Dube Resume VisualVishal Dube Resume Visual
Vishal Dube Resume Visual
Β 
Amit R Kolwankar
Amit R KolwankarAmit R Kolwankar
Amit R Kolwankar
Β 
7776.ppt
7776.ppt7776.ppt
7776.ppt
Β 

More from Nauman Shahid

Pick the Odd-ones Out! Conferring Legitimacy of Initial Coin Offerings by Dis...
Pick the Odd-ones Out! Conferring Legitimacy of Initial Coin Offerings by Dis...Pick the Odd-ones Out! Conferring Legitimacy of Initial Coin Offerings by Dis...
Pick the Odd-ones Out! Conferring Legitimacy of Initial Coin Offerings by Dis...Nauman Shahid
Β 
Status, Celebrity, and Reputation of Firms
Status, Celebrity, and Reputation of FirmsStatus, Celebrity, and Reputation of Firms
Status, Celebrity, and Reputation of FirmsNauman Shahid
Β 
Anchor and Adjustment Heuristic
Anchor and Adjustment HeuristicAnchor and Adjustment Heuristic
Anchor and Adjustment HeuristicNauman Shahid
Β 
The Impact of User Personality Traits on Word of Mouth: Text-Mining Social Me...
The Impact of User Personality Traits on Word of Mouth: Text-Mining Social Me...The Impact of User Personality Traits on Word of Mouth: Text-Mining Social Me...
The Impact of User Personality Traits on Word of Mouth: Text-Mining Social Me...Nauman Shahid
Β 
Turing Award Winners 2004
Turing Award Winners 2004Turing Award Winners 2004
Turing Award Winners 2004Nauman Shahid
Β 
Generating and justifying design theory
Generating and justifying design theoryGenerating and justifying design theory
Generating and justifying design theoryNauman Shahid
Β 
Secondary Design: A Case of Behavioral Design Science Research
Secondary Design: A Case of Behavioral Design Science ResearchSecondary Design: A Case of Behavioral Design Science Research
Secondary Design: A Case of Behavioral Design Science ResearchNauman Shahid
Β 
Positioning and presenting design science research for maximum impact
Positioning and presenting design science research for maximum impactPositioning and presenting design science research for maximum impact
Positioning and presenting design science research for maximum impactNauman Shahid
Β 

More from Nauman Shahid (8)

Pick the Odd-ones Out! Conferring Legitimacy of Initial Coin Offerings by Dis...
Pick the Odd-ones Out! Conferring Legitimacy of Initial Coin Offerings by Dis...Pick the Odd-ones Out! Conferring Legitimacy of Initial Coin Offerings by Dis...
Pick the Odd-ones Out! Conferring Legitimacy of Initial Coin Offerings by Dis...
Β 
Status, Celebrity, and Reputation of Firms
Status, Celebrity, and Reputation of FirmsStatus, Celebrity, and Reputation of Firms
Status, Celebrity, and Reputation of Firms
Β 
Anchor and Adjustment Heuristic
Anchor and Adjustment HeuristicAnchor and Adjustment Heuristic
Anchor and Adjustment Heuristic
Β 
The Impact of User Personality Traits on Word of Mouth: Text-Mining Social Me...
The Impact of User Personality Traits on Word of Mouth: Text-Mining Social Me...The Impact of User Personality Traits on Word of Mouth: Text-Mining Social Me...
The Impact of User Personality Traits on Word of Mouth: Text-Mining Social Me...
Β 
Turing Award Winners 2004
Turing Award Winners 2004Turing Award Winners 2004
Turing Award Winners 2004
Β 
Generating and justifying design theory
Generating and justifying design theoryGenerating and justifying design theory
Generating and justifying design theory
Β 
Secondary Design: A Case of Behavioral Design Science Research
Secondary Design: A Case of Behavioral Design Science ResearchSecondary Design: A Case of Behavioral Design Science Research
Secondary Design: A Case of Behavioral Design Science Research
Β 
Positioning and presenting design science research for maximum impact
Positioning and presenting design science research for maximum impactPositioning and presenting design science research for maximum impact
Positioning and presenting design science research for maximum impact
Β 

Recently uploaded

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
Β 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
Β 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
Β 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
Β 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
Β 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
Β 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
Β 
18-04-UA_REPORT_MEDIALITERAΠ‘Y_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAΠ‘Y_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAΠ‘Y_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAΠ‘Y_INDEX-DM_23-1-final-eng.pdfssuser54595a
Β 
β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
Β 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
Β 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
Β 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
Β 
call girls in Kamla Market (DELHI) πŸ” >ΰΌ’9953330565πŸ” genuine Escort Service πŸ”βœ”οΈβœ”οΈ
call girls in Kamla Market (DELHI) πŸ” >ΰΌ’9953330565πŸ” genuine Escort Service πŸ”βœ”οΈβœ”οΈcall girls in Kamla Market (DELHI) πŸ” >ΰΌ’9953330565πŸ” genuine Escort Service πŸ”βœ”οΈβœ”οΈ
call girls in Kamla Market (DELHI) πŸ” >ΰΌ’9953330565πŸ” genuine Escort Service πŸ”βœ”οΈβœ”οΈ9953056974 Low Rate Call Girls In Saket, Delhi NCR
Β 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
Β 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
Β 

Recently uploaded (20)

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Β 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
Β 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
Β 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
Β 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
Β 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
Β 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
Β 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
Β 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
Β 
18-04-UA_REPORT_MEDIALITERAΠ‘Y_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAΠ‘Y_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAΠ‘Y_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAΠ‘Y_INDEX-DM_23-1-final-eng.pdf
Β 
β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
β€œOh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
Β 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
Β 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
Β 
Model Call Girl in Bikash Puri Delhi reach out to us at πŸ”9953056974πŸ”
Model Call Girl in Bikash Puri  Delhi reach out to us at πŸ”9953056974πŸ”Model Call Girl in Bikash Puri  Delhi reach out to us at πŸ”9953056974πŸ”
Model Call Girl in Bikash Puri Delhi reach out to us at πŸ”9953056974πŸ”
Β 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
Β 
call girls in Kamla Market (DELHI) πŸ” >ΰΌ’9953330565πŸ” genuine Escort Service πŸ”βœ”οΈβœ”οΈ
call girls in Kamla Market (DELHI) πŸ” >ΰΌ’9953330565πŸ” genuine Escort Service πŸ”βœ”οΈβœ”οΈcall girls in Kamla Market (DELHI) πŸ” >ΰΌ’9953330565πŸ” genuine Escort Service πŸ”βœ”οΈβœ”οΈ
call girls in Kamla Market (DELHI) πŸ” >ΰΌ’9953330565πŸ” genuine Escort Service πŸ”βœ”οΈβœ”οΈ
Β 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
Β 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Β 
CΓ³digo Creativo y Arte de Software | Unidad 1
CΓ³digo Creativo y Arte de Software | Unidad 1CΓ³digo Creativo y Arte de Software | Unidad 1
CΓ³digo Creativo y Arte de Software | Unidad 1
Β 
Model Call Girl in Tilak Nagar Delhi reach out to us at πŸ”9953056974πŸ”
Model Call Girl in Tilak Nagar Delhi reach out to us at πŸ”9953056974πŸ”Model Call Girl in Tilak Nagar Delhi reach out to us at πŸ”9953056974πŸ”
Model Call Girl in Tilak Nagar Delhi reach out to us at πŸ”9953056974πŸ”
Β 

RatingBot: A Text Mining Based Rating Approach

  • 1. RatingBot: A Text Mining Based Rating Approach Presented by Wong Ming Jie & Zhu Cungen
  • 2. CONTENT ● Motivation ● Credit Risk ● Measurement of Credit Risk ● Related Work ● Text Mining in Finance ● Machine Learning in credit Rating ● Research Question ● Data Context ● Methodology ● Discussion and Conclusion
  • 3. MOTIVATION ● Financial institutions such as banks serve as intermediaries between o lenders (seeks short-term investing opportunities) and o borrowers (seeks long-term financing) ● Facilitating development of products and services to both groups of clienteles ● Their services and products need to achieve balance between risk-taking and profit o Maximise profits o Do not want to exceed their risk appetite
  • 4. MOTIVATION ● Risk is core of the financial industry o Credit; o Market; and o Operational risks ● Credit risk is defined as the risk that a counterparty (borrower) may become less likely to fulfill its obligation in part or in full on the agreed upon date ● To manage credit risk, banks need to develop rating models for capital calculation to o Quantify possible expected and unexpected loss of the counterparty o Ascertain their creditworthiness
  • 5. MOTIVATION ● Measurement of Credit Risk comprises: o Probability Default (PD); o Loss Given Default (LGD); and o Exposure at Default (EAD) ● PD represents the creditworthiness of the counterparty and is used for loan approval ● Estimating PD is based on applying statistical inference from quantitative/categorical input variables using financial information (historical number of defaults), demographics (country) to determine whether a counterparty will default ● This form of inference is limited to large homogenous populations with substantial incidents of defaults
  • 6. MOTIVATION ● For small populations which are heterogenous and low default rates, the data used for statistical inference may not be generalizable and is difficult to predict PD ● Most cases, banks move towards the approach of using o Quantitative sources (including profitability) o Qualitative sources (corporate disclosures, news, analyst calls) ● Characteristics of qualitative sources may contribute to estimating credit-worthiness (improves PD estimation) o Sentiment of corporate disclosures correlates with future earnings, return on assets, stock returns and return volatility o Strong financial performance may relate to creditworthiness which could be overlooked
  • 7. MOTIVATION ● Qualitative sources such as annual reports provide more structured, objective and forward-looking information and are publicly available ● More predictive power since it includes consideration of quantitatively-difficult-to-quantify factors (future strategy, performance) which may be relevant to creditworthiness ● Unlike quantitative sources, qualitative sources require experts to extract and interpret meaningful information ● Manual interpreting and coding exposed to subjectivity, errors and inefficiencies which affects PD estimation
  • 8. RELATED WORK – TEXT MINING ● Text mining has been studied extensively in finance ● Content analysis methods required to extract relevant information from unstructured form of annual reports for PD estimation o Bag-of-words o Document narrative extraction ● Unlike document narrative extraction, bag-of-words is more flexible and assumes word (or sentence) order is irrelevant for document representation ● Bag-of-words method can be implemented with either o Term-weighting approach o Machine-learning approach
  • 9. RELATED WORK – TEXT MINING ● Text mining has been studied extensively in finance ● Content analysis methods required to extract relevant information from unstructured form of annual reports for PD estimation o Bag-of-words o Document narrative extraction ● Unlike document narrative extraction, bag-of-words is more flexible and assumes word (or sentence) order is irrelevant for document representation ● Bag-of-words method can be implemented with either o Term-weighting approach o Machine-learning approach
  • 10. RELATED WORK – TEXT MINING ● Term-weighing structure a document in different terms and assign a weight to each of them based on the level of representation of importance to derive a sentiment score ● Predefined sentiment dictionaries can use to determine the sentiment score of the document o Harvard GI word list (designed for general uses) o Longhran and McDonald (designed for financial uses and credit rating applications) ● Longhran and MacDonald is extracted from a set of 10-K SEC filings ● A 10-K filing is an annual report required by the U.S. Securities and Exchange Commission (SEC) from a company that provides a comprehensive summary of its financial performance (within 60 days after its fiscal year)
  • 11. RELATED WORK – TEXT MINING ● Machine-learning require a set of manually labelled words or sentences as inputs which are then used to train classification models ● These models will then be used to label new words or sentences in the document ● This method is costlier to implement and requires finance experts who are native speakers of the language of the words ● A new approach is to identify themes within a corpus of documents by identifying latent (hidden) topics that represent the document using probabilistic Latent Dirichlet Allocation (LDA) and then Gibbs Sampling (MCMC) to derive the topics and approximate the distribution parameters
  • 12. RELATED WORK – CREDIT RATING ● Most machine learning in credit rating uses binary classification (default/non-) to predict the default of a counterparty ● Most common methods in literature are: o Neural Networks (NN) o Support Vector Machine (SVM) o Linear and Logistic Regression (LR) o Decision Trees (DT) o Fuzzy Logic (FL) o Genetic Programming (GP) o Discriminant Analysis (DA) ● Most machine learning techniques proposed specifically use only quantitative information as model variables or inputs
  • 13. RESEARCH QUESTION ● Research Question: Can we combine the two approaches? o Term-weighting, topic models and sentiment analysis to identify and represent qualitative information in annual reports in a structured way -> inputs o Apply classification approaches using machine-learning on these inputs to predict the credit rating of a company ● Proposition: Based on the idea to use annual report of a company as inputs and then apply text mining and classification approaches to automatically and objectively derive its credit rating ● Credit rating as an indication of perceived risk can be used to investigate PD by banks and financial institutions
  • 14. DATA CONTEXT ● Dataset 1: o Downloaded from 10 available 10-K filings from SEC EDGAR database from 2009 to 2016 o Avoid the influence of financial crisis between 2007 and 2008 o 34,972 10-K filings were joined with 17,622 Standard & Poor (S&P)’s ratings in 2016 (9197 unique companies) o Then the new joined dataset was constructed by matching the SEC statements with the latest S&P ratings after 2009 so that the reports were issued at most 9 months before the credit rating dates o This resulted in 1716 data points for 1351 companies o After removing 228 data points which had a defaulted rating given (intention is to study credit rating class), the final sample has 1488 data points
  • 15. DATA CONTEXT ● Dataset 2: o Provided by a major European bank o Consists of annual reports and internal credit ratings of companies between 2009 and 2016 o Contains non-standardized general reports with some reported partly in 10-K filings format and some not written in English o 10,435 total annual statements o After removing read-protected and non-English reports and reports from defaulted and non-rated companies, there were only 5508 annual statements for consideration ● Dataset 1 was used because it allows for replication of the results and 10-K filings are commonly used sources in literature ● Dataset 2 was used because it represents real-world scenario with internal ratings
  • 16. DATA CONTEXT ● Document representation: o Loughran and MacDonald dictionary was used to determine the sentiment-weighted analysis for term-weighting approach o Dictionary was derived from 10-K SEC findings (suitable for credit rating context) o Terms appearing in less than 5% of the documents (dataset 1 and 2) were removed based on the chi-squared statistics so that only the most important terms remained ● Robustness check was also made to ensure that the annual reports from dataset 1 and 2 are relevant to the credit rating of a company ● Datasets 1 and 2 were compared across the absolute frequency of terms and the sentiment-weighted frequency of terms after using the dictionary
  • 18. Methodology  Data Context o Feature: A company rates its counterparty 𝑖 by making annual report π‘Žπ‘–, 𝑖 ∈ 1,2, … . , π‘š : textual or qualitative data 𝑿 o Label: Counterparty 𝑖 get credit rating 𝑐𝑖,𝑐𝑖 ∈ 1,2, … . , 𝑛 o What we can do? οƒΌ Derive the relationship between textural report and rating οƒΌ Predict the rating for new given counterparties or annual report  Cannot directly compute textual data o Information transformation οƒΌ Preprocess m textual documents π‘Žπ‘– 𝑖=1 π‘š οƒΌ Represent report with new computational form: quantitative 11/14/2018 18 Transformation: 1. Text preprocessing 2. Document representation Qualitative Form Quantitative Form
  • 19. Methodology  Text Preprocessing o Step 1: clean pictures, HTML tags, formatting etc. raw text οƒΌ Only raw text is left o Step 2: transform every letter into low-case formation οƒΌ For instance, β€˜Capital’ become β€˜capital’ o Step 3: remove numbers, special characters and punctuation o Step 4: tokenize sentence into words by removing spaces o Step 5: remove stop words, such as β€˜or’, β€˜and’, and β€˜the’ meaningless o Step 6: stem terms (words) back to root form 11/14/2018 19 Transformation: 1. Text preprocessing 2. Document representation Qualitative Form Quantitative Form
  • 20. Methodology  Text Preprocessing o Step 6: stem terms (words) back to root form οƒΌ Stemming: remove β€˜s’ from β€˜risks’ or β€˜ed’ from β€˜declined’ β€’ Group terms into same semantic root β€’ However, not applicable to terms like β€˜goes’ and β€˜went’ οƒΌ Alternatives: lemmatization β€’ Can overcome limitation of stemming with higher precise β€’ Require complex computation, unwieldy in practice β€’ Still choose stemming algorithm o Result of text preprocessing οƒΌ A list of stemmed terms 11/14/2018 20 Transformation: 1. Text preprocessing 2. Document representation Qualitative Form Quantitative Form
  • 21. Methodology  Document representation o In the list of stemmed terms οƒΌ Many terms occur more than once οƒΌ Compress them and make interpretable β€’ Term-weighting: weight every unique term 11/14/2018 21 Transformation: 1. Text preprocessing 2. Document representation Qualitative Form Quantitative Form
  • 22. Methodology  Term-weighting o Binary frequency: whether occur, dummy οƒΌ Too Naive to consider true frequency o Absolute or relative frequency: How many times a term occurs in π‘Žπ‘– οƒΌ Ignore the distribution of a term over different π‘Žπ‘– οƒΌ E.g. if a term has same distribution cross π‘Žπ‘–, it might be useless to predict rating score o Term frequency-inverted document frequency (tf-idf) οƒΌ Decrease high frequency while increase low frequency (smoothing) o Ignore the sentiment of words οƒΌ In financial field, sentiment of report has significant relationship with company performance 11/14/2018 22
  • 23. Methodology  Term-weighting o Ignore the sentiment of words o Sentiment-weighted frequency 11/14/2018 23 What if 𝑆𝑒𝑛𝑖.𝑙 = 𝑣𝑖,𝑙 Γ— 𝑆𝑒𝑛 π‘‘π‘’π‘Ÿπ‘šπ‘™ ?
  • 24. Methodology  Document representation o In the list of stemmed terms οƒΌ Many terms occur more than once οƒΌ Compress them and make interpretable β€’ Term-weighting: weight every unique term οƒΌ Reduce terms to avoid overfitting and computation complexity β€’ Term selection: remove terms with low frequency β€’ Term selection: remove terms with low explanatory power by chi- square test β€’ Term extraction: topic model---LDA 11/14/2018 24 Transformation: 1. Text preprocessing 2. Document representation Qualitative Form Quantitative Form input
  • 25. Methodology  LDA o Sample one document π‘Žπ‘– with probability 𝑝 π‘Žπ‘– o Sample π‘‘π‘–π‘Ÿπ‘–π‘β„Žπ‘™π‘’π‘‘ π‘‘π‘–π‘ π‘‘π‘Ÿπ‘–π‘π‘’π‘‘π‘–π‘œπ‘› 𝜢 to generate topic distribution πœ½π‘– for π‘Žπ‘– o Sample πœ½π‘– for π‘Žπ‘– to get topic 𝑧𝑖,𝑗 o Topics are latent and unknown o Sample π‘‘π‘–π‘Ÿπ‘–π‘β„Žπ‘™π‘’π‘‘ π‘‘π‘–π‘ π‘‘π‘Ÿπ‘–π‘π‘’π‘‘π‘–π‘œπ‘› 𝜷 to generate word distribution 𝝋 𝑧 𝑖,𝑗 for topic 𝑧𝑖,𝑗 o Sample 𝝋 𝑧 𝑖,𝑗 to generate final word 𝑀𝑖,𝑗 11/14/2018 25 Hyperparameter: parameter of parameter
  • 26. Methodology  Document representation o In the list of stemmed terms οƒΌ Many terms occur more than once οƒΌ Compress them and make interpretable β€’ Term-weighting: weight every unique term οƒΌ Reduce terms to avoid overfitting and computation complexity β€’ Term selection: remove terms with low frequency β€’ Term selection: remove terms with low explanatory power by chi- square test β€’ Term extraction: topic model---LDA o Results: 𝑝 π‘‘π‘’π‘Ÿπ‘šπ‘™ 𝑙=1 𝑝 |π‘‘π‘œπ‘π‘–π‘β„Ž ,π‘‘π‘œπ‘π‘–π‘π‘–,β„Ž = 𝑝 π‘‘π‘œπ‘π‘–π‘β„Ž|π‘Žπ‘– 11/14/2018 26 Transformation: 1. Text preprocessing 2. Document representation Qualitative Form Quantitative Form input
  • 27. Methodology  Document representation o Results: 𝑝 π‘‘π‘’π‘Ÿπ‘šπ‘™ 𝑙=1 𝑝 |π‘‘π‘œπ‘π‘–π‘β„Ž ,π‘‘π‘œπ‘π‘–π‘π‘–,β„Ž = 𝑝 π‘‘π‘œπ‘π‘–π‘β„Ž|π‘Žπ‘– οƒΌ Interpretation: π‘‘π‘œπ‘π‘–π‘π‘–,β„Ž measures the significance of topic term h just as the role of weight of word term l o Final date set: 𝑝 π‘‘π‘’π‘Ÿπ‘šπ‘™ 𝑙=1 𝑝 |π‘‘π‘œπ‘π‘–π‘β„Ž ,π‘‘π‘œπ‘π‘–π‘π‘–,β„Ž = 𝑝 π‘‘π‘œπ‘π‘–π‘β„Ž|π‘Žπ‘– οƒΌ 𝑀𝑖,𝑙|𝑀𝑖,𝑙 𝑖𝑛𝑐𝑙𝑒𝑑𝑒𝑠 π‘€π‘’π‘–π‘”β„Žπ‘‘π‘  𝑆𝑒𝑛𝑖,𝑙 π‘Žπ‘›π‘‘ π‘‘π‘œπ‘π‘–π‘π‘–,β„Ž 𝑙=1 𝑝 , 𝑐𝑖 𝑖=1 π‘š οƒΌ π‘Žπ‘–, 𝑐𝑖 𝑖=1 π‘š 11/14/2018 27 Transformation: 1. Text preprocessing 2. Document representation Qualitative Form π‘Žπ‘– Quantitative Form 𝑀𝑖,𝑙
  • 28. Methodology  Classification o NaΓ―ve Bayes (NB): benchmark οƒΌ Aim: 𝑝 𝑐𝑖 = π‘˜|π‘Žπ‘– ∝ 𝑝 𝑐𝑖 = π‘˜ 𝑝 π‘Žπ‘–|𝑐𝑖 = π‘˜ οƒΌ Known: 𝑝 𝑐𝑖 = π‘˜ = 𝑖=1 π‘š 𝐼 𝑐𝑖 = π‘˜ /π‘š οƒΌ MLE: 𝑝 π‘Žπ‘–|𝑐𝑖 = π‘˜ = 𝑙=1 𝑝 𝑝 𝑀𝑖,𝑙|𝑐𝑖 = π‘˜ = 𝑙=1 𝑝 𝑝 𝑀𝑖,𝑙|𝑐𝑖 = 11/14/2018 28
  • 29. Methodology  Classification o Support Vector Machine (SVM): what about benchmark? οƒΌ Aim: maximum-margin hyperplane οƒΌ the distance between the hyperplane and the nearest point π’˜ from either group is maximized. οƒΌ Reduce overfitting by L2-norm regularization οƒΌ Stable οƒΌ Lack of interpretability 11/14/2018 29
  • 30. Methodology  Classification o Neural Networks (NN) οƒΌ Three layers: input, hidden (a or multiple) and output layer οƒΌ Before hidden----Aggregation function: 𝑙=1 𝑝 𝛽𝑙 𝑀𝑖,𝑙 οƒΌ In the hidden---- Activation function: g π‘₯ = 1, π‘₯ > 0 0, π‘₯ ≀ 0 for each left layers οƒΌ Label rule:𝑐 π‘Ž 𝑛𝑒𝑀 = 1, g πœ·π’˜ 𝑛𝑒𝑀 > 0 2, 𝑒𝑙𝑠𝑒 οƒΌ Train model: οƒΌ backpropagation to update prediction errors 11/14/2018 30
  • 31. Methodology  Classification o Decision Tree (DT) οƒΌ Aim: 𝑝 𝑐𝑖 = π‘˜|π‘Žπ‘– ∝ 𝑝 𝑐𝑖 = π‘˜ 𝑝 π‘Žπ‘–|𝑐𝑖 = π‘˜ οƒΌ spilling criterion: chi-squared, Gini coefficient or entropy-based οƒΌ Overfitting οƒΌ Good interpretability 11/14/2018 31
  • 32. Methodology  Classification o Logistic Regression (LR) οƒΌ Popular methods οƒΌ requires uncorrelated independent variables: exogeneity 11/14/2018 32
  • 33. Methodology  Classification o Discriminant Analysis (DA) οƒΌ Prior probability: 𝑀𝑖|𝑐𝑖 = π‘˜~𝑁 𝝁 π‘˜, 𝞒 οƒΌ Estimate 𝝁 π‘˜, 𝞒 οƒΌ Rule: 𝑐 π‘Ž 𝑛𝑒𝑀 = 1, log 𝑝 𝑐 𝑖=1|π‘Ž 𝑛𝑒𝑀 𝑝 𝑐 𝑖=2|π‘Ž 𝑛𝑒𝑀 > 1 2, 𝑒𝑙𝑠𝑒 οƒΌ log 𝑝 𝑐 𝑖=1|π‘Ž 𝑛𝑒𝑀 𝑝 𝑐 𝑖=2|π‘Ž 𝑛𝑒𝑀 is linear formation among 𝑀𝑖,𝑙 οƒΌ Not applicable to non-linear case 11/14/2018 33
  • 34. Methodology  Classification o Supervised Topic Models (STM) οƒΌ Define topic distribution over document probability: πœƒ π‘Ž 𝑖 |πœ—, ~π·π‘–π‘Ÿπ‘–π‘β„Žπ‘™π‘’π‘‘ πœ— οƒΌ Sample topic 𝑧 π‘Ž 𝑖,𝑙| πœƒ π‘Ž 𝑖 , ~π‘€π‘’π‘™π‘‘π‘–π‘›π‘œπ‘šπ‘–π‘Žπ‘™ π·π‘–π‘ π‘‘π‘Ÿπ‘–π‘π‘’π‘‘π‘–π‘œπ‘› πœƒ π‘Ž 𝑖 οƒΌ Sample term π‘‘π‘’π‘Ÿπ‘š π‘Ž 𝑖,𝑙|𝑧 π‘Ž 𝑖,𝑙, 𝜷, ~𝑀𝑒𝑙𝑑 𝛽𝑧 π‘Žπ‘– οƒΌ Sample the response variable 𝑐 π‘Ž 𝑖 |𝑧 π‘Ž 𝑖,𝑙, 𝜹, 𝜎2 , ~𝑁 𝜹 𝑇 𝑧 π‘Ž 𝑙 , 𝜎2 by linear regression β€’ 𝜹= 𝛿1, . . , π›Ώβ„Ž, . . , 𝛿 𝐻 𝑇, 𝑧 π‘Ž 𝑙 = 1 𝑝 𝑙=1 𝑝 𝑧 π‘Ž 𝑖,𝑙 β€’ Regress label on topics β€’ Parameters are estimated Expectation-Maximization (EM) οƒΌ Rule: 𝑐 π‘Ž 𝑛𝑒𝑀 = π‘Ÿπ‘œπ‘’π‘›π‘‘ 𝜹 𝑇 𝐸 𝑧 π‘Ž 𝑛𝑒𝑀 |πœ—, 𝜷, 𝑀 11/14/2018 34
  • 35. Model Development and Evaluation  Dependent Variables o 19 ratings or classes: 𝑐𝑖 o Rating bands οƒΌ π‘π‘Žπ‘›π‘‘π‘– = 1 𝑖𝑓𝑐𝑖 ≀ 7 2 𝑖𝑓 7 < 𝑐𝑖 ≀ 10 3 𝑖𝑓 10 < 𝑐𝑖 ≀ 13 4 𝑖𝑓𝑐𝑖 > 13 οƒΌ Less granular o Binary dependent variable: 𝑏𝑖𝑛𝑖 οƒΌ 𝑏𝑖𝑛𝑖 = investment grade company 𝑖𝑓𝑐𝑖 ≀ 10 π‘ π‘π‘’π‘π‘’π‘™π‘Žπ‘‘π‘–π‘£π‘’ π‘”π‘Ÿπ‘Žπ‘‘π‘’ π‘π‘œπ‘šπ‘π‘Žπ‘›π‘¦ 𝑖𝑓 > 10 οƒΌ Comparable 11/14/2018 35 Unbalanced Unbalanced Unbalanced
  • 36. Model Development and Evaluation  Dependent Variables o In addition, intuitively, less classes, more accurate οƒΌ 𝑝 π‘™π‘œπ‘π‘Žπ‘‘π‘’π‘‘ 𝑖𝑛 𝑏𝑖𝑔 𝑐𝑦𝑐𝑙𝑒 > 𝑝 π‘™π‘œπ‘π‘Žπ‘‘π‘’π‘‘ 𝑖𝑛 𝑠𝑒𝑏 βˆ’ 𝑐𝑦𝑐𝑙𝑒 οƒΌ Accuracy: Binary ratings> Rating bands > 19 ratings 11/14/2018 36
  • 37. Model Development and Evaluation  Test Set Performance o Train: test = 3:1 o Every presented classifier performs best under responding column o higher than random accuracy 1/19=5.2%, ΒΌ=25%, Β½=50% o SVM best for dataset2 while DT and NN best for dataset1 11/14/2018 37
  • 38. Model Development and Evaluation  Test Set Performance o π‘Žπ‘π‘π‘’π‘Ÿπ‘Žπ‘π‘¦ = 𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁 : overall accuracy o π‘π‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› = 𝑇𝑃 𝑇𝑃+𝐹𝑃 : the proportion of correctly classified points within a true class o π‘Ÿπ‘’π‘π‘Žπ‘™π‘™ = 𝑇𝑃 𝑇𝑃+𝐹𝑁 : the proportion of correctly classified points within a classified or generated class 11/14/2018 38
  • 39. Model Development and Evaluation  Test Set Performance o π‘Žπ‘π‘ = 𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁 ; π‘π‘Ÿπ‘’ = 𝑇𝑃 𝑇𝑃+𝐹𝑃 ; π‘Ÿπ‘’π‘π‘Žπ‘™π‘™ = 𝑇𝑃 𝑇𝑃+𝐹𝑁 o Thought: small class tends to have low precision o Thought: Some classifiers can sacrifice the precision and recall of small class in order to pursue a high accuracy or a or precision of big class. o Thought: understand class distribution Before choosing and develop classifier. 11/14/2018 39
  • 40. Model Development and Evaluation  Model Results 11/14/2018 40
  • 41. Model Development and Evaluation  Model Results 11/14/2018 41
  • 42. Model Development and Evaluation  Model Results o How to interpret the result? οƒΌ Compare accuracy between sentiment-using and -unusing 11/14/2018 42
  • 43. Model Development and Evaluation  Model Results o The effect of considering sentiment of the terms depends on the classifier, the dataset, and the type of dependent variable. o NN, DT, and SVM work best o STM works bad, which is surprising οƒΌ Good interpretability οƒΌ Potential reason given by this paper  Should use linear regression (LR) rather than generalized linear model(GLM) οƒΌ My thought:  GLM may work better since it has less assumption than LR  But why not good? In the generating process of STM, it assume response variable (class) is produced by or sampled from topics. However, the true ratings may not be produced in this way.  We only know sentiment has relationship with rating, but that dose not means sentiment produce rating 11/14/2018 43
  • 44. Conclusion  RatingBot to predict rating score  Limitations o The credibility of rating οƒΌ Considering time component to capture the manipulation of companies to annual report to get higher rating o Try other classifiers, such as deep learning o Including other text sources, such as news, social media content 11/14/2018 44

Editor's Notes

  1. More reliable and robust Commonly used classifier