SlideShare a Scribd company logo
CORPORATE CREDIT RATING PREDICTION USING MACHINE LEARNING
by
Pedro Henrique Veronezi e Sa
A THESIS
Submitted to the Faculty of the Stevens Institute of Technology
in partial fulfillment of the requirements for the degree of
MASTER OF SCIENCE - FINANCIAL ENGINEERING
Pedro Henrique Veronezi e Sa, Candidate
ADVISORY COMMITTEE
Rupak Chatterjee, Advisor Date
David Starer, Reader Date
STEVENS INSTITUTE OF TECHNOLOGY
Castle Point on Hudson
Hoboken, NJ 07030
2016
c 2016, Pedro Henrique Veronezi e Sa. All rights reserved.
iii
CORPORATE CREDIT RATING PREDICTION USING MACHINE LEARNING
ABSTRACT
This study uses machine learning techniques, such as the Random Forest
and the Multilayer Perceptron method, to predict corporate credit ratings for any
given company using publicly available financial data. Those ratings are then com-
pared to Standard & Poors and Moody’s credit ratings. The data used comes from
financial reports from 170 companies from the Health Care, Financial Services,
and Technology sectors of the S&P500 index, dating from 1990 to 2014. The pe-
riod of investigation is from 2000 to 2014. This study uses a specific Machine
Learning architecture framework for both learning methods. This thesis also intro-
duces a new performance measurement called Credit Rating Divergence Measure-
ment. It is a statistical measure to compare the ratings from the prediction models
and the ratings from the credit rating agencies. The results presented in this study
show that it is possible to rate a company, even if it is not publicly traded, using
the same standards as the two biggest credit rating agencies through the use of
Machine Learning techniques. Machine Learning makes the rating process faster
and more efficient. The architecture framework presented achieved one notch ac-
curacy of more than 90% over the investigation period, for both credit agencies, and
a Credit Rating Divergence Measurement of less than 0.40, which is approximately
60% better than the benchmark.
Author: Pedro Henrique Veronezi e Sa
Advisor: Rupak Chatterjee
Date: May 2, 2016
Department: School of System Enterprises
Degree: Master of Science - Financial Engineering
iv
Dedication
This thesis is dedicated to my family, who supported me during this phase;
especially to my mother, Lucila Maria Veronezi and to my grandmother, Barbara
Quagliato Veronezi.
v
Acknowledgments
I would like to acknowledge my advisor, Rupak Chatterjee, for his support in this
opportunity of personal and professional growth. I would like to acknowledge my
girlfriend, Katherine Thompson, who stood by my side during this process offering
nothing but support and help. I also would like to acknowledge my family, as they
were always present during my masters program and supported me in difficult
moments.
Table of Contents
Abstract iii
Dedication iv
Acknowledgments v
List of Tables viii
List of Figures ix
Chapter 1 Introduction 1
1.1 The Problem Statement 4
1.1.1 Hypothesis 6
1.2 Problem Scope 7
1.3 Research Approach 9
1.4 Organization and Structure 11
Chapter 2 Literature Review 14
2.1 Standard & Poor’s Method 16
2.2 Moody’s Method 21
2.3 Machine Learning Techniques 25
2.3.1 Deep Learning 25
2.3.2 Neural Networks 26
vi
vii
2.3.2.1 Multilayer Perceptron 32
2.3.3 Decision Tree 40
2.3.3.1 Random Forest 40
Chapter 3 Data and Computational Procedures 43
3.1 Data Source and Preprocessing 44
3.2 Data structure and characteristics 47
3.3 Framework architecture 57
Chapter 4 Results 67
4.1 Standard & Poors 68
4.2 Moody’s 80
4.3 Credit Rating Dissimilarity Coefficient 91
Chapter 5 Summary and Conclusion 97
5.1 Conclusion 97
5.2 Further Research 100
Appendices 103
Appendix A 103
Appendix B 107
Bibliography 109
List of Tables
3.1 Table of scale values for the corporate credit ratings 57
4.1 Statistics for Standard & Poors 69
4.2 Statistics for Moody’s 81
5.1 List of companies used on this study. 103
viii
List of Figures
1.1 S&P500 index breakdown by GICS sectors, as of Jan 29, 2016. 7
2.1 Rating Agencies around the world, as of 2006. Source International Rating
Group (IRG) 15
2.2 *Due to lack of data, the highest number of rated issuers by one of 3 credit
rating agencies is assumed to be the total rated issuers, then the coverage
is calculated based on this total number. Source [Estrella, 2000] 16
2.3 Corporate Criteria Framework. Source [Standard & Poors, 2014a] 20
2.4 Ratings summary. Source [Standard & Poors, 2014a] 22
2.5 Corporate Default summary. Source [Standard & Poors, 2015] 23
2.6 Neuron representation 29
2.7 Multilayer Neural Network 34
2.8 Basics schematics of a Random Forest 40
3.1 Number of companies in the study over the years. 48
3.2 Distribution of datapoints in the sectors. 49
3.3 Ratings distribution for Moody’s. 49
3.4 Investment Level distribution for Moody’s. 50
3.5 Changes over the years for Moody’s. 51
3.6 Changes per rating for Moody’s. 52
3.7 Ratings distribution for Standard & Poors. 53
ix
x
3.8 Investment Level distribution for Standard & Poors. 53
3.9 Changes over the years for Standard & Poors. 54
3.10 Changes per rating for Standard & Poors. 54
3.11 Comparison of the credit rating agencies over the years. 56
3.12 Comparison of the credit rating agencies by the rates. 56
3.13 Architecture Flowchart. 61
3.14 Performance indicators flowchart. 65
4.1 Ratings distribution for Random Forest model for Standard & Poors. 69
4.2 Ratings distribution for MLP model for Standard & Poors. 70
4.3 Ratings over years for Random Forest model for Standard & Poors. 71
4.4 Ratings over years for MLP model for Standard & Poors. 71
4.5 Accuracy by ratings for Random Forest model for Standard & Poors. 72
4.6 Accuracy by ratings for MLP model for Standard & Poors. 73
4.7 Cumulative overall accuracy for Random Forest model for Standard & Poors.
75
4.8 Cumulative overall accuracy for MLP model for Standard & Poors. 76
4.9 Cumulative changes only accuracy for Random Forest model for Standard
& Poors. 76
4.10 Cumulative changes only accuracy for MLP model for Standard & Poors. 77
4.11 Random Forest vs Multilayer Perceptron performance comparison for Stan-
dard & Poors. 78
4.12 Random Forest vs Standard & Poors performance comparison. 79
4.13 Ratings distribution for Random Forest model for Moody’s. 81
4.14 Ratings distribution for MLP model for Moody’s. 82
4.15 Ratings over years for Random Forest model for Moody’s. 83
xi
4.16 Ratings over years for MLP model for Moody’s. 83
4.17 Accuracy by ratings for Random Forest model for Moody’s. 84
4.18 Accuracy by ratings for MLP model for Moody’s. 85
4.19 Cumulative overall accuracy for Random Forest model for Moody’s. 86
4.20 Cumulative overall accuracy for MLP model for Moody’s. 87
4.21 Cumulative changes only accuracy for Random Forest model for Moody’s. 87
4.22 Cumulative changes only accuracy for MLP model for Moody’s. 88
4.23 Random Forest vs Multilayer Perceptron performance comparison for Moody’s.
89
4.24 Random Forest vs Moody’s performance comparison. 90
4.25 Credit Rating Dissimilarity Coefficient for overall ratings. 92
4.26 Credit Rating Dissimilarity Coefficient for changes ratings. 95
1
Chapter 1
Introduction
Credit ratings have become ubiquitous these days since all market agents have
come to depend on these reports. The investors use these ratings to determine
their positions on any given corporate financial instruments, e.g. bonds, stocks,
credit default swaps, etc. The issuer of a bond, the ones that have their rating
scored, know that the rating affects their financial costs in a very fundamental
way. After some casualties in the market, even the regulators use credit ratings
as parameters for a series of regulations, from allowable investment alternatives
to required capital for most global banking firms. For instance, we can cite the
regulation which states that pension funds must only hold Investment Grade (IG)
corporate bonds. This regulation can cause considerable changes in the market,
since pension funds play an important role in the financial system. So one can
imply that changing the credit score for a bond issuer, e.g. changing it to a non
Investment Grade rate, can cause a selling race on its bonds and a big loss on the
bond’s value.
The market can be seen as an exchange of information in many different
ways, and and the stakeholder needs to have access to information about the
company in order to make investment decisions, so being better informed about
the company strategy. In order to make the information standard and available
throughout the market, the concept of ratings was created. The first appearance
of a standardized ratings business was in the 19th century, at the time of the US
railroad expansion. Henry Poor recognized the problem of an information gap
between the investors and the companies building the railroads. As stated by
2
[Langohr, 2010] Henry Poor was an editor for a local Journal focused on railroads
called The American Rail road Journal and gathered the information of the busi-
ness standing and creditworthiness using a network of agents spread all across
the US.. Later, in the beginning of the 20th century, John Moody initiated agency
bond ratings in the US, expanding his business analysis services to include a rating
based on the company’s credit risk.
In 1999 the Bank for International Settlements proposed rule changes that
would provide an explicit role for credit ratings in determining a bank’s required reg-
ulatory risk capital, widely know as Basel 2 (BIS). The BIS proposal vastly elevated
the importance of the credit rating by linking the required bank capital to the credit
rating of its obligations. [Richard M. Levich, 2002]
With the advance of time and technology, followed by the globalization of the
financial market, the quantity of information available at both the macroeconomic
and institutional level increased exponentially. Correspondingly, this increased the
complexity of rating a company, creating more information asymmetries and ele-
vating the value of the credit rating agencies.. [Richard M. Levich, 2002] Moving
forward to today, three main companies dominate both the US and global mar-
ket: Standard & Poor’s Financial Services LLC, Moody’s Investors Service, Inc
and Fitch Ratings Inc. According to [Hill, 2003] those companies hold a collective
global market share of roughtly 95% as of 2013, when the study was realized. Un-
til 2003, those same companies were the only "Nationally Recognized Statistical
Rating Organizations" (NRSROs) in the US, a designation that means their credit
ratings were used by the government in several regulatory areas. [U.S. SEC, 2003]
Since corporate credit ratings are such an important force in the worldwide
economy, there have been many studies attempting to develop different methods
to understand, predict, and model the rating process. In [Figlewski et al., 2012],
3
the author explores how general economic conditions impact defaults and major
credit rating changes by fitting a reduced-form Cox intensity model with a broad
range of macroeconomics and firm-specific ratings-related variables.
[Frydman and Schuermann, 2008] proposes a parsimonious model that is a
mixture of two Markov chains, implying that the future distribution of a corporation
depends not only on its current rating but also on its past rating history. Also,
[Koopman et al., 2008], proposes a new empirical reduced-form model for credit
rating. The model it is driven by exogenous covariates and latent dynamic factors
on a generalized semi-Markov fashion, simulating transitions using Monte Carlo
maximum likelihood methods.
In a more technological environment, where we have available data in great
quantity, quality and form a variety of sources, [Lu Hsin-min, 2012] shows that the
use of news coverage improves the accuracy for the credit rating models by us-
ing a missing-tolerant multinomial probit model, which treats missing values using
Bayeasian theoretical framework, and proves that this model out-performs an SVM
model in the credit rating predictions.
With the increasing quantity of data and complexity of the financial instru-
ments, it is clear that credit ratings are heading towards increasingly computational
methods.. Most of the recent studies in the area show that the challenge for the
future of credit rating predicting is to apply computational methods with a combi-
nation of financial and economic data. In [Lee, 2007], the author uses a support
vector machine (SVM) to predict corporate credit ratings and compares the re-
sults with traditional statistical models, such as multiple discriminant analysis and
case-based reasoning, and proves that the SVM model out-performs the traditional
methods. The literature for machine learning algorithms and statistical algorithms
applied to credit rating prediction has been extensively explored, and the use of hy-
4
brid machine learning models is evolving. One example is [Tsai and Chen, 2010],
in which the author explores the use of a series of different machine learning tech-
niques and the combination of them and proves that this evolution improves the
prediction accuracy.
All those previous studies show that predicting credit ratings is an exhaust-
ing and difficult task that involves knowledge in different areas such as financial
markets, financial instruments, macroeconomics, fundamental financial reports,
computer science, advanced math and so on.
1.1 The Problem Statement
The credit rating industry it is characterized by its high barrier to entry due to the
market regulation and that the established companies have more than 95 % of the
market [Hill, 2003]. This creates a perfect scenario for the companies to require
large fees, either for the companies that need to be rated, also known as the issuer,
and for the individual or entity that desires to know the score of a rated company,
which could be any investor, government or company willing to acquire the report.
Another recurrent scenario is the error on those ratings, since periodically their
classification methods have not always been shown to reflect the real default risk.
Since the credit rating is important for many financial instruments and entities, the
correct modeling using computational methods and select the inputs became a
challenge for all players that need the rating without handling the large fees but still
need high accuracy and credibility.
If a medium-large corporation needs to make a loan in a financial institution,
there are a few ways for the financial institution to analyze the creditworthiness of
the corporation. One of them would be to ask for the corporation to face the large
5
fees charged by the credit rating agencies and wait for a considerable period of
time, at least 6 months, to get a rating. Another method could be the use of a
model that uses credit default swap price information in order to predict the credit
rating, as stated in [Tanthanongsakkun and Treepongkaruna, 2008]. The author
uses a Black Scholes option-pricing called Merton. [MERTON, 1974] shows that
the company default probability can be estimated using an option-pricing model,
viewing the market equity of a firm as a European call option on its firm assets,
considering the strike price equal to the value of its liabilities. Another approach
extensively explored in the field is the use of accounting-based models to explain
the credit rating. Works such as [PINCHES and MINGO, 1973] use multiple dis-
criminant analysis with factor analysis using those accounting-based features and
information from the bond market.
It is worth mention that none of the previous studies done in the area for the
former method, the accounting-based model, uses more than a few years of data
and do not use an extensive list of companies. As we can see in [PINCHES and MINGO, 1973],
the author uses two years of data, and restricts the model to ratings above B. In
another study, [Pogue and Soldofsky, 1969], the author explores the same problem
explored in this study by asking the following: "how well can corporate bond ratings
be explained by available financial and operating statistics?". The author makes
use of six years of fundamental data (accounting-based) in the form of ratios, but
the study’s date, the shortage of data, computational methods and hardware lim-
ited the extensiveness of its work. Most of the previous studies done in this area
limit their data to either a few years, a few companies, or few possible rates, such
as using only investment grade companies and bonds, when it is all the previous
conditions together.
6
1.1.1 Hypothesis
This thesis hypothesizes that the corporate credit rating given by the two main
companies (Moody’s and Standard & Poors) in US, can be explained with high
confidence level, by the firm’s accounting-based information. The use of multiple
machine learning techniques is a key factor in this analysis, since the shortage
of public financial data is not an issue and the risk of high computational cost is
mitigated by the advance in the technology and use of cutting-edge algorithms.
With the discussed scenario in mind, this study will use data from 1990 to 2015,
in a quarterly fashion, for 170 different companies, representing tree main sectors
of the S&P500: technology, healthcare and financial services. One framework is
created in order to evaluate the model over time. This framework should be able
to deal with large quantities of data, either in features and in unique entries. The
second hypothesis to be tested is that the application of multiple machine learning
techniques is a viable and solid method to build a predictions model. The result of
the algorithms should be able to outperform previous statistical methods and other
computational methods from other studies.
The model as a whole should be able to perform predictions on any given
company without the use of its specific financial instruments, such as bonds quotes
or stock price, since for the vast majority of companies this information is not a
reality. The model will focus on using accounting-based information with the use of
multiple machine learning techniques to perform the prediction.
The preceding discussion suggests that corporate credit ratings may de-
pend on accounting-based firm’s reports. The rates may also depend on the rating
agency’s judgement about factors that are not easily measured, such as quality
of management, future shifts in the market and other qualitative measures that in-
7
Figure 1.1: S&P500 index breakdown by GICS sectors, as of Jan 29, 2016.
fluence the long-term results of any given company, factors that are out of scope
of this study, but can be easily implemented and added together with the current
structure.
1.2 Problem Scope
The scope of this study is limited to the United States market, more specifically 170
companies currently part of the S&P500 index. According to [S&P DOW JONES INDICES, 2016]
those companies all together represents approximately 51.3% of the index, as of
January 2016, as we can see in Figure 1.1. This index has been on the market
since the 4th of March 1957, and has 504 constituents. The max market cap of
its constituents is 542, 702.72 million and its minimum is 1, 818.69 million with an
average of 35, 423.67 million, values in US dollars.
The history of corporate credit ratings dates back to the 1900’s, but for the
scope of this thesis the window of data gathered is restricted from 1990 to 2015,
8
and the list of companies it is complete in the appendix A. If any given company
does not have a credit rating for any reason, the data is not considered. The quan-
tity of data used is large enough for this research to be considered a Big Data
analysis, since the data consists of quarterly financial information for each com-
pany. As we can see in [has, 2015], Big Data is defined by the following aspects:
data are numerous, data are generated, captured, and processed rapidly. The na-
ture of this study, in a production environment, satisfy all those criteria, but since
in this thesis we are focused on the back-test results of the architecture and model
created, the data do not change as rapidly as possible. Another valid to point to
be raised is that the Credit Rating Agencies have access to any financial data of
the company evaluated by them at any moment, not just on the quarter releases.
However, since the data is not public available until the quarter releases this study
uses the quarterly information with one quarter of discrepancy, to ensure that no
data from the future is used as input for the model to predict its ratings.
All data was gathered using an Application Programming Interface (API)
from the Bloomberg database using computational methods. The number of fea-
tures available on the database in vast, so in order to make a first selection, all
possible features were downloaded at first, and then the features that had fewer
occurrences of NA were manually selected. The first download had more than 900
different features for each quarter for each company. From those, 230 different
features were selected for each company, for each quarter.
Those features are fed to a series of machine learning techniques, such as
multi-layer perceptron in a deep learning architecture, in parallel a distributed ran-
dom forest and the result of those models are fed into another multi-layer percep-
tron deep learning architecture, looping again in the same period. This structure
is called in the literature an ensemble model, [Hsieh et al., 2012] shows that the
9
ensemble methods can be applied with success improving the accuracy of single
models. In order to reduce the dimensionality of the features, a feature selection
technique, such as random forest, is applied in order to improve the accuracy and
the model’s computational cost. The the use of feature selection when dealing with
a large database of features that can explain a phenomenon is believed to select
which features are relevant and which are irrelevant, improving the performance of
the classifier machine learning technique.
The machine learning techniques were chosen given their ability to work
well with big databases and multi-classification problems. In this thesis there are
21 different categories to be classified on, and more than 920 different features
to explain each classification. The size of the database and the nature of the
classification were determinants of the machine learning methods chosen.
1.3 Research Approach
A literature review on machine learning techniques, feature selection and ensemble
models was performed in order to use the state-of-the-art for each techniques. This
research focused on corporate credit rating in the US, and on the two main credit
rating agencies: Moody’s and Standard & Poors. During the literature review, it
was found that the past research uses either a small set of features, or a small
sample of companies and often focused only on investment grade (IG), excluding
the non-investment grade companies, or uses samples of few years.
This study was developed with the premise of evaluating any given com-
pany, at any given time (given the restrictions of availability on data). In order to
ensure the flexibility of the model, information such as the company’s ticker, time
references for the ratings and company’s specific information on exchanges, e.g.
10
CDS or equity, were not given as part of the training dataset, which was composed
by publicly available financial data. A framework for back-testing was created in or-
der to perform all different analysis and set ups, and gather the results in a uniform
fashion and guarantee a consistent analysis. In order to test the model’s consis-
tency over time, the framework made viable the test using a rolling window that
trains the model until the selected period and tests it for the next quarter, so with
that set up it is impossible for the test data be used as training data assuring the
concept of out-of-sample, and preventing the model from showing biased results.
This thesis uses different measurements to evaluate the model performance, as
we can see in the list that follows:
1. Credit Rating Dissimilarity Coefficient (Overall)
(a) Overall accuracy
(b) Crude accuracy
(c) Node accuracy
2. Credit Rating Dissimilarity Coefficient (Changes)
(a) Overall accuracy on changes
(b) Node accuracy on changes
The item item 1a is the measure of accuracy for the model that represents
how many of the ratings present in the test dataset the model predicted correctly.
The definition of item 1b is how many times the model predicted the correct rating
on a crude basis, with crude being defined by the following example: if a rating is
AA+, its crude rating is defined as AA. The item item 1cis defined as how many
times the model predicted the right rating within an acceptable range, or node.
11
A node is a definition of distance between the ratings, for example, if a rating is
Aa1, the node accuracy accepts an error of up to one node, which means that
a prediction such as Aaa or Aa2 is counted as correct. As for the performance
measurement items item 2a and item 2b, they are, respectively, the same concepts
as explained above, but only measured for the ratings that changed relative to the
previous period. By considering all those measurements, this study can test and
evaluate the hypothesis previously proposed.
In order to provide a statistical measurement that reflects all previous mea-
surements, a new measurement is introduced: the Credit Rating Dissimilarity Co-
efficient, which is explained in more detail in Chapter 04. The Credit Rating Dis-
similarity Coefficient is applied to the overall ratings in the investigation period, and
also applied separately to only the ratings which changed compared to the pre-
vious rating. This new statistical measurement makes it possible to compare the
predictions with the observed ratings and compare with a benchmark accepted by
the market.
1.4 Organization and Structure
Chapter 1 presents an introduction to corporate credit rating since its beginning,
how the two main credit ratings agencies were created, and the current market
distribution for corporate credit rating. After this introduction about the market and
the creation of the credit rating agencies, this study shows the different approaches
over time for solving this problem, beginning with the use of statistical models, with
a variety of features until the modern methods such as machine learning, and its
variations. This chapter also characterizes the problem and the current environ-
ment in which the study was developed. It also lays out the hypothesis on which
12
this study is based and sets the ground to evaluate the hypothesis. It continues
to define the scope and the research approach chosen for the development of this
study.
Chapter 2 reviews the literature and the theory behind the algorithms being
used. The mathematical formulation of the algorithms used is also presented so
the reader can have a full understanding of the theory. The review is structured by
as follows:
1. Standard and Poor’s Method
2. Moody’s Method
3. Machine Learning Techniques
(a) Neural Networks
i. Multilayer Perceptron
(b) Decision Tree
i. Random Forest
Chapter 3 details the computational procedures, such as back-test method,
assumptions made in the simulation execution and other relevant details used to
perform the tests. This chapter also presents the data structure and the prepro-
cessing process required for this study. Following then to the framework construc-
tion specifics.
Chapter 4 presents the results found by the tests performed, approaching
the results from different angles in order to better understand the framework model
application and performance measurements.
13
Chapter 5 finally draws a conclusion from the test results, presents ideas
about further research and discuss the viability of this research to be used in a
production environment.
14
Chapter 2
Literature Review
As explained in [Langohr, 2010],there are six main macroeconomic factors that
shaped the current credit rating industry: financial disintermediation, institution-
alization of investments, accelerated rate of industry change, complex financial
innovations, the globalization of international capital markets, and the growth in
regulatory uses of ratings. Those factors transformed the market, such that as of
2009, there are about 150 local and international credit rating agencies around the
world, and the major US agencies have established operations and joint ventures
abroad to meet the globalization of capital markets. A big picture of the market as
of 2006 is drawn in 2.1.
With the joint ventures and mergers that occur in this industry, it can be
described as an oligopoly of three dominant global credit agencies, as we can see
in 2.2, where [Estrella, 2000] shows the current state of the credit rating agencies
industry.
With all that information, it is easy to realize that are three dominants play-
ers, S&P, Moody’s, and Fitch, and all of them follow a similar pattern: large com-
panies, global focus, cross-industry issuer and instrument specific ratings. They
take an analytical approach with committe reporting, use ordinal scales and have
an issuer-pays business model. When it comes to comparisons between the three
main agencies, investors perceive all three of them equally, but for older markets,
such as the US corporate debt instruments, the issuers automatically get ratings
from two or three different agencies.This study is focused on the first two compa-
nies, S&P and Moody’s, since from a previous analysis realized, it was observed
15
Figure 2.1: Rating Agencies around the world, as of 2006. Source International Rating
Group (IRG)
16
Figure 2.2: *Due to lack of data, the highest number of rated issuers by one of 3 credit
rating agencies is assumed to be the total rated issuers, then the coverage is calculated
based on this total number. Source [Estrella, 2000]
that those companies have more information available for the companies chosen.
For several years great effort has been devoted to the study of corporate credit
ratings and how to predict them using computational methods. By its definition, an
algorithm can only present results as good as its models, by that means it is im-
portant to understand the process and methods behind the rating executed by the
companies analysed in this study. In order to have a complete and improved under-
standing of the corporate credit rating process, each company method is explored
in this thesis. Bearing that in mind, this thesis uses accounting-based, publically
available information to fit a machine learning model for both companies’ methods.
The machine learning methods applied in this thesis are explained here and a high
level mathematical explanation is given as well.
2.1 Standard & Poor’s Method
This section refers to S&P Credit Market Services, which, as described earlier,
traces its roots to the 19th century when Henry Poor published in a newspaper the
financial information and analysis for the railroad companies of the time. Nowadays
S&P has a parent company called McGraw-Hill, which provides financial services
17
related to equities, including the S&P’s Credit Market Services, the affiliate respon-
sible for credit rating activity. Since the beginning, the business had stable growth,
but with market-based funding becoming more common in early 70’s, and this
decade marking a peak in the speculative bonds offered, in order to better serve
the market’s needs, S&P refined its ratings by adding ’+’ (plus) and ’-’ (minus), to
each generic category, moving from a 10-point to a 22-point scale. In the same
decade, S&P decided to charge issuers for their ratings, as investor subscriptions
could no longer meet the costs.
In 1975, the structured finance market was created and S&P started to rate
mortgage-backed securities, and in 1976 the company received the designation
of Nationally Recognized Statistical Rating Organization (NRSRO), a regulatory
aid created by the Securities and Exchange Commission (SEC). After that, the
company kept growing by adapting to new market needs, by developing new prod-
ucts and expanding the business worldwide, merging and acquiring several ratings
agencies across all continents. S&P is widely accepted by investors in both the
US and the European markets, but the US market is considered a rated market,
as issuers get two ratings if not three, so S&P has a policy of systematically rating
issuers in the US debt market, whether solicited or not. On the other hand, in Eu-
ropean markets, S&P tends to be preferred over its competitors when issuers are
looking for only one rating.
Standard and Poor’s has specialized in analysing the credit risk of issuers
and debt issues. The company formulates and disseminates ratings opinions that
are used by investors and other market players who may consider credit risk in
their decisions. The credit ratings process at Standard and Poor’s is given by the
following steps, as seen in [Standard & Poors, 2014b]:
18
1. Contract
2. Pre-evaluation
3. Management meeting
4. Notification
5. Rating Committee
6. Analysis
7. Publication
8. Surveillance of rated Issuers & Issues
The current payment model used by Standard and Poor’s is composed of:
Issuer-pay model The agency charges the issuers a fee for providing a ratings
opinion. In order to conduct the analysis the agencies may obtain information
that might not otherwise be available to the public and factor this information
into their ratings opinion. The released rating information is published broadly
to the public.
Subscription model The agency charges investors and other market players a
fee for access to the agency’s ratings. Critics point out that both this model
as well as the Issuer-pay model have the potential of conflict of interest.
When rating an Issuer, as seen in [Standard & Poors, 2014b], Standard &
Poor’s evaluates the ability and willingness to repay its obligations in accordance
with the terms of said obligations. To form its opinion on the rating, S&P reviews
19
a broad range of financial and business attributes that may influence future pay-
ments. Those attributes include, for example: key performance indicators, eco-
nomic, regulatory and geopolitical influences, management and corporate gover-
nance, and competitive position. There is a framework defining the work, as seen
in [Standard & Poors, 2014a] and better explained by the 2.3.
20
Figure 2.3: Corporate Criteria Framework. Source [Standard & Poors, 2014a]
21
According to [Standard & Poors, 2014a], there is more than just financial
ratios and accounting-based information that contributes to the rating for S&P. The
agency also incorporates in its analysis the country risk, the industry risk, and the
competitive position to create the business risk profile, and then analyzes the cash
flow and leverage to to create the financial risk profile, and finally analyzes the
company’s qualitative factors.
The Standard and Poor’s definition of its ratings scales, as defined by [Standard & Poors, 20
and shown in 2.4.
According to [Standard & Poors, 2014b], Standard and Poor’s tracks its rat-
ings yearly to evaluate the accuracy. The update is always available on the S&P
website www.spratings.com, which for this study the latest is the 2014 edition. In
[Standard & Poors, 2015], the agency goes through details for the last year’s per-
formance and keeps track of all the ratings and defaults analyzed by them since
the 80’s, and it shows that the changing ratio over the years is small, and usually
motivated by factors and influences other than financial information.
All of these data and methods were considered during the experiment con-
struction and thesis assumptions, and all the results and comparisons are made in
the corresponding section, as explained in the previous chapter.
2.2 Moody’s Method
This section refers to Moody’s, which, as described earlier, was established in the
late 19th century by John Moody when he published Moody’s Manual of Industry
and Corporation Securities and later on, when he directly competed with S&P by
publishing the first bond rating as part of Moody’s Analyses of Railroad Securities.
Nowadays Moody’s is an essential component of the global capital markets. It
22
Figure 2.4: Ratings summary. Source [Standard & Poors, 2014a]
23
Figure 2.5: Corporate Default summary. Source [Standard & Poors, 2015]
24
provides credit ratings, research, tools and analysis aiming to protect the integrity
of credit, as said in [Moody’s Investor Service, ]. In 1962, Moody’s was sold to Dun
& Bradstreet, and ten years later, Moody’s began to assign short-term ratings and
bank deposit ratings after Penn Central defaulted on its commercial obligations.
Around the same time that S&P changed its business model to issuer pays, so
did Moody’s, and shortly thereafter Moody’s received the NRSRO status along
with the other two main agencies. After a series of mergers and acquisitions,
the agency expanded to Europe and established a global footprint. In the early
1980’s, the company refined its rating system by moving from a 9-point to a 21-
point scale, a few years after S&P performed a similar restructuring. In the early
2000’s the CEO, John Rutherford Jr, pointed out that debt sold in public capital
markets usually requires ratings, and decided to focus the company on that. At the
same time, the reports showed the success of two new products, the collateralized
Debt Obligations (CDOs) and syndicated bank loans. Currently Moody’s operates
in over 26 countries outside the US, and as of 2006, covered approximately 12000
corporations and financial institutions.
Although Moody’s website provides few details about their rating technique,
the basic methodology is similar to that of S&P. They use a proprietary combination
of financial data and other financial indicators that considers the market risk, the
company risk and other factors that might influence the corporate credit rating. De-
spite using similar inputs, the ratings Moody’s and S&P assign often differ, possibly
due to the different weight each company assigns to each factor used in the rating
process, this difference will be discussed further in this research.
25
2.3 Machine Learning Techniques
Machine learning grew out of the quest for artificial intelligence. In the beginning
of AI as an academic discipline, some researchers were interested in having ma-
chines learn from data. They attempted to approach the problem with various sym-
bolic methods, probabilistic reasoning was also employed. However, an increasing
emphasis on the logical, knowledge-based approach caused a rift between AI and
machine learning. Probabilistic systems were plagued by theoretical and practi-
cal problems of data acquisition and representation. Work on knowledge-based
learning did continue within AI, leading to inductive logic programming, but the
more statistical line of research was now outside the field of AI proper, in pattern
recognition and information retrieval.
2.3.1 Deep Learning
Deep Learning, also known as deep structured learning, hierarchical learning or
deep machine learning, is a branch of machine learning based on a set of algo-
rithms arranged in multiple levels of non-linear operations in order to learn compli-
cated functions that can represent high-level abstractions, according to [Begio, 2009].
With the increased data offering, the effects of the already discussed big data
wave, it is increasingly impossible to manually formalize all that information in a
machine-usable format that can actually generate intelligent conclusions. For this
reason, research and studies have been developed to create algorithms able to
learn from the data, and the deep architecture are the bleeding edge algorithms
for that task. Research in this area attempts to make better representations and
create models that learn from those representations in a large and scalable fash-
ion. This methodology first developed due to inspiration from advances in neuro-
26
science and the communication patterns in the nervous system. Deep learning is
often erroneously used as a re-branding of neural networks, but in fact, the deep
architecture can be applied to several different methods. This study makes use of
a specific model of Neural Network called Multilayer Perceptron, topics which will
be reviewed in the next subsections.
As seen in [LeCun Y, 2015], deep learning methods are representation-learning
methods with multiple levels, that are obtained by composing non-linear modules
that each transform the representation at one level into a higher and more abstract
level. After enough transformations, the data can be learned by use of complex
functions. When applied to classification tasks, which is the case in this study, the
higher layers of representation amplify aspects for the input that are important for
the differentiation and would be suppressed in shallow machine learning methods.
Conversely, it also suppresses representations that are actually irrelevant for the
differentiation and would appear to be more relevant in shallow methods. Deep
learning is advancing quickly, as evinced by great results, often beating previous
records on each application, and has shown to be especially efficient when applied
to intricate structures in high-dimensional data, which is the case in this research.
2.3.2 Neural Networks
In machine learning, the Neural Networks (NN), often seen as Artificial Neural
Networks (ANN), encompass a family of models inspired by biological neural net-
works and represent a technology rooted in many disciplines: neuroscience, math-
ematics, statistics, physics, computer science, and engineering. According to
[Haykin, 1999] the study of neural networks has been motivated by the capacity
of the human brain to compute in a different way from the conventional digital com-
27
puter and deliver better results. The human brain is complex, non-linear, and uses
parallel computing and has the capacity to reorganize it structural constituents,
known as neurons. Another feature of the human brain is its capacity to build its
own rules based on past experiences, meaning that the human brain is plastic and
allows the system to adapt to its surroundings.
[Haykin, 1999] defines ANN as a massively parallel distributed processor
made up of simple processing units, which has a natural propensity for storing
experiential knowledge and making it available for use. It resembles the brain in
two respects:
1. Knowledge is acquired by the network from its environment through a learn-
ing process.
2. Inter-neuron connection strengths, known as synaptic weights, are used to
store the acquired knowledge.
The ability to learn and generalize which is inherent to ANN makes it an
extremely powerful tool for machine learning, since it can produce reasonable out-
puts for inputs not encountered during training (out-of-sample). The main qualities
and benefits of ANN are discussed as follows:
Nonlinearity The connection between the neurons can either be nonlinear or lin-
ear, but the ability to be nonlinear when the underlying physical mechanism
is also nonlinear makes this property important, especially because the non-
linearty is distributed through the network on each neuron activation.
Input-Output Mapping This property states that the algorithm learns from a train-
ing algorithm, in which during each iteration one element is presented to the
algorithm and the free parameters of the ANN are adjusted to minimize the
28
distance between the target value and the predicted value, given a statistic
measure. That means that no prior assumptions are made on the model for
the input data.
Adaptivity ANN have a built-in capability to adapt their synaptic weights to changes
in the surrounding environment, making it a useful tool in adaptive pattern
classification and adaptive tasks in general, but the principal time constants
of the system must be long enough for the model to ignore high disturbances
and short enough to respond to meaningful changes in the environment.
Evidential Response When the task is classification, the ANN can be designed
to return not just the class selected, but also the confidence in the decision
made, which can improve the overall classification if this information is used
to reject ambiguous patterns.
Fault Tolerance Due to the distributed nature of information stored in the network,
if any of the neurons is damaged, it will not affect the overall quality of the
model and will not stop the system from working.
Large Scale Implementability It has potential to make use of tools for distributed
computation, making it faster to process and more responsive.
To better understand the function of an ANN one needs to better understand
the concept behind all its components. The neuron is an information-processing
unit that is fundamental to the operation of a neural network. The 2.6 shows the
model of a neuron. The neuron is composed of three basic elements, the input
signals which are a set of synapses, each of them characterized by a weight of its
own.
29
Figure 2.6: Neuron representation
Specifically, an input xj connected to a neuron k is multiplied by the synaptic
weight wkj. Another basic element is the adder, which is the summing element
for the input signals weighted by the respective synapses, also known as linear
combiner. The last basic element is the activation function, which delimits the
output signal, also known as squashing function due to its propoerty of squashing
the amplitude range for the output signal to a finite value. In mathematical terms,
the neuron can be described as follow:
uk =
m
j=1
wkjxj (2.1)
and:
yk = ϕ(uk + bk) (2.2)
where x1, x2, ..., xm are the input signals, wk1, wk2, ..., wkm are the synaptic
weights of neuron k, uk is the linear combiner output, bk is the bias and ϕ(.) is the
activation function, and yk is the output signal of the neuron.
30
The use of bias bk has an effect of applying a transformation to the output
uk of the linear combiner in the model, as seen by the following equation. Also,
the bias is an external parameter of the artificial neuron k, one can formulate the
combination of the previous equations:
vk = uk + bk (2.3)
the combination can be seen as:
vk =
m
j=0
wkjxj (2.4)
or:
yk ∼ ϕ(vk) (2.5)
Still looking at the neurons, there are several types of activation functions,
and each one of them defines the output of a neuron in terms of the induced local
field v. The three basic types are:
Threshold Function This type of function can be expressed by:
ϕ(v) =



1 if v ≥ 0
0 if v < 0
(2.6)
or:
yk =



1 if vk ≥ 0
0 if vk < 0
(2.7)
31
where vk is the induced local field of the neuron:
vk =
m
j=1
wkjxj + bk (2.8)
Piecewise-Linear Function The amplification is assumed to be uniform, so ap-
proximating the non-linear function, the equation is as follows:
ϕ(v) =



1, v ≥ +1
2
v, +1
2
> v > −1
2
0, v ≤ −1
2
(2.9)
Sigmoid Function is the most common form of activation function and it is defined
as a strictly increasing function:
ϕ(v) =
1
1 + exp(−αv
) (2.10)
Antisymmetric Form All the previous activation functions had a range from 0 to 1,
however, depending on the application it is convenient to have the activation
function in the range from −1 to 1, and in those cases the threshold is defined
as:
ϕ(v) =



1, if v > 0
0, if v = 0
−1 if v < 0
(2.11)
This case is common when using the hyperbolic tangent function, defined by
ϕ(v) = tanh(v) (2.12)
32
Stochastic Model All the previous activation functions presented are determinis-
tic, in a sense that its input-output behaviour is precisely defined for all in-
puts. This solution is especially interesting when applied to situations where
the state of the neuron should always be either −1 or 1, which in this context
means if the neuron will forward its information or not. In the stochastic model
of neuron the activation takes the following form:
x =



+1 with probability P(v)
−1 with probability 1 − P(v)
(2.13)
Those described neurons can be arranged in an infinity of formats, and can
also be combined in various ways. This study will focus on exploring the multilayer
perceptron, which is the method to be tested in the hypothesis. The multilayer per-
ceptron (MLP) construction will be explored with more detail in the next subsection.
2.3.2.1 Multilayer Perceptron
Deep Architecture is not a new approach nor a new technique; this method dates
back at least to [Fukushima, ], when he developed a deep network based on an
ANN in 1980. Since then, different approaches were taken to develop a viable
deep neural network, mostly because of the computational cost implied on those
algorithms. The breakthrough happened in early 2000, when Geoffrey Hinton pub-
lished the paper [Hinton, 2007]. In this paper, the author uses a deep architec-
ture in which each layer is pre-trained using an unsupervised restricted Boltzmann
machine, and then fine-tuned using supervised back-propagation, in a feed for-
ward fashion. The main idea behind the paper is a combination of three ideas:
train a model that generates sensory data rather than classifying it (unsupervised
33
learning), train one layer of representation at a time using restricted Boltzmann
machine, decomposing the overall learning task into multiple simpler tasks and
eliminates the inference problems, and lastly, use a fine-tuning stage to improve
the previous model. The term "deep learning" can cause confusion if not used cor-
rectly: this term encompasses any algorithm that has a series of layers, instead of
just one layer, called then "shallow" algorithms. Therefore there is a great variety
of deep architectures and most of them are branched from some original parent
architectures. The deep version of a Single layer Perceptron is called Multilayer
Perceptron.
The multilayer perceptron follows the same general concept of all ANN:
it is designed based on the human brain and consists of a number of artificial
neurons, whose function was explained in previous subsections. According to
[Gurney, 1997], a perceptron comes from a set of preprocessing association units.
The perceptron is assigned any arbitrary boolean functionality and is fixed, so it
does not learn from the data. It performs a rough classification and sends the in-
formation to the next node that performs the training algorithm for ANN. In order
for the ANN, or MLP, perform a given classification it must have the desired deci-
sion surface. This determination is made by adjusting the weights and thresholds
in a network, either in a shallow or a deep architecture. The adjustment of those
weights and thresholds is made in an iterative fashion, where it is presented with
examples of the required task repeatedly, and at each presentation makes small
changes to the weights and thresholds to bring them closer to the desired values.
This study makes use of the MLP, which is the process of making a series
of layers of neurons and perceptrons that are arranged in the input layer, hidden
layers and output layer, as seen in figure 2.7. The input layer is the data that is fed
to the algorithm, usually after some kind of preprocessing, such as PCA, FCA, etc.
34
Figure 2.7: Multilayer Neural Network
The MLP are usually feedforward neural networks and are often applied
with the use of a error-connection learning rule and can be viewed as a gener-
alization of an adaptive filtering algorithm. According to [Haykin, 1999], the error
back-propagation learning consists of two passes through the MLP’s layers: a for-
ward pass and a backward pass. The forward pass occurs when the input vector is
applied to the neurons, also called nodes, of the network and propagates through
the network layer by layer. The weights in this process stay fixed, and in the end it
produces a set of outputs as the response of the network. After the forward pass,
the backward pass starts, and during this process the weights are adjusted in ac-
cordance with an error-correction rule. Usually the output is subtracted from the
labelled target producing an error signal, and this signal is propagated backwards
thought the network with inverted direction of the synaptic weights. The error signal
at the output of neuron j at iteration n is defined by the following equation, when n
is an output node:
ej(n) = dj(n) − yj(n) (2.14)
35
The error is calculated in terms of error energy, and have to be calculated
using all the nodes in the output layer, expressed as C, for the nth sample:
E (n) =
1
2 j∈C
e2
j (n) (2.15)
In order to calculate the average squared error energy, the following formula
must be applied:
Eav =
1
N
N
n=1
E (n) (2.16)
This measure, the error energy Eav is a function of all the free parameters,
such as synaptic weights and bias levels of the network. The error energy can
be seen as the loss function or cost function for the ANN and it serves as a per-
formance measurement for the learning process. The objective of the learning
algorithm is to minimize Eav by updating the weights on a pattern-by-pattern basis
until one complete representation of the entire training set has gone through the
neural network; this can also be called epoch. This relationship is better explained
by looking to one neuron, for this example consider a neuron j, being fed by a
set of function signals produced by a previous layer, the induced local field vi(n)
produced at the input activation function associated with the j neuron is:
vi(n) =
m
i=0
wij(n)yi(n) (2.17)
Were m is the total number of inputs, with bias included, and yi(n) can be
represented by:
yi(n) = ϕj(vj(n)) (2.18)
36
The improvement brought by the use of the error back-propagation is mainly
explained by the approach in which it applies corrections in a backward fashion
to the synaptic weights wij(n) proportionally to its partial derivative ∂E (n)/∂wij(n).
When applied to the whole network and according to the chain rule the gradient is
expressed by the following equation:
∂E (n)
∂wij(n)
=
∂E (n)
∂ej(n)
∂ej(n)
∂yj(n)
∂yj(n)
∂vj(n)
∂vj(n)
∂wji(n)
(2.19)
The use of previous equations here explained applied to the last equation
results in, and is understood as the opposite signal of the local gradient δj(n)
∂E (n)
∂wji(n)
= −e(n)ϕ
′
j(vj(n))yi(n) (2.20)
Then, finally, the correction applied is:
∆wji(n) = −η
∂E (n)
∂wji(n)
(2.21)
In the training process, as stated before, there are two phases: the for-
ward pass and the backward pass. during the forward pass the synaptic weights
are fixed throughout the network and the functions are computed on a neuron-by-
neuron basis. The forward process is mathematically explained as follows:
yi(n) = ϕ(vi(n)) (2.22)
were, vi it is defined on neuron j as
vj(n) =
m
i=0
wji(n)yi(n) (2.23)
37
If the neuron j is in the output layer of the network, m = mL, and then can
written as
yi(n) = oj(n) (2.24)
where oj(n) is the jth element of the output vector, this output is then com-
pared with the labelled response dj(n), in order to get the error signal ej(n). The
forward pass starts from the input layer until the output layer and finishes with the
error signals calculation.
After the forward pass comes the backwards pass, which starts on the out-
put layer and goes to the first hidden layer. The process consists of recursively
calculating the δ, also know as local gradient, for each neuron and adjusting and
updating the synaptic weights according to the delta rule, explained in previous
equations. For the adjustment task there are still various types of loss functions
that can be applied in ANN or MLP models. The loss function is represented by
L(W, B|j), where B is the bias and W is the synaptic weights for a given jth train-
ing sample, and t(j)
are the predicted values and o(j)
are the labelled output values,
and y represents the output layer in the network:
Mean Square Error The mean square error (MSE) it is typically used in regres-
sion tasks, since it does not consider different categories. The formula is
given by:
L(W, B|j) =
1
2
| t(j)
− o(j)
|2
(2.25)
Huber The Huber method of calculation is usually calculated in regression tasks
and it is defined as:
L(W, B|j) =| t(j)
− o(j)
| (2.26)
38
Cross Entropy The cross entropy is used often for classification tasks and can be
defined as:
L(W, B|j) = −
y∈O
[(ln(o(j)
y ) · ln(1 − o(j)
y ) · (1 − t(j)
y )] (2.27)
Since this study uses the cross entropy method, the literature review will fo-
cus on it. According [de Boer et al., 2005], the cross entropy (CE) was motivated
by an adaptive method for estimating probabilities in complex stochastic networks,
which involves mainly variance minimization. In addition, [Suresh et al., 2008] states
that the cross entropy minimizes the misclassification between all classes in an all-
in-one approach. It is common for ANN or MLP algorithms to exhibit over fitting,
and in order to avoid this problem this study makes use of different techniques
of regularization, such as Lasso, also know as l1; Ridge, expressed by l2; and
dropout, which is a new technique applied on deep architecture. The first two
techniques work either with shallow or deep architectures, and the last technique,
dropout, will be explored in the next subsection. The loss function modified after
the application of the regularization techniques is represented in Equation 2.28.
L
′
(W, B|j) = l(W, B|j) + λ1R1(W, B|j) + λ2R2(W, B|j) (2.28)
In the Equation 2.28, R1(W, B|j) is the sum of all l1 norms for the weights
and biases in the network and for the R2(W, B|j), it represents the sum of squares
of all the weights and biases in the network. The λ1 and λ2 are constants and
usually very small (10−5
. The LASSO stands for least absolute shrinkage and
selection operator, and according to [Tibshirani, 2011] it is a regression with a l1-
norm penalty, which is given by the formula presented in Equation 2.29.
39
N
i=1
yi −
j
xijβj
2
+ λ
p
j=1
| βj | (2.29)
In the Equation 2.29, Xij is the predictors, and yi is the centered response
values for i = 1, 2, ..., N and j = 1, 2, ..., p. This function is solved by finding β =
{βj}, which means the same as minimizing the sum of squares with a constraint
of the form | βj |≤ s. Looking at that angle it approximates itself from the Ridge
method, which is a regression that has a constraint in a form β2
j ≤ t. The main
difference between those methods is that LASSO performs a variable selection
and a shrinkage whereas the Ridge regression does only the shrinkage. When
considering both of them in the general form, the formula is shown in Equation
2.30.
p
j=1
βq
j
1
q
(2.30)
In the general context, the LASSO uses the general formula where q = 1
and Ridge Regression uses q = 2.
With all the previous concepts presented, the training algorithm for a MLP is
clearer, which can be given as the procedure:
1. Initialize W, B
2. Iterate until convergence criterion reached:
(a) Get training example i
(b) Update all weights wj ∈ W, biases bj ∈ B
wj := wj −
∂L
′
(W, B|j)
∂wj
40
Figure 2.8: Basics schematics of a Random Forest
bj := bj −
∂L
′
(W, B|j)
∂bj
2.3.3 Decision Tree
2.3.3.1 Random Forest
Random Forest is a specific branch of the general technique of random decision
forest, which is an ensemble method for classification and regression, that oper-
ates by constructing multiple decision trees in the learning process and outputting
the designed class or mean prediction, depending on the method applied, for each
individual tree. The general method of decision tree is defined by [Rokach, 2016]
as a predictive model expressed as a recursive partition of covariate spaces to sub-
spaces that constitute a basis of prediction. The same author defines the random
forest as a combination of decision trees in which their predictors are combined
into a final prediction. The basic schematics for a random forest construction is in
2.8.
As seen in [Verikas et al., 2011], the decision trees are sensitive to small
41
perturbations in the learning dataset. To mitigate this problem, one can build a
random forest using an ensemble technique called bagging. To better understand
the Random Forests (RF) model, let X = {(X1, Y1), ..., (Xn, Yn)} be the learning
set, made of i.i.d. observations of a vector (X, Y ), where vector X = (X1
, ..., Xp
)
contains the predictors, or the called explanatory variables, and the vector Y is the
vector which describes the labelled classes or numerical response, and X ∈ Rp
and Y ∈ ψ, where ψ is the categorical label or numerical response. For classifica-
tion problems, which is the case for this study, a classifier k is mapping t : Rp
→ ψ.
Lastly, it is assumed that the Y = s(X) + ε with expectation of ε equals to zero and
the variance of 1. The statistical framework is as follows:
1. Each tree is created on a bootstrap sample of the training dataset.
2. At each node, n variables are randomly selected out of X
3. n is usually defined as n = log2(N) + 1, N being the sample size
The RF algorithm has a sub-product: a measurement of variable importance
that is implemented together. One of them is the Gini index, which is usually used
for classification tasks. Given a node t and estimated class probabilities p(k|t), k =
1, ..., Q, the Gini index is defined as G(t) = 1 − Q
k=1 p2
(k|t) where Q is the number
of different classes. The whole process consists of calculating the Gini index for
every Xn used to make the split and the Gini index variable importance is given by
the average decrease in Gini index in the forest, where the variable Xn is used in
the node t.
Another measurement of variable importance is the accuracy-based esti-
mator. This method computes the mean decrease in classification accuracy of the
out-of-sample (OOS) data. Let the bootstrap samples be b = 1, ..., B, and the im-
42
portance measurement ¯Dj for a given predictor Xj. The importance is calculated
as follows:
1. Set b = 1 and define the OOS data points
2. Classify the OSS data points using the node t and count the correct classifi-
cations
3. For variables Xj, where j = 1, ..., N :
(a) permute the values of Xj in OOS data points
(b) use the node t to classify the permuted values and count the correct
classifications
4. Repeat first three steps for b = 2, ..., B.
5. Compute the standard deviation σj of the decrease in correct classifications
and a z − score: zj =
¯Dj
σj /
√
B
and then convert zj to a significance value,
considering a Gaussian distribution.
43
Chapter 3
Data and Computational Procedures
This chapter covers the data gathering, data cleaning and other procedures related
to the architecture building used to evaluate the models proposed in this study. As
explained before, the data evaluated in this study is from 160 companies that are
part of the current S&P500 (TICKER: GSPC) from the following sectors: health
care, financial services and technology. These sectors were chosen due to their
different unique characteristics. The financial services sector is intrinsically more
sensitive to changes in economic indicators and other variables such as oil prices,
financial markets (both local and also global), the federal rate, the interest rate
and others. The health care industry is related to another set of variables, most of
them not directly related to the variables which influence financial services. On the
other hand, the technology sector shares some variables with the financial market,
but has less correlation with the health care sector. By using a combination of all
three sectors, the data reflects changes in the market based on a broad variety of
factors. This set of companies was chosen in order to make the data as unbiased
as possible, and therefore closer to reality. As shown in previous chapters, these
sectors have significant weight in the S&P500 index.
In the past, a lot of research was done in this area trying to model and
predict the corporate credit ratings. As seen in [Lee, 2007], most of them use
techniques to narrow the research: in Lee’s study it was shown that corporate
credit ratings can be narrowed down to five categories, from AAA to C, using data
from 1997 to 2002. [Darrell Duffie, 2003] explores different approaches to using
coarser classifications: he shrinks the number of possible categories in order to
44
improve the results of the classification.
In this thesis none of those techniques were used in order to shrink the cate-
gories available for the machine learning models and it was decided to use as much
as data as possible, which means that for those previous cited 160 companies, all
data available on its financial statements since 1990 was gathered and used in the
procedures. It is important to restate that this study did not use financial ratios,
and instead used the full value of each variable as input, which were normalized
using techniques discussed going forward. With the process of using data not as
a ratio, it becomes even more important to have a heterogeneous dataset for each
sector. The numbers presented for each sector and for each variable are of differ-
ent natures, and the goal of this study is to develop a model able to identify and
generalize that information in order to correctly predict a corporate credit rating.
3.1 Data Source and Preprocessing
This section shows the data source, data gathering procedures, and preprocessing
techniques necessary for the accomplishments made in this study. The data was
gathered using the R programming language to access the API of Bloomberg Ter-
minal. All the data gathering was done in the Hanlon Financial Systems Lab, which
has several Bloomberg terminals available for student use. The data gathering and
preprocessing is as follows:
1. Gather financial data:
(a) Define the companies
(b) Define the period
(c) Define the frequency
45
(d) Define the variables
(e) Preprocessing the financial data
2. Gather ratings information
3. Preprocessing the ratings data
4. Combine the preprocessed data
For gathering the financial data, the first step, which is the definition of the
companies to be evaluated, was explained in the previous section. The second
step, defines the period of analysis, was done based on the availability of the data.
As this study aims to deal with a big data problem, it would ideally gather as much
data as available. However, due to restrictions in the database and existence of
the companies (with the same name and ticker) over time, the time period was
restricted to maximize the quality of the data. For this reason, the period was de-
fined from the year 1990 until 2014. This incorporates 24 years of financial data,
which is enough to be considered a big data problem and enough to perform this
study. For the definition of the frequency of the data gathered, it was observed
that the corporate credit ratings do not change often, but also a wrongly classified
corporate credit rating with a time lag of more than one quarter is too risky. In order
to make the model responsive to the changes in ratings, it was decided to collect
data on a quarterly basis to coincide with the public financial reports that the com-
panies are obliged to release. In order to have the maximum amount of information
available for the machine learning algorithms, a robust and complete data gather-
ing process was deployed and more than 900 features were extracted for each
company for each quarter when the data was available. When not available, the
data was completed using NA arguments. The data was then saved in a serialized
46
format on the disk. To better analyze and perform a verification of the data, the
serialized file was transformed into a comma delimited values format and edited
using spreadsheet software. For the preprocessing of the gathered financial data,
this study was focused on eliminating the features with more NA data points, and
then move forward to verify the existence of NA data points row-wise. The columns
and rows that presented more than 20% of its value as NA were deleted. After this
verification the resulting data set contained 230 different features for each quar-
ter for each company. The basic formula for gathering those financial data is as
follows:
BDH(ticker, apiform, startdate, enddate, overrides)
Where the ticker represents the company to be searched for, the api form
represents the variable to be searched, the start and end date stands for the period
in which one desires to perform the analysis, and the overrides are the specifica-
tions that have to be passed as an argument for the API in order to get the correct
results. For this study the override was the "BEST_FPERIOD_OVERRIDE", with value
of "QUARTERLY". [Bloomberg Finance L.P., 2014]
The next step in preparing the experiment consists of gathering the ratings.
The ratings were gathered individually from the Bloomberg Database and stored for
the purpose of this study. The ratings were collected individually and preprocessed
in spreadsheet software, in which it was constructed into a database like a machine
readable file.
After gathering the ratings, it was possible to merge those datasets together
using the R programming language in order to tie each rating to its respective
financial data. In order to make later analysis possible, columns were added to
the dataset such as "classification", which determines if the related rate was an
47
investment grade (IG) or NON - investment grade (N-IG), and also the respective
rate without the signals, +and− for Standard & Poors and 1, 2, 3 for Moody’s, and
the related movements for each one of those columns added, indicating the status
of the related rate in comparison with the previous given rate. This combined and
preprocessed dataset is then saved in a serialized form to later use as input to
the machine learning algorithm. Another important point in the preprocessing is
that all the movements were calculated in this part of the study, but only used later
on for the results analysis: it was not fed to the machine learning algorithms. If a
movement was greater than 5 positions, either going up or down, it was considered
as an outlier and not considered in further investigations.
3.2 Data structure and characteristics
This section shows the particularities and characteristics of the data used in this
study. For certain aspects of the data it is possible to make an analyses without
specifying the credit rating agency, because the information shown is related to the
companies and not to its corporate credit ratings. Since the credit rating agencies
have access to proprietary information at any time, meaning that they are not re-
stricted to the quarterly public financial statements, it was assumed that if a rate
given in the period from the first day of the quarter to the last day of the quarter,
the credit rating belongs to its quarter and it is related to the previous four quarters
of the same company and with one quarter of offset, in order to guarantee no use
of future data.
Even though this study uses a total of 160 companies, not all of them have
available data during the past period needed for the research. It was decided to not
update the participants for the S&P500 index, since the focus of this study is not on
48
Figure 3.1: Number of companies in the study over the years.
the companies themselves, but on modelling the credit rating agencies during the
determined time frame. The Figure Figure 3.1 displays the number of companies
in this study over the years.
The Figure Figure 3.2 demonstrates the distribution of entries for each sec-
tor. The balance between companies is even through the companies distribution,
therefore the number of rated companies in the Financial Services sector is greater
than the others. Because of the nature of their business, the Financial Services
companies need to be rated and are more often involved in financial transactions
in which those ratings are required either by regulation or by the parties involved.
This section will also discuss the specifics for each credit rating agency. As
stated before, this study focus on the two major players in the market, Standard
and Poor’s and Moody’s. Their data are analyzed separately due to their individ-
ualities and the difference in the methods of evaluating the ratings. As observed
in this research, both agencies do not always agree on the corporate credit rating.
Furthermore, past events such as the 2009 financial crisis have proven that the
ratings of these agencies are not always in accordance with the market. The goal
49
Figure 3.2: Distribution of datapoints in the sectors.
of this study is not to analyze the accuracy of their ratings, it is instead to create a
model that is more accessible and faster with the same reliability of their methods.
On the Moody’s data analysis, the data set used in this study has the follow-
ing distribution:
Figure 3.3: Ratings distribution for Moody’s.
When the value has a value of zero it means that in the period of analysis
the corresponding credit rating did not appear in our source data. Using the pro-
50
cess explained above, it is evident that the concentration of ratings is in the area
around the rating A3, which is still considered an investment-grade level but not
as a high grade. The Figure Figure 3.3 uses data from the whole time frame and
ignores the periods in which any company is in default or is listed as NR (non-
rated). When analyzed on the investment-grade or non-investment grade level, the
Moody’s distribution is as follows:
Figure 3.4: Investment Level distribution for Moody’s.
The Figure Figure 3.4 is consistent with the Figure 3.3, since most of the
ratings in the first graph are above investment grade level. All the previous graphs
have shown overall characteristics of the data, either for the general data or for the
Moody’s specifics. One important aspect, if not the most important, is to evaluate
the behavior of the proposed model at points where a company’s corporate credit
rating changes. In the following graph, it is possible to visualize the quantity of
changes over the years analyzed in this study, which is related to the left axis of
the graph, and the line represents the average change step for the period. The
change step is called the size of the movement for a rating. For example, if a rating
is Ba2 and in the next quarter it changes to B3, this change step has value of 4,
51
Figure 3.5: Changes over the years for Moody’s.
even though it was a downgrading movement. The average change step is the
sum of all changes steps divided by the number of changes in the period. The
result of those calculations can be seen in Figure 3.5, where the bars represent
the quantity of changes in each year and is related to the left axis of the graph, and
the line represents the average change step for the period, and is related with the
right axis of the graph.
One interesting observation to be made in this graph is the spike in both
the line and in the bar on the year 2009, which represents the year after the most
recent financial crises of the US. The first spike observed on the year of 1993 is
due to the limited and small size of the sample of rating changes in our dataset,
which do not represent the whole market and neither the whole S&P500 index.
Another interesting analysis is the changes related to the credit ratings. As shown
in the Figure 3.6 there are ratings such as, Aaa, Ca, and C, that do not present any
changes in the period studied, which includes 25 years of data from those sectors.
Another observation to be made about the last graph is the spike on the graph from
the rating Caa3. There is no clear explanation for it. The only plausible explanation
52
Figure 3.6: Changes per rating for Moody’s.
would be that those ratings have a high chance of default or miss-classification
for the credit rating agencies.
The following section displays the same previous analysis but instead for
the Standard & Poor’s credit rating agency. The first graph shows the distribution
of the corporate credit ratings of the same companies and in the same period as
those analyzed in the Moody’s case. It is evident when comparing the ratings from
both credit rating agencies that they are similar. Further on this study, the general
divergences between the agencies will be shown.
Another important aspect of the Standard & Poors ratings distribution is the
ratio of companies classified as investment grade (IG), over companies that are
non-investment grade. That relationship is displayed in Figure 3.8, and the graph
is consistent with the ratings distribution. Figure 3.7.
Again, one of the most important aspects from the data is the analysis of the
periods of change, and for the Standard & Poors, those periods are characterized
in the following graphs:
On Figure 3.9, it is shown the average change step over and its evolution
53
Figure 3.7: Ratings distribution for Standard & Poors.
Figure 3.8: Investment Level distribution for Standard & Poors.
54
Figure 3.9: Changes over the years for Standard & Poors.
Figure 3.10: Changes per rating for Standard & Poors.
55
over the years. And the Figure 3.10, shows the quantity of changes for each cor-
porate credit rating as bars and the average change step as a line.As it is possible
to observe, the spikes on the year of 2009, and it is understandable as a conse-
quence of the crises in which US faced on that period.
During the conception of this study one question raised was about the agree-
ment between the corporate credit ratings given by both credit rating agencies ana-
lyzed in this study. In order to proceed with the analysis, the ratings were matched
by evaluated company (Ticker) and by period (year and quarter) for both credit
agencies, and both ratings were transformed using a scale as displayed on Table
3.1. After transforming the ratings into numbers, it was calculated the absolute
difference between the two companies’ ratings, being then able to calculate the
average distance between the ratings given. Another measurement is the notch
difference between the ratings given by those two companies, and the notch con-
cept is explained in next subsection.
From this transformation of the ratings to scaled numbers, it was possible
to construct the following figures. From the Figure 3.11 it is possible to visualize
the difference which exists between the credit rating agencies. The line "% exact
match" represents the percentage of ratings that match exactly in the given year,
the "% notch match" represents the percentage of matches in the situation where
the distance of the ratings are equal to or less than 1, and lastly, the bar labelled
"ave diff", represents the average distance between the agencies’ ratings. The
Figure 3.12 uses the same arguments, but instead of aggregating by years, the
analysis aggregates by corporate credit ratings, using the Moody’s scale as basis
for the analysis.
56
Figure 3.11: Comparison of the credit rating agencies over the years.
Figure 3.12: Comparison of the credit rating agencies by the rates.
57
Standard & Poors Moody’s Scale
AAA Aaa 21
AA+ Aa1 20
AA Aa2 19
AA- Aa3 18
A+ A1 17
A A2 16
A- A3 15
BBB+ Baa1 14
BBB Baa2 13
BBB- Baa3 12
BB+ Ba1 11
BB Ba2 10
BB- Ba3 9
B+ B1 8
B B2 7
B- B3 6
CCC+ Caa1 5
CCC Caa2 4
CCC- Caa3 3
CC Ca 2
C C 1
Table 3.1: Table of scale values for the corporate credit ratings
3.3 Framework architecture
This section will explore the construction of the framework used to perform this
study and get the respective results. The machine learning methods applied were
explained in detail in previous sections of this dissertation and all the frameworks
were built in R programming language. However, one of the premises of this work
is that it should be portable to another platform more suitable for a production en-
vironment, such as Python or JavaScript. So bearing that in mind the code was
developed in R using functions and different environments, which are natural in
the R programming language. The preprocessing already involves relating each
available credit rating of each company with its respective financial information in a
58
machine readable format. Given that previous step, the first step of the framework
is to develop a function to upload and slice the dataset for each iteration with the
correct variable arguments for each iteration. The experiment consists of evaluat-
ing the model quarterly, by training on one quarter and testing in the subsequent
one. By doing that, it reinforcies that no data from the future is used either in the
training or validating data, ensuring the out-of-sample technique. This first func-
tion is responsible for identifying the training time frame and the test time frame for
each iteration and slicing the complete dataset accordingly with those time frames
and feeding those data frames to further functions responsible for performing the
machine learning techniques.
At this point in the time frame, there are two data frames: the training data
and the test data. The next step on the framework is the division of the training
data into training and validation data. All the machine learning techniques were
fed both datasets in order to perform the learning process on the training data and
perform the validation, and check the performance on the learning process on the
validation data. The rate used to split those data frames were 80% for the training
dataset and 20% for the validation data. For the test data, no modifications were
made, in order to maintain the relation with reality. To better illustrate this process
of slicing the data, take for example the iteration to perform the test for the year
2005 on the quarter 02. For this scenario, the function would slice all data available
previous from 2005 quarter 01, including the 2005 quarter 01 data, and register it
as the training dataset, and the get the data from 2005 quarter 02 and save it as
the test set. Furthrmore, the training set previously mentioned is again sliced into
2 different data frames, the training data, which contains approximately 80% of the
available data for training, and the validation data, which contains approximately
the other 20% of the available data. The second step after the data slicing and
59
preparation for the iteration is to perform a feature selection using the Gini index,
which is implemented together with the Random Forest. From all the available
features, it selects the best features which represent at least 80% of the responses,
that was observed to be usually between 40% and 60% of the available features.
After the selected features, the next step in the framework is the training of
two different models, one aiming to predict the movement if any given rate is going
up, down or not changing, and a second one aiming to predict the probability of
a credit rating to change or to keep the current value. Those models are trained
using the machine learning method of Random Forest, and the models are saved
on the disk for later use. The next step in the framework is to use the two models
previously trained to predict the value using the training data and the validation
data, but always making sure the train, valid, and test data are always separated.
The predicted values are then combined to the previous selected features and then
fed to the other machine learning techniques.
This part of the architecture is the same for all variations of code applied,
that means that the data preparation and feature selection is the same for all fol-
lowing parts. During the conception process of this study a brainstorm was realized
in order to understand better the data and conceptualize and find a solution for the
proposed problem. In order to better understand the data, the previous studies
were done and discussed, and it became apparent that the data could present the
challenge of overfitting, since ratings do not change often over the years and there
are a large number of available features. The problem of overfitting was addressed
with the use of feature selection, which shrinks the number of available features
to be fed into the machine learning algorithms, and also with the use of regular-
ization techniques applied on the loss function of the machine learning algorithms.
The former was mainly applied to the Multi-Layer Perceptron technique, since it
60
presented characteristics of overfitting on the earlier tests due to its error rate be-
ing concentrated on the corporate credit ratings that changed. The problem of the
overfitting and the high error rate on the changing period for the MLP was solved,
or at least improved, after the use of regularization such as the Lasso and Ridge
regularization techniques.
After being preprocessed, prepared and with the features selected, the ar-
chitecture begins the machine learning algorithms. At this stage of the code, there
are two datasets to be presented to the algorithms: they are the training data and
the validation data, divided as explained before. As said previously, for this fol-
lowing part of the study a series of different constructions were tested. The first
used both MLP + Random Forest, training a first guess and then refining it using a
different model for each notch with three possible outputs: the closest upper and
lower rate and the rate itself, using the ratios of each quarter. Other constructions
used different configurations of random forest and MLP, using those to predict the
dataframe and for a later step use another MLP, always in a supervised fashion,
to learn from the data what predictor was best in each moment. The main goal
of this study is not evaluate the differences between those different architectures
but instead to evaluate the existence of a model that can predict with satisfactory
accuracy. For the scope of this study, those different architectures were tested and
the one with better results was selected. For this study the method that presented
the best results was the architecture in which after the fists steps it is performed a
MLP algorithm with backpropagation and a Random Forest algorithm, separately,
aiming to learn the same variable, the corporate credit rating as a whole, in which
each rating followed by its sign is treated as a different class and presented as it
is to the algorithms. The basic architecture used to implement the algorithms it is
displayed in Figure 3.13.
61
Figure 3.13: Architecture Flowchart.
62
The former process is the explanation for each iteration for training (the
learning phase) and the testing. The following explanation is about the functions
and procedures in which the performance indicators are defined and calculated in
the data for each iteration. As explained before, the performance indicators for this
study will be the accuracy presented by both models, the MLP and the Random
Forest, for the predicted values of the full ratings, the crude ratings, the notches
ratings, in the quarter defined as test, and the accuracy for the full ratings and
notches ratings only for the ratings that observed changes from the previous pe-
riod. For better understanding and explanation on those performance indicators,
let o be the output values, t the target values, and N the length of the test dataset.
So, o = {o1, o2, ..., oN }, and t = {t1, t2, ..., tN }, where each oi is respective for each
ti. Bearing that in mind, the first performance measurement to be explored is the
accuracy on the full rating predicted values for the overall test dataset. The first
step is to remove the data considered as outliers, which means the ones with a
movement greater than 5. After that, the procedure is to count how many oi have
the same value of ti, such as:
Γfullrating =
N
i=1 θ(i)
N
(3.1)
where:
θ(i) =



1 if oi = ti
0 if oi = ti
(3.2)
The next performance measurement to be calculated is the accuracy for the
crude rating, which means how many times the oi is equal to ti, given the ratings
have to be transformed. So in this transformation the function creates an o
′
i and
63
t
′
i, that is the output of the full rating without its respective sign, and performs the
same transformation for the target rating and then proceeds to compare its values.
Γcruderating =
N
i=1 θ(i)
N
(3.3)
where:
θ(i) =



1 if o
′
i = t
′
i
0 if o
′
i = t
′
i
(3.4)
Following the crude rate is the notch analysis. For this performance mea-
surement it is necessary to transform the categorical rating, such as, Aa1, Baa3,
AA+ or CCC to a correspondent number in a scale. This procedure is done for ti and
oi, the result of this procedure is called t
′′
i and o
′′
i , respectively. After the transfor-
mation, each value of o
′′
i is compared with its correspondent t
′′
i , and if the absolute
distance between those in the scale is less than or equal to 1, the output is consid-
ered correct. The mathematical explanation is as follows:
Γnotchrating =
N
i=1 θ(i)
N
(3.5)
where:
θ(i) =



1 if |(o
′′
i − t
′′
i )| ≤ 1
0 if |(o
′′
i − t
′′
i )| ≥ 2
(3.6)
Those were the performance measurements used to evaluate the complete
dataset, but since this study intends to completely evaluate the model and it is
known that the periods when a change in the corporate credit rating is observed
are the periods that the model is put to proof, it was decided to implement those
64
extra performance measurements in which is only evaluated the indicators for the
ratings that were changed for the ti. In order to perform this evaluation, the test data
set is then filtered to only contain the predictions and responses for the companies
that had their corporate credit rating changed from last period. The set up is still
the same, only that instead of having N as the length of the test dataset, the length
is now represented as N
′
and is the filtered test dataset. For the specific analysis
on the changes, it was decided to only evaluate and keep track of two performance
measurements: full rate accuracy, and notch accuracy. They follow the same logic
behind the previous one:
Γ
′
fullrating =
N
′
j=1 θ(j)
N′ (3.7)
where:
θ(j) =



1 if oj = tj
0 if oj = tj
(3.8)
and,
Γ
′
notchrating =
N
′
j=1 θ(j)
N′ (3.9)
where:
θ(j) =



1 if |(o
′′
j − t
′′
j )| ≤ 1
0 if |(o
′′
j − t
′′
j )| ≥ 2
(3.10)
The whole process is displayed in the following:
The concern about corporate credit ratings goes beyond the accuracy. It is
important for the predictive model to not predict values that are distant from the real
65
Figure 3.14: Performance indicators flowchart.
66
value. To better comprehend the results, the Credit Rating Dissimilarity Coefficient
was created. This statistical measurement, here presented as ψ, has as feature
a penalization for the predictions, based on the distance from the predicted to the
observed value, the scale used to calculate the distance is shown on Table 3.1.
The indicator is calculated using the following:
ψOxP =
21
i=1
(i) ∗ (
N
j=1 Θi
N
) (3.11)
where, N is the number of samples, O is the observed value (or the value to
be compared with), and P is the predicted value.
Θi is defined as:
Θi =



1 if |O − P| = i
0 otherwise
(3.12)
67
Chapter 4
Results
This chapter shows and discusses the results obtained in this study. Since the
framework developed makes possible the use of different machine learning tech-
niques, this chapter will evaluate the results for the MultiLayer Perceptron and the
Random Forest machine learning techniques. Additionally, it will discuss the results
gathered using the methods previously presented. In order to develop statistical
results, it was chosen to work with the Chi-Square test, which tests for indepen-
dence between two discrete distributions and for "Goodness-of-fit". The first one
aims to test if the distribution is independent from the other, and the "Goodness-of-
fit" has the null hypothesis that the observed frequency f, is equal to an expected
count, e, for each category. The null hypothesis is rejected if the p-value of the
calculated Chi-square test statistics is less than a given significance level α. A lit-
erature review on the χ2
(Chi-square test) is presented in Appendix B. As explored
before, the tests here presented were performed in an out-of-sample (OSS) fash-
ion, in which the machine learning technique did not have access to the test data
during the learning phase. By doing this, it is possible to guarantee the veracity
of the results. The results will first be presented by company (Standard & Poors
and Moody’s), and then for each one of them it will be presented the results for
the test data: first for the entire data set, and then only for the periods in which
the corporate credit rating presented a change when compared with the previous
period. For the analysis of the changes, the first score for any given company is
exluded because there is no previous score for comparison. Additionally, scores
which changed more than five notches compared to the previous score were con-
68
sidered outliers and also excluded. The results will be presented first for the MLP
technique and then for the Random Forest, followed by a comparison between
those methods.
4.1 Standard & Poors
This section is dedicated to discussing the results obtained from the model created
using the Standard & Poors credit ratings and the methodology described in Chap-
ter 03 for the construction of the experiment and also for the results measurements
calculation. Two different machine learning techniques were used: MLP and Ran-
dom Forest. The results will be presented for both methods for each different mea-
surement, and in the end there will be a discussion about each method’s efficiency.
The first analysis is on the distribution of the predicted values and the observed val-
ues. The following graphs show the distribution according to the corporate credit
ratings in those models. Figure 4.1 shows the frequency of the observed value for
each rate against the frequency of each predicted rate when using the Random
Forest model in the Standard & Poors ratings. Similary, Figure 4.2 shows the dis-
tribution of observed corporate credit ratings against the predicted ratings when
predicted using the Multilayer Perceptron model.
These figures show that the Random Forest model appears, at least visually,
to be closer to the observed distribution than the MLP model. To better understand
and mathematically prove this conclusion, a χ2
test was employed to test for inde-
pendence and "goodness-of-fit". Also, the correlation between the distribution of
those two distributions was calculated to serve as a simple measure of comparison
and is displayed in Table 4.1.
As shown in Table 4.1, the correlation on the Random Forest model predic-
69
Figure 4.1: Ratings distribution for Random Forest model for Standard & Poors.
Standard & Poors Statistics Multilayer Perceptron Random Forest
Correlation 0.547542 0.928467
Independence test 2.2e − 16 2.2e − 16
"Goodness-of-fit" test 2.2e − 16 2.874e − 05
Table 4.1: Statistics for Standard & Poors
tions is considerably higher than the correlation presented by the MLP model. With
a correlation of approximately 93%, it is strongly indicated that the observed data
and the predicted data have a relationship of dependence. Another measurement
presented here is the χ2
test for independence. For this test, both results were
similar: for both methods the p-value returned was less than 0.05, meaning that
with a confidence level of 95%, it is possible to reject the null hypothesis. The null
hypothesis in the χ2
independence test is that the distribution of both samples are
independent, and by rejecting it, it is possible to conclude that the distributions are
not independent, and are in fact dependent. The χ2
Goodness-of-fit test, on the
other hand, has a null hypothesis that the one distribution supports the other one.
In this application, it will test if the predicted data supports the observed data. The
result for this test in both models were less than 0.05, so it is possible to reject the
70
Figure 4.2: Ratings distribution for MLP model for Standard & Poors.
null hypothesis, implying that the predicted values for both methods are not the
same as the observed values. The statistics scenario presented here is under-
standable, since the predictions for both models presented good relationship with
the observed data, thus the low p-value in the independence test. The random for-
est presented a closer result, generating then a better correlation measurement.
The result of the Goodness-of-fit test presented implies that neither model’s pre-
dictions follow the same distribution as the observed data. This does not imply a
failure to accurately predict the credit rating due to the fact that the observed data
from S&P and Moody’s do not always align over time.
The Figure 4.3 and Figure 4.4 show the average difference between the
predicted model and the observed model over the years. For a better understand-
ing, four different performance measurements are used. The different lines on
the graph represent the percentage of match in each different scenario: when it
is exact, with one notch distance, or with two notches distance. These lines are
respective to the left x-axis on the graph. The bar represents the yearly average
distance of the predicted value to the observed value. All those calculations were
71
Figure 4.3: Ratings over years for Random Forest model for Standard & Poors.
Figure 4.4: Ratings over years for MLP model for Standard & Poors.
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis
stevensThesis

More Related Content

Viewers also liked

3 coty collection
3 coty collection3 coty collection
3 coty collectionspclass
 
Educ.ar tutorial slide share
Educ.ar   tutorial slide shareEduc.ar   tutorial slide share
Educ.ar tutorial slide share
Majito Regalado
 
My future jobeze
My future jobezeMy future jobeze
My future jobeze
spclass
 
3joaquiruanicollection
3joaquiruanicollection3joaquiruanicollection
3joaquiruanicollection
spclass
 
1 Point Perspective and Color Theory
1 Point Perspective and Color Theory1 Point Perspective and Color Theory
1 Point Perspective and Color Theory
Whitney Brooks
 
Historia del mueble antiguo I
Historia del mueble antiguo I Historia del mueble antiguo I
Historia del mueble antiguo I
Tachie Gaya
 
Communicators' Evolving Leadership Role in the Age of Disruption
Communicators' Evolving Leadership Role in the Age of DisruptionCommunicators' Evolving Leadership Role in the Age of Disruption
Communicators' Evolving Leadership Role in the Age of Disruption
Celine Schillinger
 

Viewers also liked (7)

3 coty collection
3 coty collection3 coty collection
3 coty collection
 
Educ.ar tutorial slide share
Educ.ar   tutorial slide shareEduc.ar   tutorial slide share
Educ.ar tutorial slide share
 
My future jobeze
My future jobezeMy future jobeze
My future jobeze
 
3joaquiruanicollection
3joaquiruanicollection3joaquiruanicollection
3joaquiruanicollection
 
1 Point Perspective and Color Theory
1 Point Perspective and Color Theory1 Point Perspective and Color Theory
1 Point Perspective and Color Theory
 
Historia del mueble antiguo I
Historia del mueble antiguo I Historia del mueble antiguo I
Historia del mueble antiguo I
 
Communicators' Evolving Leadership Role in the Age of Disruption
Communicators' Evolving Leadership Role in the Age of DisruptionCommunicators' Evolving Leadership Role in the Age of Disruption
Communicators' Evolving Leadership Role in the Age of Disruption
 

Similar to stevensThesis

Bond credit rating
Bond credit rating Bond credit rating
Bond credit rating
Jigar Gogri
 
How to Justify a Change in Your ALLL
How to Justify a Change in Your ALLLHow to Justify a Change in Your ALLL
How to Justify a Change in Your ALLL
Libby Bierman
 
MSc research project report - Optimisation of Credit Rating Process via Machi...
MSc research project report - Optimisation of Credit Rating Process via Machi...MSc research project report - Optimisation of Credit Rating Process via Machi...
MSc research project report - Optimisation of Credit Rating Process via Machi...
AmarnathVenkataraman
 
ROLE OF CREDIT RATING AGENCIES ON LOAN ON BANK OF BARODA
ROLE OF CREDIT RATING AGENCIES ON LOAN ON BANK OF BARODAROLE OF CREDIT RATING AGENCIES ON LOAN ON BANK OF BARODA
ROLE OF CREDIT RATING AGENCIES ON LOAN ON BANK OF BARODA
Vaishali Upadhyay
 
Credit decision-indices-a-flexible-tool-for-both-credit-consumers-and-providers
Credit decision-indices-a-flexible-tool-for-both-credit-consumers-and-providersCredit decision-indices-a-flexible-tool-for-both-credit-consumers-and-providers
Credit decision-indices-a-flexible-tool-for-both-credit-consumers-and-providers
Srikanth Minnam
 
Credit information information
Credit information informationCredit information information
Credit information information
Shambhu Kumar
 
A study on service quality assessment in state bank of travancore
A study on service quality assessment in state bank of travancoreA study on service quality assessment in state bank of travancore
A study on service quality assessment in state bank of travancore
Bella Meraki
 
Credit rating agency(cra) hardcopy
Credit rating agency(cra) hardcopyCredit rating agency(cra) hardcopy
Credit rating agency(cra) hardcopy
Dharmik
 
Credit rating agency(cra) hardcopy
Credit rating agency(cra) hardcopyCredit rating agency(cra) hardcopy
Credit rating agency(cra) hardcopy
Dharmik
 
BUSN20016Project ProposalBUSN20016Project ProposalRESEAR.docx
BUSN20016Project ProposalBUSN20016Project ProposalRESEAR.docxBUSN20016Project ProposalBUSN20016Project ProposalRESEAR.docx
BUSN20016Project ProposalBUSN20016Project ProposalRESEAR.docx
clairbycraft
 
Indian Placement Reporting Standards Version2.0
Indian Placement Reporting Standards Version2.0Indian Placement Reporting Standards Version2.0
Indian Placement Reporting Standards Version2.0
Ankur Sethi
 
How To Biuld Internal Rating System For Basel Ii
How To Biuld Internal Rating System For Basel IiHow To Biuld Internal Rating System For Basel Ii
How To Biuld Internal Rating System For Basel Ii
FNian
 
Credit rating
Credit ratingCredit rating
Credit rating
Shreta Maheshwari
 
Relationship-Based Banking: Balancing Relationships & Risk
Relationship-Based Banking: Balancing Relationships & RiskRelationship-Based Banking: Balancing Relationships & Risk
Relationship-Based Banking: Balancing Relationships & Risk
Libby Bierman
 
The W+ Monitoring and Evaluation System 2017
The W+ Monitoring and Evaluation System 2017The W+ Monitoring and Evaluation System 2017
The W+ Monitoring and Evaluation System 2017
WOCAN-Women Organizing for Change in Agriculture and NRM
 
Business decision making.docx
Business decision making.docxBusiness decision making.docx
Business decision making.docx
studywriters
 
ROLE OF CREDIT RATING IN DEBT MARKETS.
ROLE OF CREDIT RATING IN DEBT MARKETS.ROLE OF CREDIT RATING IN DEBT MARKETS.
ROLE OF CREDIT RATING IN DEBT MARKETS.
Sowjanya Sampathkumar
 
Cr risk model
Cr risk modelCr risk model
Cr risk model
Tulsi Chandan
 
General principlesforcreditreporting
General principlesforcreditreportingGeneral principlesforcreditreporting
General principlesforcreditreporting
Dr Lendy Spires
 
General principlesforcreditreporting
General principlesforcreditreportingGeneral principlesforcreditreporting
General principlesforcreditreporting
Dr Lendy Spires
 

Similar to stevensThesis (20)

Bond credit rating
Bond credit rating Bond credit rating
Bond credit rating
 
How to Justify a Change in Your ALLL
How to Justify a Change in Your ALLLHow to Justify a Change in Your ALLL
How to Justify a Change in Your ALLL
 
MSc research project report - Optimisation of Credit Rating Process via Machi...
MSc research project report - Optimisation of Credit Rating Process via Machi...MSc research project report - Optimisation of Credit Rating Process via Machi...
MSc research project report - Optimisation of Credit Rating Process via Machi...
 
ROLE OF CREDIT RATING AGENCIES ON LOAN ON BANK OF BARODA
ROLE OF CREDIT RATING AGENCIES ON LOAN ON BANK OF BARODAROLE OF CREDIT RATING AGENCIES ON LOAN ON BANK OF BARODA
ROLE OF CREDIT RATING AGENCIES ON LOAN ON BANK OF BARODA
 
Credit decision-indices-a-flexible-tool-for-both-credit-consumers-and-providers
Credit decision-indices-a-flexible-tool-for-both-credit-consumers-and-providersCredit decision-indices-a-flexible-tool-for-both-credit-consumers-and-providers
Credit decision-indices-a-flexible-tool-for-both-credit-consumers-and-providers
 
Credit information information
Credit information informationCredit information information
Credit information information
 
A study on service quality assessment in state bank of travancore
A study on service quality assessment in state bank of travancoreA study on service quality assessment in state bank of travancore
A study on service quality assessment in state bank of travancore
 
Credit rating agency(cra) hardcopy
Credit rating agency(cra) hardcopyCredit rating agency(cra) hardcopy
Credit rating agency(cra) hardcopy
 
Credit rating agency(cra) hardcopy
Credit rating agency(cra) hardcopyCredit rating agency(cra) hardcopy
Credit rating agency(cra) hardcopy
 
BUSN20016Project ProposalBUSN20016Project ProposalRESEAR.docx
BUSN20016Project ProposalBUSN20016Project ProposalRESEAR.docxBUSN20016Project ProposalBUSN20016Project ProposalRESEAR.docx
BUSN20016Project ProposalBUSN20016Project ProposalRESEAR.docx
 
Indian Placement Reporting Standards Version2.0
Indian Placement Reporting Standards Version2.0Indian Placement Reporting Standards Version2.0
Indian Placement Reporting Standards Version2.0
 
How To Biuld Internal Rating System For Basel Ii
How To Biuld Internal Rating System For Basel IiHow To Biuld Internal Rating System For Basel Ii
How To Biuld Internal Rating System For Basel Ii
 
Credit rating
Credit ratingCredit rating
Credit rating
 
Relationship-Based Banking: Balancing Relationships & Risk
Relationship-Based Banking: Balancing Relationships & RiskRelationship-Based Banking: Balancing Relationships & Risk
Relationship-Based Banking: Balancing Relationships & Risk
 
The W+ Monitoring and Evaluation System 2017
The W+ Monitoring and Evaluation System 2017The W+ Monitoring and Evaluation System 2017
The W+ Monitoring and Evaluation System 2017
 
Business decision making.docx
Business decision making.docxBusiness decision making.docx
Business decision making.docx
 
ROLE OF CREDIT RATING IN DEBT MARKETS.
ROLE OF CREDIT RATING IN DEBT MARKETS.ROLE OF CREDIT RATING IN DEBT MARKETS.
ROLE OF CREDIT RATING IN DEBT MARKETS.
 
Cr risk model
Cr risk modelCr risk model
Cr risk model
 
General principlesforcreditreporting
General principlesforcreditreportingGeneral principlesforcreditreporting
General principlesforcreditreporting
 
General principlesforcreditreporting
General principlesforcreditreportingGeneral principlesforcreditreporting
General principlesforcreditreporting
 

stevensThesis

  • 1. CORPORATE CREDIT RATING PREDICTION USING MACHINE LEARNING by Pedro Henrique Veronezi e Sa A THESIS Submitted to the Faculty of the Stevens Institute of Technology in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE - FINANCIAL ENGINEERING Pedro Henrique Veronezi e Sa, Candidate ADVISORY COMMITTEE Rupak Chatterjee, Advisor Date David Starer, Reader Date STEVENS INSTITUTE OF TECHNOLOGY Castle Point on Hudson Hoboken, NJ 07030 2016
  • 2. c 2016, Pedro Henrique Veronezi e Sa. All rights reserved.
  • 3. iii CORPORATE CREDIT RATING PREDICTION USING MACHINE LEARNING ABSTRACT This study uses machine learning techniques, such as the Random Forest and the Multilayer Perceptron method, to predict corporate credit ratings for any given company using publicly available financial data. Those ratings are then com- pared to Standard & Poors and Moody’s credit ratings. The data used comes from financial reports from 170 companies from the Health Care, Financial Services, and Technology sectors of the S&P500 index, dating from 1990 to 2014. The pe- riod of investigation is from 2000 to 2014. This study uses a specific Machine Learning architecture framework for both learning methods. This thesis also intro- duces a new performance measurement called Credit Rating Divergence Measure- ment. It is a statistical measure to compare the ratings from the prediction models and the ratings from the credit rating agencies. The results presented in this study show that it is possible to rate a company, even if it is not publicly traded, using the same standards as the two biggest credit rating agencies through the use of Machine Learning techniques. Machine Learning makes the rating process faster and more efficient. The architecture framework presented achieved one notch ac- curacy of more than 90% over the investigation period, for both credit agencies, and a Credit Rating Divergence Measurement of less than 0.40, which is approximately 60% better than the benchmark. Author: Pedro Henrique Veronezi e Sa Advisor: Rupak Chatterjee Date: May 2, 2016 Department: School of System Enterprises Degree: Master of Science - Financial Engineering
  • 4. iv Dedication This thesis is dedicated to my family, who supported me during this phase; especially to my mother, Lucila Maria Veronezi and to my grandmother, Barbara Quagliato Veronezi.
  • 5. v Acknowledgments I would like to acknowledge my advisor, Rupak Chatterjee, for his support in this opportunity of personal and professional growth. I would like to acknowledge my girlfriend, Katherine Thompson, who stood by my side during this process offering nothing but support and help. I also would like to acknowledge my family, as they were always present during my masters program and supported me in difficult moments.
  • 6. Table of Contents Abstract iii Dedication iv Acknowledgments v List of Tables viii List of Figures ix Chapter 1 Introduction 1 1.1 The Problem Statement 4 1.1.1 Hypothesis 6 1.2 Problem Scope 7 1.3 Research Approach 9 1.4 Organization and Structure 11 Chapter 2 Literature Review 14 2.1 Standard & Poor’s Method 16 2.2 Moody’s Method 21 2.3 Machine Learning Techniques 25 2.3.1 Deep Learning 25 2.3.2 Neural Networks 26 vi
  • 7. vii 2.3.2.1 Multilayer Perceptron 32 2.3.3 Decision Tree 40 2.3.3.1 Random Forest 40 Chapter 3 Data and Computational Procedures 43 3.1 Data Source and Preprocessing 44 3.2 Data structure and characteristics 47 3.3 Framework architecture 57 Chapter 4 Results 67 4.1 Standard & Poors 68 4.2 Moody’s 80 4.3 Credit Rating Dissimilarity Coefficient 91 Chapter 5 Summary and Conclusion 97 5.1 Conclusion 97 5.2 Further Research 100 Appendices 103 Appendix A 103 Appendix B 107 Bibliography 109
  • 8. List of Tables 3.1 Table of scale values for the corporate credit ratings 57 4.1 Statistics for Standard & Poors 69 4.2 Statistics for Moody’s 81 5.1 List of companies used on this study. 103 viii
  • 9. List of Figures 1.1 S&P500 index breakdown by GICS sectors, as of Jan 29, 2016. 7 2.1 Rating Agencies around the world, as of 2006. Source International Rating Group (IRG) 15 2.2 *Due to lack of data, the highest number of rated issuers by one of 3 credit rating agencies is assumed to be the total rated issuers, then the coverage is calculated based on this total number. Source [Estrella, 2000] 16 2.3 Corporate Criteria Framework. Source [Standard & Poors, 2014a] 20 2.4 Ratings summary. Source [Standard & Poors, 2014a] 22 2.5 Corporate Default summary. Source [Standard & Poors, 2015] 23 2.6 Neuron representation 29 2.7 Multilayer Neural Network 34 2.8 Basics schematics of a Random Forest 40 3.1 Number of companies in the study over the years. 48 3.2 Distribution of datapoints in the sectors. 49 3.3 Ratings distribution for Moody’s. 49 3.4 Investment Level distribution for Moody’s. 50 3.5 Changes over the years for Moody’s. 51 3.6 Changes per rating for Moody’s. 52 3.7 Ratings distribution for Standard & Poors. 53 ix
  • 10. x 3.8 Investment Level distribution for Standard & Poors. 53 3.9 Changes over the years for Standard & Poors. 54 3.10 Changes per rating for Standard & Poors. 54 3.11 Comparison of the credit rating agencies over the years. 56 3.12 Comparison of the credit rating agencies by the rates. 56 3.13 Architecture Flowchart. 61 3.14 Performance indicators flowchart. 65 4.1 Ratings distribution for Random Forest model for Standard & Poors. 69 4.2 Ratings distribution for MLP model for Standard & Poors. 70 4.3 Ratings over years for Random Forest model for Standard & Poors. 71 4.4 Ratings over years for MLP model for Standard & Poors. 71 4.5 Accuracy by ratings for Random Forest model for Standard & Poors. 72 4.6 Accuracy by ratings for MLP model for Standard & Poors. 73 4.7 Cumulative overall accuracy for Random Forest model for Standard & Poors. 75 4.8 Cumulative overall accuracy for MLP model for Standard & Poors. 76 4.9 Cumulative changes only accuracy for Random Forest model for Standard & Poors. 76 4.10 Cumulative changes only accuracy for MLP model for Standard & Poors. 77 4.11 Random Forest vs Multilayer Perceptron performance comparison for Stan- dard & Poors. 78 4.12 Random Forest vs Standard & Poors performance comparison. 79 4.13 Ratings distribution for Random Forest model for Moody’s. 81 4.14 Ratings distribution for MLP model for Moody’s. 82 4.15 Ratings over years for Random Forest model for Moody’s. 83
  • 11. xi 4.16 Ratings over years for MLP model for Moody’s. 83 4.17 Accuracy by ratings for Random Forest model for Moody’s. 84 4.18 Accuracy by ratings for MLP model for Moody’s. 85 4.19 Cumulative overall accuracy for Random Forest model for Moody’s. 86 4.20 Cumulative overall accuracy for MLP model for Moody’s. 87 4.21 Cumulative changes only accuracy for Random Forest model for Moody’s. 87 4.22 Cumulative changes only accuracy for MLP model for Moody’s. 88 4.23 Random Forest vs Multilayer Perceptron performance comparison for Moody’s. 89 4.24 Random Forest vs Moody’s performance comparison. 90 4.25 Credit Rating Dissimilarity Coefficient for overall ratings. 92 4.26 Credit Rating Dissimilarity Coefficient for changes ratings. 95
  • 12. 1 Chapter 1 Introduction Credit ratings have become ubiquitous these days since all market agents have come to depend on these reports. The investors use these ratings to determine their positions on any given corporate financial instruments, e.g. bonds, stocks, credit default swaps, etc. The issuer of a bond, the ones that have their rating scored, know that the rating affects their financial costs in a very fundamental way. After some casualties in the market, even the regulators use credit ratings as parameters for a series of regulations, from allowable investment alternatives to required capital for most global banking firms. For instance, we can cite the regulation which states that pension funds must only hold Investment Grade (IG) corporate bonds. This regulation can cause considerable changes in the market, since pension funds play an important role in the financial system. So one can imply that changing the credit score for a bond issuer, e.g. changing it to a non Investment Grade rate, can cause a selling race on its bonds and a big loss on the bond’s value. The market can be seen as an exchange of information in many different ways, and and the stakeholder needs to have access to information about the company in order to make investment decisions, so being better informed about the company strategy. In order to make the information standard and available throughout the market, the concept of ratings was created. The first appearance of a standardized ratings business was in the 19th century, at the time of the US railroad expansion. Henry Poor recognized the problem of an information gap between the investors and the companies building the railroads. As stated by
  • 13. 2 [Langohr, 2010] Henry Poor was an editor for a local Journal focused on railroads called The American Rail road Journal and gathered the information of the busi- ness standing and creditworthiness using a network of agents spread all across the US.. Later, in the beginning of the 20th century, John Moody initiated agency bond ratings in the US, expanding his business analysis services to include a rating based on the company’s credit risk. In 1999 the Bank for International Settlements proposed rule changes that would provide an explicit role for credit ratings in determining a bank’s required reg- ulatory risk capital, widely know as Basel 2 (BIS). The BIS proposal vastly elevated the importance of the credit rating by linking the required bank capital to the credit rating of its obligations. [Richard M. Levich, 2002] With the advance of time and technology, followed by the globalization of the financial market, the quantity of information available at both the macroeconomic and institutional level increased exponentially. Correspondingly, this increased the complexity of rating a company, creating more information asymmetries and ele- vating the value of the credit rating agencies.. [Richard M. Levich, 2002] Moving forward to today, three main companies dominate both the US and global mar- ket: Standard & Poor’s Financial Services LLC, Moody’s Investors Service, Inc and Fitch Ratings Inc. According to [Hill, 2003] those companies hold a collective global market share of roughtly 95% as of 2013, when the study was realized. Un- til 2003, those same companies were the only "Nationally Recognized Statistical Rating Organizations" (NRSROs) in the US, a designation that means their credit ratings were used by the government in several regulatory areas. [U.S. SEC, 2003] Since corporate credit ratings are such an important force in the worldwide economy, there have been many studies attempting to develop different methods to understand, predict, and model the rating process. In [Figlewski et al., 2012],
  • 14. 3 the author explores how general economic conditions impact defaults and major credit rating changes by fitting a reduced-form Cox intensity model with a broad range of macroeconomics and firm-specific ratings-related variables. [Frydman and Schuermann, 2008] proposes a parsimonious model that is a mixture of two Markov chains, implying that the future distribution of a corporation depends not only on its current rating but also on its past rating history. Also, [Koopman et al., 2008], proposes a new empirical reduced-form model for credit rating. The model it is driven by exogenous covariates and latent dynamic factors on a generalized semi-Markov fashion, simulating transitions using Monte Carlo maximum likelihood methods. In a more technological environment, where we have available data in great quantity, quality and form a variety of sources, [Lu Hsin-min, 2012] shows that the use of news coverage improves the accuracy for the credit rating models by us- ing a missing-tolerant multinomial probit model, which treats missing values using Bayeasian theoretical framework, and proves that this model out-performs an SVM model in the credit rating predictions. With the increasing quantity of data and complexity of the financial instru- ments, it is clear that credit ratings are heading towards increasingly computational methods.. Most of the recent studies in the area show that the challenge for the future of credit rating predicting is to apply computational methods with a combi- nation of financial and economic data. In [Lee, 2007], the author uses a support vector machine (SVM) to predict corporate credit ratings and compares the re- sults with traditional statistical models, such as multiple discriminant analysis and case-based reasoning, and proves that the SVM model out-performs the traditional methods. The literature for machine learning algorithms and statistical algorithms applied to credit rating prediction has been extensively explored, and the use of hy-
  • 15. 4 brid machine learning models is evolving. One example is [Tsai and Chen, 2010], in which the author explores the use of a series of different machine learning tech- niques and the combination of them and proves that this evolution improves the prediction accuracy. All those previous studies show that predicting credit ratings is an exhaust- ing and difficult task that involves knowledge in different areas such as financial markets, financial instruments, macroeconomics, fundamental financial reports, computer science, advanced math and so on. 1.1 The Problem Statement The credit rating industry it is characterized by its high barrier to entry due to the market regulation and that the established companies have more than 95 % of the market [Hill, 2003]. This creates a perfect scenario for the companies to require large fees, either for the companies that need to be rated, also known as the issuer, and for the individual or entity that desires to know the score of a rated company, which could be any investor, government or company willing to acquire the report. Another recurrent scenario is the error on those ratings, since periodically their classification methods have not always been shown to reflect the real default risk. Since the credit rating is important for many financial instruments and entities, the correct modeling using computational methods and select the inputs became a challenge for all players that need the rating without handling the large fees but still need high accuracy and credibility. If a medium-large corporation needs to make a loan in a financial institution, there are a few ways for the financial institution to analyze the creditworthiness of the corporation. One of them would be to ask for the corporation to face the large
  • 16. 5 fees charged by the credit rating agencies and wait for a considerable period of time, at least 6 months, to get a rating. Another method could be the use of a model that uses credit default swap price information in order to predict the credit rating, as stated in [Tanthanongsakkun and Treepongkaruna, 2008]. The author uses a Black Scholes option-pricing called Merton. [MERTON, 1974] shows that the company default probability can be estimated using an option-pricing model, viewing the market equity of a firm as a European call option on its firm assets, considering the strike price equal to the value of its liabilities. Another approach extensively explored in the field is the use of accounting-based models to explain the credit rating. Works such as [PINCHES and MINGO, 1973] use multiple dis- criminant analysis with factor analysis using those accounting-based features and information from the bond market. It is worth mention that none of the previous studies done in the area for the former method, the accounting-based model, uses more than a few years of data and do not use an extensive list of companies. As we can see in [PINCHES and MINGO, 1973], the author uses two years of data, and restricts the model to ratings above B. In another study, [Pogue and Soldofsky, 1969], the author explores the same problem explored in this study by asking the following: "how well can corporate bond ratings be explained by available financial and operating statistics?". The author makes use of six years of fundamental data (accounting-based) in the form of ratios, but the study’s date, the shortage of data, computational methods and hardware lim- ited the extensiveness of its work. Most of the previous studies done in this area limit their data to either a few years, a few companies, or few possible rates, such as using only investment grade companies and bonds, when it is all the previous conditions together.
  • 17. 6 1.1.1 Hypothesis This thesis hypothesizes that the corporate credit rating given by the two main companies (Moody’s and Standard & Poors) in US, can be explained with high confidence level, by the firm’s accounting-based information. The use of multiple machine learning techniques is a key factor in this analysis, since the shortage of public financial data is not an issue and the risk of high computational cost is mitigated by the advance in the technology and use of cutting-edge algorithms. With the discussed scenario in mind, this study will use data from 1990 to 2015, in a quarterly fashion, for 170 different companies, representing tree main sectors of the S&P500: technology, healthcare and financial services. One framework is created in order to evaluate the model over time. This framework should be able to deal with large quantities of data, either in features and in unique entries. The second hypothesis to be tested is that the application of multiple machine learning techniques is a viable and solid method to build a predictions model. The result of the algorithms should be able to outperform previous statistical methods and other computational methods from other studies. The model as a whole should be able to perform predictions on any given company without the use of its specific financial instruments, such as bonds quotes or stock price, since for the vast majority of companies this information is not a reality. The model will focus on using accounting-based information with the use of multiple machine learning techniques to perform the prediction. The preceding discussion suggests that corporate credit ratings may de- pend on accounting-based firm’s reports. The rates may also depend on the rating agency’s judgement about factors that are not easily measured, such as quality of management, future shifts in the market and other qualitative measures that in-
  • 18. 7 Figure 1.1: S&P500 index breakdown by GICS sectors, as of Jan 29, 2016. fluence the long-term results of any given company, factors that are out of scope of this study, but can be easily implemented and added together with the current structure. 1.2 Problem Scope The scope of this study is limited to the United States market, more specifically 170 companies currently part of the S&P500 index. According to [S&P DOW JONES INDICES, 2016] those companies all together represents approximately 51.3% of the index, as of January 2016, as we can see in Figure 1.1. This index has been on the market since the 4th of March 1957, and has 504 constituents. The max market cap of its constituents is 542, 702.72 million and its minimum is 1, 818.69 million with an average of 35, 423.67 million, values in US dollars. The history of corporate credit ratings dates back to the 1900’s, but for the scope of this thesis the window of data gathered is restricted from 1990 to 2015,
  • 19. 8 and the list of companies it is complete in the appendix A. If any given company does not have a credit rating for any reason, the data is not considered. The quan- tity of data used is large enough for this research to be considered a Big Data analysis, since the data consists of quarterly financial information for each com- pany. As we can see in [has, 2015], Big Data is defined by the following aspects: data are numerous, data are generated, captured, and processed rapidly. The na- ture of this study, in a production environment, satisfy all those criteria, but since in this thesis we are focused on the back-test results of the architecture and model created, the data do not change as rapidly as possible. Another valid to point to be raised is that the Credit Rating Agencies have access to any financial data of the company evaluated by them at any moment, not just on the quarter releases. However, since the data is not public available until the quarter releases this study uses the quarterly information with one quarter of discrepancy, to ensure that no data from the future is used as input for the model to predict its ratings. All data was gathered using an Application Programming Interface (API) from the Bloomberg database using computational methods. The number of fea- tures available on the database in vast, so in order to make a first selection, all possible features were downloaded at first, and then the features that had fewer occurrences of NA were manually selected. The first download had more than 900 different features for each quarter for each company. From those, 230 different features were selected for each company, for each quarter. Those features are fed to a series of machine learning techniques, such as multi-layer perceptron in a deep learning architecture, in parallel a distributed ran- dom forest and the result of those models are fed into another multi-layer percep- tron deep learning architecture, looping again in the same period. This structure is called in the literature an ensemble model, [Hsieh et al., 2012] shows that the
  • 20. 9 ensemble methods can be applied with success improving the accuracy of single models. In order to reduce the dimensionality of the features, a feature selection technique, such as random forest, is applied in order to improve the accuracy and the model’s computational cost. The the use of feature selection when dealing with a large database of features that can explain a phenomenon is believed to select which features are relevant and which are irrelevant, improving the performance of the classifier machine learning technique. The machine learning techniques were chosen given their ability to work well with big databases and multi-classification problems. In this thesis there are 21 different categories to be classified on, and more than 920 different features to explain each classification. The size of the database and the nature of the classification were determinants of the machine learning methods chosen. 1.3 Research Approach A literature review on machine learning techniques, feature selection and ensemble models was performed in order to use the state-of-the-art for each techniques. This research focused on corporate credit rating in the US, and on the two main credit rating agencies: Moody’s and Standard & Poors. During the literature review, it was found that the past research uses either a small set of features, or a small sample of companies and often focused only on investment grade (IG), excluding the non-investment grade companies, or uses samples of few years. This study was developed with the premise of evaluating any given com- pany, at any given time (given the restrictions of availability on data). In order to ensure the flexibility of the model, information such as the company’s ticker, time references for the ratings and company’s specific information on exchanges, e.g.
  • 21. 10 CDS or equity, were not given as part of the training dataset, which was composed by publicly available financial data. A framework for back-testing was created in or- der to perform all different analysis and set ups, and gather the results in a uniform fashion and guarantee a consistent analysis. In order to test the model’s consis- tency over time, the framework made viable the test using a rolling window that trains the model until the selected period and tests it for the next quarter, so with that set up it is impossible for the test data be used as training data assuring the concept of out-of-sample, and preventing the model from showing biased results. This thesis uses different measurements to evaluate the model performance, as we can see in the list that follows: 1. Credit Rating Dissimilarity Coefficient (Overall) (a) Overall accuracy (b) Crude accuracy (c) Node accuracy 2. Credit Rating Dissimilarity Coefficient (Changes) (a) Overall accuracy on changes (b) Node accuracy on changes The item item 1a is the measure of accuracy for the model that represents how many of the ratings present in the test dataset the model predicted correctly. The definition of item 1b is how many times the model predicted the correct rating on a crude basis, with crude being defined by the following example: if a rating is AA+, its crude rating is defined as AA. The item item 1cis defined as how many times the model predicted the right rating within an acceptable range, or node.
  • 22. 11 A node is a definition of distance between the ratings, for example, if a rating is Aa1, the node accuracy accepts an error of up to one node, which means that a prediction such as Aaa or Aa2 is counted as correct. As for the performance measurement items item 2a and item 2b, they are, respectively, the same concepts as explained above, but only measured for the ratings that changed relative to the previous period. By considering all those measurements, this study can test and evaluate the hypothesis previously proposed. In order to provide a statistical measurement that reflects all previous mea- surements, a new measurement is introduced: the Credit Rating Dissimilarity Co- efficient, which is explained in more detail in Chapter 04. The Credit Rating Dis- similarity Coefficient is applied to the overall ratings in the investigation period, and also applied separately to only the ratings which changed compared to the pre- vious rating. This new statistical measurement makes it possible to compare the predictions with the observed ratings and compare with a benchmark accepted by the market. 1.4 Organization and Structure Chapter 1 presents an introduction to corporate credit rating since its beginning, how the two main credit ratings agencies were created, and the current market distribution for corporate credit rating. After this introduction about the market and the creation of the credit rating agencies, this study shows the different approaches over time for solving this problem, beginning with the use of statistical models, with a variety of features until the modern methods such as machine learning, and its variations. This chapter also characterizes the problem and the current environ- ment in which the study was developed. It also lays out the hypothesis on which
  • 23. 12 this study is based and sets the ground to evaluate the hypothesis. It continues to define the scope and the research approach chosen for the development of this study. Chapter 2 reviews the literature and the theory behind the algorithms being used. The mathematical formulation of the algorithms used is also presented so the reader can have a full understanding of the theory. The review is structured by as follows: 1. Standard and Poor’s Method 2. Moody’s Method 3. Machine Learning Techniques (a) Neural Networks i. Multilayer Perceptron (b) Decision Tree i. Random Forest Chapter 3 details the computational procedures, such as back-test method, assumptions made in the simulation execution and other relevant details used to perform the tests. This chapter also presents the data structure and the prepro- cessing process required for this study. Following then to the framework construc- tion specifics. Chapter 4 presents the results found by the tests performed, approaching the results from different angles in order to better understand the framework model application and performance measurements.
  • 24. 13 Chapter 5 finally draws a conclusion from the test results, presents ideas about further research and discuss the viability of this research to be used in a production environment.
  • 25. 14 Chapter 2 Literature Review As explained in [Langohr, 2010],there are six main macroeconomic factors that shaped the current credit rating industry: financial disintermediation, institution- alization of investments, accelerated rate of industry change, complex financial innovations, the globalization of international capital markets, and the growth in regulatory uses of ratings. Those factors transformed the market, such that as of 2009, there are about 150 local and international credit rating agencies around the world, and the major US agencies have established operations and joint ventures abroad to meet the globalization of capital markets. A big picture of the market as of 2006 is drawn in 2.1. With the joint ventures and mergers that occur in this industry, it can be described as an oligopoly of three dominant global credit agencies, as we can see in 2.2, where [Estrella, 2000] shows the current state of the credit rating agencies industry. With all that information, it is easy to realize that are three dominants play- ers, S&P, Moody’s, and Fitch, and all of them follow a similar pattern: large com- panies, global focus, cross-industry issuer and instrument specific ratings. They take an analytical approach with committe reporting, use ordinal scales and have an issuer-pays business model. When it comes to comparisons between the three main agencies, investors perceive all three of them equally, but for older markets, such as the US corporate debt instruments, the issuers automatically get ratings from two or three different agencies.This study is focused on the first two compa- nies, S&P and Moody’s, since from a previous analysis realized, it was observed
  • 26. 15 Figure 2.1: Rating Agencies around the world, as of 2006. Source International Rating Group (IRG)
  • 27. 16 Figure 2.2: *Due to lack of data, the highest number of rated issuers by one of 3 credit rating agencies is assumed to be the total rated issuers, then the coverage is calculated based on this total number. Source [Estrella, 2000] that those companies have more information available for the companies chosen. For several years great effort has been devoted to the study of corporate credit ratings and how to predict them using computational methods. By its definition, an algorithm can only present results as good as its models, by that means it is im- portant to understand the process and methods behind the rating executed by the companies analysed in this study. In order to have a complete and improved under- standing of the corporate credit rating process, each company method is explored in this thesis. Bearing that in mind, this thesis uses accounting-based, publically available information to fit a machine learning model for both companies’ methods. The machine learning methods applied in this thesis are explained here and a high level mathematical explanation is given as well. 2.1 Standard & Poor’s Method This section refers to S&P Credit Market Services, which, as described earlier, traces its roots to the 19th century when Henry Poor published in a newspaper the financial information and analysis for the railroad companies of the time. Nowadays S&P has a parent company called McGraw-Hill, which provides financial services
  • 28. 17 related to equities, including the S&P’s Credit Market Services, the affiliate respon- sible for credit rating activity. Since the beginning, the business had stable growth, but with market-based funding becoming more common in early 70’s, and this decade marking a peak in the speculative bonds offered, in order to better serve the market’s needs, S&P refined its ratings by adding ’+’ (plus) and ’-’ (minus), to each generic category, moving from a 10-point to a 22-point scale. In the same decade, S&P decided to charge issuers for their ratings, as investor subscriptions could no longer meet the costs. In 1975, the structured finance market was created and S&P started to rate mortgage-backed securities, and in 1976 the company received the designation of Nationally Recognized Statistical Rating Organization (NRSRO), a regulatory aid created by the Securities and Exchange Commission (SEC). After that, the company kept growing by adapting to new market needs, by developing new prod- ucts and expanding the business worldwide, merging and acquiring several ratings agencies across all continents. S&P is widely accepted by investors in both the US and the European markets, but the US market is considered a rated market, as issuers get two ratings if not three, so S&P has a policy of systematically rating issuers in the US debt market, whether solicited or not. On the other hand, in Eu- ropean markets, S&P tends to be preferred over its competitors when issuers are looking for only one rating. Standard and Poor’s has specialized in analysing the credit risk of issuers and debt issues. The company formulates and disseminates ratings opinions that are used by investors and other market players who may consider credit risk in their decisions. The credit ratings process at Standard and Poor’s is given by the following steps, as seen in [Standard & Poors, 2014b]:
  • 29. 18 1. Contract 2. Pre-evaluation 3. Management meeting 4. Notification 5. Rating Committee 6. Analysis 7. Publication 8. Surveillance of rated Issuers & Issues The current payment model used by Standard and Poor’s is composed of: Issuer-pay model The agency charges the issuers a fee for providing a ratings opinion. In order to conduct the analysis the agencies may obtain information that might not otherwise be available to the public and factor this information into their ratings opinion. The released rating information is published broadly to the public. Subscription model The agency charges investors and other market players a fee for access to the agency’s ratings. Critics point out that both this model as well as the Issuer-pay model have the potential of conflict of interest. When rating an Issuer, as seen in [Standard & Poors, 2014b], Standard & Poor’s evaluates the ability and willingness to repay its obligations in accordance with the terms of said obligations. To form its opinion on the rating, S&P reviews
  • 30. 19 a broad range of financial and business attributes that may influence future pay- ments. Those attributes include, for example: key performance indicators, eco- nomic, regulatory and geopolitical influences, management and corporate gover- nance, and competitive position. There is a framework defining the work, as seen in [Standard & Poors, 2014a] and better explained by the 2.3.
  • 31. 20 Figure 2.3: Corporate Criteria Framework. Source [Standard & Poors, 2014a]
  • 32. 21 According to [Standard & Poors, 2014a], there is more than just financial ratios and accounting-based information that contributes to the rating for S&P. The agency also incorporates in its analysis the country risk, the industry risk, and the competitive position to create the business risk profile, and then analyzes the cash flow and leverage to to create the financial risk profile, and finally analyzes the company’s qualitative factors. The Standard and Poor’s definition of its ratings scales, as defined by [Standard & Poors, 20 and shown in 2.4. According to [Standard & Poors, 2014b], Standard and Poor’s tracks its rat- ings yearly to evaluate the accuracy. The update is always available on the S&P website www.spratings.com, which for this study the latest is the 2014 edition. In [Standard & Poors, 2015], the agency goes through details for the last year’s per- formance and keeps track of all the ratings and defaults analyzed by them since the 80’s, and it shows that the changing ratio over the years is small, and usually motivated by factors and influences other than financial information. All of these data and methods were considered during the experiment con- struction and thesis assumptions, and all the results and comparisons are made in the corresponding section, as explained in the previous chapter. 2.2 Moody’s Method This section refers to Moody’s, which, as described earlier, was established in the late 19th century by John Moody when he published Moody’s Manual of Industry and Corporation Securities and later on, when he directly competed with S&P by publishing the first bond rating as part of Moody’s Analyses of Railroad Securities. Nowadays Moody’s is an essential component of the global capital markets. It
  • 33. 22 Figure 2.4: Ratings summary. Source [Standard & Poors, 2014a]
  • 34. 23 Figure 2.5: Corporate Default summary. Source [Standard & Poors, 2015]
  • 35. 24 provides credit ratings, research, tools and analysis aiming to protect the integrity of credit, as said in [Moody’s Investor Service, ]. In 1962, Moody’s was sold to Dun & Bradstreet, and ten years later, Moody’s began to assign short-term ratings and bank deposit ratings after Penn Central defaulted on its commercial obligations. Around the same time that S&P changed its business model to issuer pays, so did Moody’s, and shortly thereafter Moody’s received the NRSRO status along with the other two main agencies. After a series of mergers and acquisitions, the agency expanded to Europe and established a global footprint. In the early 1980’s, the company refined its rating system by moving from a 9-point to a 21- point scale, a few years after S&P performed a similar restructuring. In the early 2000’s the CEO, John Rutherford Jr, pointed out that debt sold in public capital markets usually requires ratings, and decided to focus the company on that. At the same time, the reports showed the success of two new products, the collateralized Debt Obligations (CDOs) and syndicated bank loans. Currently Moody’s operates in over 26 countries outside the US, and as of 2006, covered approximately 12000 corporations and financial institutions. Although Moody’s website provides few details about their rating technique, the basic methodology is similar to that of S&P. They use a proprietary combination of financial data and other financial indicators that considers the market risk, the company risk and other factors that might influence the corporate credit rating. De- spite using similar inputs, the ratings Moody’s and S&P assign often differ, possibly due to the different weight each company assigns to each factor used in the rating process, this difference will be discussed further in this research.
  • 36. 25 2.3 Machine Learning Techniques Machine learning grew out of the quest for artificial intelligence. In the beginning of AI as an academic discipline, some researchers were interested in having ma- chines learn from data. They attempted to approach the problem with various sym- bolic methods, probabilistic reasoning was also employed. However, an increasing emphasis on the logical, knowledge-based approach caused a rift between AI and machine learning. Probabilistic systems were plagued by theoretical and practi- cal problems of data acquisition and representation. Work on knowledge-based learning did continue within AI, leading to inductive logic programming, but the more statistical line of research was now outside the field of AI proper, in pattern recognition and information retrieval. 2.3.1 Deep Learning Deep Learning, also known as deep structured learning, hierarchical learning or deep machine learning, is a branch of machine learning based on a set of algo- rithms arranged in multiple levels of non-linear operations in order to learn compli- cated functions that can represent high-level abstractions, according to [Begio, 2009]. With the increased data offering, the effects of the already discussed big data wave, it is increasingly impossible to manually formalize all that information in a machine-usable format that can actually generate intelligent conclusions. For this reason, research and studies have been developed to create algorithms able to learn from the data, and the deep architecture are the bleeding edge algorithms for that task. Research in this area attempts to make better representations and create models that learn from those representations in a large and scalable fash- ion. This methodology first developed due to inspiration from advances in neuro-
  • 37. 26 science and the communication patterns in the nervous system. Deep learning is often erroneously used as a re-branding of neural networks, but in fact, the deep architecture can be applied to several different methods. This study makes use of a specific model of Neural Network called Multilayer Perceptron, topics which will be reviewed in the next subsections. As seen in [LeCun Y, 2015], deep learning methods are representation-learning methods with multiple levels, that are obtained by composing non-linear modules that each transform the representation at one level into a higher and more abstract level. After enough transformations, the data can be learned by use of complex functions. When applied to classification tasks, which is the case in this study, the higher layers of representation amplify aspects for the input that are important for the differentiation and would be suppressed in shallow machine learning methods. Conversely, it also suppresses representations that are actually irrelevant for the differentiation and would appear to be more relevant in shallow methods. Deep learning is advancing quickly, as evinced by great results, often beating previous records on each application, and has shown to be especially efficient when applied to intricate structures in high-dimensional data, which is the case in this research. 2.3.2 Neural Networks In machine learning, the Neural Networks (NN), often seen as Artificial Neural Networks (ANN), encompass a family of models inspired by biological neural net- works and represent a technology rooted in many disciplines: neuroscience, math- ematics, statistics, physics, computer science, and engineering. According to [Haykin, 1999] the study of neural networks has been motivated by the capacity of the human brain to compute in a different way from the conventional digital com-
  • 38. 27 puter and deliver better results. The human brain is complex, non-linear, and uses parallel computing and has the capacity to reorganize it structural constituents, known as neurons. Another feature of the human brain is its capacity to build its own rules based on past experiences, meaning that the human brain is plastic and allows the system to adapt to its surroundings. [Haykin, 1999] defines ANN as a massively parallel distributed processor made up of simple processing units, which has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects: 1. Knowledge is acquired by the network from its environment through a learn- ing process. 2. Inter-neuron connection strengths, known as synaptic weights, are used to store the acquired knowledge. The ability to learn and generalize which is inherent to ANN makes it an extremely powerful tool for machine learning, since it can produce reasonable out- puts for inputs not encountered during training (out-of-sample). The main qualities and benefits of ANN are discussed as follows: Nonlinearity The connection between the neurons can either be nonlinear or lin- ear, but the ability to be nonlinear when the underlying physical mechanism is also nonlinear makes this property important, especially because the non- linearty is distributed through the network on each neuron activation. Input-Output Mapping This property states that the algorithm learns from a train- ing algorithm, in which during each iteration one element is presented to the algorithm and the free parameters of the ANN are adjusted to minimize the
  • 39. 28 distance between the target value and the predicted value, given a statistic measure. That means that no prior assumptions are made on the model for the input data. Adaptivity ANN have a built-in capability to adapt their synaptic weights to changes in the surrounding environment, making it a useful tool in adaptive pattern classification and adaptive tasks in general, but the principal time constants of the system must be long enough for the model to ignore high disturbances and short enough to respond to meaningful changes in the environment. Evidential Response When the task is classification, the ANN can be designed to return not just the class selected, but also the confidence in the decision made, which can improve the overall classification if this information is used to reject ambiguous patterns. Fault Tolerance Due to the distributed nature of information stored in the network, if any of the neurons is damaged, it will not affect the overall quality of the model and will not stop the system from working. Large Scale Implementability It has potential to make use of tools for distributed computation, making it faster to process and more responsive. To better understand the function of an ANN one needs to better understand the concept behind all its components. The neuron is an information-processing unit that is fundamental to the operation of a neural network. The 2.6 shows the model of a neuron. The neuron is composed of three basic elements, the input signals which are a set of synapses, each of them characterized by a weight of its own.
  • 40. 29 Figure 2.6: Neuron representation Specifically, an input xj connected to a neuron k is multiplied by the synaptic weight wkj. Another basic element is the adder, which is the summing element for the input signals weighted by the respective synapses, also known as linear combiner. The last basic element is the activation function, which delimits the output signal, also known as squashing function due to its propoerty of squashing the amplitude range for the output signal to a finite value. In mathematical terms, the neuron can be described as follow: uk = m j=1 wkjxj (2.1) and: yk = ϕ(uk + bk) (2.2) where x1, x2, ..., xm are the input signals, wk1, wk2, ..., wkm are the synaptic weights of neuron k, uk is the linear combiner output, bk is the bias and ϕ(.) is the activation function, and yk is the output signal of the neuron.
  • 41. 30 The use of bias bk has an effect of applying a transformation to the output uk of the linear combiner in the model, as seen by the following equation. Also, the bias is an external parameter of the artificial neuron k, one can formulate the combination of the previous equations: vk = uk + bk (2.3) the combination can be seen as: vk = m j=0 wkjxj (2.4) or: yk ∼ ϕ(vk) (2.5) Still looking at the neurons, there are several types of activation functions, and each one of them defines the output of a neuron in terms of the induced local field v. The three basic types are: Threshold Function This type of function can be expressed by: ϕ(v) =    1 if v ≥ 0 0 if v < 0 (2.6) or: yk =    1 if vk ≥ 0 0 if vk < 0 (2.7)
  • 42. 31 where vk is the induced local field of the neuron: vk = m j=1 wkjxj + bk (2.8) Piecewise-Linear Function The amplification is assumed to be uniform, so ap- proximating the non-linear function, the equation is as follows: ϕ(v) =    1, v ≥ +1 2 v, +1 2 > v > −1 2 0, v ≤ −1 2 (2.9) Sigmoid Function is the most common form of activation function and it is defined as a strictly increasing function: ϕ(v) = 1 1 + exp(−αv ) (2.10) Antisymmetric Form All the previous activation functions had a range from 0 to 1, however, depending on the application it is convenient to have the activation function in the range from −1 to 1, and in those cases the threshold is defined as: ϕ(v) =    1, if v > 0 0, if v = 0 −1 if v < 0 (2.11) This case is common when using the hyperbolic tangent function, defined by ϕ(v) = tanh(v) (2.12)
  • 43. 32 Stochastic Model All the previous activation functions presented are determinis- tic, in a sense that its input-output behaviour is precisely defined for all in- puts. This solution is especially interesting when applied to situations where the state of the neuron should always be either −1 or 1, which in this context means if the neuron will forward its information or not. In the stochastic model of neuron the activation takes the following form: x =    +1 with probability P(v) −1 with probability 1 − P(v) (2.13) Those described neurons can be arranged in an infinity of formats, and can also be combined in various ways. This study will focus on exploring the multilayer perceptron, which is the method to be tested in the hypothesis. The multilayer per- ceptron (MLP) construction will be explored with more detail in the next subsection. 2.3.2.1 Multilayer Perceptron Deep Architecture is not a new approach nor a new technique; this method dates back at least to [Fukushima, ], when he developed a deep network based on an ANN in 1980. Since then, different approaches were taken to develop a viable deep neural network, mostly because of the computational cost implied on those algorithms. The breakthrough happened in early 2000, when Geoffrey Hinton pub- lished the paper [Hinton, 2007]. In this paper, the author uses a deep architec- ture in which each layer is pre-trained using an unsupervised restricted Boltzmann machine, and then fine-tuned using supervised back-propagation, in a feed for- ward fashion. The main idea behind the paper is a combination of three ideas: train a model that generates sensory data rather than classifying it (unsupervised
  • 44. 33 learning), train one layer of representation at a time using restricted Boltzmann machine, decomposing the overall learning task into multiple simpler tasks and eliminates the inference problems, and lastly, use a fine-tuning stage to improve the previous model. The term "deep learning" can cause confusion if not used cor- rectly: this term encompasses any algorithm that has a series of layers, instead of just one layer, called then "shallow" algorithms. Therefore there is a great variety of deep architectures and most of them are branched from some original parent architectures. The deep version of a Single layer Perceptron is called Multilayer Perceptron. The multilayer perceptron follows the same general concept of all ANN: it is designed based on the human brain and consists of a number of artificial neurons, whose function was explained in previous subsections. According to [Gurney, 1997], a perceptron comes from a set of preprocessing association units. The perceptron is assigned any arbitrary boolean functionality and is fixed, so it does not learn from the data. It performs a rough classification and sends the in- formation to the next node that performs the training algorithm for ANN. In order for the ANN, or MLP, perform a given classification it must have the desired deci- sion surface. This determination is made by adjusting the weights and thresholds in a network, either in a shallow or a deep architecture. The adjustment of those weights and thresholds is made in an iterative fashion, where it is presented with examples of the required task repeatedly, and at each presentation makes small changes to the weights and thresholds to bring them closer to the desired values. This study makes use of the MLP, which is the process of making a series of layers of neurons and perceptrons that are arranged in the input layer, hidden layers and output layer, as seen in figure 2.7. The input layer is the data that is fed to the algorithm, usually after some kind of preprocessing, such as PCA, FCA, etc.
  • 45. 34 Figure 2.7: Multilayer Neural Network The MLP are usually feedforward neural networks and are often applied with the use of a error-connection learning rule and can be viewed as a gener- alization of an adaptive filtering algorithm. According to [Haykin, 1999], the error back-propagation learning consists of two passes through the MLP’s layers: a for- ward pass and a backward pass. The forward pass occurs when the input vector is applied to the neurons, also called nodes, of the network and propagates through the network layer by layer. The weights in this process stay fixed, and in the end it produces a set of outputs as the response of the network. After the forward pass, the backward pass starts, and during this process the weights are adjusted in ac- cordance with an error-correction rule. Usually the output is subtracted from the labelled target producing an error signal, and this signal is propagated backwards thought the network with inverted direction of the synaptic weights. The error signal at the output of neuron j at iteration n is defined by the following equation, when n is an output node: ej(n) = dj(n) − yj(n) (2.14)
  • 46. 35 The error is calculated in terms of error energy, and have to be calculated using all the nodes in the output layer, expressed as C, for the nth sample: E (n) = 1 2 j∈C e2 j (n) (2.15) In order to calculate the average squared error energy, the following formula must be applied: Eav = 1 N N n=1 E (n) (2.16) This measure, the error energy Eav is a function of all the free parameters, such as synaptic weights and bias levels of the network. The error energy can be seen as the loss function or cost function for the ANN and it serves as a per- formance measurement for the learning process. The objective of the learning algorithm is to minimize Eav by updating the weights on a pattern-by-pattern basis until one complete representation of the entire training set has gone through the neural network; this can also be called epoch. This relationship is better explained by looking to one neuron, for this example consider a neuron j, being fed by a set of function signals produced by a previous layer, the induced local field vi(n) produced at the input activation function associated with the j neuron is: vi(n) = m i=0 wij(n)yi(n) (2.17) Were m is the total number of inputs, with bias included, and yi(n) can be represented by: yi(n) = ϕj(vj(n)) (2.18)
  • 47. 36 The improvement brought by the use of the error back-propagation is mainly explained by the approach in which it applies corrections in a backward fashion to the synaptic weights wij(n) proportionally to its partial derivative ∂E (n)/∂wij(n). When applied to the whole network and according to the chain rule the gradient is expressed by the following equation: ∂E (n) ∂wij(n) = ∂E (n) ∂ej(n) ∂ej(n) ∂yj(n) ∂yj(n) ∂vj(n) ∂vj(n) ∂wji(n) (2.19) The use of previous equations here explained applied to the last equation results in, and is understood as the opposite signal of the local gradient δj(n) ∂E (n) ∂wji(n) = −e(n)ϕ ′ j(vj(n))yi(n) (2.20) Then, finally, the correction applied is: ∆wji(n) = −η ∂E (n) ∂wji(n) (2.21) In the training process, as stated before, there are two phases: the for- ward pass and the backward pass. during the forward pass the synaptic weights are fixed throughout the network and the functions are computed on a neuron-by- neuron basis. The forward process is mathematically explained as follows: yi(n) = ϕ(vi(n)) (2.22) were, vi it is defined on neuron j as vj(n) = m i=0 wji(n)yi(n) (2.23)
  • 48. 37 If the neuron j is in the output layer of the network, m = mL, and then can written as yi(n) = oj(n) (2.24) where oj(n) is the jth element of the output vector, this output is then com- pared with the labelled response dj(n), in order to get the error signal ej(n). The forward pass starts from the input layer until the output layer and finishes with the error signals calculation. After the forward pass comes the backwards pass, which starts on the out- put layer and goes to the first hidden layer. The process consists of recursively calculating the δ, also know as local gradient, for each neuron and adjusting and updating the synaptic weights according to the delta rule, explained in previous equations. For the adjustment task there are still various types of loss functions that can be applied in ANN or MLP models. The loss function is represented by L(W, B|j), where B is the bias and W is the synaptic weights for a given jth train- ing sample, and t(j) are the predicted values and o(j) are the labelled output values, and y represents the output layer in the network: Mean Square Error The mean square error (MSE) it is typically used in regres- sion tasks, since it does not consider different categories. The formula is given by: L(W, B|j) = 1 2 | t(j) − o(j) |2 (2.25) Huber The Huber method of calculation is usually calculated in regression tasks and it is defined as: L(W, B|j) =| t(j) − o(j) | (2.26)
  • 49. 38 Cross Entropy The cross entropy is used often for classification tasks and can be defined as: L(W, B|j) = − y∈O [(ln(o(j) y ) · ln(1 − o(j) y ) · (1 − t(j) y )] (2.27) Since this study uses the cross entropy method, the literature review will fo- cus on it. According [de Boer et al., 2005], the cross entropy (CE) was motivated by an adaptive method for estimating probabilities in complex stochastic networks, which involves mainly variance minimization. In addition, [Suresh et al., 2008] states that the cross entropy minimizes the misclassification between all classes in an all- in-one approach. It is common for ANN or MLP algorithms to exhibit over fitting, and in order to avoid this problem this study makes use of different techniques of regularization, such as Lasso, also know as l1; Ridge, expressed by l2; and dropout, which is a new technique applied on deep architecture. The first two techniques work either with shallow or deep architectures, and the last technique, dropout, will be explored in the next subsection. The loss function modified after the application of the regularization techniques is represented in Equation 2.28. L ′ (W, B|j) = l(W, B|j) + λ1R1(W, B|j) + λ2R2(W, B|j) (2.28) In the Equation 2.28, R1(W, B|j) is the sum of all l1 norms for the weights and biases in the network and for the R2(W, B|j), it represents the sum of squares of all the weights and biases in the network. The λ1 and λ2 are constants and usually very small (10−5 . The LASSO stands for least absolute shrinkage and selection operator, and according to [Tibshirani, 2011] it is a regression with a l1- norm penalty, which is given by the formula presented in Equation 2.29.
  • 50. 39 N i=1 yi − j xijβj 2 + λ p j=1 | βj | (2.29) In the Equation 2.29, Xij is the predictors, and yi is the centered response values for i = 1, 2, ..., N and j = 1, 2, ..., p. This function is solved by finding β = {βj}, which means the same as minimizing the sum of squares with a constraint of the form | βj |≤ s. Looking at that angle it approximates itself from the Ridge method, which is a regression that has a constraint in a form β2 j ≤ t. The main difference between those methods is that LASSO performs a variable selection and a shrinkage whereas the Ridge regression does only the shrinkage. When considering both of them in the general form, the formula is shown in Equation 2.30. p j=1 βq j 1 q (2.30) In the general context, the LASSO uses the general formula where q = 1 and Ridge Regression uses q = 2. With all the previous concepts presented, the training algorithm for a MLP is clearer, which can be given as the procedure: 1. Initialize W, B 2. Iterate until convergence criterion reached: (a) Get training example i (b) Update all weights wj ∈ W, biases bj ∈ B wj := wj − ∂L ′ (W, B|j) ∂wj
  • 51. 40 Figure 2.8: Basics schematics of a Random Forest bj := bj − ∂L ′ (W, B|j) ∂bj 2.3.3 Decision Tree 2.3.3.1 Random Forest Random Forest is a specific branch of the general technique of random decision forest, which is an ensemble method for classification and regression, that oper- ates by constructing multiple decision trees in the learning process and outputting the designed class or mean prediction, depending on the method applied, for each individual tree. The general method of decision tree is defined by [Rokach, 2016] as a predictive model expressed as a recursive partition of covariate spaces to sub- spaces that constitute a basis of prediction. The same author defines the random forest as a combination of decision trees in which their predictors are combined into a final prediction. The basic schematics for a random forest construction is in 2.8. As seen in [Verikas et al., 2011], the decision trees are sensitive to small
  • 52. 41 perturbations in the learning dataset. To mitigate this problem, one can build a random forest using an ensemble technique called bagging. To better understand the Random Forests (RF) model, let X = {(X1, Y1), ..., (Xn, Yn)} be the learning set, made of i.i.d. observations of a vector (X, Y ), where vector X = (X1 , ..., Xp ) contains the predictors, or the called explanatory variables, and the vector Y is the vector which describes the labelled classes or numerical response, and X ∈ Rp and Y ∈ ψ, where ψ is the categorical label or numerical response. For classifica- tion problems, which is the case for this study, a classifier k is mapping t : Rp → ψ. Lastly, it is assumed that the Y = s(X) + ε with expectation of ε equals to zero and the variance of 1. The statistical framework is as follows: 1. Each tree is created on a bootstrap sample of the training dataset. 2. At each node, n variables are randomly selected out of X 3. n is usually defined as n = log2(N) + 1, N being the sample size The RF algorithm has a sub-product: a measurement of variable importance that is implemented together. One of them is the Gini index, which is usually used for classification tasks. Given a node t and estimated class probabilities p(k|t), k = 1, ..., Q, the Gini index is defined as G(t) = 1 − Q k=1 p2 (k|t) where Q is the number of different classes. The whole process consists of calculating the Gini index for every Xn used to make the split and the Gini index variable importance is given by the average decrease in Gini index in the forest, where the variable Xn is used in the node t. Another measurement of variable importance is the accuracy-based esti- mator. This method computes the mean decrease in classification accuracy of the out-of-sample (OOS) data. Let the bootstrap samples be b = 1, ..., B, and the im-
  • 53. 42 portance measurement ¯Dj for a given predictor Xj. The importance is calculated as follows: 1. Set b = 1 and define the OOS data points 2. Classify the OSS data points using the node t and count the correct classifi- cations 3. For variables Xj, where j = 1, ..., N : (a) permute the values of Xj in OOS data points (b) use the node t to classify the permuted values and count the correct classifications 4. Repeat first three steps for b = 2, ..., B. 5. Compute the standard deviation σj of the decrease in correct classifications and a z − score: zj = ¯Dj σj / √ B and then convert zj to a significance value, considering a Gaussian distribution.
  • 54. 43 Chapter 3 Data and Computational Procedures This chapter covers the data gathering, data cleaning and other procedures related to the architecture building used to evaluate the models proposed in this study. As explained before, the data evaluated in this study is from 160 companies that are part of the current S&P500 (TICKER: GSPC) from the following sectors: health care, financial services and technology. These sectors were chosen due to their different unique characteristics. The financial services sector is intrinsically more sensitive to changes in economic indicators and other variables such as oil prices, financial markets (both local and also global), the federal rate, the interest rate and others. The health care industry is related to another set of variables, most of them not directly related to the variables which influence financial services. On the other hand, the technology sector shares some variables with the financial market, but has less correlation with the health care sector. By using a combination of all three sectors, the data reflects changes in the market based on a broad variety of factors. This set of companies was chosen in order to make the data as unbiased as possible, and therefore closer to reality. As shown in previous chapters, these sectors have significant weight in the S&P500 index. In the past, a lot of research was done in this area trying to model and predict the corporate credit ratings. As seen in [Lee, 2007], most of them use techniques to narrow the research: in Lee’s study it was shown that corporate credit ratings can be narrowed down to five categories, from AAA to C, using data from 1997 to 2002. [Darrell Duffie, 2003] explores different approaches to using coarser classifications: he shrinks the number of possible categories in order to
  • 55. 44 improve the results of the classification. In this thesis none of those techniques were used in order to shrink the cate- gories available for the machine learning models and it was decided to use as much as data as possible, which means that for those previous cited 160 companies, all data available on its financial statements since 1990 was gathered and used in the procedures. It is important to restate that this study did not use financial ratios, and instead used the full value of each variable as input, which were normalized using techniques discussed going forward. With the process of using data not as a ratio, it becomes even more important to have a heterogeneous dataset for each sector. The numbers presented for each sector and for each variable are of differ- ent natures, and the goal of this study is to develop a model able to identify and generalize that information in order to correctly predict a corporate credit rating. 3.1 Data Source and Preprocessing This section shows the data source, data gathering procedures, and preprocessing techniques necessary for the accomplishments made in this study. The data was gathered using the R programming language to access the API of Bloomberg Ter- minal. All the data gathering was done in the Hanlon Financial Systems Lab, which has several Bloomberg terminals available for student use. The data gathering and preprocessing is as follows: 1. Gather financial data: (a) Define the companies (b) Define the period (c) Define the frequency
  • 56. 45 (d) Define the variables (e) Preprocessing the financial data 2. Gather ratings information 3. Preprocessing the ratings data 4. Combine the preprocessed data For gathering the financial data, the first step, which is the definition of the companies to be evaluated, was explained in the previous section. The second step, defines the period of analysis, was done based on the availability of the data. As this study aims to deal with a big data problem, it would ideally gather as much data as available. However, due to restrictions in the database and existence of the companies (with the same name and ticker) over time, the time period was restricted to maximize the quality of the data. For this reason, the period was de- fined from the year 1990 until 2014. This incorporates 24 years of financial data, which is enough to be considered a big data problem and enough to perform this study. For the definition of the frequency of the data gathered, it was observed that the corporate credit ratings do not change often, but also a wrongly classified corporate credit rating with a time lag of more than one quarter is too risky. In order to make the model responsive to the changes in ratings, it was decided to collect data on a quarterly basis to coincide with the public financial reports that the com- panies are obliged to release. In order to have the maximum amount of information available for the machine learning algorithms, a robust and complete data gather- ing process was deployed and more than 900 features were extracted for each company for each quarter when the data was available. When not available, the data was completed using NA arguments. The data was then saved in a serialized
  • 57. 46 format on the disk. To better analyze and perform a verification of the data, the serialized file was transformed into a comma delimited values format and edited using spreadsheet software. For the preprocessing of the gathered financial data, this study was focused on eliminating the features with more NA data points, and then move forward to verify the existence of NA data points row-wise. The columns and rows that presented more than 20% of its value as NA were deleted. After this verification the resulting data set contained 230 different features for each quar- ter for each company. The basic formula for gathering those financial data is as follows: BDH(ticker, apiform, startdate, enddate, overrides) Where the ticker represents the company to be searched for, the api form represents the variable to be searched, the start and end date stands for the period in which one desires to perform the analysis, and the overrides are the specifica- tions that have to be passed as an argument for the API in order to get the correct results. For this study the override was the "BEST_FPERIOD_OVERRIDE", with value of "QUARTERLY". [Bloomberg Finance L.P., 2014] The next step in preparing the experiment consists of gathering the ratings. The ratings were gathered individually from the Bloomberg Database and stored for the purpose of this study. The ratings were collected individually and preprocessed in spreadsheet software, in which it was constructed into a database like a machine readable file. After gathering the ratings, it was possible to merge those datasets together using the R programming language in order to tie each rating to its respective financial data. In order to make later analysis possible, columns were added to the dataset such as "classification", which determines if the related rate was an
  • 58. 47 investment grade (IG) or NON - investment grade (N-IG), and also the respective rate without the signals, +and− for Standard & Poors and 1, 2, 3 for Moody’s, and the related movements for each one of those columns added, indicating the status of the related rate in comparison with the previous given rate. This combined and preprocessed dataset is then saved in a serialized form to later use as input to the machine learning algorithm. Another important point in the preprocessing is that all the movements were calculated in this part of the study, but only used later on for the results analysis: it was not fed to the machine learning algorithms. If a movement was greater than 5 positions, either going up or down, it was considered as an outlier and not considered in further investigations. 3.2 Data structure and characteristics This section shows the particularities and characteristics of the data used in this study. For certain aspects of the data it is possible to make an analyses without specifying the credit rating agency, because the information shown is related to the companies and not to its corporate credit ratings. Since the credit rating agencies have access to proprietary information at any time, meaning that they are not re- stricted to the quarterly public financial statements, it was assumed that if a rate given in the period from the first day of the quarter to the last day of the quarter, the credit rating belongs to its quarter and it is related to the previous four quarters of the same company and with one quarter of offset, in order to guarantee no use of future data. Even though this study uses a total of 160 companies, not all of them have available data during the past period needed for the research. It was decided to not update the participants for the S&P500 index, since the focus of this study is not on
  • 59. 48 Figure 3.1: Number of companies in the study over the years. the companies themselves, but on modelling the credit rating agencies during the determined time frame. The Figure Figure 3.1 displays the number of companies in this study over the years. The Figure Figure 3.2 demonstrates the distribution of entries for each sec- tor. The balance between companies is even through the companies distribution, therefore the number of rated companies in the Financial Services sector is greater than the others. Because of the nature of their business, the Financial Services companies need to be rated and are more often involved in financial transactions in which those ratings are required either by regulation or by the parties involved. This section will also discuss the specifics for each credit rating agency. As stated before, this study focus on the two major players in the market, Standard and Poor’s and Moody’s. Their data are analyzed separately due to their individ- ualities and the difference in the methods of evaluating the ratings. As observed in this research, both agencies do not always agree on the corporate credit rating. Furthermore, past events such as the 2009 financial crisis have proven that the ratings of these agencies are not always in accordance with the market. The goal
  • 60. 49 Figure 3.2: Distribution of datapoints in the sectors. of this study is not to analyze the accuracy of their ratings, it is instead to create a model that is more accessible and faster with the same reliability of their methods. On the Moody’s data analysis, the data set used in this study has the follow- ing distribution: Figure 3.3: Ratings distribution for Moody’s. When the value has a value of zero it means that in the period of analysis the corresponding credit rating did not appear in our source data. Using the pro-
  • 61. 50 cess explained above, it is evident that the concentration of ratings is in the area around the rating A3, which is still considered an investment-grade level but not as a high grade. The Figure Figure 3.3 uses data from the whole time frame and ignores the periods in which any company is in default or is listed as NR (non- rated). When analyzed on the investment-grade or non-investment grade level, the Moody’s distribution is as follows: Figure 3.4: Investment Level distribution for Moody’s. The Figure Figure 3.4 is consistent with the Figure 3.3, since most of the ratings in the first graph are above investment grade level. All the previous graphs have shown overall characteristics of the data, either for the general data or for the Moody’s specifics. One important aspect, if not the most important, is to evaluate the behavior of the proposed model at points where a company’s corporate credit rating changes. In the following graph, it is possible to visualize the quantity of changes over the years analyzed in this study, which is related to the left axis of the graph, and the line represents the average change step for the period. The change step is called the size of the movement for a rating. For example, if a rating is Ba2 and in the next quarter it changes to B3, this change step has value of 4,
  • 62. 51 Figure 3.5: Changes over the years for Moody’s. even though it was a downgrading movement. The average change step is the sum of all changes steps divided by the number of changes in the period. The result of those calculations can be seen in Figure 3.5, where the bars represent the quantity of changes in each year and is related to the left axis of the graph, and the line represents the average change step for the period, and is related with the right axis of the graph. One interesting observation to be made in this graph is the spike in both the line and in the bar on the year 2009, which represents the year after the most recent financial crises of the US. The first spike observed on the year of 1993 is due to the limited and small size of the sample of rating changes in our dataset, which do not represent the whole market and neither the whole S&P500 index. Another interesting analysis is the changes related to the credit ratings. As shown in the Figure 3.6 there are ratings such as, Aaa, Ca, and C, that do not present any changes in the period studied, which includes 25 years of data from those sectors. Another observation to be made about the last graph is the spike on the graph from the rating Caa3. There is no clear explanation for it. The only plausible explanation
  • 63. 52 Figure 3.6: Changes per rating for Moody’s. would be that those ratings have a high chance of default or miss-classification for the credit rating agencies. The following section displays the same previous analysis but instead for the Standard & Poor’s credit rating agency. The first graph shows the distribution of the corporate credit ratings of the same companies and in the same period as those analyzed in the Moody’s case. It is evident when comparing the ratings from both credit rating agencies that they are similar. Further on this study, the general divergences between the agencies will be shown. Another important aspect of the Standard & Poors ratings distribution is the ratio of companies classified as investment grade (IG), over companies that are non-investment grade. That relationship is displayed in Figure 3.8, and the graph is consistent with the ratings distribution. Figure 3.7. Again, one of the most important aspects from the data is the analysis of the periods of change, and for the Standard & Poors, those periods are characterized in the following graphs: On Figure 3.9, it is shown the average change step over and its evolution
  • 64. 53 Figure 3.7: Ratings distribution for Standard & Poors. Figure 3.8: Investment Level distribution for Standard & Poors.
  • 65. 54 Figure 3.9: Changes over the years for Standard & Poors. Figure 3.10: Changes per rating for Standard & Poors.
  • 66. 55 over the years. And the Figure 3.10, shows the quantity of changes for each cor- porate credit rating as bars and the average change step as a line.As it is possible to observe, the spikes on the year of 2009, and it is understandable as a conse- quence of the crises in which US faced on that period. During the conception of this study one question raised was about the agree- ment between the corporate credit ratings given by both credit rating agencies ana- lyzed in this study. In order to proceed with the analysis, the ratings were matched by evaluated company (Ticker) and by period (year and quarter) for both credit agencies, and both ratings were transformed using a scale as displayed on Table 3.1. After transforming the ratings into numbers, it was calculated the absolute difference between the two companies’ ratings, being then able to calculate the average distance between the ratings given. Another measurement is the notch difference between the ratings given by those two companies, and the notch con- cept is explained in next subsection. From this transformation of the ratings to scaled numbers, it was possible to construct the following figures. From the Figure 3.11 it is possible to visualize the difference which exists between the credit rating agencies. The line "% exact match" represents the percentage of ratings that match exactly in the given year, the "% notch match" represents the percentage of matches in the situation where the distance of the ratings are equal to or less than 1, and lastly, the bar labelled "ave diff", represents the average distance between the agencies’ ratings. The Figure 3.12 uses the same arguments, but instead of aggregating by years, the analysis aggregates by corporate credit ratings, using the Moody’s scale as basis for the analysis.
  • 67. 56 Figure 3.11: Comparison of the credit rating agencies over the years. Figure 3.12: Comparison of the credit rating agencies by the rates.
  • 68. 57 Standard & Poors Moody’s Scale AAA Aaa 21 AA+ Aa1 20 AA Aa2 19 AA- Aa3 18 A+ A1 17 A A2 16 A- A3 15 BBB+ Baa1 14 BBB Baa2 13 BBB- Baa3 12 BB+ Ba1 11 BB Ba2 10 BB- Ba3 9 B+ B1 8 B B2 7 B- B3 6 CCC+ Caa1 5 CCC Caa2 4 CCC- Caa3 3 CC Ca 2 C C 1 Table 3.1: Table of scale values for the corporate credit ratings 3.3 Framework architecture This section will explore the construction of the framework used to perform this study and get the respective results. The machine learning methods applied were explained in detail in previous sections of this dissertation and all the frameworks were built in R programming language. However, one of the premises of this work is that it should be portable to another platform more suitable for a production en- vironment, such as Python or JavaScript. So bearing that in mind the code was developed in R using functions and different environments, which are natural in the R programming language. The preprocessing already involves relating each available credit rating of each company with its respective financial information in a
  • 69. 58 machine readable format. Given that previous step, the first step of the framework is to develop a function to upload and slice the dataset for each iteration with the correct variable arguments for each iteration. The experiment consists of evaluat- ing the model quarterly, by training on one quarter and testing in the subsequent one. By doing that, it reinforcies that no data from the future is used either in the training or validating data, ensuring the out-of-sample technique. This first func- tion is responsible for identifying the training time frame and the test time frame for each iteration and slicing the complete dataset accordingly with those time frames and feeding those data frames to further functions responsible for performing the machine learning techniques. At this point in the time frame, there are two data frames: the training data and the test data. The next step on the framework is the division of the training data into training and validation data. All the machine learning techniques were fed both datasets in order to perform the learning process on the training data and perform the validation, and check the performance on the learning process on the validation data. The rate used to split those data frames were 80% for the training dataset and 20% for the validation data. For the test data, no modifications were made, in order to maintain the relation with reality. To better illustrate this process of slicing the data, take for example the iteration to perform the test for the year 2005 on the quarter 02. For this scenario, the function would slice all data available previous from 2005 quarter 01, including the 2005 quarter 01 data, and register it as the training dataset, and the get the data from 2005 quarter 02 and save it as the test set. Furthrmore, the training set previously mentioned is again sliced into 2 different data frames, the training data, which contains approximately 80% of the available data for training, and the validation data, which contains approximately the other 20% of the available data. The second step after the data slicing and
  • 70. 59 preparation for the iteration is to perform a feature selection using the Gini index, which is implemented together with the Random Forest. From all the available features, it selects the best features which represent at least 80% of the responses, that was observed to be usually between 40% and 60% of the available features. After the selected features, the next step in the framework is the training of two different models, one aiming to predict the movement if any given rate is going up, down or not changing, and a second one aiming to predict the probability of a credit rating to change or to keep the current value. Those models are trained using the machine learning method of Random Forest, and the models are saved on the disk for later use. The next step in the framework is to use the two models previously trained to predict the value using the training data and the validation data, but always making sure the train, valid, and test data are always separated. The predicted values are then combined to the previous selected features and then fed to the other machine learning techniques. This part of the architecture is the same for all variations of code applied, that means that the data preparation and feature selection is the same for all fol- lowing parts. During the conception process of this study a brainstorm was realized in order to understand better the data and conceptualize and find a solution for the proposed problem. In order to better understand the data, the previous studies were done and discussed, and it became apparent that the data could present the challenge of overfitting, since ratings do not change often over the years and there are a large number of available features. The problem of overfitting was addressed with the use of feature selection, which shrinks the number of available features to be fed into the machine learning algorithms, and also with the use of regular- ization techniques applied on the loss function of the machine learning algorithms. The former was mainly applied to the Multi-Layer Perceptron technique, since it
  • 71. 60 presented characteristics of overfitting on the earlier tests due to its error rate be- ing concentrated on the corporate credit ratings that changed. The problem of the overfitting and the high error rate on the changing period for the MLP was solved, or at least improved, after the use of regularization such as the Lasso and Ridge regularization techniques. After being preprocessed, prepared and with the features selected, the ar- chitecture begins the machine learning algorithms. At this stage of the code, there are two datasets to be presented to the algorithms: they are the training data and the validation data, divided as explained before. As said previously, for this fol- lowing part of the study a series of different constructions were tested. The first used both MLP + Random Forest, training a first guess and then refining it using a different model for each notch with three possible outputs: the closest upper and lower rate and the rate itself, using the ratios of each quarter. Other constructions used different configurations of random forest and MLP, using those to predict the dataframe and for a later step use another MLP, always in a supervised fashion, to learn from the data what predictor was best in each moment. The main goal of this study is not evaluate the differences between those different architectures but instead to evaluate the existence of a model that can predict with satisfactory accuracy. For the scope of this study, those different architectures were tested and the one with better results was selected. For this study the method that presented the best results was the architecture in which after the fists steps it is performed a MLP algorithm with backpropagation and a Random Forest algorithm, separately, aiming to learn the same variable, the corporate credit rating as a whole, in which each rating followed by its sign is treated as a different class and presented as it is to the algorithms. The basic architecture used to implement the algorithms it is displayed in Figure 3.13.
  • 73. 62 The former process is the explanation for each iteration for training (the learning phase) and the testing. The following explanation is about the functions and procedures in which the performance indicators are defined and calculated in the data for each iteration. As explained before, the performance indicators for this study will be the accuracy presented by both models, the MLP and the Random Forest, for the predicted values of the full ratings, the crude ratings, the notches ratings, in the quarter defined as test, and the accuracy for the full ratings and notches ratings only for the ratings that observed changes from the previous pe- riod. For better understanding and explanation on those performance indicators, let o be the output values, t the target values, and N the length of the test dataset. So, o = {o1, o2, ..., oN }, and t = {t1, t2, ..., tN }, where each oi is respective for each ti. Bearing that in mind, the first performance measurement to be explored is the accuracy on the full rating predicted values for the overall test dataset. The first step is to remove the data considered as outliers, which means the ones with a movement greater than 5. After that, the procedure is to count how many oi have the same value of ti, such as: Γfullrating = N i=1 θ(i) N (3.1) where: θ(i) =    1 if oi = ti 0 if oi = ti (3.2) The next performance measurement to be calculated is the accuracy for the crude rating, which means how many times the oi is equal to ti, given the ratings have to be transformed. So in this transformation the function creates an o ′ i and
  • 74. 63 t ′ i, that is the output of the full rating without its respective sign, and performs the same transformation for the target rating and then proceeds to compare its values. Γcruderating = N i=1 θ(i) N (3.3) where: θ(i) =    1 if o ′ i = t ′ i 0 if o ′ i = t ′ i (3.4) Following the crude rate is the notch analysis. For this performance mea- surement it is necessary to transform the categorical rating, such as, Aa1, Baa3, AA+ or CCC to a correspondent number in a scale. This procedure is done for ti and oi, the result of this procedure is called t ′′ i and o ′′ i , respectively. After the transfor- mation, each value of o ′′ i is compared with its correspondent t ′′ i , and if the absolute distance between those in the scale is less than or equal to 1, the output is consid- ered correct. The mathematical explanation is as follows: Γnotchrating = N i=1 θ(i) N (3.5) where: θ(i) =    1 if |(o ′′ i − t ′′ i )| ≤ 1 0 if |(o ′′ i − t ′′ i )| ≥ 2 (3.6) Those were the performance measurements used to evaluate the complete dataset, but since this study intends to completely evaluate the model and it is known that the periods when a change in the corporate credit rating is observed are the periods that the model is put to proof, it was decided to implement those
  • 75. 64 extra performance measurements in which is only evaluated the indicators for the ratings that were changed for the ti. In order to perform this evaluation, the test data set is then filtered to only contain the predictions and responses for the companies that had their corporate credit rating changed from last period. The set up is still the same, only that instead of having N as the length of the test dataset, the length is now represented as N ′ and is the filtered test dataset. For the specific analysis on the changes, it was decided to only evaluate and keep track of two performance measurements: full rate accuracy, and notch accuracy. They follow the same logic behind the previous one: Γ ′ fullrating = N ′ j=1 θ(j) N′ (3.7) where: θ(j) =    1 if oj = tj 0 if oj = tj (3.8) and, Γ ′ notchrating = N ′ j=1 θ(j) N′ (3.9) where: θ(j) =    1 if |(o ′′ j − t ′′ j )| ≤ 1 0 if |(o ′′ j − t ′′ j )| ≥ 2 (3.10) The whole process is displayed in the following: The concern about corporate credit ratings goes beyond the accuracy. It is important for the predictive model to not predict values that are distant from the real
  • 76. 65 Figure 3.14: Performance indicators flowchart.
  • 77. 66 value. To better comprehend the results, the Credit Rating Dissimilarity Coefficient was created. This statistical measurement, here presented as ψ, has as feature a penalization for the predictions, based on the distance from the predicted to the observed value, the scale used to calculate the distance is shown on Table 3.1. The indicator is calculated using the following: ψOxP = 21 i=1 (i) ∗ ( N j=1 Θi N ) (3.11) where, N is the number of samples, O is the observed value (or the value to be compared with), and P is the predicted value. Θi is defined as: Θi =    1 if |O − P| = i 0 otherwise (3.12)
  • 78. 67 Chapter 4 Results This chapter shows and discusses the results obtained in this study. Since the framework developed makes possible the use of different machine learning tech- niques, this chapter will evaluate the results for the MultiLayer Perceptron and the Random Forest machine learning techniques. Additionally, it will discuss the results gathered using the methods previously presented. In order to develop statistical results, it was chosen to work with the Chi-Square test, which tests for indepen- dence between two discrete distributions and for "Goodness-of-fit". The first one aims to test if the distribution is independent from the other, and the "Goodness-of- fit" has the null hypothesis that the observed frequency f, is equal to an expected count, e, for each category. The null hypothesis is rejected if the p-value of the calculated Chi-square test statistics is less than a given significance level α. A lit- erature review on the χ2 (Chi-square test) is presented in Appendix B. As explored before, the tests here presented were performed in an out-of-sample (OSS) fash- ion, in which the machine learning technique did not have access to the test data during the learning phase. By doing this, it is possible to guarantee the veracity of the results. The results will first be presented by company (Standard & Poors and Moody’s), and then for each one of them it will be presented the results for the test data: first for the entire data set, and then only for the periods in which the corporate credit rating presented a change when compared with the previous period. For the analysis of the changes, the first score for any given company is exluded because there is no previous score for comparison. Additionally, scores which changed more than five notches compared to the previous score were con-
  • 79. 68 sidered outliers and also excluded. The results will be presented first for the MLP technique and then for the Random Forest, followed by a comparison between those methods. 4.1 Standard & Poors This section is dedicated to discussing the results obtained from the model created using the Standard & Poors credit ratings and the methodology described in Chap- ter 03 for the construction of the experiment and also for the results measurements calculation. Two different machine learning techniques were used: MLP and Ran- dom Forest. The results will be presented for both methods for each different mea- surement, and in the end there will be a discussion about each method’s efficiency. The first analysis is on the distribution of the predicted values and the observed val- ues. The following graphs show the distribution according to the corporate credit ratings in those models. Figure 4.1 shows the frequency of the observed value for each rate against the frequency of each predicted rate when using the Random Forest model in the Standard & Poors ratings. Similary, Figure 4.2 shows the dis- tribution of observed corporate credit ratings against the predicted ratings when predicted using the Multilayer Perceptron model. These figures show that the Random Forest model appears, at least visually, to be closer to the observed distribution than the MLP model. To better understand and mathematically prove this conclusion, a χ2 test was employed to test for inde- pendence and "goodness-of-fit". Also, the correlation between the distribution of those two distributions was calculated to serve as a simple measure of comparison and is displayed in Table 4.1. As shown in Table 4.1, the correlation on the Random Forest model predic-
  • 80. 69 Figure 4.1: Ratings distribution for Random Forest model for Standard & Poors. Standard & Poors Statistics Multilayer Perceptron Random Forest Correlation 0.547542 0.928467 Independence test 2.2e − 16 2.2e − 16 "Goodness-of-fit" test 2.2e − 16 2.874e − 05 Table 4.1: Statistics for Standard & Poors tions is considerably higher than the correlation presented by the MLP model. With a correlation of approximately 93%, it is strongly indicated that the observed data and the predicted data have a relationship of dependence. Another measurement presented here is the χ2 test for independence. For this test, both results were similar: for both methods the p-value returned was less than 0.05, meaning that with a confidence level of 95%, it is possible to reject the null hypothesis. The null hypothesis in the χ2 independence test is that the distribution of both samples are independent, and by rejecting it, it is possible to conclude that the distributions are not independent, and are in fact dependent. The χ2 Goodness-of-fit test, on the other hand, has a null hypothesis that the one distribution supports the other one. In this application, it will test if the predicted data supports the observed data. The result for this test in both models were less than 0.05, so it is possible to reject the
  • 81. 70 Figure 4.2: Ratings distribution for MLP model for Standard & Poors. null hypothesis, implying that the predicted values for both methods are not the same as the observed values. The statistics scenario presented here is under- standable, since the predictions for both models presented good relationship with the observed data, thus the low p-value in the independence test. The random for- est presented a closer result, generating then a better correlation measurement. The result of the Goodness-of-fit test presented implies that neither model’s pre- dictions follow the same distribution as the observed data. This does not imply a failure to accurately predict the credit rating due to the fact that the observed data from S&P and Moody’s do not always align over time. The Figure 4.3 and Figure 4.4 show the average difference between the predicted model and the observed model over the years. For a better understand- ing, four different performance measurements are used. The different lines on the graph represent the percentage of match in each different scenario: when it is exact, with one notch distance, or with two notches distance. These lines are respective to the left x-axis on the graph. The bar represents the yearly average distance of the predicted value to the observed value. All those calculations were
  • 82. 71 Figure 4.3: Ratings over years for Random Forest model for Standard & Poors. Figure 4.4: Ratings over years for MLP model for Standard & Poors.