Unstructured Data Management

Predicting Relative Risk of Financial Investment in
Publicly-Traded Companies
Krishna Vijaywargiy, Kshitij Deshpande, Manish Lingamallu
MSA 8050: UNSTRUCTURED DATA MANAGEMENT FINAL PROJECT REPORT
J. MACK ROBINSON SCHOOL OF BUSINESS, GEORGIA STATE UNIVERSITY

Predicting Relative Risk of Financial Investment in
Publicly-Traded Companies
Krishna Vijaywargiy
Master of Science in Analytics
Georgia State University
kvijaywargiy1@student.gsu.edu
Kshitij Deshpande
kdeshpande4@student.gsu.edu
Manish Lingamallu
jlingamallu1@student.gsu.edu
ABSTRACT
We address a text analysis problem: given a set of publicly-traded companies, we try to predict the
relative risk of financial investment by analyzing the text content in the SEC-mandated financial report
published by these companies annually. We focus on the latent information in parts 7 & 7a of this report
which are a detailed description of the company’s financial status in the previous year and use well known
analysis models to predict future trends from this information. By using text instead of numbers, our project
revolves around the modern approaches for financial trend prediction and stretches the accuracy higher.
INTRODUCTION
While a lot of information about the financial health of a publicly-traded company is transmitted through
numbers, insights galore can also be extracted from the textual content in the mandated 10-K reports
submitted yearly to the United States Securities and Exchange Commission. The research for efficiently
using text analysis to improve the accuracy of stock price prediction is still in the early stages but has
certainly achieved a few milestones. This project report forecasts the relative investment risk for a set of
companies by analyzing the text content in parts 7 and 7a of the 10k report. Part 7 gives a description of
“Management’s Discussion and Analysis of Financial Condition and Results of Operations” and part 7a
focuses on “Quantitative and Qualitative Disclosures about Market Risk” which comprise of discussions of
trends, capital resources liquidity, general business review, discontinued operations, interim financial
statements, etc. among other things.
Extracting the latent information about a company’s performance in future and year past from the fore
mentioned subsections of these freely available reports is an important step towards predicting the risk of
investments in stocks. 10-K Reports from 15 companies with positively trending stock prices and 15 with
negatively trending stock prices were evaluated in this project to identify the patterns from text. Before the
data can generate insights, it has to be carefully extracted and cleaned. The process of mining the text
involves extracting only the textual data and ignoring all the tables from the subsections. The text then
undergoes cleaning process which, after removing all special characters, tokenizes the whole document into

words and converts all tokens/words to their root words. These processed documents are then fitted with
the models to generate a relative risk prediction model. The two models that are used for prediction are
Logistic Regression model, which takes in a categorical dependent variable to estimate probabilities using
a logistic function for the independent variable(s), and Neural Networks, which is trained on a corpus of
documents and is then improved through feature fusion to generate high ranking summaries. We further
evaluate the results of both these methods to give a comparison of both models.
METHODOLOGY
We started our project by researching the available papers and identifying what parts of the 10-K reports
are most significant to predict financial risk. The paper ‘Predicting Risk from Financial Reports with
Regression’ by Kogan, Levin, Routledge, Sagi and Smith [1] presents an insightful approach of constructing
regression models for volatility of stock returns, which is an empirical measure of financial risk.[ 1] It aims
at providing summarizing statistical facts that are not subject to any kind of human-expertise, knowledge
or regression. This establishes that simplistic representation of text (unigrams and bi-grams) can
substantially improve a strong stock price prediction baseline that does not use text. Volatility is measured
as the standard deviation of a stock’s returns over a finite period of time. Thus, to predict risk, the paper
focuses on using clustering and Support Vector Regression for volatility prediction to identify that text
regression model prediction efficiently correlates true volatility with historical volatility and thus provides
higher accuracy in combination.[1] However, instead of clustering, we followed a pattern using ‘text topics’
and ‘text parser’. We evaluated the 5-year stock price trends for companies along with their 10-K reports
to create a start list for both positive and negative words and added them to the text parsing node.
In their paper “Combining Data and Text Mining techniques for Analyzing Financial Reports” , Antonia
Kloptchenko, Tomas Eklund, Barbro Back provide a different approach to analyze the information form
the Telecommunication companies and state that the textual part contains more precise information in
dealing with company performance than the quantitative data available in the form of financial ratios. They
have used data and text mining methods to study hidden indications about the financial performance of
companies from the qualitative as well as quantitative parts of their financial reports. They gathered
information from all occurring matches in combination with quantitative data clustering making it possible
to conclude that the analysis schema has captured a tendency: the text reports tend to foresee the changes
in financial states of the companies, before those changes influence the financial ratios. The results obtained
after analyzing the overall information has proven that some future changes in the financial performance
can be anticipated by analyzing text from the reports. Thus due to time constraint and limited know how
about the financial ratios and terminology we decided to focus only on the text data contained in the SEC
reports form Item 7 and Item 7A.

IMPLEMENTATION
To extract the content from the 10-K reports, we use python to parse the document and acquire the
relevant text. This is a tedious process as all the companies have different formats of the XBRL reports. We
then stored this into pandas data-frame for merging all the companies. The text in the data-frame is then
tokenized and stripped off of markups, punctuation marks and any special characters besides the alphabets.
We used regular expression in python to process this text for analyzing.
After extracting the data in desired format we
created a chain of nodes in enterprise miner to run
our data. The first node in the chain was the data-
source node - “Training Data” that contained a
comprehensive list of 30 companies with over 90
records that were parsed to train the model. An
additional field “Investment” was added to the
spreadsheet and the field was assigned a “0” or a “1”
based on the stock trend of the company. “1” was
assigned to a company which was considered safe to
invest in and “0” for company considered not safe to invest in. The input to the data-source node was the
excel spreadsheet. The output from the data-source included company name, year of filing, and Investment
preferences. If we were to run the experiment again, we would use growth or Profit/Equity ratios in order
to have a numeric value to set as the threshold for our decisions. This data ran through a data-partition node
with the partition set at a 75/25 split between the training and validation data. After the data partition node,
the data was then parsed using the Text Parsing node. We created start list by analyzing textual data from
the good companies and bad companies which we classified on the basis of the stock performance for the
last 5 years.
The output from the Text parsing node flowed to a text
filter node where weights of the terms were assigned based on
Inverse Document Frequency. The weight reflected the
importance of the word in a document.
Fig 2(b): High Frequency Terms
Fig 2(a): Number of Documents by Frequency

The output of the text parsing nodes included the terms along with their weights.
The output of the text filter node was passed through the text topic node. The text topic node matched
the terms that were strongly associated and created topics. Topics are collections of terms that describe and
characterize a main theme or idea. For example, the term “profit” would have strong association with terms
like “revenue” and “cash” and have a higher probability to be in a topic. We limited multi-terms topics to
be 5. The results of the text filter node showed significant insights, which included the frequently used
words, which was consistent with our earlier findings.
Fig 3(a): Weights for terms – Result of Text Filter node
Fig 4(a) : Scatter plot of Positive words, Negative terms
and the topics.
Fig 4(b) : Weights of high frequency terms with other attributes like role,
frequency etc.

Some of the topics that we got as a result of text topic node included:
From our analysis, we found that documents which included terms of the topic have good investment
preferences, whereas the documents which have terms related to. This method identified a collection of
terms, which in turn helped us determine the performance.
The output included the documents, the topics and the relevance of the document to each of the topics.
This output was given to the Variable selection node. The Variable Selection node helped in reducing
number of input variables to the model by rejecting input variables that were not related to the market. The
results window also displayed a histogram that was called “Variable Importance”.
The histogram shows each variable's
contribution towards the prediction, based on the
R-Square scores. Here, we observed that the topic
“+weak +termination +average +tend” had the
highest importance in deciding the investment
preferences as it had the highest Variable
Importance.
We finally built two models, one using Logistic Regression and the other using neural networks.
We analyzed the effects of both the models based on the results of validation and training data.
Fig 5: Document cutoff value for each topic and the number of documents having satisfied the criteria
Fig 7: Histogram displaying the variable importance for prediction
using R-Square scores.
Fig 8(a) : Mean Predicted against Mean Target for Logistics
Regression
Fig 8(b) : Mean Predicted against Mean Target for Neural
Networks

From the results of both the Regression Model and Neural networks, we found a strong relationship
exists between the text of Item 7 and 7A and the company’s performance in the upcoming year. If the
accuracy is compared for the validation data for Logistic and Neural Network model, we could infer it to
be approximately around 90 – 95% respectively. The final model we developed is as seen in Fig 10.
TESTING THE MODEL
To test the model that was built, we decided to pass data and validate its result. The initial task was to
decide whether to go with the Regression Model or the Neural Network model. To resolve this, we used
the Model Comparison node which selects the best performing model based on errors for input models.
The figure below shows the Statistics of both the Regression model and the Neural Network model. The
selected model was a Regression model based on Average Square Error as selection Criterion.
Fig 10: SAS Model Developed
Fig 11(b): Regression Model selected based on Average Square Error as Criterion.
Fig 11(a): The error statistics of Regression Model and Neural Network Model.

Finally, we used the Score node with which we used to score new raw data. The input to the score node
was both the output of the model comparison node and a new raw data source. Based on the model selected
from the Model Comparison node, the Score node scored new data. The raw data with no Investment
preference and the final model developed is as seen in the below figure.
The output of the score node included the prediction for the field “Should Invest”, the target variable.
Here, we can see for the company VISA the “Should Invest” field is predicted around 1.059 whereas for
the company Frontier, the “Should Invest” field is predicted around 0.28. This is a clear indication to
showcase that it is worth investing in VISA. To further validate the mode, we decided to check the trends
of the stocks for these companies in Yahoo finance and found the trend lines in the graphs in agreement
with the results of the model.
CONCLUSION
By using the data from the 10k reports from Item 7 and Item 7A we concluded there exists a correlation
between the data from the filings and the trending stock performance of the companies. As mentioned in
one of the research paper we would like to use the quantitative data mentioned in the financial reports to
optimize the relation among the filings and stock performances. Also we would like to add a start list which
is more effective in distinguishing among the positive and negative words in the financial sector which will
definitely enhance the accuracy of the model overall. This methodology would be more meaningful if we
include other sections such as Item 1A which consists of the risk factors prevalent and play an important
role in determining the company performance.
Fig 12(b): Raw data for model testing
Fig 13: The predicted values of the raw data, which is the output of the score
node.
Fig 14(a) : Stock trend of VISA Fig 14(b): Stock trend of Frontier

REFERENCES
1. Predicting Risks from financial reports with regression, by Kogan, Levin, Routledge, Sagi and Smith
http://homes.cs.washington.edu/~nasmith/papers/kogan+levin+routledge+sagi+smith.naacl09.pdf
2. Back, B., Toivonen, J., Vanharanta, H., and Visa, A. Comparing numerical data and text information
from annual reports using self-orginizing maps, International Journal of Accounting Information
Systems (2), 2001, pp. 249-269.
3. Kohonen, T. Self-Orginizing Maps, Leipzig, Germany: Springer-Verlag, 1997. Kohut, G., and Segars,
A. The president’s letter to stockholders: An examination of corporate communication strategy, Journal
of Business Communcation (29:1), 1992, pp. 7-21. Lehtinen, J. Financial Ratios in an International
Comparison, Vasa: Acta Wasaensia, 1996

Unstructured Data Management

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Unstructured Data Management

Similar to Unstructured Data Management (20)

Unstructured Data Management