SlideShare a Scribd company logo
1 of 9
Download to read offline
Predicting Relative Risk of Financial Investment in
Publicly-Traded Companies
Krishna Vijaywargiy, Kshitij Deshpande, Manish Lingamallu
MSA 8050: UNSTRUCTURED DATA MANAGEMENT FINAL PROJECT REPORT
J. MACK ROBINSON SCHOOL OF BUSINESS, GEORGIA STATE UNIVERSITY
Predicting Relative Risk of Financial Investment in
Publicly-Traded Companies
Krishna Vijaywargiy
Master of Science in Analytics
Georgia State University
kvijaywargiy1@student.gsu.edu
Kshitij Deshpande
Master of Science in Analytics
Georgia State University
kdeshpande4@student.gsu.edu
Manish Lingamallu
Master of Science in Analytics
Georgia State University
jlingamallu1@student.gsu.edu
ABSTRACT
We address a text analysis problem: given a set of publicly-traded companies, we try to predict the
relative risk of financial investment by analyzing the text content in the SEC-mandated financial report
published by these companies annually. We focus on the latent information in parts 7 & 7a of this report
which are a detailed description of the company’s financial status in the previous year and use well known
analysis models to predict future trends from this information. By using text instead of numbers, our project
revolves around the modern approaches for financial trend prediction and stretches the accuracy higher.
INTRODUCTION
While a lot of information about the financial health of a publicly-traded company is transmitted through
numbers, insights galore can also be extracted from the textual content in the mandated 10-K reports
submitted yearly to the United States Securities and Exchange Commission. The research for efficiently
using text analysis to improve the accuracy of stock price prediction is still in the early stages but has
certainly achieved a few milestones. This project report forecasts the relative investment risk for a set of
companies by analyzing the text content in parts 7 and 7a of the 10k report. Part 7 gives a description of
“Management’s Discussion and Analysis of Financial Condition and Results of Operations” and part 7a
focuses on “Quantitative and Qualitative Disclosures about Market Risk” which comprise of discussions of
trends, capital resources liquidity, general business review, discontinued operations, interim financial
statements, etc. among other things.
Extracting the latent information about a company’s performance in future and year past from the fore
mentioned subsections of these freely available reports is an important step towards predicting the risk of
investments in stocks. 10-K Reports from 15 companies with positively trending stock prices and 15 with
negatively trending stock prices were evaluated in this project to identify the patterns from text. Before the
data can generate insights, it has to be carefully extracted and cleaned. The process of mining the text
involves extracting only the textual data and ignoring all the tables from the subsections. The text then
undergoes cleaning process which, after removing all special characters, tokenizes the whole document into
words and converts all tokens/words to their root words. These processed documents are then fitted with
the models to generate a relative risk prediction model. The two models that are used for prediction are
Logistic Regression model, which takes in a categorical dependent variable to estimate probabilities using
a logistic function for the independent variable(s), and Neural Networks, which is trained on a corpus of
documents and is then improved through feature fusion to generate high ranking summaries. We further
evaluate the results of both these methods to give a comparison of both models.
METHODOLOGY
We started our project by researching the available papers and identifying what parts of the 10-K reports
are most significant to predict financial risk. The paper ‘Predicting Risk from Financial Reports with
Regression’ by Kogan, Levin, Routledge, Sagi and Smith [1] presents an insightful approach of constructing
regression models for volatility of stock returns, which is an empirical measure of financial risk.[ 1] It aims
at providing summarizing statistical facts that are not subject to any kind of human-expertise, knowledge
or regression. This establishes that simplistic representation of text (unigrams and bi-grams) can
substantially improve a strong stock price prediction baseline that does not use text. Volatility is measured
as the standard deviation of a stock’s returns over a finite period of time. Thus, to predict risk, the paper
focuses on using clustering and Support Vector Regression for volatility prediction to identify that text
regression model prediction efficiently correlates true volatility with historical volatility and thus provides
higher accuracy in combination.[1] However, instead of clustering, we followed a pattern using ‘text topics’
and ‘text parser’. We evaluated the 5-year stock price trends for companies along with their 10-K reports
to create a start list for both positive and negative words and added them to the text parsing node.
In their paper “Combining Data and Text Mining techniques for Analyzing Financial Reports” , Antonia
Kloptchenko, Tomas Eklund, Barbro Back provide a different approach to analyze the information form
the Telecommunication companies and state that the textual part contains more precise information in
dealing with company performance than the quantitative data available in the form of financial ratios. They
have used data and text mining methods to study hidden indications about the financial performance of
companies from the qualitative as well as quantitative parts of their financial reports. They gathered
information from all occurring matches in combination with quantitative data clustering making it possible
to conclude that the analysis schema has captured a tendency: the text reports tend to foresee the changes
in financial states of the companies, before those changes influence the financial ratios. The results obtained
after analyzing the overall information has proven that some future changes in the financial performance
can be anticipated by analyzing text from the reports. Thus due to time constraint and limited know how
about the financial ratios and terminology we decided to focus only on the text data contained in the SEC
reports form Item 7 and Item 7A.
IMPLEMENTATION
To extract the content from the 10-K reports, we use python to parse the document and acquire the
relevant text. This is a tedious process as all the companies have different formats of the XBRL reports. We
then stored this into pandas data-frame for merging all the companies. The text in the data-frame is then
tokenized and stripped off of markups, punctuation marks and any special characters besides the alphabets.
We used regular expression in python to process this text for analyzing.
After extracting the data in desired format we
created a chain of nodes in enterprise miner to run
our data. The first node in the chain was the data-
source node - “Training Data” that contained a
comprehensive list of 30 companies with over 90
records that were parsed to train the model. An
additional field “Investment” was added to the
spreadsheet and the field was assigned a “0” or a “1”
based on the stock trend of the company. “1” was
assigned to a company which was considered safe to
invest in and “0” for company considered not safe to invest in. The input to the data-source node was the
excel spreadsheet. The output from the data-source included company name, year of filing, and Investment
preferences. If we were to run the experiment again, we would use growth or Profit/Equity ratios in order
to have a numeric value to set as the threshold for our decisions. This data ran through a data-partition node
with the partition set at a 75/25 split between the training and validation data. After the data partition node,
the data was then parsed using the Text Parsing node. We created start list by analyzing textual data from
the good companies and bad companies which we classified on the basis of the stock performance for the
last 5 years.
The output from the Text parsing node flowed to a text
filter node where weights of the terms were assigned based on
Inverse Document Frequency. The weight reflected the
importance of the word in a document.
Fig 2(b): High Frequency Terms
Fig 2(a): Number of Documents by Frequency
The output of the text parsing nodes included the terms along with their weights.
The output of the text filter node was passed through the text topic node. The text topic node matched
the terms that were strongly associated and created topics. Topics are collections of terms that describe and
characterize a main theme or idea. For example, the term “profit” would have strong association with terms
like “revenue” and “cash” and have a higher probability to be in a topic. We limited multi-terms topics to
be 5. The results of the text filter node showed significant insights, which included the frequently used
words, which was consistent with our earlier findings.
Fig 3(a): Weights for terms – Result of Text Filter node
Fig 4(a) : Scatter plot of Positive words, Negative terms
and the topics.
Fig 4(b) : Weights of high frequency terms with other attributes like role,
frequency etc.
Some of the topics that we got as a result of text topic node included:
From our analysis, we found that documents which included terms of the topic have good investment
preferences, whereas the documents which have terms related to. This method identified a collection of
terms, which in turn helped us determine the performance.
The output included the documents, the topics and the relevance of the document to each of the topics.
This output was given to the Variable selection node. The Variable Selection node helped in reducing
number of input variables to the model by rejecting input variables that were not related to the market. The
results window also displayed a histogram that was called “Variable Importance”.
The histogram shows each variable's
contribution towards the prediction, based on the
R-Square scores. Here, we observed that the topic
“+weak +termination +average +tend” had the
highest importance in deciding the investment
preferences as it had the highest Variable
Importance.
We finally built two models, one using Logistic Regression and the other using neural networks.
We analyzed the effects of both the models based on the results of validation and training data.
Fig 5: Document cutoff value for each topic and the number of documents having satisfied the criteria
Fig 7: Histogram displaying the variable importance for prediction
using R-Square scores.
Fig 8(a) : Mean Predicted against Mean Target for Logistics
Regression
Fig 8(b) : Mean Predicted against Mean Target for Neural
Networks
From the results of both the Regression Model and Neural networks, we found a strong relationship
exists between the text of Item 7 and 7A and the company’s performance in the upcoming year. If the
accuracy is compared for the validation data for Logistic and Neural Network model, we could infer it to
be approximately around 90 – 95% respectively. The final model we developed is as seen in Fig 10.
TESTING THE MODEL
To test the model that was built, we decided to pass data and validate its result. The initial task was to
decide whether to go with the Regression Model or the Neural Network model. To resolve this, we used
the Model Comparison node which selects the best performing model based on errors for input models.
The figure below shows the Statistics of both the Regression model and the Neural Network model. The
selected model was a Regression model based on Average Square Error as selection Criterion.
Fig 10: SAS Model Developed
Fig 11(b): Regression Model selected based on Average Square Error as Criterion.
Fig 11(a): The error statistics of Regression Model and Neural Network Model.
Finally, we used the Score node with which we used to score new raw data. The input to the score node
was both the output of the model comparison node and a new raw data source. Based on the model selected
from the Model Comparison node, the Score node scored new data. The raw data with no Investment
preference and the final model developed is as seen in the below figure.
The output of the score node included the prediction for the field “Should Invest”, the target variable.
Here, we can see for the company VISA the “Should Invest” field is predicted around 1.059 whereas for
the company Frontier, the “Should Invest” field is predicted around 0.28. This is a clear indication to
showcase that it is worth investing in VISA. To further validate the mode, we decided to check the trends
of the stocks for these companies in Yahoo finance and found the trend lines in the graphs in agreement
with the results of the model.
CONCLUSION
By using the data from the 10k reports from Item 7 and Item 7A we concluded there exists a correlation
between the data from the filings and the trending stock performance of the companies. As mentioned in
one of the research paper we would like to use the quantitative data mentioned in the financial reports to
optimize the relation among the filings and stock performances. Also we would like to add a start list which
is more effective in distinguishing among the positive and negative words in the financial sector which will
definitely enhance the accuracy of the model overall. This methodology would be more meaningful if we
include other sections such as Item 1A which consists of the risk factors prevalent and play an important
role in determining the company performance.
Fig 12(b): Raw data for model testing
Fig 13: The predicted values of the raw data, which is the output of the score
node.
Fig 14(a) : Stock trend of VISA Fig 14(b): Stock trend of Frontier
REFERENCES
1. Predicting Risks from financial reports with regression, by Kogan, Levin, Routledge, Sagi and Smith
http://homes.cs.washington.edu/~nasmith/papers/kogan+levin+routledge+sagi+smith.naacl09.pdf
2. Back, B., Toivonen, J., Vanharanta, H., and Visa, A. Comparing numerical data and text information
from annual reports using self-orginizing maps, International Journal of Accounting Information
Systems (2), 2001, pp. 249-269.
3. Kohonen, T. Self-Orginizing Maps, Leipzig, Germany: Springer-Verlag, 1997. Kohut, G., and Segars,
A. The president’s letter to stockholders: An examination of corporate communication strategy, Journal
of Business Communcation (29:1), 1992, pp. 7-21. Lehtinen, J. Financial Ratios in an International
Comparison, Vasa: Acta Wasaensia, 1996

More Related Content

What's hot

Text Analytics- An application in Indian Stock Markets
Text Analytics- An application in Indian Stock MarketsText Analytics- An application in Indian Stock Markets
Text Analytics- An application in Indian Stock MarketsSinjana Ghosh
 
Private Information
Private InformationPrivate Information
Private InformationAmit Mittal
 
Portfolio Management Project
Portfolio Management ProjectPortfolio Management Project
Portfolio Management ProjectRan Zhang
 
tibu-published article10.11648.j.jfa.20160404.13
tibu-published article10.11648.j.jfa.20160404.13tibu-published article10.11648.j.jfa.20160404.13
tibu-published article10.11648.j.jfa.20160404.13Tibu Ngozi
 
Financial Analysis on Recession Period at M&M Tractors
Financial Analysis on Recession Period at M&M TractorsFinancial Analysis on Recession Period at M&M Tractors
Financial Analysis on Recession Period at M&M TractorsProjects Kart
 
A study on effect of liquidity management on profitability with select privat...
A study on effect of liquidity management on profitability with select privat...A study on effect of liquidity management on profitability with select privat...
A study on effect of liquidity management on profitability with select privat...Supriya Mondal
 
Liquidity reactions towards dividend announcements and information efficiency...
Liquidity reactions towards dividend announcements and information efficiency...Liquidity reactions towards dividend announcements and information efficiency...
Liquidity reactions towards dividend announcements and information efficiency...Evans Tee
 
Portfolio Optimization Project Report
Portfolio Optimization Project ReportPortfolio Optimization Project Report
Portfolio Optimization Project ReportJohn Cui
 
Summer Training Report on Fundamental Analysis
Summer Training Report on Fundamental AnalysisSummer Training Report on Fundamental Analysis
Summer Training Report on Fundamental AnalysisFellowBuddy.com
 
Fundamental analysis of banking industry
Fundamental analysis of banking industryFundamental analysis of banking industry
Fundamental analysis of banking industryDARUN V
 
FIN 571 Extraordinary Success |tutorialrank.com
FIN 571 Extraordinary Success |tutorialrank.comFIN 571 Extraordinary Success |tutorialrank.com
FIN 571 Extraordinary Success |tutorialrank.combeautifuljasmine
 
Fundamental Analysis Of Mahindra&Mahindra
Fundamental Analysis Of Mahindra&MahindraFundamental Analysis Of Mahindra&Mahindra
Fundamental Analysis Of Mahindra&MahindraStudying
 
Stock Return Predictability with Financial Ratios: Evidence from PSX 100 Inde...
Stock Return Predictability with Financial Ratios: Evidence from PSX 100 Inde...Stock Return Predictability with Financial Ratios: Evidence from PSX 100 Inde...
Stock Return Predictability with Financial Ratios: Evidence from PSX 100 Inde...Wasim Uddin
 
00251740510626254
0025174051062625400251740510626254
00251740510626254Jan Ahmed
 
Project report on fundamental analysis of scrips under banking sector
Project report on fundamental analysis of scrips under banking sectorProject report on fundamental analysis of scrips under banking sector
Project report on fundamental analysis of scrips under banking sectoraftabshaikh04
 
The influence of debt ratio
The influence of debt ratioThe influence of debt ratio
The influence of debt ratioIntan Ayuna
 
Working capital investment and financing policies of selected pharmaceutical ...
Working capital investment and financing policies of selected pharmaceutical ...Working capital investment and financing policies of selected pharmaceutical ...
Working capital investment and financing policies of selected pharmaceutical ...Alexander Decker
 

What's hot (19)

Project eby
Project ebyProject eby
Project eby
 
Text Analytics- An application in Indian Stock Markets
Text Analytics- An application in Indian Stock MarketsText Analytics- An application in Indian Stock Markets
Text Analytics- An application in Indian Stock Markets
 
Private Information
Private InformationPrivate Information
Private Information
 
Sapm
SapmSapm
Sapm
 
Portfolio Management Project
Portfolio Management ProjectPortfolio Management Project
Portfolio Management Project
 
tibu-published article10.11648.j.jfa.20160404.13
tibu-published article10.11648.j.jfa.20160404.13tibu-published article10.11648.j.jfa.20160404.13
tibu-published article10.11648.j.jfa.20160404.13
 
Financial Analysis on Recession Period at M&M Tractors
Financial Analysis on Recession Period at M&M TractorsFinancial Analysis on Recession Period at M&M Tractors
Financial Analysis on Recession Period at M&M Tractors
 
A study on effect of liquidity management on profitability with select privat...
A study on effect of liquidity management on profitability with select privat...A study on effect of liquidity management on profitability with select privat...
A study on effect of liquidity management on profitability with select privat...
 
Liquidity reactions towards dividend announcements and information efficiency...
Liquidity reactions towards dividend announcements and information efficiency...Liquidity reactions towards dividend announcements and information efficiency...
Liquidity reactions towards dividend announcements and information efficiency...
 
Portfolio Optimization Project Report
Portfolio Optimization Project ReportPortfolio Optimization Project Report
Portfolio Optimization Project Report
 
Summer Training Report on Fundamental Analysis
Summer Training Report on Fundamental AnalysisSummer Training Report on Fundamental Analysis
Summer Training Report on Fundamental Analysis
 
Fundamental analysis of banking industry
Fundamental analysis of banking industryFundamental analysis of banking industry
Fundamental analysis of banking industry
 
FIN 571 Extraordinary Success |tutorialrank.com
FIN 571 Extraordinary Success |tutorialrank.comFIN 571 Extraordinary Success |tutorialrank.com
FIN 571 Extraordinary Success |tutorialrank.com
 
Fundamental Analysis Of Mahindra&Mahindra
Fundamental Analysis Of Mahindra&MahindraFundamental Analysis Of Mahindra&Mahindra
Fundamental Analysis Of Mahindra&Mahindra
 
Stock Return Predictability with Financial Ratios: Evidence from PSX 100 Inde...
Stock Return Predictability with Financial Ratios: Evidence from PSX 100 Inde...Stock Return Predictability with Financial Ratios: Evidence from PSX 100 Inde...
Stock Return Predictability with Financial Ratios: Evidence from PSX 100 Inde...
 
00251740510626254
0025174051062625400251740510626254
00251740510626254
 
Project report on fundamental analysis of scrips under banking sector
Project report on fundamental analysis of scrips under banking sectorProject report on fundamental analysis of scrips under banking sector
Project report on fundamental analysis of scrips under banking sector
 
The influence of debt ratio
The influence of debt ratioThe influence of debt ratio
The influence of debt ratio
 
Working capital investment and financing policies of selected pharmaceutical ...
Working capital investment and financing policies of selected pharmaceutical ...Working capital investment and financing policies of selected pharmaceutical ...
Working capital investment and financing policies of selected pharmaceutical ...
 

Similar to Unstructured Data Management

Working Capital Management of Larsen & Turbo
Working Capital Management of Larsen & TurboWorking Capital Management of Larsen & Turbo
Working Capital Management of Larsen & TurboDr. Amarjeet Singh
 
1408-Article Text-5906-1-10-20220221.pdf
1408-Article Text-5906-1-10-20220221.pdf1408-Article Text-5906-1-10-20220221.pdf
1408-Article Text-5906-1-10-20220221.pdfDR BHADRAPPA HARALAYYA
 
Accounting Research Center, Booth School of Business, Universi.docx
Accounting Research Center, Booth School of Business, Universi.docxAccounting Research Center, Booth School of Business, Universi.docx
Accounting Research Center, Booth School of Business, Universi.docxnettletondevon
 
STOCK PRICE PREDICTION AND RECOMMENDATION USINGMACHINE LEARNING TECHNIQUES AN...
STOCK PRICE PREDICTION AND RECOMMENDATION USINGMACHINE LEARNING TECHNIQUES AN...STOCK PRICE PREDICTION AND RECOMMENDATION USINGMACHINE LEARNING TECHNIQUES AN...
STOCK PRICE PREDICTION AND RECOMMENDATION USINGMACHINE LEARNING TECHNIQUES AN...IRJET Journal
 
Module 7 Discussion ForumDiscussion Statement of Cash and Financi.docx
Module 7 Discussion ForumDiscussion Statement of Cash and Financi.docxModule 7 Discussion ForumDiscussion Statement of Cash and Financi.docx
Module 7 Discussion ForumDiscussion Statement of Cash and Financi.docxhelzerpatrina
 
Team Project Deliverable and PresentationYou team works for XY.docx
Team Project Deliverable and PresentationYou team works for XY.docxTeam Project Deliverable and PresentationYou team works for XY.docx
Team Project Deliverable and PresentationYou team works for XY.docxerlindaw
 
My name is highlighted in Blue and thatt the portion I am respo.docx
My name is highlighted in Blue and thatt the portion I am respo.docxMy name is highlighted in Blue and thatt the portion I am respo.docx
My name is highlighted in Blue and thatt the portion I am respo.docxgemaherd
 
Ratio analysis - Introduction
Ratio analysis - IntroductionRatio analysis - Introduction
Ratio analysis - Introductionuma reur
 
IRJET - Stock Recommendation System using Machine Learning Approache
IRJET - Stock Recommendation System using Machine Learning ApproacheIRJET - Stock Recommendation System using Machine Learning Approache
IRJET - Stock Recommendation System using Machine Learning ApproacheIRJET Journal
 
The Supply Chain Index - Improving Strength, Balance and Resiliency - 13 MAY ...
The Supply Chain Index - Improving Strength, Balance and Resiliency - 13 MAY ...The Supply Chain Index - Improving Strength, Balance and Resiliency - 13 MAY ...
The Supply Chain Index - Improving Strength, Balance and Resiliency - 13 MAY ...Lora Cecere
 
Financial Performance Analysis of Selected Private Sector Banks in India
Financial Performance Analysis of Selected Private Sector Banks in IndiaFinancial Performance Analysis of Selected Private Sector Banks in India
Financial Performance Analysis of Selected Private Sector Banks in IndiaDr. Amarjeet Singh
 
IRJET - Bankruptcy Score Indexing
IRJET - Bankruptcy Score IndexingIRJET - Bankruptcy Score Indexing
IRJET - Bankruptcy Score IndexingIRJET Journal
 
Capital structure and eps a study on selected financial institutions listed o...
Capital structure and eps a study on selected financial institutions listed o...Capital structure and eps a study on selected financial institutions listed o...
Capital structure and eps a study on selected financial institutions listed o...Alexander Decker
 
International Journal of Business and Management Invention (IJBMI)
International Journal of Business and Management Invention (IJBMI)International Journal of Business and Management Invention (IJBMI)
International Journal of Business and Management Invention (IJBMI)inventionjournals
 
A Study on Ratio Analysis at Accord Puducherry
A Study on Ratio Analysis at Accord PuducherryA Study on Ratio Analysis at Accord Puducherry
A Study on Ratio Analysis at Accord Puducherryijtsrd
 
Vencon Research International 2020
Vencon Research International 2020Vencon Research International 2020
Vencon Research International 2020Vicente Farias
 

Similar to Unstructured Data Management (20)

Working Capital Management of Larsen & Turbo
Working Capital Management of Larsen & TurboWorking Capital Management of Larsen & Turbo
Working Capital Management of Larsen & Turbo
 
1408-Article Text-5906-1-10-20220221.pdf
1408-Article Text-5906-1-10-20220221.pdf1408-Article Text-5906-1-10-20220221.pdf
1408-Article Text-5906-1-10-20220221.pdf
 
Accounting Research Center, Booth School of Business, Universi.docx
Accounting Research Center, Booth School of Business, Universi.docxAccounting Research Center, Booth School of Business, Universi.docx
Accounting Research Center, Booth School of Business, Universi.docx
 
Performance Analysis through Financial Modelling
Performance Analysis through Financial ModellingPerformance Analysis through Financial Modelling
Performance Analysis through Financial Modelling
 
STOCK PRICE PREDICTION AND RECOMMENDATION USINGMACHINE LEARNING TECHNIQUES AN...
STOCK PRICE PREDICTION AND RECOMMENDATION USINGMACHINE LEARNING TECHNIQUES AN...STOCK PRICE PREDICTION AND RECOMMENDATION USINGMACHINE LEARNING TECHNIQUES AN...
STOCK PRICE PREDICTION AND RECOMMENDATION USINGMACHINE LEARNING TECHNIQUES AN...
 
Module 7 Discussion ForumDiscussion Statement of Cash and Financi.docx
Module 7 Discussion ForumDiscussion Statement of Cash and Financi.docxModule 7 Discussion ForumDiscussion Statement of Cash and Financi.docx
Module 7 Discussion ForumDiscussion Statement of Cash and Financi.docx
 
Team Project Deliverable and PresentationYou team works for XY.docx
Team Project Deliverable and PresentationYou team works for XY.docxTeam Project Deliverable and PresentationYou team works for XY.docx
Team Project Deliverable and PresentationYou team works for XY.docx
 
My name is highlighted in Blue and thatt the portion I am respo.docx
My name is highlighted in Blue and thatt the portion I am respo.docxMy name is highlighted in Blue and thatt the portion I am respo.docx
My name is highlighted in Blue and thatt the portion I am respo.docx
 
Ratio analysis - Introduction
Ratio analysis - IntroductionRatio analysis - Introduction
Ratio analysis - Introduction
 
IRJET - Stock Recommendation System using Machine Learning Approache
IRJET - Stock Recommendation System using Machine Learning ApproacheIRJET - Stock Recommendation System using Machine Learning Approache
IRJET - Stock Recommendation System using Machine Learning Approache
 
F0272050059
F0272050059F0272050059
F0272050059
 
The Supply Chain Index - Improving Strength, Balance and Resiliency - 13 MAY ...
The Supply Chain Index - Improving Strength, Balance and Resiliency - 13 MAY ...The Supply Chain Index - Improving Strength, Balance and Resiliency - 13 MAY ...
The Supply Chain Index - Improving Strength, Balance and Resiliency - 13 MAY ...
 
Dss project analytics writeup
Dss project analytics writeup Dss project analytics writeup
Dss project analytics writeup
 
Financial Performance Analysis of Selected Private Sector Banks in India
Financial Performance Analysis of Selected Private Sector Banks in IndiaFinancial Performance Analysis of Selected Private Sector Banks in India
Financial Performance Analysis of Selected Private Sector Banks in India
 
Ratios Analysis
Ratios Analysis Ratios Analysis
Ratios Analysis
 
IRJET - Bankruptcy Score Indexing
IRJET - Bankruptcy Score IndexingIRJET - Bankruptcy Score Indexing
IRJET - Bankruptcy Score Indexing
 
Capital structure and eps a study on selected financial institutions listed o...
Capital structure and eps a study on selected financial institutions listed o...Capital structure and eps a study on selected financial institutions listed o...
Capital structure and eps a study on selected financial institutions listed o...
 
International Journal of Business and Management Invention (IJBMI)
International Journal of Business and Management Invention (IJBMI)International Journal of Business and Management Invention (IJBMI)
International Journal of Business and Management Invention (IJBMI)
 
A Study on Ratio Analysis at Accord Puducherry
A Study on Ratio Analysis at Accord PuducherryA Study on Ratio Analysis at Accord Puducherry
A Study on Ratio Analysis at Accord Puducherry
 
Vencon Research International 2020
Vencon Research International 2020Vencon Research International 2020
Vencon Research International 2020
 

Unstructured Data Management

  • 1. Predicting Relative Risk of Financial Investment in Publicly-Traded Companies Krishna Vijaywargiy, Kshitij Deshpande, Manish Lingamallu MSA 8050: UNSTRUCTURED DATA MANAGEMENT FINAL PROJECT REPORT J. MACK ROBINSON SCHOOL OF BUSINESS, GEORGIA STATE UNIVERSITY
  • 2. Predicting Relative Risk of Financial Investment in Publicly-Traded Companies Krishna Vijaywargiy Master of Science in Analytics Georgia State University kvijaywargiy1@student.gsu.edu Kshitij Deshpande Master of Science in Analytics Georgia State University kdeshpande4@student.gsu.edu Manish Lingamallu Master of Science in Analytics Georgia State University jlingamallu1@student.gsu.edu ABSTRACT We address a text analysis problem: given a set of publicly-traded companies, we try to predict the relative risk of financial investment by analyzing the text content in the SEC-mandated financial report published by these companies annually. We focus on the latent information in parts 7 & 7a of this report which are a detailed description of the company’s financial status in the previous year and use well known analysis models to predict future trends from this information. By using text instead of numbers, our project revolves around the modern approaches for financial trend prediction and stretches the accuracy higher. INTRODUCTION While a lot of information about the financial health of a publicly-traded company is transmitted through numbers, insights galore can also be extracted from the textual content in the mandated 10-K reports submitted yearly to the United States Securities and Exchange Commission. The research for efficiently using text analysis to improve the accuracy of stock price prediction is still in the early stages but has certainly achieved a few milestones. This project report forecasts the relative investment risk for a set of companies by analyzing the text content in parts 7 and 7a of the 10k report. Part 7 gives a description of “Management’s Discussion and Analysis of Financial Condition and Results of Operations” and part 7a focuses on “Quantitative and Qualitative Disclosures about Market Risk” which comprise of discussions of trends, capital resources liquidity, general business review, discontinued operations, interim financial statements, etc. among other things. Extracting the latent information about a company’s performance in future and year past from the fore mentioned subsections of these freely available reports is an important step towards predicting the risk of investments in stocks. 10-K Reports from 15 companies with positively trending stock prices and 15 with negatively trending stock prices were evaluated in this project to identify the patterns from text. Before the data can generate insights, it has to be carefully extracted and cleaned. The process of mining the text involves extracting only the textual data and ignoring all the tables from the subsections. The text then undergoes cleaning process which, after removing all special characters, tokenizes the whole document into
  • 3. words and converts all tokens/words to their root words. These processed documents are then fitted with the models to generate a relative risk prediction model. The two models that are used for prediction are Logistic Regression model, which takes in a categorical dependent variable to estimate probabilities using a logistic function for the independent variable(s), and Neural Networks, which is trained on a corpus of documents and is then improved through feature fusion to generate high ranking summaries. We further evaluate the results of both these methods to give a comparison of both models. METHODOLOGY We started our project by researching the available papers and identifying what parts of the 10-K reports are most significant to predict financial risk. The paper ‘Predicting Risk from Financial Reports with Regression’ by Kogan, Levin, Routledge, Sagi and Smith [1] presents an insightful approach of constructing regression models for volatility of stock returns, which is an empirical measure of financial risk.[ 1] It aims at providing summarizing statistical facts that are not subject to any kind of human-expertise, knowledge or regression. This establishes that simplistic representation of text (unigrams and bi-grams) can substantially improve a strong stock price prediction baseline that does not use text. Volatility is measured as the standard deviation of a stock’s returns over a finite period of time. Thus, to predict risk, the paper focuses on using clustering and Support Vector Regression for volatility prediction to identify that text regression model prediction efficiently correlates true volatility with historical volatility and thus provides higher accuracy in combination.[1] However, instead of clustering, we followed a pattern using ‘text topics’ and ‘text parser’. We evaluated the 5-year stock price trends for companies along with their 10-K reports to create a start list for both positive and negative words and added them to the text parsing node. In their paper “Combining Data and Text Mining techniques for Analyzing Financial Reports” , Antonia Kloptchenko, Tomas Eklund, Barbro Back provide a different approach to analyze the information form the Telecommunication companies and state that the textual part contains more precise information in dealing with company performance than the quantitative data available in the form of financial ratios. They have used data and text mining methods to study hidden indications about the financial performance of companies from the qualitative as well as quantitative parts of their financial reports. They gathered information from all occurring matches in combination with quantitative data clustering making it possible to conclude that the analysis schema has captured a tendency: the text reports tend to foresee the changes in financial states of the companies, before those changes influence the financial ratios. The results obtained after analyzing the overall information has proven that some future changes in the financial performance can be anticipated by analyzing text from the reports. Thus due to time constraint and limited know how about the financial ratios and terminology we decided to focus only on the text data contained in the SEC reports form Item 7 and Item 7A.
  • 4. IMPLEMENTATION To extract the content from the 10-K reports, we use python to parse the document and acquire the relevant text. This is a tedious process as all the companies have different formats of the XBRL reports. We then stored this into pandas data-frame for merging all the companies. The text in the data-frame is then tokenized and stripped off of markups, punctuation marks and any special characters besides the alphabets. We used regular expression in python to process this text for analyzing. After extracting the data in desired format we created a chain of nodes in enterprise miner to run our data. The first node in the chain was the data- source node - “Training Data” that contained a comprehensive list of 30 companies with over 90 records that were parsed to train the model. An additional field “Investment” was added to the spreadsheet and the field was assigned a “0” or a “1” based on the stock trend of the company. “1” was assigned to a company which was considered safe to invest in and “0” for company considered not safe to invest in. The input to the data-source node was the excel spreadsheet. The output from the data-source included company name, year of filing, and Investment preferences. If we were to run the experiment again, we would use growth or Profit/Equity ratios in order to have a numeric value to set as the threshold for our decisions. This data ran through a data-partition node with the partition set at a 75/25 split between the training and validation data. After the data partition node, the data was then parsed using the Text Parsing node. We created start list by analyzing textual data from the good companies and bad companies which we classified on the basis of the stock performance for the last 5 years. The output from the Text parsing node flowed to a text filter node where weights of the terms were assigned based on Inverse Document Frequency. The weight reflected the importance of the word in a document. Fig 2(b): High Frequency Terms Fig 2(a): Number of Documents by Frequency
  • 5. The output of the text parsing nodes included the terms along with their weights. The output of the text filter node was passed through the text topic node. The text topic node matched the terms that were strongly associated and created topics. Topics are collections of terms that describe and characterize a main theme or idea. For example, the term “profit” would have strong association with terms like “revenue” and “cash” and have a higher probability to be in a topic. We limited multi-terms topics to be 5. The results of the text filter node showed significant insights, which included the frequently used words, which was consistent with our earlier findings. Fig 3(a): Weights for terms – Result of Text Filter node Fig 4(a) : Scatter plot of Positive words, Negative terms and the topics. Fig 4(b) : Weights of high frequency terms with other attributes like role, frequency etc.
  • 6. Some of the topics that we got as a result of text topic node included: From our analysis, we found that documents which included terms of the topic have good investment preferences, whereas the documents which have terms related to. This method identified a collection of terms, which in turn helped us determine the performance. The output included the documents, the topics and the relevance of the document to each of the topics. This output was given to the Variable selection node. The Variable Selection node helped in reducing number of input variables to the model by rejecting input variables that were not related to the market. The results window also displayed a histogram that was called “Variable Importance”. The histogram shows each variable's contribution towards the prediction, based on the R-Square scores. Here, we observed that the topic “+weak +termination +average +tend” had the highest importance in deciding the investment preferences as it had the highest Variable Importance. We finally built two models, one using Logistic Regression and the other using neural networks. We analyzed the effects of both the models based on the results of validation and training data. Fig 5: Document cutoff value for each topic and the number of documents having satisfied the criteria Fig 7: Histogram displaying the variable importance for prediction using R-Square scores. Fig 8(a) : Mean Predicted against Mean Target for Logistics Regression Fig 8(b) : Mean Predicted against Mean Target for Neural Networks
  • 7. From the results of both the Regression Model and Neural networks, we found a strong relationship exists between the text of Item 7 and 7A and the company’s performance in the upcoming year. If the accuracy is compared for the validation data for Logistic and Neural Network model, we could infer it to be approximately around 90 – 95% respectively. The final model we developed is as seen in Fig 10. TESTING THE MODEL To test the model that was built, we decided to pass data and validate its result. The initial task was to decide whether to go with the Regression Model or the Neural Network model. To resolve this, we used the Model Comparison node which selects the best performing model based on errors for input models. The figure below shows the Statistics of both the Regression model and the Neural Network model. The selected model was a Regression model based on Average Square Error as selection Criterion. Fig 10: SAS Model Developed Fig 11(b): Regression Model selected based on Average Square Error as Criterion. Fig 11(a): The error statistics of Regression Model and Neural Network Model.
  • 8. Finally, we used the Score node with which we used to score new raw data. The input to the score node was both the output of the model comparison node and a new raw data source. Based on the model selected from the Model Comparison node, the Score node scored new data. The raw data with no Investment preference and the final model developed is as seen in the below figure. The output of the score node included the prediction for the field “Should Invest”, the target variable. Here, we can see for the company VISA the “Should Invest” field is predicted around 1.059 whereas for the company Frontier, the “Should Invest” field is predicted around 0.28. This is a clear indication to showcase that it is worth investing in VISA. To further validate the mode, we decided to check the trends of the stocks for these companies in Yahoo finance and found the trend lines in the graphs in agreement with the results of the model. CONCLUSION By using the data from the 10k reports from Item 7 and Item 7A we concluded there exists a correlation between the data from the filings and the trending stock performance of the companies. As mentioned in one of the research paper we would like to use the quantitative data mentioned in the financial reports to optimize the relation among the filings and stock performances. Also we would like to add a start list which is more effective in distinguishing among the positive and negative words in the financial sector which will definitely enhance the accuracy of the model overall. This methodology would be more meaningful if we include other sections such as Item 1A which consists of the risk factors prevalent and play an important role in determining the company performance. Fig 12(b): Raw data for model testing Fig 13: The predicted values of the raw data, which is the output of the score node. Fig 14(a) : Stock trend of VISA Fig 14(b): Stock trend of Frontier
  • 9. REFERENCES 1. Predicting Risks from financial reports with regression, by Kogan, Levin, Routledge, Sagi and Smith http://homes.cs.washington.edu/~nasmith/papers/kogan+levin+routledge+sagi+smith.naacl09.pdf 2. Back, B., Toivonen, J., Vanharanta, H., and Visa, A. Comparing numerical data and text information from annual reports using self-orginizing maps, International Journal of Accounting Information Systems (2), 2001, pp. 249-269. 3. Kohonen, T. Self-Orginizing Maps, Leipzig, Germany: Springer-Verlag, 1997. Kohut, G., and Segars, A. The president’s letter to stockholders: An examination of corporate communication strategy, Journal of Business Communcation (29:1), 1992, pp. 7-21. Lehtinen, J. Financial Ratios in an International Comparison, Vasa: Acta Wasaensia, 1996