Financial Forecasting Fortune 500 Stocks Using Statistical Classification Learning Algorithms

Financial Forecasting Fortune 500 Stocks Using
Statistical Classification Learning Algorithms

Zack Pollak • Zach Murray • Kyle He
● Department of Statistics, University of Michigan — Stats 415 Data Mining
● Contact information:
○ Z. Pollak — UM ‘16 BS Statistics — zpollak@umich.edu
○ Z. Murray — UM ‘16 BS Informatics — zhmurray@umich.edu
○ K. He — UM ‘17 BS Informatics — kylehe@umich.edu
Abstract:
A theory laying at the foundations of financial asset pricing, the Efficient Market
Hypothesis, claims that the price of an asset reflects the full amount of information available,
implying it is not possible to “beat the market”. WeakForm EMH implies the expected value of
any stock, given all available security market information, is equal to the current market price of
the stock. This paper applies statistical learning methods to historical securities data with the
intention of forecasting the stock price movements of Fortune 500 companies from multiple
verticals.
Keywords: Financial forecast • KNearest neighbors (KNN) • Linear discriminant analysis
(LDA) • Quadratic discriminant analysis (QDA) • quantmod • R • Statistical Learning

Methodology
Data Collection
We pulled our data from Yahoo! Finance using the getSymbols() function in the R
quantmod package to create an xts (eXtensible Time Series) object of stock information which
we then transformed into a data frame for use with the statistical learning functions found in the
R packages MASS and class. Each stock’s data is indexed by date ranging from 2007 up until
today’s date uptodate information is one of the many benefits provided by quantmod. The
xts object from getSymbols() consists of the stock’s daily opening price, closing price,
trading volume, high price, low price, and adjusted closing price. Of these variables, we would
like to use the stock’s closing price and volume of shares traded for each day.
Using the getSymbols() data, we would create another data frame taking advantage of
the date indexing of xts, containing: the daily volume of shares traded, 10 different lags, and our
response variable direction. We focused our analysis on stable, Fortune 500 stocks including
Walmart (WMT) representing the retail sector, Exxon (XOM) representing petroleum refining,
and Alphabet (GOOGL) representing the tech sector. We are focusing on industry leaders as they
tend to represent the full sector well and are not too risky. This allows us to worry more about
the strength of each learning method used in determining the best method to forecast a financial
time series.
Building the Data Frames
A vector of returns was calculated using dailyReturn() which computes the
closetoclose return as a percentage given a specific stock for each day since 2007. Using
returns we created our response variable, direction. Direction is a 2level factor, categorical

variable consisting of “UP” and “DOWN” depending on the sign of the day’s return. Lags were
calculated using a time series lag function included in the quantmod package in order to
generate k lagged return vectors (where each vector is the stock’s returns vector shifted k days
down). We chose to set k=10 because this allowed us to capture the last two weeks of trading.
Filtering the Data
The data was split into a training and testing set in order to validate model performance.
Each model is fit using the training set, which in this case we define to be all of the stock data
from January 2007 through the end of December 2013. The test set for each model is all of the
stock data from January 2014 up until April 20, 2016. All observations with an NA coerced due
to lagging were omitted from each data frame.
In order to implement our statistical learning methods, we partitioned the data into
training and testing sets. The training set accounted for about ⅔ of the data, while the test set
represented the remaining ⅓ of the data. We split the data before building the direction or any
lag vectors to take advantage of xts indexing. We created vectors for each set by extracting the
respective training and testing volumes of stocks traded, 10 days of lagged returns, and directions
from the train and test xts objects. Then, binding these vectors by column into new training and
testing data frames, we were ready to implement Linear Discriminant Analysis, Quadratic
Discriminant Analysis, and KNearest Neighbors classification.
Statistical Modeling and Analysis
Our primary objective was to extrapolate the success rates of predicting a stock’s
direction using the different statistical learning methods we explored this semester. Our
secondary objective was to use the most successful prediction methods to create our own unique
quantitative trading strategies, however, this will be explored in a followup study. All statistical

analysis was done using the RStudio GUI (Version 0.99.484) with various additional packages:
class, ggplot2, MASS, PerformanceAnalytics, quantmod and xts. Our analysis took
the form of exploring the effects of past returns and volume on future returns; then, visualizing
and interpreting model output lead to proper assessments of statistical learning for financial
forecasting.
Comparing Learning Methods
Statistical Classification — Theoretical Framework
As the goal of this study is to successfully predict the direction a stock will move
tomorrow given the past 10 days’ returns and the stock’s volume, we look to build a model to
classify the stock’s direction as a function of the explanatory variables using the training set.
Predictions will be made for the unseen test set in order to validate the accuracy of the model.
This study only incorporates binary classification methods since Direction is limited to “UP” and
“DOWN” (if it occurs, no change in returns is included in “DOWN”).
The mathematical framework behind binary classification includes:
● Twoclass label:
● Input variables:
This study will have the response variable labels, c1 and c2, be represented by the stock’s
direction. The input variables are the 10 lagged returns and the stock’s volume so p = 11. The
goal of statistical classification is to produce a classifier that accurately predicts unseen cases.
Using the training classconditional densities and classconditional prior probabilities
we can apply Bayes’ Theorem to estimate the posterior probability,

Linear Discriminant Analysis
Linear discriminant analysis (LDA) is a dimensionality reduction learning method that
locates a linear combination of continuous predictor variables to predict a categorical response
variable. LDA is a discriminative, linear classifier with underlying assumptions that the
classconditional density functions, , are gaussian and the classes share a common
covariance matrix (homoscedasticity across classes: ∑k = ∑ ∀k).
These assumptions lead to a discriminant function for each k,
where the decision rule is given by generating the LDA decision
boundaries. The MASS package’s lda() function can be leveraged with the training set to
facilitate LDA model creation; in this case, Direction is the response variable while the rest of
the data frame makes up the explanatory variables. R’s predict() function may be utilized in
conjunction with the LDA model and the unseen test set to generate class predictions, “UP” or
“DOWN”, for each test day.
Figure 1: LDA discriminant density histograms for WMT, XOM, and GOOGL

Plots of the LDA classification density histograms for each stock and the LDA test error
rates for each stock can be found in Figure 1. It is clear that LDA tends to predict that the stock
price will move up the next day and can be confirmed by the test error table. By looking at the
density plots from the LDA model, we see that in all three cases (WMT, XOM, GOOGL) the
LDA classification rules predicted the stock would move up a majority of the time. LDA was
much more accurate when it came to making correct predictions that the stock would move up in
comparison to predicting a downward movement correctly.
Figure 2: LDA classification test error rate table
Here we see that the error rates for the LDA classification algorithm are rather similar for
each stock, however, Exxon resulted in a test error rate higher than 50% implying we were
unable to forecast Exxon direction well enough using LDA with 10 days of lagged returns and
the trading volume. The LDA analysis provides reasonably accurate prediction rates for Walmart

and Google due to their misclassification error rates being less than 50%. We expected LDA to
be a good forecasting method due to our analysis being run on largecap stocks where the
potential percentage increase or decrease is very small and the stocks tend to be rather stable
weekbyweek.
Quadratic Discriminant Analysis
LDA is actually a special case of Quadratic Discriminant Analysis (QDA) is a
discriminative learning method similar to LDA in that the classconditional densities, ,
are modeled as multivariate gaussian. However, the covariance matrices are not assumed to be
equal for QDA, resulting in discriminant function for each class k:
where the decision rule is once again given by to define the QDA
decision boundaries.
QDA is preferable to LDA when variances between classes are noticeably different and
there are a significant amount of observations. The qda() function in the MASS package can be
utilized in an identical manner to building an LDA model and test class prediction. Looking at
Figure 2, the QDA test misclassification error straddles 50% for all three equities. As we surely
have enough observations, we can attribute the decline in prediction accuracy to similar
variances among classes.
The error rates for the QDA classification algorithm on the three stocks are all fairly
similar. However, in this case both Exxon and Walmart had error rates higher than 50% implying
that we were not able to accurately forecast direction using QDA classification with 10 days of

Figure 3: QDA classification test error rate table
lagged returns and trading volume. Google did achieve a test error rate of less than 50% but since
it is so close, we would not advocate that QDA was an accurate prediction method. We expected
QDA to perform slightly worse than LDA due to similar variance structures between lagged
returns and the natural stability of Fortune 500 stocks.
KNearest Neighbors
KNearest Neighbors is a flexible, nonparametric method where we predict a new point
by looking at the knearest points, referred to as its “neighbors”. For use in classification
problems, the classifier can be written as , where Nk is the neighborhood
consisting of the k points closest to the point being predicted.

Figure 4: KNN classification test error rate for k in [1,20]   Figure 5: KNN classification test error rate table

The KNN test error rate plot in Figure 4 uses k values ranging from 1 to 20 in order to
fine tune our models with the optimal k values that provide the minimum test error rates.  It is
clearly visible in the KNN error plot above that all three stocks had varying k values that
minimized test error. As the k value increases, the model complexity decreases. Thus, a smaller k
value means a more complex model was needed to make the most accurate prediction.
Google resulted in an error rate higher than 50% indicating we were unable to accurately
forecast the direction. The KNN classification method was able to provide adequate prediction
rates for Exxon and Walmart with both of their error rates being minimized below 50%.  KNN
was the most accurate prediction method giving the lowest average error rate across the three
stocks, however, it was the least stable across the three stocks.

It was not expected that KNN would perform nearly as well as it did due to the
dimensionality of the data; these results were a little surprising to us. In the case of Walmart, the
smallest error rate was achieved with k=1 leading to the lowest testing error rate of the study;
this boils down to the nearest neighbor algorithm.
Conclusions
LDA proves to be the best statistical learning method for forecasting a financial time
series. The misclassification error rate for LDA was noticeably smaller and/or more stable than
the other learning methods’ error rates. The misclassification error rate for QDA was too high
while the KNN error rate was too unstable in comparison to the LDA error rate. This may be due
to the highdimensionality of our data and homoscedasticity between lags, allowing LDA to
thrive.
QDA seems to not be the best choice when forecasting financial data using past
observations, however, there may be some limitations to our models. Rather than using 10 lags,
it could have been beneficial to use some sort of information criterion to select the optimal lags
to use with each learning method. In the case of QDA, it would have made sense to only include
the lags with the most dissimilar variances. Using less, more significant lags could also
potentially allow KNN to further excel due to reduced dimensions. This reduction in
dimensionality could have been explored by means of the autocorrelation structure of the data. It
is worthwhile to note that including factors such as accounting ratios, economic measures, and
even Google search frequencies could have provided means to enrich and improve our
forecasting. Followup research to this project will consist of developing a long/short day trading
strategy using LDA posterior probabilities as position rules.

References
Efficient market hypothesis. In Morningstar. Retrieved from
____http://www.morningstar.com/InvGlossary/efficient_market_hypothesis_definition_what_is.
____aspx

Georgakopoulos, H. (2015). Quantitative trading with R: understanding mathematical and
____computational tools from a quant's perspective. Palgrave Macmillan US.

Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: Data
____mining, inference, and prediction (2nd ed.). Springer.

James, G., Hastie, T., Tibshirani, J., & Friedman, J. H. (2015). An introduction to statistical
____learning: with applications in R(6th ed.). Springer.

Wang, L., & Zhu, J. (2010). Financial market forecasting using a twostep kernel learning
____method for the support vector regression. Annals of Operations Research, 174(1), 103120.
____doi:10.1007/s1047900803577

Wickham, H. (2009). ggplot2: elegant graphics for data analysis. Springer.

Zhu, J. (2016). Lecture on assessing model accuracy. Personal Collection of J. Zhu, University of
____Michigan, Ann Arbor MI.

Zhu, J. (2016). Lecture on classification LDA, QDA and LR. Personal Collection of J. Zhu,
____University of Michigan, Ann Arbor MI.

Zhu, J. (2016). Lecture on linear model selection and regularization. Personal Collection of J.
____Zhu, University of Michigan, Ann Arbor MI.

Financial Forecasting Fortune 500 Stocks Using Statistical Classification Learning Algorithms

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to Financial Forecasting Fortune 500 Stocks Using Statistical Classification Learning Algorithms

Similar to Financial Forecasting Fortune 500 Stocks Using Statistical Classification Learning Algorithms (20)

Recently uploaded

Recently uploaded (20)

Financial Forecasting Fortune 500 Stocks Using Statistical Classification Learning Algorithms