Big Data Project - Final version

Machine Learning Analysis of
Micro & Macroeconomic
Variables
Alakar Srinivasan
Pinank Shah
Tjisana Kerr
Mihir Sanghavi
Vinayak Kishanchandani

Background & Objectives
Background:
 S&P Global 500 is a popular index consisting of
500 of the largest listed global companies in
terms of market capitalization
 Stock index prediction is a relatively old but
challenging problem to solve using new
techniques
 The other challenging problem is to devise
portfolio optimization strategies to outperform
the index
Key Challenges:
 Stock markets can be extremely volatile and
reactive to new events and information
 Efficient Market Hypothesis suggests that stock
markets cannot be predicted or outperformed
Objective:
Compare performance of various statistical and ML prediction
techniques in estimating future daily returns of the S&P index since
stock market prediction is enticing and continues to remain one of
the ultimate challenges
8.4
12
13 12.9 13.6
15
16.2
0
5
10
15
20
2009 2010 2011 2012 2013 2014 2015
MarketCap(Billion$)
S&P Global 100 Companies: Market Capitalization ($ Billion)

List of Prediction Methods For Comparison
1. Linear Regression
2. Lasso Regression
3. Hold Winter Filtering
4. K Nearest Neighbors Algorithm
5. Support Vector Regression

1. Linear Regression
About Linear Regression:
• A linear model that determines the relationship between a
dependent variable (daily returns) and set of explanatory
variables, in this case sector specific returns, exchange rate,
commodities price, etc.
• One of the first regression techniques to be studied extensively
• Linear models are fitted using Ordinary Least Square or OLS
approach
• Pros: Simple to execute, easy to implement, easy to interpret
model results, fast in processing
• Cons: Linear model will not be able to fit non linearity which
exists in the context of stock market prediction
• Results not reliable when very few data points are available
while over fits when large amount of data is used for training
Observation / Result:
 The model included all variables except one (enterprise
index) for which p value was greater than 0.05 and hence
removed.
 Therefore all other factors have been considered and
performance of model was as follows - RMSE: 47.44
Actual vs Predicted Values

2. Lasso Regression
About Lasso Regression:
• Lasso stands for Least Absolute Shrinkage and Selection
Operator and belongs to class of generalized linear models
• Lasso regression performs both variable selection as well as
regularization in order to improve prediction accuracy
performance
• Pros: Lasso regression are typically better than linear
regression models as it includes variable selection as well as
regularization which helps reduce over fitting
• Lasso is better than ridge regression in that it performs both
parameter shrinkage and selection while ridge can perform only
regularization
• Cons: Although Lasso can be extended to generalized linear
models, its performance in terms of predicting non linear
relationships is constrained
Result:
 Using LASSO few additional factors (1) US10 year bond
(2) Health (3) Telecom turned out to be not significant for
the model and hence were eliminated due to high p value
 Due to further variable selection and coefficient shrinkage,
the prediction performance of the model improved to
RMSE: 43.34
Observation:
• Using LASSO we found that the factors: (1) US10 year
bond (2) Health (3) Telecom are not significant for the
model and can be eliminated since

3. Holt Winters Filtering
About Holt Winter:
• Holt Winters filtering can be used to perform the
exponential smoothing technique which is a simple
forecasting method for time series data
• The method can apply as many as three low pass filters
recursively with exponential windows and defined with
parameters alpha, beta and gamma
• Exponential smoothing is different from simple moving
average in that it assigns exponentially decreasing
weights to historical data instead of equal weights in the
time window
• Pros: Better than SMA in that it determines the right
parameters to assign exponentially decreasing weights to
historical data to improve prediction performance
• Cons: Although simpler to execute than more complex
time series forecasting techniques such as ARIMA,
prediction accuracy is lower in comparison
 Since the daily returns data is stationary and non
seasonal, only the alpha parameter was used to perform
exponential smoothing.
 An alpha of 0.191 was found to minimize the SSE
although in terms of prediction performance exponential
smoothing was found to be much lower than others

4. K Nearest Neighbors
About K-NN:
• The KNN algorithm is a method for classifying objects
based on closest training examples in the feature space
• The KNN is the fundamental and simplest classification
technique when there is little or no prior knowledge
about the distribution of the data
• In addition to classification, KNN can also be used for
regression and prediction. The average response of the K
nearest neighbors is taken as the predicted value
• Pros: Computationally efficient and can provide higher
accuracy if there is good correlation between historical
and future events
• Pros: Performs in an unsupervised fashion and hence
there is no requirement to train the model
• Cons: Input parameters and dimensions need to be
scaled appropriately in order for KNN to perform well
• KNN regression with k=5 gives the best RMSE of 15.5498
on the test data
• To predict tomorrow’s close of S&P500 we find the k-
nearest neighbors of today’s observations from the
beginning of data (t=1) and then compute the mean of
tomorrows’ closing prices of those k nearest neighbors

5. Support Vector Regression
 As observed from the charts, SVR is able to predict the
daily returns better than other techniques such as
exponential smoothing and linear regression
 SVR took about 2 hours to run the model compared to few
seconds for most others. However performance was as
follows - RMSE: 11.25
About SVR:
 SVM technique can also be used as regression method,
maintaining all the main features that characterize the
algorithm (maximal margin)
 In the case of regression, a margin of tolerance (epsilon)
is set in approximation to the SVM to predict real values
 This makes the algorithm more complicated than SVM
therefore to be taken in consideration
 Pros: Can function well with high number of variables,
high number of observations, can model non linearity
better than simple regression
 Cons: A complicated algorithm and output which makes it
difficult to interpret results and identify drivers
 Also very time consuming as compared to other models
during model training

Comparison of Techniques / Results
Parameters
Linear
Regression
Lasso
Regression
Holt Winter
Filtering
K-NN
Support Vector
Regression
RMSE 47.44 43.34 71.65 15.5 11.25
Processing Time Very Low Low Low Medium High
Actionability High High Low Low Medium
Implementation
Ease
High Medium High Medium Low

Conclusion and Further Research
Key takeaways from stock market prediction analysis:
 Linear Regression has a considerably high RMSE value possibly because of the fact that the
stock market does not follow a linear model.
 Best Performing Model: Support Vector Regression is the best model which is clearly evident
from the RMSE values as show in the previous slide.
 SVR being non linear is a better model compared to all others which have been used.
 In terms of processing time, Support Vector Regression actually took the maximum time to
process as shown in the previous slide.

List of Time Series Models for Equity Factors
1. Linear Approximation
2. I.I.D. Analysis
3. Technical Indicators
4. Neural Networks

1. Linear Approximation
About Linear Approximations:
• Linear least squares is a method for estimating the
unknown parameters in a linear regression model, with
the goal of minimizing the differences between the
observed responses in some arbitrary dataset and the
responses predicted by the linear approximation of the
data.
• Pros: Simple model.
• Cons: Patches of positive and negative residuals if curved data.
• Polynomial regression is the relationship between the
independent variable x and the dependent variable y is
modelled as an nth degree polynomial. It fits a nonlinear
relationship between the value of x and the corresponding
conditional mean of y.
• Pros: Polynomial fits curvilinear terms better.
• Cons: There is a likelihood of over fitting when using the
polynomial regression method.
Observations - R2:
Equity Linear Poly
MS 0.000575 0.125526
JPM 0.132070 0.655651
GS 0.092261 0.279183

2. I.I.D. Analysis
About I.I.D. Analysis:
A sequence or other collection of random variables is
independent and identically distributed (i.i.d.) if each random
variable has the same probability distribution as the others
and all are mutually independent
Notice that the scatter plot is symmetrical with respect to the
reference axes and it resembles a circular cloud, which implies that
all the terms are identically distributed. Thus, Returns for both sets
of data is an invariant, in particular, all the terms in the series are
independent of each other.

3. Technical Indicators
About Technical Indicators
• Momentum: Momentum is the rate of acceleration of a security's price or volume. The idea of momentum in securities is that their
price is more likely to keep moving in the same direction than to change directions.
• Relative Strength Indicator: The relative strength index (RSI) is a technical momentum indicator that compares the magnitude of
recent gains to recent losses in an attempt to determine overbought and oversold conditions of an asset.
• Price Rate of Change: The price rate of change (ROC) is a technical indicator that measures the percentage change between the
most recent price and the price "n" periods in the past.
• Price Volume Trend: A technical indicator consisting of a cumulative volume line that adds or subtracts a multiple of the percentage
change in share price trend and current volume, depending upon their upward or downward movements.
• Bollinger Bands of 50 Day Moving Average: A Bollinger Band® is a band plotted two standard deviations away from a simple
moving average, developed by famous technical trader John Bollinger.

Big Data Project - Final version

Recommended

Recommended

More Related Content

Similar to Big Data Project - Final version

Similar to Big Data Project - Final version (20)

Big Data Project - Final version