Stock Market Analysis using Data Mining & Machine Learning
1. SCHOOL OF ENGINEERING
COMPUTER ENGINEERING & INFORMATICS
DEPARTMENT
Stock Market Analysis using Data Mining and
Machine Learning Algorithms
DIPLOMA THESIS
Grivas G. Panagiotis
griva@ceid.upatras.gr
Advisor: Professor Vasileios Megalooikonomou
Patra, September 2014
2. Abstract
The huge volume of economic data today has created the need for technical
analysis and processing of information that will help investors in taking correct
decisions. The subject of Diploma Thesis is the Extraction of Useful Information
through Financial Data. For the purposes of work have taken historical data from the
daily index S&P500. The basic data mining algorithms studied are the following:
Preprocessing, Export of Technical Features, Clustering, Classification, Lag
Correlation and Forecasting. In the context of this thesis the information is organized
into seven chapters.
The first chapter is introductory part, indicating the aim and motivation of this
thesis. The second chapter presents the basic Market analysis techniques which use
graphs and indicators. The third chapter examines the Data Mining methods and
Learning Algorithms aimed at discovering patterns in the data and constructing useful
models that are closer to the characteristics studied. The fourth chapter presents the
way in which data mining techniques applied to the analysis of the shares, while
highlighting the importance of each data mining algorithm for the stock Market. The
fifth chapter analyzes the environments, Matlab and Weka, in which we perform data
mining algorithms in order to analyze stock Market data.
The sixth chapter includes the experimental procedure of the present work. In
the first section of the Chapter, Preprocessing techniques are implemented so that to
improve the quality of the shares, while errors and incorrect attribute values are
removed. The second section examines the problem of Clustering where algorithms
K-Means and Hierarchical are implemented in order to detect 'similar' shares. Initially
we evaluate the performance of Hierarchical Clustering algorithm with Euclidean and
DTW metric distances, for various types of linkages between the clusters. Then we
evaluate the performance of k-Means and Hierarchical (with ward linkage criterion)
Clustering algorithms, for various numbers of clusters. Finally we apply Clustering
algorithms , for standard number of clusters, while we assess the quality of classes
created, with techniques Intra/Inter cluster distance and Silhouette value. The third
3. section applies the Classification algorithm of k-Nearest Neighbors so that each new
stock coming in stock market to be classified in one of the predefined groups obtained
through Clustering. Furthermore the Classification method is evaluated by checking
whether the shares are categorized in the appropriate class. In the fourth section we
use the Pearson index to find Lag Correlation in shares. Originally we detect shares
with proportional or inverse temporal association with non-zero delay, and examine
whether these shares belong to the same or different classes defined at the outset after
Hierarchical-DTW Clustering process. Yet we identified the shares with proportional
or inverse correlation for delay equal to zero time. Finally applied the lag correlation
algorithm and checked for correlation between stocks not only for their entire length,
but for a window length which starts at a specified time. In the fifth section we
perform Forecasting Algorithms to a set of stocks, where we construct a suitable
prediction model (using first 225 closing values for training set) which can forecast
the last 20 closing values of shares. The forecasting methods applied are the
following: Statistical Technique ARIMA, Artificial Neural Networks (Multilayer
Perceptron), Decision Trees (M5P Tree), Support Vector Machines (SMOreg), Linear
Regression and Instance-Based Learning Algorithms (k-Nearest Neighbors). Finally
we evaluate the performance of forecasting algorithms using both the average
absolute percentage error (MAPE) between actual and predicted values and finding
the prediction accuracy for the investment reliability of the shares in 20 days term
(Trend Prediction).
The seventh chapter presents both conclusions reached after the execution of
the experiments and future extensions that could be applied to the Financial Data
Mining models we constructed.