Historical prices have been downloaded from finance.yahoo.com for the most influencial 500 stocks in the United States from the S&P500 index.
For every company is associated a label corresponding to the sector in which the company operates. Yahoo identifies eight sector plus a ninth sector, Conglomerates, used for companies which own divisions in different and separate businesses.
Historical prices for each stock can be downloaded through the url:
We want to measure how the prices of the stock market reflect the sector division given by the Yahoo website.
To this aim we want to apply the k-means clustering algorithms to the 500 stocks in the S&P500 index to investigate how price variations follow the market sectorization.
As a first step we compute the log of the closing prices for every day.
Then we compute for every day the return, intended as the difference between the current day log price and the previous day log price.
For every day we subtract the average return among all the stocks, intended as the general market return for the day.
With the serieses of the modified returns we compute the 500x500 matrix of correlations among stocks, using the daily returns from January 1 st 2000 up to December 7 th 2007.
To visualize the result we use the principal component analysis over the correlation matrix and we visualize how the stocks appear when projected over the first two eigenvectors of the matrix.
If we exclude the 7 stocks from the Conglomerate cluster, that is, those stocks that don't belong to any specific sector, then we can try to cluster the rest of the stocks using k-means setting the number of groups equal to 8.
To make the process supervised we set the ”seed” centroids for k-means as the centroids of the 8 groups of stocks indicated by Yahoo.
At first sight a classification accuracy of 67.75% might seem low. Although, we have to remember that there were 8 clusters and therefore a random classification would have yielded only an accuracy of 12.5%.
Mahalanobis distance cannot be applied since the space where the stock vectors were embedded has dimension greater than the size of the clusters.
Figures were generated using matlab. Source code and data is freely available at the following address:
Be the first to comment