This document summarizes an unsuccessful experiment using self-organizing maps (SOM) for unsupervised learning on S&P 500 historical index data. The goal was to cluster unusual trading patterns like those during financial crises, but the SOM failed to produce meaningful clusters. Even after adjusting the data set sizes and attributes tracked, the resulting maps showed randomly distributed nodes with no clear separation of clusters. The SOM was only somewhat successful in clustering when tracking a single attribute, but the clusters did not clearly correspond to known unusual periods in the market index.
A Novel Forecasting Based on Automatic-optimized Fuzzy Time SeriesTELKOMNIKA JOURNAL
In this paper, we propose a new method for forecasting based on automatic-optimized fuzzy time
series to forecast Indonesia Inflation Rate (IIR). First, we propose the forecasting model of two-factor highorder
fuzzy-trend logical relationships groups (THFLGs) for predicting the IIR. Second, we propose the
interval optimization using automatic clustering and particle swarm optimization (ACPSO) to optimize the
interval of main factor IIR and secondary factor SF, where SF = {Customer Price Index (CPI), the Bank of
Indonesia (BI) Rate, Rupiah Indonesia /US Dollar (IDR/USD) Exchange rate, Money Supply}. The
proposed method gets lower root mean square error (RMSE) than previous methods.
Seminar report on a statistical approach to machineHrishikesh Nair
Machine Translation (MT) refers to the use of computers for the task of translating
automatically from one language to another. The differences between languages and
especially the inherent ambiguity of language make MT a very difficult problem. Traditional
approaches to MT have relied on humans supplying linguistic knowledge in the form of rules
to transform text in one language to another. Given the vastness of language, this is a highly
knowledge intensive task. Statistical MT is a radically different approach that automatically
acquires knowledge from large amounts of training data. This knowledge, which is typically
in the form of probabilities of various language features, is used to guide the translation
process. This report provides an overview of MT techniques, and looks in detail at the basic
statistical model. (MT) refers to the use of computers for the task of translating
automatically from one language to another. The differences between languages and
especially the inherent ambiguity of language make MT a very difficult problem. Traditional
approaches to MT have relied on humans supplying linguistic knowledge in the form of rules
to transform text in one language to another. Given the vastness of language, this is a highly
knowledge intensive task. Statistical MT is a radically different approach that automatically
acquires knowledge from large amounts of training data. This knowledge, which is typically
in the form of probabilities of various language features, is used to guide the translation
process. This report provides an overview of MT techniques, and looks in detail at the basic
statistical model.
A Novel Forecasting Based on Automatic-optimized Fuzzy Time SeriesTELKOMNIKA JOURNAL
In this paper, we propose a new method for forecasting based on automatic-optimized fuzzy time
series to forecast Indonesia Inflation Rate (IIR). First, we propose the forecasting model of two-factor highorder
fuzzy-trend logical relationships groups (THFLGs) for predicting the IIR. Second, we propose the
interval optimization using automatic clustering and particle swarm optimization (ACPSO) to optimize the
interval of main factor IIR and secondary factor SF, where SF = {Customer Price Index (CPI), the Bank of
Indonesia (BI) Rate, Rupiah Indonesia /US Dollar (IDR/USD) Exchange rate, Money Supply}. The
proposed method gets lower root mean square error (RMSE) than previous methods.
Seminar report on a statistical approach to machineHrishikesh Nair
Machine Translation (MT) refers to the use of computers for the task of translating
automatically from one language to another. The differences between languages and
especially the inherent ambiguity of language make MT a very difficult problem. Traditional
approaches to MT have relied on humans supplying linguistic knowledge in the form of rules
to transform text in one language to another. Given the vastness of language, this is a highly
knowledge intensive task. Statistical MT is a radically different approach that automatically
acquires knowledge from large amounts of training data. This knowledge, which is typically
in the form of probabilities of various language features, is used to guide the translation
process. This report provides an overview of MT techniques, and looks in detail at the basic
statistical model. (MT) refers to the use of computers for the task of translating
automatically from one language to another. The differences between languages and
especially the inherent ambiguity of language make MT a very difficult problem. Traditional
approaches to MT have relied on humans supplying linguistic knowledge in the form of rules
to transform text in one language to another. Given the vastness of language, this is a highly
knowledge intensive task. Statistical MT is a radically different approach that automatically
acquires knowledge from large amounts of training data. This knowledge, which is typically
in the form of probabilities of various language features, is used to guide the translation
process. This report provides an overview of MT techniques, and looks in detail at the basic
statistical model.
Dianne Finch, visiting assistant professor of communications at Elon University, provided this data visualization handout from an issue of the Communications of the ACM during the SABEW 2014 session, "Data Visualization: A Hands-On Primer for Business Journalists," March 28, 2014.
For more information about training for journalists, please visit http://businessjournalism.org.
Exploring the Impact of Magnitude- and Direction-based Loss Function on the P...Dr. Amarjeet Singh
Researches on predicting prices (as time series) from deep learning models usually use a magnitude-based error measurement (such as ). However, in trading, the error in the predicted direction could affect trading results much more than the magnitude error. Few works consider the impact of ill-predicted trading direction as part of the error measurement.
In this work, we first find parameter sets of LSTM and TCN models with low magnitude-based error measurement, and then calculate the profitability using program trading. Relationships between profitability and error measurements are analyzed.
We also propose a new loss function considering both directional and magnitude error for previous models for re-evaluation. Three commodities are tested: gold, soybean, and crude oil (from GLOBEX). Our findings are: with given parameter sets, if merchandise (gold and soybean) is of low averaged magnitude error, then its profitability is more stable. The proposed loss function can further improve profitability. If it is of larger magnitude error (crude oil), then its profitability is unstable, and the proposed loss function cannot improve nor stabilize the profitability.
Furthermore, the relationship between profitability and error measurement for models of LSTM and TCN with or without customized loss function is not, as commonly believed, highly positively correlated (i.e., the more precise the predicted value, the more trading profit) since the correlation coefficients are rarely higher than 0.5 in all our experiments. However, the customized loss functions perform better in TCN than in LSTM.
Top 100+ Google Data Science Interview Questions.pdfDatacademy.ai
Data science interviews can be particularly difficult due to the many proficiencies that you'll have to demonstrate (technical skills, problem solving, communication) and the generally high bar to entry for the industry.we Provide Top 100+ Google Data Science Interview Questions : All You Need to know to Crack it
visit by :-https://www.datacademy.ai/google-data-science-interview-questions/
GET IEEE BIG DATA,JAVA ,DOTNET,ANDROID ,NS2,MATLAB,EMBEDED AT LOW COST WITH BEST QUALITY PLEASE CONTACT BELOW NUMBER
FOR MORE INFORMATION PLEASE FIND THE BELOW DETAILS:
Nexgen Technology
No :66,4th cross,Venkata nagar,
Near SBI ATM,
Puducherry.
Email Id: praveen@nexgenproject.com
Mobile: 9791938249
Telephone: 0413-2211159
www.nexgenproject.com
Finding Relationships between the Our-NIR Cluster ResultsCSCJournals
The problem of evaluating node importance in clustering has been active research in present days and many methods have been developed. Most of the clustering algorithms deal with general similarity measures. However In real situation most of the cases data changes over time. But clustering this type of data not only decreases the quality of clusters but also disregards the expectation of users, when usually require recent clustering results. In this regard we proposed Our-NIR method that is better than Ming-Syan Chen proposed a method and it has proven with the help of results of node importance, which is related to calculate the node importance that is very useful in clustering of categorical data, still it has deficiency that is importance of data labeling and outlier detection. In this paper we modified Our-NIR method for evaluating of node importance by introducing the probability distribution which will be better than by comparing the results.
Software Defect Trend Forecasting In Open Source Projects using A Univariate ...CSCJournals
Our objective in this research is to provide a framework that will allow project managers, business owners, and developers an effective way to forecast the trend in software defects within a software project in real-time. By providing these stakeholders with a mechanism for forecasting defects, they can then provide the necessary resources at the right time in order to remove these defects before they become too much ultimately leading to software failure. In our research, we will not only show general trends in several open-source projects but also show trends in daily, monthly, and yearly activity. Our research shows that we can use this forecasting method up to 6 months out with only an MSE of 0.019. In this paper, we present our technique and methodologies for developing the inputs for the proposed model and the results of testing on seven open source projects. Further, we discuss the prediction models, the performance, and the implementation using the FBProphet framework and the ARIMA model.
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...ijtsrd
Clustering a large sparse and large scale data is an open research in the data mining. To discover the significant information through clustering algorithm stands inadequate as most of the data finds to be non actionable. Existing clustering technique is not feasible to time varying data in high dimensional space. Hence Subspace clustering will be answerable to problems in the clustering through incorporation of domain knowledge and parameter sensitive prediction. Sensitiveness of the data is also predicted through thresholding mechanism. The problems of usability and usefulness in 3D subspace clustering are very important issue in subspace clustering. . The Solutions is highly helpful benefit for police departments and law enforcement organisations to better understand stock issues and provide insights that will enable them to track activities, predict the likelihood. Also determining the correct dimension is inconsistent and challenging issue in subspace clustering .In this thesis, we propose Centroid based Subspace Forecasting Framework by constraints is proposed, i.e. must link and must not link with domain knowledge. Unsupervised Subspace clustering algorithm with inbuilt process like inconsistent constraints correlating to dimensions has been resolved through singular value decomposition. Principle component analysis is been used in which condition has been explored to estimate the strength of actionable to be particular attributes and utilizing the domain knowledge to refinement and validating the optimal centroids dynamically. An experimental result proves that proposed framework outperforms other competition subspace clustering technique in terms of efficiency, Fmeasure, parameter insensitiveness and accuracy. G. Raj Kamal | A. Deepika | D. Pavithra | J. Mohammed Nadeem | V. Prasath Kumar "Principle Component Analysis Based on Optimal Centroid Selection Model for SubSpace Clustering Model" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-4 , June 2020, URL: https://www.ijtsrd.com/papers/ijtsrd31374.pdf Paper Url :https://www.ijtsrd.com/computer-science/data-miining/31374/principle-component-analysis-based-on-optimal-centroid-selection-model-for-subspace-clustering-model/g-raj-kamal
Module 04 Content· As a continuation to examining your policies, rIlonaThornburg83
Module 04 Content
· As a continuation to examining your policies, review for procedures that may relate to them.
· In a 4-page paper, describe the procedures for each of the two compliance plans. NOTE: procedures are not the same thing as policies. Policies (Module 03) sets the parameters for decision-making, while procedures explains the "how". The procedures should be step-by-step instructions, and may include a checklist or process steps to follow.
· Break each procedure section into 2 pages each.
· Remember to support your procedures for each of two plans with a total of three research sources (1-2 per procedure), cited at the end in APA format.
· Write your procedures in a way that all employees will understand at a large medical facility where you are the Compliance Officer.
· Remember, you chose two compliance policy plans under the key compliance areas of Compliance Standards, High-Level Responsibility, Education, Communication, Monitoring/Auditing (for Safety), Enforcement/Discipline, and Response/Prevention. (Check them out if you forget! Remember, you may have written about different policies for the two different compliance plans.)
·
· Submit your completed assignment by following the directions linked below. Please check the Course Calendar for specific due dates.
·
· Save your assignment as a Microsoft Word document. (Mac users, please remember to append the ".docx" extension to the filename.) The name of the file should be your first initial and last name, followed by an underscore and the name of the assignment, and an underscore and the date. An example is shown below:
· Jstudent_exampleproblem_101504
1
8
Southeast Missouri State University
Department of Computer Science
Name of the Instructor: Dr. Reshmi Mitra
CS – 609 Graduate Project
December 2021
Team Member List
Jing Ma
Manoj Thapa
Nagendra Mokara
Contents
Machine learning on Stock Predition 0
Abstract 4
1 INTRODUCTION 4
1.1 LSTM Prediction 5
2 BACKGROUND AND RELATED WORK 6
2.1 Overview of the design 7
3 RESEARCH DESIGN AND METHODS 8
3.1 Stock Price Prediction Project using LSTM 8
3.2 Build the dashboard using plotly 9
4 EVALUTION, VALIDATION ANF KEYRESULTS 9
4.1 Screenshot of Result 10
5 CONCLUSIONS 11
References 12
Sound Classification using Deep Learning ……………………………………………...13
Abstract ………………………………………………………………………………….14
6 INTRODUCTION …………………………………………………………………….14
6.1 Keywords ……………………………………………………………………………15
7 COLLECTION OF DATA FROM THE DATASET …………………………………..15
7.1 Dataset ……………………………………………………………………………….15
7.2 Segregation of Data into Various Folders ..………………………………………….15
8 WRITING CODE AND DATA MODELING/TRAINING……………………………16
8.1 Design Diagram of the Project……………………………………………………….16
9 RESULT ……………………………………………………………………………….17
10 CONCLUSION/FUTURE WORK …………………………………………………..18
Machine learning on stock prediction
Jing Ma
December 2021
Southeast Missouri State University
Advisor: Dr Robert Lowe
Course: Researc ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Jisu Han
A robot operating in a partially observable environment must perform sensing actions to achieve a goal, such as clearing the objects in front of a shelf to better localize a target object at the back, and estimate its shape for grasping. A POMDP is a principled framework for enabling robots to perform such information-gathering actions. Unfortunately, while robot manipulation domains involve high-dimensional and continuous observation and action spaces, most POMDP solvers are limited to discrete spaces. Recently, POMCPOW has been proposed for continuous POMDPs, which handles continuity using sampling and progressive widening. However, for robot manipulation problems involving camera observations and multiple objects, POMCPOW is too slow to be practical. We take inspiration from the recent work in learning to guide task and motion planning to propose a framework that learns to guide POMCPOW from past planning experience. Our method uses preference learning that utilizes both success and failure trajectories, where the preference label is given by the results of the tree search. We demonstrate the efficacy of our framework in several continuous partially observable robotics domains, including real-world manipulation, where our framework explicitly reasons about the uncertainty in off-the-shelf segmentation and pose estimation algorithms.
Dianne Finch, visiting assistant professor of communications at Elon University, provided this data visualization handout from an issue of the Communications of the ACM during the SABEW 2014 session, "Data Visualization: A Hands-On Primer for Business Journalists," March 28, 2014.
For more information about training for journalists, please visit http://businessjournalism.org.
Exploring the Impact of Magnitude- and Direction-based Loss Function on the P...Dr. Amarjeet Singh
Researches on predicting prices (as time series) from deep learning models usually use a magnitude-based error measurement (such as ). However, in trading, the error in the predicted direction could affect trading results much more than the magnitude error. Few works consider the impact of ill-predicted trading direction as part of the error measurement.
In this work, we first find parameter sets of LSTM and TCN models with low magnitude-based error measurement, and then calculate the profitability using program trading. Relationships between profitability and error measurements are analyzed.
We also propose a new loss function considering both directional and magnitude error for previous models for re-evaluation. Three commodities are tested: gold, soybean, and crude oil (from GLOBEX). Our findings are: with given parameter sets, if merchandise (gold and soybean) is of low averaged magnitude error, then its profitability is more stable. The proposed loss function can further improve profitability. If it is of larger magnitude error (crude oil), then its profitability is unstable, and the proposed loss function cannot improve nor stabilize the profitability.
Furthermore, the relationship between profitability and error measurement for models of LSTM and TCN with or without customized loss function is not, as commonly believed, highly positively correlated (i.e., the more precise the predicted value, the more trading profit) since the correlation coefficients are rarely higher than 0.5 in all our experiments. However, the customized loss functions perform better in TCN than in LSTM.
Top 100+ Google Data Science Interview Questions.pdfDatacademy.ai
Data science interviews can be particularly difficult due to the many proficiencies that you'll have to demonstrate (technical skills, problem solving, communication) and the generally high bar to entry for the industry.we Provide Top 100+ Google Data Science Interview Questions : All You Need to know to Crack it
visit by :-https://www.datacademy.ai/google-data-science-interview-questions/
GET IEEE BIG DATA,JAVA ,DOTNET,ANDROID ,NS2,MATLAB,EMBEDED AT LOW COST WITH BEST QUALITY PLEASE CONTACT BELOW NUMBER
FOR MORE INFORMATION PLEASE FIND THE BELOW DETAILS:
Nexgen Technology
No :66,4th cross,Venkata nagar,
Near SBI ATM,
Puducherry.
Email Id: praveen@nexgenproject.com
Mobile: 9791938249
Telephone: 0413-2211159
www.nexgenproject.com
Finding Relationships between the Our-NIR Cluster ResultsCSCJournals
The problem of evaluating node importance in clustering has been active research in present days and many methods have been developed. Most of the clustering algorithms deal with general similarity measures. However In real situation most of the cases data changes over time. But clustering this type of data not only decreases the quality of clusters but also disregards the expectation of users, when usually require recent clustering results. In this regard we proposed Our-NIR method that is better than Ming-Syan Chen proposed a method and it has proven with the help of results of node importance, which is related to calculate the node importance that is very useful in clustering of categorical data, still it has deficiency that is importance of data labeling and outlier detection. In this paper we modified Our-NIR method for evaluating of node importance by introducing the probability distribution which will be better than by comparing the results.
Software Defect Trend Forecasting In Open Source Projects using A Univariate ...CSCJournals
Our objective in this research is to provide a framework that will allow project managers, business owners, and developers an effective way to forecast the trend in software defects within a software project in real-time. By providing these stakeholders with a mechanism for forecasting defects, they can then provide the necessary resources at the right time in order to remove these defects before they become too much ultimately leading to software failure. In our research, we will not only show general trends in several open-source projects but also show trends in daily, monthly, and yearly activity. Our research shows that we can use this forecasting method up to 6 months out with only an MSE of 0.019. In this paper, we present our technique and methodologies for developing the inputs for the proposed model and the results of testing on seven open source projects. Further, we discuss the prediction models, the performance, and the implementation using the FBProphet framework and the ARIMA model.
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...ijtsrd
Clustering a large sparse and large scale data is an open research in the data mining. To discover the significant information through clustering algorithm stands inadequate as most of the data finds to be non actionable. Existing clustering technique is not feasible to time varying data in high dimensional space. Hence Subspace clustering will be answerable to problems in the clustering through incorporation of domain knowledge and parameter sensitive prediction. Sensitiveness of the data is also predicted through thresholding mechanism. The problems of usability and usefulness in 3D subspace clustering are very important issue in subspace clustering. . The Solutions is highly helpful benefit for police departments and law enforcement organisations to better understand stock issues and provide insights that will enable them to track activities, predict the likelihood. Also determining the correct dimension is inconsistent and challenging issue in subspace clustering .In this thesis, we propose Centroid based Subspace Forecasting Framework by constraints is proposed, i.e. must link and must not link with domain knowledge. Unsupervised Subspace clustering algorithm with inbuilt process like inconsistent constraints correlating to dimensions has been resolved through singular value decomposition. Principle component analysis is been used in which condition has been explored to estimate the strength of actionable to be particular attributes and utilizing the domain knowledge to refinement and validating the optimal centroids dynamically. An experimental result proves that proposed framework outperforms other competition subspace clustering technique in terms of efficiency, Fmeasure, parameter insensitiveness and accuracy. G. Raj Kamal | A. Deepika | D. Pavithra | J. Mohammed Nadeem | V. Prasath Kumar "Principle Component Analysis Based on Optimal Centroid Selection Model for SubSpace Clustering Model" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-4 , June 2020, URL: https://www.ijtsrd.com/papers/ijtsrd31374.pdf Paper Url :https://www.ijtsrd.com/computer-science/data-miining/31374/principle-component-analysis-based-on-optimal-centroid-selection-model-for-subspace-clustering-model/g-raj-kamal
Module 04 Content· As a continuation to examining your policies, rIlonaThornburg83
Module 04 Content
· As a continuation to examining your policies, review for procedures that may relate to them.
· In a 4-page paper, describe the procedures for each of the two compliance plans. NOTE: procedures are not the same thing as policies. Policies (Module 03) sets the parameters for decision-making, while procedures explains the "how". The procedures should be step-by-step instructions, and may include a checklist or process steps to follow.
· Break each procedure section into 2 pages each.
· Remember to support your procedures for each of two plans with a total of three research sources (1-2 per procedure), cited at the end in APA format.
· Write your procedures in a way that all employees will understand at a large medical facility where you are the Compliance Officer.
· Remember, you chose two compliance policy plans under the key compliance areas of Compliance Standards, High-Level Responsibility, Education, Communication, Monitoring/Auditing (for Safety), Enforcement/Discipline, and Response/Prevention. (Check them out if you forget! Remember, you may have written about different policies for the two different compliance plans.)
·
· Submit your completed assignment by following the directions linked below. Please check the Course Calendar for specific due dates.
·
· Save your assignment as a Microsoft Word document. (Mac users, please remember to append the ".docx" extension to the filename.) The name of the file should be your first initial and last name, followed by an underscore and the name of the assignment, and an underscore and the date. An example is shown below:
· Jstudent_exampleproblem_101504
1
8
Southeast Missouri State University
Department of Computer Science
Name of the Instructor: Dr. Reshmi Mitra
CS – 609 Graduate Project
December 2021
Team Member List
Jing Ma
Manoj Thapa
Nagendra Mokara
Contents
Machine learning on Stock Predition 0
Abstract 4
1 INTRODUCTION 4
1.1 LSTM Prediction 5
2 BACKGROUND AND RELATED WORK 6
2.1 Overview of the design 7
3 RESEARCH DESIGN AND METHODS 8
3.1 Stock Price Prediction Project using LSTM 8
3.2 Build the dashboard using plotly 9
4 EVALUTION, VALIDATION ANF KEYRESULTS 9
4.1 Screenshot of Result 10
5 CONCLUSIONS 11
References 12
Sound Classification using Deep Learning ……………………………………………...13
Abstract ………………………………………………………………………………….14
6 INTRODUCTION …………………………………………………………………….14
6.1 Keywords ……………………………………………………………………………15
7 COLLECTION OF DATA FROM THE DATASET …………………………………..15
7.1 Dataset ……………………………………………………………………………….15
7.2 Segregation of Data into Various Folders ..………………………………………….15
8 WRITING CODE AND DATA MODELING/TRAINING……………………………16
8.1 Design Diagram of the Project……………………………………………………….16
9 RESULT ……………………………………………………………………………….17
10 CONCLUSION/FUTURE WORK …………………………………………………..18
Machine learning on stock prediction
Jing Ma
December 2021
Southeast Missouri State University
Advisor: Dr Robert Lowe
Course: Researc ...
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...Jisu Han
A robot operating in a partially observable environment must perform sensing actions to achieve a goal, such as clearing the objects in front of a shelf to better localize a target object at the back, and estimate its shape for grasping. A POMDP is a principled framework for enabling robots to perform such information-gathering actions. Unfortunately, while robot manipulation domains involve high-dimensional and continuous observation and action spaces, most POMDP solvers are limited to discrete spaces. Recently, POMCPOW has been proposed for continuous POMDPs, which handles continuity using sampling and progressive widening. However, for robot manipulation problems involving camera observations and multiple objects, POMCPOW is too slow to be practical. We take inspiration from the recent work in learning to guide task and motion planning to propose a framework that learns to guide POMCPOW from past planning experience. Our method uses preference learning that utilizes both success and failure trajectories, where the preference label is given by the results of the tree search. We demonstrate the efficacy of our framework in several continuous partially observable robotics domains, including real-world manipulation, where our framework explicitly reasons about the uncertainty in off-the-shelf segmentation and pose estimation algorithms.
Preference learning for guiding the tree searches in continuous POMDPs (CoRL ...
Machine_Learning_Project_Report
1. MACHINE LEARNING PROJECT REPORT, JUNE 2014 1
SOM for Temporal Clustering Experiment
Henrik Grandin Aditya Hendra
Abstract—This paper is a report for the machine learning
project course. In this project we are going to use unsupervised
learning on the S&P 500 historical index data to find unusual
trading pattern. This could be used as an indicator, that during
a certain period, the index was in a very unusual condition, for
example during financial crisis 2008.
This paper will discuss the unsuccessful results of our prelim-
inary experiments using SOM[1] as the unsupervised learning
method, what we think as the causes for it and our suggestion
for future studies.
Keywords—Machine Learning, SOM, Unsupervised Learning,
Time Series, S&P500 index, STS Clustering
I. INTRODUCTION
Time series data is everywhere, especially in the data
mining field where we want to get valuable and interpretable
information from huge raw data sets. It is especially true for
financial time series, for predicting financial trend based on
past historical data has been one of the most prevalent ways
to make profit.
In order to learn more about this financial time series data,
we propose an experiment using machine learning algorithm
on S&P500 data set to see whether such time series data has
a unique data or pattern that could be clustered, i.e: a cluster
of very fluctuate index during financial crisis.
The goal of clustering is to identify structure in an unlabeled
data set by objectively organizing data into homogeneous
groups where the within-group-object similarity is minimized
and the between-group-object dissimilarity is maximized. In
this sense, clustering is sometimes called automatic classifica-
tion [3].
Without using any class label, clustering is also known as
unsupervised learning or learning by observations instead of by
examples. Han and Kamber [3] classified clustering methods
into five major categories: partitioning methods, hierarchical
methods, density based methods, grid-based methods, and
model-based methods.
One of model-based methods approach is neural network[7],
which consists of competitive learning, including ART and
self-organizing feature maps (SOM).
We choose SOM because it is one of the most frequently
used method for clustering temporal sequences [9].
This report is organized as follows. Section 2 gives a brief
introduction to SOMs. Section 3 explain the preprocessing
approach we use. Section 4 describes the experiment design.
Section 5 examines the result and consequences from it.
Section 6 gives possible improvement suggestion and section
7 concludes the report.
Henrik. Grandin is a student at Uppsala University
Aditya. Hendra is a student at Uppsala University
II. SELF ORGANIZING MAP
B. Hammer, A. Micheli, A. Sperduti, and M. Strickert [6]
used Self Organizing Map (SOM) for time series clustering
prediction with recurrent neural networks, it is said that
SOM is the most frequently used method when dealing with
clustering of temporal sequences.
The SOM is part of unsupervised learning which works
without using labels or examples to achieve auto classification,
data segmentation or vector quantification [4]. SOM adopts a
modified competitive learning, where nodes which have the
most similarity with inputs win the competition and its weight
get updated. A difference with normal competitive learning is
that, SOM also updates its neighbourhoods’ weight with value
less than the winner.
It is also mentioned by C. Y. Tsao and S. H. Chen [4],
from their literature research, that SOMs have been proven to
be an effective methodology to analyze problems in finance,
economics and marketing.
Therefore, we think it is appropriate to use SOM to find
clusters that contains so-called unusual trading data which
usually happens during financial crisis.
III. DATA PREPROCESSING
The data that we have worked on for each day is open, high,
low, close, and volume. It was downloaded to a csv file from
the Yahoo Finance web page [11].
Since a stock exchange is a non linear time series (inflation),
working with the raw data would not give any results. We need
to process it in some sense.
It is important that one attribute does not dwarf the other
attributes. At the same time, we do not want to make light
when a big change is happening in a single attribute. The
solution we ended up using was to work with the the percent
of change of an attribute from the previous day.
Pn =
An
An−1
(1)
Where P is the processed attribute and A is the raw data.
To be able to study the stock over time we need to group
the days into sets of days. It is not obvious what size of
these sets would be optimal so we will have to try a number
of different sizes. To ensure that we do not miss interesting
patterns because they are happening from the end of one set
into the beginning of another, we need to have overlapping
between the sets.
This is all solved with a python script that reads a csv file
with the raw data. It gives us the possibility to decide the size
of each set, the number of days two neighboring sets should
have in common and which of the attributes we wish to use
(open, high, low, close and volume).
2. MACHINE LEARNING PROJECT REPORT, JUNE 2014 2
IV. EXPERIMENT DESIGN
We decided to use the software Orange [12]. It is a software
written in python with a nice and easy to use graphical
interface. It has a lot of classification methods but we are only
interested in its SOM implementation. We started experiment-
ing in Matlab but the SOM implementation in Matlab does
not allow us to see exactly which of input sets which gets
associated to which nodes in the output map. This is a vital
functionality for our experiment and orange do provide it.
The SOM application in Orange allow us to customize it to
some extent. We can change the size and topology of the output
map. The initial weights of the map can either be random or
evenly distributed. We can set the initial and final size of the
neighborhood, as well as decide if the neighborhood move
function should be a Gaussian or a ”top hat” function. One
setting that would have been helpful but unfortunately was
not included in the orange application would be to decide how
many iterations for the tuning phase, the only thing you could
decide on was the total amount of iterations.
In our first experiment we used a set of 20 days of trading
(one month) with an overlap of 10 days between each set. The
result was basically a random spread over the 2D map. No
distance between the nodes had been created and the amount
of sets in each node was basically the same all over the map.
This remained the same with whatever settings we used on
the SOM. After this experiment we concluded that we have to
account for a problem with time series in a SOM.
If two sets are basically identical, with the only difference
that one set is delayed by one day. The SOM will not be able
to recognize the similarities, since it is only comparing day
one with day one, day two with day two and so forth. The
solution to this is to ensure that each day will be used in each
position of the set. If we have a set size of ten days, every day
will be in ten different sets. Two neighboring sets will have
nine days in common. Basically we are sliding the data sets
by one day.
When using the new settings, 20 days with 19 days overlap,
the resulting 2D map remained the same. When using 20 days
with five attributes for each day, we get 100 dimensions for our
SOM. We tried to reduce the dimensions by either reducing
the days in each set, or by reducing the number of attributes
we track for each day. We tried a Number of combinations
between 5-20 days and 1-5 attributes but still sliding the data
sets by one day.
Nevertheless, the result was for the most part as unimpres-
sive as before, it was only when we only used one attribute that
some resemblance of clusters started appearing. The problem
is that the clusters aren’t really separated when you start to
look at the days that are contained in each clusters. Since we
have overlapping in our sets, days will be contained in multiple
sets. The problem is that these sets are not contained in the
same cluster for the most part. Most days will therefore be
present in most clusters, outliers as well as the main clusters.
Fig. 1: 20 days set, 10 days overlap, tracking open, high, low,
close and volume
Fig. 2: 5 days set, 4 days overlap, tracking open, high, low,
close and volume
3. MACHINE LEARNING PROJECT REPORT, JUNE 2014 3
Fig. 3: 20 days set, 19 days overlap, tracking closing value
Fig. 4: 5 days set, 4 days overlap, tracking amount of stocks
traded
This is a sample of the resulting 2D maps from our SOM.
The size of the circles represent how many of the sets are
associated with that node. The colour of each node represent
the distance between the nodes, light colour represent short
distance while dark colour represent larger distances. Nodes
without circles are there to indicate distances between the
nodes with circles.
Figure one and two are ran while tracking all 5 attributes,
as a result we do not manage to find any clusters. A few of
the corners have created distance from the rest of the map, but
each node only have 1 set in each node. These sets do not
include any dates that is on the list of days with great change
in the SP 500 index [13]. When looking at the values each set,
there was no apparent reason why they where outliers.
As you can see in figure three and four, we actually manage
to get clusters when we are only tracking one attribute in the
time series (closing value did in general create more distinct
clusters than volume).
V. RESULTS AND FAILURES
Overall, the SOM does not produce a deterministic cluster
pattern that could indicate whether a cluster contain a group of
unusual data such as the index during financial crisis 2008. The
resulting clusters contain data that looks more like a normal
random data, with each has data from various time as seen at
previous figures.
For our preprocessing, we extract data sets from a single
time series using sliding windows method to create more
data sets with overlapping time. This method also called as
sub-sequence clustering or STS (Subsequence Time Series)
clustering. Lin, E. Keogh, and W. Truppel at [5] claim that
clustering of streaming time series is meaningless because the
use of data sets extracted by sliding window method.
As shocking as it is, the claim provide proof that clustering
from sliding window time series is essentially no different than
clustering from a random walk data. It is said in that literature
that for any time series data set T, if T is clustered using sliding
windows and sampling’s length for the sliding windows is very
small compared to the length of the overall time series, then
the mean of all data will be an approximately constant vector.
Although we haven’t really tested and proved this theorem,
the visual confirmation of the SOM clusters indicate that each
cluster contains a general data consists of random sampling
from each data sets.
The reason of why this happens could be
explained by introducing cluster distance(A,B) and
cluster meaningfullness(X,Y). We will use the following
equation to help us understand these two terms:
• Let A=( ¯a1, ¯a2, ..., ¯ak) be the cluster centers derived
from one run of STS k-means.
• Let B=( ¯b1, ¯b2, ..., ¯bk) be the cluster centers derived from
another different run of STS k-means.
• Let dist( ¯ai, ¯aj) be the distance between two cluster
centers, measured with Euclidean distance.
Then we could define the distance between two sets of
clusters as:
cluster distance(A, B) ≡
k
i=1
min[dist( ¯ai, ¯aj)], 1 ≤ j ≤ k
(2)
We could use this distance to tell us the similarity between
the two cluster sets. The experiment described in the literature
4. MACHINE LEARNING PROJECT REPORT, JUNE 2014 4
uses k-means as the main clustering algorithm example, 3
random restarts of k-means on a stock market data set was
created and saved as set X. Another 3 random restarts on
random walk data set was also created and saved as set Y.
Both set are processed as follows:
• within set X distance is the average cluster distance
between one set of X with other set of X.
• between set X and Y distance is the average cluster
distance between one set of X with other set of Y.
The relationship between these two equation is represented
as:
cluster meaningfullness(X, Y ) ≡
within set X distance
between set X and Y distance
(3)
Since the numerator is measuring distance between similar
clusters the value should be very small or close to zero. On
the other hand, the denominator is measuring distance between
two different cluster, therefore, the value should be large and
overall the cluster meaningfullness(X,Y) should be very close
to zero.
The result mentioned in the literature is actually very dif-
ferent, with between set X and Y distance’s value suggesting
that X and Y set is very similar.
The literature also said that the experiments were done with
many other clustering algorithm including SOM.
One conclusion that was suggested as the root cause of why
this happens is that ”STS clustering algorithms are simply
returning a set of basis functions that can be added together
in a weighted combination to approximate the original data.”
VI. POSSIBLE IMPROVEMENTS
One problem with the regular SOM is that it does not have
a sense of time. When you are going to update the weights in
a node, it is based on its current position. Its current position is
basically just a regular sum of all previous movements and its
starting position. When working with time series, it does make
sense to change this. Patterns that happened the last month
should have more impact than patterns that happened 10 years
ago. The recurrent self-organizing map is an alteration to the
regular SOM that aims to fix this. The RSOM update rule
looks like: [10]
yi(t) = a
n−1
k=0
(1−a)k
(x(t−k)−wi(t−k))+(1−a)k
yi(t−n)
(4)
Where x(t) is the input pattern at iteration t, wi(t) is the
weights of node i at iteration t, x(t) -wi(t) is be the movement
needed to move node i to x(t), a, 0 < a ≤ 1 determines the
impact of older movements. when a approaches 1, we discard
old movements, the system is a short-term memory. When a
approaches 0 the system is a long-term memory. We did not
manage to find any application that had RSOM implemented,
and we did not have enough time to implement it. But it seems
reasonable that this would at least improve our clusters.
When we used multiple attributes in our time series (open,
high, low, close and volume) our SOM did not manage to
create any clusters. We have not find any paper that had used
more than one attribute in their time series.
One of the problems with using multiple attributes is of
course that the input dimension in your SOM greatly increases.
But using 25 day time series with one attribute created better
clusters than using five attributes in a five day time series.
If the problem was only about dimensionality they should
be comparable. We propose an experiment with a different
approach to the multiple attribute time series.
The first step is to use the SOM on each attribute individu-
ally, then define clusters of the output map for each attribute.
The next step is to combine clusters from different attributes.
If two clusters from two different attributes have 80% in
common, create a new cluster with the 80%. The 20% from
the two original clusters get put in two smaller clusters.
We imagine that there might be quite a lot of clusters so it
might be necessary to create a distance measure between the
clusters that makes it possible to merge clusters that gets to
close to each other.
Other simple modification that we could suggest is not using
sliding windows approach, instead of using subsequences that
are randomly extracted[5].
Another approach that is quite differnt is to combine clus-
tering (SOM) with recurrent neural network (RNN) algorithm
[9]. In this literature, the SOM algorithm is used for temporal
sequence processing and classification and a Recurrent Neural
Network will be associated for each created cluster as a
predictor. The literature also states that RNN uses ”internal
feedback mechanism that creates an implicit memory that
contributes to the prediction.” This also means using data in
sequentially correct order is obligatory for correct predictions.
Although this approach also uses sliding temporal window
method, we do not know how this method affect the overall
approach, or whether a non-sliding temporal window method
should be used.
VII. CONCLUSION
For this project, we tried to find a unique cluster from
S&P500 time series using Self-Organizing Map. We used the
value of Opening, Closing, Highest and Lowest of daily index
and also daily trading volume. But the resulting clusters do
not have any distinguishable value. One of the main cause of
the problem is probably the use of sliding window method.
One of the reason we think the approach by A. Cherif, H.
Cardot, and R. Bon at [9] could work is because this literature
discusses experiments which use a well-known time series, the
MackeyGlass, which visually looks very similar with S&P500
index chart. The usefullness of this approach at a time series
that is similar with S&P500 makes it worth to mention.
ACKNOWLEDGMENT
The authors would like to thank Joseph Scott for supervising
our project and our lecturer Olle Gallmo for his teaching during
the course.
5. MACHINE LEARNING PROJECT REPORT, JUNE 2014 5
REFERENCES
[1] T. Kohonen, The self-organizing map, Proceedings of the IEEE, vol. 78,
no. 9, pp. 14641480, Sep. 1990.
[2] T. Koskela, M. Varsta, J. Heikkonen, K. Kaski, ”Recurrent SOM with
Local Linear Models in Time Series Prediction” in 6th European Sym-
posium on Artificial Neural Networks, 1998, pp. 167-172.
[3] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan
Kaufmann, San Francisco, 2001 pp. 346389.
[4] C. Y. Tsao and S.-H. Chen, Self-organizing maps as a foundation for
charting or geometric pattern recognition in financial time series, in
2003 IEEE International Conference on Computational Intelligence for
Financial Engineering, 2003. Proceedings, 2003, pp. 387394.
[5] Lin, E. Keogh, and W. Truppel, Clustering of streaming time series is
meaningless, in Proceedings of the 8th ACM SIGMOD workshop on
Research issues in data mining and knowledge discovery, 2003, pp. 5665.
[6] B. Hammer, A. Micheli, A. Sperduti, and M. Strickert, Recursive self-
organizing network models, Neural Networks, vol. 17, no. 89, pp.
10611085, Oct. 2004.
[7] T. Warren Liao, Clustering of time series data survey, Pattern Recogni-
tion, vol. 38, no. 11, pp. 18571874, Nov. 2005.
[8] A. Fonseka, D. Alahakoon, and S. Bedingfield, GSOM sequence: An
unsupervised dynamic approach for knowledge discovery in temporal
data, in 2011 IEEE Symposium on Computational Intelligence and Data
Mining (CIDM), 2011, pp. 232238.
[9] A. Cherif, H. Cardot, and R. Bone, SOM time series clustering and
prediction with recurrent neural networks, Neurocomputing, vol. 74, no.
11, pp. 19361944, May 2011.
[10] M. Varsta, J.R. Milln and J. Heikkonen, ”A Recurrent Self-Organizing
Map for Temporal Sequence Processing”, Lecture Notes in Computer
Science Volume 1327, 1997, pp 421-426
[11] Website:Yahoo Finance, S&P 500 Stock data, Accessed: 28/5 2014,
URL: http://finance.yahoo.com/q/hp?s=%5EGSPC+Historical+Prices
[12] Website: SOM Application, Orange, Accessed: 28/5 2015, URL:
http://orange.biolab.si/
[13] Website:List of largest changes in the S&P 500 Accessed: 29/5 2015,
URL: http://en.wikipedia.org/wiki/List of largest daily changes in the
S%26P 500