Data Mining for Prediction. Financial Series Case


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Data Mining for Prediction. Financial Series Case

  1. 1. Data Mining for Prediction. Financial Series Case Stefan Zemke Doctoral Thesis The Royal Institute of Technology Department of Computer and Systems Sciences December 2003 i
  2. 2. Doctoral Thesis The Royal Institute of Technology, Sweden ISBN 91-7283-613-X Copyright c by Stefan Zemke Contact: Printed by Akademitryck AB, Edsbruk, 2003 ii
  3. 3. Abstract Hard problems force innovative approaches and attention to detail, their exploration often contributing beyond the area initially attempted. This thesis investigates the data mining process resulting in a predictor for numerical series. The series experimented with come from financial data – usually hard to forecast. One approach to prediction is to spot patterns in the past, when we already know what followed them, and to test on more recent data. If a pattern is followed by the same outcome frequently enough, we can gain confidence that it is a genuine relationship. Because this approach does not assume any special knowledge or form of the regular- ities, the method is quite general – applicable to other time series, not just financial. However, the generality puts strong demands on the pattern detection – as to notice regularities in any of the many possible forms. The thesis’ quest for an automated pattern-spotting involves numerous data mining and optimization techniques: neural networks, decision trees, nearest neighbors, regression, genetic algorithms and other. Comparison of their performance on a stock exchange index data is one of the contributions. As no single technique performed sufficiently well, a number of predictors have been put together, forming a voting ensemble. The vote is diversified not only by different training data – as usually done – but also by a learning method and its parameters. An approach is also proposed how to speed-up a predictor fine-tuning. The algorithm development goes still further: A prediction can only be as good as the training data, therefore the need for good data preprocessing. In particular, new multivariate discretization and attribute selection algorithms are presented. The thesis also includes overviews of prediction pitfalls and possible solutions, as well as of ensemble-building for series data with financial characteristics, such as noise and many attributes. The Ph.D. thesis consists of an extended background on financial prediction, 7 papers, and 2 appendices. iii
  4. 4. Acknowledgements I would like to take the opportunity to express my gratitude to the many people who helped me with the developments leading to the thesis. In particular, I would like to thank Ryszard Kubiak for his tutoring and support reaching back to my high-school days and beginnings of university education, also for his help to improve the thesis. I enjoyed and appreciated the fruitful exchange of ideas and cooperation with Michal Rams, to whom I am also grateful for comments on a part of the thesis. I am also grateful to Miroslawa Kajko-Mattsson for words of encouragement in the final months of the Ph.D. efforts and for her style-improving suggestions. In the early days of my research Henrik Bostr¨m stimulated my interest o in machine learning and Pierre Wijkman in evolutionary computation. I am thankful for that and for the many discussions I had with both of them. And finally, I would like to thank Carl Gustaf Jansson for being such a terrific supervisor. I am indebted to Jozef Swiatycki for all forms of support during the study years. Also, I would like to express my gratitude to the computer support people, in particular, Ulf Edvardsson, Niklas Brunb¨ck and Jukka a Luukkonen at DMC, and to other staff at DSV, in particular to Birgitta Olsson for her patience with the final formatting efforts. I dedicate the thesis to my parents who always believed in me. Gdynia. October 27, 2003. Stefan Zemke iv
  5. 5. Contents 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Questions in Financial Prediction . . . . . . . . . . . . . . 2 1.2.1 Questions Addressed by the Thesis . . . . . . . . . 4 1.3 Method of the Thesis Study . . . . . . . . . . . . . . . . . 4 1.3.1 Limitations of the Research . . . . . . . . . . . . . 4 1.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . 6 2 Extended Background 9 2.1 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Time Series Glossary . . . . . . . . . . . . . . . . . 10 2.1.2 Financial Time Series Properties . . . . . . . . . . 13 2.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Data Integration . . . . . . . . . . . . . . . . . . . 15 2.2.3 Data Transformation . . . . . . . . . . . . . . . . . 16 2.2.4 Data Reduction . . . . . . . . . . . . . . . . . . . . 16 2.2.5 Data Discretization . . . . . . . . . . . . . . . . . . 17 2.2.6 Data Quality Assessment . . . . . . . . . . . . . . . 18 2.3 Basic Time Series Models . . . . . . . . . . . . . . . . . . 18 2.3.1 Linear Models . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Limits of Linear Models . . . . . . . . . . . . . . . 19 2.3.3 Nonlinear Methods . . . . . . . . . . . . . . . . . . 20 2.3.4 General Learning Issues . . . . . . . . . . . . . . . 21 2.4 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . 23 2.5 System Evaluation . . . . . . . . . . . . . . . . . . . . . . 24 v
  6. 6. 2.5.1 Evaluation Data . . . . . . . . . . . . . . . . . . . 24 2.5.2 Evaluation Measures . . . . . . . . . . . . . . . . . 25 2.5.3 Evaluation Procedure . . . . . . . . . . . . . . . . . 25 2.5.4 Non/Parametric Tests . . . . . . . . . . . . . . . . 26 3 Development of the Thesis 27 3.1 First half – Exploration . . . . . . . . . . . . . . . . . . . 27 3.2 Second half – Synthesis . . . . . . . . . . . . . . . . . . . . 29 4 Contributions of Thesis Papers 33 4.1 Nonlinear Index Prediction . . . . . . . . . . . . . . . . . . 33 4.2 ILP via GA for Time Series Prediction . . . . . . . . . . . 34 4.3 Bagging Imperfect Predictors . . . . . . . . . . . . . . . . 35 4.4 Rapid Fine Tuning of Computationally Intensive Classifiers 36 4.5 On Developing Financial Prediction System: Pitfalls and Possibilities . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.6 Ensembles in Practice: Prediction, Estimation, Multi-Feature and Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . 37 4.7 Multivariate Feature Coupling and Discretization . . . . . 38 5 Bibliographical Notes 39 A Feasibility Study on Short-Term Stock Prediction 141 B Amalgamation of Genetic Selection and Boosting Poster GECCO-99, US, 1999 147 vi
  7. 7. List of Thesis Papers Stefan Zemke. 45 Nonlinear Index Prediction. Physica A 269 (1999) Stefan Zemke. 57 ILP and GA for Time Series Prediction. Dept. of Computer and Systems Sciences Report 99-006 Stefan Zemke. 71 Bagging Imperfect Predictors. ANNIE’99, St. Louis, MO, US, 1999 Stefan Zemke. 81 Rapid Fine-Tuning of Computationally Intensive Classifiers. MICAI’2000, Mexico, 2000. LNAI 1793 Stefan Zemke. 95 On Developing Financial Prediction System: Pitfalls and Possibilities. DMLL Workshop at ICML-2002, Australia, 2002 Stefan Zemke. 113 Ensembles in Practice: Prediction, Estimation, Multi-Feature and Noisy Data. HIS-2002, Chile, 2002 Stefan Zemke and Michal Rams. 131 Multivariate Feature Coupling and Discretization. FEA-2003, Cary, US, 2003 vii
  8. 8. viii
  9. 9. Chapter 1 Introduction Predictions are hard, especially about the future. Niels Bohr and Yogi Berra 1.1 Background As computers, sensors and information distribution channels proliferate, there is an increasing flood of data. However, the data is of little use, unless it is analyzed and exploited. There is indeed little use in just gathering the tell tale signals of a volcano eruption, heart attack, or a stock exchange crash, unless they are recognized and acted upon in advance. This is where prediction steps in. To be effective, a prediction system requires good input data, good pattern-spotting ability, good discovered pattern evaluation, among other. The input data needs to be preprocessed, perhaps enhanced by a domain expert knowledge. The prediction algorithms can be provided by methods from statistics, machine learning, analysis of dynamical systems, together known as data mining – concerned with extracting useful information from raw data. And predictions need to be carefully evaluated to see if they fulfill criteria of significance, novelty, usefulness etc. In other words, prediction is not an ad hoc procedure. It is a process involving a number of premeditated steps and domains, all of which influence the quality of the outcome. The process is far from automatic. A particular prediction task requires experimentation to assess what works best. Part of the assessment comes from intelligent but to some extent artful exploratory data analysis. If the task is poorly addressed by existing methods, the exploration might lead 1
  10. 10. to a new algorithm development. The thesis research follows that progression, started by the question of days-ahead predictability of a stock exchange index data. The thesis work and contributions consist of three developments. First, exploration of sim- ple methods of prediction, exemplified by the initial thesis papers. Second, higher level analysis of the development process leading to a successful pre- dictor. The process also supplements the simple methods by specifics of the domain and advanced approaches such as elaborate preprocessing, en- sembles, chaos theory. Third, the thesis presents new algorithmic solutions, such as bagging a Genetic Algorithms population, parallel experiments for rapid fine-tuning and multivariate discretization. Time series are common. Road traffic in cars per minute, heart beats per minute, number of applications to a school every year and a whole range of scientific and industrial measurements, all represent time series which can be analyzed and perhaps predicted. Many of the prediction tasks face similar challenges, such as how to decide which input series will enhance prediction, how to preprocess them, or how efficiently tune various parameters. Despite the thesis referring to the financial data, most of the work is applicable to other domains, even if not directly, then indirectly by pointing different possibilities and pitfalls in a predictor development. 1.2 Questions in Financial Prediction Some questions of scientific and practical interest concerning financial pre- diction follow. Prediction possibility. Is statistically significant prediction of financial markets data possible? Is profitable prediction of such data possible, what involves answer to the former question, adjusted by constraints imposed by the real markets, such as commissions, liquidity limits, influence of the trades. Methods. If prediction is possible, what methods are best at performing it? What methods are best-suited for what data characteristics – could it be said in advance? 2
  11. 11. Meta-methods. What are the ways to improve the methods? Can meta- heuristics successful in other domains, such as ensembles or pruning, improve financial prediction? Data. Can the amount, type of data needed for prediction be character- ized? Data preprocessing. Can data transformations that facilitate prediction be identified? In particular, what transformation formulae enhance input data? Are the commonly used financial indicators formulae of any good? Evaluation. What are the features of sound evaluation procedure, re- specting the properties of financial data and the expectations of fi- nancial prediction? How to handle rare but important data events, such as crashes? What are the common evaluation pitfalls? Predictor development. Are there any common features of successful prediction systems? If so, what are they, and how could they be advanced? Can common reasons of failure of financial prediction be identified? Are they intrinsic, non-reparable, or there is a way to amend them? Transfer to other domains. Can the methods developed for financial prediction benefit other domains? Predictability estimation. Can financial data be reasonably quickly es- timated to be predictable or not, without the investment to build a custom system? What are the methods, what do they actually say, what are their limits? Consequences of predictability. What are the theoretical and practical consequences of demonstrated predictability of financial data, or the impossibility of it? How a successful prediction method translates into economical models? What could be the social consequences of financial prediction? 3
  12. 12. 1.2.1 Questions Addressed by the Thesis The thesis addresses many of the questions, in particular the prediction possibility, methods, meta-methods, data Preprocessing, and the predic- tion development process. More details on the contributions are provided by the chapter: Contributions of the Thesis Papers. 1.3 Method of the Thesis Study The investigation behind the thesis has been mostly goal driven. As prob- lems appeared on the way to realizing financial prediction, they were con- fronted by various means including the following: • Investigation of existing machine learning and data mining methods and meta-heuristics. • Reading of financial literature for properties and hints of regularities in financial data which could be exploited. • Analysis of existing financial prediction systems, for commonly work- ing approaches. • Implementation and experimentation with own machine learning meth- ods and hybrid approaches involving a number of existing methods. • Some theoretical considerations on mechanisms behind the generation of financial data, e.g. deterministic chaotic systems, and on general predictability demands and limits. • Practical insights into the realm of trading, some contacts with pro- fessional investors, courses on finance and economics. 1.3.1 Limitations of the Research As any closed work, this thesis research has its limitations. One criticism of the thesis could be that the contributions do not directly tackle the promi- nent question: if financial prediction can be profitable. A Ph.D. student concentrating efforts on this would make a heavy bet: either s/he would 4
  13. 13. end up with a Ph.D. and as a millionaire, or without anything, should the prediction attempts fail. This is too high risk to take. This is why in my research, after the initial head-on attempts, I took a more balanced path investigating prediction from the side: methods, data preprocessing etc., instead of prediction results per se. Another criticism could address the omission or shallowness of experi- ments involving some of the relevant methods. For instance, a researcher devoted to Inductive Logic Programming could bring forward a new sys- tem good at dealing with numerical/noisy series, or the econometrician could point out the omission of linear methods. The reply could be: there are too many possibilities for one person to explore, so it was necessary to skip some. Even then, the interdisciplinary research demanded much work, among other, for: • Studying ’how to’ in 3 areas: machine learning/data mining, finance and mathematics; 2 years of graduate courses taken. • Designing systems exploiting and efficiently implementing the result- ing ideas. • Collecting data for prospective experiments – initially quite a time consuming task of low visibility. • Programming, which for new ideas not guaranteed to work, takes time going into hundreds of hours. • Evaluating the programs, adjusting parameters, evaluating again – the loop possibly taking hundreds of hours. The truth here is that most new approaches do not work, so the design, implementation and initial evaluation efforts are not publishable. • Writing papers, extended background study, for the successful at- tempts. Another limitation of the research concerns evaluation methods. The Evaluation section stresses how careful the process should be, preferably involving a trading model, commissions, whereas the evaluations in the the- sis papers do not have that. The reasons are many-fold. First, as already 5
  14. 14. pointed out, the objective was not to prove there is profit possibility in the predictions. This would involve not only commissions, but also a trading model. A simple model would not fit the bill, so there would be a need to investigate how predictions, together with general knowledge, trader’s experience etc. merge into successful trading – a subject for another Ph.D. Second, after commissions, the above-random gains, would be much thin- ner, demanding better predictions, more data, more careful statistics to spot the effect – perhaps too much for a pilot study. The lack of experiments backing some of the thesis ideas is another shortcoming. The research attempts to be practical, i.e. mostly experi- mental, but there are tradeoffs. As ideas become more advanced, the path from an idea to a reported evaluation becomes more involved. For instance, to predict, one needs data preprocessing, often including discretization. So, even having implemented an experimental predictor, it could not have been evaluated without the discretization completed, pressing to describe just the prediction part – without real evaluation. Also computational demands grow – a notebook computer is no longer enough. 1.4 Outline of the Thesis The rest of the initial chapters – preceding the thesis papers – is meant to provide the reader with the papers’ background, often skimmed in them for page limit reasons. Thus, the Extended Background chapter goes through the subsequent areas and issues involved in time series prediction in the financial domain, one of the objectives being to introduce the vo- cabulary. The intention is also to present the width of the prediction area and of my study of it, which perhaps will allow one to appreciate the effort and knowledge behind the developments in this domain. Then comes the Development of the Thesis chapter which, more or less chronologically, presents the research advancement. In this tale one can also see the many attempts proving to be dead-ends. As such, the positive Published results can be seen as an essence of much bigger work. The next chapter Contributions of Thesis Papers summarizes all the thesis papers and their contributions. The summaries assume familiarity 6
  15. 15. with the vocabulary of the Extended Background chapter. The rest of the thesis consists of 8 thesis papers, formatted for a common appearance, otherwise quoted the way they were published. The thesis ends with common bibliography, resolving references for the introduction chapters and all the included papers. 7
  16. 16. 8
  17. 17. Chapter 2 Extended Background This chapter is organized as follows. Section 1 presents time series prelim- inaries and characteristics of financial series, Section 2 summarizes data preprocessing, Section 3 lists basic learning schemes, Section 4 ensemble methods, and Section 5 discuses predictor evaluation. 2.1 Time Series This section introduces properties of time series appearing in the context of developing a prediction system in general, and in the thesis papers in particular. The presentation is divided into generic series properties and characteristics of financial time series. Most of the generic time series definitions follow (Tsay, 2002). Time series, series for short, is a sequence of numerical values indexed by increasing time units, e.g. a price of a commodity, such as oranges in a particular shop, indexed by the time when the price is checked. In the sequel, series’ st return values refer to rt = log(st+T ) − log(st ), the return period T assumed 1, if not specified. Remarks about series distribution refer to the distribution of the returns series rt . A predictor forecasts a future value st+T , having access only to past values si , i ≤ t, of this and usually other series. For the prediction to be of any value it has to be better than random, which can be measured by various metrics, such as accuracy, discussed in Section 6. 9
  18. 18. 2.1.1 Time Series Glossary Stationarity of a series indicates that its mean value and arbitrary au- tocorrelations are time invariant. Finance literature commonly assumes that asset returns are weakly stationary. This can be checked, provided a sufficient number of values, e.g., one can divide data into subsamples and check the consistency of mean and autocorrelations (Tsay, 2002). Determi- nation if a series moved into a nonstationary regime is not trivial, let alone deciding which of the series properties are still holding. Therefore, most prediction systems, which are based on past data, implicitly assume that the predicted series is to a great extent stationary, at least with respect to the invariants that the system may spot, which most likely go beyond mean and autocorrelations. Seasonality means periodic fluctuations. For example, retail sales peak around Christmas season and decline after the holidays. So the time series of retail sales will show increasing values from September through Decem- ber and declining in January and February. Seasonality is common in economic time series and less in engineering and scientific data. It can be identified, e.g. by correlation or Fourier analysis, and removed, if desired. Linearity and Nonlinearity are wide notions depending on the context in which they appear. Usually, linearity signifies that an entity can be decom- posed into sub-entities, properties of which, such as influence on the whole, carry on to the whole entity in an easy to analyze additive way. Nonlin- ear systems do not allow such a simple decomposition analysis since the interactions do not need to be additive, often leading to complex emergent phenomena not seen in the individual sub-entities (Bak, 1997). In the much narrower context of prediction methods, nonlinear often refers to the form of dependencies between data and the predicted vari- able. In nonlinear systems the function might be nonlinear. Hence, linear approaches, such as correlation analysis and linear regression are not suf- ficient. One must use less orthodox tools to find and exploit nonlinear dependencies, e.g. neural networks. 10
  19. 19. Deterministic and Nondeterministic Chaos. For a reader new to chaos, an illustration of the theory applied to finances can be found in (Deboeck, 1994). A system is chaotic if its trajectory through state space is sensi- tively dependent on the initial conditions, that is, if small differences are magnified exponentially with time. This means that initially unobserv- able fluctuations will eventually dominate the outcome. So, though the process may be deterministic, it is unpredictable in the long run (Kantz & Schreiber, 1999a; Gershenfeld & Weigend, 1993). Deterministic means that given the same circumstances the transition from a state is always the same. The topic if financial markets express this kind of behavior is hotly debated and there are numerous publications supporting each view. The deterministic chaos notion involves a number of issues. First, whether markets react deterministically to events influencing prices versus a more probabilistic reaction. Second, whether indeed magnified small changes eventually take over, which does not need to be the case, e.g. self-correction could step in if a value is too much off mark – overpriced or underpriced. Financial time series have been analyzed in those respects, however, the mathematical theory behind chaos often poorly deals with noise prevalent in financial data making the results dubious. Even a chaotic system can be predicted up to a point where magnified disturbances dominate. The time when this happens depends inversely on the largest Lyapunov exponent, a measure of divergence. It is an av- erage statistics – at any time the process is likely to have different di- vergence/predictability, especially if nonstationary. Beyond, prediction is possible only in statistical terms – which outcomes are more likely, no mat- ter what we start with. Weather – a chaotic system – is a good illustration: despite global efforts in data collection, forecasts are precise up to a few days and in the long run offer only statistical views such as average month temperature. However, chaos is not to be blamed for all poor forecasts – it recently came to attention that the errors in weather forecasts initially do not grow exponentially but linearly, what points more to imprecise weather models than chaos at work. Another exciting aspect of a chaotic system is its control. If at times the 11
  20. 20. system is so sensitive to disturbances, a small influence at that time can profoundly alter the trajectory, provided that the system will be determin- istic for a while thereafter. So potentially a government, or a speculator, who knew the rules, could control the markets without a vast investment. Modern pace-makers for human heart – another chaotic system – work by this principle providing a little electrical impulse only when needed, without the need for constant overwhelming of the heart electrical activity. Still, it is unclear if the markets are stochastic or deterministic, let alone chaotic. A mixed view is also possible: market are deterministic only in part – so even short-term prediction cannot be fully accurate, or that there are pockets of predictability – markets, or market conditions, when the moves are deterministic, otherwise being stochastic. Delay vectors embedding converts a scalar series st into a vector series: vt = (st , st−delay , .., st−(D−1)∗delay ). This is a standard procedure in (non- linear) time series analysis, and a way to present a series to a predictor demanding an input of constant dimension D. More on how to fit the delay embedding parameters can be found in (Kantz & Schreiber, 1999a). Takens Theorem (Takens, 1981) states that we can reconstruct the dy- namics of a deterministic system – possibly multidimensional, which each state is a vector – by long-enough observation of just one noise-free vari- able of the system. Thus, given a series we can answer questions about the dynamics of the system that generated it by examining the dynamics in a space defined by delayed values of just that series. From this, we can compute features such as the number of degrees of freedom and linking of trajectories and make predictions by interpolating in the delay embedding space. However, Takens theorem holds for mathematical measurement functions, not the ones seen in the laboratory or market: asset price is not a noise-free function. Nevertheless, the theorem supports experiments with a delay embedding, which might yield useful models. In fact, they often do (Deboeck, 1994). 12
  21. 21. Prediction, modeling, characterization are three different goals of time se- ries analysis (Gershenfeld & Weigend, 1993): ”The aim of prediction is to accurately forecast the short-term evolution of the system; the goal of modeling is to find description that accurately captures features of the long-term behavior. These are not necessarily identical: finding governing equations with proper long-term properties may not be the most reliable way to determine parameters for short-term forecasts, and a model that is useful for short-term forecasts may have incorrect long-term properties. Characterization attempts with little or no a priori knowledge to deter- mine fundamental properties, such as the number of degrees of freedom of a system or the amount of randomness.” 2.1.2 Financial Time Series Properties One may wonder if there are universal characteristics of the many series coming from markets different in size, location, commodities, sophistica- tion etc. The surprising fact is that there are (Cont, 1999). Moreover, interacting systems in other fields, such as statistical mechanics, suggest that the properties of financial time series loosely depend on the market microstructure and are common to a range of interacting systems. Such observations have stimulated new models of markets based on analogies with particle systems and brought in new analysis techniques opening the era of econophysics (Mantegna & Stanley, 2000). Efficient Market Hypothesis (EMH) developed in 1965 (Fama, 1965) ini- tially got wide acceptance in the financial community. It asserts, in weak form, that the current price of an asset already reflects all information ob- tainable from past prices and assumes that news is promptly incorporated into prices. Since news is assumed unpredictable, so are prices. However, real markets do not obey all the consequences of the hypoth- esis, e.g., price random walk implies normal distribution, not the observed case; there is a delay while price stabilizes to a new level after news, which among other, lead to a more modern view (Haughen, 1997): ”Overall, the best evidence points to the following conclusion. The market isn’t efficient with respect to any of the so-called levels of efficiency. The value invest- 13
  22. 22. ing phenomenon is inconsistent with semi-strong form efficiency, and the January effect is inconsistent even with weak form efficiency. Overall, the evidence indicates that a great deal of information available at all levels is, at any given time, reflected in stock prices. The market may not be easily beaten, but it appears to be beatable, at least if you are willing to work at it.” Distribution of financial series (Cont, 1999) tends to be non-normal, sharp peaked and heavy-tailed, these properties being more pronounced for in- traday values. Such observations were pioneered in the 1960s (Mandelbrot, 1963), interestingly around the time the EMH was formulated. Volatility – measured by the standard deviation – also has common char- acteristics (Tsay, 2002). First, there exist volatility clusters, i.e. volatility may be high for certain periods and low for other. Second, volatility evolves over time in a continuous manner, volatility jumps are rare. Third, volatil- ity does not diverge to infinity but varies within fixed range, which means that it is often stationary. Fourth, volatility reaction to a big price increase seems to differ from reaction to a big price drop. Extreme values appear more frequently in a financial series as compared to a normally-distributed series of the same variance. This is important to the practitioner since often the values cannot be disregarded as erroneous outliers but must be actively anticipated, because of their magnitude which can influence trading performance. Scaling property of a time series indicates that the series is self-similar at different time scales (Mantegna & Stanley, 2000). This is common in fi- nancial time series, i.e. given a plot of returns without the axis signed, it is next to impossible to say if it represents hourly, daily or monthly changes, since all the plots look similar, with differences appearing at minute res- olution. Thus prediction methods developed for one resolution could, in principle, be applied to others. Data frequency refers to how often series values are collected: hourly, daily, weekly etc. Usually, if a financial series provides values on daily, 14
  23. 23. or longer, basis, it is low frequency data, otherwise – when many intraday quotes are included – it is high frequency. Tick-by-tick data includes all individual transactions, and as such, the event-driven time between data points varies creating challenge even for such a simple calculation as corre- lation. The minute market microstructure and massive data volume create new problems and possibilities not dealt with by the thesis. The reader interested in high frequency finance can start at (Dacorogna et al., 2001). 2.2 Data Preprocessing Before data is scrutinized by a prediction algorithm, it must be collected, inspected, cleaned and selected. Since even the best predictor will fail on bad data, data quality and preparation is crucial. Also, since a predictor can exploit only certain data features, it is important to detect which data preprocessing/presentation works best. 2.2.1 Data Cleaning Data cleaning fills in missing values, smoothes noisy data, handles or re- moves outliers, resolves inconsistencies. Missing values can be handled by a generic method (Han & Kamber, 2001). Methods include skipping the whole instance with a missing value, or filling the miss with the mean/new ’unknown’ constant, or using inference, e.g. based on most similar instances or some Bayesian considerations. Series data has another dimension – we do not want to spoil the temporal relationship, thus data restoration is preferable to removal. The restora- tion should also accommodate the time aspect – not use too time-distant values. Noise is prevalent, especially low volume markets should be dealt with suspicion. Noise reduction usually involves some form of averaging or putting a range of values into one bin, discretization. If data changes are numerous, a test if the predictor picks the inserted bias is advisable. This can be done by ’missing’ some values from a random series – or better: permuted actual returns – and then restoring, cleaning etc. the series as if genuine. If the predictor can subsequently predict 15
  24. 24. anything from this, after all random, series there is too much structure introduced (Gershenfeld & Weigend, 1993). 2.2.2 Data Integration Data integration combines data from multiple sources into a coherent store. Time alignment can demand consideration in series from different sources, e.g. different time zones. Series to instances conversion is required by most of the learning algorithms expecting as an input a fixed length vector. It can be done by the delay vector embedding technique. Such delay vectors with the same time index t – coming from all input series – appended give an instance, data point or example, its coordinates referred to as data features, attributes or variables. 2.2.3 Data Transformation Data transformation changes the values of series to make them more suit- able for prediction. Detrending is such a common transformation removing the growth of a series, e.g. by working with subsequent value differentials, or subtracting the trend (linear, quadratic etc.) interpolation. For stocks, indexes, and currencies converting into the series of returns does the trick. For volume, dividing it by last k quotes average, e.g. yearly, can scale it down. Indicators are series derived from others, enhancing some features of interest, such as trend reversal. Over the years traders and technical ana- lysts trying to predict stock movements developed the formulae (Murphy, 1999), some later confirmed to pertain useful information (Sullivan et al., 1999). Indicators can also reduce noise due to averaging in many of the formulae. Common indicators include: Moving Average MA), Stochas- tic Oscillator, Moving Average Convergence Divergence (MACD), Rate of Change (ROC), Relative Strength Index (RSI). Normalization brings values to a certain range, minimally distorting initial data relationships, e.g. the SoftMax norm increasingly squeezes extreme values, linearly mapping middle 95% values. 16
  25. 25. 2.2.4 Data Reduction Sampling – not using all the data available – might be worthwhile. In my experiments with NYSE predictability, skipping half of training instances with the lowest weight (i.e. weekly return) enhanced predictions, similarly reported (Deboeck, 1994). The improvement could be due to skipping noise-dominated small changes, and/or the dominant changes ruled by a mechanism whose learning is distracted by the numerous small changes. Feature selection – choosing informative attributes – can make learn- ing feasible, because of the curse of dimensionality (Mitchell, 1997) multi- feature instances demand (exponentially w.r.t. feature number) more data to train. There are 2 approaches to the problem: filter – a purpose-made algorithm evaluates and selects features, whereas in wrapper approach the final learning algorithm is presented with different feature subsets, selected on the quality of the resulting predictions. 2.2.5 Data Discretization Discretization maps similar values into one discrete bin, with the idea that it preserves important information, e.g. if all that matters is a real value’s sign, it could be digitized to {0; 1}, 0 for negative, 1 otherwise. Some prediction algorithms require discrete data, sometimes referred to as nom- inal. Discretization can improve predictions by reducing the search space, reducing noise, and by pointing to important data characteristics. Un- supervised approaches work by dividing the original feature value range into few equal-length or equal-data-frequency intervals; supervised – by maximizing measure involving the predicted variable, e.g. entropy or the chi-square statistics (Liu et al., 2002). Since discretization is an information loosing transformation, it should be approached with caution, especially as most algorithms perform uni- variate discretization – they look at one feature at a time, disregarding that it may have (additional) significance only in the context of other fea- tures, as it would be preserved in multivariate discretization. For example, if the predicted class = sign(xy), only discretizing x and y in tandem can discover their significance, alone x and y can be inferred as not related to 17
  26. 26. class and even disregarded! The multivariate approach is especially im- portant in financial prediction, where no single variable can be expected to bring significant predictability (Zemke & Rams, 2003). 2.2.6 Data Quality Assessment Predictability assessment allows to concentrate on feasible cases (Hawawini & Keim, 1995). Some tests are simple non-parametric predictors – predic- tion quality reflecting predictability. The tests may involve: 1) Linear methods, e.g. to measure correlation between the predicted and feature series. 2) Nearest Neighbor prediction method, to assess local model-free predictability. 3) Entropy, to measure information content (Molgedey & Ebeling, 2000). 4) Detrended Fluctuation Analysis (DFA), to reveal long term self-similarity, even in nonstationary series (Vandewalle et al., 1997). 5) Chaos and Lyapunov exponent, to test short-term determinism. 6) Ran- domness tests like chi-square, to assess the likelihood that the observed sequence is random. 7) Nonstationarity tests. 2.3 Basic Time Series Models This section presents basic prediction methods, starting with the linear models well established in the financial literature and moving on to modern nonlinear learning algorithms. 2.3.1 Linear Models Most linear time series models descend from the AutoRegressive Mov- ing Average (ARMA) and Generalized Autoregressive Conditional Het- eroskedastic (GARCH) (Bollerslev, 1986) models summary of which follows (Tsay, 2002). ARMA models join simpler AuroRegressive (AR) and Moving-Average (MA) models. The concept is useful in volatility modelling, less in return prediction. A general ARMA(p, q) is in the form: 18
  27. 27. rt = φ0 + Σp φi rt−i + at − Σq θi at−i i=1 i=1 where p is the order of the AR part, φi its parameters, q the order of the MA part, θj its parameters, and at normally-distributed noise. Given data series rt , there are heuristics to specify the order and parameters, e.g. either by the conditional or exact likelihood method. The Ljung-Box statistics of residuals can check the fit (Tsay, 2002). GARCH models volatility which is influenced by time dependent informa- tion flows resulting in pronounced temporal volatility clustering. For a log return series rt , we assume its mean ARMA-modelled, then let at = rt − µt be the mean-corrected log return. Then at follows a GARCH(m, s) model if: at = σt t , σt = α0 + Σm αi a2 + Σs βj σt−j 2 i=1 t−i j=1 2 where t is a sequence of identically independent distributed (iid) random variables with mean 0 and variance 1, α0 > 0, αi ≥ 0, βj >≥ 0, and max(m,s) Σi=1 (αi + βi ) < 1. Box-Jenkins AutoRegressive Integrated Moving Average (ARIMA) extend the ARMA models, moreover coming with a detailed procedure how to fit and test such a model, not an easy task (Box et al., 1994). Because of wide applicability, extendable to nonstationary series, and the fitting procedure, the models are commonly used. ARIMA assumes that a probability model generates the series, with future values related to past values and errors. Econometric models extend the notion of series depending only on it past values – they additionally use related series. This involves a regression model in which the time series is forecast as the dependent variable; the related time series as well as the past values of the time series are the independent or predictor variables. This, in principle, is the approach of the thesis papers. 19
  28. 28. 2.3.2 Limits of Linear Models Modern econometrics increasingly shifts towards nonlinear models of risk and return. Bera – actively involved in (G)ARCH research – remarked (Bera & Higgins, 1993): ”a major contribution of the ARCH literature is the finding that apparent changes in the volatility of economic time series may be predictable and result from a specific type of nonlinear depen- dence rather than exogenous structural changes in variables”. Campbell further argued (Campbell et al., 1997): ”it is both logically inconsistent and statistically inefficient to use volatility measures that are based on the assumption of constant volatility over some period when the resulting series moves through time.” 2.3.3 Nonlinear Methods Nonlinear methods are increasingly preferred for financial prediction, due to the perceived nonlinear dependencies in financial data which cannot be handled by purely linear models. A short overview of the methods follows (Mitchell, 1997). Artificial Neural Network (ANN) advances linear models by applying a non-linear function to the linear combination of inputs to a network unit – a perceptron. In an ANN, perceptrons are usually prearranged in layers, with those in the the first layer having access to the inputs, and the perceptrons’ outputs forming the inputs to the next layer, the final one providing the ANN output(s). Training a network involves adjusting the weights in each unit’s linear combination as to minimize an objective, e.g. squared error. Backpropagation – the classical training method – however, may miss an optimal network due to falling into a local minimum, so other methods might be preferred (Zemke, 2002b). Inductive Logic Programming (ILP) and a decision tree (Mitchell, 1997) learner C4.5 (Quinlan, 1993) generate if-conditions-then-outcome symbolic rules, human understandable if small. Since the search for such rules is ex- pensive, the algorithms either employ greedy heuristics, e.g. C4.5 looking 20
  29. 29. at a single variable at a time, or perform exhaustive search, e.g. ILP Progol. These limit the applicability, especially in an area where data is volumi- nous and unlikely in the form of simple rules. Additionally, ensembles – putting a number of different predictors to vote – obstruct the acclaimed human comprehension of the rules. However, the approach could be of use in more regular domains, such as customer rating and perhaps fraud de- tection. Rules can be also extracted from an ANN, or used together with probabilities making them more robust (Kovalerchuk & Vityaev, 2000). Nearest Neighbor (kNN) does not create a general model, but to predict, it looks back for the most similar k cases. Distracted by noisy/irrelevant features, but if this ruled out, failure of kNN suggests that the most that can be predicted are general regularities, e.g. based on the output (condi- tional) distribution. Bayesian predictor first learns probabilities how evidence supports out- comes, used then to predict new evidence’s outcome. Although the simple learning scheme is robust to violating the ’naive’ independent-evidence as- sumption, watching independence might pay off, especially as in decreasing markets variables become more correlated than usual. Support Vector Machines (SVM) offer a relatively new and powerful learner, having attractive characteristics for time series prediction (Muller et al., 1997). First, the model deals with multidimensional instances, actually the more features the better – reducing the need for (wrong) feature selection. Second, it has few parameters, thus finding optimal settings can be easier; one parameter referring to noise level the system can handle. Genetic Algorithms (GAs) (Deboeck, 1994) – mimic biological evolution by mutation and cross-over of solutions, in order to maximize their fit- ness. This is a general optimization technique, thus can be applied to any problem – a solution can encode data selection, preprocessing, predictor. GAs explore novel possibilities, often not thought of by humans. There- fore, it may be worth keeping some predictor settings as parameters that 21
  30. 30. can be (later) GA-optimized. Evolutionary systems – another example of evolutionary computation – work in a similar way to GAs, except that the solution is coded as real-valued vector, and optimized not only with respect to the values but also to the optimization rate. 2.3.4 General Learning Issues Computational Learning Theory (COLT) theoretically analyzes prediction algorithms, with respect to the learning process assumptions, data and computation requirements. Probably Approximately Correct (PAC) Learnability is a central notion in the theory, meaning that we learn probably – with probability 1 − δ – and approximately – within error – the correct predictor drawn from a space H. The lower bound on the number of training examples m to find such a predictor is an important result: 1 m ≥ (ln |H| + ln(1/δ)) where |H| is the size of the space – the number of predictors in it. This is usually overly big bound – specifics about the learning process can lower it. However, it provides some insights: m grows linearly in the error factor 1/ and logarithmically in 1/δ – that we find the hypothesis at all (Mitchell, 1997). Curse of dimensionality (Bellman, 1961) involves two related problems. As the data dimension – the number of features in an instance – grows, the predictor needs increasing resources to cover the increasing instances. It also needs more instances to learn – exponentially with the dimension. Some prediction algorithms, e.g. kNN, will not be able to generalize at all, if the dimension is greater than ln(M ), M the number of instances. This is why feature selection – reducing the data dimension – is so important. The amount of data to train a predictor can be experimentally estimated (Walczak, 2001). Overfitting means that a predictor memorizes non-general aspects of the training data, such as noise. This leads to poor prediction on a new data. 22
  31. 31. This is a common problem due to a number of reasons. First, the training and testing data are often not well separated, so memorizing the common part will give the predictor a higher score. Second, multiple trials might be performed on the same data (split), so in effect the predictor coming out will be best suited for exactly that data. Third, the predictor com- plexity – number of internal parameters – might be too big for the number of training instances, so the predictor learns even the unimportant data characteristics. Precautions against overfitting involve: good separation of training and testing data, careful evaluation, use of ensembles averaging-out the indi- vidual overfitting, and an application of the Occam’s razor. In general, overfitting is a difficult problem that must be approached individually. A discussion how to deal with it can be found in (Mitchell, 1997). Occam’s razor – preferring a smaller solution, e.g. a predictor involving fewer parameters, to a bigger one, other things equal – is not a specific technique but a general guidance. There are indeed arguments (Mitchell, 1997) that a smaller hypothesis has a bigger chance to generalize well on new data. Speed is another motivation – smaller predictor is likely to be faster, which can be especially important in an ensemble. Entropy (Shannon & Weaver, 1949) is an information measure useful at many stages in a prediction system development. Entropy expresses the number of bits of information brought in by an entity, let it be next train- ing instance, or checking another condition. Since the notion does not assume any data model, it is well suited to deal with nonlinear systems. As such it is used in feature selection, predictability estimation, predictor construction, e.g. in C4.5 as the information gain measure to decide which feature to split. 2.4 Ensemble Methods An ensemble (Dietterich, 2000) is a number of predictors of which votes are put together into the final prediction. The predictors, on average, 23
  32. 32. are expected above-random and making independent errors. The idea is that correct majority offsets individual errors, thus the ensemble will be correct more often than an individual predictor. The diversity of errors is usually achieved by training a scheme, e.g. C4.5, on different instance samples or features. Alternatively, different predictor types – like C4.5, ANN, kNN – can be used. Common schemes include Bagging, Boosting, Bayesian ensembles and their combinations (Dietterich, 2000). Bagging produces an ensemble by training predictors on different boot- strap samples – each the size of the original data, but sampled allowing repetitions. The final prediction is the majority vote. This simple to imple- ment scheme is always worth trying, in order to reduce prediction variance. Boosting initially assigns equal weights to all data instances and trains a predictor, then it increases weights of the misclassified instances, trains next predictor on the new distribution etc. The final prediction is a weighted vote of predictors obtained in this way. Boosting increasingly pays attention to misclassified instances, what may lead to overfitting if the instances are noisy. Bayesian ensemble, similarly to the Bayesian predictor, uses conditional probabilities accumulated for the individual predictors, to arrive at the most evidenced outcome. Given good estimates for predictors’ accuracy, Bayesian ensemble results in a more optimal prediction compared to bag- ging. 2.5 System Evaluation Proper evaluation is crucial to a prediction system development. First, it has to measure exactly the interesting effect, e.g. trading return as opposed to related, but not identical, prediction accuracy. Second, it has to be sensitive enough as to spot even minor gains. Third, it has to convince that the gains are no merely a coincidence. 24
  33. 33. Usually prediction performance is compared against published results. Although, having its problems, such as data overfitting and accidental suc- cesses due to multiple (worldwide!) trials, this approach works well as long as everyone uses the same data and evaluation procedure, so meaningful comparisons are possible. However, when no agreed benchmark is avail- able, as in the financial domain, another approach must be adopted. Since the main question concerning financial data is whether prediction is at all possible, it suffices to compare a predictor’s performance against the in- trinsic growth of a series – also referred to as the buy and hold strategy. Then a statistical test can judge if there is a significant improvement. 2.5.1 Evaluation Data To reasonably test a prediction system, the data must include different trends, assets for which the system is to perform, and to be plentiful to warrant significant conclusions. Overfitting a system to data is a real dan- ger. Dividing data into three disjoint sets is the first precaution. Training portion of the data is used to build the predictor. If the predictor in- volves some parameters which need to be tuned, they can be adjusted as to maximize performance on the validation part. Now, the system pa- rameters frozen, its performance on an unseen test set provides the final performance estimation. In multiple tests, the significance level should be adjusted, e.g. if 10 tests are run and the best appears 99.9% significant, it really is 99.9%10 = 99% (Zemke, 2000). If we want the system to predict the future of a time series, it is important to maintain proper time relation between the training, validation and test sets – basically training should involve instances time-preceding any test data. Bootstrap (Efron & Tibshirani, 1993) – with repetitions, sampling as many elements as in the original – and deriving a predictor for each such a sample, is useful for collecting various statistics (LeBaron & Weigend, 1994), e.g. return and risk-variability. It can be also used for ensemble creation or best predictor selection, however not without limits (Hastie et al., 2001). 25
  34. 34. 2.5.2 Evaluation Measures Financial forecasts are often developed to support semi-automated trading (profitability), whereas the algorithms used in those systems might have originally different objectives. Accuracy – percentage of correct discrete (e.g. up/down) predictions – is a common measure for discrete systems, e.g. ILP/decision trees. Square error – sum of squared deviations from actual outputs – is a common measure in numerical prediction, e.g. ANN. Performance measure – incorporating both the predictor and the trading model it is going to benefit – is preferable and ideally should measure exactly what we are interested in, e.g. commission and risk adjusted return (Hellstr¨m & Holmstr¨m, 1998), not just return. Actually, many systems’ o o ’profitability’ disappears once the commissions are taken into account. 2.5.3 Evaluation Procedure In data sets, where instance order does not matter, the N -cross validation – data divided into N disjoint parts, N −1 for training and 1 for testing, error averaged over all N (Mitchell, 1997) is a standard approach. However, in the case of time series data, it underestimates error because in order to train a predictor we sometimes use the data that comes after the test instances – unlike in real life, where predictor knows only the past, not the future. For series, sliding window approach is more adept: a window/segment of consecutive instances used for training and a following segment for testing, the windows sliding over all data, as statistics collected. 2.5.4 Non/Parametric Tests Parametric statistical tests have assumptions, e.g. concerning the sample independence and distribution, and as such allow stronger conclusion for smaller data – the assumptions can be viewed as additional input informa- tion, so need to be demonstrated – what is often missed. Nonparametric tests put much weaker requirements, so for equally numerous data allow weaker conclusions. Since financial data have non-normal distribution, re- quired by many of the parametric tests, non-parametric comparisons might be safer (Heiler, 1999). 26
  35. 35. Surrogate data is a useful concept in a system evaluation (Kantz & Schreiber, 1999a). The idea is to generate data sets sharing characteristics of the original data – e.g. permutations of series have the same mean, variance etc. – and for each compute a statistics of interest, e.g. return of a strategy. If α is the acceptable risk of wrongly rejecting the null hypothesis that the original series statistics is lower (higher) than of any surrogate, then 1/α − 1 surrogates needed; if all give higher (lower) statistics than the original series, then the hypothesis can be rejected. Thus, if predictor’s error was lower on the original series, as compared to 19 runs on surrogates, we can be 95% sure it was not a fluke. 27
  36. 36. 28
  37. 37. Chapter 3 Development of the Thesis Be concerned with the ends not the means. Bruce Lee. 3.1 First half – Exploration When introduced to the area of machine learning (ML) around 1996, I noticed that many of the algorithms were developed on artificial ’toy prob- lems’ and once done, the search started for more realistic problems ’suit- able’ for the algorithm. As reasonable as such strategy might initially appear – knowledge of the optimal performance area of a learning algo- rithm is what is often desired – such studies seldom yielded general area insights, merely performance comparisons for the carefully chosen test do- mains. This is in sharp contrast to the needs of a practitioner, who faces a learning problem first and searches for the solution method later, not vice versa. So, in my research, I adopted the practical approach: here is my prediction problem, what can I do about it. My starting point was that financial prediction is difficult, but is it im- possible? Or perhaps, the notion of unpredictability emerged due to the nature of the method rather than the data – a case already known: with the advent of chaotic analysis many processes previously considered random turned out deterministic, at least in the short run. Though, I do not be- lieve that such a complex socio-economical process as the markets will any time soon be found completely predictable, the question of a limited pre- dictability remains open and challenging. And since challenging problems often lead to profound discoveries I considered the subject worthwhile. 29
  38. 38. The experiments started with Inductive Logic Programming (ILP) – learning logic programs by combining provided background predicates sup- posedly useful in the domain in question. I used the then (in 1997) state- of-the-art system, Progol, reported successful in other domains, such as toxicology and chemistry. I provided the system with various financial in- dicators, however, despite many attempts, no compressed rules were ever generated. This could be due to the noise present in financial data and the rules, if any, far from the compact form sought for by an ILP system. The initial failure reiterated the question: is financial prediction at all possible, and if so, which algorithm works best? The failure of an otherwise successful learning paradigm, directed the search towards more original methods. After many fruitless trials, some promising results started ap- pearing, with the unorthodox method shortly presented in the Feasibility Study on Short-Term Stock Prediction, Appendix A. This method looked for invariants in the time series predicted – not just patterns with high predictive accuracy, but patterns that have above-random accuracy in a number of temporarily distinct time epochs, thus excluding those that work perhaps well, but only for a time. The work went unpublished since the trials were limited and in the early stages of my research I was encour- aged to use more established methods. However, it is interesting to note that the method is similar to entropy-based compression schemes, what I discovered later. So I went on to evaluate standard machine learning – to see which of the methods warrants further investigation. I tried: Neural Network, Nearest Neighbor, Naive Bayesian Classifier and Genetic Algorithms (GA) evolved rules. That research, presented and published as Nonlinear Index Pre- diction – thesis paper 1, concludes that Nearest Neighbor (kNN) works best. Some of the details, not included in the paper, made into a report ILP and GA for Time Series Prediction, thesis paper 2. The success of kNN suggested that the delay embedding and local pre- diction works for my data, so perhaps could be improved. However, when I tried to GA-optimize the embedding parameters, the prediction results were not better. If fine-tuning was not the way, perhaps averaging a num- ber of rough predictors would be. The majority voting scheme has indeed 30
  39. 39. improved the prediction accuracy. The originating publication Bagging Imperfect Predictors, thesis paper 3, presents bagging results from Non- linear Index Prediction and an approach believed to be novel at that time – bagging predictions from a number of classifiers evolved in one GA popu- lation. Another spin off from the success of kNN in Nonlinear Index Prediction, so the implicit presence of determinism and perhaps limited dimension of the data, was a research proposal Evolving Differential Equations for Dy- namical System Modeling. The idea behind this more extensive project is to use Genetic Programming-like approach, but instead of evolving pro- grams, to evolve differential equations, known as the best descriptive and modeling tool for dynamical systems. This is what the theory says, but finding equations fitting given data is not yet a solved task. The project was stalled, awaiting financial support. But coming back to the main thesis track. GA experiments in Bagging Imperfect Predictors were computationally intensive, as it is often the case while developing a new learning approach. This problem gave rise to an idea how to try a number of development variants at once, instead of one- by-one, saving on computation time. Rapid Fine-Tuning of Compu- tationally Intensive Classifiers, thesis paper 4, explains the technique, together with some experimental guidelines. The ensemble of GA individuals, as in Bagging Imperfect Predictors, could further benefit from a more powerful classifier committee technique, such as boosting. The published poster Amalgamation of Genetic Se- lection and Boosting, Appendix B, highlights the idea. 3.2 Second half – Synthesis At that point, I presented the mid-Ph.D. results and thought what to do next. Since the ensembles, becoming a mainstream in the machine learning community, seemed the most promising way to go, I investigated how different types of ensembles performed with my predictors, with the Bayesian coming a bit ahead of Bagging and Boosting. However, the results were not that startling and I found more extensive comparisons in the 31
  40. 40. literature, making me abandon that line of research. However, while searching for the comparisons above, I had done quite an extensive review. I selected the most practical and generally-applicable papers in Ensembles in Practice: Prediction, Estimation, Multi- Feature and Noisy Data which publication addresses the four data issues relevant to financial prediction, thesis paper 5. Except for the general algorithmic considerations, there are also the tens of little decisions that need to be taken while developing a prediction system, many leading to pitfalls. While reviewing descriptions of many systems ’beating the odds’ I realized that, although widely different, the acclaimed successful systems share common characteristics, while the naive systems – quite often manipulative in presenting the results – share com- mon mistakes. This led to the thesis paper 6: On Developing Financial Prediction System: Pitfalls and Possibilities which is an attempt to highlight some of the common solutions. Financial data are generated in complex and interconnected ways. What happens in Tokyo influences what happens in New York and vice versa. For prediction this has several consequences. First, there are very many data series to potentially take as inputs, creating data selection and curse of dimensionality problems. Second, many of the series are interconnected, in general, in nonlinear ways. Hence, an attempt to predict must identify the important series and their interactions, having decided that the data warrants predictability at all. These considerations led me to a long investigation. Searching for a predictability measure, I had the idea to use the common Zip compression to estimate entropy in a constructive way – if the algorithm could compress (many interleaved series), its internal working could provide the basis for a prediction system. But reviewing references, I found a similar work, more mathematically grounded, so had abandoned mine. Then, I shifted atten- tion to uncovering multivariate dependencies, along predictability measure, by means of weighted and GA-optimized Nearest Neighbor, which failed. 1 . Then came a multivariate discretization idea, initially based on Shannon 1 It worked, but only up to 15 input data series, whereas I wanted the method to work for more than 50 series. 32
  41. 41. (conditional) entropy, later reformulated in terms of accuracy. After so many false-starts, the feat was quite spectacular as the method was able to spot multivariate regularities, involving only fraction of the data, in up to 100 series. Up to my knowledge, this is also the first, (multivariate) discretization having maximizing an ensemble performance as an objective. Multivariate Feature Coupling and Discretization is the thesis paper number 7. Along the second part of the thesis, I have steadily developed a time se- ries prediction software incorporating my experiences and expertise. How- ever, at the thesis print time the system is not yet operational so its de- scription is not included. 33
  42. 42. 34
  43. 43. Chapter 4 Contributions of Thesis Papers This section summarizes some of the contributions of the 7 papers included in the thesis. 4.1 Nonlinear Index Prediction This publication (Zemke, 1998) examines index predictability by means of Neural Networks (ANN), Nearest Neighbor (kNN), Naive Bayesian and Genetic Algorithms-optimized Inductive Logic Program (ILP) classifiers. The results are interesting in many respects. First, they show that a lim- ited prediction is indeed possible. This adds to the growing evidence that an unqualified Efficient Market Hypothesis might one day be revised. Sec- ond, Nearest Neighbor achieves best accuracy among the commonly used Machine Learning methods what might encourage further exploration in this area dominated by Neural Network and rule-based, ILP-like, systems. Also, the success might hint specific features of the data analyzed. Namely, unlike the other approaches, Nearest Neighbor is a local, model-free tech- nique that does not assume any form of the learnt hypothesis, as it is done by Neural Network architecture or LP background predicates. Third, the superior performance of Nearest Neighbor, as compared to the other meth- ods, points to the problems in constructing global models for the financial data. If conformed in more extensive experiments, it would highlight the intrinsic difficulties of describing some economical dependencies in terms of simple rules, as taught to economics students. And fourth, the failure of the Naive Bayesian classifier can point out limitations of some statistical 35
  44. 44. techniques used to analyze complex preprocessed data, a common approach in the earlier studies of financial data so much contributing to the Efficient Market Hypothesis view. 4.2 ILP via GA for Time Series Prediction With only the main results, due to publisher space limits, of the GA- optimized ILP included in the earlier paper, this report presents some details of these computationally intensive experiments (Zemke, 1999c). Al- though the overall accuracy of LP on the index data was not impressive, the attempts still have practical value – in outlining limits of otherwise suc- cessful techniques. First, the initial experiments applying Progol – at that time a ’state of the art’ Inductive Logic Programming system – show that a learning system successful on some domains can fail on others. There could be at least two reasons for this: domain unsuitable for the learning paradigm or unskillful use of the system. Here, I only note that most of the successful applications of Progol involve domains where few rules hold most of the time: chemistry, astronomy, (simple) grammars, whereas fi- nancial prediction rules, if any, are more soft. As for the unskillful use of an otherwise capable system, the comment could be that such a system would merely shift the burden to learning its ’correct usage’ from learning the theory implied by the data provided – instead of lessening the bur- den altogether. As such, one should be aware that machine learning is still more of an art – demanding experience and experimentation, rather than engineering – providing procedures for almost blindly solving a given problem. The second contribution of this paper exposes background predicate sen- sitivity – exemplified by variants of equal. The predicate definitions can have a substantial influence on the achieved results – again highlighting the importance of an experimental approach and, possibly, a requirement for nonlinear predicates. Third, since GA-evolved LP can be viewed as an instance of Genetic Programming (GP), the results confirm that GP is perhaps not the best vehicle for time series prediction. And fourth, a gen- eral observation about GA-optimization and learning: while evolving LP of 36
  45. 45. varying size, the best (accuracy) programs usually emerged in GA experi- ments with only secondary fitness bonus for smaller programs, as opposed to runs in which programs would be penalized by their size. Actually, it was interesting to note that the path to smaller and accurate programs often lead through much bigger programs which have been subsequently reduced – should the bigger programs be not allowed to appear in the first place, the smaller ones would not be found either. This observation, together with the not so good generalization of the smallest programs, is- sues a warning against blind application of Occam’s Razor in evolutionary computation. 4.3 Bagging Imperfect Predictors This publication (Zemke, 1999b), again due to publisher restrictions, com- pactly presents a number of contributions both to the area of financial prediction and machine learning. The key tool here is bagging – a scheme involving majority voting of a number of different classifiers as to increase the ensemble’s accuracy. The contributions could be summarized as fol- lows. First, instead of the usual bagging of the same classifier trained on different (bootstrap) partitions of the data, classifiers based on different data partitions as well as methods are bagged together – an idea described as ’neat’ by one of the referees. This leads to higher accuracy than those achieved by bagging each of the individual method classifiers or data se- lections separately. Second, as applied to index data, prediction accuracy seems highly correlated to returns, a relationship reported breaking up at higher accuracies. Third, since the above two points hold, bagging applied to a variety of financial predictors has the potential to increase the ac- curacy of prediction and, consequently, of returns what is demonstrated. Fourth, in the case of GA-optimized classifiers, it is advantageous to bag all above-average classifiers present in the final GA population, instead of the usual taking the singe best classifier. And fifth, somehow contrary to conventional wisdom, it turned out that on the data analyzed, big index movements were more predictable than smaller ones – most likely due to the smaller ones consisting of relatively more of noise. 37
  46. 46. 4.4 Rapid Fine Tuning of Computationally Intensive Classifiers This publication (Zemke, 2000), a spin-off of the experiments carried out for the previous paper, elaborates on a practical aspect applicable to almost any machine learning system development, namely, on a rapid fine-tuning of parameters for optimal performance. The results could be summarized as follows. First, working on a specific difficult problem, as in the case of index prediction, can lead to a solution and insights to more general prob- lems, and as such is of value beyond merely the domain of the primary investigation. Second, the paper describes a strategy for simultaneous exploration of many versions of a fine-tuned algorithm with different pa- rameter choices. And third, a statistical analysis method for detection of superior parameter settings is presented, which together with the earlier point allows for rapid fine-tuning. 4.5 On Developing Financial Prediction System: Pit- falls and Possibilities The publication (Zemke, 2002b) is the result of my own experiments with a financial prediction system development and of a review of such in the literature. The paper succinctly lists issues appearing in the development process pointing to some common pitfalls and solutions. The contributions could be summarized as follows. First, it makes the reader aware of the many steps involved in a suc- cessful system implementation. The presentation tried to follow the devel- opment progression – from data preparation, through predictor selection and training, ’boosting’ the accuracy, to evaluation issues. Being aware of the progression can help in a more structured development and pinpoint some omissions. Second, for each stage of the process, the paper lists some common pitfalls. The importance of this cannot be overestimated. For instance, many ’profit-making’ systems presented in the literature are tested only in the decade-long bull market 1990-2000, and never tested in long-term 38
  47. 47. falling markets, which most likely would average the systems’ performance. Such are some of the many pitfalls pointed out. Third, the paper suggests some solutions to the pitfalls and to general issues appearing in a prediction system development. 4.6 Ensembles in Practice: Prediction, Estimation, Multi-Feature and Noisy Data This publication (Zemke, 2002a) is the result of an extensive literature search on ensembles applied to realistic data sets, with the 4 objectives in mind: 1) time series prediction – how ensembles can specifically exploit the serial nature of the data; 2) accuracy estimation – how ensembles can mea- sure the maximal prediction accuracy for a given data set, in a better way than any single method; 3) how ensembles can exploit multidimensional data and 4) how to use ensembles in the case of noisy data. The four issues appear in the context of financial time series predic- tion, though the examples referred to are non-financial. Actually, this cross-domain application of working solutions could bring new methods to financial prediction. The contributions of the publication can be summa- rized. First, after a general introduction to how and why ensembles work, and to the different ways to build them, the paper diverges into the four title areas. The message here can be that although ensembles are generally- applicable and robust techniques, a search for the ’ultimate ensemble’ should not overlook the characteristics and requirements of the problem in question. Similar quest for the ’best’ machine learning technique few years ago failed with the realization that different techniques work best in different circumstances. Similarly with ensembles: different problem settings require individual approaches. Second, the paper goes on to present some of the working approaches addressing the four issues in question. This has a practical value. Usually the ensemble literature is organized by ensemble method, whereas, a prac- titioner has data and a goal, e.g. to predict from noisy series data. The paper points to possible solutions. 39
  48. 48. 4.7 Multivariate Feature Coupling and Discretization This paper (Zemke & Rams, 2003) presents a multivariate discretization method based on Genetic Algorithms applied twice, first to identify im- portant feature groupings, second to perform the discretization maximiz- ing desired function, e.g. the predictive accuracy of an ensemble build on those groupings. The contributions could be summarized as follows. First, as the title suggests, a multivariate discretization is provided, presenting an alternative to the very few multivariate methods reported. Second, feature grouping and ranking – the intermediate outcome of the procedure – has a value in itself: allows to see which features are interre- lated and how much predictability is brought in by them, promoting feature selection. Third, the second global GA-optimization allows an arbitrary objective to be maximized, unlike in other discretization schemes where the objective is hard-coded into the algorithm. The objective exemplified in the paper maximizes the goal of prediction: accuracy, whereas other schemes often only indirectly attempt to maximize it via measures such as entropy or the chi-square statistics. Fourth contribution, up to my knowl- edge, this is the first discretization to allow explicit optimization for an ensemble. This forces the discretization to act on global basis, not merely searching for maximal information gain per selected feature (grouping) but for all features viewed together. Fifth, the global discretization can also yield a global estimate of predictability for the data. 40
  49. 49. Chapter 5 Bibliographical Notes This chapter is intended to provide a general bibliography introducing new adepts to the interdisciplinary area of financial prediction. I list a few books I have found to be both educational and interesting to read in my study of the domain. Machine Learning Machine Learning (Mitchell, 1997). As for now, I would regard this book as the textbook for machine learning. It not only presents the main learn- ing paradigms – neural networks, decision trees, rule induction, nearest neighbor, analytical and reinforcement learning – but also introduces to hypothesis testing and computational learning theory. As such, it balances the presentation of machine learning algorithms with practical issues of using them, and some theoretical aspects of their function. Next editions of this, otherwise an excellent book, could also consider the more novel approaches: support vector machines and rough sets. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations (Witten & Frank, 1999). Using this book, and the software package Weka behind it, could save time, otherwise spent on im- plementing the many learning algorithms. This book essentially provides an extended user guide to the open-source code available online. The Weka toolbox, in addition to more than 20 parameterized machine learning methods, offers data preparation, hypothesis evaluation and some visual- ization tools. A word of warning, though: most of the implementations are 41
  50. 50. straightforward and non-optimized – suitable rather for learning the nuts and bolts of the algorithms, rather than a big scale data mining. The Elements of Statistical Learning: Data Mining, Inference, and Pre- diction (Hastie et al., 2001). This book, in wide scope similar to Machine Learning (Mitchell, 1997), could be recommend for its more rigorous treat- ment and some additional topics, such as ensembles. Data Mining and Knowledge Discovery with Evolutionary Algorithms (Alex, 2002). This could be a good introduction to practical applications of evolutionary computations to various aspects of data mining. Financial Prediction Here, I present a selection of books introducing to various aspects of non- linear financial time series analysis. Data Mining in Finance: Advances in Relational and Hybrid Methods (Kovalerchuk & Vityaev, 2000). This is an overview of some of the methods used for financial prediction and of features such a prediction system should have. The authors also present their system, supposedly overcoming many of the common pitfalls. However, the book is somehow short on details allowing to re-evaluate some of the claims, but good as an overview. Trading on the Edge (Deboeck, 1994). This is an excellent book of self- contained chapters practically introducing to the essence of neural net- works, chaos analysis, genetic algorithms and fuzzy sets, as applied to financial prediction. Neural Networks in the Capital Markets (Refenes, 1995). This collection on neural networks for economic prediction, highlights some of the practical considerations while developing a prediction system. Many of the hints are applicable to prediction systems based on other paradigms, not just on neural networks. Fractal Market Analysis (Peters, 1994). In this book, I found as the most interesting chapters on various applications of Hurst or R/S analysis. Though, this has not resulted in immediately using that approach, it is always good to know what the self-similarity analysis can reveal about the data in hand. 42
  51. 51. Nonlinear Analysis, Chaos Nonlinear Time Series Analysis (Kantz & Schreiber, 1999a). As authors can be divided into those who write what they know, and those who know what they write about, this is definitely the latter case. I would recom- mend this book, among other introductions to nonlinear time series, for its readability, practical approach, examples (though mostly from physics), formulae with clearly explained meaning. I could easily convert into code many of the algorithms described in the text. Time Series Prediction: Forecasting the Future and Understanding the Past (Weigend & Gershenfeld, 1994). A primer on nonlinear prediction methods. The book, finalizing the Santa Fe Institute prediction compe- tition, introduces time series forecasting issues and discusses them in the context of the competition entries. Coping with Chaos (Ott, 1994). This book, by a contributor to the chaos theory, is a worthwhile read providing insights into aspects of chaotic data analysis, prediction, filtering, control, with the theoretical motivations revealed. Finance, General Modern Investment Theory (Haughen, 1997). A relatively easy to read book systematically introducing to current views on investments, mostly from an academic point, though. This book also discusses the Efficient Market Hypothesis. Financial Engineering (Galitz, 1995). A basic text on what financial engineering is about and what it can do. Stock Index Futures (Sutcliffe, 1997). Mostly overview work, providing numerous references to research on index futures. I considered skimming the book essential for insights into documented futures behavior, as not to reinvent the wheel. A Random Walk down Wall Street (Malkiel, 1996) and Reminiscences of a Stock Operator (Lefvre, 1994). Enjoyable, leisure read about the me- chanics of Wall Street. In some sense the books – presenting investment activity in a wider historical and social context – have also great educa- 43
  52. 52. tional value. Namely, they show the influence of subjective, not always rational, drives on the markets, which as such, perhaps cannot be fully analyzed by rational methods. Finance, High Frequency An Introduction to High-Frequency Finance (Dacorogna et al., 2001). A good introduction to high frequency finance, presenting facts about the data and ways to process it, with simple prediction schemes presented. Financial Markets Tick by Tick (Lequeux, 1998). In high frequency fi- nance, where data is usually not equally time-spaced, certain mathematical notions – such as correlation, volatility – require new precise definitions. This book is attempting that. 44
  53. 53. Nonlinear Index Prediction International Workshop on Econophysics and Statistical Finance, 1998. Physica A 269 (1999) 45
  54. 54. . 46
  55. 55. Nonlinear Index Prediction Stefan Zemke Department of Computer and System Sciences Royal Institute of Technology (KTH) and Stockholm University Forum 100, 164 40 Kista, Sweden Email: Presented: International Workshop on Econophysics and Statistical Finance, Palermo, 1998. Published: Physica A, volume 269, 1999 Abstract Neural Network, K-Nearest Neighbor, Naive Bayesian Classifier and Genetic Algorithm evolving classification rules are compared for their prediction accuracies on stock exchange index data. The method yielding the best result, Nearest Neighbor, is then refined and incorporated into a simple trading system achieving returns above index growth. The success of the method hints the plausibility of nonlinearities present in the index series and, as such, the scope for nonlinear modeling/prediction. Keywords: Stock Exchange Index Prediction, Machine Learning, Dynamics Reconstruc- tion via delay vectors, Genetic Algorithms optimized Trading System Introduction Financial time series present a fruitful area for research. On one hand there are economists claiming that profitable prediction is not possible, as voiced by the Efficient Market Hypothesis, on the other, there is a grow- ing evidence of exploitable features of these series. This work describes a prediction effort involving 4 Machine Learning (ML) techniques. These ex- periments use the same data and lack unduly specializing adjustments – the goal being relative comparison of the basic methods. Only subsequently, the most promising technique is scrutinized. Machine Learning (Mitchell, 1997) has been extensively applied to fi- nances (Deboeck, 1994; Refenes, 1995; Zirilli, 1997) and trading (Allen 47
  56. 56. & Karjalainen, 1993; Bauer, 1994; Dacorogna, 1993). Nonlinear time se- ries (Kantz & Schreiber, 1999a) approaches also become a commonplace (Trippi, 1995; Weigend & Gershenfeld, 1994). The controversial notion of (deterministic) chaos in financial data is important since the presence of a chaotic attractor warrants partial predictability of financial time series – in contrast to the random walk and Efficient Market Hypothesis (Fama, 1965; Malkiel, 1996). Some of the results supporting deviation from the log-normal theory (Mandelbrot, 1997) and a limited financial prediction can be found in (LeBaron, 1993; LeBaron, 1994). The Task Some evidence suggests that markets with lower trading volume are eas- ier to predict (Lerche, 1997). Since the task of the study is to compare ML techniques, data from the relatively small and scientifically unexplored Warsaw Stock Exchange (WSE) (Aurell & Zyczkowski, 1996) is used, with the quotes, from the opening of the exchange in 1991, freely available on the Internet. At the exchange, prices are set once a day (with intraday trading introduced more recently). The main index, WIG, is a capital- ization weighted average of all the stocks traded on the main floor, and provides the time series used in this study. The learning task involves predicting the relative index value 5 quotes ahead, i.e., a binary decision whether the index value one trading week ahead will be up or down in relation to the current value. The interpretation of up and down is such that they are equally frequent in the data set, with down also including small index gains. This facilitates detection of above-random predictions – their accuracy, as measured by the proportion of correctly predicted changes, is 0.5 + s, where s is the threshold for the required significance level. For the data including 1200 index quotes, the following table presents the s values for one-sided 95% significance, assuming that 1200 − W indowSize data points are used for the accuracy estimate. Window size: 60 125 250 500 1000 Significant error: 0.025 0.025 0.027 0.031 0.06 48
  57. 57. Learning involves W indowSize consecutive index values. Index daily (relative) changes are digitized via monotonically mapping them into 8 integer values, 1..8, such that each is equally frequent in the resulting series. This preprocessing is necessary since some of the ML methods require bounded and/or discrete values. The digitized series is then used to create delay vectors of 10 values, with lag one. Such a vector (ct , ct−1 , ct−2 , ..., ct−9 ), is the sole basis for prediction of the index up/down value at time t + 5 w.r.t. the value at time t. Only vectors, and their matching predictions, derived form index values falling within the current window are used for learning. The best generated predictor – achieving highest accuracy at the window cases – is then applied to the vector next to the last one in window – yielding prediction for the index value falling next to the window. With the accuracy estimate accumulating and the window shifting over all available data points, the resulting prediction accuracies are presented in the tables as percentages. Neural Network Prediction Five layered network topologies have been tested. The topologies, as de- scribed by the numbers of non-bias units in subsequent layers, are: G0: 10-1, G1: 10-5-1, G2: 10-5-3-1, G3: 10-8-5-1, G4: 10-20-5-1. Units in the first layer represent the input values. Standard backpropagation (BP) algorithm is used for learning weights, with the change values 1..8 linearly scaled down to the [0.2, 0.8] range required by the sigmoid BP, and up denoted by 0.8, and down – by 0.2. The window examples are randomly assigned into either training or validation set, compromising 80% and 20% of the examples respectively. The training set is used by BP to update weights, while the validation set – to evaluate the network’s squared output error. The minimal error network for the whole run is then applied to the example next to the window for prediction. Prediction accuracies and some observations follow. 49
  58. 58. Window/Graph G0 G1 G2 G3 G4 60 56 - - - - 125 58 56 63 58 - 250 57 57 60 60 - 500 58 54 57 57 58 1000 - - - 61 61 • Prediction accuracy, without outliers, is in the significant 56 – 61% range • Accuracies seem to increase with window size, reaching above 60% for bigger networks (G2 – G4), as such the results could further improve with more training data Naive Bayesian Classifier Here the basis for prediction consists of the probabilities P (classj ) and P (evidencei | classj ) for all recognized evidence/class pairs. The classp preferred by observed evidenceo1 ... evidenceon is given by maximizing the expression P (classp )∗P (evidenceo1 | classp )∗ ... ∗P (evidenceon | classp ). In the task in hand, evidence can take the form: attributen = valuen , where attributen , n = 1..10, denotes the n-th position in the delay vec- tor, and valuen is a fixed value. If the position has this value, the evi- dence is present. Class and conditional probabilities are computed through counting respective occurrences in the window, with conditionals missing assigned the default 1/equivalentSampleSize probability 1/80 (Mitchell, 1997). Some results and comments follow. Window size: 60 125 250 500 1000 Accuracy: 54 52 51 47 50 • The classifier performs poorly – perhaps due to preprocessing of the dataset removing any major probability shifts – in the bigger window case no better than a guessing strategy • The results show, however, some autocorrelation in the data: positive for shorter periods (up to 250 data-points) and mildly negative for 50
  59. 59. longer (up to 1000 data-points), which is consistent with other studies on stock returns (Haughen, 1997). K-Nearest Neighbor In this approach, K most similar window vectors – to the one being clas- sified – are found. The most frequent class among the K vectors is then returned as the classification. The standard similarity metrics is Euclidean distance between the vectors. Some results and comments follow. Window/K 1 11 125 125 56 - - 250 55 53 56 500 54 52 54 1000 64 61 56 • Peak of 64% • Accuracy always at least 50% and significant in most cases The above table has been generated for the Euclidean metrics. However, the peak of 64% accuracy (though for other Window/K combinations) has also been achieved for the Angle and Manhattan metrics1 , indicating that the result is not merely an outlier due to some idiosyncrasies of the data and parameters. GA-evolved Logic Programs The logic program is a list of clauses for the target up predicate. Each clause is 10 literals long, with each literal drawn form the set of available 2 argument predicates: lessOrEqual, greaterOrEqual – with the implied interpretation, as well as Equal(X, Y) if abs(X − Y ) < 2 and nonEqual(X, Y) if abs(X − Y ) > 1. The first argument of each literal is a constant 1 The results were obtained from a GA run in the space: M erticsT ype ∗ K ∗ W indowSize. For a pair of vectors, the Angle metrics returns the angle between them, Maximal – the maximal absolute difference coordinate-wise, whereas Manhattan - sum of such differences. 51