Flexible least squares for temporal data mining and ...


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Flexible least squares for temporal data mining and ...

  1. 1. Available online at www.sciencedirect.com Expert Systems with Applications Expert Systems with Applications 36 (2009) 2819–2830 www.elsevier.com/locate/eswa Flexible least squares for temporal data mining and statistical arbitrage Giovanni Montana a,*, Kostas Triantafyllopoulos b, Theodoros Tsagaris a,1 a Department of Mathematics, Statistics Section, Imperial College London, London SW7 2AZ, UK b Department of Probability and Statistics, University of Sheffield, Sheffield S3 7RH, UK Abstract A number of recent emerging applications call for studying data streams, potentially infinite flows of information updated in real- time. When multiple co-evolving data streams are observed, an important task is to determine how these streams depend on each other, accounting for dynamic dependence patterns without imposing any restrictive probabilistic law governing this dependence. In this paper we argue that flexible least squares (FLS), a penalized version of ordinary least squares that accommodates for time-varying regression coefficients, can be deployed successfully in this context. Our motivating application is statistical arbitrage, an investment strategy that exploits patterns detected in financial data streams. We demonstrate that FLS is algebraically equivalent to the well-known Kalman filter equations, and take advantage of this equivalence to gain a better understanding of FLS and suggest a more efficient algorithm. Prom- ising experimental results obtained from a FLS-based algorithmic trading system for the S&P 500 Futures Index are reported. Ó 2008 Elsevier Ltd. All rights reserved. Keywords: Temporal data mining; Flexible least squares; Time-varying regression; Algorithmic trading system; Statistical arbitrage 1. Introduction tion. At the core of all such applications lies the common need to make time-aware, instant, intelligent decisions that Temporal data mining is a fast-developing area con- exploit, in one way or another, patterns detected in the cerned with processing and analyzing high-volume, high- data. speed data streams. A common example of data stream is In the last decade we have seen an increasing trend by a time series, a collection of univariate or multivariate mea- investment banks, hedge funds, and proprietary trading surements indexed by time. Furthermore, each record in a boutiques to systematize the trading of a variety of finan- data stream may have a complex structure involving both cial instruments. These companies resort to sophisticated continuous and discrete measurements collected in sequen- trading platforms based on predictive models to transact tial order. There are several application areas in which tem- market orders that serve specific speculative investment poral data mining tools are being increasingly used, strategies. including finance, sensor networking, security, disaster Algorithmic trading, otherwise known as automated or management, e-commerce and many others. In the finan- systematic trading, refers to the use of expert systems that cial arena, data streams are being monitored and explored enter trading orders without any user intervention; these for many different purposes such as algorithmic trading, systems decide on all aspects of the order such as the tim- smart order routing, real-time compliance, and fraud detec- ing, price, and its final quantity. They effectively implement pattern recognition methods in order to detect and exploit * market inefficiencies for speculative purposes. Moreover, Corresponding author. E-mail address: g.montana@imperial.ac.uk (G. Montana). automated trading systems can slice a large trade automat- 1 The author is also affiliated with BlueCrest Capital Management. The ically into several smaller trades in order to hide its impact views presented here reflect solely the author’s opinion. on the market (a technique called iceberging) and lower 0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2008.01.062
  2. 2. 2820 G. Montana et al. / Expert Systems with Applications 36 (2009) 2819–2830 trading costs. According to the financial times, the London that section, in order to deal with the large number of pre- stock exchange foresees that about 60% of all its orders in dictors, we complement FLS with a feature extraction pro- the year 2007 will be entered by algorithmic trading. cedure that performs on-line dimensionality reduction. We Over the years, a plethora of statistical and econometric conclude in Section 7 with a discussion on related work and techniques have been developed to analyze financial data directions for further research. De Gooijer and Hyndma, 2006. Classical time series anal- ysis models, such as ARIMA and GARCH, as well as 2. A concise review of trading strategies many other extensions and variations, are often used to obtain insights into the mechanisms that generates the Two popular trading strategies are market timing and observed data and make predictions Chatfield, 2004. How- trend following. Market timers and trend followers both ever, in some cases, conventional time series and other pre- attempt to profit from price movements, but they do it in dictive models may not be up to the challenges that we face different ways. A market timer forecasts the direction of when developing modern algorithmic trading systems. an asset, going long (i.e. buying) to capture a price increase, Firstly, as the result of developments in data collection and going short (i.e. selling) to capture a price decrease. A and storage technologies, these applications generate mas- trend follower attempts to capture the market trends. sive amounts of data streams, thus requiring more efficient Trends are commonly related to serial correlations in price computational solutions. Such streams are delivered in changes; a trend is a series of asset prices that move persis- real-time; as new data points become available at very high tently in one direction over a given time interval, where frequency, the trading system needs to quickly adjust to the price changes exhibit positive serial correlation. A trend new information and take almost instantaneous buying follower attempts to identify developing price patterns with and selling decisions. Secondly, these applications are this property and trade in the direction of the trend if and mostly exploratory in nature: they are intended to detect when this occurs. patterns in the data that may be continuously changing Although the time-varying regression models discussed and evolving over time. Under this scenario, little prior in this work may be used to implement such trading strat- knowledge should be injected into the models; the algo- egies, we will not discuss this further. We rather focus on rithms should require minimal assumptions about the statistical arbitrage, a class of strategies widely used by data-generating process, as well as minimal user specifica- hedge funds or proprietary traders. The distinctive feature tion and intervention. of such strategies is that profits can be made by exploiting In this work we focus on the problem of identifying statistical mispricing of one or more assets, based on the time-varying dependencies between co-evolving data expected value of these assets. streams. This task can be casted into a regression problem: The simplest special case of these strategies is perhaps at any specified point in time, the system needs to quantify pairs trading (see Elliott, van der Hoek, & Malcolm, to what extent a particular stream depends on a possibly 2005; Gatev, Goetzmann, & Rouwenhorst, 2006). In this large number of other explanatory streams. In algorithmic case, two assets are initially chosen by the trader, usually trading applications, a data stream may comprise daily or based on an analysis of historical data or other financial intra-day prices or returns of a stock, an index or any other considerations. If the two stocks appear to be tied together financial instrument. At each time point, we assume that a in the long term by some common stochastic trend, a trader target stream of interest depends linearly on a number of can take maximum advantage from temporary deviations other streams, but the coefficients of the regression models from this assumed equilibrium2. are allowed to evolve and change smoothly over time. A specific example will clarify this simple but effective The paper is organized as follows. In Section 2 we briefly strategy. Fig. 1 shows the historical prices of two assets, review a number of common trading strategies and formu- SouthWest Airlines and Exxon Mobil; we denote the two late the problem arising in statistical arbitrage, thus prov- price time series by y t and xt for t ¼ 1; 2; . . ., respectively. ing some background material and motivation for the Clearly, from 1997 till 2004, the two assets exhibited some proposed methods. The flexible least squares (FLS) meth- dependence: their spread, defined as st ¼ y t À xt (plotted in odology is introduced in Section 3 as a powerful explor- the inset figure) fluctuates around a long-term average of atory method for temporal data mining; this method fits about À20. A trading system implementing a pairs trading our purposes well because it imposes no probabilistic strategy on these two assets would exploit temporary diver- assumptions and relies on minimal parameter specification. gences from this market equilibrium. For instance, when In Section 4 some assumptions of the FLS method are the spread st is greater than some predetermined positive revisited, and we establish a clear connection between constant c, the system assume that the SouthWest Airlines FLS and the well-known Kalman filter equations. This is overpriced and would go short on SouthWest Airlines connection sheds light on the interpretation of the model, and long on Exxon Mobil, in some predetermined ratio. and naturally yields a modification of the original FLS that is computationally more efficient and numerically stable. 2 This strategy relies on the idea of co-integration. Several applications of Experimental results that have been obtained using the co-integration-based trading strategies are presented in Alexander and FLS-based trading system are described in Section 5. In Dimitriu (2002) and Burgess (2003).
  3. 3. G. Montana et al. / Expert Systems with Applications 36 (2009) 2819–2830 2821 SoutWest Airlines and Exxon Mobil prices their dynamics. Such a systematic component may include 70 all market-related sources of risk, including financial and 0 SoutWest Airlines economic factors. The objective of this approach is to neu- 60 -20 Exxon Mobil tralize all marker-related sources of risks and ultimately 50 -40 obtain a data stream that best represents the target-specific Jan96 Jan98 Jan00 Jan02 Jan04 Jan06 risk, also known as idiosyncratic risk. 40 Suppose that y t represents the data stream of the target asset, and ^t is the artificial asset estimated using a set of p y 30 explanatory and co-evolving data streams x1 ; . . . ; xp , over the same time period. In this context, the artificial asset 20 can also be interpreted as the fair price of the target asset, given all available information and market conditions. The 10 difference y t À ^t then represents the risk associated with y the target asset only, or mispricing. Given that this con- 0 Jan96 Jan98 Jan00 Jan02 Jan04 Jan06 struction indirectly accounts for all sources of variations due to various market-related factors, the mispricing data Fig. 1. Historical prices of Exxon Mobil Corporation and SouthWest Airlines for the period 1997–2007. The spread time series, reported in the stream is more likely to contain predictable patterns (such inset, shows an equilibrium level between the two prices until about as the mean-reverting behavior seen in Fig. 1) that could January 2004. potentially be exploited for speculative purposes. For instance, in an analogy with the pairs trading approach, a possibly large mispricing (in absolute value) would flag A profit is made when the prices revert back to their long- a temporary inefficiency that will soon be corrected by term average. Although a stable relationship between two the market. This construction crucially relies on accurately assets may persist for quite some time, it may suddenly dis- and dynamically estimating the artificial asset, and we dis- appear or present itself in different patterns, such as peri- cuss this problem next. odic or trend patterns. In Fig. 1, for instance, the spread shows a downward trend after January 2004, which may 3. Flexible least squares (FLS) be captured by implementing more refined models. The standard linear regression model involves a response 2.1. A statistical arbitrage strategy variable y t and p predictor variables x1 ; . . . ; xp , which usu- 0 ally form a predictor column vector xt ¼ ðx1t ; . . . ; xpt Þ . Opportunities for pairs trading in the simple form The model postulates that y t can be approximated well by described above are dependent upon the existence of simi- x0t b, where b is a p-dimensional vector of regression param- lar pairs of assets, and thus are naturally limited. Many eters. In ordinary least square (OLS) regression, estimates b ^ other variations and extensions exist that exploit tempo- of the parameter vector are found as those values that min- rary mispricing among securities. For instance, in index imize the cost function arbitrage, the investor looks for temporary discrepancies X T between the prices of the stocks comprising an index and CðbÞ ¼ ðy t À x0t bÞ2 ð1Þ the price of a futures contract3 on that index. By buying t¼1 either the stocks or the futures contract and selling the When both the response variable y t and the predictor vec- other, market inefficiency can be exploited for a profit. tor xt are observations at time t of co-evolving data In this paper we adopt a simpler strategy than index streams, it may be possible that the linear dependence be- arbitrage, somewhat more related to pairs trading. The tween y t and xt changes and evolves, dynamically, over trading system we develop tries to exploit discrepancies time. Flexible least squares were introduced at the end of between a target asset, selected by the investor, and a the 80’s by Tesfatsion and Kalaba (1989) as a generaliza- paired artificial asset that reproduces the target asset. This tion of the standard linear regression model above in order artificial asset is represented by a data stream obtained as a to allow for time-variant regression coefficients. Together linear combination of a possibly large set of explanatory with the usual regression assumption that streams assumed to be correlated with the target stream. The rationale behind this approach is the following: if y t À x0t bt % 0 ð2Þ there is a strong association between synthetic and target the FLS model also postulates that assets persisting over a long period of time, this association implies that both assets react to some underlying (and btþ1 À bt % 0 ð3Þ unobserved) systematic component of risk that explains that is, the regression coefficients are now allowed to evolve slowly over time. 3 A futures contract is an obligation to buy or sell a certain underlying FLS does not require the specification of probabilistic instrument at a specific date and price, in the future. properties for the residual error in (2). This is a favorable
  4. 4. 2822 G. Montana et al. / Expert Systems with Applications 36 (2009) 2819–2830 aspect of the method for applications in temporal data Qt ¼ lðI p À M t Þ mining, where we are usually unable to precisely specify a pt ¼ let model for the errors; moreover, any assumed model would not hold true at all times. We have found that FLS per- rt ¼ rtÀ1 þ y 2 À ðptÀ1 þ xt y t Þ0 et t forms well even when assumption (3) is violated, and there and where I p is the p  p identity matrix. In order to apply are large and sudden changes from btÀ1 to bt , for some t. (8), this procedure requires all data points till time T to be We will illustrate this point by means of an example in available, so the coefficient vector bT should be computed the next section. first. Kalaba and Tesfatsion (1988) show that the estimate With these minimal assumptions in place, given a pre- of bT can be obtained sequentially as dictor xt , a procedure is called for the estimation of a 0 unique path of coefficients, bt ¼ ðb01t ; . . . ; b0pt Þ , for ^ À1 bT ¼ ðQT À1 þ xT x0T Þ ðpT À1 þ xT y T Þ t ¼ 1; 2; . . .. The FLS approach consists of minimizing a penalized version of the OLS cost function (1), namely4 Subsequently, (8) can be used to estimate all remaining coefficient vectors bT À1 ; . . . ; b1 , going backwards in time. X T 2 X T À1 The procedure relies on the specification of the regular- Cðb; lÞ ¼ ðy t À x0t bt Þ þ l nt ð4Þ ization parameter l P 0; this scalar penalizes the dynamic t¼1 t¼1 component of the cost function (4), defined in (5), and acts where we have defined as a smoothness parameter that forces the time-varying vector towards or away from the fixed-coefficient OLS nt ¼ ðbtþ1 À bt Þ0 ðbtþ1 À bt Þ ð5Þ solution. We prefer the alternative parameterization based on l ¼ ð1 À dÞ=d controlled by a scalar d varying in the and l P 0 is a scalar to be determined. unit interval. Then, with d set very close to 0 (correspond- In their original formulation, Kalaba and Tesfatsion ing to very large values of l), near total weight is given to (1988) propose an algorithm that minimizes this cost with minimizing the static part of the cost function (4). This is respect to every bt in a sequential way. They envisage a sit- the smoothest solution and results in standard OLS esti- uation where all data points are stored in memory and mates. As d moves away from 0, greater priority is given promptly accessible, in an off-line fashion. The core of their to the dynamic component of the cost, which results in approach is summarized in the sequel for completeness. time-varying estimates. The smallest cost of the estimation process at time t can be written recursively as 3.1. Off-line and on-line FLS: an illustration 2 cðbtþ1 ; lÞ ¼ inf fðy t À x0t bt Þ þ lnt þ cðbt ; lÞg ð6Þ bt As noted above, the original FLS has been introduced Furthermore, this cost is assumed to have a quadratic form for situations in which all the data points are available, in batch, prior to the analysis. In contrast, we are interested cðbt ; lÞ ¼ b0t QtÀ1 bt À 2b0t ptÀ1 þ rtÀ1 ð7Þ in situations where each data point arrives sequentially. Each component of the p dimensional vector xt represents where QtÀ1 and ptÀ1 have dimensions p  p and p  1, a new point of a data stream, and the path of regression respectively, and rtÀ1 is a scalar. Substituting (7) into (6) coefficients needs to be updated at each time step so as to and then differentiating the cost (6) with respect to bt , con- incorporate the most recently acquired information. Using ditioning on btþ1 , one obtains a recursive updating equa- the FLS machinery in this setting, the estimate of bt is given tion for the time-varying regression coefficient recursively by ^ bt ¼ et þ M t btþ1 ð8Þ ^ À1 bt ¼ ðS tÀ1 þ xt x0t Þ ðstÀ1 þ xt y t Þ ð9Þ with where we have defined the quantities À1 et ¼ lÀ1 M t ðptÀ1 þ xt y t Þ S t ¼ lðS tÀ1 þ lI p þ xt x0t Þ ðS tÀ1 þ xt x0t Þ À1 À1 ð10Þ M t ¼ lðQtÀ1 þ lI þ xt x0t Þ st ¼ lðS tÀ1 þ lI p þ xt x0t Þ ðstÀ1 þ xt y t Þ The recursions are started with some initial Q0 and p0 . The recursions are initially started with some arbitrarily Now, using (8), the cost function can be written as chosen values S 0 and s0 . Fig. 2 illustrates how accurately the FLS algorithm cðbtþ1 ; lÞ ¼ b0tþ1 Qtþ1 À 2b0tþ1 pt þ rt recovers the path of the time-varying coefficients, in both off-line and on-line settings, for some artificially created where data streams. The target stream y t for this example has been generated using the model 4 This cost function is called the incompatibility cost in Tesfatsion and Kalaba (1989). y t ¼ xt bt þ t ð11Þ
  5. 5. G. Montana et al. / Expert Systems with Applications 36 (2009) 2819–2830 2823 Dynamics of estimated regression coefficients We ask ourselves whether we can gain a better under- 6 Simulated standing of the FLS method after assuming that the regres- 5 Off–line On–line sion coefficients are indeed random vectors, without losing 4 the generality and flexibility of the original FLS method. 3 As it turns out, if we are willing to make such an assump- tion, it is possible to establish a neat algebraic correspon- 2 dence between the FLS estimation equations and the 1 well-known Kalman filter (KF) equations. This correspon- 0 dence has a number of advantages. Firstly, this connection sheds light into the meaning and interpretation of the –1 smoothing parameter l in the cost function (4). Secondly, –2 once the connection with KF is established, we are able –3 to estimate the covariance matrix of the estimator of bt . –4 Furthermore, we are able to devise a more efficient version 0 50 100 150 200 250 300 of FLS that does not require any matrix inversion. As in Fig. 2. Simulated versus estimated time-varying regression coefficients the original method, we restrain from imposing any specific using FLS in both off-line and on-line mode. probability distribution. The reminder of this section is dedicated to providing an alternative perspective of FLS, and deriving a clear connection between this method and where t is uniformly distributed over the interval [À2, 2] the well-known Kalman filter equations. and the explanatory stream xt evolves as xt ¼ 0:8xtÀ1 þ zt 4.1. The state-space model with zt being white noise. The regression coefficients have In our formulation, the regression coefficient at time been generated using a slightly complex mechanism for t þ 1 is modeled as a noisy version of the previous coeffi- the purpose of illustrating the flexibility of FLS. Starting cient at time t. First, we introduce a random vector xt with with b1 ¼ 7, we then generate bt as zero mean and some covariance matrix V x , so that 8 btÀ1 þ at for t ¼ 2; . . . ; 99 btþ1 ¼ bt þ xt t ¼ 0; 1; . . . ; T À 1: ð12Þ b þ 4 for t ¼ 100 tÀ1 Then, along the same lines, we introduce a random variable bt ¼ btÀ1 þ bt for t ¼ 101; . . . ; 200 t having zero mean and some variance V , so that : 5 sinð0:5tÞ þ ct for t ¼ 201; . . . ; 300 y t ¼ x0t bt þ t t ¼ 1; . . . ; T : ð13Þ where at and bt are Gaussian random variables with stan- Eqs. (12) and (13), jointly considered, result in a linear dard deviations 0.1 and 0.001, respectively, and ct is uni- state-space model, for which it is assumed that the innova- formly distributed over [À2, 2]. We remark that this tion series ft g and fxt g are mutually and individually example features non-Gaussian error terms, as well as lin- uncorrelated, i.e. i is uncorrelated of j , xi is uncorrelated ear and non-linear behaviors in the dynamics of the regres- of xj , and k is uncorrelated of x‘ , for any i–j and for any sion coefficient, varying over time. k; ‘. It is also assumed that for all t, t and xt are uncorre- In this example we set d ¼ 0:98. Although such a high lated of the initial state b0 . It should be emphasized again value of d encourages the regression parameters to be very that no specific distribution assumptions for t and xt have dynamic, the nearly constant coefficients observed between been made. We only assume that t and xt attain some dis- t ¼ 101 and t ¼ 200, as well as the two sudden jumps at tributions, which we do not know. We only need to specify times t ¼ 100 and t ¼ 201, are estimated well, and espe- the first two moments of such distributions. In this sense, cially so in the on-line setting. The non-linear dynamics the only difference between the system specified by (12)- observed from time t ¼ 201 onwards is also well captured. (13) and FLS is the assumption of randomness of bt . 4. An alternative look at FLS 4.2. The Kalman filter In Section 3, we have stressed that FLS relies on a quite The Kalman filter Kalman, 1960 is a powerful method general assumption concerning the evolution of the regres- for the estimation of bt in the above linear state-space sion coefficients, as it only requires btþ1 À bt to be small at model. In order to establish the connection between FLS all times. Accordingly, assumption (3) does not imply or and KF, we derive an alternative and self-contained proof require that each vector bt is a random vector. Indeed, in of the KF recursions that make no assumptions on the dis- the original work of Kalaba and Tesfatsion (1988), fbt g tributions of t and xt . We have found related proofs of such is not treated as a sequence of random variables, but rather recursions that do not rely on probabilistic assumptions, as taken as a sequence of unknown quantities to be estimated. in Kalman (1960) and Eubank (2006). In comparison with
  6. 6. 2824 G. Montana et al. / Expert Systems with Applications 36 (2009) 2819–2830 these, we believe that our derivation is simpler and does for some value of K t to be defined. From the definition of not involve matrix inversions, which serves our purposes P t , we have that well. ^ ^ We start with some definitions and notation. At time t, P t ¼ Covðbt À ðbtÀ1 þ K t ðx0t bt þ t À x0t btÀ1 ÞÞÞ ^ we denote by bt the estimate of bt and by ^tþ1 ¼ Eðy tþ1 Þ the y ^ ¼ CovððI p À K t x0 Þðb À btÀ1 Þ À K t t Þ t t ð17Þ one-step forecast of y tþ1 , where E(.) denotes expectation. ¼ ðI p À K t x0t ÞRt ðI p À xt K 0t Þ þ V K t K 0t The variance of y tþ1 is known as the one-step forecast var- ¼ Rt À K t x0t Rt À Rt xt K 0t þ Qt K t K 0t iance and is denoted by Qt ¼ Varðy tþ1 Þ. The one-step fore- cast error is defined as et ¼ y t À Eðy t Þ. We also define the Now, we can choose K t that minimizes ^ covariance matrix of bt À bt as P t and the covariance ^tÀ1 as Rt and we write Covðb À bt Þ ¼ P t ^ ^ 0 ^ Eðbt À bt Þ ðbt À bt Þ matrix of bt À b t and Covðbt À b ^tÀ1 Þ ¼ Rt . With these definitions, and which is the same as minimizing the trace of P t , and thus K t assuming linearity of the system, we can see that, at time is the solution of the matrix equation tÀ1 otraceðP t Þ 0 Rt ¼ P tÀ1 þ V x ¼ À2ðx0t Rt Þ þ 2Qt K t ¼ 0 oK t ^ ^t ¼ x0 btÀ1 y t where otraceðP t Þ=oK t denotes the partial derivative of the Qt ¼ x0t Rt xt þ V trace of P t with respect to K t . Solving the above equation ^ we obtain K t ¼ Rt xt =Qt . The quantity K t , also known as where P tÀ1 and btÀ1 are assumed known. The KF gives the Kalman gain, is optimal in the sense that among all lin- ^ recursive updating equations for P t and bt as functions of ^ ^ 0 ^ ^ ear estimators bt , (16) minimizes Eðbt À bt Þ ðbt À bt Þ. With P tÀ1 and btÀ1 . K t ¼ Rt xt =Qt , from (17) the minimum covariance matrix P t Suppose we wish to obtain an estimator of bt that is lin- becomes ^ ear in y t , that is bt ¼ at þ K t y t , for some at and K t (to be specified later). Then we can write P t ¼ Rt À Qt K t K 0t ð18Þ ^ bt ¼ a à þ K t e t ð14Þ The KF consists of Eqs. (16) and (18), together with t ^ ^ with et ¼ y t À x0t btÀ1 . We will show that for some K t , if bt is K t ¼Rt xt =Qt required to minimize the sum of squares Rt ¼P tÀ1 þ V x X T Qt ¼x0t Rt xt þ V and 2 C¼ ðy t À x0t bt Þ ð15Þ ^ et ¼y À x0 btÀ1 t t t¼1 0 ^ Initial values for b0 and P 0 have to be placed; usually we set ^ then aà ¼ btÀ1 . To prove this, write Y ¼ ðy 1 ; . . . ; y T Þ , 0 t 0 0 0 0 0 0 ^ b0 ¼ 0 and P À1 ¼ 0. X ¼ ðx1 ; . . . ; xT Þ , B ¼ ðb1 ; . . . ; bT Þ , E ¼ ðe1 ; . . . ; eT Þ and 0 0 1 Note that from the recursions of Pt and Rt we have K1 0 ÁÁÁ 0 B0 Rtþ1 ¼ Rt À Qt K t K 0t þ V x ð19Þ B K2 Á Á Á 0 C C K ¼ B. B. . .. . C C @. . . . . A . 4.3. Correspondence between FLS and KF 0 0 Á Á Á KT Traditionally, the KF equations are derived under the Then we can write (15) as assumption that t and xt follow the normal distribution, 0 as in Jazwinski (1970). This stronger distributional assump- C CðBÞ ¼ ðY À XBÞ ðY À XBÞ tion allows the derivation of the likelihood function. When b 0 and B ¼ Aà þ KE, where Aà ¼ ððaÃ Þ ; . . . ; ðaÃ Þ Þ . We will 0 0 1 T 0 the normal likelihood is available, we note that its maximi- ^ ^ show that Aà ¼ Bà , where Bà ¼ ðb00 ; . . . ; b0T À1 Þ . With the zation is equivalent to minimizing the quantity b the sum of squares can be written as above B, XT 2 1 XT À1 b 0 Sð BÞ ¼ ðY À XAà À XKEÞ ðY À XAà À XKEÞ ðy t À x0t bt Þ þ n t¼1 V x t¼1 t 0 0 ¼ ðY À XAÃ Þ ðY À XAÃ Þ À 2ðY À XAÃ Þ XKE with respect to b1 ; . . . ; bT , where nt has been defined in (5) þ E0 K 0 X 0 XKE (see Jazwinski (1970) for a proof). The above expression is which is minimized when Y À XAà is minimized or when exactly the cost function (4) with l replaced by 1=V x . EðY À XAÃ Þ ¼ 0, leading to Aà ¼ Bà as required. Thus, This correspondence can now be taken a step further: in ^ aà ¼ btÀ1 and from (14) we have a more general setting, where no distributional assump- t tions are made, it is possible to arrive to the same result. ^ ^ bt ¼ btÀ1 þ K t et ð16Þ This is achieved by rearranging Eq. (9) in the form of
  7. 7. G. Montana et al. / Expert Systems with Applications 36 (2009) 2819–2830 2825 (16), which is the KF estimator of bt . First, note that from However, these inversions are not necessary, as it is clear (10) we can write ^ by the KF that bt can be computed by performing only À1 À1 matrix multiplications. This is particularly useful for tem- ðS tÀ1 þ xt x0t Þ ¼ lS À1 ðS tÀ1 þ lI p þ xt x0t Þ t poral data mining data applications when T can be infinite ^ and substituting to Eq. (9) we get bt ¼ S À1 st . Thus we have and p very large. t It is interesting to note how the two procedures arrive to ^ ^ bt À btÀ1 ¼ S À1 st À S À1 stÀ1 t tÀ1 the same solution, although they are based on quite differ- ¼ ðS tÀ1 þ xt x0t ÞÀ1 ðstÀ1 þ xt y t Þ À S À1 stÀ1 tÀ1 ent principles. On one hand, FLS merely solves an optimi- zation problem, as it minimizes the cost function CðlÞ of S À1 xt x0t S À1 ðstÀ1 þ xt y t Þ ¼ S À1 xt y t À tÀ1 tÀ1 tÀ1 (4). On the other hand, KF performs two steps: first, all lin- x0t S À1 xt þ 1 tÀ1 ear estimators are restricted to forms of (16), for any S À1 xt parameter vector K t ; in the second step, K t is optimized ¼ tÀ1 ðy t x0t S À1 xt þ y t À x0t S À1 stÀ1 À x0t S À1 xt y t Þ ^ x0t S À1 xt þ 1 tÀ1 tÀ1 tÀ1 tÀ1 so that it minimizes P t , the covariance matrix of bt À bt . This matrix, known as the error matrix of bt , gives a mea- S À1 xt ^ ¼ tÀ1 ðy t À x0t btÀ1 Þ ¼ K t et sure of the uncertainty of the estimation of bt . x0t S À1 xt þ 1 tÀ1 The relationship between FLS and KF has important with implications for both methods. For FLS, it suggests that the regression coefficients can be learned from the data in a K t ¼ Rt xt =Qt recursive way without the need of performing matrix inver- Rt ¼ S À1 tÀ1 sions; also, the error matrix P t is routinely available to us. ^ For KF, we have proved that the estimator bt minimizes Qt ¼ x0t Rt xt þ 1 the cost function CðlÞ ¼ Cð1=V x Þ when only the mean and V ¼1 the variance of the innovations t and xt are specified, with- It remain to prove that the recursion of St as in Eq. (10) out assuming these errors to be normally distributed. communicates with the recursion of Eq. (19), for Rtþ1 ¼ S À1 . To end this, starting from Eq. (10) and using t 5. An FLS-based algorithmic trading system the matrix inversion lemma, we obtain À1 5.1. Data description Rtþ1 ¼ S À1 ¼ lÀ1 ðS tÀ1 þ xt x0t Þ ðS tÀ1 þ lIp þ xt x0t Þ t ¼ lÀ1 ðI p þ lðS tÀ1 þ xt x0t ÞÀ1 Þ We have developed a statistical arbitrage system that ¼ l I p þ ðS tÀ1 þ xt x0t ÞÀ1 À1 trades SP 500 stock-index futures contracts. The underly- ing instrument in this case is the SP 500 Price Index, a S À1 xt x0 S À1 world renowned index of 500 US equities with minimum ¼ S À1 À tÀ1 À1 t tÀ1 þ lÀ1 I p tÀ1 x0t S tÀ1 xtþ1 capitalization of $4 billion each; this index is a leading mar- ¼ Rt À Qt K t K 0t þ V x ket indicator, and is often used as a gauge of portfolio per- formance. The constituents of this index are highly traded Which is the KF recursion of Rt, where by traditional asset management firms and proprietary V x ¼ lÀ1 I p desks worldwide. The data stream for the SP 500 Futures Index, our target asset, has been kindly made available by ^ Clearly, the FLS estimator bt of (9) is the same as the KF ^ estimator bt of (16). From this equivalence, it follows that SP500 Futures Index l ¼ 1=V x , which in turn means that 1800 1 1700 Covðbtþ1 À bt Þ ¼ I p l 1600 This result further clarifies the role of the smoothing 1500 parameter l in (4). As l ! 1, the covariance matrix of 1400 btþ1 À bt is almost zero, which means that btþ1 ¼ bt , for 1300 all t, reducing the model to a usual regression model with constant coefficients. In the other extreme, when l % 0, 1200 the covariance matrix of btþ1 À bt has very high diagonal 1100 elements (variances) and therefore the estimated bt ’s fluctu- 1000 ate erratically. 900 An important computational consequence of the estab- lished correspondence between the FLS and the KF is 800 Jan98 Jan00 Jan02 Jan04 Jan06 apparent. For each time t, FLS requires the inversion of two matrices, namely S tÀ1 þ xt x0t and S tÀ1 þ lI p þ xt x0t . Fig. 3. SP 500 Futures Index for the available 9-years period.
  8. 8. 2826 G. Montana et al. / Expert Systems with Applications 36 (2009) 2819–2830 BlueCrest Capital Management, and covers a period of h t ¼ k t g t ¼ Rt g t ð20Þ about 9 years, from 02/01/1997 to 25/10/2005 (see where kt is the corresponding eigenvalue. Let us call ^t the h Fig. 3). Our explanatory data streams are taken to be a current estimate of ht using all the data up to time t subset of all constituents of the underlying SP 500 Price ðt ¼ 1; . . . ; T Þ. We can write the above characteristic equa- Index. The constituents list was acquired from the Stan- tion in matrix form as dard Poor’s web site as of 1st of March 2007, whereas 0 1 0 10 1 the constituents data streams were downloaded from h1 R1 Á Á Á 0 g1 Yahoo! financial. The constituents of the SP index are B. C B. CB . C . . . CB . C h ¼ B. C ¼ B. @. A @. . . A@ . A ¼ Rg . added and deleted frequently on the basis of the character- istics of the index. For our experiments, we have selected a hT 0 Á Á Á RT gT time-invariant subset of 432 stocks, namely all the constit- and then, noting that uents whose historical data is available over the entire 1997–2005 period. h 1 þ Á Á Á þ hT 1 0 1 1X T The system thus monitors 433 co-evolving data streams ¼ ð1; . . . ; 1Þ h ¼ ðR1 ; . . . ; RT Þg ¼ Ri g i T T T T i¼1 comprising one target asset and 432 explanatory streams. All raw prices are pre-processed in several ways: data the estimate bT is obtained by bT ¼ ðh1 þ Á Á Á þ hT Þ=T by h h adjustments are made for discontinuities relating to stock substituting Ri by ri r0i . This leads to splits, bonus issues, and other financial events; missing X t observations are filled in using the most recent data points; ^t ¼ 1 h ri r0i gi ð21Þ finally, prices are transformed into log-returns. At each t i¼1 time t 1, the log-return for asset i is defined as which is the incremental average of ri r0i gi , where ri r0i ac- rit ¼ log pit À log piðtÀ1Þ i ¼ 1; . . . ; 432 counts for the contribution to the estimate of Ri at point i. Observing that gt ¼ ht =kht k, an obvious choice is to esti- where pit is the observed price of asset i at time t. Taking mate gt as ^tÀ1 =k^tÀ1 k; in this setting, ^0 is initialized by h h h returns provides a more convenient representation of the equating it to r1 , the first direction of data spread. After assets, as it makes different prices directly comparable plugging in this estimator in (21), we obtain and center them around zero. We collect all explanatory as- 1 X 0 biÀ1 t h sets available at time t in a column vector rt . Analogously, ht ¼ ri ri ð22Þ t i¼1 biÀ1 k kh we denote by at the log-return of the SP 500 Futures In- dex at time t. In a on-line setting, we need a recursive expression for bt . h Eq. (22) can be rearranged to obtain an equivalent expres- 5.2. Incremental SVD for dimensionality reduction sion that only uses btÀ1 and the most recent data point rt , h X tÀ1 ^iÀ1 ^tÀ1 When the dimensionality of the regression model is ^t ¼ 1 h ri r0i h 1 þ rt r0t h t ^ khiÀ1 k t k^tÀ1 k h large, as in our application, the model might suffer from i¼1 multicollinearity. Moreover, in real-world trading applica- t À 1^ 1 ^tÀ1 h tions using high frequency data, the regression model gen- ¼ htÀ1 þ rt r0t t t k^tÀ1 k h erating trading signals need to be updated quickly as new information is acquired. A much smaller set of explanatory The weights ðt À 1Þ=t and 1=t control the influence of old streams would achieve remarkable computational speed- values in determining the current estimates. Full details re- ups. In order to address all these issues, we implement lated to the computation of the subsequent eigenvectors on-line feature extraction by reducing the dimensionality can be found in the contribution of Weng et al. (2003). in the space of explanatory streams. In our application, we have used data points from 02/ Suppose that Rt ¼ Eðrt r0t Þ is the unknown population 01/1997 till 01/11/2000 as a training set to obtain stable covariance matrix of the explanatory streams, with data estimates of the first few dominant eigenvectors. Therefore, available up to time t ¼ 1; . . . ; T . The algorithm proposed data points prior to 01/11/2000 will be excluded from the by Weng, Zhang, and Hwang (2003) provides an efficient experimental results. procedure to incrementally update the eigenvectors of the Rt matrix as new data are made available at time t þ 1. 5.3. Trading rule In turn, this procedure allows us to extract the first few principal components of the explanatory data streams in The trade unit for SP 500 Futures Index is set by real-time, and effectively perform incremental dimensional- the Chicago Mercantile Exchange (CME) to $250 multi- ity reduction. plied by the current SP 500 Price Index, pt . Accord- A brief outline of the procedure suggested by Weng ingly, we denote the trade unit expressed in monetary et al. (2003) is provided in the sequel. First, note that the terms as C t ¼ 250pt , which also gives the contract value eigenvector gt of Rt satisfies the characteristic equation at time t. For instance, if the current stock-index price