In recent years there has been increasing interest in forecasting methods that utilise large data sets . Indeed, there is a huge quantity of information available in the economic arena which might be useful for forecasting, but standard econometric techniques are not well suited to extract this in a useful form. This is not only an issue of academic interest . Central bankers and policy makers are interested in summarising large data sets for forecasting purposes (Eklund and Kapetanios gave a wide review of the recent literature on this issue) Since the reference paper of Stock and Watson (2002), factors methods have been at the fore front of developments in forecasting with large data set. The methodology to extract a few number of common factors has become more and more sophisticated from principal components to dynamic factor models, for example in Forni et al. Or to deal with unbalanced data sets at the end of the sample (jagged edge feature of the data), like in the recent contribution from Giannone on real time nowcast Finally, factor analysis combined with linear models (bridge models) has been among the main tool to summarise large data set for forecasting purpose.
In our study, we would like to present a new statistical approach to forecasting macro economic aggregates based on the RF techniques , proposed by Breiman in the 2000’s. This technique is widely used in biostatistics, becomes more and more popular and appears to be very powerful in a lot of different applications (classification problems and also regression problems). But, it is largely unknown in economics . To our knowledge, the only application in the economic field is the paper I wrote with my co-authosr (gérard Biau ans Laurent Rouvière) when I was working for the French INSEE. In this paper, we use the micro data from the French Industry Business Tendency Survey to track the manufacturing output.
If RF are so widely used in the medical research, it is because it enjoys good prediction properties, is robust to noise and can handle a very large number of input variables (which is very often the case in medicine, as you can collect a lot of variable on your patients but very often the number of observation is limited. RF is considered to be one of the most accurate general-purpose learning techniques available, independent of any functional and distributional assumptions RF is considered to be very powerful… but it is not clearly elucidated from a mathematical point of view . Although the mechanism appears simple, it involves many different driving forces which make it difficult to analyse. In fact, its mathematical properties remain to date largely unknown and, up to now, most theoretical studies have concentrated on isolated parts or stylized versions of the algorithm. Lin and Jeon: connection with the adaptative nearest neighbour The most recent paper from Gérard Biau is a step frowards as it shows consistency theorem of the RF algorithm.
After this introduction, here the outline of the presentation: I will first present the dataset In order to explain the algorithm I will take one basic example Then, I will present the benchmarks, I mean, the competitors of the Random forest outputs Finally, we will see our results and the perspectives of this work
After this introduction, here the outline of the presentation: I will first present the dataset In order to explain the algorithm I will take one basic example Then, I will present the benchmarks, I mean, the competitors of the Random forest outputs Finally, we will see our results and the perspectives of this work
Well, the data set used in this paper is based on the J H EU Programme of BCS . It covers 5 sectors: A key aspect of the business surveys is that most questions ask for qualitative responses, reflecting the sentiment or confidence (optimistic, pessimistic or neutral) of managers and consumers The Programme covers all the 27 Members States, Croatia, the FYROM and Turkey More than 125 000 firms and over 40 000 consumers are surveyed every month
To be more precise, The dataset mainly consists of the euro area balances of opinion (%positive - % negative) The time series used in the analysis are those available at the end of the third month of each quarter: the level series: monthly St or quarterly Sq , the difference series ( St - St-1 , St - St-2 , St - St-3 for monthly questions, Sq - Sq-1 for quarterly questions) The dataset is composed of p = 172 ‘soft’ series: X i The only ‘hard’ variable is the euro area GDP qoq growth series: Y i Finally, we have –what we called in this literature- a « learning set » L = { ( X 1 , Y 1 ) … ( X n , Y n ) } n=57 (from 3 rd Q 1995 to 3 rd Q 2009)
I will now speak about the RF. But, before dealing with the Forest, I have to explain what is a tree and, above all, how it is grown . What is a tree? It is a partition of space into regions. The most important is the binary tree, since they have just 2 children per node Here we have an example: We first split the space into two regions. One or both of these regions are split into two more regions, and this process is continued, until some stopping rule is applied How to grow a tree (the response is for example with CART) and what is the tree predictor (or tree regressor) . In my next slide, I will take a basic example…
Imagine I want to predict the income of someone who enters this room using the information I have collected about your self: So I know about you a lot of characteristics Xi such as your gender, size, age, your position –I mean do you have responsibility or not, are you head of unit, etc…), …and of course I have collected your income Y. We have to find the first node of the tree, that is the first splitting variable X j and the first split point s which discriminate the most (by solving this expression if we want to model the response forexample by a constant in each region) If we take the variable ‘size’, N1 represents the small ones (smaller than I) and N2 the tall ones. C1 is the average income for the small ones and C2 the average income of the all one. Do you think, this would be the best partition in term of minimum sum of square ? Undoubtedly the answer would be NO. The first node will be the position (the chiefs / the others…) or the age. So having found the first split by scanning all the covariates, CART repeats the process with other variables until for example the terminal nodes contains a user specified number of individuals.
Now, the tree is grow… Given the characteristics X of the new entering the room (gender, size, position, …) I put him down the tree, He falls in a terminal node N(X) And I predict his income by averaging the observed Yi over the observations i ‘falling’ in that node. For example, is the new entering this room is a young Man, he is 150cm high, he is French and not HoU, … I will predict his income by averaging those of the young small French Men who are not chiefs.
Breiman has demonstrated that consequential gains in prediction accuracy can be achieve by using a set of simpler trees. Here we have the RF algorithm as described in the book of Hastie More precisely, a random forest is a collection of K tree predictors, Where each tree is constructed from a bootstrap sample from the learning dataset L However, instead of determining the optimal split on a given node by evaluating all possible covariates, a subset of covariates drawn at random is used. Finally, we aggregate K tree predictors of simpler size trees. By doing that, it is possible to approximate a rich class of functions. And it gives an accurate approximation of the conditional mean. For the free parameters K , nodesize and mtry (the variables chosen at random from p) , we used the default values 500, 5 and p /3 of the random forest R-package
In addition, RF posses a number of features, that can be used for example to deal with the problem of missing values. In this study, we use the important feature to reduce data dimensionality . Here n the number of observations is much lower than p the dimension of the explanatory variables The variable importance allows to discriminate between informative and noninformative variables. For each variable, the idea is to compare the prediction error with the prediction error where the variable is randomly permuted - Large positive values for a variable indicate that this variable is predictive, since noising it up increases prediction error, - zero or negative importance values indicate non-predictive variables In my previous example, the importance for size would have been very likely negative, while age or position would have been positive…
So, our aim is to predict GDP growth for quarter Q, based on data available at the end on quarter Q To be clear, based on the information available until end of June 2010, we want to forecast 2 nd Q of 2010 (now cast, remember that Flash estimate GDP will be release by Eurostat mid august) To asses the quality of our forecast, we will do an out-of-sample analysis: 2004q1 – 2009q3 The criterium will be the MSE (out of sample MSE) We started our study an unvariate AR model … which appear to be a poor competitor. And we decided to compared our results with a fair competitor: the quarterly projections of the EZEO (jointly release by three major European economic institutes, the German IFO, the French Insee and the Italian ISAE) EZEO is a quarterly publication Based on the information available at the end of the previous quarter, EZEO forecast GDP growth of quarter Q. In fact, this publication provides also 2-steps -ahead projections (Q+1 and Q+2) for GDP, IP, Consumption, Inflation And describes also economic links explaining these forecasts… Here, however, our concern was just to asses How a data-driven model like the RF perform relative to competitors for GDP nowcasting.
Results: We compare the pure random forest (non parametric) to our benchmarks. Based on MSE computed on the whole out of sample period, RF outperforms the AR but not the euro zone economic outlook You have quarter after quarter all the forecast in the paper, but in this graph I plotted how MSE develop over time (real time MSE) The graphs highlights the good performance of RF before the crisis Poor performance during the crisis… in fact for 2008Q4 and 2009Q1, we made a huge error. But it is not surprising, as the pure RF are based on average of values observed in the learning set. And before the crisis, no negative vales were present in the learning set ! However, it is worth noting that the RF algorithm “learns” and will be able to predict negative output in the future.
To square the problem of non negative values in our learning set, we try to use the power of RF to select the 25 most important variables And then to insert this variable in a bridge model. We use the General to Specific procedure to choose the best linear model. RF_LIMOD is the model we retained. Remarkably, 3 of five explanatory variables come from the Industry (whose variability explain most of the GDP cycles), moreover the variable are related to orders book, which are well known to be among the most factual and reliable soft variable We have also one question about order in retail trader and on question from the consumer survey.
To square the problem of non negative values in our learning set, we try to use the power of RF to select the 25 most important variables And then to insert this variable in a bridge model. We use the General to Specific procedure to choose the best linear model. RF_LIMOD is the model we retained. Remarkably, 3 of five explanatory variables come from the Industry (whose variability explain most of the GDP cycles), moreover the variable are related to orders book, which are well known to be among the most factual and reliable soft variable We have also one question about order in retail trader and on question from the consumer survey.
Results: We continue compare RF_LINMOD outputs using the same criterion Based on MSE computed on the whole out of sample period, it performs as well as the euro zone economic outlook outperforms the pure RF and compares well to the euro zone economic outlook during the ‘crisis’
For further development, we would like continue this analysis by adding in our dataset the hard variable that are available at the end of each quarter (for example the carry over of industrial production, first registration of private cars)… and according to our tests, the carry over of the IP appears to be one of the most important variable! Moreover, this kind of information is for sure used by our colleagues in the euro zone economic outlook… So, this work is still ongoing but promising.
Transcript
1.
Euro area GDP forecasting using large survey datasets A random forest approach Olivier Biau – Angela D’Elia Directorate General for Economic and Financial Affairs European Commission Groupe Travail prévision Paris, 12 May 2010 Views expressed represent exclusively the positions of the authors and do not necessarily correspond to those of the European Commission
RF enjoys good prediction properties, is robust to noise and can handle a very large number of input variables
RF is considered to be one of the most accurate general-purpose learning techniques available, independent of any functional and distributional assumptions
However, from a mathematical point of view, the mechanism of RF algorithms remains largely unknown and is not clearly explained
Breiman, 2002; Lin and Jeon, 2006; Biau. G et al. , 2008,
A specific application for short-term GDP forecasting in the euro area is shown using ( only ) the harmonized European Union Business and Consumer surveys dataset
A typical high-dimensional regression problem ( n << p )
The RF technique is explored with two aims in mind:
1. to obtain a non-parametric forecast of GDP
2. to obtain a ranking of the explanatory variables, and then select those variables to build a linear model to forecast GDP
The forecast performance is assessed through an out-of-sample exercise
The mission of the Directorate-General for Economic and Financial Affairs is to contribute to raising the economic welfare of the citizens in the European Union and beyond, notably by developing and promoting policies that ensure sustainable economic growth, a high level of employment, stable public finances and financial stability.
At the present juncture, this means working to ensure that the European
economy emerges quickly and strongly from the present deep economic and financial crisis. We do this by helping to find the right policy-mix for overcoming the economic and financial crisis and for the EU economy to significantly reduce unemployment and to attain a sustainable growth and convergence in living standards in a stable
Based on the Joint Harmonized EU Programme of Business and Consumer Surveys (manufacturing industry, services, retail trade, construction, and consumers)
Qualitative surveys
It covers all the 27 Members States, Croatia, the FYROM and Turkey
More than 125 000 firms and over 40 000 consumers surveyed every month
How to predict the income of a newcomer (entering the room) based on observed characteristics X i (gender, size, weight, age, …) and income Y i of people attending the conference?
For the first node, we seek the first splitting variable X j and the first split point s which discriminate the most, by solving:
For any choice j and s , the inner minimization is solved by:
Having found the best split, we partition the data into the two resulting regions and repeat the splitting process until each node reaches a user-specified minimum nodesize and becomes a terminal node
Breiman’s idea: instead of finding the best tree, build a large number ( K ) of simpler regression trees and aggregate them
RF algorithm (Hastie and Tibshirani, 2009)
1. For k = 1 to K :
(a) Draw a bootstrap sample from the learning dataset L = { ( X 1 ,Y 1 ) … ( X n , Y n ) }.
(b) Grow a random-forest tree h k to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum nodesize is reached.
Select mtry variables at random from the p variables
Pick the best variable/split-point among the mtry
Split the node into two daughter nodes
2. the predicted outcome (final decision) is obtained as the average value over the K trees:
For the free parameters K , nodesize and mtry , we used the default values 500, 5 and p /3 of the random forest R-package
Be the first to comment