Your SlideShare is downloading. ×
Internet Traffic Forecasting using Time Series Methods
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Internet Traffic Forecasting using Time Series Methods

2,298
views

Published on

Published in: Technology

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,298
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
78
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Article _____________________________ DOI: 10.1111/j.1468-0394.2010.00568.xMulti-scale Internet traffic forecasting usingneural networks and time series methodsPaulo Cortez,1 Miguel Rio,2 Miguel Rocha3 andPedro Sousa3(1) Department of Information Systems=Algoritmi, University of Minho, 4800-058Guimara ˜es, PortugalEmail: pcortez@dsi.uminho.pt(2) Department of Electronic and Electrical Engineering, University College London,Torrington Place, WC1E 7JE London, UK(3) Department of Informatics=CCTC, University of Minho, 4710-059 Braga, PortugalAbstract: This article presents three methods to forecast accurately the amount of traffic in TCP=IP basednetworks: a novel neural network ensemble approach and two important adapted time series methods (ARIMAand Holt-Winters). In order to assess their accuracy, several experiments were held using real-world data fromtwo large Internet service providers. In addition, different time scales (5 min, 1 h and 1 day) and distinctforecasting lookaheads were analysed. The experiments with the neural ensemble achieved the best results for5 min and hourly data, while the Holt-Winters is the best option for the daily forecasts. This research openspossibilities for the development of more efficient traffic engineering and anomaly detection tools, which willresult in financial gains from better network resource management.Keywords: network monitoring, multi-layer perceptron, time series, traffic engineering networks. Security attacks like denial-of-service or1. Introduction even an irregular amount of SPAM can in theoryAs more applications vital to today’s society be detected by comparing the real traffic with themigrate to TCP=IP networks, it is crucial to values predicted by forecasting algorithms (Krish-develop techniques to better understand and namurthy et al., 2003; Jiang & Papavassiliou,forecast the behaviour of these networks. In 2004). The earlier detection of these problemseffect, TCP=IP traffic prediction is an important would conduct to a more reliable service.issue for any medium=large network provider Nowadays, TCP=IP traffic prediction is oftenand it is gaining more attention from the com- done intuitively by experienced network admin-puter networks community (Papagiannaki et al., istrators, with the help of marketing informa-2005; Babiarz & Bedo, 2006). By improving this tion on the future number of costumers andtask’s performance, network providers can opti- their behaviours (Papagiannaki et al., 2005).mize resources (e.g. adaptive congestion control Yet, this produces only a rough idea of the realand proactive network management), allowing a traffic. On the other hand, contributions frombetter quality of service (Alarcon-Aquino & the areas of Operational Research and Compu-Barria, 2006). Moreover, traffic forecasting ter Science has lead to solid forecasting methodscan also help to detect anomalies in the data that replaced intuition based ones in other fields. 2010 Blackwell Publishing Ltdc Expert Systems, May 2012, Vol. 29, No. 2 143 143
  • 2. In particular, the field of time series forecasting time window and model selection, and(TSF) deals with the prediction of a chronologi- adaptations of the Holt-Winters, bothcally ordered variable (Makridakis et al., 1998). traditional and recent double seasonalThe goal of TSF is to model a complex system as versions, and the ARIMA methodology;a black-box, predicting its behavior based inhistorical data, and not how it works. iii. in contrast with previous studies (Hanse- Owing to its importance, several TSF meth- gawa et al., 2001; Jiang Papavassiliou,ods have been proposed, such as the Holt- 2004; Papagiannaki et al., 2005; BabiarzWinters (Makridakis et al., 1998), the ARIMA Bedo, 2006; Alarcon-Aquino Barria,methodology (Box Jenkins, 1976) and artifi- 2006; Wang et al., 2008), two distinct ISPscial neural networks (NN) (Lapedes Farber, are considered, the predictions are ana-1987; Ding et al., 1995; Malki et al., 2004; lysed at different time scales (i.e. 5 min,Cortez et al., 2005). Holt-Winters was devised hourly, daily) and distinct ahead forecastsfor series with trended and seasonal factors. are performed.More recently, a double seasonal version hasbeen proposed (Taylor, 2003). The ARIMA is a The result of this research is expected to allowmore complex approach, requiring steps such as the development of intelligent TCP=IP trafficmodel identification, estimation and validation. forecasting engines.NNs are connectionist models inspired in the The article is organized in four sections. First,behavior of central nervous system, and in the Internet traffic data is presented and ana-contrast with the previous methods, they can lysed. The forecasting methods are given inpredict non-linear series. In the past, several Section 3, while the results are presented andstudies have demonstrated the predictability of discussed in Section 4. Finally, in the last sectionnetwork traffic by using similar methods, such closing conclusions are drawn.as Holt-Winters (Krishnamurthy et al., 2003)and ARIMA (Sang Li, 2002; Papagiannakiet al., 2005). Following the evidence of non- 2. Time series analysislinear network traffic (Hansegawa et al., 2001), A time series is a collection of time orderedNNs have also been proposed (Jiang Papa- observations (y1, y2, . . . yt), each one being re-vassiliou, 2004; Alarcon-Aquino Barria, 2006; corded at a specific time t (period), appearing inWang et al., 2008). a wide set of domains such as Finance, Produc- In this work, several experiments are carried tion and Control (Makridakis et al., 1998). A timeout, based on recent real-world data provided series model ð^t Þ assumes that past patterns will yby two ISPs, in order to provide network en- occur in the future. Another relevant concept isgineers with a useful feedback. The main con- the horizon or lead time (h), which is defined bytributions of this work are: the time in advance that a forecast is issued. The performance of a forecasting model is i. Internet traffic is predicted using a pure evaluated by an accuracy measure, such as the TSF approach (i.e. only past values are used sum squared error (SSE) and mean absolute as inputs), in contrast to (Krishnamurthy percentage error (MAPE) (Makridakis et al., et al., 2003), which uses compact summaries 1998): of traffic data, and (Papagiannaki et al., 2005), which uses wavelets to smooth the ^ et ¼ yt À yt;tÀh signal, allowing its use in wider contexts; P 2 PþN SSEh ¼ ei ii. several forecasting methods are tested and i ¼ Pþ1 ð1Þ P jei j PþN compared, including a novel NN ensem- MAPEh ¼ N 1 yi  100% ble based on fast heuristic procedures for i ¼ Pþ1144 Expert Systems, May 2012, Vol. 29, No. 2 2010 Blackwell Publishing Ltd c
  • 3. where et denotes the forecasting error at time t; more than 2=3 months of data, since network ^yt the desired value; yt;p the predicted value for servers often reboot or pass through upda-period t and computed at period p; P is the te=maintenance changes.present time and N the number of forecasts. Depending on the time scale, the following The MAPE is a common metric in forecasting forecasting types can be defined (Ding et al.,applications, such as electricity demand (Malki 1995):et al., 2004; Taylor et al., 2006), and it measuresthe proportionality between the forecasting er- real-time, which concerns samples not ex-ror and the actual value. This metric will be ceeding a few minutes and requires an on-adopted in this work, since it is easier to inter- line forecasting system;pret by the network administrators. In addition, short-term, from one to several hours, cru-it presents the advantage of being scale indepen- cial for optimal control or detection ofdent. It should be noted that the SSE values abnormal situations;were also calculated but the results will not be middle-term, typically from one to severalreported since the relative forecasting perfor- days, used to plan resources; andmances are similar. long-term, often issued several month- Our approach uses already available informa- s=years in advance and needed for strategiction provided by the Simple Network Manage- decisions, such as financial investments.ment Protocol (SNMP) that quantifies the trafficpassing through every network interface with Owing to the characteristics of the Internetreasonable accuracy (Stallings, 1999). SNMP is traffic collected, this study will only considerwidely deployed by every ISP=network, so the the first three types. Therefore, three new timecollection of this data does not induce any extra series were created for each ISP by aggregatingtraffic on the network. the original values; that is summing all data This work analyses traffic data (in bits) from samples within a given period of time. Thetwo different ISPs, denoted here as A and B. selected time scales were (Figure 1): every 5 minThe A dataset belongs to a private ISP with (series A5M and B5M), every hour (A1H andcentres in 11 European cities. The data corre- B1H) and every day (A1D and B1D)2. Owing tosponds to a transatlantic link and was collected the temporal nature of this domain, a sequentialfrom 06:57 hours on 7 June to 11:17 hours on 29 holdout (i.e. train=test split) will be adopted forJuly 2005. Dataset B comes from UKERNA1 the forecasting evaluation. Hence, the first 2=3and represents aggregated traffic in the United of the series will be used to fit (train) theKingdom academic network backbone. It was forecasting models and the remaining last 1=3collected between 19 November 2004, at 09:30 to evaluate (test) the forecasting accuracieshours and 27 January 2005, at 11:11 hours. The (Table 1). Under this scheme, the number ofA time series was registered every 30 s, while the forecasts is equal to N ¼ NT À h þ 1, where h isB data was recorded every 5 min. The first series the lead time period and NT is the number of(A) included eight missing values, which were samples used for testing.replaced using a linear interpolation (Hastie The autocorrelation coefficient is a statisticet al., 2001). The missing data is explained by that measures the correlation between a seriesthe fact that the SNMP scripts are not 100% and itself, lagged of k periods (Box Jenkins,reliable, since the SNMP messages may be lost. 1976):Yet, this occurs very rarely and it is statistically PTÀkinsignificant. Finally, it should be mentioned t¼1 ðyt À yÞðytþk À yÞ rk ¼ PT ð2Þthat within this domain it is difficult to collect t ¼ 1 ðyt À yÞ1 2 United Kingdom education and research networking asso- The datasets are available at: http://www3.dsi.uminho.pt/ciation. pcortez/series/ 2010 Blackwell Publishing Ltdc Expert Systems 3 Expert Systems, May 2012, Vol. 29, No. 2 145
  • 4. A5M B5M 300 3500 250 3000 2500 200 x 1012, bits x 109, bits 2000 150 1500 100 1000 50 500 Train Test Train Test 0 0 00 00 00 00 0 0 0 00 00 00 00 0 0 0 0 0 0 0 0 00 00 00 00 00 00 00 00 00 20 40 60 80 20 40 60 80 10 12 14 10 12 14 16 18 20 time (x 5 minutes) time (x 5 minutes) A1H B1H 3.5 40 3 35 30 2.5 25 x 1012, bits x 1015, bits 2 20 1.5 15 1 10 0.5 5 Train Test Train Test 0 0 0 0 0 0 00 00 0 0 0 0 00 00 00 00 0 0 20 40 60 80 20 40 60 80 10 12 10 12 14 16 time (x 1 hour) time (x 1 hour) A1D B1D 45 650 600 40 550 500 35 450 x 1012, bits x 1015, bits 400 30 350 300 25 250 20 200 150 Train Test Train Test 15 100 0 5 10 15 20 25 30 35 40 45 50 0 10 20 30 40 50 60 70 time (x 1 day) time (x 1 day)Figure 1: The Internet traffic time series (A5M, B5M, A1H, B1H, A1D and B1D) with several daily,weekly and seasonal patterns (e.g. traffic decrease in late December in B datasets is due to Christmasseason).146 Expert Systems, May 2012, Vol. 29, No. 24 Expert Systems 2010 Blackwell Publishing Ltd c
  • 5. where y1, . . . ,yT stands for the time series and y previous seasonal cycle (Taylor et al., 2006):for the series’ average. Autocorrelations areuseful for the detection of seasonal components ^ ytþh;t ¼ ytþhÀK ð3Þ(Makridakis et al., 1998). For example, the ^ where K is the seasonal period. In this work, Kautocorrelations for the A5M and A1H A series will be set to the weekly cycle. This naiveare plotted in Figure 2. The daily seasonal effect method, which can be easily adopted by the(K1 ¼ 288) is visible for the 5 min data, while two network administrators, will be used as a bench-seasonal components appear at the hourly scale, mark for the comparison with other forecastingdue to the intraday (K1 ¼ 24) and intraweek approaches.cycles (K2 ¼ 168). 3.2. Holt-Winters The Holt-Winters is an important forecasting3. Forecasting methods technique where the predictive model is based on trended and seasonable patterns that are3.1. Naive benchmark distinguished from noise by averaging the his-A common naive forecasting method is to pre- torical values. It presents advantages such asdict the future as the present value. Yet, this simplicity of use, reduced computational de-method will perform poorly in seasonal data. mand and accuracy for seasonal series. TheThus, a better alternative is to use a seasonal model is defined by the equations (Makridakisversion, where a forecast will be given by the et al., 1998):observed value for the same period related to the Level St ¼ a DtÀK þ ð1 À aÞðStÀ1 þ TtÀ1 Þ yt 1Table 1: The scale and length of Internet traffic Trend Tt ¼ bðSt À StÀ1 Þ þ ð1 À bÞTtÀ1 ð4Þtime series Seasonality Dt ¼ g Stt þ ð1 À gÞDtÀK1 ySeries Time Train Test Total ^ ytþh;t ¼ ðSt þ hTt Þ Â DtÀK1 þh scale length length length where St, Tt and Dt stand for the level, trend andA5M 5 min 9848 4924 14772A1H 1h 821 410 1231 seasonal estimates, K1 for the seasonal period,A1D 1 day 34 17 51 and a, b and g for the model parameters. WhenB5M 5 min 13259 6629 19888 there is no seasonal component, the g is dis-B1H 1h 1105 552 1657 carded and the DtÀK1 þh factor in the last equa-B1D 1 day 46 23 69 tion is replaced by the unity. 1 1 Daily Seasonal Period (K1=288) Weekly Seasonal Period (K2=168) 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0–0.2 –0.2–0.4 –0.4–0.6 –0.6 0 50 100 150 200 250 300 0 24 48 72 96 120 144 168 192 Lag LagFigure 2: The autocorrelations for the series A5M (left) and A1H (right). 2010 Blackwell Publishing Ltdc Expert Systems 5 Expert Systems, May 2012, Vol. 29, No. 2 147
  • 6. More recently, this method has been extended also contemplate a constant term m in the rightto encompass two seasonal cycles (Taylor, side of the equation. For demonstrative pur-2003): poses, the full time series model is presented ^ for ARIMA(1,1,1,): yt;tÀ1 ¼ m þ ð1 þ f1 ÞytÀ1 ÀLevel St ¼ a DtÀK yWtÀK þ ð1 À aÞðStÀ1 þ TtÀ1 Þ t 1 2 f1 ytÀ2 À y1 etÀ1 . To create multi-step predic-Trend Tt ¼ bðSt À StÀ1 Þ þ ð1 À bÞTtÀ1 tions, the one step-ahead forecasts are usedSeasonality 1 Dt ¼ g St WtÀK þ ð1 À gÞDtÀK1 yt iteratively as inputs (Taylor et al., 2006). 2 ytSeasonality 2 Wt ¼ o St DtÀK þ ð1 À oÞWtÀK2 There is also a multiplicative seasonal version, 1 often called SARIMA and denoted by the term ^ ytþh;t ¼ ðSt þ hTt Þ Â DtÀK1 þh WtÀK2 þh ARIMA(p, d, q)(P1, D1, Q1). It can be written as: ð5Þ fp ðLÞFP1 ðLK1 Þð1 À LÞd ð1where Wt is the second seasonal estimate, K1and K2 are the first and second seasonal periods; À LÞD1 yt ¼ yq ðLÞYQ1 ðLK1 Þet ð7Þand o is the second seasonal parameter. where K1 is the seasonal period; FP1 and YQ1 The initial values for the level, trend and are polynomial functions of orders P1 and Q1.seasonal estimates will be set by averaging the Finally, the double seasonal ARIMA(p, d, q)early observations (Taylor, 2003). The para- (P1, D1, Q1)(P2, D2, Q2) is defined by (Taylormeters (a, b, g and o) will be optimized by a grid et al., 2006)search, which works by testing all combinationsof a discrete set of values for each parameter. fp ðLÞFP1 ðLK1 ÞOP2 ðLK2 Þð1ÀLÞd ð1ÀLÞD1The aim is to get the lowest training error ð8Þ(SSE1), which is a common procedure within ð1ÀLÞD2 yt ¼ yq ðLÞYQ1 ðLK1 ÞCQ2 ðLK2 Þetthe forecasting community. where K2 is the second seasonal period; OP2 and CQ2 are the polynomials of orders P2 and Q2.3.3. ARIMA methodology The constant and the coefficients of the modelThe ARIMA is another important forecasting are usually estimated by using statistical ap-approach that goes through model identifica- proaches (e.g. least squares methods). It wastion, parameter estimation and model valida- decided to use the forecasting package X-12-tion (Box Jenkins, 1976). The main advantage ARIMA from the US Bureau of the Censusof this method relies on the accuracy over a (Time-Series-Staff, 2002), for the parameter es-wider domain of series, despite being more timation of a given model. For each series,complex than the Holt-Winters. The model is several ARIMA models will be tested and thebased on a linear combination of past values BIC statistic, which penalizes model complexity(AR components) and errors (MA components), and is evaluated over the training data, will bebeing named autoregressive integrated moving- the criterion for the model selection, as advisedaverage (ARIMA). by the X-12-ARIMA manual. The non-seasonal model is denoted by the formARIMA(p, d, q) and is defined by the equation 3.4. Artificial neural networks fp ðLÞð1 À LÞd yt ¼ yq ðLÞet ð6Þ Neural models are innate candidates for fore- casting due to their flexibility (i.e. there is no awhere yt is the series; et is the error; L is the priori restrictions on the type of relationship tolag operator (e.g. L3yt ¼ yt À 3); fp ¼ 1 À f1 be modeled) and non-linear learning capabil-L À f2L2 À . . . À fpLp is the AR polynomial ities. Indeed, the use of NNs for TSF beganof order p; d is the differencing order; and in the late 1980s with encouraging results andyp ¼ 1 À y1L À y2L2 À . . . À yqLq is the MA the field has been consistently growing sincepolynomial of order q. When the series has a (Lapedes Farber, 1987; Ding et al., 1995;non-zero average through time, the model may Malki et al., 2004; Cortez et al., 2005).148 Expert Systems, May 2012, Vol. 29, No. 26 Expert Systems 2010 Blackwell Publishing Ltd c
  • 7. 1Input Layer Hidden Layer Output Layer function (1þeÀx ). Similar to ARIMA, multi-step forecasts are built by iteratively using 1-ahead +1 predictions as inputs (Taylor et al., 2006). xt−k j In the training stage, the NN initial weights 1 w i,j +1 are randomly set within the range [ À 1.0; 1.0]. w +1 i,0 Then, the RPROP algorithm was adopted, since xt−k xt it presents a faster training when compared with 2 i ... other algorithms such as the backpropagation ... (Riedmiller, 1994). The training is stopped when +1 the error slope approaches zero or after a max- ... xt−k I imum of 1000 epochs. The quality of the trained network will dependFigure 3: The neural network architecture. on the choice of the starting weights, since the error function is non-convex and the training may fall into local minima. To solve this issue, the Although there are other neural architectures, solution adopted is to use a neural networkthe majority of the NN studies use the multi- ensemble (NNE) where R different networks arelayer perceptron network (Lapedes Farber, trained (here set to R ¼ 5) and the final prediction1987; Cortez et al., 1995,2005; Ding et al., 1995; is given by the average of the individual predic-Tong et al., 2004). With this network, TSF is tions (Hastie et al., 2001). In general, ensemblesachieved by using a sliding time window (Wang are better than individual learners, provided thatet al., 2008), defined by the set of time lags fk1, the errors made by the individual models arek2, . . . k1 g used to build a forecast. For a uncorrelated, a condition easily met with NNs,given time period t, the NN inputs are since the training algorithms are stochastic inytÀkI ; . . . ; ytÀk2 ; ytÀk1 and the desired output is nature (Dietterich, 2000).yt. For example, let us consider the series 51, 101, Under this setup, the NNE performance will143, 194, 235 (yt values). If the f1, 3g window is depend on two crucial parameters: the choice ofadopted, then two training examples can be the input time lags and number of hidden nodescreated: 5, 14 ! 19 and 10, 19 ! 23. (H). Feeding a NN with uncorrelated variables In this work, fully connected multi-layer percep- or time lags will affect the learning process duetrons, with one hidden layer of H hidden nodes, to the increase of noise. A NN with 0 hiddenbias and shortcut connections will be adopted neurons can only learn linear relationships and(Figure 3). To introduce non-linearity, the logistic it is equivalent to the classic Auto-Regressiveactivation function was applied on the hidden (AR) model. By increasing the number of hid-nodes. The linear function was used in the output den neurons, more complex functions can benode, in order to scale the range of the outputs learned but also it increases the probability of(Cortez et al., 2005). The final model is given by: overfitting to the data and thus losing the gen- eralization capability. X I ^ yt;tÀ1 ¼ wo;0 þ ytÀki wo;i Since the search space for these parameters is i¼1 high, heuristic procedures will be proposed in the next section to reduce the computational effort, X oÀ1 X I þ fð ytÀki wj;i þ wj;0 Þwoj limiting the search to a few time window=hidden j ¼ Iþ1 i¼1 node combinations during the model selection step. In this stage, the training data (2=3 of the ð9Þ series’ length) will be further divided into trainingwhere wi,j denotes the weight of the connection and validation sets. The former, with 2=3 of thefrom node j to i (if j ¼ 0 then it is a bias connec- training data, will be used to train the NNE. Thetion), o denotes the output node and f the logistic latter, with the remaining 1=3, will be used to 2010 Blackwell Publishing Ltdc Expert Systems 7 Expert Systems, May 2012, Vol. 29, No. 2 149
  • 8. estimate the network generalization capabilities. ranged from 0 to 2; and the d and D1 orders wereThe NNE with the lowest validation error (aver- set to 0 and 1, in a total of 35 models. In case ofage of all MAPEh values) will be selected. After the hourly data, no differencing factors werethe model selection, the final NNE is retrained used, since the series seems stationary and theusing all training data. Holt-Winters models provided no evidence for trended factors, with very low b values. A total of eight double seasonal ARIMA models were4. Experiments and results tested, by using combinations of the p, P1, P2, q, Q1 and Q2 values up to a maximum order of 2.The Holt-Winters and NNs were implemented Finally, for the 5 min datasets, three singlein an object oriented programming environment seasonal (maximum order of 1) and 25 non-developed in the Java language by the authors. seasonal (maximum order of 5) models wereRegarding the ARIMA methodology, the dif- explored. Similar to the Holt-Winters case, forferent models will be estimated using the X- these series only non-seasonal ARIMA models12-ARIMA package (Time-Series-Staff, 2002). were selected. Table 3 shows the best ARIMAThe best model (with the lowest BIC values) will models.be selected and then the forecasts are produced The NNE heuristic rules for model selectionin the Java environment. were set as follows. The number of tested hidden The Holt-Winters models were adapted to the nodes (H) was within the range f0,2,4,6,8g,series characteristics. The seasonal version since in previous work (Cortez et al., 2005) it(K1 ¼ 7) was used for the daily values, while the has been shown that complex series can bedouble seasonal variant (K1 ¼ 24 and K2 ¼ 168) modeled by small neural structures. Based onwas applied on the hourly series. Both seasonal the seasonal traits, three different sliding win-(K1 ¼ 288) and non-seasonal versions were dows were explored in each time scale:tested for the 5 min scale data, since it wassuspected that the seasonal effect could be less f1,2,3,4,5,6,7,8g, f1,2,3,6,7,8g and f1,7,8grelevant in this case. Indeed, SSE errors ob- for the daily series;tained in the training data backed this claim. To f1,2,3,24,25,26,168,167,169g, f1,2,3,11,12,optimize the parameters of the selected models 13,24,25,26g and f1,2,3,24,25,26g for the(Table 2), the grid-search used a step of 0.01 for hourly data; andthe 5 min and daily data. The grid step was f1,2,3,5,6,7,287,288,289g, f1,2,3,5,6,7,11,12,increased to 0.05 in the hourly series, due to the 13g and f1,2,3,4,5,6,7g for the 5 min scale.higher computational effort required by the Table 4 presents the selected NNEs. Regardingdouble seasonal models. the selected time lags, it is interesting to notice Regarding the ARIMA, an extensive range of that there are two models that contrast with themodels were tested. In all cases, the m constant previous methods. The B5M model includeswas set to zero by the X-12-ARIMA package. seasonal information (K1 ¼ 288), while the A1HFor the daily series, the p, P1, q and Q1 orders does not use the second seasonal factor (K2 ¼ 168).Table 2: The selected Holt-Winters forecasting After the model selection stage, the forecastsmodels were performed for each method by testing aSeries K1 K2 a b g o lead time from h ¼ 11–24, for the 5 min andA5M – – 0.76 0.09 – – hourly data, and an horizon of h ¼ 1–7 for theA1H 24 168 0.70 0.00 1.00 1.00 daily series. In case of the NNE, 20 runs wereA1D 7 – 0.00 0.00 1.00 – applied to each configuration in order to presentB5M – – 1.00 0.07 – – the results in terms of the average and t-studentB1H 24 1105 0.95 0.00 0.75 1.00 95% confidence intervals (Flexer, 1996). Table 5B1D 7 – 1.00 0.01 0.01 – shows the forecasting errors for each method,150 Expert Systems, May 2012, Vol. 29, No. 28 Expert Systems 2010 Blackwell Publishing Ltd c
  • 9. Table 3: The selected ARIMA forecasting modelsSeries Model ParametersA5M (5 0 5) f1 ¼ 2.81, f2 ¼ 3.49, f3 ¼ 2.40, f4 ¼ À 0.58, f5 ¼ À 0.13 y1 ¼ 1.98, y2 ¼ 1.91, y3 ¼ 0.75, y4 ¼ 0.26, y5 ¼ À 0.20A1H (2 0 0)(2 0 0)(2 0 0) f1 ¼ 1.70, f1 ¼ À 0.74, F1 ¼ 0.60, F2 ¼ 0.06 O1 ¼ À 0.08, O2 ¼ À 0.28A1D (2 1 0)(0 1 0) f1 ¼ À 0.46, f2 ¼ À 0.35B5M (5 0 5) f1 ¼ 1.58, f2 ¼ À 0.59, f3 ¼ 1.00, f4 ¼ À 1.58, f5 ¼ 0.59 y1 ¼ 0.74, y2 ¼ À 0.08, y3 ¼ À 0.97, y4 ¼ À 0.77, y5 ¼ À 0.06B1H (2 0 1)(1 0 1)(1 0 1) f1 ¼ 1.59, f2 ¼ À 0.62, F1 ¼ 0.93, O1 ¼ 0.82, y1 ¼ 0.36, Y1 ¼ 0.72, C1 ¼ 0.44B1D (1 1 2)(0 1 1) f1 ¼ 0.41, y1 ¼ 0.45, y2 ¼ 0.36, Y1 ¼ 0.53Table 4: The selected neural forecasting models decreases, in a behavior that may be explainedSeries Hidden nodes Input time lags by the seasonal effects (Figure 4). The differences (H) between the methods are higher for the first provider (A) than the second one. Nevertheless,A5M 6 f1,2,3,5,6,7,11,12,13g in both cases the ARIMA and NNE outperformA1H 8 f1,2,3,24,25,26g the Holt-Winters method. Overall, the neuralA1D 0 f1,7,8g approach is the best model with a 3.5% globalB5M 0 f1,2,3,5,6,7,287,288,289g difference to ARIMA in dataset A1H and aB1H 0 f1,2,3,24,25,26,168,167,169g 0.4% improvement in the second series (B1H).B1D 0 f1,7,8g The higher relative NNE performance for the A ISP may be explained by the presence of non-linear effects (as suggested in Table 4).when using the smallest and largest lookaheads. The analysis of the daily results shows aThe global performance is presented as the different behavior. The naive approach is oneaverage error (h) for all h values. The overall of the best options for the A1D data. This effectview is given in Figure 4, where the MAPE is also occurs for series B1D, although only after aplotted for all horizons. lead time of hZ3 for NNE, hZ4 for ARIMA As expected, the naive benchmark reveals a and hZ6 for Holt-Winters. In both series, theconstant performance at all lead times for the best choice is the Holt-Winters method, which is5 min series and it was greatly outperformed by equivalent to the naive method for the A series.the other forecasting approaches. Indeed, the These results are not surprising, since the Holt-remaining three methods obtain quite similar Winters can be quite accurate even when fewand very good forecasts (MAPE values within historic values are present (Makridakis et al.,the range 1.4–3%) for a 5 min lead. As the 1998). It should be noticed that the training datahorizon is increased, the results decay slowly for A1D contains only 34 elements, while B1Dand in a linear fashion, although the Holt- contains 46. In contrast, NNs tend to give badWinters method presents a higher slope for both results when 50 observations are used (Mak-ISPs. At this time scale, the best approach is ridakis, 1982).given by the NNE (Table 5). For demonstrative purposes, Figure 5 pre- Turning to the hourly scale, the naive method sents 100 forecasts given by the NNE methodis still the worst method. As before, the other for the series A1H and horizons of 1 and 24. Themethods present the lowest errors for the 1- figure shows a good fit by the forecasts, whichahead forecasts. However, the error curves are follow the series. Another relevant issue is re-not linear and after a given horizon, the error lated with the computational complexity. With a 2010 Blackwell Publishing Ltdc Expert Systems 9 Expert Systems, May 2012, Vol. 29, No. 2 151
  • 10. Table 5: Comparison between the forecasting methods (MAPEh values, in percentage, bold denotesbest values)Series Horizon Naı¨ve Holt- ARIMA NNE (h) WintersA5M 1 34.79 2.98 2.95 2.91 Æ 0.00* 24 34.83 21.65 18.08 16.30 Æ 0.21* h 34.80 11.98 10.68 9.59 Æ 0.08*B5M 1 20.10 1.44 1.74 1.43 Æ 0.01 24 19.99 14.36 11.32 10.92 Æ 0.24* h 20.05 7.65 6.60 6.34 Æ 0.11*A1H 1 65.19 12.96 7.37 5.23 Æ 0.03* 24 65.89 33.95 28.18 25.11 Æ 0.59* h 65.67 50.60 26.96 23.48 Æ 0.49*B1H 1 34.82 3.30 3.13 3.25 Æ 0.01 24 35.54 17.31 15.15 12.20 Æ 0.07* h 35.18 13.69 12.69 12.26 Æ 0.03*A1D 1 6.77 6.77 8.49 8.76 Æ 0.00 7 6.25 6.25 7.23 7.99 Æ 0.00 h 6.34 6.34 8.12 8.48 Æ 0.00B1D 1 20.81 7.00 9.79 12.99 Æ 0.01 7 13.65 18.38 21.11 31.04 Æ 0.01 h 17.62 13.43 18.18 24.89 Æ 0.01*Statistically significant when compared with other methods.Pentium IV 1.6 GHz processor, the NNE train- performance. As shown in the previous section,ing (including five runs of the RPROP algorithm) and also argued in (Taylor et al., 2006), theand testing for this series required only 41 s. In ARIMA methodology is impractical for on-linethis case, the computational demand for Holt- forecasting systems because it requires moreWinters increases around a factor of three, since computation. Although the search space forthe 0.05 grid-search required 137 s. For the dou- NNE is high (i.e. selecting the best neuralble seasonal series, the highest effort is given by architecture and set of time lags), the heuristicsthe ARIMA model, where the X-12-ARIMA proposed here for feature=model selection re-estimation took more than 2 h of processing time. duce substantially the computational effort and are easy to implement, while still providing competitive forecasts. Hence, the NNE is the5. Conclusions recommended approach, since it can be used inIn this article, three time series methods were real-time and this is crucial for dynamic re-presented to forecast the amount of traffic in source allocation. At the daily scale, the Holt-TCP=IP based networks. A neural network Winters provided the best forecasts since ourensemble (NNE) was developed and the both datasets contained few observations. However,the Holt-Winters and the ARIMA methods in a on-line setting, an ISP could easily storewere adapted. Recent real-world data collected hundreds of daily aggregated data. Thus, wefrom two large Internet source providers (ISPs) believe that the proposed NNE would also leadwas analysed using different ahead predictions to accurate forecasts in such scenario.and time scales (e.g. every 5 min, hour and day). The experimental results reveal promising A comparison among the time series methods performances. Only a 1–3% error was obtainedshows that both ARIMA and NNE produce the for the 5 min forecasts. This value increasedlowest errors for the 5 min and hourly data, with from 11% to 17% when the forecasts werethe latter method presenting the best overall issued 2 h in advance. For the short-term152 Expert Systems May 2012, Vol. 29, No. 210 Expert Systems, 2010 Blackwell Publishing Ltd c
  • 11. A5M B5M 35 20 Naive Naive 30 25 ARIMA 15 MAPE MAPE 20 HW HW 10 15 ARIMA 10 NNE 5 NNE 5 0 0 5 10 15 20 5 10 15 20 Lead Time (every 5 minutes) Lead Time (every five minutes) A1H B1H 70 35 Naive Naive 60 30 HW 50 25 HW MAPE MAPE 40 20 ARIMA ARIMA 30 15 NNE NNE 20 10 10 5 0 0 5 10 15 20 5 10 15 20 Lead Time (hours) Lead Time (hours) 35 10 ARIMA NNE NNE 30 8 25 Naive ARIMA 6 MAPE MAPE 20 4 15 HW 2 10 0 5 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Lead Time (days) Lead Time (days)Figure 4: The forecasting error results (MAPE) plotted against the lead time (h). 2010 Blackwell Publishing Ltdc Expert Systems, May 2012, Vol. 29,Systems 153 Expert No. 2 11
  • 12. 3.5 BABIARZ, R. and J. BEDO (2006) Internet traffic mid- H=24 term forecasting: a pragmatic approach using statis- 3 A1H tical analysis tools, Lecture Notes on Computer H=1 2.5 Science, 3976, 111–121. x 1012, bits BOX, G. and G. JENKINS (1976) Time Series Analysis: 2 Forecasting and Control, San Francisco, CA, USA: 1.5 Holden Day. CORTEZ, P., M. ROCHA, J. MACHADO and J. NEVES 1 (1995) A neural network based forecasting system. 0.5 In Proceedings of IEEE ICNN’95. Vol. 5, Perth, Australia, pp. 2689–2693 0 CORTEZ, P., M. ROCHA and J. NEVES (2005) Time series 0 20 40 60 80 100 forecasting by evolutionary neural networks, Chapter Time (hours) III: Artificial Neural Networks in Real-Life Applica-Figure 5: Example of the neural forecasts for tions, Hershey, PA, USA: Idea Group Publishing,series A1H and lead times of h ¼ 1 and h ¼ 24. pp 47–70. DIETTERICH, T. (2000) Ensemble methods in machine learning, in J. Kittler and F. Roli (eds), Multiplepredictions, the error goes from 3% to 5% (1 h in Classifier Systems, Lecture Notes in Computeradvance) until 13–22% (24 h lookahead). Finally, Science 1857, Berlin: Springer, 1–15.the daily forecasts gave rise to error rates of 7% DING, X., S. CANU and T. DENOEUX (1995) Neural network based models for forecasting. In Proceed-(1 day horizon) and 6–13% (1 week lookahead). ings of Applied Decision Technologies ConferenceMoreover, once this work was designed assuming (ADT’95). Uxbridge, UK, pp. 243–252a passive monitoring system, no extra traffic is FLEXER, A. (1996) Statistical evaluation of neural net-required in the network. Hence, the recommended works experiments: minimum requirements and cur-approach opens room for the development of rent practice, In Proceedings of the 13th Europeanbetter traffic engineering tools and methods to Meeting on Cybernetics and Systems Research, Vol. 2, Vienna, Austria, pp. 1005–1008detect anomalies in the traffic patterns. HANSEGAWA, M., G. WU and M. MIZUNO (2001) In the future, similar methods will be applied Applications of nonlinear prediction methods to theto forecast traffic demands associated with spe- internet traffic, In Proceedings of IEEE Internationalcific Internet applications, since this might ben- Symposium on Circuits and Systems. Vol. 3, Sydney,efit management operations performed by ISPs, Australia, pp. 169–172 HASTIE, T., R. TIBSHIRANI and J. FRIEDMAN (2001) Thesuch as traffic prioritization. Another interest- Elements of Statistical Learning: Data Mining, Infer-ing possibility, would be the exploration of ence, and Prediction, New York, USA: Springer-Verlag.similar forecasting approaches to other domains JIANG, J. and S. PAPAVASSILIOU (2004) Detecting net-(e.g. electricity demand or road traffic). work attacks in the internet via statistical network traffic normality prediction, Journal of Network and Systems Management, 12, 51–72.Acknowledgements KRISHNAMURTHY, B., S. SEN, Y. ZHANG and Y. CHEN (2003) Sketch-based change detection: methods, eva-This work is supported by the FCT (Portuguese luation, and applications, In Proceedings of Internetscience foundation) project PTDC=EIA=64541= Measurment Conference (IMC’03), Miami, USA2006. We would also like to thank Steve Williams LAPEDES, A. and R. FARBER (1987). Non-Linear Signalfrom UKERNA for providing us with part of the Processing Using Neural Networks: Prediction anddata used in this work. System Modelling, Technical Report LA-UR-87- 2662, Los Alamos National Laboratory, USA. MAKRIDAKIS, S. (1982) The accuracy of extrapolationReferences (times series) methods: results of a forecasting com- petition, Journal of Forecasting, 1, 111–153.ALARCON-AQUINO, V. and J. BARRIA (2006) Multi- MAKRIDAKIS, S., S. WEELWRIGHT and R. HYNDMAN resolution FIR neural-network-based learning algo- (1998) Forecasting: Methods and Applications, New rithm applied to network traffic prediction, IEEE York, USA: John Wiley Sons. Transactions on Systems, Man and Cybernetics – MALKI, H., N. KARAYIANNIS, B. NICOLAOS and Part C, 36, 208–220. M. BALASUBRAMANIAN (2004) Short-term electric154 Expert Systems May 2012, Vol. 29, No. 212 Expert Systems, 2010 Blackwell Publishing Ltd c
  • 13. power load forecasting using feedforward neural investigator in two). He is co-author of more networks, Expert Systems, 21, 157–167. than sixty publications in international peerPAPAGIANNAKI, K., N. TAFT, Z. ZHANG and C. DIOT reviewed journals and conferences. Web-page: (2005) Long-term forecasting of internet backbone traffic, IEEE Transactions on Neural Networks, 16, http://www3.dsi.uminho.pt/pcortez 1110–1124.RIEDMILLER, M. (1994) Advanced supervised learning Miguel Rio in multilayer perceptrons – from backpropagation to adaptive learning techniques, International Journal Miguel Rio received the PhD from the Univer- of Computer Standards and Interfaces, 16, 265–278.SANG, A. and S. LI (2002) A predictability analysis of sity of Kent at Canterbury where he worked on network traffic, Computer Networks, 39, 329–345. Multicast distribution with Quality of Service.STALLINGS, W. (1999) SNMP, SNMPv2, SNMPv3 and He has been the Principal Investigator of several RMON 1 and 2, Reading, MA: Addison-Wesley. UK and EU funded research project in areas ofTAYLOR, J. (2003) Short-term electricity demand fore- Telecommunications and Future Internet. Cur- casting using double seasonal exponential smooth- ing, Journal of Operational Research Society, 54, rently, he is Senior Lecturer in the Department 799–805. of Electrical and Electronic Engineering, Uni-TAYLOR, J., L. MENEZES and P. MCSHARRY (2006) A versity College London. His research interests comparison of univariate methods for forecasting include peer-to-peer real-time delivery, routing, electricity demand up to a day ahead, International congestion control, and network traffic analysis. Journal of Forecasting, 21, 1–16.Time-Series-Staff (2002). X-12-ARIMA reference Web-page: http://www.ee.ucl.ac.uk/ $ mrio manual. Available at http://www.census.gov/srd/ www/x12a/ (accessed December 2008), U. S. Census Miguel Rocha Bureau, Washington, USA, July.TONG, H., C. LI and J. HE (2004) Boosting feed- Miguel Rocha obtained an MSc degree (1998) and forward neural network for internet traffic predic- tion, In Proceedings of the IEEE 3rd International a PhD (2004), both in Computer Science, Uni- Conference on Machine Learning and Cybernetics. versity of Minho. Since 1998, he is an Assistant Shanghai, China, pp. 3129–3134 Professor in the Artificial Intelligence group at theWANG, C., X. ZHANG, H. YAN and L. ZHENG (2008) Department of Informatics at the same institu- An internet traffic forecasting model adopting radi- tion. His research interests include bioinformatics cal based on function neural network optimized by genetic algorithm, In Proceedings of IEEE Workshop and systems biology, evolutionary computation on Knowledge Discovery and Data Mining and neural networks, where he coordinates (WKDD08). Adelaide, Australia, pp. 367–370 funded projects and has a number of refereed publications in journals and international confer- ences (see http://www.di.uminho.pt/ $ mpr).The authors Pedro SousaPaulo Cortez Pedro Sousa received an MSc degree (1997) and aPaulo Cortez received an MSc degree (1998) and PhD (2005), both in Computer Science, Univer-a PhD (2002), both in Computer Science, Uni- sity of Minho. In 1996, he joined the Computerversity of Minho, where he works since 2001 as Communications Group of the Department ofan Assistant Professor in the Department of Informatics at University of Minho, where he isInformation Systems. He is also researcher at an assistant professor and performs his researchthe Algoritmi centre, with interests in the fields activities within the CCTC RD Center. Hisof: business intelligence, data mining, neural main research interests include computer net-networks, evolutionary computation and fore- works technologies and protocols, network simu-casting. Currently, he is associate editor of lation, TCP=IP protocols, quality of service,the Neural Processing Letters journal and he traffic scheduling and mobile networks. Web-participated in seven RD projects (principal page: http://marco.uminho.pt/ $ pns 2010 Blackwell Publishing Ltdc Expert Systems, May 2012, Vol. 29,Systems 155 Expert No. 2 13