SlideShare a Scribd company logo
1 of 20
Download to read offline
Web Traffic Time Series
Forecasting
SUBMITTED BY –
Korivi Sravan Kumar
Introduction:
The data contains daily views of Wikipedia article. The data set contains individual Pages
and daily views of the pages.
The total number of pages in the data set is 145k. The training data set 1 contains daily views
from July 1st
2015 to Dec 31st
2016 with a total number of 550 days.
Testing of forecast model is based on data from January, 1st, 2017 up until March 1st, 2017,
which is 60 days including 1st
march 2017.
The training dataset 2 contains data set upto 1st
Sept 2017.
Test data set has been created from training data set 2 for evaluating accuracy.
Importing libraries:
All the libraries imported for data manipulation, time series and forecasting
Data Input:
Creation of training and test data sets:
The data is converted into training & testing data based on Train1 and Train 2 data sets.
Columns from train 2 data set are selected from Jan1st 2017 to March 1st
2018 including 1st
march.
library(forecast) #working with time series
library(fpp2) #working with time series
library(dplyr) # data manipulation
library(tidyverse) #data manipulation
library(lubridate) # easily work with dates and times
library(zoo) # working with time series data
setwd(“D:/Assignment-2/”) #Set the working directory
train <- read.csv("train_1.csv") #Read train_1 csv file
dim(train) # Rows = 145063; Columns = 551
rows_count = nrow(train) #No. of rows
cols_count = ncol(train) #No. of columns
train2 <- read.csv("train_2.csv") #Read train_2 csv file
dim(train2)
test <- train2[, (cols_count+1): (cols_count+60)] # 551+60(days) =611
After converting the data to train and test data sets. Each page time series data needs to be
converted into time series for forecasting.
To make better understanding of the code, we selected a random row using sample() and used
the row number 707772 to explain the process of conversion to time series data for
application of different forecasting models and evaluation methodology of various
forecasting models.
In actual all the code from below is run a loop to get forecast for each page as presented in
the kaggle –‘Web Time Series Forecasting’ which is provided at the end of the document.
Converting to time series
trainsep = train[70772,]
testsep = test[70772,]
sum = sum(train[1,2:cols_count])
if(!is.na(sum)){
f = t(trainsep[,-c(1,552)])
f_test = t(testsep)
f = data.frame(f,substr(row.names(f),2,11))
colnames(f) = c("visits","dat")
# To convert X(yyyy.mm.dd) into date(yyyy.mm.dd)
f_test = data.frame(f_test,substr(row.names(f_test),2,11))
colnames(f_test) = c("visits","dat")
#---------------------Rest of the code is in the if condition------------------------
}
f.ts = ts(f$visits, start = c(2015, 07, 01), frequency = 7) # to create time series object
f.ts = tsclean(f.ts) # To Identify and Replace Outliers And Missing Values In A Time Series
Exploratory data analysis:
autoplot(f.ts)
gglagplot(f.ts)
ggAcf(f.ts)
Box test performed to check whether the time series is white noise or not. As p-value < 0.05,
the time series is not whitenoise.
> Box.test(f.ts, lag = 10, fitdf = 0, type = "Lj")
Box-Ljung test
data: f.ts
X-squared = 5260.9, df = 10, p-value < 2.2e-16
Forecasting models:
For the data, forecasting is applied by using Naïve forecast, snaive forecast, moving average
forecast, simple exponential smoothing, holt’s smoothing and holt’s winter smoothing to
check for the next 60 days forecast.
1. Naïve forecast:
Naïve forecast is applied on the training time series.
Output:
> summary(fcnaive_ts)
fcnaive_ts = naive(f.ts, 60)
summary(fcnaive_ts)
autoplot(fcnaive_ts)
checkresiduals(fcnaive_ts)
Forecast method: Naive method
Model Information:
Call: naive(y = f.ts, h = 60)
Residual sd: 100.2178
Error measures:
ME RMSE MAE MPE MAPE MASE
ACF1
Training set 1.967213 100.2178 66.30965 -0.2189641 7.587369 0.03936731 -0.
1744151
Forecasts:
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
2016.5233 1264 1135.5657 1392.434 1067.576726 1460.423
2016.5260 1264 1082.3665 1445.633 986.215542 1541.784
2016.5288 1264 1041.5453 1486.455 923.784910 1604.215
2016.5315 1264 1007.1314 1520.869 871.153452 1656.847
2016.5342 1264 976.8122 1551.188 824.784207 1703.216
2016.5370 1264 949.4016 1578.598 782.863205 1745.137
2016.5397 1264 924.1948 1603.805 744.312865 1783.687
2016.5425 1264 900.7330 1627.267 708.431084 1819.569
2016.5452 1264 878.6972 1649.303 674.730178 1853.270
2016.5479 1264 857.8552 1670.145 642.855069 1885.145
2016.5507 1264 838.0317 1689.968 612.537700 1915.462
2016.5534 1264 819.0906 1708.909 583.569819 1944.430
2016.5562 1264 800.9236 1727.076 555.785814 1972.214
2016.5589 1264 783.4429 1744.557 529.051406 1998.949
2016.5616 1264 766.5762 1761.424 503.255931 2024.744
2016.5644 1264 750.2629 1777.737 478.306904 2049.693
2016.5671 1264 734.4519 1793.548 454.126094 2073.874
2016.5699 1264 719.0995 1808.900 430.646626 2097.353
2016.5726 1264 704.1680 1823.832 407.810799 2120.189
2016.5753 1264 689.6245 1838.376 385.568414 2142.432
2016.5781 1264 675.4402 1852.560 363.875479 2164.125
2016.5808 1264 661.5899 1866.410 342.693180 2185.307
2016.5836 1264 648.0509 1879.949 321.987071 2206.013
2016.5863 1264 634.8031 1893.197 301.726410 2226.274
2016.5890 1264 621.8286 1906.171 281.883630 2246.116
2016.5918 1264 609.1111 1918.889 262.433893 2265.566
2016.5945 1264 596.6359 1931.364 243.354729 2284.645
2016.5973 1264 584.3897 1943.610 224.625731 2303.374
2016.6000 1264 572.3603 1955.640 206.228298 2321.772
2016.6027 1264 560.5365 1967.463 188.145420 2339.855
2016.6055 1264 548.9082 1979.092 170.361495 2357.639
2016.6082 1264 537.4660 1990.534 152.862168 2375.138
2016.6110 1264 526.2013 2001.799 135.634197 2392.366
2016.6137 1264 515.1059 2012.894 118.665338 2409.335
2016.6164 1264 504.1726 2023.827 101.944240 2426.056
2016.6192 1264 493.3943 2034.606 85.460356 2442.540
2016.6219 1264 482.7648 2045.235 69.203869 2458.796
2016.6247 1264 472.2780 2055.722 53.165619 2474.834
2016.6274 1264 461.9282 2066.072 37.337047 2490.663
2016.6301 1264 451.7103 2076.290 21.710138 2506.290
2016.6329 1264 441.6194 2086.381 6.277374 2521.723
2016.6356 1264 431.6508 2096.349 -8.968306 2536.968
2016.6384 1264 421.8001 2106.200 -24.033544 2552.034
2016.6411 1264 412.0634 2115.937 -38.924600 2566.925
2016.6438 1264 402.4367 2125.563 -53.647379 2581.647
2016.6466 1264 392.9164 2135.084 -68.207460 2596.207
2016.6493 1264 383.4990 2144.501 -82.610122 2610.610
2016.6521 1264 374.1812 2153.819 -96.860361 2624.860
2016.6548 1264 364.9601 2163.040 -110.962918 2638.963
2016.6575 1264 355.8325 2172.167 -124.922290 2652.922
2016.6603 1264 346.7958 2181.204 -138.742753 2666.743
2016.6630 1264 337.8473 2190.153 -152.428372 2680.428
2016.6658 1264 328.9844 2199.016 -165.983019 2693.983
2016.6685 1264 320.2047 2207.795 -179.410384 2707.410
2016.6712 1264 311.5059 2216.494 -192.713987 2720.714
2016.6740 1264 302.8859 2225.114 -205.897188 2733.897
2016.6767 1264 294.3425 2233.658 -218.963198 2746.963
2016.6795 1264 285.8737 2242.126 -231.915087 2759.915
2016.6822 1264 277.4776 2250.522 -244.755796 2772.756
2016.6849 1264 269.1524 2258.848 -257.488138 2785.488
checkresiduals(fcnaive_ts)
Ljung-Box test
data: Residuals from Naive method
Q* = 655.3, df = 110, p-value < 2.2e-16
Model df: 0. Total lags used: 110
After checking residuals, there is still autocorrelation exists with the lag factors as there is
trend and seasonality in the data.
2. Seasonal naive forecast:
Output:
> summary(fcsnaive_ts)
Forecast method: Seasonal naive method
Model Information:
Call: snaive(y = f.ts, h = 60)
Residual sd: 1701.5666
Error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 1684.384 1701.566 1684.384 87.29204 87.29204 1 0.7978843
Forecasts:
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
2016.5233 294 -1886.645 2474.645 -3041.009 3629.009
2016.5260 321 -1859.645 2501.645 -3014.009 3656.009
2016.5288 335 -1845.645 2515.645 -3000.009 3670.009
2016.5315 399 -1781.645 2579.645 -2936.009 3734.009
2016.5342 352 -1828.645 2532.645 -2983.009 3687.009
2016.5370 348 -1832.645 2528.645 -2987.009 3683.009
2016.5397 369 -1811.645 2549.645 -2966.009 3704.009
2016.5425 312 -1868.645 2492.645 -3023.009 3647.009
2016.5452 303 -1877.645 2483.645 -3032.009 3638.009
2016.5479 396 -1784.645 2576.645 -2939.009 3731.009
2016.5507 363 -1817.645 2543.645 -2972.009 3698.009
2016.5534 405 -1775.645 2585.645 -2930.009 3740.009
2016.5562 377 -1803.645 2557.645 -2958.009 3712.009
2016.5589 385 -1795.645 2565.645 -2950.009 3720.009
2016.5616 381 -1799.645 2561.645 -2954.009 3716.009
2016.5644 405 -1775.645 2585.645 -2930.009 3740.009
2016.5671 414 -1766.645 2594.645 -2921.009 3749.009
2016.5699 482 -1698.645 2662.645 -2853.009 3817.009
2016.5726 420 -1760.645 2600.645 -2915.009 3755.009
2016.5753 464 -1716.645 2644.645 -2871.009 3799.009
2016.5781 449 -1731.645 2629.645 -2886.009 3784.009
2016.5808 436 -1744.645 2616.645 -2899.009 3771.009
2016.5836 477 -1703.645 2657.645 -2858.009 3812.009
2016.5863 518 -1662.645 2698.645 -2817.009 3853.009
2016.5890 456 -1724.645 2636.645 -2879.009 3791.009
2016.5918 504 -1676.645 2684.645 -2831.009 3839.009
2016.5945 519 -1661.645 2699.645 -2816.009 3854.009
2016.5973 489 -1691.645 2669.645 -2846.009 3824.009
2016.6000 455 -1725.645 2635.645 -2880.009 3790.009
2016.6027 444 -1736.645 2624.645 -2891.009 3779.009
2016.6055 480 -1700.645 2660.645 -2855.009 3815.009
2016.6082 506 -1674.645 2686.645 -2829.009 3841.009
2016.6110 469 -1711.645 2649.645 -2866.009 3804.009
fcsnaive_ts = snaive(f.ts,60)
summary(fcsnaive_ts)
autoplot(fcsnaive_ts)
checkresiduals(fcsnaive_ts)
2016.6137 529 -1651.645 2709.645 -2806.009 3864.009
2016.6164 524 -1656.645 2704.645 -2811.009 3859.009
2016.6192 474 -1706.645 2654.645 -2861.009 3809.009
2016.6219 519 -1661.645 2699.645 -2816.009 3854.009
2016.6247 493 -1687.645 2673.645 -2842.009 3828.009
2016.6274 585 -1595.645 2765.645 -2750.009 3920.009
2016.6301 627 -1553.645 2807.645 -2708.009 3962.009
2016.6329 562 -1618.645 2742.645 -2773.009 3897.009
2016.6356 590 -1590.645 2770.645 -2745.009 3925.009
2016.6384 581 -1599.645 2761.645 -2754.009 3916.009
2016.6411 575 -1605.645 2755.645 -2760.009 3910.009
2016.6438 711 -1469.645 2891.645 -2624.009 4046.009
2016.6466 641 -1539.645 2821.645 -2694.009 3976.009
2016.6493 749 -1431.645 2929.645 -2586.009 4084.009
2016.6521 749 -1431.645 2929.645 -2586.009 4084.009
2016.6548 706 -1474.645 2886.645 -2629.009 4041.009
2016.6575 698 -1482.645 2878.645 -2637.009 4033.009
2016.6603 778 -1402.645 2958.645 -2557.009 4113.009
2016.6630 956 -1224.645 3136.645 -2379.009 4291.009
2016.6658 848 -1332.645 3028.645 -2487.009 4183.009
2016.6685 810 -1370.645 2990.645 -2525.009 4145.009
2016.6712 803 -1377.645 2983.645 -2532.009 4138.009
2016.6740 883 -1297.645 3063.645 -2452.009 4218.009
2016.6767 813 -1367.645 2993.645 -2522.009 4148.009
2016.6795 815 -1365.645 2995.645 -2520.009 4150.009
2016.6822 710 -1470.645 2890.645 -2625.009 4045.009
2016.6849 797 -1383.645 2977.645 -2538.009 4132.009
> checkresiduals(fcnaive_ts)
Ljung-Box test
data: Residuals from Naive method
Q* = 655.3, df = 110, p-value < 2.2e-16
Model df: 0. Total lags used: 110
Upon checking the residuals, and perform box test, the p-value <0.05. It suggests that
residuals is not white noise.
3. Moving average:
4. Simple exponential smoothing:
autoplot(f.ts, series = "Data") +
autolayer(ma(f.ts, 7), series = "1 week MA") +
autolayer(ma(f.ts, 31), series = "1 month MA") +
autolayer(ma(f.ts, 91), series = "3 month MA") +
autolayer(ma(f.ts, 183), series = "6 month MA") +
xlab("Date") +
ylab("visits")
Output:
> checkresiduals(fcses_ts)
Ljung-Box test
data: Residuals from Simple exponential smoothing
Q* = 908.14, df = 108, p-value < 2.2e-16
Model df: 2. Total lags used: 110
fcses_ts <- ses(f.ts, alpha = .2, h = 60) # simple exponential moving average
summary(fcses_ts)
autoplot(fcses_ts) #plot
checkresiduals(fcses_ts) #residuals to check whether it is white noise or not
As p value of Box text <0.05, the residuals are white noise, as the data contains both trend
and seasonality.
5.Holt’s smoothing
> checkresiduals(fcholt_ts)
fcholt_ts <- holt(f.ts, h = 60)
summary(fcholt_ts)
autoplot(fcholt_ts)
checkresiduals(fcholt_ts)
Ljung-Box test
data: Residuals from Holt's method
Q* = 1002, df = 106, p-value < 2.2e-16
Model df: 4. Total lags used: 110
Upon tuning the beta parameters,
# identify optimal alpha parameter
beta <- seq(.0001, .5, by = .001)
RMSE <- NA
for(i in seq_along(beta)) {
fit <- holt(f.ts, beta = beta[i], h = 60)
RMSE[i] <- accuracy(fit, f_test$visits)[2,2]
}
# convert to a data frame and idenitify min alpha value
beta.fit <- data_frame(beta, RMSE)
beta.min <- filter(beta.fit, RMSE == min(RMSE))
# plot RMSE vs. alpha
ggplot(beta.fit, aes(beta, RMSE)) +
geom_line() +
geom_point(data = beta.min, aes(beta, RMSE), size = 2, color = "blue")
fcholt_ts <- holt(f.ts, h = 90, belta = beta.min$beta)
6. Holt’s winter smoothing:
Decomposition of additional time series:
hw.ts <- ets(f.ts, model = "ZZZ")
checkresiduals(hw.ts)
autoplot(hw.ts)
summary(hw.ts)
> summary(hw.ts)
ETS(M,N,M)
Call:
ets(y = f.ts, model = "ZZZ")
Smoothing parameters:
alpha = 0.6672
gamma = 0.0364
Initial states:
l = 194.5145
s = 1.1697 1.0074 0.9371 0.9015 0.9571 1.0013
1.0259
sigma: 0.1116
AIC AICc BIC
8362.877 8363.286 8405.977
Training set error measures:
ME RMSE MAE MPE MAPE MASE
ACF1
Training set 2.652725 88.74605 61.03587 -0.1384452 7.216793 0.6028258 -0.0
1053619
The Holt winter model of ETS(M,N,M) has residuals with higher p-value than other models.
Evaluating the different forecast models:
Every model is evaluated against RMSE of test data. On the basis of lower RMSE, Holt’s
method is selected and used to forecast.
> accuracy(fcnaive_ts, f_test$visits)
ME RMSE MAE MPE MAPE MASE
ACF1
Training set 1.967213 102.6296 68.49265 -0.2271511 7.699251 1.000000 -
0.1835412
Test set 283.950000 419.6527 302.65000 15.8649924 17.797103 4.418722
NA
> accuracy(fcsnaive_ts, f_test$visits)
ME RMSE MAE MPE MAPE MASE A
CF1
Training set 16.93582 145.0159 101.2496 1.3114613 11.96902 1.478255 0.6341
429
Test set 46.02771 315.4499 181.7056 0.2651809 11.04817 2.652921
NA
> accuracy(mean_fc, f_test$visits)
ME RMSE MAE MPE MAPE MASE
ACF1
Training set 4.291307e-14 751.6030 694.2092 -119.25917 156.48239 10.135528
0.98933
Test set 4.602121e+02 554.3247 466.6034 27.59744 28.31075 6.812459
NA
> accuracy(fcses_ts,f_test$visits)
ME RMSE MAE MPE MAPE MASE
ACF1
Training set 22.432887 128.3612 87.42038 1.732784 9.373469 1.276347 0.6
310515
Test set -3.173597 309.0159 188.65375 -3.246674 11.869201 2.754365
NA
> accuracy(fcholt_ts,f_test$visits)
ME RMSE MAE MPE MAPE MASE
ACF1
Training set -3.993354 99.2416 66.82666 -1.6738377 7.629988 0.9756764 0
.08924597
Test set 28.649399 308.8983 193.02659 -0.9831642 11.896523 2.8182087
NA
> accuracy(fcets_ts, f_test$visits)
ME RMSE MAE MPE MAPE MASE
ACF1
Training set 2.652725 88.74605 61.03587 -0.1384452 7.216793 0.8911302
-0.01053619
Test set 114.850686 314.26532 173.62511 4.8900676 10.024239 2.5349451
NA
R code to run for 145 k pages automatically:
#Library
library(forecast) #working with time series
library(fpp2) #working with time series
library('dplyr') # data manipulation
library('tidyverse') #data manipulation
library(lubridate) # easily work with dates and times
library(zoo) # working with time series data
#train data
train <- read.csv("train_1.csv")
dim(train)
# head(train)
rows_count = nrow(train)
cols_count = ncol(train)
train2 <- read.csv("train_2.csv")
dim(train2)
#Creation of test data from training data set
test <- train2[, (cols_count+1):(cols_count+60)]
dim(test)
for(j in 1:nrow(train)){
trainsep = train[j,]
testsep = test[j,]
sum = sum(train[1,2:cols_count])
if(!is.na(sum)){
#Matrix to store RMSE of training and test data set accuracy of forecasts
accur <- matrix(, nrow = 6, ncol = 2)
#Data imputations
f = t(trainsep[,-c(1,552)])
f_test = t(testsep)
head(f_test)
f = data.frame(f,substr(row.names(f),2,11))
colnames(f) = c("visits","dat")
f_test = data.frame(f_test,substr(row.names(f_test),2,11))
colnames(f_test) = c("visits","dat")
head(f)
head(f_test)
#Creation of timeseries data after cleaning using ts and tsclean
f.ts =tsclean(ts(f$visits,frequency = 7))
head(f.ts, 45)
#Data Exploration
autoplot(f.ts)
gglagplot(f.ts)
acf(f.ts)
Box.test(f.ts, lag = 10, fitdf = 0, type = "Lj")
#Removing trend and to check for the seasonality
f.ts.dif = diff(f.ts)
gglagplot(f.ts.dif)
ggAcf(f.ts.dif)
autoplot(f.ts.dif)
f_test.dif <- diff(f_test$visits)
Box.test(f.ts.dif, lag = 10, fitdf = 0, type = "Lj")
ggAcf(f.ts)
#Naive test
fcnaive_ts = naive(f.ts, 60)
summary(fcnaive_ts)
autoplot(fcnaive_ts)
checkresiduals(fcnaive_ts)
act = accuracy(fcnaive_ts, f_test$visits)
accur[1,1] = act[2,2] #test RMSE accuracy
accur[1,2] = act[1,2] #trin RMSE accuracy
#seasonal naive test
fcsnaive_ts = snaive(f.ts,60)
summary(fcsnaive_ts)
autoplot(fcsnaive_ts)
checkresiduals(fcsnaive_ts)
act = accuracy(fcsnaive_ts, f_test$visits)
accur[2,1] = act[2,2] #test RMSE accuracy
accur[2,2] = act[1,2] #trin RMSE accuracy
#mean forecast
mean_fc <- meanf(f.ts, h = 60)
act = accuracy(mean_fc, f_test$visits)
accur[3,1] = act[2,2] #test RMSE accuracy
accur[3,2] = act[1,2] #trin RMSE accuracy
#SES(Simple Exponential smoothing)
fcses_ts <- ses(f.ts, alpha = .2, h = 60)
summary(fcses_ts)
autoplot(fcses_ts)
checkresiduals(fcses_ts)
accuracy(fcses_ts,f_test$visits)
fces_ts1 <-ses(f.ts.dif, alpha = .2, h = 60)
autoplot(fces_ts1)
summary(fces_ts1)
autoplot(f.ts.dif)
checkresiduals(fces_ts1)
accuracy(fces_ts1,f_test.dif)
alpha <- seq(.01, .99, by = .01)
RMSE <- NA
for(i in seq_along(alpha)) {
fit <- ses(f.ts, alpha = alpha[i], h = 60)
RMSE[i] <- accuracy(fit, f_test$visits)[2,2]
}
alpha.fit <- data_frame(alpha, RMSE)
alpha.min <- filter(alpha.fit, RMSE == min(RMSE))
ggplot(alpha.fit, aes(alpha, RMSE)) +
geom_line() +
geom_point(data = alpha.min, aes(alpha, RMSE), size = 2, color = "blue")
fcses_ts <- ses(f.ts, alpha = alpha.min$alpha, h = 60)
autoplot(fcses_ts)
act = accuracy(fcses_ts,f_test$visits)
accur[4,1] = act[2,2] #test RMSE accuracy
accur[4,2] = act[1,2] #trin RMSE accuracy
fcholt_ts <- holt(f.ts, h = 60)
summary(fcholt_ts)
autoplot(fcholt_ts)
checkresiduals(fcholt_ts)
act = accuracy(fcholt_ts,f_test$visits)
accur[5,1] = act[2,2] #test RMSE accuracy
accur[5,2] = act[1,2] #trin RMSE accuracy
# identify optimal alpha parameter
beta <- seq(.0001, .5, by = .001)
RMSE <- NA
for(i in seq_along(beta)) {
fit <- holt(f.ts, beta = beta[i], h = 60)
RMSE[i] <- accuracy(fit, f_test$visits)[2,2]
}
# convert to a data frame and idenitify min alpha value
beta.fit <- data_frame(beta, RMSE)
beta.min <- filter(beta.fit, RMSE == min(RMSE))
# plot RMSE vs. alpha
ggplot(beta.fit, aes(beta, RMSE)) +
geom_line() +
geom_point(data = beta.min, aes(beta, RMSE), size = 2, color = "blue")
fcholt_ts <- holt(f.ts, h = 60, belta = beta.min$beta)
act = accuracy(fcholt_ts,f_test$visits)
accur[5,1] = act[2,2] #test RMSE accuracy
accur[5,2] = act[1,2] #trin RMSE accuracy
autoplot(decompose(f.ts))
#HoltWinters seasonal model
hw.ts <- ets(f.ts, model = "ZZZ")
checkresiduals(hw.ts)
autoplot(hw.ts)
summary(hw.ts)
fcets_ts <- forecast(hw.ts, h = 60)
act= accuracy(fcets_ts, f_test$visits)
accur[6,1] = act[2,2] #test RMSE accuracy
accur[6,2] = act[1,2] #trin RMSE accuracy
#Model evaluation using RMSE of test data
method = c("naive","snaive","mean", "ses","holts","aes")
accur1 = data_frame(method, as.vector(t(accur[,1])))
colnames(accur1) = c("method","RMSE_TEST")
minimum <- filter(accur1, RMSE_TEST == min(RMSE_TEST))
Conclusion:
Each series will have different forecast depending upon the trend, seasonality and error terms
in the page visits daily. Some of the pages have no trend, some have trend and seasonality.
Some have no trend but seasonality. Data exploration has been used to understand about the
time series. Acf plots help us in understanding the autocorrelation lag plots. Using the
moving average, time series plots are used to understand for smoothing the data.
Different forecast models are used to understand about the time series. Navie, seasonal
naïve, simple exponential smoothing, holt’s smoothing, holt-winters smoothing used for the
forecasting. While using the forecasting models, residual plots are made to check whether the
error is centered around 0, ACF plots lie within in the range of Box test > 0.05.
RMSE used to evaluate the different models. Based on the lower RMSE value, the forecast
model is selected to predict the next 60 days page visits.
if (minimum$method == "naive"){
fcnaive_ts
}else if(minimum$method == "snaive"){
fcsnaive_ts
}else if(minimum$method == "mean"){
mean_fc
}else if(minimum$method == "ses"){
fcses_ts
}else if(minimum$method == "holts"){
fcholt_ts
}else if(minimum$method == "aes"){
fcets_ts
}
}
}

More Related Content

Similar to Web trafic time series forecasting

Linear regression an 80 year study of the dow jones industrial average
Linear regression  an 80 year study of the dow jones industrial averageLinear regression  an 80 year study of the dow jones industrial average
Linear regression an 80 year study of the dow jones industrial averageTehyaSingleton
 
Linear regression an 80 year study of the dow jones industrial average
Linear regression  an 80 year study of the dow jones industrial averageLinear regression  an 80 year study of the dow jones industrial average
Linear regression an 80 year study of the dow jones industrial averageTehyaSingleton
 
Sales forecasting using sas
Sales forecasting using sasSales forecasting using sas
Sales forecasting using sasHaritha Easan
 
An Introduction to Statistical Methods and Data Analysis.pdf
An Introduction to Statistical Methods and Data Analysis.pdfAn Introduction to Statistical Methods and Data Analysis.pdf
An Introduction to Statistical Methods and Data Analysis.pdfSandra Valenzuela
 
Financial_Management_Class_Notes (1).pdf
Financial_Management_Class_Notes (1).pdfFinancial_Management_Class_Notes (1).pdf
Financial_Management_Class_Notes (1).pdfSIMBARASHEMABHEKA
 
Financial_Management_Class_Notes.pdf
Financial_Management_Class_Notes.pdfFinancial_Management_Class_Notes.pdf
Financial_Management_Class_Notes.pdfSIMBARASHEMABHEKA
 
New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...
New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...
New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...Peter Laurinec
 
Question1.xlsxAnova ResultsSUMMARY OUTPUTRegression Statistic.docx
Question1.xlsxAnova ResultsSUMMARY OUTPUTRegression Statistic.docxQuestion1.xlsxAnova ResultsSUMMARY OUTPUTRegression Statistic.docx
Question1.xlsxAnova ResultsSUMMARY OUTPUTRegression Statistic.docxcatheryncouper
 
Trigonometric tables
Trigonometric tablesTrigonometric tables
Trigonometric tablesJayapal Jp
 
IPPTChap010.pptx
IPPTChap010.pptxIPPTChap010.pptx
IPPTChap010.pptxQuangLong44
 
Product Design Forecasting Techniquesision.ppt
Product Design Forecasting Techniquesision.pptProduct Design Forecasting Techniquesision.ppt
Product Design Forecasting Techniquesision.pptavidc1000
 
Forecasting_Quantitative Forecasting.ppt
Forecasting_Quantitative Forecasting.pptForecasting_Quantitative Forecasting.ppt
Forecasting_Quantitative Forecasting.pptRituparnaDas584083
 
ForldRite Furniture Co : PLANNING TO MEET A SURGE IN DEMAND
ForldRite Furniture Co :  PLANNING TO MEET A SURGE IN DEMANDForldRite Furniture Co :  PLANNING TO MEET A SURGE IN DEMAND
ForldRite Furniture Co : PLANNING TO MEET A SURGE IN DEMANDaliyudhi_h
 
Appendix  A  Future value in.docx
Appendix  A  Future value in.docxAppendix  A  Future value in.docx
Appendix  A  Future value in.docxrossskuddershamus
 

Similar to Web trafic time series forecasting (20)

Linear regression an 80 year study of the dow jones industrial average
Linear regression  an 80 year study of the dow jones industrial averageLinear regression  an 80 year study of the dow jones industrial average
Linear regression an 80 year study of the dow jones industrial average
 
Linear regression an 80 year study of the dow jones industrial average
Linear regression  an 80 year study of the dow jones industrial averageLinear regression  an 80 year study of the dow jones industrial average
Linear regression an 80 year study of the dow jones industrial average
 
Forecasting Attendance at SWU Football Games
Forecasting Attendance at SWU Football GamesForecasting Attendance at SWU Football Games
Forecasting Attendance at SWU Football Games
 
Sales forecasting using sas
Sales forecasting using sasSales forecasting using sas
Sales forecasting using sas
 
TIME SERIES PAPER
TIME SERIES PAPERTIME SERIES PAPER
TIME SERIES PAPER
 
An Introduction to Statistical Methods and Data Analysis.pdf
An Introduction to Statistical Methods and Data Analysis.pdfAn Introduction to Statistical Methods and Data Analysis.pdf
An Introduction to Statistical Methods and Data Analysis.pdf
 
Financial_Management_Class_Notes (1).pdf
Financial_Management_Class_Notes (1).pdfFinancial_Management_Class_Notes (1).pdf
Financial_Management_Class_Notes (1).pdf
 
Financial_Management_Class_Notes.pdf
Financial_Management_Class_Notes.pdfFinancial_Management_Class_Notes.pdf
Financial_Management_Class_Notes.pdf
 
New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...
New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...
New Clustering-based Forecasting Method for Disaggregated End-consumer Electr...
 
Question1.xlsxAnova ResultsSUMMARY OUTPUTRegression Statistic.docx
Question1.xlsxAnova ResultsSUMMARY OUTPUTRegression Statistic.docxQuestion1.xlsxAnova ResultsSUMMARY OUTPUTRegression Statistic.docx
Question1.xlsxAnova ResultsSUMMARY OUTPUTRegression Statistic.docx
 
Trigonometric tables
Trigonometric tablesTrigonometric tables
Trigonometric tables
 
IPPTChap010.pptx
IPPTChap010.pptxIPPTChap010.pptx
IPPTChap010.pptx
 
Product Design Forecasting Techniquesision.ppt
Product Design Forecasting Techniquesision.pptProduct Design Forecasting Techniquesision.ppt
Product Design Forecasting Techniquesision.ppt
 
Forecasting_Quantitative Forecasting.ppt
Forecasting_Quantitative Forecasting.pptForecasting_Quantitative Forecasting.ppt
Forecasting_Quantitative Forecasting.ppt
 
Data Analysis.pptx
Data Analysis.pptxData Analysis.pptx
Data Analysis.pptx
 
ForldRite Furniture Co : PLANNING TO MEET A SURGE IN DEMAND
ForldRite Furniture Co :  PLANNING TO MEET A SURGE IN DEMANDForldRite Furniture Co :  PLANNING TO MEET A SURGE IN DEMAND
ForldRite Furniture Co : PLANNING TO MEET A SURGE IN DEMAND
 
Appendix  A  Future value in.docx
Appendix  A  Future value in.docxAppendix  A  Future value in.docx
Appendix  A  Future value in.docx
 
Regression project
Regression projectRegression project
Regression project
 
JTP - EV Presentation
JTP - EV PresentationJTP - EV Presentation
JTP - EV Presentation
 
Forecasting Assignment Help
Forecasting Assignment HelpForecasting Assignment Help
Forecasting Assignment Help
 

More from Korivi Sravan Kumar

Study on Zara International Strategy
Study on Zara International StrategyStudy on Zara International Strategy
Study on Zara International StrategyKorivi Sravan Kumar
 
RBL Bank Strategy analysis and formulation
RBL Bank Strategy analysis and formulationRBL Bank Strategy analysis and formulation
RBL Bank Strategy analysis and formulationKorivi Sravan Kumar
 
P&G Strategic Restructuring of Global Business Service
P&G Strategic Restructuring of Global Business ServiceP&G Strategic Restructuring of Global Business Service
P&G Strategic Restructuring of Global Business ServiceKorivi Sravan Kumar
 
Data visualization tools & techniques - 1
Data visualization tools & techniques - 1Data visualization tools & techniques - 1
Data visualization tools & techniques - 1Korivi Sravan Kumar
 

More from Korivi Sravan Kumar (6)

Notes.pptx
Notes.pptxNotes.pptx
Notes.pptx
 
No bill is available.docx
No bill is available.docxNo bill is available.docx
No bill is available.docx
 
Study on Zara International Strategy
Study on Zara International StrategyStudy on Zara International Strategy
Study on Zara International Strategy
 
RBL Bank Strategy analysis and formulation
RBL Bank Strategy analysis and formulationRBL Bank Strategy analysis and formulation
RBL Bank Strategy analysis and formulation
 
P&G Strategic Restructuring of Global Business Service
P&G Strategic Restructuring of Global Business ServiceP&G Strategic Restructuring of Global Business Service
P&G Strategic Restructuring of Global Business Service
 
Data visualization tools & techniques - 1
Data visualization tools & techniques - 1Data visualization tools & techniques - 1
Data visualization tools & techniques - 1
 

Recently uploaded

Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 

Recently uploaded (20)

Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 

Web trafic time series forecasting

  • 1. Web Traffic Time Series Forecasting SUBMITTED BY – Korivi Sravan Kumar
  • 2. Introduction: The data contains daily views of Wikipedia article. The data set contains individual Pages and daily views of the pages. The total number of pages in the data set is 145k. The training data set 1 contains daily views from July 1st 2015 to Dec 31st 2016 with a total number of 550 days. Testing of forecast model is based on data from January, 1st, 2017 up until March 1st, 2017, which is 60 days including 1st march 2017. The training dataset 2 contains data set upto 1st Sept 2017. Test data set has been created from training data set 2 for evaluating accuracy. Importing libraries: All the libraries imported for data manipulation, time series and forecasting Data Input: Creation of training and test data sets: The data is converted into training & testing data based on Train1 and Train 2 data sets. Columns from train 2 data set are selected from Jan1st 2017 to March 1st 2018 including 1st march. library(forecast) #working with time series library(fpp2) #working with time series library(dplyr) # data manipulation library(tidyverse) #data manipulation library(lubridate) # easily work with dates and times library(zoo) # working with time series data setwd(“D:/Assignment-2/”) #Set the working directory train <- read.csv("train_1.csv") #Read train_1 csv file dim(train) # Rows = 145063; Columns = 551 rows_count = nrow(train) #No. of rows cols_count = ncol(train) #No. of columns train2 <- read.csv("train_2.csv") #Read train_2 csv file dim(train2) test <- train2[, (cols_count+1): (cols_count+60)] # 551+60(days) =611
  • 3. After converting the data to train and test data sets. Each page time series data needs to be converted into time series for forecasting. To make better understanding of the code, we selected a random row using sample() and used the row number 707772 to explain the process of conversion to time series data for application of different forecasting models and evaluation methodology of various forecasting models. In actual all the code from below is run a loop to get forecast for each page as presented in the kaggle –‘Web Time Series Forecasting’ which is provided at the end of the document. Converting to time series trainsep = train[70772,] testsep = test[70772,] sum = sum(train[1,2:cols_count]) if(!is.na(sum)){ f = t(trainsep[,-c(1,552)]) f_test = t(testsep) f = data.frame(f,substr(row.names(f),2,11)) colnames(f) = c("visits","dat") # To convert X(yyyy.mm.dd) into date(yyyy.mm.dd) f_test = data.frame(f_test,substr(row.names(f_test),2,11)) colnames(f_test) = c("visits","dat") #---------------------Rest of the code is in the if condition------------------------ } f.ts = ts(f$visits, start = c(2015, 07, 01), frequency = 7) # to create time series object f.ts = tsclean(f.ts) # To Identify and Replace Outliers And Missing Values In A Time Series
  • 5. ggAcf(f.ts) Box test performed to check whether the time series is white noise or not. As p-value < 0.05, the time series is not whitenoise. > Box.test(f.ts, lag = 10, fitdf = 0, type = "Lj") Box-Ljung test data: f.ts X-squared = 5260.9, df = 10, p-value < 2.2e-16 Forecasting models: For the data, forecasting is applied by using Naïve forecast, snaive forecast, moving average forecast, simple exponential smoothing, holt’s smoothing and holt’s winter smoothing to check for the next 60 days forecast. 1. Naïve forecast: Naïve forecast is applied on the training time series. Output: > summary(fcnaive_ts) fcnaive_ts = naive(f.ts, 60) summary(fcnaive_ts) autoplot(fcnaive_ts) checkresiduals(fcnaive_ts)
  • 6. Forecast method: Naive method Model Information: Call: naive(y = f.ts, h = 60) Residual sd: 100.2178 Error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 1.967213 100.2178 66.30965 -0.2189641 7.587369 0.03936731 -0. 1744151 Forecasts: Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 2016.5233 1264 1135.5657 1392.434 1067.576726 1460.423 2016.5260 1264 1082.3665 1445.633 986.215542 1541.784 2016.5288 1264 1041.5453 1486.455 923.784910 1604.215 2016.5315 1264 1007.1314 1520.869 871.153452 1656.847 2016.5342 1264 976.8122 1551.188 824.784207 1703.216 2016.5370 1264 949.4016 1578.598 782.863205 1745.137 2016.5397 1264 924.1948 1603.805 744.312865 1783.687 2016.5425 1264 900.7330 1627.267 708.431084 1819.569 2016.5452 1264 878.6972 1649.303 674.730178 1853.270 2016.5479 1264 857.8552 1670.145 642.855069 1885.145 2016.5507 1264 838.0317 1689.968 612.537700 1915.462 2016.5534 1264 819.0906 1708.909 583.569819 1944.430 2016.5562 1264 800.9236 1727.076 555.785814 1972.214 2016.5589 1264 783.4429 1744.557 529.051406 1998.949 2016.5616 1264 766.5762 1761.424 503.255931 2024.744 2016.5644 1264 750.2629 1777.737 478.306904 2049.693 2016.5671 1264 734.4519 1793.548 454.126094 2073.874 2016.5699 1264 719.0995 1808.900 430.646626 2097.353 2016.5726 1264 704.1680 1823.832 407.810799 2120.189 2016.5753 1264 689.6245 1838.376 385.568414 2142.432 2016.5781 1264 675.4402 1852.560 363.875479 2164.125 2016.5808 1264 661.5899 1866.410 342.693180 2185.307 2016.5836 1264 648.0509 1879.949 321.987071 2206.013 2016.5863 1264 634.8031 1893.197 301.726410 2226.274 2016.5890 1264 621.8286 1906.171 281.883630 2246.116 2016.5918 1264 609.1111 1918.889 262.433893 2265.566 2016.5945 1264 596.6359 1931.364 243.354729 2284.645 2016.5973 1264 584.3897 1943.610 224.625731 2303.374 2016.6000 1264 572.3603 1955.640 206.228298 2321.772 2016.6027 1264 560.5365 1967.463 188.145420 2339.855 2016.6055 1264 548.9082 1979.092 170.361495 2357.639 2016.6082 1264 537.4660 1990.534 152.862168 2375.138 2016.6110 1264 526.2013 2001.799 135.634197 2392.366 2016.6137 1264 515.1059 2012.894 118.665338 2409.335 2016.6164 1264 504.1726 2023.827 101.944240 2426.056 2016.6192 1264 493.3943 2034.606 85.460356 2442.540 2016.6219 1264 482.7648 2045.235 69.203869 2458.796 2016.6247 1264 472.2780 2055.722 53.165619 2474.834 2016.6274 1264 461.9282 2066.072 37.337047 2490.663 2016.6301 1264 451.7103 2076.290 21.710138 2506.290 2016.6329 1264 441.6194 2086.381 6.277374 2521.723 2016.6356 1264 431.6508 2096.349 -8.968306 2536.968 2016.6384 1264 421.8001 2106.200 -24.033544 2552.034 2016.6411 1264 412.0634 2115.937 -38.924600 2566.925 2016.6438 1264 402.4367 2125.563 -53.647379 2581.647 2016.6466 1264 392.9164 2135.084 -68.207460 2596.207 2016.6493 1264 383.4990 2144.501 -82.610122 2610.610 2016.6521 1264 374.1812 2153.819 -96.860361 2624.860 2016.6548 1264 364.9601 2163.040 -110.962918 2638.963 2016.6575 1264 355.8325 2172.167 -124.922290 2652.922 2016.6603 1264 346.7958 2181.204 -138.742753 2666.743 2016.6630 1264 337.8473 2190.153 -152.428372 2680.428 2016.6658 1264 328.9844 2199.016 -165.983019 2693.983 2016.6685 1264 320.2047 2207.795 -179.410384 2707.410
  • 7. 2016.6712 1264 311.5059 2216.494 -192.713987 2720.714 2016.6740 1264 302.8859 2225.114 -205.897188 2733.897 2016.6767 1264 294.3425 2233.658 -218.963198 2746.963 2016.6795 1264 285.8737 2242.126 -231.915087 2759.915 2016.6822 1264 277.4776 2250.522 -244.755796 2772.756 2016.6849 1264 269.1524 2258.848 -257.488138 2785.488 checkresiduals(fcnaive_ts) Ljung-Box test data: Residuals from Naive method Q* = 655.3, df = 110, p-value < 2.2e-16 Model df: 0. Total lags used: 110
  • 8. After checking residuals, there is still autocorrelation exists with the lag factors as there is trend and seasonality in the data. 2. Seasonal naive forecast: Output: > summary(fcsnaive_ts) Forecast method: Seasonal naive method Model Information: Call: snaive(y = f.ts, h = 60) Residual sd: 1701.5666 Error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 1684.384 1701.566 1684.384 87.29204 87.29204 1 0.7978843 Forecasts: Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 2016.5233 294 -1886.645 2474.645 -3041.009 3629.009 2016.5260 321 -1859.645 2501.645 -3014.009 3656.009 2016.5288 335 -1845.645 2515.645 -3000.009 3670.009 2016.5315 399 -1781.645 2579.645 -2936.009 3734.009 2016.5342 352 -1828.645 2532.645 -2983.009 3687.009 2016.5370 348 -1832.645 2528.645 -2987.009 3683.009 2016.5397 369 -1811.645 2549.645 -2966.009 3704.009 2016.5425 312 -1868.645 2492.645 -3023.009 3647.009 2016.5452 303 -1877.645 2483.645 -3032.009 3638.009 2016.5479 396 -1784.645 2576.645 -2939.009 3731.009 2016.5507 363 -1817.645 2543.645 -2972.009 3698.009 2016.5534 405 -1775.645 2585.645 -2930.009 3740.009 2016.5562 377 -1803.645 2557.645 -2958.009 3712.009 2016.5589 385 -1795.645 2565.645 -2950.009 3720.009 2016.5616 381 -1799.645 2561.645 -2954.009 3716.009 2016.5644 405 -1775.645 2585.645 -2930.009 3740.009 2016.5671 414 -1766.645 2594.645 -2921.009 3749.009 2016.5699 482 -1698.645 2662.645 -2853.009 3817.009 2016.5726 420 -1760.645 2600.645 -2915.009 3755.009 2016.5753 464 -1716.645 2644.645 -2871.009 3799.009 2016.5781 449 -1731.645 2629.645 -2886.009 3784.009 2016.5808 436 -1744.645 2616.645 -2899.009 3771.009 2016.5836 477 -1703.645 2657.645 -2858.009 3812.009 2016.5863 518 -1662.645 2698.645 -2817.009 3853.009 2016.5890 456 -1724.645 2636.645 -2879.009 3791.009 2016.5918 504 -1676.645 2684.645 -2831.009 3839.009 2016.5945 519 -1661.645 2699.645 -2816.009 3854.009 2016.5973 489 -1691.645 2669.645 -2846.009 3824.009 2016.6000 455 -1725.645 2635.645 -2880.009 3790.009 2016.6027 444 -1736.645 2624.645 -2891.009 3779.009 2016.6055 480 -1700.645 2660.645 -2855.009 3815.009 2016.6082 506 -1674.645 2686.645 -2829.009 3841.009 2016.6110 469 -1711.645 2649.645 -2866.009 3804.009 fcsnaive_ts = snaive(f.ts,60) summary(fcsnaive_ts) autoplot(fcsnaive_ts) checkresiduals(fcsnaive_ts)
  • 9. 2016.6137 529 -1651.645 2709.645 -2806.009 3864.009 2016.6164 524 -1656.645 2704.645 -2811.009 3859.009 2016.6192 474 -1706.645 2654.645 -2861.009 3809.009 2016.6219 519 -1661.645 2699.645 -2816.009 3854.009 2016.6247 493 -1687.645 2673.645 -2842.009 3828.009 2016.6274 585 -1595.645 2765.645 -2750.009 3920.009 2016.6301 627 -1553.645 2807.645 -2708.009 3962.009 2016.6329 562 -1618.645 2742.645 -2773.009 3897.009 2016.6356 590 -1590.645 2770.645 -2745.009 3925.009 2016.6384 581 -1599.645 2761.645 -2754.009 3916.009 2016.6411 575 -1605.645 2755.645 -2760.009 3910.009 2016.6438 711 -1469.645 2891.645 -2624.009 4046.009 2016.6466 641 -1539.645 2821.645 -2694.009 3976.009 2016.6493 749 -1431.645 2929.645 -2586.009 4084.009 2016.6521 749 -1431.645 2929.645 -2586.009 4084.009 2016.6548 706 -1474.645 2886.645 -2629.009 4041.009 2016.6575 698 -1482.645 2878.645 -2637.009 4033.009 2016.6603 778 -1402.645 2958.645 -2557.009 4113.009 2016.6630 956 -1224.645 3136.645 -2379.009 4291.009 2016.6658 848 -1332.645 3028.645 -2487.009 4183.009 2016.6685 810 -1370.645 2990.645 -2525.009 4145.009 2016.6712 803 -1377.645 2983.645 -2532.009 4138.009 2016.6740 883 -1297.645 3063.645 -2452.009 4218.009 2016.6767 813 -1367.645 2993.645 -2522.009 4148.009 2016.6795 815 -1365.645 2995.645 -2520.009 4150.009 2016.6822 710 -1470.645 2890.645 -2625.009 4045.009 2016.6849 797 -1383.645 2977.645 -2538.009 4132.009 > checkresiduals(fcnaive_ts) Ljung-Box test data: Residuals from Naive method Q* = 655.3, df = 110, p-value < 2.2e-16 Model df: 0. Total lags used: 110
  • 10. Upon checking the residuals, and perform box test, the p-value <0.05. It suggests that residuals is not white noise. 3. Moving average: 4. Simple exponential smoothing: autoplot(f.ts, series = "Data") + autolayer(ma(f.ts, 7), series = "1 week MA") + autolayer(ma(f.ts, 31), series = "1 month MA") + autolayer(ma(f.ts, 91), series = "3 month MA") + autolayer(ma(f.ts, 183), series = "6 month MA") + xlab("Date") + ylab("visits")
  • 11. Output: > checkresiduals(fcses_ts) Ljung-Box test data: Residuals from Simple exponential smoothing Q* = 908.14, df = 108, p-value < 2.2e-16 Model df: 2. Total lags used: 110 fcses_ts <- ses(f.ts, alpha = .2, h = 60) # simple exponential moving average summary(fcses_ts) autoplot(fcses_ts) #plot checkresiduals(fcses_ts) #residuals to check whether it is white noise or not
  • 12. As p value of Box text <0.05, the residuals are white noise, as the data contains both trend and seasonality. 5.Holt’s smoothing > checkresiduals(fcholt_ts) fcholt_ts <- holt(f.ts, h = 60) summary(fcholt_ts) autoplot(fcholt_ts) checkresiduals(fcholt_ts)
  • 13. Ljung-Box test data: Residuals from Holt's method Q* = 1002, df = 106, p-value < 2.2e-16 Model df: 4. Total lags used: 110 Upon tuning the beta parameters, # identify optimal alpha parameter beta <- seq(.0001, .5, by = .001) RMSE <- NA for(i in seq_along(beta)) { fit <- holt(f.ts, beta = beta[i], h = 60) RMSE[i] <- accuracy(fit, f_test$visits)[2,2] } # convert to a data frame and idenitify min alpha value beta.fit <- data_frame(beta, RMSE) beta.min <- filter(beta.fit, RMSE == min(RMSE)) # plot RMSE vs. alpha ggplot(beta.fit, aes(beta, RMSE)) + geom_line() + geom_point(data = beta.min, aes(beta, RMSE), size = 2, color = "blue") fcholt_ts <- holt(f.ts, h = 90, belta = beta.min$beta)
  • 14. 6. Holt’s winter smoothing: Decomposition of additional time series: hw.ts <- ets(f.ts, model = "ZZZ") checkresiduals(hw.ts) autoplot(hw.ts) summary(hw.ts)
  • 15. > summary(hw.ts) ETS(M,N,M) Call: ets(y = f.ts, model = "ZZZ") Smoothing parameters: alpha = 0.6672 gamma = 0.0364 Initial states: l = 194.5145 s = 1.1697 1.0074 0.9371 0.9015 0.9571 1.0013 1.0259 sigma: 0.1116 AIC AICc BIC 8362.877 8363.286 8405.977 Training set error measures: ME RMSE MAE MPE MAPE MASE ACF1 Training set 2.652725 88.74605 61.03587 -0.1384452 7.216793 0.6028258 -0.0 1053619 The Holt winter model of ETS(M,N,M) has residuals with higher p-value than other models.
  • 16. Evaluating the different forecast models: Every model is evaluated against RMSE of test data. On the basis of lower RMSE, Holt’s method is selected and used to forecast. > accuracy(fcnaive_ts, f_test$visits) ME RMSE MAE MPE MAPE MASE ACF1 Training set 1.967213 102.6296 68.49265 -0.2271511 7.699251 1.000000 - 0.1835412 Test set 283.950000 419.6527 302.65000 15.8649924 17.797103 4.418722 NA > accuracy(fcsnaive_ts, f_test$visits) ME RMSE MAE MPE MAPE MASE A CF1 Training set 16.93582 145.0159 101.2496 1.3114613 11.96902 1.478255 0.6341 429 Test set 46.02771 315.4499 181.7056 0.2651809 11.04817 2.652921 NA > accuracy(mean_fc, f_test$visits) ME RMSE MAE MPE MAPE MASE ACF1 Training set 4.291307e-14 751.6030 694.2092 -119.25917 156.48239 10.135528 0.98933 Test set 4.602121e+02 554.3247 466.6034 27.59744 28.31075 6.812459 NA > accuracy(fcses_ts,f_test$visits) ME RMSE MAE MPE MAPE MASE ACF1 Training set 22.432887 128.3612 87.42038 1.732784 9.373469 1.276347 0.6 310515 Test set -3.173597 309.0159 188.65375 -3.246674 11.869201 2.754365 NA > accuracy(fcholt_ts,f_test$visits) ME RMSE MAE MPE MAPE MASE ACF1 Training set -3.993354 99.2416 66.82666 -1.6738377 7.629988 0.9756764 0 .08924597 Test set 28.649399 308.8983 193.02659 -0.9831642 11.896523 2.8182087 NA > accuracy(fcets_ts, f_test$visits) ME RMSE MAE MPE MAPE MASE ACF1 Training set 2.652725 88.74605 61.03587 -0.1384452 7.216793 0.8911302 -0.01053619 Test set 114.850686 314.26532 173.62511 4.8900676 10.024239 2.5349451 NA
  • 17. R code to run for 145 k pages automatically: #Library library(forecast) #working with time series library(fpp2) #working with time series library('dplyr') # data manipulation library('tidyverse') #data manipulation library(lubridate) # easily work with dates and times library(zoo) # working with time series data #train data train <- read.csv("train_1.csv") dim(train) # head(train) rows_count = nrow(train) cols_count = ncol(train) train2 <- read.csv("train_2.csv") dim(train2) #Creation of test data from training data set test <- train2[, (cols_count+1):(cols_count+60)] dim(test) for(j in 1:nrow(train)){ trainsep = train[j,] testsep = test[j,] sum = sum(train[1,2:cols_count]) if(!is.na(sum)){ #Matrix to store RMSE of training and test data set accuracy of forecasts accur <- matrix(, nrow = 6, ncol = 2) #Data imputations f = t(trainsep[,-c(1,552)]) f_test = t(testsep) head(f_test) f = data.frame(f,substr(row.names(f),2,11)) colnames(f) = c("visits","dat") f_test = data.frame(f_test,substr(row.names(f_test),2,11)) colnames(f_test) = c("visits","dat") head(f) head(f_test) #Creation of timeseries data after cleaning using ts and tsclean f.ts =tsclean(ts(f$visits,frequency = 7)) head(f.ts, 45)
  • 18. #Data Exploration autoplot(f.ts) gglagplot(f.ts) acf(f.ts) Box.test(f.ts, lag = 10, fitdf = 0, type = "Lj") #Removing trend and to check for the seasonality f.ts.dif = diff(f.ts) gglagplot(f.ts.dif) ggAcf(f.ts.dif) autoplot(f.ts.dif) f_test.dif <- diff(f_test$visits) Box.test(f.ts.dif, lag = 10, fitdf = 0, type = "Lj") ggAcf(f.ts) #Naive test fcnaive_ts = naive(f.ts, 60) summary(fcnaive_ts) autoplot(fcnaive_ts) checkresiduals(fcnaive_ts) act = accuracy(fcnaive_ts, f_test$visits) accur[1,1] = act[2,2] #test RMSE accuracy accur[1,2] = act[1,2] #trin RMSE accuracy #seasonal naive test fcsnaive_ts = snaive(f.ts,60) summary(fcsnaive_ts) autoplot(fcsnaive_ts) checkresiduals(fcsnaive_ts) act = accuracy(fcsnaive_ts, f_test$visits) accur[2,1] = act[2,2] #test RMSE accuracy accur[2,2] = act[1,2] #trin RMSE accuracy #mean forecast mean_fc <- meanf(f.ts, h = 60) act = accuracy(mean_fc, f_test$visits) accur[3,1] = act[2,2] #test RMSE accuracy accur[3,2] = act[1,2] #trin RMSE accuracy #SES(Simple Exponential smoothing) fcses_ts <- ses(f.ts, alpha = .2, h = 60) summary(fcses_ts) autoplot(fcses_ts) checkresiduals(fcses_ts) accuracy(fcses_ts,f_test$visits) fces_ts1 <-ses(f.ts.dif, alpha = .2, h = 60) autoplot(fces_ts1) summary(fces_ts1) autoplot(f.ts.dif) checkresiduals(fces_ts1) accuracy(fces_ts1,f_test.dif) alpha <- seq(.01, .99, by = .01) RMSE <- NA for(i in seq_along(alpha)) { fit <- ses(f.ts, alpha = alpha[i], h = 60) RMSE[i] <- accuracy(fit, f_test$visits)[2,2] }
  • 19. alpha.fit <- data_frame(alpha, RMSE) alpha.min <- filter(alpha.fit, RMSE == min(RMSE)) ggplot(alpha.fit, aes(alpha, RMSE)) + geom_line() + geom_point(data = alpha.min, aes(alpha, RMSE), size = 2, color = "blue") fcses_ts <- ses(f.ts, alpha = alpha.min$alpha, h = 60) autoplot(fcses_ts) act = accuracy(fcses_ts,f_test$visits) accur[4,1] = act[2,2] #test RMSE accuracy accur[4,2] = act[1,2] #trin RMSE accuracy fcholt_ts <- holt(f.ts, h = 60) summary(fcholt_ts) autoplot(fcholt_ts) checkresiduals(fcholt_ts) act = accuracy(fcholt_ts,f_test$visits) accur[5,1] = act[2,2] #test RMSE accuracy accur[5,2] = act[1,2] #trin RMSE accuracy # identify optimal alpha parameter beta <- seq(.0001, .5, by = .001) RMSE <- NA for(i in seq_along(beta)) { fit <- holt(f.ts, beta = beta[i], h = 60) RMSE[i] <- accuracy(fit, f_test$visits)[2,2] } # convert to a data frame and idenitify min alpha value beta.fit <- data_frame(beta, RMSE) beta.min <- filter(beta.fit, RMSE == min(RMSE)) # plot RMSE vs. alpha ggplot(beta.fit, aes(beta, RMSE)) + geom_line() + geom_point(data = beta.min, aes(beta, RMSE), size = 2, color = "blue") fcholt_ts <- holt(f.ts, h = 60, belta = beta.min$beta) act = accuracy(fcholt_ts,f_test$visits) accur[5,1] = act[2,2] #test RMSE accuracy accur[5,2] = act[1,2] #trin RMSE accuracy autoplot(decompose(f.ts)) #HoltWinters seasonal model hw.ts <- ets(f.ts, model = "ZZZ") checkresiduals(hw.ts) autoplot(hw.ts) summary(hw.ts) fcets_ts <- forecast(hw.ts, h = 60) act= accuracy(fcets_ts, f_test$visits) accur[6,1] = act[2,2] #test RMSE accuracy accur[6,2] = act[1,2] #trin RMSE accuracy #Model evaluation using RMSE of test data method = c("naive","snaive","mean", "ses","holts","aes") accur1 = data_frame(method, as.vector(t(accur[,1]))) colnames(accur1) = c("method","RMSE_TEST") minimum <- filter(accur1, RMSE_TEST == min(RMSE_TEST))
  • 20. Conclusion: Each series will have different forecast depending upon the trend, seasonality and error terms in the page visits daily. Some of the pages have no trend, some have trend and seasonality. Some have no trend but seasonality. Data exploration has been used to understand about the time series. Acf plots help us in understanding the autocorrelation lag plots. Using the moving average, time series plots are used to understand for smoothing the data. Different forecast models are used to understand about the time series. Navie, seasonal naïve, simple exponential smoothing, holt’s smoothing, holt-winters smoothing used for the forecasting. While using the forecasting models, residual plots are made to check whether the error is centered around 0, ACF plots lie within in the range of Box test > 0.05. RMSE used to evaluate the different models. Based on the lower RMSE value, the forecast model is selected to predict the next 60 days page visits. if (minimum$method == "naive"){ fcnaive_ts }else if(minimum$method == "snaive"){ fcsnaive_ts }else if(minimum$method == "mean"){ mean_fc }else if(minimum$method == "ses"){ fcses_ts }else if(minimum$method == "holts"){ fcholt_ts }else if(minimum$method == "aes"){ fcets_ts } } }