Time series analysis is conducted on daily views of Wikipedia article. The data set contains individual Pages and daily views of the pages.
The total number of pages in the data set is 145k. The training data set 1 contains daily views from July 1st 2015 to Dec 31st 2016 with a total number of 550 days.
Testing of forecast model is based on data from January, 1st, 2017 up until March 1st, 2017, which is 60 days including 1st march 2017.
1. Web Traffic Time Series
Forecasting
SUBMITTED BY –
Korivi Sravan Kumar
2. Introduction:
The data contains daily views of Wikipedia article. The data set contains individual Pages
and daily views of the pages.
The total number of pages in the data set is 145k. The training data set 1 contains daily views
from July 1st
2015 to Dec 31st
2016 with a total number of 550 days.
Testing of forecast model is based on data from January, 1st, 2017 up until March 1st, 2017,
which is 60 days including 1st
march 2017.
The training dataset 2 contains data set upto 1st
Sept 2017.
Test data set has been created from training data set 2 for evaluating accuracy.
Importing libraries:
All the libraries imported for data manipulation, time series and forecasting
Data Input:
Creation of training and test data sets:
The data is converted into training & testing data based on Train1 and Train 2 data sets.
Columns from train 2 data set are selected from Jan1st 2017 to March 1st
2018 including 1st
march.
library(forecast) #working with time series
library(fpp2) #working with time series
library(dplyr) # data manipulation
library(tidyverse) #data manipulation
library(lubridate) # easily work with dates and times
library(zoo) # working with time series data
setwd(“D:/Assignment-2/”) #Set the working directory
train <- read.csv("train_1.csv") #Read train_1 csv file
dim(train) # Rows = 145063; Columns = 551
rows_count = nrow(train) #No. of rows
cols_count = ncol(train) #No. of columns
train2 <- read.csv("train_2.csv") #Read train_2 csv file
dim(train2)
test <- train2[, (cols_count+1): (cols_count+60)] # 551+60(days) =611
3. After converting the data to train and test data sets. Each page time series data needs to be
converted into time series for forecasting.
To make better understanding of the code, we selected a random row using sample() and used
the row number 707772 to explain the process of conversion to time series data for
application of different forecasting models and evaluation methodology of various
forecasting models.
In actual all the code from below is run a loop to get forecast for each page as presented in
the kaggle –‘Web Time Series Forecasting’ which is provided at the end of the document.
Converting to time series
trainsep = train[70772,]
testsep = test[70772,]
sum = sum(train[1,2:cols_count])
if(!is.na(sum)){
f = t(trainsep[,-c(1,552)])
f_test = t(testsep)
f = data.frame(f,substr(row.names(f),2,11))
colnames(f) = c("visits","dat")
# To convert X(yyyy.mm.dd) into date(yyyy.mm.dd)
f_test = data.frame(f_test,substr(row.names(f_test),2,11))
colnames(f_test) = c("visits","dat")
#---------------------Rest of the code is in the if condition------------------------
}
f.ts = ts(f$visits, start = c(2015, 07, 01), frequency = 7) # to create time series object
f.ts = tsclean(f.ts) # To Identify and Replace Outliers And Missing Values In A Time Series
5. ggAcf(f.ts)
Box test performed to check whether the time series is white noise or not. As p-value < 0.05,
the time series is not whitenoise.
> Box.test(f.ts, lag = 10, fitdf = 0, type = "Lj")
Box-Ljung test
data: f.ts
X-squared = 5260.9, df = 10, p-value < 2.2e-16
Forecasting models:
For the data, forecasting is applied by using Naïve forecast, snaive forecast, moving average
forecast, simple exponential smoothing, holt’s smoothing and holt’s winter smoothing to
check for the next 60 days forecast.
1. Naïve forecast:
Naïve forecast is applied on the training time series.
Output:
> summary(fcnaive_ts)
fcnaive_ts = naive(f.ts, 60)
summary(fcnaive_ts)
autoplot(fcnaive_ts)
checkresiduals(fcnaive_ts)
10. Upon checking the residuals, and perform box test, the p-value <0.05. It suggests that
residuals is not white noise.
3. Moving average:
4. Simple exponential smoothing:
autoplot(f.ts, series = "Data") +
autolayer(ma(f.ts, 7), series = "1 week MA") +
autolayer(ma(f.ts, 31), series = "1 month MA") +
autolayer(ma(f.ts, 91), series = "3 month MA") +
autolayer(ma(f.ts, 183), series = "6 month MA") +
xlab("Date") +
ylab("visits")
11. Output:
> checkresiduals(fcses_ts)
Ljung-Box test
data: Residuals from Simple exponential smoothing
Q* = 908.14, df = 108, p-value < 2.2e-16
Model df: 2. Total lags used: 110
fcses_ts <- ses(f.ts, alpha = .2, h = 60) # simple exponential moving average
summary(fcses_ts)
autoplot(fcses_ts) #plot
checkresiduals(fcses_ts) #residuals to check whether it is white noise or not
12. As p value of Box text <0.05, the residuals are white noise, as the data contains both trend
and seasonality.
5.Holt’s smoothing
> checkresiduals(fcholt_ts)
fcholt_ts <- holt(f.ts, h = 60)
summary(fcholt_ts)
autoplot(fcholt_ts)
checkresiduals(fcholt_ts)
13. Ljung-Box test
data: Residuals from Holt's method
Q* = 1002, df = 106, p-value < 2.2e-16
Model df: 4. Total lags used: 110
Upon tuning the beta parameters,
# identify optimal alpha parameter
beta <- seq(.0001, .5, by = .001)
RMSE <- NA
for(i in seq_along(beta)) {
fit <- holt(f.ts, beta = beta[i], h = 60)
RMSE[i] <- accuracy(fit, f_test$visits)[2,2]
}
# convert to a data frame and idenitify min alpha value
beta.fit <- data_frame(beta, RMSE)
beta.min <- filter(beta.fit, RMSE == min(RMSE))
# plot RMSE vs. alpha
ggplot(beta.fit, aes(beta, RMSE)) +
geom_line() +
geom_point(data = beta.min, aes(beta, RMSE), size = 2, color = "blue")
fcholt_ts <- holt(f.ts, h = 90, belta = beta.min$beta)
14. 6. Holt’s winter smoothing:
Decomposition of additional time series:
hw.ts <- ets(f.ts, model = "ZZZ")
checkresiduals(hw.ts)
autoplot(hw.ts)
summary(hw.ts)
15. > summary(hw.ts)
ETS(M,N,M)
Call:
ets(y = f.ts, model = "ZZZ")
Smoothing parameters:
alpha = 0.6672
gamma = 0.0364
Initial states:
l = 194.5145
s = 1.1697 1.0074 0.9371 0.9015 0.9571 1.0013
1.0259
sigma: 0.1116
AIC AICc BIC
8362.877 8363.286 8405.977
Training set error measures:
ME RMSE MAE MPE MAPE MASE
ACF1
Training set 2.652725 88.74605 61.03587 -0.1384452 7.216793 0.6028258 -0.0
1053619
The Holt winter model of ETS(M,N,M) has residuals with higher p-value than other models.
16. Evaluating the different forecast models:
Every model is evaluated against RMSE of test data. On the basis of lower RMSE, Holt’s
method is selected and used to forecast.
> accuracy(fcnaive_ts, f_test$visits)
ME RMSE MAE MPE MAPE MASE
ACF1
Training set 1.967213 102.6296 68.49265 -0.2271511 7.699251 1.000000 -
0.1835412
Test set 283.950000 419.6527 302.65000 15.8649924 17.797103 4.418722
NA
> accuracy(fcsnaive_ts, f_test$visits)
ME RMSE MAE MPE MAPE MASE A
CF1
Training set 16.93582 145.0159 101.2496 1.3114613 11.96902 1.478255 0.6341
429
Test set 46.02771 315.4499 181.7056 0.2651809 11.04817 2.652921
NA
> accuracy(mean_fc, f_test$visits)
ME RMSE MAE MPE MAPE MASE
ACF1
Training set 4.291307e-14 751.6030 694.2092 -119.25917 156.48239 10.135528
0.98933
Test set 4.602121e+02 554.3247 466.6034 27.59744 28.31075 6.812459
NA
> accuracy(fcses_ts,f_test$visits)
ME RMSE MAE MPE MAPE MASE
ACF1
Training set 22.432887 128.3612 87.42038 1.732784 9.373469 1.276347 0.6
310515
Test set -3.173597 309.0159 188.65375 -3.246674 11.869201 2.754365
NA
> accuracy(fcholt_ts,f_test$visits)
ME RMSE MAE MPE MAPE MASE
ACF1
Training set -3.993354 99.2416 66.82666 -1.6738377 7.629988 0.9756764 0
.08924597
Test set 28.649399 308.8983 193.02659 -0.9831642 11.896523 2.8182087
NA
> accuracy(fcets_ts, f_test$visits)
ME RMSE MAE MPE MAPE MASE
ACF1
Training set 2.652725 88.74605 61.03587 -0.1384452 7.216793 0.8911302
-0.01053619
Test set 114.850686 314.26532 173.62511 4.8900676 10.024239 2.5349451
NA
17. R code to run for 145 k pages automatically:
#Library
library(forecast) #working with time series
library(fpp2) #working with time series
library('dplyr') # data manipulation
library('tidyverse') #data manipulation
library(lubridate) # easily work with dates and times
library(zoo) # working with time series data
#train data
train <- read.csv("train_1.csv")
dim(train)
# head(train)
rows_count = nrow(train)
cols_count = ncol(train)
train2 <- read.csv("train_2.csv")
dim(train2)
#Creation of test data from training data set
test <- train2[, (cols_count+1):(cols_count+60)]
dim(test)
for(j in 1:nrow(train)){
trainsep = train[j,]
testsep = test[j,]
sum = sum(train[1,2:cols_count])
if(!is.na(sum)){
#Matrix to store RMSE of training and test data set accuracy of forecasts
accur <- matrix(, nrow = 6, ncol = 2)
#Data imputations
f = t(trainsep[,-c(1,552)])
f_test = t(testsep)
head(f_test)
f = data.frame(f,substr(row.names(f),2,11))
colnames(f) = c("visits","dat")
f_test = data.frame(f_test,substr(row.names(f_test),2,11))
colnames(f_test) = c("visits","dat")
head(f)
head(f_test)
#Creation of timeseries data after cleaning using ts and tsclean
f.ts =tsclean(ts(f$visits,frequency = 7))
head(f.ts, 45)
18. #Data Exploration
autoplot(f.ts)
gglagplot(f.ts)
acf(f.ts)
Box.test(f.ts, lag = 10, fitdf = 0, type = "Lj")
#Removing trend and to check for the seasonality
f.ts.dif = diff(f.ts)
gglagplot(f.ts.dif)
ggAcf(f.ts.dif)
autoplot(f.ts.dif)
f_test.dif <- diff(f_test$visits)
Box.test(f.ts.dif, lag = 10, fitdf = 0, type = "Lj")
ggAcf(f.ts)
#Naive test
fcnaive_ts = naive(f.ts, 60)
summary(fcnaive_ts)
autoplot(fcnaive_ts)
checkresiduals(fcnaive_ts)
act = accuracy(fcnaive_ts, f_test$visits)
accur[1,1] = act[2,2] #test RMSE accuracy
accur[1,2] = act[1,2] #trin RMSE accuracy
#seasonal naive test
fcsnaive_ts = snaive(f.ts,60)
summary(fcsnaive_ts)
autoplot(fcsnaive_ts)
checkresiduals(fcsnaive_ts)
act = accuracy(fcsnaive_ts, f_test$visits)
accur[2,1] = act[2,2] #test RMSE accuracy
accur[2,2] = act[1,2] #trin RMSE accuracy
#mean forecast
mean_fc <- meanf(f.ts, h = 60)
act = accuracy(mean_fc, f_test$visits)
accur[3,1] = act[2,2] #test RMSE accuracy
accur[3,2] = act[1,2] #trin RMSE accuracy
#SES(Simple Exponential smoothing)
fcses_ts <- ses(f.ts, alpha = .2, h = 60)
summary(fcses_ts)
autoplot(fcses_ts)
checkresiduals(fcses_ts)
accuracy(fcses_ts,f_test$visits)
fces_ts1 <-ses(f.ts.dif, alpha = .2, h = 60)
autoplot(fces_ts1)
summary(fces_ts1)
autoplot(f.ts.dif)
checkresiduals(fces_ts1)
accuracy(fces_ts1,f_test.dif)
alpha <- seq(.01, .99, by = .01)
RMSE <- NA
for(i in seq_along(alpha)) {
fit <- ses(f.ts, alpha = alpha[i], h = 60)
RMSE[i] <- accuracy(fit, f_test$visits)[2,2]
}
19. alpha.fit <- data_frame(alpha, RMSE)
alpha.min <- filter(alpha.fit, RMSE == min(RMSE))
ggplot(alpha.fit, aes(alpha, RMSE)) +
geom_line() +
geom_point(data = alpha.min, aes(alpha, RMSE), size = 2, color = "blue")
fcses_ts <- ses(f.ts, alpha = alpha.min$alpha, h = 60)
autoplot(fcses_ts)
act = accuracy(fcses_ts,f_test$visits)
accur[4,1] = act[2,2] #test RMSE accuracy
accur[4,2] = act[1,2] #trin RMSE accuracy
fcholt_ts <- holt(f.ts, h = 60)
summary(fcholt_ts)
autoplot(fcholt_ts)
checkresiduals(fcholt_ts)
act = accuracy(fcholt_ts,f_test$visits)
accur[5,1] = act[2,2] #test RMSE accuracy
accur[5,2] = act[1,2] #trin RMSE accuracy
# identify optimal alpha parameter
beta <- seq(.0001, .5, by = .001)
RMSE <- NA
for(i in seq_along(beta)) {
fit <- holt(f.ts, beta = beta[i], h = 60)
RMSE[i] <- accuracy(fit, f_test$visits)[2,2]
}
# convert to a data frame and idenitify min alpha value
beta.fit <- data_frame(beta, RMSE)
beta.min <- filter(beta.fit, RMSE == min(RMSE))
# plot RMSE vs. alpha
ggplot(beta.fit, aes(beta, RMSE)) +
geom_line() +
geom_point(data = beta.min, aes(beta, RMSE), size = 2, color = "blue")
fcholt_ts <- holt(f.ts, h = 60, belta = beta.min$beta)
act = accuracy(fcholt_ts,f_test$visits)
accur[5,1] = act[2,2] #test RMSE accuracy
accur[5,2] = act[1,2] #trin RMSE accuracy
autoplot(decompose(f.ts))
#HoltWinters seasonal model
hw.ts <- ets(f.ts, model = "ZZZ")
checkresiduals(hw.ts)
autoplot(hw.ts)
summary(hw.ts)
fcets_ts <- forecast(hw.ts, h = 60)
act= accuracy(fcets_ts, f_test$visits)
accur[6,1] = act[2,2] #test RMSE accuracy
accur[6,2] = act[1,2] #trin RMSE accuracy
#Model evaluation using RMSE of test data
method = c("naive","snaive","mean", "ses","holts","aes")
accur1 = data_frame(method, as.vector(t(accur[,1])))
colnames(accur1) = c("method","RMSE_TEST")
minimum <- filter(accur1, RMSE_TEST == min(RMSE_TEST))
20. Conclusion:
Each series will have different forecast depending upon the trend, seasonality and error terms
in the page visits daily. Some of the pages have no trend, some have trend and seasonality.
Some have no trend but seasonality. Data exploration has been used to understand about the
time series. Acf plots help us in understanding the autocorrelation lag plots. Using the
moving average, time series plots are used to understand for smoothing the data.
Different forecast models are used to understand about the time series. Navie, seasonal
naïve, simple exponential smoothing, holt’s smoothing, holt-winters smoothing used for the
forecasting. While using the forecasting models, residual plots are made to check whether the
error is centered around 0, ACF plots lie within in the range of Box test > 0.05.
RMSE used to evaluate the different models. Based on the lower RMSE value, the forecast
model is selected to predict the next 60 days page visits.
if (minimum$method == "naive"){
fcnaive_ts
}else if(minimum$method == "snaive"){
fcsnaive_ts
}else if(minimum$method == "mean"){
mean_fc
}else if(minimum$method == "ses"){
fcses_ts
}else if(minimum$method == "holts"){
fcholt_ts
}else if(minimum$method == "aes"){
fcets_ts
}
}
}