Business forecasting project border

Course: Business Forecasting
Submitted to: Prof. Apratim Guha
Topic
Time Series Forecastingof the number ofIndividuals crossinginto USvia Mexico and
CanadaBorders
Submitted by:
Name Roll No.
Shruti
Nigam
EA21029
Business Forecasting Project Report
PGCBA Batch-IV

Business Forecasting Project Report PGCBA Batch-IV
CanadaBorders
Preface
As a part of the curriculum for the course Business Forecasting, XLRI PGCBA-4
Program, we are required to apply various models and forecasting techniques
on a case study and submit a project report. The basic objective behind doing
this project report is to get hands-on experience and practical knowledge on
business forecasting tools and techniques that can be used to solve real world
problems.
In this project report, we have used various techniques like basic exploratory
analysis, visualization of time series, regression modelling for time series data,
2-layer ARIMA model for analysis, decomposition methods, residual analysis to
smooth out data to be able to forecast accurately and to reach to our conclusion.
Doing this report helped us to gain deeper insights into the fields of analytics
and application of business forecasting to resolve real-life challenges and
problems.

CanadaBorders
Abstract
Business forecasting allows us to analyse the data in hand, creating strategies
for projections, and then comparing the forecasting model to the realized
outcome. Forecasting can be done by many methods. The time series analysis
method has been used in the project to predict the future trend or pattern
analysing the given data over given period. Time Series data focuses on the
patterns found in the historical data and uses statistical methods to understand
how time affects the target variable. Here, concepts such as analysis of the
seasonality, trend, cyclicity, and irregularity found in historical data are used to
understand the future better. In this project we are applying the time series
analysis to determine the current trend of rate of influx of individuals and
commercial vehicles from Mexico and Canada to US via multiple modes of
transport. Based on which forecasting has been done to compare it with the
realized outcome.
The overview of methodology has been displayed with introduction in next
pages.

CanadaBorders
Table of Contents
Introduction ............................................................................................................................................5
Overview of the procedure to be followed ............................................................................................5
Data Exploration .....................................................................................................................................6
2.2 Data Loading .....................................................................................................................................6
2.3 Data Pre-processing..........................................................................................................................6
2.4 Data Exploration ...............................................................................................................................6
Decomposition........................................................................................................................................9
3. Model 1: Regression with trend and seasonality..............................................................................11
Model 2: Pure ARIMA model ................................................................................................................13
Model 3: Seasonal Naïve.......................................................................................................................14
Model 4:Holt-Winters...........................................................................................................................15
Model 5: Neural network model...........................................................................................................15
Model 6: ETS .........................................................................................................................................15
ACCURACY.............................................................................................................................................16
Forecasting............................................................................................................................................16
Bibliography ..........................................................................................................................................18
Appendix ...............................................................................................................................................18

CanadaBorders
Introduction
The wisdom of globalization happening over the years has introduced us to the concept of tearing
down borders. The statement to support this says that growing integration and interdependence leads
to a retreat of the regulatory state, more open borders, and more harmonious cross-border
relationships. In fact, the prominent free market advocates such as Wall Street Journal even published
that border not only more meaningless for the flow of goods and money but also for people, backing
the merging ‘borderless world’.
However, after the devastating 9/11 terrorist attacks on the mainland US turned the North America’s
vision of having a border-free vision into another direction. One of the immediate responses by US
authorities after the attack was to do something about the leaking border. Rather than simply being
dismantled in the face of intensifying pressures of economic integration, border controls are being
retooled and redesigned as part of a new and expanding “war on terrorism.” Traditional border issues
such as trade and migration are now inescapably evaluated through a security lens. Optimistic talk of
opening borders has been replaced by more anxious and sombre talk about “security perimeters” and
“homeland defence”. The American public’s views on the border-less world also re-shaped as
prevalent fears about unpredictable terrorism has heightened after the incident.
In this project, we are focusing to determine that how the inflow and outflow of the changing practice
and politics of North American border controls by analysing the implications of these changes for cross
border relations and continental integration. We are using the past data to find and analyses the
pattern and past trends and finally to forecast the trend for the most recent year of 2020-21.
Abbreviations used in the report
‘ ts – Time Series
WN – White noise
Overview of the procedure to be followed

CanadaBorders
Data Exploration
2.1 Data Explanation
This data file contains whole data from January 1996 to February 2020 of the total incoming
crossing counts into the US. This file contains 7 columns specifying the port and its unique code,
the border, the mode of vehicle used, number of people crossing the border into the US, the date
and time of crossing, the mode of transport used to cross over.
• Port Name: Name of the port from which the border is crossed.
• State: States in US
• Port Code: Unique port code
• Border: US-Canada or US-Mexico border
• Month: Jan to Dec till Feb 2020
• Year: 1996 to 2020
• Date (DD/MM/YY): Date of crossing the border
• Measure: Mode of transportation
• Value: Count of people crossing
2.2 Data Loading
The data which is in excel format was loading into a variable. There are 355511 records and 7 variables
in the dataset.
2.3 Data Pre-processing
We checked for the nulls in the data set as a part of data pre-processing. No nulls were found in the
data set.
2.4 Data Exploration
The dataset is a representation of influx of immigrants from Mexico and Canada borders of US to
various states via varied modes of transport. The data consists of number of migrations on daily basis
across multiple entry points of states across the US. In time series forecasting only numerical variables
can be considered in X variables/predictors.
The data was imported into R and then grouped into Monthly data, from January 1996 to February
2020. Data visualizations plots were plotted to ascertain time series behaviour over the years and
months.
Plots show a varying trend over the years with no cyclic pattern. Seasonality seems to be prominent
characteristic.

CanadaBorders
Figure 1. Data Table for Individuals and commercial vehicles crossing US border into states, Jan 1996 to Feb 2020
There is a sharp decline post 2001, which suddenly stabilizes by 2012 and start moving slightly
upwards. There is a significant change in trend observed from 2010 hence, data till 2010 should not
be used in the analysis. We reduce the dataset, will consider data from January 2011 onwards, pls
refer Fig-2
Figure 2. Autoplot of entire data, since 1996

CanadaBorders
There is clear seasonality every 12 months, dataset can be treated as frequency 12. The change in
seasonality can be seen clearly over years, it appears to follow almost similar pattern every year.
Over the years people crossing the border follow the same pattern throughout the year, every month
as it has been set since 1996. Despite huge decrease in number of people crossing the border every
decade. See (Fig-3,4)
There seems to be no cyclic pattern, first decade numbers descend, next decade numbers ascend.
Even 5 years movements are different.
Graphs suggests time series has prominent presence of trend and seasonality components, i.e., time
s is non-stationary.
The data has been reduced to 2011 onwards, on plotting the data from 2011 to 2020, we observe that
there is clear pattern of seasonality and slightly increasing trend. See (Fig-6) and (Fig-7).
Figure 3. Seasonality plot of entire data
Figure 4. Seasonality plot of entire data, numbers have decreased
but pattern remains similar
Figure 5. Plot of Reduced dataset since Jan 2011 to Feb 2020

CanadaBorders
Let us decompose the time series to understand the components that will help in building the model
for forecasting.
The presence of seasonality in the data between 2011-2020 can be seen in the seasonal plots. Pls refer
(Fig-8 and Fig-9)
Over the years number of people crossing the borders into US is following the same pattern through
the year and months from 1996. Despite huge decrease in number of people crossing the border every
decade, the seasonal pattern persists. Pls refer (Fig-8, Fig-9, and Fig-10).
Decomposition
To further analyse the data, we decompose the data into individual components of trend, seasonality,
and residuals/errors. Pls refer (Fig-6)
Classical Method (additive model) of decomposition was run on the data and the following
observations were made:
1. Presence of strong seasonality
2. Trend line is strong and volatile
3. High Variance in the remainders
Figure 5. Seasonality, sub series plots of reduced Dataset. Similar patterns as entire dataset

CanadaBorders
Classical Method (multiplicative model) of decomposition was also run on the data and the following
observations were like the additive model.
Therefore, these methods are not comparable and X11 seasonal smoothening is being applied to
understand the time series components.
X11 method was applied on the data and the findings are follows:
1. It has automatically selected the additive time series structure
2. Presence of seasonality is strong, and intervals do not increase with time
The time series decomposition has defined the model as additive.
Now, we prepare for creating Models for forecasting.
We divide the dataset into train and test. January 2011 to December 2018 is train and January 2019
onwards is test data.
Then we check the residuals characteristics in both grouped data and train dataset by Box cox test, by
looking at lambda value, checking for heteroskedasticity.
Before, going for the modelling we checked for the presence of heteroscedasticity, through Box cox
test function. We get the λ = 0.5925186. This makes the size of the seasonal variation about the same
across the whole series, as that makes the forecasting model simpler. In this case it works quite
well.
There is clear seasonality every 12 months, dataset can be treated as frequency 12. And it is evident
from the ACF plot, lag values are decreasing towards zero, very slowly. Plus, lags are outside
transformation points. Hence, the train dataset is non-stationary (however, seasonality being present,
no need to check for non-stationarity).
Figure 6. Residula check in reduced dataset.

CanadaBorders
Moreover, the box-pierce test shows the presence of white noise, hence rejecting the null
Hypothesis.
Let's start modelling the ts for forecasting number of passengers influx into states (including both
borders crossing)
3. Model 1: Regression with trend and seasonality
The tslm() function fits a linear regression model to time series data.
Now we model our dataset for obtaining forecast.
To build the forecasting model, data has been split into test and train, with train set containing
datapoints from Jan 2010- Dec-2017 and test set containing datapoints from Jan 2018- end.
t-value and p-value holds no meaning in terms of forecasting. If the predictions are close to the actual
values, we would expect to be close to 1. On the other hand, if the predictions are unrelated to the
actual values, then (again, assuming there is an intercept). In all cases, lies between 0 and 1.
Multiple R-squared: 0.9575, Adjusted R-squared: 0.9504 are near to 1. Meaning the predicted fitted
values by the model 1 are much closer to original prediction. We can see that in the following figure.
Let’s check the residuals behaviour too, whether model id good enough for prediction in test data. We
check residuals in train data then check accuracy on test data. Because there is no guarantee that
model performing good in train will perform same or better in test data.
Though, the fitted values are closest to the original values, the residuals of the model exhibit
autocorrelation, which is clear from the plot, and because the Breusch-Godfrey test for serial
tslm(formula = Value ~ trend + season, data = crossing.ts.train)
Values’ = 43948 + (-2177459)* season2 + 1797434 * season3 + 1110099 * season4 +
2456941* season5 + 2125431 * season6 + 5482345 * season7 + 5572770 * season8 + 1241338
* season9 + 1989670 * season10 + 968427 * season11 + 2760972* season12
Figure 7. Fitted line by TSLM model over train dataset.

CanadaBorders
correlation test p-value = 0.01292 < 0.05, rejects the Null hypothesis of No autocorrelation/presence
of White Noise.
Variation in time series is present but there is no sign of heteroscedasticity.
The histogram shows that the residuals seem to be slightly skewed, which may also affect the coverage
probability of the prediction intervals. Most number of Lags are crossing out of transformation lines,
i.e. Prediction Intervals (1.98/N). In any case, the autocorrelation is large, which will impact PI and
thus have impact on forecasted values or PI.
Hence, we need to fit a second layer ARIMA model to the residuals, for improving the prediction
capability by capturing the information left in the residuals.
First, check for the differencing needed to smooth out the residuals.
Differencing can help stabilise the mean of a time series by removing changes in the level of a time
series, and therefore eliminating (or reducing) trend and seasonality.
This process of using a sequence of KPSS tests to determine the appropriate number of first
differences is zero to make the data stationary.
Because seasonal differencing returns 1 (indicating one seasonal difference is required). These
functions suggest we should do both a seasonal difference and no first difference.
So, d = 0, and D = 1. Now create the trend and seasonality variables based on this to be fed into ARIMA
model.
Model fitted by auto.arima is ARIMA(1,0,1)(2,1,0)[12] with AICc = 1241.86 .
The coefficients of AR and MA terms is less than 1, the sum of the coefficients of the two seasonal
AR terms is less than one and the sum of the coefficients of two MA terms is also less than one.
Figure 8. Residual check for TSLM model
(1-0.9764 * B) (1 – (-0.4329)* B12
) (1- B12
)* Yt = (1 + (-0.3645) * B12
) et

CanadaBorders
p-value is 0.1967 greater than 5% in Ljung Box test, meaning fail to reject null hypothesis. i.e. White
noise is present. However, we can see lag 7 is going outside PI, only one lag going outside
transformation line will not impact the prediction from the model. This seems to be a good model. This
ARIMA model is stationary and invertible.
AICc=1241.86 , MAPE = 225.185, MAPE = 0.6689771: MODEL TS+ SARIMA
Model 2: Pure ARIMA model
Before ARIMA , will check for Non stationarity and whether differencing is needed or not. By using
Unit root tests These are statistical hypothesis tests of stationarity that are designed for determining
whether differencing is required.
A number of unit root tests are available, which are based on different assumptions and may lead to
conflicting answers. In our analysis, we use the Kwiatkowski-Phillips-Schmidt-Shin (KPSS)
test (Kwiatkowski, Phillips, Schmidt, & Shin, 1992). In this test, the null hypothesis is that the data are
stationary, and we look for evidence that the null hypothesis is false. Consequently, small p-values
(e.g., less than 0.05) suggest that differencing is required.
On checking residuals on train data, lag 1 is highly significant, and followed by negative lag 6. Lags
oscillate after fixed number of lags interval, indicating prominent seasonality. It also shows lags going
outside transformation line, meaning no White Noise.
KPSS Test for Level Stationarity has p-value = 0.01 less than 0.05, rejects the Null, i.e., series is non-
stationary.
Augmented Dickey-Fuller Test p-value = 0.2981, Fail to reject Null, i.e., ts is non-stationary.
Tests show differencing of 1 lag is needed and zero seasonal differencing is needed to reduce the
differenced time series into stationary.
We apply Pure ARIMA model with d = 1 and D=0 by auto. Arima(), model is ARIMA(0,1,0) with
AICc=1015.96 MAPE = 1.374052 MASE = 0.611111.
Figure 9. Pure Arima Model fitted over train data

CanadaBorders
A Ljung-Box test returns a p-value = 0.001733 less than 0.05, suggesting that the residuals are NOT
white noise. Lag 1 is negative and significant and goes out of transformation line. Others are in waves,
only one lag may not have significant impact on forecasting.
Seasonality component is not captured by the model. We try to see the fit of the forecasted values by
the model with the original dataset. We can easily see that fitted values are closely following the test
dataset values. The Model seems a good fit.
Next, we try to build a better model on train data directly. To make train data stationary, we need
one non-seasonal and one seasonal differencing. Making the next pure model seasonal ARIMA.
Model is ARIMA (0,1,1) (0,1,1) [12] with AICc=852.48, MAPE = -0.2317185, MASE = 1.103649
This has the lowest AICc value with MASE. Ljung-Box test has p-value = 0.9549, Fail to reject null i.e.,
pure WN, evident from the ACF plot too. This is the best model so far.
Now, we explore other models.
Model 3: Seasonal Naïve
The forecasting techniques of Naïve, Seasonal Naïve, Average and Drift was applied on the data.
Seasonal naive has least value for MAPE AND MASE on test data
On fitting the various forecasting methods output on data, we find a better fit on the naïve seasonal
forecasting technique. The checking upon the model’s residuals the following observations can be
made (FIG-10):
1. ACF – all the lags beyond the threshold limits
2. Lag 1st
is significant, Every 12th
lag point higher the rest
3. Residual are left skewed however the curve normal
4. There is strong presence of autocorrelation.
On performing the Ljung-box tests we find a highly significant p=value (<2.2e-16) (Fig-11).
Indicating residuals are non-stationery and residuals contain more information and require
further modelling.
Again, we will check differencing needed to make residuals stationary. 1 Non seasonal differencing is
needed on residuals. We fit auto.arima with d=1. The model is ARIMA(0,1,1) with AICc=1195.28,
MAPE 246.2098 MASE 0.6888965.
This is model for residual forecasting. The residuals have become pure WN with Ljung-Box test p-
value = 0.1834 greater than 5%. Now, we need to fit the forecasted model into original values. First,
calculate forecasted values by adding up forecasted values from seasonal naïve model with
forecasted residuals from auto Arima to get proper forecasted values. Now we put this on the plot.
- (1-B) * (1- B12
)* Yt = (1 + (-0.7723) * B) + (1 + (-0.6338) * B12
) et
- (1-B) * Yt = (1 + (-0.7723) * B) + (1 + (- 0.8044) * B) et

CanadaBorders
As we can see fit is not as good as pure Arima. We will later compare the accuracy. Let’s see how
smoothing parameters work in this data for forecasting.
Model 4:Holt-Winters
We fit both Additive and multiplicative holt winter’s model, as both trend and seasonality are
present.
The output of HoltWinters() tells us that the estimated value of the alpha parameter is about 0.032
for additive and 0.0616 for multiplicative. This is very close to zero, telling us that the forecasts are
based on both recent and less recent observations (although somewhat more weight is placed on
recent observations).
The model shows additive ts structure fits better than multiplicative.
On checking residuals, only lag 24 is significant, all other lags are WN with Ljung-Box test for additive
is p-value = 0.01273 less than 5%. Rejects the Null, i.e. some non-stationarity is there. However, one
larger lag will not impact the forecast significantly. We try to plot the dataset with fitted values.
Residuals are normally distributed with little bit skewed with no heteroscedasticity.
The linear exponential smoothing models are all special cases of ARIMA models, the nonlinear
exponential smoothing models have no equivalent ARIMA counterparts. For non-stationary residuals
and data, we can also explore ETS models.
Model 5: Neural network model
We model the d=train data by neural network. It is evident that, lag 1 is significant and positive, from
lag 18 all lags become negative. No heteroscadascity is present in residuals.
The model is NNAR(2,1,2). There is no AICc value computed for this.
We later compare the accuracy of the forecast.
Let’s explore ETS, we have experienced the data residuals display non stationarity. As we know all
ETS models are non-stationary, while some ARIMA models are stationary.
Model 6: ETS
ETS model is a time series univariate forecasting method; its use focuses on trend and seasonal
components. ETS a three-character string identifying the nature of Time-Series components,
first character: Nature of Remainder: l t
second character: Nature of Trend: b t
third character: Nature of Seasonality: s t
(Sindhanuru, n.d.) (Athanasopoulos, n.d.)
The ETS models with seasonality or non-damped trend or both have two-unit roots (i.e., they need
two levels of differencing to make them stationary). All other ETS models have one unit root (they
need one level of differencing to make them stationary).
Let’s see simple ETS model on train dataset. The model is ETS(M,Ad,M):
Remainder is Multiplicative, Nature of trend is Additive but damped, Seasonality is Multiplicative

CanadaBorders
With alpha = 0.1316, close to zero. AICc =2581.995, MAPE = 1.051038 MASE = 0.4693708
Residuals exhibit non stationarity, as p-value = 0.006745 greater than 5%, rejects null hypothesis of
Ljung-Box test. As visible in the plot only one lag24 is outside the line, it may not have significant
effect on prediction. However, MASE is the lowest but AICc value is highest among the models of
this project. Additionally, residuals are normally distributed.
On plotting the fitted forecasted values on training dataset, the fitted line closely follows the original
data points and shape of the ts.
ACCURACY
We have forecasted numbers of individuals crossing the US border, by each model we have built.
Then tested the accuracy of the model on the Test Data. The following is the performance of each
model on test data:
Accuracy on test data
Model Number Model Name AICc of Model MAPE MASE
1 TSLM 1241.86 1.672851 0.7902939
2 Pure ARIMA 852.48 1.570884 0.7342843
3 Seasonal Naive 1195.28 2.017029 0.9388456
4 Holt Winter's 3436.111 1.527619 0.7171233
5 Neural Network - 2.309248 1.0874329
6 ETS 2581.995 1.865868 0.8714899
AICc value is the lowest for Pure ARIMA model, with a low MASE and MAPE.
Holt-winter’s MAPE and MASE are the lowest, but AICc value is the highest.
Forecasting
Let’s forecast the values by Pure ARIMA Model by re-training the chosen model on the whole
dataset, and then produce the forecast.
Simple ETS model:
Yt = l t + b t + s t
Yt = 27398694.1062 + 104492.8019 + sum (1.0273 0.9665 1.0007 0.977 1.1223 1.1185
1.0062 1.0183 0.972 0.9953 0.8618 0.9341)

CanadaBorders
We can see the number of individuals has increased, with replicating the shape previous ts seasons.
These are the forecasted numbers for next 12 months with 80, 90, 95 PI.
In conclusion, it can be deduced that the pattern of influx may witness a decline as compared to the
previous years. This could be an impact of the ongoing pandemic or the increased legal requirements
for crossing the border or many other factors which is beyond the scope of this study.
We can go ahead and now validate the model, using dataset from March 2020 to December 2021,
available at (Border-Crossing-Entry-Data, n.d.).
The model cannot predict movement based on previous database, due to pandemic restrictions across
borders. We can see sharp decrease in April 2020, which gradually increase as restriction lifted over
time. But model shows similar season movement across pandemic, showing macro-economic, political
factors are not factored in the forecasting model. We have learned that Proper forecast requires
Comprehensive approach of analytics and domain knowledge and expert views with Judgemental
forecasts.
Figure 10. Final Forecast by Pure ARIMA

CanadaBorders
Bibliography
Athanasopoulos, R. J. (n.d.). Forecasting: Principles and Practice (2nd ed). Retrieved from otexts.com:
https://otexts.com/fpp2/arima-ets.html
Border-Crossing-Entry-Data. (n.d.). Retrieved from data.bts.gov: https://data.bts.gov/Research-and-
Statistics/Border-Crossing-Entry-Data/keg4-3bc2/data
Sindhanuru, H. (n.d.). Retrieved from www.latentview.com:
https://www.latentview.com/idealab/exponential-smoothing-ets-framework/
time-series-analysisforecast-with-visualization. (n.d.). Retrieved from kaggle: -
https://www.kaggle.com/datafan07/time-series-analysisforecast-with-visualization/data
PowerPoints and R , Rmd files , codes provided by Professor during sessions.
Figure 1. Data Table for Individuals and commercial vehicles crossing US border into states, Jan
1996 to Feb 2020 _________________________________________________________________ 7
Figure 2. Autoplot of entire data, since 1996 ___________________________________________ 7
Figure 3. Seasonality plot of entire data________________________________________________ 8
Figure 4. Seasonality plot of entire data, numbers have decreased but pattern remains similar ____ 8
Figure 5. Seasonality, sub series plots of reduced Dataset. Similar patterns as entire dataset______ 9
Figure 6. Residula check in reduced dataset. ___________________________________________ 10
Figure 7. Fitted line by TSLM model over train dataset.___________________________________ 11
Figure 8. Residual check for TSLM model ______________________________________________ 12
Figure 9. Pure Arima Model fitted over train data _______________________________________ 13
Figure 10. Final Forecast by Pure ARIMA ______________________________________________ 17
Appendix
.rmd file
Knitted word document.
C:UsersShruti
DocumentsXLRIBF Apratim Guhaassignment BFG6Group 6 PGCBA IV BF PROJECTGroup6-Border-Crossing Knitted doc.docx

Business forecasting project border

Recommended

Recommended

More Related Content

Similar to Business forecasting project border

Similar to Business forecasting project border (20)

More from Shruti Nigam (CWM, AFP)

More from Shruti Nigam (CWM, AFP) (11)

Recently uploaded

Recently uploaded (20)

Business forecasting project border