SlideShare a Scribd company logo
Basic Principles to Create a
Time Series Forecast
We are surrounded by patterns that can be found everywhere,
one can notice patterns with the four season in relation to the
weather; patterns on peak hour when it refers to the volume of
traffic; in your heart beats, as well as in the shares of the stock
market and also in the sales cycles of certain products.
Analyzing time series data can be extremely useful for checking
these patterns and creating predictions for future. There are
several ways to create these forecasts, in this post I will approach
the concepts of the most basic and traditional methodologies.
All code is written in Python, and also, any additional
information can be seen on my Github.
So let’s start commenting about the initial condition for
analyzing Time Series:
Stationary Series
A stationary time series is one whose statistical properties, such
as mean, variance and auto correlation, are relatively constant
over time. Therefore, a non-stationary series is one whose
statistical properties change over time.
Before starting any predictive modeling it is necessary to verify if
these statistical properties are constant, I will explain below each
of these points:
• Constant mean
• Constant variance
• Auto correlated
Constant Mean
A stationary series has a relatively constant mean overtime,
there are no bullish or bearish trends. Having a constant mean
with small variations around it, makes much easier to
extrapolate to the future.
There are cases where the variance is small relative to the mean
and using it may be a good metric to make predictions for the
future, below a chart to show the relative constant mean in
relation to the variances over time:
In this case, if the series is not stationary, the forecast for the
future will not be efficient, because variations around the mean
values deviate significantly as can be seen on the chart below:
In the chart above, it is clear that there is a bullish trend and the
mean is gradually rising. In this case, if the average was used to
make future forecasts the error would be significant, since
forecast prices would always be below the real price.
Constant Variance
When the series has constant variance, we have an idea of the
standard variation in relation to the mean, when the variance is
not constant (as image below) the forecast will probably have
bigger errors in certain periods and these periods will not be
predictable, it is expected that the variance will remain
inconstant over time, including in the future.
In order to reduce the variance effect, the logarithmic
transformation can be applied. In this case also, exponential
transformation, like the Box-Cox method, or the use of inflation
adjustment can be used as well.
Autocorrelated Series
When two variables have similar variation in relation to the
standard deviation during time, you can say that these variables
are correlated, For instance, when the body weight increase
along with heart disorders, the greater the weight, greater is the
incidence of problems in the heart. In this case, the correlation is
positive and the graph would look something like this:
A case of negative correlation would be something like: the
greater the investment within safety measures at work the
smaller would be the amount of work related accidents.
Here are several examples of scatter plots with correlation levels:
source: wikipedia
When the subject is auto correlation, it means that there is a
correlation of certain previous periods with the current period,
the name given to the period with this correlation is lag, For
instance, in a series that has measurements every hour, today’s
temperature at 12:00 is very similar to the temperature of 12:00,
24 hours ago. If you compare the variation of temperatures
through this 24 house time frame, there will be an auto
correlation, in this case we will have an auto correlation with the
24th lag.
Auto correlation is a condition to create forecasts with a single
variable, because if there is no correlation, you can not use past
values to predict the future, when there are several variables,
you can verify if there is a correlation between the dependent
variable and the lags of the independent variables.
If a series does not have auto correlation it is a series with
random and unpredictable sequences, and the best way to make
a prediction is usually to use the value from the previous day. I
will use more detailed charts and explanations below.
From here I will analyze the weekly Hydrous ethanol prices from
Esalq (it’s a price reference to negotiate hydrous ethanol in
Brazil), the data can be downloaded here.
The price is in Brazilian Reais per cubic meter (BRL/m3).
Before starting any analysis, let’s split the data on a training and
test set
Dividing data on training and testing basis
When we are going to create a time series prediction model, it’s
crucial to separate the data into two parts:
Training set: these data will be the main basis for defining the
coefficients/parameters of the model;
Test set: These are the data that will be separated and will not
be seen by the model to test if the model works (generally these
values are compared with a walk forward method and finally the
mean error is measured).
The size of the test set is usually about 20% of the total sample,
although this percentage depends on the sample size that you
have and also how much time ahead you want to make the
forecast. The test set should ideally be at least as large as the
maximum forecast horizon required.
Unlike other prediction methods, such as classifications and
regressions without the influence of time, in time series we can
not divide the training and test data with random samples from
any part of the data, we must follow the time criterion of the
series, where the training data should always come before the
test data.
In this example of Esalq hydrous prices we have 856 weeks, we
will use as training set the first 700 weeks and the last 156
weeks (3 years ~ 18%) we will use as a test set:
From now on we will only use the training set to do the studies,
the test set will only be used to validate the predictions that we
will make.
Every time series can be broken down into 3 parts: trend,
seasonality and residuals, which is what remains after
removing the first two parts from the series, below the
separation of these parts:
Clearly the series has an uptrend, with peaks between the end
and beginning of each years and minimums between April and
September (beginning of the sugarcane crushing in the center-
south of Brazil).
However it’s indicated to use statistical tests to confirm if the
series is stationary, we will use two tests: the Dickey-Fuller test
and the KPSS test.
First, we will use the Dickey-Fuller test, I will use the base P-
value of 5%, that is, if the P-value is below this 5% it means that
the series is statistically stationary.
In addition, there is the Statistical Test of the model, where
these values can be compared with the critical values of 1%, 5%
and 10%, if the statistical test is below some critical value
chosen the series will be stationary:
In this case, the Dickey-Fuller test indicated that the series is not
stationary (P-value 36% and the critical value 5% is less than the
statistical test).
Now we are going to analyze the series with the KPSS test,
unlike the Dickey-Fuller test, the KPSS test already assumes that
the series is stationary and only will not be if the P value is less
than 5% or the statistical test is less than some value critic:
Confirming the Dickey-Fuller test, the KPSS test also shows that
the series is not stationary because the P-value is at 1% and the
statistical test is above any critical value.
Next I will demonstrate ways to turn a series into stationary.
Turning the series into stationary
Differencing is used to remove trend signals and also to reduce
the variance, it is simply the difference of the value of
period T with the value of the previous period T-1.
To make it easier to understand, below we get only a fraction of
ethanol prices for better visualization, note from May/2005
prices start rising until mid-May/2006, these prices have weekly
rises that accumulates creating an uptrend, in this case, we have
a non-stationary series.
When the first differentiation is made (graph below), we remove
the cumulative effect of the series and only show the variation of
period T against period T-1 throughout the whole series, so if
the price of 3 days ago was BRL 800.00 and changed to BRL
850.00, the value of the differentiation will be BRL 50.00 and if
today’s value is BRL 860.00 then the difference will be -BRL
Normally only one differentiation is necessary to transform a
series into stationary, but if necessary, a second differentiation
can be applied, in this case, the differentiation will be made on
the values of the first differentiation (there will hardly be cases
with more than 2 differentiations).
Using the same example, to make a second differentiation we
must take the differentiation of T minus T-1: BRL 2.9 — BRL 5.5
= -BRL 2.6 and so on.
Let’s do the Dickey-fuller test to see if the series will be
stationary with the first differentiation:
In this case we confirm that the series is stationary, the P-value
is zero and when we compare the value of the statistical test, it
is far below the critical values.
In the next example we will try to transform a series into
stationary using the inflation adjustment.
Inflation Adjustment
Prices are relative to the time that they were traded, in 2002 the
price of ethanol was at BRL 680.00, if the price of this product
were traded at this price nowadays certainly many mills would
be closed as it’s a very low price.
To try to make the series stationary, I will adjust the whole
series based on the current values using the IPCA index (it’s the
Brazilian CPI index), accumulating from the end of the training
period (Apr/2016) until the beginning of the study, the source
of the data is on the IBGE website.
Now let’s see how the series became and also if it became
As can be seen, the uptrend has disappeared, with only the
seasonal oscillations remaining, the Dickey-Fuller test also
confirms that the series is now stationary.
Just for the sake of curiosity, see below the graph with the
adjusted price with inflation against the original series.
Reducing variance
The logarithm is usually used to transform series that have
exponential growth values in series with more linear growths, in
this example we will use the Natural Logarithm (NL), where the
base is 2.718, this type of logarithm is widely used in economic
The difference of the values transformed into NL is
approximately equivalent to the percentage variation of the
values of the original series, which is valid as a basis for
reducing the variance in series with different prices, see the
example below:
If we have a product that had a price increase in 2000 and went
from BRL 50.00 to 52.50, some years later (2019) the price was
already BRL 100.00 and changed to BRL 105.00, the absolute
difference between prices is BRL 2.50 and BRL 5.00 respectively,
however the percentage difference of both is 5%.
When we use the LN in these prices we have: NL (52,50) — NL
(50,00) = 3,96–3,912 = 0,048 or 4.8%, in the same way using
the LN in the second price sequence we have: NL (105) — NL
(100) = 4.654–4.605 = 0.049 or 4.9%.
In this example, we can reduce the variation of values by
bringing almost everything to the same basis.
Below the same example:
Result: The percentage variation of the first example is 4.9 and the
second is 4.9
Below the table comparing values of percentage variation of X
with the variation values of NL (X):
let’s plot the comparative between the original series and the
series with NL transform:
Box-Cox Transformation (Power Transform)
The BOX COX transformation is also a way to transform a series,
the lambda (λ) value is a parameter used to transform the series.
In short, this function is the junction of several exponential
transformation functions, where we search for the best value of
lambda that transforms the series so that it has a distribution
closer to a normal Gaussian distribution. A condition to use this
transformation is that the series only has positive values, the
formula is:
Below I will plot the original series with its distribution and after
that the transformed series with the optimal value of lambda
with its new distribution, to find the value of lambda we will use
the function boxcox of the library Scipy, where it generates the
transformed series and the ideal lambda:
Below is an interactive chart where you can change the lambda
value and check the change in the chart:
This tool is usually used to improve the performance of the
model, since it makes it with more normal distributions,
remembering that after finishing the prediction of the model,
you must return to the original base inverting the transformation
according to the formula below:
Looking for correlated lags
To be predictable, a series with a single variable must have auto
correlation, that is, the current period must be explained based
on an earlier period (a lag).
As this series has weekly periods, 1 year is approximately 52
weeks, I will use the auto correlation function showing a period
of 60 lags to verify correlations of the current period with these
Analyzing the above auto correlation chart above, it seems that
all lags could be used to create forecasts for future events since
they have a positive correlation close to 1 and they are also
outside of the confidence interval, but this characteristic is of a
non-stationary series.
Another very important function is the partial auto correlation
function, where the effect of previous lags on the current period
is removed and only the effect of the lag analyzed over the
current period remains, for instance: the partial auto correlation
of the fourth lag will remove the effects of the first, second and
third lags.
Below the partial auto correlation graph:
As can be seen, almost no lag has an effect on the current
period, but as demonstrated earlier, the series without
differentiation is not stationary, we will now plot these two
functions with the series with one differentiation to see how it
The auto correlation plot changed significantly, showing that the
series has a significant correlation only in the first lag and a
seasonal effect with negative correlation around the 26th month
(half a year).
To create forecasts, we must pay attention to an extremely
important detail about finding correlated lags, it’s important that
there is a reason behind this correlation, because if there is no
logical reason it’s possible that it’s only chance and that this
correlation can disappear when you include more data.
Another important point is that the auto correlation and partial
auto correlation graphs are very sensitive to outliers, so it’s
important to analyze the time series itself and compare with the
two auto correlation charts.
In this example the first lag has a high correlation with the
current period, since the prices of the previous week historically
do not vary significantly, in the same case the 26th lag presents
a negative correlation, indicating a tendency contrary to the
current period, probably due to the different periods of supply
and demand over the course of a year.
As the inflation-adjusted series has become stationary, we will
use it to create our forecasts, below the auto correlation and
partial auto correlation graphs of the adjusted series:
We will use only the first two lags as a predictor for auto-
regressive series.
For more information, Duke University professor Robert
Nau’s website is one of the best related to this subject.
Metrics to evaluate the model
In order to analyze if the forecasts are with the values close to
the current values one must make the measurement of the error,
the error (or residuals) in this case is basically
The error in the training data is evaluated to verify if the model
has good assertiveness, and validates the model by checking the
error in the test data (data that was not “seen” by the model).
Checking the error is very important to verify if your model is
overfitting or underfitting when you compare the training data
with the test data.
Below are the key metrics used to evaluate time series models:
It’s nothing more than the average of the errors of the evaluated
series, the values can be positive or negative. This metric
suggests that the model tends to make predictions above the real
value (negative errors) or below the real value (positive errors),
so it can also be said that the mean forecast error is the bias of
the model.
This metric is very similar to the average error of the prediction
mentioned above, the only difference is the error with a negative
value that is transformed into positive and afterward the mean is
This metric is widely used in time series, since there are cases
that the negative error can cancel the positive error and give an
idea that the model is accurate, in the case of the MAE it doesn’t
happen, because this metric shows how much the forecast is far
from the real values, regardless if above or below, see the case
Result: The error of each model value looks like this: [-4 -2 0 2 4]
The MFE error was 0.0, the MAE error was 2.4
This metric places more weight on larger errors because each
individual error value is squared and then the mean is
calculated. Thus, this metric is very sensitive to outliers and puts
a lot of weight on predictions with more significant errors.
Unlike the MAE and MFE, the MSE values are in quadratic units
rather than the units of the model.
This metric is simply the square root of the MSE, where the error
returns to the unit of measure of the model (BRL/m3), it is very
used in time series because it’s more sensitive to the bigger
errors due to the process of squaring which originated it.
This is another interesting metric to use, which generally is used
in management reports because the error is measured in
percentage terms, so the error of a product X can be compared
with the error of a product Y.
The calculation of this metric takes the absolute value of the
error divided by the current price, then the mean is calculated:
Let’s create a function to evaluate the errors of training and test
data with several evaluation metrics:
Checking the residual values
It’s not enough to create the model and check the error values
according to the chosen metric, you must also analyze the
characteristics of the residual itself, as there are cases where the
model can not capture the information necessary to make a good
forecast, resulting in an error with information that should be
used to improve the forecast.
To verify this residual we will check:
• Current vs. predicted values (sequential chart);
• Residual vs. predicted values (dispersion chart):
It is very important to analyze this graph since in it we can check
patterns that can tell us if some modification is needed in the
model, the ideal is that the error is distributed linearly along the
forecast sequence.
• QQ plot of the residual (dispersion chart):
Summarizing this is a graph that shows where the residue
should be theoretically distributed, following a Gaussian
distribution, versus how it actually is.
• Residual auto correlation (sequential chart):
Where there should be no values that come out of the
confidence margin or the model is leaving information out of the
We need to create another function to plot these graphs:
Most basic ways to make a forecast
From now on we will create some models of price forecast of
Hydrous ethanol, below will be the steps that we will follow for
each model:
• Create prediction on the training data and subsequently
validate on the test data;
• Check the error of each model according to the metrics
mentioned above;
• Plot the model with the residual comparatives.
Let’s go to the models:
Naive approach:
The simplest way to make a forecast is to use the value of the
previous period, this is the best approach that can be done in
some cases, where the error is lower compared to other forecast
Generally, this methodology doesn’t work well to predict many
periods ahead, as the errors tend to increase in relation to real
Many people also use this approach as a baseline to try to
improve with more complex models.
Below we will use the training and test data to make the
The QQ chart shows that there are some larger (up and down)
residuals than theoretically should be, these are the so-called
outliers, and there is still a significant auto correlation in the
first, sixth and seventh lag, which could be used to improve the
In the same way, we will now make the forecast in the test data.
The first value of the predicted series will be the last of the
training data, then these values will be updated step-by-step by
the current value of the test and so on:
The RMSE and MAE errors were similar to the training data, the
QQ chart is with the residual more in line with what should
theoretically be, probably due to the few sample values
compared to the training data.
In the chart comparing the residuals with the predicted values
it’s noted that there is a tendency for the errors to increase in
absolute values when prices increase, perhaps a logarithmic
adjustment would decrease this error expansion, and to finalize
the residual correlation graph shows that there is still room for
improvement as there is a strong correlation in the first lag,
where a regression based on the first lag could probably be
added to improve predictions. Next model is the simple average:
Simple Mean:
Another way to make predictions is to use the series mean,
usually this form of forecasting is good when the values oscillate
close around the mean, with constant variance and no uptrend
or downtrend, but it’s possible to use better methods, where can
make the forecast using seasonal patterns among others.
This model uses the mean of the beginning of the data until the
previous period analyzed and it expands daily until the end of
the data, in the end, the tendency is that the line is straight, we
will now compare the error of this model with the first model:
In the testing data, I will continue using the mean from the
beginning of the training data and make the expansion of the
mean with the values that will be added on the test data:
The simple mean model failed to capture relevant information of
the series, as can be seen in the Real vs Forecast graph, also in
the correlation and Residual vs. Predicted graphs.
Simple Moving Average:
The moving average is an average that is calculated for a given
period (5 days for example) and is moving and always being
calculated using this particular period, in which case we will
always be using the average for the last 5 days to predict the
value of the next day.
The error was lower than the simple average, but still above the
simple model, below the test model:
Similarly to the training data, the moving-averages model is
better than the simple average, but they do not yet gain from the
simple model.
The predictions are with auto-correlation in two lags and the
error is with a very high variance in relation to the predicted
Exponential Moving Average:
The simple moving average model described above has the
property of treating the last X observations equally and
completely ignoring all previous observations. Intuitively, past
data should be discounted more gradually, for example, the
most recent observation should theoretically be slightly more
important than the second most recent, and the second most
recent should have a little more importance than the third more
recent, and so on, the Exponential Moving Average
(EMM) model does this.
Since α (alpha) is a constant with a value between 0 and 1, we
will calculate the forecast with the following formula:
Where the first value of the forecast is the respective current
value, the other values will be updated by α times the difference
between the actual value and the forecast of the previous period.
When α is zero we have a constant based on the first value of
the forecast, when α is 1 we have a model with a simple
approach because the result is the value of the previous real
Below is a graph chart several values of α:
The average data period in the EMM forecast is 1 / α . For
example, when α = 0.5, lag is equivalent to 2 periods;
when α = 0.2 the lag is 5 periods; when α = 0.1 the lag is 10
periods and so on.
In this model, we will arbitrarily use a α of 0.50, but you can do
a grid search to look for the α which reduces the error in the
training and also in the validation, we will see how it will look:
The error of this model was similar to the error of the moving
averages, however, we have to validate the model in the test
In the validation data, the error so far is the second best of the
models that we have already trained, but the characteristics of
the graphs of the residuals are very similar to the graphs of the
model of the moving average of 5 days.
An auto-regressive model is basically a linear regression with
significantly correlated lags, where the autocorrelation and
partial autocorrelation charts should initially be plotted to verify
if there is anything relevant.
Below are the autocorrelation and partial autocorrelation charts
of the training series that shows a signature of auto-regressive
model with 2 lags with significant correlations:
Below we will create the model based on the training data and
after obtaining the coefficients of the model, we will multiply
them by the values that are being performed by the test data:
In this model the error was the lowest compared to all the others
that we trained, now let’s use its coefficients to do the step-by-
step forecast of the training data:
Note that in the test data the error did not remain stable, even
worse than the simple model, note in the chart that the
forecasts are almost always below the current values, the bias
measurement shows that the real values are BRL 50.19 above
the predictions, maybe tuning some parameters in the training
model this difference would decrease.
To improve these models you can apply several transformations,
such as those explained in this post, also you can add external
variables as a forecast source, however, this is a subject for
another post.
Final considerations
Each time series model has its own characteristics and should be
analyzed individually so we can extract as much information as
possible to make good predictions reducing the uncertainty of
the future.
Checking for stationary, transforming the data, creating the
model in the training data, validating on the test data and
checking the residuals are key steps to create a good time series

More Related Content

What's hot

Computing Transformations Spring2005
Computing Transformations Spring2005Computing Transformations Spring2005
Computing Transformations Spring2005guest5989655
Regression and Co-Relation
Regression and Co-RelationRegression and Co-Relation
Regression and Co-Relation
nuwan udugampala
Pearson's correlation
Pearson's  correlationPearson's  correlation
Pearson's correlation
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
Mohit Asija
Covariance and correlation
Covariance and correlationCovariance and correlation
Covariance and correlationRashid Hussain
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysis
Van Martija
Correlation and regresion-Mathematics
Correlation and regresion-MathematicsCorrelation and regresion-Mathematics
Correlation and regresion-Mathematics
Tanishq Soni
Karl pearson's coefficient of correlation
Karl pearson's coefficient of correlationKarl pearson's coefficient of correlation
Karl pearson's coefficient of correlation
Correlation - Biostatistics
Correlation - BiostatisticsCorrelation - Biostatistics
Correlation - Biostatistics
Fahmida Swati
coefficient correlation
 coefficient correlation coefficient correlation
coefficient correlation
irshad narejo
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
Huma Ansari
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
Correlation and Regression
Correlation and RegressionCorrelation and Regression
Correlation and Regression
Ram Kumar Shah "Struggler"
maths gate exam syllabus India
maths gate exam syllabus India maths gate exam syllabus India
maths gate exam syllabus India
Simple correlation & Regression analysis
Simple correlation & Regression analysisSimple correlation & Regression analysis
Simple correlation & Regression analysis
Afra Fathima
Correlation Statistics
Correlation StatisticsCorrelation Statistics
Correlation Statistics
tahmid rashid
Bivariate Relationship
Bivariate RelationshipBivariate Relationship
Bivariate RelationshipKreisha Guzman

What's hot (20)

Computing Transformations Spring2005
Computing Transformations Spring2005Computing Transformations Spring2005
Computing Transformations Spring2005
Regression and Co-Relation
Regression and Co-RelationRegression and Co-Relation
Regression and Co-Relation
Pearson's correlation
Pearson's  correlationPearson's  correlation
Pearson's correlation
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
Covariance and correlation
Covariance and correlationCovariance and correlation
Covariance and correlation
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysis
Correlation and regresion-Mathematics
Correlation and regresion-MathematicsCorrelation and regresion-Mathematics
Correlation and regresion-Mathematics
Karl pearson's coefficient of correlation
Karl pearson's coefficient of correlationKarl pearson's coefficient of correlation
Karl pearson's coefficient of correlation
Correlation - Biostatistics
Correlation - BiostatisticsCorrelation - Biostatistics
Correlation - Biostatistics
coefficient correlation
 coefficient correlation coefficient correlation
coefficient correlation
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
Correlation and Regression
Correlation and RegressionCorrelation and Regression
Correlation and Regression
maths gate exam syllabus India
maths gate exam syllabus India maths gate exam syllabus India
maths gate exam syllabus India
Simple correlation & Regression analysis
Simple correlation & Regression analysisSimple correlation & Regression analysis
Simple correlation & Regression analysis
Correlation Statistics
Correlation StatisticsCorrelation Statistics
Correlation Statistics
Bivariate Relationship
Bivariate RelationshipBivariate Relationship
Bivariate Relationship

Similar to Time series basics

Logistic regression analysis
Logistic regression analysisLogistic regression analysis
Logistic regression analysis
Dhritiman Chakrabarti
Regression Analysis of SAT Scores Final
Regression Analysis of SAT Scores FinalRegression Analysis of SAT Scores Final
Regression Analysis of SAT Scores FinalJohn Michael Croft
Multivariate time series
Multivariate time seriesMultivariate time series
Multivariate time series
Luigi Piva CQF
MSc Finance_EF_0853352_Kartik Malla
MSc Finance_EF_0853352_Kartik MallaMSc Finance_EF_0853352_Kartik Malla
MSc Finance_EF_0853352_Kartik MallaKartik Malla
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
Avjinder (Avi) Kaler
Building a Regression Model using SPSS
Building a Regression Model using SPSSBuilding a Regression Model using SPSS
Building a Regression Model using SPSSZac Bodner
Distribution of EstimatesLinear Regression ModelAssume (yt,.docx
Distribution of EstimatesLinear Regression ModelAssume (yt,.docxDistribution of EstimatesLinear Regression ModelAssume (yt,.docx
Distribution of EstimatesLinear Regression ModelAssume (yt,.docx
1 Assignment Quantitative Methods 2 The following ass.docx
1 Assignment Quantitative Methods 2 The following ass.docx1 Assignment Quantitative Methods 2 The following ass.docx
1 Assignment Quantitative Methods 2 The following ass.docx
New Hypothesis Testing Method
New Hypothesis Testing MethodNew Hypothesis Testing Method
New Hypothesis Testing Method
Gaetan Lion
What is the Paired Sample T Test and How is it Beneficial to Business Analysis?
What is the Paired Sample T Test and How is it Beneficial to Business Analysis?What is the Paired Sample T Test and How is it Beneficial to Business Analysis?
What is the Paired Sample T Test and How is it Beneficial to Business Analysis?
Smarten Augmented Analytics
Your Paper was well written, however; I need you to follow the f
Your Paper was well written, however; I need you to follow the fYour Paper was well written, however; I need you to follow the f
Your Paper was well written, however; I need you to follow the f
Topic 5 (multiple regression)
Topic 5 (multiple regression)Topic 5 (multiple regression)
Topic 5 (multiple regression)
Ryan Herzog
Tv watching time project
Tv watching time projectTv watching time project
Tv watching time project
Applications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationshipApplications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationship
Rithish Kumar
Time series
Time seriesTime series
Time series
Interpreting Regression Results - Machine Learning
Interpreting Regression Results - Machine LearningInterpreting Regression Results - Machine Learning
Interpreting Regression Results - Machine Learning
Kush Kulshrestha
Measures and Strengths of AssociationRemember that while w.docx
Measures and Strengths of AssociationRemember that while w.docxMeasures and Strengths of AssociationRemember that while w.docx
Measures and Strengths of AssociationRemember that while w.docx
What is the Holt-Winters Forecasting Algorithm and How Can it be Used for Ent...
What is the Holt-Winters Forecasting Algorithm and How Can it be Used for Ent...What is the Holt-Winters Forecasting Algorithm and How Can it be Used for Ent...
What is the Holt-Winters Forecasting Algorithm and How Can it be Used for Ent...
Smarten Augmented Analytics

Similar to Time series basics (20)

Logistic regression analysis
Logistic regression analysisLogistic regression analysis
Logistic regression analysis
Regression Analysis of SAT Scores Final
Regression Analysis of SAT Scores FinalRegression Analysis of SAT Scores Final
Regression Analysis of SAT Scores Final
Multivariate time series
Multivariate time seriesMultivariate time series
Multivariate time series
MSc Finance_EF_0853352_Kartik Malla
MSc Finance_EF_0853352_Kartik MallaMSc Finance_EF_0853352_Kartik Malla
MSc Finance_EF_0853352_Kartik Malla
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
Building a Regression Model using SPSS
Building a Regression Model using SPSSBuilding a Regression Model using SPSS
Building a Regression Model using SPSS
Distribution of EstimatesLinear Regression ModelAssume (yt,.docx
Distribution of EstimatesLinear Regression ModelAssume (yt,.docxDistribution of EstimatesLinear Regression ModelAssume (yt,.docx
Distribution of EstimatesLinear Regression ModelAssume (yt,.docx
1 Assignment Quantitative Methods 2 The following ass.docx
1 Assignment Quantitative Methods 2 The following ass.docx1 Assignment Quantitative Methods 2 The following ass.docx
1 Assignment Quantitative Methods 2 The following ass.docx
699 Final Report
699 Final Report699 Final Report
699 Final Report
New Hypothesis Testing Method
New Hypothesis Testing MethodNew Hypothesis Testing Method
New Hypothesis Testing Method
What is the Paired Sample T Test and How is it Beneficial to Business Analysis?
What is the Paired Sample T Test and How is it Beneficial to Business Analysis?What is the Paired Sample T Test and How is it Beneficial to Business Analysis?
What is the Paired Sample T Test and How is it Beneficial to Business Analysis?
Your Paper was well written, however; I need you to follow the f
Your Paper was well written, however; I need you to follow the fYour Paper was well written, however; I need you to follow the f
Your Paper was well written, however; I need you to follow the f
Topic 5 (multiple regression)
Topic 5 (multiple regression)Topic 5 (multiple regression)
Topic 5 (multiple regression)
Tv watching time project
Tv watching time projectTv watching time project
Tv watching time project
Applications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationshipApplications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationship
Time series
Time seriesTime series
Time series
Interpreting Regression Results - Machine Learning
Interpreting Regression Results - Machine LearningInterpreting Regression Results - Machine Learning
Interpreting Regression Results - Machine Learning
Measures and Strengths of AssociationRemember that while w.docx
Measures and Strengths of AssociationRemember that while w.docxMeasures and Strengths of AssociationRemember that while w.docx
Measures and Strengths of AssociationRemember that while w.docx
What is the Holt-Winters Forecasting Algorithm and How Can it be Used for Ent...
What is the Holt-Winters Forecasting Algorithm and How Can it be Used for Ent...What is the Holt-Winters Forecasting Algorithm and How Can it be Used for Ent...
What is the Holt-Winters Forecasting Algorithm and How Can it be Used for Ent...
Case Study of Petroleum Consumption With R Code
Case Study of Petroleum Consumption With R CodeCase Study of Petroleum Consumption With R Code
Case Study of Petroleum Consumption With R Code

More from akshay ghanwat

kundankulan nuclear power plant 2
kundankulan nuclear power plant 2kundankulan nuclear power plant 2
kundankulan nuclear power plant 2
akshay ghanwat
Kudankulam nuclear power plant
Kudankulam nuclear power plantKudankulam nuclear power plant
Kudankulam nuclear power plant
akshay ghanwat
central Railway project report
central Railway project report central Railway project report
central Railway project report
akshay ghanwat
akshay ghanwat
manufacturing and desighn of cnc milling machine
manufacturing and desighn of cnc milling machinemanufacturing and desighn of cnc milling machine
manufacturing and desighn of cnc milling machine
akshay ghanwat
Machining process in mechanical engineering
Machining process in mechanical engineeringMachining process in mechanical engineering
Machining process in mechanical engineering
akshay ghanwat

More from akshay ghanwat (7)

kundankulan nuclear power plant 2
kundankulan nuclear power plant 2kundankulan nuclear power plant 2
kundankulan nuclear power plant 2
Kudankulam nuclear power plant
Kudankulam nuclear power plantKudankulam nuclear power plant
Kudankulam nuclear power plant
central Railway project report
central Railway project report central Railway project report
central Railway project report
manufacturing and desighn of cnc milling machine
manufacturing and desighn of cnc milling machinemanufacturing and desighn of cnc milling machine
manufacturing and desighn of cnc milling machine
Machining process in mechanical engineering
Machining process in mechanical engineeringMachining process in mechanical engineering
Machining process in mechanical engineering

Recently uploaded

Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
Investigate & Recover / / Crypto_Crimes
Investigate & Recover / / Crypto_CrimesInvestigate & Recover / / Crypto_Crimes
Investigate & Recover / / Crypto_Crimes
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx

Recently uploaded (20)

Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Investigate & Recover / / Crypto_Crimes
Investigate & Recover / / Crypto_CrimesInvestigate & Recover / / Crypto_Crimes
Investigate & Recover / / Crypto_Crimes
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx

Time series basics

  • 1. Basic Principles to Create a Time Series Forecast We are surrounded by patterns that can be found everywhere, one can notice patterns with the four season in relation to the weather; patterns on peak hour when it refers to the volume of traffic; in your heart beats, as well as in the shares of the stock market and also in the sales cycles of certain products. Analyzing time series data can be extremely useful for checking these patterns and creating predictions for future. There are several ways to create these forecasts, in this post I will approach the concepts of the most basic and traditional methodologies. All code is written in Python, and also, any additional information can be seen on my Github. So let’s start commenting about the initial condition for analyzing Time Series: Stationary Series A stationary time series is one whose statistical properties, such as mean, variance and auto correlation, are relatively constant
  • 2. over time. Therefore, a non-stationary series is one whose statistical properties change over time. Before starting any predictive modeling it is necessary to verify if these statistical properties are constant, I will explain below each of these points: • Constant mean • Constant variance • Auto correlated Constant Mean A stationary series has a relatively constant mean overtime, there are no bullish or bearish trends. Having a constant mean with small variations around it, makes much easier to extrapolate to the future. There are cases where the variance is small relative to the mean and using it may be a good metric to make predictions for the future, below a chart to show the relative constant mean in relation to the variances over time:
  • 3. In this case, if the series is not stationary, the forecast for the future will not be efficient, because variations around the mean values deviate significantly as can be seen on the chart below: In the chart above, it is clear that there is a bullish trend and the mean is gradually rising. In this case, if the average was used to make future forecasts the error would be significant, since forecast prices would always be below the real price. Constant Variance When the series has constant variance, we have an idea of the standard variation in relation to the mean, when the variance is not constant (as image below) the forecast will probably have bigger errors in certain periods and these periods will not be predictable, it is expected that the variance will remain inconstant over time, including in the future. In order to reduce the variance effect, the logarithmic transformation can be applied. In this case also, exponential
  • 4. transformation, like the Box-Cox method, or the use of inflation adjustment can be used as well. Autocorrelated Series When two variables have similar variation in relation to the standard deviation during time, you can say that these variables are correlated, For instance, when the body weight increase along with heart disorders, the greater the weight, greater is the incidence of problems in the heart. In this case, the correlation is positive and the graph would look something like this: A case of negative correlation would be something like: the greater the investment within safety measures at work the smaller would be the amount of work related accidents. Here are several examples of scatter plots with correlation levels: source: wikipedia
  • 5. When the subject is auto correlation, it means that there is a correlation of certain previous periods with the current period, the name given to the period with this correlation is lag, For instance, in a series that has measurements every hour, today’s temperature at 12:00 is very similar to the temperature of 12:00, 24 hours ago. If you compare the variation of temperatures through this 24 house time frame, there will be an auto correlation, in this case we will have an auto correlation with the 24th lag. Auto correlation is a condition to create forecasts with a single variable, because if there is no correlation, you can not use past values to predict the future, when there are several variables, you can verify if there is a correlation between the dependent variable and the lags of the independent variables. If a series does not have auto correlation it is a series with random and unpredictable sequences, and the best way to make a prediction is usually to use the value from the previous day. I will use more detailed charts and explanations below. From here I will analyze the weekly Hydrous ethanol prices from Esalq (it’s a price reference to negotiate hydrous ethanol in Brazil), the data can be downloaded here. The price is in Brazilian Reais per cubic meter (BRL/m3). Before starting any analysis, let’s split the data on a training and test set
  • 6. Dividing data on training and testing basis When we are going to create a time series prediction model, it’s crucial to separate the data into two parts: Training set: these data will be the main basis for defining the coefficients/parameters of the model; Test set: These are the data that will be separated and will not be seen by the model to test if the model works (generally these values are compared with a walk forward method and finally the mean error is measured). The size of the test set is usually about 20% of the total sample, although this percentage depends on the sample size that you have and also how much time ahead you want to make the forecast. The test set should ideally be at least as large as the maximum forecast horizon required. Unlike other prediction methods, such as classifications and regressions without the influence of time, in time series we can not divide the training and test data with random samples from any part of the data, we must follow the time criterion of the series, where the training data should always come before the test data. In this example of Esalq hydrous prices we have 856 weeks, we will use as training set the first 700 weeks and the last 156 weeks (3 years ~ 18%) we will use as a test set:
  • 7. From now on we will only use the training set to do the studies, the test set will only be used to validate the predictions that we will make. Every time series can be broken down into 3 parts: trend, seasonality and residuals, which is what remains after removing the first two parts from the series, below the separation of these parts: Clearly the series has an uptrend, with peaks between the end and beginning of each years and minimums between April and September (beginning of the sugarcane crushing in the center- south of Brazil).
  • 8. However it’s indicated to use statistical tests to confirm if the series is stationary, we will use two tests: the Dickey-Fuller test and the KPSS test. First, we will use the Dickey-Fuller test, I will use the base P- value of 5%, that is, if the P-value is below this 5% it means that the series is statistically stationary. In addition, there is the Statistical Test of the model, where these values can be compared with the critical values of 1%, 5% and 10%, if the statistical test is below some critical value chosen the series will be stationary: In this case, the Dickey-Fuller test indicated that the series is not stationary (P-value 36% and the critical value 5% is less than the statistical test). Now we are going to analyze the series with the KPSS test, unlike the Dickey-Fuller test, the KPSS test already assumes that the series is stationary and only will not be if the P value is less than 5% or the statistical test is less than some value critic:
  • 9. Confirming the Dickey-Fuller test, the KPSS test also shows that the series is not stationary because the P-value is at 1% and the statistical test is above any critical value. Next I will demonstrate ways to turn a series into stationary. Turning the series into stationary Differencing Differencing is used to remove trend signals and also to reduce the variance, it is simply the difference of the value of period T with the value of the previous period T-1. To make it easier to understand, below we get only a fraction of ethanol prices for better visualization, note from May/2005 prices start rising until mid-May/2006, these prices have weekly rises that accumulates creating an uptrend, in this case, we have a non-stationary series.
  • 10. When the first differentiation is made (graph below), we remove the cumulative effect of the series and only show the variation of period T against period T-1 throughout the whole series, so if the price of 3 days ago was BRL 800.00 and changed to BRL 850.00, the value of the differentiation will be BRL 50.00 and if today’s value is BRL 860.00 then the difference will be -BRL 10.00. Normally only one differentiation is necessary to transform a series into stationary, but if necessary, a second differentiation can be applied, in this case, the differentiation will be made on the values of the first differentiation (there will hardly be cases with more than 2 differentiations). Using the same example, to make a second differentiation we must take the differentiation of T minus T-1: BRL 2.9 — BRL 5.5 = -BRL 2.6 and so on.
  • 11. Let’s do the Dickey-fuller test to see if the series will be stationary with the first differentiation: In this case we confirm that the series is stationary, the P-value is zero and when we compare the value of the statistical test, it is far below the critical values. In the next example we will try to transform a series into stationary using the inflation adjustment. Inflation Adjustment
  • 12. Prices are relative to the time that they were traded, in 2002 the price of ethanol was at BRL 680.00, if the price of this product were traded at this price nowadays certainly many mills would be closed as it’s a very low price. To try to make the series stationary, I will adjust the whole series based on the current values using the IPCA index (it’s the Brazilian CPI index), accumulating from the end of the training period (Apr/2016) until the beginning of the study, the source of the data is on the IBGE website. Now let’s see how the series became and also if it became stationary.
  • 13. As can be seen, the uptrend has disappeared, with only the seasonal oscillations remaining, the Dickey-Fuller test also confirms that the series is now stationary. Just for the sake of curiosity, see below the graph with the adjusted price with inflation against the original series. Reducing variance Logarithm The logarithm is usually used to transform series that have exponential growth values in series with more linear growths, in this example we will use the Natural Logarithm (NL), where the base is 2.718, this type of logarithm is widely used in economic models. The difference of the values transformed into NL is approximately equivalent to the percentage variation of the values of the original series, which is valid as a basis for reducing the variance in series with different prices, see the example below: If we have a product that had a price increase in 2000 and went from BRL 50.00 to 52.50, some years later (2019) the price was already BRL 100.00 and changed to BRL 105.00, the absolute
  • 14. difference between prices is BRL 2.50 and BRL 5.00 respectively, however the percentage difference of both is 5%. When we use the LN in these prices we have: NL (52,50) — NL (50,00) = 3,96–3,912 = 0,048 or 4.8%, in the same way using the LN in the second price sequence we have: NL (105) — NL (100) = 4.654–4.605 = 0.049 or 4.9%. In this example, we can reduce the variation of values by bringing almost everything to the same basis. Below the same example: Result: The percentage variation of the first example is 4.9 and the second is 4.9 Below the table comparing values of percentage variation of X with the variation values of NL (X): Source
  • 15. let’s plot the comparative between the original series and the series with NL transform: Box-Cox Transformation (Power Transform) The BOX COX transformation is also a way to transform a series, the lambda (λ) value is a parameter used to transform the series. In short, this function is the junction of several exponential transformation functions, where we search for the best value of lambda that transforms the series so that it has a distribution closer to a normal Gaussian distribution. A condition to use this transformation is that the series only has positive values, the formula is: Below I will plot the original series with its distribution and after that the transformed series with the optimal value of lambda with its new distribution, to find the value of lambda we will use the function boxcox of the library Scipy, where it generates the transformed series and the ideal lambda:
  • 16. Below is an interactive chart where you can change the lambda value and check the change in the chart: This tool is usually used to improve the performance of the model, since it makes it with more normal distributions, remembering that after finishing the prediction of the model, you must return to the original base inverting the transformation according to the formula below: Looking for correlated lags To be predictable, a series with a single variable must have auto correlation, that is, the current period must be explained based on an earlier period (a lag).
  • 17. As this series has weekly periods, 1 year is approximately 52 weeks, I will use the auto correlation function showing a period of 60 lags to verify correlations of the current period with these lags. Analyzing the above auto correlation chart above, it seems that all lags could be used to create forecasts for future events since they have a positive correlation close to 1 and they are also outside of the confidence interval, but this characteristic is of a non-stationary series. Another very important function is the partial auto correlation function, where the effect of previous lags on the current period is removed and only the effect of the lag analyzed over the current period remains, for instance: the partial auto correlation of the fourth lag will remove the effects of the first, second and third lags. Below the partial auto correlation graph:
  • 18. As can be seen, almost no lag has an effect on the current period, but as demonstrated earlier, the series without differentiation is not stationary, we will now plot these two functions with the series with one differentiation to see how it works:
  • 19. The auto correlation plot changed significantly, showing that the series has a significant correlation only in the first lag and a seasonal effect with negative correlation around the 26th month (half a year). To create forecasts, we must pay attention to an extremely important detail about finding correlated lags, it’s important that there is a reason behind this correlation, because if there is no logical reason it’s possible that it’s only chance and that this correlation can disappear when you include more data.
  • 20. Another important point is that the auto correlation and partial auto correlation graphs are very sensitive to outliers, so it’s important to analyze the time series itself and compare with the two auto correlation charts. In this example the first lag has a high correlation with the current period, since the prices of the previous week historically do not vary significantly, in the same case the 26th lag presents a negative correlation, indicating a tendency contrary to the current period, probably due to the different periods of supply and demand over the course of a year. As the inflation-adjusted series has become stationary, we will use it to create our forecasts, below the auto correlation and partial auto correlation graphs of the adjusted series:
  • 21. We will use only the first two lags as a predictor for auto- regressive series. For more information, Duke University professor Robert Nau’s website is one of the best related to this subject. Metrics to evaluate the model In order to analyze if the forecasts are with the values close to the current values one must make the measurement of the error,
  • 22. the error (or residuals) in this case is basically Yreal−YpredYreal−Ypred. The error in the training data is evaluated to verify if the model has good assertiveness, and validates the model by checking the error in the test data (data that was not “seen” by the model). Checking the error is very important to verify if your model is overfitting or underfitting when you compare the training data with the test data. Below are the key metrics used to evaluate time series models: MEAN FORECAST ERROR — (BIAS) It’s nothing more than the average of the errors of the evaluated series, the values can be positive or negative. This metric suggests that the model tends to make predictions above the real value (negative errors) or below the real value (positive errors), so it can also be said that the mean forecast error is the bias of the model. MAE — MEAN ABSOLUTE ERROR This metric is very similar to the average error of the prediction mentioned above, the only difference is the error with a negative value that is transformed into positive and afterward the mean is calculated.
  • 23. This metric is widely used in time series, since there are cases that the negative error can cancel the positive error and give an idea that the model is accurate, in the case of the MAE it doesn’t happen, because this metric shows how much the forecast is far from the real values, regardless if above or below, see the case below: Result: The error of each model value looks like this: [-4 -2 0 2 4] The MFE error was 0.0, the MAE error was 2.4 MSE — MEAN SQUARED ERROR This metric places more weight on larger errors because each individual error value is squared and then the mean is calculated. Thus, this metric is very sensitive to outliers and puts a lot of weight on predictions with more significant errors. Unlike the MAE and MFE, the MSE values are in quadratic units rather than the units of the model. RMSE — ROOT MEAN SQUARED ERROR This metric is simply the square root of the MSE, where the error returns to the unit of measure of the model (BRL/m3), it is very used in time series because it’s more sensitive to the bigger errors due to the process of squaring which originated it. MAPE — MEAN ABSOLUTE PERCENTAGE ERROR
  • 24. This is another interesting metric to use, which generally is used in management reports because the error is measured in percentage terms, so the error of a product X can be compared with the error of a product Y. The calculation of this metric takes the absolute value of the error divided by the current price, then the mean is calculated: Let’s create a function to evaluate the errors of training and test data with several evaluation metrics: Checking the residual values It’s not enough to create the model and check the error values according to the chosen metric, you must also analyze the characteristics of the residual itself, as there are cases where the model can not capture the information necessary to make a good forecast, resulting in an error with information that should be used to improve the forecast. To verify this residual we will check: • Current vs. predicted values (sequential chart); • Residual vs. predicted values (dispersion chart):
  • 25. It is very important to analyze this graph since in it we can check patterns that can tell us if some modification is needed in the model, the ideal is that the error is distributed linearly along the forecast sequence. • QQ plot of the residual (dispersion chart): Summarizing this is a graph that shows where the residue should be theoretically distributed, following a Gaussian distribution, versus how it actually is. • Residual auto correlation (sequential chart): Where there should be no values that come out of the confidence margin or the model is leaving information out of the model. We need to create another function to plot these graphs: Most basic ways to make a forecast From now on we will create some models of price forecast of Hydrous ethanol, below will be the steps that we will follow for each model: • Create prediction on the training data and subsequently validate on the test data; • Check the error of each model according to the metrics mentioned above;
  • 26. • Plot the model with the residual comparatives. Let’s go to the models: Naive approach: The simplest way to make a forecast is to use the value of the previous period, this is the best approach that can be done in some cases, where the error is lower compared to other forecast methodologies. Generally, this methodology doesn’t work well to predict many periods ahead, as the errors tend to increase in relation to real values. Many people also use this approach as a baseline to try to improve with more complex models. Below we will use the training and test data to make the simulations:
  • 27. The QQ chart shows that there are some larger (up and down) residuals than theoretically should be, these are the so-called outliers, and there is still a significant auto correlation in the first, sixth and seventh lag, which could be used to improve the model. In the same way, we will now make the forecast in the test data. The first value of the predicted series will be the last of the training data, then these values will be updated step-by-step by the current value of the test and so on:
  • 28. The RMSE and MAE errors were similar to the training data, the QQ chart is with the residual more in line with what should theoretically be, probably due to the few sample values compared to the training data. In the chart comparing the residuals with the predicted values it’s noted that there is a tendency for the errors to increase in absolute values when prices increase, perhaps a logarithmic adjustment would decrease this error expansion, and to finalize the residual correlation graph shows that there is still room for improvement as there is a strong correlation in the first lag, where a regression based on the first lag could probably be added to improve predictions. Next model is the simple average:
  • 29. Simple Mean: Another way to make predictions is to use the series mean, usually this form of forecasting is good when the values oscillate close around the mean, with constant variance and no uptrend or downtrend, but it’s possible to use better methods, where can make the forecast using seasonal patterns among others. This model uses the mean of the beginning of the data until the previous period analyzed and it expands daily until the end of the data, in the end, the tendency is that the line is straight, we will now compare the error of this model with the first model: In the testing data, I will continue using the mean from the beginning of the training data and make the expansion of the mean with the values that will be added on the test data:
  • 30. The simple mean model failed to capture relevant information of the series, as can be seen in the Real vs Forecast graph, also in the correlation and Residual vs. Predicted graphs. Simple Moving Average: The moving average is an average that is calculated for a given period (5 days for example) and is moving and always being calculated using this particular period, in which case we will always be using the average for the last 5 days to predict the value of the next day.
  • 31. The error was lower than the simple average, but still above the simple model, below the test model: Similarly to the training data, the moving-averages model is better than the simple average, but they do not yet gain from the simple model.
  • 32. The predictions are with auto-correlation in two lags and the error is with a very high variance in relation to the predicted values. Exponential Moving Average: The simple moving average model described above has the property of treating the last X observations equally and completely ignoring all previous observations. Intuitively, past data should be discounted more gradually, for example, the most recent observation should theoretically be slightly more important than the second most recent, and the second most recent should have a little more importance than the third more recent, and so on, the Exponential Moving Average (EMM) model does this. Since α (alpha) is a constant with a value between 0 and 1, we will calculate the forecast with the following formula: Where the first value of the forecast is the respective current value, the other values will be updated by α times the difference between the actual value and the forecast of the previous period. When α is zero we have a constant based on the first value of the forecast, when α is 1 we have a model with a simple approach because the result is the value of the previous real period.
  • 33. Below is a graph chart several values of α: The average data period in the EMM forecast is 1 / α . For example, when α = 0.5, lag is equivalent to 2 periods; when α = 0.2 the lag is 5 periods; when α = 0.1 the lag is 10 periods and so on. In this model, we will arbitrarily use a α of 0.50, but you can do a grid search to look for the α which reduces the error in the training and also in the validation, we will see how it will look: The error of this model was similar to the error of the moving averages, however, we have to validate the model in the test base:
  • 34. In the validation data, the error so far is the second best of the models that we have already trained, but the characteristics of the graphs of the residuals are very similar to the graphs of the model of the moving average of 5 days. Auto-Regressive: An auto-regressive model is basically a linear regression with significantly correlated lags, where the autocorrelation and partial autocorrelation charts should initially be plotted to verify if there is anything relevant. Below are the autocorrelation and partial autocorrelation charts of the training series that shows a signature of auto-regressive model with 2 lags with significant correlations:
  • 35. Below we will create the model based on the training data and after obtaining the coefficients of the model, we will multiply them by the values that are being performed by the test data:
  • 36. In this model the error was the lowest compared to all the others that we trained, now let’s use its coefficients to do the step-by- step forecast of the training data: Note that in the test data the error did not remain stable, even worse than the simple model, note in the chart that the forecasts are almost always below the current values, the bias measurement shows that the real values are BRL 50.19 above the predictions, maybe tuning some parameters in the training model this difference would decrease. To improve these models you can apply several transformations, such as those explained in this post, also you can add external
  • 37. variables as a forecast source, however, this is a subject for another post. Final considerations Each time series model has its own characteristics and should be analyzed individually so we can extract as much information as possible to make good predictions reducing the uncertainty of the future. Checking for stationary, transforming the data, creating the model in the training data, validating on the test data and checking the residuals are key steps to create a good time series forecast.