This document provides an overview of basic principles for creating a time series forecast. It discusses key concepts like stationary series, differencing to make series stationary, and transformations like logarithms and Box-Cox. It also covers evaluating autocorrelation and partial autocorrelation to identify lag effects, and metrics for evaluating forecasts like mean error and mean absolute error. The goal is to explain traditional methodologies for time series analysis and prediction using Python code examples.
It is most useful for the students of BBA for the subject of "Data Analysis and Modeling"/
It has covered the content of chapter- Data regression Model
Visit for more on www.ramkumarshah.com.np/
It is most useful for the students of BBA for the subject of "Data Analysis and Modeling"/
It has covered the content of chapter- Data regression Model
Visit for more on www.ramkumarshah.com.np/
Distribution of EstimatesLinear Regression ModelAssume (yt,.docxmadlynplamondon
Distribution of Estimates
Linear Regression Model
Assume (yt, xt) are independent and identically distributed and E(xtet) = 0
Estimation Consistency
The estimates approach the true values as the sample size increases.
Estimation variance decreases as the sample size increases.
Illustration of Consistency
Take a random sample of U.S. men
Estimate a linear regression of log(wages) on education
Total sample = 9089
Start with 100 observations, and sequentially increase sample size until in the final regression use the whole 9089.
Sequence of Slope Coefficients
Asymptotic Normality
4
Illustration of Asymptotic Normality
Time Series
Do these results apply to time-series data?
Consistency
Asymptotic Normality
Variance Formula
Time-series models
AR models, i.e., xt = yt-1
Trend and seasonal models
One-step and multi-step forecasting
Derivation of Variance Formula
For simplicity
Assume the variables have zero mean
The regression has no intercept
Model with no intercept:
Model with no intercept
OLS minimizes the sum of squares
The first-order condition is
Solution
Now substitute
We have
The denominator is the sample variance (when x has mean zero), so
10
Then
Where
Since
Then
From the covariance formula
When the observations are independent, the covariances are zero.
And since
We obtain
We have found
As stated at the beginning.
Extension to Time-Series
The only place in this argument where we used the assumption of the independence of observations was to show that vt = xtet has zero covariance with vj = xjej.
This is saying that vt is not autocorrelated.
Unforecastable one-step errors
In one-step-ahead forecasting, if the regression error is unforecastable, then vt is not autocorrelated.
In this case, the variance formula for the least-squares estimate is
Why is this true?
The error is unforecastable if
For simplicity, suppose that xt = 1.
Then for
Summary
In one-step-ahead time-series models, if the error is unforecastable, then least-squares estimates satisfy the asymptotic (approximate) distribution
As the sample size T is in the denominator, the variance decreases as the sample size increases.
This means that least-squares is consistent.
Variance Formula
The variance formula for the least-squares estimate takes the form
This formula is valid in time-series regression when the error is unforecastable.
Classical Variance Formula
If we make the simplifying assumption
Then
Homoskedasticity
The variance simplification is valid under “conditional homoskedasticity”
This is a simplifying assumption made to make calculations easier, and is a conventional assumption in introductory econometrics courses.
It is not used in serious econometrics.
Variance Formula: AR(1) Model
Take the AR(1) model with unforecastable homoscedastic errors
Then the variance of the OLS estimate is
Since in this model
AR(1) Asymptotic Variance
We know that
So
The asymp ...
1 Assignment Quantitative Methods 2 The following ass.docxteresehearn
1
Assignment: Quantitative Methods 2
The following assignment is designed to give you a first experience in conducting original empirical
research. You will have to use your own ingenuity in constructing an interesting hypothesis, and finding the
relevant data before conducting the analysis. This exercise should assist you in understanding the statistical
components of the course and in building skills that will be useful in completing your 3rd year dissertations.
The Task
To complete this assignment you must find two or more variables that you believe are related, one of which
is to be explained by the others using an OLS regression, ANOVA or PROBIT regression;
The data may be found anywhere on the web or in books, journals or magazines in the library1. You are free
to do any topics that is of interest to you. It need not be directly relevant to your course. The only
limitations are that:
The data must be secondary (it cannot come from a survey that you conduct yourself);
The data must not already been analysed in a similar way to the way you propose;
The sample size must be at least 25 observations (for all variables)
Your analysis needs to be yours and must be different to others in the class. The same data set
can be used but if two students present essentially the same analysis then this will be looked at
very closely to verify that it is original.
It cannot be from the data sets provided to you on BB.
You are recommended Python but you can use Excel or other software if you wish. Then:
I) Complete a report summarising the analysis; and,
II) Present copy of the data and a copy of the results of the analysis.
Within the report you must complete the following objectives within (I):
1. Clearly state what hypothesis or hypotheses are to be tested, and write one or two paragraphs
on why you believe that the analysis is worth doing with supporting evidence from literature
(e.g. textbooks, articles or internet).
2. Give an exact source for the data. This must be a verifiable source so that we can check if the
data is genuine.
3. Present a summary analysis of the results, with a formal test of the appropriate hypothesis using
the data.
4. It should contain references if they have been cited within the text.
The failure to attach the data in full along with a verifiable source, or a full set of regression results
that can be clearly read will result in 0 mark being given.
Full criteria for Marking are given in the attached Rubric at the end of this document
Assignments should not exceed 1000 words (excluding graphs tables and references).
.
The material necessary to complete this assignment will be completed late in the Autumn Term. Data
should be obtained prior to the Christmas break. However, you will need to submit the assignment
electronically, and will be in due on Monday 14/1/2019.
Note that you should expect your marks back on the Friday 01.
This is a new hypothesis testing method that allows you to test whether something during a "treatment period" made a difference even in the absence of Controls. Even a sample of 1 with no Controls will work.
The Paired Sample T Test is used to determine whether the mean of a dependent variable. For example, weight, anxiety level, salary, or reaction time is the same in two related groups. It is particularly useful in measuring results before and after a particular event, action, process change, etc.
Your Paper was well written, however; I need you to follow the frochellscroop
Your Paper was well written, however; I need you to follow the following Analysis Guidance for Intervention Data. I will give you a passing grade when you submit with these by the 26th of April at 1pm EST
This document is designed to provide a summary of the key steps for analysing intervention data. The main analysis is conducted using the general linear model function in SPSS. This document does not cover how to clean data for analysis. (Data for the PARS module has already been cleaned so students do not have to undertake this part of the analysis.) This document is written with the PARS assignment in mind, so please refer to statistical texts for details on how to check assumptions, and a broader overview of how to interpret the output of intervention analyses in SPSS.
Preparing Scales
When using scales, ensure you compute scale reliabilities (Cronbachs Alpha using the function Analyse>Scale>Reliability analysis). Make sure scales are recoded as required by the specific scale you’re using. If you find poor reliability, that might indicate scale items have not been coded as required (e.g. a scale item may need reverse coding). If scale reliability is poor, then you may want to exclude it from the analysis, remove a low-loading item, or report why you think the reliability is poor and justify why you decided to include it. Scale items should be aggregated or averaged using the compute variable function in SPSS (Transform>Compute variable) for the main analysis, as directed by the scale authors. (For the PARS assignment, scale reliability statistics can be reported in the appendix.)
Calculating Means and Standard Deviations
It is useful at this stage to calculate the means and standard deviations for the data using the function Analyse>Descriptive Statistics. For intervention data comparing more than one condition, you need to isolate a condition in the dataset before generating the means and standard deviations for that condition. The analyses testing the effect of an intervention with individuals in different conditions (i.e. between-subject) are essentially testing whether there is a significant difference in the means of groups in different conditions. The means for the different conditions show whether levels are increasing or decreasing, and this is useful for interpreting the results of the analysis.
Isolate study conditions using the function Data>Select cases, and use the function ‘If condition satisfied’. In the PARS data, use cohort as the variable in the rule (i.e. ‘Cohort = 1’ for the intervention group, or ‘Cohort = 2’ for the control group). When you have either of these rules applied, SPSS will only run the analysis on the cases selected by that rule. For example, if the rule applied is ‘Cohort = 1’ only cases with the value 1 in the cohort variable will be included in the analysis.
Bivariate Correlations
As part the analysis, you need to run bivariate correlations. Use the function Analyse>Correlate>Bivariate. (For ...
Measures and Strengths of AssociationRemember that while w.docxARIV4
Measures and Strengths of Association
Remember that while we may find two variables to be involved in a relationship, we also want to know the
strength of the association. Each type of variable has its own measure to determine this though. Three
measures will be discussed in this paper, Lambda, Gamma, and Pearson’s r.
Lambda
Lambda is a measure of association which should be used when both variables are nominal. Essentially
this means that knowing a person’s attribute on one variable will help you guess their attribute on the
other (Babbie et al., 2014).
Gamma
Gamma is used to explore the relationship between two ordinal variables. It can also be used to measure
association between one dichotomous nominal and one ordinal variable (Babbie et al., 2014, p. 227).
Unlike lambda, gamma indicates a strength of an association and a direction. The closer to -1.00 or +1.00,
the stronger the relationship, whereas the closer to 0 the weaker the relationship. You can determine the
direction of a relationship the following way:
A negative association is indicated by a negative sign. This means that as one variable increases the other
decreases- the variables are moving away from each other. For example, as social class increases, prejudice
decreases. On the other hand, a positive association, indicated by a plus or positive sign, means that both
variables change in the same direction, either increase or decrease. For example, as social class increases,
so too does prejudice or as social class decreases, so too does prejudice.
Correlation Coefficient- Pearson’s r
Pearson’s r, also known as the correlation coefficient, is the test measure used to determine the
association between interval/ratio variables. This measure is similar to Gamma in how it can be understood
and establish direction of association.
Value of Measures of Association
0.00
+ or - .01 to .09
+ or - .10 to .29
+ or - .30 to .99
Strength of Association
None- no assocation at all
Weak- uninteresting association
Moderate- worth making note of
Evidence of a strong association- extremely
interesting
Perfect- strongest association possible 1.00
Measures of Association in SPSS
Analyze – Descriptive Statistics – Crosstabs
Place your dependent variable in the Row and your independent variable in the Column. Click "Statistics"
to choose which test you will run for the measure/strength of association. You will select Lambda for
nominal variables, Gamma for ordinal variables (or one ordinal and one dichotomous nominal), or
Pearson’s r for interval/ratio variables.
Measures of Association in SPSS- Understanding Output
Lambda
The test above is looking at the relationship between one’s political affiliation and their race. We look at
the value .036, which is the measure when political party is the DV (see in table). This means that we can
improve our guessing of political affiliation by 4% if we know that person’s race. Based on our notes abo ...
Holt-Winters forecasting allows users to smooth a time series and use data to forecast selected areas. Exponential smoothing assigns decreasing weights and values against historical data to decrease the value of the weight for the older data, so more recent historical data is assigned more weight in forecasting than older results. The right augmented analytics provides user-friendly application of this method and allow business users to leverage this powerful tool.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
1. Basic Principles to Create a
Time Series Forecast
We are surrounded by patterns that can be found everywhere,
one can notice patterns with the four season in relation to the
weather; patterns on peak hour when it refers to the volume of
traffic; in your heart beats, as well as in the shares of the stock
market and also in the sales cycles of certain products.
Analyzing time series data can be extremely useful for checking
these patterns and creating predictions for future. There are
several ways to create these forecasts, in this post I will approach
the concepts of the most basic and traditional methodologies.
All code is written in Python, and also, any additional
information can be seen on my Github.
So let’s start commenting about the initial condition for
analyzing Time Series:
Stationary Series
A stationary time series is one whose statistical properties, such
as mean, variance and auto correlation, are relatively constant
2. over time. Therefore, a non-stationary series is one whose
statistical properties change over time.
Before starting any predictive modeling it is necessary to verify if
these statistical properties are constant, I will explain below each
of these points:
• Constant mean
• Constant variance
• Auto correlated
Constant Mean
A stationary series has a relatively constant mean overtime,
there are no bullish or bearish trends. Having a constant mean
with small variations around it, makes much easier to
extrapolate to the future.
There are cases where the variance is small relative to the mean
and using it may be a good metric to make predictions for the
future, below a chart to show the relative constant mean in
relation to the variances over time:
3. In this case, if the series is not stationary, the forecast for the
future will not be efficient, because variations around the mean
values deviate significantly as can be seen on the chart below:
In the chart above, it is clear that there is a bullish trend and the
mean is gradually rising. In this case, if the average was used to
make future forecasts the error would be significant, since
forecast prices would always be below the real price.
Constant Variance
When the series has constant variance, we have an idea of the
standard variation in relation to the mean, when the variance is
not constant (as image below) the forecast will probably have
bigger errors in certain periods and these periods will not be
predictable, it is expected that the variance will remain
inconstant over time, including in the future.
In order to reduce the variance effect, the logarithmic
transformation can be applied. In this case also, exponential
4. transformation, like the Box-Cox method, or the use of inflation
adjustment can be used as well.
Autocorrelated Series
When two variables have similar variation in relation to the
standard deviation during time, you can say that these variables
are correlated, For instance, when the body weight increase
along with heart disorders, the greater the weight, greater is the
incidence of problems in the heart. In this case, the correlation is
positive and the graph would look something like this:
A case of negative correlation would be something like: the
greater the investment within safety measures at work the
smaller would be the amount of work related accidents.
Here are several examples of scatter plots with correlation levels:
source: wikipedia
5. When the subject is auto correlation, it means that there is a
correlation of certain previous periods with the current period,
the name given to the period with this correlation is lag, For
instance, in a series that has measurements every hour, today’s
temperature at 12:00 is very similar to the temperature of 12:00,
24 hours ago. If you compare the variation of temperatures
through this 24 house time frame, there will be an auto
correlation, in this case we will have an auto correlation with the
24th lag.
Auto correlation is a condition to create forecasts with a single
variable, because if there is no correlation, you can not use past
values to predict the future, when there are several variables,
you can verify if there is a correlation between the dependent
variable and the lags of the independent variables.
If a series does not have auto correlation it is a series with
random and unpredictable sequences, and the best way to make
a prediction is usually to use the value from the previous day. I
will use more detailed charts and explanations below.
From here I will analyze the weekly Hydrous ethanol prices from
Esalq (it’s a price reference to negotiate hydrous ethanol in
Brazil), the data can be downloaded here.
The price is in Brazilian Reais per cubic meter (BRL/m3).
Before starting any analysis, let’s split the data on a training and
test set
6. Dividing data on training and testing basis
When we are going to create a time series prediction model, it’s
crucial to separate the data into two parts:
Training set: these data will be the main basis for defining the
coefficients/parameters of the model;
Test set: These are the data that will be separated and will not
be seen by the model to test if the model works (generally these
values are compared with a walk forward method and finally the
mean error is measured).
The size of the test set is usually about 20% of the total sample,
although this percentage depends on the sample size that you
have and also how much time ahead you want to make the
forecast. The test set should ideally be at least as large as the
maximum forecast horizon required.
Unlike other prediction methods, such as classifications and
regressions without the influence of time, in time series we can
not divide the training and test data with random samples from
any part of the data, we must follow the time criterion of the
series, where the training data should always come before the
test data.
In this example of Esalq hydrous prices we have 856 weeks, we
will use as training set the first 700 weeks and the last 156
weeks (3 years ~ 18%) we will use as a test set:
7. From now on we will only use the training set to do the studies,
the test set will only be used to validate the predictions that we
will make.
Every time series can be broken down into 3 parts: trend,
seasonality and residuals, which is what remains after
removing the first two parts from the series, below the
separation of these parts:
Clearly the series has an uptrend, with peaks between the end
and beginning of each years and minimums between April and
September (beginning of the sugarcane crushing in the center-
south of Brazil).
8. However it’s indicated to use statistical tests to confirm if the
series is stationary, we will use two tests: the Dickey-Fuller test
and the KPSS test.
First, we will use the Dickey-Fuller test, I will use the base P-
value of 5%, that is, if the P-value is below this 5% it means that
the series is statistically stationary.
In addition, there is the Statistical Test of the model, where
these values can be compared with the critical values of 1%, 5%
and 10%, if the statistical test is below some critical value
chosen the series will be stationary:
In this case, the Dickey-Fuller test indicated that the series is not
stationary (P-value 36% and the critical value 5% is less than the
statistical test).
Now we are going to analyze the series with the KPSS test,
unlike the Dickey-Fuller test, the KPSS test already assumes that
the series is stationary and only will not be if the P value is less
than 5% or the statistical test is less than some value critic:
9. Confirming the Dickey-Fuller test, the KPSS test also shows that
the series is not stationary because the P-value is at 1% and the
statistical test is above any critical value.
Next I will demonstrate ways to turn a series into stationary.
Turning the series into stationary
Differencing
Differencing is used to remove trend signals and also to reduce
the variance, it is simply the difference of the value of
period T with the value of the previous period T-1.
To make it easier to understand, below we get only a fraction of
ethanol prices for better visualization, note from May/2005
prices start rising until mid-May/2006, these prices have weekly
rises that accumulates creating an uptrend, in this case, we have
a non-stationary series.
10. When the first differentiation is made (graph below), we remove
the cumulative effect of the series and only show the variation of
period T against period T-1 throughout the whole series, so if
the price of 3 days ago was BRL 800.00 and changed to BRL
850.00, the value of the differentiation will be BRL 50.00 and if
today’s value is BRL 860.00 then the difference will be -BRL
10.00.
Normally only one differentiation is necessary to transform a
series into stationary, but if necessary, a second differentiation
can be applied, in this case, the differentiation will be made on
the values of the first differentiation (there will hardly be cases
with more than 2 differentiations).
Using the same example, to make a second differentiation we
must take the differentiation of T minus T-1: BRL 2.9 — BRL 5.5
= -BRL 2.6 and so on.
11. Let’s do the Dickey-fuller test to see if the series will be
stationary with the first differentiation:
In this case we confirm that the series is stationary, the P-value
is zero and when we compare the value of the statistical test, it
is far below the critical values.
In the next example we will try to transform a series into
stationary using the inflation adjustment.
Inflation Adjustment
12. Prices are relative to the time that they were traded, in 2002 the
price of ethanol was at BRL 680.00, if the price of this product
were traded at this price nowadays certainly many mills would
be closed as it’s a very low price.
To try to make the series stationary, I will adjust the whole
series based on the current values using the IPCA index (it’s the
Brazilian CPI index), accumulating from the end of the training
period (Apr/2016) until the beginning of the study, the source
of the data is on the IBGE website.
Now let’s see how the series became and also if it became
stationary.
13. As can be seen, the uptrend has disappeared, with only the
seasonal oscillations remaining, the Dickey-Fuller test also
confirms that the series is now stationary.
Just for the sake of curiosity, see below the graph with the
adjusted price with inflation against the original series.
Reducing variance
Logarithm
The logarithm is usually used to transform series that have
exponential growth values in series with more linear growths, in
this example we will use the Natural Logarithm (NL), where the
base is 2.718, this type of logarithm is widely used in economic
models.
The difference of the values transformed into NL is
approximately equivalent to the percentage variation of the
values of the original series, which is valid as a basis for
reducing the variance in series with different prices, see the
example below:
If we have a product that had a price increase in 2000 and went
from BRL 50.00 to 52.50, some years later (2019) the price was
already BRL 100.00 and changed to BRL 105.00, the absolute
14. difference between prices is BRL 2.50 and BRL 5.00 respectively,
however the percentage difference of both is 5%.
When we use the LN in these prices we have: NL (52,50) — NL
(50,00) = 3,96–3,912 = 0,048 or 4.8%, in the same way using
the LN in the second price sequence we have: NL (105) — NL
(100) = 4.654–4.605 = 0.049 or 4.9%.
In this example, we can reduce the variation of values by
bringing almost everything to the same basis.
Below the same example:
Result: The percentage variation of the first example is 4.9 and the
second is 4.9
Below the table comparing values of percentage variation of X
with the variation values of NL (X):
Source
15. let’s plot the comparative between the original series and the
series with NL transform:
Box-Cox Transformation (Power Transform)
The BOX COX transformation is also a way to transform a series,
the lambda (λ) value is a parameter used to transform the series.
In short, this function is the junction of several exponential
transformation functions, where we search for the best value of
lambda that transforms the series so that it has a distribution
closer to a normal Gaussian distribution. A condition to use this
transformation is that the series only has positive values, the
formula is:
Below I will plot the original series with its distribution and after
that the transformed series with the optimal value of lambda
with its new distribution, to find the value of lambda we will use
the function boxcox of the library Scipy, where it generates the
transformed series and the ideal lambda:
16. Below is an interactive chart where you can change the lambda
value and check the change in the chart:
This tool is usually used to improve the performance of the
model, since it makes it with more normal distributions,
remembering that after finishing the prediction of the model,
you must return to the original base inverting the transformation
according to the formula below:
Looking for correlated lags
To be predictable, a series with a single variable must have auto
correlation, that is, the current period must be explained based
on an earlier period (a lag).
17. As this series has weekly periods, 1 year is approximately 52
weeks, I will use the auto correlation function showing a period
of 60 lags to verify correlations of the current period with these
lags.
Analyzing the above auto correlation chart above, it seems that
all lags could be used to create forecasts for future events since
they have a positive correlation close to 1 and they are also
outside of the confidence interval, but this characteristic is of a
non-stationary series.
Another very important function is the partial auto correlation
function, where the effect of previous lags on the current period
is removed and only the effect of the lag analyzed over the
current period remains, for instance: the partial auto correlation
of the fourth lag will remove the effects of the first, second and
third lags.
Below the partial auto correlation graph:
18. As can be seen, almost no lag has an effect on the current
period, but as demonstrated earlier, the series without
differentiation is not stationary, we will now plot these two
functions with the series with one differentiation to see how it
works:
19. The auto correlation plot changed significantly, showing that the
series has a significant correlation only in the first lag and a
seasonal effect with negative correlation around the 26th month
(half a year).
To create forecasts, we must pay attention to an extremely
important detail about finding correlated lags, it’s important that
there is a reason behind this correlation, because if there is no
logical reason it’s possible that it’s only chance and that this
correlation can disappear when you include more data.
20. Another important point is that the auto correlation and partial
auto correlation graphs are very sensitive to outliers, so it’s
important to analyze the time series itself and compare with the
two auto correlation charts.
In this example the first lag has a high correlation with the
current period, since the prices of the previous week historically
do not vary significantly, in the same case the 26th lag presents
a negative correlation, indicating a tendency contrary to the
current period, probably due to the different periods of supply
and demand over the course of a year.
As the inflation-adjusted series has become stationary, we will
use it to create our forecasts, below the auto correlation and
partial auto correlation graphs of the adjusted series:
21. We will use only the first two lags as a predictor for auto-
regressive series.
For more information, Duke University professor Robert
Nau’s website is one of the best related to this subject.
Metrics to evaluate the model
In order to analyze if the forecasts are with the values close to
the current values one must make the measurement of the error,
22. the error (or residuals) in this case is basically
Yreal−YpredYreal−Ypred.
The error in the training data is evaluated to verify if the model
has good assertiveness, and validates the model by checking the
error in the test data (data that was not “seen” by the model).
Checking the error is very important to verify if your model is
overfitting or underfitting when you compare the training data
with the test data.
Below are the key metrics used to evaluate time series models:
MEAN FORECAST ERROR — (BIAS)
It’s nothing more than the average of the errors of the evaluated
series, the values can be positive or negative. This metric
suggests that the model tends to make predictions above the real
value (negative errors) or below the real value (positive errors),
so it can also be said that the mean forecast error is the bias of
the model.
MAE — MEAN ABSOLUTE ERROR
This metric is very similar to the average error of the prediction
mentioned above, the only difference is the error with a negative
value that is transformed into positive and afterward the mean is
calculated.
23. This metric is widely used in time series, since there are cases
that the negative error can cancel the positive error and give an
idea that the model is accurate, in the case of the MAE it doesn’t
happen, because this metric shows how much the forecast is far
from the real values, regardless if above or below, see the case
below:
Result: The error of each model value looks like this: [-4 -2 0 2 4]
The MFE error was 0.0, the MAE error was 2.4
MSE — MEAN SQUARED ERROR
This metric places more weight on larger errors because each
individual error value is squared and then the mean is
calculated. Thus, this metric is very sensitive to outliers and puts
a lot of weight on predictions with more significant errors.
Unlike the MAE and MFE, the MSE values are in quadratic units
rather than the units of the model.
RMSE — ROOT MEAN SQUARED ERROR
This metric is simply the square root of the MSE, where the error
returns to the unit of measure of the model (BRL/m3), it is very
used in time series because it’s more sensitive to the bigger
errors due to the process of squaring which originated it.
MAPE — MEAN ABSOLUTE PERCENTAGE
ERROR
24. This is another interesting metric to use, which generally is used
in management reports because the error is measured in
percentage terms, so the error of a product X can be compared
with the error of a product Y.
The calculation of this metric takes the absolute value of the
error divided by the current price, then the mean is calculated:
Let’s create a function to evaluate the errors of training and test
data with several evaluation metrics:
Checking the residual values
It’s not enough to create the model and check the error values
according to the chosen metric, you must also analyze the
characteristics of the residual itself, as there are cases where the
model can not capture the information necessary to make a good
forecast, resulting in an error with information that should be
used to improve the forecast.
To verify this residual we will check:
• Current vs. predicted values (sequential chart);
• Residual vs. predicted values (dispersion chart):
25. It is very important to analyze this graph since in it we can check
patterns that can tell us if some modification is needed in the
model, the ideal is that the error is distributed linearly along the
forecast sequence.
• QQ plot of the residual (dispersion chart):
Summarizing this is a graph that shows where the residue
should be theoretically distributed, following a Gaussian
distribution, versus how it actually is.
• Residual auto correlation (sequential chart):
Where there should be no values that come out of the
confidence margin or the model is leaving information out of the
model.
We need to create another function to plot these graphs:
Most basic ways to make a forecast
From now on we will create some models of price forecast of
Hydrous ethanol, below will be the steps that we will follow for
each model:
• Create prediction on the training data and subsequently
validate on the test data;
• Check the error of each model according to the metrics
mentioned above;
26. • Plot the model with the residual comparatives.
Let’s go to the models:
Naive approach:
The simplest way to make a forecast is to use the value of the
previous period, this is the best approach that can be done in
some cases, where the error is lower compared to other forecast
methodologies.
Generally, this methodology doesn’t work well to predict many
periods ahead, as the errors tend to increase in relation to real
values.
Many people also use this approach as a baseline to try to
improve with more complex models.
Below we will use the training and test data to make the
simulations:
27. The QQ chart shows that there are some larger (up and down)
residuals than theoretically should be, these are the so-called
outliers, and there is still a significant auto correlation in the
first, sixth and seventh lag, which could be used to improve the
model.
In the same way, we will now make the forecast in the test data.
The first value of the predicted series will be the last of the
training data, then these values will be updated step-by-step by
the current value of the test and so on:
28. The RMSE and MAE errors were similar to the training data, the
QQ chart is with the residual more in line with what should
theoretically be, probably due to the few sample values
compared to the training data.
In the chart comparing the residuals with the predicted values
it’s noted that there is a tendency for the errors to increase in
absolute values when prices increase, perhaps a logarithmic
adjustment would decrease this error expansion, and to finalize
the residual correlation graph shows that there is still room for
improvement as there is a strong correlation in the first lag,
where a regression based on the first lag could probably be
added to improve predictions. Next model is the simple average:
29. Simple Mean:
Another way to make predictions is to use the series mean,
usually this form of forecasting is good when the values oscillate
close around the mean, with constant variance and no uptrend
or downtrend, but it’s possible to use better methods, where can
make the forecast using seasonal patterns among others.
This model uses the mean of the beginning of the data until the
previous period analyzed and it expands daily until the end of
the data, in the end, the tendency is that the line is straight, we
will now compare the error of this model with the first model:
In the testing data, I will continue using the mean from the
beginning of the training data and make the expansion of the
mean with the values that will be added on the test data:
30. The simple mean model failed to capture relevant information of
the series, as can be seen in the Real vs Forecast graph, also in
the correlation and Residual vs. Predicted graphs.
Simple Moving Average:
The moving average is an average that is calculated for a given
period (5 days for example) and is moving and always being
calculated using this particular period, in which case we will
always be using the average for the last 5 days to predict the
value of the next day.
31. The error was lower than the simple average, but still above the
simple model, below the test model:
Similarly to the training data, the moving-averages model is
better than the simple average, but they do not yet gain from the
simple model.
32. The predictions are with auto-correlation in two lags and the
error is with a very high variance in relation to the predicted
values.
Exponential Moving Average:
The simple moving average model described above has the
property of treating the last X observations equally and
completely ignoring all previous observations. Intuitively, past
data should be discounted more gradually, for example, the
most recent observation should theoretically be slightly more
important than the second most recent, and the second most
recent should have a little more importance than the third more
recent, and so on, the Exponential Moving Average
(EMM) model does this.
Since α (alpha) is a constant with a value between 0 and 1, we
will calculate the forecast with the following formula:
Where the first value of the forecast is the respective current
value, the other values will be updated by α times the difference
between the actual value and the forecast of the previous period.
When α is zero we have a constant based on the first value of
the forecast, when α is 1 we have a model with a simple
approach because the result is the value of the previous real
period.
33. Below is a graph chart several values of α:
The average data period in the EMM forecast is 1 / α . For
example, when α = 0.5, lag is equivalent to 2 periods;
when α = 0.2 the lag is 5 periods; when α = 0.1 the lag is 10
periods and so on.
In this model, we will arbitrarily use a α of 0.50, but you can do
a grid search to look for the α which reduces the error in the
training and also in the validation, we will see how it will look:
The error of this model was similar to the error of the moving
averages, however, we have to validate the model in the test
base:
34. In the validation data, the error so far is the second best of the
models that we have already trained, but the characteristics of
the graphs of the residuals are very similar to the graphs of the
model of the moving average of 5 days.
Auto-Regressive:
An auto-regressive model is basically a linear regression with
significantly correlated lags, where the autocorrelation and
partial autocorrelation charts should initially be plotted to verify
if there is anything relevant.
Below are the autocorrelation and partial autocorrelation charts
of the training series that shows a signature of auto-regressive
model with 2 lags with significant correlations:
35. Below we will create the model based on the training data and
after obtaining the coefficients of the model, we will multiply
them by the values that are being performed by the test data:
36. In this model the error was the lowest compared to all the others
that we trained, now let’s use its coefficients to do the step-by-
step forecast of the training data:
Note that in the test data the error did not remain stable, even
worse than the simple model, note in the chart that the
forecasts are almost always below the current values, the bias
measurement shows that the real values are BRL 50.19 above
the predictions, maybe tuning some parameters in the training
model this difference would decrease.
To improve these models you can apply several transformations,
such as those explained in this post, also you can add external
37. variables as a forecast source, however, this is a subject for
another post.
Final considerations
Each time series model has its own characteristics and should be
analyzed individually so we can extract as much information as
possible to make good predictions reducing the uncertainty of
the future.
Checking for stationary, transforming the data, creating the
model in the training data, validating on the test data and
checking the residuals are key steps to create a good time series
forecast.