Basic statistics by_david_solomon_hadi_-_split_and_reviewed

Basic Statistics by David Solomon Hadi, Chief Financial Officer, Rock StarConsulting Group
Consultant, contact +61 424 102 603 www.rockstarconsultinggroup.com
Everything we see is distributed on some scale. Some people are tall, some short and some are
neither tall nor short. Once we find out how many are tall, short or middle heighted we get to
know how people are distributed when it comes to height. This distribution can also be of
chances. For example, we throw, 100 times, an unbalanced dice and find out how many times
1,2,3,4,5 or 6 appeared on top. This knowledge of distribution plays animportant role in empirical
work.
These distributions give us an idea about our chances of facing a particular type of
person/event/thing/process if we interact randomly. That is why it is formally called probability
distribution. A probability is written as a number between 0 and 1. If we multiply it with 100 it is
% of chances of meeting our desired event/person/…..
We may write probability of an event X = p(X) = number of time X occurs in our observation / total
number of observations we have.
Two concepts that come with distributions are their mean values (average) and variance.
Variance is measure of average distance of any value from mean. The calculations for mean,
sometimes called expected value, is E(X) = sum of all values of X / total number of values of X.
The calculation for variance is V(X) = [Xi – E(X)]2/total number of values of X. (A square root of
variance is called standard deviation i.e. a deviation that is standard or average)
Once these distributions and their means and variances are known we can then answer such
questions as what is distribution of chances if people of different heights throw ball in basket
(basket ball game)? Basket ball needs height but players of same height may throw differently
depending on other factors like skill etc. Therefore, now two distributions would be interacting
with each other. The result of mathematical process is called conditional probability i.e.
probability of throwing in the basket conditioned to the fact that person is tall.
However, the distributions I talked earlier may be entirely misleading. These may be just a
representative of sample we take. For example people we took could be of one city only which
we studied (sample). To solve this we say if same distribution is seen in several samples, we may
call it long run frequency or objective probability. However, some on other hand would suggest
that it cannot be established that there is an objective probability and that is why they use
continuous updating of probability distribution knowledge (belief).

This updating leads us back to use of conditional probability. We would answer such questions
as given that I know hard working employees produce the output with such probability (chances)
and I have seen that this employee has produced that particular output, what chances are he
really did work hard? (confused? Well that’s why everyone needs to hire econometricians and
statisticians). Formally we put these questions in domain of Bayesian Decision Making. But this is
not going to be the topic below.
A related concept is to ask how much two distributions vary together e.g. if we have two
distributions of heights from two cities, we may ask if these two distribution show similar
variation of number of people of particular height as we move on scale of height from short to
tall? This is called covariance. Larger covariance means two populations vary similarly. We
sometime standardize these on a scale of -1 to +1 and call it correlation. +1 means both
populations move in same way, -1 means both move in totally opposite way and 0 mean they do
not show any similar changes.
The mathematical notation for covariance is Cov(X,Y)={Sum (Xi*Yi) - n * (average X* average Y)},
with n is total number of observations in X or Y and “i” represent observation number.
The mathematical notation for correlation is corr (X,Y) = r(X,Y) = Cov(X,Y)/{standard deviation of
X * standard deviation of Y} (I ignored “i” which is present).
If we square the correlation we have an R squared. This tell us that how much % variation in X is
explained by Y (or vice versa too).
Before leaving to next section I should add there are several known types of distributions. The
usual one we use in economics are normal distribution, t distribution, F distribution. Recently
Pareto or Power Law distributions have been introduced. Also anything that is distributed is
called variable (formally random variable). With these basic concepts I would engage in brief
description of all tools we use in economics.
I assume that readers are familiar with basic terms in statistics or at least do understand the
ordinary meaning of the statistical terms.
Becoming an Empiricist:
Let us say you speculate that a rise in interest rate from bank would lead to fall of investment by
the firms since the loans are costly now. You further assume that a fall in taxes can induce
investment for similar reason.
Being an empiricist I would ask you have you observed this happen in real life? If it does happen
in real life, by what value the investment would fall or rise if we change tax or interest rate by a
particular value? I would also ask, would this relation hold for all times? I may as well be

interested in a counter factual world for decision making but would like to make my counter
factual using real data.
With such questions at hand, an empirical economist would try to model the real world using a
theory. The focus is especially on questions one and two above. In the rest of the lines I present
most of the methods an economist can employ, their simplified calculation and their uses. If
possible I can provide an example too. I assume the reader is aware of basic terms in economics
and statistics. I would start with basics of regression, time series analysis techniques, proceed to
macro econometrics methods and introduce panel data and micro data analysis.
Regression
In regression we try to reproduce the real world relations modeled using a theory. This
reproduction uses real world data. The underlying idea is to explainvariation in one variable using
others in such a way that we may produce a best fit. A best fit would be one where deviations of
predicted values (of regression model) from real data are minimized.
While achieving that fit, we find out that some variables explain greatly the variation in variable
of our interest (explained variable) and other do not. We start with the most general form of
model i.e. gather all possible variables that explain the variation and then gradually drop one
after another which does not explain.
This helps us test our hypothesis. The hypotheses are driven from atheory and are testable ones.
Hypotheses are such astatement as “wagedoes not influence output of an employee” and “wage
does not influence output of an employee positively”. In regression, while fitting the data, if wage
significantly explains output we say we fail to reject the hypothesis. (Technically it is not same as
we accept the hypothesis but in practice it means that). Significance is established with t test, F
test and p values that we talk about in later parts.
The method used to minimize the distance (error) is called ordinary least square simply called
OLS. In a regression we assume that values of explanatory variables (those explaining the
explained variable) are fixed, non random, non-repeating. To minimize errors we write down our
model and do the calculus to find out minimum value of the function. An example can clear it.
We are interested in Employees Output as a function of Wage. We write model as:
Outputi = a + b * Wagei + Errorsi.
The “i” means employee number. In remaining part I ignore “i” for simplicity. Here a is constant
and b absorbs the effectof 1unit change inwage on output. Errors are calculatedas actualoutput
– predicted output from our model. Our aim is to find out combination of a and b so that errors
are minimized.

To do so we write errors =actual output - predicted output and plug-in Output = a + b * Wage.
Therefore, Error = actual output - a - b * Wage.
We are interested in minimum mean of square of errors. Mean of square of error is a standard
used in econometrics to measure errors for technical reasons. Therefore we may write:
Sum (Error)2 = Sum (actual output - a - b * Wage)2
Our desired minimum value is 0, therefore;
Minimum {Sum (actual output - a - b * Wage)2} = 0
We take first order derivatives with respect to a and b. And calculate the math as follows:
d Sum (actual output - a - b * Wage)2 / d (a) = 0
d Sum (actual output - a - b * Wage)2/ d (b) = 0
d on this occasion stands for derivative.
Solving for (a):
d Sum (actual output - a - b * Wage)2 / d (a) = 0
2 * Sum (actual output - a - b * Wage) (-1) = 0
Sum (actual output - a - b * Wage) = 0
Sum (actual output) – sum(a) –sum( b * Wage) = 0
Since the Sum (actual output) can be written as n*average output where n is total number of
employees, a is constant and is repeated for each employee so it can be replaced with n*a and
sum of wage can be re written as n * average wage.
N * (average output) – n * (a) – n* ( b * average Wage) = 0
a = average output – b * average wage.
Solving for b:
d Sum (actual output - a - b * Wage)2 / d (b) = 0
2 * Sum (actual output - a - b * Wage) (-Wage) = 0
Sum (actual output * Wage - a * Wage - b * Wage2) = 0
Sum (actual output * Wage) = Sum( a * Wage) + Sum (b * Wage2)

Again using the same idea of averages and calculation of a:
Sum (actual output * Wage) = n*(Average Wage) * (average output – b * average wage ) + Sum
(b * Wage2)
Sum (actual output * Wage) = n * (average Wage * average output) + b * {(Sum (Wage2) –
n*(average wage)2}
B = {Sum (actual output * Wage) - n * (average Wage * average output)} / {(Sum (Wage2) –
n*(average wage)2}
From before we know that {Sum (actual output * Wage) - n * (average Wage * average output)}
can be covariance of actual output and wage and {(Sum (Wage2) – n*(average wage)2} is variance
of wage.
This is basic setup of a regression model for demo purpose only. When we have many variables
like X = a + b Y + c Z + d M +…. We do similarly take first derivatives and solve for each a,b,c,…..
Few features of this OLS method are that b and a (coefficients and constant) found in this process
are unbiased, have smallest variance (among all other methods, other than OLS), consistent (in a
sense that as we get larger sample we get better results) and linear (donot change with time or
for any other reason).
From these “a” and “b” we can then predict the values of employee output. With that employee
output we can calculate once again the errors and their squared average. The square root of this
average of square of errors is called standard error of regression.
We use variance (square of standard error of regression) and divide it with the variance of wage.
We get variance of “b”. Take square root of it and we get standard error of “b”. We divide “b”
with its standard error and it serves as at testof hypothesis that “b” is not zero againsthypothesis
that “b” is zero. If the resultant value is larger than 2, we accept that b is not zero (Similarly for
“a”).In some cases the individual coefficients may fail to explain but collectively the alldo explain
the results of a regression. In that case we use an F test.
A Regression is evaluated on several grounds.
P value is the probability of making an error of rejecting something when it is true. Therefore we
prefer a lower p value. It is to be noted that in our regression we have a “null hypothesis” that
our coefficient (b in above example) is zero. P-value is found using the t test table. In situations
where our focus is on causal analysis we can consider a p-value only and not the R square (R2)
below.

R2 is the square of correlation of forecasted values from a regression to its actual values with
which regression model was built. It tells us how much % age our model explains of the data we
used to develop the model. It might be of some help in some cases. I personally am not much
interested in R2.
F test: F test is likea t test which checks ifallcoefficients are zero or they collectivelyare not zero.
The formula is (R2/1-R2) * (n-k-1/k) where n is total number of observations and k is number of
coefficients. It should be checked in its F table if the value is acceptable for our combination of n
and k or not. If value is acceptable we say collectively we have non zero coefficients. It is used
when we have t tests that say coefficients are zero but we suspect that the coefficients
collectively (in their sum) are not.
Autocorrelation. The errors from the regression i.e. the actual values minus values that a
regression model generate could be of such nature that previous years’ value forecast current
years values.This is a hint that some pattern is missed,some variables from pastforecast present
values. Different test to check for autocorrelation include DW-test, LM test etc.
Hetero: We assume that all values of regression line are in the center of real data for all
observations. That is variance of errors is constant. This is not always the case. Once we have
hetero, we may end up with a case where, for example, lower values in regression fit could have
a smallererror and bigger have bigger. A viceversa is alsopossible.This canlead to biasedresults.
We have following test and solution for that. Diagnosis of heteros is done via White’s test.
Normality: All errors of a regression result should be normally distributed around a zero mean
and constant variance.This would mean that if we repeatedly useour model, the underprediction
would be canceled out by overprediction and on average we would have zero errors. Also that
normality means most errors are not far from zero. Once this is violated, it means that we have
some information missing in the data. Normality is checked via Jarque-Bera test.
Outliers: Some time errors are not normally distributed because of few extreme errors. These
are not by chance. These have information that can be used for insights into the economic and
business phenomenon. We can study them separately. The outliers are fixed using a dummy
variable approach i.e. we define a dummy that is ON when the outlier observation occurs. This
absorbs the effects of special case and we can then be informed about the
Time Series
A time series is set of observations of any economic phenomenon in arranged in order of time. It
represents the development of something over time. For example industrial output, interest rate,
inflation. The three main components in time series are its long term path (trend), short term

deviation (cycle) and irregular movements (errors). To handle a time series we first filter the
desired component and study it. There are different methods which I shall introduce later.
Once filtering is done, any time series can be studied for forecasting or for measuring or testing
causalimpact. Measuring acausalrelation would mean to answer second question above, testing
a causal relation refers to first question (see becoming empiricist). However, time series is
insufficient, sometimes, to test causal relation, we shift to panel data. Panel data is when we
observe a group of individuals over time.
Any time series can be represented as Xt where X is the observation and t is time for example
GDP2013 is GDP in year 2013. We may further write it as Xt= Trendt + Cyclet + Errort . This time
series may have two additional properties of stability of mean over time and stability of variance
over time. Once two properties are met we call a time series as stationary. To study a time series
we some do pre-filtering discussed below in order to achieve these properties.
The properties of time series are seldom met. However we may modify in a reversible way to
obtain a time series which have these properties. This is done by pre-filtering methods. One of
the most used method is method of differencing. Differencing means to subtract yesterday’s
value from today’s value. Denoted with d, we may say d = Xt – X(t-1). We may take logs and then
difference, we may take double or triple differences. Underlying idea is to obtain transformed
time series with the properties mentioned above. Double differencing would be dd = dt – d(t-1).
We may take difference at long lag. Lag mean how much in past e.g. D12 = Xt – X(t-12).
Filtering
As time series is composed of mainly a trend and cycle we may separate them and study each of
them. However, trend and cycles are arbitrarily defined by investigator. The two methods we use
are Hodrick Prescott filter and Baxter King Filter. (HP and BK). These separate trend as cycle as
needed. The HP filter is sensitive to last values in the data but BK is not. However, BK omits the
last values of data. I prefer BK for its analytical solutions and ease. The Formulas are below.
Hodrick Prescott:
The HP filter is often used to get an estimate of long-term trend components. HP tries to be as
close to real time series as possible but also produce smooth curves. The problem it faces is
written as: Minimise the variance of the original series y around the smoothed series μ, and
subject this to a penalty based on the variance of μ. Mathematically:
Minimize
As the λ rise the filtered time series become smoother and finally it may be reduced to one
straight line. Hodrick and Prescott (who introduced it) claim that λ should be 100 for data from
 


 
T
t
T
t
tttttty
1
1
2
2
11
2
)]()[()( 

year to year changes and 14400 for monthly changes. But then again this is in discretion of
researcher.
Baxter King:
It is based on a spectral analysis which decomposes a time series into its components each with
a different frequency. The sum of these components results in the original series. Lower
frequency would mean a long-run component and higher frequency would imply the short-run.
The calculation of this filter is as follows:
We find a two sided moving average written as
Aj = Bj + X;
j = 0, ±1, ±2,………±K; j is lags.
Bj =(W1 – W2)/π for j = 0 and
Bj = (1/ πj) * {sine (W2j) – sine(W1j)} for j being any number other than 0.
X = -1 * (Sum of all Bj) / (2 K + 1)
W1 and W2 are arbitrary and a researcher can choose which frequency he/she wants to extract.
Baxter and King proposed them to be as follows: If K = 12, and we have a 3 months aggregate for
US Business Cycle, W1 = 2 π / 32 and W2 = 2 π / 6.
Forecast Evaluation
Why would you rely on forecasts I would give you? We can surely not rely 100% but there are
methods to develop a trust on our models. One of these methods is to cut the sample
(observations available) in two parts. With one bigger part we develop our model and with one
smaller part we test our model. We consider that the smaller sample is not know to us, and
pretend that we are forecasting in real life. Then we compare the results of forecast real data.
There are two main methods of evaluation, one is called RMSE or root mean square error and
other Theil’s U.
RMSE simply calculates the square root of average of square errors. This tells us about on average
how much we deviated from real data. Mathematics is : Square Root [ Sum of [Error2]/number
of observation].
Theil’ U gives a better picture as it compares the naïve forecast errors against our model’s errors.
This is a ratio then. A ratio of 1 means we have a model which is same as naïve model and close
to zero means we have perfect forecast no error. Naïve model is where forecasted value is same

as today’s values i.e. we saytomorrow would be allthe same as today. The error of naïve forecast
is simply first difference (differencing of level 1).
A page in GRETL manual is informative. It is page 215 of GRETL manual Feb 2011.
Its screen shot is below.
ARFIMA ANN
With this basic setup we may proceed to models of time series. A model is simplified version of
reality and helps us acquire some useful insights in real life. I would introduce a very general
notation of model in time series which gives rise to many other models. It can be written as
ARFIMA – ANN. It relies on fact that due to trend or cycle or both we may be able to forecast
future values based on past values i.e. it focuses on pattern finding and mathematically modeling
them. It stands for Auto Regressive Fractionally Integrated Moving Average with Artificial Neural
Networks. Technically it may written as ARFIMA – ANN (a,b,c,d,e). where a,b,c,d,e are details of
AR, F,I, MA and AN. Details of each of the terms used are below.
Auto Regressive: A time series is auto regressive if its regresses to past i.e. today’s value is driven
by past. So we may say as Xt= a + b1 Xt-1 + b2 Xt-1 + ………… + bn Xt-n +Et

Here b’s represent how much effect of past value has on current value. Consider Company Sales
2000 = 1 million + 0.20 Company Sales 1999 would mean that current sales is 20% above previous
sales with and 1 million sale is minimum achieved (if the company just started today). This 1
million is ‘a’.
Moving Average: In time series analysis (opposed to technical analysis), a moving average is
lagged errors. It may influence today’s value. For example, an irregular movement or error in last
year sales may increase revenue and this may in-turn increase investment this year and sales
also. So we may say: Xt= a + b1 Et-1 + b2 Et-2 + ………… + bn Et-n + Et where E stands for errors in
past.
Integration: A time series may not be stationary and we need to do differencing. The level of
difference i.e. how many times we need to repeat differencing is called integration. So we may
say d Xt = a + b1 dXt-1 + b2 dXt-1 + ………… + bn dXt-n +dEt. .Here d stands for level of differencing.
Fractional: An integration which is not complete is called fractional integration. This means if we
subtract sales of this year from past, we use only x% of the resultant. Using only a fraction of
integration has technical reasons which would be made clear later. Formula is same as above but
with d being only in part i.e. d now is not complete difference but a part of that.
AN: Neural Network is a powerful tool for identifying patterns using a computer. This comes into
play when our present knowledge fails to tell us any pattern in data. It acts as an artificial but fast
brain with ability to learn.
“An Muti Layer Perceptron (or ANN) is typically composed of several layers of nodes. The first or
the lowest layer is an input layer where external information is received. The last or the highest
layer is an output layer where the problem solution is obtained. The input layer and output layer
are separated by one or more intermediate layers called the hidden layers.The nodes in adjacent
layers are usually fully connected by acyclic arcs from a lower layer to a higher layer. The
knowledge learned by a network is stored in the arcs and the nodes in the form of arc weights
and node biases which will be estimated in the neural network training process”
(Zhang, G., & Hu, M. Y. (1998). Neural network forecasting of the British Pound/US Dollar
exchange rate. Omega, 26(4), 495-506.).
Cases of an ARFIMA-ANN.
AR(1) = ARFIMA-ANN(1,0,0,0,0) also Random Walk
ARFIMA-ANN(1,0,0,0,0) is a case of special interest in finance and economics. It says what
happens today is based only on yesterday’s events.

Underlying assumptions (only when following maths is true) are that that all influences on time
series are exogenous and are random i.e. all changes in future are possible equally. An example
of this is day to day exchange rate. This means that time series walks randomly. Random Walk is
unpredictable. However, statisticians and mathematicians have come up with solutions.
These solutions rely on finding a pattern in time series and measuring the frequency with which
this pattern occurs. This frequency, if observed over large number of time, is called probability.
Further we may find a signal of pattern and pattern itself and calculated conditional probability.
This method equips decision makers with an idea of possibilities intime series and they can make
decisions depending on their risk appetite.
For the mathematically oriented we may write it as:
Xt = a + b Xt-1 + e
With b usually close to 1 and a is 0. The e is error and is always distributed around a 0 mean and
has a limited variance that matches the properties of normal distribution.
ARIMA(p,d,q) = ARFIMA-ANN(p,0,d,q,0)
Standard model in time series and it is used in forecasting (or pattern finding) using only previous
values of atime series.p stands for lags of auto regression, d for level of differencing, q for moving
average.
Mathematically we may write it as:
dXt= a + b1 dXt-1 + b2 dXt-2 + ………… + bn dXt-n + c1 dEt-1 + c2 dEt-2 + ………… + cn dEt-n + dEt.
ARFIMA-ANN(p,z,d,q,0) or ARFIMA
Used to model either long term trend along with short term deviations or used when we have
repeated cross sections. p stands for lags of auto regression, d for level of differencing, q for
moving average and z for fraction.
Long memory is when a very old event has impact on today’s value but this impact decays
gradually. Therefore we do not consider the whole of difference but a fraction.
Repeated cross section is when we observe some part of whole population at one time, another
part at other time and yet another part another time. We then combine these observations in
order of time. This is not a perfect time series since different people were observed in different
time. But if we can somehow argue that all these different people have similar traits which we
want to forecast, we may use fractional integration to use only that fraction of data that is
common between different agents.

dzXt= a + b1 dzXt-1 + b2 dzXt-2 + ………… + bn dzXt-n + c1 dzEt-1 + c2 dzEt-2 + ………… + cn dzEt-n + dEt.
here z stands for fraction taken.
ARIMA ANN = ARFIMA-ANN(p,0,d,q,y) And
ARFIMA-ANN(p,z,d,q,y) = ARFIMA ANN
These are state of the art forecasting methods. These combine properties of ARFIMA and
employees the power of computers to figure out patterns that are far more complicated than
ones in ARFIMA process. ARFIMA are known to model long memory time series (or repeated
cross section) while otherwise we would use ARIMA. ARIMA and ARFIMA models are also linear.
Linear is where coefficient themselves are constant and not variable. (for example X = a + b*Y,
should have a and b constant, if it turns out that a or b itself are influenced by, for example, time,
then we have a non linear model).
A hybrid model of ARFIMA and ANN is proposed to increase forecasting accuracy by exploiting
the linear and non linear properties simultaneously. p stands for lags of auto regression, d for
level of differencing, q for moving average and z for fraction and y represents ANN. Y could tell
number of layers and also type of function. Therefore it is not a number, for example it can be 2
tanh.
The two articles I found in this are :“Aladag, Cagdas Hakan, Erol Egrioglu, and Cem Kadilar.
"Improvement in Forecasting Accuracy Using the Hybrid Model of ARFIMA and Feed Forward
Neural Network." American Journal of Intelligent Systems 2.2 (2012): 12-17” and “Valenzuela,
Olga,et al."Hybridization of intelligent techniques and ARIMA models for time series prediction."
Fuzzy Sets and Systems 159.7 (2008): 821-845.”
Modeling Procedure ARFIMA-ANN.
There have been severalprocedures proposed to selectthe lag order of ARIMA models. Lag order
means how many previous values are needed in model (dzXt= a + b1 dzXt-1 + b2 dzXt-2 + …………
+ bn dnXt-n ……... this n is lag). Main one was by Box and Jenkins (who introduced ARIMA) who
used three steps in their Arima Model. These were identification, parameter estimation, and
diagnostic checking. The idea was that if a time series is generated from an ARIMA process, it
should have some theoretical autocorrelation properties. The empirical autocorrelation should
match with the theoretical. Some authors proposed the information-theoretic approaches such
as the Akaike’s information criterion (AIC), recently approaches based on intelligent paradigms,
such as neural networks, genetic algorithms or fuzzy system have been proposed to improve the
accuracy of order selection of ARIMA models.

This selection of lag, selection of differencing and selection of fraction and so is usually left on
the researcher to choose as he needs. The parameters p,d,q,z,y are at discretion of researcher.
(Slight variation from “Valenzuela,Olga,et al. "Hybridization of intelligent techniques and ARIMA
models for time series prediction." Fuzzy Sets and Systems 159.7 (2008): 821-845.”)
Adding an X to ARIMA (ADL)
Let us now say that sales today are not only influenced by past sales but also by a host of other
factors for example, interest rate, prices, income of buyers etc. When we add these to our model
we callitARFIMA-ANN-X model. Most popular one is ARIMA-X model. However, the other models
ARFIMA-X or ARFIMA-ANN-X are also possible. X stands for host of other factors that influence
our values.
The addition of ‘X’ allows for studying causal relations. This allowance is easily exploited in an
ADL model which is a special case of ARFIMA-ANN-X with all zero but AR and X non zero and I
being as needed. That is we do not consider Fractional integration or ANN or MA. We only focus
on previous values of, let say, sales and of other host of factors influencing, for example prices.
The previous values are called lags. The reason it is called ADL is that it stands for auto regressive
distributed lag model.
The ADL model can give us two information; first it forms basis of Granger causation and second
it tells us about the long term effect of explanatory variables. Mathematically we may write it as:
Xt= a + b1 Xt-1 + b2 Xt-2 + ………… + bn Xt-n + c1 Yt-1 + c2 Yt-2 + ………… + cn Yt-n + Et.
Now if we test using an F test for all c’s being zero we have started a Granger causation test.
Hypothesis is that changes in Y does not cause changes X. We may as well re-write the ADL with
X replaced with Y (that is now X is explaining Y), and do another test of reverse causation. Better
test of causation is found in VAR, panels FE models and 2SLS.
The second thing ADL give is long term average which is simply (Sum of c’s)/(1-sumof b’s). This
might be called a long term effect of Y on X.
Interesting:
I have not been able to find but would be interesting to see ARFIMA-ANN – X model. I am very
much interested in a practical work on this topic if someone is willing to join me.
Regime Shift
Till now we are making models that consider that relations do not change over time. However,
in real lifewe may facechanging circumstances. In one time a change in price may trigger a lower

demand and in other time it may increase the demand. Lets calleachtime period in which a single
relation exist between all variables as a regime. The regimes may change. And we should be able
to model that. It is here that a regime shift model comes into play. These models are sometimes
called xTAR models for example LSTAR.
A simple way to solvethis issueis to use adummy. Following example can clearit. Before German
reunification GDP of West Germany grew at a particular rate, after reunification growth and
levels both could have changed. Lets say GDP = a + b(Consumption) + c(Host of other factors) +
error.
The regime shift can be modeled using a dummy (a variable that is 1 or 0, like a switch 1 means
something is ON 0 means OFF.) In our case a dummy would turn ON and remain ON from the
year when Germany was reunited. We may then re write it as
GDP = a + b(Consumption) + c(Host of other factors) + d (Dummy of reunification) + e (Dummy of
reunification * Consumption) + f(Dummy of reunification *Host of other factors) + error.
Now if the dummy has a significant coefficient, then we have an overall impact on GDP because
of the reunion. If the coefficients d and f are also significant they can be compared to b and c to
check influence of shift in regime.
ARCH GARCH
From earlier discussion we remember time series would have a constant variance over time. In
real life we may not find such a situation. To solve these problems we may model variance. In
econometrics we would call changing variance in a regression as heteroskedasticity. (the simple
way to call it is heteros). The models of heteros are either ARCH or GARCH. If the time series has
heteros, then we can model the changes in variance and use the model to enhance our
understanding.
In an ARCH model we would take the errors and square them. These squared errors are same as
variance (for technical reasons). We then regress these against their own past values (lags)
and/or other explanatory variables (variables explaining the explained, wage in our example).
This gives us the model of variance.
VAR
Going back to the ADL model, we note that we can check two-way causation. If the two-way
causation is found to exist, we would have a closed system. If we do know that a system is closed
but do not know exactly how the variables interact with each other, we can define a closed
system with several ADL models and study effects of sudden changes in a variable on others.

These sudden changes are called shocks. This model itself is called Vector Auto regressions or
VAR.
A VAR offers a graphic (or tabular as needed) output. Each graph explains the evolution of
response of one variable to shock in other. These graphs can be analyzed separately. However,
VAR offers us ability to understand complicated effects for example sales might change due to
changes in threes years old sales, which influence two year old revenue, which influenced one
year old investment which then influenced this year’s sales.Thestudy is calledanalysis ofimpulse
responses. Here Impulse is the sudden shock.
A shock is a special transformation of error in the ADL equations. Errors in ADL underlying the
VAR are for technical reasons related across the equations and also not independent. This would
not be helpful if we want analysis ofone variable’s error (or one equation’s errors). Since an error
in one variable is now also error of other variable.
Technical reason is that we have a reduced form of model. A reduced form of model would
present a mixed effect of two or more variables. The errors would also be representing a mix of
errors of two or more variables from the original model. The original model can sometime not be
estimated but reduced form can be.
For example we donot see in aggregate economy that income changes such that price don’t
change. Therefore we cannot separate the effect of change in price and income when studying
changes in consumer’s demand. This means we have a reduced model of economy where effects
are mixed. To disentangle statistically these effects is, however, possible. This is what we do in
VAR using following method.
To achieve this analyticalend, we use CholeskyDecomposition. In this method ADLequations are
solved one after another leaving us with cleardevelopment of aunique error that is independent
of other errors. The first variable which is in first equation in ADL is influenced only by its own
lags and errors. The second is influenced by its own lags and shocks (errors) in first and the third
by its own lags, and shocks to second and first variables.
Later to make sure errors are not correlated we engage in mathematical process of gaining
orthogonal errors. The mathematics is to multiply first ADL equation in a VAR with a ratio as
follows: ratio= covariance of errors from first and second equation / variance of error of first
equation. We then subtract the first equation from second equation in a 2 variable VAR. The
errors now are as follows: first equation errors are errors (e1) and second equation errors are
adjusted errors (e2 – ratio * e1).
A covariance between these would be:

E(e1, adjusted e2) = E(e1 * (e2 – (e1*cov(e1,e2)/e12))) where E stands for expectation (average)
in case of errors (zero mean and normal distribution) it is variance or covariance.
E(e1, adjusted e2) = E(e1 * e2) – E(e1*cov(e1,e2)/e1)
E(e1, adjusted e2) = cov(e1, e2) – cov(e1,e2) since E(e1*e2) in case of error with normal
distribution and zero mean is cov (e1,e2).
E(e1, adjusted e2) = 0. No covariance.
A shock is also called impulse. A response of all variables to an impulse in one variable is the main
interest in VAR. Using the Cholesky Decomposition and Orthogonal Shocks, we study how
variables adjust themselves back to a state when shock did not appear. The time series generated
by this process is saved in a table and presented as a graph. (The basic mathematical details are
similar as above).
VAR are also good in causal analysis. In a system like VAR where X and Y re enforce each other,
we can find a factor Z that has nothing to do with X but disturbs Y. And this is STRICT condition.
If Z is active and we see a change in X through Y, we have evidence that causal relation of X and
Y exist. A shock in VAR just acts as Z. A significant effect of a shock implies that variable which
received shock was cause of the effect we observe. Kindly consider 2SLS and Panel Models too
for details of causal relation.
Different variants of VAR include a structured VAR with certain restriction (for example we may
say a variable will never receive effect from another), VAR with Moving Average process called
VAR-MA, VAR where two VARs interact, VAR with time varying parameters (similar to regime
shift) and Panel VAR.
Main power of VAR is that it offers a flexible and powerful closed system that can be used to
analyze our desired changes. Main weakness is heavy demand of data.
Panel and FE
We just focused on time series till now. We took example of sales of one company. How about
sales of several companies in an industry or output of several employees in one company over
time? Such a data is called Panel data under strict condition that same employees or companies
are observed over time.
The model we use for Panel Data is called Fixed Effect Model. There are other models too, but
simplest and powerful one is Fixed Effects Model. Its name has the reason that it assumes that
each individual (firm etc) has a unique fixed feature that can be controlled statistically. Similarly
each unit of time (month, year, etc) has a unique feature and can be controlled too. The most

important part is that even if we cannot observe these unique features we can control their
average effect by statistical means.
Once these unique features (especially of individuals) are controlled we can use the remaining
observations as if they are of same individual. This would be same as if we have many time series
observations (now the fixed features of individuals are removed and they are all same).
This has powerful effect of using observed data to identify effects of policy (or any other
question) as if we conducted a real experiment.
Also a perfect counterfactual can be understood since we statistically controlled the unique
features of individuals (observable or not). That is, after controlling of unique features, time
specific features and analyzing policy (or desired experiment) we may re combine them in our
imagined manner which may never existin real world. (Controlling means we have separated and
saved the average effects).
Consider this example. A firm wants to know if a policy, that they made in past, did effect the
output. For one reason or other, this policy was implemented on only few headquarters of the
firm.
We may take data of all of headquarters in a firm for last few years. We can then use FE to first
remove the individual effects. This means that we ruled out the possibility that some
headquarters had their own unique features that caused a change in output. This means we
donot need to investigate which were those features.
Second, the model would introduce a dummy (a variable that is 1 or 0, like a switch 1 means
something is ON 0 means OFF. Here it is policy). This dummy would turn ON on the year of policy
introduction and remain ON forever. As we regress, this dummy would show a coefficient. This
would be the effect of change of policy on firms headquarters regardless of the individual
features of the headquarters. This would same as if we conducted a real experiment.
Mathematically we would say:
Output it = ai + b Xit + c (Policy Dummy it) + errorit
Here “i” and “t” represent headquarter “i” and time “t”. “a” is unique for “i”, since it captures,
statistically, the (un)observed features of headquarter i. The X is set of all possible reasons for
output changes that change with time and perhaps headquarters.
This kind of causal study, controlling unobserved factors and counter factual development is
nearly difficult and perhaps impossible in times series or cross section. Since only time or only
individuals are studied in these (cross section is study of individuals only at one time).

2SLS:
In analyzing business and economics empirically we face such problems as missing data, low
quality data or we know that two variables re-enforce each other but cannot disentangle effect
of one on other. To solve such problems is one of the major tasks of an econometrician. The last
problem is extremely important since it solves the causation.
To properly understand the casual relation is the most important aspect in decision making. The
proper testing of a causal relation in economics is done via different methods. One of them is
called 2SLS or two stage least square method.
A two stage least square method can disentangle the two way feedback. Following example can
clarify the problem and use of 2sls.
A century and ahalfago, in old Prussiacrime started rising.At the same time alcohol consumption
went up too. The authorities we confused. They faced two opinions; first that the alcohol
consumption lead to rise in crime and second that rise in criminals in town lead to rise in alcohol
consumption. Back then no one was able to answer this question.
By now economist developed tools to solve this. We would now solve it as follows: We start with
assumption that beer and crime have a feedback effect. We find out all factors that influence
alcohol consumption (other than crime) such that those factors never on any ground (strictly)
influence the crime. Then we forecast the values of alcohol using these factors. These forecasted
alcohol consumption values are then used to forecast crime. If forecasted alcohol can forecast
crime we may say that crime is indeed influenced by alcohol. Since the alcohol consumption was
forecasted with factors that are neither crime nor influence crime. This method is called two
stage least square.
The factors influencing consumption are called instruments and in our example it was found that
they were factors influencing production i.e. weather conditions from last year. Since beer (only
alcohol drink) was neither stored nor exported, beer production equaled its consumption and
production itself has no effect on crime. No relation to crime (violent) is known for having more
barley production.
One technical aspect is that we have now two regressions; First of beer production and second
of crime. The evaluation of first stage regression is done not only on standard regression
evaluation (p-value etc) but also on one more test. It is called test of excluded instruments it
should have a value of 10 or so. It is an F test.
Logit

We are till now interested in data of sales, consumption, interest rate etc all of which is usually
represented as a number. However, being married or not is not a number but this difference do
play a role in labor force activity. Women’s labor activity might decline and man’s might increase.
So how to put these characters in our completely statistical, number based, analytical world of
econometrics?
Here we use Dummies. A dummy as mentioned else where is like a switch. It can be turned ON
or OFF. If we want to study effects of being married as compared to being unmarried, we may
define a dummy that assumes a value of 1 when an individual is married and 0 otherwise.
However, some time we have variables where there are many values. For example, belonging to
different states in a country with 10 states. In such a case we use 10-1 = 9 dummies, each as a
switch for one state. Once all switches are OFF we assume (for mathematical reasons) that the
missing state dummy is active. So our analysis is based on comparison of all states against
missing.
The analysis where we try to find out explanation of these dummies i.e. we question why one
marries or why one changes a job etc is done using Logit Models. Other Models are probit, tobit
etc. Logit is commonly used. In case when a variable has many values, we analyze one dummy
after another it would be called hierarchal logit model. The result of logit model is an odds ratio.
Odd ratio : probability of our desired event / probability of other events.

Basic statistics by_david_solomon_hadi_-_split_and_reviewed

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to Basic statistics by_david_solomon_hadi_-_split_and_reviewed

Similar to Basic statistics by_david_solomon_hadi_-_split_and_reviewed (20)

More from bob panic

More from bob panic (6)

Recently uploaded

Recently uploaded (20)

Basic statistics by_david_solomon_hadi_-_split_and_reviewed