Basic Civil Engineering notes on Transportation Engineering & Modes of Transport
M140 Introducing Statistics.docx
1. M140 Introducing Statistics
Answer:
Arima
Slide 1:
One of the most important question involved in this presentation is “what is data analysis?”.
Data analysis is crucial in any business firm and it involves the process of examining a given
data set to enable one to draw the most appropriate conclusion about the key question that
they wanted to answer. It is most of the time is a continuous process that involves the
collection and analysis of data that is still under scrutiny, that is because research normally
tries to identify the patterns that are present in the entire data that has been collected.
Tools like MS Excel, Python, RStudio etc. are used in any type of data analysis project.
In the current project Python has been selected as the data analysis tool. Python has become
a very important tool that researchers prefer when they want to conduct analysis of any
given data set, and this is because the software is flexible and the language is easy for
analysts to understand.
An ARIMA model will be used in order to forecast a given set of data.
Slide 2:
ARIMA model normally entail three terms that is p, d and q. These terms have meaning and
are very essential for this model, for instance;
P – Stands for the order of the Autoregressive (AR) term
d – Stands for the total number of differentiations that is needed to make the time series
analysis stationary, keeping in mind that for a stationery time series d is always equal to
zero (d = 0).
q – Stands for the Moving Average (MA) term. It entails the total number of lagged errors
2. that should be in the ARIMA model.
Slide 3:
1) Investigation of the concept of time series and forecasting. The project will try and bring
forth how the aspect time series and forecasting are intertwined. Forecasting puts into use
the previous information of a certain matter under study to predict the future outcome or
how the situation will be in future, and time series forecasting uses aspects such as
historical trends, cyclical analysis, and the idea of seasonality. That is why the concepts will
be used in the project to determine the future outcome in business using the data from Light
IT consultancy firm. This is to determine whether or not the implementation of ARIMA
model will be effective in future in the business sector.
2) To perform the required gathering of the system development. This project will ensure
that it has adequate system development that will be necessary for the effective
development of the ARIMA model using Python 3 software. This will also show how critical
it is to gather the requirement that is necessary for system development. This is because for
the model to work a system has to be developed; this system will ensure that the time series
is stationary and show if the project team understands what is needed from the project
because it is from this phase that the most important requirement of the project can be seen
or discovered. Therefore, this project is aiming at doing the requirement gathering properly
to avoid issues that will cause the failure or delay of the project.
3) To design and model the proposed ARIMA system focusing on business customer data
from Light IT consultancy firm. This project aims at implementing the use of ARIMA model
in business sector, and therefore, a good system will ensure that the model works
effectively. The system will ensure that the data analysis is done with precision and the
results are interpreted appropriately, to allow entrepreneurs to make proper decision for
their business. The data from the consultancy firm will be analyzed using the Python
analysis tool and use the ARIMA model that entails time series analysis and forecasting.
4) Coding the system using Python programing language. Python programming language
allows one to work faster and integrate the system in the most effective manner. That is
why this project aims at using the python coding system so as to be able to finish the project
analysis faster and obtain the most effective results. This is because python allow analysts
or programmers to do clear and logical coding.
5) To test and validate the ARIMA system code against the requirements using business
customer data from Light IT firm. This is the part where the project will analyze the data
and implement the use of the suggested model in the analysis so that clear interpretations
can be made. This project wants to determine if the model will be effective in the data
analysis process and prove that it can be used in the business sector during data analysis
and predict the future outcome of a business using the forecasting technique of time series
3. analysis.
Slide 4:
The challenges involved in the project are:
Hardware crush. Hard wares such as computers sometimes can crush dues to internal drive
malfunctioning, this can lead to loss of the data or important information that is to be used
in the project, other hardware materials include external storage devices that might get lost
or be infected with malicious.
Power shortage. The data analysis soft wares and machines depend on electricity for
functionality, and whenever there is power shortage, these processes cannot continue. That
is why a constant supply of power is very necessary for this project.
Data loss. This is a problem that sometimes arises due to poor storage of the data set, the
data set might be lost because it might have been stored in a device that was infected by
viruses. In other cases, the data may be lost because the storage device was stolen. For this
project no data has been lost yet and to avoid this from happening several backup plans
have been put in place.
Poor management of change. It is obvious that change is always inevitable when doing a
project, and when team members involved in the project do not quickly accept change then
that might be a challenge because there will be no generation of ideas that might be helpful
to the project. At some point of this project this problem has been experienced, the
difference in ideology resulted to chaos although it was resolved it might happen again in
future and to avoid that proper communication has been enhanced among team members.
Slide 5:
This is the general equation for an ARIMA(p,d,q) model. Where;
βo is a constant.
β1 …. βp are the auto-regression parameters to be estimated.
Xt is the observed value of time series at time t (month t= 1, 2, 3…).
θ1… θp are the moving average parameters to be estimated.
Xt-1, et-1, et-1, et-2…et-p are the observed value of time series at t-1 up to t-p.
εt is the error term with mean as 0 and variance as 1.
Slide 6:
4. Before proceeding with an ARIMA model, it is necessary to check the stationarity of the data
involved in the study. There are certain tests that help in checking the stationarity of the
data like Augmented dickey Fuller test and Phillips Perron unit test. The null hypothesis for
both the tests are same and states that the data is not stationary i.e. an unit root is present
in the data. The null hypothesis gets rejected when the p-value is obtained to be less than
0.05.
Slide 7:
From the graph it can be said that it is not stationary because the mean, variance, and
covariance is not constant over the given period of time. The different means and variances
can be seen on the distribution of the peaks and troughs that are on the graph, they are not
the same and this is because the data has trend meaning that there is increase in sales with
every different times. Due to this aspect the time series analysis cannot be performed.
Slide 8:
From the ACF results it can be seen that the lags immediately decay from zero to a negative
which is a sign for a stationary data, with this we see that there is one lag before proceeding
to the next lag, also from the new graph for the stationary data. We can see that there is no
trend meaning that the mean and the variances are equal to show that the data is now
stationary. The data is now perfect to conduct the analysis that we wanted to conduct to
help get the best ARIMA model for this series.
Slide 9:
The best ARIMA model has been found to be ARIMA model of order (9, 2, 0). This is the best
model that anyone can implement into the Python 3 in the business sector. The model can
also be used by the Light IT firm to carry out any other analysis that it would wish to do.
Having this model does not mean that it is the best, other models can also be used depends
entirely on a person’s preference.
From the research, it was realized that the use of ARIMA model in Python 3 is the most
effective form of data analysis, this was achieved when the data from Light IT consultancy
firm was tested using the model and the results were pleasing because the forecast of the
data was determined in the easiest way possible, because from the analysis we were able to
know that the data was not stationary therefore, the process of differencing had to be done
in order to make the data a stationary one. This brought the realization that all firms in the
business sector should use the same model for their data analysis and keep their businesses
on track.
With the advancement in technology everybody wants to develop a coding language that is
best understood by them, software these days work with codes and thus avoiding long
structural sentence that take plenty of time to write. This research brought us to the
realization that coding languages like the ones that are used in the Python software might
5. look difficult to handle but they are the best to use, this is to imply that any company should
consider having a expert who is good at coding to facilitate fast manipulation of the
company’s data.
From the extensive discussion that we had conducted in this study, it was realized that the
ARIMA model is actually the best time series model that many people always sue when they
want to conduct time series research, the others are good yes but just not as good as the
ARIMA model. The model is the best because it uses the lagged moving average to smoothen
the time series under study, it also works on the assumption that the future depends on past
incidences, it is broadly used for the statistical and technical analysis to get a forecast of the
data.
Slide 10:
A few recommendations can be made based on the results:
Companies should venture into the use of software to boost their businesses; this is because
the world is advancing and the analogue method of data storage and analysis is almost
forgotten. A good software takes a lot of details in a firm that cannot be done manually and
if done manually could take longer than expected. Tools such as python not only analyzes
the job but also keep the results in store for future referencing.
The business sector should use coding language to facilitate faster manipulation of data,
since computers use artificial intelligence coding will be an easier way to tell or command
the computer of what you would like to be helped with. This makes this machine language a
very important aspect for all businesses.
It is wise if all companies store all the record of their sales for future reference, for example
a time might come when a company wants to know how it has been generally fairing on
since it was started. It is the data that has always been stored that will enhance this, because
without data the company cannot know if it is doing well or not, and where it need to make
adjustments.
Regression
Slide 1:
Regression analysis is the most widely used technique for fitting models to data. When a
regression model is fit using ordinary least squares, we get a few statistics to describe a
large set of data. These statistics can be highly influenced by a small set of data that is
different from the bulk of the data. These points could be y-type outliers (vertical outliers)
that do not follow the general model of the data or x-type outliers (leverage points) that are
systematically different from the rest of the explanatory data. We can also have points that
are both leverage points and vertical outliers, which are sometimes referred to as bad
leverage points. Collectively, we call any points of these kinds’ outliers.
There are a few techniques to anticipate the boundaries in regression; one of the strategies
6. is Ordinary Least Square (OLS). Assessing boundaries with OLS should satisfy some
endorsed suppositions, with mistakes commonly free and ordinary with a center worth 0
and fluctuation 2.
The Assumptions Involved In Regression Are:
Autocorrelation: It measures how the lagged version of the value of a variable is related to
the original version of it, when a time series data is considered.
Heteroscedasticity: It refers to situations where the variance of the residuals is unequal
over a range of measured values. When running a regression analysis, heteroscedasticity
results in an unequal scatter of the residuals
Multicollinearity: is a statistical concept where several independent variables in a model are
correlated. Two variables are considered to be perfectly collinear if their correlation
coefficient is +/- 1.0. Multicollinearity among independent variables will result in less
reliable statistical inferences
Normality: it means that the data under study must follow a normal distribution
Slide 2
The objective deliverable in consideration are:
To define data analytics in the business and data usage perspectives: Defining what exactly
is data analytics gives a broader perspective of what the topic of discussion is all about, in
this case, data analytics is described as the analysis of data to make a substantive conclusion
on a given topic of study (Mehta, and Pandit, 2018). This will be achieved through research
and reading of materials to get a precise understanding of what exactly is data analytics in
the data processing and business decision-making process
To justify the need for data analytics in the business decision-making process: This
objective is tailored to identify the key gaps in conventional data processing thus presenting
analytics as a solution through comparing the traditional data processing and their
efficiency levels to the data analytics and its respective efficiency in the business decision-
making process
To identify the regression data analytics techniques and algorithms: The identification of
the diverse techniques used in regression analytics is key in tailoring the research into
achieving the overall goal of the research as specified in the topic of study
To demonstrate the benefits of regression algorithms in predictive analytics: By attaining
7. the demonstration of regression analytics and its respective analytics, the benefits drawn
from it shall be identified as the specifics to be achieved in this objective. This is through a
hands-on demonstration of one or two regression algorithms used in data analytics.
Slide 3
Regression analysis is one of the significant and usually utilized factual devices for
examining the connection between a reliant and at least one autonomous factor, with wide
applications in the field of money, economic aspects, medication, and brain research. A
regression technique is for the most part characterized as
In which Y depicted as the dependent variable as well as ε derived as the true residual
vector as well as X depicting the design matrix finalizes to become the n × p. Derive β based
as the estimator for the β as well as the below (2) depicted to represent the next latter fitted
residuals.
The regression investigation ordinarily utilizes the least-squares strategy for assessment of
model boundaries under certain presumptions to be fulfilled, like the ordinariness of
mistakes with zero mean and steady change, i.e., ε ∼ N (0, δ2 ).
Slide 4
Outliers being conflicting perceptions and generally veered off from most of the perceptions
in information need legitimate taking care of as they present a genuine danger to the
regression model and its assessed coefficients and, thus, give deceiving results (Werner,
2019). Two kinds of outliers can occur in the regression dataset. One with extremely
enormous qualities in the reaction is alluded to as upward outliers, while perceptions with
extremely huge qualities in the explanatory variable are called influence focuses.
Robust regression is an innovative process for overcoming the problem of outliers and
strong perceptions in data and limiting their influence on the regression results. The vast
majority of so-called robust regression techniques lack this characteristic. The basic goal of
robust assessment is to provide reliable evaluations/derivations for oblique borders while
keeping outliers at bay. The robust system substitutes some other capacity for the OLS's
number of squared residuals, which is usually less impacted by unusual perceptions. These
methods first fit the data to regression and then identify outliers as perceptions with large
residuals. Effectiveness, breakdown point, and limited impact are three desirable features of
robust tactics. The breakdown point is only a small fraction of the unexpected impressions
that an assessor might face before making an incorrect decision.
Least Trimmed squares (LTS) is an exceptionally robust and sensibly useful assessor among
8. each of the robust assessors accessible in the writing, and it is acquired by confining the
managed amount of the squared residuals. The LTS assessor is a modified version of the LS
assessor that focuses on the more important features while ignoring the extreme
impressions in the organized data.
Slide 5
This is the result of the Residual normality test that was carried out using the Kolmogorov
Smirnov test. The p-values are very low. Therefore null hypothesis is rejected so it tends to
be inferred that the residuals of old style linear regression models are not normally
distributed. Residuals that are not normally distributed can be brought about by an outlier
in the information.
Slide 6
Outlier detection is done using TRES detection and this result was obtained. In light of the
results from the table, it is realized that all malnutrition information know 2012-2017 has
outliers
Slide 7
These figures demonstrate the standardized residuals vs fitted quality and robust distances
for India, respectively. A thorough investigation reveals that both population growth and
foreign direct investment inflows have a significant role in Pakistan's monetary
development. Nonetheless, FDI inflows are inextricably linked to population growth, and
monetary growth is inextricably linked to population growth, even if the influence of gross
investment funds is minor.
Slide 8
The current study examined the impact of FDI inflow, yearly basis population growth, as
well as gross investment companies on Pakistan's and India's GDP per capita using least
squares (LS) and high breakdown robust least tried squares (LTS) regression techniques.
FDI has a small but positive impact on both Pakistan's and India's monetary growth within
the LS structure; however, once the LTS approach is implemented, FDI becomes a decisive
and vital part of Pakistan's financial development model. Nonetheless, due to the general
elimination of 5 and 2 exclusions from the information of Pakistan and India, respectively,
FDI has a negligible impact on the monetary economy of India. Populace development adds
to GDP per capita for the two economies indistinguishably. The two methods uncover that
9. quick populace development adversely impacts the monetary development of the two
nations and henceforth is a significant issue for the financial development of the two
economies, and it requires prompt consideration.