The objective of this report is to observe any pattern or seasonality in Coregonus’ egg deposition aged 3 in Lake Huron (one of the five great lakes in North America) using time-series analysis methods and aim to predict any changes in Egg depositions for the next five years. This report will touchbase on the topics like finding a relevant model and apply suitable approaches to fit a model using visualisation and R functions for the provided dataset
The document describes an evapotranspiration (ET) component in the JGrass-NewAGE system that offers two models - the FAO Penman-Monteith and Priestley-Taylor models. The component takes various meteorological inputs like net radiation, wind speed, humidity and temperature to estimate potential or actual ET. It is integrated within the Open Modeling Software (OMS) framework and outputs time series of estimated ET that can be visualized in a GIS.
This document describes a component for separating total precipitation into rainfall and snowfall. The component uses an algorithm based on a smoothing filter for air temperature thresholds. It requires total precipitation, air temperature, and calibration parameters as inputs. The outputs are time series or raster maps of rainfall and snowfall in mm. Examples of running the component in an OMS simulation are provided.
This program takes inputs like cell tower locations, DEM data, and water meter locations to calculate cell reception zones and output them as shapefiles. It performs viewshed analysis on each cell tower to create visibility polygons. These are combined and split into 4 shapefiles representing zones of bad, good, very good, and great reception. A text file is also output listing each water meter location and its assigned reception zone.
This document provides information about three snowmelt models integrated into the JGrass-NewAge system: 1) a traditional temperature index method (C1), 2) the Cazorzi and Dalla Fontana approach (C2), and 3) the method presented in reference 2 (C3). It describes the inputs, outputs, equations, and examples of usage for the snow component within OMS.
An investigation into helicopter rotor blade-vortex interaction. This interaction limits a helicopter’s performance, vibrates
the rotor and fuselage, fatigues the entire aircraft, and creates noise which can be heard for great
distances.
The document describes a component that computes the forward probability density functions (pdfs) of residence time, travel time, and evapotranspiration time given injection time, using backward pdfs as inputs. The component outputs are travel time and evapotranspiration time forward pdfs in tridimensional matrices, as well as the partitioning coefficient. Examples of using the component in an OMS simulation file are also provided.
This document describes constructing a new probability density function (PDF) from the three-parameter Weibull distribution using an entropy transformation. The new PDF is derived and its properties are examined, including integrating to 1 and deriving the cumulative distribution function. Moment estimators are also derived for the new distribution to allow parameter estimation using the method of moments.
Model predictive-fuzzy-control-of-air-ratio-for-automotive-enginespace130557
Automotive engine air-ratio plays an important role of
emissions and fuel consumption reduction while maintains
satisfactory engine power among all of the engine control variables.
The document describes an evapotranspiration (ET) component in the JGrass-NewAGE system that offers two models - the FAO Penman-Monteith and Priestley-Taylor models. The component takes various meteorological inputs like net radiation, wind speed, humidity and temperature to estimate potential or actual ET. It is integrated within the Open Modeling Software (OMS) framework and outputs time series of estimated ET that can be visualized in a GIS.
This document describes a component for separating total precipitation into rainfall and snowfall. The component uses an algorithm based on a smoothing filter for air temperature thresholds. It requires total precipitation, air temperature, and calibration parameters as inputs. The outputs are time series or raster maps of rainfall and snowfall in mm. Examples of running the component in an OMS simulation are provided.
This program takes inputs like cell tower locations, DEM data, and water meter locations to calculate cell reception zones and output them as shapefiles. It performs viewshed analysis on each cell tower to create visibility polygons. These are combined and split into 4 shapefiles representing zones of bad, good, very good, and great reception. A text file is also output listing each water meter location and its assigned reception zone.
This document provides information about three snowmelt models integrated into the JGrass-NewAge system: 1) a traditional temperature index method (C1), 2) the Cazorzi and Dalla Fontana approach (C2), and 3) the method presented in reference 2 (C3). It describes the inputs, outputs, equations, and examples of usage for the snow component within OMS.
An investigation into helicopter rotor blade-vortex interaction. This interaction limits a helicopter’s performance, vibrates
the rotor and fuselage, fatigues the entire aircraft, and creates noise which can be heard for great
distances.
The document describes a component that computes the forward probability density functions (pdfs) of residence time, travel time, and evapotranspiration time given injection time, using backward pdfs as inputs. The component outputs are travel time and evapotranspiration time forward pdfs in tridimensional matrices, as well as the partitioning coefficient. Examples of using the component in an OMS simulation file are also provided.
This document describes constructing a new probability density function (PDF) from the three-parameter Weibull distribution using an entropy transformation. The new PDF is derived and its properties are examined, including integrating to 1 and deriving the cumulative distribution function. Moment estimators are also derived for the new distribution to allow parameter estimation using the method of moments.
Model predictive-fuzzy-control-of-air-ratio-for-automotive-enginespace130557
Automotive engine air-ratio plays an important role of
emissions and fuel consumption reduction while maintains
satisfactory engine power among all of the engine control variables.
This document discusses the use of Monte Carlo simulation for geothermal resource assessment and risk evaluation. Monte Carlo simulation is a numerical modeling technique that uses random sampling to account for uncertainty in reservoir parameters. It allows for a probabilistic estimation of reserves rather than a single deterministic value. The results can quantify the potential range and risk associated with proven, probable, and possible reserves. Guidelines are proposed for classifying reserves based on the level of uncertainty and data available from exploration and production activities.
Production decline analysis is a traditional means of identifying well production problems and predicting well performance and life based on real production data. It uses empirical decline models that have little fundamental justifications. These models include
•
Exponential decline (constant fractional decline)
•
Harmonic decline, and
•
Hyperbolic decline.
The document discusses decline curve analysis (DCA) for estimating reserves in conventional and unconventional reservoirs. It proposes using a Fetkovich type curve method in Microsoft Excel and comparing results to commercial software. The key steps are identifying the hyperbolic decline curve from production data, forecasting future production using the curve equation, and comparing actual vs predicted production to evaluate accuracy. DCA provides more accurate reserve estimates than other methods using less data as it accounts for production trends over time.
Real-Time Data Mining for Event StreamsSylvain Hallé
Information systems produce different types of event logs; in many situations, it may be desirable to look for trends inside these logs. We show how trends of various kinds can be computed over such logs in real time, using a generic framework called the trend distance workflow. Many common computations on event streams turn out to be special cases of this workflow, depending on how a handful of workflow parameters are defined. This process has been implemented and tested in a real-world event stream processing tool, called BeepBeep.
This document presents an analysis of rice production in India from 2011-2012. It uses multiple linear regression to predict rice yield based on factors like rainfall, fertilizer use, temperature, soil pH, and humidity. The regression model found several factors to be statistically significant predictors of yield, including fertilizers, wind speed, humidity, and temperature. The analysis found no issues with multicollinearity between variables. In conclusion, rice yield increases when fertilizers, wind speed, insecticides, seeds, and humidity are at appropriate levels.
This document contains questions from an examination on wireless communication and systems modeling. It includes multiple choice and long answer questions covering topics like AD-HOC wireless networking, MAC protocols, routing protocols, transport layer protocols, security, QoS, queuing models, probability distributions, random number generation, and statistical hypothesis testing. The questions would require explanations, diagrams, calculations, and simulations to fully answer.
Simple linear regression uses a single independent variable to predict the value of a dependent variable. Multiple linear regression extends this to use multiple independent variables to predict the dependent variable. The document demonstrates multiple linear regression in R by regressing soil organic carbon (SOC) on elevation, precipitation, and slope using the lm() function. This produces a model object that contains coefficients, residuals, fitted values and other details about the regression model.
This document provides an introduction to spatial queries in SQL Server 2008. It discusses new spatial data types like geometry and geography, spatial references and operations, spatial indexes, and includes a case study on spatial queries for an estate agent. Some key points covered include how 80-90% of data has a spatial element; how to insert spatial data using methods like STGeomFromText; spatial operations like STIntersection and STIntersects; and using spatial queries to filter data by location, such as finding properties near railway stations.
This document discusses ARIMA (autoregressive integrated moving average) models for time series forecasting. It covers the basic steps for identifying and fitting ARIMA models, including plotting the data, identifying possible AR or MA components using the autocorrelation function (ACF) and partial autocorrelation function (PACF), estimating model parameters, checking the residuals to validate the model fit, and choosing the best model. An example analyzes quarterly US GNP data to demonstrate these steps.
Multiscale Entropy Analysis (MSE) is a method for measuring the complexity of time series data across multiple temporal scales. It involves coarse-graining the time series into multiple scales and calculating a sample entropy value at each scale to quantify the regularity. When applied to physiological signals, MSE reveals greater complexity in original data versus surrogate data, unlike single-scale entropy analyses. The software provided calculates MSE for physiological time series and outputs sample entropy values over a range of scales. Outliers can impact results by changing the time series variance, and filtering can alter MSE curves.
These notes review fitting GLMs to aggregate data. Binomial, Poisson and Negative Binomial models are shown, with a few others. I also cover how to implement Moran Eigenvector filtering in a GLM. All data are for mortality rates for the state of Texas from the CDC Wonder.
A FUZZY LOGIC BASED SCHEME FOR THE PARAMETERIZATION OF THE INTER-TROPICAL DIS...Wireilla
ABSTRACT
In this paper, a Fuzzy Logic based scheme for the parameterization of the Inter-Tropical Discontinuity (ITD) over Nigeria was presented. The scheme was developed in order to provide a computational basis for Numerical Weather Prediction (NWP) modeling over Nigeria. The scheme uses a fuzzified 2.50 by 50 resolution grid box or 10 rows by 4 columns (10x4) matrix with the rows classified into 10 zones. The two extreme zones represented by the five (5) boundary points or two-dimensional (2-D) lattice nodes (O1 – O5), define the matrix boundaries or lattice edges, and hence, the meridional limits of the ITD position. The scheme is simple enough to be included as an ITD parameterization by NWP modelers over West Africa.
A FUZZY LOGIC BASED SCHEME FOR THE PARAMETERIZATION OF THE INTER-TROPICAL DIS...ijfls
The document presents a fuzzy logic based scheme for parameterizing the position of the Inter-Tropical Discontinuity (ITD) for use in numerical weather prediction models over Nigeria. The scheme uses a 10x4 grid matrix to represent a 2.5°x5° area covering Nigeria. The grid is divided into 10 zones and boundary nodes define the possible north-south positions of the ITD. The scheme generates sets of ITD positions defined by the boundary nodes. An offset parameter represents the actual ITD position within each set. The scheme could help improve weather forecasts over data-sparse West Africa by parameterizing the important ITD boundary.
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATAIJCSEA Journal
In Classical Hypothesis testing volumes of data is to be collected and then the conclusions are drawn which may take more time. But, Sequential Analysis of statistical science could be adopted in order to decide upon the reliable / unreliable of the developed software very quickly. The procedure adopted for this is, Sequential Probability Ratio Test (SPRT). In the present paper we proposed the performance of SPRT on Time domain data using Weibull model and analyzed the results by applying on 5 data sets. The parameters are estimated using Maximum Likelihood Estimation.
MFBLP Method Forecast for Regional Load Demand SystemCSCJournals
Load forecast plays an important role in planning and operation of a power system. The accuracy of the forecast value is necessary for economically efficient operation and also for effective control. This paper describes a method of modified forward backward linear predictor (MFBLP) for solving the regional load demand of New South Wales (NSW), Australia. The method is designed and simulated based on the actual load data of New South Wales, Australia. The accuracy of discussed method is obtained and comparison with previous methods is also reported.
Considerations on the collection of data from bio-argo floats across sampling...SeaBirdScientific
Ian D. Walsh, Ph. D, Joel Reiter, Dan Quittman, David J. Murphy, Thomas O. Mitchell, Ph.D. Sea-Bird Scientific. GAIC 2015 Meeting, Galway, Ireland, 14 – 18 Sept. 2015.
ABSTRACT
The flexibility of the current generation of float sensor packages peovides an opportunity to craft mission specific sampling schemes that balance the collection of data for specific sampling goals with the practicalities of float operation. Autonomous floats operate within constraints of battery life and data transfer rates.
For simplicity of data transfer and handling, most float data sets are transmitted after binning on pressure. Within a given pressure bin different instruments will be sampling within a particular defined sequence. A sampling sequence should be balanced towards minimizing energy consumption while maximizing data accuracy of each instrument. As the number of sensors increases and the breadth of mission parameters expands it becomes more difficult to optimize data sequencing and reporting.
We consider methods to reduce the size of the problem by setting rules for sequence development and test those rules relative to field data. We examine a set of data from a float that was equipped with internal memory that captured the full set of sample data taken during the profiling mission.
Comparing the ‘raw’ data and the transmitted data we examine the variance around the transmitted data and discuss the impact of data sequencing on the data.
Time Series Analysis on Egg depositions (in millions) of age-3 Lake Huron Blo...ShuaiGao3
This assignment’s task is to analyze the egg depositions of Lake Huron Bloasters by using the analysis methods and choose the best model among a set of possible models for this data-set and give forecasts of egg depositions for the next 5 years. The data-set is collect from FSAdata package, we will directly use the eggs data provide by this data-set.
The document describes a time series analysis and forecasting project on global land-surface temperature data from 1856 to 1997. The author fits an ARIMA(2,1,2) model to the differenced time series data. Diagnostic checks show the model fits the data well. The 10-year forecast from the model shows a steady, slight decrease in global temperature, indicating global warming should not be a major concern over the next decade.
CELEBRATION 2000 (3rd Deployment) & ALP 2002Generation of a 3D-Model of the ...gigax2
CELEBRATION 2000 and ALP 2002 are two large 3-D-refraction campaigns which target the crustal and upper mantle structure in Central Europe. This study is based on these seismic data sets and concentrates on the area of the Eastern Alps region and the surrounding forelands and basins. The tectonic setting of the investigation area is characterized by a continent-continent-collision of the Adriatic and European plate and subsequently by a lateral extrusion eastwards to the Pannonian basin. (Fig. Xxx)
The 3-D-model is described by a tomographic solution of the crust and depths and velocities for the Moho. In both cases, lateral resolution is 20 km on a 31x34-horizontal grid. The vertical distance of the depth nodes for the crust is 1 km down to 20 km depth and larger intervals from there on.
The merged datasets of ALP2002 and of the 3rd deployment of CELEBRATION 2000 result in approximately 78000 seismic traces. Signal detection (STA/LTA) and stacking techniques (offset bins) were applied to the data in order to guarantee a reliable interpretation even in areas of degraded seismic energy transmission. Refractors like the Moho are modeled by a delaytime approach at a first step. The following depth conversion will make use of the velocity model of the crust. These techniques will be presented at this meeting on Friday .
Modelling the crust
To obtain the tomographic image of the crust, for each gridpoint with sufficient coverage CMP-sorted traces over all azimuths were stacked. The stack is treated as a 1D-traveltime-curve, subsequently picked (Pg only !) and inverted. On the expense of detailed images of shallow layers, the stacking process enhances Pg at large (over-critical) offsets, so that the average penetration depth is 16 km, with a maximum of over 40 km.
This document summarizes a research paper that proposes using a two-step sequential probability ratio test (SPRT) approach to analyze software reliability growth model (SRGM) data. Specifically, it applies the approach to the Half Logistic Software Reliability Growth Model (HLSRGM). The SPRT approach allows drawing conclusions about software reliability from sequential or continuous monitoring of failure data, potentially reaching conclusions more quickly than traditional hypothesis testing. Equations are provided for determining acceptance, rejection, and continuation regions based on comparing observed failure counts to lines derived from the HLSRGM mean value function. The approach is applied to five sets of existing software failure data to analyze results.
Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...Waqas Tariq
A fundamental task in many statistical analyses is to characterize the location and variability of a data set. A further characterization of the data includes skewness and kurtosis. This paper emphasizes the real time computational problem for generally the rth standardized moments and specially for both skewness and kurtosis. It has therefore been important to derive an optimum computational technique for the standardized moments. A new algorithm has been designed for the evaluation of the standardized moments. The evaluation of error analysis has been discussed. The new algorithm saved computational energy by approximately 99.95% than that of the previously published algorithms.
This document discusses the use of Monte Carlo simulation for geothermal resource assessment and risk evaluation. Monte Carlo simulation is a numerical modeling technique that uses random sampling to account for uncertainty in reservoir parameters. It allows for a probabilistic estimation of reserves rather than a single deterministic value. The results can quantify the potential range and risk associated with proven, probable, and possible reserves. Guidelines are proposed for classifying reserves based on the level of uncertainty and data available from exploration and production activities.
Production decline analysis is a traditional means of identifying well production problems and predicting well performance and life based on real production data. It uses empirical decline models that have little fundamental justifications. These models include
•
Exponential decline (constant fractional decline)
•
Harmonic decline, and
•
Hyperbolic decline.
The document discusses decline curve analysis (DCA) for estimating reserves in conventional and unconventional reservoirs. It proposes using a Fetkovich type curve method in Microsoft Excel and comparing results to commercial software. The key steps are identifying the hyperbolic decline curve from production data, forecasting future production using the curve equation, and comparing actual vs predicted production to evaluate accuracy. DCA provides more accurate reserve estimates than other methods using less data as it accounts for production trends over time.
Real-Time Data Mining for Event StreamsSylvain Hallé
Information systems produce different types of event logs; in many situations, it may be desirable to look for trends inside these logs. We show how trends of various kinds can be computed over such logs in real time, using a generic framework called the trend distance workflow. Many common computations on event streams turn out to be special cases of this workflow, depending on how a handful of workflow parameters are defined. This process has been implemented and tested in a real-world event stream processing tool, called BeepBeep.
This document presents an analysis of rice production in India from 2011-2012. It uses multiple linear regression to predict rice yield based on factors like rainfall, fertilizer use, temperature, soil pH, and humidity. The regression model found several factors to be statistically significant predictors of yield, including fertilizers, wind speed, humidity, and temperature. The analysis found no issues with multicollinearity between variables. In conclusion, rice yield increases when fertilizers, wind speed, insecticides, seeds, and humidity are at appropriate levels.
This document contains questions from an examination on wireless communication and systems modeling. It includes multiple choice and long answer questions covering topics like AD-HOC wireless networking, MAC protocols, routing protocols, transport layer protocols, security, QoS, queuing models, probability distributions, random number generation, and statistical hypothesis testing. The questions would require explanations, diagrams, calculations, and simulations to fully answer.
Simple linear regression uses a single independent variable to predict the value of a dependent variable. Multiple linear regression extends this to use multiple independent variables to predict the dependent variable. The document demonstrates multiple linear regression in R by regressing soil organic carbon (SOC) on elevation, precipitation, and slope using the lm() function. This produces a model object that contains coefficients, residuals, fitted values and other details about the regression model.
This document provides an introduction to spatial queries in SQL Server 2008. It discusses new spatial data types like geometry and geography, spatial references and operations, spatial indexes, and includes a case study on spatial queries for an estate agent. Some key points covered include how 80-90% of data has a spatial element; how to insert spatial data using methods like STGeomFromText; spatial operations like STIntersection and STIntersects; and using spatial queries to filter data by location, such as finding properties near railway stations.
This document discusses ARIMA (autoregressive integrated moving average) models for time series forecasting. It covers the basic steps for identifying and fitting ARIMA models, including plotting the data, identifying possible AR or MA components using the autocorrelation function (ACF) and partial autocorrelation function (PACF), estimating model parameters, checking the residuals to validate the model fit, and choosing the best model. An example analyzes quarterly US GNP data to demonstrate these steps.
Multiscale Entropy Analysis (MSE) is a method for measuring the complexity of time series data across multiple temporal scales. It involves coarse-graining the time series into multiple scales and calculating a sample entropy value at each scale to quantify the regularity. When applied to physiological signals, MSE reveals greater complexity in original data versus surrogate data, unlike single-scale entropy analyses. The software provided calculates MSE for physiological time series and outputs sample entropy values over a range of scales. Outliers can impact results by changing the time series variance, and filtering can alter MSE curves.
These notes review fitting GLMs to aggregate data. Binomial, Poisson and Negative Binomial models are shown, with a few others. I also cover how to implement Moran Eigenvector filtering in a GLM. All data are for mortality rates for the state of Texas from the CDC Wonder.
A FUZZY LOGIC BASED SCHEME FOR THE PARAMETERIZATION OF THE INTER-TROPICAL DIS...Wireilla
ABSTRACT
In this paper, a Fuzzy Logic based scheme for the parameterization of the Inter-Tropical Discontinuity (ITD) over Nigeria was presented. The scheme was developed in order to provide a computational basis for Numerical Weather Prediction (NWP) modeling over Nigeria. The scheme uses a fuzzified 2.50 by 50 resolution grid box or 10 rows by 4 columns (10x4) matrix with the rows classified into 10 zones. The two extreme zones represented by the five (5) boundary points or two-dimensional (2-D) lattice nodes (O1 – O5), define the matrix boundaries or lattice edges, and hence, the meridional limits of the ITD position. The scheme is simple enough to be included as an ITD parameterization by NWP modelers over West Africa.
A FUZZY LOGIC BASED SCHEME FOR THE PARAMETERIZATION OF THE INTER-TROPICAL DIS...ijfls
The document presents a fuzzy logic based scheme for parameterizing the position of the Inter-Tropical Discontinuity (ITD) for use in numerical weather prediction models over Nigeria. The scheme uses a 10x4 grid matrix to represent a 2.5°x5° area covering Nigeria. The grid is divided into 10 zones and boundary nodes define the possible north-south positions of the ITD. The scheme generates sets of ITD positions defined by the boundary nodes. An offset parameter represents the actual ITD position within each set. The scheme could help improve weather forecasts over data-sparse West Africa by parameterizing the important ITD boundary.
DETECTION OF RELIABLE SOFTWARE USING SPRT ON TIME DOMAIN DATAIJCSEA Journal
In Classical Hypothesis testing volumes of data is to be collected and then the conclusions are drawn which may take more time. But, Sequential Analysis of statistical science could be adopted in order to decide upon the reliable / unreliable of the developed software very quickly. The procedure adopted for this is, Sequential Probability Ratio Test (SPRT). In the present paper we proposed the performance of SPRT on Time domain data using Weibull model and analyzed the results by applying on 5 data sets. The parameters are estimated using Maximum Likelihood Estimation.
MFBLP Method Forecast for Regional Load Demand SystemCSCJournals
Load forecast plays an important role in planning and operation of a power system. The accuracy of the forecast value is necessary for economically efficient operation and also for effective control. This paper describes a method of modified forward backward linear predictor (MFBLP) for solving the regional load demand of New South Wales (NSW), Australia. The method is designed and simulated based on the actual load data of New South Wales, Australia. The accuracy of discussed method is obtained and comparison with previous methods is also reported.
Considerations on the collection of data from bio-argo floats across sampling...SeaBirdScientific
Ian D. Walsh, Ph. D, Joel Reiter, Dan Quittman, David J. Murphy, Thomas O. Mitchell, Ph.D. Sea-Bird Scientific. GAIC 2015 Meeting, Galway, Ireland, 14 – 18 Sept. 2015.
ABSTRACT
The flexibility of the current generation of float sensor packages peovides an opportunity to craft mission specific sampling schemes that balance the collection of data for specific sampling goals with the practicalities of float operation. Autonomous floats operate within constraints of battery life and data transfer rates.
For simplicity of data transfer and handling, most float data sets are transmitted after binning on pressure. Within a given pressure bin different instruments will be sampling within a particular defined sequence. A sampling sequence should be balanced towards minimizing energy consumption while maximizing data accuracy of each instrument. As the number of sensors increases and the breadth of mission parameters expands it becomes more difficult to optimize data sequencing and reporting.
We consider methods to reduce the size of the problem by setting rules for sequence development and test those rules relative to field data. We examine a set of data from a float that was equipped with internal memory that captured the full set of sample data taken during the profiling mission.
Comparing the ‘raw’ data and the transmitted data we examine the variance around the transmitted data and discuss the impact of data sequencing on the data.
Time Series Analysis on Egg depositions (in millions) of age-3 Lake Huron Blo...ShuaiGao3
This assignment’s task is to analyze the egg depositions of Lake Huron Bloasters by using the analysis methods and choose the best model among a set of possible models for this data-set and give forecasts of egg depositions for the next 5 years. The data-set is collect from FSAdata package, we will directly use the eggs data provide by this data-set.
The document describes a time series analysis and forecasting project on global land-surface temperature data from 1856 to 1997. The author fits an ARIMA(2,1,2) model to the differenced time series data. Diagnostic checks show the model fits the data well. The 10-year forecast from the model shows a steady, slight decrease in global temperature, indicating global warming should not be a major concern over the next decade.
CELEBRATION 2000 (3rd Deployment) & ALP 2002Generation of a 3D-Model of the ...gigax2
CELEBRATION 2000 and ALP 2002 are two large 3-D-refraction campaigns which target the crustal and upper mantle structure in Central Europe. This study is based on these seismic data sets and concentrates on the area of the Eastern Alps region and the surrounding forelands and basins. The tectonic setting of the investigation area is characterized by a continent-continent-collision of the Adriatic and European plate and subsequently by a lateral extrusion eastwards to the Pannonian basin. (Fig. Xxx)
The 3-D-model is described by a tomographic solution of the crust and depths and velocities for the Moho. In both cases, lateral resolution is 20 km on a 31x34-horizontal grid. The vertical distance of the depth nodes for the crust is 1 km down to 20 km depth and larger intervals from there on.
The merged datasets of ALP2002 and of the 3rd deployment of CELEBRATION 2000 result in approximately 78000 seismic traces. Signal detection (STA/LTA) and stacking techniques (offset bins) were applied to the data in order to guarantee a reliable interpretation even in areas of degraded seismic energy transmission. Refractors like the Moho are modeled by a delaytime approach at a first step. The following depth conversion will make use of the velocity model of the crust. These techniques will be presented at this meeting on Friday .
Modelling the crust
To obtain the tomographic image of the crust, for each gridpoint with sufficient coverage CMP-sorted traces over all azimuths were stacked. The stack is treated as a 1D-traveltime-curve, subsequently picked (Pg only !) and inverted. On the expense of detailed images of shallow layers, the stacking process enhances Pg at large (over-critical) offsets, so that the average penetration depth is 16 km, with a maximum of over 40 km.
This document summarizes a research paper that proposes using a two-step sequential probability ratio test (SPRT) approach to analyze software reliability growth model (SRGM) data. Specifically, it applies the approach to the Half Logistic Software Reliability Growth Model (HLSRGM). The SPRT approach allows drawing conclusions about software reliability from sequential or continuous monitoring of failure data, potentially reaching conclusions more quickly than traditional hypothesis testing. Equations are provided for determining acceptance, rejection, and continuation regions based on comparing observed failure counts to lines derived from the HLSRGM mean value function. The approach is applied to five sets of existing software failure data to analyze results.
Optimum Algorithm for Computing the Standardized Moments Using MATLAB 7.10(R2...Waqas Tariq
A fundamental task in many statistical analyses is to characterize the location and variability of a data set. A further characterization of the data includes skewness and kurtosis. This paper emphasizes the real time computational problem for generally the rth standardized moments and specially for both skewness and kurtosis. It has therefore been important to derive an optimum computational technique for the standardized moments. A new algorithm has been designed for the evaluation of the standardized moments. The evaluation of error analysis has been discussed. The new algorithm saved computational energy by approximately 99.95% than that of the previously published algorithms.
This document provides an overview of recursive partitioning and classification and regression tree (CART) methods. It describes how trees are built by recursively splitting nodes into left and right sons based on variables that maximize impurity reduction. Two common measures of impurity are the Gini index and information index. The tree is then pruned back using cross-validation to prevent overfitting. Examples are provided to illustrate the methods.
Eli plots visualizing innumerable number of correlationsLeonardo Auslender
This document discusses a method for visualizing direct and partial correlations using ELI (Exploratory Linear Information) plots. The method allows correlations between any number of variables to be plotted in an overlay fashion. The plots can show correlations against a single "with" variable, sorted by absolute value. Partial correlations can also be plotted. The method is implemented in a SAS macro. An example uses continuous variables from a dataset to demonstrate plotting correlations without a "with" variable.
Remote sensing data from satellite with high temporal resolution typically have lower spatial resolution, with one pixel often spanning over a square kilometer. The signal recorded by such satellite at a pixel is typically a mixture of reflectance from different types of land covers within
the pixel, resulting in a mixed pixel. In this talk we introduce a couple of parametric and nonparametric statistical approaches to deal with the un-mixing problem which integrate information from multiple sources, and present some preliminary results applying the methodology to data
from the SMOS (Soil Moisture and Ocean Salinity) mission and the OCO-2 (Orbiting Carbon Observatory 2) mission, which motivated this research.
This document discusses building regression and classification models in R, including linear regression, generalized linear models, and decision trees. It provides examples of building each type of model using various R packages and datasets. Linear regression is used to predict CPI data. Generalized linear models and decision trees are built to predict body fat percentage. Decision trees are also built on the iris dataset to classify flower species.
This document discusses using a multiplicative time series model (p,d,q)x(P,D,Q)s to forecast air transport demand.
It analyzes passenger data from Yemenia Airlines in Yemen over many years. The data shows seasonality with a 12-month period. To account for this, the model includes polynomials in the backward shift operator B with period s=12.
The model is fitted to the logarithms of the monthly passenger totals, which minimizes residuals. This confirms the appropriateness of a logarithmic transformation for this time series data. The model provides a means of linking observations within and across years to generate forecasts of air transport demand.
This document discusses using the Box-Jenkins methodology to forecast unemployment rates in the US from January 2007 to July 2007 using past data from January 1948 to December 2006. It first provides an overview of the Box-Jenkins methodology and its key steps: identification, estimation, diagnostics, and forecasting. It then applies these steps using R: identifying an ARIMA(1,1) model as best fitting the deseasonalized data based on minimizing the AIC, estimating the parameters of this model, and selecting ARIMA(1,1) to forecast future unemployment rates.
Similar to Egg deposition for Bloaters, aged 3, in Lake Huron and predicting next five year values (20)
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Egg deposition for Bloaters, aged 3, in Lake Huron and predicting next five year values
1. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 1/27
Time Series Analysis MATH1318
Arpan Kumar (s3696599)
May 12, 2019
Importing necessary libraries
library(TSA)
library(lmtest)
library(lmtest)
library(tseries)
library(rlang)
library(pillar)
library(forecast)
Introduction
Coregonus hoyi, is a sliver colored freshwater fish found mostly found in Lake Nipigon and Great Lakes where it
habitats in underwater slopes. It is also known as the Bloater and belongs from the family of Salmonidae.
The objective of this report is to observe any pattern or seasonality in Coregonus’ egg deposition aged 3 in Lake
Huron (one of the five great lakes in North America) using time-series analysis methods and aim to predict any
changes in Egg depositions for the next five years. This report will touchbase on the topics like finding a relevant
model and apply suitable approaches to fit a model using visualisation and R functions for the provided dataset.
Reading Dataset
Eggs_Depositions <- read.csv("C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignmen
t 2/eggs.csv")
The egg deposition series/dataset is available in BloaterLH dataset under FSAdata package and consists of two
variables: Year (from 1981 to 1996), numerical variable and Egg depositions(in millions).
Before converting the dataset into time-series plot. It is important to check the class of the dataset.
class(Eggs_Depositions)
[1] "data.frame"
Converting the ‘data frame’ into ‘time-series’ format using ts() function
Code
Hide
Hide
Hide
Hide
2. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 2/27
Data <- ts(as.vector(Eggs_Depositions$eggs),start = 1981, end = 1996, frequency =1)
class(Data)
[1] "ts"
Data Exploration
Time Series visualisation
plot(Data,ylab = "Egg Egg deposition (in Mns)",xlab="Years", main = "Figure 1, Egg depositions o
f age 3 Bloaters in Lake Huron
(1981-1996)",type="o",col="darkblue",xaxt="n")
axis(1,at=seq(1981,1996,by=1),las=2)
From the above image (Figure 1), clearly a trend is observed. Egg depositions reached its peak until 1990 and
post that downward trend can be observed until 1993 and then upward trend is seen post 1993. Therefore, it can
be concluded that there is changing variance within the dataset. However, with the succeeding observations it
could be implied that there is an existence of auto-regressive. Hence it would be challenging to prepare the data to
use for the predictions for next five years
Scatterplot comparing lagged value
Hide
Hide
3. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 3/27
y=Data
x=zlag(Data)
index=2:length(x)
cor(y[index],x[index])
[1] 0.7445657
plot(y=y,x=x,ylab = "Egg deposition (in Mns)",xlab="Egg depositions for previous year (in Mns)",
main = "Figure 2, Scatter plot for Egg depositions against its lagged value",col="darkblue")
With the correlation value of 0.74 and from figure 2, it can be implied that there’s a strong correlation between
Egg’s deposition for a year with that of its lagged value (successive year egg deposition).
Interpretating Time Series using Modeling Techniques
In the process of selecting a best model, different modeling techniques will be used to identify a model that
fits the data the best.
Linear Model
model1 = lm(Data~time(Data))
summary(model1)
Hide
Hide
4. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 4/27
Call:
lm(formula = Data ~ time(Data))
Residuals:
Min 1Q Median 3Q Max
-0.4048 -0.2768 -0.1933 0.2536 1.1857
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -165.98275 49.58836 -3.347 0.00479 **
time(Data) 0.08387 0.02494 3.363 0.00464 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4598 on 14 degrees of freedom
Multiple R-squared: 0.4469, Adjusted R-squared: 0.4074
F-statistic: 11.31 on 1 and 14 DF, p-value: 0.004642
plot(Data, ylab = "Egg deposition (in Mns)",xlab="Years", main = "Figure 3, Fitted linear trend
model",type="o",col="darkblue",xaxt="n")
axis(1,at=seq(1981,1996,by=1),las=2)
abline(model1,col="Blue",lty=2)
Hide
Hide
5. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 5/27
res.model1 = rstudent(model1)
plot(y=res.model1, x = as.vector(time(Data)),xlab="Years",ylab="Standardised Residuals",main =
"Figure 4, Residual of linear trend model",type="o",col="darkblue",xaxt="n")
axis(1,at=seq(1981,1996,by=1),las=2)
qqnorm(res.model1,main="Figure 5, Normal QQ Plot for residual values")
qqline(res.model1,col =4,lwd=1,lty=2)
Hide
Hide
6. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 6/27
shapiro.test(res.model1)
Shapiro-Wilk normality test
data: res.model1
W = 0.7726, p-value = 0.001205
From linear model summary ,with adjusted R-square value of 40%, explaining a weaker variance between
the values, and
from Figure 5, as the data points are not closer to the line of best fit, indicating that the data is not normal,
and
from Shapiro wilk test summary, where p-value less than 0.05
It can be infered that Linear model is not a good model to go ahead with.
Quadratic Model
t = time(Data)
t2 = t^2
model2= lm(Data ~ t+t2)
summary(model2)
Hide
Hide
7. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 7/27
Call:
lm(formula = Data ~ t + t2)
Residuals:
Min 1Q Median 3Q Max
-0.50896 -0.25523 -0.02701 0.16615 0.96322
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.647e+04 2.141e+04 -2.170 0.0491 *
t 4.665e+01 2.153e+01 2.166 0.0494 *
t2 -1.171e-02 5.415e-03 -2.163 0.0498 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4092 on 13 degrees of freedom
Multiple R-squared: 0.5932, Adjusted R-squared: 0.5306
F-statistic: 9.479 on 2 and 13 DF, p-value: 0.00289
plot(ts(fitted(model2)),ylim=c(min(c(fitted(model2),as.vector(Data))),max(c(fitted(model2),as.ve
ctor(Data)))),
ylab="Egg deposition (in Mns)", main = "Figure 6, fitted quadratic model", type ="l",lty =2
, col="blue",xlab="Years")
lines(as.vector(Data),type="o")
Hide
8. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 8/27
res.model2 = rstudent(model2)
plot(y=res.model2, x= as.vector(time(Data)),xlab="Year",ylab="Egg Deposition (in Mns)",type="o",
xaxt="n",main="Figure 7, Residual of quadratic model",col="darkblue")
axis(1,at=seq(1981,1996,by=1),las=2)
abline(h=0, col="Blue")
qqnorm(res.model2,main="Figure 8, Normal QQ Plot for residual values")
qqline(res.model2, col=4,lwd=1,lty=2)
Hide
Hide
Hide
9. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 9/27
shapiro.test(res.model2)
Shapiro-Wilk normality test
data: res.model2
W = 0.87948, p-value = 0.03809
From quadratic model summary ,with adjusted R-square value of 53%, explaining a slight strong variance
between the values as compared to linear model, and
from Figure 8, as the data points are closer to the line of best fit, indicating that the data is normal, and
from Shapiro wilk test summary, where p-value less than 0.05
It can be infered that quadratic model is slightly a better model as compared to linear model, with higher adjusted
R-square value (53%) and in QQ plot data points are closer to the line of best fit. However, both the model has p-
value which is less than 0.05 when checked for shapiro-wilk test, rejecting null hypothesis which means data is not
normally distributed.
Preparing Dataset
Hypothesis Testing
ACF (Auto-correlation function) and PACF (Partial auto-correlation function).
Using ACF and PACF to conduct initial check for the hypothesis.
Hide
10. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 10/27
H0: Dataset is non-stationary
HA: Dataset is stationary
ACF explains how present values in the dataset is related to its lagged values (past values).
acf(Data)
While PACF, explains the correlation between the residuals and the next lag value of a given time-series data
pacf(Data)
Hide
Hide
11. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 11/27
Using interpretations of results from the figures shown above (ACF and PACF), it can be observed that there is a
slow decaying pattern in ACF and a high first correlation In PACF. From this it can be said that there is a presence
of a trend and the data is non-stationary. Therefore, in order to prepare the data stationary. It is important to
perform the transformation and differencing on the given data.
Transformation
Before transforming the dataset, the use of ‘Augmented Dickey-fuller test’ is conducted to re-confirm the existence
of non-stationarity within the dataset, statistically. And, would check for normality using ‘Shapiro Test’.
adf.test(Data) #Augmented Dickey-fuller test
Augmented Dickey-Fuller Test
data: Data
Dickey-Fuller = -2.0669, Lag order = 2, p-value = 0.5469
alternative hypothesis: stationary
shapiro.test(Data) #Augmented Dickey-fuller test
Hide
Hide
12. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 12/27
Shapiro-Wilk normality test
data: Data
W = 0.94201, p-value = 0.3744
Since p-value (0.54) is greater 0.05, for Dickey-Fuller test, this means that we fail to reject our null hypothesis or in
other words, there is existence of non-stationarity in the dataset.
In the shapiro-wilk test since p-value(0.37) is greater than 0.05, we can say that we fail tp reject not statiscally
signifcant or in other words there is normality within the dataset.
Box-Cox Transformation
Data_T=BoxCox.ar(Data,method="yule-walker")
possible convergence problem: optim gave code = 1possible convergence problem: optim gave code =
1
Data_T
Hide
Hide
13. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 13/27
$`lambda`
[1] -2.0 -1.9 -1.8 -1.7 -1.6 -1.5 -1.4 -1.3 -1.2 -1.1 -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -
0.2 -0.1 0.0 0.1 0.2 0.3 0.4 0.5
[27] 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
$loglike
[1] -18.4457011 -15.1801393 -11.9522118 -8.7654260 -5.6237637 -2.5317646 0.5053751 3.481
6875 6.3903005 9.2232907 11.9715279
[12] 14.6245247 17.1703124 19.5953784 21.8847108 24.0220046 25.9900870 27.7715993 29.349
9352 30.7103710 31.8412544 32.7350622
[23] 33.3891294 33.8059042 33.9926850 33.9609089 33.7251475 33.3019920 32.7089913 31.963
7533 31.0832597 30.0833992 28.9786936
[34] 27.7821762 26.5053815 25.1584095 23.7500337 22.2878321 20.7783238 19.2271034 17.638
9661
$mle
[1] 0.4
$ci
[1] 0.1 0.8
The lambda values captured by 95% confidence interval, falls between 0.1 and 0.8. Therefore, the mid-point of the
confidence interval CI[0.1,0.8] is 0.45. It’s this value of 0.45 which will be used as a lambda value for the Box-Cox
Transformation
lambda = 0.45
Data_T_BoxCox = (Data^lambda-1)/lambda
Normality check post BoxCox transformation
qqnorm(Data_T_BoxCox,main="Figure 9, Normal QQ Plot post BoxCox transformation, lambda=0.45")
qqline(Data_T_BoxCox,col=4)
Hide
Hide
14. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 14/27
shapiro.test(Data_T_BoxCox)
Shapiro-Wilk normality test
data: Data_T_BoxCox
W = 0.96269, p-value = 0.7107
With the increase in p-value in Shapiro-Wilk test from 0.374 to 0.710 and with data points falling more closer to the
line of best fit in figure 9, or in other words with lesser deviation of the dot-points from the line of best fit. It can be
said that the normality of the data series has improved using Box-Cox transformation. Hence it could be confirmed
that the data series is normally distributed now.
Differencing
First Differencing
Data_T_BoxCox_diff = diff(Data_T_BoxCox)
plot(Data_T_BoxCox_diff,type="o",ylab="Egg deposition (in Mns)", main = "Figure 10, First differ
encing plot on transformed data",xaxt="n",xlab="Years",col="darkblue")
axis(1,at=seq(1981,1996,by=1),las=2)
Hide
Hide
15. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 15/27
adf.test(Data_T_BoxCox_diff)
Augmented Dickey-Fuller Test
data: Data_T_BoxCox_diff
Dickey-Fuller = -3.6798, Lag order = 2, p-value = 0.0443
alternative hypothesis: stationary
With p-value 0.04 less than the alpha value of 0.05, we reject the null hypothesis which means that the data series
is stationary post first differencing. However, from figure 10, a trend can still be observed. Therefore, it is important
to eliminate the trend observe in the graph using second differencing
Second Differencing
Data_T2_BoxCox_diff = diff(Data_T_BoxCox,differences = 2)
plot(Data_T2_BoxCox_diff,type="o",ylab="Egg deposition (in Mns)", main = "Figure 11, Second diff
erencing plot on transformed data",xaxt="n",xlab="Years",col="darkblue")
axis(1,at=seq(1981,1996,by=1),las=2)
Hide
Hide
16. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 16/27
From figure 11, no longer a trend can be observed, which means its safer to say that the data series is stationary.
To confirm the statement let us check using adf.test
adf.test(Data_T2_BoxCox_diff)
Augmented Dickey-Fuller Test
data: Data_T2_BoxCox_diff
Dickey-Fuller = -3.1733, Lag order = 2, p-value = 0.1254
alternative hypothesis: stationary
Considering there’s no trend after second differencing, from figure 11. However with the p-value greater than alpha
=0.05 which is not a higher number. It is safe to use the data series for further analysis using second differencing
Modeling
Model Specification
With the use of first differencing, the changing variance trend has been removed in the data series (Figure 10) and
with the data series now being stationary too, we will build multiple models using taught approaches. Before
proceeding, model specification would be done using several approaches
Model specification using ACF and PACF for the differenced series
Hide
Hide
17. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 17/27
acf(Data_T2_BoxCox_diff)
pacf(Data_T2_BoxCox_diff)
Hide
18. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 18/27
From the ACF plot there is no significant lag. However in PACF plot shown above, there are a significant lags
present present at 4. Therefore ARIMA(1,2,0) model should be considered, with no presence of white noise
behaviour.
Model specification using EACF (Extended ACF) for the differenced series
With the use of EACF here the order of AR(Autoregressive) and MA(moving average) component of ARMA
model can be identified.
eacf(Data_T2_BoxCox_diff, ar.max=3, ma.max=3)
AR/MA
0 1 2 3
0 o o o o
1 o o o o
2 o o o o
3 o o o o
With the upper left point as (0,0) in the extended ACF method, confirming that there is presence of white noise
behaviour.Therefore, from the output above the neighbouring points can be considered and the following models
can be taken ARIMA(0,2,1), ARIMA(1,2,1), ARIMA(1,2,0) additionaly for further analysis.
Model specification using BIC (Bayesion Information Criterion) for the differenced
series
Hide
Hide
19. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 19/27
res = armasubsets(y=Data_T2_BoxCox_diff,nar=3,nma=2,y.name='test',ar.method='ols')
model order: 7 singularities in the computation of the projection matrix results are only valid
up to model order 6
plot(res)
From the above table, corresponding shaded columns are AR(1) and AR(3) coefficients. Also, two MA() effects
can be seen and those are MA(1) and MA(2). Therefore, using these coefficients four more possible ARIMA
models can be considered ARIMA(1,2,1), ARIMA(1,2,2), ARIMA(3,2,1) and ARIMA(3,2,2)
At the end, possible set of ARIMA model are:
ARIMA(1,2,0) - using ACF and PACF
ARIMA(0,2,1) - using EACF
ARIMA(1,2,1) - using BIC
ARIMA(1,2,2) - using BIC
ARIMA(3,2,1) - using BIC
ARIMA(3,2,2) - using BIC
Parameter estimation
ARIMA(1,2,0)
Hide
Hide
20. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 20/27
model_120_css = arima(Data,order=c(1,2,0),method='CSS')
coeftest(model_120_css)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 -0.45944 0.23810 -1.9296 0.05365 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model_120_ml = arima(Data,order=c(1,2,0),method='ML')
coeftest(model_120_ml)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 -0.42966 0.22743 -1.8892 0.05886 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
AR(1) is insignificant for CSS method and ML estimation.
ARIMA(0,2,1)
model_021_css = arima(Data,order=c(0,2,1),method='CSS')
coeftest(model_021_css)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ma1 -1.066739 0.071847 -14.847 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model_021_ml = arima(Data,order=c(0,2,1),method='ML')
coeftest(model_021_ml)
Hide
Hide
Hide
21. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 21/27
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ma1 -1.00000 0.25823 -3.8725 0.0001077 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
MA(1) is significant for CSS method and ML estimation
ARIMA(1,2,1)
model_121_css = arima(Data,order=c(1,2,1),method='CSS')
coeftest(model_121_css)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 0.073817 0.284315 0.2596 0.7951
ma1 -1.132556 0.074796 -15.1419 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model_121_ml = arima(Data,order=c(1,2,1),method='ML')
coeftest(model_121_ml)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 0.071764 0.269251 0.2665 0.7898
ma1 -0.999999 0.236872 -4.2217 2.425e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
AR(1) is insignificant for CSS method and ML estimation while MA(1) is significant.
ARIMA(1,2,2)
model_122_css = arima(Data,order=c(1,2,2),method='CSS')
coeftest(model_122_css)
Hide
Hide
Hide
22. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 22/27
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 1.005671 0.049778 20.203 < 2.2e-16 ***
ma1 -2.824099 0.125344 -22.531 < 2.2e-16 ***
ma2 1.838559 0.114620 16.040 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model_122_ml = arima(Data,order=c(1,2,2),method='ML')
coeftest(model_122_ml)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 0.058932 1.166803 0.0505 0.9597
ma1 -0.987060 1.155631 -0.8541 0.3930
ma2 -0.012925 1.130998 -0.0114 0.9909
AR(1),MA(1) and MA(2) are significant for CSS method and not for ML estimation
ARIMA(3,2,1)
model_321_css = arima(Data,order=c(3,2,1),method='CSS')
coeftest(model_321_css)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 -0.17209 0.26139 -0.6584 0.510286
ar2 -0.19198 0.25364 -0.7569 0.449106
ar3 -0.52748 0.24922 -2.1165 0.034300 *
ma1 -0.64906 0.24639 -2.6343 0.008432 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model_321_ml = arima(Data,order=c(3,2,1),method='ML')
coeftest(model_321_ml)
Hide
Hide
Hide
23. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 23/27
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 0.0042103 0.3147996 0.0134 0.9893
ar2 -0.0431425 0.2891200 -0.1492 0.8814
ar3 -0.3350403 0.2798031 -1.1974 0.2311
ma1 -0.9018704 0.6489515 -1.3897 0.1646
AR(3) and MA(1) are significant in CSS method and AR(1),AR(2),AR(3),MA(1) all are insignificant for ML
estimation
ARIMA(3,2,2)
model_322_css = arima(Data,order=c(3,2,2),method='CSS')
coeftest(model_322_css)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 -0.139922 0.348205 -0.4018 0.68780
ar2 -0.198602 0.258520 -0.7682 0.44235
ar3 -0.544728 0.277057 -1.9661 0.04928 *
ma1 -0.683391 0.360419 -1.8961 0.05795 .
ma2 0.045948 0.329455 0.1395 0.88908
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model_322_ml = arima(Data,order=c(3,2,2),method='ML')
coeftest(model_322_ml)
z test of coefficients:
Estimate Std. Error z value Pr(>|z|)
ar1 0.132356 0.472334 0.2802 0.77931
ar2 -0.099038 0.323387 -0.3063 0.75941
ar3 -0.384399 0.314004 -1.2242 0.22088
ma1 -0.996970 0.589359 -1.6916 0.09072 .
ma2 0.199121 0.472894 0.4211 0.67371
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
AR(3) is significant for CSS method and AR(1),AR(2),AR(3),MA(1) and MA(2) are insignificant for ML estimation
After comparing all the results above, ARIMA model (1,2,2) has all the significant coefficient using CSS
method.
Hide
Hide
24. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 24/27
Sort AIC and BIC based on score
Function to sort model based on their AIC and BIC score
sort.score = function(x,score = c("bic","aic")){if(score == "aic"){
x[with(x,order(AIC)),]} else if (score == "bic") {x[with(x,order(BIC)),]}
else {warning('score ="x" accepts valid arguments ("aic","bic")')}}
Sorting AIC model
sort.score(AIC(model_120_ml,model_021_ml,model_121_ml,model_122_ml,model_321_ml,model_322_ml),sc
ore="aic")
df
<dbl>
AIC
<dbl>
model_021_ml 2 22.74602
model_121_ml 3 24.67428
model_120_ml 2 26.57611
model_122_ml 4 26.67412
model_321_ml 5 26.90919
model_322_ml 6 28.75165
6 rows
sort.score(BIC(model_120_ml,model_021_ml,model_121_ml,model_122_ml,model_321_ml,model_322_ml),sc
ore="bic")
df
<dbl>
BIC
<dbl>
model_021_ml 2 24.02413
model_121_ml 3 26.59145
model_120_ml 2 27.85423
model_122_ml 4 29.23035
model_321_ml 5 30.10448
model_322_ml 6 32.58599
6 rows
With the lowest AIC and BIC score of 22.7 and 24.0, it can be said that ARIMA (0,2,1) model is the best model.
Hide
Hide
Hide
25. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 25/27
Overfitting
With insignificant AR(1), we can conclude that ARIMA (1,2,2) is a model which overfits ARIMA (0,2,1) model
Best model selection
After observing results from AIC and BIC and cconfirming all the overfitting models, it can be concluded that
ARIMA (0,2,1) is the best model for predicting next 5 year egg depositions (in Mns).
Model Diagnostics
For the selected best model, the Standardised residual behaviour will be analysed for its normality and
autocorrelation in order to test the results. Also, Ljung-Box test will be performed to verify the correct model
selection.
residual.analysis <- function(model, std = TRUE){
library(TSA)
install.packages("FitAR")
library(FitAR)
if (std == TRUE){
res.model = rstandard(model)
}else{
res.model = residuals(model)
}
par(mfrow=c(3,2))
plot(res.model,type='o',ylab='Standardised residuals', main="Time series plot of standardised r
esiduals")
abline(h=0)
hist(res.model,main="Histogram of standardised residuals")
qqnorm(res.model,main="QQ plot of standardised residuals")
qqline(res.model, col = 2)
acf(res.model,main="ACF of standardised residuals")
pacf(res.model,main="PACF of standardised residuals")
print(shapiro.test(res.model))
k=0
LBQPlot(res.model, lag.max = length(model$residuals)-1 , StartLag = k + 1, k = 0, SquaredQ = FA
LSE)
par(mfrow=c(1,1))
}
residual.analysis(model=model_021_ml)
Hide
Hide
26. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 26/27
Error in install.packages : Updating loaded packages
Shapiro-Wilk normality test
data: res.model
W = 0.92478, p-value = 0.2013
From the above output it can be concluded the following,
Time Series plot of standardised residuals shows that there is no general trend with no changing variance,
supporting the selected ARIMA (0,2,1)
Histogram of standardised residuals is somewhat similar to a normal distribution.
In the QQplot of standardised residuals, majority of the data points are closer to the line of best fit or there is
less variance. However, data points at the ends are far from the line best, that could be explained by
‘smaller dataset’ in terms of no. of observations.
With p-value of 0.20 greater than alpha value of 0.05, we fail to reject our null hypothesis which means the
data series in the selected data satisfies the normality behavior.
From the ACF and PACF, there are no significant lags. Hence it could be said that there is existence of
white noise in the data series
As per the Ljung Box Test, none of the data points are under or falls below the red line, supporting our
model.
Forecast
To predict or forecast the next five year ‘Egg depositions (in Mns)’, ARIMA (0,2,1) model will be leveraged.
Hide
27. 5/12/2019 Time Series Analysis MATH1318
file:///C:/Users/kumar/Downloads/Semester 3/Time Series Analysis/Assignment 2/Egg Deposition_Time Series Analysis.nb.html 27/27
Fit = Arima(Data,c(0,2,1),lambda=0.45)
plot(forecast(Fit,h=5),xlab="Year",ylab="Egg Deposition (in Mns)",type="o",xaxt="n",main="Fore
casting next five year values for Egg deposition(1996-2001)",col=4,fcol=4,shadebars=TRUE)
axis(1,at=seq(1981,2001,by=1),las=2)
Summary
This repost covers the processes that are followed in order to achieve the final goal, i.e., the prediction of egg
depositions of Bloaters for the next five years after 1996. The first step was to convert the series from non-
stationary to stationary, using the methods transformation and differencing. The model specifications are then
found by the results of PACF, ACF, BIC and EACF. Parameter estimation is done by CSS method and ML
estimations. The confirmation of best model has come from the outputs of the AIC and BIC. From the Model
diagnostic, the model ARIMA(0,2,1) has turned out to be the best feasible model in order to forecast the eggs
depositions of Bloaters in the next five years. The output of forecast indicates a increase in egg deposition after
1996 for the next five years.