This document summarizes a time series analysis of air pollution data from Richmond, Virginia conducted using R. The analysis examined particulate matter (PM2.5 and PM10), lead, carbon monoxide, and ozone over 2010-2013. Correlation between pollutants was low. Univariate time series models like ARIMA were fitted to each pollutant and compared to 2013 data. ARIMA predicted PM2.5 and lead levels accurately but not other pollutants. The analysis aimed to apply methods from a Bulgarian air pollution study to a US city.
This document summarizes a study that determined coefficients for the nonlinear Muskingum model of flood routing using genetic algorithms and numerical solutions of continuity equations. The researchers optimized the coefficients using genetic algorithms to increase computation speed compared to conventional methods. They then computed outflows using the optimized coefficients and by solving continuity equations with the Runge-Kutta method. Results showed the Runge-Kutta method produced hydrographs that more closely matched actual flows compared to the Muskingum and Muskingum-Cunge models.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Temporal trends of spatial correlation within the PM10 time series of the Air...Florencia Parravicini
We analyse the temporal variations which can be observed within time series of variogram parameters (nugget, sill and range) of daily air quality data (PM10) over a ten years time frame.
Storage Resource Estimates and Seal Evaluation of Cambrian-Ordovician Units i...Cristian Medina
This document summarizes a study evaluating the carbon dioxide (CO2) storage potential of Cambrian-Ordovician saline aquifers in the Midwest Regional Carbon Sequestration Partnership region using different methodologies, and the sealing efficiency of the Maquoketa Group. Six methods were used to independently generate storage resource estimates that differed in how porosity was estimated. Results showed the potential to store over 100 years of CO2 emissions from power plants in the region. Analysis of rock samples from the Maquoketa Group using mercury injection capillary pressure testing suggested it could support CO2 columns of 500-5000 feet, indicating potential as a sealing layer.
Calibration of Environmental Sensor Data Using a Linear Regression Techniqueijtsrd
The linear regression is a linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. A linear regression analysis method is used to improve the performance of the instruments using light scattering method. The data measured by these instruments have to be converted to actual PM10 concentrations using some factors. These findings propose that the instruments using light scattering method can be used to measure and control the PM10 concentrations of the underground subway stations. In this study, the reliability of the instrument for PM measurement using light scattering method was improved with the help of a linear regression analysis technique to continuously measure the PM10 concentrations in the subway stations. In addition, the linear regression analysis technique was also applied to calibrate a radon counter with the help of an expensive radon measuring apparatus. Chungyong Kim | Gyu-Sik Kim"Calibration of Environmental Sensor Data Using a Linear Regression Technique" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-2 | Issue-1 , December 2017, URL: http://www.ijtsrd.com/papers/ijtsrd7060.pdf http://www.ijtsrd.com/engineering/electrical-engineering/7060/calibration-of-environmental-sensor-data-using-a-linear-regression-technique/chungyong-kim
How do air quality models perform with different validation datasets and diff...Haneen Khreis
This paper explores the performance of two commonly used air quality models: dispersion models and land-use regression models. Both models are widely used in air pollution epidemiological studies and in health impact assessment studies. In this work, we looked at how the choice of the validation dataset impacts the performance of air quality models and the insights gleaned from their validation. We also looked at whether the spatial resolution for the models' setup impacts the performance of air quality models and the insights gleaned from their validation. We saw that R-squared almost halved when the air quality models' estimates were made at the centroid of the 100x100m grid in which the validation point fell, instead of at the exact location of the validation point. We also saw that the different validation datasets give very different insights.
Dispersion models and land-use regression models are widely used in air pollution epidemiological studies and in health impact assessment studies. As such the performance of these air quality models have implications on the ability of epidemiological studies to pick up associations between the exposures and the health outcomes of interest, and the ability of health impact assessment studies to quantify the impacts accurately. This work demonstrated the value of validating modeled air quality data against various datasets to obtain a better understanding of the performance of models and the value of reporting these validation results. Also, the work suggested that the spatial resolution of the models’ estimates has a significant influence on the validity at the application point. These results should be considered when air quality models are used to assign human exposures and study the health effects/impacts of these exposures. Significant work is still needed to improve the performance of air quality models and their ability to pick up the variations of air pollution levels and especially the higher and more variable levels that are related to traffic. Significant work is also still needed to account for the factors that underlie this variation in epidemiological and health impact assessment studies, especially time activity patterns of the exposed populations
This document summarizes a study that determined coefficients for the nonlinear Muskingum model of flood routing using genetic algorithms and numerical solutions of continuity equations. The researchers optimized the coefficients using genetic algorithms to increase computation speed compared to conventional methods. They then computed outflows using the optimized coefficients and by solving continuity equations with the Runge-Kutta method. Results showed the Runge-Kutta method produced hydrographs that more closely matched actual flows compared to the Muskingum and Muskingum-Cunge models.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Temporal trends of spatial correlation within the PM10 time series of the Air...Florencia Parravicini
We analyse the temporal variations which can be observed within time series of variogram parameters (nugget, sill and range) of daily air quality data (PM10) over a ten years time frame.
Storage Resource Estimates and Seal Evaluation of Cambrian-Ordovician Units i...Cristian Medina
This document summarizes a study evaluating the carbon dioxide (CO2) storage potential of Cambrian-Ordovician saline aquifers in the Midwest Regional Carbon Sequestration Partnership region using different methodologies, and the sealing efficiency of the Maquoketa Group. Six methods were used to independently generate storage resource estimates that differed in how porosity was estimated. Results showed the potential to store over 100 years of CO2 emissions from power plants in the region. Analysis of rock samples from the Maquoketa Group using mercury injection capillary pressure testing suggested it could support CO2 columns of 500-5000 feet, indicating potential as a sealing layer.
Calibration of Environmental Sensor Data Using a Linear Regression Techniqueijtsrd
The linear regression is a linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. A linear regression analysis method is used to improve the performance of the instruments using light scattering method. The data measured by these instruments have to be converted to actual PM10 concentrations using some factors. These findings propose that the instruments using light scattering method can be used to measure and control the PM10 concentrations of the underground subway stations. In this study, the reliability of the instrument for PM measurement using light scattering method was improved with the help of a linear regression analysis technique to continuously measure the PM10 concentrations in the subway stations. In addition, the linear regression analysis technique was also applied to calibrate a radon counter with the help of an expensive radon measuring apparatus. Chungyong Kim | Gyu-Sik Kim"Calibration of Environmental Sensor Data Using a Linear Regression Technique" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-2 | Issue-1 , December 2017, URL: http://www.ijtsrd.com/papers/ijtsrd7060.pdf http://www.ijtsrd.com/engineering/electrical-engineering/7060/calibration-of-environmental-sensor-data-using-a-linear-regression-technique/chungyong-kim
How do air quality models perform with different validation datasets and diff...Haneen Khreis
This paper explores the performance of two commonly used air quality models: dispersion models and land-use regression models. Both models are widely used in air pollution epidemiological studies and in health impact assessment studies. In this work, we looked at how the choice of the validation dataset impacts the performance of air quality models and the insights gleaned from their validation. We also looked at whether the spatial resolution for the models' setup impacts the performance of air quality models and the insights gleaned from their validation. We saw that R-squared almost halved when the air quality models' estimates were made at the centroid of the 100x100m grid in which the validation point fell, instead of at the exact location of the validation point. We also saw that the different validation datasets give very different insights.
Dispersion models and land-use regression models are widely used in air pollution epidemiological studies and in health impact assessment studies. As such the performance of these air quality models have implications on the ability of epidemiological studies to pick up associations between the exposures and the health outcomes of interest, and the ability of health impact assessment studies to quantify the impacts accurately. This work demonstrated the value of validating modeled air quality data against various datasets to obtain a better understanding of the performance of models and the value of reporting these validation results. Also, the work suggested that the spatial resolution of the models’ estimates has a significant influence on the validity at the application point. These results should be considered when air quality models are used to assign human exposures and study the health effects/impacts of these exposures. Significant work is still needed to improve the performance of air quality models and their ability to pick up the variations of air pollution levels and especially the higher and more variable levels that are related to traffic. Significant work is also still needed to account for the factors that underlie this variation in epidemiological and health impact assessment studies, especially time activity patterns of the exposed populations
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Integration Method of Local-global SVR and Parallel Time Variant PSO in Water...TELKOMNIKA JOURNAL
Flood is one type of natural disaster that can’t be predicted, one of the main causes of flooding is the continuous rain (natural events). In terms of meteorology, the cause of flood is come from high rainfall and the high tide of the sea, resulting in increased the water level. Rainfall and water level analysis in each period, still not able to solve the existing problems. Therefore in this study, the proposed integration method of Parallel Time Variant PSO (PTVPSO) and Local-Global Support Vector Regression (SVR) is used to forecast water level. Implementation in this study combine SVR as regression method for forecast the water level, Local-Global concept take the role for the minimization for the computing time, while PTVPSO used in the SVR to obtain maximum performance and higher accurate result by optimize the parameters of SVR. Hopefully this system will be able to solve the existing problems for flood early warning system due to erratic weather.
IRJET- Rainfall Forecasting using Regression TechniquesIRJET Journal
This document discusses rainfall forecasting using regression techniques. It begins with an abstract stating that rainfall is important for agriculture and food production in India. It then provides an introduction to different rainfall forecasting methods, emphasizing empirical regression approaches. The paper performs multiple linear regression on five years of monthly rainfall data from Mumbai to predict rainfall values. It calculates correlation coefficients between months and plots actual versus predicted rainfall to evaluate forecast accuracy. The goal is to aid agricultural planning and management in India's monsoon-dependent regions.
The document compares earthquake wave propagation analysis results from SHAKE2000 and Plaxis v8 based on Indonesian standards SNI 03-1726-2012 for North, Central, and South Jakarta. The analysis models soil at five locations using synthetic ground motions for a 2500-year earthquake. The results show site-specific spectra are generally higher than the standard. Linear Elastic modeling in Plaxis gives very high values unsuitable for seismic analysis, while Mohr-Coulomb is also unsuitable. Hardening Soil with Small Strain modeling is most suitable. SHAKE2000's Linear Equivalent modeling best matches the standard but more study of soil parameters is needed.
Refining Underwater Target Localization and Tracking EstimatesCSCJournals
Improving the accuracy and reliability of the localization estimates and tracking of underwater targets is a constant quest in ocean surveillance operations. The localization estimates may vary owing to various noises and interferences such as sensor errors and environmental noises. Even though adaptive filters like the Kalman filter subdue these problems and yield dependable results, targets that undergo maneuvering can cause incomprehensible errors, unless suitable corrective measures are implemented. Simulation studies on improving the localization and tracking estimates for a stationary target as well as a moving target including the maneuvering situations are presented in this paper
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Time Series Data Analysis for Forecasting – A Literature ReviewIJMER
This document summarizes literature on using statistical and data mining techniques for time series forecasting, with a focus on weather prediction. Section 2 discusses various statistical techniques used in literature such as ARIMA models, exponential smoothing models, and spectral analysis methods for time series rainfall and weather forecasting. Section 3 discusses data mining techniques used for time series forecasting, including neural networks and evolutionary computation methods. Several studies applying neural networks to weather prediction are summarized.
This document presents a method for predicting stream flow distributions based on climatic and geomorphic data alone, without discharge measurements. It combines a physically-based stream flow model with water balance and geomorphic recession flow models. Key parameters of the stream flow model are estimated from rainfall, potential evapotranspiration, and digital elevation model data. The method was tested on calibration and test catchments. While offering a unique approach, the method has limitations including additional assumptions and reduced accuracy of parameter estimates and flow regime predictions.
This document discusses a study analyzing the impact of climate change on precipitation characteristics in Guwahati, India using an Earth System Model. It summarizes the use of statistical downscaling with multiple linear regression to project future precipitation data. Predictors with the highest correlation to total monthly precipitation, maximum monthly precipitation, and number of dry days were selected from the ESM dataset. The downscaled results will be used for flood frequency analysis to project precipitation levels and dry days under different return periods.
Computer model simulations are widely used in the investigation of complex hydrological systems. In particular, hydrological models are tools that help both to better understand hydrological processes and to predict extreme events such as floods and droughts. Usually, model parameters need to be estimated through calibration, in order to constrain model outputs to observed variables.
Relevant model parameters used for calibration are usually selected based on expert knowledge of the modeller or by using a local one-at-a-time (OAT) sensitivity analysis (SA). However, in case of complex models those approaches may not result in proper identification of the most sensitive parameters for model calibration. In particular local OAT SA methods are only effective for assessing the relative importance of input factors when the model is linear, monotonic, and additive, which is rarely the case for complex environmental models. In contrast Global Sensitivity Analysis (GSA)
is a formal method for statistical evaluation of relevant parameters that contribute significantly to model performance. GSA techniques explore the entire feasible space of each model parameter, and they do not require any assumptions on the model nature (such as linearity or additivity).
In this work we apply the GSA to LISFLOOD, a fully-distributed hydrological model used for flood forecasting at Pan-European scale within the European Flood Awareness System (EFAS). Two case studies are considered, snowmelt- and evapotranspiration-driven catchments, to identify sensitive parameters for both types of hydrological regimes. Results of the GSA will then be used for selecting parameters that need to be estimated during model calibration. Considering the large
number of parameters of a fully-distributed model, a two-step GSA framework is applied. First, we implement the computationally efficient screening method of Morris. This method requires a limited number of simulations and produces a qualitative ranking and selection of important factors. As a second step, we apply the variance-based method of Sobol, only to the subset of factors determined as important during the previous screening. The method of Sobol provides quantitative estimates for first order and total order sensitivity indexes of input factors.
The calibration results after the GSA will be described for both case studies and compared against those obtained by using only prior expert knowledge
Estimating Parameter of Nonlinear Bias Correction Method using NSGA-II in Dai...TELKOMNIKA JOURNAL
Nonlinear (NL) method is the most effective bias correction method for correcting statistical bias
when observation precipitation data can not be approximated using gamma distribution. Since NL method
only adjusts mean and variance, it does not perform well in handling bias on quantile values. This
paperpresents a scheme of NL method with additional condition aiming to mitigate bias on quantile values.
Non-dominated Sorting Genetic Algorithm II (NSGA-II) was applied to estimate parameter of NL method.
Furthermore, to investigate suitability of application of NSGA-II, we performed Single Objective Genetic
Algorithm (SOGA) as a comparison. The experiment results revealed NSGA-II was suitable when solution
of SOGA produced low fitness. Application of NSGA-II could minimize impact of daily bias correction on
monthly precipitation. The proposed scheme successfully reduced biases on mean, variance, first and
second quantile However, biases on third and fourth moment could not be handled robustly while biases
on third quantile only reduced during dry months.
An improved method for predicting heat exchanger network areaAlexander Decker
This document presents an improved methodology for predicting the area required for heat exchanger networks. The current methodology relies on film heat transfer coefficients, which can vary significantly between the targeting, synthesis, and detailed design stages of process integration. The new methodology accounts for changes in stream properties with temperature and relates film heat transfer coefficients to pressure drop constraints, allowing the three stages to be consistent. It was tested on two case studies and found to have less than 2% difference between stages, compared to up to 59% difference with the current methodology. The new methodology provides an excellent agreement between the targeting, synthesis, and detailed design of heat exchanger networks.
This document presents a data-driven approach to establish relationships between the microstructure and hydraulic conductivity of packed soil particles. Soil particle packings were generated numerically and their microstructures characterized using 2-point statistics. Hydraulic conductivity was estimated using finite volume simulations. Principal component analysis was used to reduce the microstructural data, and regression analysis was employed to correlate hydraulic conductivity with the principal components, establishing a structure-property relationship. Leave-one-out cross validation was used to assess the regression models.
This document provides guidance on sampling principles for hydrological and hydro-meteorological variables. It discusses units of measurement, basic statistics, measurement error, sampling frequency, sampling in space, and network design. The key topics covered include defining sampling terms, describing common statistical distributions, estimating parameters and errors from samples, addressing errors from discrete time and spatial sampling, and designing monitoring networks based on sampling objectives and system characteristics.
Consequence assessment methods for incidents from lngaob
This document summarizes methods for assessing consequences of incidents involving releases from liquefied natural gas carriers. It recommends using an orifice model to estimate release rates, Webber's methodology to model pool spreading accounting for frictional forces, and heat transfer theory to calculate heat flux to boiling pools. For fires, it recommends using a solid flame model. It also provides injury and damage criteria for assessing impacts to people and structures from thermal radiation. The document notes limitations in current models and calls for additional research, particularly spill tests, to improve accuracy.
Comparison of MOC and Lax FDE for simulating transients in Pipe FlowsIRJET Journal
This document compares the method of characteristics (MOC) and Lax finite difference explicit (Lax FDE) methods for simulating transients in pipe flows. It develops numerical models using both MOC and Lax FDE to discretize the governing equations of fluid flow through a pipe. The models are implemented using data from a previous study to simulate pressure and discharge changes after rapid valve closure. The results show that Lax FDE provides more damping of pressure and discharge fluctuations compared to MOC. Therefore, the document concludes that Lax FDE is a better numerical method for calculating hydraulic transients in pipes.
This document presents a hybrid Measure-Correlate-Predict (MCP) method for wind resource assessment that combines predictions from multiple nearby meteorological stations. The existing MCP methods only use data from one reference station and do not consider distance or elevation differences between stations. The new hybrid MCP method assigns weights to individual MCP predictions based on distance and elevation differences to the target site. It was evaluated using stations in North Dakota and showed improved accuracy over individual MCP methods based on error metrics and predicted wind farm power generation. The hybrid approach more accurately characterized the long-term wind distribution at the target site.
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...IJDKP
Environmental air pollution studies fail to consider the fact that air pollution is a spatio-temporal problem. The volume and complexity of the data have created the need to explore various machine learning models, however, those models have advantages and disadvantages when applied to regional air pollution analysis, furthermore, most environmental problems are global distribution problems. This research addressed spatio-temporal problem using decentralized computational technique named Online Scalable SVM Ensemble Learning Method (OSSELM). Evaluation criteria for computational air pollution analysis includes: accuracy, real time & prediction, spatio-temporal and decentralised analysis, we assert that these criteria can be improved using the proposed OSSELM. Special consideration is given to distributed ensemble to resolve spatio-temporal data collection problem (i.e. the data collected from multiple monitoring stations dispersed over a geographical location). Moreover, the experimental results demonstrated that the proposed OSSELM produced impressive results compare to SVM ensemble for air pollution analysis in Auckland region.
Online flooding monitoring in packed towersJames Cao
This document proposes an enhanced data-driven method called EDPCA for online flooding monitoring in packed towers. EDPCA improves upon DPCA (dynamic principal component analysis) monitoring by first using fuzzy c-means clustering to separate historical data into subsets. Then, multiple single DPCA models are trained on each subset. When a new data point arrives, each DPCA model evaluates it, and the results are integrated using Bayesian inference to obtain the overall monitoring result. The method was tested on an air-water packed tower and showed better performance than single DPCA monitoring.
Determination of the corrosion rate of a mic influenced pipeline using four c...GeraldoRossoniSisqui
1) Gasunie used data from 4 pig runs over 5 years on a pipeline affected by MIC to determine the corrosion rate. 3 approaches were used: the first calculated a single rate for all defects, the second calculated individual rates, and the third used maximum likelihood estimation.
2) The third approach, which took into account measurement errors and ensured positive rates, estimated an average rate of 0.20 mm/yr with initiation largely at installation.
3) Analyzing shallow and deep defects separately found average rates of 0.23 mm/yr and 0.25 mm/yr respectively, indicating depth does not significantly impact the rate.
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...IJDKP
Environmental air pollution studies fail to consider the fact that air pollution is a spatio-temporal problem.
The volume and complexity of the data have created the need to explore various machine learning models,
however, those models have advantages and disadvantages when applied to regional air pollution analysis,
furthermore, most environmental problems are global distribution problems. This research addressed
spatio-temporal problem using decentralized computational technique named Online Scalable SVM
Ensemble Learning Method (OSSELM). Evaluation criteria for computational air pollution analysis
includes: accuracy, real time & prediction, spatio-temporal and decentralised analysis, we assert that these
criteria can be improved using the proposed OSSELM. Special consideration is given to distributed
ensemble to resolve spatio-temporal data collection problem (i.e. the data collected from multiple
monitoring stations dispersed over a geographical location). Moreover, the experimental results
demonstrated that the proposed OSSELM produced impressive results compare to SVM ensemble for air
pollution analysis in Auckland region.
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...IJDKP
Environmental air pollution studies fail to consider the fact that air pollution is a spatio-temporal problem.
The volume and complexity of the data have created the need to explore various machine learning models,
however, those models have advantages and disadvantages when applied to regional air pollution analysis,
furthermore, most environmental problems are global distribution problems. This research addressed
spatio-temporal problem using decentralized computational technique named Online Scalable SVM
Ensemble Learning Method (OSSELM). Evaluation criteria for computational air pollution analysis
includes: accuracy, real time & prediction, spatio-temporal and decentralised analysis, we assert that these
criteria can be improved using the proposed OSSELM. Special consideration is given to distributed
ensemble to resolve spatio-temporal data collection problem (i.e. the data collected from multiple
monitoring stations dispersed over a geographical location). Moreover, the experimental results
demonstrated that the proposed OSSELM produced impressive results compare to SVM ensemble for air
pollution analysis in Auckland region.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
Integration Method of Local-global SVR and Parallel Time Variant PSO in Water...TELKOMNIKA JOURNAL
Flood is one type of natural disaster that can’t be predicted, one of the main causes of flooding is the continuous rain (natural events). In terms of meteorology, the cause of flood is come from high rainfall and the high tide of the sea, resulting in increased the water level. Rainfall and water level analysis in each period, still not able to solve the existing problems. Therefore in this study, the proposed integration method of Parallel Time Variant PSO (PTVPSO) and Local-Global Support Vector Regression (SVR) is used to forecast water level. Implementation in this study combine SVR as regression method for forecast the water level, Local-Global concept take the role for the minimization for the computing time, while PTVPSO used in the SVR to obtain maximum performance and higher accurate result by optimize the parameters of SVR. Hopefully this system will be able to solve the existing problems for flood early warning system due to erratic weather.
IRJET- Rainfall Forecasting using Regression TechniquesIRJET Journal
This document discusses rainfall forecasting using regression techniques. It begins with an abstract stating that rainfall is important for agriculture and food production in India. It then provides an introduction to different rainfall forecasting methods, emphasizing empirical regression approaches. The paper performs multiple linear regression on five years of monthly rainfall data from Mumbai to predict rainfall values. It calculates correlation coefficients between months and plots actual versus predicted rainfall to evaluate forecast accuracy. The goal is to aid agricultural planning and management in India's monsoon-dependent regions.
The document compares earthquake wave propagation analysis results from SHAKE2000 and Plaxis v8 based on Indonesian standards SNI 03-1726-2012 for North, Central, and South Jakarta. The analysis models soil at five locations using synthetic ground motions for a 2500-year earthquake. The results show site-specific spectra are generally higher than the standard. Linear Elastic modeling in Plaxis gives very high values unsuitable for seismic analysis, while Mohr-Coulomb is also unsuitable. Hardening Soil with Small Strain modeling is most suitable. SHAKE2000's Linear Equivalent modeling best matches the standard but more study of soil parameters is needed.
Refining Underwater Target Localization and Tracking EstimatesCSCJournals
Improving the accuracy and reliability of the localization estimates and tracking of underwater targets is a constant quest in ocean surveillance operations. The localization estimates may vary owing to various noises and interferences such as sensor errors and environmental noises. Even though adaptive filters like the Kalman filter subdue these problems and yield dependable results, targets that undergo maneuvering can cause incomprehensible errors, unless suitable corrective measures are implemented. Simulation studies on improving the localization and tracking estimates for a stationary target as well as a moving target including the maneuvering situations are presented in this paper
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Time Series Data Analysis for Forecasting – A Literature ReviewIJMER
This document summarizes literature on using statistical and data mining techniques for time series forecasting, with a focus on weather prediction. Section 2 discusses various statistical techniques used in literature such as ARIMA models, exponential smoothing models, and spectral analysis methods for time series rainfall and weather forecasting. Section 3 discusses data mining techniques used for time series forecasting, including neural networks and evolutionary computation methods. Several studies applying neural networks to weather prediction are summarized.
This document presents a method for predicting stream flow distributions based on climatic and geomorphic data alone, without discharge measurements. It combines a physically-based stream flow model with water balance and geomorphic recession flow models. Key parameters of the stream flow model are estimated from rainfall, potential evapotranspiration, and digital elevation model data. The method was tested on calibration and test catchments. While offering a unique approach, the method has limitations including additional assumptions and reduced accuracy of parameter estimates and flow regime predictions.
This document discusses a study analyzing the impact of climate change on precipitation characteristics in Guwahati, India using an Earth System Model. It summarizes the use of statistical downscaling with multiple linear regression to project future precipitation data. Predictors with the highest correlation to total monthly precipitation, maximum monthly precipitation, and number of dry days were selected from the ESM dataset. The downscaled results will be used for flood frequency analysis to project precipitation levels and dry days under different return periods.
Computer model simulations are widely used in the investigation of complex hydrological systems. In particular, hydrological models are tools that help both to better understand hydrological processes and to predict extreme events such as floods and droughts. Usually, model parameters need to be estimated through calibration, in order to constrain model outputs to observed variables.
Relevant model parameters used for calibration are usually selected based on expert knowledge of the modeller or by using a local one-at-a-time (OAT) sensitivity analysis (SA). However, in case of complex models those approaches may not result in proper identification of the most sensitive parameters for model calibration. In particular local OAT SA methods are only effective for assessing the relative importance of input factors when the model is linear, monotonic, and additive, which is rarely the case for complex environmental models. In contrast Global Sensitivity Analysis (GSA)
is a formal method for statistical evaluation of relevant parameters that contribute significantly to model performance. GSA techniques explore the entire feasible space of each model parameter, and they do not require any assumptions on the model nature (such as linearity or additivity).
In this work we apply the GSA to LISFLOOD, a fully-distributed hydrological model used for flood forecasting at Pan-European scale within the European Flood Awareness System (EFAS). Two case studies are considered, snowmelt- and evapotranspiration-driven catchments, to identify sensitive parameters for both types of hydrological regimes. Results of the GSA will then be used for selecting parameters that need to be estimated during model calibration. Considering the large
number of parameters of a fully-distributed model, a two-step GSA framework is applied. First, we implement the computationally efficient screening method of Morris. This method requires a limited number of simulations and produces a qualitative ranking and selection of important factors. As a second step, we apply the variance-based method of Sobol, only to the subset of factors determined as important during the previous screening. The method of Sobol provides quantitative estimates for first order and total order sensitivity indexes of input factors.
The calibration results after the GSA will be described for both case studies and compared against those obtained by using only prior expert knowledge
Estimating Parameter of Nonlinear Bias Correction Method using NSGA-II in Dai...TELKOMNIKA JOURNAL
Nonlinear (NL) method is the most effective bias correction method for correcting statistical bias
when observation precipitation data can not be approximated using gamma distribution. Since NL method
only adjusts mean and variance, it does not perform well in handling bias on quantile values. This
paperpresents a scheme of NL method with additional condition aiming to mitigate bias on quantile values.
Non-dominated Sorting Genetic Algorithm II (NSGA-II) was applied to estimate parameter of NL method.
Furthermore, to investigate suitability of application of NSGA-II, we performed Single Objective Genetic
Algorithm (SOGA) as a comparison. The experiment results revealed NSGA-II was suitable when solution
of SOGA produced low fitness. Application of NSGA-II could minimize impact of daily bias correction on
monthly precipitation. The proposed scheme successfully reduced biases on mean, variance, first and
second quantile However, biases on third and fourth moment could not be handled robustly while biases
on third quantile only reduced during dry months.
An improved method for predicting heat exchanger network areaAlexander Decker
This document presents an improved methodology for predicting the area required for heat exchanger networks. The current methodology relies on film heat transfer coefficients, which can vary significantly between the targeting, synthesis, and detailed design stages of process integration. The new methodology accounts for changes in stream properties with temperature and relates film heat transfer coefficients to pressure drop constraints, allowing the three stages to be consistent. It was tested on two case studies and found to have less than 2% difference between stages, compared to up to 59% difference with the current methodology. The new methodology provides an excellent agreement between the targeting, synthesis, and detailed design of heat exchanger networks.
This document presents a data-driven approach to establish relationships between the microstructure and hydraulic conductivity of packed soil particles. Soil particle packings were generated numerically and their microstructures characterized using 2-point statistics. Hydraulic conductivity was estimated using finite volume simulations. Principal component analysis was used to reduce the microstructural data, and regression analysis was employed to correlate hydraulic conductivity with the principal components, establishing a structure-property relationship. Leave-one-out cross validation was used to assess the regression models.
This document provides guidance on sampling principles for hydrological and hydro-meteorological variables. It discusses units of measurement, basic statistics, measurement error, sampling frequency, sampling in space, and network design. The key topics covered include defining sampling terms, describing common statistical distributions, estimating parameters and errors from samples, addressing errors from discrete time and spatial sampling, and designing monitoring networks based on sampling objectives and system characteristics.
Consequence assessment methods for incidents from lngaob
This document summarizes methods for assessing consequences of incidents involving releases from liquefied natural gas carriers. It recommends using an orifice model to estimate release rates, Webber's methodology to model pool spreading accounting for frictional forces, and heat transfer theory to calculate heat flux to boiling pools. For fires, it recommends using a solid flame model. It also provides injury and damage criteria for assessing impacts to people and structures from thermal radiation. The document notes limitations in current models and calls for additional research, particularly spill tests, to improve accuracy.
Comparison of MOC and Lax FDE for simulating transients in Pipe FlowsIRJET Journal
This document compares the method of characteristics (MOC) and Lax finite difference explicit (Lax FDE) methods for simulating transients in pipe flows. It develops numerical models using both MOC and Lax FDE to discretize the governing equations of fluid flow through a pipe. The models are implemented using data from a previous study to simulate pressure and discharge changes after rapid valve closure. The results show that Lax FDE provides more damping of pressure and discharge fluctuations compared to MOC. Therefore, the document concludes that Lax FDE is a better numerical method for calculating hydraulic transients in pipes.
This document presents a hybrid Measure-Correlate-Predict (MCP) method for wind resource assessment that combines predictions from multiple nearby meteorological stations. The existing MCP methods only use data from one reference station and do not consider distance or elevation differences between stations. The new hybrid MCP method assigns weights to individual MCP predictions based on distance and elevation differences to the target site. It was evaluated using stations in North Dakota and showed improved accuracy over individual MCP methods based on error metrics and predicted wind farm power generation. The hybrid approach more accurately characterized the long-term wind distribution at the target site.
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...IJDKP
Environmental air pollution studies fail to consider the fact that air pollution is a spatio-temporal problem. The volume and complexity of the data have created the need to explore various machine learning models, however, those models have advantages and disadvantages when applied to regional air pollution analysis, furthermore, most environmental problems are global distribution problems. This research addressed spatio-temporal problem using decentralized computational technique named Online Scalable SVM Ensemble Learning Method (OSSELM). Evaluation criteria for computational air pollution analysis includes: accuracy, real time & prediction, spatio-temporal and decentralised analysis, we assert that these criteria can be improved using the proposed OSSELM. Special consideration is given to distributed ensemble to resolve spatio-temporal data collection problem (i.e. the data collected from multiple monitoring stations dispersed over a geographical location). Moreover, the experimental results demonstrated that the proposed OSSELM produced impressive results compare to SVM ensemble for air pollution analysis in Auckland region.
Online flooding monitoring in packed towersJames Cao
This document proposes an enhanced data-driven method called EDPCA for online flooding monitoring in packed towers. EDPCA improves upon DPCA (dynamic principal component analysis) monitoring by first using fuzzy c-means clustering to separate historical data into subsets. Then, multiple single DPCA models are trained on each subset. When a new data point arrives, each DPCA model evaluates it, and the results are integrated using Bayesian inference to obtain the overall monitoring result. The method was tested on an air-water packed tower and showed better performance than single DPCA monitoring.
Determination of the corrosion rate of a mic influenced pipeline using four c...GeraldoRossoniSisqui
1) Gasunie used data from 4 pig runs over 5 years on a pipeline affected by MIC to determine the corrosion rate. 3 approaches were used: the first calculated a single rate for all defects, the second calculated individual rates, and the third used maximum likelihood estimation.
2) The third approach, which took into account measurement errors and ensured positive rates, estimated an average rate of 0.20 mm/yr with initiation largely at installation.
3) Analyzing shallow and deep defects separately found average rates of 0.23 mm/yr and 0.25 mm/yr respectively, indicating depth does not significantly impact the rate.
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...IJDKP
Environmental air pollution studies fail to consider the fact that air pollution is a spatio-temporal problem.
The volume and complexity of the data have created the need to explore various machine learning models,
however, those models have advantages and disadvantages when applied to regional air pollution analysis,
furthermore, most environmental problems are global distribution problems. This research addressed
spatio-temporal problem using decentralized computational technique named Online Scalable SVM
Ensemble Learning Method (OSSELM). Evaluation criteria for computational air pollution analysis
includes: accuracy, real time & prediction, spatio-temporal and decentralised analysis, we assert that these
criteria can be improved using the proposed OSSELM. Special consideration is given to distributed
ensemble to resolve spatio-temporal data collection problem (i.e. the data collected from multiple
monitoring stations dispersed over a geographical location). Moreover, the experimental results
demonstrated that the proposed OSSELM produced impressive results compare to SVM ensemble for air
pollution analysis in Auckland region.
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...IJDKP
Environmental air pollution studies fail to consider the fact that air pollution is a spatio-temporal problem.
The volume and complexity of the data have created the need to explore various machine learning models,
however, those models have advantages and disadvantages when applied to regional air pollution analysis,
furthermore, most environmental problems are global distribution problems. This research addressed
spatio-temporal problem using decentralized computational technique named Online Scalable SVM
Ensemble Learning Method (OSSELM). Evaluation criteria for computational air pollution analysis
includes: accuracy, real time & prediction, spatio-temporal and decentralised analysis, we assert that these
criteria can be improved using the proposed OSSELM. Special consideration is given to distributed
ensemble to resolve spatio-temporal data collection problem (i.e. the data collected from multiple
monitoring stations dispersed over a geographical location). Moreover, the experimental results
demonstrated that the proposed OSSELM produced impressive results compare to SVM ensemble for air
pollution analysis in Auckland region.
The document presents a new method called GA-KELM for predicting air quality index (AQI) values. It introduces extreme learning machines (ELM) and discusses their limitations. It then proposes using a genetic algorithm to optimize the number of hidden nodes, thresholds, and weights in a kernel extreme learning machine (KELM) model in order to improve prediction accuracy. Experimental results on real-world datasets show the GA-KELM method trains faster and more accurately predicts AQI values than other methods like CMAQ, SVM, and DBN-BP.
A Smart air pollution detector using SVM ClassificationIRJET Journal
This document summarizes a research paper that proposes a smart air pollution detector using an SVM classification model. It begins with an abstract that describes the need to control rising air pollution levels in developing countries like India. It then discusses particulate matter (PM) and its health risks when concentrated. The paper proposes to regularly check PM concentration levels using machine learning techniques. It reviews related work applying models like naive Bayes, SVM and regression to predict air quality. It then describes the existing systems' limitations and proposes a system that classifies PM2.5 levels using logistic regression and forecasts levels using an SVM model for improved accuracy. The paper analyzes the results and concludes machine learning can accurately predict future pollution levels to help people be aware and take action
Atmospheric Pollutant Concentration Prediction Based on KPCA BPijtsrd
PM2.5 prediction research has important significance for improving human health and atmospheric environmental quality, etc. This paper uses a model combining nuclear principal component analysis method and neural network to study the prediction problem of meteorological pollutant concentration, and compares the experimental results with the prediction results of the original neural network and the principal component analysis neural network. Based on the O3, CO, PM10, SO2, NO2 concentrations and parallel meteorological conditions data of Beijing from 2016 to 2020, the PM2.5 concentration was predicted. First, reduce the latitude of the data, and then use the KPCA BP neural network algorithm for training. The results show that the average absolute error, root mean square error and expected variance score of the combined model are relatively good, the generalization ability is strong, and the extreme value prediction is the best, which is better than that of the single model. Xin Lin | Bo Wang | Wenjing Ai "Atmospheric Pollutant Concentration Prediction Based on KPCA-BP" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-6 | Issue-5 , August 2022, URL: https://www.ijtsrd.com/papers/ijtsrd51746.pdf Paper URL: https://www.ijtsrd.com/engineering/environment-engineering/51746/atmospheric-pollutant-concentration-prediction-based-on-kpcabp/xin-lin
Analysis Of Air Pollutants Affecting The Air Quality Using ARIMAIRJET Journal
This document discusses analyzing air pollutants affecting air quality using the ARIMA time series model. It begins with an abstract describing the decreasing air quality due to factors like traffic and industry. It then discusses predicting and forecasting the Air Quality Index using time series models like ARIMA. The document reviews literature on previous studies analyzing air pollution data using techniques like neural networks and random forests. It describes preprocessing time series air pollution data to address missing values and assess stationarity before deploying the ARIMA model to make predictions.
This document discusses a study conducted to calibrate and validate air quality models used in environmental impact assessments in India. The study involved collecting emissions data from point, area, and line sources as well as meteorological data. Air quality was then monitored and models were used to predict pollutant concentrations, which were compared to observed values. The model that took into account emissions from all source types (point, area, and line) produced predictions closest to observed concentrations. Additional scenarios were run varying the stability class input to the model.
This document describes research using genetic programming (GP) and artificial neural networks (ANN) to develop short-term air quality forecast models for Pune, India. 36 models were developed using daily average meteorological and pollutant concentration data from 2005-2008 to predict concentrations of SOx, NOx, and particulate matter one day in advance. The models were designed to be robust in situations where complete input data is unavailable. Performance of the GP and ANN models was evaluated based on correlation, error, and other statistical measures. The research found that the GP models generally performed better than the ANN models, especially in cases with incomplete data, and had the advantage of generating equation-based forecasts.
This document discusses applying a novel approach using multi-criterion decision analysis (MCDA) with the generalized likelihood uncertainty estimation (GLUE) method to quantify uncertainty in hydrological modeling. Specifically, it examines uncertainty in the SLURP hydrological model. Rather than considering overall Nash-Sutcliffe efficiency, the approach considers NSE values for different flow magnitudes simultaneously. The TOPSIS MCDA method is used to compute predictive intervals by considering NSE values for different flow periods simultaneously. The Kootenay Catchment case study is used to demonstrate the MCDA-GLUE approach.
Calculation of solar radiation by using regression methodsmehmet şahin
Abstract. In this study, solar radiation was estimated at 53 location over Turkey with
varying climatic conditions using the Linear, Ridge, Lasso, Smoother, Partial least, KNN
and Gaussian process regression methods. The data of 2002 and 2003 years were used to
obtain regression coefficients of relevant methods. The coefficients were obtained based on
the input parameters. Input parameters were month, altitude, latitude, longitude and landsurface
temperature (LST).The values for LST were obtained from the data of the National
Oceanic and Atmospheric Administration Advanced Very High Resolution Radiometer
(NOAA-AVHRR) satellite. Solar radiation was calculated using obtained coefficients in
regression methods for 2004 year. The results were compared statistically. The most
successful method was Gaussian process regression method. The most unsuccessful method
was lasso regression method. While means bias error (MBE) value of Gaussian process
regression method was 0,274 MJ/m2, root mean square error (RMSE) value of method was
calculated as 2,260 MJ/m2. The correlation coefficient of related method was calculated as
0,941. Statistical results are consistent with the literature. Used the Gaussian process
regression method is recommended for other studies.
Conference on the Environment- GUERRA presentation Nov 19, 2014Sergio A. Guerra
This document discusses innovative dispersion modeling practices to achieve reasonable conservatism in regulatory modeling demonstrations. It presents a case study evaluating the Emissions and Meteorological Variability Processor (EMVAP) and approaches to establish background concentrations. The case study models SO2 concentrations from a power plant using 1) constant emissions, 2) variable emissions, and 3) EMVAP. EMVAP provides more realistic concentrations while accounting for emission variability. Using the 50th percentile monitored background concentration when combining with modeled values provides statistical conservatism compared to using high percentile values.
This document summarizes a study that used the Generalized Likelihood Uncertainty Estimation (GLUE) method to analyze parametric uncertainty in hydrological modeling of the Kootenay Watershed in Canada. The study used the SLURP hydrological model and analyzed over 1 million parameter combinations using the GLUE method. The results identified distributions for key model parameters and showed variability in parameter averages and distributions between different land cover types within the watershed. This provided insights into parametric uncertainties and improved understanding of hydrological processes in the study area.
PPT.pdf internship demo on machine lerningMisbanausheen1
The document summarizes an internship project to predict air quality using linear regression. The intern collected air quality data on carbon monoxide and ozone levels from the World Weather Repository. Using Python and scikit-learn, a linear regression model was trained to predict ozone levels based on carbon monoxide levels. The model achieved a prediction score of [number]%. The intern concluded the project demonstrated the potential of machine learning techniques like linear regression for making accurate predictions to benefit various sectors.
air quality index forecasting using time series analysis.pptxCUInnovationTeam
This document summarizes a research project analyzing air quality data from 2019-2022 to study the impact of COVID-19 lockdowns. The research collected air quality index data from 9 Asian cities, analyzed trends, and used time series forecasting models like ARIMA, Prophet, and LSTM to predict future air quality. Results showed average air quality improved during 2020 lockdowns but increased again after. The research concludes with recommendations on long-term measures like sustainable industry and renewable fuels to maintain improved air quality.
Defining Homogenous Climate zones of Bangladesh using Cluster AnalysisPremier Publishers
Climate zones of Bangladesh are identified by using mathematical methodology of cluster analysis. Monthly data from 34 climate stations for rainfall from 1991 to 2013 are used in the cluster analysis. Five Agglomerative Hierarchical clustering measures based on mostly used six proximity measures are chosen to perform the regionalization. Besides three popular measures: K-means, Fuzzy and density based clustering techniques are applied initially to decide the most suitable method for the identification of homogeneous region. Stability of the cluster is also tested based on nine validity indices. It is decided that Ward method based on Euclidean distance, K-means, Fuzzy are the most likely to yield acceptable results in this particular case, as is often the case in climatological research. In this analysis we found seven different climate zones in Bangladesh.
Air pollution is a global environmental challenge that has continued to receive worldwide attention despite the recent decline in concentration of atmospheric pollutants following stringent environmental protection regulations. The major source of this pollution remains fossil fuels; hence the urgent need for cleaner energy sources. This study presents a review of the models applied in monitoring ambient air quality. The primary aim of air pollution modeling is to identify and quantitatively characterize pollutant emission at its source and subsequent dispersion through the atmosphere, subject to meteorological conditions, physical and chemical transformations. The common models and model assumptions for modeling air pollution and quality were critically reviewed and analyzed in this work for application in both forecasting and estimation of air pollutants on the basis of considered causes and in air quality assessment and air pollution control.
Use of Probabilistic Statistical Techniques in AERMOD Modeling EvaluationsSergio A. Guerra
The advent of the short term National Ambient Air Quality Standards (NAAQS) prompted modelers to reassess the common practices in dispersion modeling analyses. The probabilistic nature of the new short term standards also opens the door to alternative modeling techniques that are based on probability. One of these is the Monte Carlo technique that can be used to account for emission variability in permit modeling.
Currently, it is assumed that a given emission unit is in operation at its maximum capacity every hour of the year. This assumption may be appropriate for facilities that operate at full capacity most of the time. However, in most cases, emission units operate at variable loads that produce variable emissions. Thus, assuming constant maximum emissions is overly conservative for facilities such as power plants that are not in operation all the time and which exhibit high concentrations during very short periods of time.
Another element of conservatism in NAAQS demonstrations relates to combining predicted concentrations from the AMS/EPA Regulatory Model (AERMOD) with observed (monitored) background concentrations. Normally, some of the highest monitored observations are added to the AERMOD results yielding a very conservative combined concentration.
A case study is presented to evaluate the use of alternative probabilistic methods to complement the shortcomings of current dispersion modeling practices. This case study includes the use of the Monte Carlo technique and the use of a reasonable background concentration to combine with the AERMOD predicted concentrations. The use of these methods is in harmony with the probabilistic nature of the NAAQS and can help demonstrate compliance through dispersion modeling analyses, while still being protective of the NAAQS.
(1) NASA has built a quasi-operational global modeling and data assimilation system to monitor carbon dioxide concentrations with about 3 months of latency. (2) The system assimilates observations from NASA's OCO-2 satellite to produce 3D CO2 fields and has been used to track changes from events like COVID-19 and the 2019-20 Indian Ocean Dipole. (3) NASA is working to improve accessibility of greenhouse gas data products and push toward true near real-time monitoring through initiatives like EIS-GHG and by addressing challenges around data latency and developing quasi-operational flux estimation methods.
Urbanization and population growth negatively impact air quality. This study used spatial interpolation techniques in a GIS to estimate the temporal and spatial variation of air pollution levels over Mumbai, India. Air pollution data on sulfur dioxide, nitrogen dioxide, and suspended particulate matter was collected from three monitoring organizations and interpolated using inverse distance weighting, kriging spherical, and kriging Gaussian methods. The results showed that winter had the highest pollution concentrations due to lower temperatures and wind. Kriging spherical and Gaussian techniques best matched the observed data. The study concluded kriging performed best for interpolation and can help evaluate the health impacts of air pollution.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
1. Marshall University
College of Science
Department of Mathematics
- STA 564 -
Time Series Analysis and Forecasting
focused on Air Pollution in an Urban
Area
By Kenneth Guzman
December 7, 2018
2. Contents
1 INTRODUCTION 2
2 Yeo-Johnson transformation, Kolmogorov-Smirnov Test for normality 3
3 Factor Analysis and PCA 3
4 Box Jenkins Methodology 3
5 Personal Study Carried out in R 4
6 R Code Explanation and Software Package Used 29
7 Conclusion 30
References 31
3. TIME SERIES ANALYSIS FORECASTING
NOTES
At its core, the influences of air pollution in the atmosphere are strongly managed by
meteorology. However, in the ”univariate” models we will consider it is assumed that the
final concentration of air pollutants in the atmosphere is the final result of all the complex
interactions of meteorology, chemistry, transport, diffusion etc. For this reason, the combined
information of their effect on air pollutant concentration is contained in the corresponding
time series in a stochastic way. Using this approach, calculations are simplified and performed
only using the time series of the pollutant without explicit inclusion of meteorological or other
measurements.
Four professors from Plovdiv University in Bulgaria, produced a research paper on time series
analysis concerning air pollution, the methods used were explicitly stated in their article as:
(i)
Identify correlation type dependencies and grouping of observed air pollutants using the
method of factor analysis to explain mutual effects of pollution.
(ii)
Conduct time series analysis by determining seasonal ARIMA(based on hourly data) relevant
parametric models of pollutants.
(iii)
Analysis and Diagnostics of constructed models.
(iv)
Application of models for short term forecasting.
(v)
Interpretation of the results and definition of the conditions contributing to the exceeding
of national and European, concentration norms for the considered air pollutants.
Their study was carried out using IBM SPSS 19 and EViews 7.[3]
1
4. TIME SERIES ANALYSIS FORECASTING
1 INTRODUCTION
Even though there are established regulations for monitoring and controlling effects on
air quality in certain territories, air quality may remain unsatisfactory. Lets consider the
particular case where our focus lies within the town of Blagoevgrad, Bulgaria. Blagoevgrad is
a typical representative of a small urban region, with a population of approximately 70,000.
Time Span of Study:1 year period from September 1st, 2011 to August 31st, 2012, based
on hourly measurements, six air pollutants were observed. Factor analysis and Box-Jenkins
methodology were applied to inspect concentrations of the primary air pollutants of interest.
The pollutants were grouped into three factors and the degree of contribution of the factors
to the overall pollution was determined, this contribution was interpreted as the presence
of common sources of pollution. The classical techniques of principal component analysis
(PCA) and factor analysis are important statistical instruments frequently used in the
environmental sciences.
The focus of the study involved the performance of time series analysis and the development
of univariate stochastic seasonal autoregressive integrated moving average (SARIMA) models
with recording on an hourly basis as seasonality. The study incorporates Yeo-Johnson power
transformation for variance stabilizing of the data, and model selection by using Bayesian
Information Criterion. The SARIMA models obtained in the study in Bulgaria demonstrated
good fitting with respect to the observed air pollutants and short term predictions for 72
hours ahead, specifically in the case of ozone and particulate matter PM10. The methods
presented, allowed the building of less complex models that are effective for short-term air
pollution forecasting and useful for advance warning purposes in urban areas.[3]
Continuous and careful monitoring and forecasting of atmospheric air pollutants is important
when evaluating regulatory control measures related to air quality. In Bulgaria, 12 types
of pollutants are systematically monitored by more than 36 automated stations run by the
Executive Environment Agency(EEA), which manages and coordinates activities related to
the control and environmental protection of the country. Atmospheric air quality reports
for the various regions of the country are regularly published, and from this much data is
accumulated. The data accumulation is what allows us to carry out statistical analysis which
leads to the discovery of, general patterns and dependencies for different time periods and
relationships between observed air pollutants. The observed air pollutants related to the
study carried out in Blagoevgrad, Bulgaria are concentrations of particulate matter PM10,
nitrogen oxide NO, nitrogen dioxide NO2, nitrogen oxides NOx, sulfur dioxide SO2, and
ground level ozone O3. The data measurements are expressed in units of mass concentration
of pollutants in µg/m3
, only NOx is in unit ppb(partsperbillion, as it is observing pollution
from all kinds of nitrogen oxides. The data consisted of 8,744 observations (hourly data).
The goal of their study was to demonstrate the capabilities of the mentioned methods, which
can be applied to other recorded sets including for shorter and longer periods of time.
2
5. TIME SERIES ANALYSIS FORECASTING
2 Yeo-Johnson transformation, Kolmogorov-Smirnov
Test for normality
Time series data often requires preparation before using forecasting methods; and for this
reason normal or near to normal distribution of the univariate data is important, because
it reduces issues when we forecast future values. The obtained K-S statistic indicated
non-normality of the data collected in Bulgaria, which led to the transformation of the
data prior to constructing the forecasting models. In that particular case the Yeo-Johnson
transformation was carried out, which lead to the satisfying of the Kolmogorov Smirnov Test
for normality at 0.05 level of significance and may be assumed to be normally distributed.
The Yeo-Johnson transformation finds the optimal value of lambda that minimizes the
KullBack-Leibler1
distance between the normal distribution and the transformed distribution.[1][2]
Properties of Yeo-Johnson transformation below:
g(x; λ) = {1(λ=0,x≥0)
(x + 1)λ
− 1
λ
{1(λ=0,x≥0) log(x + 1)
{1(λ=2,x<0)
(1 − x)2−λ
− 1
λ − 2
{1(λ=2,x<0) − log(1 − x)
3 Factor Analysis and PCA
The statistical techniques of factor analysis and principal component analysis, help identify
patterns in the correlation between variables. The patterns identified are used to create
factors, which was the case in Bulgaria and allowed the grouping of correlated pollutants.
The steps followed for the particular case in Bulgaria were: (a) calculation of correlation
matrix (b) testing the adequacy of factor anaylsis (c) factor extraction (d) factor rotation
and (e) score calculation of factor variables. The particular advantages of these methods are
that they reveal strong correlation relationships between observed variables and allow their
grouping into new variables (factors) in order to reduce the dimensions of the complex data
structure. The factors can thereafter be used to build regression or other types of models.[5]
4 Box Jenkins Methodology
Other methods frequently used in times series analysis and forecasting are the auto-regressive
integrated moving average(ARIMA) and seasonal ARIMA (SARIMA)models, also known as
Box-Jenkins stochastic models. Box-Jenkins methodology is widely applied in air quality
research among other disciplines, and is a systematic strategy for identifying, fitting, and
forecasting time series univariate data. ARIMA models generally take the form Arima(p,d,q)
1
In mathematical statistics, the KullbackLeibler divergence (also called relative entropy), is a measure of
how one probability distribution is different from a second, reference probability distribution.
3
6. TIME SERIES ANALYSIS FORECASTING
where p is the number of parameters describing the auto-regressive process, d is the number of
nonseasonal differences needed to reach stationarity, and q is the number of lagged forecast
errors in the prediction equation. Similarly, the SARIMA models take the general form
Arima(p,d,q)(P,D,Q)s, where P is the number of seasonal auto-regressive terms, D is the
order of seasonal differencing and Q is the number of seasonal moving average terms. In the
seasonal part of the model, the three parameters P,D,Q operate across multiples of lag s,
where s is the number of time periods until a pattern repeats itself.
Main advantages of the Box-Jenkins approach:
(i)
Applicability for modeling and forecasting practically any time series that is stationary or
can be reduced to stationary by a differencing procedure.
(ii)
Ability to extract all the trends and serial correlations in the data with a minimized sequence
of white noise(shock) through inclusion in one general model equation that gets to the basis
of historical data development.
(iii)
The method has been incorporated into many standard software packages which exist within
R, SPSS, etc., which speeds up and assists the modeling process considerably.
5 Personal Study Carried out in R
Using the presented methods, I was able to carry out my own study using the statistical
software R. Using data provided by our own Environmental Protection Agency here in the
United States (https://www.epa.gov/outdoor-air-quality-data), I accessed pollutant concentration
data for the city of Richmond, Virginia, which has a population of approximately 220,000.
Time Span of Observed Data: A total of 4 years of data was accessed, periods from January
2010 to December 2013 based on weekly measurements of the following air pollutants,
concentrations of particulate matter PM2.5, particulate matter PM10, lead Pb expressed
in units of mass concentration (µg/m3
), carbon monoxide CO and ground level ozone O3
are in units ppm(partspermillion), sulfur dioxide SO2 and nitrogen dioxide NO2 are in units
ppb(partsperbillion). The goal of my personal research is to apply the time series analysis
and forecasting methods from the research paper produced in Bulgaria, to a local city here
in the US. As was the case in Bulgaria, once these methods are applied to the Richmond
pollutant data I hope to visually show an appropriate forecast for each pollutant for the year
2013.
Before I proceed forward I would like to point out that while the research paper concerning
Bulgaria highlighted a factor analysis and principal component analysis approach, the correlation
matrix calculated in R concerning the Richmond pollutant data-sets, displayed no signs of
positive or negative correlation between the pollutants, therefore I did not proceed to carry
out any sort of factor analysis or PCA. Also, the 2013 pollutant data-sets were strictly used
to compare our forecast models to the actual data recorded by the EPA in 2013.
4
7. TIME SERIES ANALYSIS FORECASTING
Directly below is the correlation matrix for all 7 pollutants concerning data over the time
span of the years 2010, 2011, and 2012.
Analyzing PM-2.5 using 3 year data
The first pollutant we will analyze is particulate matter PM2.5
The lambda value used to transform the original PM-2.5 observations, λ = 0.227158.
Directly below is the time series plot for the 3 years after a yeo-johnson transformation.
5
8. TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using only the forecast function in R.
Directly below is the time series plot using auto.arima function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for PM-2.5 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original PM-2.5 2013 observations,
λ = 0.05030683.
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant
PM-2.5.
6
9. TIME SERIES ANALYSIS FORECASTING
Using only 2012 data to predict 2013 values
The lambda value used to transform the original PM-2.5 observations for the year 2012,
λ = 0.7078218.
7
10. TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot for 2012 after a yeo-johnson transformation.
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
Directly below is the time series plot using auto.arima function in R.
Now that we have our arima model, the next step was to access our 2013 pollutant concentration
data for PM-2.5 to see how accurately auto.arima() predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original PM-2.5 2013 observations,
λ = 0.05030683.
Finally, once we plot arima model against the 2013 time series plot, I believe the auto.arima
function is somewhat appropriate for predicting the trend of the Pollutant PM-2.5 for the
year 2013.
8
11. TIME SERIES ANALYSIS FORECASTING
Analyzing PM10 using 3 year data
The second pollutant we will analyze is particulate matter PM10
The lambda value used to transform the original PM10 observations, λ = 0.2409915.
Directly below is the time series plot for the 3 years after a yeo-johnson transformation.
Below is the time series plot using only the forecast function in R.
9
12. TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using auto.arima function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for PM10 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original PM10 2013 observations,
λ = 0.7845362.
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
neither the forecast nor the auto.arima function is appropriate for predicting the values of
2013 for the Pollutant PM10.
10
13. TIME SERIES ANALYSIS FORECASTING
Using only 2012 data to predict 2013 values
The lambda value used to transform the original PM10 observations for the year 2012,
λ = −0.04297711.
Below is the time series plot for 2012 after a yeo-johnson transformation.
11
14. TIME SERIES ANALYSIS FORECASTING
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
The time series plot using the auto.arima function was not yielding an appropriate graph in
R.
Analyzing Pb(Lead) using 3 year data
The third pollutant we will analyze is lead Pb
The lambda value used to transform the original Pb observations, λ = −4.99994.
Directly below is the time series plot for the 3 years after a yeo-johnson transformation.
12
15. TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using only the forecast function in R.
Directly below is the time series plot using auto.arima function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for Pb and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original Pb 2013 observations,
λ = −4.99994.
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant
13
16. TIME SERIES ANALYSIS FORECASTING
Pb.
Using only 2012 data to predict 2013 values
The lambda value used to transform the original Pb(Lead) observations for the year 2012,
λ = −4.99994.
Directly below is the time series plot for 2012 after a yeo-johnson transformation.
14
17. TIME SERIES ANALYSIS FORECASTING
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
The time series plot using the auto.arima function was not yielding an appropriate graph in
R.
Analyzing CO using 3 year data
The fourth pollutant we will analyze is carbon monoxide CO
The lambda value used to transform the original CO observations, λ = −3.577325.
Directly below is the time series plot for the 3 years after a yeo-johnson transformation.
15
18. TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using only the forecast function in R.
Directly below is the time series plot using auto.arima function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for CO and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original CO 2013 observations,
λ = −2.432302.
16
19. TIME SERIES ANALYSIS FORECASTING
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant
CO.
Using only 2012 data to predict 2013 values
The lambda value used to transform the original CO observations for the year 2012, λ =
−3.641187.
Directly below is the time series plot for 2012 after a yeo-johnson transformation.
17
20. TIME SERIES ANALYSIS FORECASTING
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
Directly below is the time series plot using auto.arima function in R.
Now that we have our arima model, the next step was to access our 2013 pollutant concentration
data for CO and see how accurately the arima model predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original CO 2013 observations,
λ = −2.432302.
Finally, once we plot the arima model against the 2013 time series plot, I believe the
auto.arima function is most appropriate for predicting the trend of the Pollutant CO for
2013.
18
21. TIME SERIES ANALYSIS FORECASTING
Analyzing O3 using 3 year data
The fifth pollutant we will analyze is ground level ozone O3
The lambda value used to transform the original O3 observations, λ = 3.615548.
Directly below is the time series plot for the 3 years after a yeo-johnson transformation.
19
22. TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using only the forecast function in R.
Directly below is the time series plot using auto.arima function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for O3 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original O3 2013 observations,
λ = 4.99994.
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the forecast function is most appropriate for predicting the values of 2013 for the Pollutant
O3.
20
23. TIME SERIES ANALYSIS FORECASTING
Using only 2012 data to predict 2013 values
The lambda value used to transform the original O3 observations for the year 2012, λ =
4.99994.
Directly below is the time series plot for 2012 after a yeo-johnson transformation.
21
24. TIME SERIES ANALYSIS FORECASTING
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
The time series plot using the auto.arima function was not yielding an appropriate graph in
R.
Analyzing SO2 using 3 year data
The sixth pollutant we will analyze is sulfur dioxide SO2
The lambda value used to transform the original SO2 observations, λ = −0.227093.
Directly below is the time series plot for the 3 years after a yeo-johnson transformation.
22
25. TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using only the forecast function in R.
Directly below is the time series plot using auto.arima function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for SO2 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original SO2 2013 observations,
λ = 0.2616144.
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the forecast function is most appropriate for predicting the values of 2013 for the Pollutant
23
26. TIME SERIES ANALYSIS FORECASTING
SO2.
Using only 2012 data to predict 2013 values
The lambda value used to transform the original SO2 observations for the year 2012, λ =
−0.1123281.
Below is the time series plot for 2012 after a yeo-johnson transformation.
24
27. TIME SERIES ANALYSIS FORECASTING
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
The time series plot using the auto.arima function was not yielding an appropriate graph in
R.
Analyzing NO2 using 3 year data
The seventh and final pollutant we will analyze is nitrogen dioxide NO2
The lambda value used to transform the original NO2 observations, λ = 0.9783584.
Below is the time series plot for the 3 years after a yeo-johnson transformation.
25
28. TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using only the forecast function in R.
Directly below is the time series plot using auto.arima function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for NO2 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original NO2 2013 observations,
λ = 1.003092.
26
29. TIME SERIES ANALYSIS FORECASTING
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant
NO2.
Using only 2012 data to predict 2013 values
The lambda value used to transform the original NO2 observations for the year 2012,
λ = 1.229131.
Directly below is the time series plot for 2012 after a yeo-johnson transformation.
27
30. TIME SERIES ANALYSIS FORECASTING
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
Directly below is the time series plot using auto.arima function in R.
Now that we have our arima model, the next step was to access our 2013 pollutant concentration
data for NO2 and see how accurately the model predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original CO 2013 observations,
λ = 1.003092.
Finally, once we plot the arima model against the 2013 time series plot, I believe the
auto.arima function is most appropriate for predicting the trend of the Pollutant NO2 for
2013.
28
31. TIME SERIES ANALYSIS FORECASTING
6 R Code Explanation and Software Packages Used
The following packages in the R software were used: MASS, bestNormalize, forecast.
• From MASS the function truehist was used to plot the histograms of the pollutant
data before and after the yeojohnson transformation was applied, to visually show the
transformation from non-normal to normal distribution of the data.
• From bestNormalize the function yeojohnson was used to transform the pollutant
data from non-normal to normally distributed, in order to better carry out our statistical
analysis.
• From forecast the functions forecast and auto.arima were used, each playing the most
important role in analyzing prior pollutant observations and forecasting our future
values as accurately as R allows for each pollutant.
The main functions that I will highlight in this sections are the forecast and auto.arima()
functions in R but I will also briefly explain my usage of the ts() and yeojohnson() functions.
It was very important to my study that within the forecast function level=F because while
having confidence intervals in our graphs could be useful, they were not particularly needed
for my study to be carried out, since I was mostly interested in the specific values that the
forecast function gave us in its output. Also, in the forecast package it was vary important
that we only forecast exactly 59 future values, which is simply due to the fact that there
are exactly 59 values in our EPA 2013 data for each pollutant. Now, in the auto.arima()
function, no restrictions needed to be called within the function but it was most important
that we accessed our forecast values by auto.arima()$f and just for reference we are also
able to access our original values that were put into the function by using auto.arima()$x.
One last note, when I was plotting the time series for the 3 year data, you should notice that
within each ts() function the frequency=(58) which I interpret as they were an average of
58 observations per year, and I simply got 58 by dividing the total amount of observations
in our 3 year data by 3, so 174/3 = 58. Within the yeojohnson() function you will notice
29
32. TIME SERIES ANALYSIS FORECASTING
that standardize=FALSE this is because if it is not declared within the function by default
R will further perform standardization of the values put into the function, I did not find
the further standardization useful in my case when dealing with the Richmond data, mainly
because the yeojohnson transformation was of interest in the Bulgaria study so I wanted to
follow that transformation as it is without further standardization.
7 Conclusion
In the Bulgaria study the researchers main goal was to be able to use the arima models in
order to forecast ahead 72 hours, because they used hourly data. Similarly, I feel it necessary
to highlight the importance the auto.arima() function played in helping forecast the year
2013. While it was not totally helpful with forecasting all pollutants, it was definitely more
helpful than the forecast() function, in identifying the trend or behavior of each pollutant
throughout the year(s). The most important finding I came across was that the 2012 data
alone was certainly not enough it most cases when attempting to forecast a future year, but
the 3 year(2010,2011,2012) data combination allowed both the forecast() and auto.arima()
functions to display their usefulness when forecasting. I certainly enjoyed preparing this
study and learning about time series and hope that I am given the opportunity to further
explore this discipline in the future.
30
33. TIME SERIES ANALYSIS FORECASTING
References
[1] Kullback, S. (1959), Information Theory and Statistics, John Wiley and Sons.
Republished by Dover Publications in 1968; reprinted in 1978: ISBN 0-8446-5625-9.
[2] Yeo, I. K., and Johnson, R. A. (2000). A new family of power transformations to improve
normality or symmetry. Biometrika.
[3] Gocheva-Ilieva, Snezhana; Ivanov, A; Voynikova, Desislava; Boyadzhiev, Doychin. (2013).
Time series analysis and forecasting for air pollution in small urban area: An SARIMA
and factor analysis approach. Stochastic Environmental Research and Risk Assessment.
28. 1045-1060. 10.1007/s00477-013-0800-4.
[4] Alcosser, Howard. ”Diamond Bar High School” Internal Assessment: Mathematical
Exploration. Web. 27 May 2015.
[5] Jolliffe, Ian. (1986). Principal Component Analysis and Factor Analysis. 10.1007/978 −
1 − 4757 − 1904 − 87. Principal component analysis and Factor Analysis.
31