This document provides an overview of statistical procedures commonly used to evaluate analytical data precision and accuracy. It discusses confidence intervals and how to calculate them based on population standard deviation, sample size, and confidence level. Rejection of outlier data and regression analysis for calibration curves are also covered. Examples are provided to illustrate calculation of confidence intervals for single measurements and sample means, as well as outlier detection criteria.
Webinar slides how to reduce sample size ethically and responsiblynQuery
[Webinar] How to reduce sample size...ethically and responsibly | In this free webinar, you will learn various design strategies to help reduce the sample size of your study in an ethical and responsible manner. Practical examples will be used throughout.
Regression and classification techniques play an essential role in many data mining tasks and have broad applications. However, most of the state-of-the-art regression and classification techniques are often unable to adequately model the interactions among predictor variables in highly heterogeneous datasets. New techniques that can effectively model such complex and heterogeneous structures are needed to significantly improve prediction accuracy.
In this dissertation, we propose a novel type of accurate and interpretable regression and classification models, named as Pattern Aided Regression (PXR) and Pattern Aided Classification (PXC) respectively. Both PXR and PXC rely on identifying regions in the data space where a given baseline model has large modeling errors, characterizing such regions using patterns, and learning specialized models for those regions. Each PXR/PXC model contains several pairs of contrast patterns and local models, where a local classifier is applied only to data instances matching its associated pattern. We also propose a class of classification and regression techniques called Contrast Pattern Aided Regression (CPXR) and Contrast Pattern Aided Classification (CPXC) to build accurate and interpretable PXR and PXC models.
We have conducted a set of comprehensive performance studies to evaluate the performance of CPXR and CPXC. The results show that CPXR and CPXC outperform state-of-the-art regression and classification algorithms, often by significant margins. The results also show that CPXR and CPXC are especially effective for heterogeneous and high dimensional datasets. Besides being new types of modeling, PXR and PXC models can also provide insights into data heterogeneity and diverse predictor-response relationships.
We have also adapted CPXC to handle classifying imbalanced datasets and introduced a new algorithm called Contrast Pattern Aided Classification for Imbalanced Datasets (CPXCim). In CPXCim, we applied a weighting method to boost minority instances as well as a new filtering method to prune patterns with imbalanced matching datasets.
Finally, we applied our techniques on three real applications, two in the healthcare domain and one in the soil mechanic domain. PXR and PXC models are significantly more accurate than other learning algorithms in those three applications.
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample SizenQuery
Title: Bayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
Duration: 60 minutes
Speaker: Ronan Fitzpatrick, Head of Statistics, Statsols
Watch Here: http://bit.ly/2ndRG4B
In this webinar you’ll learn about:
Benefits of Sensitivity Analysis: What does the researcher gain by conducting a sensitivity analysis?
Why isn't Sensitivity Analysis formalized: Why does sensitivity analysis still lack the type of formalized rules and grounding to make it a routine part of sample size determination in every field?
How Bayesian Assurance works: Using Bayesian Assurance provides key contextual information on what is likely to happen over the total range possible values rather than the small number of fixed points used in a sensitivity analysis
Elicitation & SHELF: How expert opinion is elicited and then how to integrate these opinions with each other plus prior data using the Sheffield Elicitation Framework (SHELF)
Why use in both Frequentist or Bayesian analysis: How and why these methods can be used for studies which will use Frequentist or Bayesian methods in their final analysis
Plus more
Bayesian Approaches To Improve Sample Size WebinarnQuery
Title: Bayesian Approaches To Improve Sample Size
Duration: 60 minutes
Speaker: Ronan Fitzpatrick, Head of Statistics, Statsols
In this webinar you'll learn about:
Bayesian Sample Size Determination: See how the growth of Bayesian analysis has helped transform our ideas about statistical inference and methodologies in clinical trials
Bayesian Assurance: Get an informative answer on how likely it is to see a “positive” outcome from the trial and then make better decisions on what trials to back
Posterior Credible Intervals and Mixed Bayesian Likelihood: Enable researchers to use prior information from pilot studies and other sources to make quicker and better decisions
Plus much more
Application of predictive analytics on semi-structured north Atlantic tropica...Skylar Hernandez
A doctoral dissertation final defense that is trying to solve which weather pattern components can improve the Atlantic TC forecast accuracy; through the use of C4.5 algorithm on all five-day tropical discussions from 2001-2015?
Innovative Sample Size Methods For Clinical Trials nQuery
"Innovative Sample Size Methods for Clinical Trials" is hosted to coincide with the Spring 2018 update to nQuery - The leading Sample Size Software.
Hosted by Ronan Fitzpatrick - Head of Statistics and nQuery Lead Researcher at Statsols - you'll learn about the benefits of a range of procedures and how you can implement them into your work:
1) Dose-escalation with the Bayesian Continual Reassessment Method
CRM is a growing alternative to the 3+3 method for Phase I trials finding the Maximum Tolerated Dose (MTD).
See how researchers can overcome 3+3 drawbacks to easily find the required sample size for this beneficial alternative for finding the MTD.
2) Bayesian Assurance with Survival Example
This Bayesian alternative to power has experienced a rapid rise in interest and application from researchers.
See how Assurance is being used by researchers to discover the true “probability of success” of a trial.
3) Mendelian Randomization
Mendelian randomization (MR) is a method that allows testing of a causal effect from observational data in the presence of confounding factors.
However, in order to design efficient Mendelian randomization studies, it is essential to calculate the appropriate sample sizes required. We demonstrate what to do to achieve this.
4) Negative Binomial Distribution
Negative binomial model has been increasingly used to model the count data. One of the challenges of applying negative binomial model in clinical trial design is the sample size estimation.
We demonstrate how best to determine the appropriate sample size in the presence of challenges such as unequal follow-up or dispersion.
Poster Randomization Approach in Case-Based Reasoning: Case of Study of Mammo...Miled Basma Bentaiba
This document was presented on 8 May 2018 at the doctoral symposium at IFIP International Conference on Computational Intelligence and Its Applications (IFIP CIIA 2018).
Webinar slides- alternatives to the p-value and power nQuery
What are the alternatives to the p-value & power? What is the next step for sample size determination? We will explore these issues in this free webinar presented by nQuery
Webinar slides how to reduce sample size ethically and responsiblynQuery
[Webinar] How to reduce sample size...ethically and responsibly | In this free webinar, you will learn various design strategies to help reduce the sample size of your study in an ethical and responsible manner. Practical examples will be used throughout.
Regression and classification techniques play an essential role in many data mining tasks and have broad applications. However, most of the state-of-the-art regression and classification techniques are often unable to adequately model the interactions among predictor variables in highly heterogeneous datasets. New techniques that can effectively model such complex and heterogeneous structures are needed to significantly improve prediction accuracy.
In this dissertation, we propose a novel type of accurate and interpretable regression and classification models, named as Pattern Aided Regression (PXR) and Pattern Aided Classification (PXC) respectively. Both PXR and PXC rely on identifying regions in the data space where a given baseline model has large modeling errors, characterizing such regions using patterns, and learning specialized models for those regions. Each PXR/PXC model contains several pairs of contrast patterns and local models, where a local classifier is applied only to data instances matching its associated pattern. We also propose a class of classification and regression techniques called Contrast Pattern Aided Regression (CPXR) and Contrast Pattern Aided Classification (CPXC) to build accurate and interpretable PXR and PXC models.
We have conducted a set of comprehensive performance studies to evaluate the performance of CPXR and CPXC. The results show that CPXR and CPXC outperform state-of-the-art regression and classification algorithms, often by significant margins. The results also show that CPXR and CPXC are especially effective for heterogeneous and high dimensional datasets. Besides being new types of modeling, PXR and PXC models can also provide insights into data heterogeneity and diverse predictor-response relationships.
We have also adapted CPXC to handle classifying imbalanced datasets and introduced a new algorithm called Contrast Pattern Aided Classification for Imbalanced Datasets (CPXCim). In CPXCim, we applied a weighting method to boost minority instances as well as a new filtering method to prune patterns with imbalanced matching datasets.
Finally, we applied our techniques on three real applications, two in the healthcare domain and one in the soil mechanic domain. PXR and PXC models are significantly more accurate than other learning algorithms in those three applications.
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample SizenQuery
Title: Bayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
Duration: 60 minutes
Speaker: Ronan Fitzpatrick, Head of Statistics, Statsols
Watch Here: http://bit.ly/2ndRG4B
In this webinar you’ll learn about:
Benefits of Sensitivity Analysis: What does the researcher gain by conducting a sensitivity analysis?
Why isn't Sensitivity Analysis formalized: Why does sensitivity analysis still lack the type of formalized rules and grounding to make it a routine part of sample size determination in every field?
How Bayesian Assurance works: Using Bayesian Assurance provides key contextual information on what is likely to happen over the total range possible values rather than the small number of fixed points used in a sensitivity analysis
Elicitation & SHELF: How expert opinion is elicited and then how to integrate these opinions with each other plus prior data using the Sheffield Elicitation Framework (SHELF)
Why use in both Frequentist or Bayesian analysis: How and why these methods can be used for studies which will use Frequentist or Bayesian methods in their final analysis
Plus more
Bayesian Approaches To Improve Sample Size WebinarnQuery
Title: Bayesian Approaches To Improve Sample Size
Duration: 60 minutes
Speaker: Ronan Fitzpatrick, Head of Statistics, Statsols
In this webinar you'll learn about:
Bayesian Sample Size Determination: See how the growth of Bayesian analysis has helped transform our ideas about statistical inference and methodologies in clinical trials
Bayesian Assurance: Get an informative answer on how likely it is to see a “positive” outcome from the trial and then make better decisions on what trials to back
Posterior Credible Intervals and Mixed Bayesian Likelihood: Enable researchers to use prior information from pilot studies and other sources to make quicker and better decisions
Plus much more
Application of predictive analytics on semi-structured north Atlantic tropica...Skylar Hernandez
A doctoral dissertation final defense that is trying to solve which weather pattern components can improve the Atlantic TC forecast accuracy; through the use of C4.5 algorithm on all five-day tropical discussions from 2001-2015?
Innovative Sample Size Methods For Clinical Trials nQuery
"Innovative Sample Size Methods for Clinical Trials" is hosted to coincide with the Spring 2018 update to nQuery - The leading Sample Size Software.
Hosted by Ronan Fitzpatrick - Head of Statistics and nQuery Lead Researcher at Statsols - you'll learn about the benefits of a range of procedures and how you can implement them into your work:
1) Dose-escalation with the Bayesian Continual Reassessment Method
CRM is a growing alternative to the 3+3 method for Phase I trials finding the Maximum Tolerated Dose (MTD).
See how researchers can overcome 3+3 drawbacks to easily find the required sample size for this beneficial alternative for finding the MTD.
2) Bayesian Assurance with Survival Example
This Bayesian alternative to power has experienced a rapid rise in interest and application from researchers.
See how Assurance is being used by researchers to discover the true “probability of success” of a trial.
3) Mendelian Randomization
Mendelian randomization (MR) is a method that allows testing of a causal effect from observational data in the presence of confounding factors.
However, in order to design efficient Mendelian randomization studies, it is essential to calculate the appropriate sample sizes required. We demonstrate what to do to achieve this.
4) Negative Binomial Distribution
Negative binomial model has been increasingly used to model the count data. One of the challenges of applying negative binomial model in clinical trial design is the sample size estimation.
We demonstrate how best to determine the appropriate sample size in the presence of challenges such as unequal follow-up or dispersion.
Poster Randomization Approach in Case-Based Reasoning: Case of Study of Mammo...Miled Basma Bentaiba
This document was presented on 8 May 2018 at the doctoral symposium at IFIP International Conference on Computational Intelligence and Its Applications (IFIP CIIA 2018).
Webinar slides- alternatives to the p-value and power nQuery
What are the alternatives to the p-value & power? What is the next step for sample size determination? We will explore these issues in this free webinar presented by nQuery
This document provides information about a training module on processing stream flow data organized by the Central Training Unit of the Central Water Commission in India. The training is intended for engineers involved in reviewing, analyzing, and processing stream flow data. The document includes details about the module such as its objectives, key concepts, session plan, and evaluation suggestions. It aims to help participants learn how to analyze and process gauge-discharge data, sediment data, water quality data, bed material data, and meteorological data through methods like consistency checks and reliability assessments.
This document discusses quality assurance and within-laboratory analytical quality control programs. It emphasizes the need for quality assurance due to errors found in analytical results. A quality assurance program includes sample control, standard procedures, equipment maintenance, calibration, and analytical quality control. Within-laboratory quality control focuses on precision using statistical control charts to monitor performance over time and identify issues. Regular quality control practices help ensure reliability of laboratory results.
Gw02 role of dwlr data in groundwater resource estimationhydrologyproject0
This document discusses the role of data from Deep Well Logging Recorders (DWLRs) in estimating groundwater resources. DWLRs provide high-frequency water level data that can help understand recharge processes and parameters. Their data allows identifying accurate peaks and troughs in the water table to define optimal periods for water balance studies estimating specific yield and rainfall recharge. DWLR hydrographs also aid in determining rainfall amounts needed to initiate recharge, lag times between rainfall and recharge, effective rainfall events, and periods of evapotranspiration loss - all improving the accuracy of water balance assessments and groundwater resource estimation.
This document summarizes the rationalization of the surface water quality monitoring program under the Hydrology Project in India. It discusses that various agencies were monitoring water quality with different objectives and no coordination, resulting in duplication of efforts. The Hydrology Project aims to design a unified monitoring network and methodology. Key points discussed include:
- Monitoring objectives of establishing baseline quality, observing trends, and calculating pollutant flux.
- Frequency of sampling every 2 months at baseline stations and monthly at trend stations to represent all seasons.
- Parameters to include general, nutrients, organic, and microbiological parameters depending on station type.
- Emphasis on representative sampling and sample collection/transport procedures.
This document summarizes the Hydrology Project Phase-II being implemented in Himachal Pradesh. The key points are:
1. The project was approved in 2006 for Rs. 49.50 Crore with an implementation period of 6 years, which was later extended by 23 months.
2. The project aims to strengthen hydrological monitoring networks and institutional capacity in Himachal Pradesh. It includes installation of rain gauges, weather stations, piezometers, labs, and data management systems.
3. As of 2014, most of the planned networks have been installed but some equipment procurement and installations are still ongoing. Data is being collected from most stations and shared with other organizations.
Ch sw water availibility study and supply demand analysis in kharun sub basin...hydrologyproject0
This document describes a water availability study and supply-demand analysis conducted in the Kharun Sub-Basin of Chhattisgarh, India. The study was carried out by the Water Resources Department of Chhattisgarh and the National Institute of Hydrology between 2014-2017. Key aspects of the study included developing a rainfall-runoff model to assess water availability, estimating current and projected water demands, and evaluating measures to meet future water needs in the sub-basin. Field data collection, drought assessment, infiltration modeling, and stakeholder workshops were also part of the multi-faceted study of water resources management in the Kharun Sub-Basin.
The document discusses two phases of the Central Water Commission's Hydrology Project aimed at establishing a functional hydrological information system and improving institutional capacity in several Indian states. It outlines the objectives, activities, and achievements of each phase, including the development of water monitoring and management software, training programs, and infrastructure improvements at the National Water Academy. The post-project plan and lessons learned from the two phases are also summarized.
Here are the steps to solve this problem:
1) Calculate the mean (x̅) of the lead concentration data: ∑x/n = 306.3/10 = 30.63
2) Calculate the deviation of each value from the mean: |x - x̅|
3) Calculate the sum of the deviations: ∑|x - x̅| = 50.03
4) Calculate the average deviation: Sum of deviations/n = 50.03/10 = 5.003
5) Calculate (x - x̅)2 for each value
6) Calculate the sum of (x - x̅)2: ∑(x - x̅)2
Here are the steps to solve this problem:
1) Calculate the mean (x̅) of the lead concentration data: ∑x/n = 306.3/10 = 30.63
2) Calculate the deviation of each value from the mean: |x - x̅|
3) Calculate the sum of the deviations: ∑|x - x̅| = 50.03
4) Calculate the average deviation: Sum of deviations/n = 50.03/10 = 5.003
5) Calculate (x - x̅)2 for each value
6) Calculate the sum of (x - x̅)2: ∑(x - x̅)2
This document provides materials for a training module on basic statistics, including a session plan, overhead masters, handouts, and main text. The module introduces key concepts such as frequency distributions, accuracy and precision, and propagation of errors. Participants will learn to construct histograms, calculate descriptive statistics, and understand the difference between random and systematic errors. The goal is for participants to be able to evaluate the accuracy of chemical data using basic statistical analysis.
This document provides guidance on how to carry out secondary validation of discharge data. It discusses validating a single station's data against limits and expected behavior through graphical inspection. It also describes validating multiple stations by comparing their time series plots and residual series, as well as comparing streamflow and rainfall data. The overall goal of secondary validation is to identify potential errors or anomalies in discharge data for further investigation and correction.
This document provides guidance on analyzing discharge data. It discusses computing basic statistics, constructing flow duration curves to analyze variability, fitting theoretical frequency distributions, and other time series analysis techniques like moving averages and mass curves. The main text provides detailed explanations of these methods and their uses in hydrological analysis, data validation, and reporting. It is intended to train hydrologists and data managers on effectively analyzing discharge data.
This document outlines Daren Harmel's work monitoring watersheds and modeling water quality. It discusses developing guidance for sampling small watersheds, quantifying uncertainty in water quality data, expanding a database of agricultural management practices and their impacts on nitrogen and phosphorus runoff, challenges modeling E. coli transport, and how monitoring and modeling can best be used together to inform water quality decision making. The goal is to provide practical guidance on watershed monitoring, account for uncertainty in data, and leverage all available data and models to evaluate management impacts and alternative practices. Both monitoring and modeling are necessary for water quality decisions, and an adaptive approach is recommended that uses measured data, models, and stakeholder input to identify sources, estimate contributions
This document describes procedures for conducting an inter-laboratory analytical quality control (AQC) exercise. It outlines how to plan the exercise by selecting a coordinating laboratory and test samples, determine reference values, and analyze results. Laboratories' performance is evaluated using Youden plots to identify systematic errors. The coordinating laboratory collates the data, identifies errors, and provides feedback to participating laboratories to improve quality. The goal is to enhance quality control through interaction between laboratories and evaluation of their facilities and testing capabilities.
Statistics is a critical tool for robustness analysis, measurement system error analysis, test data analysis, probabilistic risk assessment, and many other fields in the engineering world. These basic applications are related to our basic engineering problems which help us to solve the problems and gives us the better solution and brings the efficiency to work with our real life engineering problems.
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
Tutorial @Ubicomp 2015: Bridging the Gap -- Machine Learning for Ubiquitous Computing (evaluation session).
A tutorial on promises and pitfalls of Machine Learning for Ubicomp (and Human Computer Interaction). From Practitioners for Practitioners.
Presenter: Nils Hammerla <n.hammerla@gmail.com>
video recording of talks as they wer held at Ubicomp:
https://youtu.be/LgnnlqOIXJc?list=PLh96aGaacSgXw0MyktFqmgijLHN-aQvdq
Study on Application of Ensemble learning on Credit Scoringharmonylab
This document discusses using ensemble learning techniques for credit scoring applications. It begins by providing background on the growing need for effective credit scoring models due to the expansion of financing opportunities. It then discusses some common problems with credit scoring models, including a lack of understandability, multiple evaluation metrics, and imbalanced data. The purpose is to build a better credit scoring model using machine learning, specifically ensemble learning with XGBoost. Two proposed methods are described: 1) using EasyEnsemble resampling prior to XGBoost to address imbalanced data, and 2) customizing XGBoost by changing the evaluation metric to directly optimize cost or using a focal loss function. Experiments on several credit scoring datasets show these approaches can reduce costs without reducing accuracy
Machine Learning part 3 - Introduction to data science Frank Kienle
Lecture: Introduction to Data Science
given 2017 at Technical University of Kaiserslautern, Germany
Topic: part 3 machine learning, link to data science practice
This document summarizes an introduction to deep learning with MXNet and R. It discusses MXNet, an open source deep learning framework, and how to use it with R. It then provides an example of using MXNet and R to build a deep learning model to predict heart disease by analyzing MRI images. Specifically, it discusses loading MRI data, architecting a convolutional neural network model, training the model, and evaluating predictions against actual heart volume measurements. The document concludes by discussing additional ways the model could be explored and improved.
Researc-paper_Project Work Phase-1 PPT (21CS09).pptxAdityaKumar993506
This project aims to develop a machine learning model for early stage diabetes prediction. The group will implement classification algorithms like KNN, logistic regression, decision trees, and random forests on a diabetes dataset. The inputs to the model will be patient attributes and the output will predict their risk of having diabetes. Currently, the group is working on data preprocessing, model selection, and algorithm implementation to accurately classify patients' diabetes risk. The goal is to create an effective predictive tool to help clinicians diagnose and treat diabetes early.
Shaun Douglas Smith's graduate studies focused on operations research, industrial engineering, statistics, and quality engineering. His coursework covered topics like optimization theory, simulation, stochastic processes, experiment design, statistical modeling, production and inventory control, and quality engineering. His final project applied an algorithm to solve a bin packing problem. Overall, his program equipped him with technical and analytical skills for improving production systems and decision-making.
This document provides information about a training module on processing stream flow data organized by the Central Training Unit of the Central Water Commission in India. The training is intended for engineers involved in reviewing, analyzing, and processing stream flow data. The document includes details about the module such as its objectives, key concepts, session plan, and evaluation suggestions. It aims to help participants learn how to analyze and process gauge-discharge data, sediment data, water quality data, bed material data, and meteorological data through methods like consistency checks and reliability assessments.
This document discusses quality assurance and within-laboratory analytical quality control programs. It emphasizes the need for quality assurance due to errors found in analytical results. A quality assurance program includes sample control, standard procedures, equipment maintenance, calibration, and analytical quality control. Within-laboratory quality control focuses on precision using statistical control charts to monitor performance over time and identify issues. Regular quality control practices help ensure reliability of laboratory results.
Gw02 role of dwlr data in groundwater resource estimationhydrologyproject0
This document discusses the role of data from Deep Well Logging Recorders (DWLRs) in estimating groundwater resources. DWLRs provide high-frequency water level data that can help understand recharge processes and parameters. Their data allows identifying accurate peaks and troughs in the water table to define optimal periods for water balance studies estimating specific yield and rainfall recharge. DWLR hydrographs also aid in determining rainfall amounts needed to initiate recharge, lag times between rainfall and recharge, effective rainfall events, and periods of evapotranspiration loss - all improving the accuracy of water balance assessments and groundwater resource estimation.
This document summarizes the rationalization of the surface water quality monitoring program under the Hydrology Project in India. It discusses that various agencies were monitoring water quality with different objectives and no coordination, resulting in duplication of efforts. The Hydrology Project aims to design a unified monitoring network and methodology. Key points discussed include:
- Monitoring objectives of establishing baseline quality, observing trends, and calculating pollutant flux.
- Frequency of sampling every 2 months at baseline stations and monthly at trend stations to represent all seasons.
- Parameters to include general, nutrients, organic, and microbiological parameters depending on station type.
- Emphasis on representative sampling and sample collection/transport procedures.
This document summarizes the Hydrology Project Phase-II being implemented in Himachal Pradesh. The key points are:
1. The project was approved in 2006 for Rs. 49.50 Crore with an implementation period of 6 years, which was later extended by 23 months.
2. The project aims to strengthen hydrological monitoring networks and institutional capacity in Himachal Pradesh. It includes installation of rain gauges, weather stations, piezometers, labs, and data management systems.
3. As of 2014, most of the planned networks have been installed but some equipment procurement and installations are still ongoing. Data is being collected from most stations and shared with other organizations.
Ch sw water availibility study and supply demand analysis in kharun sub basin...hydrologyproject0
This document describes a water availability study and supply-demand analysis conducted in the Kharun Sub-Basin of Chhattisgarh, India. The study was carried out by the Water Resources Department of Chhattisgarh and the National Institute of Hydrology between 2014-2017. Key aspects of the study included developing a rainfall-runoff model to assess water availability, estimating current and projected water demands, and evaluating measures to meet future water needs in the sub-basin. Field data collection, drought assessment, infiltration modeling, and stakeholder workshops were also part of the multi-faceted study of water resources management in the Kharun Sub-Basin.
The document discusses two phases of the Central Water Commission's Hydrology Project aimed at establishing a functional hydrological information system and improving institutional capacity in several Indian states. It outlines the objectives, activities, and achievements of each phase, including the development of water monitoring and management software, training programs, and infrastructure improvements at the National Water Academy. The post-project plan and lessons learned from the two phases are also summarized.
Here are the steps to solve this problem:
1) Calculate the mean (x̅) of the lead concentration data: ∑x/n = 306.3/10 = 30.63
2) Calculate the deviation of each value from the mean: |x - x̅|
3) Calculate the sum of the deviations: ∑|x - x̅| = 50.03
4) Calculate the average deviation: Sum of deviations/n = 50.03/10 = 5.003
5) Calculate (x - x̅)2 for each value
6) Calculate the sum of (x - x̅)2: ∑(x - x̅)2
Here are the steps to solve this problem:
1) Calculate the mean (x̅) of the lead concentration data: ∑x/n = 306.3/10 = 30.63
2) Calculate the deviation of each value from the mean: |x - x̅|
3) Calculate the sum of the deviations: ∑|x - x̅| = 50.03
4) Calculate the average deviation: Sum of deviations/n = 50.03/10 = 5.003
5) Calculate (x - x̅)2 for each value
6) Calculate the sum of (x - x̅)2: ∑(x - x̅)2
This document provides materials for a training module on basic statistics, including a session plan, overhead masters, handouts, and main text. The module introduces key concepts such as frequency distributions, accuracy and precision, and propagation of errors. Participants will learn to construct histograms, calculate descriptive statistics, and understand the difference between random and systematic errors. The goal is for participants to be able to evaluate the accuracy of chemical data using basic statistical analysis.
This document provides guidance on how to carry out secondary validation of discharge data. It discusses validating a single station's data against limits and expected behavior through graphical inspection. It also describes validating multiple stations by comparing their time series plots and residual series, as well as comparing streamflow and rainfall data. The overall goal of secondary validation is to identify potential errors or anomalies in discharge data for further investigation and correction.
This document provides guidance on analyzing discharge data. It discusses computing basic statistics, constructing flow duration curves to analyze variability, fitting theoretical frequency distributions, and other time series analysis techniques like moving averages and mass curves. The main text provides detailed explanations of these methods and their uses in hydrological analysis, data validation, and reporting. It is intended to train hydrologists and data managers on effectively analyzing discharge data.
This document outlines Daren Harmel's work monitoring watersheds and modeling water quality. It discusses developing guidance for sampling small watersheds, quantifying uncertainty in water quality data, expanding a database of agricultural management practices and their impacts on nitrogen and phosphorus runoff, challenges modeling E. coli transport, and how monitoring and modeling can best be used together to inform water quality decision making. The goal is to provide practical guidance on watershed monitoring, account for uncertainty in data, and leverage all available data and models to evaluate management impacts and alternative practices. Both monitoring and modeling are necessary for water quality decisions, and an adaptive approach is recommended that uses measured data, models, and stakeholder input to identify sources, estimate contributions
This document describes procedures for conducting an inter-laboratory analytical quality control (AQC) exercise. It outlines how to plan the exercise by selecting a coordinating laboratory and test samples, determine reference values, and analyze results. Laboratories' performance is evaluated using Youden plots to identify systematic errors. The coordinating laboratory collates the data, identifies errors, and provides feedback to participating laboratories to improve quality. The goal is to enhance quality control through interaction between laboratories and evaluation of their facilities and testing capabilities.
Statistics is a critical tool for robustness analysis, measurement system error analysis, test data analysis, probabilistic risk assessment, and many other fields in the engineering world. These basic applications are related to our basic engineering problems which help us to solve the problems and gives us the better solution and brings the efficiency to work with our real life engineering problems.
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
Tutorial @Ubicomp 2015: Bridging the Gap -- Machine Learning for Ubiquitous Computing (evaluation session).
A tutorial on promises and pitfalls of Machine Learning for Ubicomp (and Human Computer Interaction). From Practitioners for Practitioners.
Presenter: Nils Hammerla <n.hammerla@gmail.com>
video recording of talks as they wer held at Ubicomp:
https://youtu.be/LgnnlqOIXJc?list=PLh96aGaacSgXw0MyktFqmgijLHN-aQvdq
Study on Application of Ensemble learning on Credit Scoringharmonylab
This document discusses using ensemble learning techniques for credit scoring applications. It begins by providing background on the growing need for effective credit scoring models due to the expansion of financing opportunities. It then discusses some common problems with credit scoring models, including a lack of understandability, multiple evaluation metrics, and imbalanced data. The purpose is to build a better credit scoring model using machine learning, specifically ensemble learning with XGBoost. Two proposed methods are described: 1) using EasyEnsemble resampling prior to XGBoost to address imbalanced data, and 2) customizing XGBoost by changing the evaluation metric to directly optimize cost or using a focal loss function. Experiments on several credit scoring datasets show these approaches can reduce costs without reducing accuracy
Machine Learning part 3 - Introduction to data science Frank Kienle
Lecture: Introduction to Data Science
given 2017 at Technical University of Kaiserslautern, Germany
Topic: part 3 machine learning, link to data science practice
This document summarizes an introduction to deep learning with MXNet and R. It discusses MXNet, an open source deep learning framework, and how to use it with R. It then provides an example of using MXNet and R to build a deep learning model to predict heart disease by analyzing MRI images. Specifically, it discusses loading MRI data, architecting a convolutional neural network model, training the model, and evaluating predictions against actual heart volume measurements. The document concludes by discussing additional ways the model could be explored and improved.
Researc-paper_Project Work Phase-1 PPT (21CS09).pptxAdityaKumar993506
This project aims to develop a machine learning model for early stage diabetes prediction. The group will implement classification algorithms like KNN, logistic regression, decision trees, and random forests on a diabetes dataset. The inputs to the model will be patient attributes and the output will predict their risk of having diabetes. Currently, the group is working on data preprocessing, model selection, and algorithm implementation to accurately classify patients' diabetes risk. The goal is to create an effective predictive tool to help clinicians diagnose and treat diabetes early.
Shaun Douglas Smith's graduate studies focused on operations research, industrial engineering, statistics, and quality engineering. His coursework covered topics like optimization theory, simulation, stochastic processes, experiment design, statistical modeling, production and inventory control, and quality engineering. His final project applied an algorithm to solve a bin packing problem. Overall, his program equipped him with technical and analytical skills for improving production systems and decision-making.
May 2015 talk to SW Data Meetup by Professor Hendrik Blockeel from KU Leuven & Leiden University.
With increasing amounts of ever more complex forms of digital data becoming available, the methods for analyzing these data have also become more diverse and sophisticated. With this comes an increased risk of incorrect use of these methods, and a greater burden on the user to be knowledgeable about their assumptions. In addition, the user needs to know about a wide variety of methods to be able to apply the most suitable one to a particular problem. This combination of broad and deep knowledge is not sustainable.
The idea behind declarative data analysis is that the burden of choosing the right statistical methodology for answering a research question should no longer lie with the user, but with the system. The user should be able to simply describe the problem, formulate a question, and let the system take it from there. To achieve this, we need to find answers to questions such as: what languages are suitable for formulating these questions, and what execution mechanisms can we develop for them? In this talk, I will discuss recent and ongoing research in this direction. The talk will touch upon query languages for data mining and for statistical inference, declarative modeling for data mining, meta-learning, and constraint-based data mining. What connects these research threads is that they all strive to put intelligence about data analysis into the system, instead of assuming it resides in the user.
Hendrik Blockeel is a professor of computer science at KU Leuven, Belgium, and part-time associate professor at Leiden University, The Netherlands. His research interests lie mostly in machine learning and data mining. He has made a variety of research contributions in these fields, including work on decision tree learning, inductive logic programming, predictive clustering, probabilistic-logical models, inductive databases, constraint-based data mining, and declarative data analysis. He is an action editor for Machine Learning and serves on the editorial board of several other journals. He has chaired or organized multiple conferences, workshops, and summer schools, including ILP, ECMLPKDD, IDA and ACAI, and he has been vice-chair, area chair, or senior PC member for ECAI, IJCAI, ICML, KDD, ICDM. He was a member of the board of the European Coordinating Committee for Artificial Intelligence from 2004 to 2010, and currently serves as publications chair for the ECMLPKDD steering committee.
This document provides guidance on validating a rating curve using graphical and numerical tests. It describes plotting new gauging data against the existing rating curve and confidence limits to check fit. Graphical tests illustrated include period-flow deviation scattergrams, stage-flow deviation diagrams, and cumulative deviation plots to identify bias over time. Numerical tests like Student's t-test are also described to statistically compare new gaugings to the original dataset and determine if the rating curve should be recalculated. The document provides an example validation analysis for a station in Pargaon.
The document discusses key concepts in quantitative research methods and data analytics covered in a university course. It outlines the course content, which includes topics like data visualization, the normal distribution, and hypothesis testing. It then details the course assessments, which include a mid-term assignment and final coursework report worth 30% and 70% respectively. The final report involves selecting a topic, collecting and analyzing data using R Studio, and reporting the results in a 2000 word paper with sections on introduction, data, results, and conclusion.
Mô hình dự báo Churn cho khách hàng bằng phương pháp học máy suy diễn Phương ...NguyenThiNgocAnh9
1) The document discusses methods for predicting customer churn for a bank using machine learning, including decision trees, artificial neural networks, and online variational inference for multivariate Gaussian distributions.
2) These techniques were applied to a Vietnamese dataset containing over 2,000 customer observations and 30 features to classify customers as churned or non-churned.
3) The results showed that decision trees achieved 99.78% accuracy while artificial neural networks achieved 94.26% accuracy according to cross-validation. Online variational inference for multivariate Gaussian distributions was also tested.
Eugm 2011 mehta - adaptive designs for phase 3 oncology trialsCytel USA
This document discusses adaptive designs for phase 3 oncology trials. It uses the VALOR trial as a case study to illustrate a sponsor's dilemma in designing a trial with limited prior data. It proposes a promising zone design that allows staged investment - an initial modest sample size with the option to increase size and power if interim results are promising. Simulations show this two-stage investment approach increases power over a non-adaptive design while managing risks for sponsors. The document also discusses extensions to population enrichment designs.
Similar to Download-manuals-surface water-software-48appliedstatistics (20)
The document summarizes a water purification system that includes a reverse osmosis unit, storage tank, and ultrapure water purification unit. The system uses a microprocessor to control purification from a municipal water supply to produce at least 35 liters/hour of water with a resistivity of 18 megohm cm or higher that is at least 99% free of bacteria, ions, organics, and particles. The system is also capable of continuous monitoring and will automatically shut off if the feed water is inadequate or the storage tank reaches capacity.
The document describes a water purifier using ion exchange resin columns that produces reagent-grade water for trace analysis. It has separate cation and anion exchange columns with a capacity of 25 liters each and can purify water at a rate over 1 liter per minute. Accessories include spare columns, instructions, a cover, and vacuum hose. Key features are an inline conductivity monitor and the ability to regenerate the resin columns.
This document provides specifications for a double distillation water purification unit. It distills water to a conductivity of 1.0 μS/cm or less at 25°C for generating reagent-grade water. The unit is made of quartz glass with a 1.5 liter/hour capacity. It runs on 220V power and includes a metallic stand, ring clamp, and operation manual. Safety features include over-heating protection and indicators.
This document provides specifications for an automatic water distillation unit. The unit distills water to generate reagent water type III with a maximum conductivity of 0.01 mS/cm at 25°C for use in washing and quantitative analysis. It has a stainless steel construction, 1.5 liter per hour capacity, operates on 220VAC power, and has automatic shutoff when water levels are low.
This document specifies a water geyser (heater) that has a 120 liter capacity, heats water to 90°C with thermostatic control, and requires a 220 VAC ±25%, 47 to 53 Hz power supply for operation in a laboratory setting.
A general purpose water bath has an inner size of approximately 0.4 x 0.3 x 0.1 meters, is made of stainless steel inside and stove enameled outside, and can maintain temperatures from ambient to 100°C with an accuracy of ±2°C. It has a stainless steel cover with 12 holes and features a drain cock or plug, double-walled insulation, and a pilot lamp to indicate thermostat operation.
This document describes the specifications of a water bath used for incubating culture tubes in coliform analysis. The water bath has an inner size of approximately 0.4 x 0.3 x 0.15 meters, is made of stainless steel inside and stove enameled outside, and can maintain temperatures from ambient to 50°C with an accuracy of ±0.1°C. It has a double wall for insulation, a dome lid to hold test tubes vertically, and displays the internal water temperature.
The document specifies the requirements for a wash bottle used to flush glassware. The wash bottle must be made of polythene, hold 500 ml of liquid, and have a bent nozzle and screw cap. It was last reviewed on October 23, 2007 and is used to wash away any sticking sediment from glassware.
The document specifies volumetric flasks that will be used for sediment analyses in a laboratory. The flasks must comply with IS 915-1975, be made of Corning glass or similar material, and come in sizes of 50, 100, 250, and 500 ml with B class accuracy.
The visual accumulation tube is used to assess particle sizes of sediments. It consists of a vertical transparent settling tube through which sediment samples are passed. Particles settle individually based on terminal velocity related to diameter, and accumulated sediment volume in the tube's narrow end relates to solid weight. It is best for uniform, spherical particles. The analysis involves introducing a small sample and recording accumulated sediment height over time.
This document provides specifications for a vacuum pump for general laboratory use. The pump is a single stage pump with a capacity of 50 l/min, capable of reaching a final vacuum of 0.05 mm Hg without ballast or 2 mm Hg with ballast. It has a 200W motor that operates on 220VAC at 47-53 Hz. Accessories include a filter, regulator, gauge, hose, and valve. The pump is designed for noise free operation.
This document provides specifications for a turbidity meter used to directly measure suspended matter in water samples. The turbidity meter has a range of 0 to 1000 NTU in at least 2 ranges, an accuracy of ±2% full scale deflection, and requires a power supply of 220 VAC ±25%, 47 to 53 Hz. Accessories for the turbidity meter include an ambient light shield, 6 spare tubes, a sensor stand, voltage stabilizer, instruction manual, and dust cover.
This tool kit contains basic tools for minor repairs of electrical laboratory equipment, including a set of screwdrivers, pliers, soldering iron, and multi-meter. The tools come in a lockable storage box for organization and security.
The document specifies the requirements for a microprocessor-controlled TOC analyzer. It must be able to directly measure total carbon, total organic carbon, and purgeable organic carbon in water samples. It uses high temperature catalytic combustion up to 900°C and non-dispersive infrared detection. It must have a measuring range of 1-500 mg/L carbon, precision within 3%, and detection limit of at least 500 ppb carbon. The analyzer and auto-sampler require 240V 50Hz power and it uses nitrogen or high purity air as carrier gases. It includes an auto-sampler, syringes, printer, manuals, spare parts, and application software in English.
This document provides specifications for a tissue grinder used to prepare tissue or sediment samples. The grinder must be able to macerate glass fibre filters and is a manual, porcelain device used to homogenize samples for further analysis such as chemical extraction.
This document specifies a set of thermometers for laboratory use, including mercury-filled glass thermometers with three temperature ranges (0-80oC, 10-150oC, 20-250oC) and an accuracy of ±0.5oC, along with a storage box for the thermometers.
This document specifies test tubes for laboratory use in sediment analysis. It outlines that the test tubes should be made of Corning glass or similar material, and lists two standard sizes - 15 x 125 mm and 25 x 200 mm in diameter and height. The test tubes are intended for general use in analyzing sediments in the laboratory.
The document specifies requirements for test sieves with a shaker. It requires stainless steel sieves that are 200mm in diameter, 50mm in height, and have nominal aperture sizes between 63-250 micrometers. It also requires a shaker capable of holding at least 5 sieves that runs on 220V power. Accessories include a sieve brush and wash bottle. Some sieves can be used manually without the shaker.
This document provides specifications for a magnetic stirrer with a hot plate. It can rotate between 0-1200 rpm and heat with a 300 Watt thermostatically controlled element. The stirrer has a stainless steel top and comes with PTFE coated magnets ranging from 10-50mm in 5mm increments, with two of each size included as accessories.
This document provides specifications for a sterilizer autoclave. The autoclave is 0.3 x 0.5 meters in size, made of stainless steel inside and lid with an enameled outside. It operates at a working pressure of 1 bar with a maximum pressure of 1.5 bars. Accessories include a pressure gauge, steam release cock, safety valve, perforated aluminum basket, water level indicator, and lifting arrangement. The autoclave is powered by 220 VAC at 47-53 Hz and uses approximately 2500 Watts.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
1. World Bank & Government of The Netherlands funded
Training module # WQ - 48
Applied Statistics
New Delhi, September 2000
CSMRS Building, 4th Floor, Olof Palme Marg, Hauz Khas,
New Delhi – 11 00 16 India
Tel: 68 61 681 / 84 Fax: (+ 91 11) 68 61 685
E-Mail: dhvdelft@del2.vsnl.net.in
DHV Consultants BV & DELFT HYDRAULICS
with
HALCROW, TAHAL, CES, ORG & JPS
2. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 1
Table of contents
Page
1. Module context 2
2. Module profile 3
3. Session plan 4
4. Overhead/flipchart master 5
5. Evaluation sheets 21
6. Handout 23
7. Additional handout 30
8. Main text 32
3. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 2
1. Module context
This module discusses statistical procedures, which are commonly used by chemists to
evaluate the precision and accuracy of results of analyses. Prior training in ‘Basic Statistic’,
module no. 47, or equivalent is necessary to complete this module satisfactorily.
While designing a training course, the relationship between this module and the others,
would be maintained by keeping them close together in the syllabus and place them in a
logical sequence. The actual selection of the topics and the depth of training would, of
course, depend on the training needs of the participants, i.e. their knowledge level and skills
performance upon the start of the course.
4. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 3
2. Module profile
Title : Applied Statistics
Target group : HIS function(s): Q2, Q3, Q5, Q6
Duration : one session of 90 min
Objectives : After the training the participants will be able to:
• Apply common statistical tests for evaluation of the precision of
data
Key concepts : • confidence interval
• regression analyses
Training methods : Lecture, exercises, OHS
Training tools
required
: Board, flipchart
Handouts : As provided in this module
Further reading
and references
: • ‘Analytical Chemistry’, Douglas A. Skoog and Donald M. West,
Saunders College Publishing, 1986 (Chapter 3)
• ‘Statistical Procedures for analysis of Environmenrtal monitoring
Data and Risk Assessment’, Edward A. Mc Bean and Frank A.
Rovers, Prentice Hall, 1998.
5. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 4
3. Session plan
No Activities Time Tools
1 Preparations
2 Introduction:
• Need for statistical procedures
• Scope of the module
10 min OHS
3 Confidence interval
• Definition
• Significance of standard deviation and its correct estimation
• When population SD is known
• When population SD is not known
30 min OHS
4 Rejection of Data
• Caution in rejection
• Procedures
20 min OHS
5 Regression analysis 20 min OHS
6 Wrap up 10 min
6. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 5
4. Overhead/flipchart master
OHS format guidelines
Type of text Style Setting
Headings: OHS-Title Arial 30-36, with bottom border line (not:
underline)
Text: OHS-lev1
OHS-lev2
Arial 24-26, maximum two levels
Case: Sentence case. Avoid full text in UPPERCASE.
Italics: Use occasionally and in a consistent way
Listings: OHS-lev1
OHS-lev1-Numbered
Big bullets.
Numbers for definite series of steps. Avoid
roman numbers and letters.
Colours: None, as these get lost in photocopying and
some colours do not reproduce at all.
Formulas/Equat
ions
OHS-Equation Use of a table will ease horizontal alignment
over more lines (columns)
Use equation editor for advanced formatting
only
7. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 6
Applied Statistics
• Confidence intervals
• Number of replicate observations
• Rejection of data
• Regression analysis
8. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 7
Confidence Intervals (1)
• Population mean µ is always unknown
• Sample mean x
- Only for large number of replicate x → µ
• Estimation of limits for µ from x
(x – a) < µ < (x + a)
• Value of a depend
- confidence level (probability of occurrence)
- standard deviation
- difference between limits is the interval
9. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 8
Confidence Intervals
Figure 1. Normal distribution curve and areas under curve for different distances
10. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 9
Confidence Intervals (2)
• When s is good estimate of σ
- for single observation, (x - zσ) < µ < (x - zσ)
Confidence level 50 90 95 99 99.7
Z 0.67 1.64 1.96 2.58 3
- as z or interval increases confidence level also increases
11. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 10
Example (1)
• Mercury concentration in fish was determined as 1.8 µg/kg.
Calculate limits for µ for 95% confidence level if σ = 0.1 µg/kg
- For 95% confidence level z = 1.96
- Limits for mean
1.8 – 1.96 x 0.1 < µ < 1.8 + 1.96 x 0.1
1.6 < µ < 2.0
• What are the limits for 99.7% confidence level?
12. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 11
Confidence Intervals (3)
• Confidence interval when mean,x, of n replicates is available
• Confidence interval is reduced compared to single
measurement as n increases
n 1 2 3 4 9 16
1 1.4 1.7 2 3 4
• Analyse 2 to 4 replicates, more replicates give diminishing
return
nzxnzx σ+<µ<σ−
n
13. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 12
Example (2)
• Average mercury conc. from 3 replicates was 1.67 µg/ kg
• Calculate limits for 95% confidence level, σ = 0.1 µg/ kg
1.67 – 1.96 x 0.1/ < µ < 1.67 + 1.96 x 0.1/
1.56 < µ < 1.78
3 3
14. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 13
Example (3)
• How Many replicate analyses will be required to decrease the
95% confidence interval to ± 0.07, σ = 0.1
Interval = ± zσ/
± 0.07 = ± 1.96 x 0.1/
n = 7.8
• 8 measurements will give slightly better than 95% chance for µ
to be with x ± 0.07
n
n
15. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 14
Confidence Interval (4)
• When σ is unknown
• Sample s based on n measurements is available
Degrees of Freedom 1 4 10 α
T95 12.7 2.78 2.23 1.96
T99 63.7 4.60 3.17 2.58
• Note T → z as degrees of freedom increases to ∝
nstxnstx +<µ<−
16. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 15
Example (4)
• Groundwater samples were analysed for TOC, n = 5, x = 11.7
mg/L s = 3.2 mg/L
• Find the 95% confidence limit for the true mean
- Degrees of freedom = n – 1 = 5 – 1 = 4
- T95 = 2.78 (from table)
11.7 – 2.78 x 3.2 / < µ < 11.7 + 2.78 x 3.2 /
7 < µ < 15.74
5 5
17. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 16
Detection of Outliers
• Extreme values, very high or low, will influence calculated
statistics
• Rejection
- sound basis
- may be due to an unrecorded event
- errors in transcription, reading of instruments, inconsistent
methodology, etc.
- otherwise apply statistical yardsticks
18. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 17
Q Test
• Qexp =
• Reject if Qexp = Qcrit
Total No. of
observations
3 5 7 10
Qcrit 90% 0.94 0.64 0.51 0.41
Qcrit 96% 0.98 0.73 0.59 0.48
• Other criteria are also available
testentireofSpread
neighbourandsuspectbetweenDifference
19. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 18
Regression Analysis
• Quantification of relationships between variables
- used for predictions
- calibration of instruments
• Least-square method
- deviations due to random errors
y = a + bx
- linear relationship
22. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 21
5. Evaluation sheets
23. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 22
24. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 23
6. Handout
25. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 24
Applied Statistics
• Confidence intervals
• Number of replicate observations
• Rejection of data
• Regression analysis
Confidence Intervals (1)
• Population mean µ is always unknown
• Sample mean x
- Only for large number of replicate x → µ
• Estimation of limits for µ from x
(x – a) < µ < (x + a)
• Value of a depend
- confidence level (probability of occurrence)
- standard deviation
- difference between limits is the interval
Confidence Intervals
Figure 1. Normal distribution curve and areas under curve for different distances
26. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 25
Confidence Intervals (2)
• When s is good estimate of σ
- for single observation, (x - zσ) < µ < (x - zσ)
Confidence level 50 90 95 99 99.7
Z 0.67 1.64 1.96 2.58 3
- as z or interval increases confidence level also increases
Example (1)
• Mercury concentration in fish was determined as 1.8 µg/kg.
• Calculate limits for µ for 95% confidence level if σ = 0.1 µg/k
- For 95% confidence level z = 1.96
- Limits for mean
1.8 – 1.96 x 0.1 < µ < 1.8 + 1.96 x 0.1
1.6 < µ < 2.0
• What are the limits for 99.7% confidence level?
Confidence Intervals (3)
• Confidence interval when mean,x, of n replicates is available
• Confidence interval is reduced compared to single measurement as n increases
n 1 2 3 4 9 16
1 1.4 1.7 2 3 4
• Analyse 2 to 4 replicates, more replicates give diminishing return
Example (2)
• Average mercury conc. from 3 replicates was 1.67 µg/ kg.
• Calculate limits for 95% confidence level, σ = 0.1 µg/ k
nzxnzx σ+<µ<σ−
n
78.156.1
31.0x96.167.131.0x96.167.1
<µ<
+<µ<−
27. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 26
Example (3)
• How Many replicate analyses will be required to decrease the 95% confidence
interval to ± 0.07, σ = 0.1
Interval = ± zσ/
± 0.07 = ± 1.96 x 0.1/
n = 7.8
• 8 measurements will give slightly better than 95% chance for µ to be with x ±
0.07
Confidence Intervals (4)
• When σ is unknown
• Sample s based on n measurements is available
Degrees of Freedom 1 4 10 α
T95 12.7 2.78 2.23 1.96
T99 63.7 4.60 3.17 2.58
• Note T → z as degrees of freedom increases to α
Example (4)
• Groundwater samples were analysed for TOC n = 5, x = 11.7 mg/L s = 3.2
mg/L
• Find the 95% confidence limit for the true mean
- Degrees of freedom = n – 1 = 5 – 1 = 4
- T95 = 2.78
11.7 – 2.78 x 3.2 / < µ < 11.7 + 2.78 x 3.2 /
7 < µ < 15.74
n
n
nstxnstx +<µ<−
5 5
28. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 27
Detection of Outliers
• Extreme values, very high or low, will influence calculated statistics
• Rejection
- sound basis
- may be due to an unrecorded event
- errors in transcription, reading of instruments, inconsistent methodology, etc.
- otherwise apply statistical yardsticks
Q Test
• Qexp =
• Reject if Qexp = Qcrit
Total No. of
observations
3 5 7 10
Qcrit 90% 0.94 0.64 0.51 0.41
Qcrit 96% 0.98 0.73 0.59 0.48
• Other criteria are also available
Regression Analysis
• Quantification of relationships between variables
- used for predictions
- calibration of instruments
• Least-square method
- deviations due to random errors
y = a + bx
- linear relationship
testentireofSpread
neighbourandsuspectbetweenDifference
30. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 29
Add copy of Main text in chapter 8, for all participants.
31. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 30
7. Additional handout
These handouts are distributed during delivery and contain test questions, answers to
questions, special worksheets, optional information, and other matters you would not like to
be seen in the regular handouts.
It is a good practice to pre-punch these additional handouts, so the participants can easily
insert them in the main handout folder.
32. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 31
33. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 32
8. Main text
Contents
1. Confidence Intervals 1
2. Detection of Data Outliers 5
3. Regression Analysis 6
34. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 1
Applied Statistics
For small data sets, there is always the question as to how well the data represent the
population and what useful information can be inferred regarding the population
characteristics. Statistical calculations are used for this purpose. This module deals with four
such applications:
1. Definition of a population around the mean of a data set within which the true mean can
be expected to be found with a particular degree of probability.
2. Determination of the number of measurements needed to obtain the true mean within a
predetermined interval around the experimental mean with a particular degree of
probability.
3. Determining if an outlying value in a set of replicate measurements should be retained in
calculating the mean for the set.
4. The fitting of a straight line to a set of experimental points.
1. Confidence Intervals
The true mean of population, µ, is always unknown. However, limits can be set about the
experimentally determined mean, x, within which the true mean may be expected to occur
with a given degree of probability. These bounds are called confidence limits and the interval
between these limits, the confidence interval.
For a given set of data, if the confidence interval about the mean is large, the probability of
an observation falling within the limits also becomes large. On the other hand, in the case
when the limits are set close to the mean, the probability that the observed value falls within
the limits also becomes smaller or, in other words, a larger fraction of observations are
expected to fall out side the limits. Usually confidence intervals are calculated such that 95%
of the observations are likely to fall within the limits.
For a given probability of occurrence, the confidence limits depend on the value of the
standard deviation, s, of the observed data and the certainty with which this quantity can be
taken to represent the true standard deviation, σ, of the population.
1.1 Confidence limits where σ is known
When the number of repetitive observations is 20 or more, the standard deviation of the
observed set of data, s, can be taken to be a good approximation of the standard deviation
of the population, σ.
However, it may not always be possible to perform such a large number of repetitive
analyses, particularly when costly and time consuming extraction and analytical procedures
are involved. In such cases, data from previous records or different laboratories may be
pooled, provided that identical precautions and analytical steps are followed in each case.
Further, care should be taken that the medium sampled is also similar, for example, analysis
results of groundwater samples having high TDS should not be pooled with the results of
surface waters, which usually have low TDS.
As discussed in the previous module, most of the randomly varying data can be
approximated to a normal distribution curve. For normal distribution, 68% of the observations
lie between the limits µ ± σ , 96% between the limits µ ± 2σ, 99.7% between the limits µ ±
35. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 2
3σ, etc., Figure 1. For a single observation, xi, the confidence limits for the population mean
are given by:
Figure 1. Normal distribution curve and areas under curve for different distances
confidence limit for µ = xi ± zσ (1)
where z assumes different values depending upon the desired confidence level as given in
Table 1.
Table 1: Values of z for various confidence levels
Confidence level, % z
50 0.67
68 1.00
80 1.29
90 1.64
95 1.96
96 2.00
99 2.58
99.7 3.00
99.9 3.29
Example 1:
Mercury concentration in the sample of a fish was determined to be 1.80 µg/kg. Calculate
the 50% and 95% confidence limits for this observation. Based on previous analysis records,
it is known that the standard deviation of such observations, following similar analysis
procedures, is 0.1 µg/kg and it closely represents the population standard deviation, σ.
36. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 3
From Table 1, it is seen that z = 0.67 and 1.96 for the two confidence limits in question.
Upon substitution in Equation 1, we find that
50% confidence limit = 1.80 ± 0.67 x 0.1 = 1.8 ± 0.07 µg/kg
95% confidence limit = 1.80 ± 1.96 x 0.1 = 1.8 ± 0.2 µg/kg
Therefore if 100 replicate analyses are made, the results of 50 analysis will lie between the
limits 1.73 and 1.87 and 95 results are expected to be within an enlarged limit of 1.6 and 2.0
Equation 1 applies to the result of a single measurement. In case a number of observations,
n, is made and an average of the replicate samples is taken, the confidence interval
decreases. In such a case the limits are given by:
confidence limit for µ = x ± zσ/ √n (2)
Example 2:
Calculate the confidence limits for the problem of Example 1, if three samples of the fish
were analysed yielding an average of 1.67 µg/kg Hg.
Substitution in Equation 2 gives:
50% confidence limit = 1.67 ± 0.67 x 0.1/ √3 = 1.67 ± 0.04 µg/kg
95% confidence limit = 1.67 ± 1.96 x 0.1/ √3 = 1.67 ± 0.11 µg/kg
For the same odds the confidence limits are now substantially smaller and the result can be
said to be more accurate and probably more useful.
Note that Equation 2 indicates that the confidence interval can be halved by increasing the
number of analyses to 4 (√4 = 2) compared to a single observation. Increasing the number
of measurements beyond 4 does not decrease the confidence interval proportionately. To
narrow the interval by one fourth, 16 measurements would be required (√16 = 4), thus giving
diminishing return. Consequently, 2 to 4 replicate measurements are made in most cases.
Equation 2 can also be used to find the number of replicate measurements required such
that with a given probability the true mean would be found within a predetermined interval.
This is illustrated in Example 3.
Example 3:
How many replicate measurements of the specimen in Example 1 would be needed to
decrease the 95% confidence interval to ±0.07.
Substituting for confidence interval in Equation 2:
0.07 = 1.96 x 0.1/ √n
√n = 1.96 x 0.1/0.07, n = 7.8
Thus, 8 measurements will provide slightly better than 95% chance of he true mean lying
within ±0.07 of the experimental mean.
37. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 4
1.2 Confidence limits where σ is unknown
When the number of individual measurements in a set of data are small the reproducibility of
the calculated value of the standard deviation, s, is decreased. Therefore, for a given
probability, the confidence interval must be larger under these circumstances.
To account for the potential variability in s, the confidence limits are calculated using the
statistical parameter t:
confidence limit for µ = x ± ts/ √n (3)
In contrast to z in Equation 2, t depends not only on the desired confidence level, but also
upon the number of degrees of freedom available in the calculation of s. Table 2 provides
values for t for various degrees of freedom and confidence levels. Note that the values of t
become equal to those for z (Table 1) as the number of degrees of freedom becomes
infinite.
Table 2: Values of t for various confidence levels and degrees of freedom
Confidence LevelDegrees of
Freedom 90 95 99
1 6.31 12.7 63.7
2 2.92 4.30 9.92
3 2.35 3.18 5.84
4 2.13 2.78 4.60
5 2.02 2.57 4.03
6 1.94 2.45 3.71
8 1.86 2.31 3.36
10 1.81 2.23 3.17
12 1.78 2.18 3.11
12 1.78 2.16 3.06
14 1.76 2.14 2.98
∞ 1.64 1.96 2.58
Example 4:
A chemist obtained the following data for the concentration of total organic carbon (TOC) in
groundwater samples: 13.55, 6.39, 13.81, 11.20, 13.88 mg/L. Calculate the 95% confidence
limits for the mean of the data.
Thus, from the given data, n = 5, x = 11.77 and s = 3.2. For degrees of freedom = 5 – 1 = 4
and 95% confidence limit, t from Table 2 = 2.78. Therefore, from Equation 3:
confidence limit for µ = x ± ts/ √n = 11.77 ± 2.78 x 3.2/ √5
= 11.77 ± 3.97
38. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 5
2. Detection of Data Outliers
Data outliers are extreme (high or low) values that diverge widely from the main body of the
data set. The presence of one or more outliers may greatly influence any calculated statistics
and yield biased results. However, there is also the possibility that the outlier is a legitimate
member of the data set. Outlier detection tests are to determine whether there is sufficient
statistical evidence to conclude that an observation appears extreme and does not belong to
the data set and should be rejected.
Data outliers may result from faulty instruments, error in transcription, misreading of
instruments, inconsistent methodology of sampling and analysis and so on. These aspects
should be investigated and if any of such reasons can be pegged to an outlier, the value
may be safely deleted from consideration. However, this is to be kept in mind that the
suspect data may rightfully belong to the set and may be the consequence of an unrecorded
event, such as, a short rainfall, intrusion of sea water or a spill.
Table 3: Critical values for rejection quotient
Qcrit (reject if Qexp> Qcrit)Number of observations
90% confidence 96% confidence
3 0.94 0.98
4 0.76 0.85
5 0.64 0.73
6 0.56 0.64
7 0.51 0.59
8 0.47 0.54
9 0.44 0.51
10 0.41 0.48
Of the numerous statistical criteria available for detection of outliers, the Q test, which is
commonly used, will be discussed here. To apply the Q test, the difference between the
questionable result and its closest neighbour is divided by the spread of the entire set. The
resulting ratio Qexp is compared with the rejection values, Qcrit given in Table3, that are
critical for a particular degree of confidence. If Qexp is larger, a statistical basis for rejection
exists. The table shows only some selected values. Any standard statistical analysis book
may be consulted for a complete set.
Example 5:
Concentration measurements for fluoride in a well were measured as 2.77, 2.80, 2.90, 2.92,
3.45, 3.95, 4.44, 4.61, 5.21, 7.46. Use the Q test to examine whether the highest value is an
outlier.
Qexp = (7.46 – 5.21) / (7.46 – 2.77) = 0.51
Since 0.51 is larger than 0.48, the Qcrit value for 96% confidence, there is a basis for
excluding the value.
Other criteria that can be used to evaluate an apparent outlier are:
- Plotting of a scatter diagram indicating inter-parameter relationship between two
constituents. If there is a correlation between the two constituents, an outlier would lie a
significant distance from the general trend.
39. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 6
- Applying test for normal distribution by calculating the mean and standard deviation of all
the data and determining if the extreme value is outside the mean ±3 times the standard
deviation limit. If so, it is indeed an unusual value. (A cumulative normal probability plot
may also be used for this test.)
- When the data set is not large, calculating and comparing the standard deviation, with
and without the suspect observations. A suspect value which has a considerable
influence on the calculated standard deviation suggests an outlier.
3. Regression Analysis
In assessing environmental quality , it is often of interest to quantify relationship between two
or more variables. This may allow filling of missing data for one constituent and also may
help in predicting future levels of a constituent.
Regression analysis is focused on determining the degree to which one or more variable(s)
is dependant on an other variable, the independent variable. Thus regression is a means of
calibrating coefficients of a predictive equation. In correlation analysis neither of the
variables is identified as more important than the other. Correlation is not causation. It
provides a measure of the goodness of fit.
In the calibration step, in most analytical procedures where a ‘best’ straight line is fitted to
the observed response of the detector system when known amounts of analyte (standards)
are analysed. This section discusses the regression analysis procedure employed for this
purpose.
Least-squares method is the most straightforward regression procedure. Application of the
method requires two assumptions; a linear relationship exists between the amount of analyte
(x) and the magnitude of the measured response (y) and that any deviation of individual
points from the straight line is entirely the consequence of indeterminate error, that is, no
significant error exists in the composition of the standards.
The line generated is of the form:
y = a + bx (4)
where a is the value of y when x is zero (the intercept) and b is the slope. The method
minimises the squares of the vertical displacements of data points from the best fit line.
For convenience, the following three quantities are defined:
Sxx = ∑ (xi – x)2
= ∑xi
2
– (∑xi)2
/n
Syy = ∑ (yi – y)2
= ∑yi
2
– (∑yi)2
/n
Sxy = ∑ (xi – x) (yi – y) = ∑xiyi – ∑xi ∑yi /n
Calculating these quantities permits the determination of the following:
1. The slope of the line b:
b = Sxy/Sxx (5)
2. The intercept a:
a = y – bx (6)
40. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 7
3. The standard deviation about the regression line, sr:
sr = √{(Syy – b2
Sxx)/(n-2)} (7)
4. The standard deviation of the slope sb:
sb = √(sr
2
/Sxx) (8)
5. The standard deviation of the results based on the calibration curve sc:
sc = (sr/b) x √{(1/m) + (1/n) + (yc – y)2
/b2
Sxx} (9)
where yc is the mean of m replicate measurements made using the calibration curve.
Example 6:
The following table gives calibration data for chromatographic analysis of a pesticide and
computations for fitting a straight line according to the least-square method.
Pesticide conc., µg/L
(xi)
Peak area, cm2
(yi) xi
2
yi
2
xiyi
0.352 1.09 0.12390 1.1881 0.3868
0.803 1.78 0.64481 3.1684 1.42934
1.08 2.60 1.16640 6.7600 2.80800
1.38 3.03 1.90140 9.1809 4.18140
1.75 4.01 3.06250 16.0801 7.01750
∑ 5.365 12.51 6.90201 36.3775 15.81992
Therefore
Sxx = ∑xi
2
– (∑xi)2
/n =1.145365
Syy = ∑yi
2
– (∑yi)2
/n =5.07748
Sxy = ∑xiyi – ∑xi ∑yi /n =2.39669
and from Equations 5 & 6
b = 2.0925 = 2.09 and a = 0.2567 = 0.26
Thus the equation for the least-square line is
y = 0.26 + 2.09x
Note that rounding should not be performed until the end in order to avoid rounding errors.
The data points and the equation is plotted in Figure2.
41. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 8
Figure 2: Calibration Curve
Example 7:
Using the relationship derived in Example 6 calculate
1. concentration of pesticide in sample if the peak area of 2.65 was obtained.
2. the standard deviation of the result based on the single measurement of 2.65.
3. the standard deviation of the measurement if 2.65 represents the average of 4
replicates.
1. Substituting in the equation derived in the previous example
2.65 = 0.26 + 2.09x
x = 1.14 µg/L
2. Substituting in Equation 9 for m = 1
sc = (0.144/2.09) x √{(1/1) + (1/5) + (2.65 – 12.51/5)2
/2.092
x1.145} = ±0.08
3. Substituting in Equation 9 for m = 4
sc = (0.144/2.09) x √{(1/4) + (1/5) + (2.65 – 12.51/5)2
/2.092
x1.145} = ±0.05
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Concentration of pesticide, ug/L
Peakarea,cm
2