This document provides materials for a training module on applied statistics, including confidence intervals, number of replicate observations needed, rejection of outlier data, and regression analysis. The module aims to help participants apply common statistical tests to evaluate precision of data. Topics include calculating confidence intervals for means both when population standard deviation is known and unknown, using replicate measurements to reduce interval size, and identifying outliers using Q-tests. Regression analysis techniques are also summarized to quantify relationships between variables.
Webinar slides how to reduce sample size ethically and responsiblynQuery
[Webinar] How to reduce sample size...ethically and responsibly | In this free webinar, you will learn various design strategies to help reduce the sample size of your study in an ethical and responsible manner. Practical examples will be used throughout.
Regression and classification techniques play an essential role in many data mining tasks and have broad applications. However, most of the state-of-the-art regression and classification techniques are often unable to adequately model the interactions among predictor variables in highly heterogeneous datasets. New techniques that can effectively model such complex and heterogeneous structures are needed to significantly improve prediction accuracy.
In this dissertation, we propose a novel type of accurate and interpretable regression and classification models, named as Pattern Aided Regression (PXR) and Pattern Aided Classification (PXC) respectively. Both PXR and PXC rely on identifying regions in the data space where a given baseline model has large modeling errors, characterizing such regions using patterns, and learning specialized models for those regions. Each PXR/PXC model contains several pairs of contrast patterns and local models, where a local classifier is applied only to data instances matching its associated pattern. We also propose a class of classification and regression techniques called Contrast Pattern Aided Regression (CPXR) and Contrast Pattern Aided Classification (CPXC) to build accurate and interpretable PXR and PXC models.
We have conducted a set of comprehensive performance studies to evaluate the performance of CPXR and CPXC. The results show that CPXR and CPXC outperform state-of-the-art regression and classification algorithms, often by significant margins. The results also show that CPXR and CPXC are especially effective for heterogeneous and high dimensional datasets. Besides being new types of modeling, PXR and PXC models can also provide insights into data heterogeneity and diverse predictor-response relationships.
We have also adapted CPXC to handle classifying imbalanced datasets and introduced a new algorithm called Contrast Pattern Aided Classification for Imbalanced Datasets (CPXCim). In CPXCim, we applied a weighting method to boost minority instances as well as a new filtering method to prune patterns with imbalanced matching datasets.
Finally, we applied our techniques on three real applications, two in the healthcare domain and one in the soil mechanic domain. PXR and PXC models are significantly more accurate than other learning algorithms in those three applications.
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample SizenQuery
Title: Bayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
Duration: 60 minutes
Speaker: Ronan Fitzpatrick, Head of Statistics, Statsols
Watch Here: http://bit.ly/2ndRG4B
In this webinar you’ll learn about:
Benefits of Sensitivity Analysis: What does the researcher gain by conducting a sensitivity analysis?
Why isn't Sensitivity Analysis formalized: Why does sensitivity analysis still lack the type of formalized rules and grounding to make it a routine part of sample size determination in every field?
How Bayesian Assurance works: Using Bayesian Assurance provides key contextual information on what is likely to happen over the total range possible values rather than the small number of fixed points used in a sensitivity analysis
Elicitation & SHELF: How expert opinion is elicited and then how to integrate these opinions with each other plus prior data using the Sheffield Elicitation Framework (SHELF)
Why use in both Frequentist or Bayesian analysis: How and why these methods can be used for studies which will use Frequentist or Bayesian methods in their final analysis
Plus more
Bayesian Approaches To Improve Sample Size WebinarnQuery
Title: Bayesian Approaches To Improve Sample Size
Duration: 60 minutes
Speaker: Ronan Fitzpatrick, Head of Statistics, Statsols
In this webinar you'll learn about:
Bayesian Sample Size Determination: See how the growth of Bayesian analysis has helped transform our ideas about statistical inference and methodologies in clinical trials
Bayesian Assurance: Get an informative answer on how likely it is to see a “positive” outcome from the trial and then make better decisions on what trials to back
Posterior Credible Intervals and Mixed Bayesian Likelihood: Enable researchers to use prior information from pilot studies and other sources to make quicker and better decisions
Plus much more
Application of predictive analytics on semi-structured north Atlantic tropica...Skylar Hernandez
A doctoral dissertation final defense that is trying to solve which weather pattern components can improve the Atlantic TC forecast accuracy; through the use of C4.5 algorithm on all five-day tropical discussions from 2001-2015?
Innovative Sample Size Methods For Clinical Trials nQuery
"Innovative Sample Size Methods for Clinical Trials" is hosted to coincide with the Spring 2018 update to nQuery - The leading Sample Size Software.
Hosted by Ronan Fitzpatrick - Head of Statistics and nQuery Lead Researcher at Statsols - you'll learn about the benefits of a range of procedures and how you can implement them into your work:
1) Dose-escalation with the Bayesian Continual Reassessment Method
CRM is a growing alternative to the 3+3 method for Phase I trials finding the Maximum Tolerated Dose (MTD).
See how researchers can overcome 3+3 drawbacks to easily find the required sample size for this beneficial alternative for finding the MTD.
2) Bayesian Assurance with Survival Example
This Bayesian alternative to power has experienced a rapid rise in interest and application from researchers.
See how Assurance is being used by researchers to discover the true “probability of success” of a trial.
3) Mendelian Randomization
Mendelian randomization (MR) is a method that allows testing of a causal effect from observational data in the presence of confounding factors.
However, in order to design efficient Mendelian randomization studies, it is essential to calculate the appropriate sample sizes required. We demonstrate what to do to achieve this.
4) Negative Binomial Distribution
Negative binomial model has been increasingly used to model the count data. One of the challenges of applying negative binomial model in clinical trial design is the sample size estimation.
We demonstrate how best to determine the appropriate sample size in the presence of challenges such as unequal follow-up or dispersion.
Poster Randomization Approach in Case-Based Reasoning: Case of Study of Mammo...Miled Basma Bentaiba
This document was presented on 8 May 2018 at the doctoral symposium at IFIP International Conference on Computational Intelligence and Its Applications (IFIP CIIA 2018).
Webinar slides- alternatives to the p-value and power nQuery
What are the alternatives to the p-value & power? What is the next step for sample size determination? We will explore these issues in this free webinar presented by nQuery
Webinar slides how to reduce sample size ethically and responsiblynQuery
[Webinar] How to reduce sample size...ethically and responsibly | In this free webinar, you will learn various design strategies to help reduce the sample size of your study in an ethical and responsible manner. Practical examples will be used throughout.
Regression and classification techniques play an essential role in many data mining tasks and have broad applications. However, most of the state-of-the-art regression and classification techniques are often unable to adequately model the interactions among predictor variables in highly heterogeneous datasets. New techniques that can effectively model such complex and heterogeneous structures are needed to significantly improve prediction accuracy.
In this dissertation, we propose a novel type of accurate and interpretable regression and classification models, named as Pattern Aided Regression (PXR) and Pattern Aided Classification (PXC) respectively. Both PXR and PXC rely on identifying regions in the data space where a given baseline model has large modeling errors, characterizing such regions using patterns, and learning specialized models for those regions. Each PXR/PXC model contains several pairs of contrast patterns and local models, where a local classifier is applied only to data instances matching its associated pattern. We also propose a class of classification and regression techniques called Contrast Pattern Aided Regression (CPXR) and Contrast Pattern Aided Classification (CPXC) to build accurate and interpretable PXR and PXC models.
We have conducted a set of comprehensive performance studies to evaluate the performance of CPXR and CPXC. The results show that CPXR and CPXC outperform state-of-the-art regression and classification algorithms, often by significant margins. The results also show that CPXR and CPXC are especially effective for heterogeneous and high dimensional datasets. Besides being new types of modeling, PXR and PXC models can also provide insights into data heterogeneity and diverse predictor-response relationships.
We have also adapted CPXC to handle classifying imbalanced datasets and introduced a new algorithm called Contrast Pattern Aided Classification for Imbalanced Datasets (CPXCim). In CPXCim, we applied a weighting method to boost minority instances as well as a new filtering method to prune patterns with imbalanced matching datasets.
Finally, we applied our techniques on three real applications, two in the healthcare domain and one in the soil mechanic domain. PXR and PXC models are significantly more accurate than other learning algorithms in those three applications.
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample SizenQuery
Title: Bayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
Duration: 60 minutes
Speaker: Ronan Fitzpatrick, Head of Statistics, Statsols
Watch Here: http://bit.ly/2ndRG4B
In this webinar you’ll learn about:
Benefits of Sensitivity Analysis: What does the researcher gain by conducting a sensitivity analysis?
Why isn't Sensitivity Analysis formalized: Why does sensitivity analysis still lack the type of formalized rules and grounding to make it a routine part of sample size determination in every field?
How Bayesian Assurance works: Using Bayesian Assurance provides key contextual information on what is likely to happen over the total range possible values rather than the small number of fixed points used in a sensitivity analysis
Elicitation & SHELF: How expert opinion is elicited and then how to integrate these opinions with each other plus prior data using the Sheffield Elicitation Framework (SHELF)
Why use in both Frequentist or Bayesian analysis: How and why these methods can be used for studies which will use Frequentist or Bayesian methods in their final analysis
Plus more
Bayesian Approaches To Improve Sample Size WebinarnQuery
Title: Bayesian Approaches To Improve Sample Size
Duration: 60 minutes
Speaker: Ronan Fitzpatrick, Head of Statistics, Statsols
In this webinar you'll learn about:
Bayesian Sample Size Determination: See how the growth of Bayesian analysis has helped transform our ideas about statistical inference and methodologies in clinical trials
Bayesian Assurance: Get an informative answer on how likely it is to see a “positive” outcome from the trial and then make better decisions on what trials to back
Posterior Credible Intervals and Mixed Bayesian Likelihood: Enable researchers to use prior information from pilot studies and other sources to make quicker and better decisions
Plus much more
Application of predictive analytics on semi-structured north Atlantic tropica...Skylar Hernandez
A doctoral dissertation final defense that is trying to solve which weather pattern components can improve the Atlantic TC forecast accuracy; through the use of C4.5 algorithm on all five-day tropical discussions from 2001-2015?
Innovative Sample Size Methods For Clinical Trials nQuery
"Innovative Sample Size Methods for Clinical Trials" is hosted to coincide with the Spring 2018 update to nQuery - The leading Sample Size Software.
Hosted by Ronan Fitzpatrick - Head of Statistics and nQuery Lead Researcher at Statsols - you'll learn about the benefits of a range of procedures and how you can implement them into your work:
1) Dose-escalation with the Bayesian Continual Reassessment Method
CRM is a growing alternative to the 3+3 method for Phase I trials finding the Maximum Tolerated Dose (MTD).
See how researchers can overcome 3+3 drawbacks to easily find the required sample size for this beneficial alternative for finding the MTD.
2) Bayesian Assurance with Survival Example
This Bayesian alternative to power has experienced a rapid rise in interest and application from researchers.
See how Assurance is being used by researchers to discover the true “probability of success” of a trial.
3) Mendelian Randomization
Mendelian randomization (MR) is a method that allows testing of a causal effect from observational data in the presence of confounding factors.
However, in order to design efficient Mendelian randomization studies, it is essential to calculate the appropriate sample sizes required. We demonstrate what to do to achieve this.
4) Negative Binomial Distribution
Negative binomial model has been increasingly used to model the count data. One of the challenges of applying negative binomial model in clinical trial design is the sample size estimation.
We demonstrate how best to determine the appropriate sample size in the presence of challenges such as unequal follow-up or dispersion.
Poster Randomization Approach in Case-Based Reasoning: Case of Study of Mammo...Miled Basma Bentaiba
This document was presented on 8 May 2018 at the doctoral symposium at IFIP International Conference on Computational Intelligence and Its Applications (IFIP CIIA 2018).
Webinar slides- alternatives to the p-value and power nQuery
What are the alternatives to the p-value & power? What is the next step for sample size determination? We will explore these issues in this free webinar presented by nQuery
Here are the steps to solve this problem:
1) Calculate the mean (x̅) of the lead concentration data: ∑x/n = 306.3/10 = 30.63
2) Calculate the deviation of each value from the mean: |x - x̅|
3) Calculate the sum of the deviations: ∑|x - x̅| = 50.03
4) Calculate the average deviation: Sum of deviations/n = 50.03/10 = 5.003
5) Calculate (x - x̅)2 for each value
6) Calculate the sum of (x - x̅)2: ∑(x - x̅)2
The document provides information about a workshop on standards for groundwater monitoring, processing, and data dissemination. It includes the following key points:
1. The workshop aims to review current practices and adopt standard formats, techniques, and procedures for computerized groundwater data acquisition, processing, validation, retrieval and dissemination.
2. Topics to be addressed include computerized techniques, data standards, quality monitoring objectives and procedures, dedicated software demonstrations, and requirements for software.
3. The 3-day workshop program includes sessions on data standards, software discussions, and a visit to an operational digital monitoring network site. Standardizing procedures and using computerization can help establish a reliable hydrological information system.
The document provides an overview of training courses for staff involved in managing and operating a Hydrological Information System for groundwater in India, outlining the objectives, target groups, providers, locations, durations, and sample programs for courses on topics like groundwater data collection, data entry and processing, HIS management, and more.
This document provides guidance on collecting, entering, and validating hydrological data for storage and use in a water resources information system in India. It discusses the mandatory information needed for spatial (e.g. well locations) and temporal (e.g. water level measurements) data. It also describes proper data collection procedures like using field forms, maintaining a data collection register, and entering data directly from field forms to reduce errors. The document emphasizes validation of data at multiple stages and storing data according to standards to ensure long-term usability and reliability of the hydrological information system.
This document provides an operations manual for water quality analysis laboratories. It discusses good laboratory practices and quality assurance protocols that should be followed, including procedures for sample handling, analytical methods, equipment maintenance, and data recording. Standard analytical methods are described for over 40 water quality parameters. Laboratories are expected to adhere to proper chemical and equipment handling techniques, quality control measures, and documentation practices to ensure reliable and comparable analytical results.
This document provides information on a training module about major ions in water. It includes an introduction to the module, a module profile, session plan, overheads and masters, evaluation sheets, and handouts. The module aims to help participants understand the major ions found in water and their sources, how to characterize water samples based on ion concentrations, the water quality consequences of different ions, and the technique of ion balancing to assess water quality. The session covers topics such as hardness, sodium adsorption ratio, percent sodium, and residual sodium carbonate as ways to evaluate water for irrigation use.
This document provides information about a training module on absorption spectroscopy. It discusses the World Bank and Government of the Netherlands funding a training program in New Delhi, India. The document includes an introduction to absorption spectroscopy, outlines the module context and profile, presents a session plan and materials including overhead masters, and discusses measuring various water quality parameters using a spectrophotometer.
Here are the steps to solve this problem:
1) Calculate the mean (x̅) of the lead concentration data: ∑x/n = 306.3/10 = 30.63
2) Calculate the deviation of each value from the mean: |x - x̅|
3) Calculate the sum of the deviations: ∑|x - x̅| = 50.03
4) Calculate the average deviation: Sum of deviations/n = 50.03/10 = 5.003
5) Calculate (x - x̅)2 for each value
6) Calculate the sum of (x - x̅)2: ∑(x - x̅)2
This document provides materials for a training module on basic statistics, including a session plan, overhead masters, handouts, and main text. The module introduces key concepts such as frequency distributions, accuracy and precision, and propagation of errors. Participants will learn to construct histograms, calculate descriptive statistics, and understand the difference between random and systematic errors. The goal is for participants to be able to evaluate the accuracy of chemical data using basic statistical analysis.
This document provides guidance on how to carry out secondary validation of discharge data. It discusses validating a single station's data against limits and expected behavior through graphical inspection. It also describes validating multiple stations by comparing their time series plots and residual series, as well as comparing streamflow and rainfall data. The overall goal of secondary validation is to identify potential errors or anomalies in discharge data for further investigation and correction.
This document provides guidance on analyzing discharge data. It discusses computing basic statistics, constructing flow duration curves to analyze variability, fitting theoretical frequency distributions, and other time series analysis techniques like moving averages and mass curves. The main text provides detailed explanations of these methods and their uses in hydrological analysis, data validation, and reporting. It is intended to train hydrologists and data managers on effectively analyzing discharge data.
This document outlines Daren Harmel's work monitoring watersheds and modeling water quality. It discusses developing guidance for sampling small watersheds, quantifying uncertainty in water quality data, expanding a database of agricultural management practices and their impacts on nitrogen and phosphorus runoff, challenges modeling E. coli transport, and how monitoring and modeling can best be used together to inform water quality decision making. The goal is to provide practical guidance on watershed monitoring, account for uncertainty in data, and leverage all available data and models to evaluate management impacts and alternative practices. Both monitoring and modeling are necessary for water quality decisions, and an adaptive approach is recommended that uses measured data, models, and stakeholder input to identify sources, estimate contributions
This document discusses quality assurance and within-laboratory analytical quality control programs. It emphasizes the need for quality assurance due to errors found in analytical results. A quality assurance program includes sample control, standard procedures, equipment maintenance, calibration, and analytical quality control. Within-laboratory quality control focuses on precision using statistical control charts to monitor performance over time and identify issues. Regular quality control practices help ensure reliability of laboratory results.
This document describes procedures for conducting an inter-laboratory analytical quality control (AQC) exercise. It outlines how to plan the exercise by selecting a coordinating laboratory and test samples, determine reference values, and analyze results. Laboratories' performance is evaluated using Youden plots to identify systematic errors. The coordinating laboratory collates the data, identifies errors, and provides feedback to participating laboratories to improve quality. The goal is to enhance quality control through interaction between laboratories and evaluation of their facilities and testing capabilities.
Statistics is a critical tool for robustness analysis, measurement system error analysis, test data analysis, probabilistic risk assessment, and many other fields in the engineering world. These basic applications are related to our basic engineering problems which help us to solve the problems and gives us the better solution and brings the efficiency to work with our real life engineering problems.
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
Tutorial @Ubicomp 2015: Bridging the Gap -- Machine Learning for Ubiquitous Computing (evaluation session).
A tutorial on promises and pitfalls of Machine Learning for Ubicomp (and Human Computer Interaction). From Practitioners for Practitioners.
Presenter: Nils Hammerla <n.hammerla@gmail.com>
video recording of talks as they wer held at Ubicomp:
https://youtu.be/LgnnlqOIXJc?list=PLh96aGaacSgXw0MyktFqmgijLHN-aQvdq
Study on Application of Ensemble learning on Credit Scoringharmonylab
This document discusses using ensemble learning techniques for credit scoring applications. It begins by providing background on the growing need for effective credit scoring models due to the expansion of financing opportunities. It then discusses some common problems with credit scoring models, including a lack of understandability, multiple evaluation metrics, and imbalanced data. The purpose is to build a better credit scoring model using machine learning, specifically ensemble learning with XGBoost. Two proposed methods are described: 1) using EasyEnsemble resampling prior to XGBoost to address imbalanced data, and 2) customizing XGBoost by changing the evaluation metric to directly optimize cost or using a focal loss function. Experiments on several credit scoring datasets show these approaches can reduce costs without reducing accuracy
Machine Learning part 3 - Introduction to data science Frank Kienle
Lecture: Introduction to Data Science
given 2017 at Technical University of Kaiserslautern, Germany
Topic: part 3 machine learning, link to data science practice
This document summarizes an introduction to deep learning with MXNet and R. It discusses MXNet, an open source deep learning framework, and how to use it with R. It then provides an example of using MXNet and R to build a deep learning model to predict heart disease by analyzing MRI images. Specifically, it discusses loading MRI data, architecting a convolutional neural network model, training the model, and evaluating predictions against actual heart volume measurements. The document concludes by discussing additional ways the model could be explored and improved.
Researc-paper_Project Work Phase-1 PPT (21CS09).pptxAdityaKumar993506
This project aims to develop a machine learning model for early stage diabetes prediction. The group will implement classification algorithms like KNN, logistic regression, decision trees, and random forests on a diabetes dataset. The inputs to the model will be patient attributes and the output will predict their risk of having diabetes. Currently, the group is working on data preprocessing, model selection, and algorithm implementation to accurately classify patients' diabetes risk. The goal is to create an effective predictive tool to help clinicians diagnose and treat diabetes early.
Shaun Douglas Smith's graduate studies focused on operations research, industrial engineering, statistics, and quality engineering. His coursework covered topics like optimization theory, simulation, stochastic processes, experiment design, statistical modeling, production and inventory control, and quality engineering. His final project applied an algorithm to solve a bin packing problem. Overall, his program equipped him with technical and analytical skills for improving production systems and decision-making.
Here are the steps to solve this problem:
1) Calculate the mean (x̅) of the lead concentration data: ∑x/n = 306.3/10 = 30.63
2) Calculate the deviation of each value from the mean: |x - x̅|
3) Calculate the sum of the deviations: ∑|x - x̅| = 50.03
4) Calculate the average deviation: Sum of deviations/n = 50.03/10 = 5.003
5) Calculate (x - x̅)2 for each value
6) Calculate the sum of (x - x̅)2: ∑(x - x̅)2
The document provides information about a workshop on standards for groundwater monitoring, processing, and data dissemination. It includes the following key points:
1. The workshop aims to review current practices and adopt standard formats, techniques, and procedures for computerized groundwater data acquisition, processing, validation, retrieval and dissemination.
2. Topics to be addressed include computerized techniques, data standards, quality monitoring objectives and procedures, dedicated software demonstrations, and requirements for software.
3. The 3-day workshop program includes sessions on data standards, software discussions, and a visit to an operational digital monitoring network site. Standardizing procedures and using computerization can help establish a reliable hydrological information system.
The document provides an overview of training courses for staff involved in managing and operating a Hydrological Information System for groundwater in India, outlining the objectives, target groups, providers, locations, durations, and sample programs for courses on topics like groundwater data collection, data entry and processing, HIS management, and more.
This document provides guidance on collecting, entering, and validating hydrological data for storage and use in a water resources information system in India. It discusses the mandatory information needed for spatial (e.g. well locations) and temporal (e.g. water level measurements) data. It also describes proper data collection procedures like using field forms, maintaining a data collection register, and entering data directly from field forms to reduce errors. The document emphasizes validation of data at multiple stages and storing data according to standards to ensure long-term usability and reliability of the hydrological information system.
This document provides an operations manual for water quality analysis laboratories. It discusses good laboratory practices and quality assurance protocols that should be followed, including procedures for sample handling, analytical methods, equipment maintenance, and data recording. Standard analytical methods are described for over 40 water quality parameters. Laboratories are expected to adhere to proper chemical and equipment handling techniques, quality control measures, and documentation practices to ensure reliable and comparable analytical results.
This document provides information on a training module about major ions in water. It includes an introduction to the module, a module profile, session plan, overheads and masters, evaluation sheets, and handouts. The module aims to help participants understand the major ions found in water and their sources, how to characterize water samples based on ion concentrations, the water quality consequences of different ions, and the technique of ion balancing to assess water quality. The session covers topics such as hardness, sodium adsorption ratio, percent sodium, and residual sodium carbonate as ways to evaluate water for irrigation use.
This document provides information about a training module on absorption spectroscopy. It discusses the World Bank and Government of the Netherlands funding a training program in New Delhi, India. The document includes an introduction to absorption spectroscopy, outlines the module context and profile, presents a session plan and materials including overhead masters, and discusses measuring various water quality parameters using a spectrophotometer.
Here are the steps to solve this problem:
1) Calculate the mean (x̅) of the lead concentration data: ∑x/n = 306.3/10 = 30.63
2) Calculate the deviation of each value from the mean: |x - x̅|
3) Calculate the sum of the deviations: ∑|x - x̅| = 50.03
4) Calculate the average deviation: Sum of deviations/n = 50.03/10 = 5.003
5) Calculate (x - x̅)2 for each value
6) Calculate the sum of (x - x̅)2: ∑(x - x̅)2
This document provides materials for a training module on basic statistics, including a session plan, overhead masters, handouts, and main text. The module introduces key concepts such as frequency distributions, accuracy and precision, and propagation of errors. Participants will learn to construct histograms, calculate descriptive statistics, and understand the difference between random and systematic errors. The goal is for participants to be able to evaluate the accuracy of chemical data using basic statistical analysis.
This document provides guidance on how to carry out secondary validation of discharge data. It discusses validating a single station's data against limits and expected behavior through graphical inspection. It also describes validating multiple stations by comparing their time series plots and residual series, as well as comparing streamflow and rainfall data. The overall goal of secondary validation is to identify potential errors or anomalies in discharge data for further investigation and correction.
This document provides guidance on analyzing discharge data. It discusses computing basic statistics, constructing flow duration curves to analyze variability, fitting theoretical frequency distributions, and other time series analysis techniques like moving averages and mass curves. The main text provides detailed explanations of these methods and their uses in hydrological analysis, data validation, and reporting. It is intended to train hydrologists and data managers on effectively analyzing discharge data.
This document outlines Daren Harmel's work monitoring watersheds and modeling water quality. It discusses developing guidance for sampling small watersheds, quantifying uncertainty in water quality data, expanding a database of agricultural management practices and their impacts on nitrogen and phosphorus runoff, challenges modeling E. coli transport, and how monitoring and modeling can best be used together to inform water quality decision making. The goal is to provide practical guidance on watershed monitoring, account for uncertainty in data, and leverage all available data and models to evaluate management impacts and alternative practices. Both monitoring and modeling are necessary for water quality decisions, and an adaptive approach is recommended that uses measured data, models, and stakeholder input to identify sources, estimate contributions
This document discusses quality assurance and within-laboratory analytical quality control programs. It emphasizes the need for quality assurance due to errors found in analytical results. A quality assurance program includes sample control, standard procedures, equipment maintenance, calibration, and analytical quality control. Within-laboratory quality control focuses on precision using statistical control charts to monitor performance over time and identify issues. Regular quality control practices help ensure reliability of laboratory results.
This document describes procedures for conducting an inter-laboratory analytical quality control (AQC) exercise. It outlines how to plan the exercise by selecting a coordinating laboratory and test samples, determine reference values, and analyze results. Laboratories' performance is evaluated using Youden plots to identify systematic errors. The coordinating laboratory collates the data, identifies errors, and provides feedback to participating laboratories to improve quality. The goal is to enhance quality control through interaction between laboratories and evaluation of their facilities and testing capabilities.
Statistics is a critical tool for robustness analysis, measurement system error analysis, test data analysis, probabilistic risk assessment, and many other fields in the engineering world. These basic applications are related to our basic engineering problems which help us to solve the problems and gives us the better solution and brings the efficiency to work with our real life engineering problems.
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
Tutorial @Ubicomp 2015: Bridging the Gap -- Machine Learning for Ubiquitous Computing (evaluation session).
A tutorial on promises and pitfalls of Machine Learning for Ubicomp (and Human Computer Interaction). From Practitioners for Practitioners.
Presenter: Nils Hammerla <n.hammerla@gmail.com>
video recording of talks as they wer held at Ubicomp:
https://youtu.be/LgnnlqOIXJc?list=PLh96aGaacSgXw0MyktFqmgijLHN-aQvdq
Study on Application of Ensemble learning on Credit Scoringharmonylab
This document discusses using ensemble learning techniques for credit scoring applications. It begins by providing background on the growing need for effective credit scoring models due to the expansion of financing opportunities. It then discusses some common problems with credit scoring models, including a lack of understandability, multiple evaluation metrics, and imbalanced data. The purpose is to build a better credit scoring model using machine learning, specifically ensemble learning with XGBoost. Two proposed methods are described: 1) using EasyEnsemble resampling prior to XGBoost to address imbalanced data, and 2) customizing XGBoost by changing the evaluation metric to directly optimize cost or using a focal loss function. Experiments on several credit scoring datasets show these approaches can reduce costs without reducing accuracy
Machine Learning part 3 - Introduction to data science Frank Kienle
Lecture: Introduction to Data Science
given 2017 at Technical University of Kaiserslautern, Germany
Topic: part 3 machine learning, link to data science practice
This document summarizes an introduction to deep learning with MXNet and R. It discusses MXNet, an open source deep learning framework, and how to use it with R. It then provides an example of using MXNet and R to build a deep learning model to predict heart disease by analyzing MRI images. Specifically, it discusses loading MRI data, architecting a convolutional neural network model, training the model, and evaluating predictions against actual heart volume measurements. The document concludes by discussing additional ways the model could be explored and improved.
Researc-paper_Project Work Phase-1 PPT (21CS09).pptxAdityaKumar993506
This project aims to develop a machine learning model for early stage diabetes prediction. The group will implement classification algorithms like KNN, logistic regression, decision trees, and random forests on a diabetes dataset. The inputs to the model will be patient attributes and the output will predict their risk of having diabetes. Currently, the group is working on data preprocessing, model selection, and algorithm implementation to accurately classify patients' diabetes risk. The goal is to create an effective predictive tool to help clinicians diagnose and treat diabetes early.
Shaun Douglas Smith's graduate studies focused on operations research, industrial engineering, statistics, and quality engineering. His coursework covered topics like optimization theory, simulation, stochastic processes, experiment design, statistical modeling, production and inventory control, and quality engineering. His final project applied an algorithm to solve a bin packing problem. Overall, his program equipped him with technical and analytical skills for improving production systems and decision-making.
May 2015 talk to SW Data Meetup by Professor Hendrik Blockeel from KU Leuven & Leiden University.
With increasing amounts of ever more complex forms of digital data becoming available, the methods for analyzing these data have also become more diverse and sophisticated. With this comes an increased risk of incorrect use of these methods, and a greater burden on the user to be knowledgeable about their assumptions. In addition, the user needs to know about a wide variety of methods to be able to apply the most suitable one to a particular problem. This combination of broad and deep knowledge is not sustainable.
The idea behind declarative data analysis is that the burden of choosing the right statistical methodology for answering a research question should no longer lie with the user, but with the system. The user should be able to simply describe the problem, formulate a question, and let the system take it from there. To achieve this, we need to find answers to questions such as: what languages are suitable for formulating these questions, and what execution mechanisms can we develop for them? In this talk, I will discuss recent and ongoing research in this direction. The talk will touch upon query languages for data mining and for statistical inference, declarative modeling for data mining, meta-learning, and constraint-based data mining. What connects these research threads is that they all strive to put intelligence about data analysis into the system, instead of assuming it resides in the user.
Hendrik Blockeel is a professor of computer science at KU Leuven, Belgium, and part-time associate professor at Leiden University, The Netherlands. His research interests lie mostly in machine learning and data mining. He has made a variety of research contributions in these fields, including work on decision tree learning, inductive logic programming, predictive clustering, probabilistic-logical models, inductive databases, constraint-based data mining, and declarative data analysis. He is an action editor for Machine Learning and serves on the editorial board of several other journals. He has chaired or organized multiple conferences, workshops, and summer schools, including ILP, ECMLPKDD, IDA and ACAI, and he has been vice-chair, area chair, or senior PC member for ECAI, IJCAI, ICML, KDD, ICDM. He was a member of the board of the European Coordinating Committee for Artificial Intelligence from 2004 to 2010, and currently serves as publications chair for the ECMLPKDD steering committee.
This document provides guidance on validating a rating curve using graphical and numerical tests. It describes plotting new gauging data against the existing rating curve and confidence limits to check fit. Graphical tests illustrated include period-flow deviation scattergrams, stage-flow deviation diagrams, and cumulative deviation plots to identify bias over time. Numerical tests like Student's t-test are also described to statistically compare new gaugings to the original dataset and determine if the rating curve should be recalculated. The document provides an example validation analysis for a station in Pargaon.
The document discusses key concepts in quantitative research methods and data analytics covered in a university course. It outlines the course content, which includes topics like data visualization, the normal distribution, and hypothesis testing. It then details the course assessments, which include a mid-term assignment and final coursework report worth 30% and 70% respectively. The final report involves selecting a topic, collecting and analyzing data using R Studio, and reporting the results in a 2000 word paper with sections on introduction, data, results, and conclusion.
Mô hình dự báo Churn cho khách hàng bằng phương pháp học máy suy diễn Phương ...NguyenThiNgocAnh9
1) The document discusses methods for predicting customer churn for a bank using machine learning, including decision trees, artificial neural networks, and online variational inference for multivariate Gaussian distributions.
2) These techniques were applied to a Vietnamese dataset containing over 2,000 customer observations and 30 features to classify customers as churned or non-churned.
3) The results showed that decision trees achieved 99.78% accuracy while artificial neural networks achieved 94.26% accuracy according to cross-validation. Online variational inference for multivariate Gaussian distributions was also tested.
Eugm 2011 mehta - adaptive designs for phase 3 oncology trialsCytel USA
This document discusses adaptive designs for phase 3 oncology trials. It uses the VALOR trial as a case study to illustrate a sponsor's dilemma in designing a trial with limited prior data. It proposes a promising zone design that allows staged investment - an initial modest sample size with the option to increase size and power if interim results are promising. Simulations show this two-stage investment approach increases power over a non-adaptive design while managing risks for sponsors. The document also discusses extensions to population enrichment designs.
Similar to Download-manuals-surface water-software-48appliedstatistics (20)
This document provides guidance on working with map layers and network layers in HYMOS, a hydrological modeling software. It describes how to obtain map layers from digitized topographic maps and remotely sensed data. It also explains how to create network layers by manually adding observation stations or importing them from another database. The document outlines how to manage and set properties for map layers and network layers within HYMOS to control visibility, styling, and other display options.
This document contains information about receiving hydrological data at different levels in India, including:
1. Data is transferred from field stations to subdivisional offices, then to divisional offices and state/regional data processing centers in stages. Target dates are set for receipt and transmission at each level to ensure smooth processing.
2. Records of receipt are maintained at each office to track data and identify delays, with feedback provided if data is not received by targets.
3. Original paper records are filed by station for easy retrieval, while digital copies are stored for long-term archiving.
The document describes a training module on understanding different types and forms of data in hydrological information systems (HIS). It was developed with funding from the World Bank and Government of the Netherlands. The module provides an overview of the session plan and covers various types of data in HIS, including space-oriented data like catchment maps, time-oriented data such as meteorological observations, and relation-oriented data like stage-discharge relationships. The goal is for participants to learn about all the different types and forms of data managed in HIS.
The document provides details on a surface water data processing plan for India. It discusses distributing data processing activities across three levels - sub-divisional, divisional, and state data processing centers. It outlines the activities, computing facilities, staffing, and time schedules needed at each level to efficiently manage the large volume of hydrological data. The plan aims to ensure data is properly validated and processed within time limits while not overwhelming staff.
This document outlines the stages of surface water data processing under the Hydrological Information System (HIS) in India. It discusses: 1) Receipt of data from field stations and storage of raw records; 2) Data entry at sub-divisional offices; 3) Validation of data through primary, secondary, and hydrological checks; 4) Completion and correction of missing or erroneous data; 5) Compilation, analysis, and reporting of validated data; 6) Transfer of data between processing levels from sub-division to division to state centers. The overall goal is to process field data in a systematic series of steps to produce quality-controlled hydrological information.
This document provides information on a training module for understanding hydrological information system (HIS) concepts and setup. It includes an introduction to HIS, why they are needed, how they are set up under the Hydrology Project. It also discusses who the key users of hydrological data are and how computers are used in hydrological data processing. The training module contains session plans, presentations, handouts, and text to educate participants on HIS objectives, components, and how they provide reliable hydrological data to various end users.
This document provides guidance on reporting climatic data in India. It discusses the purpose and contents of annual reports on climatic data, including evaporation data. Key points covered include:
- Annual reports summarize evaporation data for the reporting year and compare to long-term statistics.
- Reports include details on the observational network, basic evaporation statistics, data validation processes.
- Network maps and station listings provide details of monitoring locations. Statistics include monthly and annual evaporation amounts for the current year and historical averages.
- Reports aim to inform water resource planning, acknowledge data collection efforts, and provide access to climatic data records.
This document provides information and guidance on analyzing climatic data to estimate evaporation and evapotranspiration rates. It discusses the use of evaporation pans and appropriate pan coefficients to estimate open water evaporation from lakes and reservoirs. It also describes the Penman method for estimating potential evapotranspiration using standard climatological measurements. The Penman method combines the energy budget and mass transfer approaches and provides formulas for calculating evapotranspiration based on climatic variables like temperature, humidity, wind speed, and solar radiation. Substitutions are suggested when some climatic variables are not directly measured.
This document provides guidance on how to carry out secondary validation of climatic data. It describes various methods for validating data spatially using multiple station comparisons, including comparison plots, balance series, regression analysis, and double mass curves. It also describes single station validation tests for homogeneity, including mass curves and tests of differences in means. The document is part of a training module on secondary validation of climatic data funded by the World Bank and Government of the Netherlands. It provides context for the training and outlines the session plan, materials, and main validation methods to be covered.
This document provides guidance on how to carry out primary validation of climatic data. It discusses validating temperature, humidity, wind speed, atmospheric pressure, sunshine duration, and pan evaporation data. For each variable, it describes typical variations and measurement methods, potential errors, and approaches to error detection such as setting maximum/minimum limits. The goal of primary validation is to check for errors by comparing individual observations to physical limits and sequential observations for unacceptable changes.
This document provides guidance on entering climatic data into a hydrological data processing software called SWDES. It describes the various types of climatic data that can be entered, including daily, twice daily, hourly, and sunshine duration data. Instructions are provided on inspecting paper records, setting up data entry screens, entering values, and performing basic data validation checks. The overall aim is to make climatic data available electronically using SWDES in order to facilitate validation, processing, and reporting of the data.
This document provides guidance on how to report rainfall data in yearly and periodic reports. It outlines the typical contents and structure of annual reports including descriptive summaries of rainfall patterns, comparisons to long-term averages, basic statistics, and descriptions of major storms. Periodic reports produced every 10 years would include long-term statistics updated over the previous decade as well as frequency analysis of rainfall data. The reports aim to inform stakeholders of rainfall patterns and data availability as well as validate and improve the quality of data collection.
The document describes a training module on analyzing rainfall data. It includes sessions on checking data homogeneity, computing basic statistics, fitting frequency distributions, and deriving frequency-duration and intensity-duration-frequency curves. Exercises are provided for trainees to practice analyzing monthly and daily rainfall series, fitting distributions, and deriving curves for different durations and return periods. Case studies from India are referenced as examples throughout the training material.
This document provides guidance on compiling rainfall data from various time intervals into longer standardized durations. It discusses aggregating hourly data into daily totals, daily data into weekly, ten-daily, monthly, and yearly totals. Methods are presented for arithmetic averaging and Thiessen polygons to estimate areal rainfall from point measurements. Guidance is also given on transforming non-equidistant time series into equidistant series and compiling extreme rainfall statistics. Examples demonstrate compiling hourly rainfall from an autographic rain gauge into daily totals and further aggregating daily point rainfall into areal averages and statistics for various durations.
This document provides guidance on correcting and completing rainfall data. It discusses using autographic rain gauge (ARG) and standard rain gauge (SRG) data to correct errors. When the SRG is faulty but ARG is available, the SRG can be corrected to match the ARG totals. When the ARG is faulty but SRG is available, hourly distributions from neighboring stations can be used to estimate hourly totals for the station based on its daily SRG total. The document also discusses correcting time shifts, apportioning partial daily accumulations, adjusting for systematic shifts using double mass analysis, and using spatial interpolation methods to estimate missing values. Examples are provided to demonstrate each technique.
This document describes a training module on how to carry out secondary validation of rainfall data. It includes the following key points:
1. Secondary validation involves comparing rainfall data to neighboring stations to identify suspect values, taking into account spatial correlation which depends on duration, distance, precipitation type, and physiography.
2. Validation methods described include screening data against limits, scrutinizing multiple time series graphs and tabulations, checking against data limits for longer durations, spatial homogeneity testing, and double mass analysis.
3. Examples demonstrate how spatial correlation varies with duration and distance, and how physiography affects correlation. Screening listings with basic statistics are used to flag suspect data values.
This document provides guidance on how to carry out primary validation of rainfall data. It discusses comparing daily rainfall measurements from a standard raingauge to those from an autographic or digital raingauge. Differences greater than 5% between the two measurements would be further investigated. Likely sources of error are outlined for each type of raingauge. The validation can be done graphically or tabularly by aggregating hourly rainfall data to daily totals and comparing. Actions are suggested based on the patterns of discrepancies found.
This document provides guidance on entering rainfall data into a dedicated hydrological data processing software (SWDES). It discusses entering daily rainfall data, twice daily rainfall data, and hourly rainfall data from manual records or digital loggers. The key steps are:
1. Manually inspecting field records for completeness and errors before data entry.
2. Entering data into customized SWDES forms that match field observation sheets. This allows direct data transfer with minimal risk of errors.
3. Performing automated checks of the entered data against limits and computed totals to ensure accuracy. Any errors are flagged for further inspection.
4. Graphing the entered time series data during the entry process as an additional validation check.
The document provides guidance on sampling surface waters for water quality analysis. It discusses selecting sampling sites that are representative of the waterbody and safely accessible. It describes three types of samples - grab samples, composite samples, and integrated samples - and when each would be used. It also outlines appropriate sampling devices and containers for different analyses, as well as procedures for sample handling, preservation, and identification. The overall aim is to collect samples that accurately represent water quality without significant changes prior to analysis.
The document describes methods for hydrological observations including rainfall, water level, discharge, and inspection of observation stations. It contains sections on ordinary and recording rainfall observation, ordinary and recording water level observation, observation of discharge using current meters and floats, and inspection of rainfall and water level observation stations. The document was produced by the Ministry of Construction in Japan.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Infrastructure Challenges in Scaling RAG with Custom AI modelsZilliz
Building Retrieval-Augmented Generation (RAG) systems with open-source and custom AI models is a complex task. This talk explores the challenges in productionizing RAG systems, including retrieval performance, response synthesis, and evaluation. We’ll discuss how to leverage open-source models like text embeddings, language models, and custom fine-tuned models to enhance RAG performance. Additionally, we’ll cover how BentoML can help orchestrate and scale these AI components efficiently, ensuring seamless deployment and management of RAG systems in the cloud.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
1. World Bank & Government of The Netherlands funded
Training module # WQ - 48
Applied Statistics
New Delhi, September 2000
CSMRS Building, 4th Floor, Olof Palme Marg, Hauz Khas,
New Delhi – 11 00 16 India
Tel: 68 61 681 / 84 Fax: (+ 91 11) 68 61 685
E-Mail: dhvdelft@del2.vsnl.net.in
DHV Consultants BV & DELFT HYDRAULICS
with
HALCROW, TAHAL, CES, ORG & JPS
2. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 1
Table of contents
Page
1. Module context 2
2. Module profile 3
3. Session plan 4
4. Overhead/flipchart master 5
5. Evaluation sheets 21
6. Handout 23
7. Additional handout 30
8. Main text 32
3. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 2
1. Module context
This module discusses statistical procedures, which are commonly used by chemists to
evaluate the precision and accuracy of results of analyses. Prior training in ‘Basic Statistic’,
module no. 47, or equivalent is necessary to complete this module satisfactorily.
While designing a training course, the relationship between this module and the others,
would be maintained by keeping them close together in the syllabus and place them in a
logical sequence. The actual selection of the topics and the depth of training would, of
course, depend on the training needs of the participants, i.e. their knowledge level and skills
performance upon the start of the course.
4. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 3
2. Module profile
Title : Applied Statistics
Target group : HIS function(s): Q2, Q3, Q5, Q6
Duration : one session of 90 min
Objectives : After the training the participants will be able to:
• Apply common statistical tests for evaluation of the precision of
data
Key concepts : • confidence interval
• regression analyses
Training methods : Lecture, exercises, OHS
Training tools
required
: Board, flipchart
Handouts : As provided in this module
Further reading
and references
: • ‘Analytical Chemistry’, Douglas A. Skoog and Donald M. West,
Saunders College Publishing, 1986 (Chapter 3)
• ‘Statistical Procedures for analysis of Environmenrtal monitoring
Data and Risk Assessment’, Edward A. Mc Bean and Frank A.
Rovers, Prentice Hall, 1998.
5. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 4
3. Session plan
No Activities Time Tools
1 Preparations
2 Introduction:
• Need for statistical procedures
• Scope of the module
10 min OHS
3 Confidence interval
• Definition
• Significance of standard deviation and its correct estimation
• When population SD is known
• When population SD is not known
30 min OHS
4 Rejection of Data
• Caution in rejection
• Procedures
20 min OHS
5 Regression analysis 20 min OHS
6 Wrap up 10 min
6. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 5
4. Overhead/flipchart master
OHS format guidelines
Type of text Style Setting
Headings: OHS-Title Arial 30-36, with bottom border line (not:
underline)
Text: OHS-lev1
OHS-lev2
Arial 24-26, maximum two levels
Case: Sentence case. Avoid full text in UPPERCASE.
Italics: Use occasionally and in a consistent way
Listings: OHS-lev1
OHS-lev1-Numbered
Big bullets.
Numbers for definite series of steps. Avoid
roman numbers and letters.
Colours: None, as these get lost in photocopying and
some colours do not reproduce at all.
Formulas/Equat
ions
OHS-Equation Use of a table will ease horizontal alignment
over more lines (columns)
Use equation editor for advanced formatting
only
7. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 6
Applied Statistics
• Confidence intervals
• Number of replicate observations
• Rejection of data
• Regression analysis
8. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 7
Confidence Intervals (1)
• Population mean µ is always unknown
• Sample mean x
- Only for large number of replicate x → µ
• Estimation of limits for µ from x
(x – a) < µ < (x + a)
• Value of a depend
- confidence level (probability of occurrence)
- standard deviation
- difference between limits is the interval
9. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 8
Confidence Intervals
Figure 1. Normal distribution curve and areas under curve for different distances
10. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 9
Confidence Intervals (2)
• When s is good estimate of σ
- for single observation, (x - zσ) < µ < (x - zσ)
Confidence level 50 90 95 99 99.7
Z 0.67 1.64 1.96 2.58 3
- as z or interval increases confidence level also increases
11. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 10
Example (1)
• Mercury concentration in fish was determined as 1.8 µg/kg.
Calculate limits for µ for 95% confidence level if σ = 0.1 µg/kg
- For 95% confidence level z = 1.96
- Limits for mean
1.8 – 1.96 x 0.1 < µ < 1.8 + 1.96 x 0.1
1.6 < µ < 2.0
• What are the limits for 99.7% confidence level?
12. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 11
Confidence Intervals (3)
• Confidence interval when mean,x, of n replicates is available
• Confidence interval is reduced compared to single
measurement as n increases
n 1 2 3 4 9 16
1 1.4 1.7 2 3 4
• Analyse 2 to 4 replicates, more replicates give diminishing
return
nzxnzx σ+<µ<σ−
n
13. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 12
Example (2)
• Average mercury conc. from 3 replicates was 1.67 µg/ kg
• Calculate limits for 95% confidence level, σ = 0.1 µg/ kg
1.67 – 1.96 x 0.1/ < µ < 1.67 + 1.96 x 0.1/
1.56 < µ < 1.78
3 3
14. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 13
Example (3)
• How Many replicate analyses will be required to decrease the
95% confidence interval to ± 0.07, σ = 0.1
Interval = ± zσ/
± 0.07 = ± 1.96 x 0.1/
n = 7.8
• 8 measurements will give slightly better than 95% chance for µ
to be with x ± 0.07
n
n
15. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 14
Confidence Interval (4)
• When σ is unknown
• Sample s based on n measurements is available
Degrees of Freedom 1 4 10 α
T95 12.7 2.78 2.23 1.96
T99 63.7 4.60 3.17 2.58
• Note T → z as degrees of freedom increases to ∝
nstxnstx +<µ<−
16. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 15
Example (4)
• Groundwater samples were analysed for TOC, n = 5, x = 11.7
mg/L s = 3.2 mg/L
• Find the 95% confidence limit for the true mean
- Degrees of freedom = n – 1 = 5 – 1 = 4
- T95 = 2.78 (from table)
11.7 – 2.78 x 3.2 / < µ < 11.7 + 2.78 x 3.2 /
7 < µ < 15.74
5 5
17. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 16
Detection of Outliers
• Extreme values, very high or low, will influence calculated
statistics
• Rejection
- sound basis
- may be due to an unrecorded event
- errors in transcription, reading of instruments, inconsistent
methodology, etc.
- otherwise apply statistical yardsticks
18. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 17
Q Test
• Qexp =
• Reject if Qexp = Qcrit
Total No. of
observations
3 5 7 10
Qcrit 90% 0.94 0.64 0.51 0.41
Qcrit 96% 0.98 0.73 0.59 0.48
• Other criteria are also available
testentireofSpread
neighbourandsuspectbetweenDifference
19. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 18
Regression Analysis
• Quantification of relationships between variables
- used for predictions
- calibration of instruments
• Least-square method
- deviations due to random errors
y = a + bx
- linear relationship
22. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 21
5. Evaluation sheets
23. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 22
24. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 23
6. Handout
25. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 24
Applied Statistics
• Confidence intervals
• Number of replicate observations
• Rejection of data
• Regression analysis
Confidence Intervals (1)
• Population mean µ is always unknown
• Sample mean x
- Only for large number of replicate x → µ
• Estimation of limits for µ from x
(x – a) < µ < (x + a)
• Value of a depend
- confidence level (probability of occurrence)
- standard deviation
- difference between limits is the interval
Confidence Intervals
Figure 1. Normal distribution curve and areas under curve for different distances
26. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 25
Confidence Intervals (2)
• When s is good estimate of σ
- for single observation, (x - zσ) < µ < (x - zσ)
Confidence level 50 90 95 99 99.7
Z 0.67 1.64 1.96 2.58 3
- as z or interval increases confidence level also increases
Example (1)
• Mercury concentration in fish was determined as 1.8 µg/kg.
• Calculate limits for µ for 95% confidence level if σ = 0.1 µg/k
- For 95% confidence level z = 1.96
- Limits for mean
1.8 – 1.96 x 0.1 < µ < 1.8 + 1.96 x 0.1
1.6 < µ < 2.0
• What are the limits for 99.7% confidence level?
Confidence Intervals (3)
• Confidence interval when mean,x, of n replicates is available
• Confidence interval is reduced compared to single measurement as n increases
n 1 2 3 4 9 16
1 1.4 1.7 2 3 4
• Analyse 2 to 4 replicates, more replicates give diminishing return
Example (2)
• Average mercury conc. from 3 replicates was 1.67 µg/ kg.
• Calculate limits for 95% confidence level, σ = 0.1 µg/ k
nzxnzx σ+<µ<σ−
n
78.156.1
31.0x96.167.131.0x96.167.1
<µ<
+<µ<−
27. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 26
Example (3)
• How Many replicate analyses will be required to decrease the 95% confidence
interval to ± 0.07, σ = 0.1
Interval = ± zσ/
± 0.07 = ± 1.96 x 0.1/
n = 7.8
• 8 measurements will give slightly better than 95% chance for µ to be with x ±
0.07
Confidence Intervals (4)
• When σ is unknown
• Sample s based on n measurements is available
Degrees of Freedom 1 4 10 α
T95 12.7 2.78 2.23 1.96
T99 63.7 4.60 3.17 2.58
• Note T → z as degrees of freedom increases to α
Example (4)
• Groundwater samples were analysed for TOC n = 5, x = 11.7 mg/L s = 3.2
mg/L
• Find the 95% confidence limit for the true mean
- Degrees of freedom = n – 1 = 5 – 1 = 4
- T95 = 2.78
11.7 – 2.78 x 3.2 / < µ < 11.7 + 2.78 x 3.2 /
7 < µ < 15.74
n
n
nstxnstx +<µ<−
5 5
28. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 27
Detection of Outliers
• Extreme values, very high or low, will influence calculated statistics
• Rejection
- sound basis
- may be due to an unrecorded event
- errors in transcription, reading of instruments, inconsistent methodology, etc.
- otherwise apply statistical yardsticks
Q Test
• Qexp =
• Reject if Qexp = Qcrit
Total No. of
observations
3 5 7 10
Qcrit 90% 0.94 0.64 0.51 0.41
Qcrit 96% 0.98 0.73 0.59 0.48
• Other criteria are also available
Regression Analysis
• Quantification of relationships between variables
- used for predictions
- calibration of instruments
• Least-square method
- deviations due to random errors
y = a + bx
- linear relationship
testentireofSpread
neighbourandsuspectbetweenDifference
30. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 29
Add copy of Main text in chapter 8, for all participants.
31. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 30
7. Additional handout
These handouts are distributed during delivery and contain test questions, answers to
questions, special worksheets, optional information, and other matters you would not like to
be seen in the regular handouts.
It is a good practice to pre-punch these additional handouts, so the participants can easily
insert them in the main handout folder.
32. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 31
33. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 32
8. Main text
Contents
1. Confidence Intervals 1
2. Detection of Data Outliers 5
3. Regression Analysis 6
34. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 1
Applied Statistics
For small data sets, there is always the question as to how well the data represent the
population and what useful information can be inferred regarding the population
characteristics. Statistical calculations are used for this purpose. This module deals with four
such applications:
1. Definition of a population around the mean of a data set within which the true mean can
be expected to be found with a particular degree of probability.
2. Determination of the number of measurements needed to obtain the true mean within a
predetermined interval around the experimental mean with a particular degree of
probability.
3. Determining if an outlying value in a set of replicate measurements should be retained in
calculating the mean for the set.
4. The fitting of a straight line to a set of experimental points.
1. Confidence Intervals
The true mean of population, µ, is always unknown. However, limits can be set about the
experimentally determined mean, x, within which the true mean may be expected to occur
with a given degree of probability. These bounds are called confidence limits and the interval
between these limits, the confidence interval.
For a given set of data, if the confidence interval about the mean is large, the probability of
an observation falling within the limits also becomes large. On the other hand, in the case
when the limits are set close to the mean, the probability that the observed value falls within
the limits also becomes smaller or, in other words, a larger fraction of observations are
expected to fall out side the limits. Usually confidence intervals are calculated such that 95%
of the observations are likely to fall within the limits.
For a given probability of occurrence, the confidence limits depend on the value of the
standard deviation, s, of the observed data and the certainty with which this quantity can be
taken to represent the true standard deviation, σ, of the population.
1.1 Confidence limits where σ is known
When the number of repetitive observations is 20 or more, the standard deviation of the
observed set of data, s, can be taken to be a good approximation of the standard deviation
of the population, σ.
However, it may not always be possible to perform such a large number of repetitive
analyses, particularly when costly and time consuming extraction and analytical procedures
are involved. In such cases, data from previous records or different laboratories may be
pooled, provided that identical precautions and analytical steps are followed in each case.
Further, care should be taken that the medium sampled is also similar, for example, analysis
results of groundwater samples having high TDS should not be pooled with the results of
surface waters, which usually have low TDS.
As discussed in the previous module, most of the randomly varying data can be
approximated to a normal distribution curve. For normal distribution, 68% of the observations
lie between the limits µ ± σ , 96% between the limits µ ± 2σ, 99.7% between the limits µ ±
35. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 2
3σ, etc., Figure 1. For a single observation, xi, the confidence limits for the population mean
are given by:
Figure 1. Normal distribution curve and areas under curve for different distances
confidence limit for µ = xi ± zσ (1)
where z assumes different values depending upon the desired confidence level as given in
Table 1.
Table 1: Values of z for various confidence levels
Confidence level, % z
50 0.67
68 1.00
80 1.29
90 1.64
95 1.96
96 2.00
99 2.58
99.7 3.00
99.9 3.29
Example 1:
Mercury concentration in the sample of a fish was determined to be 1.80 µg/kg. Calculate
the 50% and 95% confidence limits for this observation. Based on previous analysis records,
it is known that the standard deviation of such observations, following similar analysis
procedures, is 0.1 µg/kg and it closely represents the population standard deviation, σ.
36. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 3
From Table 1, it is seen that z = 0.67 and 1.96 for the two confidence limits in question.
Upon substitution in Equation 1, we find that
50% confidence limit = 1.80 ± 0.67 x 0.1 = 1.8 ± 0.07 µg/kg
95% confidence limit = 1.80 ± 1.96 x 0.1 = 1.8 ± 0.2 µg/kg
Therefore if 100 replicate analyses are made, the results of 50 analysis will lie between the
limits 1.73 and 1.87 and 95 results are expected to be within an enlarged limit of 1.6 and 2.0
Equation 1 applies to the result of a single measurement. In case a number of observations,
n, is made and an average of the replicate samples is taken, the confidence interval
decreases. In such a case the limits are given by:
confidence limit for µ = x ± zσ/ √n (2)
Example 2:
Calculate the confidence limits for the problem of Example 1, if three samples of the fish
were analysed yielding an average of 1.67 µg/kg Hg.
Substitution in Equation 2 gives:
50% confidence limit = 1.67 ± 0.67 x 0.1/ √3 = 1.67 ± 0.04 µg/kg
95% confidence limit = 1.67 ± 1.96 x 0.1/ √3 = 1.67 ± 0.11 µg/kg
For the same odds the confidence limits are now substantially smaller and the result can be
said to be more accurate and probably more useful.
Note that Equation 2 indicates that the confidence interval can be halved by increasing the
number of analyses to 4 (√4 = 2) compared to a single observation. Increasing the number
of measurements beyond 4 does not decrease the confidence interval proportionately. To
narrow the interval by one fourth, 16 measurements would be required (√16 = 4), thus giving
diminishing return. Consequently, 2 to 4 replicate measurements are made in most cases.
Equation 2 can also be used to find the number of replicate measurements required such
that with a given probability the true mean would be found within a predetermined interval.
This is illustrated in Example 3.
Example 3:
How many replicate measurements of the specimen in Example 1 would be needed to
decrease the 95% confidence interval to ±0.07.
Substituting for confidence interval in Equation 2:
0.07 = 1.96 x 0.1/ √n
√n = 1.96 x 0.1/0.07, n = 7.8
Thus, 8 measurements will provide slightly better than 95% chance of he true mean lying
within ±0.07 of the experimental mean.
37. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 4
1.2 Confidence limits where σ is unknown
When the number of individual measurements in a set of data are small the reproducibility of
the calculated value of the standard deviation, s, is decreased. Therefore, for a given
probability, the confidence interval must be larger under these circumstances.
To account for the potential variability in s, the confidence limits are calculated using the
statistical parameter t:
confidence limit for µ = x ± ts/ √n (3)
In contrast to z in Equation 2, t depends not only on the desired confidence level, but also
upon the number of degrees of freedom available in the calculation of s. Table 2 provides
values for t for various degrees of freedom and confidence levels. Note that the values of t
become equal to those for z (Table 1) as the number of degrees of freedom becomes
infinite.
Table 2: Values of t for various confidence levels and degrees of freedom
Confidence LevelDegrees of
Freedom 90 95 99
1 6.31 12.7 63.7
2 2.92 4.30 9.92
3 2.35 3.18 5.84
4 2.13 2.78 4.60
5 2.02 2.57 4.03
6 1.94 2.45 3.71
8 1.86 2.31 3.36
10 1.81 2.23 3.17
12 1.78 2.18 3.11
12 1.78 2.16 3.06
14 1.76 2.14 2.98
∞ 1.64 1.96 2.58
Example 4:
A chemist obtained the following data for the concentration of total organic carbon (TOC) in
groundwater samples: 13.55, 6.39, 13.81, 11.20, 13.88 mg/L. Calculate the 95% confidence
limits for the mean of the data.
Thus, from the given data, n = 5, x = 11.77 and s = 3.2. For degrees of freedom = 5 – 1 = 4
and 95% confidence limit, t from Table 2 = 2.78. Therefore, from Equation 3:
confidence limit for µ = x ± ts/ √n = 11.77 ± 2.78 x 3.2/ √5
= 11.77 ± 3.97
38. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 5
2. Detection of Data Outliers
Data outliers are extreme (high or low) values that diverge widely from the main body of the
data set. The presence of one or more outliers may greatly influence any calculated statistics
and yield biased results. However, there is also the possibility that the outlier is a legitimate
member of the data set. Outlier detection tests are to determine whether there is sufficient
statistical evidence to conclude that an observation appears extreme and does not belong to
the data set and should be rejected.
Data outliers may result from faulty instruments, error in transcription, misreading of
instruments, inconsistent methodology of sampling and analysis and so on. These aspects
should be investigated and if any of such reasons can be pegged to an outlier, the value
may be safely deleted from consideration. However, this is to be kept in mind that the
suspect data may rightfully belong to the set and may be the consequence of an unrecorded
event, such as, a short rainfall, intrusion of sea water or a spill.
Table 3: Critical values for rejection quotient
Qcrit (reject if Qexp> Qcrit)Number of observations
90% confidence 96% confidence
3 0.94 0.98
4 0.76 0.85
5 0.64 0.73
6 0.56 0.64
7 0.51 0.59
8 0.47 0.54
9 0.44 0.51
10 0.41 0.48
Of the numerous statistical criteria available for detection of outliers, the Q test, which is
commonly used, will be discussed here. To apply the Q test, the difference between the
questionable result and its closest neighbour is divided by the spread of the entire set. The
resulting ratio Qexp is compared with the rejection values, Qcrit given in Table3, that are
critical for a particular degree of confidence. If Qexp is larger, a statistical basis for rejection
exists. The table shows only some selected values. Any standard statistical analysis book
may be consulted for a complete set.
Example 5:
Concentration measurements for fluoride in a well were measured as 2.77, 2.80, 2.90, 2.92,
3.45, 3.95, 4.44, 4.61, 5.21, 7.46. Use the Q test to examine whether the highest value is an
outlier.
Qexp = (7.46 – 5.21) / (7.46 – 2.77) = 0.51
Since 0.51 is larger than 0.48, the Qcrit value for 96% confidence, there is a basis for
excluding the value.
Other criteria that can be used to evaluate an apparent outlier are:
- Plotting of a scatter diagram indicating inter-parameter relationship between two
constituents. If there is a correlation between the two constituents, an outlier would lie a
significant distance from the general trend.
39. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 6
- Applying test for normal distribution by calculating the mean and standard deviation of all
the data and determining if the extreme value is outside the mean ±3 times the standard
deviation limit. If so, it is indeed an unusual value. (A cumulative normal probability plot
may also be used for this test.)
- When the data set is not large, calculating and comparing the standard deviation, with
and without the suspect observations. A suspect value which has a considerable
influence on the calculated standard deviation suggests an outlier.
3. Regression Analysis
In assessing environmental quality , it is often of interest to quantify relationship between two
or more variables. This may allow filling of missing data for one constituent and also may
help in predicting future levels of a constituent.
Regression analysis is focused on determining the degree to which one or more variable(s)
is dependant on an other variable, the independent variable. Thus regression is a means of
calibrating coefficients of a predictive equation. In correlation analysis neither of the
variables is identified as more important than the other. Correlation is not causation. It
provides a measure of the goodness of fit.
In the calibration step, in most analytical procedures where a ‘best’ straight line is fitted to
the observed response of the detector system when known amounts of analyte (standards)
are analysed. This section discusses the regression analysis procedure employed for this
purpose.
Least-squares method is the most straightforward regression procedure. Application of the
method requires two assumptions; a linear relationship exists between the amount of analyte
(x) and the magnitude of the measured response (y) and that any deviation of individual
points from the straight line is entirely the consequence of indeterminate error, that is, no
significant error exists in the composition of the standards.
The line generated is of the form:
y = a + bx (4)
where a is the value of y when x is zero (the intercept) and b is the slope. The method
minimises the squares of the vertical displacements of data points from the best fit line.
For convenience, the following three quantities are defined:
Sxx = ∑ (xi – x)2
= ∑xi
2
– (∑xi)2
/n
Syy = ∑ (yi – y)2
= ∑yi
2
– (∑yi)2
/n
Sxy = ∑ (xi – x) (yi – y) = ∑xiyi – ∑xi ∑yi /n
Calculating these quantities permits the determination of the following:
1. The slope of the line b:
b = Sxy/Sxx (5)
2. The intercept a:
a = y – bx (6)
40. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 7
3. The standard deviation about the regression line, sr:
sr = √{(Syy – b2
Sxx)/(n-2)} (7)
4. The standard deviation of the slope sb:
sb = √(sr
2
/Sxx) (8)
5. The standard deviation of the results based on the calibration curve sc:
sc = (sr/b) x √{(1/m) + (1/n) + (yc – y)2
/b2
Sxx} (9)
where yc is the mean of m replicate measurements made using the calibration curve.
Example 6:
The following table gives calibration data for chromatographic analysis of a pesticide and
computations for fitting a straight line according to the least-square method.
Pesticide conc., µg/L
(xi)
Peak area, cm2
(yi) xi
2
yi
2
xiyi
0.352 1.09 0.12390 1.1881 0.3868
0.803 1.78 0.64481 3.1684 1.42934
1.08 2.60 1.16640 6.7600 2.80800
1.38 3.03 1.90140 9.1809 4.18140
1.75 4.01 3.06250 16.0801 7.01750
∑ 5.365 12.51 6.90201 36.3775 15.81992
Therefore
Sxx = ∑xi
2
– (∑xi)2
/n =1.145365
Syy = ∑yi
2
– (∑yi)2
/n =5.07748
Sxy = ∑xiyi – ∑xi ∑yi /n =2.39669
and from Equations 5 & 6
b = 2.0925 = 2.09 and a = 0.2567 = 0.26
Thus the equation for the least-square line is
y = 0.26 + 2.09x
Note that rounding should not be performed until the end in order to avoid rounding errors.
The data points and the equation is plotted in Figure2.
41. HP Training Module File: “ 48 Applied Statistics.doc” Version: September 2000 Page 8
Figure 2: Calibration Curve
Example 7:
Using the relationship derived in Example 6 calculate
1. concentration of pesticide in sample if the peak area of 2.65 was obtained.
2. the standard deviation of the result based on the single measurement of 2.65.
3. the standard deviation of the measurement if 2.65 represents the average of 4
replicates.
1. Substituting in the equation derived in the previous example
2.65 = 0.26 + 2.09x
x = 1.14 µg/L
2. Substituting in Equation 9 for m = 1
sc = (0.144/2.09) x √{(1/1) + (1/5) + (2.65 – 12.51/5)2
/2.092
x1.145} = ±0.08
3. Substituting in Equation 9 for m = 4
sc = (0.144/2.09) x √{(1/4) + (1/5) + (2.65 – 12.51/5)2
/2.092
x1.145} = ±0.05
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Concentration of pesticide, ug/L
Peakarea,cm
2