This document defines key statistical terms and concepts. It discusses populations and samples, measures of central tendency like mean and median, measures of variation like standard deviation and coefficient of variation, distributions like Gaussian and standard normal, and methods of analyzing data like linear regression and correlation coefficient. Uncertainty analysis is also covered, including identifying possible outliers using z-scores and Chauvenet's criterion.
1. Statistics: Terms and Definitions
Population: All data, continuous
Sample: A subset of data, discrete. Use sample for inferential statistics.
Every statistical problem contains five elements:
•Questions to be answered. Identification of the populations
•Design of experiment, sampling procedure
•Analysis of the sampled data (equations and distributions)
•Inference (based on confidence level)
•How good the inference is, measure of goodness
2. Statistics: Terms and Definitions
Measurements: Single Point
Multiple Point
Uncertainty is total error associated with measurements with specific level of confidence.
Errors: Bias or fixed error (Systematic Error)
Precision or random error
Mean = 휇=푥 = 푥푖 푛 , 푥푖 is the sample and n is the total number of the samples.
Variance = 휎2=푠2= 1 푛−1 (푥 −푥푖)2
Average deviation from the mean= 1 푛 (푥 −푥푖)2
R.M.S. Deviation from the mean = 1 푛 (푥 −푥푖)2
Standard Deviation (SD)=푠=휎= 푠2=휎2
Coefficient of Variation: It is a relative variation of the data, 푠 푥
Standard Error of the Mean = 푠푥 = 푠 푛
Mode: The most frequent items in the measurement
Median: Central item when the data is arranged in ascending or descending order.
Degrees of freedom: F or DF = n-K . Here k is the number of constraints imposed on the data.
3. Probability Density Function (PDF)
Probability is a measure of occurrence
Probability of an event between a & b
P(a<x<b) = 푝푥푑푥 푏 푎
Total Probability = 푝푥푑푥 ∞ −∞
Gaussian Distribution
푝푥 1 휎푥2휋 푒 − 12(휎푥)2푥−휇2
4. Standard Normal Distribution
If the data is large and random, then with the following conversion, it should follow a standard normal distribution.
푧= 푥−휇 휎푥
푝푧 12휋 푒− 푧22
Area under the curve is one.
5. Histogram
Histogram provides the probability of events within each increment. Histogram can be used to check if the data follows a standard distribution or not. The following steps can be used to draw a histogram:
–Choose a number of class intervals (usually between 5 and 20) that covers the data range. Select the class marks which are the mid-point of the class intervals. If you arrange data in ascending order, the first data should fall in the first class interval.
–For each class interval, determine the number of data that fall within that interval. If a data falls exactly at the division point, then it is placed in the lower interval.
–Construct rectangles with centers at the class marks and areas proportional to class frequencies. If the widths of the rectangles are the same, then the height of the rectangles represent the class frequencies.
6. Histogram
Data: 25 data point.
3.0, 6.0, 7.5, 15.0, 12.0, 6.5, 8.0, 4.0, 5.5, 6.5, 5.5,
12.0, 1.0, 3.5, 3.0, 7.5, 5.0, 10.0, 8.0, 3.5, 9.0, 2.0,
6.5, 1.0, 5.0
Δ푥 =
(푥푚푎푥−푥푚푖푛)
푐푙푎푠푠 푖푛푡푒푟푣푎푙
= (15.0-1.0)/6=2.33
0.2 2.4 2.2 2 1 x x x
2.2 2.4 4.6 3 2 x x x
4.6 2.4 7.0 4 3 x x x
7.0 2.4 9.4 5 4 x x x
9.4 2.4 11.8 6 5 x x x
11.8 2.4 14.2 7 6 x x x
14.2 2.4 16.6 8 7 x x x
Class
Class subinterval Class
Marks
Class Frequency
Start End
1 -0.2 2.2 1.0 3
2 2.2 4.6 3.4 5
3 4.6 7.0 5.8 8
4 7.0 9.4 8.2 5
5 9.4 11.8 10.6 1
6 11.8 14.2 13.0 2
7 14.2 16.6 15.4 1
10. Uncertainty and Level of Confidence
Variation of the mean value is identifies by the number of the standard deviations (± σ or ± s) we select which is also related to the level of confidence we choose to indicate that we are sure our data falls within the identified rang of the standard deviation.
The relationships between the confidence level and the standard deviation are as follow:
67% level of confidence ± s
95% level of confidence ± 2s
(this is what Engineers use, unless stated otherwise)
99% level of confidence ± 3s
For large sample 푥 ±푡훼푠푥
Here α = 1-level of confidence.
For small sample 푥 ±푡훼 2 푠푥 푛
11. Identification of Possible Bad Data Point
Z Score: Z score is a measure of relative standing of the data.
푧= 푥−푥 푠
Data with z values higher than 1.96 (95% level of confidence) is discarded.
Chouvenet’s Criterion:
•For a sample population, calculate 푥 ,σ푥 .
•Using sample population n, find σ푚푎푥 σ푥 .
•Knowing σ푥 , find σ푚푎푥 from the table below
•Calculate 푥 −푥 . Here 푥 is the sample that you are assessing. If the difference is larger than σ푚푎푥, the sample is discarded, otherwise it is retained
.
12. Linear Regression
Linear regression is used extensively for calibration. It is a relationship between input (x) and output (y). Calibration is used to eliminate Bias error.
푦=푎0+푎1푥
Where:
The error associated with fitting the data with this equation is:
This is a mathematical error.