1. Henry R. Kang (1/2010)
General Chemistry
Lecture 5
Statistical Data
Analysis
2. Henry R. Kang (7/2008)
Outlines
• Fundamental Statistics
• Accuracy and Precision
• Data Rejection
3. Henry R. Kang (1/2010)
Accuracy & Precision
• Accuracy
Accuracy is a measure of the closeness of a
measured quantity to the true value.
• Precision
How close two or more measurements of the
quantity agree with one another.
Precision is a measure of the agreement of
replicate measurements.
5. Henry R. Kang (7/2008)
Errors
• All Measurements Contain Errors.
• Types of Errors
Systematic errors
One-sided errors (either positive or negative)
• Usually from a single source
• Resulting data are consistently high or low
Results may be precise but inaccurate
• Examples: Balance is incorrectly zeroed. Use incorrect constant for
calculations.
Random errors
Randomly occurred
Positive and negative deviations occur with equal frequency and size.
• A bell shape curve (Gaussian or normal distribution)
The source of the error is usually not known
6. Henry R. Kang (7/2008)
Gaussian Distribution
• Gaussian distribution gives the distribution of data points with respect to the
true value. It gives a bell-shaped curve as shown in the figure.
The closer to the true value, the higher the probability.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-3 -2 -1 0 1 2 3
Standard Deviation
Probability
7. Henry R. Kang (7/2008)
Measuring Accuracy
• Percent Error
If the true value is known
• Part Per Thousand (PPT)
• Part Per Million (PPM)
• Unfortunately, the true value is often not known.
% error =
| true value – experimental value |
| True value |
× 100
PPT =
| true value – experimental value |
| true value – experimental value |
| True value |
| True value |
× 1000
× 106
PPM =
8. Henry R. Kang (7/2008)
Measuring Precision
• Mean (or Average)
• Deviation and Absolute Deviation
• Absolute Average Deviation
• Relative Deviation
• Relative Average Deviation (RAD)
• Standard Deviation
• Relative Standard Deviation
9. Henry R. Kang (7/2008)
Mean (Average)
• For multiple measurements of a given quantity,
we have numerical values x1, x2, x3, - - - -, xn, where
n is the number of measurements.
• Sum is defined as
Sum = x1 + x2 + x3 + - - - + xn = ∑ xi
• Mean xavg is defined as
∑ xiSum
n n=xavg =
10. Henry R. Kang (7/2008)
Deviation & Absolute Deviations
• Deviation is the difference (or variation) of a single measurement,
xi, away from the mean value, xavg.
d1 = x1 – xavg
d2 = x2 – xavg
d3 = x3 – xavg
-- - -- -- - -- --
-- - -- -- - -- --
dn = xn – xavg
• Absolute deviation is always positive.
d1 = | x1 – xavg|
d2 = | x2 – xavg|
d3 = |x3 – xavg|
-- - -- -- - -- --
-- - -- -- - -- --
11. Henry R. Kang (7/2008)
Absolute Average Deviation
• Absolute average deviation, davg, is the arithmetic
mean of individual absolute deviations, di.
d1 = | x1 – xavg|
d2 = | x2 – xavg|
d3 = | x3 – xavg|
--------- ---
--------- ---
dn = | xn – xavg| ∑ di
n=davg
12. Henry R. Kang (7/2008)
Relative Deviation
• Relative deviation, Di, is the ratio of
individual absolute deviations, di, to the
mean value, xavg.
D1 = d1 / xavg = | x1 – xavg| / xavg
D2 = d2 / xavg = | x2 – xavg| / xavg
D3 = d3 / xavg = | x3 – xavg| / xavg
------------
Di = di / xavg = | xi – xavg| / xavg
------------
13. Henry R. Kang (7/2008)
Relative Average Deviation
• Relative average deviation (RAD) is the
absolute average deviation relative to
the mean xavg
A precision of 3 ppt or less is considered
very good.
RAD (ppt) = × 1000
davg
xavg
14. Henry R. Kang (7/2008)
Standard Deviation
• Standard deviation (σ) is useful in estimating data points
distribution in the form of the Gaussian distribution (a
bell-shaped curve).
(xavg ± σ) incorporates 68.3% of the data points.
(xavg ± 3σ) incorporates 99.7% of the data points.
The smaller the σ, the less spread of data points.
d1 = x1 – xavg
d2 = x2 – xavg
d3 = x3 – xavg
------------
dn = xn – xavg
∑ di
2
n – 1
=σ
√ =
√
d1
2
+ d2
2
+ d3
2
+ - - - - + dn
2
n – 1
15. Henry R. Kang (7/2008)
Relative Standard Deviation
• Relative standard deviation (σr) is the standard
deviation relative to the mean value.
d1 = x1 – xavg
d2 = x2 – xavg
d3 = x3 – xavg
--------- ---
dn = xn – xavg
where n is the number of measurements
∑ (di /xavg)2
n – 1
=σr
√ =
√ D1
2
+D2
2
+D3
2
+ - - - - +Dn
2
n – 1
or σr (ppt) = (σ / xavg ) × 1000
16. Henry R. Kang (7/2008)
Gaussian Distribution
• Gaussian distribution gives the
distribution of data points with
respect to the true value. It gives a
bell-shaped curve as shown in the
figure.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
-3 -2 -1 0 1 2 3
Standard Deviation
Probability
• The Gaussian equation is
P(x) = [(2π)1/2
σ]–1
exp[-(x – X)2
/(2σ2
)]
where σ is the standard deviation and X is the true value.
The closer to the true value, the higher the probability.
The area under the curve (or the integration of the Gaussian function)
(xture ± σ) incorporates 68.3% of the data points.
(xture ± 3σ) incorporates 99.7% of the data points.
(xture ± 3.8901σ) incorporates 99.99% of the data points.
(xture ± 4.4172σ) incorporates 99.999% of the data points.
(xture ± 6σ) incorporates nearly 100% of the data points.
17. Henry R. Kang (7/2008)
Standard Deviation & Data Distribution
• The smaller the σ, the less spread of data points.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
-4 -3 -2 -1 0 1 2 3 4
Standard Deviation
Probability
σ = 0.5
σ = 1.0
σ = 2.0
18. Henry R. Kang (7/2008)
Approximation of Standard Deviation
• The computational cost for standard deviation is pretty
high; therefore, there exists a good approximation to
compute standard deviation with much less
computational cost.
• š = Ř/√N
Ř is the range of data points from the lowest value to the
highest value
Ř = xmax – xmin
N is the number of data points.
• For a small number of measurements the approximation
is accurate enough to replace the formal standard
deviation.
20. Henry R. Kang (1/2010)
Accuracy & Precision of Measurements
• Accuracy is a measure of the closeness of a measured quantity to
the true value.
• Precision is a measure of the agreement of replicate
measurements.
• Measurements can be precise but not accurate or accurate but not
precise or neither. The best result is, of course, accurate and
precise.
Accurate &
precise
Precise but
not accurate
not accurate
& not precise
accurate but
not precise
21. Henry R. Kang (1/2010)
Example 1 of Accuracy and Precision
• Measured %S values in H2SO4 are 28.72%, 28.40%, and 28.57%,
where the true value is 32.69%. Determine the accuracy and
precision.
• Answer:
Mean = (28.72% + 28.40% + 28.57%) / 3 = 28.60%
Estimated precision by using the approximation: š = Ř / √N
š = (28.72 – 28.40)% / 31/2
= 0.32% / 1.732 = 0.18 %
Relative standard deviation: sr = š / xM
sr = 0.18% / 28.60% = 0.0063
Accuracy = |X − xM| = | 32.69% − 28.60% | = 4.09%
Relative accuracy = Accuracy / True value
= 4.09% / 32.69% = 0.125
• These result indicate that the data are precise but inaccurate.
22. Henry R. Kang (1/2010)
Example 2 of Accuracy and Precision
• Measured %S values in H2SO4 are 28.89%, 32.56%, and 36.64%,
where the true value is 32.69%. Determine the accuracy and
precision.
• Answer:
Mean = (28.89% + 32.56% + 36.64%) / 3 = 32.70%
Estimated precision by using the approximation: š = Ř / √N
š = (36.64 – 28.89)% / 31/2
= 7.75% / 1.732 = 4.47 %
Relative standard deviation: sr = š / xM
sr = 4.47% / 32.70% = 0.137
Accuracy = |X − xM| = | 32.69% − 32.70% | = 0.01%
Relative accuracy = Accuracy / True value
= 0.01% / 32.69% = 0.0003
• These result indicate that the data are imprecise but accurate.
23. Henry R. Kang (1/2010)
Example 3 of Accuracy and Precision
• Measured %S values in H2SO4 are 25.62%, 33.56%, and 27.93%,
where the true value is 32.69%. Determine the accuracy and
precision.
• Answer:
Mean = (25.62% + 33.56% + 27.93%) / 3 = 29.04%
Estimated precision by using the approximation: š = Ř / √N
š = (33.56 – 25.62)% / 31/2
= 7.94% / 1.732 = 4.58 %
Relative standard deviation: sr = š / xM
sr = 4.58% / 29.04% = 0.158
Accuracy = |X − xM| = | 32.69% − 29.04% | = 3.65%
Relative accuracy = Accuracy / True value
= 3.65% / 32.69% = 0.112
• These result indicate that the data are imprecise and inaccurate.
25. Henry R. Kang (7/2008)
Data Rejection
• Replicate measurements of a given quantity are usually
scattered.
Some values are closer than others.
• Which values to keep (or which values to discard)
If a single result differs greatly from the others that is caused
by a particular error of the experimenter, then this result
should be discarded.
If a result is significantly “off”, but there is no error in the
experiment, then the result, in general, should be kept.
• If in doubt, use the rejection coefficient Q test.
• Do not discard any result just to get “good precision”.
26. Henry R. Kang (7/2008)
Q Test
• Q test is used to test the extreme values (the highest and lowest
values)
• Procedure
Calculate the range
Range = xmax – xmin
Calculate the difference between the extreme value with its nearest
neighbor
dhi = xmax – xnbor,hi; dlo = | xmin – xnbor,lo |
Calculate the ratio (Q value) between the difference and the range
Qhi = dhi / Range ;Qlo = dlo / Range
• Compare the resulting Q value with the rejection table at 90%
confidence level (or other selected confidence level)
If the calculated Q value is greater than the Q value given in the table, then
reject the value.
27. Henry R. Kang (7/2008)
Rejection Q Tables
Number
of Data
Q90 Q96 Q99
3 0.94 0.98 0.99
4 0.76 0.85 0.93
5 0.64 0.73 0.82
6 0.56 0.64 0.74
7 0.51 0.59 0.68
8 0.47 0.54 0.63
9 0.44 0.51 0.60
10 0.41 0.48 0.57
28. Henry R. Kang (7/2008)
Q Test - Example
• Data: 35.00, 35.05, 35.10, 35.80
• Calculate the range
Range = xmax – xmin= 35.80 – 35.00 = 0.80
• Calculate the difference between the extreme value with its
nearest neighbor.
dhi = xmax – xnbor,hi = 35.80 – 35.10 = 0.70
dlo = xmin – xnbor,lo = | 35.00 – 35.05 | = 0.05
• Calculate Q values between the difference and the range.
Qhi = dhi / Range = 0.70 / 0.80 = 0.88
Qlo = dlo / Range = 0.05 / 0.80 = 0.063
• Compare the resulting Q value with the rejection table at 90%
confidence level.
For 4 samples, the Q value in the table is 0.76
Qhi > 0.76; therefore, the highest value 35.80 can be dropped
Once the value is dropped, it is no longer in the data set and should not
be used for the calculations of mean and various deviations.
#Data Q90
3 0.94
4 0.76
5 0.64
6 0.56
7 0.51
8 0.47
9 0.44
10 0.41