1. The Role of Data Quality
Assessment in a Project
Dr. Ferdin Joe John Joseph
Kamnoetvidya Science Academy
Rayong,Thailand
ferdinjoe@gmail.com
2. Course Objectives
At the end of this course, you will be able to:
•Explain why DQA is important and how it can be applied to
your projects
•List the five steps of DQA and explain the purpose of each
step
•Evaluate the application of DQA on a dataset
•Interpret basic statistics and simple graphs
•Recognize different software tools and other resources for
performing DQA
3. Data Quality
Meaningful only when "data quality" relates to
intended use of data
Some data are good ("high quality") for some
purposes but are bad ("low quality") for others
4. Data Quality Assessment
A scientific and a statistical evaluation to determine if
data are adequate for their intended use
DQA is described in the Guidance for Data Quality
Assessment: Practical Methods for Data Analysis
(EPA/QA G-9), EPA/600/R-96/084, July 2000
5. The Project Life Cycle
Product or Decision
Plan for Data Collection - Set data quality
objectives or other performance and acceptance
criteria. Document in QA Project Plan.
Collect Data - Collect/assemble data in
accordance with QA Project Plan. Perform
assessments defined in Plan.
Assess and Use Data - Verify whether the
data meet acceptance criteria. Run
statistical methods to analyze data.
6. DQA is Performed ...
Whenever data are used to make a decision, for
estimation, or for research purposes
This applies to:
–New data to be collected
–Data collected by someone else
–Data collected by you for another project
7. IMPLEMENTATION
Field Data Collection and Associated
QA / QC Activities
PLANNING
Systematic Planning (e.g., Data
Quality Objectives Process)
QA Project Plan Development
ASSESSMENT
Data Validation/Verification
Data Quality Assessment
OUTPUT
INPUT
OUTPUT
QUALITY ASSURANCE ASSESSMENT
CONCLUSIONS DRAWN FROM DATA
DATA VALIDATION/VERIFICATION
Verify measurement performance
Verify measurement procedures and
reporting requirements
VALIDATED/VERIFIED DATA
DATA QUALITY ASSESSMENT
Review objectives and design
Conduct preliminary data review
Select statistical method
Verify assumptions
Draw conclusions
QC/Performance
Evaluation Data
Routine Data
INPUTS
8. Data Verification - the process of evaluating the
completeness, correctness, and conformance/
compliance of a specific data set against the method,
procedural, or contractual requirements
Data Validation - an analyte- and sample-specific
process that extends the evaluation of data beyond
method, procedural, or contractual compliance (i.e.,
data verification) to determine the analytical quality of
a specific data set
Data Quality Assessment - the process to determine if
the data are suitable for a specific use
Verification vs. Validation vs.
Assessement
9. The Five Steps of
Data Quality Assessment
1. Review the Objectives and Sampling Design
2. Conduct a Preliminary Data Review
3. Select the Statistical Method
4. Verify the Assumptions of the Statistical
Method
5. Draw Conclusions from the Data
10. Data Quality Assessment
Do the
assumption
s hold?
No
Yes
Step 4: Verify the assumptions of the method
Step 5: Draw conclusions
Step 3: Select the statistical method
Step 2: Learn more about the data
Step 1: Review Decision Problem
Revise the scope
of the problem
Choose a new statistical test
Transform or otherwise modify
the data
OR
OR
Product/decision
11. Two Views of DQA
1
Define the Decision Rule
and Decision Errors
2
Specify Acceptable Limits on
Decision Errors
3
Identify Method for Applying
Decision Rule
4
Ensure that Method
is Defensible
5 Apply the Decision Rule
1
Define the Statistical
Hypotheses
2
Determine Acceptable Type I
and Type II Error Rates
3
Identify Statistical Test or
Method and Assumptions
4
Assess Validity of
Statistical Test/Method
5
Perform Statistical Test
and Assess Design
DECISION MAKER'S VIEW DATA ANALYST'S VIEW
12. DQA Project Table
Project Objective &
Data Collection
Design (Step 1)
Observations from
QA Reports,
Summary Statistics,
and Graphs (Step 2)
Statistical Method
and Assumptions
(Step 3)
Verification of
Assumptions
(Step 4)
Results from
Statistical Method
(Step 5)
LIST:
- Objective
- Parameter of
interest
- Type of analysis
needed
- Type of data
collection design
- Information on
deviations from the
design in the
implementation
LIST:
- Non-detects
- Probable
distribution
- Potential outliers
- Anomalies
LIST:
- Analysis method
- Assumptions to
verify
- Significance levels
LIST:
- Assumptions,
whether they were
met, and how they
were verified
(including
significance levels)
LIST:
- Final results from
data analysis
- Other factors
affecting the final
product or decision
[This column will
contain an overview
of the project and
background
information against
which to determine
"quality."]
[This column will
contain information
that will provide
insight into which
assumptions might
be met.]
[This column will
contain information
on the statistical
method and its
assumptions.]
[This column will
describe what
assumptions were
checked, how they
were checked, and
what the results
were.]
[This column will
summarize the final
results from the
statistical test and
other factors to
consider in the final
product or decision.]
14. The Five Steps of Data Quality
Assessment
1. Review objectives and data collection design
2. Conduct a preliminary data review
3. Select the statistical method
4. Verify the assumptions of the statistical method
5. Draw conclusions from the data
IMPORTANT: If data other than (or in addition to) new
data is being used, ALL steps must still be performed!
15. DQA Step 1: Review Objectives
and Data Collection Design
Translate the data user's objectives into a statement
of the primary statistical hypothesis or estimation
goal
Translate the data user's objectives into tolerable
limits on the probability of committing decision errors
Review the sampling design and note any special
features or deviation from the sampling plan
16. Step 1: Input
QA Project Plan or any other planning documents that
contain:
–Project objective or question to be answered
–Decision performance criteria (DQOs) or other
performance and acceptance criteria
Field Sampling Plan and any reports on actual
implementation of sampling plan
17. If Systematic Planning was Performed...
Use the reports documenting the planning to answer:
–What is the objective of the project?
–What are the performance or acceptance criteria for
the product or decision?
18. If Systematic Planning was NOT
performed...
Decision Making: Apply the Data Quality Objectives
Process or other planning process to:
–develop hypotheses
–define the potential decision errors
–specify tolerable limits on making decision errors
Estimation: Use a systematic planning process to:
–select parameters
–develop performance or acceptance criteri.
19. Example Hypotheses:
–Null Hypothesis: True mean is less than 50 mg/Kg
–Alternative: True mean is greater than 50 mg/Kg
Example Decision Errors: If the null hypothesis is
that the mean is less than 50 mg/Kg:
–False Rejection: Decide that the true mean PAH
concentration is more than 50 mg/Kg when it is
really less than 50 mg/Kg
–False Acceptance: Decide that the true mean PAH
concentration is less than 50 mg/Kg when it is
really greater than 50 mg/Kg
Example Hypothesis & Decision Errors
20. False Rejection: Decide that the true mean PAH
concentration more than 50 mg/Kg, when it is really
less than 50 mg/Kg
–10% probability of making an error at 50 mg/Kg
–5% probability of making an error at 25 mg/Kg
False Acceptance: Decide that the true mean PAH
concentration is less than 50 mg/Kg, when it is really
greater than 50 mg/Kg
–10% probability of making an error at 70 mg/Kg
–5% probability of making an error at 100 mg/Kg
Example Limits on Decision Errors
21. Reviewing the Sampling Design
Review the planned sampling design and the
information on the actual data collection; note any
special features or deviations
Determine whether these deviations could effect the
potential analysis of the data
22. Step 1 - Output
Well-defined project objectives and criteria
Verification that the hypothesis chosen is consistent
with the objective and criteria
A list of any deviations from the planned sampling
design and the effects of these deviations
23. Review quality assurance reports for anomalies
Calculate standard statistical quantities
Display the data using graphical representations
DQA Step 2:
Conduct Preliminary Data Review
24. Step 2: Input
Verified and Validated Data
QA reports, QC data
Technical systems audit results:
–Performance evaluations
–Corrective action reports
–Data verification and validation reports
QA Project Plan, Sampling and Analysis Plan, or other
planning documents
25. Review QA Reports
Look for:
–Failure to meet acceptance criteria/obvious QC
violations:
ƒ variable detection limits
ƒ nonequivalent analytical methods
–Implementation anomalies from the QA Project
Plan:
ƒ negative emission rates
ƒ pH values exceeding 14.0
ƒ Values in wrong reporting units
26. Calculate Standard Statistical Quantities
Statistical Quantities include measures of:
–Central Tendency (mean, median, etc.)
–Relative Standing (percentiles)
–Dispersion (range, variance)
–Association (correlation)
Review these quantities to determine:
–Do the data look reasonable - do the values make
sense?
–Are there any obvious anomalies?
–Are there any trends or patterns?
27. Display the Data using Graphs
Common Graphs:
–Histogram
–Stem-and-Leaf
–Box and Whiskers
–Scatter Plot
–Time Plot
Review graphs to determine:
–Do the data look reasonable?
–What is the distribution like? Is it symmetric,
bimodal?
–Are there extremely high or extremely low values?
–Are there any obvious trends?
28. Step 2: Output
Statistical quantities and graphs that provide you with
a preliminary understanding of the data and any
potential issues including:
–Distribution of data
–Potential outliers
–Non-detects
29. DQA Step 3:
Select the Statistical Method
Select the statistical method based on the data user's
objectives and the preliminary data review
Identify the assumptions underlying the statistical
method
30. Step 3: Input
Project objectives, hypotheses, and preliminary
statistical method if identified
Background on statistical methods
31. If identified during planning, determine if that choice
seems reasonable based on the preliminary review of
the data
Otherwise, select statistical method based on the data
user's objectives and the preliminary data review
Example Methods:
–Tests: One-sample t-test, Two-sample t-test, Test for
a single proportion, Wilcoxon Signed Rank Test
–Estimation
–Regression Analysis
–Time Series Analysis
Select Method
32. Every method has assumptions.
Common Assumptions:
–Distributional form
–Independence
–Dispersion characteristics
–Homogeneity
–Basis for randomization
Example - One-sample t-test: random sample,
independence of data; sample mean is normally
distributed; no outliers; few "non-detects"
Identify Assumptions
33. Step 3: Output
Proposed statistical method that looks appropriate for
the data and the project objectives
List of assumptions for the statistical method
34. DQA Step 4: Verify the Assumptions
of the Statistical Method
Determine approach for verifying assumptions
Perform tests of assumptions
If necessary, determine corrective actions to be taken
35. Step 4: Input
Data
Assumptions identified for statistical method
Methods to verify these assumptions along with their
formulas
36. Determine Approach for Verifying
Assumptions and Perform Test
Evaluate Step 3 to see what assumptions need to be
verified
Determine what tests are available for verifying
assumptions for this dataset
Select test and appropriate significance level
37. Determine Corrective Actions
If this data set does not meet the needed assumptions,
determine the next steps that should be taken:
–Repeat Step 3 and select a different method for
analyzing the data
–Transform the data
–Reduce significance level
–Gather additional data
–Modify objective
ƒ . . . But this should be done with caution.
38. Step 4: Output
Documentation of the method used to verify each
assumption and the results of these methods
Corrective actions (if necessary)
39. DQA Step 5:
Draw Conclusions from the Data
Perform the calculations for the statistical method
Evaluate the results and draw conclusions
40. Step 5: Input
Data
Objective, hypotheses (if applicable), and
performance or acceptance criteria
Formulas for statistical method
Non-statistical factors to incorporate into the final
decision or product
41. Perform the Statistical Method
Use formulas and procedures from standard text
books
Use software to perform the calculations:
–SPSS
–SAS or Splus
–R
–DataQUEST
Note: Software could be used on any data, whether
the assumptions have been verified or not. But when
the assumptions don't hold, then the results are
highly suspect.
42. Evaluate the Results
The statistical results are not necessarily the answer.
Factor in items like:
–Practical significance
–Political/social factors
–Contextual significance
44. DQA Steps - Summary Table
STEP INPUT PROCESS OUTPUT
1 QA Project Plan or any other planning
documents.
Project objective or question to be
answered; decision performance criteria or
other performance and acceptance criteria
Reports (e.g., Field Sampling Plan) on
actual implementation of sampling plan
Translate objectives into a statement of
the primary statistical hypothesis or
estimation goal
Translate objectives into tolerable limits
on the probability of committing decision
errors
Review the sampling design and note
any special features of deviations
Well-defined project objectives and
criteria.
Verification that the hypothesis chosen is
consistent with the objective and criteria
A list of deviations from the planned
sampling design and the effects of these
deviations.
2 Verified and Validated Data
QA reports, QC data
Technical systems audits results
QA Project Plan, Sampling and Analysis
Plan, or other planning documents
Review quality assurance reports for
anomalies
Calculate standard statistical quantities
Display the data using graphical
representations
Statistical quantities and graphs that
provide you with a preliminary
understanding of the data and any
potential issues
3 Project objectives, hypotheses, and
preliminary statistical method if identified
Background on statistical methods
Select the statistical method based on the
data user's objectives and the preliminary
data review
Identify the assumptions underlying the
statistical method
Proposed statistical method that seems
appropriate for the data and the project
objectives.
List of assumptions for the statistical
method.
4 Data
Assumptions identified for statistical method.
Methods to verify assumptions along with
formula
Determine approach for verifying
assumptions
Perform tests of assumptions
If necessary, determine corrective actions
to be taken
Documentation of the methods used to
very each assumption and the results.
Corrective actions (if necessary)
5 Data
Hypotheses (if applicable) and performance
or acceptance criteria
Formula for statistical method
Non-statistical factors to incorporate into the
final decision or product.
Perform the calculations for the statistical
method
Evaluate the results of the statistical
method and draw conclusions
Statistical results with a specified
significance level
Final product or decision.
46. The Five Steps of Data Quality
Assessment
1. Review objectives and data collection design
2. Conduct a preliminary data review
3. Select the statistical method
4. Verify the assumptions of the statistical method
5. Draw conclusions from the data
47. DQA Step 1: Review Objectives
and Data Collection Design
A. Translate the project's objectives into a statement
of the primary statistical hypothesis or estimation goal
B. Translate the data user's objectives into
performance or acceptance criteria
C. Review the sampling design and note any special
features or deviation from the sample plan
48. Two Situations to Consider
Project Systematically Planned
–Use QA Project Plan and other planning documents
to perform this action
Project NOT Systematically Planned
–Use a systematical planning process (e.g., the Data
Quality Objectives Process) to plan retrospectively
49. Project Objective (Step 1A)
The objective indicates what the final outputs from
the project should be
For example:
–If the goal is to ascertain if the contamination
exceeds a threshold, the objective would be
"determine whether the contamination is greater
than X"
–If the goal is to ascertain if the contamination in the
soil has reached the groundwater, the objective
would be "determine whether the contamination
can be detected in the aquifer"
51. Defining the Boundaries (Step 1A)
Define the geographical area within which decisions
apply and the media of concern (Spatial Boundary)
Determine the time frame to which the study results
apply (Temporal Boundary)
Define a scale of decision making
52. Specifying Criteria (Step 1B)
Set quantitative performance or acceptance criteria
Consider consequences of any potential decision
errors. Consequences may include:
ƒ Health risks
ƒ Ecological risks
ƒ Political risks
ƒ Social risks
ƒ Resource risks
EXAMPLE: When selecting between two opposing
conditions, define a 'gray region', false rejection error
limit, false acceptance error limit.
53. Review the Sampling Design (Step 1C)
Information Needed Where it would be found
Original Sampling Plan - locations
and types of samples to be taken.
QA Project Plan or other planning
documents. Documentation from
the Systematic planning process.
Details on how the samples were
actually collected.
Summary reports from field
notes, maps.
Documentation of deviations from
sampling plan.
Must be developed based on
comparison of original plan and
actual implementation.
Review of deviations to ensure
that the implemented plan still
meets the objectives for the
project.
Must be developed based on
information from systematic
planning on study objectives.
54. What if you are assessing data
collected by another project?
Gather available information about the way that that
data were collected - for example, sample collection
plans, implementation of sample collection plans,
analytical method used, . . .
55. PCB Example: Background
Electronic Manufacturing Corporation of America
operated at site from 1965 to 1985 and sold the site to
Energy Components Company in 1985. Both
companies went bankrupt in 1990.
In 1991, chlorinated solvents were discovered in
water from city wells located in a field east of site.
Waste oil contaminated with PCBs was sprayed on a
dirt road on the site for dust suppression while the
site was operational.
Problem: Determine if the extent of PCB
contamination along the road presents unacceptable
risks and remedial action is needed.
56. PCB Example: Statistical Hypotheses
If the mean concentration of total PCBs in surface soil
(top 1 inch) over the dirt road exceeds 2 mg/Kg, then
take remedial action; otherwise, take no further
action.
Null Hypothesis: True Mean < 2 mg/Kg
vs.
Alternative Hypothesis: True Mean > 2 mg/Kg
57. PCB Example: Tolerable Limits on
Decision Errors
Probability of Deciding that the
Mean Exceeds 2 mg/Kg
True Concentration of PCB (mg/Kg)
Tolerable
False
Rejection
Decision
Error Rates
Tolerable
False
Acceptance
Decision
Error Rates
Action
Level
Gray
Region
58. Composite samples
selected using
simple random
sampling from the
dirt road.
Dirt road was one of
4 strata at the site
(stratum 1).
Each composite
consists of 5 mini-
samples.
PCB Example: Sampling Design
Stratum 1
59. PCB Example: Data
PCB concentration levels were measured (in mg/Kg)
from 16 surface soil samples (top one inch of soil)
from the dirt road. Each soil sample consists of 5
mini-samples composited together.
1.92 2.49 4.58 1.17
2.48 5.62 2.54 25.15
7.72 1.02 2.91 3.23
2.87 8.66 1.71 1.18
60. A. Review quality assurance reports for anomalies:
ƒ Anomalies in recorded data, missing values,
deviations from standard operating procedures,
failure to meet acceptance criteria, use of
nonstandard data collection methodologies
B. Calculate standard statistical quantities:
ƒ Do the data look reasonable? Do the values make
sense? Are there any anomalies?
C. Display the data using graphical representations:
ƒ Are there any trends? What is the distribution
like? Are there any extreme values?
DQA Step 2:
Conduct Preliminary Data Review
61. Review QA Reports (Step 2A)
Data validation reports that document sample
collection, handling, analysis, data reduction, and
reporting procedures used
Quality control reports from laboratories or field
stations that document measurement system
performance, including data from check samples,
split samples, spiked samples, or any other internal
QC measures
Technical system reviews, performance evaluation
audits, and audits of data quality, including data from
performance evaluation samples
63. Commonly Used Graphs (Step 2C)
Histogram Scatter Plot
Stem-and-Leaf Time Plot
Ranked Data Plot Spatial Correlogram
Quantile Plot Posting Plot
Normal Probability Plot Symbol Plot
64. PCB Example: Statistical Quantities
Number of Observations: 16
Minimum: 1.020 Maximum: 25.150
Mean: 4.703 Median: 2.705
Variance: 34.808 Standard Deviation: 5.900
Range: 24.130 Interquartile Range: 3.285
Coefficient of Variation: 1.254
Coefficient of Skewness: 2.818
Coefficient of Kurtosis: 7.321
Percentiles:
1st: 1.020 75th: 5.100
5th: 1.020 90th: 8.660
10th: 1.170 95th: 25.150
25th: 1.815 99th: 25.150
67. PCB Example: Summary of Statistical
Quantities and Graphs
Not symmetric
One extreme value (21.15)
Does not appear to be normally distributed
68. DQA Step 3:
Select the Statistical Method
A. Select the statistical method based on the project's
objectives and the preliminary data review
B. Identify the assumptions underlying the statistical
method
69. If the statistical method has been identified during
planning, ensure it is applicable based on the
preliminary review of the data
Otherwise, select statistical method based on the data
user's objectives and the preliminary data review
Example Tests:
–One-sample t-test
–Two-sample t-test
–Test for a single proportion
–Wilcoxon Signed Rank Test
Select Method (Step 3A)
70. Each method has assumptions which can be found in
references on that test or in Guidance for Data Quality
Assessment
Common Assumptions:
–Distributional form
–Independence
–Dispersion characteristics
–Homogeneity
–Basis for randomization
Example: One-sample t-test: random sample;
independence of data; sample mean is normally
distributed; no outliers; not an excessive amount of
"non-detects"
Identify Assumptions (Step 3B)
71. ONE SAMPLE t-TEST ASSUMPTIONS:
No outliers (sample mean and standard deviation are
very sensitive to outliers)
Sample mean approximately normally distributed
Random sample (independence of the data values)
Relatively few values below the detection limit
PCB Example: Select Method and
Identify Assumptions
82. The Five Steps of Data Quality
Assessment
1. Review objectives and data collection design
2. Conduct a preliminary data review
3. Select the statistical method
4. Verify the assumptions of the statistical method
5. Draw conclusions from the data
83. DQA Step 4: Verify the Assumptions
of the Statistical Test
A. Perform method to test assumptions:
ƒ Determine what methods are available
ƒ Select method and significance level
ƒ Perform calculations
B. If necessary, determine corrective actions to be
taken, e.g., select different test in Step 3, reduce
significance level, gather additional data, modify
slightly the objectives of the study, etc.
84. Types of Assumptions (Step 4A)
Random sample
Independence of data
Distribution of Data
Existence of Outliers
Extent of non-detect data
85. Verify Random Sample
Review data collection plan to verify that some
element of random selection was used to locate the
samples
If judgmental sampling was used to choose some
sample locations, you may need to consult a
statistician to determine if the data collection is
"random enough" to be able to draw reliable
conclusions from the data
If judgmental sampling was used to choose all sample
locations, the ability to draw conclusions about the
entire population is limited
86. Verify Independence of Data
For one variable (i.e., one contaminant) ensure there
are no trends within the data - for example, by
location, by time period, etc.
For two (or more) variables (i.e., two or more
contaminants), ensure that they are not correlated
with one another
87. Verify Distribution
Typical assumptions are:
–Normal distribution
–Symmetric distribution
An indication of the distribution can be gained by
reviewing the summary statistics and the graphs
Quantitative tests are available to verify whether a
data set has a specific distribution
88. Verify Outliers
Statistical outliers are anomalies with respect to the
proposed distribution for the data -- an extreme value
may be a statistical outlier but may still be a valid data
point
Before removing any statistical outliers, review these
values on both a scientific or quality assurance basis
-- do not delete any values unless there is a scientific
or quality assurance based reason for the removal
If any data are deleted, run all analyses both with and
without the data to see the effect of the deletions
89. Using Data Below the Detection Limit
Quality control information would indicate which data
points were non-detects
Methods for Addressing non-detects:
–Substitution, e.g., detection limit, 1/2 detection limit,
. . .
–Special algorithms, e.g., Winsorized Mean, Cohen's
Method, . . .
–Use of percentiles
90. Types of Corrective Actions (Step 4B)
Select a different test or method of analysis
Transform the data to a different metric
Gather additional data or modify objective
91. PCB Example: Background
Electronic Manufacturing Corporation of America
(EMCA) operated at site from 1965 to 1985, when site
was sold to Energy Components Company (ECC).
Both companies went bankrupt in 1990.
In 1991, chlorinated solvents discovered in water from
city well field east of site.
Waste oil contaminated with PCBs was sprayed on a
dirt road for dust suppression.
Problem: Determine the extent of PCB contamination
on the dirt road that presents unacceptable risks.
92. No outliers?
Data are approximately normally distributed?
Random sample (for independence of data values)?
No data reported as "Not Detected?"
PCB Example: Assumptions of the t-Test
for a Single Mean
93. Is the extreme value of 25.15 a statistical outlier?
Extreme Value Test will be used - this test assumes
data without the outlier are normally distributed.
PCB Example: Identifying Outliers
94. Results: Data are Not Normally Distributed
PCB Example: Testing for Normality
Shapiro-Wilk W Test
Shapiro-Wilk Test
Null Hypothesis: 'Data are normally distributed'
Sample Value: 0.836
Tabled Value: 0.881
Non-normality has been detected at a 5% significance level.
95. Data without outlier appears to be lognormally
distributed in the Histogram
So, apply Shapiro-Wilk W Test to natural logarithms of
the data to test for lognormality -- if logged data are
normally distributed, then untransformed data are
lognormally distributed
Original data value 1.92 becomes 0.652, 2.49 becomes
0.912, etc.
PCB Example: Not Normally Distributed
96. PCB Example: Testing Data for
Lognormality
Shapiro-Wilk Test
Null Hypothesis: ‘Data are normally distributed'
Sample Value: 0.953
Tabled Value: 0.881
There is not enough evidence to reject the assumption of
normality with a 5% significance level.
Result: Can not reject the assumption that the data are
lognormally distributed and that the logs of the data are
normally distributed.
97. PCB Example: Discordance Test for
Outliers
Discordance Test for Outliers
Value Tested: 3.225 [ ln(25.12)]
Sample Value: 2.476
Tabled Value: 2.443
Conclude 3.225 is an outlier at a 5%
significance level.
98. PCB Example: Assumptions Satisfied?
No outliers?
Data approximately normally distributed?
Random sample?
All data is above the detection limit?
99. PCB Example: A New Look at the
Logged Data
Number of Observations: 15
Minimum: 0.020 Maximum: 2.159
Mean: 1.002 Median: 0.932
Variance: 0.428 Standard Deviation: 0.654
Range: 2.139 Interquartile Range: 0.985
Coefficient of Variation: 0.653
Coefficient of Skewness: 0.244
Coefficient of Kurtosis: -0.783
Percentiles:
1st: 0.020 75th: 1.522
5th: 0.020 90th: 2.044
10th: 0.157 95th: 2.159
25th: 0.536 99th: 2.159
100. DQA Step 5: Perform Method and
Draw Conclusions from the Data
A. Perform the calculations for the statistical method
B. Evaluate the results and draw conclusions
101. Perform the Statistical Method (Step 5A)
For basic methods, formulas and procedures are
available in standard text books
Software can also do the calculations:
–SAS, SPSS, or SPlus
–DataQUEST
102. A statistical test can give two results:
–Reject the baseline condition or
–Fail to reject the baseline condition
If the test does not reject the baseline condition, then
the false acceptance error rate must be verified
Evaluate the Results (Step 5B)
103. If the baseline condition (Null Hypothesis) is true and
the statistical test rejects the baseline condition with
a 5% significance level, then the results will occur
naturally with a chance of less than one-in-twenty.
A chance of less than one-in-twenty is highly unlikely,
therefore it is unlikely that the baseline condition is
true. Therefore, the alternative condition is selected
as being correct (with a significance level of 5%).
Significant at 5%
104. PCB Example: Perform Calculations
Formula:
ƒ Reject baseline condition if t > t1-α
α = 5, t0.95 = 1.753
AL = 0.6931 [the natural log of 2.0]
Results: Reject baseline condition at a 5% level of
significance.
t =
¯X - AL
s
n
=
1.002 - 0.6931
0.654
15
= 1.818
105. PCB Example: Conclusions
The t-test rejected the null hypothesis that the mean
is less than action level at a 5% significance level
Statistically, the mean concentration of total PCBs in
surface soil (top 1 inch) over the dirt road exceeds
2 ppm and remedial action is necessary
107. Manganese Exposure
Manganese (Mn) is a metal emitted from industrial
operations (e.g., steel mills) and cars using gasoline
with additive MMT
Mn is an essential dietary requirement in trace
amounts but higher exposures are toxic:
–occupational exposure shown to affect motor skills
–disruption of neurotransmitters may adversely
affect higher-level cognitive skills
–exposure during infancy and early childhood may
cause learning disabilities and emotional problems
Research Problem: Determine if elevated exposure to
Mn during early childhood can be linked to cognitive
dysfunction in teenagers
108. Manganese Exposure Routes
Mn is usually inhaled or ingested
MMT can be absorbed through the skin
Key route of concern: Mn is transported to the brain
via the olfactory nerve when Mn in coarse particles is
inhaled
Exposure to Mn can be estimated by analyzing
ambient air monitoring data in Total Suspended
Particulates (TSP) - undisturbed dust in attics and
other areas of homes can be an indicator of long-term
exposure
109. DQA Step 1: Review Objectives
and Data Collection Design
Translate the data user's objectives into a statement
of the primary statistical hypothesis or estimation
goal
Translate the data user's objectives into tolerable
limits on the probability of committing decision errors
Review the sampling design and note any special
features or deviation from the sample plan
110. Pilot Study: Objective and Background
Objective: Determine whether the proposed study
area is suitable for a large-scale investigation
Background
–Range of Mn exposures vary from very high to very
low
–Concentrations of Mn in air and dust decrease with
distance from known point source (steel mill)
–There are few, if any, confounding factors, such as
lead in paint
111. Exploratory nature of pilot study does not warrant
specification of rigorous performance criteria;
however, results must support the larger study goals
–Detect correlation of 40% with at least 80% power
–Identify very high and very low exposures
Sample size for TSP analysis driven by existing
ambient air monitoring network design for air
measurements; sample size for dust samples based
on performance criteria, budget, professional
judgment, and experience with similar studies
Performance Criteria for Pilot Study
112. Pilot Study: Statistical Methods and
Performance Criteria
Statistical Methods:
–Determine correlation between distance to steel mill
and Mn concentrations in air (TSP) and house dust
–Estimate annual average air concentrations of Mn
in TSP at various distances from steel mill
10% significance level chosen for testing hypotheses
about relationships between Mn exposure
concentrations and results of cognitive tests
113. Pilot Study: Performance Criteria
(continued)
Correlations expected to be relatively low (10%- 50%)
and negative
False negative error rate specified by determining
power under various assumptions of correlation
10% 20% 30% 40% 50%
50% 166 42 19 11 7
60% 237 60 27 15 10
70% 327 82 36 20 13
75% 383 96 42 23 14
80% 451 113 49 27 17
85% 537 134 58 32 19
90% 656 163 72 39 24
True Population Correlation
Power
114. Review Sampling Design
2-stage stratified random sampling design
–Sampling frame was phone list in the target area
around the steel mill, pre-filtered to include
households with children age 13-17
–Sampling population was stratified by distance to
the steel mill
–First stage of sampling was random-digit-dialing to
recruit households who met the study criteria
–Second stage of sampling was within the set of
eligible and willing participants
For each household selected, dust from the attic or
basement was collected using a vacuum device per
established protocols
115. 30 households were successfully sampled including
individual houses, attached townhouses, and
apartments
The field report indicated that 50% of the households
were sampled in a basement furnace area, 10% in
other basement areas, 10% in attics, and the
remainder in miscellaneous crawl spaces or other
furnace areas
All of these locations seem reasonable in light of the
project objectives
Review Data Collection Reports
116. Review quality assurance reports for anomalies
Calculate standard statistical quantities
Display the data using graphical representations
DQA Step 2:
Conduct Preliminary Data Review
117. Quality Control Reports
There were no "below the detection limit" values in
the Mn data set
Duplicate sample analyses and instrument analyses
indicated excellent agreement, all within 5%
There were no significant anomalies in the QC reports
118. Summary Statistics for Mn in Dust
Mean 924.23
Std Dev 2566.9
C.V. 2.78
Median 390.0
Mode N.A.
Range 14420
Minimum 16
Maximum 14436
Percentile Value
95% 1339
75% 699
50% 390
25% 258
5% 130
N = 30
120. Box-and-Whiskers Plots of Mn in Dust
Full Data Set
+
16
(median) 390
(mean) 924
o14436
Without Potential Outlier
+
o
16.0
(median) 256.0
(mean) 458.3
1339.0
122. DQA Step 3:
Select the Statistical Method
Select the statistical method based on the data user's
objectives and the preliminary data review
Identify the assumptions underlying the statistical
method
123. Two potential measures of correlation:
–Pearson's correlation coefficient
ƒ detects linear relationship between two sets of
values
ƒ sensitive to extreme values
ƒ not affected by linear transformations
–Spearman's rank correlation coefficient
ƒ uses ranks of data values
ƒ less sensitive to extreme values
ƒ not affected by monotonic nonlinear
transformations
Correlation Between Mn and
Distance to Point Source
124. Assumptions to be Verified for Pearson's
Correlation Coefficient
Random Sample
Independence of data
Linear Relationship
No Outliers
Normal Distribution
125. DQA Step 4: Verify the Assumptions
of the Statistical Method
Determine approach for verifying assumptions
Perform tests of assumptions
If necessary, determine corrective actions to be taken
126. Assumption: Random Sample
Verification Method: Review how the data collection
plan was developed and implemented (i.e., how the
sample locations were chosen)
Verification Results: The QA Project Plan and the
data collection documentation reports that random
sample locations were chosen
It is highly probable that this assumption of a random
sample has been met
127. Assumption: Independence of Data
Verification Method: Rank von Neumann test
Verification Result: Using the baseline condition (Null
Hypothesis) that there is no serial correlation present,
the Rank von Neumann test conclude that there is no
serial correlation at the 10% level of significance
It is highly probable that this assumption of a
independence has been met
128. Assumption: Few Values Below
Detection Limit
Verification Method: Review data
Verification Result: There are none in this data set
This assumption of few values below the detection
limit has been met
129. Assumption: No Outliers
Verification Method:
–Review histogram: Look for very high or very low
values as compared to the rest of the distribution
–Review Box Plot: Potential outliers show up
quickly.
–Use Rosner's Test on any extreme values (with a
5% signicance level.
131. Review Box and Whiskers Plot
Looks like
an extreme
outlier!
+
16
(median) 390
(mean) 924
o14436
132. Assumption: No Outliers
Verification Result: Rosner's test has determined that
the observation 14436 is a statistical outlier at the 5%
significance level and should be investigated further.
(The sample value was 5.264 and the critical value
was 2.910.)
This assumption of no outliers has not been met!
133. Look at the data without the outlier...
Looks better, but
still may have an
outlier...
+
o
16.0
(median) 256.0
(mean) 458.3
1339.0
134. Check for outliers in censored data set
Verification Result: Rosner's test has determined that
the observation 1339 is a statistical outlier at the 5%
significance level and should be investigated further.
(The sample value was 3.129 and the critical value
was 2.890.)
HOWEVER, Rosner's test assumes normality...
135. Shapiro-Wilk Test for Normality
The Shapiro-Wilk test for normality determined that the
baseline condition that the data are normally distributed
can be rejected at the 5% significance level (Sample
value = 0.900; Critical Value = 0.926)
Non-normality has been detected at the 5% level of
significance, but only just!
136. What about a log transform of
the censored data set?
+
o
16.0
(median) 256.0
(mean) 458.3
1339.0
137. What about a log transform of
the censored data set?
Still looks skewed:
Shapiro-Wilk test
confirms that data
are not strictly normal.
205 394 593 772 961 1150 1339
Concentration (ppm)
0
5
10
15
Frequency
16
138. What should we conclude about
assumptions?
Data seems to be an independent random sample
Data are not normally distributed (or lognormal,
either)
Data probably has multiple outliers -- it's heavily
skewed to the right.
Methods such as Pearson's correlation coefficient,
that are sensitive to extreme values, are not
recommended
Spearman's Rank Correlation Coefficient is a better
approach for evaluating the relationship between Mn
in dust vs. distance to point source
139. DQA Step 5: Perform Method and
Draw Conclusions from the Data
Perform the calculations for the statistical method
Evaluate the results of the statistical method and
draw conclusions
140. Spearman's Rank Correlation Coefficient
Results are less sensitive to extreme values (use the
original full data set)
The Spearman's Correlation Coefficient of distance
from the source to Mn concentration is -0.406
ƒ The correlation is negative (as suspected) and the
concentration decreases as distance from the
point source increases (a negativecorrelation
coefficient)
ƒ The relationship is not strong (only 40% of the
total variation is attributable to a relationship
between distance and concentration)
141. Interpret the Results
Factors other than distance-to-point-source that
determine concentrations of Mn in dust:
–Infiltration rates into the dwelling
–Housekeeping practices
–Age of dwelling (how long the dust was left
undisturbed)
–Changing wind patterns in the region
Those other factors mediate the relationship between
Mn in dust and distance to point source.
Nonetheless, the researchers believe there is enough
of a relationship to support the use of distance as a
basis for stratifying the population in the larger study