Role of Data Quality Assessment in a Project

The Role of Data Quality
Assessment in a Project
Dr. Ferdin Joe John Joseph
Kamnoetvidya Science Academy
Rayong,Thailand
ferdinjoe@gmail.com

Course Objectives
At the end of this course, you will be able to:
•Explain why DQA is important and how it can be applied to
your projects
•List the five steps of DQA and explain the purpose of each
step
•Evaluate the application of DQA on a dataset
•Interpret basic statistics and simple graphs
•Recognize different software tools and other resources for
performing DQA

Data Quality
Meaningful only when "data quality" relates to
intended use of data
Some data are good ("high quality") for some
purposes but are bad ("low quality") for others

Data Quality Assessment
A scientific and a statistical evaluation to determine if
data are adequate for their intended use
DQA is described in the Guidance for Data Quality
Assessment: Practical Methods for Data Analysis
(EPA/QA G-9), EPA/600/R-96/084, July 2000

The Project Life Cycle
Product or Decision
Plan for Data Collection - Set data quality
objectives or other performance and acceptance
criteria. Document in QA Project Plan.
Collect Data - Collect/assemble data in
accordance with QA Project Plan. Perform
assessments defined in Plan.
Assess and Use Data - Verify whether the
data meet acceptance criteria. Run
statistical methods to analyze data.

DQA is Performed ...
Whenever data are used to make a decision, for
estimation, or for research purposes
This applies to:
–New data to be collected
–Data collected by someone else
–Data collected by you for another project

IMPLEMENTATION
Field Data Collection and Associated
QA / QC Activities
PLANNING
Systematic Planning (e.g., Data
Quality Objectives Process)
QA Project Plan Development
ASSESSMENT
Data Validation/Verification
OUTPUT
INPUT
OUTPUT
QUALITY ASSURANCE ASSESSMENT
CONCLUSIONS DRAWN FROM DATA
DATA VALIDATION/VERIFICATION
Verify measurement performance
Verify measurement procedures and
reporting requirements
VALIDATED/VERIFIED DATA
DATA QUALITY ASSESSMENT
Review objectives and design
Conduct preliminary data review
Select statistical method
Verify assumptions
Draw conclusions
QC/Performance
Evaluation Data
Routine Data
INPUTS

Data Verification - the process of evaluating the
completeness, correctness, and conformance/
compliance of a specific data set against the method,
procedural, or contractual requirements
Data Validation - an analyte- and sample-specific
process that extends the evaluation of data beyond
method, procedural, or contractual compliance (i.e.,
data verification) to determine the analytical quality of
a specific data set
Data Quality Assessment - the process to determine if
the data are suitable for a specific use
Verification vs. Validation vs.
Assessement

The Five Steps of
1. Review the Objectives and Sampling Design
2. Conduct a Preliminary Data Review
3. Select the Statistical Method
4. Verify the Assumptions of the Statistical
Method
5. Draw Conclusions from the Data

Do the
assumption
s hold?
No
Yes
Step 4: Verify the assumptions of the method
Step 5: Draw conclusions
Step 3: Select the statistical method
Step 2: Learn more about the data
Step 1: Review Decision Problem
Revise the scope
of the problem
Choose a new statistical test
Transform or otherwise modify
the data
OR
OR
Product/decision

Two Views of DQA
1
Define the Decision Rule
and Decision Errors
2
Specify Acceptable Limits on
Decision Errors
3
Identify Method for Applying
Decision Rule
4
Ensure that Method
is Defensible
5 Apply the Decision Rule
1
Define the Statistical
Hypotheses
2
Determine Acceptable Type I
and Type II Error Rates
3
Identify Statistical Test or
Method and Assumptions
4
Assess Validity of
Statistical Test/Method
5
Perform Statistical Test
and Assess Design
DECISION MAKER'S VIEW DATA ANALYST'S VIEW

DQA Project Table
Project Objective &
Data Collection
Design (Step 1)
Observations from
QA Reports,
Summary Statistics,
and Graphs (Step 2)
Statistical Method
and Assumptions
(Step 3)
Verification of
Assumptions
(Step 4)
Results from
Statistical Method
(Step 5)
LIST:
- Objective
- Parameter of
interest
- Type of analysis
needed
- Type of data
collection design
- Information on
deviations from the
design in the
implementation
LIST:
- Non-detects
- Probable
distribution
- Potential outliers
- Anomalies
LIST:
- Analysis method
- Assumptions to
verify
- Significance levels
LIST:
- Assumptions,
whether they were
met, and how they
were verified
(including
significance levels)
LIST:
- Final results from
data analysis
- Other factors
affecting the final
product or decision
[This column will
contain an overview
of the project and
background
information against
which to determine
"quality."]
[This column will
contain information
that will provide
insight into which
assumptions might
be met.]
[This column will
contain information
on the statistical
method and its
assumptions.]
[This column will
describe what
assumptions were
checked, how they
were checked, and
what the results
were.]
[This column will
summarize the final
results from the
statistical test and
other factors to
consider in the final
product or decision.]

Overview of

The Five Steps of Data Quality
Assessment
1. Review objectives and data collection design
2. Conduct a preliminary data review
3. Select the statistical method
4. Verify the assumptions of the statistical method
5. Draw conclusions from the data
IMPORTANT: If data other than (or in addition to) new
data is being used, ALL steps must still be performed!

DQA Step 1: Review Objectives
and Data Collection Design
Translate the data user's objectives into a statement
of the primary statistical hypothesis or estimation
goal
Translate the data user's objectives into tolerable
limits on the probability of committing decision errors
Review the sampling design and note any special
features or deviation from the sampling plan

Step 1: Input
QA Project Plan or any other planning documents that
contain:
–Project objective or question to be answered
–Decision performance criteria (DQOs) or other
performance and acceptance criteria
Field Sampling Plan and any reports on actual
implementation of sampling plan

If Systematic Planning was Performed...
Use the reports documenting the planning to answer:
–What is the objective of the project?
–What are the performance or acceptance criteria for
the product or decision?

If Systematic Planning was NOT
performed...
Decision Making: Apply the Data Quality Objectives
Process or other planning process to:
–develop hypotheses
–define the potential decision errors
–specify tolerable limits on making decision errors
Estimation: Use a systematic planning process to:
–select parameters
–develop performance or acceptance criteri.

Example Hypotheses:
–Null Hypothesis: True mean is less than 50 mg/Kg
–Alternative: True mean is greater than 50 mg/Kg
Example Decision Errors: If the null hypothesis is
that the mean is less than 50 mg/Kg:
–False Rejection: Decide that the true mean PAH
concentration is more than 50 mg/Kg when it is
really less than 50 mg/Kg
–False Acceptance: Decide that the true mean PAH
concentration is less than 50 mg/Kg when it is
really greater than 50 mg/Kg
Example Hypothesis & Decision Errors

False Rejection: Decide that the true mean PAH
concentration more than 50 mg/Kg, when it is really
less than 50 mg/Kg
–10% probability of making an error at 50 mg/Kg
False Acceptance: Decide that the true mean PAH
concentration is less than 50 mg/Kg, when it is really
greater than 50 mg/Kg
Example Limits on Decision Errors

Reviewing the Sampling Design
Review the planned sampling design and the
information on the actual data collection; note any
special features or deviations
Determine whether these deviations could effect the
potential analysis of the data

Step 1 - Output
Well-defined project objectives and criteria
Verification that the hypothesis chosen is consistent
with the objective and criteria
A list of any deviations from the planned sampling
design and the effects of these deviations

Review quality assurance reports for anomalies
Calculate standard statistical quantities
Display the data using graphical representations
DQA Step 2:
Conduct Preliminary Data Review

Step 2: Input
Verified and Validated Data
QA reports, QC data
Technical systems audit results:
–Performance evaluations
–Corrective action reports
–Data verification and validation reports
QA Project Plan, Sampling and Analysis Plan, or other
planning documents

Review QA Reports
Look for:
–Failure to meet acceptance criteria/obvious QC
violations:
ƒ variable detection limits
ƒ nonequivalent analytical methods
–Implementation anomalies from the QA Project
Plan:
ƒ negative emission rates
ƒ pH values exceeding 14.0
ƒ Values in wrong reporting units

Calculate Standard Statistical Quantities
Statistical Quantities include measures of:
–Central Tendency (mean, median, etc.)
–Relative Standing (percentiles)
–Dispersion (range, variance)
–Association (correlation)
Review these quantities to determine:
–Do the data look reasonable - do the values make
sense?
–Are there any obvious anomalies?
–Are there any trends or patterns?

Display the Data using Graphs
Common Graphs:
–Histogram
–Stem-and-Leaf
–Box and Whiskers
–Scatter Plot
–Time Plot
Review graphs to determine:
–Do the data look reasonable?
–What is the distribution like? Is it symmetric,
bimodal?
–Are there extremely high or extremely low values?
–Are there any obvious trends?

Step 2: Output
Statistical quantities and graphs that provide you with
a preliminary understanding of the data and any
potential issues including:
–Distribution of data
–Potential outliers
–Non-detects

DQA Step 3:
Select the Statistical Method
Select the statistical method based on the data user's
objectives and the preliminary data review
Identify the assumptions underlying the statistical
method

Step 3: Input
Project objectives, hypotheses, and preliminary
statistical method if identified
Background on statistical methods

If identified during planning, determine if that choice
seems reasonable based on the preliminary review of
the data
Otherwise, select statistical method based on the data
user's objectives and the preliminary data review
Example Methods:
–Tests: One-sample t-test, Two-sample t-test, Test for
a single proportion, Wilcoxon Signed Rank Test
–Estimation
–Regression Analysis
–Time Series Analysis
Select Method

Every method has assumptions.
Common Assumptions:
–Distributional form
–Independence
–Dispersion characteristics
–Homogeneity
–Basis for randomization
Example - One-sample t-test: random sample,
independence of data; sample mean is normally
distributed; no outliers; few "non-detects"
Identify Assumptions

Step 3: Output
Proposed statistical method that looks appropriate for
the data and the project objectives
List of assumptions for the statistical method

DQA Step 4: Verify the Assumptions
of the Statistical Method
Determine approach for verifying assumptions
Perform tests of assumptions
If necessary, determine corrective actions to be taken

Step 4: Input
Data
Assumptions identified for statistical method
Methods to verify these assumptions along with their
formulas

Determine Approach for Verifying
Assumptions and Perform Test
Evaluate Step 3 to see what assumptions need to be
verified
Determine what tests are available for verifying
assumptions for this dataset
Select test and appropriate significance level

Determine Corrective Actions
If this data set does not meet the needed assumptions,
determine the next steps that should be taken:
–Repeat Step 3 and select a different method for
analyzing the data
–Transform the data
–Reduce significance level
–Gather additional data
–Modify objective
ƒ . . . But this should be done with caution.

Step 4: Output
Documentation of the method used to verify each
assumption and the results of these methods
Corrective actions (if necessary)

DQA Step 5:
Draw Conclusions from the Data
Perform the calculations for the statistical method
Evaluate the results and draw conclusions

Step 5: Input
Data
Objective, hypotheses (if applicable), and
performance or acceptance criteria
Formulas for statistical method
Non-statistical factors to incorporate into the final
decision or product

Perform the Statistical Method
Use formulas and procedures from standard text
books
Use software to perform the calculations:
–SPSS
–SAS or Splus
–R
–DataQUEST
Note: Software could be used on any data, whether
the assumptions have been verified or not. But when
the assumptions don't hold, then the results are
highly suspect.

Evaluate the Results
The statistical results are not necessarily the answer.
Factor in items like:
–Practical significance
–Political/social factors
–Contextual significance

Step 5: Output
Statistical results with a specified significance level
Final Product or Decision

DQA Steps - Summary Table
STEP INPUT PROCESS OUTPUT
1 QA Project Plan or any other planning
documents.
Project objective or question to be
answered; decision performance criteria or
other performance and acceptance criteria
Reports (e.g., Field Sampling Plan) on
actual implementation of sampling plan
Translate objectives into a statement of
the primary statistical hypothesis or
estimation goal
Translate objectives into tolerable limits
on the probability of committing decision
errors
Review the sampling design and note
any special features of deviations
Well-defined project objectives and
criteria.
Verification that the hypothesis chosen is
consistent with the objective and criteria
A list of deviations from the planned
sampling design and the effects of these
deviations.
2 Verified and Validated Data
QA reports, QC data
Technical systems audits results
QA Project Plan, Sampling and Analysis
Plan, or other planning documents
Review quality assurance reports for
anomalies
Calculate standard statistical quantities
Display the data using graphical
representations
Statistical quantities and graphs that
provide you with a preliminary
understanding of the data and any
potential issues
3 Project objectives, hypotheses, and
preliminary statistical method if identified
Background on statistical methods
Select the statistical method based on the
data user's objectives and the preliminary
data review
Identify the assumptions underlying the
statistical method
Proposed statistical method that seems
appropriate for the data and the project
objectives.
List of assumptions for the statistical
method.
4 Data
Assumptions identified for statistical method.
Methods to verify assumptions along with
formula
Determine approach for verifying
assumptions
Perform tests of assumptions
If necessary, determine corrective actions
to be taken
Documentation of the methods used to
very each assumption and the results.
Corrective actions (if necessary)
5 Data
Hypotheses (if applicable) and performance
or acceptance criteria
Formula for statistical method
Non-statistical factors to incorporate into the
final decision or product.
Perform the calculations for the statistical
method
Evaluate the results of the statistical
method and draw conclusions
Statistical results with a specified
significance level
Final product or decision.

Data Quality
Assessment
Steps 1, 2, and 3

The Five Steps of Data Quality
Assessment
1. Review objectives and data collection design
2. Conduct a preliminary data review
3. Select the statistical method
4. Verify the assumptions of the statistical method
5. Draw conclusions from the data

A. Translate the project's objectives into a statement
of the primary statistical hypothesis or estimation goal
B. Translate the data user's objectives into
performance or acceptance criteria
C. Review the sampling design and note any special
features or deviation from the sample plan

Two Situations to Consider
Project Systematically Planned
–Use QA Project Plan and other planning documents
to perform this action
Project NOT Systematically Planned
–Use a systematical planning process (e.g., the Data
Quality Objectives Process) to plan retrospectively

Project Objective (Step 1A)
The objective indicates what the final outputs from
the project should be
For example:
–If the goal is to ascertain if the contamination
exceeds a threshold, the objective would be
"determine whether the contamination is greater
than X"
–If the goal is to ascertain if the contamination in the
soil has reached the groundwater, the objective
would be "determine whether the contamination
can be detected in the aquifer"

Different Analysis Methods
Estimation
Hypothesis Testing
Regression
Analysis of Variance
Time Series Analysis
Spatial Analysis

Defining the Boundaries (Step 1A)
Define the geographical area within which decisions
apply and the media of concern (Spatial Boundary)
Determine the time frame to which the study results
apply (Temporal Boundary)
Define a scale of decision making

Specifying Criteria (Step 1B)
Set quantitative performance or acceptance criteria
Consider consequences of any potential decision
errors. Consequences may include:
ƒ Health risks
ƒ Ecological risks
ƒ Political risks
ƒ Social risks
ƒ Resource risks
EXAMPLE: When selecting between two opposing
conditions, define a 'gray region', false rejection error
limit, false acceptance error limit.

Review the Sampling Design (Step 1C)
Information Needed Where it would be found
Original Sampling Plan - locations
and types of samples to be taken.
QA Project Plan or other planning
documents. Documentation from
the Systematic planning process.
Details on how the samples were
actually collected.
Summary reports from field
notes, maps.
Documentation of deviations from
sampling plan.
Must be developed based on
comparison of original plan and
actual implementation.
Review of deviations to ensure
that the implemented plan still
meets the objectives for the
project.
Must be developed based on
information from systematic
planning on study objectives.

What if you are assessing data
collected by another project?
Gather available information about the way that that
data were collected - for example, sample collection
plans, implementation of sample collection plans,
analytical method used, . . .

PCB Example: Background
Electronic Manufacturing Corporation of America
operated at site from 1965 to 1985 and sold the site to
Energy Components Company in 1985. Both
companies went bankrupt in 1990.
In 1991, chlorinated solvents were discovered in
water from city wells located in a field east of site.
Waste oil contaminated with PCBs was sprayed on a
dirt road on the site for dust suppression while the
site was operational.
Problem: Determine if the extent of PCB
contamination along the road presents unacceptable
risks and remedial action is needed.

PCB Example: Statistical Hypotheses
If the mean concentration of total PCBs in surface soil
(top 1 inch) over the dirt road exceeds 2 mg/Kg, then
take remedial action; otherwise, take no further
action.
Null Hypothesis: True Mean < 2 mg/Kg
vs.
Alternative Hypothesis: True Mean > 2 mg/Kg

PCB Example: Tolerable Limits on
Decision Errors
Probability of Deciding that the
Mean Exceeds 2 mg/Kg
True Concentration of PCB (mg/Kg)
Tolerable
False
Rejection
Decision
Error Rates
Tolerable
False
Acceptance
Decision
Error Rates
Action
Level
Gray
Region

Composite samples
selected using
simple random
sampling from the
dirt road.
Dirt road was one of
4 strata at the site
(stratum 1).
Each composite
consists of 5 mini-
samples.
PCB Example: Sampling Design
Stratum 1

PCB Example: Data
PCB concentration levels were measured (in mg/Kg)
from 16 surface soil samples (top one inch of soil)
from the dirt road. Each soil sample consists of 5
mini-samples composited together.
1.92 2.49 4.58 1.17
2.48 5.62 2.54 25.15
7.72 1.02 2.91 3.23
2.87 8.66 1.71 1.18

A. Review quality assurance reports for anomalies:
ƒ Anomalies in recorded data, missing values,
deviations from standard operating procedures,
failure to meet acceptance criteria, use of
nonstandard data collection methodologies
B. Calculate standard statistical quantities:
ƒ Do the data look reasonable? Do the values make
sense? Are there any anomalies?
C. Display the data using graphical representations:
ƒ Are there any trends? What is the distribution
like? Are there any extreme values?
DQA Step 2:
Conduct Preliminary Data Review

Review QA Reports (Step 2A)
Data validation reports that document sample
collection, handling, analysis, data reduction, and
reporting procedures used
Quality control reports from laboratories or field
stations that document measurement system
performance, including data from check samples,
split samples, spiked samples, or any other internal
QC measures
Technical system reviews, performance evaluation
audits, and audits of data quality, including data from
performance evaluation samples

Summary Statistics (Step 2B)
Central Tendency: mean, median, mode
Relative Standing:
–5th percentile
–25th percentile
–50th percentile
–75th percentile
–95th percentile
–99th percentile
Dispersion: variance, standard deviation, range
Association: correlation coefficient, regression

Commonly Used Graphs (Step 2C)
Histogram Scatter Plot
Stem-and-Leaf Time Plot
Ranked Data Plot Spatial Correlogram
Quantile Plot Posting Plot
Normal Probability Plot Symbol Plot

PCB Example: Statistical Quantities
Number of Observations: 16
Minimum: 1.020 Maximum: 25.150
Mean: 4.703 Median: 2.705
Variance: 34.808 Standard Deviation: 5.900
Range: 24.130 Interquartile Range: 3.285
Coefficient of Variation: 1.254
Coefficient of Skewness: 2.818
Coefficient of Kurtosis: 7.321
Percentiles:
1st: 1.020 75th: 5.100
5th: 1.020 90th: 8.660
10th: 1.170 95th: 25.150
25th: 1.815 99th: 25.150

PCB Example: Ordered Data Plot

PCB Example: Summary of Statistical
Quantities and Graphs
Not symmetric
One extreme value (21.15)
Does not appear to be normally distributed

DQA Step 3:
Select the Statistical Method
A. Select the statistical method based on the project's
objectives and the preliminary data review
B. Identify the assumptions underlying the statistical
method

If the statistical method has been identified during
planning, ensure it is applicable based on the
preliminary review of the data
Otherwise, select statistical method based on the data
user's objectives and the preliminary data review
Example Tests:
–One-sample t-test
–Two-sample t-test
–Test for a single proportion
–Wilcoxon Signed Rank Test
Select Method (Step 3A)

Each method has assumptions which can be found in
references on that test or in Guidance for Data Quality
Assessment
Common Assumptions:
–Distributional form
–Independence
–Dispersion characteristics
–Homogeneity
–Basis for randomization
Example: One-sample t-test: random sample;
independence of data; sample mean is normally
distributed; no outliers; not an excessive amount of
"non-detects"
Identify Assumptions (Step 3B)

ONE SAMPLE t-TEST ASSUMPTIONS:
No outliers (sample mean and standard deviation are
very sensitive to outliers)
Sample mean approximately normally distributed
Random sample (independence of the data values)
Relatively few values below the detection limit
PCB Example: Select Method and
Identify Assumptions

Example: Raw Data
ND 4.5 19.4
ND 5.8 19.5
ND 6.2 19.5
ND 6.8 19.7
ND 8.33348 19.7
ND 7.4 19.7
ND 12.5 19.8
ND 14.7 19.8
ND 19.0
204.6

lowest highest
Example: Order Data Plot

6 5 4 3 1 60 3
9 6 55 5 5 5 6 7 7 8
2 50 0 0 1 2 3 3 3
7 7 7 7 7 5 5 5 5 5 5 45 5 6 7 7 8 8 8 8 8 9
4 4 4 4 4 4 3 3 3 3 2 40 0 2 3 4
35 9
Site 2 Ozone
June
Site 1 Ozone
June
Example: Stem and Leaf

Example: Posting Plot
15.4 11.4 7.4
12.3 8.3
43.0
14.7 10.7 6.7
10.5 6.5 2.4
Creek
Road

Steps 4 and 5

DQA Step 4: Verify the Assumptions
of the Statistical Test
A. Perform method to test assumptions:
ƒ Determine what methods are available
ƒ Select method and significance level
ƒ Perform calculations
B. If necessary, determine corrective actions to be
taken, e.g., select different test in Step 3, reduce
significance level, gather additional data, modify
slightly the objectives of the study, etc.

Types of Assumptions (Step 4A)
Random sample
Independence of data
Distribution of Data
Existence of Outliers
Extent of non-detect data

Verify Random Sample
Review data collection plan to verify that some
element of random selection was used to locate the
samples
If judgmental sampling was used to choose some
sample locations, you may need to consult a
statistician to determine if the data collection is
"random enough" to be able to draw reliable
conclusions from the data
If judgmental sampling was used to choose all sample
locations, the ability to draw conclusions about the
entire population is limited

Verify Independence of Data
For one variable (i.e., one contaminant) ensure there
are no trends within the data - for example, by
location, by time period, etc.
For two (or more) variables (i.e., two or more
contaminants), ensure that they are not correlated
with one another

Verify Distribution
Typical assumptions are:
–Normal distribution
–Symmetric distribution
An indication of the distribution can be gained by
reviewing the summary statistics and the graphs
Quantitative tests are available to verify whether a
data set has a specific distribution

Verify Outliers
Statistical outliers are anomalies with respect to the
proposed distribution for the data -- an extreme value
may be a statistical outlier but may still be a valid data
point
Before removing any statistical outliers, review these
values on both a scientific or quality assurance basis
-- do not delete any values unless there is a scientific
or quality assurance based reason for the removal
If any data are deleted, run all analyses both with and
without the data to see the effect of the deletions

Using Data Below the Detection Limit
Quality control information would indicate which data
points were non-detects
Methods for Addressing non-detects:
–Substitution, e.g., detection limit, 1/2 detection limit,
. . .
–Special algorithms, e.g., Winsorized Mean, Cohen's
Method, . . .
–Use of percentiles

Types of Corrective Actions (Step 4B)
Select a different test or method of analysis
Transform the data to a different metric
Gather additional data or modify objective

PCB Example: Background
Electronic Manufacturing Corporation of America
(EMCA) operated at site from 1965 to 1985, when site
was sold to Energy Components Company (ECC).
Both companies went bankrupt in 1990.
In 1991, chlorinated solvents discovered in water from
city well field east of site.
Waste oil contaminated with PCBs was sprayed on a
dirt road for dust suppression.
Problem: Determine the extent of PCB contamination
on the dirt road that presents unacceptable risks.

No outliers?
Data are approximately normally distributed?
Random sample (for independence of data values)?
No data reported as "Not Detected?"
PCB Example: Assumptions of the t-Test
for a Single Mean

Is the extreme value of 25.15 a statistical outlier?
Extreme Value Test will be used - this test assumes
data without the outlier are normally distributed.
PCB Example: Identifying Outliers

Results: Data are Not Normally Distributed
PCB Example: Testing for Normality
Shapiro-Wilk W Test
Shapiro-Wilk Test
Null Hypothesis: 'Data are normally distributed'
Sample Value: 0.836
Tabled Value: 0.881
Non-normality has been detected at a 5% significance level.

Data without outlier appears to be lognormally
distributed in the Histogram
So, apply Shapiro-Wilk W Test to natural logarithms of
the data to test for lognormality -- if logged data are
normally distributed, then untransformed data are
lognormally distributed
Original data value 1.92 becomes 0.652, 2.49 becomes
0.912, etc.
PCB Example: Not Normally Distributed

PCB Example: Testing Data for
Lognormality
Shapiro-Wilk Test
Null Hypothesis: ‘Data are normally distributed'
Sample Value: 0.953
Tabled Value: 0.881
There is not enough evidence to reject the assumption of
normality with a 5% significance level.
Result: Can not reject the assumption that the data are
lognormally distributed and that the logs of the data are
normally distributed.

PCB Example: Discordance Test for
Outliers
Discordance Test for Outliers
Value Tested: 3.225 [ ln(25.12)]
Sample Value: 2.476
Tabled Value: 2.443
Conclude 3.225 is an outlier at a 5%
significance level.

PCB Example: Assumptions Satisfied?
No outliers?
Data approximately normally distributed?
Random sample?
All data is above the detection limit?

PCB Example: A New Look at the
Logged Data
Number of Observations: 15
Minimum: 0.020 Maximum: 2.159
Mean: 1.002 Median: 0.932
Variance: 0.428 Standard Deviation: 0.654
Range: 2.139 Interquartile Range: 0.985
Coefficient of Variation: 0.653
Coefficient of Skewness: 0.244
Coefficient of Kurtosis: -0.783
Percentiles:
1st: 0.020 75th: 1.522
5th: 0.020 90th: 2.044
10th: 0.157 95th: 2.159
25th: 0.536 99th: 2.159

DQA Step 5: Perform Method and
A. Perform the calculations for the statistical method
B. Evaluate the results and draw conclusions

Perform the Statistical Method (Step 5A)
For basic methods, formulas and procedures are
available in standard text books
Software can also do the calculations:
–SAS, SPSS, or SPlus
–DataQUEST

A statistical test can give two results:
–Reject the baseline condition or
–Fail to reject the baseline condition
If the test does not reject the baseline condition, then
the false acceptance error rate must be verified
Evaluate the Results (Step 5B)

If the baseline condition (Null Hypothesis) is true and
the statistical test rejects the baseline condition with
a 5% significance level, then the results will occur
naturally with a chance of less than one-in-twenty.
A chance of less than one-in-twenty is highly unlikely,
therefore it is unlikely that the baseline condition is
true. Therefore, the alternative condition is selected
as being correct (with a significance level of 5%).
Significant at 5%

PCB Example: Perform Calculations
Formula:
ƒ Reject baseline condition if t > t1-α
α = 5, t0.95 = 1.753
AL = 0.6931 [the natural log of 2.0]
Results: Reject baseline condition at a 5% level of
significance.
t =
¯X - AL
s
n
=
1.002 - 0.6931
0.654
15
= 1.818

PCB Example: Conclusions
The t-test rejected the null hypothesis that the mean
is less than action level at a 5% significance level
Statistically, the mean concentration of total PCBs in
surface soil (top 1 inch) over the dirt road exceeds
2 ppm and remedial action is necessary

Research Example:
Inhalation Exposure to
Manganese

Manganese Exposure
Manganese (Mn) is a metal emitted from industrial
operations (e.g., steel mills) and cars using gasoline
with additive MMT
Mn is an essential dietary requirement in trace
amounts but higher exposures are toxic:
–occupational exposure shown to affect motor skills
–disruption of neurotransmitters may adversely
affect higher-level cognitive skills
–exposure during infancy and early childhood may
cause learning disabilities and emotional problems
Research Problem: Determine if elevated exposure to
Mn during early childhood can be linked to cognitive
dysfunction in teenagers

Manganese Exposure Routes
Mn is usually inhaled or ingested
MMT can be absorbed through the skin
Key route of concern: Mn is transported to the brain
via the olfactory nerve when Mn in coarse particles is
inhaled
Exposure to Mn can be estimated by analyzing
ambient air monitoring data in Total Suspended
Particulates (TSP) - undisturbed dust in attics and
other areas of homes can be an indicator of long-term
exposure

Translate the data user's objectives into a statement
of the primary statistical hypothesis or estimation
goal
Translate the data user's objectives into tolerable
limits on the probability of committing decision errors
Review the sampling design and note any special
features or deviation from the sample plan

Pilot Study: Objective and Background
Objective: Determine whether the proposed study
area is suitable for a large-scale investigation
Background
–Range of Mn exposures vary from very high to very
low
–Concentrations of Mn in air and dust decrease with
distance from known point source (steel mill)
–There are few, if any, confounding factors, such as
lead in paint

Exploratory nature of pilot study does not warrant
specification of rigorous performance criteria;
however, results must support the larger study goals
–Detect correlation of 40% with at least 80% power
–Identify very high and very low exposures
Sample size for TSP analysis driven by existing
ambient air monitoring network design for air
measurements; sample size for dust samples based
on performance criteria, budget, professional
judgment, and experience with similar studies
Performance Criteria for Pilot Study

Pilot Study: Statistical Methods and
Performance Criteria
Statistical Methods:
–Determine correlation between distance to steel mill
and Mn concentrations in air (TSP) and house dust
–Estimate annual average air concentrations of Mn
in TSP at various distances from steel mill
10% significance level chosen for testing hypotheses
about relationships between Mn exposure
concentrations and results of cognitive tests

Pilot Study: Performance Criteria
(continued)
Correlations expected to be relatively low (10%- 50%)
and negative
False negative error rate specified by determining
power under various assumptions of correlation
10% 20% 30% 40% 50%
50% 166 42 19 11 7
60% 237 60 27 15 10
70% 327 82 36 20 13
75% 383 96 42 23 14
80% 451 113 49 27 17
85% 537 134 58 32 19
90% 656 163 72 39 24
True Population Correlation
Power

Review Sampling Design
2-stage stratified random sampling design
–Sampling frame was phone list in the target area
around the steel mill, pre-filtered to include
households with children age 13-17
–Sampling population was stratified by distance to
the steel mill
–First stage of sampling was random-digit-dialing to
recruit households who met the study criteria
–Second stage of sampling was within the set of
eligible and willing participants
For each household selected, dust from the attic or
basement was collected using a vacuum device per
established protocols

30 households were successfully sampled including
individual houses, attached townhouses, and
apartments
The field report indicated that 50% of the households
were sampled in a basement furnace area, 10% in
other basement areas, 10% in attics, and the
remainder in miscellaneous crawl spaces or other
furnace areas
All of these locations seem reasonable in light of the
project objectives
Review Data Collection Reports

Quality Control Reports
There were no "below the detection limit" values in
the Mn data set
Duplicate sample analyses and instrument analyses
indicated excellent agreement, all within 5%
There were no significant anomalies in the QC reports

Summary Statistics for Mn in Dust
Mean 924.23
Std Dev 2566.9
C.V. 2.78
Median 390.0
Mode N.A.
Range 14420
Minimum 16
Maximum 14436
Percentile Value
95% 1339
75% 699
50% 390
25% 258
5% 130
N = 30

Histogram of Mn in Dust Example
Conc. (ppm) Frequency
0-99 1
100-199 1
200-299 6
300-399 7
400-499 5
500-599 1
600-699 2
700-799 4
800-899 0
900-999 0
1000-1099 1
1100-1199 0
1200-1299 0
1300-1399 1
1400-1499 0
1500 + 1

Box-and-Whiskers Plots of Mn in Dust
Full Data Set
+
16
(median) 390
(mean) 924
o14436
Without Potential Outlier
+
o
16.0
(median) 256.0
(mean) 458.3
1339.0

Scatter Plot of Mn vs. Distance
using a logarithmic scale for Mn
Mn (ppm) Distance
(mi)
465 39.06
712 20.98
456 20.29
502 27.42
448 34.82
414 27.53
341 39.26
212 16.88
619 34.86
1339 30.80
243 42.77
250 39.26
16 39.19
203 39.26
130 39.41
327 12.83
332 12.15
309 16.17
350 12.82
366 12.25
258 14.67
225 6.66
743 6.60
468 16.08
14436 13.85
726 14.74
347 16.29
1037 1.21
754 1.22
699 13.86

Two potential measures of correlation:
–Pearson's correlation coefficient
ƒ detects linear relationship between two sets of
values
ƒ sensitive to extreme values
ƒ not affected by linear transformations
–Spearman's rank correlation coefficient
ƒ uses ranks of data values
ƒ less sensitive to extreme values
ƒ not affected by monotonic nonlinear
transformations
Correlation Between Mn and
Distance to Point Source

Assumptions to be Verified for Pearson's
Correlation Coefficient
Random Sample
Independence of data
Linear Relationship
No Outliers
Normal Distribution

Assumption: Random Sample
Verification Method: Review how the data collection
plan was developed and implemented (i.e., how the
sample locations were chosen)
Verification Results: The QA Project Plan and the
data collection documentation reports that random
sample locations were chosen
It is highly probable that this assumption of a random
sample has been met

Assumption: Independence of Data
Verification Method: Rank von Neumann test
Verification Result: Using the baseline condition (Null
Hypothesis) that there is no serial correlation present,
the Rank von Neumann test conclude that there is no
serial correlation at the 10% level of significance
It is highly probable that this assumption of a
independence has been met

Assumption: Few Values Below
Detection Limit
Verification Method: Review data
Verification Result: There are none in this data set
This assumption of few values below the detection
limit has been met

Assumption: No Outliers
Verification Method:
–Review histogram: Look for very high or very low
values as compared to the rest of the distribution
–Review Box Plot: Potential outliers show up
quickly.
–Use Rosner's Test on any extreme values (with a
5% signicance level.

Review Box and Whiskers Plot
Looks like
an extreme
outlier!
+
16
(median) 390
(mean) 924
o14436

Assumption: No Outliers
Verification Result: Rosner's test has determined that
the observation 14436 is a statistical outlier at the 5%
significance level and should be investigated further.
(The sample value was 5.264 and the critical value
was 2.910.)
This assumption of no outliers has not been met!

Look at the data without the outlier...
Looks better, but
still may have an
outlier...
+
o
16.0
(median) 256.0
(mean) 458.3
1339.0

Check for outliers in censored data set
Verification Result: Rosner's test has determined that
the observation 1339 is a statistical outlier at the 5%
significance level and should be investigated further.
(The sample value was 3.129 and the critical value
was 2.890.)
HOWEVER, Rosner's test assumes normality...

Shapiro-Wilk Test for Normality
The Shapiro-Wilk test for normality determined that the
baseline condition that the data are normally distributed
can be rejected at the 5% significance level (Sample
value = 0.900; Critical Value = 0.926)
Non-normality has been detected at the 5% level of
significance, but only just!

What about a log transform of
the censored data set?
+
o
16.0
(median) 256.0
(mean) 458.3
1339.0

What about a log transform of
the censored data set?
Still looks skewed:
Shapiro-Wilk test
confirms that data
are not strictly normal.
205 394 593 772 961 1150 1339
Concentration (ppm)
0
5
10
15
Frequency
16

What should we conclude about
assumptions?
Data seems to be an independent random sample
Data are not normally distributed (or lognormal,
either)
Data probably has multiple outliers -- it's heavily
skewed to the right.
Methods such as Pearson's correlation coefficient,
that are sensitive to extreme values, are not
recommended
Spearman's Rank Correlation Coefficient is a better
approach for evaluating the relationship between Mn
in dust vs. distance to point source

DQA Step 5: Perform Method and
Perform the calculations for the statistical method
Evaluate the results of the statistical method and
draw conclusions

Spearman's Rank Correlation Coefficient
Results are less sensitive to extreme values (use the
original full data set)
The Spearman's Correlation Coefficient of distance
from the source to Mn concentration is -0.406
ƒ The correlation is negative (as suspected) and the
concentration decreases as distance from the
point source increases (a negativecorrelation
coefficient)
ƒ The relationship is not strong (only 40% of the
total variation is attributable to a relationship
between distance and concentration)

Interpret the Results
Factors other than distance-to-point-source that
determine concentrations of Mn in dust:
–Infiltration rates into the dwelling
–Housekeeping practices
–Age of dwelling (how long the dust was left
undisturbed)
–Changing wind patterns in the region
Those other factors mediate the relationship between
Mn in dust and distance to point source.
Nonetheless, the researchers believe there is enough
of a relationship to support the use of distance as a
basis for stratifying the population in the larger study

Role of Data Quality Assessment in a Project

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Role of Data Quality Assessment in a Project

Similar to Role of Data Quality Assessment in a Project (20)

More from Ferdin Joe John Joseph PhD

More from Ferdin Joe John Joseph PhD (20)

Recently uploaded

Recently uploaded (20)

Role of Data Quality Assessment in a Project