Lab #1 Basic Statistics
• Definition of STATISTICS
• 1: a branch of mathematics dealing with the collection, analysis,
interpretation, and presentation of masses of numerical data
• 2: a collection of quantitative data
• Origin of STATISTICS: German Statistik study of political facts and figures,
from New Latin statisticus of politics, from Latin status state
• First Known Use: 1770
• Rhymes with STATISTICS: ballistics, ekistics, linguistics, logistics, patristics,
Why is this important?
∗ Need to know relationships
∗ Parameters (examples):
Amount of a chemical or other
material in air, water, soil
∗ PH Meter
∗ Gas Chromatography
∗ Ozone monitor
Morning Session of FE Exam
Engineering Probability and Statistics Topic Area
The following subtopics are covered in the Engineering Probability and
Statistics portion of the FE Examination:
A. Measures of central tendencies and dispersions (e.g., mean, mode, standard
B. Probability distributions (e.g., discrete, continuous, normal, binomial)
C. Conditional probabilities
D. Estimation (e.g., point, confidence intervals) for a single mean
E. Regression and curve fitting
F. Expected value (weighted average) in decision-making
G. Hypothesis testing
The Engineering Probability and Statistics portion covers approximately 7% of
the morning session test content.
• “Sample” versus “population”
• Random variables
• Population mean (μ), variance (σ2) & standard
deviation (s), kurtosis, skewness
• Also expressed as: Sample mean (y), variance (s2), and
standard deviation (s)
• Frequency distribution/histogram (relates to skewness)
• Precision and accuracy, Confidence interval
• Linear regression
Some Key Ideas
• It is impossible to determine the concentrations of a
given pollutant at every possible location at a site.
• Statistical methods allow us to use a small number of
samples to make inferences about the entire site.
• A single sample is a subset of all the possible samples (n)
that could be taken from a given site.
–Multivariate data sets have several data values
generated for each location and time.
–As opposed to univariate data sets.
• The hypothetical set of all possible values is referred to
as the population.
Key Ideas: continued
• Number of samples collected is the sample size (n).
• A random variable is a variable that is random.
• Experimental observations are considered random
• Experimental errors
Key Ideas continued
∗ Experimental measurements are always imperfect:
∗ Measured value = true value ± error
∗ The error is a combined measure of the inherent variation
of the phenomenon we are observing and the numerous
factors that interfere with the measurement.
∗ Any quantitative result should be reported with an
accompanying estimate of its error.
∗ Systematic errors (or determinate errors) can be traced to
their source (e.g., improper sampling or analytical
∗ Random errors (or indeterminate errors) are random
fluctuations and cannot be identified or corrected for.
Example: Population versus Sample
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
1 2 3 4 5 6 7 8 9 101112131415161718192021222324
• Accuracy is the degree of agreement of a measured
value with the true or expected value.
• Precision is the degree of mutual agreement among
individual measurements (x1, x2, …xn) made under the
• Precision measures the variation among measurements
and may be expressed as sample standard deviation (s):
Accuracy and Precision
Accuracy and Precision
Example: Five analysts were each given five samples that were
prepared to have a known concentration of 8.0 mg/L. The results are
summarized in the figure below.
Accuracy and Precision
• A random variable, y is characterized by:
• A set of possible values.
An associated set of relative likelihoods (this is called a
• Random variables can be discrete or continuous.
e.g., a die toss is a discrete random variable.
e.g., ozone conc. is a continuous random variable.
• Experimental observations are considered random
• When we sample the environment, the sample values are
known, but not the population values.
• For a sample size n, the number of times a specific value
occurs is call the frequency.
• The frequency divided by the sample size n is the relative
• The relative frequency is an estimate of the probability
that given value occurs in the population.
• If we compute the relative frequencies for each possible
value of a random variable, we have an estimate of the
probability distribution of the random variable (see next
• For continuous random variables, we can group the
measured values into intervals (or “bins”).
• Plotting the number of values measured in each interval
gives a frequency histogram (see next slide).
• Plotting the total number of measured values in or below
a given interval gives a cumulative frequency distribution
(see next slide).
• To obtain the relative frequency, the number of measured
values falling within a given interval is divided by the
sample size n.
• The shape of a histogram can allow us to infer the
distribution of the population.
Continuous Frequency Distributions
Normal (Gaussian) and skewed
Bimodal and Uniform
∗ In general, we do not know the mean and standard
deviation of the underlying population.
∗ The population mean can be estimated from the
sample mean and sample standard deviation s:
∗ Note that in environmental monitoring, the standard
deviation s for the sample depends on the amount of
Sample Mean and Standard Deviation
= ∑ ( )
In many situations, environmental data involves working with a small sample set.
Also known as Bessel’s correction or unbiased estimate.
Another way of looking at it:
The POPULATION VARIANCE (σ2) is a PARAMETER of the population.
s2 The SAMPLE VARIANCE is a STATISTIC of the sample.
We use the sample statistic to estimate the population parameter.
The sample variance s2 is an estimate of the population variance σ2.
Note: Excel 2010 has a couple functions for standard deviation. One for population (=STD.P(range))
and the other based on sample (=STD.S(range)).
A note about (n-1)
• Most random variables have two important characteristic
values: the mean (μ) and the variance (s2).
• Square-root of the variance is the standard deviation (s).
• The mean is also called the expected value of the random
• The mean represents balance point on graph.
• The variance & standard deviation both quantify how
much the possible values disperse away from the mean.
• For a normal distribution, 68% of values lies within µ ±
σ, 95% within µ ± 2σ, and 99.7% within µ ± 3σ.
Mean, Variance, Standard Deviation
Mean, Median, Mode
∗ Covariance is a simplistic test to determine whether the
data can be characterized by a normal distribution. The
formula for covariance is the standard deviation divided by
the mean. The closer the ratio is to zero, the better the
possibility that the data has a normal distribution. A
number greater than unity indicates a non- normal
∗ Skewness is a measure of symmetry or lack of it and can be
normal, negative, or positive.
∗ Kurtosis is a measure whether the data are flat relative to a
Covariance, Skewness, Kurtosis
Normal Distribution at 68%, 95%, 99%
The value is the probability that a random variable will
fall in the upper or lower tail of a probability
For example, α = 0.05 implies that there is a 0.95
probability that a random variable will not fall in the
upper or lower tail of the probability distribution.
Statistical tables of probability distributions (e.g.,
normal and “student t”) list probabilities that a random
variable will fall in the upper tail only.
α Values for Probability Distributions
• We typically want to determine a confidence interval
for which we are 90% confident that a random
variable will not fall in either tail.
• In this case, we use an α/2 = 0.05.
• Similarly, to determine 95% and 99% confidence
intervals, we would use α/2 = 0.025 and 0.005,
α values and confidence intervals
= ± ( )( )
Regression analysis (dependency) – an analysis focused
on the degree to which one variable (the dependent
variable) is dependent upon one or more other
variables (independent variable).
(examples: ozone vs. temperature, bacteria counts
versus chlorination treatment)
Correlation analysis – neither variable is identified as
more important than the other, but the investigator is
interested in their interdependence or joint behavior
NOTE: Correlation or association is not causation.
Linear Regression Examples
• Slope formula: y = mx + b
• coefficient of determination, R2 is used in the context of statistical models whose main
purpose is the prediction of future outcomes on the basis of other related
information. It is the proportion of variability in a data set that is accounted for by the
statistical model. It provides a measure of how well future outcomes are likely to be
predicted by the model.
R2 does NOT tell whether:
the independent variables are a true cause of the changes in the
omitted-variable bias exists
the correct regression was used
the most appropriate set of independent variables has been chosen
there is co-linearity present in the data
the model might be improved by using transformed versions of the
existing set of independent variables
R2, Slope Equation
Statistics Excel 2010
Standard Error 5.013678308
Standard Deviation 18.75946647
Sample Variance 351.9175824
Confidence Level(95.0%) 10.83139138
Ozone April 2013
Histogram and Summary Statistics
Standard Dev 10.72231
Sample Variance 114.968
April 2013 Ozone
Population size: 713
First quartile: 28
Third quartile: 43
Interquartile Range: 15
Outliers: 2 5 5 5 6 8 10 11 11 68 65 64 62 62 61
61 60 59 58 58 58
∗ Access TCEQ web site data.
∗ Importing files into Excel and Matlab.
∗ Using Excel for statistical work, Matlab for statistics.
∗ Read the papers posted on Blackboard: Statistics for
Analysis of Experimental Data, Errors and Limitation
Associated with Regression, and Why we divide by n-
∗ Lab will be assigned.