1.0 Descriptive statistics.pdf

Overview and Descriptive Statistics introduction
• Statistical concepts and methods are critical to
understanding the world around us.
• The science of collecting, organizing, presenting,
analyzing, and interpreting data to assist in making more
effective decisions
– Allow us to make informed decisions based upon data in the
presence of uncertainty and variation
• Statistical methods have broad applications
– Evaluating research
– Making predictions
– Understanding variability among components
manufactured by a certain process

Numerical Summaries of Data
• Data summaries and displays are essential to good statistical thinking
• It is useful to describe data features numerically
• Characterizing the location or central tendency in the data is an example of a
numerical summary
• Data are often a sample of observations that have been selected from some
larger population of observations
• The population is the collection of all individuals or objects under
consideration in a statistical study
• Obtaining information on entire population (census) often
impractical
• Instead use a sample or subset (part) of the population to obtain
information

• parameter is a descriptive measure for a population
• e.g. population mean, µ, population
standard deviation, σ
A statistic is a descriptive measure for a sample
• e.g. About 33% of people in our sample were between 30 and 40 years old.
• Used to estimate population parameters
– e.g. sample mean, x̄,
sample standard deviation, s
Practice:
1.Determine whether each group is best described as a sample or a population:
a)The participants in a study of a new cholesterol drug.
b)All the people living in Doha.

2.Determine whether each situation is a sample or a census:
a)The government of Qatar collects information from all residents about
their income.
b)The government of Qatar asks 1000 residents of Doha where they like to
shop.
c)Data about the effectiveness of a new vaccine that has been given to
volunteers.
d)The types of cars driven by all residents of Doha.
Measures of Center ( Mean , Median, Mode)
- Sample mean
The location or central tendency in the data can be characterized by the
arithmetic average or the sample mean. Calculated by adding up all values in the

data set, then dividing it by the number of values that are in the data set.
Mathematically, the formula is written as
=
 X
X
n
X This is read as X bar. Sample mean.
n Sample size.
X Random variable to denote the items in a sample.
- Population Mean:
X
N
 = 
 Greek letter mu, always used to denote population mean.
N Population size. Capital N is used for population size.
X Random variable to denote the items in a population.
➢A different symbol is used for measure if the data is from a sample.

Example 1:
- The Median
➢It is the middle value once the data have been sorted into ascending or descending
order.
➢If we have an odd number of data points, it is the middle value.
3, 5, 8, 12, 17, 18, 19

➢If we have an even number of data points the median is the mean of the middle
two points. Therefore, the median does not have to be one of the data points.
12, 8, 5, 3, 21, 18, 19, 17
i. We first have to order the numbers
ii. 3, 5, 8, 12, 17, 18, 19, 21
iii. The median is
12 17
14.5
2
+
=
The Mode:
➢This is the value in the data set with the greatest frequency. It is possible to vae
mopre than one mode in a data set.
Example: Consider the following data sets. What is the mode?
12, 13, 14, 15, 12, 4, 2, 6, 14, 13, 15, 12, 4, 12
It is helpful to sort the list first.:2 4 4 6 12 12 12 12 13 13 14 14 15 15

Mode is 12
Example: Consider the following data sets. What is the mode?
12, 13, 14, 15, 12, 4, 2, 15, 6, 14, 13, 15, 12, 4, 12, 15
It is helpful to sort the list first.:2 4 4 6 12 12 12 12 13 13 14 14 15 15 15 15
Mode is 12 15
and
Roles of the mean, median and mode
➢The mean, median and mode are useful measures depending on what it is being
used for. Choosing the best measure of central tendency depends on the type of data
you have.
Examples:
1)Suppose the annual salaries of a sample of Accountants $62,900, $61,600, $62,500,
$60,800, and $1,200,000.

• The mean salary is $289,560. All accountants have an income between
$60,000 to $63,000 except for the last with $1.2 million salary. This salary is
affecting the mean calculation. Obviously, it is not a representative average of
this group of workers.
• For data containing one or two very large or very small values, the mean may
not be representative. The center for such data can be better represented by the
median.
$60,800, $61,600, $62,500, $62,900, $1,200,000
$62,500 would be a more representative value for the average salary.
Measures of Variability
• Measures of center only give partial information abouta
distribution
• Consider the following three samples:

A 8 9 10 11 12
B 4 8 10 12 16
C 4 7 10 13 16
Standard Deviation and Variance
➢ The standard deviation is the most often used and the most important measure of variability.
➢ It can help us predict how data points are distributed about the mean.
➢ Variance: The Variance is the square of the standard deviation.
➢ There are several abbreviations for standard deviation. We will use
“ s “ for a sample standard deviation; and a lowercase
Sigma “σ” for a population standard deviation.

POPULATION STANDARD DEVIATION “σ”
➢ The population standard deviation for ungrouped data is the square root of the arithmetic
mean of the squared deviations from the population mean.
( )
N
X
 −
=
2


SAMPLE STANDARD DEVIATION “s”
➢ Sample standard uses different notation and a slightly different formula.
2
( )
1
X X
s
n
−
=
−

Note the use of X-bar (X) rather than mu and the denominator is (n-1) rather than N.
* We can use the TI-84 calculator to help us calculate the standard deviation for the data
In Example 1:

The table displays the quantities needed for calculating the sample variance and sample
standard deviation.
The numerator of 𝑠2
is
**The prior calculation is definitional and tedious. A shortcut is derived here and involves just
2 sums as follow,

we calculate the sample variance and standard deviation in the previous example using the
shortcut method.
Example 2.
A sample of ages was taken from students in an EFL course. The data set was as follows,
34, 32, 19, 22, 24, 32, 25, 23

Find the mean, variance and standard deviation.
x X x− X (x− X)2
34 26.375 7.625 58.141
32 26.375 5.625 31.641
19 26.375 -7.375 54.391
22 26.375 -4.375 19.141
24 26.375 -2.375 5.641
32 26.375 5.625 31.641
25 26.375 -1.375 1.891
23 26.375 -3.375 11.391
SUM = 213.875
Mean:
211
26.375
8
= = =
X
X
n
Variance (s2
):
2
2
2
2
( )
1
213.875
7
30.55
−
=
−
=
=
 X X
s
n
s
s

Standard deviation (s):
213.875
7
5.53
=
=
s
s
– s = 0 only if all the observations are the same, otherwise
– s > 0 if there is variability in data
– s increases with the amount of variation in data. It Can be roughly
interpreted as the “typical” deviation of an observation from the
mean.
• s is not resistant to outliers
• When the sample variance is calculated with the quantity 𝒏 − 𝟏 in the denominator,
the quantity 𝒏 − 𝟏 is called the degrees of freedom
• Origin of term:

• There are 𝑛 deviations from the 𝑥̅ in the sample
• The sum of the deviations is zero
• 𝑛 − 1 of the observations can be freely determined but the 𝑛𝑡ℎ
observation is fixed to
maintain the zero sum
Sample Range:
In addition to the sample variance and sample standard deviation, the sample
range is a useful measure of variability.
In example 2: Range = 34 – 19 = 15

Example:
Find the range of the data below, -9 , 7 , 5 , 4, 1 , 8, 4, 5, 3, 3
Range =
Frequency Distributions and Histograms:
• A frequency distribution is a compact summary of data
• To construct, we must divide the range of the data into intervals, which are
usually called class intervals, cells, or bins
• Choosing number of bins approximately equal to the square root of the
number of observations often works well in practice
• After choosing number of bins, we choose the class width( interval) that
can be evaluated as follow,
Class width = Range / number of bins
Then we can find the data frequency in each class by counting the number of
observations that fall in each class.

Example 3:
The data below are the compressive strengths in pounds per square inch (psi)
of 80 specimens of a new aluminum-lithium alloy undergoing evaluation as a
possible material for aircraft structural elements
Because the data set contains 80 observations, we suspect that about eight to
nine bins will provide a satisfactory frequency distribution. The largest and
smallest data values are 245 and 76, respectively, so the bins must cover a range

of at least 245 - 76 = 169 . If we want the lower limit for the first bin to begin
slightly below the smallest data value and the upper limit for the last bin to be
slightly above the largest data value, we might start the frequency distribution at
70 and end it at 250.
Class width = Range / number of bins
= 169 / 9 = 18.7 ( we can consider it 20 )
The second row of Table contains a relative frequency distribution. The
relative frequencies are found by dividing the observed frequency in each bin
by the total number of observations. The last row of Table expresses the

relative frequencies on a cumulative basis. Frequency distributions are often
easier to interpret than tables of data. For example, it is very easy to see that
most of the specimens have compressive strengths between 130 and 190 psi
and that 97.5 percent of the specimens fall below 230 psi.
Histogram:
• A histogram is a visual display of the frequency distribution
• Provides a visual impression of the shape and distribution of the
measurements and information about the central tendency and scatter
or dispersion in the data
• Discrete variables: the frequencies are the count of the number of
observations for each possible value
• Continuous variables: define bins (ranges/classes) and count the
number of observations that fall in each bin
• Can also plot the relative frequency or proportion/fraction of times

that a value occurs (or values fall inside a bin) :
relative frequency=
frequency
total number of observations
Example: Number of classes a university student is taking data:
1 4 4 5 6 5 5 6 4 5 5 4 2 2 1 3
2 3 1 3 5 3 4 4 3 4 2 5 5 4 3 2
3 4 4 5 2 5 5 6 5 3 6 4 5 5 4 2
5 6

Frequency, relative frequency distribution
Number of
Classes
Relative
Frequency
1 3 3/50=0.06
2 7 7/50=0.14
3 8 8/50=0.16
4 12 12/50=0.24
5 15 15/50=0.30
6 5 5/50=0.10
Total 50 1.00
Median > Mean Median = Mean Median < Mean
Negative skewed Positive skewed
Symmetric
(bell shape)

Histograms are the most common method for graphically displaying and determine
the shape of distribution and the existence of outliers.
– For symmetric distributions one half of the distribution is a mirror
image of the other.
– Skewed distributions: Negative/Left-skewed, Positive/Right-
skewed
Outlier:

Boxplots
• The box plot is a graphical display that simultaneously describes several
important features of a data set, such as center, spread, departure from
symmetry, and identification of unusual observations or outliers
• Sometimes called box – and – whisker plots
• Displays three quartiles
• A line, or whisker, extends from each end of the box
Description of the Box Plot

• The pth percentile is the number that divides the bottom p% of the
data from the top (100-p)%
• Q1: 25th percentile (lower quartile/fourth)
• Q3: 75th percentile (upper quartile/fourth)
• Median = Q2 → 50th percentile
• Five number summary: Minimum, Q1, Median, Q3, Maximum
Calculation of Q1 and Q3 (by hand)*
• Q1: median of the bottom half of the (ordered) data set
• Q3: median of the top half of the (ordered) data set

• The fourth spread denoted as fs or interquartile range
(IQR) is a resistant measure of spread, IQR = Q3 − Q1
• Can be used to identify any observations that may be potential
outliers:
– Mild: more than 1.5 × IQR from closest quartile
– Extreme: more than 3 × IQR from closest quartile
• The boxplot is a graphical representation of the five number
summary.
The box plot and
five number summary
for example 3.

Comparative box plots of a quality index at three plants
• Boxplots can be useful for comparing the distribution between
two (or more) groups.
The graph shows the comparative box plots for a manufacturing quality index
on semiconductor devices at three manufacturing plants. Inspection of this

display reveals that there is too much variability at plant 2 and that plants 2 and
3 need to raise their quality index performance.
Practice:
Measurement of total nitrogen loads from a particular Chesapeake Bay
location
Raw data
• Five number summary: Min, Q1, Q2 , Q3, Max
• 9.69, 44.075, 92.17, 175.145, 1529.35
9.69 13.16 17.09 18.12 23.70 24.07 24.29 26.43
30.75 31.54 35.07 36.99 40.32 42.51 45.64 48.22
49.98 50.06 55.02 57.00 58.41 61.31 64.25 65.24
66.14 67.68 81.40 90.80 92.17 92.42 100.82 101.94
103.61 106.28 106.80 108.69 114.61 120.86 124.54 143.27
143.75 149.64 167.79 182.50 192.55 193.53 271.57 292.61
312.45 352.09 371.47 444.68 460.86 563.92 690.11 826.54
1529.35

IQR =
Upper limit = Q3 + (1.5× IQR)= (mild)
Lower limit = Q1 - (1.5× IQR)= (Mild)
Mild outliers:
Extreme outlier limits:
Upper limit = Q3 + (3× IQR)=
Lower limit = Q1 - (3× IQR) =
Time Sequence Plot:
• A time series or time sequence is a data set in which the observations are recorded in
the order in which they occur.
• A time series plot is a graph in which the vertical axis denotes the observed value of
variable and the horizontal axis denotes the time

Scatter Diagrams
:
• Multivariate: each observation consists of measurements of several variables
• The scatter diagram is a useful way to graphically display the potential relationship
between quality and one of the other qualities
• When two or more variables exist, the matrix of scatter diagrams may be useful in
looking at all of the pairwise relationships between the variables in the sample
• The sample correlation coefficient is a quantitative measure of the strength of the linear
relationship between two random variables x and y

1.0 Descriptive statistics.pdf

1.0 Descriptive statistics.pdf

Recommended

Recommended

More Related Content

Similar to 1.0 Descriptive statistics.pdf

Similar to 1.0 Descriptive statistics.pdf (20)

Recently uploaded

Recently uploaded (20)

1.0 Descriptive statistics.pdf