1.
D E S C R I P T I V E S T A T I S T I C S P a g e | 1
1.0 INTRODUCTION
In everyday life, whether at home or at work, we usually keep records or read reports. An item in
the record or report is a fact that is expressed in terms of a numerical value or described by its quality or
kind. That single item or fact is referred to as a datum. All these facts in a record or report are called
data.
Examples of data:
Color of the hair
Number of students in a class
Height and weight
Number of times you were absent from class
1.1Population and Sample
In data-gathering phase, the information is taken from a unit, which is a part of a collection of all
such units called a population. A population is consists of an entire set of objects, observations, or
scores that have something in common.
Some Definitions:
Population – collection of all units from which the data is to be collected.
Element – unit in a population
Sample – subset or a representative part of the population.
Frame – listing of all the elements of the population
Census – complete enumeration in which every member of the population is included
Sampling – or sample survey; only a part or a portion of the population is used to obtain data
1.2Definition of Statistics
The word “statistics” is used in several different senses. In the broadest sense, “statistics” is branch of
science that deals with the development of methods for a more effective way of collecting, organizing,
presenting, and analyzing data. Data and how to deal with it is the main concern of statistics.
In a second usage, a “statistic” is defined as numerical quantity (such as mean) calculated in a
sample. The grade point average (GPA) is an example of a statistic. It is a value computed from a set of
grades of a student in a particular semester.
Illustration:
If the data is the set of grades, then GPA is the statistic. Another numerical value that can be
computed from the set of grades is the percentage passing. The percentage passing is also a statistic.
From the same set of grades, the number of subjects that received a failing grade is another statistic.
Taken together, the GPA, the percentage passing, and the number that received a failing grade are
called statistics.
Major Areas of Statistics
1. Descriptive Statistics – deals largely with summary calculations, graphical displays, and
describing important features of a set of data. It does not attempt to draw conclusions about
anything that pertains to more than the data themselves.
2. Inferential Statistics – concerned with making generalizations from information gathered from a
small group of observations (sample) to a bigger group of observations (population).
Two Main Methods:
1. Estimation
- the sample statistic is used to estimate a population parameter
- a confidence interval about the estimate is constructed.
2. Hypothesis Testing
- a null hypothesis is put forward.
- Analysis of the data is then used to determine whether to reject it.
1.3Variables
A variable is any measured characteristic or an attribute that differs for different subjects. Those
variables having cause-and-effect relationships are called independent variables and dependent
variables.
2.
D E S C R I P T I V E S T A T I S T I C S P a g e | 2
Types of Variables:
1. Qualitative Variables – sometimes called “categorical variables”
- facts for which no numerical measure exists
- expressed in categories or kind
Examples:
color of the skin which can be black, brown, or white
person’s sex which can be male or female
2. Quantitative Variables – variables that can be expressed in numbers.
- can be measured and counted.
Examples:
person’s height and weight – can be measured
number of students in a class – can be counted
Classification of Quantitative Variables
1. Continuous variable
A continuous variable is one for which within the limits the variable ranges, any value is
possible.
Examples:
Time to solve a math problem is continuous since it could take 2 minutes, 2.13
minutes, etc. to finish a problem
Height is continuous since it could take 1.55 meters, 1.65 meters, etc.
2. Discrete variable
A discrete variable is one that cannot take on all values within the limits of the variable
Examples:
Responses to a five-point rating scale is discrete since it can only take 1, 2, 3, 4
and 5.
Number of provinces
1.4Types of Measurements
1. Nominal measurement is consists of assigning items to groups or categories. No quantitative
information is conveyed and no ordering of the items is implied. Nominal measurements are
therefore qualitative rather than quantitative. Nominal measurement is the lowest form of
measurement.
Examples:
Color
Sex
Blood type
Religion
2. Ordinal Measurement
Measurement in ordinal scales are ordered in the sense that higher number
represent higher values. However, the intervals between the numbers are not necessarily
equal. For example, on a five-point rating scale measuring attitudes towards gun control,
the difference between a rating of 2 and a rating of 3 may not represent the same
difference as the difference between a rating of 4 and a rating of 5. There is no “true” zero
point for ordinal scales since the zero point is chosen arbitrarily. The lowest point on the
rating scale in the example was chosen to be 1. It could just as well have been 0 or 5.
Examples:
Taste preferences
Satisfactions
Social classes
Academic honors
3. Interval Measurement
On interval measurement scales, one unit on the scale represents the same
magnitude on the trait or characteristic being measured across the whole range of scale.
Interval scales do not have a “true” zero point, however, and therefore it is not
possible to make statements about how many times higher one score is than the another.
A good example of interval scale is the Fahrenheit scale for temperature.
4. Ratio Measurement
Ratio measurements are like interval measurement except they have true zero
point. It is the highest form of measurement.
3.
D E S C R I P T I V E S T A T I S T I C S P a g e | 3
Examples:
Length
Weight
Note: A large number of statistical analysis tools are available for each type of measurements. It is
important that the statistical user has a good understanding of the type of data that is to be processed in
order that the statistical tool that is chosen is used properly.
1.5 Random and Non-Random Sampling
Random sampling is the most commonly used sampling technique in which each member in
the population is given an equal chance of being selected in the sample.
Non-random sampling is the method of collecting a small portion of the population by which
not all the members in the population are given the chance to be included in the sample.
Properties of Random Sampling
1. Equiprobability – means that each member of the population has an equal chance
of being selected and included in the sample.
2. Independence – means that the chance of one member being drawn does not
affect the chance of the other member.
1.6 Probability Sampling Techniques
1. Simple Random Sampling (SRS) – process for selecting a sample wherein every element in
the sampled population is given an equal chance of being included in the sample
2. Systematic Random Sampling – sampling wherein every kth
unit is included after a random
start is taken for the sample
3. Stratified Proportional Random Sampling – population is divided into homogeneous groups of
strata and selection is done within each stratum
4. Multi-stage Sampling – this technique uses several stages or phases in getting sample from
the population. This method is an extension or a multiple application of the stratified random
sampling technique.
1.7Non-random Sampling Techniques
1. Judgment or Purposive Sampling – this method is also referred as non-probability sampling. It
plays a major role in the selection of a particular item and in making decisions in cases of
incomplete responses or observation.
2. Quota Sampling – this is a relatively quick and inexpensive method to operate since the choice
of the number of subjects to be included in a sample is done at the researcher’s own
convenience or preference and is not predetermined by some carefully operated randomizing
plan.
3. Cluster Sampling – population is divided into a number of relatively small subdivisions, which
are themselves clusters of still smaller units, and then some of these subdivisions, or clusters,
are randomly selected for inclusion in the overall sample.
4. Incidental Sampling – this design is applied to those samples which are taken because they
are the most available.
5. Convenience Sampling – this method has been widely used in television and radio programs to
find out opinions of TV viewers and listeners regarding a controversial issues.
1.8 Methods of Collecting Data
There are many ways of collecting data, each of which has its own advantages and
disadvantages. The more general methods of collecting informations are:
1. Direct or Interview Method
A very common and effective method of obtaining informations is by conducting interviews.
People usually respond when visited in person.
Disadvantages: People may tend to lie and interviews are quite costly and needs thorough
training of the interviewers (untrained interviewers tend to influence the respondent’s
answers).
2. Indirect or Questionnaire Method
Questionnaires can either be mailed or handed personally to respondents.
Advantages: It does not require interviews and is therefore less costly. It also cover wider
area than interviews.
4.
D E S C R I P T I V E S T A T I S T I C S P a g e | 4
Disadvantages: Response rate is usually lower than interview. Many people tend to ignore
mailed questionnaires.
To encourage participation, a questionnaire should be kept short as possible and
contain questions related to the objectives of the survey.
3. Direct Observation
In situations where less personal responses are needed, collecting data by direct
observation may be used.
Disadvantage: Assigned person to observe may commit some observational errors.
4. Experimentation – is used when the objective is to determine the cause-and-effect of a
certain phenomenon under some controlled conditions.
5. Utilizing Existing Records
A very convenient way of obtaining data is by utilizing existing records. There are
number of institutions that gather data not only for their own purposes but for purposes of
other group of people.
Advantage: It is very economical and requires less cooperation from people.
Disadvantage: Informations needed may not be found in these sources.
Data are sometimes obtained in published/unpublished document and can be
classified as follows:
Primary sources – provide data first hand; data gathered originally have not been
subjected to some transcription or condensation. Its authenticity is guaranteed by the
group who gathered it originally.
Secondary sources – provide data that have been transcribed or compiled from
original sources
2.0 ORGANIZATION AND PRESENTATION OF DATA
After data have been gathered and checked for possible errors, the next logical step is to present
the data in a manner that is easy to understand. It should also readily convey the relevant information
and the important results at a glance.
Ways/Methods of presenting data:
1. Textual presentation – a narrative way of describing the collected characteristics of the population
based on the data collected and organized
2. Tabular presentation – data are tallied into the appropriate row and/or column categories
3. Graphical presentation – data are presented graphically such as bar chart, histogram, pie chart
and pictograph
2.1 Textual Presentation
Example:
A total of 22.4 million children aged 5-17 years old in 9.6 million households were
estimated from the 1995 National Survey of Working Children (NSWC).
Sixteen percent (16%) or 3.6 million children were reported engaged in economic activities
at any time in 1995. Boys were more likely to work than girls with a national sex ratio of working
children of 187.
2.2 Tabular Presentation
- may be in the form of a cross tabulation table, a frequency distribution table (FDT) or a
stem-and-leaf plot.
2.2.1 Cross Tabulation Table
When a data are in categories, results are usually presented in systematic manner by using a table,
which arranges data in rows and columns.
5.
D E S C R I P T I V E S T A T I S T I C S P a g e | 5
Example:
Table 1. Numbers of Subjects Falling Into Smoking/Lung Cancer Combination
Smoker
Lung Cancer
Present Absent Total
Yes 688 650 1338
No 21 59 80
Total 709 709 1418
A table contains:
1. Heading
Heading includes a table number and a title. A Table number is necessary to easily identify
the table. It should be followed by a title, which briefly de describes the contents of the table.
2. Body
The body is the main part of the table. It contains row categories (which are found in the left
side of the table) and the column categories (which are found at the top of the table). Row
totals may also be included and is located in the right side of the table. A column total may
also be included and is located at the top of the table.
The figures found in the cells of the main body are usually the frequencies, representing
the number of time the two categories occur together. Percentages can be used instead of
frequencies. Or use both percentages and frequencies.
3. Footnote (optional)
The data used may have been taken from some publications of provided by another group of
person. Footnotes may be added to indicate the source of information.
Contingency Table – a table listing the frequencies for the different combination of values of two
categorical variables.
2.2.2 Frequency Distribution
In many instances, information gathered is numerical in nature, such as age respondent or
exam score of a student. When faced with a large set of this kind of data, it is often
advantageous to group the data into a number of classes of intervals so as to get a better
overall picture.
Table 2.3 Scores in a Statistics Final Exam
31 28 15 10 47
18 32 29 58 48
37 49 26 54 56
21 24 28 32 28
43 12 23 29 61
16 42 40 32 26
48 36 39 22 40
20 63 54 30 17
18 30 23 26 36
47 19 25 38 35
Table 2.3 is a set of scores in the exam of Statistics. The above data will be used in illustrating the
construction of a frequency table.
Frequency distribution – is a grouping of all observations into interval or classes together with a count
of the number of observations that fall in each interval or class.
Data in Table 2.3 is called raw data and such form is difficult to read and analyze. In frequency
distributions the data is presented in a more compact and usable manner. However, this process brings
about some loss of details.
1.1 Steps in Constructing a Frequency Distribution
1. From the data set, identify the highest value and lowest value. Compute the range R as
R = highest value – lowest value
2. Estimate the number of classes, k as
6.
D E S C R I P T I V E S T A T I S T I C S P a g e | 6
nk
Note: The results are “rounded off” to the next higher integer, NOT the usual nearest integer.
Rounding off to the nearest integer will often yield a number of intervals that cannot
accommodate all the observations.
3. Estimate the width c of the interval by dividing the range R by the number of classes k. That is,
k
R
c
Round off this estimate to the same number of significant places as the original data set.
No. of decimal places
of the raw data Precision
0 1
1 0.1
2 0.01
3 0.001
4. List the lower and upper class limits of the first interval. This interval should contain the smallest
observation in the data set. The starting lower limit could be the lowest or any number closest to
it.
5. List all the class limits by adding the class width to the limits of the previous interval. The highest
class should contain the largest observation in the data set.
6. Tally the frequencies for each class.
7. Compute the class marks and the class boundaries.
Class midpoint, or class mark is the midpoint of an interval. That is,
2
ULLL
CM
where, CM – class mark
LL – lower limit
UL – upper limit
To find class boundaries, it is important to know the unit of accuracy of the raw data. The final
exam scores are accurate to the ones unit. The value reported as 5.8 kg. is accurate to the tenth
unit, while a GPA of 2.64 is accurate to the hundredth unit.
Lower class boundary, Li, is given as
Li = LL – 0.5 (Precision)
Upper class boundary, Ui, is given as
Ui = UL + 0.5 (Precision)
Additional columns may be added to obtain additional information about the distributional
characteristics of the data. Among these are:
a) Relative Frequency (RF) – frequency of a class expressed in proportion or
percentage of the total number of observations. That is,
n
f
RF i
where fi is the frequency in each interval
b) Cumulative Frequency (CF). This is the accumulated frequency of a class. There are
two types:
The “less than” CF (<CF) of a class is the number of observations whose values are less than or equal to
the upper limit of the class.
The “greater than” CF (>CF) of a class is the number of observations whose values are greater than or
equal to the lower limit of the class.
7.
D E S C R I P T I V E S T A T I S T I C S P a g e | 7
2.3 Graphical Presentation
This form is the most effective means of organizing and presenting data because the important
relationships are brought out more clearly and creatively in virtually solid and colorful figures.
2.3.1 Different Kinds of Graphs/Charts
1. Line Graph – it shows relationships between two sets of quantities. This is done by
plotting point of X set of quantities along the horizontal axis against the Y set of quantities
along the vertical axis in a Cartesian coordinate plane. Those plotted points will be
connected by a line segment which finally forms the line graphs.
2. Bar Graph – it consists of bars or rectangles of equal widths, either drawn vertically or
horizontally.
3. Circle Graph or Pie Chart – it represents relationships of the different components of a
single total as revealed in the sectors of a circle.
4. Picture Graph or Pictogram – it is a visual presentation of statistical quantities by means
of drawing pictures or symbols related to the subject under study.
2.3.2 Graphical Representation of the Frequency Distribution
1. Bar Chart and Histogram - is one of the more popular ways of representing a frequency
distribution graphically. It is a graph where the different classes are represented by the
class limits in the horizontal axis or categories for nominal data. The length of the
rectangle, represented by the class frequency is drawn in the vertical axis. A graph that is
close resemblance of the bar graph is the histogram. The basic difference is: a bar chart
uses class limits for the horizontal axis while the histogram employs the class boundaries.
Using the class boundaries, it eliminates spaces between the rectangles giving it a solid
appearance.
2. Frequency Polygon - is constructed by plotting the class marks against the frequency.
The set of (x,y) points formed the class marks and their corresponding frequencies are
connected by straight lines. To complete the polygon, which is defined as closed figure, an
additional class mark is added at the beginning and at the end of the distribution.
3. Frequency Ogive - A cumulative frequency distribution can be represented graphically by
a frequency ogive. An ogive is obtained by plotting the upper class boundaries on the
horizontal scale and the cumulative frequency less than the upper class boundaries in the
vertical scale.
3.0 NUMERICAL DESCRIPTION OF DATA
It is a numerical value that summarizes a set of observations into a single value, and that value
may be used to represent the entire population.
3.1 The Summation Symbol
The Greek letter ‘ ’ ( upper case sigma) denotes the summation symbol. It is a more compact
way of writing a sum of a set of data values. A convenient way of writing a data value in mathematical
notation is the subscripted variable ix , which is read as ‘ x sub i ’. When a set of data values are written
in the subscripted variable notation nxxxx ,...,,, 321 , the notation
n
i
ix
1
is defined as
n
n
i
i xxxxx
321
1
.
The symbol
n
i
ix
1
is read as ‘the summation of x sub i from 1 to n ’.
Example: Consider the set of data values 5, 4, 8 and 6 which are measurements of weights. Find the
following:
1.
4
1i
ix 2.
4
1
2
i
ix 3.
24
1
i
ix
3.2 Measures of Central Tendency
It is a single value about which the set of observation tend to cluster.
8.
D E S C R I P T I V E S T A T I S T I C S P a g e | 8
3.2.1 ARITHMETIC MEAN
The arithmetic mean or simply mean, is the sum of a set of measurements divided by the
number of measurements in the set. This measure is appropriate for the data in the interval or
ratio scale.
a. Population mean;
N
x
N
i
i
1
b. Sample mean;
n
x
x
n
i
i
1
c. Weighted mean;
k
i
i
k
i
ii
w
f
xf
x
1
1
d. Grand mean;
k
i
i
k
i
ii
n
xn
x
1
1
Examples 3.2.1:
1. The number of hours spent by ten students in studying per day were recorded as follows: 5, 8, 2,
2, 2, 6, 5, 3, 1, and 4. Find the mean.
2. The following table shows the number of households in the five (5) Barangays in Iligan City in
2010, and corresponding percentage changes in the number of households 2010 – 2012.
Barangay
Number of
Households
Percentage
Change
Tibanga 11,802 9.1
Suarez 8,624 8.3
Hinaplanon 5,326 4.5
Digkilaan 894 1.4
Palao 12,012 10.6
Find the weighted mean of the percentage changes.
3.2.2 MEDIAN
The median is not affected by the presence of abnormally large or abnormally small
observations. It is the middle value of a set of observations arranged in an increasing or
decreasing order of magnitude. It is the middle value when the number of observations is
odd if it is even i. e. it is the value such that half of the observations fall above it and half
below it.
a. Population Median: ~ =
.,
2
1
,
1
22
2
1
evenisNifxx
oddisNifx
NN
N
b. Sample Median: x~ =
.,
2
1
,
1
22
2
1
evenisNifxx
oddisNifx
nn
n
9.
D E S C R I P T I V E S T A T I S T I C S P a g e | 9
3.2.3 MODE – is the value which occurs the most number of times, or the value with the greatest
frequency.
Remarks 3.2.1
1. When mean, median, and mode equal in a given data set then the data set is said to be
normally distributed.
2. The graph of the said data is a symmetrical bell-shaped curved.
3.3 Measures of Variability or Dispersion
They are numerical values computed from the given observations that measures how the data
spread from the central location.
3.3.1 RANGE – is the difference between the largest and the smallest values in the set.
It is denoted by R i.e., R = Highest Value – Lowest Value
3.3.2 VARIANCE – is the average squared differences of the scores from the mean score of a
distribution.
a. Population Variance. Given the finite population x1, x2,…,xN the population variance is:
2
=
2
1
N
x
N
i
i
For ease of computation, an alternative form is suggested below:
2
=
N
Nx
N
i
i
1
22
b. Sample Variance. Given the random sample x1, x2,…,xn , the sample variance is:
2
s =
2
1
n
xx
n
i
i
A computationally faster form is
1
1
2
1
2
2
nn
xxn
s
n
i
n
i
ii
Note that in sample variance the denominator is involving “n – 1”, this is because using only “n” to solve
sample variance will underestimate the variance and would create a bias.
3.3.3 STANDARD DEVIATION – is the positive square root of the variance.
a. population standard deviation :
2
b. sample standard deviation :
2
ss
3.3.4 COEFFICIENT OF VARIATION (denoted by CV) – is a measure of relative variation expressed
as percentage. It is the ratio of the standard deviation and the mean multiplied by 100%.
a. %100
CV
b. %100
x
s
CV
Examples 3.3.4
1. The final examination given to two sections of Math 2 gave the following mean and standard
deviation:
Statistics Section A Section B
Mean 30 46
Standard Deviation 10 12
Find the coefficient of variation of the two sections and determine which of the two sections
has greater variability of scores.
10.
D E S C R I P T I V E S T A T I S T I C S P a g e |
10
2. The mean height of college women is 157.48 cm. with a standard deviation of 6.35 cm., while
their mean weight is 47.70 kg. with a standard deviation of 3.64 kg. Which is more variable, the
height or the weight of the college women?
3.3.5 Characteristics of the Standard Deviation
The standard deviation and variance are the most commonly used in measures of dispersion in the social
sciences because:
1. Both take into account the precise difference between each score and the mean.
2. If any single score is change, the standard deviation changes. If the score is moved away from the
mean the standard deviation increases. Otherwise, decreases.
3. If a score is added that is far from the mean the standard deviation increases. Otherwise,
decreases
3.3.6 Interpreting the Standard Deviation
The standard deviation is very important regardless of the mean. It makes a great deal of
difference whether the distribution is spread-out over a broad range or bunched up closely
around the mean. Figure 3.1, shows set scores which are normally distributed.
3.3.6.1
Figure 3.1 A Normal Curve Showing the Percent of Cases Lying Within 1, 2, and 3 Standard Deviations From
the Mean
3.3.6.1 Chebyshev’s Theorem
The accuracy and the position of the scores in frequency distribution relative to the mean can
be determined by using the Chebyshev’s Theorem
Chebyshev’s Theorem: Chebyshev’s theorem states that the proportion or
percentage of any data set that lies within k standard deviations of the mean (where k
is any positive integer greater than 1) is at least
.
1
1 2
k
For any data set, at least 88.9% of the data lie
within three standard deviations to either side of its
mean.
Example 3.3.6.1
If the mean score of the students enrolled in
Statistics class is 66 points with standard deviations
of 5 points, at least what percentage of the scores
must lie between 46 and 86?
Solution:
4
54666
46566
46
k
k
k
Skx
Hence from Chebyshev’s Theorem, %75.93
16
15
4
1
1
1
1 22
k
11.
D E S C R I P T I V E S T A T I S T I C S P a g e |
11
3.4 Other Measures of Location (Quantiles or Fractiles)
The measures of central tendency refer only to the center of the entire set of data, but there are
other measures of location that describes or locate the non-central position of this set of data. These
measures are referred to as quantiles or fractiles. In this section, we will consider the fractiles, which can
be a percentile, a decile, or a quartile.
3.4.1 Percentiles – are values that divide an ordered set of observations into 100 equal parts. These
values, denoted by P1, P2, … , P99, are such that 1 % of the data falls below P1, 2% falls below
P2,…, and 99 % falls below P99.
3.4.2 Deciles – are values that divide an ordered set of observations into 10 equal parts. These values
denoted by D1, D2, …, D9, are such that 10 % of the data falls below D1, 20 % falls below D2, …,
and 90 % falls below D9.
3.4.3 Quartiles – are values that divide an ordered set of observations into 4 equal parts. These
values, denoted by Q1, Q2, and Q3, are such that 25 % of the data falls below Q1, 50 % falls below
Q2, …, and 75 % falls below Q3.
Procedure for the computation of the fractiles:
1. Arrange the data in an increasing order of magnitude.
2. Solve for the value of L, where
Quartilesfor
mn
Decilesfor
mn
sPercentilefor
mn
L
'
4
,
10
,
100
where: m is the location of the percentile, decile, or quartile
n is the number of observations.
3. If L is an integer, the desired fractile is the average of the Lth
and the (L + 1)th
observations. If L is
fractional, get the next higher integer to find the required location. The fractile corresponds to the
value in that location.
Remark 3.4:
1. Semi-Interquartile Range represents the distance on a scale between Q1 and Q3.
2. Quartile Deviation is the half of semi-interquartile range.
3.5 Skewness and Kurtosis
Skewness is the degree of departure from symmetry of a distribution. Kurtosis is the
degree of peakedness of distribution.
3.5.1 Symmetric Distribution (those where one side is the mirror image of the other) when
presented graphically will show normal curves. They have a mean and a median that
have the same value. If the distribution is symmetric and unimodal, the mode also has
the same value as the mean and median (see Graph 1 in Figure 4.1).
3.5.2 Skewed Distribution – have different values for the mean, median, and mode. For
unimodal skewed distributions, the mean is pulled toward the tail, and the median is
between the mean and mode.
Figure 4.1 Graphs of Different Type of Distribution
12.
D E S C R I P T I V E S T A T I S T I C S P a g e |
12
Remarks 3.4
1. A positively skewed distribution has “tail” which
pulled in positive direction (see Graph 3 in
Figure 4.1).
2. A negatively skewed distribution has “tail” which
pulled in negative direction (see Graph 2 in
Figure 4.1).
3. A symmetric distribution has zero skewness.
4. A normal distribution is a mesokurtic distribution.
5. A pure leptokurtic distribution has a higher peak
than the normal distribution and has heavier
tails.
6. A pure platykurtic distribution has a lower peak than a normal distribution and lighter tails.
3.5.3 Application of Measuring Skewness and Kurtosis
One application is testing for normality: many statistics inferences require that a distribution be normal or
nearly normal. A normal distribution has skewness and excess kurtosis of 0, so if your distribution is
close to those values then it is probably close to normal.
3.5.4 Calculating Skewness
The moment coefficient of skewness of a data set is skewness:
.
3
2
3
1
m
m
g
where:
n
xx
m
n
i
i
1
3
3
x̄ - is the mean and n is the sample size, as usual.
m3 - is called the third moment of the data set.
m2 - is the variance.
Note: Remember that you have to choose one of two different measures of standard deviation,
depending on whether you have data for the whole population or just a sample. The same is true of
skewness. If you have the whole population, then g1 above is the measure of skewness. But if you have
just a sample, you need the sample skewness:
11
2
1
g
n
nn
G
3.5.5 Interpreting Skewness
1. If skewness is positive, the data are positively skewed or skewed right, meaning that the right
tail of the distribution is longer than the left.
2. If skewness is negative, the data are negatively skewed or skewed left, meaning that the left tail
is longer.
3. If skewness = 0, the data are perfectly symmetrical.
4. But a skewness of exactly zero is quite unlikely for real-world data, so how can you interpret the
skewness number? Bulmer, M. G., Principles of Statistics (Dover,1979) — classically
suggests this rule of thumb:
a. If skewness is less than −1 or greater than +1, the distribution is highly skewed.
b. If skewness is between −1 and −½ or between +½ and +1, the distribution is moderately
skewed.
c. If skewness is between −½ and +½, the distribution is approximately symmetric.
Inferring
Your data set is just one sample drawn from a population. Maybe, from ordinary sample variability, your
sample is skewed even though the population is symmetric. But if the sample is skewed too much for
random chance to be the explanation, then you can conclude that there is skewness in the population.
To answer that, you need to divide the sample skewness G1 by the standard error of skewness (SES) to
get the test statistic, which measures how many standard errors separate the sample skewness from
zero:
13.
D E S C R I P T I V E S T A T I S T I C S P a g e |
13
test statistic:
312
16
,1
1
nnn
nn
SES
SES
G
Z g
The critical value of Zg1 is approximately 2. (This is a two-tailed test of skewness ≠ 0 at roughly the 0.05
significance level.)
If Zg1< −2, the population is very likely skewed negatively (though you don’t know by how much).
If Zg1 is between −2 and +2, you can’t reach any conclusion about the skewness of the
population: it might be symmetric, or it might be skewed in either direction.
If Zg1 > 2, the population is very likely skewed positively (though you don’t know by how much).
14.
D E S C R I P T I V E S T A T I S T I C S P a g e |
14
CASE STUDIES:
Case Study1
1. A study was conducted to see how well reading success in first grade could be predicted from
various kinds of information obtained in kindergarten: age, sex, tribe, academic rank, and IQ.
Which of the variables represents a
a. nominal scale
b. ordinal scale
c. interval scale
d. ratio scale
2. Are the following variables discrete or continuous?
a. The number of correct answers on the true-false test.
b. The duration of the effectiveness of a pain medication.
c. The number of commercials aired daily by a television station.
d. The weights of Sunday newspaper.
e. The heights of basketball players.
2. Among 250 employees of the local office of an international insurance company, 182 are whites,
51 are blacks, and 17 are Orientals. If we use the stratified random sampling to select a
committee of 15 employees, how many employees must we take from each class?
3. Suppose you were asked to make a study on the brand preferences and satisfaction of the
customers of famous laundry soaps in four (4) different supermarkets.
a. Arrange the letters of the following steps to statistical inquiry in a logical way.
A. Collecting relevant information
B. Defining a problem
C. Interpreting the data
D. Analyzing the data
E. Organizing and presenting data
b. Who will be the most appropriate respondents of the study?
c. How will you apply multi-stage sampling to the population of the study?
e. Calculate the sample size if the population size is 2000 and the margin of error is 5%.
Case Study2
1. Create a textual presentation based from the table shown below. Suppose there are 800 million
users per day.
2. Create tabular and (any) graphical presentations of the textual presentation as presented below.
“The top three regions in terms of population count are Region IV-Southern
Tagalog (11.32 million or 15.04% of the total), NCR (10.49 million or 13.93%), Region III
– Central Luzon (7.80 million or 10.35%). The population residing in these regions
combined comprises 39.32% of the total Filipino population. This means that four out of
ten persons in the country reside in NCR and the adjoining regions of Central Luzon and
Southern Tagalog.”
15.
D E S C R I P T I V E S T A T I S T I C S P a g e |
15
3. Using the table below
Table 2.5 Number of Passengers for P&P Airlines
68
72
50
70
65
83
77
78
80
93
71
74
60
84
72
84
73
81
84
92
77
57
70
59
85
74
78
79
91
102
83
67
66
75
79
82
93
90
101
80
79
69
76
94
71
97
92
83
86
69
a. Construct a frequency distribution table (with the class interval, frequency, class
boundaries, class marks and cumulative frequency) for the given data.
b. Construct its bar graph, histogram, frequency polygon, and frequency ogive.
c. Determine whether the given data set is normally distributed.
3. Given the frequency polygon below.
a. Reconstruct the frequency distribution table.
b. Construct the frequency histogram.
c. Give the answers of the following:
i. What is the lower class limit of the lowest class?
ii. What is the lower class boundary of the highest class?
iii. What is the class width?
Case Study3
1. A random sample of 10 students was given a special test. The time in minutes it took the students
to finish the exam were taken and are given as follows:
Find the following:
a) Mean
b) Median
c) Variance
d) Standard Deviation
e) Range
f) Mode
g) Coefficient of Variation
h) 18th
Percentile
i) 7th
Decile
j) 3rd
Quartlie
FREQUENCY
CLASS MARKS
6
10
12
14
21.2 22.9 24.6 26.3 28 29.7 31.1 34.8 36.5
0
15 30 26 40 35 19 22 28 17 38
16.
D E S C R I P T I V E S T A T I S T I C S P a g e |
16
2. Suppose that you are investigating the influence of interactive approach on the students’
mathematics performance. Consider the following samples of students’ final examination scores
taken from three (3) sections of Math 1 enrolled during the first semester of SY 2011 – 2012.
Sections Sample Scores
Rizal 19 8 7 2 19 29 36 20 3 14
Bonifacio 14 25 12 32 13 17 10 22 13 32
Luna 24 13 20 1 8 28 16 21 23 26
a. Describe the performance of each section by their respective mean and standard
deviation.
b. Which of these 3 sections showed great improvements of the students’ performance in
mathematics? Explain why?
3. Table shown below is the distribution of the responses of your respondents in the emotional
intelligence inventory.
Emotional Intelligence Inventory
Indicators
Almost
Never
Seldom
Sometimes
Usually
Almost
Always
(1) (2) (3) (4) (5)
1. I appropriately communicate decisions to stakeholders. 11 9 15 5 9
2. I fail to recognize how my feelings drive my behavior at work. 18 2 10 12 8
3. When upset at work, I still think clearly. 5 6 15 14 8
4. I fail to handle stressful situations at work effectively. 10 12 8 14 6
5. I understand the things that make people feel optimistic at
work.
18 2 13 7 10
6. I fail to keep calm in difficult situations at work. 21 12 8 9 0
7. I am effective in helping others feel positive at work. 1 4 16 19 10
8. I find it difficult to identify the things that motivate people at
work.
15 12 5 8 5
1. Find the weighted mean of each statement.
2. Set-up a Likert scale with 5 intervals to interpret the results by assigning a descriptive equivalent
such as “very low”, “low”, “average”, “high”, “very high”.
3. Find the weighted mean of each statement.
4. Find the standard deviation of each item.
5. Find the grand mean.
6. Interpret the results.
Be the first to comment