Chapter-Four.pdf

ARBA MINCH TECHNOLOGY INSTITUTE
DEPARTMENT OF MECHANICAL ENGINEERING
Chapter Four: Processing and Data Analysis
Instructor: Solomon N.(Ph.D.)
Academic Year: 2022/23
1

Processing and Data Analysis
2
 After collecting data, the method of converting raw
data into meaningful statement; includes data
processing, data analysis, and data interpretation and
presentation.
 Acquiring data: Acquisition involves collecting or
adding to the data holdings. There are several methods
of acquiring data:
 collecting new data
 using your own previously collected data
 reusing someone others data
 purchasing data
 acquired from Internet (texts, social media, photos)

Processing and Data Analysis
3
 Data processing: A series of actions or steps performed
on data to verify, organize, transform, integrate, and
extract data in an appropriate output form for
subsequent use.
 Methods of processing must be rigorously
documented to ensure the utility and integrity of the
data.
 Data Analysis involves actions and methods performed
on data that help describe facts, detect patterns,
develop explanations and test hypotheses. This
includes data quality assurance, statistical data analysis,
modeling, and interpretation of results.

Data Preservation and Re-use
4
 Data preservation involves actions and procedures to keep
data for future use, and includes data archiving and/or data
submission to a data repository. Data preservation needs data
description, documentation and metadata.
 The goal of all these actions is to make data findable,
comprehensible and easy to use. It also involves long-term
preservation and curation of data.
 Documentation provides an overview of the research context
and design, data collection methods, data preparation and
results or findings and is key to enabling the secondary user
to make informed use of the data.
 Metadata are providing standardized structured information
explaining the purpose, origin, time references, geographic
location, creator, access conditions and terms of use of data.

Data Processing
5
 Data processing is concerned with editing, coding,
classifying, tabulating and charting and diagramming
research data. The essence of data processing in research
is data reduction/saving.
 Data reduction involves winnowing/inspecting/ out the
irrelevant from the relevant data and establishing order
from disorder and giving shape to a mass of data
 DOI: digital object identifier is a unique and persistent
identifier makes data easy to find and cite data sets.
Example: doi.org/10.1016/j.ecolind.2015.04.011
 Data re-use means data mining, replication research,
comparative studies, longitudinal research etc. E.g. data
collected for one research objective can be used in a new
study dealing with some other similar problem.

Data Processing
6
 Six stages of data processing
1. Data collection: Collecting data is the first step in data
processing. Data is pulled from available sources,
including data lakes and data warehouses.
2. Data preparation: Once the data is collected, it then
enters the data preparation stage. Data preparation,
often referred to as “pre-processing” is the stage at
which raw data is cleaned up and organized for the
following stage of data processing.
 Data collection
 Data preparation
 Data input/interpretation
 Data Processing
 Data output/interpretation
 Data storage and Report
Writing

Data Processing
7
3. Data input: The clean data is then entered into its
destination and translated into a language that it can
understand. Data input is the first stage in which raw
data begins to take the form of usable information.
4. Processing: During this stage, the data inputted to
the computer in the previous stage is actually
processed for interpretation. Processing is done using
machine learning algorithms, though the process
itself may vary slightly depending on the source of
data being processed.

Data Processing
8
6. Data storage and Report Writing: The final stage of
data processing is storage. After all of the data is
processed, it is then stored for future use. While some
information may be put to use immediately, much of it
will serve a purpose later on.
5.Data output/interpretation:
The output/interpretation stage is the stage at which
data is finally usable to non-data scientists. It is
translated, readable, and often in the form of graphs,
videos, images, plain text, etc.).

STATISTICS IN RESEARCH
9
 The role of statistics in research is to function as a tool in
designing research, analyzing its data and drawing conclusions.
Most research studies result in a large volume of raw data which
must be suitably reduced so that the same can be read easily and
can be used for further analysis. There are two major areas of
statistics
Descriptive statistics and Inferential statistics.

Cont.
10
 Descriptive statistics concern the development of certain
indices/directions from the raw data,
 Inferential statistics concern with the process of
generalization. Inferential statistics are also known as
sampling statistics and are mainly concerned with two
major type of problems:
“Descriptive” describes data, while “inferential” infers or allows
the researcher to arrive at a conclusion based on the collected
information.
 the estimation of population parameters, and
 the testing of statistical hypotheses.

Cont.
12
For example, you are tasked to research about teenage
pregnancy in a certain high school. Using both descriptive
and inferential statistics, you will be researching the number
of teenage pregnancy cases in the school for a specific
number of years. The difference is that with descriptive
statistics, you are merely summarizing the collected data
and, if possible, detecting a pattern in the changes.
For example, it can be said that for the past five years, the
majority of teenage pregnancies in X High School happened
to those enrolled in the third year. There’s no need to
predict that on the sixth year, the third year students would
still be the ones with a greater number of teenage
pregnancies. Conclusions as well as predictions are only
done in inferential statistics.

13
The important statistical measures that are used to
summarize the survey/research data are:
 measures of central tendency or statistical
averages;
 measures of dispersion;
 measures of asymmetry (skewness);
 measures of relationship; and
 other measures.

14
Measures of Central Tendency
 Amongst the measures of central tendency, the three
most important ones are the arithmetic average or
mean, median and mode.
 A measure of central tendency is a single value that
attempts to describe a set of data by identifying the
central position within that set of data. As such,
measures of central tendency are sometimes called
measures of central location.
 The mean, median and mode are all valid measures of
central tendency, but under different conditions, some
measures of central tendency become more appropriate
to use than others.

Cont.
15
Mean (Arithmetic)
 The mean (or average) is the most popular and well
known measure of central tendency.
 It can be used with both discrete and continuous
data, although its use is most often with continuous
data.
 The mean is equal to the sum of all the values in
the data set divided by the number of values in the
data set.

Cont.
16
 The mean salary for these ten staff is $30.7k. However,
inspecting the raw data suggests that this mean value
might not be the best way to accurately reflect the
typical salary of a worker, as most workers have salaries
in the $12k to 18k range. The mean is being skewed
/tilted by the two large salaries. Therefore, in this
situation, we would like to have a better measure of
central tendency.
For example, consider the wages of staff at a factory below:

Cont.
17
Median
The median is the middle score for a set of data that has
been arranged in order of magnitude. The median is less
affected by outliers and skewed data. In order to calculate
the median, suppose we have the data below:
in this case, 56. It is the middle mark because there are 5
scores before it and 5 scores after it. This works fine when
you have an odd number of scores, but what happens
when you have an even number of scores? What if you had
only 10 scores? Well, you simply have to take the middle
two scores and average the result.

Cont.
18
Mode
The mode is the most frequent score in our data set. On a histogram it
represents the highest bar in a bar chart or histogram. You can,
therefore, sometimes consider the mode as being the most popular
option. An example of a mode is presented below:

Measure of variation
19
 Example
Consider the following two sets of scores: Set 1: 40, 50, 60, 60, 40, 50
Set 2: 0,100, 25, 75, 80, 20
Alert block
 Both these sets have the same mean (50),
 But the second set is a lot more widely dispersed ("scattered") than
the first.

Measure of variation/dispersion
20
 The scatter or spread of items of a distribution is
known as dispersion or variation.
 In other words the degree to which numerical data
tend to spread about an average value is called
dispersion or variation of the data.
 Measures of dispersion are statistical measures which
provide ways of measuring the extent in which data
are dispersed or spread out.

Objective of Measuring Variation
21
 To determine the reliability of an average by pointing
out as how far an average is representative of the
entire data.
 To determine the nature and cause of variation in
order to control the variation itself.
 Enable comparison of two or more distribution with
regard to their variability.
 Measuring variability is of great importance to other
statistical analysis. E.g., it is the basis of statistical
quality control

A good measure of variation
22
 It should be easy to compute and understand.
 It should be based on all observations.
 It should be Uniquely defined
 It should be capable of further statistical treatment.
 It should be as little as affected by extreme values

Types of measure of variation
23
Absolute measure: The measures of dispersion which are
expressed in terms of original units of a data termed as absolute
measures. :
 Range
 Quartile deviation
 Mean deviation
 Variance
 Standard deviation
Relative measures: are known as coefficients of dispersion, are
obtained as ratios or percentages.
 Relative range
 Coefficient of quartile deviation
 Coefficient of mean deviation
 Coefficient of variation
 Standard scores

The range
24
Several measures of dispersion are available. We will
discuss the common ones below.
The Range:
 The difference between the largest (maximum) and
smallest (minimum) values.
Range = Maximum – Minimum (1)
For frequency distributed data, the range is:
 The difference between the upper class boundary of
the last class and the lower class boundary of the first
class.

Measure of Dispersion
25
Measure of variation-dispersion
 Find the Range of 54.5, 55.0, 55.7, 51.8, 54.2, 52.4
Solution:
 range(R) = 55.7- 51.8 = 3.9cm
Solution: Range = UCBl - LCBf = 118.5-52.5 = 66

26
Quartile deviation (QD):
QD is the product of half of the difference between
the upper and lower quartiles. The range expresses
the extreme variability of observations of a variable. is
half of the inter quartile range.
 Coefficient of quartile deviation (CQD):
 It gives the average amount by which the two quartiles differ
from the median

27
 Mean Deviation(M.D):
 The average deviation measures the scatter of the
individual observations around a central value usually
the mean or the median of a distribution.
 The mean deviation is defined as the arithmetic mean
of positive deviations of each observation from either
the mean or the median of a distribution.
 If the deviations are taken from the mean then it is
called mean deviation about the mean.
 On the other hand, if the deviations are taken from
the median we call it mean deviation about the
median.

Mean deviation
28
 The mean Deviation (M.D) is the arithmetic mean of the absolute
deviations of the values from the mean.
 It is the “average absolute deviation of the values from the mean”.
 Note that: while dealing with population values, it is adjusted
accordingly
 Mean Deviations for Grouped data (discrete or continuous)
 Where m = number of classes and xi = class mark of the ith class; n =
number of observation

Mean deviation
29
 Mean deviation about the median ( MD)
ungrouped data:
grouped Frequency Distribution:

Example
30
 The weights of a sample of six students from a class (in kilograms) is
measured as: 53, 56, 57, 59, 63 and 66. Find the mean deviation about
the mean and the mean deviation from the median.
 solution: First find the mean and the median. The mean is 59 kg and
the median is 58 kg. Then take the deviations of each observation
from these averages as shown below

Example cont.
31
Example: Calculate the mean deviation from the mean and median for the
following data.

Solution
32
Mean of each class = lower class point + upper class point divided by
two. = (1+5)/2= 3
Mean= 100/10=10
MD from the mean = =60/10=6
Class xi fi fixi |xi-ẋ| fi|xi-ẋ|
1-5 3 4 12 |3-10|=7 4*7=28
6-10 8 1 8 2 1*2=2
11-15 13 2 26 3 2*3=6
16-20 18 3 54 8 3*8=24
𝑓𝑖 = 10
100
𝑓𝑖 = 60

Solution
33
MD from the median = =60/10=6
Class xi fi fixi |xi-ẋ| fi|xi-ẋ|
1-5 3 4 12 |3-10.5|=7 4*7.5=28
6-10 8 1 8 2.5 1*2.5=2.5
11-15 13 2 26 2.5 2*2.5=5
16-20 18 3 54 7.5 3*7.5=24
𝑓𝑖 = 10
100
𝑓𝑖 = 60
Median = 3, 3,3, 3, 8,,13,13, 18,18,18
median= (8+13)/2=10.5

Coefficients of Mean Deviation(C.M.D)
34
Example: Find the coefficient of mean deviation about the mean and
mean deviation about the median for the weights of six students in
example above.
Solution: Coefficient of mean deviation about the mean

Variance and Standard Deviation
35
 The variance and standard deviation are the most superior and
widely used measures of dispersion
 Both measures the average dispersion of the observations
around the mean.
 The variance is defined as the average of the squared deviation
from the mean.
 variance is a measure of dispersion that takes into account the
spread of all data points in a data set. It’s the measure of dispersion
the most often used, along with the standard deviation, which is
simply the square root of the variance.
 The variance is mean squared difference between each data point
and the center of the distribution measured by the mean.
 An item selected at random from a data set whose standard
deviation is low has a better chance of being close to the mean
than an item from a data set whose standard deviation is higher.

Variance and standard deviation formula
36

37

Cont.
38
Example
Calculate the population variance from the following 5
observations: 50, 55, 45, 60, 40.

Cont.
39
Example: 24, 25, 29,29,30,31
Find variance and standard deviation ?
Solution:

40
Quiz-1
Find the variance and standard deviation of the following sample data
i. 5, 17, 12, 10,8
ii .The data is given in the form of frequency distribution.

Coefficient of Variance
41
The coefficient of variation (CV) is the ratio of the
standard deviation to the mean. The higher the coefficient
of variation, the greater the level of dispersion around the
mean. It is generally expressed as a percentage. Without
units, it allows for comparison between distributions of
values whose scales of measurement are not comparable.
When we are presented with estimated values, the CV
relates the standard deviation of the estimate to the value
of this estimate. The lower the value of the coefficient of
variation, the more precise the estimate.

Coefficient of Variance formula
42
 In situations where either two series have different units of
measurements, or their means differ sufficiently in size, the CV
should be used as a measure of dispersion.
 In spite of the fact that the C.V. is broadly applied, its
disadvantage is that it’s not useful when the mean is negative or
zero or very close to zero.
 Interpretation of the coefficient of variation: the distribution
having less CV is said to be less variable or more consistent

Why We Need the Coefficient of Variation
43
So, standard deviation is the most common measure of
variability for a single data set. But why do we need yet
another measure such as the coefficient of variation? Well,
comparing the standard deviations of two different data
sets is meaningless, but comparing coefficients of
variation is not.
Example question: Two versions of a test are given to
students. One test has pre-set answers and a second test
has randomized answers. Find the coefficient of variation.
Regular Test
Regular Test
Randomized
Answers
Mean 50.1 45.8
SD 11.2 12.9

Cont.
44
Solution
Step 1: Divide the standard deviation by the mean for the
first sample:
11.2 / 50.1 = 0.22355
Step 2: Multiply Step 1 by 100: 0.22355 * 100 =22.355%
Step 3: Divide the standard deviation by the mean for the
second sample: 12.9 / 45.8 = 0.28166
Step 4: Multiply Step 3 by 100: 0.28166 * 100 =28.266%
That’s it! Now you can compare the two results directly.

Cont.
45
Quiz-2: Find the population coefficient of variation of 24,
26, 33, 37, 29, 31.
solution

Cont.
46
Quiz-1: Find the population coefficient of variation of 24,
26, 33, 37, 29, 31.
solution

Cont.
47
Example: Suppose that the mean weight of a group
of students is 165 pounds with a S.D of 8 pounds. If
the height of the same group of students has a
mean of 60 inches with a S.D of 3 inches, compare
the variability in weight and height measurements.
Solution:

Standard Scores (Z-Scores)
48
 A Z-Score is a statistical measurement of a score's
relationship to the mean in a group of scores.
 Are not measures of relative dispersion, but one of the
applications of standard deviation.
 We define the standard score as:.
 Tells us how many standard deviations a value lies
above (if positive) or below (if negative) the mean.
 Standard score gives the deviations from the mean in
units of standard deviation It is used to compare two
observations coming from different groups.

49
 Questions: Two third year Medical laboratory sections were given
introduction to biostatistics examinations. The following information
was given.
 Student A from section1 scored 90 and student B from section 2
scored 95.
 Relatively speaking who performed better ?
Student A performed better relative to his section because the score of student A
is two standard deviation above the mean score of his section while, the score of
student B is only one standard deviation above the mean score of his section

50
Quiz 3 : Given mean and standard deviation is 50 and 10, what value
of x has a z-score of 1.4? What is the z-score that correspondents to x
= 30?

Moments
51
 The rth moment about the mean (the rth central moment) defined as :
 for continuous grouped data it is given by:
Example: Find the first three central moments of the numbers 2, 3 and 7
Solution first find the mean:

52
Normal Distribution, Skewness and Kurtosis
 A normal distribution is the proper term for a probability
bell curve.
 In a normal distribution the mean is zero and the
standard deviation is 1. It has zero skew and a kurtosis
of 3. Normal distributions are symmetrical, but not all
symmetrical distributions are normal
What are the 4 characteristics of a normal distribution?
 Normal distributions are symmetric, unimodal, and
asymptotic, and the mean, median, and mode are all
equal.
 A normal distribution is perfectly symmetrical around its
center. That is, the right side of the center is a mirror
image of the left side

53
Skewness
 Skewness is the degree of asymmetry or departure from
symmetry of a distribution.
 A skewed frequency distribution is one that is not
symmetrical.
 Skewness is concerned with the shape of the curve not size
 If the frequency curve (smoothed frequency polygon) of a
distribution has a longer tail to the right of the central
maximum than to the left, the distribution is said to be
skewed to the right or said to have positive skewness. If it
has a longer tail to the left of the central maximum than to
the right, it is said to be skewed to the left or said to have
negative skewness.
 For moderately skewed distribution, the following relation
holds among the three
 commonly used measures of central tendency.

Skewness
54
 A unimodal distribution is a distribution with one clear peak or most
frequent value.
“Asymptotic” refers to how an estimator behaves as the sample size
gets larger (i.e. tends to infinity).

Skewness
55
 Skewness is a measure of symmetry, or more precisely, the lack of
symmetry. A distribution, or data set, is symmetric if it looks the
same to the left and right of the center point.
 In respect of the measures of skewness and kurtosis, we mostly use
the first measure of skewness based on mean and mode or on
mean and median.
 Positive Skewed
Mode < Median < Mean
 Negative Skewed
Mean < Median < Mode
 Zero Skewed
Mean = Median = Mode

Skewness
56
Example. Suppose the mean, the mode, and the standard deviation of
a certain distribution are 32, 30.5 and 10 respectively. What is the
shape of the curve representing the distribution?
 Solution:
The distribution is positively skewed
The Karl Pearson’s Coefficient of Skewness (SK):
If SK = 0, then the distribution is symmetrical.
If SK > 0, then the distribution is positively skewed.
If SK < 0, then the distribution is negatively skewed

Kurtosis
57
Kurtosis
 Kurtosis is a measure of whether the data are heavy-
tailed or light-tailed relative to a normal distribution.
 A standard normal distribution has kurtosis of 3 and is
recognized as mesokurtic.
 An increased kurtosis (>3) can be visualized as a thin
“bell” with a high peak whereas a decreased kurtosis
corresponds to a broadening of the peak and
“thickening” of the tails.
 Kurtosis is a statistical measure, whether the data is
heavy-tailed or light-tailed in a normal distribution.
 In finance, kurtosis is used as a measure of financial risk.

Kurtosis
58
 A large kurtosis is associated with a high level of risk for an
investment because it indicates that there are high
probabilities of extremely large and extremely small returns.
 On the other hand, a small kurtosis signals a moderate level
of risk because the probabilities of extreme returns are
relatively low.

Kurtosis
59
 Kurtosis is the degree of peakedness of a distribution, usually taken
relative to a normal distribution.
When the curve of a distribution is relatively:
 flatter than normal it is known as platykurtic and
 the distribution is more peaked than normal, it is called leptokurtic.
 The normal distribution which is not very high peaked or flat
topped is called mesokurtic.
The moment coefficient of skewness (ß2)
If B2 =3, then the distribution is mesokurtic.
If B2 > 3, then the distribution is leptokurtic.
If B2 < 3, then the distribution is platykurtic.

Acceptable Standard Deviation (SD)
60
 A smaller SD represents data where the results are
very close in value to the mean. The larger the SD the
more variance in the results.
 Data points in a normal distribution are more likely to
fall closer to the mean. In fact, 68% of all data points
will be within ±1SD from the mean, 95% of all data
points will be within + 2SD from the mean, and 99%
of all data points will be within ±3SD.
 Statisticians have determined that values no greater
than plus or minus 2 SD represent measurements that
are more closely near the true value than those that
fall in the area greater than ± 2SD.

61
Statisticians have determined that values no greater than
plus or minus 2 SD represent measurements that are more
closely near the true value than those that fall in the area
greater than ± 2SD.

62
A cholesterol control is run 20 times over 25 days yielding the following
results in mg/dL: 192, 188, 190, 190, 189, 191, 188, 193, 188, 190, 191, 194,
194, 188, 192, 190, 189,189, 191, 192.
• Using the cholesterol control results, follow the steps described below to
establish Quality Control/QC/ ranges.

63

Skewness and Kurtosis
64
Formula & Examples
Examples
1. Calculate Sample Skewness, Sample Kurtosis from the following grouped
data
Class Frequency
2 - 4 3
4 - 6 4
6 - 8 2
8 - 10 1

Coefficient of Correlation
66
 DEFINITION OF CORRELATION
 “If two or more quantities vary in sympathy so that
movements in one tend to be accompanied by
corresponding movements in other(s) then they are said
to be correlated.” Or
 “Correlation is an analysis of co-variation between two or
more variables.”
 A coefficient of correlation is generally applied in statistics
to calculate a relationship between two variables
Types of Correlation
The following are different types of correlation:
 Positive and Negative Correlation
 Simple, Partial and Multiple Correlation
 Linear and Non-linear Correlation

Types of Coefficient of Correlation
67
 Positive correlation: the correlation between two variables
is said to be positive or direct if an increase (or a decrease)
in one variable corresponds to an increase (or a decrease)
in the other.
 Negative Correlation: the correlation between two
variables is said to be negative or inverse if an increase (or
a decrease) corresponds to a decrease (or an increase) in
the other.
 Simple Correlation: It involves the study of only two
variables. For example, when we study the correlation
between the price and demand of a product, it is a
problem of simple correlation.

68
 Partial Correlation: It involves the study of three or more
variables, but considers only two variables to be
influencing each other. For example, if we consider three
variables, namely yield of wheat, amount of rainfall and
amount of fertilizers and limit our correlation analysis to
yield and rainfall, with the effect of fertilizers removed, it
becomes a problem relating to partial correlation only.
 Multiple Correlation: It involves the study of three or more
variables simultaneously. For example, if we study the
relationship between the yield of wheat per acre and both
amount of rainfall and the amount of fertilizers used, it
becomes a problem relating to multiple correlation.

69
 Linear Correlation: The correlation between two
variables is said to be linear if the amount of change in
one variable tends to bear a constant ratio to the
amount of change in other variable.
 Non-linear (or Curvilinear): The correlation between two
variables is said to be non-linear or curvilinear if the
amount of change in one variable does not bear a
constant ratio to the amount of change in other
variable.

Methods of Studying Correlation
70
 Scatter Diagram Method
 Karl Pearson’s Coefficient of Correlation, and
 Spearman's Rank Correlation Method
 A scatter diagram Method
 A scattered diagram method the data helps in having a
visual idea about the nature of association between two
variables. If the points cluster along a straight line, the
association between two variables is linear.
 Further, if the points cluster along a curve, the
corresponding association is non-linear or curvilinear.
 Finally, if the points neither cluster along a straight line
nor along a curve, there is absence of any association
between the variables.

Karl Pearson’s Coefficient Correlation
72
 Karl Pearson’s coefficient of correlation is an extensively used
mathematical method in which the numerical representation is
applied to measure the level of relation between linearly related
variables. The coefficient of correlation is expressed by “r”.
Actual Mean Method Which is Expressed as -
Pearson correlation example
 When a correlation coefficient is (1), that means for every increase in one
variable, there is a positive increase in the other fixed proportion. For example,
shoe sizes change according to the length of the feet and are perfect (almost)
correlations.
 When a correlation coefficient is (-1), that means for every positive increase in
one variable, there is a negative decrease in the other fixed proportion. For
example, the decrease in the quantity of gas in a gas tank shows a perfect
(almost) inverse correlation with speed.
 When a correlation coefficient is (0) for every increase, that means there is no
positive or negative increase, and the two variables are not related.

73
Correlation coefficient formulas are used to find how strong a
relationship is between data. The formulas return a value
between -1 and 1, where:
 1 indicates a strong positive relationship.
 -1 indicates a strong negative relationship.
 A result of zero indicates no relationship at all.

74
Example: Find the value of the correlation coefficient from the following table:
Solution
Step 1: Make a chart. Use the given data, and add three more columns:, find both x and y
mean value x2, y2, and, xy.
Step 2: Multiply x and y together to fill the xy column.
Step 3: Take the square of the numbers in the x column, and put the result in the x2
column.
Step 4: Take the square of the numbers in the y column, and put the result in the y2
column.
Step 5: Add up all of the numbers in the columns and put the result at the bottom of the
column. The Greek letter sigma (Σ) is a short way of saying “sum of” or summation.
Step 6: Use the following correlation coefficient formula.

75
Classwork: Find the value of the correlation coefficient from the following table:

Spearman Rank Correlation Method
76
 A rank correlation coefficient measures the degree of
similarity between two rankings, and can be used to
assess the significance of the relation between them.
 Also called rank-order.
 Used when one or both variables are rank or ordinal
scales.
 Difference (D) between ranks of two sets of scores is
used to determine correlation coefficient.
Examples - golf driving distance and order of finish in golf
tournament; height and IQ score; weight and order of
finish in 400 meter race; number of calories consumed
and weight lost

77
To determine :
1. List each set of scores in a column.
2. Rank the two sets of scores.
3. Place the appropriate rank beside each score.
4. Head a column D and determine the difference in rank for each pair of
scores. (Sum of the D column should always be 0)
5. Square each number in the D column and sum the
values (∑D2).
6. Calculate the correlation coefficient by subtracting the
values in the formula
n = number of observations

78
As an example,
Food R1 R2 D= R1-
R2
D2
A 2 1 1 1
B 1 3 -2 4
C 4 2 2 4
D 3 4 -1 1
E 5 5 0 0
F 7 6 1 1
G 6 7 -1 1
R= 1- 6 ∑12
𝑅 = 1 −
6 12
73 − 7
= 1-0.2142= 0.786

chi-square test
79
 A chi-square test is a statistical test used to compare
observed results with expected results.
 The purpose of this test is to determine if a difference
between observed data and expected data is due to
chance, or if it is due to a relationship between the
variables you are studying.
 A chi-square (χ2) statistic is a test that measures how
a model compares to actual observed data.
 The chi-square statistic compares the size of any
discrepancies between the expected results and the
actual results, given the size of the sample and the
number of variables in the relationship.

chi-square test (cont’d.)
80
 The formula for the chi-square statistic used in the
chi square test is:
The subscript “c” is the degrees of freedom. “O” is your
observed value and E is your expected value. It’s very rare
that you’ll want to actually use this formula to find a critical
chi-square value by hand.
The summation symbol means that you’ll have to perform a
calculation for every single data item in your data set. As
you can probably imagine, the calculations can get very,
very, lengthy and tedious. Instead, you’ll probably want to
use technology:

81
EXAMPLE
Employers want to know which days of the week employees are
absent in a five day work week. Most employers would like to believe
that employees are absent equally during the week. Suppose a
random sample of 60 managers were asked on which day of the week
did they have the highest number of employee absences. The results
were distributed as follows: (Use a 5% level of significance level.)
Monday Tuesday Wednesday Thursday Friday
Observed Absences 15 12 9 9 15
Expected Absences 12 12 12 12 12
Calculate the χ2 test statistic. Make a chart with the following
column headings and fill in the cells:

82
SOLUTION
The null and alternate hypotheses are:
 H0: The absent days occur with equal frequencies.
 Ha: The absent days occur with unequal frequencies.
 The degrees of freedom are one fewer than the number of cells:
df=n-1 = 5−1=4.
Now add (sum) the values of the last column. Verify that this sum is 3.
This is the Χ2 test statistic. The decision is to not reject the null hypothesis.

83
EXAMPLE

Chapter-Four.pdf

Recommended

Recommended

More Related Content

Similar to Chapter-Four.pdf

Similar to Chapter-Four.pdf (20)

Recently uploaded

Recently uploaded (20)

Chapter-Four.pdf