SlideShare a Scribd company logo
ARBA MINCH TECHNOLOGY INSTITUTE
DEPARTMENT OF MECHANICAL ENGINEERING
Chapter Four: Processing and Data Analysis
Instructor: Solomon N.(Ph.D.)
Academic Year: 2022/23
1
Processing and Data Analysis
2
 After collecting data, the method of converting raw
data into meaningful statement; includes data
processing, data analysis, and data interpretation and
presentation.
 Acquiring data: Acquisition involves collecting or
adding to the data holdings. There are several methods
of acquiring data:
 collecting new data
 using your own previously collected data
 reusing someone others data
 purchasing data
 acquired from Internet (texts, social media, photos)
Processing and Data Analysis
3
 Data processing: A series of actions or steps performed
on data to verify, organize, transform, integrate, and
extract data in an appropriate output form for
subsequent use.
 Methods of processing must be rigorously
documented to ensure the utility and integrity of the
data.
 Data Analysis involves actions and methods performed
on data that help describe facts, detect patterns,
develop explanations and test hypotheses. This
includes data quality assurance, statistical data analysis,
modeling, and interpretation of results.
Data Preservation and Re-use
4
 Data preservation involves actions and procedures to keep
data for future use, and includes data archiving and/or data
submission to a data repository. Data preservation needs data
description, documentation and metadata.
 The goal of all these actions is to make data findable,
comprehensible and easy to use. It also involves long-term
preservation and curation of data.
 Documentation provides an overview of the research context
and design, data collection methods, data preparation and
results or findings and is key to enabling the secondary user
to make informed use of the data.
 Metadata are providing standardized structured information
explaining the purpose, origin, time references, geographic
location, creator, access conditions and terms of use of data.
Data Processing
5
 Data processing is concerned with editing, coding,
classifying, tabulating and charting and diagramming
research data. The essence of data processing in research
is data reduction/saving.
 Data reduction involves winnowing/inspecting/ out the
irrelevant from the relevant data and establishing order
from disorder and giving shape to a mass of data
 DOI: digital object identifier is a unique and persistent
identifier makes data easy to find and cite data sets.
Example: doi.org/10.1016/j.ecolind.2015.04.011
 Data re-use means data mining, replication research,
comparative studies, longitudinal research etc. E.g. data
collected for one research objective can be used in a new
study dealing with some other similar problem.
Data Processing
6
 Six stages of data processing
1. Data collection: Collecting data is the first step in data
processing. Data is pulled from available sources,
including data lakes and data warehouses.
2. Data preparation: Once the data is collected, it then
enters the data preparation stage. Data preparation,
often referred to as “pre-processing” is the stage at
which raw data is cleaned up and organized for the
following stage of data processing.
 Data collection
 Data preparation
 Data input/interpretation
 Data Processing
 Data output/interpretation
 Data storage and Report
Writing
Data Processing
7
3. Data input: The clean data is then entered into its
destination and translated into a language that it can
understand. Data input is the first stage in which raw
data begins to take the form of usable information.
4. Processing: During this stage, the data inputted to
the computer in the previous stage is actually
processed for interpretation. Processing is done using
machine learning algorithms, though the process
itself may vary slightly depending on the source of
data being processed.
Data Processing
8
6. Data storage and Report Writing: The final stage of
data processing is storage. After all of the data is
processed, it is then stored for future use. While some
information may be put to use immediately, much of it
will serve a purpose later on.
5.Data output/interpretation:
The output/interpretation stage is the stage at which
data is finally usable to non-data scientists. It is
translated, readable, and often in the form of graphs,
videos, images, plain text, etc.).
STATISTICS IN RESEARCH
9
 The role of statistics in research is to function as a tool in
designing research, analyzing its data and drawing conclusions.
Most research studies result in a large volume of raw data which
must be suitably reduced so that the same can be read easily and
can be used for further analysis. There are two major areas of
statistics
Descriptive statistics and Inferential statistics.
Cont.
10
 Descriptive statistics concern the development of certain
indices/directions from the raw data,
 Inferential statistics concern with the process of
generalization. Inferential statistics are also known as
sampling statistics and are mainly concerned with two
major type of problems:
“Descriptive” describes data, while “inferential” infers or allows
the researcher to arrive at a conclusion based on the collected
information.
 the estimation of population parameters, and
 the testing of statistical hypotheses.
Cont.
11
Cont.
12
For example, you are tasked to research about teenage
pregnancy in a certain high school. Using both descriptive
and inferential statistics, you will be researching the number
of teenage pregnancy cases in the school for a specific
number of years. The difference is that with descriptive
statistics, you are merely summarizing the collected data
and, if possible, detecting a pattern in the changes.
For example, it can be said that for the past five years, the
majority of teenage pregnancies in X High School happened
to those enrolled in the third year. There’s no need to
predict that on the sixth year, the third year students would
still be the ones with a greater number of teenage
pregnancies. Conclusions as well as predictions are only
done in inferential statistics.
STATISTICS IN RESEARCH
13
The important statistical measures that are used to
summarize the survey/research data are:
 measures of central tendency or statistical
averages;
 measures of dispersion;
 measures of asymmetry (skewness);
 measures of relationship; and
 other measures.
STATISTICS IN RESEARCH
14
Measures of Central Tendency
 Amongst the measures of central tendency, the three
most important ones are the arithmetic average or
mean, median and mode.
 A measure of central tendency is a single value that
attempts to describe a set of data by identifying the
central position within that set of data. As such,
measures of central tendency are sometimes called
measures of central location.
 The mean, median and mode are all valid measures of
central tendency, but under different conditions, some
measures of central tendency become more appropriate
to use than others.
Cont.
15
Mean (Arithmetic)
 The mean (or average) is the most popular and well
known measure of central tendency.
 It can be used with both discrete and continuous
data, although its use is most often with continuous
data.
 The mean is equal to the sum of all the values in
the data set divided by the number of values in the
data set.
Cont.
16
 The mean salary for these ten staff is $30.7k. However,
inspecting the raw data suggests that this mean value
might not be the best way to accurately reflect the
typical salary of a worker, as most workers have salaries
in the $12k to 18k range. The mean is being skewed
/tilted by the two large salaries. Therefore, in this
situation, we would like to have a better measure of
central tendency.
For example, consider the wages of staff at a factory below:
Cont.
17
Median
The median is the middle score for a set of data that has
been arranged in order of magnitude. The median is less
affected by outliers and skewed data. In order to calculate
the median, suppose we have the data below:
in this case, 56. It is the middle mark because there are 5
scores before it and 5 scores after it. This works fine when
you have an odd number of scores, but what happens
when you have an even number of scores? What if you had
only 10 scores? Well, you simply have to take the middle
two scores and average the result.
Cont.
18
Mode
The mode is the most frequent score in our data set. On a histogram it
represents the highest bar in a bar chart or histogram. You can,
therefore, sometimes consider the mode as being the most popular
option. An example of a mode is presented below:
Measure of variation
19
 Example
Consider the following two sets of scores: Set 1: 40, 50, 60, 60, 40, 50
Set 2: 0,100, 25, 75, 80, 20
Alert block
 Both these sets have the same mean (50),
 But the second set is a lot more widely dispersed ("scattered") than
the first.
Measure of variation/dispersion
20
 The scatter or spread of items of a distribution is
known as dispersion or variation.
 In other words the degree to which numerical data
tend to spread about an average value is called
dispersion or variation of the data.
 Measures of dispersion are statistical measures which
provide ways of measuring the extent in which data
are dispersed or spread out.
Objective of Measuring Variation
21
 To determine the reliability of an average by pointing
out as how far an average is representative of the
entire data.
 To determine the nature and cause of variation in
order to control the variation itself.
 Enable comparison of two or more distribution with
regard to their variability.
 Measuring variability is of great importance to other
statistical analysis. E.g., it is the basis of statistical
quality control
A good measure of variation
22
 It should be easy to compute and understand.
 It should be based on all observations.
 It should be Uniquely defined
 It should be capable of further statistical treatment.
 It should be as little as affected by extreme values
Types of measure of variation
23
Absolute measure: The measures of dispersion which are
expressed in terms of original units of a data termed as absolute
measures. :
 Range
 Quartile deviation
 Mean deviation
 Variance
 Standard deviation
Relative measures: are known as coefficients of dispersion, are
obtained as ratios or percentages.
 Relative range
 Coefficient of quartile deviation
 Coefficient of mean deviation
 Coefficient of variation
 Standard scores
The range
24
Several measures of dispersion are available. We will
discuss the common ones below.
The Range:
 The difference between the largest (maximum) and
smallest (minimum) values.
Range = Maximum – Minimum (1)
For frequency distributed data, the range is:
 The difference between the upper class boundary of
the last class and the lower class boundary of the first
class.
Measure of Dispersion
25
Measure of variation-dispersion
 Find the Range of 54.5, 55.0, 55.7, 51.8, 54.2, 52.4
Solution:
 range(R) = 55.7- 51.8 = 3.9cm
Solution: Range = UCBl - LCBf = 118.5-52.5 = 66
Measure of Dispersion
26
Quartile deviation (QD):
QD is the product of half of the difference between
the upper and lower quartiles. The range expresses
the extreme variability of observations of a variable. is
half of the inter quartile range.
 Coefficient of quartile deviation (CQD):
 It gives the average amount by which the two quartiles differ
from the median
Measure of Dispersion
27
 Mean Deviation(M.D):
 The average deviation measures the scatter of the
individual observations around a central value usually
the mean or the median of a distribution.
 The mean deviation is defined as the arithmetic mean
of positive deviations of each observation from either
the mean or the median of a distribution.
 If the deviations are taken from the mean then it is
called mean deviation about the mean.
 On the other hand, if the deviations are taken from
the median we call it mean deviation about the
median.
Mean deviation
28
 The mean Deviation (M.D) is the arithmetic mean of the absolute
deviations of the values from the mean.
 It is the “average absolute deviation of the values from the mean”.
 Note that: while dealing with population values, it is adjusted
accordingly
 Mean Deviations for Grouped data (discrete or continuous)
 Where m = number of classes and xi = class mark of the ith class; n =
number of observation
Mean deviation
29
 Mean deviation about the median ( MD)
ungrouped data:
grouped Frequency Distribution:
Example
30
 The weights of a sample of six students from a class (in kilograms) is
measured as: 53, 56, 57, 59, 63 and 66. Find the mean deviation about
the mean and the mean deviation from the median.
 solution: First find the mean and the median. The mean is 59 kg and
the median is 58 kg. Then take the deviations of each observation
from these averages as shown below
Example cont.
31
Example: Calculate the mean deviation from the mean and median for the
following data.
Solution
32
Mean of each class = lower class point + upper class point divided by
two. = (1+5)/2= 3
Mean= 100/10=10
MD from the mean = =60/10=6
Class xi fi fixi |xi-ẋ| fi|xi-ẋ|
1-5 3 4 12 |3-10|=7 4*7=28
6-10 8 1 8 2 1*2=2
11-15 13 2 26 3 2*3=6
16-20 18 3 54 8 3*8=24
𝑓𝑖 = 10
100
𝑓𝑖 = 60
Solution
33
MD from the median = =60/10=6
Class xi fi fixi |xi-ẋ| fi|xi-ẋ|
1-5 3 4 12 |3-10.5|=7 4*7.5=28
6-10 8 1 8 2.5 1*2.5=2.5
11-15 13 2 26 2.5 2*2.5=5
16-20 18 3 54 7.5 3*7.5=24
𝑓𝑖 = 10
100
𝑓𝑖 = 60
Median = 3, 3,3, 3, 8,,13,13, 18,18,18
median= (8+13)/2=10.5
Coefficients of Mean Deviation(C.M.D)
34
Example: Find the coefficient of mean deviation about the mean and
mean deviation about the median for the weights of six students in
example above.
Solution: Coefficient of mean deviation about the mean
Variance and Standard Deviation
35
 The variance and standard deviation are the most superior and
widely used measures of dispersion
 Both measures the average dispersion of the observations
around the mean.
 The variance is defined as the average of the squared deviation
from the mean.
 variance is a measure of dispersion that takes into account the
spread of all data points in a data set. It’s the measure of dispersion
the most often used, along with the standard deviation, which is
simply the square root of the variance.
 The variance is mean squared difference between each data point
and the center of the distribution measured by the mean.
 An item selected at random from a data set whose standard
deviation is low has a better chance of being close to the mean
than an item from a data set whose standard deviation is higher.
Variance and standard deviation formula
36
Variance and standard deviation formula
37
Cont.
38
Example
Calculate the population variance from the following 5
observations: 50, 55, 45, 60, 40.
Cont.
39
Example: 24, 25, 29,29,30,31
Find variance and standard deviation ?
Solution:
Variance and standard deviation formula
40
Quiz-1
Find the variance and standard deviation of the following sample data
i. 5, 17, 12, 10,8
ii .The data is given in the form of frequency distribution.
Coefficient of Variance
41
The coefficient of variation (CV) is the ratio of the
standard deviation to the mean. The higher the coefficient
of variation, the greater the level of dispersion around the
mean. It is generally expressed as a percentage. Without
units, it allows for comparison between distributions of
values whose scales of measurement are not comparable.
When we are presented with estimated values, the CV
relates the standard deviation of the estimate to the value
of this estimate. The lower the value of the coefficient of
variation, the more precise the estimate.
Coefficient of Variance formula
42
 In situations where either two series have different units of
measurements, or their means differ sufficiently in size, the CV
should be used as a measure of dispersion.
 In spite of the fact that the C.V. is broadly applied, its
disadvantage is that it’s not useful when the mean is negative or
zero or very close to zero.
 Interpretation of the coefficient of variation: the distribution
having less CV is said to be less variable or more consistent
Why We Need the Coefficient of Variation
43
So, standard deviation is the most common measure of
variability for a single data set. But why do we need yet
another measure such as the coefficient of variation? Well,
comparing the standard deviations of two different data
sets is meaningless, but comparing coefficients of
variation is not.
Example question: Two versions of a test are given to
students. One test has pre-set answers and a second test
has randomized answers. Find the coefficient of variation.
Regular Test
Regular Test
Randomized
Answers
Mean 50.1 45.8
SD 11.2 12.9
Cont.
44
Solution
Step 1: Divide the standard deviation by the mean for the
first sample:
11.2 / 50.1 = 0.22355
Step 2: Multiply Step 1 by 100: 0.22355 * 100 =22.355%
Step 3: Divide the standard deviation by the mean for the
second sample: 12.9 / 45.8 = 0.28166
Step 4: Multiply Step 3 by 100: 0.28166 * 100 =28.266%
That’s it! Now you can compare the two results directly.
Cont.
45
Quiz-2: Find the population coefficient of variation of 24,
26, 33, 37, 29, 31.
solution
Cont.
46
Quiz-1: Find the population coefficient of variation of 24,
26, 33, 37, 29, 31.
solution
Cont.
47
Example: Suppose that the mean weight of a group
of students is 165 pounds with a S.D of 8 pounds. If
the height of the same group of students has a
mean of 60 inches with a S.D of 3 inches, compare
the variability in weight and height measurements.
Solution:
Standard Scores (Z-Scores)
48
 A Z-Score is a statistical measurement of a score's
relationship to the mean in a group of scores.
 Are not measures of relative dispersion, but one of the
applications of standard deviation.
 We define the standard score as:.
 Tells us how many standard deviations a value lies
above (if positive) or below (if negative) the mean.
 Standard score gives the deviations from the mean in
units of standard deviation It is used to compare two
observations coming from different groups.
Standard Scores (Z-Scores)
49
 Questions: Two third year Medical laboratory sections were given
introduction to biostatistics examinations. The following information
was given.
 Student A from section1 scored 90 and student B from section 2
scored 95.
 Relatively speaking who performed better ?
Student A performed better relative to his section because the score of student A
is two standard deviation above the mean score of his section while, the score of
student B is only one standard deviation above the mean score of his section
Standard Scores (Z-Scores)
50
Quiz 3 : Given mean and standard deviation is 50 and 10, what value
of x has a z-score of 1.4? What is the z-score that correspondents to x
= 30?
Moments
51
 The rth moment about the mean (the rth central moment) defined as :
 for continuous grouped data it is given by:
Example: Find the first three central moments of the numbers 2, 3 and 7
Solution first find the mean:
52
Normal Distribution, Skewness and Kurtosis
 A normal distribution is the proper term for a probability
bell curve.
 In a normal distribution the mean is zero and the
standard deviation is 1. It has zero skew and a kurtosis
of 3. Normal distributions are symmetrical, but not all
symmetrical distributions are normal
What are the 4 characteristics of a normal distribution?
 Normal distributions are symmetric, unimodal, and
asymptotic, and the mean, median, and mode are all
equal.
 A normal distribution is perfectly symmetrical around its
center. That is, the right side of the center is a mirror
image of the left side
53
Skewness
 Skewness is the degree of asymmetry or departure from
symmetry of a distribution.
 A skewed frequency distribution is one that is not
symmetrical.
 Skewness is concerned with the shape of the curve not size
 If the frequency curve (smoothed frequency polygon) of a
distribution has a longer tail to the right of the central
maximum than to the left, the distribution is said to be
skewed to the right or said to have positive skewness. If it
has a longer tail to the left of the central maximum than to
the right, it is said to be skewed to the left or said to have
negative skewness.
 For moderately skewed distribution, the following relation
holds among the three
 commonly used measures of central tendency.
Skewness
54
 A unimodal distribution is a distribution with one clear peak or most
frequent value.
“Asymptotic” refers to how an estimator behaves as the sample size
gets larger (i.e. tends to infinity).
Skewness
55
 Skewness is a measure of symmetry, or more precisely, the lack of
symmetry. A distribution, or data set, is symmetric if it looks the
same to the left and right of the center point.
 In respect of the measures of skewness and kurtosis, we mostly use
the first measure of skewness based on mean and mode or on
mean and median.
 Positive Skewed
Mode < Median < Mean
 Negative Skewed
Mean < Median < Mode
 Zero Skewed
Mean = Median = Mode
Skewness
56
Example. Suppose the mean, the mode, and the standard deviation of
a certain distribution are 32, 30.5 and 10 respectively. What is the
shape of the curve representing the distribution?
 Solution:
The distribution is positively skewed
The Karl Pearson’s Coefficient of Skewness (SK):
If SK = 0, then the distribution is symmetrical.
If SK > 0, then the distribution is positively skewed.
If SK < 0, then the distribution is negatively skewed
Kurtosis
57
Kurtosis
 Kurtosis is a measure of whether the data are heavy-
tailed or light-tailed relative to a normal distribution.
 A standard normal distribution has kurtosis of 3 and is
recognized as mesokurtic.
 An increased kurtosis (>3) can be visualized as a thin
“bell” with a high peak whereas a decreased kurtosis
corresponds to a broadening of the peak and
“thickening” of the tails.
 Kurtosis is a statistical measure, whether the data is
heavy-tailed or light-tailed in a normal distribution.
 In finance, kurtosis is used as a measure of financial risk.
Kurtosis
58
 A large kurtosis is associated with a high level of risk for an
investment because it indicates that there are high
probabilities of extremely large and extremely small returns.
 On the other hand, a small kurtosis signals a moderate level
of risk because the probabilities of extreme returns are
relatively low.
Kurtosis
59
 Kurtosis is the degree of peakedness of a distribution, usually taken
relative to a normal distribution.
When the curve of a distribution is relatively:
 flatter than normal it is known as platykurtic and
 the distribution is more peaked than normal, it is called leptokurtic.
 The normal distribution which is not very high peaked or flat
topped is called mesokurtic.
The moment coefficient of skewness (ß2)
If B2 =3, then the distribution is mesokurtic.
If B2 > 3, then the distribution is leptokurtic.
If B2 < 3, then the distribution is platykurtic.
Acceptable Standard Deviation (SD)
60
 A smaller SD represents data where the results are
very close in value to the mean. The larger the SD the
more variance in the results.
 Data points in a normal distribution are more likely to
fall closer to the mean. In fact, 68% of all data points
will be within ±1SD from the mean, 95% of all data
points will be within + 2SD from the mean, and 99%
of all data points will be within ±3SD.
 Statisticians have determined that values no greater
than plus or minus 2 SD represent measurements that
are more closely near the true value than those that
fall in the area greater than ± 2SD.
Acceptable Standard Deviation (SD)
61
Statisticians have determined that values no greater than
plus or minus 2 SD represent measurements that are more
closely near the true value than those that fall in the area
greater than ± 2SD.
Acceptable Standard Deviation (SD)
62
A cholesterol control is run 20 times over 25 days yielding the following
results in mg/dL: 192, 188, 190, 190, 189, 191, 188, 193, 188, 190, 191, 194,
194, 188, 192, 190, 189,189, 191, 192.
• Using the cholesterol control results, follow the steps described below to
establish Quality Control/QC/ ranges.
Acceptable Standard Deviation (SD)
63
Skewness and Kurtosis
64
Formula & Examples
Examples
1. Calculate Sample Skewness, Sample Kurtosis from the following grouped
data
Class Frequency
2 - 4 3
4 - 6 4
6 - 8 2
8 - 10 1
Skewness and Kurtosis
65
Coefficient of Correlation
66
 DEFINITION OF CORRELATION
 “If two or more quantities vary in sympathy so that
movements in one tend to be accompanied by
corresponding movements in other(s) then they are said
to be correlated.” Or
 “Correlation is an analysis of co-variation between two or
more variables.”
 A coefficient of correlation is generally applied in statistics
to calculate a relationship between two variables
Types of Correlation
The following are different types of correlation:
 Positive and Negative Correlation
 Simple, Partial and Multiple Correlation
 Linear and Non-linear Correlation
Types of Coefficient of Correlation
67
 Positive correlation: the correlation between two variables
is said to be positive or direct if an increase (or a decrease)
in one variable corresponds to an increase (or a decrease)
in the other.
 Negative Correlation: the correlation between two
variables is said to be negative or inverse if an increase (or
a decrease) corresponds to a decrease (or an increase) in
the other.
 Simple Correlation: It involves the study of only two
variables. For example, when we study the correlation
between the price and demand of a product, it is a
problem of simple correlation.
Types of Coefficient of Correlation
68
 Partial Correlation: It involves the study of three or more
variables, but considers only two variables to be
influencing each other. For example, if we consider three
variables, namely yield of wheat, amount of rainfall and
amount of fertilizers and limit our correlation analysis to
yield and rainfall, with the effect of fertilizers removed, it
becomes a problem relating to partial correlation only.
 Multiple Correlation: It involves the study of three or more
variables simultaneously. For example, if we study the
relationship between the yield of wheat per acre and both
amount of rainfall and the amount of fertilizers used, it
becomes a problem relating to multiple correlation.
Types of Coefficient of Correlation
69
 Linear Correlation: The correlation between two
variables is said to be linear if the amount of change in
one variable tends to bear a constant ratio to the
amount of change in other variable.
 Non-linear (or Curvilinear): The correlation between two
variables is said to be non-linear or curvilinear if the
amount of change in one variable does not bear a
constant ratio to the amount of change in other
variable.
Methods of Studying Correlation
70
 Scatter Diagram Method
 Karl Pearson’s Coefficient of Correlation, and
 Spearman's Rank Correlation Method
 A scatter diagram Method
 A scattered diagram method the data helps in having a
visual idea about the nature of association between two
variables. If the points cluster along a straight line, the
association between two variables is linear.
 Further, if the points cluster along a curve, the
corresponding association is non-linear or curvilinear.
 Finally, if the points neither cluster along a straight line
nor along a curve, there is absence of any association
between the variables.
Scatter Diagram
71
Karl Pearson’s Coefficient Correlation
72
 Karl Pearson’s coefficient of correlation is an extensively used
mathematical method in which the numerical representation is
applied to measure the level of relation between linearly related
variables. The coefficient of correlation is expressed by “r”.
Actual Mean Method Which is Expressed as -
Pearson correlation example
 When a correlation coefficient is (1), that means for every increase in one
variable, there is a positive increase in the other fixed proportion. For example,
shoe sizes change according to the length of the feet and are perfect (almost)
correlations.
 When a correlation coefficient is (-1), that means for every positive increase in
one variable, there is a negative decrease in the other fixed proportion. For
example, the decrease in the quantity of gas in a gas tank shows a perfect
(almost) inverse correlation with speed.
 When a correlation coefficient is (0) for every increase, that means there is no
positive or negative increase, and the two variables are not related.
Coefficient of Correlation
73
Correlation coefficient formulas are used to find how strong a
relationship is between data. The formulas return a value
between -1 and 1, where:
 1 indicates a strong positive relationship.
 -1 indicates a strong negative relationship.
 A result of zero indicates no relationship at all.
Coefficient of Correlation
74
Example: Find the value of the correlation coefficient from the following table:
Solution
Step 1: Make a chart. Use the given data, and add three more columns:, find both x and y
mean value x2, y2, and, xy.
Step 2: Multiply x and y together to fill the xy column.
Step 3: Take the square of the numbers in the x column, and put the result in the x2
column.
Step 4: Take the square of the numbers in the y column, and put the result in the y2
column.
Step 5: Add up all of the numbers in the columns and put the result at the bottom of the
column. The Greek letter sigma (Σ) is a short way of saying “sum of” or summation.
Step 6: Use the following correlation coefficient formula.
Coefficient of Correlation
75
Classwork: Find the value of the correlation coefficient from the following table:
Spearman Rank Correlation Method
76
 A rank correlation coefficient measures the degree of
similarity between two rankings, and can be used to
assess the significance of the relation between them.
 Also called rank-order.
 Used when one or both variables are rank or ordinal
scales.
 Difference (D) between ranks of two sets of scores is
used to determine correlation coefficient.
Examples - golf driving distance and order of finish in golf
tournament; height and IQ score; weight and order of
finish in 400 meter race; number of calories consumed
and weight lost
Spearman Rank Correlation Method
77
To determine :
1. List each set of scores in a column.
2. Rank the two sets of scores.
3. Place the appropriate rank beside each score.
4. Head a column D and determine the difference in rank for each pair of
scores. (Sum of the D column should always be 0)
5. Square each number in the D column and sum the
values (∑D2).
6. Calculate the correlation coefficient by subtracting the
values in the formula
n = number of observations
Spearman Rank Correlation Method
78
As an example,
Food R1 R2 D= R1-
R2
D2
A 2 1 1 1
B 1 3 -2 4
C 4 2 2 4
D 3 4 -1 1
E 5 5 0 0
F 7 6 1 1
G 6 7 -1 1
R= 1- 6 ∑12
𝑅 = 1 −
6 12
73 − 7
= 1-0.2142= 0.786
chi-square test
79
 A chi-square test is a statistical test used to compare
observed results with expected results.
 The purpose of this test is to determine if a difference
between observed data and expected data is due to
chance, or if it is due to a relationship between the
variables you are studying.
 A chi-square (χ2) statistic is a test that measures how
a model compares to actual observed data.
 The chi-square statistic compares the size of any
discrepancies between the expected results and the
actual results, given the size of the sample and the
number of variables in the relationship.
chi-square test (cont’d.)
80
 The formula for the chi-square statistic used in the
chi square test is:
The subscript “c” is the degrees of freedom. “O” is your
observed value and E is your expected value. It’s very rare
that you’ll want to actually use this formula to find a critical
chi-square value by hand.
The summation symbol means that you’ll have to perform a
calculation for every single data item in your data set. As
you can probably imagine, the calculations can get very,
very, lengthy and tedious. Instead, you’ll probably want to
use technology:
chi-square test (cont’d.)
81
EXAMPLE
Employers want to know which days of the week employees are
absent in a five day work week. Most employers would like to believe
that employees are absent equally during the week. Suppose a
random sample of 60 managers were asked on which day of the week
did they have the highest number of employee absences. The results
were distributed as follows: (Use a 5% level of significance level.)
Monday Tuesday Wednesday Thursday Friday
Observed Absences 15 12 9 9 15
Expected Absences 12 12 12 12 12
Calculate the χ2 test statistic. Make a chart with the following
column headings and fill in the cells:
chi-square test (cont’d.)
82
SOLUTION
The null and alternate hypotheses are:
 H0: The absent days occur with equal frequencies.
 Ha: The absent days occur with unequal frequencies.
 The degrees of freedom are one fewer than the number of cells:
df=n-1 = 5−1=4.
Now add (sum) the values of the last column. Verify that this sum is 3.
This is the Χ2 test statistic. The decision is to not reject the null hypothesis.
chi-square test (cont’d.)
83
EXAMPLE
84

More Related Content

Similar to Chapter-Four.pdf

Computing Descriptive Statistics © 2014 Argos.docx
Computing Descriptive Statistics     © 2014 Argos.docxComputing Descriptive Statistics     © 2014 Argos.docx
Computing Descriptive Statistics © 2014 Argos.docx
AASTHA76
 
An Overview Of Data Analysis And Interpretations In Research
An Overview Of Data Analysis And Interpretations In ResearchAn Overview Of Data Analysis And Interpretations In Research
An Overview Of Data Analysis And Interpretations In Research
Finni Rice
 
EDA-Unit 1.pdf
EDA-Unit 1.pdfEDA-Unit 1.pdf
EDA-Unit 1.pdf
Nirmalavenkatachalam
 
Data analysis aug-11
Data analysis aug-11Data analysis aug-11
Data analysis aug-11
DrVinodhiniYallagand
 
Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...
Stats Statswork
 
data science course with placement in hyderabad
data science course with placement in hyderabaddata science course with placement in hyderabad
data science course with placement in hyderabad
maneesha2312
 
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Laboratorio di Cultura Digitale, labcd.humnet.unipi.it
 
PUH 6301, Public Health Research 1 Course Learning Ou
 PUH 6301, Public Health Research 1 Course Learning Ou PUH 6301, Public Health Research 1 Course Learning Ou
PUH 6301, Public Health Research 1 Course Learning Ou
TatianaMajor22
 
Statistical ProcessesCan descriptive statistical processes b.docx
Statistical ProcessesCan descriptive statistical processes b.docxStatistical ProcessesCan descriptive statistical processes b.docx
Statistical ProcessesCan descriptive statistical processes b.docx
darwinming1
 
probability and statistics-4.pdf
probability and statistics-4.pdfprobability and statistics-4.pdf
probability and statistics-4.pdf
habtamu292245
 
Krupa rm
Krupa rmKrupa rm
Krupa rm
Krupa Mehta
 
Action research data analysis
Action research data analysis Action research data analysis
Action research data analysis
Nasrun Ahmad
 
Datascience
DatascienceDatascience
Datascience
JayaKulshrestha
 
datascience.docx
datascience.docxdatascience.docx
datascience.docx
JayaKulshrestha
 
Data Analysis
Data Analysis Data Analysis
Data Analysis
DawitDibekulu
 
Unit 3 Qualitative Data
Unit 3 Qualitative DataUnit 3 Qualitative Data
Unit 3 Qualitative Data
Sherry Bailey
 
IRJET- Medical Data Mining
IRJET- Medical Data MiningIRJET- Medical Data Mining
IRJET- Medical Data Mining
IRJET Journal
 
Research and Statistics Report- Estonio, Ryan.pptx
Research  and Statistics Report- Estonio, Ryan.pptxResearch  and Statistics Report- Estonio, Ryan.pptx
Research and Statistics Report- Estonio, Ryan.pptx
RyanEstonio
 

Similar to Chapter-Four.pdf (20)

Computing Descriptive Statistics © 2014 Argos.docx
Computing Descriptive Statistics     © 2014 Argos.docxComputing Descriptive Statistics     © 2014 Argos.docx
Computing Descriptive Statistics © 2014 Argos.docx
 
An Overview Of Data Analysis And Interpretations In Research
An Overview Of Data Analysis And Interpretations In ResearchAn Overview Of Data Analysis And Interpretations In Research
An Overview Of Data Analysis And Interpretations In Research
 
EDA-Unit 1.pdf
EDA-Unit 1.pdfEDA-Unit 1.pdf
EDA-Unit 1.pdf
 
Data analysis aug-11
Data analysis aug-11Data analysis aug-11
Data analysis aug-11
 
S4 pn
S4 pnS4 pn
S4 pn
 
Lesson 6 chapter 4
Lesson 6   chapter 4Lesson 6   chapter 4
Lesson 6 chapter 4
 
Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...
 
data science course with placement in hyderabad
data science course with placement in hyderabaddata science course with placement in hyderabad
data science course with placement in hyderabad
 
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
Data collection, Data Integration, Data Understanding e Data Cleaning & Prepa...
 
PUH 6301, Public Health Research 1 Course Learning Ou
 PUH 6301, Public Health Research 1 Course Learning Ou PUH 6301, Public Health Research 1 Course Learning Ou
PUH 6301, Public Health Research 1 Course Learning Ou
 
Statistical ProcessesCan descriptive statistical processes b.docx
Statistical ProcessesCan descriptive statistical processes b.docxStatistical ProcessesCan descriptive statistical processes b.docx
Statistical ProcessesCan descriptive statistical processes b.docx
 
probability and statistics-4.pdf
probability and statistics-4.pdfprobability and statistics-4.pdf
probability and statistics-4.pdf
 
Krupa rm
Krupa rmKrupa rm
Krupa rm
 
Action research data analysis
Action research data analysis Action research data analysis
Action research data analysis
 
Datascience
DatascienceDatascience
Datascience
 
datascience.docx
datascience.docxdatascience.docx
datascience.docx
 
Data Analysis
Data Analysis Data Analysis
Data Analysis
 
Unit 3 Qualitative Data
Unit 3 Qualitative DataUnit 3 Qualitative Data
Unit 3 Qualitative Data
 
IRJET- Medical Data Mining
IRJET- Medical Data MiningIRJET- Medical Data Mining
IRJET- Medical Data Mining
 
Research and Statistics Report- Estonio, Ryan.pptx
Research  and Statistics Report- Estonio, Ryan.pptxResearch  and Statistics Report- Estonio, Ryan.pptx
Research and Statistics Report- Estonio, Ryan.pptx
 

Recently uploaded

Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Soumen Santra
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
Vijay Dialani, PhD
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 

Recently uploaded (20)

Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTSHeap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS
 
ML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptxML for identifying fraud using open blockchain data.pptx
ML for identifying fraud using open blockchain data.pptx
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 

Chapter-Four.pdf

  • 1. ARBA MINCH TECHNOLOGY INSTITUTE DEPARTMENT OF MECHANICAL ENGINEERING Chapter Four: Processing and Data Analysis Instructor: Solomon N.(Ph.D.) Academic Year: 2022/23 1
  • 2. Processing and Data Analysis 2  After collecting data, the method of converting raw data into meaningful statement; includes data processing, data analysis, and data interpretation and presentation.  Acquiring data: Acquisition involves collecting or adding to the data holdings. There are several methods of acquiring data:  collecting new data  using your own previously collected data  reusing someone others data  purchasing data  acquired from Internet (texts, social media, photos)
  • 3. Processing and Data Analysis 3  Data processing: A series of actions or steps performed on data to verify, organize, transform, integrate, and extract data in an appropriate output form for subsequent use.  Methods of processing must be rigorously documented to ensure the utility and integrity of the data.  Data Analysis involves actions and methods performed on data that help describe facts, detect patterns, develop explanations and test hypotheses. This includes data quality assurance, statistical data analysis, modeling, and interpretation of results.
  • 4. Data Preservation and Re-use 4  Data preservation involves actions and procedures to keep data for future use, and includes data archiving and/or data submission to a data repository. Data preservation needs data description, documentation and metadata.  The goal of all these actions is to make data findable, comprehensible and easy to use. It also involves long-term preservation and curation of data.  Documentation provides an overview of the research context and design, data collection methods, data preparation and results or findings and is key to enabling the secondary user to make informed use of the data.  Metadata are providing standardized structured information explaining the purpose, origin, time references, geographic location, creator, access conditions and terms of use of data.
  • 5. Data Processing 5  Data processing is concerned with editing, coding, classifying, tabulating and charting and diagramming research data. The essence of data processing in research is data reduction/saving.  Data reduction involves winnowing/inspecting/ out the irrelevant from the relevant data and establishing order from disorder and giving shape to a mass of data  DOI: digital object identifier is a unique and persistent identifier makes data easy to find and cite data sets. Example: doi.org/10.1016/j.ecolind.2015.04.011  Data re-use means data mining, replication research, comparative studies, longitudinal research etc. E.g. data collected for one research objective can be used in a new study dealing with some other similar problem.
  • 6. Data Processing 6  Six stages of data processing 1. Data collection: Collecting data is the first step in data processing. Data is pulled from available sources, including data lakes and data warehouses. 2. Data preparation: Once the data is collected, it then enters the data preparation stage. Data preparation, often referred to as “pre-processing” is the stage at which raw data is cleaned up and organized for the following stage of data processing.  Data collection  Data preparation  Data input/interpretation  Data Processing  Data output/interpretation  Data storage and Report Writing
  • 7. Data Processing 7 3. Data input: The clean data is then entered into its destination and translated into a language that it can understand. Data input is the first stage in which raw data begins to take the form of usable information. 4. Processing: During this stage, the data inputted to the computer in the previous stage is actually processed for interpretation. Processing is done using machine learning algorithms, though the process itself may vary slightly depending on the source of data being processed.
  • 8. Data Processing 8 6. Data storage and Report Writing: The final stage of data processing is storage. After all of the data is processed, it is then stored for future use. While some information may be put to use immediately, much of it will serve a purpose later on. 5.Data output/interpretation: The output/interpretation stage is the stage at which data is finally usable to non-data scientists. It is translated, readable, and often in the form of graphs, videos, images, plain text, etc.).
  • 9. STATISTICS IN RESEARCH 9  The role of statistics in research is to function as a tool in designing research, analyzing its data and drawing conclusions. Most research studies result in a large volume of raw data which must be suitably reduced so that the same can be read easily and can be used for further analysis. There are two major areas of statistics Descriptive statistics and Inferential statistics.
  • 10. Cont. 10  Descriptive statistics concern the development of certain indices/directions from the raw data,  Inferential statistics concern with the process of generalization. Inferential statistics are also known as sampling statistics and are mainly concerned with two major type of problems: “Descriptive” describes data, while “inferential” infers or allows the researcher to arrive at a conclusion based on the collected information.  the estimation of population parameters, and  the testing of statistical hypotheses.
  • 12. Cont. 12 For example, you are tasked to research about teenage pregnancy in a certain high school. Using both descriptive and inferential statistics, you will be researching the number of teenage pregnancy cases in the school for a specific number of years. The difference is that with descriptive statistics, you are merely summarizing the collected data and, if possible, detecting a pattern in the changes. For example, it can be said that for the past five years, the majority of teenage pregnancies in X High School happened to those enrolled in the third year. There’s no need to predict that on the sixth year, the third year students would still be the ones with a greater number of teenage pregnancies. Conclusions as well as predictions are only done in inferential statistics.
  • 13. STATISTICS IN RESEARCH 13 The important statistical measures that are used to summarize the survey/research data are:  measures of central tendency or statistical averages;  measures of dispersion;  measures of asymmetry (skewness);  measures of relationship; and  other measures.
  • 14. STATISTICS IN RESEARCH 14 Measures of Central Tendency  Amongst the measures of central tendency, the three most important ones are the arithmetic average or mean, median and mode.  A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. As such, measures of central tendency are sometimes called measures of central location.  The mean, median and mode are all valid measures of central tendency, but under different conditions, some measures of central tendency become more appropriate to use than others.
  • 15. Cont. 15 Mean (Arithmetic)  The mean (or average) is the most popular and well known measure of central tendency.  It can be used with both discrete and continuous data, although its use is most often with continuous data.  The mean is equal to the sum of all the values in the data set divided by the number of values in the data set.
  • 16. Cont. 16  The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests that this mean value might not be the best way to accurately reflect the typical salary of a worker, as most workers have salaries in the $12k to 18k range. The mean is being skewed /tilted by the two large salaries. Therefore, in this situation, we would like to have a better measure of central tendency. For example, consider the wages of staff at a factory below:
  • 17. Cont. 17 Median The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. In order to calculate the median, suppose we have the data below: in this case, 56. It is the middle mark because there are 5 scores before it and 5 scores after it. This works fine when you have an odd number of scores, but what happens when you have an even number of scores? What if you had only 10 scores? Well, you simply have to take the middle two scores and average the result.
  • 18. Cont. 18 Mode The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular option. An example of a mode is presented below:
  • 19. Measure of variation 19  Example Consider the following two sets of scores: Set 1: 40, 50, 60, 60, 40, 50 Set 2: 0,100, 25, 75, 80, 20 Alert block  Both these sets have the same mean (50),  But the second set is a lot more widely dispersed ("scattered") than the first.
  • 20. Measure of variation/dispersion 20  The scatter or spread of items of a distribution is known as dispersion or variation.  In other words the degree to which numerical data tend to spread about an average value is called dispersion or variation of the data.  Measures of dispersion are statistical measures which provide ways of measuring the extent in which data are dispersed or spread out.
  • 21. Objective of Measuring Variation 21  To determine the reliability of an average by pointing out as how far an average is representative of the entire data.  To determine the nature and cause of variation in order to control the variation itself.  Enable comparison of two or more distribution with regard to their variability.  Measuring variability is of great importance to other statistical analysis. E.g., it is the basis of statistical quality control
  • 22. A good measure of variation 22  It should be easy to compute and understand.  It should be based on all observations.  It should be Uniquely defined  It should be capable of further statistical treatment.  It should be as little as affected by extreme values
  • 23. Types of measure of variation 23 Absolute measure: The measures of dispersion which are expressed in terms of original units of a data termed as absolute measures. :  Range  Quartile deviation  Mean deviation  Variance  Standard deviation Relative measures: are known as coefficients of dispersion, are obtained as ratios or percentages.  Relative range  Coefficient of quartile deviation  Coefficient of mean deviation  Coefficient of variation  Standard scores
  • 24. The range 24 Several measures of dispersion are available. We will discuss the common ones below. The Range:  The difference between the largest (maximum) and smallest (minimum) values. Range = Maximum – Minimum (1) For frequency distributed data, the range is:  The difference between the upper class boundary of the last class and the lower class boundary of the first class.
  • 25. Measure of Dispersion 25 Measure of variation-dispersion  Find the Range of 54.5, 55.0, 55.7, 51.8, 54.2, 52.4 Solution:  range(R) = 55.7- 51.8 = 3.9cm Solution: Range = UCBl - LCBf = 118.5-52.5 = 66
  • 26. Measure of Dispersion 26 Quartile deviation (QD): QD is the product of half of the difference between the upper and lower quartiles. The range expresses the extreme variability of observations of a variable. is half of the inter quartile range.  Coefficient of quartile deviation (CQD):  It gives the average amount by which the two quartiles differ from the median
  • 27. Measure of Dispersion 27  Mean Deviation(M.D):  The average deviation measures the scatter of the individual observations around a central value usually the mean or the median of a distribution.  The mean deviation is defined as the arithmetic mean of positive deviations of each observation from either the mean or the median of a distribution.  If the deviations are taken from the mean then it is called mean deviation about the mean.  On the other hand, if the deviations are taken from the median we call it mean deviation about the median.
  • 28. Mean deviation 28  The mean Deviation (M.D) is the arithmetic mean of the absolute deviations of the values from the mean.  It is the “average absolute deviation of the values from the mean”.  Note that: while dealing with population values, it is adjusted accordingly  Mean Deviations for Grouped data (discrete or continuous)  Where m = number of classes and xi = class mark of the ith class; n = number of observation
  • 29. Mean deviation 29  Mean deviation about the median ( MD) ungrouped data: grouped Frequency Distribution:
  • 30. Example 30  The weights of a sample of six students from a class (in kilograms) is measured as: 53, 56, 57, 59, 63 and 66. Find the mean deviation about the mean and the mean deviation from the median.  solution: First find the mean and the median. The mean is 59 kg and the median is 58 kg. Then take the deviations of each observation from these averages as shown below
  • 31. Example cont. 31 Example: Calculate the mean deviation from the mean and median for the following data.
  • 32. Solution 32 Mean of each class = lower class point + upper class point divided by two. = (1+5)/2= 3 Mean= 100/10=10 MD from the mean = =60/10=6 Class xi fi fixi |xi-ẋ| fi|xi-ẋ| 1-5 3 4 12 |3-10|=7 4*7=28 6-10 8 1 8 2 1*2=2 11-15 13 2 26 3 2*3=6 16-20 18 3 54 8 3*8=24 𝑓𝑖 = 10 100 𝑓𝑖 = 60
  • 33. Solution 33 MD from the median = =60/10=6 Class xi fi fixi |xi-ẋ| fi|xi-ẋ| 1-5 3 4 12 |3-10.5|=7 4*7.5=28 6-10 8 1 8 2.5 1*2.5=2.5 11-15 13 2 26 2.5 2*2.5=5 16-20 18 3 54 7.5 3*7.5=24 𝑓𝑖 = 10 100 𝑓𝑖 = 60 Median = 3, 3,3, 3, 8,,13,13, 18,18,18 median= (8+13)/2=10.5
  • 34. Coefficients of Mean Deviation(C.M.D) 34 Example: Find the coefficient of mean deviation about the mean and mean deviation about the median for the weights of six students in example above. Solution: Coefficient of mean deviation about the mean
  • 35. Variance and Standard Deviation 35  The variance and standard deviation are the most superior and widely used measures of dispersion  Both measures the average dispersion of the observations around the mean.  The variance is defined as the average of the squared deviation from the mean.  variance is a measure of dispersion that takes into account the spread of all data points in a data set. It’s the measure of dispersion the most often used, along with the standard deviation, which is simply the square root of the variance.  The variance is mean squared difference between each data point and the center of the distribution measured by the mean.  An item selected at random from a data set whose standard deviation is low has a better chance of being close to the mean than an item from a data set whose standard deviation is higher.
  • 36. Variance and standard deviation formula 36
  • 37. Variance and standard deviation formula 37
  • 38. Cont. 38 Example Calculate the population variance from the following 5 observations: 50, 55, 45, 60, 40.
  • 39. Cont. 39 Example: 24, 25, 29,29,30,31 Find variance and standard deviation ? Solution:
  • 40. Variance and standard deviation formula 40 Quiz-1 Find the variance and standard deviation of the following sample data i. 5, 17, 12, 10,8 ii .The data is given in the form of frequency distribution.
  • 41. Coefficient of Variance 41 The coefficient of variation (CV) is the ratio of the standard deviation to the mean. The higher the coefficient of variation, the greater the level of dispersion around the mean. It is generally expressed as a percentage. Without units, it allows for comparison between distributions of values whose scales of measurement are not comparable. When we are presented with estimated values, the CV relates the standard deviation of the estimate to the value of this estimate. The lower the value of the coefficient of variation, the more precise the estimate.
  • 42. Coefficient of Variance formula 42  In situations where either two series have different units of measurements, or their means differ sufficiently in size, the CV should be used as a measure of dispersion.  In spite of the fact that the C.V. is broadly applied, its disadvantage is that it’s not useful when the mean is negative or zero or very close to zero.  Interpretation of the coefficient of variation: the distribution having less CV is said to be less variable or more consistent
  • 43. Why We Need the Coefficient of Variation 43 So, standard deviation is the most common measure of variability for a single data set. But why do we need yet another measure such as the coefficient of variation? Well, comparing the standard deviations of two different data sets is meaningless, but comparing coefficients of variation is not. Example question: Two versions of a test are given to students. One test has pre-set answers and a second test has randomized answers. Find the coefficient of variation. Regular Test Regular Test Randomized Answers Mean 50.1 45.8 SD 11.2 12.9
  • 44. Cont. 44 Solution Step 1: Divide the standard deviation by the mean for the first sample: 11.2 / 50.1 = 0.22355 Step 2: Multiply Step 1 by 100: 0.22355 * 100 =22.355% Step 3: Divide the standard deviation by the mean for the second sample: 12.9 / 45.8 = 0.28166 Step 4: Multiply Step 3 by 100: 0.28166 * 100 =28.266% That’s it! Now you can compare the two results directly.
  • 45. Cont. 45 Quiz-2: Find the population coefficient of variation of 24, 26, 33, 37, 29, 31. solution
  • 46. Cont. 46 Quiz-1: Find the population coefficient of variation of 24, 26, 33, 37, 29, 31. solution
  • 47. Cont. 47 Example: Suppose that the mean weight of a group of students is 165 pounds with a S.D of 8 pounds. If the height of the same group of students has a mean of 60 inches with a S.D of 3 inches, compare the variability in weight and height measurements. Solution:
  • 48. Standard Scores (Z-Scores) 48  A Z-Score is a statistical measurement of a score's relationship to the mean in a group of scores.  Are not measures of relative dispersion, but one of the applications of standard deviation.  We define the standard score as:.  Tells us how many standard deviations a value lies above (if positive) or below (if negative) the mean.  Standard score gives the deviations from the mean in units of standard deviation It is used to compare two observations coming from different groups.
  • 49. Standard Scores (Z-Scores) 49  Questions: Two third year Medical laboratory sections were given introduction to biostatistics examinations. The following information was given.  Student A from section1 scored 90 and student B from section 2 scored 95.  Relatively speaking who performed better ? Student A performed better relative to his section because the score of student A is two standard deviation above the mean score of his section while, the score of student B is only one standard deviation above the mean score of his section
  • 50. Standard Scores (Z-Scores) 50 Quiz 3 : Given mean and standard deviation is 50 and 10, what value of x has a z-score of 1.4? What is the z-score that correspondents to x = 30?
  • 51. Moments 51  The rth moment about the mean (the rth central moment) defined as :  for continuous grouped data it is given by: Example: Find the first three central moments of the numbers 2, 3 and 7 Solution first find the mean:
  • 52. 52 Normal Distribution, Skewness and Kurtosis  A normal distribution is the proper term for a probability bell curve.  In a normal distribution the mean is zero and the standard deviation is 1. It has zero skew and a kurtosis of 3. Normal distributions are symmetrical, but not all symmetrical distributions are normal What are the 4 characteristics of a normal distribution?  Normal distributions are symmetric, unimodal, and asymptotic, and the mean, median, and mode are all equal.  A normal distribution is perfectly symmetrical around its center. That is, the right side of the center is a mirror image of the left side
  • 53. 53 Skewness  Skewness is the degree of asymmetry or departure from symmetry of a distribution.  A skewed frequency distribution is one that is not symmetrical.  Skewness is concerned with the shape of the curve not size  If the frequency curve (smoothed frequency polygon) of a distribution has a longer tail to the right of the central maximum than to the left, the distribution is said to be skewed to the right or said to have positive skewness. If it has a longer tail to the left of the central maximum than to the right, it is said to be skewed to the left or said to have negative skewness.  For moderately skewed distribution, the following relation holds among the three  commonly used measures of central tendency.
  • 54. Skewness 54  A unimodal distribution is a distribution with one clear peak or most frequent value. “Asymptotic” refers to how an estimator behaves as the sample size gets larger (i.e. tends to infinity).
  • 55. Skewness 55  Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.  In respect of the measures of skewness and kurtosis, we mostly use the first measure of skewness based on mean and mode or on mean and median.  Positive Skewed Mode < Median < Mean  Negative Skewed Mean < Median < Mode  Zero Skewed Mean = Median = Mode
  • 56. Skewness 56 Example. Suppose the mean, the mode, and the standard deviation of a certain distribution are 32, 30.5 and 10 respectively. What is the shape of the curve representing the distribution?  Solution: The distribution is positively skewed The Karl Pearson’s Coefficient of Skewness (SK): If SK = 0, then the distribution is symmetrical. If SK > 0, then the distribution is positively skewed. If SK < 0, then the distribution is negatively skewed
  • 57. Kurtosis 57 Kurtosis  Kurtosis is a measure of whether the data are heavy- tailed or light-tailed relative to a normal distribution.  A standard normal distribution has kurtosis of 3 and is recognized as mesokurtic.  An increased kurtosis (>3) can be visualized as a thin “bell” with a high peak whereas a decreased kurtosis corresponds to a broadening of the peak and “thickening” of the tails.  Kurtosis is a statistical measure, whether the data is heavy-tailed or light-tailed in a normal distribution.  In finance, kurtosis is used as a measure of financial risk.
  • 58. Kurtosis 58  A large kurtosis is associated with a high level of risk for an investment because it indicates that there are high probabilities of extremely large and extremely small returns.  On the other hand, a small kurtosis signals a moderate level of risk because the probabilities of extreme returns are relatively low.
  • 59. Kurtosis 59  Kurtosis is the degree of peakedness of a distribution, usually taken relative to a normal distribution. When the curve of a distribution is relatively:  flatter than normal it is known as platykurtic and  the distribution is more peaked than normal, it is called leptokurtic.  The normal distribution which is not very high peaked or flat topped is called mesokurtic. The moment coefficient of skewness (ß2) If B2 =3, then the distribution is mesokurtic. If B2 > 3, then the distribution is leptokurtic. If B2 < 3, then the distribution is platykurtic.
  • 60. Acceptable Standard Deviation (SD) 60  A smaller SD represents data where the results are very close in value to the mean. The larger the SD the more variance in the results.  Data points in a normal distribution are more likely to fall closer to the mean. In fact, 68% of all data points will be within ±1SD from the mean, 95% of all data points will be within + 2SD from the mean, and 99% of all data points will be within ±3SD.  Statisticians have determined that values no greater than plus or minus 2 SD represent measurements that are more closely near the true value than those that fall in the area greater than ± 2SD.
  • 61. Acceptable Standard Deviation (SD) 61 Statisticians have determined that values no greater than plus or minus 2 SD represent measurements that are more closely near the true value than those that fall in the area greater than ± 2SD.
  • 62. Acceptable Standard Deviation (SD) 62 A cholesterol control is run 20 times over 25 days yielding the following results in mg/dL: 192, 188, 190, 190, 189, 191, 188, 193, 188, 190, 191, 194, 194, 188, 192, 190, 189,189, 191, 192. • Using the cholesterol control results, follow the steps described below to establish Quality Control/QC/ ranges.
  • 64. Skewness and Kurtosis 64 Formula & Examples Examples 1. Calculate Sample Skewness, Sample Kurtosis from the following grouped data Class Frequency 2 - 4 3 4 - 6 4 6 - 8 2 8 - 10 1
  • 66. Coefficient of Correlation 66  DEFINITION OF CORRELATION  “If two or more quantities vary in sympathy so that movements in one tend to be accompanied by corresponding movements in other(s) then they are said to be correlated.” Or  “Correlation is an analysis of co-variation between two or more variables.”  A coefficient of correlation is generally applied in statistics to calculate a relationship between two variables Types of Correlation The following are different types of correlation:  Positive and Negative Correlation  Simple, Partial and Multiple Correlation  Linear and Non-linear Correlation
  • 67. Types of Coefficient of Correlation 67  Positive correlation: the correlation between two variables is said to be positive or direct if an increase (or a decrease) in one variable corresponds to an increase (or a decrease) in the other.  Negative Correlation: the correlation between two variables is said to be negative or inverse if an increase (or a decrease) corresponds to a decrease (or an increase) in the other.  Simple Correlation: It involves the study of only two variables. For example, when we study the correlation between the price and demand of a product, it is a problem of simple correlation.
  • 68. Types of Coefficient of Correlation 68  Partial Correlation: It involves the study of three or more variables, but considers only two variables to be influencing each other. For example, if we consider three variables, namely yield of wheat, amount of rainfall and amount of fertilizers and limit our correlation analysis to yield and rainfall, with the effect of fertilizers removed, it becomes a problem relating to partial correlation only.  Multiple Correlation: It involves the study of three or more variables simultaneously. For example, if we study the relationship between the yield of wheat per acre and both amount of rainfall and the amount of fertilizers used, it becomes a problem relating to multiple correlation.
  • 69. Types of Coefficient of Correlation 69  Linear Correlation: The correlation between two variables is said to be linear if the amount of change in one variable tends to bear a constant ratio to the amount of change in other variable.  Non-linear (or Curvilinear): The correlation between two variables is said to be non-linear or curvilinear if the amount of change in one variable does not bear a constant ratio to the amount of change in other variable.
  • 70. Methods of Studying Correlation 70  Scatter Diagram Method  Karl Pearson’s Coefficient of Correlation, and  Spearman's Rank Correlation Method  A scatter diagram Method  A scattered diagram method the data helps in having a visual idea about the nature of association between two variables. If the points cluster along a straight line, the association between two variables is linear.  Further, if the points cluster along a curve, the corresponding association is non-linear or curvilinear.  Finally, if the points neither cluster along a straight line nor along a curve, there is absence of any association between the variables.
  • 72. Karl Pearson’s Coefficient Correlation 72  Karl Pearson’s coefficient of correlation is an extensively used mathematical method in which the numerical representation is applied to measure the level of relation between linearly related variables. The coefficient of correlation is expressed by “r”. Actual Mean Method Which is Expressed as - Pearson correlation example  When a correlation coefficient is (1), that means for every increase in one variable, there is a positive increase in the other fixed proportion. For example, shoe sizes change according to the length of the feet and are perfect (almost) correlations.  When a correlation coefficient is (-1), that means for every positive increase in one variable, there is a negative decrease in the other fixed proportion. For example, the decrease in the quantity of gas in a gas tank shows a perfect (almost) inverse correlation with speed.  When a correlation coefficient is (0) for every increase, that means there is no positive or negative increase, and the two variables are not related.
  • 73. Coefficient of Correlation 73 Correlation coefficient formulas are used to find how strong a relationship is between data. The formulas return a value between -1 and 1, where:  1 indicates a strong positive relationship.  -1 indicates a strong negative relationship.  A result of zero indicates no relationship at all.
  • 74. Coefficient of Correlation 74 Example: Find the value of the correlation coefficient from the following table: Solution Step 1: Make a chart. Use the given data, and add three more columns:, find both x and y mean value x2, y2, and, xy. Step 2: Multiply x and y together to fill the xy column. Step 3: Take the square of the numbers in the x column, and put the result in the x2 column. Step 4: Take the square of the numbers in the y column, and put the result in the y2 column. Step 5: Add up all of the numbers in the columns and put the result at the bottom of the column. The Greek letter sigma (Σ) is a short way of saying “sum of” or summation. Step 6: Use the following correlation coefficient formula.
  • 75. Coefficient of Correlation 75 Classwork: Find the value of the correlation coefficient from the following table:
  • 76. Spearman Rank Correlation Method 76  A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them.  Also called rank-order.  Used when one or both variables are rank or ordinal scales.  Difference (D) between ranks of two sets of scores is used to determine correlation coefficient. Examples - golf driving distance and order of finish in golf tournament; height and IQ score; weight and order of finish in 400 meter race; number of calories consumed and weight lost
  • 77. Spearman Rank Correlation Method 77 To determine : 1. List each set of scores in a column. 2. Rank the two sets of scores. 3. Place the appropriate rank beside each score. 4. Head a column D and determine the difference in rank for each pair of scores. (Sum of the D column should always be 0) 5. Square each number in the D column and sum the values (∑D2). 6. Calculate the correlation coefficient by subtracting the values in the formula n = number of observations
  • 78. Spearman Rank Correlation Method 78 As an example, Food R1 R2 D= R1- R2 D2 A 2 1 1 1 B 1 3 -2 4 C 4 2 2 4 D 3 4 -1 1 E 5 5 0 0 F 7 6 1 1 G 6 7 -1 1 R= 1- 6 ∑12 𝑅 = 1 − 6 12 73 − 7 = 1-0.2142= 0.786
  • 79. chi-square test 79  A chi-square test is a statistical test used to compare observed results with expected results.  The purpose of this test is to determine if a difference between observed data and expected data is due to chance, or if it is due to a relationship between the variables you are studying.  A chi-square (χ2) statistic is a test that measures how a model compares to actual observed data.  The chi-square statistic compares the size of any discrepancies between the expected results and the actual results, given the size of the sample and the number of variables in the relationship.
  • 80. chi-square test (cont’d.) 80  The formula for the chi-square statistic used in the chi square test is: The subscript “c” is the degrees of freedom. “O” is your observed value and E is your expected value. It’s very rare that you’ll want to actually use this formula to find a critical chi-square value by hand. The summation symbol means that you’ll have to perform a calculation for every single data item in your data set. As you can probably imagine, the calculations can get very, very, lengthy and tedious. Instead, you’ll probably want to use technology:
  • 81. chi-square test (cont’d.) 81 EXAMPLE Employers want to know which days of the week employees are absent in a five day work week. Most employers would like to believe that employees are absent equally during the week. Suppose a random sample of 60 managers were asked on which day of the week did they have the highest number of employee absences. The results were distributed as follows: (Use a 5% level of significance level.) Monday Tuesday Wednesday Thursday Friday Observed Absences 15 12 9 9 15 Expected Absences 12 12 12 12 12 Calculate the χ2 test statistic. Make a chart with the following column headings and fill in the cells:
  • 82. chi-square test (cont’d.) 82 SOLUTION The null and alternate hypotheses are:  H0: The absent days occur with equal frequencies.  Ha: The absent days occur with unequal frequencies.  The degrees of freedom are one fewer than the number of cells: df=n-1 = 5−1=4. Now add (sum) the values of the last column. Verify that this sum is 3. This is the Χ2 test statistic. The decision is to not reject the null hypothesis.
  • 84. 84