This document provides an overview of key concepts in statistics including:
- Descriptive statistics which describe data and inferential statistics which make inferences from samples.
- Measures of central tendency (mean, median, mode) and how to calculate them to summarize data.
- Measures of dispersion like range, variance, standard deviation and coefficient of variation which indicate how spread out data is.
- When to use mean vs median based on the presence of outliers.
- Geometric mean which calculates average percentage change over time.
- Examples of calculating these measures and interpreting the results are provided.
2. STATISTICS
Descriptive
• Descriptive statistics in
simple sense is to provide
people a description of the
data that we currently have.
• Example would be what is
the statistics of performance
of a class of students and
answer would be like mean
marks are 64.5
Inferential
• Inferential statistics is when
we have to infer an outcome
by just looking at a small
portion of data.
• Example would be who will
win this election and answer
would be like a survey of
10000 people suggests that
XYZ has a 60-70% chance with
95% confidence
3. Sample Vs Population
Sample
• A sample is a portion of the
population which is readily
available or easily attainable.
• Example would be a survey
or just a million people from
the population.
Population
• A population is the entire
data that should be ideally
used for a statistic.
• Example would be a census
or population of a country
4. Why do we go for sample ?
• Going around and asking the entire
population of people who they are going to
vote for is impossible.
• Taking the heights of the entire population is
not feasible as a lot would die and a lot would
be born by the time we finish.
• Sometimes the sample would be a
“Representative sample” meaning it has the
same nature and characteristics of the
population.
6. Terminology
Measures of Central Tendency:
Mean
Median
Mode
Geometric Mean
Measures of Dispersion:
Range
Variance
Standard Deviation
Coefficient of Variation
6
8. 8
Arithmetic Mean (Average)
Median & Mode
Mean : The most commonly used measure of
central tendency. It represents the “typical” value
in a data set.
Median : The median represents the “middle”
number in a set of numbers after arranging them in
ascending order
Mode : The mode is the most frequently
occurring value in a data set.
9. 9
Exercise 1
The following are the ages of eight employees of
the Math department. Find the average age of
the department.
53, 32, 61, 27, 39, 44, 49, 57
Ans: 45.25
10. 10
Exercise 2
A cosmetics manufacturer recently purchased a machine to fill 3-
ounce cologne bottles. To test the accuracy of the machine’s
volume setting, 18 trial bottles were run. The resulting volumes
(in ounces) for the trials were as follows:
3.02, 2.89, 2.92, 2.84, 2.90, 2.97, 2.95, 2.94, 2.93, 3.01,
2.97, 2.95, 2.90, 2.94, 2.96, 2.99, 2.99, 2.97.
The company does not normally recalibrate the filling machine
for this cologne if the average volume is within 0.04 of 3.00
ounces. Should it re-calibrate based on the trial runs?
11. 11
Concept of Mode
Given below are the number of books borrowed
by employees from the office library yesterday.
What is the mode?
2, 4, 3, 2, 2, 2, 2, 3, 3, 5, 2, 1, 1, 2
12. 12
Check Your Understanding
Given below are the percentage of planned training
programmes that have been conducted for 10
departments in an organization at the end of the
planning year. If you were to report the average
percentage of planned programmes conducted for the
entire organization, how would you compute the
average?
95, 89, 90, 97, 85, 88 85, 94, 25, 90
13. 13
Remember!
The median value divides a data set into two
halves; one half of the values are less than the
median and one half of the values are more
than the median value.
14. 14
Exercise 3
Given below are the number of absentees at ten
different training programmes in
communication skills, conducted at an
organization. If you were to report the “typical”
number of absentees at a communication skills
programme, what would that number be?
1, 1, 1, 1, 3, 1, 4, 2, 2, 1
17. 17
Exercise 4
Given below are the annual salaries (in Rs. Lakh) for
seven employees. Find their mean salary and median
salary. What do you observe?
2, 3, 6, 8, 12, 40, 75
18. 18
Practical Use of Median
The median is used as a measure of central tendency
when the data set has values which vary a lot in
magnitude.
Example: Magazines like Fortune, report the median
salaries of CEOs. Why?
19. 19
Outliers
An outlier is a value which is very different from
the rest of the data.
Question: How should you handle outliers in a
data set?
Ans : Investigate then remove or normalize
20. 20
Fundamentals
The mean is affected by extreme values whereas
the median is not.
REMEMBER!
Whether MEAN is appropriate or MEDIAN is
appropriate, depends on the context
21. 21
Practical Tip
Before analyzing any data, take a careful look at
the data. If there are unusual values, they may
have crept in because of an error in recording
data, or the data may reflect some unusul (rare)
event. Investigate those unusual values before
proceeding with analysis of the data!
23. Application of Geometric Mean
The geometric mean is used to compute the
“average rate of increase (or decrease)” in a
variable over time.
Example: to compute the average percentage
change in the training budget of EY from 2007 to
2012.
23
24. Example
If Rs. 100 becomes Rs. 150 at the end of the first
year, and if Rs. 150 becomes Rs. 75 at the end of
the second year, at what % has your money
decreased?
24
25. Growth Factors (GF)
Growth Factor = New Amount / Old Amount
e.g. If Rs. 100 becomes Rs. 120 at the end of
Year 1, the growth factor = 120/100 = 1.2
e.g. If Rs. 100 becomes Rs. 90 at the end of Year
1, the growth factor is 90/100 = 0.9
25
26. Computing Change from GFs
Subtract 1 from the GF to arrive at the change.
If GF = 1.2, change = 1.2 – 1 = 0.2 i.e. increase of
20%
If GF = 0.9, change = 0.9 – 1 = -0.1 i.e. a
decrease of 10%
26
27. Exercise 5
The training manager in an IT company has noted the
following changes (percentage increase) in the training
budget over the following years:
2005: 0.11 2009: 0.095
2006: 0.08 2010: 0.108
2007: 0.075 2011: 0.120
2008: 0.08
He has to submit the average percentage change in the
training budget over the years in a presentation to top
management. What figure should he report? [Ans: 9.53%]
27
29. Exercise 6
The employee count in a leading company, are as
follows. Find the average percentage increase in the
number of employees each year. [Ans: 8.98%]
2008: 12,500 employees
2009: 13,250
2010: 14,310
2011: 15,741
2012: 17,630
29
35. 35
Concept of SD
The standard deviation of a data set is an
indication of how closely the data points are
clustered about the mean. Greater the SD of a
data set, greater is the dispersion or spread in
the data set.
36. 36
Key Features of SD
A data set that has all values the same, will have
zero SD.
SD can never be negative.
The SD represents the “average” distance of a
data point from the mean of the data set.
37. Significance of SD in Training
SD is an indication of consistency. So trainers in
an organization can be evaluated for consistency
(in delivery) based on the SD of the participant
ratings.
37
38. 38
Exercise 8
Given below are the time taken (in months) by two
groups of employees (with similar experience) to
complete multimedia Java training programmes
developed by two different vendors (A and B). Which
vendor’s programme shows more consistency in
completion times?
A: 2, 1.5, 1.75, 1.8, 2.2, 3, 1, 2.8, 2
B: 2, 3, 2.75, 3.25, 1, 3, 2.8, 2.95,
39. 39
Exercise 9
Students’ ages in the regular daytime MBA programme and the
evening programmes in a popular Management institute are
shown below as two samples. If homogeneity of the class (as
regards age) is a positive factor in learning, using SD, determine
which class will be easier to teach.
Regular MBA: 23, 25, 27, 22, 24, 21, 25, 20, 24, 25
Evening MBA: 27, 29, 34, 31, 35, 28, 32, 26, 25, 35
41. 41
Exercise 10
You are employed as a statistician for a company that
sells electronic equipment by salespersons. The
company has four salespersons (A, B, C, and D)
employed in a small town. The sales records (in
thousands of rupees) for the past six months for these
four salespersons is shown on the next slide. The MD
of the company wants to reward the salesperson who is
most consistent and also meets/surpasses the average
6-monthly target of Rs. 1,80,000. Who is the star
salesperson?
42. 42
Data…
Month A B C D
1 177 221 133 140
2 181 151 130 167
3 189 235 150 153
4 193 194 110 183
5 185 205 119 170
6 173 163 144 150
44. 44
Formula for CV
CV = SD
------
Mean
CV is also called relative variation or relative
dispersion.
45. 45
Exercise 11
A manufacturing company is considering employing
one of two training programmes. Two groups were
trained for the same task. Group 1 was trained by
Program A; Group 2 by Program B. For Group 1 the
time required to train the employees had an average of
32.11 hours and a variance of 68.09 hours squared. For
Group 2, the average was 19.75 hours and the variance
was 71.14 hours squared. Which training programme
has lesser relative variation (i.e. CV)? [Ans.:
Program A]
46. 46
Exercise 12
The Board of Directors of a company is considering
acquiring one of two companies is closely examining
their inclinations towards risk. During the past five
years, the first company’s returns on investments had
an average of 28% and a standard deviation of 5.3%.
The second company’s returns had an average of 37.8%
and a standard deviation of 4.8%. If we consider risk to
be associated with greater relative dispersion, which of
these two companies has pursued a riskier strategy?
[Ans.: Company 1]
47. 47
Application of CV [training domain]
Let Suresh and Aruna be two trainers. Suresh
has an average audience rating score of 65 with
a standard deviation of 5, and Aruna has an
average audience rating score of 75 with a
standard deviation of 8. Who is more
inconsistent in delivery quality?
48. Contd…
CV (Suresh) = 5 / 65 = 7.7%
CV (Aruna) = 8 / 75 = 10.7%
Since CV of Aruna is higher, she is more
inconsistent in terms of delivery quality as
perceived by the audience.
48
49. 49
Fundamentals!
Do not compare the SDs of two data sets to
measure their dispersion; you must compare
their CVs and then decide.
50. Excel and R formulas
• Excel Formulas
• average()
• median()
• mode()
• growth Factor = Dn/Dn-1.
• geomean()
• stdev()
• var()
• R formulas
• mean()
• median()
• No standard function
• Same
• No standard function
• sd()
• var()