BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docx

BUS308 – Week 1 Lecture 2
Describing Data
Expected Outcomes
After reading this lecture, the student should be familiar with:
1. Basic descriptive statistics for data location
2. Basic descriptive statistics for data consistency
3. Basic descriptive statistics for data position
4. Basic approaches for describing likelihood
5. Difference between descriptive and inferential statistics
What this lecture covers
This lecture focuses on describing data and how these
descriptions can be used in an
analysis. It also introduces and defines some specific
descriptive statistical tools and results.
Even if we never become a data detective or do statistical tests,
we will be exposed and
bombarded with statistics and statistical outcomes. We need to
understand what they are telling
us and how they help uncover what the data means on the
“crime,” AKA research question/issue.
How we obtain these results will be covered in lecture 1-3.
Detecting
In our favorite detective shows, starting out always seems

difficult. They have a crime,
but no real clues or suspects, no idea of what happened, no
“theory of the crime,” etc. Much as
we are at this point with our question on equal pay for equal
work.
The process followed is remarkably similar across the different
shows. First, a case or
situation presents itself. The heroes start by understanding the
background of the situation and
those involved. They move on to collecting clues and following
hints, some of which do not pan
out to be helpful. They then start to build relationships between
and among clues and facts,
tossing out ideas that seemed good but lead to dead-ends or
non-helpful insights (false leads,
etc.). Finally, a conclusion is reached and the initial question
of “who done it” is solved.
Data analysis, and specifically statistical analysis, is done quite
the same way as we will
see.
Descriptive Statistics
Week 1 Clues
We are interested in whether or not males and females are paid
the same for doing equal
work. So, how do we go about answering this question? The
“victim” in this question could be
considered the difference in pay between males and females,
specifically when they are doing
equal work. An initial examination (Doc, was it murder or an
accident?) involves obtaining
basic information to see if we even have cause to worry.

The first action in any analysis involves collecting the data.
This generally involves
conducting a random sample from the population of employees
so that we have a manageable
data set to operate from. In this case, our sample, presented in
Lecture 1, gave us 25 males and
25 females spread throughout the company. A quick look at the
sample by HR provided us with
assurance that the group looked representative of the company
workforce we are concerned with
as a whole. Now we can confidently collect clues to see if we
should be concerned or not.
As with any detective, the first issue is to understand the “who”
and “what” about the
victim. In this case, we need to use our sample to understand
basic information about how males
and females are paid. Understanding data sets typically
involves look at several characteristics.
These descriptive measures describe the data set. Typical
descriptive measures include:
• Measures of location such as the average (AKA mean), the
median (middle
point), and mode (most often occurring value if it exists).
• Measures of consistency such as range (largest value minus
the smallest value),
variance, and standard deviation.
• Measure of position showing where a single data point is
within the data set, such
as percentile and rank.

• Measures of likelihood showing the probability of obtaining
specific outcomes.
Note: Descriptive statistics describe a particular data set and
can only be used for that
data set. However, often we want to use a sample to “infer”
back to a larger population. In this
case, we would use inferential statistics. Most measures, except
for variance and standard
deviation, are calculated the same way. We will see the specific
difference for those two later in
this lecture.
The key to whether we have descriptive statistics or inferential
statistics lies with the
group we are taking the measures on. If we are only concerned
with that group, we use
descriptive statistics. If, however, we want to use that group to
make inferences, claims, and
conclusions about a larger population, then we take a random
sample from the population and
use inferential statistics (allowing us to infer back to the
population). Our class data sets – both
the lecture and homework – are random samples from a larger
population, so we will basically
be using inferential statistical measures.
Note that these are not the complete list of possible descriptive
statistics. Excel’s
Descriptive Statistics function (described in Lecture 3 for this
week) includes a couple of
measures that focus on data distribution shape. These have
some specialized uses that we will
not be getting into.

Location Measures
Perhaps the most often asked question about data sets is what is
the average? The intent
is to get a measure that shows us the center of the data.
Unfortunately, average is a somewhat
imprecise term that could mean all three of our measures of
location identified above. So, as
analysts we tend to be more precise and use mean, median, and
mode.
While these all tell us something about where the data might be
clustered, they can
provide very different views of the data. An example of this
comes from an example the author
heard back in High School. At that time, the mean per capita
income for citizens of Kuwait was
about $25,000; the median income was around $125; and the
mode was $25! The very high (due
to oil revenues) income of the Royal family accounted for much
of this difference, but just look
at the different impressions we get about the country depending
on which value we look at.
• Mean, AKA average, is the sum of all the values divided by
the count. This can be
considered the “weighted center” of the data set. For example,
the mean of 1, 2, 3, 4, and
5 = 1+2+3+4+5/5 = 15/5 = 3. The mean is generally the best
measure for any data set as
it uses all of the data values and requires interval or ratio level
data. Thus, while we can
average salary, compa-ratio, seniority, etc., we cannot average
gender or gender1 (even if

one is coded in numbers) or grade in our data set.
• The median is the middle value in an ordered (listed from low
to high) data set. This is
the “physical center” of a data set. For example, the median of
1, 2, 3, 4, and 5 = 3, the
middle value. If we have an even number of values, the median
is the average of the
middle two values. Medians can be found on ordinal, interval,
or ratio level data.
• The mode is the most frequently occurring value. This is more
or less the “popular
center” as it is where most numbers group together. A data set
may have no modes or
one or more. Modes may occur with any level of data. The data
set 1,1,2,2,2,2,3,8,8,9
has a primary mode of 2, and two secondary modes of 1 and 8.
Consistency/Variation Measures
While they do not have the popularity of their location cousins,
knowing the consistency
or variation within the data is as important, some say even more
important, as knowing the
central tendency for us to understand what the data is trying to
tell us. Very consistent data, with
little variation, has a mean that is very representative of the
data and is unlikely to change much
if we resample the population. Data with a large amount of
variation tends to have unstable
means, meaning that these values would change a lot with
multiple samples. Inconsistent data
(having large variation) is often a problem for businesses,
particularly for manufacturing
operations, as it means the results they produce differ and might

often not meet the quality
specifications. Predictions based on data with large variations
are rarely useful. Consider
attempting to estimate how long it would take you to get to
work if your route had frequent
traffic accidents that made the travel time different every day.
The key measures of variation are:
• Range, which equals the maximum value minus the minimum
value. For our
example data set of 1, 2, 3, 4, and 5, the range is 5 – 1 = 4.
• Variance, which is the average of the square of sum of the
differences between each
value in the data set from the mean. To get the variance, find
the mean of the data,
subtract this value from each of the data points, square this
result (to get rid of the
negative differences), add them up and divide by the total
count. For our example
data set, this would look like:
Value Mean Difference Squared
1 3 -2 4
2 3 -1 1
3 3 0 0
4 3 1 1
5 3 2 4
Sum = 10

Variance = 10/5 = 2
The problem with variance is that it expressed as units squared.
So, if our data set
were dollars, the variance would be 2 dollars squared. How
should we interpret
dollars squared? In general, we do not and use the next measure
instead.
• Standard Deviation is the (positive) square root of the
variance. It returns the
dispersion measure back to one that is in the same units as the
original data, so we can
compare it to the data values. For our example, the standard
deviation is the square
root of 2 dollars squared, or 1.4 dollars. This much easier to
understand measure
means that the average difference from the mean is 1.4 dollars
(in our example above
having a mean or average value of 3 dollars).
• Important point about the variance and standard deviation.
When we find these
values for a population, the entire group we are interested in,
we divide the numerator
by the sample size. However, when we have a sample of the
entire group (and want to
use this sample to estimate the population value for either
variance or standard
deviation), we create the inferential estimate by dividing the
numerator by the (count
– 1). This is an adjustment that increases the estimate to take
into account we most
likely do not have the extreme low and extreme high value from
the population in our
sample, so its variation is less than the group we are using the
sample to describe.

Just as detectives want to know what victims typically did and
how consistent they were
in their behavior around the time of the crime (For example:
Was he usually in this area, and if
not, why last night?), examining location and consistency
measures provide a similar perspective
on data variables and how they behave.
Applying the Information: Equal Pay Questions
OK, we can now start looking at our data set to see what the
numbers are hiding, and
develop some clues. As with all analysis, we start with
questions, then identify the tools to use
for those questions, and finally apply those tools to the data.
Our initial question is, do males
and females get equal pay for equal work? We also said we
needed to start with the question of
whether or not we had some measures that showed pay
comparisons between males and females.
Let’s take a look at some of the group and sub-group data. A
couple of measures that might
answer this question are:
• What are the group averages for each variable?
• What are the average male and female compa-ratios?
(Remember, you will work with
the Salary variable in the homework.)
• How consistent are the compa-ratios for each?
Note that we will be focusing on the compa-ratio data in the

lectures, while you will
focus on the same questions using salary in the weekly
homework assignments. As described,
compa-ratio is the result of dividing an employee’s salary by
their grade midpoint. It generally
ranges from about 0.80 to 1.20 in most pay plans. The value of
this measure is it removes the
impact of different grades (each of which we are assuming are
different levels of work from
other grades and contain equal work for all the jobs within the
grade). While not a perfect
measure, it is the start of measuring what is paid for equal
work. Side note: a grade’s midpoint is
generally pegged to the average market pay needed to hire new
employees into a job.
Week 1 Question 1
Question 1 asks for some summary statistics. Part A asks for
you to use the Excel Descriptive
Statistic function (more on this in the third lecture), while part
B asks for some specific statistics
using the Fx function list (again, how to do this is covered in
lecture 3). The purpose for these
specific requests is to let you show mastery in using these two
Excel tools.
For part a, the mean, standard deviation, and range of the entire
compa-ratio data set is
highlighted. This shows us that that mean is 1.062, the standard
deviation is 0.077, and the range
is 0.34. As interesting as these values are, they do not really
tell us anything. Measures
generally need to be compared to provide information.

This is where part b comes in. We see that the male and female
averages (1.056 and
1.069 (rounded) respectively) appear relatively close and are on
opposite sides of the overall
mean. The standard deviations are also close at 0.084 and 0.07
and surround the standard
deviation from the entire data set. The ranges are both smaller
than the overall range – meaning
that neither gender has both the smallest and largest value. The
female compa-ratios appear to
be slightly more clustered (less variation, more consistent) than
the male values from both the
range and standard deviation results.
Two things stand out. First, perhaps surprisingly, the females
appear to be paid more
relative to their grade midpoints than the males. Second,
measures of dispersion appear fairly
close with males being slightly more spread out than females.
So far, nothing seems to create
any concerns as we expect sample results to be a bit different
than the overall population values.
These differences seem to be small enough that they might be
simple sampling errors – if we
resampled (such as the data set you will be working with) the
male and female results might
switch.
Remember, when you do this problem in the homework, use the
salary data. As practice
you can copy the data set into a practice excel file and try to
replicate the same answers as show
up in the lectures. Ask a question if you are unsure of how to
do this or do not get the same

results using the lecture dataset.
Position Measures
Often, we are interested in where within a data set a particular
measure falls. This opens
up the idea of distributions, how the data values are spread
across the range of values. Our
detectives would be looking at where victims typically went and
where they spent their time –
the pattern of their normal behavior.
Distributions. Location and consistency measures are important
for summarizing the
data set. Important as they are, they do not always give us all
the information we need. At times
we want to know how specific values fit within the data set.
For example, we might want to
compare the 10th highest male and female value to get a sense
of how relative positions within
the data range differ. This often means we need to examine the
distribution, or shape, of the
data. This shows us how all the data values relate to all of the
other values with the sample.
One important tool in analyzing data sets that we will not cover
(we cannot cover
everything, alas) is graphical analysis – looking at how data
sets are distributed when graphed.
One example will show how powerful these techniques can be.
One very common graph is a
histogram – a count of how many times a certain value occurs.
For example, if you tossed a pair
of dice 50 times, you might get the following results. The table
shows the results we got. The
Histogram shows the distribution or shape of the data, with the

x-axis, horizontal, showing the
sum of the numbers on the two faces and the y-axis, vertical,
showing how often we observed
Outcomes from tossing a pair of dice
Count showing 2 3 4 5 6 7 8 9 10 11 12
Frequency seen 1 2 4 3 9 12 7 5 4 1 2
0
2
4
6
8
10
12
14
2 3 4 5 6 7 8 9 10 11 12
a particular result in our 50 tosses.
A couple of things we can do with distributions can be easily
shown with this histogram.
First, we can find the center, in this case 7. We can see that

there are two tails around the center,
one to the left showing counts for values less than the middle
value of 7, and one to the right
showing how often we got values greater than 7. Visually, we
can see that the further away from
the center we get, the less often – or less likely – we are to get
any particular outcome. Ways to
quantify these observations are discussed below.
Our detectives use this logic when they attempt to find out
where all the persons of
interest were at the critical times. These approaches provide
more detailed information about
how the data looks more specifically than the summaries of
dispersion examined earlier.
Position Measures. Central tendency and variation are group
descriptive measures –
particularly the mean and standard deviation, which use all the
values in the data set in their
calculation. At times; however, we are concerned with specific
values with in the distribution,
such as:
• Quartiles,
• Percentiles, or centiles,
• The 5-number summary, or
• Z-score.
Quartiles and Percentiles. These measures divide the data into
groups, four with the
quartile and 100 with the percentile. One example that many
of you might be familiar with is
percentile (AKA percentile rank). This is often use when
doctors describe a child as in the 80th
percentile in height or weight for his/her age. This means that

80% of other children at this age
are at or below this particular child’s measure. Percentiles
range from 1 to 100%-tile, meaning
the lowest score would be at the first (or 1%-tile) and the
highest score would be at the 100%-
tile. Percentiles are very useful for comparing groups.
The general percentile formula lets us find percentiles, deciles
(the 10% divisions), and/or
quartiles, although Excel will do this for us. The formula is:
Lp = (n+1) * P/100; where
Lp is the count of the desired percentile (25 would be the
location of the first quartile, for
example)
n is the size/count of the data set
P is the desired percentile; using 25, 50, or 75 gives the quartile
points, while using 10,
20, etc. would give the decile points.
Example: if we wanted to find the cut-off for the first (or
lowest) quartile of the data, also
known as the 25th percentile in a data set of 50, we would use
(50+1)*25/100 = 12.75, or
the 13th value from the bottom in an ordered list. By
convention, we always round up to
the next whole value.
5-Number Summary. As its name suggests, the 5-number
summary identifies five key
values in a data set: minimum value, 1st quartile, median or 2nd

quartile, 3rd quartile, and
maximum values. For the compa-ratio data set used in the
lectures, the 5-number summary can
be found from the following table results. The 1st quartile, for
either gender group of 25 is
(25+1) * 25/100 = 6.5, or the 7th values in a rank ordered list.
The 3rd quartile is located at 19.5.
For the entire sample of 50, these values are located at the 13th
and 39th rank ordered places,
respectively. Here is a 5-number summary for the overall
compa-ratio values in the sample:
Compa-ratio 5-number summary: 0.870, 1.013, 1.051, 1.134,
1.210.
More on this shortly.
Z-score. What is often of more value is looking at where
specific measures lie within
each range. The z-score measures show how far from the mean
a specific data point lies,
measured in standard deviation units. (I know that sounds
strange but keep reading.) The Z-
score provides a measure of how many standard deviations a
particular score lies from the mean,
and in what direction (above or below). The Z-score formula is:
Z = (individual score – mean) / (standard deviation).
Looking at this formula we can see that a score above the mean
would give us a positive
z-score, a score below the mean would give us a negative z-
score, and a score that exactly equals
the mean would gives us a z-score of 0. For most data sets, the
z-score ranges from a -3.0 to a
+3.0.

For example, in our example data set (1, 2, 3, 4, and 5) (see
above for descriptive
statistics on this data set), the z score for 2 would be (2-3)/1.4 =
-1/1.4 = -0.71. The negative
value means that 2 is below (or less than) the mean and is 0.71
standard deviation units away
from the mean (0.71 times the standard deviation of 1.4 = 1).
Using this measure, we can easily examine relative placement of
scores. For example, a
compa-ratio of 1.06 would have Z-scores of 0.048 for males, -
0.129 for females, and -0.03 for
the overall group. (We will see how we got these values
shortly.) Thus, we can see that a person
with this compa-ratio is slightly above average for males, but
below average for the overall
group and for females.
Applying the information
Week 1 Question 2
Question 2 asks for a 5-number summary for the overall compa-
ratio data set as well as for the
male and female sub-groups within the data.
Note: Lecture 1-3 will show the same screen shot with the cell
formulas displayed.
One of the first observations that confirms an earlier
observation is that neither the male
nor female data set has both the largest and smallest values.

The males appear to have a slightly
lower overall range of values than do the females. Some other
interesting observations include
the relatively similar 3rd quartile values for all three groups and
the lower midpoint for females,
meaning that more females are lower in the overall range than
males. More males are in the first
quartile than females. What other observations can you make
about how employees are
distributed within their respective compa-ratio ranges?
Week 1 Question 3
Often looking at how a single point lies within a data range is
helpful to get some insight
into how the distributions are positioned. Question 3 asks for
us to examine where the midpoint
of each gender’s dataset fits within the entire compa-ratio data
set. The Percentank.exc function
returns a percentile rank, the percent of data values that fall at
or below a given value. For
example, the percentrank.exe of the median would be 50%-tile
as half the values are above and
half below the median (as expected).
When we look at the male median, we see it falls at the 51st %-
tile, meaning it is slightly
above the overall median. The female median (half of the
female compa-ratios are below this
value remember) falls at the 33rd %-tile! This means that most
of the females are in the bottom
half of the distribution, even though (from Question 2), females
have the “higher” range.
Interesting.

The z score is a measure of relative placement based on the
mean rather than the median.
A value that equals the mean would have a z score of 0, a value
that is greater than the mean
would have a positive z score, while a value less than the mean
would have a negative z score.
Both the male and female medians fall below the overall compa-
ratio mean, with the female
median being relatively lower in the distribution. This is
consistent with what the percentile
scores suggested. Overall, these two questions are suggesting
that males and females are not
distributed the same within in the compa-ratio data set.
Likelihood Measures
Likelihood, or probability, focuses on how often we can expect
to see an outcome. In
statistics, many decisions are made based upon how likely, or
more accurately, how unlikely it is
to see an outcome.
Probability
Probability is the likelihood that an event will happen. For
example, if we toss a fair
coin, we have a 50/50 chance, or a probability of .5 of getting a
head. If we pick a date between
1 and 7, we have a 1 out of 7 chances (or a probability of 1/7 =
.14 or 14%) that it will be a
Wednesday in the current month. Statisticians recognize three
types of probabilities:

• Theoretical – based on a theory, for example – since a die
(half of a pair of dice) has 6
sides, and our theory says each face is equally likely to show up
when we toss it; we
therefore expect that will see a 1 1/6th of the number of times
we toss it (assuming we
toss it a lot).
• Empirical – count based; if we see that an accident happens on
our way to work 5
times(days) within every 4 weeks, we can say the probability of
an accident today is 5/20
or 25% since there are 20 work days within a 4-week period.
An empirical probability
equals the number of successes we see divided by the number of
times we could have
seen the outcome.
• Subjective – a guess based on some experience or feeling.
There are some basic probability rules that will be helpful
during the course. The
probability
• of something (an event) happening is called P(event),
• of two things happening together – called joint probability:
P(A and B),
• of either one or the other but not both events occurring – P(A
or B),
• of something occurring given that something else has
occurred, conditional probability:
P(A|B) (read as probability of A given B).
• Compliment rule: P(not A) = 1- p(A).
Two other issues are needed, the idea of mutually exclusive

means that the elements of
one data set do not belong to another – for example, males and
pregnant are mutually exclusive
data sets. The other term we frequently hear with probability is
collectively exhaustive – this
simply means that all members of the data set are listed.
Some rules, which apply for both theoretical and empirical
based probabilities, for
dealing with these different probability situations include:
• P(event) = (number of success)/(number of attempts or
possible outcomes)
• P(A and B) = P(A)*P(B) for independent events or
P(A)*P(B|A) for dependent events
(This last is called conditional probability the probability of B
occurring given that A has
occurred).
• P(A or B) = P(A) + P(B) – P(A and B); if A and B cannot
occur together (such as the
example of male and pregnant) then P(A and B) = 0
• P(A|B) = P(A and B)/P(B).
One of the more interesting uses of probabilities (other than
forecasting the likelihood of
rain on our days off) is the comparing of outcome likelihoods
for different groups.
• The probability of randomly picking a female [P(F)] is the
same as randomly picking
a male [P(M)] from the group = 25 specified outcomes/50

possible outcomes. This is
a simple empirical probability – counts divided by counts.
• We can get a bit more complicated, such as the probability of
picking a female from a
specific grade such as B – P(F|B), probability of picking a
female given (from) only
grade B. Again, this is empirical – we have 7 employees in
grade B, and 4 of these
are females, so P(F|B) = 4/7.
• Now the probability of picking a Female who is also in grade
B (from the entire data
set is 4 females out of 50 = 4/50 = 0.08, empirically. We can
find this using the P(A
and B) formula referenced above. P(F and B) = P(F)*P(B|F),
since the events of
female and grade E are not independent. So, we know P(F) =
.5, and P(B/F) = 4/25
(4 females out of 25 are in grade B), so by theory, P(Female and
grade B) = .5 * .16 =
0.08, the same results.
• The compliment rule is often helpful, if we want to find the
probability of picking any
female EXCEPT those in grade B, we could figure out the
probability for each of the
grades and add them together, OR we could simply say that the
probability of Female
and not grade B is simply 1 – P(Female and grade B), or 1 -
0.08, or 0.92. We will
use this property of probabilities a lot in the rest of the class.
As we can see, probabilities can show us a lot and can be
somewhat complex in determining
their values. The nice thing is that this is about as complicated

as it gets.
Applying the information
Week 1 Question 4
Question 4 gives us some probability values- how likely are we
to exceed the respective
gender midpoints in the entire data set. We are looking at the
empirical and normal curve
probabilities. If the data set is normally distributed, the
probabilities should be fairly close; if
not, we have a clue that the data might not be normally
distributed over the entire data range.
The male empirical probability of exceeding the midpoint in the
entire data set is 50%
empirically (close to the 51st percentile value we got above)
and 55% assuming normality –
fairly close. The female probabilities are 68 and 60%
respectively; again not too far off.
The data again support the idea that a lot of females are at the
higher end of the compa-
ratio distribution.
Drawing Conclusions: Week 1 Question 5
As interesting as the numbers are themselves, they mean very
little unless we can
interpret their meaning and apply that insight to the question(s)
at hand.

Recapping our results, we see that while female overall average
compa-ratio is somewhat
higher than the males, the probability and distribution outcomes
suggest that males and females
are not distributed in the same fashion and that more of the
females are relatively lower in their
range than the males.
While we have not yet accounted for equal work, it appears that
there are some issues
suggesting that males and females are not paid the same within
the company. At least enough
for more investigation.
On our detective shows, we might say that we have some
evidence, but not enough to
take it to the grand jury for an indictment yet.
Summary
This lecture looked at descriptive statistics and what they can
tell us about the data set.
We reviewed the questions that are asked in the Week 1
assignment and the answers for each
question using the COMPA-RATIO variable. The focus of this
lecture was on interpreting
presented results, as that is a more frequent activity for
professionals than actually developing
the measures.
Specifically, we looked at the developing the following
information.

Note that this was created by listing the tool as we introduced
it, the data requirements,
and then a typical question that would require this tool. By
copying this information to a second
Excel sheet and sorting the columns we can create a guide as to
when to use each tool, a shown
below.
Now, we move on to some specific ways to set-up Excel to
provide the results that we
just looked at.
Before we do, however, please respond to Discussion Thread 2
for this week with your
initial response and responses to others over a couple of days
before moving on to reading the
second lecture for the week.
Please ask your instructor if you have any questions about this
material.

BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docx

Recommended

Recommended

More Related Content

Similar to BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docx

Similar to BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docx (20)

More from curwenmichaela

More from curwenmichaela (20)

Recently uploaded

Recently uploaded (20)

BUS308 – Week 1 Lecture 2 Describing Data Expected Out.docx