An Unsuccessful Attempt at Understanding
Statistics
Nadia Afroze Disha
Quantitative Analysis for Managers
What is Quantitative Analysis?
Quantitative Analysis for Manager is basically statistical analysis – dealing with numbers – to help managers in decision-
making.
Is decision-making science or arts?
Decision-making is bits of both – art and science.
What factors act as inputs for decision-making?
The Continuum of Decision-making Environment
Decision-making under Uncertainty
Most significant decisions made in today’s complex environment are formulated under a state of uncertainty. Conditions
of uncertainty exist when the future environment is unpredictable and everything is in a state of flux. The decision-
maker is not aware of all available alternatives, the risks associated with each and the consequences of each alternative
or their probabilities.
The manager does not possess complete information about the alternatives and whatever information is available may
not be completely reliable. In the face of such uncertainty, managers need to make certain assumptions about the
Decision-making
Manager's
Knowledge
Intuition
Judgment
Capabilities
Models and
Tools
Risk-
taking/Risk-
aversing
Attitude
Manager's
Mood
Uncertainty Risk Situation Certainty
Information
situation in order to provide a reasonable framework for decision-making. They have to depend upon their judgment
and experience for making decisions.
Decision-making under Risk
When a manager lacks perfect information or whenever an information asymmetry exists, risk arises. Under a state of
risk, the decision maker has incomplete information about available alternatives but has a good idea of the probability
of outcomes for each alternative. While making decisions under a state of risk, managers must determine the probability
associated with each alternative on the basis of the available information and his experience.
Decision-making under Certainty
A condition of certainty exists when the decision-maker knows with reasonable certainty what the alternatives are, what
conditions are associated with each alternative, and the outcome of each alternative. Under conditions of certainty,
accurate, measurable, and reliable information on which to base decisions is available. The cause and effect
relationships are known and the future is highly predictable under conditions of certainty. Such conditions exist in case
of routine and repetitive decisions concerning the day-to-day operations of the business.
What are the differences between data and information?
Data are the facts or details from which information is derived. Data are simply facts or figures – bits of information but
not information itself. Individual pieces of data are rarely useful alone. For data to become information, data needs to
be put into contexts. In other words, information provides context for data.
Data Information
Data is raw, unorganized facts that need to be
processed. Data can be something simple and
seemingly random and useless until it is organized.
When data is processed, interpreted, organized,
structured or presented in a given context so as to
make it meaningful and useful, it is called
information.
In a certain class, each student’s age is one piece
of data.
The average age or range of ages of the students
in a class is a piece of information that can be
derived from the given data. However, when an
individual is asked their age, their age is a piece of
information.
What are the differences between possibility and probability?
Possibility is something we use to describe an event that may or may not happen. And it might not be always possible
to calculate how likely the event is to occur. Possibility is the qualitative characteristic of an event. Probability of an
event is the likelihood or chance with which that event could occur or happen. Probability is basically the numerical
characteristic of likelihood of an event.
Let us take the example of tossing a coin to understand it.
It is possible that the coin shows up a head or a tail or lands on its edge or a bird takes away the coin when tossed up in
the air or the coin is lost on tossing. There are numerous possibilities that may or may not happen; we have listed down
a few! And we could only calculate the likelihood of a few (not all!) possibilities.
Whereas we can calculate the probabilities of the coin showing up a head or a tail upon performing a large number of
experiments and thus, we can conclude whether the coin is biased or fair. We can be almost certain that for a fair coin,
the probability of showing up a head is ½ and the same for showing up a head. We just can NOT assign numerical
attributes to all the possibilities.
“There’s a 90% chance it will rain tomorrow.” – what does it mean?
It means that out of 10 days, it will rain 9 days and out of 100 days, it will rain 90 days.
Decision-making
For ABC Bread Manufacturing Company (the significance of bread here is that it is a perishable good; if 42 units are
produced while there is a demand for 40, 2 units will be wasted as they cannot be stored for sales the next day),
Demand = 40 – 44 units [25 different outcomes e.g. demand for 40 units and production of 40 units of bread is one
outcome]
Selling Price = $38/unit
Variable Cost = $25/unit
Fixed Cost = $200/day
Question: How many units should be produced?
Solution
Method 1: Payoff Matrix/Profit Matrix
Decision-making under Absolute Uncertainty
ABC is a new bread manufacturing company and they have absolutely no information regarding the market except that
the range of demand will be 40 units to 44 units.
Decision Alternatives (Productions)
Demand 40 41 42 43 44
40 320* 295 270 245 220
41 320 333 308 283 258
42 320 333 346 321 296
43 320 333 346 359 334
44 320 333 346 359 372
*Revenue = 40 X 38 = $1520
(-) Variable Cost = 40 X 25 = $1000
(-) Fixed Cost = $200
Profit = $320
The decision to manufacture can be anything from 40 units to 44 units, depending on the producer/manager.
• The manager who chooses to produce 40 units of bread has a highly risk-averting attitude.
• The manager who chooses to produce 44 units of bread has a highly risk-taking attitude.
• The manager who chooses to produce 42 units of bread does so because while his profit is not maximized, his loss
isn’t maximized either.
Decision-making under Risk
Now ABC has been in the market for some time and so, has some information regarding demands of their customers.
Decision Alternatives (Productions)
Probability Demand 40 41 42 43 44
10% 40 320 295 270 245 220
20% 41 320 333 308 283 258
40% 42 320 333 346 321 296
20% 43 320 333 346 359 334
10% 44 320 333 346 359 372
100% Expected
Value (EV)
320 329.2 330.8 317.2 296*
Expected
Value
under Risk
(EVR)
330.8
*(10 X 220 + 20 X 258 + 40 X 296 + 20 X 334 + 10 X 372) / 100 = 296
Expected Value is a predicted value of a variable calculated as the sum of all possible values each multiplied by the
probability of its occurrence.
Under risk environment, 5 Expected Values (EV) have been found for the 5 different production levels. The optimum
Expected Value is known as Expected Value under Risk (EVR).
Here, EVR = $330.8. so, 42 units will be produced.
Decision-making under Absolute Certainty
Now ABC knows the exact demands for each day
Decision Alternatives (Productions)
Probability Demand 40 41 42 43 44
10% 40 320 295 270 245 220
20% 41 320 333 308 283 258
40% 42 320 333 346 321 296
20% 43 320 333 346 359 334
10% 44 320 333 346 359 372
100% Expected
Value (EV)
320 329.2 330.8 317.2 296
Expected
Value
under
Certainty
(EVC)
346*
*320 X 0.1 + 333 X 0.2 + 346 X 0.4 + 359 X 0.2 + 372 X 0.1 = 346
Now ths situation has no uncertain component. When ABC knows the demand is 40, they will produce 40 units. So their
profit will be $320. Similarly, when ABC knows the demand is 41, they will produce 41 units. So their profit will be $333.
There will never be any case of overstocking or understocking. In other words, there will be no Opportunity Loss (OL).
We have found 5 Expected Values for 5 production levels. And the Expected Value under Certainty (EVC) is $346.
Method 2: Opportunity Loss Table
Opportunity Loss (OL): Loss incurred by not taking the best decision.
Contribution Margin (CM): Marginal profit per unit of sale (Selling Price – Variable Cost)
Opportunity Loss of Understocking (OLU) = Contribution Margin (CM)
Opportunity Loss of Overstocking (OLO) = Variable Cost (VC)
Decision Alternatives (Productions)
Probability Demand 40 41 42 43 44
10% 40 0 25 50 75 100
20% 41 13 0 25 50 75
40% 42 26 13 0 25 50
20% 43 39 26 13 0 25
10% 44 52 39 26 13 0
100% Expected
Opportunity
Loss (EOL)
26* 16.8 15.2 28.8 50
Expected
Opportunity
Loss under
Risk (EOLR)
15.2
+ + + + + +
Expected
Value (EV)
320 329.2 330.8 317.2 296
= = = = = =
EVC 346 346 346 346 346
*0 X 0.1 + 13 X 0.2 + 26 X 0.4 + 39 X 0.2 + 52 X 0.1 = 26
Here, EOLR is $15.2, so the best decision will be to produce 42 units.
In the short--term, the models may produce undesired outcome. However, in the long--term, these
models help us make more correct decisions.
EVC – EVR = 346 – 330.8
= 15.2
= EOLR
= EVPI (Expected Value of Perfect Information)
So, EOLR = EVPI
The value of perfect information is the opportunity loss under risk.
Also, EOL + EV = EVC
Method 3: Incremental Analysis
Decision Alternatives (Productions)
Probability Demand 40 41 42 43 44
10% 40 320 295 270 245 220
20% 41 320 333 308 283 258
40% 42 320 333 346 321 296
20% 43 320 333 346 359 334
10% 44 320 333 346 359 372
Probability of selling the additional unit:
40th
– 100%
41st
– 90%
42nd
– 70%
43rd
– 30%
44th
– 10%
Additional Unit Probability of Profit Increase Meaning
41st
unit 13 X 0.9 – 25 X 0.1 9.2 Since the value is greater than 0, it is
profitable to produce 41 units.
42nd
unit 13 X 0.7 – 25 X 0.3 1.6 Since the value is greater than 0, it is
profitable to produce 42 units.
43rd
unit 13 X 0.3 – 25 X 0.7 -13.6 Since the value is less than 0, it is not
profitable to produce 43 units.
44th
unit 13 X 0.1 – 25 X 0.9 -21.2 Since the value is less than 0, it is not
profitable to produce 44 units.
So, the best decision is to produce 42 units.
OLU X P – OLO X (1 – P) = 0
OLU X P = OLO X (1 – P)
OLU X P + OLO X P = OLO
P = OLO / (OLU + OLO)
P = 25 / (13 + 25)
P = 65.8%
Advantages and Disadvantages of Three Methods
Methods Advantages Disadvantages
Payoff
Matrix/Profit
Matrix
Can be used to directly find EVR, EVC and
EVPI
Calculations are complex and time-
consuming.
EOL can be found indirectly from EOL + EV =
EVC
EOL cannot be found directly.
Opportunity Loss
Table
Can be used to directly find EOL and EVPI Does not reveal all information such as EVR
and EVC
Simpler calculations
Incremental
Analysis
Very simple and more viable for larger
numbers of production/demand values
Does not reveal EVR, EVC, EOL, EVPI
Only SP, VC and P values are needed
The value of P is a minimum value. Since P = 65.8% is greater
than 30%, maximum 42 units should be produced and 43rd
unit
should not be produced. Even if P is 30.05%, it will be greater
than 30%, so 42 units should be produced. 43 units will have to
be produced only when P will be less than 30%.
Three flips of a fair coin
Example 1. Suppose you have a fair coin: this means it has a 50% chance of landing heads up and a 50% chance of
landing tails up. Suppose you flip it three times and these flips are independent. What is the probability that it lands
heads up, then tails up, then heads up?
We're asking about the probability of this outcome:
(H,T,H)
Since the flips are independent this is
p(H,T,H) = pH pT pH
Since the coin is fair we have
pH = pT = 1/2
so
pH pT pH = ½ × ½ × ½ = 1/8
So the answer is 1/8, or 12.5%.
Example 2. In the same situation, what's the probability that the coin lands heads up exactly twice?
There are 2 × 2 × 2 = 8 outcomes that can happen:
(H,H,H), (H,H,T), (H,T,H), (T,H,H), (H,T,T), (T,H,T), (T,T,H), (T,T,T)
We can work out the probability of each of these outcomes. For example, we've already seen that (H,T,H) is
p(H,T,H) = pH pT pH = 1/8
since the coin is fair and the flips are independent. In fact, all 8 probabilities work out the same way. We always get 1/8.
In other words, each of the 8 outcomes is equally likely!
But we're interested in the probability that we get exactly two heads. That's the probability of this subset:
S = {(T,H,H), (H,T,H), (H,H,T)}
p(S) = p(T,H,H) + p(H,T,H) + p(H,H,T) = 3 × 1/8
So the answer is 3/8, or 37.5%.
Three flips of a very unfair coin
Example 3. Now suppose we have an unfair coin with a 90% chance of landing heads up and 10% chance of landing tails
up! What's the probability that if we flip it three times, it lands heads up exactly twice?
Again let's assume the coin flips are independent.
Most of the calculation works exactly the same way, but now our coin has
pH = 0.9, pT = 0.1
We're interested in the outcomes where the coin comes up heads twice, so we look at this subset:
S = {(T,H,H), (H,T,H), (H,H,T)}
The probability of this subset is
p(S) = p (T,H,H) + p (H,T,H) + p (H,H,T)
= pT pH pH + pH pT pH + pH pH pT
= 3 pT pH pH
=3 × 0.1 × 0.9 × 0.9
=0.3 × 0.81
= 0.243
So now the probability is just 24.3%.
What is Statistics?
Statistics is a form of mathematical analysis that uses quantified models, representations and synopses for a given set
of experimental data or real-life studies. Statistics studies methodologies to gather, review, analyze and draw
conclusions from data.
Statistics is a term used to summarize a process that an analyst uses to characterize a data set. If the data set depends
on a sample of a larger population, then the analyst can develop interpretations about the population primarily based
on the statistical outcomes from the sample. Statistical analysis involves the process of gathering and evaluating data
and then summarizing the data into a mathematical form.
Statistics is used in various disciplines such as psychology, business, physical and social sciences, humanities,
government, and manufacturing. Statistical data is gathered using a sample procedure or other method.
Types of Statistics
Descriptive Statistics
Descriptive statistics is the type of statistics that probably springs to most people’s minds when they hear the word
“statistics.” In this branch of statistics, the goal is to describe. Use descriptive statistics to summarize and graph the data
for a group that you choose. This process allows you to understand that specific set of observations.
Descriptive statistics describe a sample. That’s pretty straightforward. You simply take a group that you’re interested in,
record data about the group members, and then use summary statistics and graphs to present the group properties.
Collecting
data
Organising
data
Analysing
data
Interpreting
data
Statistics
Mathematical
(Development of
tools and
techniques)
Business
(Application of tools
and techniques in
decision-making)
Descriptive Statistics
(Describes data)
Inferential Statistics
(Information is
inferred from data)
STATISTICS
The field of statistics is divided into
two major divisions: descriptive
and inferential. Each of these
segments is important, offering
different techniques that
accomplish different objectives.
Descriptive statistics describe what
is going on in a population or data
set. Inferential statistics, by
contrast, allow scientists to take
findings from a sample group and
generalize them to a larger
population. The two types of
statistics have some important
differences.
With descriptive statistics, there is no uncertainty because you are describing only the people or items that you actually
measure. You’re not trying to infer properties about a larger population.
The process involves taking a potentially large number of data points in the sample and reducing them down to a few
meaningful summary values and graphs. This procedure allows us to gain more insights and visualize the data than
simply pouring through row upon row of raw numbers!
There are a number of items that belong in this portion of statistics, such as:
• The average, or measure of the center of a data set, consisting of the mean, median, mode, or midrange
• The spread of a data set, which can be measured with the range or standard deviation
• Overall descriptions of data such as the five number summary
• Measurements such as skewness and kurtosis
• The exploration of relationships and correlation between paired data
• The presentation of statistical results in graphical form
These measures are important and useful because they allow scientists to see patterns among data, and thus to make
sense of that data. Descriptive statistics can only be used to describe the population or data set under study: The results
cannot be generalized to any other group or population.
Example of descriptive statistics
Collectively, this information gives us a pretty good picture of this specific class. There is no uncertainty surrounding
these statistics because we gathered the scores for everyone in the class. However, we can’t take these results and
extrapolate to a larger population of students.
Inferential Statistics
Inferential statistics takes data from a sample and makes inferences about the larger population from which the sample
was drawn. Because the goal of inferential statistics is to draw conclusions from a sample and generalize them to a
Suppose we want to describe the test scores in a
specific class of 30 students. We record all of the
test scores and calculate the summary statistics and
produce graphs.
These results indicate that the mean score of this
class is 79.18. The scores range from 66.21 to 96.53,
and the distribution is symmetrically centered
around the mean. A score of at least 70 on the test
is acceptable. The data show that 86.7% of the
students have acceptable scores.
population, we need to have confidence that our sample accurately reflects the population. This requirement affects
our process. At a broad level, we must do the following:
1. Define the population we are studying.
2. Draw a representative sample from that population.
3. Use analyses that incorporate the sampling error.
We don’t get to pick a convenient group. Instead, random sampling allows us to have confidence that the sample
represents the population. This process is a primary method for obtaining samples that mirrors the population
on average. Random sampling produces statistics, such as the mean, that do not tend to be too high or too low. Using
a random sample, we can generalize from the sample to the broader population. Unfortunately, gathering a truly
random sample can be a complicated process.
Difference between Descriptive Statistics and Inferential Statistics
As you can see, the difference between descriptive and inferential statistics lies in the process as much as it does the
statistics that you report.
Descriptive Statistics Inferential Statistics
For descriptive statistics, we choose a group that we want
to describe and then measure all subjects in that group.
The statistical summary describes this group with
complete certainty (outside of measurement error).
For inferential statistics, we need to define the
population and then devise a sampling plan that produces
a representative sample. The statistical results
incorporate the uncertainty that is inherent in using a
sample to understand an entire population.
A study using descriptive statistics is simpler to perform. However, if you need evidence that an effect or
relationship between variables exists in an entire
population rather than only your sample, you need to use
inferential statistics.
Populations, Parameters and Samples
Inferential statistics lets you draw conclusions about populations by using small samples. Consequently, inferential
statistics provide enormous benefits because typically you can’t measure an entire population. However, to gain these
benefits, you must understand the relationship between populations, subpopulations, population parameters, samples,
and sample statistics.
Populations
Populations can include people, but other examples include objects, events, businesses, and so on. In statistics, there
are two general types of populations. Populations can be the complete set of all similar items that exist. For example,
the population of a country includes all people currently within that country. It’s a finite but potentially large list of
members. However, a population can be a theoretical construct that is potentially infinite in size. For example, quality
improvement analysts often consider all current and future output from a manufacturing line to be part of a population.
Populations share a set of attributes that you define. For example, the following are populations:
• Stars in the Milky Way galaxy.
• Parts from a production line.
• Citizens of the United States.
Population Parameters
Parameter: A parameter is a value that describes a characteristic of an entire population, such as the population mean.
Because you can almost never measure an entire population, you usually don’t know the real value of a parameter. In
fact, parameter values are nearly always unknowable. While we don’t know the value, it definitely exists.
For example, the average height of adult women in the United States is a parameter that has an exact value—we just
don’t know what it is!
The population mean and standard deviation are two common parameters. In statistics, Greek symbols usually
represent population parameters, such as μ (mu) for the mean and σ (sigma) for the standard deviation.
Statistic: A statistic is a characteristic of a sample. If you collect a sample and calculate the mean and standard deviation,
these are sample statistics. Inferential statistics allow you to use sample statistics to make conclusions about a
population. However, to draw valid conclusions, you must use particular sampling techniques. These techniques help
ensure that samples produce unbiased estimates. Biased estimates are systematically too high or too low. You want
unbiased estimates because they are correct on average.
In inferential statistics, we use sample statistics to estimate population parameters. For example, if we collect a random
sample of adult women in the United States and measure their heights, we can calculate the sample mean and use it as
an unbiased estimate of the population mean. We can also perform hypothesis testing on the sample estimate and
create confidence intervals to construct a range that the actual population value likely falls within.
Representative Sampling and Simple Random Samples
In statistics, sampling refers to selecting a subset of a population. After drawing the sample, you measure one or more
characteristics of all items in the sample, such as height, income, temperature, opinion, etc. If you want to draw
conclusions about these characteristics in the whole population, it imposes restrictions on how you collect the sample.
If you use an incorrect methodology, the sample might not represent the population, which can lead you to erroneous
conclusions.
The most well-known method to obtain an unbiased, representative sample is simple random sampling. With this
method, all items in the population have an equal probability of being selected. This process helps ensure that the
sample includes the full range of the population. Additionally, all relevant subpopulations should be incorporated into
the sample and represented accurately on average. Simple random sampling minimizes the bias and simplifies data
analysis.
While this approach minimizes bias, it does not indicate that your sample statistics exactly equal the population
parameters. Instead, estimates from a specific sample are likely to be a bit high or low, but the process produces
accurate estimates on average. Furthermore, it is possible to obtain unusual samples with random sampling—it’s just
not the expected result. Additionally, random sampling might sound a bit haphazard and easy to do—both of which are
not true. Simple random sampling assumes that you systematically compile a complete list of all people or items that
exist in the population. You then randomly select subjects from that list and include them in the sample. It can be a very
cumbersome process.
Why Sampling is Oftentimes Better than Census?
• Reduces cost - both in monetary terms and staffing requirements.
• Reduces time needed to collect and process the data and produce results as it requires a smaller scale of operation.
• (Because of the above reasons) enables more detailed questions to be asked.
• Enables characteristics to be tested which could not otherwise be assessed. An example is life span of light bulbs,
strength of spring, etc. To test all light bulbs of a particular brand is not possible as the test needs to destroy the
product so only a sample of bulbs can be tested.
• Importantly, surveys lead to less respondent burden, as fewer people are needed to provide the required data.
• Results can be made available quickly
Some Negative Points of Sampling
• Data on sub-populations (such as a particular ethnic group) may be too unreliable to be useful.
• Data for small geographical areas also may be too unreliable to be useful.
• (Because of the above reasons) detailed cross-tabulations may not be practical.
• Estimates are subject to sampling error which arises as the estimates are calculated from a part (sample) of the
population.
• May have difficulty communicating the precision (accuracy) of the estimates to users.
Descriptive Statistics: Measure of Central Tendency
A measure of central tendency is a summary statistic that represents the center point or typical value of a dataset. These
measures indicate where most values in a distribution fall and are also referred to as the central location of a
distribution. You can think of it as the tendency of data to cluster around a middle value. In statistics, the three most
common measures of central tendency are the mean, median, and mode. Each of these measures calculates the
location of the central point using a different method.
Choosing the best measure of central tendency depends on the type of data you have.
The central tendency of a distribution represents one characteristic of a distribution. Another aspect is the variability
around that central value. While measures of variability are the topic of a different article (link below), this property
describes how far away the data points tend to fall from the center. The graph below shows how distributions with the
same central tendency (mean = 100) can actually be quite different. The panel on the left displays a distribution that is
tightly clustered around the mean, while the distribution on the right is more spread out. It is crucial to understand that
the central tendency summarizes only one aspect of a distribution and that it provides an incomplete picture by itself.
Mean
The mean describes an entire sample with a single number that represents the center of the data. The mean is the
arithmetic average. You calculate the mean by adding up all of the observations and then dividing the total by the
number of observations.
For example, if the weights of five apples are 5, 5, 6, 7, and 8, the average apple weight is 6.4.
5 + 7 + 6 + 5 +9 / 5 = 6.4
The mean is sensitive to skewed data and extreme values. For data sets with these properties, the mean gets pulled
away from the center of the data. In these cases, the mean can be misleading because the most common values in the
distribution might not be near the mean.
Median
The median is the middle of the data. Half of the observations are less than or equal to it and half of the observations
are greater than or equal to it. The median is equivalent to the second quartile or the 50th percentile.
For example, if the weights of five apples are 5, 5, 6, 7, and 8, the median apple weight is 6 because it is the middle
value. If there is an even number of observations, you take the average of the two middle values.
The median is less sensitive than the mean to skewed data and extreme values. For data sets with these properties, the
mean gets pulled away from the center of the data. In these cases, the mean can be misleading because the most
common values in the distribution might not be near the mean.
For example, the mean might not be a good statistic for describing annual income. A few extremely wealthy individuals
can increase the overall average, giving a misleading view of annual incomes. In this case, the median is more
informative.
Mode
The mode is the value that occurs most frequently in a set of observations. You can find the mode simply by counting
the number of times each value occurs in a data set.
For example, if the weights of five apples are 5, 5, 6, 7, and 8, the apple weight mode is 5 because it is the most frequent
value.
Identifying the mode can help you understand your distribution.
Which is Best—the Mean, Median, or Mode?
When you have a symmetrical distribution for continuous data, the mean, median, and mode are equal. In this case,
analysts tend to use the mean because it includes all of the data in the calculations. However, if you have a skewed
distribution, the median is often the best measure of central tendency.
When you have ordinal data, the median or mode is usually the best choice. For categorical data, you have to use the
mode.
In cases where you are deciding between the mean and median as the better measure of central tendency, you are also
determining which types of statistical hypothesis tests are appropriate for your data—if that is your ultimate goal. I have
written an article that discusses when to use parametric (mean) and nonparametric (median) hypothesis tests along
with the advantages and disadvantages of each type.
Descriptive Statistics: Measure of Dispersion
Suppose you are given a data series. Someone asks you to tell some interesting facts about this data series. How can you do
so? You can say you can find the mean, the median or the mode of this data series and tell about its distribution. But is it the
only thing you can do? Are the central tendencies the only way by which we can get to know about the concentration of the
observation? So here we are going to know about the measure of dispersion. Let’s start.
As the name suggests, the measure of dispersion shows the scatterings of the data. It tells the variation of the data from one
another and gives a clear idea about the distribution of the data. The measure of dispersion shows the homogeneity or the
heterogeneity of the distribution of the observations.
Supposeyouhavefourdatasetsofthesamesizeandthe meanisalsosame,say,m.Inallthecasesthesumoftheobservations
will be the same. Here, the measure of central tendency is not giving a clear and complete idea about the distribution for the
four given sets.
Can we get an idea about the distribution if we get to know about the dispersion of the observations from one another
within and between the datasets? The main idea about the measure of dispersion is to get to know how the data are spread.
It shows how much the data vary from their average value.
Characteristics of Measures of Dispersion
• A measure of dispersion should be rigidly defined
• It must be easy to calculate and understand
• Not affected much by the fluctuations of observations
• Based on all observations
Classification of Measures of Dispersion
The measure of dispersion is categorized as:
(i) An absolute measure of dispersion:
• The measures which express the scattering of observation in terms of distances i.e., range, quartile deviation.
• The measure which expresses the variations in terms of the average of deviations of observations like mean deviation
and standard deviation.
(ii) A relative measure of dispersion:
We use a relative measure of dispersion for comparing distributions of two or more data set and for unit free comparison.
They are the coefficient of range, the coefficient of mean deviation, the coefficient of quartile deviation, the coefficient of
variation, and the coefficient of standard deviation.
Range
A range is the most common and easily understandable measure of dispersion. It is the difference between two extreme
observations of the data set. If X max and X min are the two extreme observations then
Range = X max – X min
Merits of Range
• It is the simplest of the measure of dispersion
• Easy to calculate
• Easy to understand
• Independent of change of origin
Mean Absolute Deviation (MAD)
The Mean Absolute Deviation (MAD) of a set of data is the average distance between each data value and the mean.
The steps to find the MAD include:
1. find the mean (average)
2. find the difference between each data value and the mean
3. take the absolute value of each difference
Demerits of Range
• It is based on two extreme observations. Hence, get
affected by fluctuations
• A range is not a reliable measure of dispersion
• Dependent on change of scale
4. find the mean (average) of these differences
Merits of Mean Deviation
• Based on all observations
• It provides a minimum value when the deviations are taken from the median
• Independent of change of origin
Demerits of Mean Deviation
• Not easily understandable
• Its calculation is not easy and time-consuming
• Dependent on the change of scale
• Ignorance of negative sign creates artificiality and becomes useless for further mathematical treatment
Example: Erica enjoys posting pictures of her cat online. Here's how many "likes" the past 666 pictures each received:
10, 15, 15, 17, 18, 21
Find the mean absolute deviation.
Step 1: Calculate the mean.
The sum of the data is 96 total "likes" and there are 6 pictures.
Step 2: Calculate the distance between each data point and the mean.
Step 3: Add the distances together.
Step 4: Divide the sum by the number of data points.
Standard Deviation
The Standard Deviation is a measure of how spread out numbers are. Its symbol is σ (the greek letter sigma). The formula
is easy: it is the square root of the Variance. So now you ask, "What is the Variance?"
The Variance is defined as: The average of the squared differences from the Mean.
To calculate the variance, follow these steps:
• Work out the Mean (the simple average of the numbers)
• Then for each number: subtract the Mean and square the result (the squared difference).
• Then work out the average of those squared differences.
Why square the differences?
Example
You and your friends have just measured the heights of your dogs (in millimeters):
The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.
Find out the Mean, the Variance, and the Standard Deviation.
Your first step is to find the Mean:
Answer:
so the mean (average) height is 394 mm. Let's plot this on the chart:
Now we calculate each dog's difference from the Mean:
To calculate the Variance, take each difference, square it, and then average the result:
So the Variance is 21,704
And the Standard Deviation is just the square root of Variance, so:
And the good thing about the Standard Deviation is that it is useful. Now we can show which heights are within one
Standard Deviation (147mm) of the Mean:
So, using the Standard Deviation we have a "standard" way of knowing what is normal, and what is extra large or extra
small.
Rottweilers are tall dogs. And Dachshunds are a bit short, right?
Merits of Standard Deviation
• Squaring the deviations overcomes the drawback of ignoring signs in mean deviations
• Suitable for further mathematical treatment
• Least affected by the fluctuation of the observations
• The standard deviation is zero if all the observations are constant
• Independent of change of origin
Demerits of Standard Deviation
• Not easy to calculate
• Difficult to understand for a layman
• Dependent on the change of scale
Note:
• If the effect of error is non-linear, Standard Deviation should be used, not Mean Absolute Deviation.
• If the effect of error is linear, either Standard Deviation or Mean Absolute Deviation will do.
Stem and Leaf Diagram
A stem and leaf diagram shows numbers in a table format. It can be a useful way to organize data to find the median,
mode and range of a set of data.
A stem and leaf plot, or stem plot, is a technique used to classify either discrete or continuous variables. A stem and leaf
plot is used to organize data as they are collected.
A stem and leaf plot looks something like a bar graph. Each number in the data is broken down into a stem and a leaf,
thus the name. The stem of the number includes all but the last digit. The leaf of the number will always be a single digit.
Elements of a good stem and leaf plot
A good stem and leaf plot
• shows the first digits of the number (thousands, hundreds or tens) as the stem and shows the last digit (ones) as
the leaf.
• usually uses whole numbers. Anything that has a decimal point is rounded to the nearest whole number. For
example, test results, speeds, heights, weights, etc.
• looks like a bar graph when it is turned on its side.
• shows how the data are spread—that is, highest number, lowest number, most common number and outliers (a
number that lies outside the main group of numbers).
Tips on how to draw a stem and leaf plot
Once you have decided that a stem and leaf plot is the best way to show your data, draw it as follows:
• On the left hand side of the page, write down the thousands, hundreds or tens (all digits but the last one). These
will be your stems.
• Draw a line to the right of these stems.
• On the other side of the line, write down the ones (the last digit of a number). These will be your leaves.
For example, if the observed value is 25, then the stem is 2 and the leaf is the 5. If the observed value is 369, then the
stem is 36 and the leaf is 9. Where observations are accurate to one or more decimal places, such as 23.7, the stem is
23 and the leaf is 7. If the range of values is too great, the number 23.7 can be rounded up to 24 to limit the number of
stems.
In stem and leaf plots, tally marks are not required because the actual data are used.
Example
The marks that a class score in a maths test are shown below.
A stem and leaf diagram is drawn by splitting the tens and units column. The tens column becomes the 'stem' and the
units become the 'leaf'.
Stem and leaf diagrams must be in order to read them properly.
The 'leaf' should only ever contain single digits. Therefore, to put the number 124 in a stem and leaf diagram, the 'stem'
would be 12 and the 'leaf' would be 4. To put the number 78.9 into a stem and leaf diagram, the 'stem' would be 78 and
the 'leaf' would be 9. In this case, the key would indicate that the split between stem and leaf is a decimal.
The main advantage of a stem and leaf plot
The main advantage of a stem and leaf plot is that the data are grouped and all the original data are shown, too.
In Example 3 on battery life in the Frequency distribution tables section, the table shows that two observations occurred
in the interval from 360 to 369 minutes. However, the table does not tell you what those actual observations are. A
stem and leaf plot would show that information. Without a stem and leaf plot, the two values (363 and 369) can only
be found by searching through all the original data—a tedious task when you have lots of data!
When looking at a data set, each observation may be considered as consisting of two parts—a stem and a leaf. To make
a stem and leaf plot, each observed value must first be separated into its two parts:
• The stem is the first digit or digits;
• The leaf is the final digit of a value;
• Each stem can consist of any number of digits; but
• Each leaf can have only a single digit.
Example 1 – Making a stem and leaf plot
Each morning, a teacher quizzed his class with 20 geography questions. The class marked them together and everyone
kept a record of their personal scores. As the year passed, each student tried to improve his or her quiz marks. Every
day, Elliot recorded his quiz marks on a stem and leaf plot. This is what his marks looked like plotted out:
Analyse Elliot's stem and leaf plot. What is his most common score on the geography quizzes? What is his highest score?
His lowest score? Rotate the stem and leaf plot onto its side so that it looks like a bar graph. Are most of Elliot's scores
in the 10s, 20s or under 10? It is difficult to know from the plot whether Elliot has improved or not because we do not
know the order of those scores.
Example 2 – Making a stem and leaf plot
A teacher asked 10 of her students how many books they had read in the last 12 months. Their answers were as follows:
12, 23, 19, 6, 10, 7, 15, 25, 21, 12
Prepare a stem and leaf plot for these data.
Tip: The number 6 can be written as 06, which means that it has a stem of 0 and a leaf of 6.
The stem and leaf plot should look like this:
In Table 2:
• stem 0 represents the class interval 0 to 9;
• stem 1 represents the class interval 10 to 19; and
• stem 2 represents the class interval 20 to 29.
Usually, a stem and leaf plot is ordered, which simply means that the leaves are arranged in ascending order from left
to right. Also, there is no need to separate the leaves (digits) with punctuation marks (commas or periods) since each
leaf is always a single digit.
Using the data from Table 2, we made the ordered stem and leaf plot shown below:
Example 3 – Making an ordered stem and leaf plot
Fifteen people were asked how often they drove to work over 10 working days. The number of times each person drove
was as follows:
5, 7, 9, 9, 3, 5, 1, 0, 0, 4, 3, 7, 2, 9, 8
Make an ordered stem and leaf plot for this table.
It should be drawn as follows:
Splitting the stems
The organization of this stem and leaf plot does not give much information about the data. With only one stem, the
leaves are overcrowded. If the leaves become too crowded, then it might be useful to split each stem into two or more
components. Thus, an interval 0–9 can be split into two intervals of 0–4 and 5–9. Similarly, a 0–9 stem could be split into
five intervals: 0–1, 2–3, 4–5, 6–7 and 8–9.
The stem and leaf plot should then look like this:
Note: The stem 0(0)
means all the data within the interval 0–4. The stem 0(5)
means all the data within the interval 5–9.
Example 4 – Splitting the stems
Britney is a swimmer training for a competition. The number of 50-metre laps she swam each day for 30 days are as
follows:
22, 21, 24, 19, 27, 28, 24, 25, 29, 28, 26, 31, 28, 27, 22, 39, 20, 10, 26, 24, 27, 28, 26, 28, 18, 32, 29, 25, 31, 27
1. Prepare an ordered stem and leaf plot. Make a brief comment on what it shows.
2. Redraw the stem and leaf plot by splitting the stems into five-unit intervals. Make a brief comment on what the new
plot shows.
Answers
1. The observations range in value from 10 to 39, so the stem and leaf plot should have stems of 1, 2 and 3. The ordered
stem and leaf plot is shown below:
The stem and leaf plot shows that Britney usually swims between 20 and 29 laps in training each day.
2. Splitting the stems into five-unit intervals gives the following stem and leaf plot:
Note: The stem 1(0)
means all data between 10 and 14, 1(5)
means all data between 15 and 19, and so on.
The revised stem and leaf plot shows that Britney usually swims between 25 and 29 laps in training each day. The values
1(0)
0 = 10 and 3(5)
9 = 39 could be considered outliers—a concept that will be described in the next section.
Example 5 – Splitting stems using decimal values
The weights (to the nearest tenth of a kilogram) of 30 students were measured and recorded as follows:
59.2, 61.5, 62.3, 61.4, 60.9, 59.8, 60.5, 59.0, 61.1, 60.7, 61.6, 56.3, 61.9, 65.7, 60.4, 58.9, 59.0, 61.2, 62.1, 61.4, 58.4,
60.8, 60.2, 62.7, 60.0, 59.3, 61.9, 61.7, 58.4, 62.2
Prepare an ordered stem and leaf plot for the data. Briefly comment on what the analysis shows.
Answer
In this case, the stems will be the whole number values and the leaves will be the decimal values. The data range from
56.3 to 65.7, so the stems should start at 56 and finish at 65.
In this example, it was not necessary to split stems because the leaves are not crowded on too few stems; nor was it
necessary to round the values, since the range of values is not large. This stem and leaf plot reveals that the group with
the highest number of observations recorded is the 61.0 to 61.9 group.
Outliers
An outlier is an extreme value of the data. It is an observation value that is significantly different from the rest of the
data. There may be more than one outlier in a set of data. Sometimes, outliers are significant pieces of information and
should not be ignored. Other times, they occur because of an error or misinformation and should be ignored.
In the previous example, 56.3 and 65.7 could be considered outliers, since these two values are quite different from the
other values. By ignoring these two outliers, the previous example's stem and leaf plot could be redrawn as below:
When using a stem and leaf plot, spotting an outlier is often a matter of judgment. This is because, except when using
box plots (explained in the section on box and whisker plots), there is no strict rule on how far removed a value must be
from the rest of a data set to qualify as an outlier.
Features of distributions
When you assess the overall pattern of any distribution (which is the pattern formed by all values of a particular
variable), look for these features:
• number of peaks
• general shape (skewed or symmetric)
• centre
• spread
Number of peaks
Line graphs are useful because they readily reveal some characteristic of the data. (See the section on line graphs for
details on this type of graph.) The first characteristic that can be readily seen from a line graph is the number of high
points or peaks the distribution has. While most distributions that occur in statistical data have only one main
peak (unimodal), other distributions may have two peaks (bimodal) or more than two peaks (multimodal).
Examples of unimodal, bimodal and multimodal line graphs are shown below:
General shape
The second main feature of a distribution is the extent to which it is symmetric.
A perfectly symmetric curve is one in which both sides of the distribution would exactly match the other if the figure
were folded over its central point. An example is shown below:
A symmetric, unimodal, bell-shaped distribution—a relatively common occurrence—is called a normal distribution.
If the distribution is lop-sided, it is said to be skewed.
A distribution is said to be skewed to the right, or positively skewed, when most of the data are concentrated on the left
of the distribution. Distributions with positive skews are more common than distributions with negative skews.
Income provides one example of a positively skewed distribution. Most people make under $40,000 a year, but some
make quite a bit more, with a smaller number making many millions of dollars a year. Therefore, the positive (right) tail
on the line graph for income extends out quite a long way, whereas the negative (left) skew tail stops at zero. The right
tail clearly extends farther from the distribution's centre than the left tail, as shown below:
A distribution is said to be skewed to the left, or negatively skewed, if most of the data are concentrated on the right of
the distribution. The left tail clearly extends farther from the distribution's centre than the right tail, as shown below:
Centre and spread
Locating the centre (median) of a distribution can be done by counting half the observations up from the smallest.
Obviously, this method is impracticable for very large sets of data. A stem and leaf plot makes this easy, however,
because the data are arranged in ascending order. The mean is another measure of central tendency. (See the chapter
on central tendency for more detail.)
The amount of distribution spread and any large deviations from the general pattern (outliers) can be quickly spotted
on a graph.
Using stem and leaf plots as graphs
A stem and leaf plot is a simple kind of graph that is made out of the numbers themselves. It is a means of displaying
the main features of a distribution. If a stem and leaf plot is turned on its side, it will resemble a bar graph or histogram
and provide similar visual information.
Example 6 – Using stem and leaf plots as graph
The results of 41 students' math tests (with a best possible score of 70) are recorded below:
31, 49, 19, 62, 50, 24, 45, 23, 51, 32, 48, 55, 60, 40, 35, 54, 26, 57, 37, 43, 65, 50, 55, 18, 53, 41, 50, 34, 67, 56, 44, 4, 54,
57, 39, 52, 45, 35, 51, 63, 42
1. Is the variable discrete or continuous? Explain.
2. Prepare an ordered stem and leaf plot for the data and briefly describe what it shows.
3. Are there any outliers? If so, which scores?
4. Look at the stem and leaf plot from the side. Describe the distribution's main features such as:
a) number of peaks
b) symmetry
c) value at the centre of the distribution
Answers
1. A test score is a discrete variable. For example, it is not possible to have a test score of 35.74542341....
2. The lowest value is 4 and the highest is 67. Therefore, the stem and leaf plot that covers this range of values looks
like this:
Note: The notation 2|4 represents stem 2 and leaf 4.
The stem and leaf plot reveals that most students scored in the interval between 50 and 59. The large number of
students who obtained high results could mean that the test was too easy, that most students knew the material well,
or a combination of both.
3. The result of 4 could be an outlier, since there is a large gap between this and the next result, 18.
4. If the stem and leaf plot is turned on its side, it will look like the following:
Example:
The stem and leaf plot below shows the grade point averages of 18 students. The digit in the stem represents the ones
and the digit in the leaf represents the tenths. So for example 0 | 8 = 0.8, 1 | 2 = 1.2 and so on.
a) What is the range of the data in the stem and leaf plot?
b) How many students have a grade of 2 or more?
c) What is the mode of the grades?
d) What is the median of the grades?
The distribution has a single peak within the 50–59 interval.
Although there are only 41 observations, the distribution
shows that most data are clustered at the right. The left tail
extends farther from the data centre than the right tail.
Therefore, the distribution is skewed to the left
or negatively skewed.
Since there are 41 observations, the distribution centre
(the median value) will occur at the 21st observation.
Counting 21 observations up from the smallest, the centre
is 48. (Note that the same value would have been obtained
if 21 observations were counted down from the highest
observation.)
Solution to Example:
a) range = maximum value - minimum value = 4.0 - 0.8 = 3.2
b) 7 + 4 + 1 = 12 students
c) two modes: 1.4 and 2.5
d) There are 18 data values and already ordered in the stem and leaf diagram.
median = (the 9th value + the 10th value) / 2 = (2.5 + 2.5) / 2 = 2.5
Example:
The back to back stem and leaf plot below shows the exam grades (out of 100) of two sections. The digit in the stem
represents the tens and the digit in the leaf represents the ones. So for example 5 | 3 = 53 and so on.
a) How many students scored higher than 60 in section 1?
b) How many students scored higher than 60 in section 2?
c) What are the minimum and maximum scores in section 1?
d) What are the minimum and maximum scores in section 2?
e) Without counting, which section has more students scoring 80 or more?
f) Without counting, which section has more students scoring 50 or less?
Solution to Example:
a) 6 + 7 + 5 + 4 = 22 students
b) 8 + 6 + 2 + 2 = 18 students
c) minimum = 40 , maximum = 95
d) minimum = 41 , maximum = 91
e) section 1
f) section 2
Example:
The back to back stem and leaf plot below shows the LDL cholesterol levels (in milligram per deciliter mg/dL) of two
groups of people, smokers and non smokers. The digits in the stem represents the hundreds and tens and the digit in
the leaf represents the ones. So for example 11 | 8 = 118 and so on.
a) People with a cholesterol level of 129 or less are said to have a near ideal level of cholesterol. How many people, in
each group, have a near ideal level of cholesterol?
b) People with a cholesterol level between 130 and 159 inclusive are said to be in the border high. How many people,
in each group, are in the border high?
c) People with a cholesterol level between 160 and 189 inclusive are said to have a high level of cholesterol. How many
people, in each group, have a high level of cholesterol?
d) People with a cholesterol level of 190 or above are said to have a very high level of cholesterol. How many people, in
each group, have a very high level of cholesterol?
e) Comparing the two groups, which group has more people with a higher level of cholesterol?
Solution to Example:
a) smokers: 1 + 2 = 3 people , non smokers: 2 + 4 + 7 = 13 people
b) smokers: 3 + 4 + 5 = 12 people , non smokers: 7 + 6 + 3 = 16 people
c) smokers: 6 + 5 + 4 = 15 people , non smokers: 3 + 2 + 1 = 6 people
d) smokers: 3 + 2 = 5 people , non smokers: none
e) The group of smokers have more people with higher cholesterol.
Box Plot
Box plots (also called box-and-whisker plots or box-whisker plots) give a good graphical image of the concentration of
the data. They also show how far the extreme values are from most of the data. A box plot is constructed from five
values: the minimum value, the first quartile, the median, the third quartile, and the maximum value. We use these
values to compare how close other data values are to them.
To construct a box plot, use a horizontal or vertical number line and a rectangular box. The smallest and largest data
values label the endpoints of the axis. The first quartile marks one end of the box and the third quartile marks the other
end of the box. Approximately the middle 50 percent of the data fall inside the box. The “whiskers” extend from the
ends of the box to the smallest and largest data values. The median or second quartile can be between the first and
third quartiles, or it can be one, or the other, or both. The box plot gives a good, quick picture of the data.
Note: You may encounter box-and-whisker plots that have dots marking outlier values. In those cases, the whiskers
are not extending to the minimum and maximum values.
Consider, again, this dataset.
1 1 2 2 4 6 6.8 7.2 8 8.3 9 10 10 11.5
The first quartile is two, the median is seven, and the third quartile is nine. The smallest value is one, and the largest
value is 11.5. The following image shows the constructed box plot.
The two whiskers extend from the first quartile to the smallest value and from the third quartile to the largest value.
The median is shown with a dashed line.
Note
It is important to start a box plot with a scaled number line. Otherwise the box plot may not be useful.
Example
The following data are the heights of 40 students in a statistics class.
59 60 61 62 62 63 63 64 64 64 65 65 65 65 65 65 65 65 65 66 66 67 67 68 68 69 70 70 70 70 70 71 71 72 72 73 74 74 75
77
Construct a box plot with the following properties; the calculator instructions for the minimum and maximum values as
well as the quartiles follow the example.
• Minimum value = 59
• Maximum value = 77
• Q1: First quartile = 64.5
• Q2: Second quartile or median= 66
• Q3: Third quartile = 70
1. Each quarter has approximately 25% of the data.
2. The spreads of the four quarters are 64.5 – 59 = 5.5 (first quarter), 66 – 64.5 = 1.5 (second quarter), 70 – 66 = 4 (third
quarter), and 77 – 70 = 7 (fourth quarter). So, the second quarter has the smallest spread and the fourth quarter
has the largest spread.
3. Range = maximum value – the minimum value = 77 – 59 = 18
4. Interquartile Range: IQR = Q3 – Q1 = 70 – 64.5 = 5.5.
5. The interval 59–65 has more than 25% of the data so it has more data in it than the interval 66 through 70 which
has 25% of the data.
6. The middle 50% (middle half) of the data has a range of 5.5 inches.
For some sets of data, some of the largest value, smallest value, first quartile, median, and third quartile may be the
same. For instance, you might have a data set in which the median and the third quartile are the same. In this case, the
diagram would not have a dotted line inside the box displaying the median. The right side of the box would display both
the third quartile and the median. For example, if the smallest value and the first quartile were both one, the median
and the third quartile were both five, and the largest value was seven, the box plot would look like:
In this case, at least 25% of the values are equal to one. Twenty-five percent of the values are between one and five,
inclusive. At least 25% of the values are equal to five. The top 25% of the values fall between five and seven, inclusive.
Example
Test scores for a college statistics class held during the day are:
99 56 78 55.5 32 90 80 81 56 59 45 77 84.5 84 70 72 68 32 79 90
Test scores for a college statistics class held during the evening are:
98 78 68 83 81 89 88 76 65 45 98 90 80 84.5 85 79 78 98 90 79 81 25.5
1. Find the smallest and largest values, the median, and the first and third quartile for the day class.
2. Find the smallest and largest values, the median, and the first and third quartile for the night class.
3. For each data set, what percentage of the data is between the smallest value and the first quartile? the first quartile
and the median? the median and the third quartile? the third quartile and the largest value? What percentage of
the data is between the first quartile and the largest value?
4. Create a box plot for each set of data. Use one number line for both box plots.
5. Which box plot has the widest spread for the middle 50% of the data (the data between the first and third quartiles)?
What does this mean for that set of data in comparison to the other set of data?
Solution:
1.
• Min = 32
• Q1 = 56
• M = 74.5
• Q3 = 82.5
• Max = 99
• Min = 25.5
• Q1 = 78
• M = 81
• Q3 = 89
• Max = 98
2. Day class: There are six data values ranging from 32 to 56: 30%. There are six data values ranging from 56 to 74.5: 30%.
There are five data values ranging from 74.5 to 82.5: 25%. There are five data values ranging from 82.5 to 99: 25%. There
are 16 data values between the first quartile, 56, and the largest value, 99: 75%. Night class:
The first data set has the wider spread for the middle 50% of the data. The IQR for the first data set is greater than
the IQR for the second set. This means that there is more variability in the middle 50% of the first data set.
SPSS : BASICS
General Functions
The Menu bar (“File”, “Edit” and so on) is located in the upper area
In the lower left corner, two tabs are available: Data View and Variable View. When you start SPSS, Variable View is
default.
File Types
SPSS uses three different types of files with different functions and extensions.
Options
The SPSS menu works similar to the menus in many other programs, such as Word or Excel. Some useful options are
listed below:
Variable View
In Variable View, different columns are displayed. Each line corresponds to a variable. A variable is simply a quantity of
something, which varies and can be measured, such as height, weight, number of children, educational level, gender
and so forth.
Options
To alter the variable options, you may click the cells. Some columns can be typed directly into, while you need to press
the arrows or dots that appeared when you click in the columns. It is often possible to use “copy and paste” here – this
may efficient when you, for example, have several variables with the same Values.
If you want to delete a
variable, select the
numbered cell to the left of
the variable and then right-
click and choose Clear.
Creating a New Data Set
If you have a questionnaire, you can easily create the corresponding data structure in Variable View in SPSS. For example:
Data View
Once the structure of the data set is determined, it is time to take a look at Data View. Access this view by clicking on
the tab named Data View in the lower left corner.
Here, each column corresponds to a variable, whereas each row corresponds to a case (most commonly an individual).
It is possible to change the order of the variables by highlighting a column and “drag and drop”. You may also change
the width of the column by placing the mouse over the right border of a column (next to the name of the column),
pressing down the button and then “drag and drop”.
If you are creating a new data set, simply type in your data, one row (and one column) at a time. Use the left and right
arrow key on your key board to move between cells.
Make sure that you have chosen the right Type of variable before you enter your data (i.e. Numeric or String)
Output
Everything you order in SPSS (e.g. graphs, tables, or analyses) ends up in a window called Output. In the area to the left,
all the different steps are listed. It is possible to collapse specific steps by clicking on the box with the minus sign (and
expand it again by clicking on the same box, now with a plus sign). In the area to the right, your actual output is shown.
First, you see the syntax for what you have ordered, and then you get the tables or graphs related to the specific
command.

Quantitative Analysis for Managers Notes.pdf

  • 1.
    An Unsuccessful Attemptat Understanding Statistics Nadia Afroze Disha
  • 2.
    Quantitative Analysis forManagers What is Quantitative Analysis? Quantitative Analysis for Manager is basically statistical analysis – dealing with numbers – to help managers in decision- making. Is decision-making science or arts? Decision-making is bits of both – art and science. What factors act as inputs for decision-making? The Continuum of Decision-making Environment Decision-making under Uncertainty Most significant decisions made in today’s complex environment are formulated under a state of uncertainty. Conditions of uncertainty exist when the future environment is unpredictable and everything is in a state of flux. The decision- maker is not aware of all available alternatives, the risks associated with each and the consequences of each alternative or their probabilities. The manager does not possess complete information about the alternatives and whatever information is available may not be completely reliable. In the face of such uncertainty, managers need to make certain assumptions about the Decision-making Manager's Knowledge Intuition Judgment Capabilities Models and Tools Risk- taking/Risk- aversing Attitude Manager's Mood Uncertainty Risk Situation Certainty Information
  • 3.
    situation in orderto provide a reasonable framework for decision-making. They have to depend upon their judgment and experience for making decisions. Decision-making under Risk When a manager lacks perfect information or whenever an information asymmetry exists, risk arises. Under a state of risk, the decision maker has incomplete information about available alternatives but has a good idea of the probability of outcomes for each alternative. While making decisions under a state of risk, managers must determine the probability associated with each alternative on the basis of the available information and his experience. Decision-making under Certainty A condition of certainty exists when the decision-maker knows with reasonable certainty what the alternatives are, what conditions are associated with each alternative, and the outcome of each alternative. Under conditions of certainty, accurate, measurable, and reliable information on which to base decisions is available. The cause and effect relationships are known and the future is highly predictable under conditions of certainty. Such conditions exist in case of routine and repetitive decisions concerning the day-to-day operations of the business. What are the differences between data and information? Data are the facts or details from which information is derived. Data are simply facts or figures – bits of information but not information itself. Individual pieces of data are rarely useful alone. For data to become information, data needs to be put into contexts. In other words, information provides context for data. Data Information Data is raw, unorganized facts that need to be processed. Data can be something simple and seemingly random and useless until it is organized. When data is processed, interpreted, organized, structured or presented in a given context so as to make it meaningful and useful, it is called information. In a certain class, each student’s age is one piece of data. The average age or range of ages of the students in a class is a piece of information that can be derived from the given data. However, when an individual is asked their age, their age is a piece of information. What are the differences between possibility and probability? Possibility is something we use to describe an event that may or may not happen. And it might not be always possible to calculate how likely the event is to occur. Possibility is the qualitative characteristic of an event. Probability of an event is the likelihood or chance with which that event could occur or happen. Probability is basically the numerical characteristic of likelihood of an event. Let us take the example of tossing a coin to understand it. It is possible that the coin shows up a head or a tail or lands on its edge or a bird takes away the coin when tossed up in the air or the coin is lost on tossing. There are numerous possibilities that may or may not happen; we have listed down a few! And we could only calculate the likelihood of a few (not all!) possibilities. Whereas we can calculate the probabilities of the coin showing up a head or a tail upon performing a large number of experiments and thus, we can conclude whether the coin is biased or fair. We can be almost certain that for a fair coin, the probability of showing up a head is ½ and the same for showing up a head. We just can NOT assign numerical attributes to all the possibilities.
  • 4.
    “There’s a 90%chance it will rain tomorrow.” – what does it mean? It means that out of 10 days, it will rain 9 days and out of 100 days, it will rain 90 days. Decision-making For ABC Bread Manufacturing Company (the significance of bread here is that it is a perishable good; if 42 units are produced while there is a demand for 40, 2 units will be wasted as they cannot be stored for sales the next day), Demand = 40 – 44 units [25 different outcomes e.g. demand for 40 units and production of 40 units of bread is one outcome] Selling Price = $38/unit Variable Cost = $25/unit Fixed Cost = $200/day Question: How many units should be produced? Solution Method 1: Payoff Matrix/Profit Matrix Decision-making under Absolute Uncertainty ABC is a new bread manufacturing company and they have absolutely no information regarding the market except that the range of demand will be 40 units to 44 units. Decision Alternatives (Productions) Demand 40 41 42 43 44 40 320* 295 270 245 220 41 320 333 308 283 258 42 320 333 346 321 296 43 320 333 346 359 334 44 320 333 346 359 372 *Revenue = 40 X 38 = $1520 (-) Variable Cost = 40 X 25 = $1000 (-) Fixed Cost = $200 Profit = $320 The decision to manufacture can be anything from 40 units to 44 units, depending on the producer/manager. • The manager who chooses to produce 40 units of bread has a highly risk-averting attitude. • The manager who chooses to produce 44 units of bread has a highly risk-taking attitude. • The manager who chooses to produce 42 units of bread does so because while his profit is not maximized, his loss isn’t maximized either. Decision-making under Risk Now ABC has been in the market for some time and so, has some information regarding demands of their customers.
  • 5.
    Decision Alternatives (Productions) ProbabilityDemand 40 41 42 43 44 10% 40 320 295 270 245 220 20% 41 320 333 308 283 258 40% 42 320 333 346 321 296 20% 43 320 333 346 359 334 10% 44 320 333 346 359 372 100% Expected Value (EV) 320 329.2 330.8 317.2 296* Expected Value under Risk (EVR) 330.8 *(10 X 220 + 20 X 258 + 40 X 296 + 20 X 334 + 10 X 372) / 100 = 296 Expected Value is a predicted value of a variable calculated as the sum of all possible values each multiplied by the probability of its occurrence. Under risk environment, 5 Expected Values (EV) have been found for the 5 different production levels. The optimum Expected Value is known as Expected Value under Risk (EVR). Here, EVR = $330.8. so, 42 units will be produced. Decision-making under Absolute Certainty Now ABC knows the exact demands for each day Decision Alternatives (Productions) Probability Demand 40 41 42 43 44 10% 40 320 295 270 245 220 20% 41 320 333 308 283 258 40% 42 320 333 346 321 296 20% 43 320 333 346 359 334 10% 44 320 333 346 359 372 100% Expected Value (EV) 320 329.2 330.8 317.2 296 Expected Value under Certainty (EVC) 346* *320 X 0.1 + 333 X 0.2 + 346 X 0.4 + 359 X 0.2 + 372 X 0.1 = 346 Now ths situation has no uncertain component. When ABC knows the demand is 40, they will produce 40 units. So their profit will be $320. Similarly, when ABC knows the demand is 41, they will produce 41 units. So their profit will be $333. There will never be any case of overstocking or understocking. In other words, there will be no Opportunity Loss (OL).
  • 6.
    We have found5 Expected Values for 5 production levels. And the Expected Value under Certainty (EVC) is $346. Method 2: Opportunity Loss Table Opportunity Loss (OL): Loss incurred by not taking the best decision. Contribution Margin (CM): Marginal profit per unit of sale (Selling Price – Variable Cost) Opportunity Loss of Understocking (OLU) = Contribution Margin (CM) Opportunity Loss of Overstocking (OLO) = Variable Cost (VC) Decision Alternatives (Productions) Probability Demand 40 41 42 43 44 10% 40 0 25 50 75 100 20% 41 13 0 25 50 75 40% 42 26 13 0 25 50 20% 43 39 26 13 0 25 10% 44 52 39 26 13 0 100% Expected Opportunity Loss (EOL) 26* 16.8 15.2 28.8 50 Expected Opportunity Loss under Risk (EOLR) 15.2 + + + + + + Expected Value (EV) 320 329.2 330.8 317.2 296 = = = = = = EVC 346 346 346 346 346 *0 X 0.1 + 13 X 0.2 + 26 X 0.4 + 39 X 0.2 + 52 X 0.1 = 26 Here, EOLR is $15.2, so the best decision will be to produce 42 units. In the short--term, the models may produce undesired outcome. However, in the long--term, these models help us make more correct decisions. EVC – EVR = 346 – 330.8 = 15.2 = EOLR = EVPI (Expected Value of Perfect Information) So, EOLR = EVPI The value of perfect information is the opportunity loss under risk. Also, EOL + EV = EVC
  • 7.
    Method 3: IncrementalAnalysis Decision Alternatives (Productions) Probability Demand 40 41 42 43 44 10% 40 320 295 270 245 220 20% 41 320 333 308 283 258 40% 42 320 333 346 321 296 20% 43 320 333 346 359 334 10% 44 320 333 346 359 372 Probability of selling the additional unit: 40th – 100% 41st – 90% 42nd – 70% 43rd – 30% 44th – 10% Additional Unit Probability of Profit Increase Meaning 41st unit 13 X 0.9 – 25 X 0.1 9.2 Since the value is greater than 0, it is profitable to produce 41 units. 42nd unit 13 X 0.7 – 25 X 0.3 1.6 Since the value is greater than 0, it is profitable to produce 42 units. 43rd unit 13 X 0.3 – 25 X 0.7 -13.6 Since the value is less than 0, it is not profitable to produce 43 units. 44th unit 13 X 0.1 – 25 X 0.9 -21.2 Since the value is less than 0, it is not profitable to produce 44 units. So, the best decision is to produce 42 units. OLU X P – OLO X (1 – P) = 0 OLU X P = OLO X (1 – P) OLU X P + OLO X P = OLO P = OLO / (OLU + OLO) P = 25 / (13 + 25) P = 65.8% Advantages and Disadvantages of Three Methods Methods Advantages Disadvantages Payoff Matrix/Profit Matrix Can be used to directly find EVR, EVC and EVPI Calculations are complex and time- consuming. EOL can be found indirectly from EOL + EV = EVC EOL cannot be found directly. Opportunity Loss Table Can be used to directly find EOL and EVPI Does not reveal all information such as EVR and EVC Simpler calculations Incremental Analysis Very simple and more viable for larger numbers of production/demand values Does not reveal EVR, EVC, EOL, EVPI Only SP, VC and P values are needed The value of P is a minimum value. Since P = 65.8% is greater than 30%, maximum 42 units should be produced and 43rd unit should not be produced. Even if P is 30.05%, it will be greater than 30%, so 42 units should be produced. 43 units will have to be produced only when P will be less than 30%.
  • 8.
    Three flips ofa fair coin Example 1. Suppose you have a fair coin: this means it has a 50% chance of landing heads up and a 50% chance of landing tails up. Suppose you flip it three times and these flips are independent. What is the probability that it lands heads up, then tails up, then heads up? We're asking about the probability of this outcome: (H,T,H) Since the flips are independent this is p(H,T,H) = pH pT pH Since the coin is fair we have pH = pT = 1/2 so pH pT pH = ½ × ½ × ½ = 1/8 So the answer is 1/8, or 12.5%. Example 2. In the same situation, what's the probability that the coin lands heads up exactly twice? There are 2 × 2 × 2 = 8 outcomes that can happen: (H,H,H), (H,H,T), (H,T,H), (T,H,H), (H,T,T), (T,H,T), (T,T,H), (T,T,T) We can work out the probability of each of these outcomes. For example, we've already seen that (H,T,H) is p(H,T,H) = pH pT pH = 1/8 since the coin is fair and the flips are independent. In fact, all 8 probabilities work out the same way. We always get 1/8. In other words, each of the 8 outcomes is equally likely! But we're interested in the probability that we get exactly two heads. That's the probability of this subset: S = {(T,H,H), (H,T,H), (H,H,T)} p(S) = p(T,H,H) + p(H,T,H) + p(H,H,T) = 3 × 1/8 So the answer is 3/8, or 37.5%. Three flips of a very unfair coin Example 3. Now suppose we have an unfair coin with a 90% chance of landing heads up and 10% chance of landing tails up! What's the probability that if we flip it three times, it lands heads up exactly twice? Again let's assume the coin flips are independent. Most of the calculation works exactly the same way, but now our coin has pH = 0.9, pT = 0.1 We're interested in the outcomes where the coin comes up heads twice, so we look at this subset: S = {(T,H,H), (H,T,H), (H,H,T)} The probability of this subset is p(S) = p (T,H,H) + p (H,T,H) + p (H,H,T) = pT pH pH + pH pT pH + pH pH pT = 3 pT pH pH =3 × 0.1 × 0.9 × 0.9 =0.3 × 0.81 = 0.243 So now the probability is just 24.3%.
  • 9.
    What is Statistics? Statisticsis a form of mathematical analysis that uses quantified models, representations and synopses for a given set of experimental data or real-life studies. Statistics studies methodologies to gather, review, analyze and draw conclusions from data. Statistics is a term used to summarize a process that an analyst uses to characterize a data set. If the data set depends on a sample of a larger population, then the analyst can develop interpretations about the population primarily based on the statistical outcomes from the sample. Statistical analysis involves the process of gathering and evaluating data and then summarizing the data into a mathematical form. Statistics is used in various disciplines such as psychology, business, physical and social sciences, humanities, government, and manufacturing. Statistical data is gathered using a sample procedure or other method. Types of Statistics Descriptive Statistics Descriptive statistics is the type of statistics that probably springs to most people’s minds when they hear the word “statistics.” In this branch of statistics, the goal is to describe. Use descriptive statistics to summarize and graph the data for a group that you choose. This process allows you to understand that specific set of observations. Descriptive statistics describe a sample. That’s pretty straightforward. You simply take a group that you’re interested in, record data about the group members, and then use summary statistics and graphs to present the group properties. Collecting data Organising data Analysing data Interpreting data Statistics Mathematical (Development of tools and techniques) Business (Application of tools and techniques in decision-making) Descriptive Statistics (Describes data) Inferential Statistics (Information is inferred from data) STATISTICS The field of statistics is divided into two major divisions: descriptive and inferential. Each of these segments is important, offering different techniques that accomplish different objectives. Descriptive statistics describe what is going on in a population or data set. Inferential statistics, by contrast, allow scientists to take findings from a sample group and generalize them to a larger population. The two types of statistics have some important differences.
  • 10.
    With descriptive statistics,there is no uncertainty because you are describing only the people or items that you actually measure. You’re not trying to infer properties about a larger population. The process involves taking a potentially large number of data points in the sample and reducing them down to a few meaningful summary values and graphs. This procedure allows us to gain more insights and visualize the data than simply pouring through row upon row of raw numbers! There are a number of items that belong in this portion of statistics, such as: • The average, or measure of the center of a data set, consisting of the mean, median, mode, or midrange • The spread of a data set, which can be measured with the range or standard deviation • Overall descriptions of data such as the five number summary • Measurements such as skewness and kurtosis • The exploration of relationships and correlation between paired data • The presentation of statistical results in graphical form These measures are important and useful because they allow scientists to see patterns among data, and thus to make sense of that data. Descriptive statistics can only be used to describe the population or data set under study: The results cannot be generalized to any other group or population. Example of descriptive statistics Collectively, this information gives us a pretty good picture of this specific class. There is no uncertainty surrounding these statistics because we gathered the scores for everyone in the class. However, we can’t take these results and extrapolate to a larger population of students. Inferential Statistics Inferential statistics takes data from a sample and makes inferences about the larger population from which the sample was drawn. Because the goal of inferential statistics is to draw conclusions from a sample and generalize them to a Suppose we want to describe the test scores in a specific class of 30 students. We record all of the test scores and calculate the summary statistics and produce graphs. These results indicate that the mean score of this class is 79.18. The scores range from 66.21 to 96.53, and the distribution is symmetrically centered around the mean. A score of at least 70 on the test is acceptable. The data show that 86.7% of the students have acceptable scores.
  • 11.
    population, we needto have confidence that our sample accurately reflects the population. This requirement affects our process. At a broad level, we must do the following: 1. Define the population we are studying. 2. Draw a representative sample from that population. 3. Use analyses that incorporate the sampling error. We don’t get to pick a convenient group. Instead, random sampling allows us to have confidence that the sample represents the population. This process is a primary method for obtaining samples that mirrors the population on average. Random sampling produces statistics, such as the mean, that do not tend to be too high or too low. Using a random sample, we can generalize from the sample to the broader population. Unfortunately, gathering a truly random sample can be a complicated process. Difference between Descriptive Statistics and Inferential Statistics As you can see, the difference between descriptive and inferential statistics lies in the process as much as it does the statistics that you report. Descriptive Statistics Inferential Statistics For descriptive statistics, we choose a group that we want to describe and then measure all subjects in that group. The statistical summary describes this group with complete certainty (outside of measurement error). For inferential statistics, we need to define the population and then devise a sampling plan that produces a representative sample. The statistical results incorporate the uncertainty that is inherent in using a sample to understand an entire population. A study using descriptive statistics is simpler to perform. However, if you need evidence that an effect or relationship between variables exists in an entire population rather than only your sample, you need to use inferential statistics. Populations, Parameters and Samples Inferential statistics lets you draw conclusions about populations by using small samples. Consequently, inferential statistics provide enormous benefits because typically you can’t measure an entire population. However, to gain these benefits, you must understand the relationship between populations, subpopulations, population parameters, samples, and sample statistics. Populations Populations can include people, but other examples include objects, events, businesses, and so on. In statistics, there are two general types of populations. Populations can be the complete set of all similar items that exist. For example, the population of a country includes all people currently within that country. It’s a finite but potentially large list of members. However, a population can be a theoretical construct that is potentially infinite in size. For example, quality improvement analysts often consider all current and future output from a manufacturing line to be part of a population. Populations share a set of attributes that you define. For example, the following are populations: • Stars in the Milky Way galaxy. • Parts from a production line. • Citizens of the United States.
  • 12.
    Population Parameters Parameter: Aparameter is a value that describes a characteristic of an entire population, such as the population mean. Because you can almost never measure an entire population, you usually don’t know the real value of a parameter. In fact, parameter values are nearly always unknowable. While we don’t know the value, it definitely exists. For example, the average height of adult women in the United States is a parameter that has an exact value—we just don’t know what it is! The population mean and standard deviation are two common parameters. In statistics, Greek symbols usually represent population parameters, such as μ (mu) for the mean and σ (sigma) for the standard deviation. Statistic: A statistic is a characteristic of a sample. If you collect a sample and calculate the mean and standard deviation, these are sample statistics. Inferential statistics allow you to use sample statistics to make conclusions about a population. However, to draw valid conclusions, you must use particular sampling techniques. These techniques help ensure that samples produce unbiased estimates. Biased estimates are systematically too high or too low. You want unbiased estimates because they are correct on average. In inferential statistics, we use sample statistics to estimate population parameters. For example, if we collect a random sample of adult women in the United States and measure their heights, we can calculate the sample mean and use it as an unbiased estimate of the population mean. We can also perform hypothesis testing on the sample estimate and create confidence intervals to construct a range that the actual population value likely falls within. Representative Sampling and Simple Random Samples In statistics, sampling refers to selecting a subset of a population. After drawing the sample, you measure one or more characteristics of all items in the sample, such as height, income, temperature, opinion, etc. If you want to draw conclusions about these characteristics in the whole population, it imposes restrictions on how you collect the sample. If you use an incorrect methodology, the sample might not represent the population, which can lead you to erroneous conclusions. The most well-known method to obtain an unbiased, representative sample is simple random sampling. With this method, all items in the population have an equal probability of being selected. This process helps ensure that the sample includes the full range of the population. Additionally, all relevant subpopulations should be incorporated into the sample and represented accurately on average. Simple random sampling minimizes the bias and simplifies data analysis. While this approach minimizes bias, it does not indicate that your sample statistics exactly equal the population parameters. Instead, estimates from a specific sample are likely to be a bit high or low, but the process produces accurate estimates on average. Furthermore, it is possible to obtain unusual samples with random sampling—it’s just not the expected result. Additionally, random sampling might sound a bit haphazard and easy to do—both of which are
  • 13.
    not true. Simplerandom sampling assumes that you systematically compile a complete list of all people or items that exist in the population. You then randomly select subjects from that list and include them in the sample. It can be a very cumbersome process. Why Sampling is Oftentimes Better than Census? • Reduces cost - both in monetary terms and staffing requirements. • Reduces time needed to collect and process the data and produce results as it requires a smaller scale of operation. • (Because of the above reasons) enables more detailed questions to be asked. • Enables characteristics to be tested which could not otherwise be assessed. An example is life span of light bulbs, strength of spring, etc. To test all light bulbs of a particular brand is not possible as the test needs to destroy the product so only a sample of bulbs can be tested. • Importantly, surveys lead to less respondent burden, as fewer people are needed to provide the required data. • Results can be made available quickly Some Negative Points of Sampling • Data on sub-populations (such as a particular ethnic group) may be too unreliable to be useful. • Data for small geographical areas also may be too unreliable to be useful. • (Because of the above reasons) detailed cross-tabulations may not be practical. • Estimates are subject to sampling error which arises as the estimates are calculated from a part (sample) of the population. • May have difficulty communicating the precision (accuracy) of the estimates to users.
  • 14.
    Descriptive Statistics: Measureof Central Tendency A measure of central tendency is a summary statistic that represents the center point or typical value of a dataset. These measures indicate where most values in a distribution fall and are also referred to as the central location of a distribution. You can think of it as the tendency of data to cluster around a middle value. In statistics, the three most common measures of central tendency are the mean, median, and mode. Each of these measures calculates the location of the central point using a different method. Choosing the best measure of central tendency depends on the type of data you have. The central tendency of a distribution represents one characteristic of a distribution. Another aspect is the variability around that central value. While measures of variability are the topic of a different article (link below), this property describes how far away the data points tend to fall from the center. The graph below shows how distributions with the same central tendency (mean = 100) can actually be quite different. The panel on the left displays a distribution that is tightly clustered around the mean, while the distribution on the right is more spread out. It is crucial to understand that the central tendency summarizes only one aspect of a distribution and that it provides an incomplete picture by itself. Mean The mean describes an entire sample with a single number that represents the center of the data. The mean is the arithmetic average. You calculate the mean by adding up all of the observations and then dividing the total by the number of observations. For example, if the weights of five apples are 5, 5, 6, 7, and 8, the average apple weight is 6.4. 5 + 7 + 6 + 5 +9 / 5 = 6.4 The mean is sensitive to skewed data and extreme values. For data sets with these properties, the mean gets pulled away from the center of the data. In these cases, the mean can be misleading because the most common values in the distribution might not be near the mean.
  • 15.
    Median The median isthe middle of the data. Half of the observations are less than or equal to it and half of the observations are greater than or equal to it. The median is equivalent to the second quartile or the 50th percentile. For example, if the weights of five apples are 5, 5, 6, 7, and 8, the median apple weight is 6 because it is the middle value. If there is an even number of observations, you take the average of the two middle values. The median is less sensitive than the mean to skewed data and extreme values. For data sets with these properties, the mean gets pulled away from the center of the data. In these cases, the mean can be misleading because the most common values in the distribution might not be near the mean. For example, the mean might not be a good statistic for describing annual income. A few extremely wealthy individuals can increase the overall average, giving a misleading view of annual incomes. In this case, the median is more informative. Mode The mode is the value that occurs most frequently in a set of observations. You can find the mode simply by counting the number of times each value occurs in a data set. For example, if the weights of five apples are 5, 5, 6, 7, and 8, the apple weight mode is 5 because it is the most frequent value. Identifying the mode can help you understand your distribution. Which is Best—the Mean, Median, or Mode? When you have a symmetrical distribution for continuous data, the mean, median, and mode are equal. In this case, analysts tend to use the mean because it includes all of the data in the calculations. However, if you have a skewed distribution, the median is often the best measure of central tendency. When you have ordinal data, the median or mode is usually the best choice. For categorical data, you have to use the mode. In cases where you are deciding between the mean and median as the better measure of central tendency, you are also determining which types of statistical hypothesis tests are appropriate for your data—if that is your ultimate goal. I have written an article that discusses when to use parametric (mean) and nonparametric (median) hypothesis tests along with the advantages and disadvantages of each type. Descriptive Statistics: Measure of Dispersion Suppose you are given a data series. Someone asks you to tell some interesting facts about this data series. How can you do so? You can say you can find the mean, the median or the mode of this data series and tell about its distribution. But is it the only thing you can do? Are the central tendencies the only way by which we can get to know about the concentration of the observation? So here we are going to know about the measure of dispersion. Let’s start. As the name suggests, the measure of dispersion shows the scatterings of the data. It tells the variation of the data from one another and gives a clear idea about the distribution of the data. The measure of dispersion shows the homogeneity or the heterogeneity of the distribution of the observations.
  • 16.
    Supposeyouhavefourdatasetsofthesamesizeandthe meanisalsosame,say,m.Inallthecasesthesumoftheobservations will bethe same. Here, the measure of central tendency is not giving a clear and complete idea about the distribution for the four given sets. Can we get an idea about the distribution if we get to know about the dispersion of the observations from one another within and between the datasets? The main idea about the measure of dispersion is to get to know how the data are spread. It shows how much the data vary from their average value. Characteristics of Measures of Dispersion • A measure of dispersion should be rigidly defined • It must be easy to calculate and understand • Not affected much by the fluctuations of observations • Based on all observations Classification of Measures of Dispersion The measure of dispersion is categorized as: (i) An absolute measure of dispersion: • The measures which express the scattering of observation in terms of distances i.e., range, quartile deviation. • The measure which expresses the variations in terms of the average of deviations of observations like mean deviation and standard deviation. (ii) A relative measure of dispersion: We use a relative measure of dispersion for comparing distributions of two or more data set and for unit free comparison. They are the coefficient of range, the coefficient of mean deviation, the coefficient of quartile deviation, the coefficient of variation, and the coefficient of standard deviation. Range A range is the most common and easily understandable measure of dispersion. It is the difference between two extreme observations of the data set. If X max and X min are the two extreme observations then Range = X max – X min Merits of Range • It is the simplest of the measure of dispersion • Easy to calculate • Easy to understand • Independent of change of origin Mean Absolute Deviation (MAD) The Mean Absolute Deviation (MAD) of a set of data is the average distance between each data value and the mean. The steps to find the MAD include: 1. find the mean (average) 2. find the difference between each data value and the mean 3. take the absolute value of each difference Demerits of Range • It is based on two extreme observations. Hence, get affected by fluctuations • A range is not a reliable measure of dispersion • Dependent on change of scale
  • 17.
    4. find themean (average) of these differences Merits of Mean Deviation • Based on all observations • It provides a minimum value when the deviations are taken from the median • Independent of change of origin Demerits of Mean Deviation • Not easily understandable • Its calculation is not easy and time-consuming • Dependent on the change of scale • Ignorance of negative sign creates artificiality and becomes useless for further mathematical treatment Example: Erica enjoys posting pictures of her cat online. Here's how many "likes" the past 666 pictures each received: 10, 15, 15, 17, 18, 21 Find the mean absolute deviation. Step 1: Calculate the mean. The sum of the data is 96 total "likes" and there are 6 pictures. Step 2: Calculate the distance between each data point and the mean.
  • 18.
    Step 3: Addthe distances together. Step 4: Divide the sum by the number of data points. Standard Deviation The Standard Deviation is a measure of how spread out numbers are. Its symbol is σ (the greek letter sigma). The formula is easy: it is the square root of the Variance. So now you ask, "What is the Variance?" The Variance is defined as: The average of the squared differences from the Mean. To calculate the variance, follow these steps: • Work out the Mean (the simple average of the numbers) • Then for each number: subtract the Mean and square the result (the squared difference). • Then work out the average of those squared differences.
  • 19.
    Why square thedifferences?
  • 20.
    Example You and yourfriends have just measured the heights of your dogs (in millimeters): The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm. Find out the Mean, the Variance, and the Standard Deviation. Your first step is to find the Mean: Answer: so the mean (average) height is 394 mm. Let's plot this on the chart: Now we calculate each dog's difference from the Mean:
  • 21.
    To calculate theVariance, take each difference, square it, and then average the result: So the Variance is 21,704 And the Standard Deviation is just the square root of Variance, so: And the good thing about the Standard Deviation is that it is useful. Now we can show which heights are within one Standard Deviation (147mm) of the Mean: So, using the Standard Deviation we have a "standard" way of knowing what is normal, and what is extra large or extra small. Rottweilers are tall dogs. And Dachshunds are a bit short, right?
  • 22.
    Merits of StandardDeviation • Squaring the deviations overcomes the drawback of ignoring signs in mean deviations • Suitable for further mathematical treatment • Least affected by the fluctuation of the observations • The standard deviation is zero if all the observations are constant • Independent of change of origin Demerits of Standard Deviation • Not easy to calculate • Difficult to understand for a layman • Dependent on the change of scale Note: • If the effect of error is non-linear, Standard Deviation should be used, not Mean Absolute Deviation. • If the effect of error is linear, either Standard Deviation or Mean Absolute Deviation will do. Stem and Leaf Diagram A stem and leaf diagram shows numbers in a table format. It can be a useful way to organize data to find the median, mode and range of a set of data. A stem and leaf plot, or stem plot, is a technique used to classify either discrete or continuous variables. A stem and leaf plot is used to organize data as they are collected. A stem and leaf plot looks something like a bar graph. Each number in the data is broken down into a stem and a leaf, thus the name. The stem of the number includes all but the last digit. The leaf of the number will always be a single digit. Elements of a good stem and leaf plot A good stem and leaf plot • shows the first digits of the number (thousands, hundreds or tens) as the stem and shows the last digit (ones) as the leaf. • usually uses whole numbers. Anything that has a decimal point is rounded to the nearest whole number. For example, test results, speeds, heights, weights, etc. • looks like a bar graph when it is turned on its side. • shows how the data are spread—that is, highest number, lowest number, most common number and outliers (a number that lies outside the main group of numbers). Tips on how to draw a stem and leaf plot Once you have decided that a stem and leaf plot is the best way to show your data, draw it as follows: • On the left hand side of the page, write down the thousands, hundreds or tens (all digits but the last one). These will be your stems. • Draw a line to the right of these stems. • On the other side of the line, write down the ones (the last digit of a number). These will be your leaves. For example, if the observed value is 25, then the stem is 2 and the leaf is the 5. If the observed value is 369, then the stem is 36 and the leaf is 9. Where observations are accurate to one or more decimal places, such as 23.7, the stem is
  • 23.
    23 and theleaf is 7. If the range of values is too great, the number 23.7 can be rounded up to 24 to limit the number of stems. In stem and leaf plots, tally marks are not required because the actual data are used. Example The marks that a class score in a maths test are shown below. A stem and leaf diagram is drawn by splitting the tens and units column. The tens column becomes the 'stem' and the units become the 'leaf'. Stem and leaf diagrams must be in order to read them properly. The 'leaf' should only ever contain single digits. Therefore, to put the number 124 in a stem and leaf diagram, the 'stem' would be 12 and the 'leaf' would be 4. To put the number 78.9 into a stem and leaf diagram, the 'stem' would be 78 and the 'leaf' would be 9. In this case, the key would indicate that the split between stem and leaf is a decimal.
  • 24.
    The main advantageof a stem and leaf plot The main advantage of a stem and leaf plot is that the data are grouped and all the original data are shown, too. In Example 3 on battery life in the Frequency distribution tables section, the table shows that two observations occurred in the interval from 360 to 369 minutes. However, the table does not tell you what those actual observations are. A stem and leaf plot would show that information. Without a stem and leaf plot, the two values (363 and 369) can only be found by searching through all the original data—a tedious task when you have lots of data! When looking at a data set, each observation may be considered as consisting of two parts—a stem and a leaf. To make a stem and leaf plot, each observed value must first be separated into its two parts: • The stem is the first digit or digits; • The leaf is the final digit of a value; • Each stem can consist of any number of digits; but • Each leaf can have only a single digit. Example 1 – Making a stem and leaf plot Each morning, a teacher quizzed his class with 20 geography questions. The class marked them together and everyone kept a record of their personal scores. As the year passed, each student tried to improve his or her quiz marks. Every day, Elliot recorded his quiz marks on a stem and leaf plot. This is what his marks looked like plotted out: Analyse Elliot's stem and leaf plot. What is his most common score on the geography quizzes? What is his highest score? His lowest score? Rotate the stem and leaf plot onto its side so that it looks like a bar graph. Are most of Elliot's scores in the 10s, 20s or under 10? It is difficult to know from the plot whether Elliot has improved or not because we do not know the order of those scores. Example 2 – Making a stem and leaf plot A teacher asked 10 of her students how many books they had read in the last 12 months. Their answers were as follows: 12, 23, 19, 6, 10, 7, 15, 25, 21, 12 Prepare a stem and leaf plot for these data. Tip: The number 6 can be written as 06, which means that it has a stem of 0 and a leaf of 6. The stem and leaf plot should look like this:
  • 25.
    In Table 2: •stem 0 represents the class interval 0 to 9; • stem 1 represents the class interval 10 to 19; and • stem 2 represents the class interval 20 to 29. Usually, a stem and leaf plot is ordered, which simply means that the leaves are arranged in ascending order from left to right. Also, there is no need to separate the leaves (digits) with punctuation marks (commas or periods) since each leaf is always a single digit. Using the data from Table 2, we made the ordered stem and leaf plot shown below: Example 3 – Making an ordered stem and leaf plot Fifteen people were asked how often they drove to work over 10 working days. The number of times each person drove was as follows: 5, 7, 9, 9, 3, 5, 1, 0, 0, 4, 3, 7, 2, 9, 8 Make an ordered stem and leaf plot for this table. It should be drawn as follows: Splitting the stems The organization of this stem and leaf plot does not give much information about the data. With only one stem, the leaves are overcrowded. If the leaves become too crowded, then it might be useful to split each stem into two or more components. Thus, an interval 0–9 can be split into two intervals of 0–4 and 5–9. Similarly, a 0–9 stem could be split into five intervals: 0–1, 2–3, 4–5, 6–7 and 8–9. The stem and leaf plot should then look like this: Note: The stem 0(0) means all the data within the interval 0–4. The stem 0(5) means all the data within the interval 5–9.
  • 26.
    Example 4 –Splitting the stems Britney is a swimmer training for a competition. The number of 50-metre laps she swam each day for 30 days are as follows: 22, 21, 24, 19, 27, 28, 24, 25, 29, 28, 26, 31, 28, 27, 22, 39, 20, 10, 26, 24, 27, 28, 26, 28, 18, 32, 29, 25, 31, 27 1. Prepare an ordered stem and leaf plot. Make a brief comment on what it shows. 2. Redraw the stem and leaf plot by splitting the stems into five-unit intervals. Make a brief comment on what the new plot shows. Answers 1. The observations range in value from 10 to 39, so the stem and leaf plot should have stems of 1, 2 and 3. The ordered stem and leaf plot is shown below: The stem and leaf plot shows that Britney usually swims between 20 and 29 laps in training each day. 2. Splitting the stems into five-unit intervals gives the following stem and leaf plot: Note: The stem 1(0) means all data between 10 and 14, 1(5) means all data between 15 and 19, and so on. The revised stem and leaf plot shows that Britney usually swims between 25 and 29 laps in training each day. The values 1(0) 0 = 10 and 3(5) 9 = 39 could be considered outliers—a concept that will be described in the next section. Example 5 – Splitting stems using decimal values The weights (to the nearest tenth of a kilogram) of 30 students were measured and recorded as follows: 59.2, 61.5, 62.3, 61.4, 60.9, 59.8, 60.5, 59.0, 61.1, 60.7, 61.6, 56.3, 61.9, 65.7, 60.4, 58.9, 59.0, 61.2, 62.1, 61.4, 58.4, 60.8, 60.2, 62.7, 60.0, 59.3, 61.9, 61.7, 58.4, 62.2 Prepare an ordered stem and leaf plot for the data. Briefly comment on what the analysis shows. Answer In this case, the stems will be the whole number values and the leaves will be the decimal values. The data range from 56.3 to 65.7, so the stems should start at 56 and finish at 65.
  • 27.
    In this example,it was not necessary to split stems because the leaves are not crowded on too few stems; nor was it necessary to round the values, since the range of values is not large. This stem and leaf plot reveals that the group with the highest number of observations recorded is the 61.0 to 61.9 group. Outliers An outlier is an extreme value of the data. It is an observation value that is significantly different from the rest of the data. There may be more than one outlier in a set of data. Sometimes, outliers are significant pieces of information and should not be ignored. Other times, they occur because of an error or misinformation and should be ignored. In the previous example, 56.3 and 65.7 could be considered outliers, since these two values are quite different from the other values. By ignoring these two outliers, the previous example's stem and leaf plot could be redrawn as below: When using a stem and leaf plot, spotting an outlier is often a matter of judgment. This is because, except when using box plots (explained in the section on box and whisker plots), there is no strict rule on how far removed a value must be from the rest of a data set to qualify as an outlier. Features of distributions When you assess the overall pattern of any distribution (which is the pattern formed by all values of a particular variable), look for these features: • number of peaks • general shape (skewed or symmetric) • centre • spread
  • 28.
    Number of peaks Linegraphs are useful because they readily reveal some characteristic of the data. (See the section on line graphs for details on this type of graph.) The first characteristic that can be readily seen from a line graph is the number of high points or peaks the distribution has. While most distributions that occur in statistical data have only one main peak (unimodal), other distributions may have two peaks (bimodal) or more than two peaks (multimodal). Examples of unimodal, bimodal and multimodal line graphs are shown below: General shape The second main feature of a distribution is the extent to which it is symmetric. A perfectly symmetric curve is one in which both sides of the distribution would exactly match the other if the figure were folded over its central point. An example is shown below: A symmetric, unimodal, bell-shaped distribution—a relatively common occurrence—is called a normal distribution. If the distribution is lop-sided, it is said to be skewed. A distribution is said to be skewed to the right, or positively skewed, when most of the data are concentrated on the left of the distribution. Distributions with positive skews are more common than distributions with negative skews. Income provides one example of a positively skewed distribution. Most people make under $40,000 a year, but some make quite a bit more, with a smaller number making many millions of dollars a year. Therefore, the positive (right) tail on the line graph for income extends out quite a long way, whereas the negative (left) skew tail stops at zero. The right tail clearly extends farther from the distribution's centre than the left tail, as shown below:
  • 29.
    A distribution issaid to be skewed to the left, or negatively skewed, if most of the data are concentrated on the right of the distribution. The left tail clearly extends farther from the distribution's centre than the right tail, as shown below: Centre and spread Locating the centre (median) of a distribution can be done by counting half the observations up from the smallest. Obviously, this method is impracticable for very large sets of data. A stem and leaf plot makes this easy, however, because the data are arranged in ascending order. The mean is another measure of central tendency. (See the chapter on central tendency for more detail.) The amount of distribution spread and any large deviations from the general pattern (outliers) can be quickly spotted on a graph. Using stem and leaf plots as graphs A stem and leaf plot is a simple kind of graph that is made out of the numbers themselves. It is a means of displaying the main features of a distribution. If a stem and leaf plot is turned on its side, it will resemble a bar graph or histogram and provide similar visual information. Example 6 – Using stem and leaf plots as graph The results of 41 students' math tests (with a best possible score of 70) are recorded below: 31, 49, 19, 62, 50, 24, 45, 23, 51, 32, 48, 55, 60, 40, 35, 54, 26, 57, 37, 43, 65, 50, 55, 18, 53, 41, 50, 34, 67, 56, 44, 4, 54, 57, 39, 52, 45, 35, 51, 63, 42 1. Is the variable discrete or continuous? Explain. 2. Prepare an ordered stem and leaf plot for the data and briefly describe what it shows. 3. Are there any outliers? If so, which scores? 4. Look at the stem and leaf plot from the side. Describe the distribution's main features such as: a) number of peaks b) symmetry c) value at the centre of the distribution
  • 30.
    Answers 1. A testscore is a discrete variable. For example, it is not possible to have a test score of 35.74542341.... 2. The lowest value is 4 and the highest is 67. Therefore, the stem and leaf plot that covers this range of values looks like this: Note: The notation 2|4 represents stem 2 and leaf 4. The stem and leaf plot reveals that most students scored in the interval between 50 and 59. The large number of students who obtained high results could mean that the test was too easy, that most students knew the material well, or a combination of both. 3. The result of 4 could be an outlier, since there is a large gap between this and the next result, 18. 4. If the stem and leaf plot is turned on its side, it will look like the following: Example: The stem and leaf plot below shows the grade point averages of 18 students. The digit in the stem represents the ones and the digit in the leaf represents the tenths. So for example 0 | 8 = 0.8, 1 | 2 = 1.2 and so on. a) What is the range of the data in the stem and leaf plot? b) How many students have a grade of 2 or more? c) What is the mode of the grades? d) What is the median of the grades? The distribution has a single peak within the 50–59 interval. Although there are only 41 observations, the distribution shows that most data are clustered at the right. The left tail extends farther from the data centre than the right tail. Therefore, the distribution is skewed to the left or negatively skewed. Since there are 41 observations, the distribution centre (the median value) will occur at the 21st observation. Counting 21 observations up from the smallest, the centre is 48. (Note that the same value would have been obtained if 21 observations were counted down from the highest observation.)
  • 31.
    Solution to Example: a)range = maximum value - minimum value = 4.0 - 0.8 = 3.2 b) 7 + 4 + 1 = 12 students c) two modes: 1.4 and 2.5 d) There are 18 data values and already ordered in the stem and leaf diagram. median = (the 9th value + the 10th value) / 2 = (2.5 + 2.5) / 2 = 2.5 Example: The back to back stem and leaf plot below shows the exam grades (out of 100) of two sections. The digit in the stem represents the tens and the digit in the leaf represents the ones. So for example 5 | 3 = 53 and so on. a) How many students scored higher than 60 in section 1? b) How many students scored higher than 60 in section 2? c) What are the minimum and maximum scores in section 1? d) What are the minimum and maximum scores in section 2? e) Without counting, which section has more students scoring 80 or more? f) Without counting, which section has more students scoring 50 or less? Solution to Example: a) 6 + 7 + 5 + 4 = 22 students b) 8 + 6 + 2 + 2 = 18 students c) minimum = 40 , maximum = 95 d) minimum = 41 , maximum = 91 e) section 1 f) section 2 Example: The back to back stem and leaf plot below shows the LDL cholesterol levels (in milligram per deciliter mg/dL) of two groups of people, smokers and non smokers. The digits in the stem represents the hundreds and tens and the digit in the leaf represents the ones. So for example 11 | 8 = 118 and so on. a) People with a cholesterol level of 129 or less are said to have a near ideal level of cholesterol. How many people, in each group, have a near ideal level of cholesterol?
  • 32.
    b) People witha cholesterol level between 130 and 159 inclusive are said to be in the border high. How many people, in each group, are in the border high? c) People with a cholesterol level between 160 and 189 inclusive are said to have a high level of cholesterol. How many people, in each group, have a high level of cholesterol? d) People with a cholesterol level of 190 or above are said to have a very high level of cholesterol. How many people, in each group, have a very high level of cholesterol? e) Comparing the two groups, which group has more people with a higher level of cholesterol? Solution to Example: a) smokers: 1 + 2 = 3 people , non smokers: 2 + 4 + 7 = 13 people b) smokers: 3 + 4 + 5 = 12 people , non smokers: 7 + 6 + 3 = 16 people c) smokers: 6 + 5 + 4 = 15 people , non smokers: 3 + 2 + 1 = 6 people d) smokers: 3 + 2 = 5 people , non smokers: none e) The group of smokers have more people with higher cholesterol. Box Plot Box plots (also called box-and-whisker plots or box-whisker plots) give a good graphical image of the concentration of the data. They also show how far the extreme values are from most of the data. A box plot is constructed from five values: the minimum value, the first quartile, the median, the third quartile, and the maximum value. We use these values to compare how close other data values are to them. To construct a box plot, use a horizontal or vertical number line and a rectangular box. The smallest and largest data values label the endpoints of the axis. The first quartile marks one end of the box and the third quartile marks the other end of the box. Approximately the middle 50 percent of the data fall inside the box. The “whiskers” extend from the ends of the box to the smallest and largest data values. The median or second quartile can be between the first and third quartiles, or it can be one, or the other, or both. The box plot gives a good, quick picture of the data. Note: You may encounter box-and-whisker plots that have dots marking outlier values. In those cases, the whiskers are not extending to the minimum and maximum values. Consider, again, this dataset. 1 1 2 2 4 6 6.8 7.2 8 8.3 9 10 10 11.5
  • 33.
    The first quartileis two, the median is seven, and the third quartile is nine. The smallest value is one, and the largest value is 11.5. The following image shows the constructed box plot. The two whiskers extend from the first quartile to the smallest value and from the third quartile to the largest value. The median is shown with a dashed line. Note It is important to start a box plot with a scaled number line. Otherwise the box plot may not be useful. Example The following data are the heights of 40 students in a statistics class. 59 60 61 62 62 63 63 64 64 64 65 65 65 65 65 65 65 65 65 66 66 67 67 68 68 69 70 70 70 70 70 71 71 72 72 73 74 74 75 77 Construct a box plot with the following properties; the calculator instructions for the minimum and maximum values as well as the quartiles follow the example. • Minimum value = 59 • Maximum value = 77 • Q1: First quartile = 64.5 • Q2: Second quartile or median= 66 • Q3: Third quartile = 70 1. Each quarter has approximately 25% of the data. 2. The spreads of the four quarters are 64.5 – 59 = 5.5 (first quarter), 66 – 64.5 = 1.5 (second quarter), 70 – 66 = 4 (third quarter), and 77 – 70 = 7 (fourth quarter). So, the second quarter has the smallest spread and the fourth quarter has the largest spread. 3. Range = maximum value – the minimum value = 77 – 59 = 18 4. Interquartile Range: IQR = Q3 – Q1 = 70 – 64.5 = 5.5. 5. The interval 59–65 has more than 25% of the data so it has more data in it than the interval 66 through 70 which has 25% of the data. 6. The middle 50% (middle half) of the data has a range of 5.5 inches. For some sets of data, some of the largest value, smallest value, first quartile, median, and third quartile may be the same. For instance, you might have a data set in which the median and the third quartile are the same. In this case, the diagram would not have a dotted line inside the box displaying the median. The right side of the box would display both
  • 34.
    the third quartileand the median. For example, if the smallest value and the first quartile were both one, the median and the third quartile were both five, and the largest value was seven, the box plot would look like: In this case, at least 25% of the values are equal to one. Twenty-five percent of the values are between one and five, inclusive. At least 25% of the values are equal to five. The top 25% of the values fall between five and seven, inclusive. Example Test scores for a college statistics class held during the day are: 99 56 78 55.5 32 90 80 81 56 59 45 77 84.5 84 70 72 68 32 79 90 Test scores for a college statistics class held during the evening are: 98 78 68 83 81 89 88 76 65 45 98 90 80 84.5 85 79 78 98 90 79 81 25.5 1. Find the smallest and largest values, the median, and the first and third quartile for the day class. 2. Find the smallest and largest values, the median, and the first and third quartile for the night class. 3. For each data set, what percentage of the data is between the smallest value and the first quartile? the first quartile and the median? the median and the third quartile? the third quartile and the largest value? What percentage of the data is between the first quartile and the largest value? 4. Create a box plot for each set of data. Use one number line for both box plots. 5. Which box plot has the widest spread for the middle 50% of the data (the data between the first and third quartiles)? What does this mean for that set of data in comparison to the other set of data? Solution: 1. • Min = 32 • Q1 = 56 • M = 74.5 • Q3 = 82.5 • Max = 99 • Min = 25.5 • Q1 = 78 • M = 81 • Q3 = 89 • Max = 98 2. Day class: There are six data values ranging from 32 to 56: 30%. There are six data values ranging from 56 to 74.5: 30%. There are five data values ranging from 74.5 to 82.5: 25%. There are five data values ranging from 82.5 to 99: 25%. There are 16 data values between the first quartile, 56, and the largest value, 99: 75%. Night class:
  • 35.
    The first dataset has the wider spread for the middle 50% of the data. The IQR for the first data set is greater than the IQR for the second set. This means that there is more variability in the middle 50% of the first data set.
  • 36.
    SPSS : BASICS GeneralFunctions The Menu bar (“File”, “Edit” and so on) is located in the upper area In the lower left corner, two tabs are available: Data View and Variable View. When you start SPSS, Variable View is default. File Types SPSS uses three different types of files with different functions and extensions. Options The SPSS menu works similar to the menus in many other programs, such as Word or Excel. Some useful options are listed below:
  • 37.
    Variable View In VariableView, different columns are displayed. Each line corresponds to a variable. A variable is simply a quantity of something, which varies and can be measured, such as height, weight, number of children, educational level, gender and so forth.
  • 38.
    Options To alter thevariable options, you may click the cells. Some columns can be typed directly into, while you need to press the arrows or dots that appeared when you click in the columns. It is often possible to use “copy and paste” here – this may efficient when you, for example, have several variables with the same Values. If you want to delete a variable, select the numbered cell to the left of the variable and then right- click and choose Clear.
  • 39.
    Creating a NewData Set If you have a questionnaire, you can easily create the corresponding data structure in Variable View in SPSS. For example: Data View Once the structure of the data set is determined, it is time to take a look at Data View. Access this view by clicking on the tab named Data View in the lower left corner. Here, each column corresponds to a variable, whereas each row corresponds to a case (most commonly an individual). It is possible to change the order of the variables by highlighting a column and “drag and drop”. You may also change the width of the column by placing the mouse over the right border of a column (next to the name of the column), pressing down the button and then “drag and drop”. If you are creating a new data set, simply type in your data, one row (and one column) at a time. Use the left and right arrow key on your key board to move between cells. Make sure that you have chosen the right Type of variable before you enter your data (i.e. Numeric or String)
  • 40.
    Output Everything you orderin SPSS (e.g. graphs, tables, or analyses) ends up in a window called Output. In the area to the left, all the different steps are listed. It is possible to collapse specific steps by clicking on the box with the minus sign (and expand it again by clicking on the same box, now with a plus sign). In the area to the right, your actual output is shown. First, you see the syntax for what you have ordered, and then you get the tables or graphs related to the specific command.