Statistics is both the science of uncertainty and the technology.docx

Statistics is both the science of uncertainty and the technology
of extracting information from data.
A statistic is a summary measure of data.
Descriptive statistics are methods that describe and summarize
data.
Microsoft Excel supports statistical analysis in two ways:
1. Statistical functions
2. Analysis Toolpak add-in
Statistical Methods for Summarizing Data
A frequency distribution is a table that shows the number of
observations in each of several nonoverlapping groups.
Categorical variables naturally define the groups in a frequency
distribution.
To construct a frequency distribution, we need only count the
number of observations that appear in each category.
This can be done using the Excel COUNTIF function.
Frequency Distributions for Categorical Data
Example 3.16: Constructing a Frequency Distribution for Items
in the Purchase Orders Database
List the item names in a column on the spreadsheet.
Use the function =COUNTIF($D$4:$D$97,cell_reference),
where cell_reference is the cell containing the item name

Example 3.16: Constructing a Frequency Distribution for Items
in the Purchase Orders Database
Construct a column chart to visualize the frequencies.
Relative frequency is the fraction, or proportion, of the total.
If a data set has n observations, the relative frequency of
category i is:
We often multiply the relative frequencies by 100 to express
them as percentages.
A relative frequency distribution is a tabular summary of the
relative frequencies of all categories.
Relative Frequency Distributions
Example 3.17: Constructing a Relative Frequency Distribution
for Items in the Purchase Orders Database
First, sum the frequencies to find the total number (note that the

sum of the frequencies must be the same as the total number of
observations, n).
Then divide the frequency of each category by this value.
For numerical data that consist of a small number of discrete
values, we may construct a frequency distribution similar to the
way we did for categorical data; that is, we simply use
COUNTIF to count the frequencies of each discrete value.
Frequency Distributions for Numerical Data
In the Purchase Orders data, the A/P terms are all whole
numbers 15, 25, 30, and 45.
Example 3.18: Frequency and Relative Frequency Distribution
for A/P Terms
A graphical depiction of a frequency distribution for numerical
data in the form of a column chart is called a histogram.
Frequency distributions and histograms can be created using the
Analysis Toolpak in Excel.
Click the Data Analysis tools button in the Analysis group
under the Data tab in the Excel menu bar and select Histogram

from the list.
Excel Histogram Tool
Specify the Input Range corresponding to the data. If you
include the column header, then also check the Labels box so
Excel knows that the range contains a label. The Bin Range
defines the groups (Excel calls these “bins”) used for the
frequency distribution.
Histogram Dialog
If you do not specify a Bin Range, Excel will automatically
determine bin values for the frequency distribution and
histogram, which often results in a rather poor choice.
If you have discrete values, set up a column of these values in
your spreadsheet for the bin range and specify this range in the
Bin Range field.
Using Bin Ranges
We will create a frequency distribution and histogram for the
A/P Terms variable in the Purchase Orders database.
We defined the bin range below the data in cells H99:H103 as
follows:

Month
15
25
30
45
Example 3.19: Using the Histogram Tool
Histogram tool results:
Example 3.19: Using the Histogram Tool
For numerical data that have many different discrete values with
little repetition or are continuous, a frequency distribution
requires that we define by specifying
the number of groups,
the width of each group, and
the upper and lower limits of each group.
Choose between 5 to 15 groups, and the range of each should be
equal.
Choose the lower limit of the first group (LL) as a whole
number smaller than the minimum data value and the upper
limit of the last group (UL) as a whole number larger than the
maximum data value.
Histograms for Numerical Data

The data range from a minimum of $68.75 to a maximum of
$127,500; set the lower limit of the first group to $0 and the
upper limit of the last group to $130,000.
If we select 5 groups, using equation (3.2) the width of each
group is ($130,000 - 0) / 5 = $26,000
Example 3.20: Constructing a Frequency Distribution and
Histogram for Cost per Order
Ten-group histogram
Example 3.20: Constructing a Frequency Distribution and
Histogram for Cost per Order
Set the cumulative relative frequency of the first group equal to
its relative frequency. Then add the relative frequency of the
next group to the cumulative relative frequency.
For, example, the cumulative relative frequency in cell D3 is
computed as =D2+C3 = 0.000 + 0.447 = 0.447.
Example 3.21 Computing Cumulative Relative Frequencies

The kth percentile is a value at or below which at least k
percent of the observations lie. The most common way to
compute the kth percentile is to order the data values from
smallest to largest and calculate the rank of the kth percentile
using the formula:
Statistical software use different methods that often involve
interpolating between ranks instead of rounding, thus producing
different results.
The Excel function PERCENTILE.INC(array, k) computes the
kth percentile of data in the range specified in the array field,
where k is in the range 0 to 1, inclusive (i.e., including 0 and
1).
Percentiles
Compute the 90th percentile for Cost per order in the Purchase
Orders data.
Rank of kth percentile = nk/100 + 0.5
n = 94; k = 90
For the 90th percentile, the rank is
= 94(90)/100+0.5 = 85.1 (round to 85)
Value of the 85th observation = $74,375
Using the Excel function PERCENTILE.INC(G4:G97,0.9), the
90th percentile is $73,737.50, which is different from using
formula (3.3).

Examples 3.22 and 3.23: Computing Percentiles
Data >
Data Analysis >
Rank and Percentile
90.3rd percentile
= $74,375
(same result as
manually computing
the 90th percentile)
Example 3.24 Excel Rank and Percentile Tool
The Excel value of the 90th percentile that was computed in
Example 3.23 as $74,375 is the 90.3rd percentile value.
Quartiles break the data into four parts.
The 25th percentile is called the first quartile,Q1;
the 50th percentile is called the second quartile, Q2;
the 75th percentile is called the third quartile, Q3; and
the 100th percentile is the fourth quartile, Q4.
One-fourth of the data fall below the first quartile, one-half are
below the second quartile, and three-fourths are below the third
quartile.
Excel function QUARTILE. INC(array, quart), where array
specifies the range of the data and quart is a whole number
between 1 and 4, designating the desired quartile.
Quartiles

Compute the Quartiles of the Cost per Order data
First quartile: =QUARTILE.INC(G4:G97,1) = $6,757.81
Second quartile: =QUARTILE.INC(G4:G97,2) = $15,656.25
Third quartile: =QUARTILE.INC(G4:G97,3) = $27,593.75
Fourth quartile: =QUARTILE.INC(G4:G97,4) = $127,500.00
Example 3.25 Computing Quartiles in Excel
A cross-tabulation is a tabular method that displays the number
of observations in a data set for different subcategories of two
categorical variables.
A cross-tabulation table is often called a contingency table.
The subcategories of the variables must be mutually exclusive
and exhaustive, meaning that each observation can be classified
into only one subcategory, and, taken together over all
subcategories, they must constitute the complete data set.
Cross-Tabulations
Sales Transactions database

Count the number (and compute the percentage) of books and
DVDs ordered by region.
Example 3.26: Constructing a Cross-Tabulation
Cross-Tabulation Visualization: Chart of Regional Sales by
Product
Select the Insert tab.
Highlight the data.
Click on chart type, then subtype.
Use Chart Tools to customize.
Creating Charts in Microsoft Excel

Excel distinguishes between vertical and horizontal bar charts,
calling the former column charts and the latter bar charts.
A clustered column chart compares values across categories
using vertical rectangles;
a stacked column chart displays the contribution of each value
to the total by stacking the rectangles;
a 100% stacked column chart compares the percentage that each
value contributes to a total.
Column and bar charts are useful for comparing categorical or
ordinal data, for illustrating differences between sets of values,
and for showing proportions or percentages of a whole.
Column and Bar Charts
Example 3.2: Creating a Column Chart
Highlighted Cells
Highlight the range C3:K6, which includes the headings and
data for each category. Click on the Column Chart button and
then on the first chart type in the list (a clustered column chart).
Example 3.2: Creating a Column Chart
To add a title, click on the first icon in the Chart Layouts group.
Click on “Chart Title” in the chart and change it to “EEO
Employment Report—Alabama.” The names of the data series
can be changed by clicking on the Select Data button in the

Data group of the Design tab. In the Select Data Source dialog
(see below), click on “Series1” and then the Edit button. Enter
the name of the data series, in this case “All Employees.”
Change the names of the other data series to “Men” and
“Women” in a similar fashion.
Line charts provide a useful means for displaying data over
time.
You may plot multiple data series in line charts; however, they
can be difficult to interpret if the magnitude of the data values
differs greatly. In that case, it would be advisable to create
separate charts for each data series.
Line Charts
Example 3.3: A Line Chart for China Export Data
Pie Charts
A pie chart displays this by partitioning a circle into pie-shaped
areas showing the relative proportion.
Example 3.4: A Pie Chart for Census Data

Pie Charts
Data visualization professionals don't recommend using pie
charts. In a pie chart, it is difficult to compare the relative sizes
of areas; however, the bars in the column chart can easily be
compared to determine relative ratios of the data.
If you do use pie charts, restrict them to small numbers of
categories, always ensure that the numbers add to 100%, and
use labels to display the group names and actual percentages.
Avoid three-dimensional (3-D) pie charts—especially those that
are rotated—and keep them simple.
An area chart combines the features of a pie chart with those of
line charts.
Area charts present more information than pie or line charts
alone but may clutter the observer’s mind with too many details
if too many data series are used; thus, they should be used with
care.
Area Charts
Example 3.5: An Area Chart for Energy Consumption
Scatter charts show the relationship between two variables. To
construct a scatter chart, we need observations that consist of
pairs of variables.
Scatter Charts

Example 3.6: A Scatter Chart for Real Estate Data
A bubble chart is a type of scatter chart in which the size of the
data marker corresponds to the value of a third variable;
consequently, it is a way to plot three variables in two
dimensions.
Bubble Charts
Example 3.7: A Bubble Chart for Stock Comparisons
Stock chart
Surface chart
Doughnut chart
Radar chart
Miscellaneous Excel Charts
Many applications of business analytics involve geographic
data. Visualizing geographic data can highlight key data
relationships, identify trends, and uncover business
opportunities. In addition, it can often help to spot data errors
and help end users understand solutions, thus increasing the
likelihood of acceptance of decision models.
Companies like Nike use geographic data and information

systems for visualizing where products are being distributed and
how that relates to demographic and sales information. This
information is vital to marketing strategies.
Geographic mapping capabilities were introduced in Excel 2000
but were not available in Excel 2002 and later versions. These
capabilities are now available through Microsoft MapPoint
2010, which must be purchased separately.
Geographic Data
Visualizing and Exploring Data
Data visualization - the process of displaying data (often in
large quantities) in a meaningful fashion to provide insights that
will support better decisions.
Data visualization improves decision-making, provides
managers with better analysis capabilities that reduce reliance
on IT professionals, and improves collaboration and information
sharing.
Data Visualization
Tabular data can be used to determine exactly how many units

of a certain product were sold in a particular month, or to
compare one month to another.
For example, we see that sales of product A dropped in
February, specifically by 6.7% (computed as 1 – B3/B2).
Beyond such calculations, however, it is difficult to draw big
picture conclusions.
Example 3.1: Tabular vs. Visual Data Analysis
A visual chart provides the means to
easily compare overall sales of different products (Product C
sells the least, for example);
identify trends (sales of Product D are increasing), other
patterns (sales of Product C is relatively stable while sales of
Product B fluctuates more over time), and exceptions (Product
E’s sales fell considerably in September).
Example 3.1: Tabular vs. Visual Data Analysis
A dashboard is a visual representation of a set of key business
measures. It is derived from the analogy of an automobile’s
control panel, which displays speed, gasoline level,
temperature, and so on.
Dashboards provide important summaries of key business
information to help manage a business process or function.
Dashboards

Hypothesis Testing – Examples and
Case Studies
How Hypothesis Tests Are Reported
Determine the null hypothesis and the
alternative hypothesis.
Collect and summarize the data into a
test statistic.
Use the test statistic to determine the p-value.
The result is statistically significant if the p-value is less than
or equal to the level of significance.
2

Testing Hypotheses About Proportions and Means
If the null and alternative hypotheses are expressed in terms of
a population proportion, mean, or difference between two means
and if the sample sizes are large …
… the test statistic is simply the corresponding standardized
score computed assuming the null hypothesis is true; and the p-
value is found from a table of percentiles for standardized
scores.
3
Example 2: Weight Loss for Diet vs Exercise
Did dieters lose more fat than the exercisers?
Diet Only:
sample mean = 5.9 kg
sample standard deviation = 4.1 kg sample size = n = 42
Exercise Only: sample mean = 4.1 kg
sample standard deviation = 3.7 kg sample size = n = 47
measure of variability = [(0.633)2 + (0.540)2] = 0.83

4
Step 1. Determine the null and alternative hypotheses.
Null hypothesis: No difference in average fat lost in population
for two methods. Population mean difference is zero.
Alternative hypothesis: There is a difference in average fat lost
in population for two methods. Population mean difference is
not zero.
Step 2. Collect and summarize data into a test statistic.
The sample mean difference = 5.9 – 4.1 = 1.8 kg and the
standard error of the difference is 0.83.
So the test statistic: z = 1.8 – 0 = 2.17
0.83
5
Step 3. Determine the p-value.
Recall the alternative hypothesis was two-sided.
p- -shaped curve above 2.17]

Step 4. Make a decision.
The p-value of 0.03 is less than or equal to 0.05, so …
If really no difference between dieting and exercise as fat loss
methods, would see such an extreme result only 3% of the time,
or 3 times out of 100.
Prefer to believe truth does not lie with null hypothesis. We
conclude that there is a statistically significant difference
between average fat loss for the two methods.
6
Example 3: Public Opinion About President
On May 16, 1994, Newsweek reported the results of a public
opinion poll that asked: “From everything you know about Bill
Clinton, does he have the honesty and integrity you expect in a
president?” (p. 23).
Poll surveyed 518 adults and 233, or 0.45 of them (clearly less
than half), answered yes.
Could Clinton’s adversaries conclude from this that only a
minority (less than half) of the population of Americans thought
Clinton had the honesty and integrity to be president?
7

Step 1. Determine the null and alternative hypotheses.
Null hypothesis: There is no clear winning opinion on this
issue; the proportions who would answer yes or no are each
0.50.
Alternative hypothesis: Fewer than 0.50, or 50%, of the
population would answer yes to this question. The majority do
not think Clinton has the honesty and integrity to be president.
Step 2. Collect and summarize data into a test statistic.
Sample proportion is: 233/518 = 0.45.
The standard deviation =
– 0.50) = 0.022.
518
Test statistic: z = (0.45 – 0.50)/0.022 = –2.27
8
Step 3. Determine the p-value.
Recall the alternative hypothesis was one-sided.
p-value = proportion of bell-shaped curve below –2.27 Exact p-
value = 0.0116.
Step 4. Make a decision.
The p-value of 0.0116 is less than 0.05, so we conclude that the
proportion of American adults in 1994 who believed Bill

Clinton had the honesty and integrity they expected in a
president was significantly less than a majority.
9
Revisiting Case Studies: How Journals Present Tests
Whereas newspapers and magazines tend to simply report the
decision from hypothesis testing, journals tend to report p-
values as well.
This allows you to make your own decision, based on the
severity of a type 1 error and the magnitude of the p-value.
10
Case Study 5.1: Quitting Smoking with
Nicotine Patches
11
Compared the smoking cessation rates for smokers randomly
assigned to use a nicotine patch versus a placebo patch.
Null hypothesis: The proportion of smokers in the population
who would quit smoking using a nicotine patch and a placebo
patch are the same.
Alternative hypothesis: The proportion of smokers in the
population who would quit smoking using a nicotine patch is

higher than the proportion who would quit using a placebo
patch.
Case Study 5.1: Quitting Smoking with
Nicotine Patches
12
Higher smoking cessation rates were observed in the active
nicotine patch group at 8 weeks (46.7% vs 20%) (P < .001)
and at 1 year (27.5% vs 14.2%) (P = .011).
(Hurt et al., 1994, p. 595)
Conclusion: p-values are quite small: less than 0.001 for
difference after 8 weeks and equal to 0.011 for difference after
a year. Therefore, rates of quitting are significantly higher
using a nicotine patch than using a placebo patch after 8 weeks
and after 1 year.
Case Study 6.4: Smoking During
Pregnancy and Child’s IQ
13
Study investigated impact of maternal smoking on subsequent
IQ of child at ages 1, 2, 3, and 4 years of age.

Null hypothesis: Mean IQ scores for children whose mothers
smoke 10 or more cigarettes a day during pregnancy are same as
mean for those whose mothers do not smoke, in populations
similar to one from which this sample was drawn.
Alternative hypothesis: Mean IQ scores for children whose
mothers smoke 10 or more cigarettes a day during pregnancy are
not the same as mean for those whose mothers do not smoke, in
populations similar to one from which this sample was drawn.
Case Study 6.4: Smoking During
Pregnancy and Child’s IQ
14
Children born to women who smoked 10+ cigarettes per day
during pregnancy had developmental quotients at 12 and 24
months of age that were 6.97 points lower (averaged across
these two time points) than children born to women who did not
smoke during pregnancy (95% CI: 1.62,12.31, P = .01); at 36
and 48 months they were 9.44 points lower (95% CI:
4.52, 14.35, P = .0002). (Olds et al., 1994, p. 223)
Researchers conducted two-tailed tests for possibility the mean
IQ score could actually be higher for those whose mothers
smoke. The CI provides evidence of the direction in which the
difference falls. The p-value simply tells us there is a
statistically significant difference.
For Those Who Like Formulas

15
16
17
Statistics
Spring 2019
Module 3 Comprehensive Problem

INFERENTIAL STATISTICS – HYPOTHESIS TESTING
Either individually or in groups of 2 or 3, your task is to
perform some real-world inferential statistics. You will take a
claim that someone has made, form a hypothesis from that,
collect the data necessary to test the hypothesis, perform a
hypothesis test, and interpret the results.
You will test to see if less than 50% of students participate in
the Student Evaluation of Teaching system (SETS) in the
School of Business Administration at USCA. Why or Why not?
Determine and describe the type of data that you will collect
and how you plan to collect this data in order to answer your
questions. You will need to collect data on many characteristics
of your sample so that these characteristics can later be
compared somehow (e.g., before and after data; comparisons by
gender, major, type, year, age, etc.) Define the population and
the sample that you will be studying. (you must sample at least
100 students in the SOBA)Project Components
The report will include a description of the problem, and why
you think it is important, or what you hope to gain from testing
the hypothesis. It should also include the context of the data, all
data collected, and the values generated in EXCEL. A decision
and conclusion should be stated. An analysis should follow
with what the conclusion means in terms of the original
problem. The report should be in narrative format like you were
writing for a newspaper or magazine, must be typed, printed,
and should be double spaced.
An excellent final report (100 points) will have the following
components.
· An introduction to the problem including the claim(s) being
tested
· The context (who, what, where, when, why, how) of the data

(remember this is in narrative format) and any possible
problems with collecting the data
· Descriptive statistics and/or tables depending on your type of
data
· Appropriate graphs (every project should have at least one
graph or chart of the data in it)
· Inferential statistics including ...
· the null and alternative hypotheses written symbolically
· statistical output including a test statistic and p-value
· a graph showing the critical and non-critical regions, test
statistic, and p-value
· the decision and a conclusion written in terms of the original
claim
· Conclusion
· Suggestions for the next time this project is done
· No statistical usage errors
What can we test?
Some things are easier to test than other things. The purpose of
this project is to expose you to the process of hypothesis testing
in a real-world application. You may test means, proportions, or
linear correlation. You may have one or more samples. You may
categorize your variables in one or two ways.
If you are dealing with one sample, then you will need some
numerical value to test against. The claim "more people prefer
Pepsi than Coke" becomes a claim that the proportion of Pepsi
drinkers is greater than 0.5. There are not two independent
samples (Pepsi drinkers / Coke drinkers), just one sample
categorized in two ways. A problem with the Pepsi / Coke thing
is that it omits other soft drinks because that is more difficult to
do. A chi-square goodness of fit test would be more appropriate
in this case.
Categorical Data
If your data consists solely of categories and not measured
quantities, then you should be looking at proportions or counts.

Things to look for that let you know you're dealing with
categorical data or proportions include: proportions, percents,
counts, frequencies, fractions, or ratios. If your data consists of
names or labels, you're dealing with categorical data.
You really need to think about the response that was recorded
for each case. Did you record a yes/no response for each case
or did you record a number that means something? If it was a
yes/no or other categorical data, then this is the place to be.
Example Claims about Categorical Data
· 93.1% of Americans feel there should not be nudity on
television during children's viewing time.
http://www.parentstv.org/PTC/publications/lbbcolumns/2003/05
28.asp
This is a claim about a single proportion. We know this because
the value includes a percentage and the data is categorical (yes
or no), not numerical. The original claim here could be written
as p=0.931.
Quantitative (Numerical) Data
If your data consists of measured quantities, then you will
probably be testing a mean or perhaps correlation between two
variables. It is possible to test a claim about a standard
deviation, but that is rare, and not covered in this course.
There are four main ways to analyze means.
1. A test about a single mean that requires a number as the
claimed value.
2. A test about two independent means doesn't need a number
because you compare them to each other. This compares the
same thing in two different groups.
3. A test for two dependent means, often called paired samples,
compares two values for each case in the same group.
4. The Analysis of Variance is an extension of the two
independent samples case where there are more than two
groups.

You can also perform correlation and regression with two
quantitative variables. Simple regression, with just one
predictor variable, is covered in the book. Multiple regression,
with several predictor variables, is not covered in the textbook
but is available online.
Example Claims about Quantitative Data:
· Women live five years longer than men.
http://www.medicalnewstoday.com/medicalnews.php?newsid=1
8866 This is a claim about two averages, the average lifespan of
women and that of men. We don't know the average of either
gender (they're given in the article), we just know that women
are supposed to live five years longer than men. When you're
working with one sample, it's important to have a value to
compare against, but with two samples, you don't need a value
for each, just the difference between the two (in this case 5
years). The original claim here could be written as μw-μm=5
(the difference in the mean ages of women and men is 5 years).
· Seat belts save lives. http://dot.state.il.us/trafficsafety/seatbelt
june 2006.pdf and http://www-
fars.nhtsa.dot.gov/FinalReport.cfm?stateid=17&title=states&titl
e2=fatalities_and_fatality_rates&year=2005. Okay, this claim
is all over the place, but I wanted to give some links on how it
would be tested.
You could take the data regarding the percent of people wearing
their seat belts and compare it to the fatality rate. These are two
numerical values that are paired together for each case
(probably based on an annual report). Remember that you
cannot perform correlation and regression with categorical
variables. The original claim that seat belts save lives would be
interpreted as a negative correlation (as seat belt use goes up,
fatalities go down) and would be written as ρ<0.
Sample Final Report
Available online are sample projects and resources. Your
project may not be as long or detailed.

Assignment is due April 15th, either electronically prior to the
start of class or a hard copy at the start of class.
Hypothesis testing

Hypothesis testing: procedure
1
6
7
8
We ask a yes/no question about a population.
We answer the question yes, and answer the question no, using
symbols for the population means.
We label one answer the null hypothesis and the other answer
the alternative hypothesis.
We decide the criterion for rejecting the null hypothesis. The
test is one of: two-tailed, right-tailed, or left-tailed. We take a
sample, and calculate our test statistic (Z or t for now)
We find if the observed test statistic is in the rejection region
(critical region or tail) of the distribution.
If the statistic is in the rejection region, we reject the null
hypothesis and accept the alternative hyopthesis.
If the statistic is not in the rejection region, we retain the null
hypothesis, and do not accept the alternative hypothesis.
2
3
4
5
9

STATISTICS
PROJECT:
Hypothesis
Testing
INTRODUCTION
My topic is the average tuition cost of a 4-yr. public college.
Since I will soon be transferring to a 4-yr. college, I thought
this topic would be perfect. "The College Board" says that the
average tuition cost of college is $5836 per year. I will be
researching online the costs of different public colleges to test
this claim. I will be using the T-test for a mean, since my
sample is going to be less than 30 and an unknown population
standard deviation. I will also use Chi-Square Test of
Independence.
HYPOTHESIS
I think the average cost of tuition is lower than the average
stated by “The College Board”.
Ho: mu >/= $5836.
H1: mu< $5836 (Claim)
DATA ANALYSIS
I collected my data from various college websites. I looked up
the cost of tuition per year and the number of students enrolled.
Here is what I came up with:

College
Tuition
Number of Students
Central Washington University
$4392
10,200
University of Washington
$5985
25,469
Washington State University
$5888
18,432
Western Washington University
$4356
13,000
Evergreen State University
$4590
4400
Eastern Washington University
$5904
10,000
Peninsula College
$3639
10,120
University of Oregon
$6174
20,394
Portland State University
$5208
24,284
Oregon State University
$5604
19,362
Southern Oregon University
$5233
5000

Eastern Oregon University
$4500
3000
Western Oregon University
$5763
4500
University of Idaho
$4410
11,739
Idaho State University
$4400
13,000
There weren’t really any large gaps or outliers in the data that I
collected. There was a gap between 5,000 – 10,000 students.
But the rest was mostly consistent. The lowest tuition was
$3639 from Peninsula College and the highest tuition was
$6174 from the University of Oregon. Some of the websites
were hard to find the information I wanted, but I eventually
found it. Some of the websites were specific as to undergraduate
or graduate and some probably contain both. I should have done
further research to make sure that my numbers only contain
undergraduates and not graduates. So, that is one possible
mistake in the data collection.
HYPOTHESIS TESTING
T-Test for a Mean
Step 1: State the hypothesis and identify the claim.
I claim that the average cost of college tuition is less than
$5836 per year as concluded from “The College Board”. At
a=.025, can it be concluded that the average is less than $5836

based on a sample of 15 colleges?
H0: mu>/= $5836
H1: mu<$5836 (claim)
Step 2: Find the critical value
At a=.025 and d.f. = 14, the critical value is -2.145.
Step 3: Compute the sample test value. m= 5069.73, s=787.80
t= (5069.73-5836)/(787.80/sqrt(15)) = -3.767
Step 4: Make the decision to reject or not reject the null
hypothesis. Reject the null hypotheses since -3.767 falls in the
critical region.
Step 5: Summarize the results.
I will reject the null hypotheses since there is enough evidence
to support the claim that the average cost of tuition is less than
$5836 per year.
Chi-Squared Independence Test
Step 1: State the hypotheses and identify the claim.
I claim that there is a correlation between the number of
students at a college and the cost of tuition per year. Here is
the data that I collected:
Cost of Tuition
Number of Students
Total
3000-9,999

10,000-16,999
17,000-23,999
24,000-30,999
$3500-4500
1
5
0
0
6
$4501-5500
2
0
0
1
3
$5501-6500
1
1
3
1
6
Total
4
6
3
2
15
At .025, can we conclude that the cost of tuition is dependent on
the number of students?
Ho: The cost of tuition is independent of the number of students
that attend the college. (x²=0)
H1: The cost of tuition is dependent on the number of students

that attend the college. (claim) (x²>0)
Step 2: Find the critical value:
The critical value is 14.449 since the degrees of freedom are (3-
1)(4-1)=6.
Step 3: Compute the test value.
First we have to find the expected value:
E1,1 = (6)(4)/15=1.6
E2,1 = (3)(4)/15=.8
E3,1 = (6)(4)/15=1.6
E1,2 = (6)(6)/15=2.4
E2,2 = (3)(6)/15=1.2
E3,2 = (6)(6)/15=2.4
E1,3 = (6)(3)/15=1.2
E2,3 = (3)(3)-15=.6
E3,3 = (6)(3)/15=1.2
E1,4 = (6)(2)/15=.8
E2,4 = (3)(2)/15=.4
E3,4 = (6)(2)/15=.8
The completed table is shown:
Cost of Tuition
Number of Students
Total
3000-9,999
10,000-16,999
17,000-23,999

24,000-30,999
$3500-4500
1 (1.6)
5 (2.4)
0 (1.2)
0 (.8)
6
$4501-5500
2 (.8)
0 (1.2)
0 (.6)
1 (.4)
3
$5501-6500
1 (1.6)
1 (2.4)
3 (1.2)
1 (.8)
6
Total
4
6
3
2
15
Then the test value is x² = ∑ (O-E)²/E
= (1-1.6)²/1.6 + (5-2.4)²/2.4 + (0-1.2)²/1.2 + (0-.8)²/.8 + (2-
.8)²/.8 + (0-
1.2)²/1.2 + (0-.6)²/.6 + (1-.4)²/.4 + (1-1.6)²/1.6 + (1-2.4)²/2.4 +
(3-1.2)²/1.2 + (1-
.8)²/.8
= 13.333

Step 4: Make the decision to reject or not to reject the null
hypothesis. Do not reject the null hypothesis since 13.333 is
less than 14.449.
Step 5: Summarize the results.
There is not enough evidence to support the claim that the cost
of tuition is dependent on the number of students that attend the
college.
SUMMARY
My first hypothesis test about the tuition cost of 4-year
universities being less than the average was correct. The
average as stated by “The College Board” said that the tuition
was $5836 per year. I thought that was a little high. The average
tuition of the fifteen colleges that I researched was $5069.73.
Maybe if I would have researched colleges all around the
country instead of just our surrounding states I would have
come up with different numbers. Another thing that may have
caused this test to be a little off was that when I was collecting
data, some of the costs of tuition may include other fees and
some may not. When I looked them up, some fees were listed
separately and some were not. This could have lead to a Type I
error where the null hypothesis was true and it was rejected.
My second hypothesis test about whether the cost of tuition is
dependant on the number of students that attend the college was
rejected. I thought that the fewer the students that attend a
specific college, that tuition would be cheaper, but that wasn’t
the case. One main problem I can see with colleting my data is
that on the college websites for the number of students, some
said “over” or “approximately”. So, these weren’t the exact
numbers of students enrolled. Also, as stated earlier, some of
the students could be undergraduates or graduates. Some of the
websites didn’t list them separately. Tuition is higher for
graduates, so they should not have been included in this study

and it would have thrown off the number of students. So, these
may have affected the outcome a little, but I don’t’ think
enough for it to change the hypothesis.
It would have also been interesting to test to see whether the
tuition is higher in urban areas where more people live verses
rural areas where there are not as
many people. I would be inclined to say that this is true, but it
would need to be tested further to say for sure. It would also be
interesting to do this same testing for private colleges to see if
they have the same results. I thought this was fun to come up
with our own hypothesis and try to prove ourselves right or
wrong using what we have learned all quarter. It was a good test
of our skills and it made me get a better understanding of how
the formulas really work rather than just doing the homework
examples in the book.

Statistics is both the science of uncertainty and the technology.docx

Recommended

Recommended

More Related Content

Similar to Statistics is both the science of uncertainty and the technology.docx

Similar to Statistics is both the science of uncertainty and the technology.docx (20)

More from rafaelaj1

More from rafaelaj1 (20)

Recently uploaded

Recently uploaded (20)

Statistics is both the science of uncertainty and the technology.docx