SlideShare a Scribd company logo
1 of 13
Will Shonk
BANA STATS
3.) This problem will use the "GSS_2012.csv" data in the "Data Sets" folder. This data set
contains the results of the 2012 General Social Survey (GSS). Think of an issue that
interests you either personally, academically, or professionally that could be addressed
by examining the relationship between two categorical or ordinal variables. For
example, you might be interested in the relationship between marital status and job
satisfaction. You can find a list of the variable names as well as what they stand for from
the GSS website: GSS documentation
a). Briefly state the issue and why it interests you.
I am going to compare EDUC, respondent’s education with SEX, respondent’s gender, in one
pivot table and then use ABPOOR, low income can’t afford more children with CONDOM, did
the respondent use a condom the last time he/she had sex. As a graduate student I value
education and I heard that there were more females graduating then males so I wanted to test
the claim. The low income and can’t afford more children vs. condom usage data intrigued me
because I wanted to see if people who had low income and couldn’t afford more kids were
actually using condoms and being responsible enough to avoid pregnancy till their financial
situation was more accommodating to rear children.
b). Make a side-by-side bar chart of the distribution of the response variable by each
level of
the explanatory variable. To do this using an Excel pivot table, first select the table, right
click, and go to
Show Values As -> % of Row Total. Then insert a pivot chart. Be sure that the chart is
easy to read and labeled properly so that anyone looking at your paper could tell what is
being plotted.
I used use the ABPOOR as the explanatory/independent variable and the CONDOMvariable is
the response/ dependent variable.
I was not able to delete the NA values since pivot tables will not allow you to delete parts of a
variable (the NA portion) so I copied the values and the format and put in the percentages by
row and created the following tables without the NA’s.
It was necessary to go in and add in the axis titles for the graph. Go to LAYOUT tab > AXIS TITLES
and then also add a title. The explanatory is on the X axis, ABPOOR, can the respondent afford
more children. And the Y axis is the response, or the dependent, CONDOM, did respondent use
a condom last time had sex.
I made a second chart for education and males vs females. I had to copy the values and the
format to make it into a presentable table as shown below instead of with the original 20 rows
and not in groups.
I grouped the years of education into “ no highschool”= 1 -12 years of education ( group1)
“highschool”= 12 years of education, “some college”=13-15 years of education ( group2),
“bachelors”= 16 years, “post undergraduate”= 17 years, “graduate”= 18 years, and “PHD”= 19-
20 years (group3
c). Conduct the chi-squared test of independence, either using the pivot table or using R.
Use the four-step procedure, explaining each step, and state the conclusion in the
context of the problem. In R, go to Statistics -> Contingency Tables -> Two-Way Table.
This will let you select the row and column variables from the data set. If you want to
enter the data directly, select Enter and analyze two-way table. With either approach,
select the "Statistics" tab under "Hypothesis tests," and tick the first three boxes. If
using Excel, you must first make a table of expected frequencies and then use the
CHITEST() function, which returns the p-value of the test.
Below I created a chi-squared test of independence in Excel making an expected values chart.
1. Sum of each columns
2. Sum of each rows
3. Total sum of all frequencies for total
4. Total females (column) / total sample size = female proportions=.55
5.Total males (column) / total sample size =.45
6. Female proportion (.55) * row total (No High school 318)= expected value of females for that
row (Expected Value 175.44 Females)
7 Male proportion * row total= expected value of males for that row
8. In the expected value formulas in steps 6-7, add the "$" signed in the female/male
proportion cell to keep it fixed, then drag the formula down to get the expected values for the
remaining rows.
9. Perform the ChiSquared test, formula =chitest(original values, expected values)
10. The number from the chitest=.26, which is the P-value, we will compare the p-value to the
alpha, which s 95% certainty=.05 alpha
11. .26 > .05, therefore we fail to reject, we do not have enough evidence more data is needed
12. r square (=correl)= .996 correlation, therefore there is a very strong positive correlation
between the original values and the expected values.
d). Calculate Cramer’s coefficient and explain what it tells you about the relationship
between the variables you calculated.
Cramer’sV is usedto calculate correlationintables thatare greaterthan 2 x 2 columnsand rows.
Cramer’sV correlation isbetween0and1. A value close to0 meansthat there islittle association
betweenvariables.A Cramer’sV of close to 1 meansthere isa strongassociation.
I foundthe X squared byplugginginthe yearsof educationandgendersfromabove intoRcmdr,
STATISTICS>CONTIGENCYTABLE>ENTERAND ANALYZETWO-WAYTABLE
N is large but we still fail to detect dependence .078 is not close to 1.
4.) Open the "AmesHousing.csv" data set again and create a new variable named "To-
talSF" to represent the total square footage of a house. It will be composed of the vari-
ables "First_Flr_SF," "Low_Qual_Fin_SF," "Open_Porch_SF," "Scnd_Flr_SF,"and
"TotalBsmtSF " added together. As a check, if you do this correctly, the to-tal square
footage for the first house should be 2798. Also, create a variable called
"SalesPrice000s" to represent the sales price of a home in thousands.
Using R: Input a new Variable DATA> Manage Variables in active data set> compute new
variable.--> Add in “To.talSF” in the New variable name field. Add in the variable “X1st.Flr.SF,"
"Low.Qual.Fin.SF," "Open.Porch.SF," "X2nd.Flr.SF,"and "Tota.lBsmt.SF "” under the Expression
to compute.
"X1st_Flr_SF+Low_Qual_Fin_SF+Open_Porch_SF+X2nd_Flr_SF+TotalBsmtSF" NOTE the _ must
be replaced with . and there should be no “” in the actual variable nor formula.
a). Produce a scatter plot of SalesPrice000s by TotalSF. State whether the appearance of
the plot makes sense to you and why you feel that way.
In R:
X is the explanatory, and Y is the response or dependent, that is prices of homes depend on the
total square footage.
In Excel:
The goodness of fit line resembles a linear upward increasing line, meaning that the paired data
arrays have a positive linear correlation. That is, price will increase (depends) as total square
footage increases (independent)
b). Your supervisor looks at the scatter plot in (a) and does not like the outliers. Produce
a scatter plot with the outliers you see in (a) removed. You can delete these
observations in R or delete them in the Excel file and the re-import the data into R.
Comment on the effect that removing outliers has on the appearance of the plot.
Below is a box plot that shows where the outlying data, marked by zeros.
-100
0
100
200
300
400
500
600
700
800
0 1000 2000 3000 4000 5000 6000 7000 8000
House
Price in
thousdands
House sq footage
X House sq footage effect onY house price
Below is a histogram and the outlier data has been highlighted yellow. The outlier data is
located where price range starts at about the $375,000 (outlier is arbitrary so I defaulted to R’s
outlier points marked as the zeros above)
We will remove the outliers and graph a new scatter plot:
Removed values corresponding to sales prices under $75,000 and greater than $375,000Mean,
which represent the outliers in the histogram above
In Excel:
y = 0.0695x - 2.8554
R² = 0.6205
0
50
100
150
200
250
300
350
400
0 1000 2000 3000 4000 5000 6000
House
Price in
thousdands
House sq footage
X House sq footage effect onY house price removed
outliers
c). Explain why what you did in (b) that is, just removing the outliers and making a new
plot might be considered unethical. Discuss one way you could justify what you did in
(b).
This could be considered unethical because I manipulated the true numbers to now not be
representative of the true population. The aggregates now do not take into account all the
data. This is incorrectly portraying information that I could use to support a bias.
However, this could be justifiable if I presented the information, making known the outliers
were removed in order to prevent skewed aggregates. Thus the bulk of the data is more
representative of the true mean of the sample and therefore a more accurate generalization of
the true mean of what to expect from the population.
d). Suppose your supervisor wants to predict SalePrice000s using TotalSF using least-
squares linear regression. Explain, conceptually, the idea behind the least-squares
method. You shouldn’t include any calculations.
The idea behind the least squares method is that it is a method to predict the value of a
dependent Y variable basing off the value of the independent X variable, simply said, it is a
cause effect relationship.
Linear regression calculates a straight line that is called the “least squares regression line” that
minimizes the differences with all the data sets. The slope calculate a line that best fits ALL the
data sets. This line has a set slope so you follow the line where a specific square footage at the
X intercept on the X axis of the line and you will find the corresponding price at the Y the
intercept on the Y axis of the line.
e). Fit the regression line in (d) to the data set with the outliers removed using Excel or
R. Display the output from whatever program you use (a screen shot will probably be
easiest), and write down the estimated regression equation.
Using Excel, I removed the rows with the outliers stated above (omitting sales price under
$75,000 and above $375,000).Then I selected INSERT tab>, SCATTER right click graph, Select
Data >add input X Series is the total square footage, Y Series is the sales price by selecting all
the values in the SalesPrice000s and to.talSF variables, ctrl+shift+down key.
Once it is plotted, I insert an X and Y axis label selecting the graph, LAYOUT tab> AXIS TITLES
tab, and then CHART TITLE tab.
Lastly, add a trend line by selecting a data point, right click> ADD TRENDLINE> check DISPLAY
EQUATION
The equation for the regression line without the outliers is located above the points in the
scatter plot, Y= 0.0695x-2.8554, and the R^2, or the amount of data that is explained is 0 .6192,
if you take the square root of that you will get your correlation, R, which is 0.787, or in a cell put
in “=correl(values of house price, values of house square footage)”. With a high correlation
close to the 80% benchmark we can infer that when square footage of a home increase, the
sales price of the house also increases.
Y is the Price, the dependent variable
X is the Square footage, the explanatory or independent
f). Interpret the slope and intercept term in (e) in the context of the problem.
Y= a + bx, where a is the y intercept and b is the slope of the straight line predicting “Yhat” (the
means of Y). This line is the “least squares line" because it is the sum of the squared differences
between each data point and its estimated point on the line, and the point on the line
minimizes the differences between the points.
Y= bx + a  y= 0.0695x – 2.8554
Note: the values are in thousands, so the values must be moved three decimal places to the
right, that is be multiplied by 1,000.
y= 69.538x – 2855.4
A= y intercept. The initial point at 0, no square footage. Which says, “when a house has 0
square footage, it’s average price is $- 2,855.4.
B= Gradient/slope. The increase in price per one square foot. Which says, “for every X, 1
square footage of a house, there is a $69.5 increase in price”
For every one increase in X, Y will go up the amount of the slope which is the predicted price
increase
g). Using software, calculate a 95% confidence interval for the slope parameter
and interpret the interval in the context of the problem. Show the components of the
calculation.
In Excel, go to DATA > DATA ANALYSIS tab on far right, then input the Y (dependent) TotalSF
values and the X (independent) SalePrice values.
Check the labels
Check the confidence levels put at 95%
For every increase in square footage of the house, we are 95% confident based on the method
used to calculate the interval that the sales price of the house will increase from $67.51 -
$71.56 on average, all other variables held constant.
h). What does the interval imply about the test of
and why?
The above hypothesis is the most common inference about regression, is X related to Y? The
idea is that the confidence intervals and tests get at the same idea.
If the confidence interval contains the null hypothesis of B1, which is 0, we would fail to reject
B1 = 0. If the P-value < or = to alpha, we reject H0, if P-value is > Alpha we fail to reject
F-Test is given below = 4,532 and the P-Value of the test is 0.
Therefore, we reject the H0 , 0 < .05

More Related Content

What's hot

correlation and regression
correlation and regressioncorrelation and regression
correlation and regressionKeyur Tejani
 
8 correlation regression
8 correlation regression 8 correlation regression
8 correlation regression Penny Jiang
 
Simple regression and correlation
Simple regression and correlationSimple regression and correlation
Simple regression and correlationMary Grace
 
What is Simple Linear Regression and How Can an Enterprise Use this Technique...
What is Simple Linear Regression and How Can an Enterprise Use this Technique...What is Simple Linear Regression and How Can an Enterprise Use this Technique...
What is Simple Linear Regression and How Can an Enterprise Use this Technique...Smarten Augmented Analytics
 
Regression analysis made easy
Regression analysis made easyRegression analysis made easy
Regression analysis made easyWeam Banjar
 
Applications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationshipApplications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationshipRithish Kumar
 
Regression analysis in R
Regression analysis in RRegression analysis in R
Regression analysis in RAlichy Sowmya
 
Regression analysis
Regression analysisRegression analysis
Regression analysissaba khan
 
Machine Learning Algorithm - Linear Regression
Machine Learning Algorithm - Linear RegressionMachine Learning Algorithm - Linear Regression
Machine Learning Algorithm - Linear RegressionKush Kulshrestha
 

What's hot (20)

Correlation analysis
Correlation analysis Correlation analysis
Correlation analysis
 
correlation and regression
correlation and regressioncorrelation and regression
correlation and regression
 
8 correlation regression
8 correlation regression 8 correlation regression
8 correlation regression
 
Regression
RegressionRegression
Regression
 
Simple regression and correlation
Simple regression and correlationSimple regression and correlation
Simple regression and correlation
 
Regression analysis in excel
Regression analysis in excelRegression analysis in excel
Regression analysis in excel
 
What is Simple Linear Regression and How Can an Enterprise Use this Technique...
What is Simple Linear Regression and How Can an Enterprise Use this Technique...What is Simple Linear Regression and How Can an Enterprise Use this Technique...
What is Simple Linear Regression and How Can an Enterprise Use this Technique...
 
Regression presentation
Regression presentationRegression presentation
Regression presentation
 
Chapter 14
Chapter 14 Chapter 14
Chapter 14
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Regression analysis made easy
Regression analysis made easyRegression analysis made easy
Regression analysis made easy
 
Applications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationshipApplications of regression analysis - Measurement of validity of relationship
Applications of regression analysis - Measurement of validity of relationship
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Regression analysis in R
Regression analysis in RRegression analysis in R
Regression analysis in R
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Machine Learning Algorithm - Linear Regression
Machine Learning Algorithm - Linear RegressionMachine Learning Algorithm - Linear Regression
Machine Learning Algorithm - Linear Regression
 
Econometrics chapter 8
Econometrics chapter 8Econometrics chapter 8
Econometrics chapter 8
 
Multiple linear regression
Multiple linear regressionMultiple linear regression
Multiple linear regression
 
Regression
RegressionRegression
Regression
 
Linear regression analysis
Linear regression analysisLinear regression analysis
Linear regression analysis
 

Viewers also liked

A converging vision of resilience building between the private sector and civ...
A converging vision of resilience building between the private sector and civ...A converging vision of resilience building between the private sector and civ...
A converging vision of resilience building between the private sector and civ...Global Risk Forum GRFDavos
 
Activision Blizzard Financial Assesment William Shonk
Activision Blizzard Financial Assesment William ShonkActivision Blizzard Financial Assesment William Shonk
Activision Blizzard Financial Assesment William ShonkWilliam Shonk
 
2011 11 05 05 ifrc presentation 112011-2
2011 11 05 05 ifrc presentation 112011-22011 11 05 05 ifrc presentation 112011-2
2011 11 05 05 ifrc presentation 112011-2John Stringer
 
Alexandra Caldwell Resume 2017
Alexandra Caldwell Resume 2017Alexandra Caldwell Resume 2017
Alexandra Caldwell Resume 2017Alexandra Caldwell
 
identificacion de cationes mediante el analisis a la flama
identificacion de cationes mediante el analisis a la flamaidentificacion de cationes mediante el analisis a la flama
identificacion de cationes mediante el analisis a la flamaricardo ku ruiz
 
Pengantar pendidikan
Pengantar pendidikanPengantar pendidikan
Pengantar pendidikanSarli Arham
 
Agiles 2009 - Visual Management - Xavier Quesada Allue
Agiles 2009 - Visual Management - Xavier Quesada AllueAgiles 2009 - Visual Management - Xavier Quesada Allue
Agiles 2009 - Visual Management - Xavier Quesada AllueAgiles2009
 
Full Resume of Kenneth Osmond
Full Resume of Kenneth OsmondFull Resume of Kenneth Osmond
Full Resume of Kenneth OsmondKenneth Osmond
 
Web Development for UX Designers
Web Development for UX DesignersWeb Development for UX Designers
Web Development for UX DesignersAshlimarie
 
A Successful Resumes Candidate Makeover
A Successful Resumes Candidate MakeoverA Successful Resumes Candidate Makeover
A Successful Resumes Candidate MakeoverLinda Fitzpatrick
 
Regression analysis
Regression analysisRegression analysis
Regression analysisRavi shankar
 
Net experience-resume-sample
Net experience-resume-sampleNet experience-resume-sample
Net experience-resume-sampleAmit Sawant
 
Net experience-resume-sample
Net experience-resume-sampleNet experience-resume-sample
Net experience-resume-sampleyogeswaghmare
 

Viewers also liked (19)

Nissan Creating Operational Resiliency
Nissan Creating Operational ResiliencyNissan Creating Operational Resiliency
Nissan Creating Operational Resiliency
 
A converging vision of resilience building between the private sector and civ...
A converging vision of resilience building between the private sector and civ...A converging vision of resilience building between the private sector and civ...
A converging vision of resilience building between the private sector and civ...
 
Activision Blizzard Financial Assesment William Shonk
Activision Blizzard Financial Assesment William ShonkActivision Blizzard Financial Assesment William Shonk
Activision Blizzard Financial Assesment William Shonk
 
Nissan presentation
Nissan presentationNissan presentation
Nissan presentation
 
2011 11 05 05 ifrc presentation 112011-2
2011 11 05 05 ifrc presentation 112011-22011 11 05 05 ifrc presentation 112011-2
2011 11 05 05 ifrc presentation 112011-2
 
Alexandra Caldwell Resume 2017
Alexandra Caldwell Resume 2017Alexandra Caldwell Resume 2017
Alexandra Caldwell Resume 2017
 
identificacion de cationes mediante el analisis a la flama
identificacion de cationes mediante el analisis a la flamaidentificacion de cationes mediante el analisis a la flama
identificacion de cationes mediante el analisis a la flama
 
Pengantar pendidikan
Pengantar pendidikanPengantar pendidikan
Pengantar pendidikan
 
Practica 4
Practica 4Practica 4
Practica 4
 
Agiles 2009 - Visual Management - Xavier Quesada Allue
Agiles 2009 - Visual Management - Xavier Quesada AllueAgiles 2009 - Visual Management - Xavier Quesada Allue
Agiles 2009 - Visual Management - Xavier Quesada Allue
 
Prasad Degala CV
Prasad Degala CVPrasad Degala CV
Prasad Degala CV
 
Full Resume of Kenneth Osmond
Full Resume of Kenneth OsmondFull Resume of Kenneth Osmond
Full Resume of Kenneth Osmond
 
.Net Resume - SetFocus
.Net Resume - SetFocus.Net Resume - SetFocus
.Net Resume - SetFocus
 
Nissan Company
Nissan CompanyNissan Company
Nissan Company
 
Web Development for UX Designers
Web Development for UX DesignersWeb Development for UX Designers
Web Development for UX Designers
 
A Successful Resumes Candidate Makeover
A Successful Resumes Candidate MakeoverA Successful Resumes Candidate Makeover
A Successful Resumes Candidate Makeover
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Net experience-resume-sample
Net experience-resume-sampleNet experience-resume-sample
Net experience-resume-sample
 
Net experience-resume-sample
Net experience-resume-sampleNet experience-resume-sample
Net experience-resume-sample
 

Similar to assignment 2

Lab 4 excel basics
Lab 4 excel basicsLab 4 excel basics
Lab 4 excel basicsAnuja Lad
 
Frequency Tables - Statistics
Frequency Tables - StatisticsFrequency Tables - Statistics
Frequency Tables - Statisticsmscartersmaths
 
Measures and Strengths of AssociationRemember that while w.docx
Measures and Strengths of AssociationRemember that while w.docxMeasures and Strengths of AssociationRemember that while w.docx
Measures and Strengths of AssociationRemember that while w.docxARIV4
 
correlation and r3433333333333333333333333333333333333333333333333egratio111n...
correlation and r3433333333333333333333333333333333333333333333333egratio111n...correlation and r3433333333333333333333333333333333333333333333333egratio111n...
correlation and r3433333333333333333333333333333333333333333333333egratio111n...Ghaneshwer Jharbade
 
You clearly understand the concepts of this assignment. You’ve don.docx
You clearly understand the concepts of this assignment. You’ve don.docxYou clearly understand the concepts of this assignment. You’ve don.docx
You clearly understand the concepts of this assignment. You’ve don.docxjeffevans62972
 
4b6c1c5c-e913-4bbf-b3a4-41e23cb961ba-161004200047.pdf
4b6c1c5c-e913-4bbf-b3a4-41e23cb961ba-161004200047.pdf4b6c1c5c-e913-4bbf-b3a4-41e23cb961ba-161004200047.pdf
4b6c1c5c-e913-4bbf-b3a4-41e23cb961ba-161004200047.pdfNitish Nagar
 
Correlation Example
Correlation ExampleCorrelation Example
Correlation ExampleOUM SAOKOSAL
 
Week 4 Lecture 10 We have been examining the question of equal p.docx
Week 4 Lecture 10 We have been examining the question of equal p.docxWeek 4 Lecture 10 We have been examining the question of equal p.docx
Week 4 Lecture 10 We have been examining the question of equal p.docxcockekeshia
 
Correlations and Scatterplots MS Excel Lesson 2 Grade 8.ppt
Correlations and Scatterplots MS Excel Lesson 2 Grade 8.pptCorrelations and Scatterplots MS Excel Lesson 2 Grade 8.ppt
Correlations and Scatterplots MS Excel Lesson 2 Grade 8.pptJoshuaCasas7
 
De vry math 399 ilabs &amp; discussions latest 2016 november
De vry math 399 ilabs &amp; discussions latest 2016 novemberDe vry math 399 ilabs &amp; discussions latest 2016 november
De vry math 399 ilabs &amp; discussions latest 2016 novemberlenasour
 
De vry math 399 ilabs &amp; discussions latest 2016
De vry math 399 ilabs &amp; discussions latest 2016De vry math 399 ilabs &amp; discussions latest 2016
De vry math 399 ilabs &amp; discussions latest 2016lenasour
 
De vry math 399 all ilabs latest 2016 november
De vry math 399 all ilabs latest 2016 novemberDe vry math 399 all ilabs latest 2016 november
De vry math 399 all ilabs latest 2016 novemberlenasour
 
Principal components
Principal componentsPrincipal components
Principal componentsHutami Endang
 
Commonly used excel formulas
Commonly used excel formulasCommonly used excel formulas
Commonly used excel formulassaladi330
 
Commonly used excel formulas
Commonly used excel formulasCommonly used excel formulas
Commonly used excel formulassaladi330
 

Similar to assignment 2 (20)

Lab 4 excel basics
Lab 4 excel basicsLab 4 excel basics
Lab 4 excel basics
 
Lab 4 excel basics
Lab 4 excel basicsLab 4 excel basics
Lab 4 excel basics
 
Frequency Tables - Statistics
Frequency Tables - StatisticsFrequency Tables - Statistics
Frequency Tables - Statistics
 
Advanced Excel ppt
Advanced Excel pptAdvanced Excel ppt
Advanced Excel ppt
 
Measures and Strengths of AssociationRemember that while w.docx
Measures and Strengths of AssociationRemember that while w.docxMeasures and Strengths of AssociationRemember that while w.docx
Measures and Strengths of AssociationRemember that while w.docx
 
Scatter Plot
Scatter PlotScatter Plot
Scatter Plot
 
correlation and r3433333333333333333333333333333333333333333333333egratio111n...
correlation and r3433333333333333333333333333333333333333333333333egratio111n...correlation and r3433333333333333333333333333333333333333333333333egratio111n...
correlation and r3433333333333333333333333333333333333333333333333egratio111n...
 
You clearly understand the concepts of this assignment. You’ve don.docx
You clearly understand the concepts of this assignment. You’ve don.docxYou clearly understand the concepts of this assignment. You’ve don.docx
You clearly understand the concepts of this assignment. You’ve don.docx
 
4b6c1c5c-e913-4bbf-b3a4-41e23cb961ba-161004200047.pdf
4b6c1c5c-e913-4bbf-b3a4-41e23cb961ba-161004200047.pdf4b6c1c5c-e913-4bbf-b3a4-41e23cb961ba-161004200047.pdf
4b6c1c5c-e913-4bbf-b3a4-41e23cb961ba-161004200047.pdf
 
Correlation Example
Correlation ExampleCorrelation Example
Correlation Example
 
Chapter 12
Chapter 12Chapter 12
Chapter 12
 
Week 4 Lecture 10 We have been examining the question of equal p.docx
Week 4 Lecture 10 We have been examining the question of equal p.docxWeek 4 Lecture 10 We have been examining the question of equal p.docx
Week 4 Lecture 10 We have been examining the question of equal p.docx
 
Correlations and Scatterplots MS Excel Lesson 2 Grade 8.ppt
Correlations and Scatterplots MS Excel Lesson 2 Grade 8.pptCorrelations and Scatterplots MS Excel Lesson 2 Grade 8.ppt
Correlations and Scatterplots MS Excel Lesson 2 Grade 8.ppt
 
De vry math 399 ilabs &amp; discussions latest 2016 november
De vry math 399 ilabs &amp; discussions latest 2016 novemberDe vry math 399 ilabs &amp; discussions latest 2016 november
De vry math 399 ilabs &amp; discussions latest 2016 november
 
De vry math 399 ilabs &amp; discussions latest 2016
De vry math 399 ilabs &amp; discussions latest 2016De vry math 399 ilabs &amp; discussions latest 2016
De vry math 399 ilabs &amp; discussions latest 2016
 
De vry math 399 all ilabs latest 2016 november
De vry math 399 all ilabs latest 2016 novemberDe vry math 399 all ilabs latest 2016 november
De vry math 399 all ilabs latest 2016 november
 
Principal components
Principal componentsPrincipal components
Principal components
 
Commonly used excel formulas
Commonly used excel formulasCommonly used excel formulas
Commonly used excel formulas
 
Commonly used excel formulas
Commonly used excel formulasCommonly used excel formulas
Commonly used excel formulas
 
data analysis
data analysisdata analysis
data analysis
 

assignment 2

  • 1. Will Shonk BANA STATS 3.) This problem will use the "GSS_2012.csv" data in the "Data Sets" folder. This data set contains the results of the 2012 General Social Survey (GSS). Think of an issue that interests you either personally, academically, or professionally that could be addressed by examining the relationship between two categorical or ordinal variables. For example, you might be interested in the relationship between marital status and job satisfaction. You can find a list of the variable names as well as what they stand for from the GSS website: GSS documentation a). Briefly state the issue and why it interests you. I am going to compare EDUC, respondent’s education with SEX, respondent’s gender, in one pivot table and then use ABPOOR, low income can’t afford more children with CONDOM, did the respondent use a condom the last time he/she had sex. As a graduate student I value education and I heard that there were more females graduating then males so I wanted to test the claim. The low income and can’t afford more children vs. condom usage data intrigued me because I wanted to see if people who had low income and couldn’t afford more kids were actually using condoms and being responsible enough to avoid pregnancy till their financial situation was more accommodating to rear children. b). Make a side-by-side bar chart of the distribution of the response variable by each level of the explanatory variable. To do this using an Excel pivot table, first select the table, right click, and go to Show Values As -> % of Row Total. Then insert a pivot chart. Be sure that the chart is easy to read and labeled properly so that anyone looking at your paper could tell what is being plotted. I used use the ABPOOR as the explanatory/independent variable and the CONDOMvariable is the response/ dependent variable. I was not able to delete the NA values since pivot tables will not allow you to delete parts of a variable (the NA portion) so I copied the values and the format and put in the percentages by row and created the following tables without the NA’s.
  • 2. It was necessary to go in and add in the axis titles for the graph. Go to LAYOUT tab > AXIS TITLES and then also add a title. The explanatory is on the X axis, ABPOOR, can the respondent afford more children. And the Y axis is the response, or the dependent, CONDOM, did respondent use a condom last time had sex. I made a second chart for education and males vs females. I had to copy the values and the format to make it into a presentable table as shown below instead of with the original 20 rows and not in groups.
  • 3. I grouped the years of education into “ no highschool”= 1 -12 years of education ( group1) “highschool”= 12 years of education, “some college”=13-15 years of education ( group2), “bachelors”= 16 years, “post undergraduate”= 17 years, “graduate”= 18 years, and “PHD”= 19- 20 years (group3 c). Conduct the chi-squared test of independence, either using the pivot table or using R. Use the four-step procedure, explaining each step, and state the conclusion in the context of the problem. In R, go to Statistics -> Contingency Tables -> Two-Way Table. This will let you select the row and column variables from the data set. If you want to enter the data directly, select Enter and analyze two-way table. With either approach, select the "Statistics" tab under "Hypothesis tests," and tick the first three boxes. If using Excel, you must first make a table of expected frequencies and then use the CHITEST() function, which returns the p-value of the test. Below I created a chi-squared test of independence in Excel making an expected values chart.
  • 4. 1. Sum of each columns 2. Sum of each rows 3. Total sum of all frequencies for total 4. Total females (column) / total sample size = female proportions=.55 5.Total males (column) / total sample size =.45 6. Female proportion (.55) * row total (No High school 318)= expected value of females for that row (Expected Value 175.44 Females) 7 Male proportion * row total= expected value of males for that row 8. In the expected value formulas in steps 6-7, add the "$" signed in the female/male proportion cell to keep it fixed, then drag the formula down to get the expected values for the remaining rows. 9. Perform the ChiSquared test, formula =chitest(original values, expected values) 10. The number from the chitest=.26, which is the P-value, we will compare the p-value to the alpha, which s 95% certainty=.05 alpha 11. .26 > .05, therefore we fail to reject, we do not have enough evidence more data is needed
  • 5. 12. r square (=correl)= .996 correlation, therefore there is a very strong positive correlation between the original values and the expected values. d). Calculate Cramer’s coefficient and explain what it tells you about the relationship between the variables you calculated. Cramer’sV is usedto calculate correlationintables thatare greaterthan 2 x 2 columnsand rows. Cramer’sV correlation isbetween0and1. A value close to0 meansthat there islittle association betweenvariables.A Cramer’sV of close to 1 meansthere isa strongassociation. I foundthe X squared byplugginginthe yearsof educationandgendersfromabove intoRcmdr, STATISTICS>CONTIGENCYTABLE>ENTERAND ANALYZETWO-WAYTABLE N is large but we still fail to detect dependence .078 is not close to 1.
  • 6. 4.) Open the "AmesHousing.csv" data set again and create a new variable named "To- talSF" to represent the total square footage of a house. It will be composed of the vari- ables "First_Flr_SF," "Low_Qual_Fin_SF," "Open_Porch_SF," "Scnd_Flr_SF,"and "TotalBsmtSF " added together. As a check, if you do this correctly, the to-tal square footage for the first house should be 2798. Also, create a variable called "SalesPrice000s" to represent the sales price of a home in thousands. Using R: Input a new Variable DATA> Manage Variables in active data set> compute new variable.--> Add in “To.talSF” in the New variable name field. Add in the variable “X1st.Flr.SF," "Low.Qual.Fin.SF," "Open.Porch.SF," "X2nd.Flr.SF,"and "Tota.lBsmt.SF "” under the Expression to compute. "X1st_Flr_SF+Low_Qual_Fin_SF+Open_Porch_SF+X2nd_Flr_SF+TotalBsmtSF" NOTE the _ must be replaced with . and there should be no “” in the actual variable nor formula. a). Produce a scatter plot of SalesPrice000s by TotalSF. State whether the appearance of the plot makes sense to you and why you feel that way. In R: X is the explanatory, and Y is the response or dependent, that is prices of homes depend on the total square footage.
  • 7. In Excel: The goodness of fit line resembles a linear upward increasing line, meaning that the paired data arrays have a positive linear correlation. That is, price will increase (depends) as total square footage increases (independent) b). Your supervisor looks at the scatter plot in (a) and does not like the outliers. Produce a scatter plot with the outliers you see in (a) removed. You can delete these observations in R or delete them in the Excel file and the re-import the data into R. Comment on the effect that removing outliers has on the appearance of the plot. Below is a box plot that shows where the outlying data, marked by zeros. -100 0 100 200 300 400 500 600 700 800 0 1000 2000 3000 4000 5000 6000 7000 8000 House Price in thousdands House sq footage X House sq footage effect onY house price
  • 8. Below is a histogram and the outlier data has been highlighted yellow. The outlier data is located where price range starts at about the $375,000 (outlier is arbitrary so I defaulted to R’s outlier points marked as the zeros above) We will remove the outliers and graph a new scatter plot:
  • 9. Removed values corresponding to sales prices under $75,000 and greater than $375,000Mean, which represent the outliers in the histogram above In Excel: y = 0.0695x - 2.8554 R² = 0.6205 0 50 100 150 200 250 300 350 400 0 1000 2000 3000 4000 5000 6000 House Price in thousdands House sq footage X House sq footage effect onY house price removed outliers
  • 10. c). Explain why what you did in (b) that is, just removing the outliers and making a new plot might be considered unethical. Discuss one way you could justify what you did in (b). This could be considered unethical because I manipulated the true numbers to now not be representative of the true population. The aggregates now do not take into account all the data. This is incorrectly portraying information that I could use to support a bias. However, this could be justifiable if I presented the information, making known the outliers were removed in order to prevent skewed aggregates. Thus the bulk of the data is more representative of the true mean of the sample and therefore a more accurate generalization of the true mean of what to expect from the population. d). Suppose your supervisor wants to predict SalePrice000s using TotalSF using least- squares linear regression. Explain, conceptually, the idea behind the least-squares method. You shouldn’t include any calculations. The idea behind the least squares method is that it is a method to predict the value of a dependent Y variable basing off the value of the independent X variable, simply said, it is a cause effect relationship. Linear regression calculates a straight line that is called the “least squares regression line” that minimizes the differences with all the data sets. The slope calculate a line that best fits ALL the data sets. This line has a set slope so you follow the line where a specific square footage at the X intercept on the X axis of the line and you will find the corresponding price at the Y the intercept on the Y axis of the line. e). Fit the regression line in (d) to the data set with the outliers removed using Excel or R. Display the output from whatever program you use (a screen shot will probably be easiest), and write down the estimated regression equation. Using Excel, I removed the rows with the outliers stated above (omitting sales price under $75,000 and above $375,000).Then I selected INSERT tab>, SCATTER right click graph, Select Data >add input X Series is the total square footage, Y Series is the sales price by selecting all the values in the SalesPrice000s and to.talSF variables, ctrl+shift+down key. Once it is plotted, I insert an X and Y axis label selecting the graph, LAYOUT tab> AXIS TITLES tab, and then CHART TITLE tab. Lastly, add a trend line by selecting a data point, right click> ADD TRENDLINE> check DISPLAY EQUATION The equation for the regression line without the outliers is located above the points in the scatter plot, Y= 0.0695x-2.8554, and the R^2, or the amount of data that is explained is 0 .6192, if you take the square root of that you will get your correlation, R, which is 0.787, or in a cell put
  • 11. in “=correl(values of house price, values of house square footage)”. With a high correlation close to the 80% benchmark we can infer that when square footage of a home increase, the sales price of the house also increases. Y is the Price, the dependent variable X is the Square footage, the explanatory or independent f). Interpret the slope and intercept term in (e) in the context of the problem. Y= a + bx, where a is the y intercept and b is the slope of the straight line predicting “Yhat” (the means of Y). This line is the “least squares line" because it is the sum of the squared differences between each data point and its estimated point on the line, and the point on the line minimizes the differences between the points. Y= bx + a  y= 0.0695x – 2.8554 Note: the values are in thousands, so the values must be moved three decimal places to the right, that is be multiplied by 1,000. y= 69.538x – 2855.4 A= y intercept. The initial point at 0, no square footage. Which says, “when a house has 0 square footage, it’s average price is $- 2,855.4. B= Gradient/slope. The increase in price per one square foot. Which says, “for every X, 1 square footage of a house, there is a $69.5 increase in price” For every one increase in X, Y will go up the amount of the slope which is the predicted price increase g). Using software, calculate a 95% confidence interval for the slope parameter and interpret the interval in the context of the problem. Show the components of the calculation. In Excel, go to DATA > DATA ANALYSIS tab on far right, then input the Y (dependent) TotalSF values and the X (independent) SalePrice values. Check the labels Check the confidence levels put at 95%
  • 12. For every increase in square footage of the house, we are 95% confident based on the method used to calculate the interval that the sales price of the house will increase from $67.51 - $71.56 on average, all other variables held constant.
  • 13. h). What does the interval imply about the test of and why? The above hypothesis is the most common inference about regression, is X related to Y? The idea is that the confidence intervals and tests get at the same idea. If the confidence interval contains the null hypothesis of B1, which is 0, we would fail to reject B1 = 0. If the P-value < or = to alpha, we reject H0, if P-value is > Alpha we fail to reject F-Test is given below = 4,532 and the P-Value of the test is 0. Therefore, we reject the H0 , 0 < .05