The document contains a problem set with exercises on estimating regression models using survey and NBA player salary data. For exercise 1, the respondent estimates several linear regression models to test the effects of marijuana usage on wages while controlling for other factors like education and gender. For exercise 2, the respondent estimates regression models relating NBA player points per game to experience, position, and other variables like marital status. Adding interaction terms between marital status and experience variables, there is no strong evidence that marital status significantly affects points per game based on the results.
Everything we see is distributed on some scale. Some people are tall, some short and some are neither tall nor short. Once we find out how many are tall, short or middle heighted we get to know how people are distributed when it comes to height. This distribution can also be of chances. For example, we throw, 100 times, an unbalanced dice and find out how many times 1,2,3,4,5 or 6 appeared on top. This knowledge of distribution plays an important role in empirical work.
A General Manger of Harley-Davidson has to decide on the size of a.docxevonnehoggarth79783
A General Manger of Harley-Davidson has to decide on the size of a new facility. The GM has narrowed the choices to two: large facility or small facility. The company has collected information on the payoffs. It now has to decide which option is the best using probability analysis, the decision tree model, and expected monetary value.
Options:
Facility
Demand Options
Probability
Actions
Expected Payoffs
Large
Low Demand
0.4
Do Nothing
($10)
Low Demand
0.4
Reduce Prices
$50
High Demand
0.6
$70
Small
Low Demand
0.4
$40
High Demand
0.6
Do Nothing
$40
High Demand
0.6
Overtime
$50
High Demand
0.6
Expand
$55
Determination of chance probability and respective payoffs:
Build Small:
Low Demand
0.4($40)=$16
High Demand
0.6($55)=$33
Build Large:
Low Demand
0.4($50)=$20
High Demand
0.6($70)=$42
Determination of Expected Value of each alternative
Build Small: $16+$33=$49
Build Large: $20+$42=$62
Click here for the Statistical Terms review sheet.
Submit your conclusion in a Word document to the M4: Assignment 2 Dropbox byWednesday, November 18, 2015.
A General Manger of Harley
-
Davidson has to decide on the size of a new facility. The GM has narrowed
the choices to two: large facility or small facility. The company has collected information on the payoffs. It
now has to decide which option is the best u
sing probability analysis, the decision tree model, and
expected monetary value.
Options:
Facility
Demand
Options
Probability
Actions
Expected
Payoffs
Large
Low Demand
0.4
Do Nothing
($10)
Low Demand
0.4
Reduce Prices
$50
High Demand
0.6
$70
Small
Low Demand
0.4
$40
High Demand
0.6
Do Nothing
$40
High Demand
0.6
Overtime
$50
High Demand
0.6
Expand
$55
A General Manger of Harley-Davidson has to decide on the size of a new facility. The GM has narrowed
the choices to two: large facility or small facility. The company has collected information on the payoffs. It
now has to decide which option is the best using probability analysis, the decision tree model, and
expected monetary value.
Options:
Facility
Demand
Options
Probability Actions
Expected
Payoffs
Large Low Demand 0.4 Do Nothing ($10)
Low Demand 0.4 Reduce Prices $50
High Demand 0.6
$70
Small Low Demand 0.4
$40
High Demand 0.6 Do Nothing $40
High Demand 0.6 Overtime $50
High Demand 0.6 Expand $55
SAMPLING MEAN:
DEFINITION:
The term sampling mean is a statistical term used to describe the properties of statistical distributions. In statistical terms, the sample meanfrom a group of observations is an estimate of the population mean. Given a sample of size n, consider n independent random variables X1, X2... Xn, each corresponding to one randomly selected observation. Each of these variables has the distribution of the population, with mean and standard deviation. The sample mean is defined to be
WHAT IT IS USED FOR:
It is also used to measure central tendency of the numbers in a .
SAMPLING MEAN DEFINITION The term sampling mean .docxanhlodge
SAMPLING MEAN:
DEFINITION:
The term sampling mean is a statistical term used to describe the properties of statistical
distributions. In statistical terms, the sample mean from a group of observations is an
estimate of the population mean . Given a sample of size n, consider n independent random
variables X1, X2... Xn, each corresponding to one randomly selected observation. Each of these
variables has the distribution of the population, with mean and standard deviation . The
sample mean is defined to be
WHAT IT IS USED FOR:
It is also used to measure central tendency of the numbers in a database. It can also be said that
it is nothing more than a balance point between the number and the low numbers.
HOW TO CALCULATE IT:
To calculate this, just add up all the numbers, then divide by how many numbers there are.
Example: what is the mean of 2, 7, and 9?
Add the numbers: 2 + 7 + 9 = 18
Divide by how many numbers (i.e., we added 3 numbers): 18 ÷ 3 = 6
So the Mean is 6
SAMPLE VARIANCE:
DEFINITION:
The sample variance, s2, is used to calculate how varied a sample is. A sample is a select number
of items taken from a population. For example, if you are measuring American people’s weights,
it wouldn’t be feasible (from either a time or a monetary standpoint) for you to measure the
weights of every person in the population. The solution is to take a sample of the population, say
1000 people, and use that sample size to estimate the actual weights of the whole population.
WHAT IT IS USED FOR:
The sample variance helps you to figure out the spread out in the data you have collected or are
going to analyze. In statistical terminology, it can be defined as the average of the squared
differences from the mean.
HOW TO CALCULATE IT:
Given below are steps of how a sample variance is calculated:
• Determine the mean
• Then for each number: subtract the Mean and square the result
• Then work out the mean of those squared differences.
To work out the mean, add up all the values then divide by the number of data points.
First add up all the values from the previous step.
But how do we say "add them all up" in mathematics? We use the Roman letter Sigma: Σ
The handy Sigma Notation says to sum up as many terms as we want.
• Next we need to divide by the number of data points, which is simply done by
multiplying by "1/N":
Statistically it can be stated by the following:
•
http://www.statisticshowto.com/find-sample-size-statistics/
http://www.mathsisfun.com/algebra/sigma-notation.html
• This value is the variance
EXAMPLE:
Sam has 20 Rose Bushes.
The number of flowers on each bush is
9, 2, 5, 4, 12, 7, 8, 11, 9, 3, 7, 4, 12, 5, 4, 10, 9, 6, 9, 4
Work out the sample variance
Step 1. Work out the mean
In the formula above, μ (the Greek letter "mu") is the mean of all our values.
For this example, the data points are: 9, 2, 5, 4, 12, 7, 8,.
Solution manual for design and analysis of experiments 9th edition douglas ...Salehkhanovic
Solution Manual for Design and Analysis of Experiments - 9th Edition
Author(s): Douglas C Montgomery
Solution manual for 9th edition include chapters 1 to 15. There is one PDF file for each of chapters.
InstructionDue Date 6 pm on October 28 (Wed)Part IProbability a.docxdirkrplav
InstructionDue Date: 6 pm on October 28 (Wed)
Part IProbability and Sampling Distributions1.Thinking about probability statements. Probability is measure of how likely an event is to occur. Match one of probabilities that follow with each statement of likelihood given (The probability is usually a more exact measure of likelihood than is the verbal statement.)Answer0 0.01 0.3 0.6 0.99 1(a) This event is impossible. It can never occur.(b) This event is certain. It will occur on every trial.(c) This event is very unlikely, but it will occur once in a while in a long sequence of trials.(d) This event will occur more often that not.2. Spill or Spell? Spell-checking software catches "nonword errors" that result in a string of letters that is not a word, as when "the" is typed as "the." When undergraduates are asked to write a 250-word essay (without spell-checking), the number X of nonword errors has the following distribution:Value of X01234Probability0.10.20.30.30.1(a) Check that this distribution satisfies the two requirements for a legitimate assignment of probabilities to individual outcomes.(b) Write the event "at least one nonword error" in term of X (for example, P(X >3)). What is the probability of this event?(c) Describe the event X ≤ 2 in words. What is its probability? 3. Discrete or continuous? For each exercise listed below, decide whether the random variable described is discrete or continuous and explains the sample space.(a) Choose a student in your class at random. Ask how much time that student spent studying during the past 24 hours.(b) In a test of a new package design, you drop a carton of a dozen eggs from a height of 1 foot and count the number of broken eggs.(c) A nutrition researcher feeds a new diet to a young male white rat. The response variable is the weight (in grams) that the rat gains in 8 weeks.4. Tossing Coins(a) The distribution of the count X of heads in a single coin toss will be as follows. Find the mean number of heads and the variance for a single coin toss.Number of Heads (Xi)01mean:Probability (Pi)0.50.5variance:(b) The distribution of the count X of heads in four tosses of a balanced coin was as follows but some missing probabilities. Fill in the blanks and then find the mean number of heads and the variance for the distribution with assumption that the tosses are independent of each other.Number of Heads (Xi)01234mean:Probability (Pi)0.06250.0625variance:(c) Show that the two results of the means (i.e. single toss and four tosses) are related by the addition rule for means. (d) Show that the two results of the variances (i.e. single toss and four tosses) are related by the addition rule for variances (note: It was assumed that the tosses are independent of each other). 5. Generating a sampling distribution. Let's illustrate the idea of a sampling distribution in the case of a very small sample from a very small .
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 9: Inferences from Two Samples
9.3 Two Means, Two Dependent Samples, Matched Pairs
1. A law firm wants to determine the trend in its annual billings .docxmonicafrancis71118
1. A law firm wants to determine the trend in its annual billings so that it can better forecast revenues. It plots the data on its billings for the past 10 years and finds that the scatter plot appears to be linear. What formula should they use to determine the trend line?
σ = ∑√(x - μ)2 ÷ N
F = s12 ÷ s22
t = (x̄ - μx-bar) ÷ s/√n
Tt = b0 + b1t
3 points
QUESTION 2
1. A set of subjects, usually randomly sampled, selected to participate in a research study is called:
Population
Sample
Mode group
Partial selection
3 points
QUESTION 3
1. If a researcher accepts a null hypothesis when that hypothesis is actually true, she has committed:
a type I error
a type II error
no error
a causation
3 points
QUESTION 4
1. A binomial probability distribution is a discrete distribution (i.e., the x-variable is discrete).
True
False
3 points
QUESTION 5
1. The tdistribution is wider and flatter (i.e., has more variation) than the normal distribution.
True
False
3 points
QUESTION 6
1. A physician wants to estimate the average amount of time that patients spend in his waiting room. He asks his receptionist to record the waiting times for 28 of his patients and finds that the sample mean (x̄) is 37 minutes and the sample standard deviation (s) is 12 minutes. What formula would you use to construct the 95% confidence interval for the population mean of waiting times?
t = (x̄ - μx-bar) ÷ s/√n
µ = ∑ x ÷ N
x̄ - t(s ÷ √n) < µ < x̄ + t(s ÷ √n)
z = (x - µ) ÷ σ
3 points
QUESTION 7
1. When the alternative hypothesis states that the difference between two groups can only be in one direction, we call this a:
One-tailed test
Bi-directional test
Two-tailed test
Non-parametric test
3 points
QUESTION 8
1. For any probability distribution, the probability of any x-value occurring within any given range is equal to the area under the distribution and above that range.
True
False
3 points
QUESTION 9
1. The formula for ____________ is (Row total X Column total)/T
Observed frequencies
Degrees of freedom
Expected frequencies
Sampling error
3 points
QUESTION 10
1. State Senator Hanna Rowe has ordered an investigation of the large number of boating accidents that have occurred in the state in recent summers. Acting on her instructions, her aide, Geoff Spencer, has randomly selected 9 summer months within the last few years and has compiled data on the number of boating accidents that occurred during each of these months. The mean number of boating accidents to occur in these 9 months was 31 (x̄), and the standard deviation (s) in this sample was 9 boating accidents per month. Geoff was told to construct a 90% confidence interval for the true mean number of boating accidents per month. What formula should Geoff use?
x̄ - t(s ÷ √n) < µ < x̄ + t(s ÷ √n)
F = s12 ÷ s22
z = (x - µ) ÷ σ
x̄ - z(σ ÷ √n) < µ < x̄ + z(σ ÷ √n)
.
Answer the questions in one paragraph 4-5 sentences. · Why did t.docxboyfieldhouse
Answer the questions in one paragraph 4-5 sentences.
· Why did the class collectively sign a blank check? Was this a wise decision; why or why not? we took a decision all the class without hesitation
· What is something that I said individuals should always do; what is it; why wasn't it done this time? Which mitigation strategies were used; what other strategies could have been used/considered? individuals should always participate in one group and take one decision
SAMPLING MEAN:
DEFINITION:
The term sampling mean is a statistical term used to describe the properties of statistical distributions. In statistical terms, the sample meanfrom a group of observations is an estimate of the population mean. Given a sample of size n, consider n independent random variables X1, X2... Xn, each corresponding to one randomly selected observation. Each of these variables has the distribution of the population, with mean and standard deviation. The sample mean is defined to be
WHAT IT IS USED FOR:
It is also used to measure central tendency of the numbers in a database. It can also be said that it is nothing more than a balance point between the number and the low numbers.
HOW TO CALCULATE IT:
To calculate this, just add up all the numbers, then divide by how many numbers there are.
Example: what is the mean of 2, 7, and 9?
Add the numbers: 2 + 7 + 9 = 18
Divide by how many numbers (i.e., we added 3 numbers): 18 ÷ 3 = 6
So the Mean is 6
SAMPLE VARIANCE:
DEFINITION:
The sample variance, s2, is used to calculate how varied a sample is. A sample is a select number of items taken from a population. For example, if you are measuring American people’s weights, it wouldn’t be feasible (from either a time or a monetary standpoint) for you to measure the weights of every person in the population. The solution is to take a sample of the population, say 1000 people, and use that sample size to estimate the actual weights of the whole population.
WHAT IT IS USED FOR:
The sample variance helps you to figure out the spread out in the data you have collected or are going to analyze. In statistical terminology, it can be defined as the average of the squared differences from the mean.
HOW TO CALCULATE IT:
Given below are steps of how a sample variance is calculated:
· Determine the mean
· Then for each number: subtract the Mean and square the result
· Then work out the mean of those squared differences.
To work out the mean, add up all the values then divide by the number of data points.
First add up all the values from the previous step.
But how do we say "add them all up" in mathematics? We use the Roman letter Sigma: Σ
The handy Sigma Notation says to sum up as many terms as we want.
· Next we need to divide by the number of data points, which is simply done by multiplying by "1/N":
Statistically it can be stated by the following:
·
· This value is the variance
EXAMPLE:
Sam has 20 Rose Bushes.
The number of flowers on each b.
Similar to Problem set 3 - Statistics and Econometrics - Msc Business Analytics - Imperial College London (20)
Can Vietnamese Vinfast rise to the stars? Learnings from Proton’s failure in ...Jonathan Zimmermann
In both, Japan and South Korea, the Automotive industry was one of the major sectors for fast economic development with the help of targeted government support and intervention – is this concept replicable in other now developing countries in South East Asia?
Note: Built in the context of an INSEAD group project for a class
Target Corporation Consulting Project - Growth through acquisition. Interim report.
University of Michigan, Ross School of Business, MO 470, Winter 2015.
Illustration of the patent hold-up problem through the Rambus v. FTC trial
ECON 490 - University of Michigan - Fall 2014
Presentation by Jonathan Zimmermann
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Problem set 3 - Statistics and Econometrics - Msc Business Analytics - Imperial College London
1. Problem set 3
Jonathan Zimmermann
31 October 2015
Exercise 1
Suppose you collect data from a survey on wages, education, experience, and gender. In
addition, you ask for information about marijuana usage. The original question is: “On how
many separate occasions last month did you smoke marijuana?"
a) Write an equation that would allow you to estimate the effects of marijuana usage on wage, while
controlling for other factors. You should be able to make statements such as,“Smoking marijuana five
more times per month is estimated to change wage by x%."
>>
To be able to interpret the variables in that way, we need to build a log-linear model. The regression equation
would look like that:
log(wage) = β0 + β1marijuna_usage + β2education + β3experience + δ1gender + u
b) Write a model that would allow you to test whether drug usage has different effects on wages for men
and women. How would you test that there are no differences in the effects of drug usage for men and
women?
>>
We would need to add an interaction variable between the gender and the marijuana variables. The new
regression equation would look like that:
log(wage) = β0 + β1marijuna_usage + β2education + β3experience + δ1gender + δ2gender ∗ marijuna_usage + u
To test whether there are differences in the effects of drug usage for men and women, we could test the
following hypothesis with a t-test:
H0 : δ2 = 0H1 : δ2 = 0
To perform the t-test, we would first need to calculate the t-statistic with the following formula:
t =
gender ∗ marijuna − 0
s/
√
n
We would then look for the critical value based on the (1 − α/2) percentile in the t distribution with n-1
degrees of freedom. If the absolute value of the t-statistic is greater than the critical value, we would then
reject H0.
c) Suppose you think it is better to measure marijuana usage by putting people into one of four categories:
nonuser, light user (1 to 5 times per month), moderate user (6 to 10 times per month), and heavy
user (more than 10 times per month). Now, write a model that allows you to estimate the effects of
marijuana usage on wage.
1
2. >>
Incorporating this change into the model in a), we would have:
log(wage) = β0 + β2education + β3experience + δ1gender + δ2light_user + δ3moderate_user + δ4heavy_user + u
It is now easy to estimate each of the coefficients by running the regression normally.
d) Using the model in part (c), explain in detail how to test the null hypothesis that marijuana usage has
no effect on wage.
>>
We would need to test the following hypothesis (i.e. we want to test whether delta2, delta3 and delta4 are
together jointly significant), using a F-test:
H0 : δ2 = 0 AND δ3 = 0 AND δ4 = 0H1 : H0 is false
Let’s call the model in c) the “unrestricted model”. The “restricted model” would then be be:
log(wage) = β0 + β2education + β3experience + δ1gender + u
We then calculate the F-statistic, using the following formula:
SSRrestricted − SSRunrestricted/q
SSRunrestricted/(n − k − 1)
Where q = number of restrictions = 3 (because we test three parameters), k = number of variables in the
unrestricted model = 6
We would then reject H0 if the F-statistic is higher than the critical value (based on the Fisher distribution
at d1=q, d2=n-k-1).
e) What are some potential problems with drawing causal inference using the survey data that you
collected?
>>
The survey data might have multiple problems that would make it non representative of the population. One
of the biggest issues is self-selection and social desirability bias. In the case of this study, we could expect
for example individuals to voluntarily (or unconsciously) report lower values than their actual marijuna
consumption, by fear of looking like an addict/junkie (social desirability). Other issues might be linked to
the way the data has been collected. For example, if the survey has been conducted in a particular area or at
a particular time of the day, the respondants might not be a truly random sample of the population; this
will be the case for example if the survey is conducted by phone during the day, at times when the active
population is at work (which would result in a overrepresentation of unemployed people, housewives, retired
people, etc.). There are of course many other response biases that could make the data inaccurate, such as
the acquiescence bias.
2
3. Exercise 2
** Use the data in nbasal.RData for this exercise. **
a) Estimate a linear regression model relating points per game to experience in the league and position
(guard, forward, or center). Include experience in quadratic form and use centers as the base group.
Report the results (including SRF, the sample size, and R-squared).
>>
load("nbasal.RData")
The regression model is:
points = β0 + β1exper + β2expersq + δ1guard + δ2forward + u
The SRF is:
a = lm(points~exper+expersq+guard+forward,data)
a
##
## Call:
## lm(formula = points ~ exper + expersq + guard + forward, data = data)
##
## Coefficients:
## (Intercept) exper expersq guard forward
## 4.76076 1.28067 -0.07184 2.31469 1.54457
summary(a)
##
## Call:
## lm(formula = points ~ exper + expersq + guard + forward, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.220 -4.268 -1.003 3.444 22.265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.76076 1.17862 4.039 7.03e-05 ***
## exper 1.28067 0.32853 3.898 0.000123 ***
## expersq -0.07184 0.02407 -2.985 0.003106 **
## guard 2.31469 1.00036 2.314 0.021444 *
## forward 1.54457 1.00226 1.541 0.124492
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.668 on 264 degrees of freedom
## Multiple R-squared: 0.09098, Adjusted R-squared: 0.07721
## F-statistic: 6.606 on 4 and 264 DF, p-value: 4.426e-05
3
4. Regression results:
points = 4.76076
(1.17862)
+ 1.28067
(0.32853)
exper − 0.07184
(0.02407)
expersq + 2.31469
(1.00036)
guard + 1.54457
(1.00226)
forward
The sample size is:
summary(a)$df[2]+length(coef(a)) # = Degrees of freedom + number of coefficients. nrow(data) would have
## [1] 269
The r-squared is:
summary(a)$r.squared
## [1] 0.09097856
b) Holding experience fixed, does a guard score more than a center? How much more? Is the difference
statistically significant?
>>
Yes, a guard seems to score more than a center. When we control for experience and experienceˆ2, a guard
seems to score on average 2.31469 (δ1) more points.
If we want to know whether it has a statistically significant positive effect, we can test the following hypothesis:
H0 : δ1 = 0H1 : δ1 > 0
The one-sided p-value of δ1 is 0.010722 (two-sided p-value divided by two), so it is statistically significant at
the 1.0722048% significance level.
c) Now, add marital status to the equation. Holding position and experience fixed, are married players
more productive (based on points per game)?
>>
The new regression model is:
points = β0 + β1exper + β2expersq + δ1guard + δ2forward + δ3marr + u
The SRF is:
a = lm(points~exper+expersq+guard+forward+marr,data)
a
##
## Call:
## lm(formula = points ~ exper + expersq + guard + forward + marr,
## data = data)
##
## Coefficients:
## (Intercept) exper expersq guard forward
## 4.70294 1.23326 -0.07037 2.28632 1.54091
## marr
## 0.58427
4
5. summary(a)
##
## Call:
## lm(formula = points ~ exper + expersq + guard + forward + marr,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.874 -4.227 -1.251 3.631 22.412
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.70294 1.18174 3.980 8.93e-05 ***
## exper 1.23326 0.33421 3.690 0.000273 ***
## expersq -0.07037 0.02416 -2.913 0.003892 **
## guard 2.28632 1.00172 2.282 0.023265 *
## forward 1.54091 1.00298 1.536 0.125660
## marr 0.58427 0.74040 0.789 0.430751
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.672 on 263 degrees of freedom
## Multiple R-squared: 0.09313, Adjusted R-squared: 0.07588
## F-statistic: 5.401 on 5 and 263 DF, p-value: 9.526e-05
Regression results:
points = 4.70294
(1.18174)
+ 1.23326
(0.33421)
exper − 0.07037
(0.02416)
expersq + 2.28632
(1.00172)
guard + 1.54091
(1.00298)
forward + 0.5842
(0.74040)
marr
The sample size is still:
summary(a)$df[2]+length(coef(a)) # = Degrees of freedom + number of coefficients. nrow(data) would have
## [1] 269
The r-squared is:
summary(a)$r.squared
## [1] 0.09312579
Yes, married players seem to be more productive than non-married players. When we control for experience,
experienceˆ2 and position, a guard seems to score on average 0.58427 (δ3) more points. However, if might
not be statistically significant.
If we want to know whether it has a statistically significant positive effect, we need to test the following
hypothesis:
H0 : δ3 = 0H1 : δ3 > 0
The one-sided p-value of δ3 is 0.2153757 (two-sided p-value divided by two), so it is statistically significant
at the 21.5375685% significance level. So for most practical purposes, we cannot consider it as statistically
significant.
5
6. d) Add interactions of marital status with both experience variables. In this expanded model, is there
strong evidence that marital status affects points per game?
>>
The new regression model is:
points = β0+β1exper+β2expersq+δ1guard+δ2forward+δ3marr+δ4marr∗experience+δ5marr∗expersq+u
The SRF is:
a = lm(points~exper+expersq+guard+forward+marr+marr*exper+marr*expersq,data)
a
##
## Call:
## lm(formula = points ~ exper + expersq + guard + forward + marr +
## marr * exper + marr * expersq, data = data)
##
## Coefficients:
## (Intercept) exper expersq guard forward
## 5.81615 0.70255 -0.02950 2.25079 1.62915
## marr exper:marr expersq:marr
## -2.53750 1.27965 -0.09359
summary(a)
##
## Call:
## lm(formula = points ~ exper + expersq + guard + forward + marr +
## marr * exper + marr * expersq, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.239 -4.328 -1.067 3.742 22.197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.81615 1.34878 4.312 2.29e-05 ***
## exper 0.70255 0.43405 1.619 0.1067
## expersq -0.02950 0.03267 -0.903 0.3674
## guard 2.25079 1.00002 2.251 0.0252 *
## forward 1.62915 1.00199 1.626 0.1052
## marr -2.53750 2.03822 -1.245 0.2143
## exper:marr 1.27965 0.68229 1.876 0.0618 .
## expersq:marr -0.09359 0.04887 -1.915 0.0566 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.654 on 261 degrees of freedom
## Multiple R-squared: 0.1058, Adjusted R-squared: 0.08184
## F-statistic: 4.413 on 7 and 261 DF, p-value: 0.0001188
6
7. Regression results:
points = 5.81615
(1.34878)
+0.70255
(0.43405)
exper−0.02950
(0.03267)
expersq+2.25079
(1.00002)
guard+1.62915
(1.00199)
forward−2.53750
(2.03822)
marr+1.27965
(0.68229)
exper∗marrm−0.
(0.
The sample size is still:
summary(a)$df[2]+length(coef(a)) # = Degrees of freedom + number of coefficients. nrow(data) would have
## [1] 269
The r-squared is:
summary(a)$r.squared
## [1] 0.1058214
This time, we want to perform a two-sided test (because we are interested in whether there is an effect in
either direction), on three different coefficients at the same time. Therefore, this is a joint hypothesis testing:
we want to know if, together, all the coefficients that include the marrital status have an effect on the points:
H0 : δ3 = 0ANDδ4 = 0ANDδ5 = 0H1 : H0isfalse
The two-sided p-value of δ3 is 0.2142624, so it is statistically significant at the 21.4262432% significance level.
So no, for most practical purposes, we cannot really say there is strong evidence that marital status affects
points per game.
e) Estimate the model from part (c) but use assists per game as the dependent variable. Are there any
notable differences from part (c)? Discuss.
>>
The new regression model is:
assists = β0 + β1exper + β2expersq + δ1guard + δ2forward + δ3marr + u
The SRF is:
a = lm(assists~exper+expersq+guard+forward+marr,data)
a
##
## Call:
## lm(formula = assists ~ exper + expersq + guard + forward + marr,
## data = data)
##
## Coefficients:
## (Intercept) exper expersq guard forward
## -0.22581 0.44360 -0.02673 2.49167 0.44747
## marr
## 0.32190
7
8. summary(a)
##
## Call:
## lm(formula = assists ~ exper + expersq + guard + forward + marr,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3127 -1.0780 -0.3157 0.6788 8.2488
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.225809 0.354904 -0.636 0.52516
## exper 0.443603 0.100372 4.420 1.45e-05 ***
## expersq -0.026726 0.007256 -3.683 0.00028 ***
## guard 2.491672 0.300842 8.282 6.19e-15 ***
## forward 0.447471 0.301220 1.486 0.13860
## marr 0.321899 0.222359 1.448 0.14891
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.704 on 263 degrees of freedom
## Multiple R-squared: 0.3499, Adjusted R-squared: 0.3375
## F-statistic: 28.31 on 5 and 263 DF, p-value: < 2.2e-16
Regression results:
assists = −0.225809
(0.354904)
+0.443603
(0.100372)
exper−0.026726
(0.007256)
expersq+2.491672
(0.300842)
guard+0.447471
(0.301220)
forward+0.321899
(0.222359)
marr
The sample size is still:
summary(a)$df[2]+length(coef(a)) # = Degrees of freedom + number of coefficients. nrow(data) would have
## [1] 269
The r-squared is:
summary(a)$r.squared
## [1] 0.3498759
As we can see, there are some differences compared to c), but nothing major. Except for the intercept,
which changed sign, the direction of all the effects is the same. The intercept, which was highly statistically
significant in c), is no longer statistically significant and the variable “guard” is now much more significant
than in c). All the variables changed in magnitude in sometimes significative ways. Most of these differences
in magnitude is explained by the different scales of “assists” and “points”:
8