SlideShare a Scribd company logo
Two-Variable (Bivariate) Regression
In the last unit, we covered scatterplots and correlation. Social
scientists use these as descriptive tools for getting an idea about
how our variables of interest are related. But these tools only
get us so far. Regression analysis is the next step. Regression is
by far the most used tool in social science research.
Simple regression analysis can tell us several things:
1. Regression can estimate the relationship between x and y in
original units of measurement. To see why this is so
useful, consider the example of infant mortality and median
family income. Let’s say that a policymaker is interested in
knowing how much of a change in median family income is
needed to significantly reduce the infant mortality rate.
Correlation cannot answer this question, but regression can.
2. Regression can tell us how well the independent variable (x)
explains the dependent variable (y). The measure is called the
R square.
Simple Two-Variable (Bivariate) Regression
Regression uses the equation of a line to estimate the
relationship between x and y. You may remember back in
algebra learning about the equation of a line. Some learned it as
Y =s X + K or Y = mX + B. In statistics, we use a different
Equation 1: Y = B0 + B1X + u
Let’s define each term in the equation:
· Y is the dependent variable. It is placed on the Y (vertical)
axis. In the example below, the dependent variable (Y) is the
infant mortality rate.
· B0 is the Y intercept. B0 is also referred to as “the constant.”
B0 is the point where the regression line crosses the Y axis.
Importantly, B0 is equal to the
predicted value of Ywhen X=0. In most cases, B0 is
does not get much attention for two reasons. First, the
researcher is usually interested in the relationship between x
and y. not the relationship between x and y at the single value
of x=0. Second, often independent variables do not take on the
value zero. Consider the AECF sample data. There are no states
with low-birth-weight percentages equal to zero, so we would
be extrapolating beyond what the data tell us.
· B1 is usually the main point of interest for researchers. It is
the slope of the line relating x to y. Researchers usually refer to
B1 as a slope coefficient, regression coefficient or simply a
B1 measures the change in Y for a one-unit change in x.
We represent change by the symbol ∆.
B1 =
· u is the error term. The error term is the distance between the
regression line and the dots on the scatterplot. Think about it,
regression estimates a single line through the cloud of data.
Naturally, the line does not hit all the data points. The degree to
which the line “misses” the data point is the error. u can also be
thought of as
all the other factors that affect the infant mortality rate
besides X. Importantly, we
assume that u is totally random given X.
The Black Box of Regression
Intuitively, regression analysis finds the line that is the best
predictor of the dependent variable. In the scatterplot, this line
is the one that “fits” the data the best. From the scatterplot, we
can see that the line does not go through all of the points in the
scatterplot. So, how does regression find this line? Regression
does this by finding the line that
minimizes the squared error. This is why regression is
also called “least squares” regression, because it minimizes the
squared error. The mathematical proof of this is not important,
if we understand that the regression line is the best fit for the
The Predicted Value of Y, “yhat”
This is the estimated regression equation for the line that relates
infant mortality to low birth weight. Notice that this equation
does not contain an error term.
This makes sense, because this is the equation for the
regression line itself, not the actual data points (Y).
To make this distinction clear, define the term
Ŷ as the predicted values of Y along the regression line.
Ŷ is the predicted value of Y.
Equation 2: Ŷ = B0 + B1X
Subtracting the two gives:
Y = B0 + B1X + u
minus Ŷ = B0 + B1X
Y- Ŷ = u
This means each observation has values for Y, Ŷ and u. To
make this more concrete, let’s consider the example of infant
mortality and low birth weights.
Example: Infant Mortality and Low Birth Weights
For regression (unlike correlation), the researcher must specify
the dependent variable and the independent variable. Logically,
low birth weights should contribute to the infant mortality rate.
This makes sense too if we think about how the regression
equation works. To make things concrete, let’s say that a
lawmaker wants to know what effect low birth weights have on
infant mortality. The regression equation would be:
imr = B0 + B1lobweight + u
The Stata output has a lot of numbers. First let’s focus on
getting the actual estimates from the regression equation. We
get these numbers from the “coefficient column.
The bottom coefficient is labeled _cons. This is short for
“constant,” which is just another name for the y intercept, B0.
In this case, B0 = 1.205.
The coefficient labeled lobweight is the one we are really
interested in. This coefficient is B1. For this regression
Now we can write out the regression:
imr = B0 + B1lobweight + u
Substituting the numbers from the table:
imr = 1.205 + 0.562 lobweight + u
Interpreting the equation
B0 is usually not of interest to the researcher for reasons
discussed above.
B1 is the main coefficient of interest, especially for policy. It
tells us about the relationship between low birth weights and the
infant mortality rate.
Rules for Interpreting B1
· B1 measures the change in Y that results from a one unit
change in X.
· So, we can say that
a one unit change in X results in a B1 change in Y.
· In the regression above B1=0.562. That means that a one unit
change in percentage low birth weights results in a 0.562
change in the infant mortality rate.
The user-written Stata command aaplot. Gives a nice summary:
Model Fit
We already saw with scatterplots and correlation that different
models have different degree of “fit”, meaning how well the
data cluster around a line.
In regression, most analysts use the R Squared. The R Squared
has a ready interpretation once we know its properties:
Box 1: R Squared Properties
R2 Property 1: R square measures the proportion of the
variation in Y that is explained by the variation in X. An easier
way to say it is that the model explains (R2*100)%. For the
running example, the R2=0.436. That means that low brth
weights explain 43.6% of the variation in the infant mortality
rate. Or, for shrt, the model explains 43.6%.
R2 Property 2: R square will always (except in extreme and
unusual cases) lie somewhere on the interval between 0 and +1.
In other words, R squared will be a positive value between 0
and 1.
R2 Property 3: R squared values are only comparable
if the dependent variable is the same.This means that if
we want to compare two models on the R squared, Y must be
the same for both models.
Effect Size for R Squared
As with correlation coefficients, it is helpful to have a
benchmark to determine effect size. Recall that effect size tells
us how large (or small) the effect of one variable is on another.
We can use the benchmarks for r and square then to get the
benchmarks for R2.
Table 1: Cohen’s Effect Size Benchmarks for R Squared
R Squared
Effect Size
0.01 to 0.09
0.09 to 0.25
0.25 to 1.0
In the example, the R squared was 0.436, which exceeds 0.25,
so we conclude that the R squared shows a large effect size
between low birth weights and infant mortality.
Hypothesis Testing
So far, we have been focusing on how to interpret regression
results. But our results are derived from a
sample. This means we cannot be sure that our results
reflect what is going on in the population. Of course, we cannot
know what we don’t know, so instead we can do hypothesis
Generally, with hypothesis testing, we are focused on a “null”
hypothesis. This involves a little thought experiment. We ask
the following, “If there was no effect of X on Y in the
population, how likely is it that we would have obtained our
regression results?”
We write the null hypothesis as:
Null Hypothesis Ho: B1pop = 0
This is equivalent to saying that B1 in the population.
Remember, we do not know what B1 is in the population, we are
just testing if it is zero.
Alternative Hypothesis H1: B1pop ≠ 0
The alternative hypothesis is that B1 in the population does not
equal zero (i.e. there is some effect of X on Y.
Using the T Test
To test the hypothesis above, we use a t test. The t distribution
is very similar to the Z distribution (standard normal).
The formula for the t test in regression is
t =
Notice that when we do a t test, we are comparing our actual
sample regression coefficient B1,
with a hypothesized value of B1
for the population, B1pop.
We could test for ANY population value using this formula. We
could set the population value to 8,0000, 50 or -0.0078. The
reason we set the population value to zero is that this is the only
value for B1pop that would indicate NO relationship between X
and Y. As a result, the standard hypothesized value for B1pop is
zero. Notice what this does to the formula a above. If we
substitute zero for b1pop
t = =
What is SE(B1)? This is called the standard error of B1. If we
think of running an infinite number of regressions with different
samples, we could plot our values of B1 on a graph. The
standard error of B1 tells us how much variation there would be
in this hypothetical distribution.
Now let’s look back at the table. B1 is 0.562 and the standard
error of B1 is 0.09138. Plugging in the numbers gives
T== 6.15
From t to a P value
The t statistic on its own does not tell us much. What we are
interested in is the p value. The p value is the probability of the
t statistic. To get the p value, we must use a t distribution.
Properties of the t distribution and p values
Property 1: The t distribution is a probability distribution that
measures the likelihood of different t values. Therefore, the
total area of the t distribution equals 1.
Property 2: For a t test, we assume that the mean of the
population t distribution is zero, which is the same as saying
Property 3: A large t statistic is unlikely because as we move
from the mean of the t distribution to its tails, the probability of
the t values goes down.
Property 4: t tests tell us the probability that we would obtain
our sample t value, if the population t value was, in fact, zero.
Thus, the term hypothesis testing. This probability is called a p
value. Put another way,
the p value tells us the probability that we would be
incorrect in saying B1pop ≠0. if in fact B1pop=0.
Property 5: A small p value gives us reason to REJECT the null
hypothesis b1pop=0 because a small p value indicates that is
unlikely, given our sample value for B1 that b1pop=0.
Looking back at the results the p value corresponding to the t
statistic of 6.15 is 0.00. The p value is so small, it has zeroes to
three digits! This means that the chances of our obtaining our
sample t value of 6.15 are very, very small, if the true
population t statistic were zero.
Confidence Intervals
Another way to think about hypothesis testing is using
confidence intervals. Confidence intervals tell us the range of
values a coefficient could take. Typically, researchers use 95%
confidence intervals.
We can rearrange some of the terms from the t test to obtain
confidence intervals.
CI lower = B+(SEB*t)
CI lower = B-(SEB*t)
With confidence intervals, we must specify a value for t. This
value of t corresponds to whatever confidence level we want to
set. Usually this is 95%.
Stata gets this value of t for us, so we do not have to look it up.
Intuitively we can say that if we compared a 95% CI to a 90%
CI, the former would be WIDER. This makes sense when we
think about the relationship between t and probability. The
larger the t value, the smaller the probability or equivalently,
the higher the confidence level, the wider the CI.
In the results above, the 95% CI for the coefficient on low birth
weight is 0.378 to 0.745, which is a wide margin! The Callows
for us to get an idea of how much a coefficient could vary. The
“official” interpretation of the 95% CI is, “95 times out of 100,
the true population coefficient would be contained in this
Assignment 1
Due Date/Time: 9/23/2021, 11:59 PM
Total Points: 100
You will implement the K-means clustering and Fuzzy C-means
clustering from scratch using a programming language of your
Follow software design principles and document (comment)
your code
clearly explaining what you did and why you did what you did.
In your
report, include a README that states how your code is
supposed to be
run to obtain the expected results.
You will use a dataset representing ten years of clinical care at
130 US
hospitals and integrated delivery networks. It includes over 50
representing patient and hospital outcomes. The dataset is
included in
the assignment with the filename diabetic_data.csv.
Use the Euclidean distance to compute the distance between any
patients in the dataset. You will run your clustering algorithms
different combinations of variables as specified in each
1. K-means clustering with different numbers of clusters (30
a. Run K-means on the entire dataset with the following two
‘time_in_hospital’, and ‘num_medications’ with the number of
K = 2. Plot your clusters using a 3D sca�er plot and report
(print) the
centroid locations. Based on this plot, what are your thoughts
on the
generated clusters?
b. Test with different numbers of clusters K, running from K =
2 to K = 10
using the same variables in 1a. According to the sca�er plots,
number of clusters do you think is the most appropriate? Justify
c. Implement Dunn index (DI) cluster validity measure from
Repeat the experiments in problem 1b and compute the
DI indices.
Which one do you believe is the best number of clusters
according to
Dunn indices? Does this agree with your initial observation in
2. K-means clustering with different variables and sample size
a. Based on the best number of clusters you obtained in
problems 1c and
the two variables, does adding the ‘insulin’ variable (total 3
improve clustering results for any 30 patients randomly
selected? Use
sca�er plots or any other equivalent method to justify your
b. Based on the model in problem 2a, does adding the
and ‘change’ variables (total five variables) improve the
results for the same 30 patients? Plot the results and compute
the Dunn
index to justify your response.
c. Randomly sample 50,000 observations and 10,000
observations from
the entire dataset and re-run 2a and 2b for each sample size.
Plot the
clustering results and compute the Dunn index for each sample
size and
compare the results with 50,000 and 10,000 observations vs the
dataset. Justify what you observe.
d. (Bonus): What happens to the relative positioning of the
centroids as
you sample fewer observations (50,000, 10,000, 5,000) from the
data? Do
the centroids go farther apart, or do they get closer after your
algorithm has converged? Justify why. Plot your findings
(sample size
(x-axis) vs Dunn Index (y-axis)). (Bonus: 10 points)
3. Fuzzy C-means clustering (40 points)
a. Implement Fuzzy C-means and apply it with the best number
clusters you selected in problem 1 and the best combination of
you selected in problem 2 for the entire observations. Was there
difference in the clusters as compared to the K-means clusters?
(Compare using visualization tools, using centroid values, OR
some labels and observing the differences).
b. Harden the cluster assignment of Fuzzy C-means and use the
index to compare it with the K-means clustering result. Is there
difference in the results? Which clustering algorithm do you
produces be�er clusters and why?
c. Select one more variable by exploring the data and add this
into the model in problem 3a. Does adding this new variable
the clustering results? If so, why or why not? If you play wi th
variables for 3c, please mention that as well as the variables
experimented with and why you chose that particular additional
Submission Instructions:
Submit a zipped file containing your code(s) and report (in pdf)
in the
Dropbox folder titled “Assignment 1-LastName” on Pilot.
Academic Integrity: Please note that the code and report you
should be your work and yours alone. If plagiarism is detected,
it will be
dealt with strictly and in accordance with Wright State
Scatterplots and Correlation
Scatterplots show the relationship between two (usually)
continuous variables. Recall that continuous have many
different numeric values; age or income are examples.
Scatterplots are very useful for data visualization because they
can give us an intuition for the
direction of the relationship between variables (positive
or negative) and the
strength of the relationship. Usually, we are interested
in both things.
With a scatterplot, we normally assume that one variable is the
independent variable. Most researchers denote the
independent variable as X
. The independent variable is the input to the model.
dependent variable is the output from the model. One
way to keep these straight is the
dependent variable is dependent on another variable in
the model, the independent variable. Researchers denote the
dependent variable as Y
. Just like in the alphabet, X comes before Y, meaning a
change in X results in some change in Y. In some cases, the
independent variable X may be a “cause” of the dependent
variable Y, but in most cases, causation is difficult to establish.
We discuss the distinction between correlation and causation
toward the end of the chapter.
In the examples below, we will be using the State Kids Count
data. In each example, the
dependent variable is the infant mortality rate (imr) for
both scatterplots. We will construct two scatterplots using two
independent variables: the percentage of low-birth-
weight babies in each state and the median family income in the
state. Figure 1 shows the scatterplot for infant mortality (y axis)
and low birth weight babies (x axis).
Figure 1: The relationship between low birth weights and infant
Here low birth weight is on the x axis and the infant mortality
rate is on the y axis. This scatterplot helps answer two
Direction of Relationship. The graph shows there is a
positive relationship between low birth weights and the
state infant mortality rate. As low birth weights increase, so
does infant mortality. This makes sense, as low birth weight
babies are often premature or have other health difficulties,
making survival less likely. So, it makes sense that states that
have a high percentage of low birthweight infants, would also
have higher overall infant mortality rates.
Strength of the Relationship. The way to determine the
strength of the relationship in a scatterplot is to look at how
tightly (or loosely) the data points cluster around the line. This
line is the “best fit” line for the data. This graph shows a strong
relationship between low birth weight and infant mortality but
interpreting graphs can be a bit like interpreting art! It is
important to note that while the direction of the relationship is
usually easy to figure out, determining the strength of the
relationship from a scatterplot alone is a subjective judgment.
Figure 2: The relationship between median family income and
infant mortality
Direction of Relationship. The graph shows there is a
negative relationship between state median family
income and the state infant mortality rate. In states with higher
median family incomes, there is less infant mortality. This also
makes sense: in states with higher family incomes, more private
resources are available throughout the pregnancy, which reduces
infant mortality.
Strength of the Relationship. The way to determine the
strength of the relationship in a scatterplot is to look at how
tightly (or loosely) the data points cluster around the line. In
this respect, the data fit the line well, but not as well as the
scatterplot in Figure 1. But again, such an interpretation is
inherently subjective.
The Correlation Coefficient
Scatterplots are helpful for visualizing the association between
X and Y, but graphs cannot provide a precise numerical
estimate of the relationship between X and Y . The numerical
estimate of the relationship between X and Y is called the
correlation coefficient, it is sometimes denoted as
r in published research. Correlation coefficients tell us
both the direction of the relationship between X and Y and the
strength of the relationship. The correlation coefficient is easy
to interpret once we understand its properties.
Box 1: Properties of the Correlation Coefficient
Correlation Coefficient Property 1: r will always indicate a
positive or negative relationship through its sign.
Correlation Coefficient Property 2: r will always lie within a
defined range between -1 and 1. r is a
normalized measure. This means that r does not depend
on the scale of measurement for a variable. For example, age
and income are measured on different scales, but r is not
affected by the scales, it will always be between -1 and +1.
Correlation Coefficient Property 3: r is bidirectional. This
means that the correlation between X and Y is the exact same as
the correlation between y and x. In other words, the “ordering”
of the independent and dependent variable is irrelevant to the
value of r.
Correlation Coefficient Property 4: r measures the strength of
linear relationship between X and Y. That means it
measures how well the data fit along a straight line. R is also an
effect size measure.
Correlation Coefficient Effect Size
Property 4 says that
r measures the degree to which the data fit along a
single straight line. But what does an r=0.58 or an r=-0.10 tell
us? Is this a large effect? This brings in the concept of
effect size. Effect sizes tell us how strong the
relationship is between variables. Effect sizes help to answer
the question of
substantive significance (McCloskey, 1996). Cohen
(1988) offers this guidance for benchmarking r. Note that
whether r is positive or negative, the effect size is the same.
Table 1: Cohen’s Effect Size Benchmarks for r
r Value (-)
r Value (+)
Effect Size
-0.1 to- 0.3
0.1 to 0.3
-0.3 to- 0.5
0.3 to 0.5
-0.5 to -1.0
0.5 to 1.0
We can now answer the question as to what an r=0.58 means in
terms of effect size. Using Cohen’s benchmarks, 0.58>0.50, so
we concluded that there is a large effect size, or in other words,
a strong relationship between X and Y. And r=-0.10=0.10,
which is a small effect size, or equivalently a weak relationship
between X and Y.
Correlation Coefficients for Infant Mortality, Low Birthweight
and Median Family Income
The Stata output below is called a
correlation matrix. Correlation matrices show us how
each variable is correlated with another. This matrix only
contains three variables: imr (infant mortality rate), lobweight
and mhhif (median family income).
The first thing you’ll notice is the three ones in the diagonal.
This is because those cells in the matrix report the correlation
of the variable with itself.
Figure 3: Correlation Matrix for Infant Moraliity Data
The correlation between infant mortality and low birth weight is
0.66 (rounded). Based on Cohen’s benchmarks, anything above
r=0.5 is considered a large effect size. Therefore, we conclude
that the correlation shows a strong relationship between the
variables. The correlation between infant mortality and median
family income is -0.59. Because 0.59 exceeds Cohen’s 0.5
benchmark for a large effect size, it is also a large effect size.
Notice that the matrix also reports the correlation between low
birth weight and median family income as -0.47. This
correlation would be classified as a medium effect size because
it is in between 0.3 and 0.5.
Correlation and Causation
Correlation does not necessarily mean causation. Correlation
can only establish that two variables are related to one another
mathematically. Consider a simple example where a researcher
is looking at the relationship between snow cone consumption
and swimming pool accidents. The researcher finds that there is
a positive correlation between snow cone consumption and
swimming pool accidents. Are we to conclude that eating snow
causes swimming accidents? Here the relationship is
not causal even though a correlation exists. Correlation cannot
establish causation. Instead, researchers must use theory to
explain and justify why correlations exist between variables.
· Scatterplots show the relationship between two continuous
· The correlation coefficient r measures the linear association
between two variables
· The sign tells us the direction of the relationship
· The effect size can be determined by using Cohen’s effect size
· Usually, correlations are displayed in a correlation matrix that
shows the pairwise correlation between the variables
· Correlation matrices are an easy way to see how all the
variables in a list are related.
· Correlation cannot establish causation
Stata Code
*Scatterplots and Correlation
* This Code Uses the Annie E. Casey Foundation Data
*Figure 1
twoway (scatter imr lobweight) (lfit imr lobweight)
*Figure 2
twoway (scatter imr mhhif) (lfit imr mhhif)
*Correlation Matrix
correlate imr lobweight mhhif

More Related Content

Similar to Two-Variable (Bivariate) RegressionIn the last unit, we covered

Linear regression.pptx
Linear regression.pptxLinear regression.pptx
Linear regression.pptx
Chapter 9 Regression
Chapter 9 RegressionChapter 9 Regression
Chapter 9 Regressionghalan
Correlation and Regression
Correlation and RegressionCorrelation and Regression
Correlation and RegressionShubham Mehta
asdfg hjkl
30REGRESSION Regression is a statistical tool that a.docx
30REGRESSION  Regression is a statistical tool that a.docx30REGRESSION  Regression is a statistical tool that a.docx
30REGRESSION Regression is a statistical tool that a.docx
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn LottierRegression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
Al Arizmendez
The future is uncertain. Some events do have a very small probabil.docx
The future is uncertain. Some events do have a very small probabil.docxThe future is uncertain. Some events do have a very small probabil.docx
The future is uncertain. Some events do have a very small probabil.docx
Outlying and Influential Data In Regression Diagnostics .docx
Outlying and Influential Data In Regression Diagnostics .docxOutlying and Influential Data In Regression Diagnostics .docx
Outlying and Influential Data In Regression Diagnostics .docx
Chapter 10
Chapter 10Chapter 10
Chapter 10
Non linearregression 4+
Non linearregression 4+Non linearregression 4+
Non linearregression 4+
Ricardo Solano
correlation and r3433333333333333333333333333333333333333333333333egratio111n...
correlation and r3433333333333333333333333333333333333333333333333egratio111n...correlation and r3433333333333333333333333333333333333333333333333egratio111n...
correlation and r3433333333333333333333333333333333333333333333333egratio111n...
Ghaneshwer Jharbade
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
Abhimanyu Dwivedi
Correlation Example
Correlation ExampleCorrelation Example
Correlation ExampleOUM SAOKOSAL
Ch8 Regression Revby Rao
Ch8 Regression Revby RaoCh8 Regression Revby Rao
Ch8 Regression Revby RaoSumit Prajapati
Biostatistics Lecture on Correlation.pptx
Biostatistics Lecture on Correlation.pptxBiostatistics Lecture on Correlation.pptx
Biostatistics Lecture on Correlation.pptx
Fantahun Dugassa
Linear Regression
Linear Regression Linear Regression
Linear Regression
Rupak Roy

Similar to Two-Variable (Bivariate) RegressionIn the last unit, we covered (20)

Linear regression.pptx
Linear regression.pptxLinear regression.pptx
Linear regression.pptx
Chapter 9 Regression
Chapter 9 RegressionChapter 9 Regression
Chapter 9 Regression
Correlation and Regression
Correlation and RegressionCorrelation and Regression
Correlation and Regression
30REGRESSION Regression is a statistical tool that a.docx
30REGRESSION  Regression is a statistical tool that a.docx30REGRESSION  Regression is a statistical tool that a.docx
30REGRESSION Regression is a statistical tool that a.docx
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn LottierRegression Analysis presentation by Al Arizmendez and Cathryn Lottier
Regression Analysis presentation by Al Arizmendez and Cathryn Lottier
9. parametric regression
9. parametric regression9. parametric regression
9. parametric regression
The future is uncertain. Some events do have a very small probabil.docx
The future is uncertain. Some events do have a very small probabil.docxThe future is uncertain. Some events do have a very small probabil.docx
The future is uncertain. Some events do have a very small probabil.docx
Outlying and Influential Data In Regression Diagnostics .docx
Outlying and Influential Data In Regression Diagnostics .docxOutlying and Influential Data In Regression Diagnostics .docx
Outlying and Influential Data In Regression Diagnostics .docx
Chapter 10
Chapter 10Chapter 10
Chapter 10
Non linearregression 4+
Non linearregression 4+Non linearregression 4+
Non linearregression 4+
correlation and r3433333333333333333333333333333333333333333333333egratio111n...
correlation and r3433333333333333333333333333333333333333333333333egratio111n...correlation and r3433333333333333333333333333333333333333333333333egratio111n...
correlation and r3433333333333333333333333333333333333333333333333egratio111n...
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
Correlation Example
Correlation ExampleCorrelation Example
Correlation Example
Ch8 Regression Revby Rao
Ch8 Regression Revby RaoCh8 Regression Revby Rao
Ch8 Regression Revby Rao
Biostatistics Lecture on Correlation.pptx
Biostatistics Lecture on Correlation.pptxBiostatistics Lecture on Correlation.pptx
Biostatistics Lecture on Correlation.pptx
Linear Regression
Linear Regression Linear Regression
Linear Regression

More from LacieKlineeb

Professional Memo 1 IFSM 201 Professional Memo .docx
Professional Memo   1  IFSM 201 Professional Memo .docxProfessional Memo   1  IFSM 201 Professional Memo .docx
Professional Memo 1 IFSM 201 Professional Memo .docx
Principals in EpidemiologyHomework #2Please complete the fol.docx
Principals in EpidemiologyHomework #2Please complete the fol.docxPrincipals in EpidemiologyHomework #2Please complete the fol.docx
Principals in EpidemiologyHomework #2Please complete the fol.docx
Prevalence Of Pressure Ulcer Name xxxUnited State Universit.docx
Prevalence Of  Pressure Ulcer Name xxxUnited State Universit.docxPrevalence Of  Pressure Ulcer Name xxxUnited State Universit.docx
Prevalence Of Pressure Ulcer Name xxxUnited State Universit.docx
Professional Disposition and Ethics - Introduction kthometz post.docx
Professional Disposition and Ethics - Introduction kthometz post.docxProfessional Disposition and Ethics - Introduction kthometz post.docx
Professional Disposition and Ethics - Introduction kthometz post.docx
Problem 7PurposeBreak apart a complicated system.ConstantsC7C13.docx
Problem 7PurposeBreak apart a complicated system.ConstantsC7C13.docxProblem 7PurposeBreak apart a complicated system.ConstantsC7C13.docx
Problem 7PurposeBreak apart a complicated system.ConstantsC7C13.docx
Procedure1. Research occupation as it relates to Occupati.docx
Procedure1. Research occupation as it relates to Occupati.docxProcedure1. Research occupation as it relates to Occupati.docx
Procedure1. Research occupation as it relates to Occupati.docx
Problem 1 (10 Points)Jackson Browne Corporation is authorized to.docx
Problem 1 (10 Points)Jackson Browne Corporation is authorized to.docxProblem 1 (10 Points)Jackson Browne Corporation is authorized to.docx
Problem 1 (10 Points)Jackson Browne Corporation is authorized to.docx
Primary Task Response Within the Discussion Board area, write 350.docx
Primary Task Response Within the Discussion Board area, write 350.docxPrimary Task Response Within the Discussion Board area, write 350.docx
Primary Task Response Within the Discussion Board area, write 350.docx
Principles of Scientific Management, Frederick Winslow Taylor .docx
Principles of Scientific Management, Frederick Winslow Taylor .docxPrinciples of Scientific Management, Frederick Winslow Taylor .docx
Principles of Scientific Management, Frederick Winslow Taylor .docx
Printed by [email protected] Printing is for personal, privat.docx
Printed by [email protected] Printing is for personal, privat.docxPrinted by [email protected] Printing is for personal, privat.docx
Printed by [email protected] Printing is for personal, privat.docx
Primary Care Integration in Rural AreasA Community-Focused .docx
Primary Care Integration in Rural AreasA Community-Focused .docxPrimary Care Integration in Rural AreasA Community-Focused .docx
Primary Care Integration in Rural AreasA Community-Focused .docx
PrepareStep 1 Prepare a shortened version of your Final Pape.docx
PrepareStep 1 Prepare a shortened version of your Final Pape.docxPrepareStep 1 Prepare a shortened version of your Final Pape.docx
PrepareStep 1 Prepare a shortened version of your Final Pape.docx
Princess Nourah bint Abdulrahman University Strategy and Ope.docx
Princess Nourah bint Abdulrahman University Strategy and Ope.docxPrincess Nourah bint Abdulrahman University Strategy and Ope.docx
Princess Nourah bint Abdulrahman University Strategy and Ope.docx
Primary Care Interventions for Prevention and Cessation of Tob.docx
Primary Care Interventions for Prevention and Cessation of Tob.docxPrimary Care Interventions for Prevention and Cessation of Tob.docx
Primary Care Interventions for Prevention and Cessation of Tob.docx
Presentation given in 2 separate PP documents as example.8-10 .docx
Presentation given in 2 separate PP documents as example.8-10 .docxPresentation given in 2 separate PP documents as example.8-10 .docx
Presentation given in 2 separate PP documents as example.8-10 .docx
Prepare a PowerPoint presentation (8 slides minimum) that presents a.docx
Prepare a PowerPoint presentation (8 slides minimum) that presents a.docxPrepare a PowerPoint presentation (8 slides minimum) that presents a.docx
Prepare a PowerPoint presentation (8 slides minimum) that presents a.docx
Porwerpoint The steps recommended for efficiently developing an ef.docx
Porwerpoint  The steps recommended for efficiently developing an ef.docxPorwerpoint  The steps recommended for efficiently developing an ef.docx
Porwerpoint The steps recommended for efficiently developing an ef.docx
Prepare a 2-page interprofessional staff update on HIPAA and appro.docx
Prepare a 2-page interprofessional staff update on HIPAA and appro.docxPrepare a 2-page interprofessional staff update on HIPAA and appro.docx
Prepare a 2-page interprofessional staff update on HIPAA and appro.docx
post 5-7 Sentences of a response to the Discovery Board Whic.docx
post 5-7 Sentences of a response to the Discovery Board Whic.docxpost 5-7 Sentences of a response to the Discovery Board Whic.docx
post 5-7 Sentences of a response to the Discovery Board Whic.docx

More from LacieKlineeb (20)

Professional Memo 1 IFSM 201 Professional Memo .docx
Professional Memo   1  IFSM 201 Professional Memo .docxProfessional Memo   1  IFSM 201 Professional Memo .docx
Professional Memo 1 IFSM 201 Professional Memo .docx
Principals in EpidemiologyHomework #2Please complete the fol.docx
Principals in EpidemiologyHomework #2Please complete the fol.docxPrincipals in EpidemiologyHomework #2Please complete the fol.docx
Principals in EpidemiologyHomework #2Please complete the fol.docx
Prevalence Of Pressure Ulcer Name xxxUnited State Universit.docx
Prevalence Of  Pressure Ulcer Name xxxUnited State Universit.docxPrevalence Of  Pressure Ulcer Name xxxUnited State Universit.docx
Prevalence Of Pressure Ulcer Name xxxUnited State Universit.docx
Professional Disposition and Ethics - Introduction kthometz post.docx
Professional Disposition and Ethics - Introduction kthometz post.docxProfessional Disposition and Ethics - Introduction kthometz post.docx
Professional Disposition and Ethics - Introduction kthometz post.docx
Problem 7PurposeBreak apart a complicated system.ConstantsC7C13.docx
Problem 7PurposeBreak apart a complicated system.ConstantsC7C13.docxProblem 7PurposeBreak apart a complicated system.ConstantsC7C13.docx
Problem 7PurposeBreak apart a complicated system.ConstantsC7C13.docx
Procedure1. Research occupation as it relates to Occupati.docx
Procedure1. Research occupation as it relates to Occupati.docxProcedure1. Research occupation as it relates to Occupati.docx
Procedure1. Research occupation as it relates to Occupati.docx
Problem 1 (10 Points)Jackson Browne Corporation is authorized to.docx
Problem 1 (10 Points)Jackson Browne Corporation is authorized to.docxProblem 1 (10 Points)Jackson Browne Corporation is authorized to.docx
Problem 1 (10 Points)Jackson Browne Corporation is authorized to.docx
Primary Task Response Within the Discussion Board area, write 350.docx
Primary Task Response Within the Discussion Board area, write 350.docxPrimary Task Response Within the Discussion Board area, write 350.docx
Primary Task Response Within the Discussion Board area, write 350.docx
Principles of Scientific Management, Frederick Winslow Taylor .docx
Principles of Scientific Management, Frederick Winslow Taylor .docxPrinciples of Scientific Management, Frederick Winslow Taylor .docx
Principles of Scientific Management, Frederick Winslow Taylor .docx
Printed by [email protected] Printing is for personal, privat.docx
Printed by [email protected] Printing is for personal, privat.docxPrinted by [email protected] Printing is for personal, privat.docx
Printed by [email protected] Printing is for personal, privat.docx
Primary Care Integration in Rural AreasA Community-Focused .docx
Primary Care Integration in Rural AreasA Community-Focused .docxPrimary Care Integration in Rural AreasA Community-Focused .docx
Primary Care Integration in Rural AreasA Community-Focused .docx
PrepareStep 1 Prepare a shortened version of your Final Pape.docx
PrepareStep 1 Prepare a shortened version of your Final Pape.docxPrepareStep 1 Prepare a shortened version of your Final Pape.docx
PrepareStep 1 Prepare a shortened version of your Final Pape.docx
Princess Nourah bint Abdulrahman University Strategy and Ope.docx
Princess Nourah bint Abdulrahman University Strategy and Ope.docxPrincess Nourah bint Abdulrahman University Strategy and Ope.docx
Princess Nourah bint Abdulrahman University Strategy and Ope.docx
Primary Care Interventions for Prevention and Cessation of Tob.docx
Primary Care Interventions for Prevention and Cessation of Tob.docxPrimary Care Interventions for Prevention and Cessation of Tob.docx
Primary Care Interventions for Prevention and Cessation of Tob.docx
Presentation given in 2 separate PP documents as example.8-10 .docx
Presentation given in 2 separate PP documents as example.8-10 .docxPresentation given in 2 separate PP documents as example.8-10 .docx
Presentation given in 2 separate PP documents as example.8-10 .docx
Prepare a PowerPoint presentation (8 slides minimum) that presents a.docx
Prepare a PowerPoint presentation (8 slides minimum) that presents a.docxPrepare a PowerPoint presentation (8 slides minimum) that presents a.docx
Prepare a PowerPoint presentation (8 slides minimum) that presents a.docx
Porwerpoint The steps recommended for efficiently developing an ef.docx
Porwerpoint  The steps recommended for efficiently developing an ef.docxPorwerpoint  The steps recommended for efficiently developing an ef.docx
Porwerpoint The steps recommended for efficiently developing an ef.docx
Prepare a 2-page interprofessional staff update on HIPAA and appro.docx
Prepare a 2-page interprofessional staff update on HIPAA and appro.docxPrepare a 2-page interprofessional staff update on HIPAA and appro.docx
Prepare a 2-page interprofessional staff update on HIPAA and appro.docx
post 5-7 Sentences of a response to the Discovery Board Whic.docx
post 5-7 Sentences of a response to the Discovery Board Whic.docxpost 5-7 Sentences of a response to the Discovery Board Whic.docx
post 5-7 Sentences of a response to the Discovery Board Whic.docx

Recently uploaded

Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh

Recently uploaded (20)

Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.

Two-Variable (Bivariate) RegressionIn the last unit, we covered

  • 1. Two-Variable (Bivariate) Regression In the last unit, we covered scatterplots and correlation. Social scientists use these as descriptive tools for getting an idea about how our variables of interest are related. But these tools only get us so far. Regression analysis is the next step. Regression is by far the most used tool in social science research. Simple regression analysis can tell us several things: 1. Regression can estimate the relationship between x and y in their original units of measurement. To see why this is so useful, consider the example of infant mortality and median family income. Let’s say that a policymaker is interested in knowing how much of a change in median family income is needed to significantly reduce the infant mortality rate. Correlation cannot answer this question, but regression can. 2. Regression can tell us how well the independent variable (x) explains the dependent variable (y). The measure is called the R square. Simple Two-Variable (Bivariate) Regression Regression uses the equation of a line to estimate the relationship between x and y. You may remember back in algebra learning about the equation of a line. Some learned it as Y =s X + K or Y = mX + B. In statistics, we use a different form: Equation 1: Y = B0 + B1X + u Let’s define each term in the equation: · Y is the dependent variable. It is placed on the Y (vertical) axis. In the example below, the dependent variable (Y) is the infant mortality rate. · B0 is the Y intercept. B0 is also referred to as “the constant.” B0 is the point where the regression line crosses the Y axis. Importantly, B0 is equal to the
  • 2. predicted value of Ywhen X=0. In most cases, B0 is does not get much attention for two reasons. First, the researcher is usually interested in the relationship between x and y. not the relationship between x and y at the single value of x=0. Second, often independent variables do not take on the value zero. Consider the AECF sample data. There are no states with low-birth-weight percentages equal to zero, so we would be extrapolating beyond what the data tell us. · B1 is usually the main point of interest for researchers. It is the slope of the line relating x to y. Researchers usually refer to B1 as a slope coefficient, regression coefficient or simply a coefficient. B1 measures the change in Y for a one-unit change in x. We represent change by the symbol ∆. B1 = · u is the error term. The error term is the distance between the regression line and the dots on the scatterplot. Think about it, regression estimates a single line through the cloud of data. Naturally, the line does not hit all the data points. The degree to which the line “misses” the data point is the error. u can also be thought of as all the other factors that affect the infant mortality rate besides X. Importantly, we assume that u is totally random given X. The Black Box of Regression Intuitively, regression analysis finds the line that is the best predictor of the dependent variable. In the scatterplot, this line is the one that “fits” the data the best. From the scatterplot, we can see that the line does not go through all of the points in the scatterplot. So, how does regression find this line? Regression does this by finding the line that minimizes the squared error. This is why regression is also called “least squares” regression, because it minimizes the squared error. The mathematical proof of this is not important,
  • 3. if we understand that the regression line is the best fit for the data. The Predicted Value of Y, “yhat” This is the estimated regression equation for the line that relates infant mortality to low birth weight. Notice that this equation does not contain an error term. This makes sense, because this is the equation for the regression line itself, not the actual data points (Y). To make this distinction clear, define the term Ŷ as the predicted values of Y along the regression line. Ŷ is the predicted value of Y. Equation 2: Ŷ = B0 + B1X Subtracting the two gives: Y = B0 + B1X + u minus Ŷ = B0 + B1X Y- Ŷ = u This means each observation has values for Y, Ŷ and u. To make this more concrete, let’s consider the example of infant mortality and low birth weights. Example: Infant Mortality and Low Birth Weights For regression (unlike correlation), the researcher must specify the dependent variable and the independent variable. Logically, low birth weights should contribute to the infant mortality rate. This makes sense too if we think about how the regression equation works. To make things concrete, let’s say that a lawmaker wants to know what effect low birth weights have on infant mortality. The regression equation would be: imr = B0 + B1lobweight + u The Stata output has a lot of numbers. First let’s focus on getting the actual estimates from the regression equation. We get these numbers from the “coefficient column. The bottom coefficient is labeled _cons. This is short for
  • 4. “constant,” which is just another name for the y intercept, B0. In this case, B0 = 1.205. The coefficient labeled lobweight is the one we are really interested in. This coefficient is B1. For this regression B1=0.562. Now we can write out the regression: imr = B0 + B1lobweight + u Substituting the numbers from the table: imr = 1.205 + 0.562 lobweight + u Interpreting the equation B0 is usually not of interest to the researcher for reasons discussed above. B1 is the main coefficient of interest, especially for policy. It tells us about the relationship between low birth weights and the infant mortality rate. Rules for Interpreting B1 · B1 measures the change in Y that results from a one unit change in X. · So, we can say that a one unit change in X results in a B1 change in Y. · In the regression above B1=0.562. That means that a one unit change in percentage low birth weights results in a 0.562 change in the infant mortality rate. The user-written Stata command aaplot. Gives a nice summary: Model Fit We already saw with scatterplots and correlation that different models have different degree of “fit”, meaning how well the data cluster around a line. In regression, most analysts use the R Squared. The R Squared has a ready interpretation once we know its properties: Box 1: R Squared Properties R2 Property 1: R square measures the proportion of the variation in Y that is explained by the variation in X. An easier way to say it is that the model explains (R2*100)%. For the
  • 5. running example, the R2=0.436. That means that low brth weights explain 43.6% of the variation in the infant mortality rate. Or, for shrt, the model explains 43.6%. R2 Property 2: R square will always (except in extreme and unusual cases) lie somewhere on the interval between 0 and +1. In other words, R squared will be a positive value between 0 and 1. R2 Property 3: R squared values are only comparable if the dependent variable is the same.This means that if we want to compare two models on the R squared, Y must be the same for both models. Effect Size for R Squared As with correlation coefficients, it is helpful to have a benchmark to determine effect size. Recall that effect size tells us how large (or small) the effect of one variable is on another. We can use the benchmarks for r and square then to get the benchmarks for R2. Table 1: Cohen’s Effect Size Benchmarks for R Squared R Squared Effect Size 0.01 to 0.09 Small 0.09 to 0.25 Medium 0.25 to 1.0 Large In the example, the R squared was 0.436, which exceeds 0.25, so we conclude that the R squared shows a large effect size between low birth weights and infant mortality. Hypothesis Testing So far, we have been focusing on how to interpret regression results. But our results are derived from a sample. This means we cannot be sure that our results reflect what is going on in the population. Of course, we cannot
  • 6. know what we don’t know, so instead we can do hypothesis testing. Generally, with hypothesis testing, we are focused on a “null” hypothesis. This involves a little thought experiment. We ask the following, “If there was no effect of X on Y in the population, how likely is it that we would have obtained our regression results?” We write the null hypothesis as: Null Hypothesis Ho: B1pop = 0 This is equivalent to saying that B1 in the population. Remember, we do not know what B1 is in the population, we are just testing if it is zero. Alternative Hypothesis H1: B1pop ≠ 0 The alternative hypothesis is that B1 in the population does not equal zero (i.e. there is some effect of X on Y. Using the T Test To test the hypothesis above, we use a t test. The t distribution is very similar to the Z distribution (standard normal). The formula for the t test in regression is t = Notice that when we do a t test, we are comparing our actual sample regression coefficient B1, with a hypothesized value of B1 for the population, B1pop. We could test for ANY population value using this formula. We could set the population value to 8,0000, 50 or -0.0078. The reason we set the population value to zero is that this is the only value for B1pop that would indicate NO relationship between X and Y. As a result, the standard hypothesized value for B1pop is zero. Notice what this does to the formula a above. If we substitute zero for b1pop t = = What is SE(B1)? This is called the standard error of B1. If we think of running an infinite number of regressions with different
  • 7. samples, we could plot our values of B1 on a graph. The standard error of B1 tells us how much variation there would be in this hypothetical distribution. Now let’s look back at the table. B1 is 0.562 and the standard error of B1 is 0.09138. Plugging in the numbers gives T== 6.15 From t to a P value The t statistic on its own does not tell us much. What we are interested in is the p value. The p value is the probability of the t statistic. To get the p value, we must use a t distribution. Properties of the t distribution and p values Property 1: The t distribution is a probability distribution that measures the likelihood of different t values. Therefore, the total area of the t distribution equals 1. Property 2: For a t test, we assume that the mean of the population t distribution is zero, which is the same as saying B1pop=0. Property 3: A large t statistic is unlikely because as we move from the mean of the t distribution to its tails, the probability of the t values goes down. Property 4: t tests tell us the probability that we would obtain our sample t value, if the population t value was, in fact, zero. Thus, the term hypothesis testing. This probability is called a p value. Put another way, the p value tells us the probability that we would be incorrect in saying B1pop ≠0. if in fact B1pop=0. Property 5: A small p value gives us reason to REJECT the null hypothesis b1pop=0 because a small p value indicates that is unlikely, given our sample value for B1 that b1pop=0. Looking back at the results the p value corresponding to the t statistic of 6.15 is 0.00. The p value is so small, it has zeroes to three digits! This means that the chances of our obtaining our sample t value of 6.15 are very, very small, if the true population t statistic were zero. Confidence Intervals
  • 8. Another way to think about hypothesis testing is using confidence intervals. Confidence intervals tell us the range of values a coefficient could take. Typically, researchers use 95% confidence intervals. We can rearrange some of the terms from the t test to obtain confidence intervals. CI lower = B+(SEB*t) CI lower = B-(SEB*t) With confidence intervals, we must specify a value for t. This value of t corresponds to whatever confidence level we want to set. Usually this is 95%. Stata gets this value of t for us, so we do not have to look it up. Intuitively we can say that if we compared a 95% CI to a 90% CI, the former would be WIDER. This makes sense when we think about the relationship between t and probability. The larger the t value, the smaller the probability or equivalently, the higher the confidence level, the wider the CI. In the results above, the 95% CI for the coefficient on low birth weight is 0.378 to 0.745, which is a wide margin! The Callows for us to get an idea of how much a coefficient could vary. The “official” interpretation of the 95% CI is, “95 times out of 100, the true population coefficient would be contained in this interval.” image3.emf
  • 9. image1.emf image2.emf Assignment 1 Due Date/Time: 9/23/2021, 11:59 PM Total Points: 100 You will implement the K-means clustering and Fuzzy C-means clustering from scratch using a programming language of your choice. Follow software design principles and document (comment) your code clearly explaining what you did and why you did what you did. In your report, include a README that states how your code is supposed to be run to obtain the expected results. You will use a dataset representing ten years of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. The dataset is included in the assignment with the filename diabetic_data.csv. Use the Euclidean distance to compute the distance between any two patients in the dataset. You will run your clustering algorithms with different combinations of variables as specified in each question.
  • 10. 1. K-means clustering with different numbers of clusters (30 points) a. Run K-means on the entire dataset with the following two variables: ‘time_in_hospital’, and ‘num_medications’ with the number of clusters K = 2. Plot your clusters using a 3D sca�er plot and report (print) the centroid locations. Based on this plot, what are your thoughts on the generated clusters? b. Test with different numbers of clusters K, running from K = 2 to K = 10 using the same variables in 1a. According to the sca�er plots, which number of clusters do you think is the most appropriate? Justify your response. c. Implement Dunn index (DI) cluster validity measure from scratch. Repeat the experiments in problem 1b and compute the corresponding DI indices. Which one do you believe is the best number of clusters according to Dunn indices? Does this agree with your initial observation in problem 1b? 2. K-means clustering with different variables and sample size
  • 11. (30 points) a. Based on the best number of clusters you obtained in problems 1c and the two variables, does adding the ‘insulin’ variable (total 3 variables) improve clustering results for any 30 patients randomly selected? Use sca�er plots or any other equivalent method to justify your response. b. Based on the model in problem 2a, does adding the ‘diabetesMed’ and ‘change’ variables (total five variables) improve the clustering results for the same 30 patients? Plot the results and compute the Dunn index to justify your response. c. Randomly sample 50,000 observations and 10,000 observations from the entire dataset and re-run 2a and 2b for each sample size. Plot the clustering results and compute the Dunn index for each sample size and compare the results with 50,000 and 10,000 observations vs the entire dataset. Justify what you observe. d. (Bonus): What happens to the relative positioning of the centroids as you sample fewer observations (50,000, 10,000, 5,000) from the data? Do the centroids go farther apart, or do they get closer after your clustering
  • 12. algorithm has converged? Justify why. Plot your findings (sample size (x-axis) vs Dunn Index (y-axis)). (Bonus: 10 points) 3. Fuzzy C-means clustering (40 points) a. Implement Fuzzy C-means and apply it with the best number of clusters you selected in problem 1 and the best combination of variables you selected in problem 2 for the entire observations. Was there any difference in the clusters as compared to the K-means clusters? (Compare using visualization tools, using centroid values, OR using some labels and observing the differences). b. Harden the cluster assignment of Fuzzy C-means and use the Dunn index to compare it with the K-means clustering result. Is there any difference in the results? Which clustering algorithm do you think produces be�er clusters and why? c. Select one more variable by exploring the data and add this variable into the model in problem 3a. Does adding this new variable improve the clustering results? If so, why or why not? If you play wi th different variables for 3c, please mention that as well as the variables you
  • 13. experimented with and why you chose that particular additional variable. Submission Instructions: Submit a zipped file containing your code(s) and report (in pdf) in the Dropbox folder titled “Assignment 1-LastName” on Pilot. Academic Integrity: Please note that the code and report you submit should be your work and yours alone. If plagiarism is detected, it will be dealt with strictly and in accordance with Wright State guidelines. Scatterplots and Correlation Scatterplots Scatterplots show the relationship between two (usually) continuous variables. Recall that continuous have many different numeric values; age or income are examples. Scatterplots are very useful for data visualization because they can give us an intuition for the direction of the relationship between variables (positive or negative) and the strength of the relationship. Usually, we are interested in both things. With a scatterplot, we normally assume that one variable is the independent variable. Most researchers denote the independent variable as X . The independent variable is the input to the model. The dependent variable is the output from the model. One way to keep these straight is the dependent variable is dependent on another variable in
  • 14. the model, the independent variable. Researchers denote the dependent variable as Y . Just like in the alphabet, X comes before Y, meaning a change in X results in some change in Y. In some cases, the independent variable X may be a “cause” of the dependent variable Y, but in most cases, causation is difficult to establish. We discuss the distinction between correlation and causation toward the end of the chapter. In the examples below, we will be using the State Kids Count data. In each example, the dependent variable is the infant mortality rate (imr) for both scatterplots. We will construct two scatterplots using two different independent variables: the percentage of low-birth- weight babies in each state and the median family income in the state. Figure 1 shows the scatterplot for infant mortality (y axis) and low birth weight babies (x axis). Figure 1: The relationship between low birth weights and infant mortality Here low birth weight is on the x axis and the infant mortality rate is on the y axis. This scatterplot helps answer two questions. 1) Direction of Relationship. The graph shows there is a positive relationship between low birth weights and the state infant mortality rate. As low birth weights increase, so does infant mortality. This makes sense, as low birth weight babies are often premature or have other health difficulties, making survival less likely. So, it makes sense that states that have a high percentage of low birthweight infants, would also have higher overall infant mortality rates.
  • 15. 2) Strength of the Relationship. The way to determine the strength of the relationship in a scatterplot is to look at how tightly (or loosely) the data points cluster around the line. This line is the “best fit” line for the data. This graph shows a strong relationship between low birth weight and infant mortality but interpreting graphs can be a bit like interpreting art! It is important to note that while the direction of the relationship is usually easy to figure out, determining the strength of the relationship from a scatterplot alone is a subjective judgment. Figure 2: The relationship between median family income and infant mortality 1) Direction of Relationship. The graph shows there is a negative relationship between state median family income and the state infant mortality rate. In states with higher median family incomes, there is less infant mortality. This also makes sense: in states with higher family incomes, more private resources are available throughout the pregnancy, which reduces infant mortality. 2) Strength of the Relationship. The way to determine the strength of the relationship in a scatterplot is to look at how tightly (or loosely) the data points cluster around the line. In this respect, the data fit the line well, but not as well as the scatterplot in Figure 1. But again, such an interpretation is inherently subjective. The Correlation Coefficient Scatterplots are helpful for visualizing the association between X and Y, but graphs cannot provide a precise numerical estimate of the relationship between X and Y . The numerical
  • 16. estimate of the relationship between X and Y is called the correlation coefficient, it is sometimes denoted as r in published research. Correlation coefficients tell us both the direction of the relationship between X and Y and the strength of the relationship. The correlation coefficient is easy to interpret once we understand its properties. Box 1: Properties of the Correlation Coefficient Correlation Coefficient Property 1: r will always indicate a positive or negative relationship through its sign. Correlation Coefficient Property 2: r will always lie within a defined range between -1 and 1. r is a normalized measure. This means that r does not depend on the scale of measurement for a variable. For example, age and income are measured on different scales, but r is not affected by the scales, it will always be between -1 and +1. Correlation Coefficient Property 3: r is bidirectional. This means that the correlation between X and Y is the exact same as the correlation between y and x. In other words, the “ordering” of the independent and dependent variable is irrelevant to the value of r. Correlation Coefficient Property 4: r measures the strength of the linear relationship between X and Y. That means it measures how well the data fit along a straight line. R is also an effect size measure. Correlation Coefficient Effect Size Property 4 says that r measures the degree to which the data fit along a single straight line. But what does an r=0.58 or an r=-0.10 tell us? Is this a large effect? This brings in the concept of
  • 17. effect size. Effect sizes tell us how strong the relationship is between variables. Effect sizes help to answer the question of substantive significance (McCloskey, 1996). Cohen (1988) offers this guidance for benchmarking r. Note that whether r is positive or negative, the effect size is the same. Table 1: Cohen’s Effect Size Benchmarks for r r Value (-) r Value (+) Effect Size -0.1 to- 0.3 0.1 to 0.3 Small -0.3 to- 0.5 0.3 to 0.5 Medium -0.5 to -1.0 0.5 to 1.0 Large We can now answer the question as to what an r=0.58 means in terms of effect size. Using Cohen’s benchmarks, 0.58>0.50, so we concluded that there is a large effect size, or in other words, a strong relationship between X and Y. And r=-0.10=0.10, which is a small effect size, or equivalently a weak relationship between X and Y. Correlation Coefficients for Infant Mortality, Low Birthweight and Median Family Income The Stata output below is called a correlation matrix. Correlation matrices show us how each variable is correlated with another. This matrix only contains three variables: imr (infant mortality rate), lobweight and mhhif (median family income). The first thing you’ll notice is the three ones in the diagonal.
  • 18. This is because those cells in the matrix report the correlation of the variable with itself. Figure 3: Correlation Matrix for Infant Moraliity Data The correlation between infant mortality and low birth weight is 0.66 (rounded). Based on Cohen’s benchmarks, anything above r=0.5 is considered a large effect size. Therefore, we conclude that the correlation shows a strong relationship between the variables. The correlation between infant mortality and median family income is -0.59. Because 0.59 exceeds Cohen’s 0.5 benchmark for a large effect size, it is also a large effect size. Notice that the matrix also reports the correlation between low birth weight and median family income as -0.47. This correlation would be classified as a medium effect size because it is in between 0.3 and 0.5. Correlation and Causation Correlation does not necessarily mean causation. Correlation can only establish that two variables are related to one another mathematically. Consider a simple example where a researcher is looking at the relationship between snow cone consumption and swimming pool accidents. The researcher finds that there is a positive correlation between snow cone consumption and swimming pool accidents. Are we to conclude that eating snow cones causes swimming accidents? Here the relationship is not causal even though a correlation exists. Correlation cannot establish causation. Instead, researchers must use theory to explain and justify why correlations exist between variables. Review · Scatterplots show the relationship between two continuous variables · The correlation coefficient r measures the linear association between two variables · The sign tells us the direction of the relationship · The effect size can be determined by using Cohen’s effect size
  • 19. benchmarks · Usually, correlations are displayed in a correlation matrix that shows the pairwise correlation between the variables · Correlation matrices are an easy way to see how all the variables in a list are related. · Correlation cannot establish causation Stata Code *Scatterplots and Correlation * This Code Uses the Annie E. Casey Foundation Data *Figure 1 twoway (scatter imr lobweight) (lfit imr lobweight) *Figure 2 twoway (scatter imr mhhif) (lfit imr mhhif) *Correlation Matrix correlate imr lobweight mhhif 2 image1.emf image2.emf image3.emf