3. Introduction:
3
• Regression analysis is a tool in assessing specific forms of
relationship between the variables.
• The ultimate objective of this method of analysis is to
predict or estimate the value of one variable
corresponding to a given value of another variable.
• The ideas of regression were first elucidated by the English
scientist Sir Francis Galton (1822–1911) in reports of his
research on heredity—first in sweet peas and later in
human stature.
4. • Correlation gives degree & direction of relationship
between 2 variables, whereas the regression analysis
enables us to predict the values of one variable
(dependent) based on other variable/s (independent).
• Regression coefficient is a measure of change of one
dependent variable with one unit change in independent
variable.
4
5. Regression:
5
• Purpose of regression analysis is to find a reasonable
regression equation to predict the average value of
dependent variable that is associated with a fixed value of
one independent variable.
• If more than one independent variable were used to
predict the average value of a dependent variable , then
we need multiple regression .
6. Scatter Plot :
6
• The scatter plots aid us in
determining the nature of
the relationship and
correlation by comparing
two sets of data.
• It gives us a visual picture
of any connection between
the two variables.
7. Components of Regression analysis:
7
• Simple regression problem is a statistical relationship
between dependent and independent variable.
• Dependent variable/ response variable /outcome
variable (y) — the main factor that you’re trying to
understand or predict. Eg: weight
• Independent variables /predictor variable/regressor
variable/explanatory variable (x) — the factors you
suspect have an impact on your dependent variable. Eg:
height
8. Components of Regression analysis:
8
Where ,
• Y is a value on the vertical axis (dependent variable),
• X is a value on the horizontal axis (independent variable),
• A is the point where the line crosses the vertical axis and the
value of Y at X = 0,
• Y-intercept is the value of A when X = 0,
• B shows the amount, by which Y changes for each unit change in
X, i.e. slope of the straight line.
Y = A + B(X)
• The independent variable predicts the value of dependent
variable for a given value of independent variable.
• The general equation for a straight line may be written as
9. Regression equation:
Where ,
• Yi is the value of the dependent variable at the i th level of
the independent variable,
• β0 and β1 are unknown regression coefficients whose values
are to be estimated,
• Xi is a known constant, which is the value of the independent
variable at the i th level. 9
10. Regression line :
10
• A regression line, also called a line of best fit, is the line for
which the sum of the squares of the residuals is a minimum.
• The vertical distance between the data point and the fitted
regression line is called a residual.
• Regression minimises residuals.
• By using the least squares method (a procedure that
minimizes the vertical deviations of plotted points
surrounding a straight line) a best fitting straight line is
constructed to the scatter diagram points and then formulate
a regression equation.
11. The equation of the line is μy=βo+β1 x, with intercept βo and
slope β1
11
12. 2. Multivariate regression –it involves the multiple
variables in the regression model. It is used to learn
about the relationship between several independent or
predictor variables and a dependent or outcome
variable.
E.g. – when predicting the changes in mandibular length
(dependent variable) using IGF-1, cervical stage, skeletal
classification, and gender (independent variables).
12
Types of Regression Analysis:
1. Bivariate regression – it is the simplest form involving
the prediction of the value of unknown variable from the
value of known variable.
E.g. – to predict the amount of mandibular growth
remaining from the Cervical Vertebra Maturation stage.
14. Linear Regression:
Types of Regression Models:
14
• Dependent variable is quantitative (preferably continuous)
• Linear regression is a parametric test and is based on a linear
relationship between variables.
• Linear means “straight line”
• Regression tells us how to draw the straight line described
by the correlation
16. Types of Linear Regression Model :
16
1. Simple Linear Regression Model –
• It is the most widely known model in which the dependent
variable is continuous and the independent variable can be
continuous or discrete. Here the nature of regression line is
linear which establishes a relationship between the
variables using a best fit straight line, known as the
regression line.
• In a simple linear regression model we assume that the
graph of the mean of the response variable Yi for given
values of the independent variable Xi is a straight line.
17. Types of Linear Regression Model :
17
2. Multiple Linear Regression Model –
The difference between simple linear regression and multiple
linear regression is that the later has more than one (>1)
independent variables, whereas the former has only 1
independent variable.
18. 18
Four assumptions are associated with a linear regression model:
• Linearity: The relationship between X and the mean of Y is
linear.
• Homoscedasticity: The variance of residual is the same for any
value of X.
• Independence: Observations are independent of each other.
• Normality: For any fixed value of X, Y is normally distributed.
Assumptions for Linear Regression:
19. Logistic Regression Model:
19
• It is used when the dependent variable is binary/nominal or categorical
with just two values like yes/no, true/false, male/female,
healthy/diseased etc. The logit function is a link which provides the
distribution amongst these two values ranging from 0 to 1.
• There are two types of logistic regression. Simple logistic regression with
only one independent variable and multiple logistic regressions which
has more than one independent variable. Simple logistic regression
finds the equation that best predicts the value of the Y variable for each
value of the X variable.
20. Aptness of a model:
20
• A two-way plot of the residuals versus the values of the
independent variable (or the fitted values of the response
variable), known as the residual plots, is a useful tool for
examining the aptness of a regression model.
• The residual can be viewed as an observed error.
21. The residual plots are used to check the
following:
• The regression function is linear
• The error terms have constant variance
• The error terms are statistically independent
• The error terms are normally distributed
• The model fits the data points except for a few
outlier observations.
21
24. Correlation coefficient:
24
• Correlation is a statistical method used to assess a possible
linear association between two continuous variables.
• Statisticians use a correlation coefficient to quantify the
strength and direction of the relationship between two
variables.
25. 25
Pearson's correlation coefficient:
• Pearson's correlation coefficient is denoted as ϱ(rho) for a
population parameter and as r for a sample statistic.
• It is used when both variables being studied are normally
distributed. For a correlation between variables x and y, the
formula for calculating the sample Pearson's correlation
coefficient is given by:
26. 26
Spearman's rank correlation coefficient:
• Spearman's rank correlation coefficient is denoted as ϱs for a
population parameter and as rs for a sample statistic. It is
appropriate when one or both variables are skewed or ordinal
and is robust when extreme values are present.
• For a correlation between variables x and y, the formula for
calculating the sample Spearman's correlation coefficient is
given by:
27. Types of correlation :
Positive – the two variables changes in the same
direction and in the same proportion.
Negative – the two variables changes in the
opposite direction and in the same proportion.
Curved – it represents nonlinear association
between the two variables. Even if the
relationship is strong the correlation coefficient
can be small or zero.
Partial – it measures the association between two
variables while controlling or adjusting the effect
of one or more variables. 27
29. 29
Degree of correlation:
None – this is no relationship between the
variables.
Low – there is some relationship between the
variables but a weak one.
High – there exists a very close relationship
between the two variables.
Perfect – it’s an ideal relationship.
As scores on one of the two variables increase
or decrease, the scores on the other variable
increase or decrease by the same magnitude
30. Difference Between Correlation And Regression:
30
• A parametric correlation coefficient (r) is a measure of the linear
relationship between two continuous variables. The range of r is 1 to –
1. When r is equal to zero, there is no relationship between the two
variables.
• Regression coefficients estimate how much of the change in the
outcome is associated with changes in the explanatory variables.
• Correlation measures the strength of relationship and regression
measures the magnitude of relationship between the two variables.
31. The Standard Error Of Estimate:
2
( )
ˆ
2
i i
e
y y
s
n
31
• When a ŷ-value is predicted from an x-value, the prediction is a point estimate.
• An interval can also be constructed.
• The standard error of estimate se is the standard deviation of the observed yi -values about the predicted ŷ-value for
given xi -value. It is given by
where n is the number of ordered pairs in the data set.
• The closer the observed y-values are to the predicted ŷ-values, the smaller the standard error of estimate will be.
32. Advantages of Regression Analysis:
32
1. Predicts the future course of events in growth, development
and treatment.
2. Support treatment planning decisions
3. Correcting errors.
4. Provide new insights.
33. 33
Limitations of Regression Analysis:
1. Looks at only linear straight-line relationship between the
dependent and independent variable.
2. Observes on the mean of the dependent variable.
3. Sensitive to extreme data.
4. Not useful when the actual and exact mathematical
relationship between the variables are know.
34. Summary:
34
• The regression analysis is a statistical technique that deals
with the analysis of relationship between such variables.
• They help us to predict the duration, course and outcome
of the treatment parameters.
• It is a complex process of predicting/estimating the
magnitude of some unknown characteristics which might
be involved in the growth and treatment of a given patient.
• Correlation coefficients are used to assess the strength and
direction of the linear relationships between pairs of
variables.
35. References:
35
• Jay S. Kim, Ronald J. Dailey. Biostatistics for Oral Healthcare. 2008, First Edition,
Blackwell Munskgaard Publication.
• B. K. Mahajan. Methods in Biostatistics, 9th Edition, Jaypee Publications
• Jain S, Chourse S, Dubey S, Jain S, Kamakoty J, Jain D. Regression analysis–its
formulation and execution in dentistry. Journal of Applied Dental and Medical
Sciences. 2016 Jan;2:1.
• M.M Mukaka. Statistics Corner: A guide to appropriate use of Correlation
coefficient in medical research. Malawi Medical Journal 2012; 24(3): 69-71
• Saha I, Paul B. ESSENTIALS OF BIOSTATISTICS & RESEARCH METHODOLOGY.
Academic Publishers; 2020 Oct 20.
• Lewis S. Regression analysis. Pract Neurol. 2007 Aug;7(4):259-64. doi:
10.1136/jnnp.2007.120055. PMID: 17636142.
Editor's Notes
Variable-attribute-that describes place, animal or thing eg-BP,PI,GI
Slope of regression line – regression coefficient, y on x and x on y
The variables involved in regression can be of either continuous or discrete type. If they are discrete then they should be treated as continuous. Only quantitative variables can be used in regression analysis. It can’t be applied to qualitative variables.
Hetero-variance of residuals are not constant, but they are different for different observation
What makes logistic regression different from linear regression is that you do not measure the Y variable directly; it is instead the probability of obtaining a particular value of a nominal variable.
When a scatter plot indicates there is a high positive relationship between the two variables, it is confirmed by calculating the correlation coefficient, which is deemed to come high. Then only the regression analysis is performed.