Lecture 1.pdf

Regression Modelling
Lecture 1

Lecturer (Me)
Contact details:
Dale Roberts
E: dale.roberts@anu.edu.au
T: +61 2 612 57336
Consultation time:
Friday 14:00 - 16:00 (2 hour block)
Room 3.48
CBE Building, 26c

STAT6014 - Additional material
Contact details:
Lucy Yunxi Hu
E: yunxi.hu@anu.edu.au
T: +61 2 612 50836
Consultation time:
Friday 14:00 - 16:00 (2 hour block)
Room 3.48
CBE Building, 26c

Communication
I Please consult with your allocated tutor for course content
questions
I And/or, use the discussion forum on Wattle
I Please contact the course convenor (me) for issues and
concerns including grades, illness, falling behind, and academic
accessibility issues

Lecture times
I Wednesday, 13:00 - 15:00 (2 hour lecture)
I Friday, 11:00 - 12:00 (1 hour lecture / workshop)

Tutorials
I Begin week 2; take time in Week 1 to visit the computer lab;
check you can log on, etc.
I Tutorial sign up – see instructions on wattle and course outline
I You should read through the tutorial sheet and think and
attempt the questions before class
I Best opportunity to learn skills and techniques that will be
required in the quizzes and exams
I Your tutors are your main source for help

Textbook
I The required textbook for this course is Linear Regression by
Michael H Kutner
I This is a custom printed textbook available in print at the Harry
Hartog bookstore
I eBook is available from McGraw Hill. Use the link and discount
code on wattle to buy the ebook
I There are multiple copies of this text available in the Hancock
library for 2Hr loan
I Linear Models with R by Julian J. Faraway is another good
resource. Available in Hancock library for 2 day loans.

Course website
I http://wattle.anu.edu.au
I Access to all enrolled students
I Course announcements
I Lecture resources
I Echo360 lecture recordings
I Data sets
I Tutorial questions, selected solutions
I Online quizzes
I Please check this site frequently!

Assessment
Assessment Task Value Due Date
Online Quiz 5% Week 5
Assignment 1 15% Week 6
Assignment 2 20% Week 10
Final Examination 65% Central Exam Period

Hints for success
I Attend lectures and tutorials, supplement given materials with
your own comments and notes.
I Be prepared for classes (read the textbook, attempt tutorial
questions)
I Do the tutorials - statistics is a discipline in which hands on
participation ⇒ learning
I Time spent trying questions is well spent

R and RStudio
I We will be using the R software throughout the course
I Please see course website for installation instructions for R and
RStudio
I Please attempt Tutorial 0 - Intro to R before your first tutorial

What is regression?
I Statistical methodology that utilises the relation between two or
more quantatitive variables to that a response or outcome
variable can be predicted from the other (or others)
I A core and important methodology in Statistics and Machine
Learning

What is regression?
Examples:
I Predict sales of a product using relationship between sales and
amount spent on advertising
I Predict performance of employee using relationship between
performance and aptitude test

Relations between variables
I We should distinguish between functional relation and a
statistical relation between variables
I A functional relation between two variables is expressed as a
mathematical formula. If X is the independent variable and Y
the dependent variable, a functional relation is
Y = f (X)
I A functional relation is a “perfect” mapping from X to Y

20 40 60 80 100 120 140
50
150
250
Units Sold (X)
Dollar
Sales
(Y)
Y = 2X

I A statistical relationship is not perfect and the observations
to not fall directly on the curve of relationship
I There is (hopefully) a function/curve that captures a general
tendency but the observations are typically scattered around
this curve

60 70 80 90 100
60
70
80
90
110
Mid-year Evaluation (X)
Year-end
Evaluation
(Y)

History of regression
I The term regression was first used by Francis Galton in the late
19th century to explain a biological phenomenon he observed:
“regression towards the mean”

Galton’s dataset
library(HistData)
help(GaltonFamilies)
This data set lists the individual observations for 934 children in 205
families on which Galton (1886) based his cross-tabulation.
I midparentHeight: mid-parent height, calculated as (father
+ 1.08*mother)/2
I childHeight: height of child

Galton’s dataset
64 66 68 70 72 74
60
65
70
75
midparentHeight
childHeight

Basic concepts
A regression model is a formal means of expressing two essential
ingredients of a statistical relation:
I A tendency of the response variable Y to vary with the
predictor variable X in a systematic fasion
I A scattering of points around the curve of statistical relationship
These two characteristics are embodied in a regression model by
postulating that:
I There is a probability distribution of Y for each level of X
I The means of these probability distributions vary in some
systematic fashion with X

Probability distributions varying with X
60 70 80 90
50
60
70
80
90
Mid-year Evaluation (X)
Year-end
Evaluation
(Y)

Construction of Regression Models

Selection of predictor variables / covariates
I Note on terminology:
I Independent variable X, aka. predictor, regressor, covariate,
feature (ML), . . .
I Dependent variable Y , aka. response, outcome, output, . . .
I Only a limited number of covariates should be included in the
regression model
I How do you choose? Through exploratory studies, theory, etc.

Choice of functional form of regression relation
I Choice of f in the functional form Y = f (X) is tied to the
choice of covariate(s)
I Sometimes the relevant theory may indicate the appropriate
form for f
I Typically needs to be determined empirically from the data
I Linear or quadratic regression functions are often a first good
approximation

Scope of model
I We usually need to restrict the coverage of the model to some
interval or region of values
I We may not have observed the full range of possible
observations and the effect of those observations on our model
I The model may perform badly given previously unobserved data
I Training / fitting model vs. predicting given new observations

Use of regression
I Regression serves three major purposes:
I Description (How one variable influence the other)
I Control (Set standards, monitor operations, etc.)
I Prediction (Given new observations)

Regression and Causality
I Existence of a statistical relation between response Y and
covariate X does not imply in any way that Y depends causally
on X
I Funny examples

Use of computers
I Regression analysis requires lots of tedious calculations
I So we will make extensive use of R to perform these calculations

Simple Linear Regression Model

Formal statement of model
Only one covariate and a linear regression function f (x) = β0 + β1x,
giving
Yi = β0 + β1Xi + εi
where:
I Yi: response from ith trial / observation
I β0 and β1 are parameters to be determined
I Xi: observed covariate from ith trial / observation
I εi: random error term with mean zero and variance σ2
I εi and εj are uncorrelated for all i 6= j

Fitting model
I We are given or we observe n pairs of values
(Y1, X1), (Y2, X2), . . . , (Yn, Xn)
I The process that relates X to Y is a black box but we assume
it does some linear transformation and we are trying to
determine what the parameters are
I We must fit a linear model

Important features of the model
I The response Yi is a random variable as it is sum of two
components:
I the constant term β0 + β1Xi
I the random term εi
I Since E[ε] = 0, we have
E[Yi] = E[β0 + β1Xi + εi]
= β0 + β1Xi + E[εi]
= β0 + β1Xi

I So the response Yi, for level Xi, has a probability distribution
with mean
E[Yi] = β0 + β1Xi
I So we know the regression function for the model is
E[Y ] = β0 + β1X
I The response Yi falls above or below the regression line based
on the random fluctuations of εi
I We have that
Var[Yi] = Var[β0 + β1Xi + εi] = Var[εi] = σ2

I Error terms εi and εj are uncorrelated, this implies that so are
Yi and Xi
I Our model assumes that Yi’s come from a probability
distribution with mean β0 + β1Xi and variance σ2

Summary of model
I Linear models can be specified as: Yi = β0 + β1Xi + εi
I The assumptions are E[εi] = 0, Var[εi] = σ2
, Cor[εi, εj] = 0
I Which gives E[Yi] = β0 + β1Xi, Var[Yi] = σ2
, Cor[Yi, Yj] = 0

Regression parameters
I The parameters are called regression coefficients
I The intercept: β0
I The slope: β1
I The slope gives the change in mean of the probability
distribution of Y per unit increase in X
I The intercept, when the scope of the model includes X = 0,
gives the mean of the probability distribution at X = 0

Before fitting the model
I What is your question of interest?
I Statistical formulation of the question
I Source of the data
I Sample size
I Missing data
I Coding of data and inconsistencies
I Exploratory Data Analysis
I Scatterplots
I Summary statistics

Least squares estimation
I To find a “good” estimator of the regression parameters β0 and
β1, we employ the method of least squares
I For each observation pair (Yi, Xi), we consider the deviation of
Yi from its expected value Yi − E[Yi] given by
Yi − (β0 + β1Xi)

Least squares estimation
I The method of “least squares” considers the sum of the n
squared deviations
I The criterion is denoted by Q:
Q =
n
X
i=1
(Yi − β0 − β1Xi)2
I The estimators of β0 and β1 are the values b0 and b1 that
minimise Q given the observation pairs (Y1, X1), . . . , (Yn, Xn)

Least squares estimation (Figure 1.9)
0 10 20 30 40 50 60
0
5
10
15
Age (X)
Attempts
(Y)
Y = 2.8 + 0.18*X (Q=5.7)
Y == 9.0 + 0.*X (Q=26)

Properties of LS estimators
I Unbiased and minimum variance
E[b0] = β0, E[b1] = β1
I Estimate of
σ2
= Var[εi] = Var[Yi]

What is regression?
I Modelling of a relationship or an association between variables
of interest
I Model the outcome variable on one or more predictor variables

Linear modelling
I Our core analytical method in this course
I Can be extended to nonlinear modelling
I Linear models help us in:
I Description
I Prediction
I Control

More than just fitting a model
I Fitting a model is the easy part
I Consider appropriateness of the model
I Ensuring the assumptions are met
I Diagnostics for a model to check for validity and significance
I Remedies for violations of assumptions
I Finally, make inferences

Pitfalls in regression
I Is a linear model the right model based on theory?
I Correlation does not mean causation
I Does high ice-cream sales lead to higher homicide rates?
I Does high temperature lead to higher homicide rates?
I Reverse Causality
I e.g., GDP and unemployment
I GDP causes lower unemployment but model may check for
unemployment on GDP

Pitfalls in regression
I Omitted variable bias
I Study finds “Golfers more prone to heart disease, cancer and arthritis”
I Modelling mistake: the effect of age was omitted
I Multicollinearity
I Child’s education performance predicted by “mother’s education” and
“father’s education”
I Extrapolating beyond the data and data mining (too many
variables)

Lecture 1.pdf

Recommended

Recommended

More Related Content

Similar to Lecture 1.pdf

Similar to Lecture 1.pdf (20)

Recently uploaded

Recently uploaded (20)

Lecture 1.pdf