SlideShare a Scribd company logo
Regression Modelling
Lecture 1
Lecturer (Me)
Contact details:
Dale Roberts
E: dale.roberts@anu.edu.au
T: +61 2 612 57336
Consultation time:
Friday 14:00 - 16:00 (2 hour block)
Room 3.48
CBE Building, 26c
STAT6014 - Additional material
Contact details:
Lucy Yunxi Hu
E: yunxi.hu@anu.edu.au
T: +61 2 612 50836
Consultation time:
Friday 14:00 - 16:00 (2 hour block)
Room 3.48
CBE Building, 26c
Communication
I Please consult with your allocated tutor for course content
questions
I And/or, use the discussion forum on Wattle
I Please contact the course convenor (me) for issues and
concerns including grades, illness, falling behind, and academic
accessibility issues
Lecture times
I Wednesday, 13:00 - 15:00 (2 hour lecture)
I Friday, 11:00 - 12:00 (1 hour lecture / workshop)
Tutorials
I Begin week 2; take time in Week 1 to visit the computer lab;
check you can log on, etc.
I Tutorial sign up – see instructions on wattle and course outline
I You should read through the tutorial sheet and think and
attempt the questions before class
I Best opportunity to learn skills and techniques that will be
required in the quizzes and exams
I Your tutors are your main source for help
Textbook
I The required textbook for this course is Linear Regression by
Michael H Kutner
I This is a custom printed textbook available in print at the Harry
Hartog bookstore
I eBook is available from McGraw Hill. Use the link and discount
code on wattle to buy the ebook
I There are multiple copies of this text available in the Hancock
library for 2Hr loan
I Linear Models with R by Julian J. Faraway is another good
resource. Available in Hancock library for 2 day loans.
Course website
I http://wattle.anu.edu.au
I Access to all enrolled students
I Course announcements
I Lecture resources
I Echo360 lecture recordings
I Data sets
I Tutorial questions, selected solutions
I Online quizzes
I Please check this site frequently!
Assessment
Assessment Task Value Due Date
Online Quiz 5% Week 5
Assignment 1 15% Week 6
Assignment 2 20% Week 10
Final Examination 65% Central Exam Period
Hints for success
I Attend lectures and tutorials, supplement given materials with
your own comments and notes.
I Be prepared for classes (read the textbook, attempt tutorial
questions)
I Do the tutorials - statistics is a discipline in which hands on
participation ⇒ learning
I Time spent trying questions is well spent
R and RStudio
I We will be using the R software throughout the course
I Please see course website for installation instructions for R and
RStudio
I Please attempt Tutorial 0 - Intro to R before your first tutorial
Linear Regression
What is regression?
I Statistical methodology that utilises the relation between two or
more quantatitive variables to that a response or outcome
variable can be predicted from the other (or others)
I A core and important methodology in Statistics and Machine
Learning
What is regression?
Examples:
I Predict sales of a product using relationship between sales and
amount spent on advertising
I Predict performance of employee using relationship between
performance and aptitude test
Relations between variables
I We should distinguish between functional relation and a
statistical relation between variables
I A functional relation between two variables is expressed as a
mathematical formula. If X is the independent variable and Y
the dependent variable, a functional relation is
Y = f (X)
I A functional relation is a “perfect” mapping from X to Y
Relations between variables
20 40 60 80 100 120 140
50
150
250
Units Sold (X)
Dollar
Sales
(Y)
Y = 2X
Relations between variables
I A statistical relationship is not perfect and the observations
to not fall directly on the curve of relationship
I There is (hopefully) a function/curve that captures a general
tendency but the observations are typically scattered around
this curve
Relations between variables
60 70 80 90 100
60
70
80
90
110
Mid-year Evaluation (X)
Year-end
Evaluation
(Y)
Regression Models
History of regression
I The term regression was first used by Francis Galton in the late
19th century to explain a biological phenomenon he observed:
“regression towards the mean”
Galton’s dataset
library(HistData)
help(GaltonFamilies)
This data set lists the individual observations for 934 children in 205
families on which Galton (1886) based his cross-tabulation.
I midparentHeight: mid-parent height, calculated as (father
+ 1.08*mother)/2
I childHeight: height of child
Galton’s dataset
64 66 68 70 72 74
60
65
70
75
midparentHeight
childHeight
Basic concepts
A regression model is a formal means of expressing two essential
ingredients of a statistical relation:
I A tendency of the response variable Y to vary with the
predictor variable X in a systematic fasion
I A scattering of points around the curve of statistical relationship
These two characteristics are embodied in a regression model by
postulating that:
I There is a probability distribution of Y for each level of X
I The means of these probability distributions vary in some
systematic fashion with X
Probability distributions varying with X
60 70 80 90
50
60
70
80
90
Mid-year Evaluation (X)
Year-end
Evaluation
(Y)
Construction of Regression Models
Selection of predictor variables / covariates
I Note on terminology:
I Independent variable X, aka. predictor, regressor, covariate,
feature (ML), . . .
I Dependent variable Y , aka. response, outcome, output, . . .
I Only a limited number of covariates should be included in the
regression model
I How do you choose? Through exploratory studies, theory, etc.
Choice of functional form of regression relation
I Choice of f in the functional form Y = f (X) is tied to the
choice of covariate(s)
I Sometimes the relevant theory may indicate the appropriate
form for f
I Typically needs to be determined empirically from the data
I Linear or quadratic regression functions are often a first good
approximation
Scope of model
I We usually need to restrict the coverage of the model to some
interval or region of values
I We may not have observed the full range of possible
observations and the effect of those observations on our model
I The model may perform badly given previously unobserved data
I Training / fitting model vs. predicting given new observations
Use of regression
I Regression serves three major purposes:
I Description (How one variable influence the other)
I Control (Set standards, monitor operations, etc.)
I Prediction (Given new observations)
Regression and Causality
I Existence of a statistical relation between response Y and
covariate X does not imply in any way that Y depends causally
on X
I Funny examples
Use of computers
I Regression analysis requires lots of tedious calculations
I So we will make extensive use of R to perform these calculations
Simple Linear Regression Model
Formal statement of model
Only one covariate and a linear regression function f (x) = β0 + β1x,
giving
Yi = β0 + β1Xi + εi
where:
I Yi: response from ith trial / observation
I β0 and β1 are parameters to be determined
I Xi: observed covariate from ith trial / observation
I εi: random error term with mean zero and variance σ2
I εi and εj are uncorrelated for all i 6= j
Fitting model
I We are given or we observe n pairs of values
(Y1, X1), (Y2, X2), . . . , (Yn, Xn)
I The process that relates X to Y is a black box but we assume
it does some linear transformation and we are trying to
determine what the parameters are
I We must fit a linear model
Important features of the model
I The response Yi is a random variable as it is sum of two
components:
I the constant term β0 + β1Xi
I the random term εi
I Since E[ε] = 0, we have
E[Yi] = E[β0 + β1Xi + εi]
= β0 + β1Xi + E[εi]
= β0 + β1Xi
Important features of the model
I So the response Yi, for level Xi, has a probability distribution
with mean
E[Yi] = β0 + β1Xi
I So we know the regression function for the model is
E[Y ] = β0 + β1X
I The response Yi falls above or below the regression line based
on the random fluctuations of εi
I We have that
Var[Yi] = Var[β0 + β1Xi + εi] = Var[εi] = σ2
Important features of the model
I Error terms εi and εj are uncorrelated, this implies that so are
Yi and Xi
I Our model assumes that Yi’s come from a probability
distribution with mean β0 + β1Xi and variance σ2
Summary of model
I Linear models can be specified as: Yi = β0 + β1Xi + εi
I The assumptions are E[εi] = 0, Var[εi] = σ2
, Cor[εi, εj] = 0
I Which gives E[Yi] = β0 + β1Xi, Var[Yi] = σ2
, Cor[Yi, Yj] = 0
Regression parameters
I The parameters are called regression coefficients
I The intercept: β0
I The slope: β1
I The slope gives the change in mean of the probability
distribution of Y per unit increase in X
I The intercept, when the scope of the model includes X = 0,
gives the mean of the probability distribution at X = 0
Before fitting the model
I What is your question of interest?
I Statistical formulation of the question
I Source of the data
I Sample size
I Missing data
I Coding of data and inconsistencies
I Exploratory Data Analysis
I Scatterplots
I Summary statistics
Least squares estimation
I To find a “good” estimator of the regression parameters β0 and
β1, we employ the method of least squares
I For each observation pair (Yi, Xi), we consider the deviation of
Yi from its expected value Yi − E[Yi] given by
Yi − (β0 + β1Xi)
Least squares estimation
I The method of “least squares” considers the sum of the n
squared deviations
I The criterion is denoted by Q:
Q =
n
X
i=1
(Yi − β0 − β1Xi)2
I The estimators of β0 and β1 are the values b0 and b1 that
minimise Q given the observation pairs (Y1, X1), . . . , (Yn, Xn)
Least squares estimation (Figure 1.9)
0 10 20 30 40 50 60
0
5
10
15
Age (X)
Attempts
(Y)
Y = 2.8 + 0.18*X (Q=5.7)
Y == 9.0 + 0.*X (Q=26)
Properties of LS estimators
I Unbiased and minimum variance
E[b0] = β0, E[b1] = β1
I Estimate of
σ2
= Var[εi] = Var[Yi]
Summary
What is regression?
I Modelling of a relationship or an association between variables
of interest
I Model the outcome variable on one or more predictor variables
Linear modelling
I Our core analytical method in this course
I Can be extended to nonlinear modelling
I Linear models help us in:
I Description
I Prediction
I Control
More than just fitting a model
I Fitting a model is the easy part
I Consider appropriateness of the model
I Ensuring the assumptions are met
I Diagnostics for a model to check for validity and significance
I Remedies for violations of assumptions
I Finally, make inferences
Pitfalls in regression
I Is a linear model the right model based on theory?
I Correlation does not mean causation
I Does high ice-cream sales lead to higher homicide rates?
I Does high temperature lead to higher homicide rates?
I Reverse Causality
I e.g., GDP and unemployment
I GDP causes lower unemployment but model may check for
unemployment on GDP
Pitfalls in regression
I Omitted variable bias
I Study finds “Golfers more prone to heart disease, cancer and arthritis”
I Modelling mistake: the effect of age was omitted
I Multicollinearity
I Child’s education performance predicted by “mother’s education” and
“father’s education”
I Extrapolating beyond the data and data mining (too many
variables)

More Related Content

Similar to Lecture 1.pdf

151028_abajpai1
151028_abajpai1151028_abajpai1
151028_abajpai1
Anshumaan Bajpai
 
Pearson Correlation
Pearson CorrelationPearson Correlation
Pearson Correlation
Noreen Morales
 
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docxWeek 5 Lecture 14 The Chi Square Test Quite often, pat.docx
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
cockekeshia
 
Intermediate Statistics 1
Intermediate Statistics 1Intermediate Statistics 1
Intermediate Statistics 1
Michael Parent, Ed.D
 
Class 1 Introduction, Levels Of Measurement, Hypotheses, Variables
Class 1   Introduction, Levels Of Measurement, Hypotheses, VariablesClass 1   Introduction, Levels Of Measurement, Hypotheses, Variables
Class 1 Introduction, Levels Of Measurement, Hypotheses, Variables
aoudshoo
 
Designing Test Collections for Comparing Many Systems
Designing Test Collections for Comparing Many SystemsDesigning Test Collections for Comparing Many Systems
Designing Test Collections for Comparing Many Systems
Tetsuya Sakai
 
Modul Ajar Statistika Inferensia ke-12: Uji Asumsi Klasik pada Regresi Linier...
Modul Ajar Statistika Inferensia ke-12: Uji Asumsi Klasik pada Regresi Linier...Modul Ajar Statistika Inferensia ke-12: Uji Asumsi Klasik pada Regresi Linier...
Modul Ajar Statistika Inferensia ke-12: Uji Asumsi Klasik pada Regresi Linier...
Arif Rahman
 
advanced_statistics.pdf
advanced_statistics.pdfadvanced_statistics.pdf
advanced_statistics.pdf
GerryMakilan2
 
Basic concepts of_econometrics
Basic concepts of_econometricsBasic concepts of_econometrics
Basic concepts of_econometrics
SwapnaJahan
 
Advanced Econometrics L3-4.pptx
Advanced Econometrics L3-4.pptxAdvanced Econometrics L3-4.pptx
Advanced Econometrics L3-4.pptx
akashayosha
 
Aminullah assagaf model regresi 17+ 5 des 2021
Aminullah assagaf model regresi 17+ 5 des 2021Aminullah assagaf model regresi 17+ 5 des 2021
Aminullah assagaf model regresi 17+ 5 des 2021
Aminullah Assagaf
 
Aminullah assagaf regresi17
Aminullah assagaf regresi17Aminullah assagaf regresi17
Aminullah assagaf regresi17
Aminullah Assagaf
 
Aminullah assagaf model regresi 17+ 5 des 2021
Aminullah assagaf model regresi 17+ 5 des 2021Aminullah assagaf model regresi 17+ 5 des 2021
Aminullah assagaf model regresi 17+ 5 des 2021
Aminullah Assagaf
 
Aminullah assagaf regresi17
Aminullah assagaf regresi17Aminullah assagaf regresi17
Aminullah assagaf regresi17
Aminullah Assagaf
 
European conference on educational research
European conference on educational research European conference on educational research
European conference on educational research
GIRUMTAREKE
 
European conference on educational research
European conference on educational research European conference on educational research
European conference on educational research
GIRUMTAREKE
 
European conference on educational research
European conference on educational research European conference on educational research
European conference on educational research
GIRUMTAREKE
 
Analyzing experimental research data
Analyzing experimental research dataAnalyzing experimental research data
Analyzing experimental research data
Atula Ahuja
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data science
pujashri1975
 
Chapter 2Theories & Models Week II – Slides 1
Chapter 2Theories & Models Week II – Slides 1Chapter 2Theories & Models Week II – Slides 1
Chapter 2Theories & Models Week II – Slides 1
EstelaJeffery653
 

Similar to Lecture 1.pdf (20)

151028_abajpai1
151028_abajpai1151028_abajpai1
151028_abajpai1
 
Pearson Correlation
Pearson CorrelationPearson Correlation
Pearson Correlation
 
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docxWeek 5 Lecture 14 The Chi Square Test Quite often, pat.docx
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
 
Intermediate Statistics 1
Intermediate Statistics 1Intermediate Statistics 1
Intermediate Statistics 1
 
Class 1 Introduction, Levels Of Measurement, Hypotheses, Variables
Class 1   Introduction, Levels Of Measurement, Hypotheses, VariablesClass 1   Introduction, Levels Of Measurement, Hypotheses, Variables
Class 1 Introduction, Levels Of Measurement, Hypotheses, Variables
 
Designing Test Collections for Comparing Many Systems
Designing Test Collections for Comparing Many SystemsDesigning Test Collections for Comparing Many Systems
Designing Test Collections for Comparing Many Systems
 
Modul Ajar Statistika Inferensia ke-12: Uji Asumsi Klasik pada Regresi Linier...
Modul Ajar Statistika Inferensia ke-12: Uji Asumsi Klasik pada Regresi Linier...Modul Ajar Statistika Inferensia ke-12: Uji Asumsi Klasik pada Regresi Linier...
Modul Ajar Statistika Inferensia ke-12: Uji Asumsi Klasik pada Regresi Linier...
 
advanced_statistics.pdf
advanced_statistics.pdfadvanced_statistics.pdf
advanced_statistics.pdf
 
Basic concepts of_econometrics
Basic concepts of_econometricsBasic concepts of_econometrics
Basic concepts of_econometrics
 
Advanced Econometrics L3-4.pptx
Advanced Econometrics L3-4.pptxAdvanced Econometrics L3-4.pptx
Advanced Econometrics L3-4.pptx
 
Aminullah assagaf model regresi 17+ 5 des 2021
Aminullah assagaf model regresi 17+ 5 des 2021Aminullah assagaf model regresi 17+ 5 des 2021
Aminullah assagaf model regresi 17+ 5 des 2021
 
Aminullah assagaf regresi17
Aminullah assagaf regresi17Aminullah assagaf regresi17
Aminullah assagaf regresi17
 
Aminullah assagaf model regresi 17+ 5 des 2021
Aminullah assagaf model regresi 17+ 5 des 2021Aminullah assagaf model regresi 17+ 5 des 2021
Aminullah assagaf model regresi 17+ 5 des 2021
 
Aminullah assagaf regresi17
Aminullah assagaf regresi17Aminullah assagaf regresi17
Aminullah assagaf regresi17
 
European conference on educational research
European conference on educational research European conference on educational research
European conference on educational research
 
European conference on educational research
European conference on educational research European conference on educational research
European conference on educational research
 
European conference on educational research
European conference on educational research European conference on educational research
European conference on educational research
 
Analyzing experimental research data
Analyzing experimental research dataAnalyzing experimental research data
Analyzing experimental research data
 
Module-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data scienceModule-2_Notes-with-Example for data science
Module-2_Notes-with-Example for data science
 
Chapter 2Theories & Models Week II – Slides 1
Chapter 2Theories & Models Week II – Slides 1Chapter 2Theories & Models Week II – Slides 1
Chapter 2Theories & Models Week II – Slides 1
 

Recently uploaded

Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
fkyes25
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 

Recently uploaded (20)

Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 

Lecture 1.pdf

  • 2. Lecturer (Me) Contact details: Dale Roberts E: dale.roberts@anu.edu.au T: +61 2 612 57336 Consultation time: Friday 14:00 - 16:00 (2 hour block) Room 3.48 CBE Building, 26c
  • 3. STAT6014 - Additional material Contact details: Lucy Yunxi Hu E: yunxi.hu@anu.edu.au T: +61 2 612 50836 Consultation time: Friday 14:00 - 16:00 (2 hour block) Room 3.48 CBE Building, 26c
  • 4. Communication I Please consult with your allocated tutor for course content questions I And/or, use the discussion forum on Wattle I Please contact the course convenor (me) for issues and concerns including grades, illness, falling behind, and academic accessibility issues
  • 5. Lecture times I Wednesday, 13:00 - 15:00 (2 hour lecture) I Friday, 11:00 - 12:00 (1 hour lecture / workshop)
  • 6. Tutorials I Begin week 2; take time in Week 1 to visit the computer lab; check you can log on, etc. I Tutorial sign up – see instructions on wattle and course outline I You should read through the tutorial sheet and think and attempt the questions before class I Best opportunity to learn skills and techniques that will be required in the quizzes and exams I Your tutors are your main source for help
  • 7. Textbook I The required textbook for this course is Linear Regression by Michael H Kutner I This is a custom printed textbook available in print at the Harry Hartog bookstore I eBook is available from McGraw Hill. Use the link and discount code on wattle to buy the ebook I There are multiple copies of this text available in the Hancock library for 2Hr loan I Linear Models with R by Julian J. Faraway is another good resource. Available in Hancock library for 2 day loans.
  • 8. Course website I http://wattle.anu.edu.au I Access to all enrolled students I Course announcements I Lecture resources I Echo360 lecture recordings I Data sets I Tutorial questions, selected solutions I Online quizzes I Please check this site frequently!
  • 9. Assessment Assessment Task Value Due Date Online Quiz 5% Week 5 Assignment 1 15% Week 6 Assignment 2 20% Week 10 Final Examination 65% Central Exam Period
  • 10. Hints for success I Attend lectures and tutorials, supplement given materials with your own comments and notes. I Be prepared for classes (read the textbook, attempt tutorial questions) I Do the tutorials - statistics is a discipline in which hands on participation ⇒ learning I Time spent trying questions is well spent
  • 11. R and RStudio I We will be using the R software throughout the course I Please see course website for installation instructions for R and RStudio I Please attempt Tutorial 0 - Intro to R before your first tutorial
  • 13. What is regression? I Statistical methodology that utilises the relation between two or more quantatitive variables to that a response or outcome variable can be predicted from the other (or others) I A core and important methodology in Statistics and Machine Learning
  • 14. What is regression? Examples: I Predict sales of a product using relationship between sales and amount spent on advertising I Predict performance of employee using relationship between performance and aptitude test
  • 15. Relations between variables I We should distinguish between functional relation and a statistical relation between variables I A functional relation between two variables is expressed as a mathematical formula. If X is the independent variable and Y the dependent variable, a functional relation is Y = f (X) I A functional relation is a “perfect” mapping from X to Y
  • 16. Relations between variables 20 40 60 80 100 120 140 50 150 250 Units Sold (X) Dollar Sales (Y) Y = 2X
  • 17. Relations between variables I A statistical relationship is not perfect and the observations to not fall directly on the curve of relationship I There is (hopefully) a function/curve that captures a general tendency but the observations are typically scattered around this curve
  • 18. Relations between variables 60 70 80 90 100 60 70 80 90 110 Mid-year Evaluation (X) Year-end Evaluation (Y)
  • 20. History of regression I The term regression was first used by Francis Galton in the late 19th century to explain a biological phenomenon he observed: “regression towards the mean”
  • 21. Galton’s dataset library(HistData) help(GaltonFamilies) This data set lists the individual observations for 934 children in 205 families on which Galton (1886) based his cross-tabulation. I midparentHeight: mid-parent height, calculated as (father + 1.08*mother)/2 I childHeight: height of child
  • 22. Galton’s dataset 64 66 68 70 72 74 60 65 70 75 midparentHeight childHeight
  • 23. Basic concepts A regression model is a formal means of expressing two essential ingredients of a statistical relation: I A tendency of the response variable Y to vary with the predictor variable X in a systematic fasion I A scattering of points around the curve of statistical relationship These two characteristics are embodied in a regression model by postulating that: I There is a probability distribution of Y for each level of X I The means of these probability distributions vary in some systematic fashion with X
  • 24. Probability distributions varying with X 60 70 80 90 50 60 70 80 90 Mid-year Evaluation (X) Year-end Evaluation (Y)
  • 26. Selection of predictor variables / covariates I Note on terminology: I Independent variable X, aka. predictor, regressor, covariate, feature (ML), . . . I Dependent variable Y , aka. response, outcome, output, . . . I Only a limited number of covariates should be included in the regression model I How do you choose? Through exploratory studies, theory, etc.
  • 27. Choice of functional form of regression relation I Choice of f in the functional form Y = f (X) is tied to the choice of covariate(s) I Sometimes the relevant theory may indicate the appropriate form for f I Typically needs to be determined empirically from the data I Linear or quadratic regression functions are often a first good approximation
  • 28. Scope of model I We usually need to restrict the coverage of the model to some interval or region of values I We may not have observed the full range of possible observations and the effect of those observations on our model I The model may perform badly given previously unobserved data I Training / fitting model vs. predicting given new observations
  • 29. Use of regression I Regression serves three major purposes: I Description (How one variable influence the other) I Control (Set standards, monitor operations, etc.) I Prediction (Given new observations)
  • 30. Regression and Causality I Existence of a statistical relation between response Y and covariate X does not imply in any way that Y depends causally on X I Funny examples
  • 31. Use of computers I Regression analysis requires lots of tedious calculations I So we will make extensive use of R to perform these calculations
  • 33. Formal statement of model Only one covariate and a linear regression function f (x) = β0 + β1x, giving Yi = β0 + β1Xi + εi where: I Yi: response from ith trial / observation I β0 and β1 are parameters to be determined I Xi: observed covariate from ith trial / observation I εi: random error term with mean zero and variance σ2 I εi and εj are uncorrelated for all i 6= j
  • 34. Fitting model I We are given or we observe n pairs of values (Y1, X1), (Y2, X2), . . . , (Yn, Xn) I The process that relates X to Y is a black box but we assume it does some linear transformation and we are trying to determine what the parameters are I We must fit a linear model
  • 35. Important features of the model I The response Yi is a random variable as it is sum of two components: I the constant term β0 + β1Xi I the random term εi I Since E[ε] = 0, we have E[Yi] = E[β0 + β1Xi + εi] = β0 + β1Xi + E[εi] = β0 + β1Xi
  • 36. Important features of the model I So the response Yi, for level Xi, has a probability distribution with mean E[Yi] = β0 + β1Xi I So we know the regression function for the model is E[Y ] = β0 + β1X I The response Yi falls above or below the regression line based on the random fluctuations of εi I We have that Var[Yi] = Var[β0 + β1Xi + εi] = Var[εi] = σ2
  • 37. Important features of the model I Error terms εi and εj are uncorrelated, this implies that so are Yi and Xi I Our model assumes that Yi’s come from a probability distribution with mean β0 + β1Xi and variance σ2
  • 38. Summary of model I Linear models can be specified as: Yi = β0 + β1Xi + εi I The assumptions are E[εi] = 0, Var[εi] = σ2 , Cor[εi, εj] = 0 I Which gives E[Yi] = β0 + β1Xi, Var[Yi] = σ2 , Cor[Yi, Yj] = 0
  • 39. Regression parameters I The parameters are called regression coefficients I The intercept: β0 I The slope: β1 I The slope gives the change in mean of the probability distribution of Y per unit increase in X I The intercept, when the scope of the model includes X = 0, gives the mean of the probability distribution at X = 0
  • 40. Before fitting the model I What is your question of interest? I Statistical formulation of the question I Source of the data I Sample size I Missing data I Coding of data and inconsistencies I Exploratory Data Analysis I Scatterplots I Summary statistics
  • 41. Least squares estimation I To find a “good” estimator of the regression parameters β0 and β1, we employ the method of least squares I For each observation pair (Yi, Xi), we consider the deviation of Yi from its expected value Yi − E[Yi] given by Yi − (β0 + β1Xi)
  • 42. Least squares estimation I The method of “least squares” considers the sum of the n squared deviations I The criterion is denoted by Q: Q = n X i=1 (Yi − β0 − β1Xi)2 I The estimators of β0 and β1 are the values b0 and b1 that minimise Q given the observation pairs (Y1, X1), . . . , (Yn, Xn)
  • 43. Least squares estimation (Figure 1.9) 0 10 20 30 40 50 60 0 5 10 15 Age (X) Attempts (Y) Y = 2.8 + 0.18*X (Q=5.7) Y == 9.0 + 0.*X (Q=26)
  • 44. Properties of LS estimators I Unbiased and minimum variance E[b0] = β0, E[b1] = β1 I Estimate of σ2 = Var[εi] = Var[Yi]
  • 46. What is regression? I Modelling of a relationship or an association between variables of interest I Model the outcome variable on one or more predictor variables
  • 47. Linear modelling I Our core analytical method in this course I Can be extended to nonlinear modelling I Linear models help us in: I Description I Prediction I Control
  • 48. More than just fitting a model I Fitting a model is the easy part I Consider appropriateness of the model I Ensuring the assumptions are met I Diagnostics for a model to check for validity and significance I Remedies for violations of assumptions I Finally, make inferences
  • 49. Pitfalls in regression I Is a linear model the right model based on theory? I Correlation does not mean causation I Does high ice-cream sales lead to higher homicide rates? I Does high temperature lead to higher homicide rates? I Reverse Causality I e.g., GDP and unemployment I GDP causes lower unemployment but model may check for unemployment on GDP
  • 50. Pitfalls in regression I Omitted variable bias I Study finds “Golfers more prone to heart disease, cancer and arthritis” I Modelling mistake: the effect of age was omitted I Multicollinearity I Child’s education performance predicted by “mother’s education” and “father’s education” I Extrapolating beyond the data and data mining (too many variables)