Introduction to R : Regression Module

Introduction to R
Regression Module
Pinaki M Mukherjee
Regression Module Introduction to R Pinaki M Mukherjee 1 / 42

1 About R software environment
2 Download and install R
3 R packages
4 Types of data in R & import data in R
5 Regression Modelling
6 Go to R Lab
7 Interpretation of R regression output

About R software environment

What is R?
R is a powerful software environment for statistical computing and
graphics
R is free open source software licensed under the GNU general public
license
R runs in Linux,Mac and Windows
Users of R
Economists
Financial Analysts
Market Researchers
Academicians
Bio- Scientist
Many other professionals for quantitative research

Why to use R?
There are may statistical softwares like SAS, SPSS, E-Views, STATA etc.
Why to use R?
Some of the very strong reasons
Its FREE
More than 2 million users around the world
High acceptability and recognition: Extensively used by large corporate
houses and business schools & universities (like Stanford, Harvard,
Johns Hopkins, Princeton, Washington University, MIT etc.)
New features are being developed all the time
Active R community and free updated resources

Applications of R
Data sourcing
Data cleaning
Data structuring
Data warehousing
Statistical and econometric modelling
Report document generation
Preparing presentation
Automate reproducable research
To know more about applications of R click Here

Download and install R

Install R
Open www.r-project.org
Click the “download R”
Select a CRAN location (a mirror site) and click the corresponding link.
Click on the “Download R for Windows” link at the top of the page
Click on the “install R for the first time” link at the top of the page.
Click “Download R for Windows” and save the executable file
Run the .exe file and follow the installation instructions
After installation of R, you need to download and install RStudio. RStudio
is easy to use interface for R loaded with many user friendly and useful
features. Like R, RStudio is also free.

Install RStudio
Open www.rstudio.com
Click on the “Download RStudio” button
Click on “Download RStudio Desktop”
Click on the version recommended for your system, or the latest
Windows version
Download and save the executable ﬁle
Run the .exe ﬁle and follow the installation instructions

R packages
R packages

R packages
About R packages
Packages are collections of R functions, data, and compiled code in a
well-deﬁned format
There are more than 6000 packages in R
However we need only a handful of packages to work
Only 1% of external packages are required to eﬃciently execute 99% of
work
The packages are kept in dedicated serves maintained by R community.
In India, we have the seerver at IIT Madras
The network of servers are also called CRAN (Comprehensive R
Archive Network)

R packages
Install R packages
(You need a proper internet connection to successfully run the codes)
To install one speciﬁc package in R
On the R Console, type install.packages(“forecast”) “forecast” is the
name of the package
Run the following code in the “Console” of RStudio to install only the most
important packages relecent for us.
install.packages(“ctv”)
library(“ctv”)
install.views(“Econometrics”)
install.views(“ReproducibleResearch”)
Install.views(“Finance”)
install.views(“MachineLearning”)

R packages
Some important commands
Load a package in the R session
library(forecast)
See the packages loaded in the R session
search()

Types of data in R & import data in R

Vectors, Factors & Matrix
A vector is a collection of data elements of the same type (also called class
in R). There are four diﬀerent class of data elements
Charecter
Numeric
Logical
Date and
Intiger
Factors are qualitative valiable extensively used in modelling. For example,
interest rate changed by RBI in diﬀerent RBI monetary policy reviw
meetings
Matrix is a numeric vectors with multiple dimentions in rows and columns

Dataframe & List
Dataframe
Most frequently used format for statistical analysis
Different than matrices. It can store different classes of vectors
It can be created in R by simple data entry Or
It can be imported from external sources, like datasets in csv files,
excel files etc
List is a collection of different dataframes. May resemble to a workbook
with different Sheets

Import external data into R
From csv or txt ﬁle
read.csv(‘data.csv’)
From excel
install.packages(‘xlsx’) (if xlsx package is not already installed)
library(xlsx) (load the ‘xlsx’ package into the R session)
read.xlsx(‘data.xlsx’, sheetIndex= 1) (Importing data in Sheet 1
of’data.xlsx’ )

Regression Modelling

Linear regression
Linear regression is a simple approach to supervised learning. It
assumes that the dependence of Y on X1,X2, . . . . ,Xp is linear
True regression functions are never linear
although it may seem overly simplistic, linear regression is extremely
useful both conceptually and practically

Questions we might ask
Is there a relationship between the dependent and independent
variable?
How strong is the relationship between the dependent and independent
variable?
Which independent variable contributes to dependent variable?
Is the relationship linear?
How accurately can we forecast the value of the dependent variable?

Simple linear regression using a single predictor X.
We assume a model:
Y = β0 + β1X +
where β0 and β1 are two unknown constants that represent the intercept
and slope, also known as coeﬃcients or parameters, and is the error term
or residual
Given the estimates for β0 and β1 for the model coeﬃcients, we can
forecast Y using the following equations
ˆy = ˆβ0 + ˆβ1x
where ˆy indicates a prediction of Y on the basis of X = x. The hat symbol
denotes an estimated value.

Estimation of the parameters by least squares
ˆyi = ˆβ0 + ˆβ1xi
Let ˆyi be the prediction for Y based on xi value of X
i = yi − ˆy represents the ith residual
We deﬁne the residual sum of square also called RSS as
RSS = 2
1 + 2
2 + 2
3 + .... + 2
n
The least squares approach chooses ˆβ0 and ˆβ1 to minimize the RSS

Simple regression model: The advertisement data
0 50 100 150 200 250 300
051015202530
Sales to TV Advertisement
TV Ad budget
Sales

0 50 100 150 200 250 300
051015202530
TV Ad budget
Sales

Multiple regression using more than one predictor
We assume a model:
Y = β0 + β1X1 + β2X2 +
where β0, β1 and β3 are two unknown constants that represent the
intercept and slope, also known as coeﬃcients or parameters, and is the
error term or residual
Given the estimates for β0 ,β1 and β2 for the model coeﬃcients, we can
forecast Y using the following equations
ˆy = ˆβ0 + ˆβ1x1 + ˆβ2x2
where ˆy indicates a prediction of Y on the basis of X = x. The hat symbol
denotes an estimated value.

Estimation of the parameters by least squares
ˆyi = ˆβ0 + ˆβ1x1i + ˆβ2x2i
Let ˆyi be the prediction for Y based on x1i value of X1 and x2i value
of X2
i = yi − ˆy represents the ith residual
We deﬁne the residual sum of square also called RSS as
RSS = 2
1 + 2
2 + 2
3 + .... + 2
n
The least squares approach chooses ˆβ0, ˆβ1 and ˆβ2 to minimize the RSS

Multiple regression model: The advertisement data
Adding elements
0 50 100 150 200 250 300
051015202530
0
10
20
30
40
50
TV
Radio
Sales

Adding elements
0 50 100 150 200 250 300
051015202530
0
10
20
30
40
50
TV
Radio
Sales

Go to R Lab
Go to R Lab

Go to R Lab
Go to R Lab: Objective of the Lab
Import external data
Corelation matrix
Estimating regression coeﬃcients
Estimating error term/residuals
Print regression model summary

Interpretation of R regression output

Accuracy of the estimated coefficient: Confidence interval
The standard error of an estimator reflects how it varies under repeated
sampling
These standard errors can be used to compute confidence intervals
A 95% confidence interval is defined as a range of values such that
with 95% probability, the range will contain the true unknown value of
the parameter.
It has the form ˆβ1 ± 2 ∗ SE( ˆβ1)
That is, there is approximately a 95% chance that the interval
[ ˆβ1 − 2 ∗ SE( ˆβ1), ˆβ1 + 2 ∗ SE( ˆβ1)]
For the advertising data, the 95% confidence interval for β1 is [0:042;
0:053]

Hypothesis testing
Standard errors can also be used to perform hypothesis tests on the
coeﬃcients
The most common hypothesis test involves testing
The null hypothesis of H0 : There is no relationship between X and Y
Vs
The alternative hypothesis of HA : There is some relationship between X
and Y

Hypothesis testing: Mathematically mean
Testing
H0 : β1 = 0
Vs
H0 : β1 = 0
if β1 = 0 it means X is not associated with Y
To test the null hypothesis, we compute a t-statistic, given by
t =
ˆβ1−0
SE( ˆβ1)
Using R, it is easy to compute the probability of observing any value
equal to |t| or larger. We call this probability the p-value.
If we see a small p-value,then we can infer that there is an association
between the predictor and the response. We reject the null
hypothesis-that is, we declare a relationship to exist between X and Y

Assessing the Overall Accuracy of the Model: R2
R-squared or fraction of variance explained is
R2 = TSS−RSS
TSS = 1 − RSS
TSS
TSS= Total Sum of Square, also called total variation RSS= Residual sum
of Square, also called unexplained variation Explained variation= Total
variation - Unexplained variation
R2 measures the proportion of variability in dependent variable that
can be explained using independent variable
An R2 statistic that is close to 1 indicates that a large proportion of
the variability in the response has been explained by the regression
A number near 0 indicates that the regression did not explain much of
the variability in the response

Assessing the Overall Accuracy of the Model: R2
. . .
R2 always lies between 0 and 1.
However, it can still be challenging to determine what is a good R2 value?
Depend on the application
Physics ~ Close to 1, smaller value value might indicate a serious
problem
Biology, Psychology ~ well below 0.1 might be more realistic!
Economics and ﬁnance ~ well above 0.6 might be more acceptable!
what is the value of R2 in our data?

Assessing the Overall Accuracy of the Model: F Test
F =
TSS−RSS
p
RSS
n−p−1
n = numberofobservations
p = numberoﬁndependentvariables
Intuitively if the model is a good ﬁt then the explained variation
(TSS − RSS) will be high relative to the RSS.
F value higher than 1 is desired
Just how high depends on the sample size n and the number of
independent variables

Answer to “Questions we might ask” in advertisement data
Is there a relationship between the dependent and independent variable?
Is there a relationship between advertising budget and sales?
How strong is the relationship between the dependent and independent
variable?
How strong is the relationship?
Which independent variable contributes to dependent variable?
Which media contribute to sales?
How large is the eﬀect of each medium on sales?
How accurately can we forecast the value of the dependent variable?

Exporting the regression output
a <- capture.output(summary(reg))
cat(a, file = "trial.txt", sep = "n", append = TRUE)

Online free resources
R Cookbook : http://www.cookbook-r.com/
Try R: http://tryr.codeschool.com/
Video tutorials: http://www.twotorials.com/
I shall be glad to help you
Follow: Me in Google plus and my blog
Email: pinaki.economics@gmail.com
Mobile: +91 9818383989

Introduction to R : Regression Module

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to R : Regression Module

Similar to Introduction to R : Regression Module (20)

More from Pinaki Mahata Mukherjee

More from Pinaki Mahata Mukherjee (9)

Recently uploaded

Recently uploaded (20)

Introduction to R : Regression Module