Audit report[rollno 49]

“Statistical Learning Model using R”
SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 1
CHAPTER 1
INTRODUCTION
Statistical learning theory is a framework for machine learning drawing from the
fields of statistics and functional analysis. Statistical learning theory deals with the
problem of finding a predictive function based on data. Statistical learning theory has
led to successful applications in fields such as computer vision, speech
recognition, bioinformatics and baseball.
The goals of learning are understanding and prediction. Learning falls into many
categories, including supervised learning, unsupervised learning, online learning,
and reinforcement learning. From the perspective of statistical learning theory, supervised
learning is best understood. Supervised learning involves learning from a training set of
data. Every point in the training is an input-output pair, where the input maps to an
output. The learning problem consists of inferring the function that maps between the
input and the output, such that the learned function can be used to predict output from
future input.
Depending on the type of output, supervised learning problems are either problems
of regression or problems of classification. If the output takes a continuous range of
values, it is a regression problem. Using Ohm's Law as an example, a regression could be
performed with voltage as input and current as output. The regression would find the
functional relationship between voltage and current.
Classification problems are those for which the output will be an element from a
discrete set of labels. Classification is very common for machine learning applications.
In facial recognition, for instance, a picture of a person's face would be the input, and the
output label would be that person's name. The input would be represented by a large
multidimensional vector whose elements represent pixels in the picture.
After learning a function based on the training set data, that function is validated on
a test set of data, data that did not appear in the training set.

1.1 Supervised Versus Unsupervised Learning:
Most statistical learning problems fall into one of two categories: supervised
Supervised or unsupervised. The examples that we have discussed so far in this capon
supervised ter all fall into the supervised learning domain. For each observation of the
predictor measurement(s) xi, i = 1, . . . , n there is an associated response measurement yi.
We wish to fit a model that relates the response to the predictors, with the aim of
accurately predicting the response for future observations (prediction) or better
understanding the relationship between the response and the predictors (inference). Many
classical statistical learning methods such as linear regression and logistic regression ), as
logistic well as more modern approaches such as GAM, boosting, and support vec-
regression tor machines, operate in the supervised learning domain. The vast majority
of this book is devoted to this setting.
In contrast, unsupervised learning describes the somewhat more challenging situation
in which for every observation i = 1, … n, we observe a vector of measurements xi but no
associated response yi. It is not possible to fit a linear regression model, since there is no
response variable to predict. In this setting, we are in some sense working blind; the
situation is referred to as unsupervised because we lack a response variable that can
supervise our analysis.

1.2 Issues on Statistical Learning:
The main goal of the special issue that we have assembled here is to fill in the above
need, with the focus on the fundamental modeling and learning issues of new emerging
approaches and empirical applications in speech and language processing. Another focus
of this special issue is on the cross-fertilization of learning approaches to speech and
language processing problems. Many problems in speech and language processing share
similarities (despite some conspicuous differences), and techniques in these two fields
can be successfully cross-pollinated.
Our additional goal is to bring together a diverse but complementary set of
contributions on emerging learning methods for speech processing, language processing,
as well as unifying approaches to problems cross cutting these two fields. Discriminative
learning has become a major theme in most areas of speech and language processing.
One of the recent advances in discriminative learning is the integration of the large
margin idea, which is the classical training standard in machine learning, into the
conventional discriminative training criteria for string recognition.
How typical training criteria, such as minimum phone error and maximum mutual
information, can be extended to incorporate the margin concept. In this work, a new
margin-based formalism is proposed for various conventional training criteria.
Experimental results show that the new criteria help the performance across a wide
variety of string recognition scenarios including speech recognition, concept tagging, and
handwriting recognition. In another paper, Cheng et al. explore online learning and
acoustic feature adaptation in large margin hidden Markov models (HMMs), which lead
to a better optimization method for large-margin HMM training. Moving beyond
acoustics, language modeling is one of the essential problems in speech and language
fields. Zhou et al. introduce a novel pseudo-conventional N-gram language model with
discriminative training, and also carry out an empirical study of the robustness of
discriminatively trained LMs. Experimental results show that cumulative performance
improvements can be achieved via this method.
Sequential pattern classification is at the core of many speech and language
processing problems. Conditional random field (CRF) is a widely adopted approach to
supervised sequential labeling.

However, the computational load and model comDigital Object Identifier
10.1109/JSTSP.2010.2086910 plexity grow dramatically when taking complex structure
into account. Here, Sokolovska et al. address this issue through efficient feature selection
based on imposing sparsity through an L1 regularization for CRF. The results show that,
without performance degradation, the L1 regularized CRF results in significantly faster
training and labeling speed, and hence makes it possible to scale up systems to handle
very large dimensional models. Meanwhile, Yu et al. improve the CRF model from
another perspective.
They proposed a multi-layer sequence classification algorithm where each layer is a
CRF, and each higher layer’s input consists of both the previous layer’s observation
sequence and the resulting frame-level marginal probabilities. Compared with the
conventional CRF, the deep-structured CRF achieves superior labeling accuracy on
common tagging tasks. Using the kernel method to improve the performance of
sequential pattern classifiers is also an important direction. Kubo et al. describe a novel
sequential pattern classifier based on kernel methods.
Unlike conventional approaches, they use kernel methods to estimate the emission
probability of HMM, with the extra benefit due to the powerful nonlinear classification
capability of kernel methods. On the other hand, unlike conventional CRF/HMM-based
methods, Bellegarda attacks this problem from a novel angle based on latent semantic
mapping and obtains insightful results.

CHAPTER 2
GETTING STARTED WITH R PROGRAMMING
2.1 Introduction to the R-Studio
R is a free, open-source software and programming language developed in 1995
at the University of Auckland as an environment for statistical computing and graphics
(Ikaha and Gentleman, 1996). Since then R has become one of the dominant software
environments for data analysis and is used by a variety of scientific disiplines, including
soil science, ecology, and geoinformatics (Envirometrics CRAN Task View; Spatial
CRAN Task View). R is particularly popular for its graphical capabilities, but it is also
prized for it’s GIS capabilities which make it relatively easy to generate raster-based
models. More recently, R has also gained several packages which are designed
specifically for analyzing soil data.
2.2 User-interface :
R is a dialect of the S language. It is a case-sensitive, interpreted language. You
can enter commands one at a time at the command prompt (>) or run a set of commands
from a source file. There is a wide variety of data types, including vectors (numerical,
character, logical), matrices, data frames, and lists. Most functionality is provided
through built-in and user-created functions and all data objects are kept in memory during
an interactive session. Basic functions are available by default. Other functions are
contained in packages that can be attached to a current session as needed.
This section describes working with the R interface. A key skill to using R
effectively is learning how to use the built-in help system. Other sections describe the
working environment, inputting programs and outputting results, installing new
functionality through packages, GUIs that have been developed for R, customizing the
environment, producing high quality output, and running programs in batch. A
fundamental design feature of R is that the output from most functions can be used as
input to other functions. This is described in reusing results.

2.3 Basic commands :
 Input and Display:
#read files with labels in first row
read.table(filename,header=TRUE)#read a tab or space delimited
file
read.table(filename,header=TRUE,sep=',')#read csv files
x <-c(1,2,4,8,16)#create a data vector with specified elements
y <-c(1:10)#create a data vector with elements 1-10
n <-10
x1<- c(rnorm(n))#create a n item vector of random normal deviates
y1 <-c(runif(n))+n #create another n
item vector that has n added to each random uniform distribution
z <-rbinom(n,size,prob)#create n samples of size "size" with
probability prob from the binomial
vect<- c(x,y)#combine them into one vector of length 2n
mat<-cbind(x,y)#combine them into a n x 2 matrix
mat[4,2]#display the 4th row and the 2nd column
mat[3,]#display the 3rd row
mat[,2]#display the 2nd column
subset(dataset,logical)#those objects meeting a logical criterion
subset(data.df,select=variables,logical)#get those objects from a
data frame that meet a criterion
data.df[data.df=logical]#yet another way to get a subset
x[order(x$B),]#sort a dataframe by the order of the elements in B
x[rev(order(x$B)),]#sort the dataframe in reverse order
 Moving around
ls()#list the variables in the workspace
rm(x)#remove x from the workspace
rm(list=ls())#remove all the variables from the workspace

attach(mat)#make the names of the variables in the matrix or data
frame available in the workspace
detach(mat)#releases the names (remember to do this each time you
attach something)
with(mat,....)#a preferred alternative to attach ... detach
new<- old[,-n]#drop the nth column
new<- old[-n,]#drop the nth row
new<- old[,-c(i,j)]#drop the ith and jth column
new<- subset(old,logical)#select those cases that meet the
logical condition
complete <- subset(data.df,complete.cases(data.df))#find those
cases with no missing values
new<- old[n1:n2,n3:n4]#select the n1 through n2 rows of variables
n3 through n4)
 Distributions
beta(a, b)
gamma(x)
choose(n, k)
factorial(x)
dnorm(x, mean=0,sd=1, log = FALSE)#normal distribution
pnorm(q, mean=0,sd=1,lower.tail= TRUE,log.p= FALSE)
qnorm(p, mean=0,sd=1,lower.tail= TRUE,log.p= FALSE)
rnorm(n, mean=0,sd=1)
dunif(x, min=0, max=1, log = FALSE)#uniform distribution
punif(q, min=0, max=1,lower.tail= TRUE,log.p= FALSE)
qunif(p, min=0, max=1,lower.tail= TRUE,log.p= FALSE)
runif(n, min=0, max=1)
 Data manipulation
replace(x, list, values)#remember to assign this to some object
i.e., x <- replace(x,x==-9,NA)

#similar to the operation x[x==-9] <- NA
scrub(x,where, min, max,isvalue,newvalue)#a convenient way to
change particular values (in psych package)
cut(x, breaks, labels = NULL,
include.lowest= FALSE, right = TRUE,dig.lab=3,...)
x.df<-data.frame(x1,x2,x3...)#combine different kinds of data
into a data frame
as.data.frame()
is.data.frame()
x <-as.matrix()
scale()#converts a data frame to standardized scores
round(x,n)#rounds the values of x to n decimal places
ceiling(x)#vector x of smallest integers > x
floor(x)#vector x of largest interger< x
as.integer(x)#truncates real x to integers (compare to round(x,0)
as.integer(x <cutpoint)#vector x of 0 if less than cutpoint, 1 if
greater than cutpoint)
factor(ifelse(a <cutpoint,"Neg","Pos"))#is another way to
dichotomize and to make a factor for analysis
transform(data.df,variable names = some operation)#can be part of
a set up for a data set
x%in%y#tests each element of x for membership in y
y%in%x#tests each element of y for membership in x
all(x%in%y)#true if x is a proper subset of y
all(x)# for a vector of logical values, are they all true?
any(x)#for a vector of logical values, is at least one true?
2.4 Data Structures in R:

R programming supports five basic types of data structure namely vector,
matrix, list, data frame and factor. This chapter will discuss these data structures and the
way to write these in R Programming.
1. Vector – This data structures contain similar types of data, i.e., integer, double,
logical, complex, etc. In order to create a vector in R Programming, c() function is
used.
For example,
> x <- 1:7; x[1] 1 2 3 4 5 6 7 > y <- 2:-2; y[1] 2 1 0 -1 -2
2. Matrix – Matrix is a two-dimensional data structure and can be created using
matrix () function. The values for rows columns can be defined using nrow and
ncol arguments. However providing both is not required as other dimension is
automatically taken with the help of length of matrix.
3. List – This data structure includes data of different types. It is similar to vector
but a vector contains similar data but list contains mixed data. A list is created
using list ().
For example, > x <- list("a" = 2.5, "b" = TRUE, "c" = 1:3)
>str(x)List of 3$ a: num 2.5$ b: logi TRUE$ c: int [1:3] 1 2 3
4. Dataframe – This data structure is a special case of list where each component is
of same length. Data frame is created using frame() function.
5. For example,
> x <- data.frame("SN" = 1:2, "Age" = c(21,15), "Name" = c("John","Dora"))
>str(x) # structure of x
'data.frame': 2 obs. of 3 variables:
$ SN :int 1 2

$ Age :num 21 15
$ Name: Factor w/ 2 levels "Dora","John": 2 1
6. Factor – Factors are used to store predefined and categorical data. It can be
created using factor() function.
For example,
> x <- factor(c("single", "married", "married", "single"));
6. String – Any value written inside a single quote or double quotes is referred to as
String.
For example,
x <- “This is a valid proper ‘ string”
print(x)
y <- ‘this is still valid as this one” double quote is used inside single quotes”
print(y)
Output:
This is a valid proper ‘ string
this is still valid as this single” double quote is used inside single quotes
2.5 Graphics:
The plot() function is the primary way to plot data in R. For instance,plot()plot(x,y)
produces a scatterplot of the numbers in x versus the numbers in y. There are many
additional options that can be passed in to the plot()function. For example, passing in the
argument xlabwill result in a label on the x-axis. To find out more information about the
plot() function, type ?plot.
> x=rnorm (100)
> y=rnorm (100)
>plot(x,y)
>plot(x,y,xlab=" this is the x-axis",ylab=" this is the y-axis", main=" Plot of X vs Y")

We will often want to save the output of an R plot. The command that we use to do this
will depend on the file type that we would like to create. Forinstance, to create a pdf, we
use the pdf() function, and to create a jpeg, pdf()we use the jpeg() function. jpeg()
>pdf (" Figure .pdf ")
>plot(x,y,col =" green ")
>dev.off () null device
The function dev.off() indicates to R that we are done creating the plot.dev.off()
Alternatively, we can simply copy the plot window and paste it into an appropriate file
type, such as a Word document. The function seq() can be used to create a sequence of
numbers. For seq() instance, seq(a,b) makes a vector of integers between a and b. There
are many other options: for instance, seq(0,1,length=10) makes a sequence of 10 numbers
that are equally spaced between 0 and 1. Typing 3:11 is a shorthand for seq(3,11) for
integer arguments.
> x=seq (1 ,10)> x
[1] 1 2 3 4 5 6 7 8 9 10
> x=1:10
>x
[1] 1 2 3 4 5 6 7 8 9 10
> x=seq(-pi ,pi ,length =50)
We will now create some more sophisticated plots. The contour() funccontour() function
produces a contour plot in order to represent three-dimensional data;contour plotit is like
a topographical map. It takes three arguments:
1. A vector of the x values (the first dimension),
2. A vector of the y values (the second dimension), and
3. A matrix whose elements correspond to the z value (the third dimension) for each pair
of (x,y) coordinates. As with the plot() function, there are many other inputs that can be
used to fine-tune the output of the contour() function. To learn more about these, take a
look at the help file by typing ?contour.
> y=x
> f=outer(x,y,function (x,y)cos(y)/(1+x^2))
>contour (x,y,f)

>contour (x,y,f,nlevels =45, add=T)
>fa=(f-t(f))/2
>contour (x,y,fa,nlevels =15)
The image() function works the same way as contour(), except that it image()produces a
color-coded plot whose colors depend on the z value.This isknown as a heatmap, and is
sometimes used to plot temperature in weather heatmapforecasts. Alternatively, persp()
can be used to produce a three-dimensional persp()plot. The arguments theta and phi
control the angles at which the plot is viewed.
>image(x,y,fa)
>persp(x,y,fa)
>persp(x,y,fa ,theta =30)
>persp(x,y,fa ,theta =30, phi =20)
2.6 Reading data into R:
Usually we will be using data already in a file that we need to read into R in order
to work on it. R can read data from a variety of file formats—for example, files created as
text, or in Excel, SPSS or Stata. We will mainly be reading files in text format .txt or .csv
(comma-separated, usually created in Excel).
To read an entire data frame directly, the external file will normally have a special form
 The first line of the file should have a name for each variable in the data frame.
 Each additional line of the file has as its first item a row label and the values for
each variable.
Here we use the example dataset called airquality.csv and airquality.txt
Input file form with names and row labels:
Ozone Solar.R*Wind Temp Month Day

1 41*****190**7.4**67****5**1
2 36*****118**8.0**72****5**2
3 12*****149*12.6**74****5**3
4 18*****313*11.5**62****5**4
5 NA*****NA**14.3**56****5**5
...
By default numeric items (except row labels) are read as numeric variables. This can be
changed if necessary.
The function read.table() can then be used to read the data frame directly
>airqual<- read.table("C:/Desktop/airquality.txt")
Similarly, to read .csv files the read.csv() function can be used to read in the data
frame directly
[Note: I have noticed that occasionally you'll need to do a double slash in your path //.
This seems to depend on the machine.]
>airqual<- read.csv("C:/Desktop/airquality.csv")
In addition, you can read in files using the file.choose() function in R. After typing in
this command in R, you can manually select the directory and file where your dataset is
located.

CHAPTER 3
LINEAR REGRESSION MODELS
3.1 Linear Regression:
This chapter is about linear regression, a very simple approach for supervised
learning. In particular, linear regression is a useful tool for predicting a quantitative
response. Linear regression has been around for a long time and is the topic of
innumerable textbooks. Though it may seem somewhat dull compared to some of the
more modern statistical learning approaches described in later chapters of this book,
linear regression is still a useful and widely used statistical learning method. Moreover, it
serves as a good jumping-off point for newer approaches: as we will see in later chapters,
many fancy statistical learning approaches can be seen as generalizations or extensions of
linear regression. Consequently, the importance of having a good understanding of linear
regression before studying more complex learning methods cannot be overstated. In this
chapter, we review some of the key ideas underlying the linear regression model, as well
as the least squares approach that is most commonly used to fit this model. Recall the
Advertising data from Chapter 2 sales(in thousands of units) for a particular product as a
function of advertising budgets (in thousands of dollars) for TV, radio, and newspaper
media. Suppose that in our role as statistical consultants we are asked to suggest, on the
basis of this data, a marketing plan for next year that will result in high product sales.
What information would be useful in order to provide such a recommendation? Here are
a few important questions that we might seek to address:
Simple Linear Regression
Simple linear regression lives up to its name: it is a very straightforward simple
linearapproach for predicting a quantitative response Y on the basis of a single predictor
variable X. It assumes that there is approximately a linear relationship between X and Y .
Mathematically, we can write this linear relationship as
Y ≈ β0 + β1XYou might read “≈” as “is approximately modeled as”. We will sometimes
describe by saying that we are regressing Y on X (or Y onto X). For example, X may

represent TV advertising and Y may represent sales. Then we can regresssales onto TV
by fitting the model sales ≈ β0 + β1 × TV.
In Equation, β0 and β1 are two unknown constants that represent the intercept and slope
terms in the linear model. Together, β0 and β1 areinterceptslope known as the model
coefficients or parameters. Once we have used ourcoefficientparametertraining data to
produce estimates ˆ β0 and ˆβ1 for the model coefficients, wecan predict future sales on
the basis of a particular value of TV advertisingby computingˆy = ˆ β0 + ˆ β1x, where ˆy
indicates a prediction of Y on the basis of X = x. Here we use a hat symbol, ˆ , to denote
the estimated value for an unknown parameter or coefficient, or to denote the predicted
value of the response.
Estimating the Coefficients
In practice, β0 and β1 are unknown. So before we can use to make predictions, we must
use data to estimate the coefficients. Let (x1, y1), (x2, y2), . . . ,(xn, yn) represent
nobservation pairs, each of which consists of a measurement of X and a measurement of
Y . In the Advertising example, this data set consists of the TV advertising budget and
product sales in n = 200 different markets. (Recall that the data are displayed. Our goal is
to obtain coefficient estimates ˆ β0 and ˆ β1 such that the linear model fits the available
data well—that is, so that yi≈ ˆβ0 + ˆ β1xi for i= 1, . . . , n. In other words, we want to find
an intercept ˆ β0 and a slope ˆ β1 such that the resulting line is as close as possible to the
n = 200 data points. There are a number of ways of measuring closeness. However, by far
the most common approach involves minimizing the least squares criterion, least squares
and we take that approach in this chapter.

For the Advertising data, the least squares fit for the regressionof sales onto TV is
shown. The fit is found by minimizing the sum of squared errors. Each grey line segment
represents an error, and the fit makes a compromise by averaging their squares. In this
case a linear fit captures the essence of the relationship, although it is somewhat deficient
in the left of the plot. Let ˆyi= ˆ β0 + ˆ β1xi be the prediction for Y based on the ith value
of X.Then ei= yi−ˆyirepresents the ithresidual—this is the difference betweenresidualthe
ith observed response value and the ith response value that is predictedby our linear
model. We define the residual sum of squares (RSS) asresidual sumof squaresRSS =
e21+ e22+ · · · + e2n, or equivalently asRSS = (y1− ˆβ0− ˆβ1x1)2+(y2− ˆβ0− ˆβ1

CHAPTER 4
CLASSIFICATION
The linear regression model discussed in Chapter 3 assumes that the response
variable Y is quantitative. But in many situations, the response variable is instead
qualitative. For example, eye color is qualitative, taking on values blue, brown, or green.
Often qualitative variables are referred to as categorical ; we will use these terms
interchangeably.
In this chapter, we study approaches for predicting qualitative responses, a process
that is known as classification. Predicting a qualitative response for an observation can be
referred to as classifying that observation, since it involves assigning the observation to a
category, or class. On the other hand, often the methods used for classification first
predict the probability of each of the categories of a qualitative variable, as the basis for
making the classification. In this sense they also behave like regression methods. There
are many possible classification techniques, or classifiers, that one might use to predict a
qualitative response.
We touched on some of these in Sections 2.1.5 and 2.2.3. In this chapter we discuss
three of the most widely-used classifiers: logistic regression, linear discriminant analysis,
and K-nearest neighbors.
4.1 An Overview of Classification:
Classification problems occur often, perhaps even more so than regression
problems. Some examples include:
1. A person arrives at the emergency room with a set of symptoms that could possibly be
attributed to one of three medical conditions. Which of the three conditions does the
individual have?
2. An online banking service must be able to determine whether or not a transaction being
performed on the site is fraudulent, on the basis of the user’s IP address, past transaction
history, and so forth.

3. On the basis of DNA sequence data for a number of patients with and without a given
disease, a biologist would like to figure out which DNA mutations are deleterious
(disease-causing) and which are not.
Just as in the regression setting, in the classification setting we have a set of training
observations (x1, y1), . . . , (xn, yn) that we can use to build a classifier. We want our
classifier to perform well not only on the training data, but also on test observations that
were not used to train the classifier. In this chapter, we will illustrate the concept of
classification using the simulated Default data set. We are interested in predicting
whether an individual will default on his or her credit card payment, on the basis of
annual income and monthly credit card balance. The data set is displayed in Figure 4.1.
We have plotted annual income and monthly credit card balance for a subset of 10, 000
individuals.
The left-hand panel of displays individuals who defaulted in a given month in
orange, and those who did not in blue. (The overall default rate is about 3 %, so we have
plotted only a fraction of the individuals who did not default.) It appears that individuals
who defaulted tended to have higher credit card balances than those who did not. In the
right-hand panel of Figure 4.1, two pairs of boxplots are shown. The first shows the
distribution of balance split by the binary default variable; the second is a similar plot for
income. In this chapter, we learn how to build a model to predict default (Y ) for any
given value of balance (X1) and income (X2). Since Y is not quantitative, the simple
linear regression model of Chapter 3 is not appropriate.
It is worth noting that Figure 4.1 displays a very pronounced relationship between the
predictor balance and the response default. In most real applications, the relationship
between the predictor and the response will not be nearly so strong. However, for the
sake of illustrating the classification procedures discussed in this chapter, we use an
example in which the relationship between the predictor and the response is somewhat
exaggerated.

FIGURE 4.1. The Default data set. Left: The annual incomes and monthly credit card
balances of a number of individuals. The individuals who defaulted on their credit card
payments are shown in orange, and those who did not are shown in blue. Center:
Boxplots of balance as a function of default status. Right:Boxplots of income as a
function of default status.
4.2 Why Not Linear Regression?
We have stated that linear regression is not appropriate in the case of a qualitative
response. Why not?
Suppose that we are trying to predict the medical condition of a patient in the emergency
room on the basis of her symptoms. In this simplified example, there are three possible
diagnoses: stroke, drug overdose, and epileptic seizure. We could consider encoding
these values as a quantitative response variable, Y , as follows:
Y ={ 1 if stroke;
2 if drug overdose;
3 if epileptic seizure.}
Using this coding, least squares could be used to fit a linear regression model to predict Y
on the basis of a set of predictors X1, . . .,Xp. Unfortunately, this coding implies an
ordering on the outcomes, putting drug overdose in between stroke and epileptic seizure,
and insisting that the difference between stroke and drug overdose is the same as the

difference between drug overdose and epileptic seizure. In practice there is no particular
reason that this needs to be the case. For instance, one could choose an equally
reasonable coding,
Y ={1 if epileptic seizure;
2 if stroke;
3 if drug overdose.}
which would imply a totally different relationship among the three conditions. Each of
these codings would produce fundamentally different linear models that would ultimately
lead to different sets of predictions on test observations. If the response variable’s values
did take on a natural ordering, such as mild, moderate, and severe, and we felt the gap
between mild and moderate was similar to the gap between moderate and severe, then a
1, 2, 3 coding would be reasonable. Unfortunately, in general there is no natural way to
convert a qualitative response variable with more than two levels into a quantitative
response that is ready for linear regression. For a binary (two level) qualitative response,
the situation is better. For instance, perhaps there are only two possibilities for the
patient’s medical condition: stroke and drug overdose. We could then potentially use the
dummy variable approach from Section 3.3.1 to code the response as follows:
Y ={0 if stroke;
1 if drug overdose.}
4.3 Logistic Regression:

Classification using the Default data. Left: Estimated probability of default
using linear regression. Some estimated probabilities are negative! The orange ticks
indicate the 0/1 values coded for default(No or Yes). Right:Predicted probabilities of
default using logistic regression. All probabilities lie between 0 and 1.
For the Default data, logistic regression models the probability of default. For example,
the probability of default given balance can be written as
Pr(default = Yes|balance).
The values of Pr(default = Yes|balance), which we abbreviate p(balance), will range
between 0 and 1. Then for any given value of balance, a prediction can be made for
default. For example, one might predict default = Yes for any individual for whom
p(balance) > 0.5. Alternatively, if a company wishes to be conservative in predicting
individuals who are at risk for default, then they may choose to use a lower threshold,
such as p(balance) > 0.1.
4.3.1 The Logistic Model
How should we model the relationship between p(X) = Pr(Y = 1|X) and X? (For
convenience we are using the generic 0/1 coding for the response). In Section 4.2 we
talked of using a linear regression model to represent these probabilities:
p(X) = β0 + β1X.
(4.1)
If we use this approach to predict default=Yes using balance, then weobtain the model
shown in the left-hand panel of Figure 4.2. Here we see the problem with this approach:
for balances close to zero we predict a negative probability of default; if we were to
predict for very large balances, we would get values bigger than 1. These predictions are
not sensible, since of course the true probability of default, regardless of credit card
balance, must fall between 0 and 1. This problem is not unique tothe credit default data.
Any time a straight line is fit to a binary response that is coded as0 or 1, in principle we
can always predict p(X) < 0 for some values of X and p(X) > 1 for others (unless the
range of X is limited).

To avoid this problem, we must model p(X) using a function that givesoutputs between 0
and 1 for all values of X. Many functions meet this description. In logistic regression, we
use the logistic function,
To fit the model (4.2), we use a method called maximum likelihood, which we discuss in
the next section. The right-hand panel of Figure 4.2 illustrates the fit of the logistic
regression model to the Default data. Notice that for low balances we now predict the
probability of default as close to, but never below, zero. Likewise, for high balances we
predict a default probability close to, but never above, one. The logistic function will
always produce an S-shaped curve of this form, and so regardless of the value of X, we
will obtain a sensible prediction. We also see that the logistic model is better able to
capture the range of probabilities than is the linear regression model in the left-hand plot.
The average fitted probability in both cases is 0.0333 (averaged over the training data),
which is the same as the overall proportion of defaulters in the data set.
4.3.2 Estimating the Regression Coefficients
The coefficients β0 and β1 in (4.2) are unknown, and must be estimated based on the
available training data. In Chapter 3, we used the least squares approach to estimate the
unknown linear regression coefficients. Although we could use (non-linear) least squares
to fit the model (4.4), the more general method of maximum likelihood is preferred, since
it has better statistical properties. The basic intuition behind using maximum likelihood to
fit a logistic regression model is as follows: we seek estimates for β0 and β1 such that the
predicted probability ˆp(xi) of default for each individual, using (4.2), corresponds as
closely as possible to the individual’s observed default status. In other words, we try to
find ˆ β0 and ˆ β1 such that plugging these estimates into the model for p(X), given in
(4.2), yields a number close to one for all individuals who defaulted, and a number close
to zero for all individuals who did not. This intuition can be formalized using a
mathematical equation called a likelihood function:

The estimates ˆ β0 and ˆβ1 are chosen to maximize this likelihood function. Maximum
likelihood is a very general approach that is used to fit many of the non-linear models that
we examine throughout this book. In the linear regression setting, the least squares
approach is in fact a special case of maximum likelihood. The mathematical details of
maximum likelihood are beyond the scope of this book. However, in general, logistic
regression and other models can be easily fit using a statistical software package such as
R, and so we do not need to concern ourselves with the details of the maximum
likelihood fitting procedure.
4.4 Linear Discriminant Analysis
Logistic regression involves directly modeling Pr(Y = k|X = x) using the logistic
function, given by (4.7) for the case of two response classes. In statistical jargon, we
model the conditional distribution of the response Y , given the predictor(s) X. We now
consider an alternative and less direct approach to estimating these probabilities. In this
alternative approach, we model the distribution of the predictors X separately in each of
the response classes (i.e. given Y ), and then use Bayes’ theorem to flip these around into
estimates for Pr(Y = k|X = x). When these distributions are assumed to be normal, it turns
out that the model is very similar in formto logistic regression.Why do we need another
method, when we have logistic regression? There are several reasons:
 When the classes are well-separated, the parameter estimates for the logistic
regression model are surprisingly unstable. Linear discriminant analysis does not
suffer from this problem.
 If n is small and the distribution of the predictors X is approximately normal in
each of the classes, the linear discriminant model is again more stable than the
logistic regression model.
 As mentioned in Section 4.3.5, linear discriminant analysis is popular when we
have more than two response classes.
4.4.1 Using Bayes’ Theorem for Classification
Suppose that we wish to classify an observation into one of K classes, whereK ≥ 2. In
other words, the qualitative response variable Y can take on Kpossible distinct and
unordered values. Let πk represent the overall or prior probability that a randomly chosen
observation comes from the kth class;this is the probability that a given observation is

associated with the kthcategory of the response variable Y . Let fk(X) ≡ Pr(X = x|Y = k)
denotethe density function of X for an observation that comes from the kth class.In other
words, fk(x) is relatively large if there is a high probability that an observation in the kth
class has X ≈ x, and fk(x) is small if it is veryunlikely that an observation in the kth class
has X ≈ x. Then Bayes’theorem states that
In accordance with our earlier notation, we will use the abbreviation pk(X) =
Pr(Y = k|X). This suggests that instead of directly computing pk(X) as in Section 4.3.1,
we can simply plug in estimates of πk and fk(X) into (4.10). In general, estimating πk is
easy if we have a random sample of Y s from the population: we simply compute the
fraction of the training observations that belong to the kh class.
.

CHAPTER 5
TREE BASED METHODS
In this chapter, we describe tree-based methods for regression and classification.
These involve stratifying or segmenting the predictor space into a number of simple
regions. In order to make a prediction for a given observation, we typically use the mean
or the mode of the training observations in the region to which it belongs. Since the set of
splitting rules used to segment the predictor space can be summarized in a tree, these
types of approaches are known as decision tree methods.
Tree-based methods are simple and useful for interpretation. However, they typically are
not competitive with the best supervised learning approaches, such as those seen in
Chapters 6 and 7, in terms of prediction accuracy. Hence in this chapter we also introduce
bagging, random forests, and boosting. Each of these approaches involves producing
multiple trees which are then combined to yield a single consensus prediction. We will
see that combining a large number of trees can often result in dramatic improvements in
prediction accuracy, at the expense of some loss in interpretation
5.1 The Basics of Decision Trees:
Decision trees can be applied to both regression and classification problems. We first
consider regression problems, and then move on to classification.For the Hitters data, a
regression tree for predicting the log salary of a baseball player, based on the number of
years that he has played in the major leagues and the number of hits that he made in the
previous year. At a given internal node, the label (of the form Ox<t k) indicates the left-
hand branch emanating from that split, and the right-hand branch corresponds to Ox ≥ tk.
For instance, the split at the top of the tree results in two large branches. The left-hand
branch corresponds to Years<4.5, and the right-hand branch corresponds to Years>=4.5.
The tree has two internal nodes and three terminal nodes, or leaves. The number in each
leaf is the mean of the response for the observations that fall there.

5.1.1 Regression Trees:
In order to motivate regression trees, we begin with a simple exam.
Predicting Baseball Players’ Salaries Using Regression Trees .We use the Hitters data set
to predict a baseball player’s Salary based on Years (the number of years that he has
played in the major leagues) and Hits (the number of hits that he made in the previous
year). We first remove observations that are missing Salary values, and log-transform
Salary so that its distribution has more of a typical bell-shape. (Recall that Salary is
measured in thousands of dollars.) Figure 8.1 shows a regression tree fit to this data. It
consists of a series of splitting rules, starting at the top of the tree. The top split assigns
observations having Years<4.5 to the left branch.1
.
Algorithm 5.1 Building a Regression Tree.
1. Use recursive binary splitting to grow a large tree on the training data, stopping only
when each terminal node has fewer than some minimum number of observations.
2. Apply cost complexity pruning to the large tree in order to obtain a sequence of best
subtrees, as a function of α.
3. Use K-fold cross-validation to choose α. That is, divide the training observations into
K folds. For each k =1,...,K:
(a) Repeat Steps 1 and 2 on all but the kth fold of the training data.
(b) Evaluate the mean squared prediction error on the data in the left-out kth fold, as a
function of α. Average the results for each value of α, and pick α to minimize the average
error.
4. Return the subtree from Step 2 that corresponds to the chosen value of α.
5.1.2 Advantages and Disadvantages of Trees:
Decision trees for regression and classification have a number of advantages over the
more classical approaches seen in Chapters 3 and 4:
▲ Trees are very easy to explain to people. In fact, they are even easier to explain than
linear regression!

▲ Some people believe that decision trees more closely mirror human decision-making
than do the regression and classification approaches seen in previous chapters.
▲ Trees can be displayed graphically, and are easily interpreted even by a non-expert
(especially if they are small).
▲ Trees can easily handle qualitative predictors without the need to create dummy
variables.
5.2 Bagging, Random Forests, Boosting:
Bagging, random forests, and boosting use trees as building blocks to construct
more powerful prediction models.
5.2.1 Bagging:
The bootstrap, introduced in Chapter 5, is an extremely powerful idea. It is used
in many situations in which it is hard or even impossible to directly compute the standard
deviation of a quantity of interest. We see here that the bootstrap can be used in a
completely different context, in order to improve statistical learning methods such as
decision trees. The decision trees discussed in Section 8.1 suffer from high variance. This
means that if we split the training data into two parts at random, and fit a decision tree to
both halves, the results that we get could be quite different. In contrast, a procedure with
low variance will yield similar results if applied repeatedly to distinct data sets; linear
regression tends to have low variance, if the ratio of n to p is moderately large. Bootstrap
aggregation, orbagging, is a general-purpose procedure for reducing the variance of a
statistical learning method.

It turns out that there is a very straightforward way to estimate the test error of a
bagged model, without the need to perform cross-validation or the validation set
approach. Recall that the key to bagging is that trees are repeatedly fit to bootstrapped
subsets of the observations.
One can show that on average, each bagged tree makes use of around two-thirds
of the observations.3 The remaining one-third of the observations not used to fit a given
bagged tree are referred to as the out-of-bag (OOB) observations. We can predict the
response for the ith observation using each of the trees in which that observation was
OOB. This will yield around B/3 predictions for the ith observation. In order to obtain a
single prediction for the ith observation, we can average these predicted responses (if
regression is the goal) or can take a majority vote (if classification is the goal). This leads
to a single OOB prediction for the ith observation. An OOB prediction can be obtained in
this way for each of the n observations, from which the overall OOB MSE (for a
regression problem) or classification error (for a classification problem) can be computed.
The resulting OOB error is a valid estimate of the test error for the bagged model, since
the response for each observation is predicted using only the trees that were not fit using
that observation. Figure 8.8 displays the OOB error on the Heart data. It can be shown
that with B sufficiently large, OOB error is virtually equivalent to leave-one-out cross-
validation error.

CHAPTER 6
CONCLUSION
 To get familiar with the explosion of “Big Data” problems, statistical learning
machine learning has become a very hot field.
 To learn statistical learning and modeling skills which are in high demand also
cover basic concepts of statistical learning / modeling methods that have
widespread use in business and scientific research.
 To get hands on the applications and the underlying statistical / mathematical
concepts that are relevant to modeling techniques. The course are designed to
familiarize students in implementing the statistical learning methods using the
highly popular statistical software package R.

CHAPTER 7
REFERENCES
1) An Introduction to Statistical Learning with Applications in R Gareth
James, Daniela Witten, Trevor Hastie and Robert Tibshirani – 6th
edition- Springer Publications.
2) ^ Jump up to:a b c d e f g h i j k l
Saffran, Jenny R. (2003). "Statistical language
learning: mechanisms and constraints". Current Directions in Psychological
Science. 12 (4): 110–114. doi:10.1111/1467-8721.01243.
3) ^ Jump up to:a b
Brent, Michael R.; Cartwright, Timothy A. (1996).
"Distributional regularity and phonotactic constraints are useful for
segmentation". Cognition. 61 (1–2): 93–125. doi:10.1016/S0010-
0277(96)00719-6.
4) ^ Jump up to:a b c d e f g h
Saffran, J. R.; Aslin, R. N.; Newport, E. L. (1996).
"Statistical Learning by 8-Month-Old Infants". Science. 274 (5294): 1926–
1928. doi:10.1126/science.274.5294.1926. PMID 8943209.
5) Jump up^ Saffran, Jenny R.; Newport, Elissa L.; Aslin, Richard N. (1996).
"Word Segmentation: The Role of Distributional Cues". Journal of Memory
and Language. 35 (4): 606–621. doi:10.1006/jmla.1996.0032.
6) Jump up^ Aslin, R. N.; Saffran, J. R.; Newport, E. L. (1998). "Computation of
Conditional Probability Statistics by 8-Month-Old Infants". Psychological
Science. 9 (4): 321–324. doi:10.1111/1467-9280.00063.
7) ^ Jump up to:a b
Saffran, Jenny R (2001a). "Words in a sea of sounds: the
output of infant statistical learning". Cognition. 81 (2): 149–
169. doi:10.1016/S0010-0277(01)00132-9.
8) ^ Jump up to:a b c
Saffran, Jenny R.; Wilson, Diana P. (2003). "From Syllables
to Syntax: Multilevel Statistical Learning by 12-Month-Old
Infants". Infancy. 4 (2): 273–284. doi:10.1207/S15327078IN0402_07.
9) Jump up^ Mattys, Sven L.; Jusczyk, Peter W.; Luce, Paul A.; Morgan,
James L. (1999). "Phonotactic and Prosodic Effects on Word Segmentation
in Infants". Cognitive Psychology. 38 (4): 465–494.

Audit report[rollno 49]

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Audit report[rollno 49]

Similar to Audit report[rollno 49] (20)

Recently uploaded

Recently uploaded (20)

Audit report[rollno 49]