•

0 likes•27 views

gg

- 1. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 1 CHAPTER 1 INTRODUCTION Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, bioinformatics and baseball. The goals of learning are understanding and prediction. Learning falls into many categories, including supervised learning, unsupervised learning, online learning, and reinforcement learning. From the perspective of statistical learning theory, supervised learning is best understood. Supervised learning involves learning from a training set of data. Every point in the training is an input-output pair, where the input maps to an output. The learning problem consists of inferring the function that maps between the input and the output, such that the learned function can be used to predict output from future input. Depending on the type of output, supervised learning problems are either problems of regression or problems of classification. If the output takes a continuous range of values, it is a regression problem. Using Ohm's Law as an example, a regression could be performed with voltage as input and current as output. The regression would find the functional relationship between voltage and current. Classification problems are those for which the output will be an element from a discrete set of labels. Classification is very common for machine learning applications. In facial recognition, for instance, a picture of a person's face would be the input, and the output label would be that person's name. The input would be represented by a large multidimensional vector whose elements represent pixels in the picture. After learning a function based on the training set data, that function is validated on a test set of data, data that did not appear in the training set.
- 2. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 2 1.1 Supervised Versus Unsupervised Learning: Most statistical learning problems fall into one of two categories: supervised Supervised or unsupervised. The examples that we have discussed so far in this capon supervised ter all fall into the supervised learning domain. For each observation of the predictor measurement(s) xi, i = 1, . . . , n there is an associated response measurement yi. We wish to fit a model that relates the response to the predictors, with the aim of accurately predicting the response for future observations (prediction) or better understanding the relationship between the response and the predictors (inference). Many classical statistical learning methods such as linear regression and logistic regression ), as logistic well as more modern approaches such as GAM, boosting, and support vec- regression tor machines, operate in the supervised learning domain. The vast majority of this book is devoted to this setting. In contrast, unsupervised learning describes the somewhat more challenging situation in which for every observation i = 1, … n, we observe a vector of measurements xi but no associated response yi. It is not possible to fit a linear regression model, since there is no response variable to predict. In this setting, we are in some sense working blind; the situation is referred to as unsupervised because we lack a response variable that can supervise our analysis.
- 3. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 3 1.2 Issues on Statistical Learning: The main goal of the special issue that we have assembled here is to fill in the above need, with the focus on the fundamental modeling and learning issues of new emerging approaches and empirical applications in speech and language processing. Another focus of this special issue is on the cross-fertilization of learning approaches to speech and language processing problems. Many problems in speech and language processing share similarities (despite some conspicuous differences), and techniques in these two fields can be successfully cross-pollinated. Our additional goal is to bring together a diverse but complementary set of contributions on emerging learning methods for speech processing, language processing, as well as unifying approaches to problems cross cutting these two fields. Discriminative learning has become a major theme in most areas of speech and language processing. One of the recent advances in discriminative learning is the integration of the large margin idea, which is the classical training standard in machine learning, into the conventional discriminative training criteria for string recognition. How typical training criteria, such as minimum phone error and maximum mutual information, can be extended to incorporate the margin concept. In this work, a new margin-based formalism is proposed for various conventional training criteria. Experimental results show that the new criteria help the performance across a wide variety of string recognition scenarios including speech recognition, concept tagging, and handwriting recognition. In another paper, Cheng et al. explore online learning and acoustic feature adaptation in large margin hidden Markov models (HMMs), which lead to a better optimization method for large-margin HMM training. Moving beyond acoustics, language modeling is one of the essential problems in speech and language fields. Zhou et al. introduce a novel pseudo-conventional N-gram language model with discriminative training, and also carry out an empirical study of the robustness of discriminatively trained LMs. Experimental results show that cumulative performance improvements can be achieved via this method. Sequential pattern classification is at the core of many speech and language processing problems. Conditional random field (CRF) is a widely adopted approach to supervised sequential labeling.
- 4. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 4 However, the computational load and model comDigital Object Identifier 10.1109/JSTSP.2010.2086910 plexity grow dramatically when taking complex structure into account. Here, Sokolovska et al. address this issue through efficient feature selection based on imposing sparsity through an L1 regularization for CRF. The results show that, without performance degradation, the L1 regularized CRF results in significantly faster training and labeling speed, and hence makes it possible to scale up systems to handle very large dimensional models. Meanwhile, Yu et al. improve the CRF model from another perspective. They proposed a multi-layer sequence classification algorithm where each layer is a CRF, and each higher layer’s input consists of both the previous layer’s observation sequence and the resulting frame-level marginal probabilities. Compared with the conventional CRF, the deep-structured CRF achieves superior labeling accuracy on common tagging tasks. Using the kernel method to improve the performance of sequential pattern classifiers is also an important direction. Kubo et al. describe a novel sequential pattern classifier based on kernel methods. Unlike conventional approaches, they use kernel methods to estimate the emission probability of HMM, with the extra benefit due to the powerful nonlinear classification capability of kernel methods. On the other hand, unlike conventional CRF/HMM-based methods, Bellegarda attacks this problem from a novel angle based on latent semantic mapping and obtains insightful results.
- 5. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 5 CHAPTER 2 GETTING STARTED WITH R PROGRAMMING 2.1 Introduction to the R-Studio R is a free, open-source software and programming language developed in 1995 at the University of Auckland as an environment for statistical computing and graphics (Ikaha and Gentleman, 1996). Since then R has become one of the dominant software environments for data analysis and is used by a variety of scientific disiplines, including soil science, ecology, and geoinformatics (Envirometrics CRAN Task View; Spatial CRAN Task View). R is particularly popular for its graphical capabilities, but it is also prized for it’s GIS capabilities which make it relatively easy to generate raster-based models. More recently, R has also gained several packages which are designed specifically for analyzing soil data. 2.2 User-interface : R is a dialect of the S language. It is a case-sensitive, interpreted language. You can enter commands one at a time at the command prompt (>) or run a set of commands from a source file. There is a wide variety of data types, including vectors (numerical, character, logical), matrices, data frames, and lists. Most functionality is provided through built-in and user-created functions and all data objects are kept in memory during an interactive session. Basic functions are available by default. Other functions are contained in packages that can be attached to a current session as needed. This section describes working with the R interface. A key skill to using R effectively is learning how to use the built-in help system. Other sections describe the working environment, inputting programs and outputting results, installing new functionality through packages, GUIs that have been developed for R, customizing the environment, producing high quality output, and running programs in batch. A fundamental design feature of R is that the output from most functions can be used as input to other functions. This is described in reusing results.
- 6. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 6 2.3 Basic commands : Input and Display: #read files with labels in first row read.table(filename,header=TRUE)#read a tab or space delimited file read.table(filename,header=TRUE,sep=',')#read csv files x <-c(1,2,4,8,16)#create a data vector with specified elements y <-c(1:10)#create a data vector with elements 1-10 n <-10 x1<- c(rnorm(n))#create a n item vector of random normal deviates y1 <-c(runif(n))+n #create another n item vector that has n added to each random uniform distribution z <-rbinom(n,size,prob)#create n samples of size "size" with probability prob from the binomial vect<- c(x,y)#combine them into one vector of length 2n mat<-cbind(x,y)#combine them into a n x 2 matrix mat[4,2]#display the 4th row and the 2nd column mat[3,]#display the 3rd row mat[,2]#display the 2nd column subset(dataset,logical)#those objects meeting a logical criterion subset(data.df,select=variables,logical)#get those objects from a data frame that meet a criterion data.df[data.df=logical]#yet another way to get a subset x[order(x$B),]#sort a dataframe by the order of the elements in B x[rev(order(x$B)),]#sort the dataframe in reverse order Moving around ls()#list the variables in the workspace rm(x)#remove x from the workspace rm(list=ls())#remove all the variables from the workspace
- 7. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 7 attach(mat)#make the names of the variables in the matrix or data frame available in the workspace detach(mat)#releases the names (remember to do this each time you attach something) with(mat,....)#a preferred alternative to attach ... detach new<- old[,-n]#drop the nth column new<- old[-n,]#drop the nth row new<- old[,-c(i,j)]#drop the ith and jth column new<- subset(old,logical)#select those cases that meet the logical condition complete <- subset(data.df,complete.cases(data.df))#find those cases with no missing values new<- old[n1:n2,n3:n4]#select the n1 through n2 rows of variables n3 through n4) Distributions beta(a, b) gamma(x) choose(n, k) factorial(x) dnorm(x, mean=0,sd=1, log = FALSE)#normal distribution pnorm(q, mean=0,sd=1,lower.tail= TRUE,log.p= FALSE) qnorm(p, mean=0,sd=1,lower.tail= TRUE,log.p= FALSE) rnorm(n, mean=0,sd=1) dunif(x, min=0, max=1, log = FALSE)#uniform distribution punif(q, min=0, max=1,lower.tail= TRUE,log.p= FALSE) qunif(p, min=0, max=1,lower.tail= TRUE,log.p= FALSE) runif(n, min=0, max=1) Data manipulation replace(x, list, values)#remember to assign this to some object i.e., x <- replace(x,x==-9,NA)
- 8. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 8 #similar to the operation x[x==-9] <- NA scrub(x,where, min, max,isvalue,newvalue)#a convenient way to change particular values (in psych package) cut(x, breaks, labels = NULL, include.lowest= FALSE, right = TRUE,dig.lab=3,...) x.df<-data.frame(x1,x2,x3...)#combine different kinds of data into a data frame as.data.frame() is.data.frame() x <-as.matrix() scale()#converts a data frame to standardized scores round(x,n)#rounds the values of x to n decimal places ceiling(x)#vector x of smallest integers > x floor(x)#vector x of largest interger< x as.integer(x)#truncates real x to integers (compare to round(x,0) as.integer(x <cutpoint)#vector x of 0 if less than cutpoint, 1 if greater than cutpoint) factor(ifelse(a <cutpoint,"Neg","Pos"))#is another way to dichotomize and to make a factor for analysis transform(data.df,variable names = some operation)#can be part of a set up for a data set x%in%y#tests each element of x for membership in y y%in%x#tests each element of y for membership in x all(x%in%y)#true if x is a proper subset of y all(x)# for a vector of logical values, are they all true? any(x)#for a vector of logical values, is at least one true? 2.4 Data Structures in R:
- 9. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 9 R programming supports five basic types of data structure namely vector, matrix, list, data frame and factor. This chapter will discuss these data structures and the way to write these in R Programming. 1. Vector – This data structures contain similar types of data, i.e., integer, double, logical, complex, etc. In order to create a vector in R Programming, c() function is used. For example, > x <- 1:7; x[1] 1 2 3 4 5 6 7 > y <- 2:-2; y[1] 2 1 0 -1 -2 2. Matrix – Matrix is a two-dimensional data structure and can be created using matrix () function. The values for rows columns can be defined using nrow and ncol arguments. However providing both is not required as other dimension is automatically taken with the help of length of matrix. 3. List – This data structure includes data of different types. It is similar to vector but a vector contains similar data but list contains mixed data. A list is created using list (). For example, > x <- list("a" = 2.5, "b" = TRUE, "c" = 1:3) >str(x)List of 3$ a: num 2.5$ b: logi TRUE$ c: int [1:3] 1 2 3 4. Dataframe – This data structure is a special case of list where each component is of same length. Data frame is created using frame() function. 5. For example, > x <- data.frame("SN" = 1:2, "Age" = c(21,15), "Name" = c("John","Dora")) >str(x) # structure of x 'data.frame': 2 obs. of 3 variables: $ SN :int 1 2
- 10. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 10 $ Age :num 21 15 $ Name: Factor w/ 2 levels "Dora","John": 2 1 6. Factor – Factors are used to store predefined and categorical data. It can be created using factor() function. For example, > x <- factor(c("single", "married", "married", "single")); 6. String – Any value written inside a single quote or double quotes is referred to as String. For example, x <- “This is a valid proper ‘ string” print(x) y <- ‘this is still valid as this one” double quote is used inside single quotes” print(y) Output: This is a valid proper ‘ string this is still valid as this single” double quote is used inside single quotes 2.5 Graphics: The plot() function is the primary way to plot data in R. For instance,plot()plot(x,y) produces a scatterplot of the numbers in x versus the numbers in y. There are many additional options that can be passed in to the plot()function. For example, passing in the argument xlabwill result in a label on the x-axis. To find out more information about the plot() function, type ?plot. > x=rnorm (100) > y=rnorm (100) >plot(x,y) >plot(x,y,xlab=" this is the x-axis",ylab=" this is the y-axis", main=" Plot of X vs Y")
- 11. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 11 We will often want to save the output of an R plot. The command that we use to do this will depend on the file type that we would like to create. Forinstance, to create a pdf, we use the pdf() function, and to create a jpeg, pdf()we use the jpeg() function. jpeg() >pdf (" Figure .pdf ") >plot(x,y,col =" green ") >dev.off () null device The function dev.off() indicates to R that we are done creating the plot.dev.off() Alternatively, we can simply copy the plot window and paste it into an appropriate file type, such as a Word document. The function seq() can be used to create a sequence of numbers. For seq() instance, seq(a,b) makes a vector of integers between a and b. There are many other options: for instance, seq(0,1,length=10) makes a sequence of 10 numbers that are equally spaced between 0 and 1. Typing 3:11 is a shorthand for seq(3,11) for integer arguments. > x=seq (1 ,10)> x [1] 1 2 3 4 5 6 7 8 9 10 > x=1:10 >x [1] 1 2 3 4 5 6 7 8 9 10 > x=seq(-pi ,pi ,length =50) We will now create some more sophisticated plots. The contour() funccontour() function produces a contour plot in order to represent three-dimensional data;contour plotit is like a topographical map. It takes three arguments: 1. A vector of the x values (the first dimension), 2. A vector of the y values (the second dimension), and 3. A matrix whose elements correspond to the z value (the third dimension) for each pair of (x,y) coordinates. As with the plot() function, there are many other inputs that can be used to fine-tune the output of the contour() function. To learn more about these, take a look at the help file by typing ?contour. > y=x > f=outer(x,y,function (x,y)cos(y)/(1+x^2)) >contour (x,y,f)
- 12. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 12 >contour (x,y,f,nlevels =45, add=T) >fa=(f-t(f))/2 >contour (x,y,fa,nlevels =15) The image() function works the same way as contour(), except that it image()produces a color-coded plot whose colors depend on the z value.This isknown as a heatmap, and is sometimes used to plot temperature in weather heatmapforecasts. Alternatively, persp() can be used to produce a three-dimensional persp()plot. The arguments theta and phi control the angles at which the plot is viewed. >image(x,y,fa) >persp(x,y,fa) >persp(x,y,fa ,theta =30) >persp(x,y,fa ,theta =30, phi =20) >persp(x,y,fa ,theta =30, phi =70) >persp(x,y,fa ,theta =30, phi =40) 2.6 Reading data into R: Usually we will be using data already in a file that we need to read into R in order to work on it. R can read data from a variety of file formats—for example, files created as text, or in Excel, SPSS or Stata. We will mainly be reading files in text format .txt or .csv (comma-separated, usually created in Excel). To read an entire data frame directly, the external file will normally have a special form The first line of the file should have a name for each variable in the data frame. Each additional line of the file has as its first item a row label and the values for each variable. Here we use the example dataset called airquality.csv and airquality.txt Input file form with names and row labels: Ozone Solar.R*Wind Temp Month Day
- 13. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 13 1 41*****190**7.4**67****5**1 2 36*****118**8.0**72****5**2 3 12*****149*12.6**74****5**3 4 18*****313*11.5**62****5**4 5 NA*****NA**14.3**56****5**5 ... By default numeric items (except row labels) are read as numeric variables. This can be changed if necessary. The function read.table() can then be used to read the data frame directly >airqual<- read.table("C:/Desktop/airquality.txt") Similarly, to read .csv files the read.csv() function can be used to read in the data frame directly [Note: I have noticed that occasionally you'll need to do a double slash in your path //. This seems to depend on the machine.] >airqual<- read.csv("C:/Desktop/airquality.csv") In addition, you can read in files using the file.choose() function in R. After typing in this command in R, you can manually select the directory and file where your dataset is located.
- 14. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 14 CHAPTER 3 LINEAR REGRESSION MODELS 3.1 Linear Regression: This chapter is about linear regression, a very simple approach for supervised learning. In particular, linear regression is a useful tool for predicting a quantitative response. Linear regression has been around for a long time and is the topic of innumerable textbooks. Though it may seem somewhat dull compared to some of the more modern statistical learning approaches described in later chapters of this book, linear regression is still a useful and widely used statistical learning method. Moreover, it serves as a good jumping-off point for newer approaches: as we will see in later chapters, many fancy statistical learning approaches can be seen as generalizations or extensions of linear regression. Consequently, the importance of having a good understanding of linear regression before studying more complex learning methods cannot be overstated. In this chapter, we review some of the key ideas underlying the linear regression model, as well as the least squares approach that is most commonly used to fit this model. Recall the Advertising data from Chapter 2 sales(in thousands of units) for a particular product as a function of advertising budgets (in thousands of dollars) for TV, radio, and newspaper media. Suppose that in our role as statistical consultants we are asked to suggest, on the basis of this data, a marketing plan for next year that will result in high product sales. What information would be useful in order to provide such a recommendation? Here are a few important questions that we might seek to address: Simple Linear Regression Simple linear regression lives up to its name: it is a very straightforward simple linearapproach for predicting a quantitative response Y on the basis of a single predictor variable X. It assumes that there is approximately a linear relationship between X and Y . Mathematically, we can write this linear relationship as Y ≈ β0 + β1XYou might read “≈” as “is approximately modeled as”. We will sometimes describe by saying that we are regressing Y on X (or Y onto X). For example, X may
- 15. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 15 represent TV advertising and Y may represent sales. Then we can regresssales onto TV by fitting the model sales ≈ β0 + β1 × TV. In Equation, β0 and β1 are two unknown constants that represent the intercept and slope terms in the linear model. Together, β0 and β1 areinterceptslope known as the model coefficients or parameters. Once we have used ourcoefficientparametertraining data to produce estimates ˆ β0 and ˆβ1 for the model coefficients, wecan predict future sales on the basis of a particular value of TV advertisingby computingˆy = ˆ β0 + ˆ β1x, where ˆy indicates a prediction of Y on the basis of X = x. Here we use a hat symbol, ˆ , to denote the estimated value for an unknown parameter or coefficient, or to denote the predicted value of the response. Estimating the Coefficients In practice, β0 and β1 are unknown. So before we can use to make predictions, we must use data to estimate the coefficients. Let (x1, y1), (x2, y2), . . . ,(xn, yn) represent nobservation pairs, each of which consists of a measurement of X and a measurement of Y . In the Advertising example, this data set consists of the TV advertising budget and product sales in n = 200 different markets. (Recall that the data are displayed. Our goal is to obtain coefficient estimates ˆ β0 and ˆ β1 such that the linear model fits the available data well—that is, so that yi≈ ˆβ0 + ˆ β1xi for i= 1, . . . , n. In other words, we want to find an intercept ˆ β0 and a slope ˆ β1 such that the resulting line is as close as possible to the n = 200 data points. There are a number of ways of measuring closeness. However, by far the most common approach involves minimizing the least squares criterion, least squares and we take that approach in this chapter.
- 16. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 16 For the Advertising data, the least squares fit for the regressionof sales onto TV is shown. The fit is found by minimizing the sum of squared errors. Each grey line segment represents an error, and the fit makes a compromise by averaging their squares. In this case a linear fit captures the essence of the relationship, although it is somewhat deficient in the left of the plot. Let ˆyi= ˆ β0 + ˆ β1xi be the prediction for Y based on the ith value of X.Then ei= yi−ˆyirepresents the ithresidual—this is the difference betweenresidualthe ith observed response value and the ith response value that is predictedby our linear model. We define the residual sum of squares (RSS) asresidual sumof squaresRSS = e21+ e22+ · · · + e2n, or equivalently asRSS = (y1− ˆβ0− ˆβ1x1)2+(y2− ˆβ0− ˆβ1
- 17. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 17 CHAPTER 4 CLASSIFICATION The linear regression model discussed in Chapter 3 assumes that the response variable Y is quantitative. But in many situations, the response variable is instead qualitative. For example, eye color is qualitative, taking on values blue, brown, or green. Often qualitative variables are referred to as categorical ; we will use these terms interchangeably. In this chapter, we study approaches for predicting qualitative responses, a process that is known as classification. Predicting a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category, or class. On the other hand, often the methods used for classification first predict the probability of each of the categories of a qualitative variable, as the basis for making the classification. In this sense they also behave like regression methods. There are many possible classification techniques, or classifiers, that one might use to predict a qualitative response. We touched on some of these in Sections 2.1.5 and 2.2.3. In this chapter we discuss three of the most widely-used classifiers: logistic regression, linear discriminant analysis, and K-nearest neighbors. 4.1 An Overview of Classification: Classification problems occur often, perhaps even more so than regression problems. Some examples include: 1. A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have? 2. An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.
- 18. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 18 3. On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not. Just as in the regression setting, in the classification setting we have a set of training observations (x1, y1), . . . , (xn, yn) that we can use to build a classifier. We want our classifier to perform well not only on the training data, but also on test observations that were not used to train the classifier. In this chapter, we will illustrate the concept of classification using the simulated Default data set. We are interested in predicting whether an individual will default on his or her credit card payment, on the basis of annual income and monthly credit card balance. The data set is displayed in Figure 4.1. We have plotted annual income and monthly credit card balance for a subset of 10, 000 individuals. The left-hand panel of displays individuals who defaulted in a given month in orange, and those who did not in blue. (The overall default rate is about 3 %, so we have plotted only a fraction of the individuals who did not default.) It appears that individuals who defaulted tended to have higher credit card balances than those who did not. In the right-hand panel of Figure 4.1, two pairs of boxplots are shown. The first shows the distribution of balance split by the binary default variable; the second is a similar plot for income. In this chapter, we learn how to build a model to predict default (Y ) for any given value of balance (X1) and income (X2). Since Y is not quantitative, the simple linear regression model of Chapter 3 is not appropriate. It is worth noting that Figure 4.1 displays a very pronounced relationship between the predictor balance and the response default. In most real applications, the relationship between the predictor and the response will not be nearly so strong. However, for the sake of illustrating the classification procedures discussed in this chapter, we use an example in which the relationship between the predictor and the response is somewhat exaggerated.
- 19. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 19 FIGURE 4.1. The Default data set. Left: The annual incomes and monthly credit card balances of a number of individuals. The individuals who defaulted on their credit card payments are shown in orange, and those who did not are shown in blue. Center: Boxplots of balance as a function of default status. Right:Boxplots of income as a function of default status. 4.2 Why Not Linear Regression? We have stated that linear regression is not appropriate in the case of a qualitative response. Why not? Suppose that we are trying to predict the medical condition of a patient in the emergency room on the basis of her symptoms. In this simplified example, there are three possible diagnoses: stroke, drug overdose, and epileptic seizure. We could consider encoding these values as a quantitative response variable, Y , as follows: Y ={ 1 if stroke; 2 if drug overdose; 3 if epileptic seizure.} Using this coding, least squares could be used to fit a linear regression model to predict Y on the basis of a set of predictors X1, . . .,Xp. Unfortunately, this coding implies an ordering on the outcomes, putting drug overdose in between stroke and epileptic seizure, and insisting that the difference between stroke and drug overdose is the same as the
- 20. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 20 difference between drug overdose and epileptic seizure. In practice there is no particular reason that this needs to be the case. For instance, one could choose an equally reasonable coding, Y ={1 if epileptic seizure; 2 if stroke; 3 if drug overdose.} which would imply a totally different relationship among the three conditions. Each of these codings would produce fundamentally different linear models that would ultimately lead to different sets of predictions on test observations. If the response variable’s values did take on a natural ordering, such as mild, moderate, and severe, and we felt the gap between mild and moderate was similar to the gap between moderate and severe, then a 1, 2, 3 coding would be reasonable. Unfortunately, in general there is no natural way to convert a qualitative response variable with more than two levels into a quantitative response that is ready for linear regression. For a binary (two level) qualitative response, the situation is better. For instance, perhaps there are only two possibilities for the patient’s medical condition: stroke and drug overdose. We could then potentially use the dummy variable approach from Section 3.3.1 to code the response as follows: Y ={0 if stroke; 1 if drug overdose.} 4.3 Logistic Regression:
- 21. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 21 Classification using the Default data. Left: Estimated probability of default using linear regression. Some estimated probabilities are negative! The orange ticks indicate the 0/1 values coded for default(No or Yes). Right:Predicted probabilities of default using logistic regression. All probabilities lie between 0 and 1. For the Default data, logistic regression models the probability of default. For example, the probability of default given balance can be written as Pr(default = Yes|balance). The values of Pr(default = Yes|balance), which we abbreviate p(balance), will range between 0 and 1. Then for any given value of balance, a prediction can be made for default. For example, one might predict default = Yes for any individual for whom p(balance) > 0.5. Alternatively, if a company wishes to be conservative in predicting individuals who are at risk for default, then they may choose to use a lower threshold, such as p(balance) > 0.1. 4.3.1 The Logistic Model How should we model the relationship between p(X) = Pr(Y = 1|X) and X? (For convenience we are using the generic 0/1 coding for the response). In Section 4.2 we talked of using a linear regression model to represent these probabilities: p(X) = β0 + β1X. (4.1) If we use this approach to predict default=Yes using balance, then weobtain the model shown in the left-hand panel of Figure 4.2. Here we see the problem with this approach: for balances close to zero we predict a negative probability of default; if we were to predict for very large balances, we would get values bigger than 1. These predictions are not sensible, since of course the true probability of default, regardless of credit card balance, must fall between 0 and 1. This problem is not unique tothe credit default data. Any time a straight line is fit to a binary response that is coded as0 or 1, in principle we can always predict p(X) < 0 for some values of X and p(X) > 1 for others (unless the range of X is limited).
- 22. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 22 To avoid this problem, we must model p(X) using a function that givesoutputs between 0 and 1 for all values of X. Many functions meet this description. In logistic regression, we use the logistic function, To fit the model (4.2), we use a method called maximum likelihood, which we discuss in the next section. The right-hand panel of Figure 4.2 illustrates the fit of the logistic regression model to the Default data. Notice that for low balances we now predict the probability of default as close to, but never below, zero. Likewise, for high balances we predict a default probability close to, but never above, one. The logistic function will always produce an S-shaped curve of this form, and so regardless of the value of X, we will obtain a sensible prediction. We also see that the logistic model is better able to capture the range of probabilities than is the linear regression model in the left-hand plot. The average fitted probability in both cases is 0.0333 (averaged over the training data), which is the same as the overall proportion of defaulters in the data set. 4.3.2 Estimating the Regression Coefficients The coefficients β0 and β1 in (4.2) are unknown, and must be estimated based on the available training data. In Chapter 3, we used the least squares approach to estimate the unknown linear regression coefficients. Although we could use (non-linear) least squares to fit the model (4.4), the more general method of maximum likelihood is preferred, since it has better statistical properties. The basic intuition behind using maximum likelihood to fit a logistic regression model is as follows: we seek estimates for β0 and β1 such that the predicted probability ˆp(xi) of default for each individual, using (4.2), corresponds as closely as possible to the individual’s observed default status. In other words, we try to find ˆ β0 and ˆ β1 such that plugging these estimates into the model for p(X), given in (4.2), yields a number close to one for all individuals who defaulted, and a number close to zero for all individuals who did not. This intuition can be formalized using a mathematical equation called a likelihood function:
- 23. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 23 The estimates ˆ β0 and ˆβ1 are chosen to maximize this likelihood function. Maximum likelihood is a very general approach that is used to fit many of the non-linear models that we examine throughout this book. In the linear regression setting, the least squares approach is in fact a special case of maximum likelihood. The mathematical details of maximum likelihood are beyond the scope of this book. However, in general, logistic regression and other models can be easily fit using a statistical software package such as R, and so we do not need to concern ourselves with the details of the maximum likelihood fitting procedure. 4.4 Linear Discriminant Analysis Logistic regression involves directly modeling Pr(Y = k|X = x) using the logistic function, given by (4.7) for the case of two response classes. In statistical jargon, we model the conditional distribution of the response Y , given the predictor(s) X. We now consider an alternative and less direct approach to estimating these probabilities. In this alternative approach, we model the distribution of the predictors X separately in each of the response classes (i.e. given Y ), and then use Bayes’ theorem to flip these around into estimates for Pr(Y = k|X = x). When these distributions are assumed to be normal, it turns out that the model is very similar in formto logistic regression.Why do we need another method, when we have logistic regression? There are several reasons: When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem. If n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model. As mentioned in Section 4.3.5, linear discriminant analysis is popular when we have more than two response classes. 4.4.1 Using Bayes’ Theorem for Classification Suppose that we wish to classify an observation into one of K classes, whereK ≥ 2. In other words, the qualitative response variable Y can take on Kpossible distinct and unordered values. Let πk represent the overall or prior probability that a randomly chosen observation comes from the kth class;this is the probability that a given observation is
- 24. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 24 associated with the kthcategory of the response variable Y . Let fk(X) ≡ Pr(X = x|Y = k) denotethe density function of X for an observation that comes from the kth class.In other words, fk(x) is relatively large if there is a high probability that an observation in the kth class has X ≈ x, and fk(x) is small if it is veryunlikely that an observation in the kth class has X ≈ x. Then Bayes’theorem states that In accordance with our earlier notation, we will use the abbreviation pk(X) = Pr(Y = k|X). This suggests that instead of directly computing pk(X) as in Section 4.3.1, we can simply plug in estimates of πk and fk(X) into (4.10). In general, estimating πk is easy if we have a random sample of Y s from the population: we simply compute the fraction of the training observations that belong to the kh class. .
- 25. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 25 CHAPTER 5 TREE BASED METHODS In this chapter, we describe tree-based methods for regression and classiﬁcation. These involve stratifying or segmenting the predictor space into a number of simple regions. In order to make a prediction for a given observation, we typically use the mean or the mode of the training observations in the region to which it belongs. Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these types of approaches are known as decision tree methods. Tree-based methods are simple and useful for interpretation. However, they typically are not competitive with the best supervised learning approaches, such as those seen in Chapters 6 and 7, in terms of prediction accuracy. Hence in this chapter we also introduce bagging, random forests, and boosting. Each of these approaches involves producing multiple trees which are then combined to yield a single consensus prediction. We will see that combining a large number of trees can often result in dramatic improvements in prediction accuracy, at the expense of some loss in interpretation 5.1 The Basics of Decision Trees: Decision trees can be applied to both regression and classiﬁcation problems. We ﬁrst consider regression problems, and then move on to classiﬁcation.For the Hitters data, a regression tree for predicting the log salary of a baseball player, based on the number of years that he has played in the major leagues and the number of hits that he made in the previous year. At a given internal node, the label (of the form Ox<t k) indicates the left- hand branch emanating from that split, and the right-hand branch corresponds to Ox ≥ tk. For instance, the split at the top of the tree results in two large branches. The left-hand branch corresponds to Years<4.5, and the right-hand branch corresponds to Years>=4.5. The tree has two internal nodes and three terminal nodes, or leaves. The number in each leaf is the mean of the response for the observations that fall there.
- 26. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 26 5.1.1 Regression Trees: In order to motivate regression trees, we begin with a simple exam. Predicting Baseball Players’ Salaries Using Regression Trees .We use the Hitters data set to predict a baseball player’s Salary based on Years (the number of years that he has played in the major leagues) and Hits (the number of hits that he made in the previous year). We ﬁrst remove observations that are missing Salary values, and log-transform Salary so that its distribution has more of a typical bell-shape. (Recall that Salary is measured in thousands of dollars.) Figure 8.1 shows a regression tree ﬁt to this data. It consists of a series of splitting rules, starting at the top of the tree. The top split assigns observations having Years<4.5 to the left branch.1 . Algorithm 5.1 Building a Regression Tree. 1. Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations. 2. Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of α. 3. Use K-fold cross-validation to choose α. That is, divide the training observations into K folds. For each k =1,...,K: (a) Repeat Steps 1 and 2 on all but the kth fold of the training data. (b) Evaluate the mean squared prediction error on the data in the left-out kth fold, as a function of α. Average the results for each value of α, and pick α to minimize the average error. 4. Return the subtree from Step 2 that corresponds to the chosen value of α. 5.1.2 Advantages and Disadvantages of Trees: Decision trees for regression and classiﬁcation have a number of advantages over the more classical approaches seen in Chapters 3 and 4: ▲ Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression!
- 27. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 27 ▲ Some people believe that decision trees more closely mirror human decision-making than do the regression and classiﬁcation approaches seen in previous chapters. ▲ Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small). ▲ Trees can easily handle qualitative predictors without the need to create dummy variables. 5.2 Bagging, Random Forests, Boosting: Bagging, random forests, and boosting use trees as building blocks to construct more powerful prediction models. 5.2.1 Bagging: The bootstrap, introduced in Chapter 5, is an extremely powerful idea. It is used in many situations in which it is hard or even impossible to directly compute the standard deviation of a quantity of interest. We see here that the bootstrap can be used in a completely diﬀerent context, in order to improve statistical learning methods such as decision trees. The decision trees discussed in Section 8.1 suﬀer from high variance. This means that if we split the training data into two parts at random, and ﬁt a decision tree to both halves, the results that we get could be quite diﬀerent. In contrast, a procedure with low variance will yield similar results if applied repeatedly to distinct data sets; linear regression tends to have low variance, if the ratio of n to p is moderately large. Bootstrap aggregation, orbagging, is a general-purpose procedure for reducing the variance of a statistical learning method.
- 28. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 28 It turns out that there is a very straightforward way to estimate the test error of a bagged model, without the need to perform cross-validation or the validation set approach. Recall that the key to bagging is that trees are repeatedly ﬁt to bootstrapped subsets of the observations. One can show that on average, each bagged tree makes use of around two-thirds of the observations.3 The remaining one-third of the observations not used to ﬁt a given bagged tree are referred to as the out-of-bag (OOB) observations. We can predict the response for the ith observation using each of the trees in which that observation was OOB. This will yield around B/3 predictions for the ith observation. In order to obtain a single prediction for the ith observation, we can average these predicted responses (if regression is the goal) or can take a majority vote (if classiﬁcation is the goal). This leads to a single OOB prediction for the ith observation. An OOB prediction can be obtained in this way for each of the n observations, from which the overall OOB MSE (for a regression problem) or classiﬁcation error (for a classiﬁcation problem) can be computed. The resulting OOB error is a valid estimate of the test error for the bagged model, since the response for each observation is predicted using only the trees that were not ﬁt using that observation. Figure 8.8 displays the OOB error on the Heart data. It can be shown that with B sufficiently large, OOB error is virtually equivalent to leave-one-out cross- validation error.
- 29. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 29 CHAPTER 6 CONCLUSION To get familiar with the explosion of “Big Data” problems, statistical learning machine learning has become a very hot field. To learn statistical learning and modeling skills which are in high demand also cover basic concepts of statistical learning / modeling methods that have widespread use in business and scientific research. To get hands on the applications and the underlying statistical / mathematical concepts that are relevant to modeling techniques. The course are designed to familiarize students in implementing the statistical learning methods using the highly popular statistical software package R.
- 30. “Statistical Learning Model using R” SRES’s SANJIVANI COLLEGE OFENGINEERING, KOPARGAON[IT]2018-2019 Page 30 CHAPTER 7 REFERENCES 1) An Introduction to Statistical Learning with Applications in R Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani – 6th edition- Springer Publications. 2) ^ Jump up to:a b c d e f g h i j k l Saffran, Jenny R. (2003). "Statistical language learning: mechanisms and constraints". Current Directions in Psychological Science. 12 (4): 110–114. doi:10.1111/1467-8721.01243. 3) ^ Jump up to:a b Brent, Michael R.; Cartwright, Timothy A. (1996). "Distributional regularity and phonotactic constraints are useful for segmentation". Cognition. 61 (1–2): 93–125. doi:10.1016/S0010- 0277(96)00719-6. 4) ^ Jump up to:a b c d e f g h Saffran, J. R.; Aslin, R. N.; Newport, E. L. (1996). "Statistical Learning by 8-Month-Old Infants". Science. 274 (5294): 1926– 1928. doi:10.1126/science.274.5294.1926. PMID 8943209. 5) Jump up^ Saffran, Jenny R.; Newport, Elissa L.; Aslin, Richard N. (1996). "Word Segmentation: The Role of Distributional Cues". Journal of Memory and Language. 35 (4): 606–621. doi:10.1006/jmla.1996.0032. 6) Jump up^ Aslin, R. N.; Saffran, J. R.; Newport, E. L. (1998). "Computation of Conditional Probability Statistics by 8-Month-Old Infants". Psychological Science. 9 (4): 321–324. doi:10.1111/1467-9280.00063. 7) ^ Jump up to:a b Saffran, Jenny R (2001a). "Words in a sea of sounds: the output of infant statistical learning". Cognition. 81 (2): 149– 169. doi:10.1016/S0010-0277(01)00132-9. 8) ^ Jump up to:a b c Saffran, Jenny R.; Wilson, Diana P. (2003). "From Syllables to Syntax: Multilevel Statistical Learning by 12-Month-Old Infants". Infancy. 4 (2): 273–284. doi:10.1207/S15327078IN0402_07. 9) Jump up^ Mattys, Sven L.; Jusczyk, Peter W.; Luce, Paul A.; Morgan, James L. (1999). "Phonotactic and Prosodic Effects on Word Segmentation in Infants". Cognitive Psychology. 38 (4): 465–494.