This document analyzes the Boston housing data from 1970 using R. It examines the relationships between variables using scatterplots and correlation. Various regression models are tested to analyze properties of the data. Model selection methods like forward selection, backward selection, and information criteria are used to identify the best fitting model. The selected model is then used to compute statistics like SSPE on a subset of the data.
This document discusses nervous system injury and regeneration strategies. It describes how the nervous system is divided into the central and peripheral systems. Injury can occur via different factors and disrupts communication between neurons. In the central nervous system, axons do not regenerate naturally due to inhibitors at injury sites. Tissue engineering strategies aim to use guidance channels, cell recovery, drug delivery, and electrical stimulation to promote regeneration, especially by transforming glial cells into neurons using genetic modification in mice models. Advances in developmental biology and tissue engineering principles can mimic cues to direct neural tissue regeneration.
This document provides an overview of the origins and influences of weaving at the Bauhaus school in Germany from 1919-1933. It discusses precursors like the Arts and Crafts movement and German Expressionism. It describes the establishment of the Weaving Workshop and its curriculum. Key weavers like Anni Albers produced innovative textile designs that integrated art and craft and emphasized function over decoration. The Bauhaus weaving tradition had lasting influence on textile design in the 20th century.
The document provides information about the Nomadic Empire, also known as the Mongol Empire, and Genghis Khan, who was instrumental in establishing the empire. It discusses how Genghis Khan unified various nomadic tribes in Mongolia and began a campaign of conquest that expanded the empire across Asia and into Eastern Europe. By the time of his death in 1227, Genghis Khan had established the largest contiguous land empire in history that spanned from the Sea of Japan to the Caspian Sea. The empire had a significant impact by encouraging trade, spreading innovations, and establishing religious tolerance across Eurasia.
This document discusses nervous system injury and regeneration strategies. It describes how the nervous system is divided into the central and peripheral systems. Injury can occur via different factors and disrupts communication between neurons. In the central nervous system, axons do not regenerate naturally due to inhibitors at injury sites. Tissue engineering strategies aim to use guidance channels, cell recovery, drug delivery, and electrical stimulation to promote regeneration, especially by transforming glial cells into neurons using genetic modification in mice models. Advances in developmental biology and tissue engineering principles can mimic cues to direct neural tissue regeneration.
This document provides an overview of the origins and influences of weaving at the Bauhaus school in Germany from 1919-1933. It discusses precursors like the Arts and Crafts movement and German Expressionism. It describes the establishment of the Weaving Workshop and its curriculum. Key weavers like Anni Albers produced innovative textile designs that integrated art and craft and emphasized function over decoration. The Bauhaus weaving tradition had lasting influence on textile design in the 20th century.
The document provides information about the Nomadic Empire, also known as the Mongol Empire, and Genghis Khan, who was instrumental in establishing the empire. It discusses how Genghis Khan unified various nomadic tribes in Mongolia and began a campaign of conquest that expanded the empire across Asia and into Eastern Europe. By the time of his death in 1227, Genghis Khan had established the largest contiguous land empire in history that spanned from the Sea of Japan to the Caspian Sea. The empire had a significant impact by encouraging trade, spreading innovations, and establishing religious tolerance across Eurasia.
Internet es una red de redes que interconecta computadoras alrededor del mundo. Los buscadores como Google, Yahoo y Bing son usados para encontrar información en Internet. Los metabuscadores como Gmail buscan información en varios buscadores convencionales y servicios de Google como Drive y YouTube proveen almacenamiento y videos respectivamente.
Allam Abu Hasan is a Jordanian national seeking a Project Management role. He has over 3 years of experience in procurement and project management. His most recent role was as a Procurement Engineer at Drake & Scull International, where he managed the procurement process, analyzed supplier quotations, and ensured timely delivery of materials. Prior to that, he worked as a Sales Engineer coordinating deliveries and providing customer support. He holds a Bachelor's degree in Mechatronics Engineering and is proficient in English, Microsoft Office, programming languages, and project management skills.
Hawaii has a unique culture that blends Native Hawaiian and Western influences. Tourism is a major part of Hawaii's economy, with visitors spending over $1.2 billion in September 2016 alone. Hawaii promotes various types of tourism, including pleasure trips, vacations, and cultural events, with its warm climate, beaches, and cultural attractions drawing many visitors throughout the year, especially from countries like China, Japan, and South Korea. While tourism has brought economic benefits, it has also increased road traffic and the cost of living. Overall, Hawaii has successfully leveraged its diverse culture and infrastructure to develop a thriving tourism industry.
This document does not contain any substantive content to summarize. It only includes random characters without any meaningful words or sentences. Therefore, a meaningful summary cannot be generated from the given text.
Revolución Digital: o te adaptas o te puedes quedar sin empresa o sin trabajoAlfredo Vela Zancada
El documento discute los desafíos de la transformación digital y la necesidad de adaptarse al cambio tecnológico. Señala que las empresas más grandes del mundo ahora son compañías tecnológicas, y que sectores como la banca y los medios enfrentan dificultades en la transformación. También describe nuevos perfiles laborales como los "knowmads", trabajadores nómadas digitales con habilidades para trabajar de forma remota y colaborativa.
El documento presenta el Informe Anual 2015 de la Sociedad de la Información en España. Resume los principales aspectos de la evolución de la Sociedad de la Información a nivel mundial, europeo y nacional, incluyendo datos sobre el despliegue y uso de las TIC en hogares, empresas y administración pública en España. Además, analiza las tendencias tecnológicas emergentes y su posible impacto futuro.
MEDIDA PROVISÓRIA Nº 766/2017. PROGRAMA DE REGULARIZAÇÃO TRIBUTÁRIAALEXANDRE PANTOJA
O documento descreve o Programa de Regularização Tributária instituído pela Medida Provisória no 766/2017, que permite a liquidação de débitos tributários e não tributários vencidos até 30/11/2016. O programa oferece diferentes modalidades de pagamento incluindo parcelamento em até 120 vezes e uso de créditos tributários. A adesão implica na confissão dos débitos e sujeição às regras do programa.
Conegliano capitale del prosecco superioreMichael Mazzer
Il 2016 rappresenta un anno molto importante per il mondo del vino, in particolare per la città di Conegliano eletta "Città Europea del Vino 2016". Inoltre ricorrono anche altri importanti avvenimenti come il 140esimo della Scuola Enologica di Conegliano e il 50esimo della Strada del Prosecco. Questa tesi presenta le caratteristiche generali del marketing territoriale, le iniziative legate alla promozione e valorizzazione del Prosecco, l'area del Conegliano - Valdobbiadene con le sue istituzioni viti-enologiche, la nomina CEV 2016, la candidatura UNESCO, l'enoturismo e le prospettive future per il Prosecco e in particolare per la zona Conegliano - Valdobbiadene.
The document discusses performing a correlation analysis on selected numerical and categorical variables from a data set to identify highly correlated variables. A heat map was generated from the correlation analysis. Two numerical variables, total sales by branch and dairy sales total, were identified as highly correlated with other variables and removed from further analysis. Stepwise regression was then performed on the remaining variables to further reduce the number of predictor variables.
Multiple Linear Regression Applications in Real Estate Pricinginventionjournals
In this paper, we attempt to predict the price of a real estate individual homes sold in North West Indiana based on the individual homes sold in 2014. The data/information is collected from realtor.com. The purpose of this paper is to predict the price of individual homes sold based on multiple regression model and also utilize SAS forecasting model and software. We also determine the factors influencing housing prices and to what extent they affect the price. Independent variables such square footage, number of bathrooms, and whether there is a finished basement,. and whether there is brick front or not and the type of home: Colonial, Cotemporary or Tudor. How much does each type of home (Colonial, Contemporary, Tudor) add to the price of the real estate
Multiple Linear Regression Applications in Real Estate Pricinginventionjournals
This document describes using multiple linear regression to predict real estate prices. House price data from 480 homes sold in Indiana in 2014 is used. Independent variables like size, number of bedrooms/bathrooms, and whether there is a basement are considered. Correlations between variables are examined. An initial regression model is developed using all potential predictors. The best fitting model is found to use only homeowner association (HOA) fees as a predictor, with the equation Price=312638+17.854Hoa.
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 10: Correlation and Regression
10.1: Correlation
The multiple regression equation to predict January heating costs based on mean outside temperature, inches of insulation, and furnace age is:
Estimated Heating Cost = 82.57 + 1.23(Temperature) - 1.39(Insulation) + 1.11(Furnace Age)
Positive regression coefficients indicate higher values of those variables increase heating costs, while negative coefficients decrease costs. The intercept of 82.57 is the estimated cost when all independent variables equal zero. Plugging values into the equation estimates a heating cost of $107.52 for a home with a furnace 10 years old, 5 inches of insulation, and mean temperature of 30 degrees.
This Presentation is on recommended system on question paper predication using machine learning techniques. We did literature survey and implement using same technique.
This document provides an overview of simultaneous equation models. Some key points:
- Simultaneous equation models account for relationships where one variable is determined by others, which are also determined by the first variable (two-way relationship).
- They consist of multiple equations, with one for each endogenous (jointly determined) variable. The parameters cannot be estimated by looking at single equations.
- Reduced form equations express the endogenous variables solely in terms of exogenous variables and error terms. This allows consistent estimation of coefficients using OLS.
- Identification concerns whether the structural form parameters can be uniquely determined from the reduced form parameters. Exact, over- or under-identification can occur.
The document summarizes research on nonlinear correlation coefficients on manifolds. It defines a new nonlinear correlation coefficient called SEVP, proves some of its basic properties including that it ranges from 0 to 1. It discusses how to measure nonlinear correlation between variables on a manifold and reviews common dimensionality reduction methods for manifolds. The goal is to preserve nonlinear structure as much as possible by projecting onto the orthogonal complement of tangent spaces. An optimization problem is formulated to find the linear space with the largest angle to all tangent spaces, transforming it into an eigenvalue problem to solve.
Internet es una red de redes que interconecta computadoras alrededor del mundo. Los buscadores como Google, Yahoo y Bing son usados para encontrar información en Internet. Los metabuscadores como Gmail buscan información en varios buscadores convencionales y servicios de Google como Drive y YouTube proveen almacenamiento y videos respectivamente.
Allam Abu Hasan is a Jordanian national seeking a Project Management role. He has over 3 years of experience in procurement and project management. His most recent role was as a Procurement Engineer at Drake & Scull International, where he managed the procurement process, analyzed supplier quotations, and ensured timely delivery of materials. Prior to that, he worked as a Sales Engineer coordinating deliveries and providing customer support. He holds a Bachelor's degree in Mechatronics Engineering and is proficient in English, Microsoft Office, programming languages, and project management skills.
Hawaii has a unique culture that blends Native Hawaiian and Western influences. Tourism is a major part of Hawaii's economy, with visitors spending over $1.2 billion in September 2016 alone. Hawaii promotes various types of tourism, including pleasure trips, vacations, and cultural events, with its warm climate, beaches, and cultural attractions drawing many visitors throughout the year, especially from countries like China, Japan, and South Korea. While tourism has brought economic benefits, it has also increased road traffic and the cost of living. Overall, Hawaii has successfully leveraged its diverse culture and infrastructure to develop a thriving tourism industry.
This document does not contain any substantive content to summarize. It only includes random characters without any meaningful words or sentences. Therefore, a meaningful summary cannot be generated from the given text.
Revolución Digital: o te adaptas o te puedes quedar sin empresa o sin trabajoAlfredo Vela Zancada
El documento discute los desafíos de la transformación digital y la necesidad de adaptarse al cambio tecnológico. Señala que las empresas más grandes del mundo ahora son compañías tecnológicas, y que sectores como la banca y los medios enfrentan dificultades en la transformación. También describe nuevos perfiles laborales como los "knowmads", trabajadores nómadas digitales con habilidades para trabajar de forma remota y colaborativa.
El documento presenta el Informe Anual 2015 de la Sociedad de la Información en España. Resume los principales aspectos de la evolución de la Sociedad de la Información a nivel mundial, europeo y nacional, incluyendo datos sobre el despliegue y uso de las TIC en hogares, empresas y administración pública en España. Además, analiza las tendencias tecnológicas emergentes y su posible impacto futuro.
MEDIDA PROVISÓRIA Nº 766/2017. PROGRAMA DE REGULARIZAÇÃO TRIBUTÁRIAALEXANDRE PANTOJA
O documento descreve o Programa de Regularização Tributária instituído pela Medida Provisória no 766/2017, que permite a liquidação de débitos tributários e não tributários vencidos até 30/11/2016. O programa oferece diferentes modalidades de pagamento incluindo parcelamento em até 120 vezes e uso de créditos tributários. A adesão implica na confissão dos débitos e sujeição às regras do programa.
Conegliano capitale del prosecco superioreMichael Mazzer
Il 2016 rappresenta un anno molto importante per il mondo del vino, in particolare per la città di Conegliano eletta "Città Europea del Vino 2016". Inoltre ricorrono anche altri importanti avvenimenti come il 140esimo della Scuola Enologica di Conegliano e il 50esimo della Strada del Prosecco. Questa tesi presenta le caratteristiche generali del marketing territoriale, le iniziative legate alla promozione e valorizzazione del Prosecco, l'area del Conegliano - Valdobbiadene con le sue istituzioni viti-enologiche, la nomina CEV 2016, la candidatura UNESCO, l'enoturismo e le prospettive future per il Prosecco e in particolare per la zona Conegliano - Valdobbiadene.
The document discusses performing a correlation analysis on selected numerical and categorical variables from a data set to identify highly correlated variables. A heat map was generated from the correlation analysis. Two numerical variables, total sales by branch and dairy sales total, were identified as highly correlated with other variables and removed from further analysis. Stepwise regression was then performed on the remaining variables to further reduce the number of predictor variables.
Multiple Linear Regression Applications in Real Estate Pricinginventionjournals
In this paper, we attempt to predict the price of a real estate individual homes sold in North West Indiana based on the individual homes sold in 2014. The data/information is collected from realtor.com. The purpose of this paper is to predict the price of individual homes sold based on multiple regression model and also utilize SAS forecasting model and software. We also determine the factors influencing housing prices and to what extent they affect the price. Independent variables such square footage, number of bathrooms, and whether there is a finished basement,. and whether there is brick front or not and the type of home: Colonial, Cotemporary or Tudor. How much does each type of home (Colonial, Contemporary, Tudor) add to the price of the real estate
Multiple Linear Regression Applications in Real Estate Pricinginventionjournals
This document describes using multiple linear regression to predict real estate prices. House price data from 480 homes sold in Indiana in 2014 is used. Independent variables like size, number of bedrooms/bathrooms, and whether there is a basement are considered. Correlations between variables are examined. An initial regression model is developed using all potential predictors. The best fitting model is found to use only homeowner association (HOA) fees as a predictor, with the equation Price=312638+17.854Hoa.
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 10: Correlation and Regression
10.1: Correlation
The multiple regression equation to predict January heating costs based on mean outside temperature, inches of insulation, and furnace age is:
Estimated Heating Cost = 82.57 + 1.23(Temperature) - 1.39(Insulation) + 1.11(Furnace Age)
Positive regression coefficients indicate higher values of those variables increase heating costs, while negative coefficients decrease costs. The intercept of 82.57 is the estimated cost when all independent variables equal zero. Plugging values into the equation estimates a heating cost of $107.52 for a home with a furnace 10 years old, 5 inches of insulation, and mean temperature of 30 degrees.
This Presentation is on recommended system on question paper predication using machine learning techniques. We did literature survey and implement using same technique.
This document provides an overview of simultaneous equation models. Some key points:
- Simultaneous equation models account for relationships where one variable is determined by others, which are also determined by the first variable (two-way relationship).
- They consist of multiple equations, with one for each endogenous (jointly determined) variable. The parameters cannot be estimated by looking at single equations.
- Reduced form equations express the endogenous variables solely in terms of exogenous variables and error terms. This allows consistent estimation of coefficients using OLS.
- Identification concerns whether the structural form parameters can be uniquely determined from the reduced form parameters. Exact, over- or under-identification can occur.
The document summarizes research on nonlinear correlation coefficients on manifolds. It defines a new nonlinear correlation coefficient called SEVP, proves some of its basic properties including that it ranges from 0 to 1. It discusses how to measure nonlinear correlation between variables on a manifold and reviews common dimensionality reduction methods for manifolds. The goal is to preserve nonlinear structure as much as possible by projecting onto the orthogonal complement of tangent spaces. An optimization problem is formulated to find the linear space with the largest angle to all tangent spaces, transforming it into an eigenvalue problem to solve.
This document discusses methods for analyzing the relationship between two quantitative variables, including:
- Scatter diagrams can show the relationship and be used to identify if the variables are positively or negatively correlated.
- The linear correlation coefficient, r, quantifies the strength of the linear relationship between -1 and 1, where values closer to -1 or 1 indicate a stronger negative or positive correlation, respectively.
- Least-squares regression finds the best-fitting straight line to describe the linear relationship between two variables by minimizing the sum of the squared residuals. It can be used to make predictions, but may not be accurate far outside the original data range.
This document discusses methods for analyzing the relationship between two quantitative variables, including:
- Scatter diagrams can show the relationship and be used to identify if the variables are positively or negatively correlated.
- The linear correlation coefficient, r, quantifies the strength of the linear relationship between -1 and 1, with values closer to 1 or -1 indicating a stronger linear relationship.
- Least-squares regression finds the best-fitting straight line to describe the linear relationship between two variables by minimizing the sum of the squared residuals. It can be used to make predictions, but may not be accurate far outside the original data range.
Regression analysis is used to establish relationships between variables and make predictions. It can be used to estimate dependent variables from independent variables, extend analysis to multiple variables, and show the nature of relationships. The key objectives are establishing if relationships exist and making forecasts. Regression requires interval scale data and establishes parameters and an error term in the regression equation. The least squares method chooses parameters that minimize errors between observed and estimated dependent variable values. Goodness of fit is measured by R-squared and F-tests and t-tests determine statistical significance.
This document discusses correlation analysis in agriculture. It begins by defining correlation as the relationship between two or more variables. Some key points:
- Correlation can be positive (variables move in the same direction), negative (variables move in opposite directions), linear, nonlinear, simple, multiple, partial or total.
- Common types analyzed in agriculture include the relationship between yield and rainfall, price and supply, height and weight.
- Methods for measuring correlation are discussed, including Karl Pearson's coefficient of correlation (denoted by r), Spearman's rank correlation, and scatter diagrams.
- The value of r ranges from -1 to 1, with higher positive or negative values indicating a stronger linear relationship between variables
This document is an empirical assignment report submitted by a group of students analyzing the relationship between urbanization, transportation, GDP, and carbon dioxide emissions across 209 countries. The report finds that:
1) Carbon dioxide emission levels in a country can be significantly explained by its levels of urbanization and vehicle density, with higher levels of both associated with higher CO2 emissions.
2) The model used satisfies assumptions of classical linear regression, and urbanization and vehicle density jointly explain over 50% of the variation in CO2 emissions levels.
3) GDP per capita is also likely to influence CO2 emissions but is excluded from the main model due to multicollinearity with urbanization and vehicle density.
This document summarizes a research article about using particle swarm optimization to find different shrinkage parameters (k values) for each explanatory variable in ridge regression, rather than a single k value. Ridge regression is used to address multicollinearity issues in multiple regression analysis. Typically, ridge regression estimates a single k value, but this study uses an algorithm based on particle swarm optimization to estimate different k values for each variable. The study applies this new method to real data and simulations to evaluate its performance compared to other ridge regression methods.
This document summarizes the results of an analysis of factors influencing individuals' job satisfaction using panel data from the British Household Panel Survey. A fixed effects model was preferred to a random effects model based on a Hausman test. The analysis found that being married, having an improved financial situation compared to the previous year, and living outside of London were associated with higher levels of job satisfaction, while a worse financial situation was associated with lower satisfaction. Regional differences in satisfaction were also observed.
1. Regression analysis is a statistical technique used to model relationships between variables and make predictions. It can be used to describe relationships, estimate coefficients, make predictions, and control systems.
2. Linear regression models describe straight-line relationships between variables, while non-linear models describe curved relationships. The goodness of fit of a model can be evaluated using the coefficient of determination.
3. The least squares method is used to fit regression lines by minimizing the sum of the squared vertical distances between observed and estimated y-values for a regression of y on x, or minimizing the sum of squared horizontal distances for a regression of x on y.
This study evaluated the performance of bootstrap confidence intervals for estimating slope coefficients in Model II regression with three or more variables. Simulation studies were conducted for different correlation structures between variables, sampling from both normal and lognormal distributions. The results showed that bootstrap intervals provided less than the nominal 95% coverage. Scenarios with strong relationships between variables produced better coverage, while scenarios with weaker relationships and bias produced poorer coverage, even with larger sample sizes. Future work could explore additional scenarios and alternative interval methods to improve accuracy of confidence intervals in Model II regression.
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docxhyacinthshackley2629
16 USING LINEAR REGRESSION PREDICTING THE FUTURE
16: MEDIA LIBRARY
Premium Videos
Core Concepts in Stats Video
· Linear Regression
Lightboard Lecture Video
· Multiple Regression
Time to Practice Video
· Chapter 16: Problem 2
Difficulty Scale
(as hard as they get!)
WHAT YOU WILL LEARN IN THIS CHAPTER
· Understanding how prediction works and how it can be used in the social and behavioral sciences
· Understanding how and why linear regression works when predicting one variable on the basis of another
· Judging the accuracy of predictions
· Understanding how multiple regression works and why it is useful
INTRODUCTION TO LINEAR REGRESSION
You’ve seen it all over the news—concern about obesity and how it affects work and daily life. A set of researchers in Sweden was interested in looking at how well mobility disability and/or obesity predicted job strain and whether social support at work can modify this association. The study included more than 35,000 participants, and differences in job strain mean scores were estimated using linear regression, the exact focus of what we are discussing in this chapter. The results found that level of mobile disability did predict job strain and that social support at work significantly modified the association among job strain, mobile disability, and obesity.
Want to know more? Go to the library or go online …
Norrback, M., De Munter, J., Tynelius, P., Ahlstrom, G., & Rasmussen, F. (2016). The association of mobility disability, weight status and job strain: A cross-sectional study. Scandinavian Journal of Public Health, 44, 311–319.
WHAT IS PREDICTION ALL ABOUT?
Here’s the scoop. Not only can you compute the degree to which two variables are related to one another (by computing a correlation coefficient as we did in Chapter 5), but you can also use these correlations to predict the value of one variable based on the value of another. This is a very special case of how correlations can be used, and it is a very powerful tool for social and behavioral sciences researchers.
The basic idea is to use a set of previously collected data (such as data on variables X and Y), calculate how correlated these variables are with one another, and then use that correlation and the knowledge of X to predict Y. Sound difficult? It’s not really, especially once you see it illustrated.
For example, a researcher collects data on total high school grade point average (GPA) and first-year college GPA for 400 students in their freshman year at the state university. He computes the correlation between the two variables. Then, he uses the techniques you’ll learn about later in this chapter to take a new set of high school GPAs and (knowing the relationship between high school GPA and first-year college GPA from the previous set of students) predict what first-year GPA should be for a new student who is just starting out. Pretty nifty, huh?
Here’s another example. A group of kindergarten teachers is interested in finding out how well ex.
16 USING LINEAR REGRESSION PREDICTING THE FUTURE16 MEDIA LIBRAR.docxnovabroom
16 USING LINEAR REGRESSION PREDICTING THE FUTURE
16: MEDIA LIBRARY
Premium Videos
Core Concepts in Stats Video
· Linear Regression
Lightboard Lecture Video
· Multiple Regression
Time to Practice Video
· Chapter 16: Problem 2
Difficulty Scale
(as hard as they get!)
WHAT YOU WILL LEARN IN THIS CHAPTER
· Understanding how prediction works and how it can be used in the social and behavioral sciences
· Understanding how and why linear regression works when predicting one variable on the basis of another
· Judging the accuracy of predictions
· Understanding how multiple regression works and why it is useful
INTRODUCTION TO LINEAR REGRESSION
You’ve seen it all over the news—concern about obesity and how it affects work and daily life. A set of researchers in Sweden was interested in looking at how well mobility disability and/or obesity predicted job strain and whether social support at work can modify this association. The study included more than 35,000 participants, and differences in job strain mean scores were estimated using linear regression, the exact focus of what we are discussing in this chapter. The results found that level of mobile disability did predict job strain and that social support at work significantly modified the association among job strain, mobile disability, and obesity.
Want to know more? Go to the library or go online …
Norrback, M., De Munter, J., Tynelius, P., Ahlstrom, G., & Rasmussen, F. (2016). The association of mobility disability, weight status and job strain: A cross-sectional study. Scandinavian Journal of Public Health, 44, 311–319.
WHAT IS PREDICTION ALL ABOUT?
Here’s the scoop. Not only can you compute the degree to which two variables are related to one another (by computing a correlation coefficient as we did in Chapter 5), but you can also use these correlations to predict the value of one variable based on the value of another. This is a very special case of how correlations can be used, and it is a very powerful tool for social and behavioral sciences researchers.
The basic idea is to use a set of previously collected data (such as data on variables X and Y), calculate how correlated these variables are with one another, and then use that correlation and the knowledge of X to predict Y. Sound difficult? It’s not really, especially once you see it illustrated.
For example, a researcher collects data on total high school grade point average (GPA) and first-year college GPA for 400 students in their freshman year at the state university. He computes the correlation between the two variables. Then, he uses the techniques you’ll learn about later in this chapter to take a new set of high school GPAs and (knowing the relationship between high school GPA and first-year college GPA from the previous set of students) predict what first-year GPA should be for a new student who is just starting out. Pretty nifty, huh?
Here’s another example. A group of kindergarten teachers is interested in finding out how well ex.
This document analyzes an energy efficiency dataset to predict heating load using linear regression. It finds that heating load is highly correlated with cooling load, so only heating load is used as the response variable. Stepwise regression identifies relative compactness, surface area, wall area, overall height, glazing area, and glazing area distribution as significant predictors of heating load. The regression has high R-squared but residual analysis shows the model is not a good fit for the data due to distinct variable values.
Similar to Analysis of the Boston Housing Data from the 1970 census (20)
Analysis of the Boston Housing Data from the 1970 census
1. 1
ysstats@bu.edu U37074009
Analysis of the Boston Housing Data
from the 1970 census:
Diverse Tests and Model Selection Processes regarding the
Variables in Boston Housing Data
Shuai Yuan1
December 8, 2016
Abstract
In this project, we study the Boston Housing Data that was offered by Harrison and Rubinfeld,
1978. The data contained many different variables that related to Boston Housing for 506 tracts of
Boston from the 1970 census. The data is included in the R package mlbench. Using the data and
R software, we first study the scatterplot matrix and the correlation of different variables to find
their relations briefly. Then, we make various tests for many null hypotheses to examine the
properties of different models. Finally, we perform the model selections by using different methods
such as the forward algorithm, backward algorithm as well as the AIC and BIC criterion to find
and analyze the most fitted model for our data sets. And the same time, we also compute the SSPE
for our subset of data.
2. 1
Contents
1 Introduction 2
2 Analysis 3
2.1 Analysis of the linearity between variables 3
2.1.1 Scatterplot matrix for variables 3
2.1.2 Explanation of Correlation between two variables 4
2.2 The statistical tests for the Null Hypotheses of the fitted model 6
2.3 Model selection by using the forward algorithm 9
2.4 Model selection by using the backward algorithm 11
2.5 Model selection by using the AIC and BIC criterion 12
2.6 Analysis of the related statistics 14
2.6.1 Fit the model by using the subset of the data 14
2.6.2 Compute and analyze the SSPE for subset of the data 14
3 Conclusion 16
4 Appendix 18
3. 2
1 Introduction
The data of the Boston Housing from the 1970 census are used in this project. The dataset contains
14 variables with 506 observations. The data is included in the R package mlbench.
In this project, we used various tools to analyze the Boston Housing data and the most frequently
used method is the linear regression. At the same time, we also used hypothesis testing, t-test, F-
test as well as model selection as our methods to analyze the properties of the related data. Using
the data and R software, we first study the scatterplot matrix and the correlation of different
variables to find their relations briefly. Then, we make various tests for many null hypotheses to
examine the properties of different models. Finally, we perform the model selections by using
different methods such as the forward algorithm, backward algorithm as well as the AIC and BIC
criterion to find and analyze the most fitted model for our data sets. And the same time, we also
compute the SSPE for our subset of data.
The outline for the remainder of the paper is as follows. In Section 2, we provide the main results
and analysis towards the multiple aspects of our topics. Section 3 concludes. In the Appendix, we
provide our R codes as well as the related outputs. Finally, we also provide the references that we
use in this project. To be specific, the part 2.1.1 is for the question 1, part 2.1.2 is for the question
2, part 2.2 is for the question 3, part 2.3 is for the question 4, part 2.4 is for the question 5, part 2.5
is for the question 6, part 2.6 is for the question 7.
4. 3
2 Analysis
To get a briefly understanding of the relationships between different variables at the very
beginning, we get the scatterplot matrix of these different variables and find the non-linearity
between these variables. Therefore, the correlation of these variables may not appropriate for
describing the relationships within the variables. At the same time, we also compute different test
statistics and test many hypotheses for the general model. Moreover, we also perform variable
selection using forward algorithm, backward algorithm, AIC and BIC criterion. We find that both
criterions select the same model for us and we explain the reason why the selected model is the
one that we need. Finally, we also get the fitted model for subset and compute and compare the
SSPE of the selected models.
2.1 Analysis of the linearity between variables
2.1.1 Scatterplot matrix for variables
First, according to the description of the R Package “mlbench”, we can get the meaning of the
following variables as well as the scatterplot matrix for these four variables which are listed below:
𝒏𝒐𝒙: Nitric oxides concentration (parts per 10 million).
𝒊𝒏𝒅𝒖𝒔: Proportion of residential land zoned for lots over 25,000 sq.ft.
𝒅𝒊𝒔: Weighted distances to five Boston employment centers.
𝒕𝒂𝒙: Full-value property-tax rate per USD 10,000.
plot 1 Scatterplot matrix for the variables nox, indus, dis, tax
5. 4
According to the scatterplot matrix, we can find that these four variables are all related in some
patterns. For instance, generally speaking, the variable 𝑛𝑜𝑥 is negatively related to the variable
𝑑𝑖𝑠 and the variable 𝑖𝑛𝑑𝑢𝑠 is also negatively related to the variable 𝑑𝑖𝑠. On the other hand,
generally speaking, the relationships between other variables are positive at the low volume level,
while the relationships may get vague and non-related at the high volume level.
On the other hand, we can also find the possible explanations according to the meanings of these
variables. Because the variable 𝑛𝑜𝑥 means the “Nitric oxides concentration (parts per 10 million).”,
which also represents the degree of air pollution in this area. For the variable 𝑑𝑖𝑠, it means the
“Weighted distances to five Boston employment centers.”, which also represents the degree of
living away from the downtown. And for the variable 𝑖𝑛𝑑𝑢𝑠 , it means the “Proportion of
residential land zoned for lots over 25,000 sq.ft.”, which also represents the level of economy of
residents. Because only if when people own enough money, will they use their money to build
their own parking lots which are also quite wide. Therefore, we can find the possible explanations
for these relationships. As we all know, the air of the area that far away from the downtown is
better because there are more trees and therefore, the level of pollution there can be at a low level.
So it is reason to see that there is negative relationship between the variable 𝑛𝑜𝑥 and 𝑑𝑖𝑠. On the
other hand, the degree of economic development in the areas that far away from the downtown is
worse than that of the downtown areas and therefore, the proportion of residential land zoned for
large lots is smaller than that of the downtown areas. So it is reason to see that there is negative
relationship between the variable 𝑖𝑛𝑑𝑢𝑠 and 𝑑𝑖𝑠.
2.1.2 Explanation of Correlation between two variables
We know that the formula of correlation coefficient between two variables is that:
ρ23 =
Cov(X, Y)
D(X) D(Y)
Therefore, according to the R codes, we can find that the correlation between the variable 𝑛𝑜𝑥 and
the variable 𝑑𝑖𝑠 is about -0.7692301, which may give us the information that these two variables
are negatively correlated.
6. 5
However, the thing we should not forget is that the correlation coefficient between two variables
is used for examining the relationship for linear regression model, or in other words, the linear
relationship between two variables. But we can find from the scatterplot that the relationship
between the variable 𝑛𝑜𝑥 and the variable 𝑑𝑖𝑠 is more likely as exponential relationship, which
means that there is not reasonable to use the correlation coefficient between these two variables to
examine the relationship between them.
On the other hand, we can also test their relationship of them by getting the model between them.
From the model, we assume that there is an exponential relation between them and we get the
significantly p-value for this model. Therefore, according to the discussion above, we can safely
draw the conclusion that we cannot use the correlation between these two variables to quantify the
strength of relationship between the variable 𝑛𝑜𝑥 and the variable 𝑑𝑖𝑠.
7. 6
2.2 The statistical tests for the Null Hypotheses of the fitted model
For this question, the full model given only contains five variables and intercept. 𝛽?means the
intercept, 𝛽@ measures the change of the variable 𝑛𝑜𝑥 if one unit of the variable 𝑑𝑖𝑠 increased, 𝛽A
measures the change of the variable 𝑛𝑜𝑥 if one unit of the variable 𝑙𝑜𝑔(𝑑𝑖𝑠) increases, 𝛽D measure
the change of the variable 𝑛𝑜𝑥 if one unit of variable 𝑑𝑖𝑠^2 increases, 𝛽H measure the change of
the variable 𝑛𝑜𝑥 if one unit of the variable 𝑖𝑛𝑑𝑢𝑠 increases, 𝛽I measure the change of the variable
𝑛𝑜𝑥 if one unit of the variable 𝑡𝑎𝑥 increases. Since three of them have already be given in the data
set, we just need to transform and add the left two variables which are log(dis) and dis^2. So we
create the new variables whose names are log(dis) and dissquare and will use them to refer the
variable log(dis) and the variable dis^2.
For this section, since we want to decide whether or not specified parameters are equal to 0 or each
other, we will do the F-test for all of three sub-questions. At the beginning, we have the formula
for F-test as below:
𝐹 = (
𝑅𝑆𝑆YZ − 𝑅𝑆𝑆Z
𝑑𝑓YZ − 𝑑𝑓Z
)/(
𝑅𝑆𝑆Z
𝑑𝑓Z
)
The table for the summary of F-test value and corresponding p-value for the question are
summarized below:
question a question b question c
F-test(value) 5.911 6.0524 42.80353
p-value 0.0154 0.002528 0.0001
Table 1 The F-test(value) and p-value of question a, b, c
We will use it to evaluate the question.
Question a:
According to the definition of the null hypothesis test, the main idea is to test whether the
coefficient of the variable log(dis) is equal to 0 or not. Since the variable log(dis) is the only
target we want to focus here, we can just build a new regression model which does not contain the
variable log(dis) to compare with the original regression model. When we compare two regression
models, we will do the F-test to see if they are significantly different with each other. Form the
8. 7
results of the R codes, the F-value is 5.911 and corresponding p-value is 0.0154. To see whether
we need to reject the null hypothesis, it depends on significant level of alpha. Here, we set the
value of alpha to 0.05. Since the p-value is smaller than 0.05, we will reject the null hypothesis
and conclude that 𝛽A is not equal to 0 at the 95% confidence level. However, if we want to be 99%
confidence about the result, the alpha will change to be 0.01. Since the p-value is bigger than 0.01
here. We cannot reject the null hypothesis based at the 99% confidence level.
Question b:
For part b, we want to make sure whether the coefficient of the variable dis and the variable dis^2
are equal to 0 or not. Since it only focuses on two variable and want to make sure if they are
significantly different from 0. We can do the similar test as part a. For this question, we will build
another regression model which only contains intercept and three variables except the variable
dis^2 and the variable dis. Then we compare the new regression model with the original full
model to see if they are significantly different. We will also do a F-test to compare the two models.
Here the null hypothesis is that 𝛽@ = 𝛽D = 0, the alternative hypothesis is that as least one of them
is not equal to 0. The value for F-test is 6.0524 and corresponding p-value is 0.002528. Similarly,
we will also set the alpha to be 0.05 here. Since the p-value is smaller than 0.05, we will reject the
null hypothesis and conclude that among the 𝛽@, 𝛽D, at least one of them is not equal to 0.
Question c:
Situation for part c is much different. Since the question want to make sure if 𝛽A = 𝛽D = 0 and if
𝛽H = 𝛽I. We will not use the tradition way as above but use matrix to get the solution. We will
divide the first section(𝛽A = 𝛽D = 0) as if 𝛽A = 0 and if 𝛽D = 0. So the first line of matrix A has
only a “1” corresponding to the position of 𝛽A and “0” for all the other variables. For the second
line of matrix A, it only has a “1” corresponding to the position of 𝛽D and “0” for all the other
variables. When we compute the matrix A times the variable matrix, the first two line we can get
is only 𝛽A and 𝛽D. To make sure whether or not 𝛽H = 𝛽I. We will put “1” corresponding to the
position of 𝛽H and “-1” corresponding to the position of 𝛽I for the third line of matrix A. So the
output of third line will become 𝛽H - 𝛽I. To test if each of the result we get equal to “0”, we will
make a F-test here. The value for F-test is 42.80353 and p-value corresponding to it is less than
0.0001. Assuming we set alpha = 0.05 here, apparently the p is smaller than alpha, so we will reject
9. 8
the null hypothesis here and conclude that at least one of 𝛽A = 𝛽D is not equal to 0 or 𝛽H is not equal
to 𝛽I.
10. 9
2.3 Model selection by using the forward algorithm
In this section, we will use the method of forward algorithm to analyze the relationship between
response variable and potential explanatory variables below. Moreover, according to the question’s
requirements, we have transformed the original variables to different formats, which are all
presented below.
Response variable:
𝐥𝐨𝐠(𝐦𝐞𝐝𝐯), which means that we now use the natural logarithm of the median value of owner-
occupied homes in $1000's.
Potential explanatory variables:
𝐫𝐦^𝟐, which means the square of average number of rooms per dwelling.
𝐥𝐨𝐠(𝐝𝐢𝐬), which means the natural logarithm of weighted distances to five Boston employment
centers.
𝐚𝐠𝐞, which means the proportion of owner-occupied units built prior to 1940.
We performed variable selection using a forward algorithm with a significance level of 5%. For
the forward algorithm, we regressed the models with all variables separately. We name the models
from “forward11” to “forward14”, which you can find with details in Appendix. The results of the
regressions were summarized as the following:
name model variable t - value Pr(>|t|)
forward11 log(medv) ~ 1 intercept 167 <2e-16
forward12 log(medv) ~ rm^2 - 1 rm^2 130 <2e-16
forward13 log(medv) ~ age - 1 age 44.84 <2e-16
forward14 log(medv) ~ log(dis) - 1 log(dis) 54.66 <2e-16
Table 2 The summary of different models from forward11 to forward14
We could observe from the table that while all the variables are significant, the intercept has the
largest t-value. Hence, we chose the intercept to our model. Next, we regressed the intercept with
each of the left three variables in the models called “foward21” to “foward23”. The summarized
results are shown in the table below:
11. 10
name model variable t - value Pr(>|t|)
forward21 log(medv) ~ rm^2 rm^2 18.8 <2e-16
forward22 log(medv) ~ age age 11.42 <2e-16
forward23 log(medv) ~ log(dis) log(dis) 9.965 <2e-16
Table 3 The summary of different models from forward21 to forward23
As shown in the table, the p-values of all the variables are significant. However, comparing to the
other variables, the t-value of the variable rm^2 has the largest one. So we added rm^2. Then, we
tested the combination of the variables rm^2, dis, age and intercept separately in the models,
which named as “forward31” and “forward32”. We got the following table as below:
name model variable t - value Pr(>|t|)
forward31 log(medv) ~ rm^2 + log(dis) log(dis) 8.269 1.21e-15
forward32 log(medv) ~ rm^2 + age age -10.23 <2e-16
Table 4 The summary of different models from forward31 to forward32
From the result above, the p-value of all the other variables are significant but the variable age has
smaller p-value than the variable dis. Therefore, we also add the variable age to our model. Finally,
we regressed the response variable log(medv) on all of the variables in the following model
“forward41”.
name model variable t - value Pr(>|t|)
forward41 log(medv) ~ rm^2 + age + log(dis) log(dis) 1.068 0.286
Table 4 The summary of different models from forward41
Based on the table above, the variable log(dis) is not significant in the model and thus, we
removed it from our model. Therefore, after the forward algorithm selection, our final model is
shown as the following,
log medv = 𝛽? + 𝛽@ ∗ 𝑟𝑚A
+ 𝛽A ∗ 𝑎𝑔𝑒 + 𝜀
which ε is the error term.
12. 11
2.4 Model selection by using the backward algorithm
In this section, we will use the method of backward algorithm to analyze the relationship between
response variable and potential explanatory variables below. Moreover, according to the question’s
requirements, we used the transformed formats of the original variables that were defined in the
previous section. We performed variable selection using a backward algorithm with a significance
level of 5%. For the backward algorithm, we regressed the models with all variables at first. We
named the models from “backward11”, which you can find with details in Appendix. The results
of the regressions were summarized as the following:
name model variable t - value Pr(>|t|)
backward11 log(medv) ~ rm^2 + age + log(dis) intercept 21.224 <2e-16
rm^2 17.676 <2e-16
age -5.758 1.48e-08
log(dis) 1.068 0.286
Table 5 The summary of different models from backward11
Based on the result, we can find that except the variable log(dis) whose t-value is 1.068 and p-
value is 0.286, all the explanatory variables are significant. Thus, we removed the variable dis and
built a new model with the left variables, which is called “backward21”. Here are the results:
name model variable t - value Pr(>|t|)
backward21 log(medv) ~ rm^2 + age intercept 32.12 <2e-16
rm^2 17.85 <2e-16
age -10.23 <2e-16
Table 6 The summary of different models from backward21
After deleting the variable log(dis) from the model, we got left variables are all significant and
thus, we ended up with the model “backward21”. We got the same model as that by the process of
forward algorithm,
log medv = 𝛽? + 𝛽@ ∗ 𝑟𝑚A
+ 𝛽A ∗ 𝑎𝑔𝑒 + 𝜀
which ε is the error term.
13. 12
2.5 Model selection by using the AIC and BIC criterion
First of all, we can do a preliminary analysis to the full model we are interested. In the full linear
regression model, the t-value and p-value are used to determine whether each of the variable is
significant for the model. Setting alpha = 0.05 here, we can see easily that the three of the variables
rm^2, age and intercept have smallest p-value that also less than 0.05 which means they are
significant. However, the variable log(dis) has the p-value of 0.286 which is not significant at all.
In this section, we will try to perform variable selection using AIC and BIC criterion. The
definition for AIC is that the measure of the relative quality of statistical models for a given set of
data. Given a collection of models for the data, AIC estimates the quality of each model, relative
to each of the other models. Hence, AIC provides a means for model selection. At the same time,
the definition for BIC is that the criterion for model selection among a finite set of models and the
model with the lowest BIC score is preferred. And the formulas for AIC and BIC are shown as
below,
𝐴𝐼𝐶(𝑚) = 𝑛 ∗ log
𝑅𝑆𝑆 𝑚
𝑛
+ 2 ∗ 𝑚v
𝐵𝐼𝐶(𝑚) = 𝑛 ∗ log
𝑅𝑆𝑆 𝑚
𝑛
+ log (𝑛) ∗ 𝑚v
where 𝑚 is the regression model, 𝑛 is the sample size, 𝑚v denotes the number of variables in the
model 𝑚. In the project, the sample size is 506 and all we need to do is to put all possible regression
model into the R software to compute the corresponding AIC and BIC scores. The candidate
models of the different regression model are summarized as below:
14. 13
Candidate Models AIC Score BIC Score
log(medv) ~ 1 -904.371 -900.145
log(medv) ~ rm^2 - 1 -659.289 -655.063
log(medv) ~ age - 1 321.927 326.154
log(medv) ~ log(dis) - 1 155.969 160.195
log(medv) ~ rm^2 + log(dis) - 1 -750.189 -741.736
log(medv) ~ log(dis) + age - 1 -533.471 -525.018
log(medv) ~ rm^2 + age - 1 -702.556 -694.102
log(medv) ~ age -1018.83 -1010.378
log(medv) ~ rm^2 -1171.36 -1162.907
log(medv) ~ log(dis) -993.379 -984.926
log(medv) ~ rm^2 + log(dis) -1233.86 -1221.175
log(medv) ~ rm^2 + age -1265.07 -1252.394
log(medv) ~ log(dis) + age -1021.36 -1008.683
log(medv) ~ rm^2 + log(dis) + age - 1 -940.149 -929.47
log(medv) ~ rm^2 + log(dis) + age -1264.22 -1247.315
log(medv) ~ -1 1132.453 1132.453
Table 7 The AIC and BIC scores of all possible models
From the table above, we can find that the regression model with smallest AIC score has variable
rm^2, age as well as the intercept. The regression model with the smallest BIC score is the same
model. And when we checked the regression model, we can find the model only contains the
variable rm^2 and the variable log(dis) with intercept. If we set alpha = 0.05 for the model, all
the variables in the model are significant. So we will select the model which contains the variables
rm^2, age and intercept under the AIC as well as BIC criterion.
15. 14
2.6 Analysis of the related statistics
2.6.1 Fit the model by using the subset of the data
According to the results above, we finally choose the model of “m12”, which has the minimum
value of BIC, to be used as our fitted model. From the question 6, we can find that the fitted model
can be written as the following,
log medv = 𝛽? + 𝛽@ ∗ 𝑟𝑚A
+ 𝛽A ∗ 𝑎𝑔𝑒 + 𝜀
which ε is the error term. Therefore, we can now use the data from Group1 to fit the above model.
From the results generated by R section, we can find that the fitted model is shown as the following:
log medv = 2.3360 + 0.0256 ∗ 𝑟𝑚A
− 0.0048 ∗ 𝑎𝑔𝑒
Moreover, the p-values of all the explanatory variables are all significant at all level.
2.6.2 Compute and analyze the SSPE for subset of the data
On the other hand, we can also apply another method which is called the Cross-Validation to
further analyze the model selection process. For this method, we need to apply the following
processes. First, we split the data into two different subsets according to a user defined criterion,
Group1 and Group2, which are also called the training data and the validation data. Second, we fit
the model using the data from the Group1. Third, based on the data from the Group2, we make the
prediction for the response variable log (𝑚𝑒𝑑𝑣)€. At the same time, we also denote the predicted
value by log (𝑚𝑒𝑑𝑣)•. At last, we compute the value of SSPE, which is also called “Sum of
Squared Prediction Error”.
Therefore, according to the question, we first divided the original data set “BostonHousing” into
two Groups, which is the Group1 and the Group2 respectively. And then, we can get the SSPE of
the Group2 by computing the SSPE according to its definition, which is shown as below:
SSPE = (log (𝑚𝑒𝑑𝑣)€ − log (𝑚𝑒𝑑𝑣)•)A
v
€…@
In the equation above, log (𝑚𝑒𝑑𝑣)€ denotes the response variables in the Group2 and log (𝑚𝑒𝑑𝑣)•
denotes the predicted values of the response variable, which were computed by the prediction
function in R section. Therefore, we can compute the SSPE of the Group2 as 0.02835043.
At the same time, we can find that the model we get from the part2.4(question 5) is that,
16. 15
log medv = 𝛽? + 𝛽@ ∗ 𝑟𝑚A
+ 𝛽A ∗ 𝑎𝑔𝑒 + 𝜀
which ε is the error term. And the model is the same as we get from the part2.5(question 6).
Therefore, we get the same results for the same model.
17. 16
3 Conclusion
In this project, we first got the scatterplot matrix of four different variables, 𝑛𝑜𝑥, 𝑖𝑛𝑑𝑢𝑠, 𝑑𝑖𝑠 and
𝑡𝑎𝑥. According to the scatterplot matrix, we found that these four variables are all related in some
patterns. Generally speaking, the variable 𝑛𝑜𝑥 is negatively related to the variable 𝑑𝑖𝑠 and the
variable 𝑖𝑛𝑑𝑢𝑠 is also negatively related to the variable 𝑑𝑖𝑠. On the other hand, generally speaking,
the relationships between other variables are positive at the low volume level, while the
relationships may get vague and non-related at the high volume level. On the other hand, we can
also find the possible explanations according to the meanings of these variables. On the other hand,
we also found that the non-linearity between the variable 𝑛𝑜𝑥 and the variable 𝑑𝑖𝑠. Therefore, we
cannot use the correlation between these two variables to quantify the strength of relationship
between the variable and the variable 𝑑𝑖𝑠.
Second, we also made several tests for the Null hypotheses of the fitted model. Using the F-test
and the related p-values, we found that the p-values for the null hypotheses 𝛽A = 0, 𝛽@ = 𝛽D = 0,
𝛽@ = 𝛽D = 0 and 𝛽H = 𝛽I are all smaller than 0.05, which means we need to reject all the null
hypotheses.
Third, we used the forward algorithm to find the best model for the regression problem. To do that,
we first applied all the variables into the model and we used the p-values of different variables to
test that whether the certain variable is significant in the model. And then, we found that the final
model includes the variable 𝑟𝑚A
, the variable age as well as the intercept. According to the results,
we finally found the best model. At the same time, we also used the backward algorithm to do the
model selection. By using the backward algorithm, we first applied the model with nothing, and
then, we added the variables one by one into the model to test the p-values of these variables.
Finally, according to the results, we found that the model we found through the backward
algorithm is the same as the one found by using the forward algorithm.
18. 17
And the same time, we also used both the AIC as well as the BIC criterion to do model selection
processes. After doing the model selection, we found that the regression model with smallest AIC
score has variable rm^2, age as well as the intercept. The regression model with the smallest BIC
score is the same model. And when we checked the regression model, we found the model only
contains the variable rm^2 and the variable log(dis) with intercept. If we set alpha = 0.05 for the
model, all the variables in the model are significant. So we will select the model which contains
the variables rm^2, age and intercept under the AIC as well as BIC criterion.
Finally, we also applied another method which is called the Cross-Validation to further analyze
the model selection process. And we also computed the sum of squared prediction error, SSPE, of
the Group2. At the same time, we can find that the model we get from the part2.4(question 5) is
the same as we get from the part2.5(question 6). Therefore, we get the same results for the same
model.
19. 18
4 Appendix
The following materials are the related R codes that used for this project. The contents
with bold texts denote the original codes.
R codes:
# Question 1:
> nox <- BostonHousing$nox
> indus <- BostonHousing$indus
> dis <- BostonHousing$dis
> tax <- BostonHousing$tax
> pairs(~nox+indus+dis+tax,main="Scatterplot for nox,indus,dis,tax")
# Question 2:
> cor(nox,dis)
[1] -0.7692301
> model <- lm(nox~1/dis)
> summary(model)
# Question 3:
(a)
> library("mlbench", lib.loc="~/Library/R/3.3/library")
> data("BostonHousing")
> BostonHousing <- transform(BostonHousing, logdis = log(dis))
> BostonHousing <- transform(BostonHousing, dissquare = dis*dis)
> u1 <- lm(nox ~ dis+logdis+dissquare +indus + tax, BostonHousing)
> u2 <- lm(nox ~ dis+dissquare +indus + tax, BostonHousing)
> anova(u1,u2)
Analysis of Variance Table
Model 1: nox ~ dis + logdis + dissquare + indus + tax
Model 2: nox ~ dis + dissquare + indus + tax
Res.Df RSS Df Sum of Sq F Pr(>F)
1 500 1.6897
2 501 1.7097 -1 -0.019976 5.911 0.0154 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(b)
> u3 <- lm(nox ~ logdis +indus + tax, BostonHousing)
> anova(u1,u3)
Analysis of Variance Table
Model 1: nox ~ dis + logdis + dissquare + indus + tax
Model 2: nox ~ logdis + indus + tax
Res.Df RSS Df Sum of Sq F Pr(>F)
1 500 1.6897