Polynomial regression is used to model the nonlinear relationship between employee wage and age. A quartic (4th degree) polynomial provides a good fit according to hypothesis testing of different polynomial models. The polynomial model estimates the probability of an employee earning over $250,000 based on their age, with this probability peaking around age 40-45 years old.
Lecture7b Applied Econometrics and Economic Modelingstone55
The document discusses using regression analysis to model the relationship between salary (the response variable) and various explanatory variables like gender, job experience, job grade, and education level using a bank's employee database. Dummy variables are created to represent categorical variables in the regression. Interaction terms and transforming the response variable are also explored as ways to improve the model. Overall, the analysis finds evidence that gender discrimination may still be a factor even after accounting for other variables.
Multiple regression analysis allows a dependent variable to be explained by multiple independent variables simultaneously. This overcomes limitations of simple regression by explicitly controlling for other factors that may affect the dependent variable. The key assumptions is that the error term has a conditional mean of zero given all independent variables. Estimating the coefficients involves minimizing the sum of squared errors to obtain estimated coefficients using ordinary least squares. The estimated coefficients can then be interpreted as measuring the partial effect of each independent variable on the dependent variable, holding other independent variables fixed.
Linear regression is an approach for modeling the relationship between one dependent variable and one or more independent variables.
Algorithms to minimize the error are
OLS (Ordinary Least Square)
Gradient Descent and much more.
Let me know if anything is required. Ping me at google #bobrupakroy
This document summarizes a study on crime rates in 47 U.S. states. It analyzes the relationship between crime rate (the response variable) and four predictor variables: unemployment rate, median income, state population, and police expenditure. Preliminary multiple linear regression models were developed. Diagnostic tests found no evidence of multicollinearity but did find non-normality in the residuals, violating a model assumption. Further model refinement is needed.
This document discusses various measures of central tendency including the mean, median, mode, weighted mean, and geometric mean. It provides examples to demonstrate how to calculate each measure and describes their key properties and appropriate uses. The mean uses all values and is affected by outliers, while the median is not affected by outliers. The mode indicates the most frequent value. The weighted mean and geometric mean are variants of the mean used in specific contexts.
Huang (2018) decomposes the differences in quantile portfolio returns using distribution regression. The main issue of using distribution regression is that the decomposition results are path dependent. In this paper, we are able to obtain path independent decomposition results by combining the Oaxaca-Blinder decomposition and the recentered influence function regression method. We show that aggregate composition effects are all positive across quantiles and the market factor is the most significant factor which has detailed composition effect monotonically decreasing with quantiles. The main decomposition results are consistent with Huang (2018)
Multiple Linear Regression Applications Automobile Pricinginventionjournals
This document describes using multiple linear regression to predict automobile prices. The response variable is price from Kelley Blue Book for 470 cars. Potential explanatory variables are mileage, make, type, liter size, cruise control, upgraded speakers, and leather seats. Preliminary analysis finds mileage and liter have significant correlations with price. The final regression model finds price is best predicted by an equation involving liter size and mileage as the most important factors. The model explains over 80% of price variation and provides a way for buyers and sellers to estimate reasonable car prices.
The document describes a mathematical model and solution method for a generalized fuzzy assignment problem (FGAP) with restrictions on both job costs and person costs. Costs are represented as trapezoidal fuzzy numbers. The problem is to minimize total assignment cost subject to constraints that the cost of assigning a job cannot exceed the job's cost limit and the cost assigned to a person cannot exceed the person's cost limit based on their qualifications. A judging matrix approach and Yager's ranking method are used to determine if a solution exists. If so, a modified extremum difference method is applied to obtain an initial feasible solution, and a modified incremental method is used to find the optimal solution. The method is demonstrated on a numerical example and the optimal solution
Lecture7b Applied Econometrics and Economic Modelingstone55
The document discusses using regression analysis to model the relationship between salary (the response variable) and various explanatory variables like gender, job experience, job grade, and education level using a bank's employee database. Dummy variables are created to represent categorical variables in the regression. Interaction terms and transforming the response variable are also explored as ways to improve the model. Overall, the analysis finds evidence that gender discrimination may still be a factor even after accounting for other variables.
Multiple regression analysis allows a dependent variable to be explained by multiple independent variables simultaneously. This overcomes limitations of simple regression by explicitly controlling for other factors that may affect the dependent variable. The key assumptions is that the error term has a conditional mean of zero given all independent variables. Estimating the coefficients involves minimizing the sum of squared errors to obtain estimated coefficients using ordinary least squares. The estimated coefficients can then be interpreted as measuring the partial effect of each independent variable on the dependent variable, holding other independent variables fixed.
Linear regression is an approach for modeling the relationship between one dependent variable and one or more independent variables.
Algorithms to minimize the error are
OLS (Ordinary Least Square)
Gradient Descent and much more.
Let me know if anything is required. Ping me at google #bobrupakroy
This document summarizes a study on crime rates in 47 U.S. states. It analyzes the relationship between crime rate (the response variable) and four predictor variables: unemployment rate, median income, state population, and police expenditure. Preliminary multiple linear regression models were developed. Diagnostic tests found no evidence of multicollinearity but did find non-normality in the residuals, violating a model assumption. Further model refinement is needed.
This document discusses various measures of central tendency including the mean, median, mode, weighted mean, and geometric mean. It provides examples to demonstrate how to calculate each measure and describes their key properties and appropriate uses. The mean uses all values and is affected by outliers, while the median is not affected by outliers. The mode indicates the most frequent value. The weighted mean and geometric mean are variants of the mean used in specific contexts.
Huang (2018) decomposes the differences in quantile portfolio returns using distribution regression. The main issue of using distribution regression is that the decomposition results are path dependent. In this paper, we are able to obtain path independent decomposition results by combining the Oaxaca-Blinder decomposition and the recentered influence function regression method. We show that aggregate composition effects are all positive across quantiles and the market factor is the most significant factor which has detailed composition effect monotonically decreasing with quantiles. The main decomposition results are consistent with Huang (2018)
Multiple Linear Regression Applications Automobile Pricinginventionjournals
This document describes using multiple linear regression to predict automobile prices. The response variable is price from Kelley Blue Book for 470 cars. Potential explanatory variables are mileage, make, type, liter size, cruise control, upgraded speakers, and leather seats. Preliminary analysis finds mileage and liter have significant correlations with price. The final regression model finds price is best predicted by an equation involving liter size and mileage as the most important factors. The model explains over 80% of price variation and provides a way for buyers and sellers to estimate reasonable car prices.
The document describes a mathematical model and solution method for a generalized fuzzy assignment problem (FGAP) with restrictions on both job costs and person costs. Costs are represented as trapezoidal fuzzy numbers. The problem is to minimize total assignment cost subject to constraints that the cost of assigning a job cannot exceed the job's cost limit and the cost assigned to a person cannot exceed the person's cost limit based on their qualifications. A judging matrix approach and Yager's ranking method are used to determine if a solution exists. If so, a modified extremum difference method is applied to obtain an initial feasible solution, and a modified incremental method is used to find the optimal solution. The method is demonstrated on a numerical example and the optimal solution
The document discusses linear regression models. It explains that a linear regression model finds the straight line that best fits a set of data points and can be used to understand the relationship between two quantitative variables. The linear model is an equation of a straight line that may not fit all data points perfectly but can summarize the overall pattern. The difference between observed and predicted values (residuals) indicates how well the model fits the data. The model with the least sum of squared residuals is considered the best fit. The document provides an example of modeling the relationship between fat and protein content in Burger King menu items.
This document discusses using a genetic algorithm to solve the travelling salesman problem (TSP). It begins with an abstract that outlines representing TSP solutions as chromosomes, using crossover and mutation genetic operators, and selecting chromosomes with minimum costs for the next generation. It then provides more details on the genetic algorithm steps, including initializing a population, selecting parents via roulette wheel selection, applying one-point crossover and mutation, and iterating until finding an optimal solution. Experimental results on an 8 city TSP problem are presented showing minimum, maximum and average costs decreasing over generations as the genetic algorithm converges on an optimal tour.
This document summarizes key concepts about ratios, proportions, and percents from Chapter 3 of a textbook on math and dosage calculations for healthcare professionals. It defines important terms like ratio, proportion, and percent. It provides rules and examples for converting between ratios, proportions, percents, fractions and decimals. It also explains how to use proportions to solve for unknown quantities, including setting up equations and checking solutions. The overall purpose is to explain essential skills for understanding relationships between quantities and solving dosage calculation problems.
The document discusses linear regression analysis. It explains that linear regression finds the best fitting straight line through data points in order to model the relationship between two quantitative variables. The regression line minimizes the sum of squared residuals. The R-squared value indicates how much of the variability in the data is explained by the linear model. Residual plots are examined to check if the linear model is appropriate.
1. The document discusses calculus concepts like derivatives, rates of change, marginal values, elasticity of demand, and partial derivatives and how they can be applied to business optimization problems.
2. Examples are provided to demonstrate how to calculate derivatives, marginal values, elasticity, and partial derivatives. Practice problems are also included for differentiation, marginal revenue, cost, and elasticity.
3. Multivariable optimization problems are introduced along with how to calculate partial derivatives to determine the marginal product of variables like labor and capital. Visual representations are provided to explain partial derivatives.
This document discusses the mathematical similarities between call/put option pricing in derivatives trading and the newsvendor problem in supply chain optimization.
The key points are:
1) Call/put option pricing and the newsvendor problem can both be formulated as expectations of "hockey stick" payoff functions, with the newsvendor problem equivalent to optimizing a portfolio of calls and puts.
2) Under certain assumptions like a Gaussian distribution, the formulas for call/put prices and newsvendor costs are analogous and involve concepts like delta hedging.
3) In both cases, when the strike/supply is optimized, the cost becomes insensitive to changes in the underlying stock price/expected demand,
This document provides an introduction to generalized linear mixed models (GLMMs). GLMMs allow for modeling of data that violates assumptions of linear mixed models, such as non-normal distributions and non-constant variance. The document discusses the components of a GLMM, including the linear predictor, inverse link function, and variance function. It also describes how to derive estimating equations for GLMMs and provides an example for a univariate logit model. Estimation of variance components is also briefly discussed.
This document presents a time series model for the exchange rate between the Euro (EUR) and the Egyptian Pound (EGP) using a GARCH model. The author analyzes the time series data of the exchange rate for 2008 and finds that it exhibits volatility clustering where large changes tend to follow large changes. An ARCH or GARCH model is needed to capture the changing conditional variances over time. The author estimates several GARCH models and selects the GARCH(1,2) model based on statistical significance of coefficients and AIC values. Diagnostic tests show that the GARCH(1,2) model adequately captures the heteroskedasticity in the data. The fitted model is then used to predict future exchange rates
IRJET- Optimization of 1-Bit ALU using Ternary LogicIRJET Journal
This document summarizes a research paper that proposes a novel approach to implementing a 1-bit arithmetic logic unit (ALU) using ternary logic. Ternary logic offers potential advantages over binary logic, including reduced transistor count and hardware. The authors designed a 1-bit ALU using ternary logic gates (T-gates) for ternary arithmetic and logic operations. Simulation results showed the ternary logic ALU design achieved a 25% reduction in transistor usage compared to an equivalent binary logic ALU design. The ternary logic ALU design approach could potentially be extended to multi-bit ALUs for applications where reduced transistor count is important.
The document describes relational algebra and calculus operations for querying relational databases. It outlines unary operations like select, project, and rename that operate on a single relation as well as binary operations derived from set theory like union, intersection, and difference that combine two relations. Examples are provided to illustrate how sequences of relational algebra operations can be used to formulate queries and retrieve data from the COMPANY example database.
The crew spent the morning digging trenches and laying pipes to connect new homes to the water system. They worked efficiently to complete the task before the expected rain. By early afternoon, the trenches were refilled and the area was cleared in preparation for the coming storm.
This document contains a final term project report submitted by three students to their professor. The report summarizes statistical techniques including correlation, regression, and measures of central tendency. For correlation, the document defines correlation, describes the correlation coefficient and different types of correlation. It also discusses the history and uses of correlation. For regression, it defines regression, describes the regression coefficient and line, and discusses the history and uses of regression. Finally, it defines different measures of central tendency including mean, median, mode, and discusses their advantages and disadvantages. The report is presented in a table of contents and contains examples, formulas and multiple choice questions.
O documento lista o número de brasileiros vivendo em diversos países ao redor do mundo, com o Estados Unidos contendo a maior população de brasileiros no exterior com mais de 1 milhão, seguido pelo Japão com 210 mil e Paraguai com 201 mil. Ao todo, estima-se que aproximadamente 2,5 milhões de brasileiros vivam fora do Brasil.
This document summarizes a study that analyzed racial and ethnic wealth disparities in the Boston metropolitan area using detailed survey data on assets and debts. Key findings include:
1) Significant racial and ethnic wealth disparities exist in Boston, even after controlling for demographic factors, with typical black and Hispanic households having much lower wealth levels than white households.
2) Within racial/ethnic groups, there are also notable differences, with Dominicans reporting comparatively low asset levels and high debt, while Caribbean blacks report higher wealth levels.
3) The study highlights the need to disaggregate data by ancestral origin to better understand causes of financial disparities between communities of color.
Android analysis of distance-based location management in wireless communica...Ecwayt
This document analyzes distance-based location management schemes (DBLMS) in wireless communication networks. A Markov chain mobility model is developed to describe a mobile terminal's movement in a 2D cellular structure. Classical renewal theory is used to analyze the expected number of paging area boundary crossings and cost of the distance-based location update method for two different call handling models. The analysis makes it possible to find the optimal distance threshold that minimizes the total cost of location management in a DBLMS.
This document from the Boston Redevelopment Authority presents statistics on immigrant populations in Boston and Massachusetts from the 2014 American Community Survey. It finds that 13.2% of the US population is foreign born, compared to 27.1% in Boston. The largest foreign-born groups in the US are from Mexico, China, India, and the Philippines, while in Boston they are from the Dominican Republic, China, Haiti, and El Salvador. Charts show differences in education, employment, income, and poverty levels between foreign-born and native-born residents at the city, state, and national levels. Maps show concentrations of foreign-born populations and recent immigrants across Boston neighborhoods.
Building Composable Serverless Apps with IOpipe Erica Windisch
This document discusses building composable serverless applications using the iopipe module.
The iopipe module allows chaining together serverless functions, code sharing, and running functions anywhere including AWS Lambda, Docker, and local CPUs. It provides tools for function composition, monitoring performance metrics, and deploying functions. Composable serverless applications can be built by connecting together inline functions, stored functions, and deployed HTTP endpoints using iopipe.
The document discusses linear regression models. It explains that a linear regression model finds the straight line that best fits a set of data points and can be used to understand the relationship between two quantitative variables. The linear model is an equation of a straight line that may not fit all data points perfectly but can summarize the overall pattern. The difference between observed and predicted values (residuals) indicates how well the model fits the data. The model with the least sum of squared residuals is considered the best fit. The document provides an example of modeling the relationship between fat and protein content in Burger King menu items.
This document discusses using a genetic algorithm to solve the travelling salesman problem (TSP). It begins with an abstract that outlines representing TSP solutions as chromosomes, using crossover and mutation genetic operators, and selecting chromosomes with minimum costs for the next generation. It then provides more details on the genetic algorithm steps, including initializing a population, selecting parents via roulette wheel selection, applying one-point crossover and mutation, and iterating until finding an optimal solution. Experimental results on an 8 city TSP problem are presented showing minimum, maximum and average costs decreasing over generations as the genetic algorithm converges on an optimal tour.
This document summarizes key concepts about ratios, proportions, and percents from Chapter 3 of a textbook on math and dosage calculations for healthcare professionals. It defines important terms like ratio, proportion, and percent. It provides rules and examples for converting between ratios, proportions, percents, fractions and decimals. It also explains how to use proportions to solve for unknown quantities, including setting up equations and checking solutions. The overall purpose is to explain essential skills for understanding relationships between quantities and solving dosage calculation problems.
The document discusses linear regression analysis. It explains that linear regression finds the best fitting straight line through data points in order to model the relationship between two quantitative variables. The regression line minimizes the sum of squared residuals. The R-squared value indicates how much of the variability in the data is explained by the linear model. Residual plots are examined to check if the linear model is appropriate.
1. The document discusses calculus concepts like derivatives, rates of change, marginal values, elasticity of demand, and partial derivatives and how they can be applied to business optimization problems.
2. Examples are provided to demonstrate how to calculate derivatives, marginal values, elasticity, and partial derivatives. Practice problems are also included for differentiation, marginal revenue, cost, and elasticity.
3. Multivariable optimization problems are introduced along with how to calculate partial derivatives to determine the marginal product of variables like labor and capital. Visual representations are provided to explain partial derivatives.
This document discusses the mathematical similarities between call/put option pricing in derivatives trading and the newsvendor problem in supply chain optimization.
The key points are:
1) Call/put option pricing and the newsvendor problem can both be formulated as expectations of "hockey stick" payoff functions, with the newsvendor problem equivalent to optimizing a portfolio of calls and puts.
2) Under certain assumptions like a Gaussian distribution, the formulas for call/put prices and newsvendor costs are analogous and involve concepts like delta hedging.
3) In both cases, when the strike/supply is optimized, the cost becomes insensitive to changes in the underlying stock price/expected demand,
This document provides an introduction to generalized linear mixed models (GLMMs). GLMMs allow for modeling of data that violates assumptions of linear mixed models, such as non-normal distributions and non-constant variance. The document discusses the components of a GLMM, including the linear predictor, inverse link function, and variance function. It also describes how to derive estimating equations for GLMMs and provides an example for a univariate logit model. Estimation of variance components is also briefly discussed.
This document presents a time series model for the exchange rate between the Euro (EUR) and the Egyptian Pound (EGP) using a GARCH model. The author analyzes the time series data of the exchange rate for 2008 and finds that it exhibits volatility clustering where large changes tend to follow large changes. An ARCH or GARCH model is needed to capture the changing conditional variances over time. The author estimates several GARCH models and selects the GARCH(1,2) model based on statistical significance of coefficients and AIC values. Diagnostic tests show that the GARCH(1,2) model adequately captures the heteroskedasticity in the data. The fitted model is then used to predict future exchange rates
IRJET- Optimization of 1-Bit ALU using Ternary LogicIRJET Journal
This document summarizes a research paper that proposes a novel approach to implementing a 1-bit arithmetic logic unit (ALU) using ternary logic. Ternary logic offers potential advantages over binary logic, including reduced transistor count and hardware. The authors designed a 1-bit ALU using ternary logic gates (T-gates) for ternary arithmetic and logic operations. Simulation results showed the ternary logic ALU design achieved a 25% reduction in transistor usage compared to an equivalent binary logic ALU design. The ternary logic ALU design approach could potentially be extended to multi-bit ALUs for applications where reduced transistor count is important.
The document describes relational algebra and calculus operations for querying relational databases. It outlines unary operations like select, project, and rename that operate on a single relation as well as binary operations derived from set theory like union, intersection, and difference that combine two relations. Examples are provided to illustrate how sequences of relational algebra operations can be used to formulate queries and retrieve data from the COMPANY example database.
The crew spent the morning digging trenches and laying pipes to connect new homes to the water system. They worked efficiently to complete the task before the expected rain. By early afternoon, the trenches were refilled and the area was cleared in preparation for the coming storm.
This document contains a final term project report submitted by three students to their professor. The report summarizes statistical techniques including correlation, regression, and measures of central tendency. For correlation, the document defines correlation, describes the correlation coefficient and different types of correlation. It also discusses the history and uses of correlation. For regression, it defines regression, describes the regression coefficient and line, and discusses the history and uses of regression. Finally, it defines different measures of central tendency including mean, median, mode, and discusses their advantages and disadvantages. The report is presented in a table of contents and contains examples, formulas and multiple choice questions.
O documento lista o número de brasileiros vivendo em diversos países ao redor do mundo, com o Estados Unidos contendo a maior população de brasileiros no exterior com mais de 1 milhão, seguido pelo Japão com 210 mil e Paraguai com 201 mil. Ao todo, estima-se que aproximadamente 2,5 milhões de brasileiros vivam fora do Brasil.
This document summarizes a study that analyzed racial and ethnic wealth disparities in the Boston metropolitan area using detailed survey data on assets and debts. Key findings include:
1) Significant racial and ethnic wealth disparities exist in Boston, even after controlling for demographic factors, with typical black and Hispanic households having much lower wealth levels than white households.
2) Within racial/ethnic groups, there are also notable differences, with Dominicans reporting comparatively low asset levels and high debt, while Caribbean blacks report higher wealth levels.
3) The study highlights the need to disaggregate data by ancestral origin to better understand causes of financial disparities between communities of color.
Android analysis of distance-based location management in wireless communica...Ecwayt
This document analyzes distance-based location management schemes (DBLMS) in wireless communication networks. A Markov chain mobility model is developed to describe a mobile terminal's movement in a 2D cellular structure. Classical renewal theory is used to analyze the expected number of paging area boundary crossings and cost of the distance-based location update method for two different call handling models. The analysis makes it possible to find the optimal distance threshold that minimizes the total cost of location management in a DBLMS.
This document from the Boston Redevelopment Authority presents statistics on immigrant populations in Boston and Massachusetts from the 2014 American Community Survey. It finds that 13.2% of the US population is foreign born, compared to 27.1% in Boston. The largest foreign-born groups in the US are from Mexico, China, India, and the Philippines, while in Boston they are from the Dominican Republic, China, Haiti, and El Salvador. Charts show differences in education, employment, income, and poverty levels between foreign-born and native-born residents at the city, state, and national levels. Maps show concentrations of foreign-born populations and recent immigrants across Boston neighborhoods.
Building Composable Serverless Apps with IOpipe Erica Windisch
This document discusses building composable serverless applications using the iopipe module.
The iopipe module allows chaining together serverless functions, code sharing, and running functions anywhere including AWS Lambda, Docker, and local CPUs. It provides tools for function composition, monitoring performance metrics, and deploying functions. Composable serverless applications can be built by connecting together inline functions, stored functions, and deployed HTTP endpoints using iopipe.
Security challenges are arising as businesses become more and more digital. Integrating access control into all your business processes requires proper access management. In this talk, we will give and overview of the current state-of-practice of access management. Who are the world’s leading vendors in access management? What is the role of different access management technologies like identity management and identity governance & administration and how do they relate to each other? What does it take for your organisation to effectively manage access to mitigate future “digital risks”?
A Presentation on Integrated marketing strategies of Kelloggs in India. This presentation includes company profile, entry in India, promotion tools, business strategy, advertising strategy, competition and the factors of success and failure as a brand in India.
Beginner's guide shows how to draw engaging sketchnotes, maps, and charts during
meetings and presentations. Includes how to sketch simple icons, metaphors, to clarify, communicate and co-create meaning.
PROCESO DE ELABORACION DE UNA INVESTIGACION DOCUMENTALmiriam gutierrez
Este documento describe los pasos del proceso de elaboración de una investigación documental, incluyendo la delimitación del problema, objetivos, recolección de información de fuentes variadas, diseño de un esquema de trabajo, sistematización de información en fichas, organización del archivo de fichas, construcción lógica del análisis utilizando fuentes referenciales, elaboración del informe preliminar, correcciones y presentación del informe final. El documento provee detalles sobre cada uno de estos pasos para guiar el proceso de investigación documental de manera
[M3A3] Data Analysis and Interpretation Specialization Andrea Rubio
- The document describes testing a multiple regression model using data from the NESARC dataset to study factors that influence personal income.
- A linear regression is first run on age and income, showing a positive relationship, but the line does not perfectly fit the data pattern.
- A polynomial regression is then applied, showing a better fit with an initial increase then decrease in income with age.
- Additional variables like sex, education level, and employment status are identified for a multiple regression analysis.
The document analyzes petroleum consumption data from 1984 to 2013 by the residential sector. It finds the data is non-stationary but becomes stationary after taking the log and first difference. An ARIMA(5,1,4) model is identified as best fitting the transformed data based on diagnostics of the residuals. The model provides an accurate 12-month forecast of future consumption within the prediction intervals, suggesting time series methods can effectively predict consumption trends. However, the model may only be reliable for short-term predictions up to 2 years rather than further into the future.
This document discusses various techniques for regularization in machine learning models to prevent overfitting. It defines overfitting and underfitting, and describes ways to detect and address each issue. Specifically, it covers regularization techniques like Lasso regression and Ridge regression that add penalties to reduce the influence of features and complexity. Other techniques discussed include dropout, early stopping, data augmentation, and cross-validation.
The document summarizes the steps of the Mamdani fuzzy logic inference method and provides examples of applying fuzzy logic and Bayes' theorem to problems involving probabilities. It discusses four steps of the Mamdani method: fuzzification, rule inference, rule composition, and defuzzification. It then works through examples of using a neural network to model fuzzy set membership, applying fuzzy logic rules to assess life insurance acceptance, and using Bayes' theorem to calculate conditional probabilities.
1. The document describes using binary logistic regression models to analyze whether age discrimination factored into employee termination decisions.
2. Regressing job termination on just age showed age was a significant predictor, but adding performance reduced the effect of age, showing omitted variable bias.
3. The best model used "real performance" that removed the effect of age from performance scores. Both age and real performance significantly predicted termination, with probability of firing increasing 23% per 10 additional years of age and decreasing 8.8% per higher performance score.
Lab 10 covers hypothesis testing for two population means and linear regression analysis.
Part A discusses hypothesis tests to compare the means of two independent populations, including tests for equal and unequal variances, and matched-pair samples.
Part B introduces linear regression models to describe the relationship between a response variable and one or more predictor variables. It demonstrates a simple linear regression model and multiple linear regression model, using R code to estimate coefficients and predict outcomes.
Two examples are provided to illustrate applying linear regression: Example 1 examines a model for hourly wage with predictors of education, experience, age, and gender. Example 2 analyzes annual salary with education as the sole predictor.
This document provides an introduction and overview of machine learning algorithms. It begins by discussing the importance and growth of machine learning. It then describes the three main types of machine learning algorithms: supervised learning, unsupervised learning, and reinforcement learning. Next, it lists and briefly defines ten commonly used machine learning algorithms including linear regression, logistic regression, decision trees, SVM, Naive Bayes, and KNN. For each algorithm, it provides a simplified example to illustrate how it works along with sample Python and R code.
The goal of this project was to predict employee attrition and to determine key factors that might contribute to attrition. Four different classification models are evaluated and compared to determine the best classification model.
DataIDSalaryCompaMidpoint AgePerformance RatingServiceGenderRaiseDegreeGender1GrStudents: Copy the Student Data file data values into this sheet to assist in doing your weekly assignments.157.71.012573485805.70METhe ongoing question that the weekly assignments will focus on is: Are males and females paid the same for equal work (under the Equal Pay Act)? 227.80.897315280703.90MBNote: to simplfy the analysis, we will assume that jobs within each grade comprise equal work.3341.096313075513.61FB459.21.03857421001605.51METhe column labels in the table mean:549.51.0314836901605.71MDID – Employee sample number Salary – Salary in thousands 675.71.1306736701204.51MFAge – Age in yearsPerformance Rating - Appraisal rating (employee evaluation score)741.71.0434032100815.71FCService – Years of service (rounded)Gender – 0 = male, 1 = female 823.41.018233290915.81FAMidpoint – salary grade midpoint Raise – percent of last raise980.81.206674910010041MFGrade – job/pay gradeDegree (0= BS\BA 1 = MS)1023.61.027233080714.71FAGender1 (Male or Female)Compa - salary divided by midpoint1123.61.02423411001914.81FA1266.91.1745752952204.50ME1341.61.0414030100214.70FC1421.50.93623329012161FA1524.41.059233280814.91FA16390.975404490405.70MC1768.81.2075727553131FE1834.91.1263131801115.60FB1923.21.008233285104.61MA20361.1603144701614.80FB2175.31.1246743951306.31MF2256.71.182484865613.81FD2322.60.984233665613.30FA2451.51.072483075913.80FD2525.51.1092341704040MA2622.90.994232295216.20FA2743.51.088403580703.91MC2874.41.111674495914.40FF2973.51.097675295505.40MF3045.70.9524845901804.30MD3123.71.031232960413.91FA3226.90.867312595405.60MB3355.10.967573590905.51ME34280.904312680204.91MB3521.90.953232390415.30FA3623.71.032232775314.30FA3723.21.010232295216.20FA3857.61.0105745951104.50ME3934.31.108312790615.50FB4024.41.062232490206.30MA4140.51.012402580504.30MC4223.31.0122332100815.71FA4377.21.1526742952015.50FF4456.90.9995745901605.21ME4557.71.202483695815.21FD4665.41.1485739752003.91ME4756.80.997573795505.51ME4859.71.0485734901115.31FE4962.41.0955741952106.60ME5056.50.9925738801204.60ME
Week 1Week 1.Measurement and Description - chapters 1 and 2The goal this week is to gain an understanding of our data set - what kind of data we are looking at, some descriptive measurse, and a look at how the data is distributed (shape).1Measurement issues. Data, even numerically coded variables, can be one of 4 levels - nominal, ordinal, interval, or ratio. It is important to identify which level a variable is, asthis impact the kind of analysis we can do with the data. For example, descriptive statistics such as means can only be done on interval or ratio level data.Please list under each label, the variables in our data set that belong in each group.NominalOrdinalIntervalRatiob.For each variable that you did not call ratio, why did you make that decision?2The first step in analyzing data sets is to find some summary descriptive statistics for key variables.For salary, compa, age, .
This document discusses ARIMA (autoregressive integrated moving average) models for time series forecasting. It covers the basic steps for identifying and fitting ARIMA models, including plotting the data, identifying possible AR or MA components using the autocorrelation function (ACF) and partial autocorrelation function (PACF), estimating model parameters, checking the residuals to validate the model fit, and choosing the best model. An example analyzes quarterly US GNP data to demonstrate these steps.
This document describes using traditional models and the error correction model approach to analyze the forward premium puzzle using US dollar/Japanese yen exchange rate data from 1989 to 2008. It first tests the level specification model and returns model but finds issues with non-stationarity and cointegration. It then introduces an extended model with macroeconomic variables but finds insignificant coefficients. Finally, it specifies an error correction model incorporating lagged differences and the residuals from the level specification, finding this model fits the data well without issues of non-stationarity, heteroskedasticity, autocorrelation or structural breaks.
- The document analyzes a dataset relating body fat percentage to various measurements to find a predictor.
- It finds that abdominal circumference has the highest correlation (0.81) to body fat percentage. A linear regression model is fitted with abdominal circumference predicting body fat percentage.
- The model is found to be statistically significant and explains 66.21% of the variability in body fat percentage based on the abdominal circumference measurement.
Introduction to linear regression and the maths behind it like line of best fit, regression matrics. Other concepts include cost function, gradient descent, overfitting and underfitting, r squared.
Linear regression and logistic regression are two machine learning algorithms that can be implemented in Python. Linear regression is used for predictive analysis to find relationships between variables, while logistic regression is used for classification with binary dependent variables. Support vector machines (SVMs) are another algorithm that finds the optimal hyperplane to separate data points and maximize the margin between the classes. Key terms discussed include cost functions, gradient descent, confusion matrices, and ROC curves. Code examples are provided to demonstrate implementing linear regression, logistic regression, and SVM in Python using scikit-learn.
Week 4 Lecture 12 Significance Earlier we discussed co.docxcockekeshia
Week 4 Lecture 12
Significance
Earlier we discussed correlations without going into how we can identify statistically
significant values. Our approach to this uses the t-test. Unfortunately, Excel does not
automatically produce this form of the t-test, but setting it up within an Excel cell is fairly easy.
And, with some slight algebra, we can determine the minimum value that is statistically
significant for any table of correlations all of which have the same number of pairs (for example,
a Correlation table for our data set would use 50 pairs of values, since we have 50 members in
our sample).
The t-test formula for a correlation (r) is t = r * sqrt(n-2)/sqrt(1-r2); the associated degrees
of freedom are n-2 (number of pairs – 2) (Lind, Marchel, & Wathen, 2008). For some this might
look a bit off-putting, but remember that we can translate this into Excel cells and functions and
have Excel do the arithmetic for us.
Excel Example
If we go back to our correlation table for salary, midpoint, Age, Perf Rat, Service, and
Raise, we have:
Using Excel to create the formula and cell numbers for our key values allows us to
quickly create a result. The T.dist.2t gives us a p-value easily.
The formula to use in finding the minimum correlation value that is statistically
significant is r = sqrt(t^2/(t^2 + n-2)). We would find the appropriate t value by using the
t.inv.2T(alpha, df) with alpha = 0.05 and df = n-2 or 48. Plugging these values into the gives us
a t-value of 2.0106 or 2.011(rounded).
Putting 2.011 and 48 (n-2) into our formula gives us a r value of 0.278; therefore, in a
correlation table based on 50 pairs, any correlation greater or equal to 0.278 would be
statistically significant.
Technical Point. If you are interested in how we obtained the formula for determining
the minimum r value, the approach is shown below. If you are not interested in the math, you
can safely skip this paragraph.
t = r* sqrt(n-2)/sqrt(1-r2)
Multiplying gives us t *sqrt (1- r2) = r2* (n-2)
Squaring gives us: t2 * (1- r2) = r2* (n-2)
Multiplying out gives us: t2– t2* r2 = n r2-2* r2
Adding gives us: t2= n* r2-2*r2+ t2 *r2
Factoring gives us t2= r2 *(n -2+ t2)
Dividing gives us t2 / (n -2+ t2) = r2
Taking the square root gives us r = sqrt (t2 / (n -2+ t2)
Effect Size Measures
As we have discussed, there is a difference between statistical and practical
significance. Virtually any statistic can become statistically significant if the sample is large
enough. In practical terms, a correlation of .30 and below is generally considered too weak to be
of any practical significance. Additionally, the effect size measure for Pearson’s correlation is
simply the absolute value of the correlation; the outcome has the same general interpretation as
Cohen’s D for the t-test (0.8 is strong, and 0.2 is quite weak, for example) (Tanner & Youssef-
Morgan, 2013).
Spearman’s Rank Correlation
Another typ.
Capstone Project - Nicholas Imholte - Final DraftNick Imholte
This document summarizes a capstone project analyzing how to optimize a baseball lineup to maximize runs scored given a fixed payroll. The analysis uses regression to model how each event impacts runs scored. Clustering is then used to group players into types based on their hitting abilities. Optimization determines the optimal arrangement of hitter clusters for different payrolls. A simulation complements the analysis by comparing results to the optimization approach.
The future is uncertain. Some events do have a very small probabil.docxoreo10
The future is uncertain. Some events do have a very small probability of happening, like an asteroid destroying the earth. So we accept that tomorrow will come as a certain event. But future demand for a business’s goods and services is very uncertain. Yet, the management of a company wants to have some idea of the survival (or growth) of the company in the future. Should they expect to hire more people or let some go? Should they plan to increase capacity? How much investment is needed for future assets, or should they down size?
Forecasting provides some ideas about the future, but how this is accomplished can vary from company to company. And one key factor is how accurate the forecast is. Generally, the further into the future one looks, the more uncertain the information is. How do forecasters reduce their forecasting errors? How much error is tolerable?
Another key factor in forecasting is data availability. Data processing and storage capability have become extremely available and inexpensive. Software and computing power is also very cheap. Collecting real-time sales data via point-of-sales systems is now common at most retail establishments. But couple this with a situation in companies that have a large number of products, such as a retail store or a large manufacturing company with hundreds or thousands of product numbers and/or product lines, forecasting becomes complicated.
Forecasting Methods
There are two main types or genres of forecasting methods, qualitative and quantitative. The former consists of judgment and analysis of qualitative factors, such as scenario building and scenario analysis. The latter is obviously based on numerical analysis. This genre of forecasting includes such methods as linear regression, time series analysis, and data mining algorithms like CHAID and CART, which are useful especially in the growing world of artificial intelligence and machine learning in business. This module will look at the linear regression and time series analysis using exponential smoothing.
Linear Growth
When using any mathematical model, we have to consider which inputs are reasonable to use. Whenever we extrapolate, or make predictions into the future, we are assuming the model will continue to be valid. There are different types of mathematical model, one of which is linear growth model or algebraic growth model and another is exponential growth model, or geometric growth model. The constant change is the defining characteristic of linear growth. Plotting the values, we can see the values form a straight line, the shape of linear growth.
If a quantity starts at size P0 and grows by d every time period, then the quantity after n time periods can be determined using either of these relations:
Recursive form:
Pn = Pn-1 + d
Explicit form:
Pn = P0 + d n
In this equation, d represents the common difference – the amount that the population changes each time n increases by 1. Calculating values using the explicit form and plot ...
This document discusses bias and variance in machine learning models. It begins by introducing bias as a stronger force that is always present and harder to eliminate than variance. Several examples of bias are provided. Through simulations of sampling from a normal distribution, it is shown that sample statistics like the mean and standard deviation are always biased compared to the population parameters. Sample size also impacts bias, with larger samples having lower bias. Variance refers to a model's ability to generalize, with higher variance indicating overfitting. The tradeoff between bias and variance is that reducing one increases the other. Several techniques for optimizing this tradeoff are discussed, including cross-validation, bagging, boosting, dimensionality reduction, and changing the model complexity.
This document provides an overview of linear regression and logistic regression concepts. It begins with an introduction to linear regression, discussing finding the best fit line to training data. It then covers the loss function and gradient descent optimization algorithm used to minimize loss and fit the model parameters. Next, it discusses logistic regression for classification problems, covering the sigmoid function for hypothesis representation and interpreting probabilities. It concludes by discussing feature scaling techniques like normalization and standardization to prepare data for modeling.
The document describes building regression and classification models in R, including linear regression, generalized linear models, decision trees, and random forests. It uses examples of CPI data to demonstrate linear regression and predicts CPI values in 2011. For classification, it builds a decision tree model on the iris dataset using the party package and visualizes the tree. The document provides information on evaluating and comparing different models.
Similar to Moving Beyond Linearity (Article 7 - Practical Exercises) (20)
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
1. Moving Beyond Linearity [ISLR.2013.Ch7-7]
Theodore Grammatikopoulos∗
Tue 6th
Jan, 2015
Abstract
Linear models are relatively simple to describe and implement, and have advantages
over other approaches in terms of interpretation and inference. However, standard
linear regression can have significant limitations in terms of predictive power. This is
because the linearity assumption is almost always an approximation, and sometimes a
poor one. In article 6 we see that we can improve Least Squares Estimation methods by
using Ridge Regression, Lasso, Principal Components Regression (PCR), Principal Least
Squares (PLS), and other techniques. In that setting, the improvement is obtained by
reducing the complexity of the linear model, and hence the variance of the estimates.
But we are still using a linear model, which can only be improved so far. Here,
we also relax the linearity assumption but we still attempting to maintain as much
interpret-ability as possible. We do this by examining very simple extensions of linear
models like Polynomial Regression and Piece-wise Step Functions, as well as more
sophisticated approaches such as Splines, Local Regression, and Generalized Additive
Models (GAMs).
## OTN License Agreement: Oracle Technology Network -
Developer
## Oracle Distribution of R version 3.0.1 (--) Good Sport
## Copyright (C) The R Foundation for Statistical Computing
## Platform: x86_64-unknown-linux-gnu (64-bit)
D:20150106213538+02’00’
∗
e-mail:tgrammat@gmail.com
1
2. 1 Non-Linear Modeling
In this lab we re-analyze the Wage data set considered in the examples throughout Chapter
7 of “ISLR.2013” [James et al., 2013]. We will see that many of the complex non-linear
fitting procedures discussed in this Chapter can be easily implemented in R.
library(ISLR)
attach(Wage)
1.1 Polynomial Regression
As a first attempt to describe employee’s wage in terms of their age, we will try a poly-
nomial regression fit. We will determine the best polynomial degree to do so and we will
afterwards examine to what extent this could be successful.
The reason to search for a non-linear fit to describe the wage ∼ age dependence, it is
almost apparent by the corresponding plot of the two variables, shown in Figure 1 below.
First, the employee data set can be easily distinguished in two groups, a “High Earners
Group” and a “Low Earners” one. Secondly, the wage ∼ age dependence of this “Low
Earners” employees group is certainly non-linear and most probably of a higher than the
second polynomial degree.
Next, we should decide on the exact degree of the polynomial to use. In the article “Linear
Model Selection and Regularization (ISLR.2013.Ch6-6)”, we studied two different ways to do
so. Either by applying some subset variable selection method or by using cross-validation.
Here, we will discuss an alternative approach, the so-called Hypothesis Testing.
More specifically, we can fit models ranging from linear to degree-5 polynomial and seek to
determine the simplest model which is sufficient to explain the wage ∼ age relationship.
To do so we perform analysis of variance (ANOVA, F-test), by using the anova() function,
in order to test the null hypothesis that a model M1 is sufficient to explain the data against
the alternative hypothesis that a more complex model M2 is required. In order to use the
anova() function, M1 and M2 must be nested models: the predictors in M1 must be a
subset of the predictors in M2. In this case, we fit five different models and sequentially
compare the simpler model to the more complex model.
lm.Poly1.fit <- lm(wage ~ age, data = Wage)
lm.Poly2.fit <- lm(wage ~ poly(age, 2), data = Wage)
lm.Poly3.fit <- lm(wage ~ poly(age, 3), data = Wage)
2
3. lm.Poly4.fit <- lm(wage ~ poly(age, 4), data = Wage)
lm.Poly5.fit <- lm(wage ~ poly(age, 5), data = Wage)
Figure 1: 4-degree Polynomial fit for Employees’ Wages vs their Age. The probability of an
employee to be a high earner person as a function of her age is also depicted.
anova(lm.Poly1.fit, lm.Poly2.fit, lm.Poly3.fit, lm.Poly4.fit,
lm.Poly5.fit)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
3
4. ## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2995 4771604 1 6070 3.8098 0.051046 .
## 5 2994 4770322 1 1283 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value comparing the linear Model 1 to the quadratic Model 2 is essentially zero
(< 1015
), indicating that a linear fit is not sufficient. Similarly the p-value comparing the
quadratic Model 2 to the cubic Model 3 is very low (∼ 0.0017), so the quadratic fit is
also insufficient. The p-value comparing the cubic to the degree-4 polynomials, Model
3 and Model 4, is approximately 5% while the degree-5 polynomial Model 5 seems
unnecessary because its p-value is ∼ 0.37 with a not large F-statistic. Hence, either a
cubic or a quartic polynomial appear to provide a reasonable fit to the data, but lower or
higher-order models are not justified.
Here, we choose to describe the wage ∼ age dependence by a quartic polynomial, i.e.:
wage ∼ Intercept + 0 ∗ age + 1 ∗ age2
+ 2 ∗ age3
+ 3 ∗ age4
.
The estimated coefficient calculated by the method can be retrieved by the following call
lm.Poly4.fit.Wage <- lm.Poly4.fit
coef(summary(lm.Poly4.fit.Wage))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70361 0.7287409 153.283015 0.000000e+00
## poly(age, 4)1 447.06785 39.9147851 11.200558 1.484604e-28
## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
## poly(age, 4)3 125.52169 39.9147851 3.144742 1.678622e-03
## poly(age, 4)4 -77.91118 39.9147851 -1.951938 5.103865e-02
Next, we create a grid of values for age at which we want predictions, call the generic
predict() functions and calculate the standard errors
ageMinMax <- range(Wage$age)
age.grid <- seq(from = ageMinMax[1], to = ageMinMax[2])
4
5. preds.poly <- predict(lm.Poly4.fit.Wage, newdata = list(age = age
.grid),
se.fit = TRUE)
se.bands <- cbind(preds.poly$fit + 2 * preds.poly$se.fit, preds.
poly$fit -
2 * preds.poly$se.fit)
Finally, we plot the data and add the degree-4 Polynomial fit
par(mfrow = c(1, 2), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0))
plot(Wage$age, Wage$wage, xlim = ageMinMax, cex = 0.5, col = "
darkgrey",
xlab = "Age", ylab = "Wage")
title("Wage vs Age Fit n Degree-4 Polynomial [Wage]", outer =
TRUE)
lines(age.grid, preds.poly$fit, lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd = 2, col = "blue", lty = 3)
As shown in Figure 1, the employee data set can be easily distinguished in two groups, a
“High Earners Group” and a “Low Earners” one. To calculate the probability an employee
to annually earn more than 250k USD, we create the appropriate response vector for the
categorical variable of 1l (wage > 250)
glm.Poly4.binomial.fit.Wage <- glm(I(wage > 250) ~ poly(age,
4), data = Wage, family = binomial)
and make the predictions as before.
glm.Poly4.binomial.preds.Wage <- predict(glm.Poly4.binomial.fit.
Wage,
newdata = list(age = age.grid), se.fit = TRUE)
However, calculating the probability P (Wage > 250 | Age) and its corresponding confi-
dence intervals is slightly more involved than in the linear regression case. The default
5
6. prediction type for a glm() model is type="link", which what we use here. This means
we get predictions for the logit, i.e we have fit a model of the form
log
P(Y = 1 | X)
1 − P(Y = 1 | X)
= X , (1)
which means that the predictions, as well as its standard errors are of X form. Therefore,
if we have to plot the P (Wage > 250 | Age) as a function of employee’s age, we have to
transform the resulting fit accordingly, that is
P(Y = 1|X) =
exp( X)
1 + exp( X)
, (2)
or in R code
preds <- glm.Poly4.binomial.preds.Wage
pfit <- exp(preds$fit)/(1 + exp(preds$fit))
se.bands.logit <- cbind(preds$fit + 2 * preds$se.fit, preds$fit -
2 * preds$se.fit)
se.bands <- exp(se.bands.logit)/(1 + exp(se.bands.logit))
and plot the result which is shown in the left panel of Figure 1.
plot(age, I(wage > 250), xlim = ageMinMax, type = "n", ylim = c
(0,
0.2), xlab = "Age", ylab = "P(Wage>250|Age)")
points(jitter(age), (I(wage > 250)/5), cex = 0.5, pch = "|",
col = "darkgrey")
lines(age.grid, pfit, lwd = 2, col = "blue")
matlines(age.grid, se.bands, lwd = 2, lty = 3, col = "blue")
It is interesting to note here that the function
6
7. coef(summary(lm.Poly4.fit.Wage))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70361 0.7287409 153.283015 0.000000e+00
## poly(age, 4)1 447.06785 39.9147851 11.200558 1.484604e-28
## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
## poly(age, 4)3 125.52169 39.9147851 3.144742 1.678622e-03
## poly(age, 4)4 -77.91118 39.9147851 -1.951938 5.103865e-02
returns a matrix whose columns are a basis of orthogonal polynomials, which essentially
means that each column is a linear combination of the variables age, age^2, age^3
and age^4. However we can also obtain a direct fit to the {age,age^2,age^3,age^4}
variable basis by demanding raw=TRUE in the the previous code as shown below. This
does not affect the model in a meaningful way. The choice of basis clearly affects the
coefficient estimates, but it does not affect the fitted values obtained.
# Direct Fit in {age,age^2,age^3,age^4} Basis
lm.Poly4.fit.Wage2 <- lm(wage ~ poly(age, 4, raw = TRUE), data =
Wage)
coef(summary(lm.Poly4.fit.Wage2))
## Estimate Std. Error
## (Intercept) -1.841542e+02 6.004038e+01
## poly(age, 4, raw = TRUE)1 2.124552e+01 5.886748e+00
## poly(age, 4, raw = TRUE)2 -5.638593e-01 2.061083e-01
## poly(age, 4, raw = TRUE)3 6.810688e-03 3.065931e-03
## poly(age, 4, raw = TRUE)4 -3.203830e-05 1.641359e-05
## t value Pr(>|t|)
## (Intercept) -3.067172 0.0021802539
## poly(age, 4, raw = TRUE)1 3.609042 0.0003123618
## poly(age, 4, raw = TRUE)2 -2.735743 0.0062606446
## poly(age, 4, raw = TRUE)3 2.221409 0.0263977518
## poly(age, 4, raw = TRUE)4 -1.951938 0.0510386498
Two other equivalents ways of calculating the same fit whereas protecting power terms of
age are the following:
# Direct Fit in {age,age^2,age^3,age^4} Basis
7
8. lm.Poly4.fit.Wage3 <- lm(wage ~ age + I(age^2) + I(age^3) + I(age
^4),
data = Wage)
coef(summary(lm.Poly4.fit.Wage3))
# Direct Fit in {age,age^2,age^3,age^4} Basis
lm.Poly4.fit.Wage4 <- lm(wage ~ cbind(age, age^2, age^3, age^4),
data = Wage)
coef(summary(lm.Poly4.fit.Wage4))
Comparing now the fitted values obtained in either case we found them identical, as
expected.
preds.raw <- predict(lm.Poly4.fit.Wage2, newdata = list(age = age
.grid),
se.fit = TRUE)
max(abs(preds.poly$fit - preds.raw$fit))
## [1] 8.739676e-12
Note:
The ANOVA method also works in more general cases, that is when terms other than the
orthogonal polynomials are also included. For example, we can use anova() to also
compare these three models
fit.1.Wage <- lm(wage ~ education + age, data = Wage)
fit.2.Wage <- lm(wage ~ education + poly(age, 2), data = Wage)
fit.3.Wage <- lm(wage ~ education + poly(age, 3), data = Wage)
fit.4.Wage <- lm(wage ~ education + poly(age, 4), data = Wage)
anova(fit.1.Wage, fit.2.Wage, fit.3.Wage, fit.4.Wage)
8
9. ## Analysis of Variance Table
##
## Model 1: wage ~ education + age
## Model 2: wage ~ education + poly(age, 2)
## Model 3: wage ~ education + poly(age, 3)
## Model 4: wage ~ education + poly(age, 4)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2994 3867992
## 2 2993 3725395 1 142597 114.6595 < 2e-16 ***
## 3 2992 3719809 1 5587 4.4921 0.03413 *
## 4 2991 3719777 1 32 0.0255 0.87308
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
giving an outcome which actually support the third case of model, i.e.:
Model 3 : wage ∼ education + poly(age, 3) .
Now comparing this new model with the one we have examined before, i.e.
wage ∼ Intercept + 0 ∗ age + 1 ∗ age2
+ 2 ∗ age3
+ 3 ∗ age4
,
we obtain the following results
# Split the data set in a Train and a Test Data Part
set.seed(356)
train <- sample(c(TRUE, FALSE), nrow(Wage), rep = TRUE)
test <- (!train)
# wage ~ poly(age,4) Model
lm.Poly4.fit <- lm(wage ~ poly(age, 4), data = Wage[train, ])
preds.polyNew <- predict(lm.Poly4.fit, newdata = Wage[test, ],
se.fit = TRUE)
# wage ~ education + poly(age,3)
fit.3.Wage <- lm(wage ~ education + poly(age, 3), data = Wage[
train,
])
preds.fit.3 <- predict(fit.3.Wage, newdata = Wage[test, ], se.fit
= TRUE)
9
10. # MSEs calculation
mse.polyNew <- mean((Wage$wage[test] - preds.polyNew$fit)^2)
mse.polyNew
## [1] 1622
mse.fit.3 <- mean((Wage$wage[test] - preds.fit.3$fit)^2)
mse.fit.3
## [1] 1278.3
which suggests that the new model,
Model 3 : wage ∼ education + poly(age, 3) ,
is in fact better fit to predict the employee’s wage variable.
1.2 Piece-wise constant functions
Here we try to fit a piece-wise constant function to describe the employee’s wage in terms
of their age. To do so we use the cut() function as shown below
stepfunction.lm.fit.Wage <- lm(wage ~ cut(age, 4), data = Wage)
coef(summary(stepfunction.lm.fit.Wage))
## Estimate Std. Error t value
## (Intercept) 94.158 1.476 63.790
## cut(age, 4)(33.5,49] 24.053 1.829 13.148
## cut(age, 4)(49,64.5] 23.665 2.068 11.443
## cut(age, 4)(64.5,80.1] 7.641 4.987 1.532
## Pr(>|t|)
## (Intercept) 0.000e+00
## cut(age, 4)(33.5,49] 1.982e-38
## cut(age, 4)(49,64.5] 1.041e-29
## cut(age, 4)(64.5,80.1] 1.256e-01
The function cut() returns an ordered categorical variable; the lm() function then
creates a set of dummy variables for use in the regression. The age < 33.5 category is left
10
11. out, so the intercept coefficient of $ 94, 158 can be interpreted as the average salary for
those under 33.5 years of age, and the other coefficients can be interpreted as the average
additional salary for those in the other age groups. Of course, we can produce predictions
and plots just as we did in the case of the polynomial fit.
Finally, note that the cut() function automatically picked the cut-points of the age
variable. However, one can also impose the cut-points of her/his choice by using the
breaks option of the function.
2 Splines
2.1 Regression Splines
Regression splines fits can be produced in R, by loading the splines library. First we
construct an appropriate matrix of basis functions for a specified set of knots, by calling
the bs{splines}() function.
library(splines)
basis.fxknots <- bs(Wage$age, knots = c(25, 40, 60))
dim(basis.fxknots)
## [1] 3000 6
Alternatively, we can let the library determine the correct number of knots by deciding
only the required degrees of freedoms, df. For a degree-d step polynomial and a fitted
model with K notes, one needs (d + 1)(K + 1) − d K = d + K + 1 “dofs”, or d + K “dofs”
if there is no an intercept in the model. In particular, for a cubic spline basis without
an intercept (default) and with 6 “dofs”, the model is constrained to use only 3
knots which are distributed along uniform quantiles of the age variable.
basis.fxdf1 <- bs(Wage$age, df = 6, intercept = FALSE)
dim(basis.fxdf1)
## [1] 3000 6
attr(basis.fxdf1, "knots")
## 25% 50% 75%
## 33.75 42.00 51.00
11
12. Should we demand this model to also have an intercept:
basis.fxdf2 <- bs(Wage$age, df = 6, intercept = TRUE)
dim(basis.fxdf2)
## [1] 3000 6
attr(basis.fxdf2, "knots")
## 33.33% 66.67%
## 37 48
whereas for one polynomial degree higher, i.e. quartic spline:
basis.fxdf3 <- bs(Wage$age, df = 6, degree = 4, intercept = TRUE)
dim(basis.fxdf3)
## [1] 3000 6
attr(basis.fxdf3, "knots")
## 50%
## 42
The first case of the cubic splines referenced above seems more promising. To produce a
prediction fit
# Produce an age Grid
ageMinMax <- range(Wage$age)
age.grid <- seq(from = ageMinMax[1], to = ageMinMax[2])
# Produce a prediction fit
splines.bs1.fit <- lm(wage ~ bs(age, df = 6, intercept = FALSE),
data = Wage)
splines.bs1.pred <- predict(splines.bs1.fit, newdata = list(age =
age.grid),
se.fit = TRUE)
12
13. se.bands <- cbind(splines.bs1.pred$fit + 2 * splines.bs1.pred$se.
fit,
splines.bs1.pred$fit - 2 * splines.bs1.pred$se.fit)
and a corresponding plot of the Wage ∼ Age dependence
par(mfrow = c(1, 1))
par(mfrow = c(1, 1), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0))
plot(Wage$age, Wage$wage, col = "grey", xlab = "Age", ylab = "
Wage",
cex = 0.5, pch = 8)
lines(age.grid, splines.bs1.pred$fit, col = "blue", lwd = 2)
matlines(age.grid, se.bands, lty = "dashed", col = "blue")
title(main = "Wage vs empoloyee Age nRegression and Natural
Splines Fit [Wage{ISLR}]",
outer = TRUE)
In order to fit a natural spline instead, that is a regression spline with linear boundary
conditions, we make use of the ns() function. All these results, Wage ∼ Age data points
as well the two prediction fits, that of the cubic spline (blue line) and the other of the
natural cubic spline (red line) are shown in Figure 2.
basis.ns <- ns(Wage$age, df = 6, intercept = TRUE)
splines.ns.fit <- lm(wage ~ ns(age, df = 6), data = Wage)
splines.ns.pred <- predict(splines.ns.fit, newdata = list(age =
age.grid),
se.fit = TRUE)
se.ns.bands <- cbind(splines.ns.pred$fit - 2 * splines.ns.pred$se
.fit,
splines.ns.pred$fit + 2 * splines.ns.pred$se.fit)
13
14. Figure 2: Cubic spline fit for Employees Wage vs their Age with 3 knots and 6 dofs (blue
lines). A natural spline fit with 6 dofs and an intercept is also depicted (red lines).
lines(age.grid, splines.ns.pred$fit, col = "red", lwd = 2)
matlines(age.grid, se.ns.bands, lty = "dashed", col = "red")
legend("topright", inset = 0.05, legend = c("Cubic Spline", "
Natural Cubic Spline"),
col = c("blue", "red"), lty = 1, lwd = c("2", "2"))
2.2 Smoothing Splines
Here, we make a smoothing spline fit for the wage ∼ age dependence of the employees’
Wage data set. To do so we utilize the smooth.spline{stats}() function as shown
below
# Smooth Spline with 16 effective dofs
sspline.fit <- smooth.spline(Wage$age, Wage$wage, df = 16)
14
15. # Cross Validated (LOOCV) Smooth Spline to let the software
# determine the optimal number of effective dofs
sspline.fit2 <- smooth.spline(Wage$age, Wage$wage, cv = FALSE,
df.offset = 1)
# effective dofs
sspline.fit2$df
## [1] 6.467555
which can be plotted by running the code below (Figure 3).
par(mfrow = c(1, 1))
par(mfrow = c(1, 1), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0))
plot(age, wage, xlim = ageMinMax, cex = 0.5, col = "darkgrey",
xlab = "Age", ylab = "Wage", pch = 8)
lines(sspline.fit, col = "red", lwd = "2")
lines(sspline.fit2, col = "blue", lwd = "1")
title(main = "Wage vs employee Age n Smoothing Spline Fit [Wage{
ISLR}]",
outer = TRUE)
legend("topright", inset = 0.05, legend = c("16 dofs", "6.47 dofs
"),
col = c("red", "blue"), lty = 1, lwd = c("2", "1"), cex =
0.8)
15
16. Figure 3: Smoothing spline fits for Employees Wage vs their Age. One with pre-configured
16 effective dofs (red line) and the other one with 6.47 effective dofs as determine by
LOOCV method.
3 Local Regression
Here, as an alternative to produce a non-linear fit, we perform local regression by making
use of the loessstats() function.
# Local Regression with each neighborhood spanning 20% of the
# observations
loess.fit <- loess(wage ~ age, span = 0.2, data = Wage)
# Local Regression with each neighborhood spannings 50% of
# the observations
loess.fit2 <- loess(wage ~ age, span = 0.5, data = Wage)
16
17. and produce the corresponding plot by executing the code below (Figure 4).
par(mfrow = c(1, 1))
par(mfrow = c(1, 1), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0))
plot(age, wage, xlim = ageMinMax, cex = 0.5, col = "darkgrey",
pch = 8, xlab = "Age", ylab = "Wage")
lines(age.grid, predict(loess.fit, data.frame(age = age.grid)),
col = "blue", lwd = "1")
lines(age.grid, predict(loess.fit2, data.frame(age = age.grid)),
col = "red", lwd = "1")
title(main = "Wage vs employee Age n Local Regression Fit [Wage{
ISLR}]",
outer = TRUE)
legend("topright", inset = 0.05, legend = c("Span 20%", "Span
50%"),
col = c("blue", "red"), lty = 1, lwd = c("1", "1"), cex =
0.8)
4 General Additive Models (GAMs)
As a generalization of the previously studied models, we now discuss additive linear models
but make use of a more flexible choice of the different fitting methods for the different
variables we are going to use as predictors. These class of models are the so-called
General Additive Models (GAMs) and have the general form
GAMs : yi = 0 +
p
j=1
fj(xij) + ϸi . (3)
As a first example we examine the fit
17
18. Figure 4: Local Regression fits for Employees Wage vs their Age. One with each neighbor-
hood pre-configured to span 20% (blue line) and the other one with 50% span percentage.
library(splines)
gam1 <- lm(wage ~ ns(year, 4) + ns(age, 5) + Wage$education,
data = Wage, subset = train)
However, in case we want to use the smooth splines or other components that cannot be
expressed in terms of the basis functions, we have to use more general sorts of GAMs to
make the fit, even if the model is additive. To do so we use the mgcv library, which was
introduced in [Wood, 2006] and it is provided here by the Oracle R distribution∗
.
The s() function, which is part of the mgcv library, is used to call smoothing spline fits.
∗
Alternatively, one can use the Trevor Hastie’s original library for that purpose, gam, [Hastie and Tib-
shirani, 1990]. However, we find mgcv much more complete to build GAMs models and we have chosen to
use this package to make our calculations.
18
19. To repeat the previous fit but with smoothing splines models we execute the following R
code.
library(mgcv)
gam.m3 <- gam(wage ~ s(year, k = 5) + s(age, k = 6) + education,
family = gaussian(), data = Wage, subset = train)
Here, we specify that the function of year should have 4 degrees of freedom, and that
the function of age will have 5 degrees of freedom. Since education is a categorical
variable, we leave it as is, and it is converted by the function into four dummy variables.
The produced fitted model can be produced as below.
par(mfrow = c(1, 1))
par(mfrow = c(1, 3), mar = c(4, 4, 1, 1), oma = c(0, 0, 3, 0))
plot(gam.m3, se = TRUE, col = "blue")
plot(education[train], gam.m3$y)
title("Smoothing Splines Fit [mgcv]nWage ~ s (year,k = 5) + s (
age,k = 6) + education",
outer = TRUE)
Note that in the first plot of Figure 5 the function year looks rather linear. We can
perform a series of ANOVA tests in order to determine which of these three models is best:
a GAM that excludes year (M1), a GAM that uses a linear function of year (M2), or a
GAM that uses a spline function of year (M3) as the one build above? Note, that in all
these models we do include the education variable which seems to be a good choice
according the short discussion in the end of section 1.1.
gam.m1 <- gam(wage ~ s(age, k = 6) + education, family = gaussian
(),
data = Wage, subset = train)
gam.m2 <- gam(wage ~ year + s(age, k = 6) + education, family =
gaussian(),
data = Wage, subset = train)
19
20. Figure 5: GAM fitted model using smoothing splines through mgcv library.
anova(gam.m1, gam.m2, gam.m3, test = "F")
## Analysis of Deviance Table
##
## Model 1: wage ~ s(age, k = 6) + education
## Model 2: wage ~ year + s(age, k = 6) + education
## Model 3: wage ~ s(year, k = 5) + s(age, k = 6) + education
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 1489.0 1812973
## 2 1488.0 1804686 1.00039 8287.4 6.8395 0.008999 **
## 3 1487.2 1801351 0.77868 3334.8 3.5358 0.069726 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
20
21. We find that there is compelling evidence that a GAM with a linear function of year is
better than a GAM that does not include year at all (F= 6.8395, p-value=0.00899).
However, there is no strong evidence that a non-linear function of year is actually re-
quired (F=3.5358, p-value=0.069726). So, based on the results of this ANOVA test,
the M2 model is preferred. Indeed, a closer look in the summary of the last fitted model
gam.m3
summary(gam.m3)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## wage ~ s(year, k = 5) + s(age, k = 6) + education
##
## Parametric coefficients:
## Estimate Std. Error t value
## (Intercept) 86.395 2.845 30.373
## education2. HS Grad 9.289 3.266 2.844
## education3. Some College 23.054 3.450 6.682
## education4. College Grad 37.938 3.404 11.144
## education5. Advanced Degree 61.579 3.738 16.473
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## education2. HS Grad 0.00452 **
## education3. Some College 3.31e-11 ***
## education4. College Grad < 2e-16 ***
## education5. Advanced Degree < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(year) 1.760 2.171 3.908 0.0179 *
## s(age) 3.018 3.658 29.208 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.288 Deviance explained = 29.2%
## GCV score = 1219.2 Scale est. = 1211.2 n = 1497
21
22. reveals that a linear function in year is adequate for this term (F= 3.90, p-value =
0.0179), whereas for the age variable a non-linear function is required (F= 29.208,
p-value < 2e-16). Note, that the p-values for year and age correspond to a null
hypothesis of a linear relationship of the particular GAM term versus the alternative of a
non-linear relationship.
Of course, we can make predictions as before using a test data set of Wage and make a
more safe conclusion by comparing the Mean-Squared Errors of the two models gam.m2
and gam.m3.
gam.m2.pred <- predict(gam.m2, newdata = Wage[test, ])
gam.m3.pred <- predict(gam.m3, newdata = Wage[test, ])
mean((Wage[test, ]$wage - gam.m2.pred)^2)
## [1] 1275.063
mean((Wage[test, ]$wage - gam.m3.pred)^2)
## [1] 1276.13
Again, the gam.m2 model is found to be a better fit.
References
[Hastie and Tibshirani, 1990] Hastie, T. and Tibshirani, R. (1990). Generalized Additive
Models.
[James et al., 2013] James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An
Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics).
Springer, 1st ed. 2013. corr. 4th printing 2014 edition.
[Wood, 2006] Wood, S. (2006). Generalized Additive Models: An Introduction with R. Chap-
man and Hall/CRC.
22